The question

Quantian1 on Twitter poses the following “junior data scientist” interview question:

Junior data scientist interview question: Assume you generate points X = N(0,1); Y = N(0,0.1). Rotate the (x,y) dataset 45 degrees, so they look something like pic below (line is y = x). If you were to calculate the OLS regression y = b1*x + b0, what is E[b1] as n->infinity? pic.twitter.com/7m5lMgOykW
— Quantian (@quantian1) June 26, 2023

(You can check the replies for my answer!)

This is a fun regression math / intuition question. In the spirit of appreciation for these types of interview questions, here are a few more.

Given $Y_t = a_0 + a_1 Z_t + \epsilon_{Y,t}$ and $X_t = b_0 + b_1 Z_t + \epsilon_{X,t}$ , with $\epsilon_{X,t}$ and $\epsilon_{Y,t}$ both being i.i.d. Gaussian white noise, calculate the OLS coefficients of $Y_t = c_0 + c_1 X_t + \omega_{t}$ (where $\omega_{t}$ is also i.i.d. Gaussian white noise).
Given some linear model $y = \beta_0 + \beta_{x_1} x_1 + \beta_{x_2} x_2 + \epsilon$ , calculate the multiple regression OLS coefficients $\beta_{x_1}$ and $\beta_{x_2}$ using only simple univariate regressions and simple matrix arithmetic.
Given some linear model $y = \beta_0 + \beta_{x} x + \epsilon$ with $\epsilon \sim N(0, 1^2)$ , calculate the OLS coefficients and residual mean + stdev by adding $\omega \sim N(0, 2^2)$ to the left hand side $y$ . ( $\epsilon$ and $\omega$ are uncorrelated). Now do this exercise again, but instead of adding $\omega$ to $y$ , add it to $x$ instead.

I leave solving these problems as an exercise to the reader.

What all of these questions (including Quantian1’s question) have in common is they care about your understanding of variance + covariance, the close form solution for univariate coefficients ( $\frac{\textrm{Cov}(x,y)}{\textrm{Var}(x)}$ ), and how to add and subtract Gaussian distributions. Things of that nature.

Should you ask that question in interviews?

Ask it to data scientists? Well, it depends on if you’re OK filtering away at least 95% of your applicants. In my experience, the vast majority of data scientists cannot answer any of the above questions. It is definitely not a “junior data scientist” interview question. I’m not sure if Quantian1 is being tongue in cheek or if he just has really high standards.

But I don’t think “should you ask it?” is the right question. The real question here is: “Should you be hiring a data scientist?” And the answer to that is probably not.

(Does it count as escaping the clutches of Betteridge’s law of headlines if you subvert the headline question?)

All these questions ask for intermediate to advanced understanding of linear algebra, probability, and OLS math. By saying we shouldn’t be asking this question to data scientists, we are saying that we don’t expect data scientists to have this deep understanding.

I don’t take issue with hiring people to do data work that won’t require this understanding. Indeed, most practical data work doesn’t necessarily require knowing how to answer these questions. My issue is that the title “data scientist” sets the expectation that people will come equipped with some degree of non-superficial understanding of how to apply math, linear algebra, statistics, probability, etc. and that they will apply this knowledge, yet we all agree that this is the type of question we wouldn’t expect a typical data scientist to be able to answer and that they don’t need to know it. There is a disconnect here!

The weirdness of hiring data scientists

Data science bootcamps and degree programs set the expectation that the vast majority of the work is fiddling around with models and algorithms, and yet:

Most of the work is not actually that.
People don’t understand how even the simplest algorithm (OLS / linear regression) works at a strong level.

So we’re hiring data science folks, telling them they get to fiddle with models, they never understand the models they’re fiddling around with, and they don’t even get to fiddle with models that much in the first place (assuming they’re doing their jobs correctly).

I suppose many data scientists are numb to this, but, you have to admit this is a really strange state of affairs! It would be like if you trained a bunch of line cooks who can’t even make a roux, promised them they’re going to work in the kitchen of a Michelin star restaurant, and then hired them to wait tables.

And that’s assuming these data scientists are doing their jobs correctly (i.e. doing the job correctly implies doing toil). Many don’t, though! Many data scientists buy into the idea that their work is literally to only twiddle with models in Jupyter Notebooks, and arrogantly sit around waiting for someone else to clean and transform their data for them, while their actual models and skill growth suffer greatly from this listlessness. So many data science careers are in a state of atrophy via being downstream of both pedagogical and managerial failures. And despite all of that, they still can’t answer any of the math questions above. What are they actually learning at the end of the day? The XGBoost API?

Compared to “data analyst,” “data scientist” is essentially a prestige position that comes with the implicit promise of being able to tinker with models/math, do research, and be a thought leader for internal decisionmaking. Data analysts do toil, data scientists are expected to do cooler things, even though we don’t rigorously filter for the mathematical prerequisites to reliably execute on these cooler things. In essence, these cooler things are essentially job perks, i.e. the perk of being able to have a little fun at work and grow intellectually.

Hire other roles instead

Instead of hiring people with the expectation that they are going to be, consider hiring the following positions instead:

Data engineer (someone who has the expectation that they will be coding a lot)
Software engineer (same as above)
Analytics engineer (someone who has the expectation they will be writing a lot of SQL pipelines)
Data analyst (someone who has the expectation they will be writing a lot of SQL and dashboards)

Hell, if you must, hire an ML engineer or MLops engineer (someone who has the expectation they will be working with a lot of ML infrastructure). Hire basically anything except an ostensible model fitter role that does everything except fitting models.

Fellow data + tech cynic Lauren Balik besmirches the “analytics engineer” job title as a grift to get people to buy into using managed B2B SaaS tools they don’t need. I don’t entirely disagree. Yet despite its inorganic origins, I think I’d still rather have someone who comes in with reasonable expectations about a reasonable job to have (building SQL pipelines) than someone who has unreasonable expectations about an unreasonable job (predictive model fiddler).

“But who will write my algorithms?”

Honestly, most anyone can. Dumb algorithms work really well in practice and you should try them! Source basic requirements from business stakeholders and then build something that meets those requirements. It doesn’t need to come from “machine learning.”

E.g. if you want to sort content for users, “sort by most popular, over the last 72 hours” works really nicely. The data scientist you hire will tell you this is not sophisticated enough and is leaving money on the table because this algorithm’s parameters are not part of a Kubeflow pipeline deployment that outputs an RNN that optimizes some NDCG loss function. And then they will spend 2 months of their time trying and failing to beat it anyway.

At my last job, I beat all the data scientists’ logistic regression based modeling by implementing some clever if statements behind a FastAPI endpoint and doing a key-value lookup on a table output by a nightly batch job. Pure heuristics, no training of model to speak of; just using my domain knowledge and data engineering chops to walk circles around data scientists and their models. Disabusing myself of the idea that the API call needed to be determined by some opaque “training” process rather than my common sense and intuition freed me to do my best work. (The full context for this is a long story, maybe for another day…)

Algorithms are tiny parts of large systems. The algorithm part can usually be abstracted as an API call that provides some value(s) with inherent uncertainty; an ML endpoint is in essence an API call with few or no side-effects that returns an uncertain value. It turns out, the system design surrounding this almost always matters more than the algorithm and the dumb algorithms tend to do well enough anyway. Tuning hyperparameters and adding features (aside from obvious ones, most notably autoregressive sorts of features) just doesn’t do much unless you’re operating at a large scale. And you can’t even begin to tune your hyperparameters until you have everything else in place– the system, dashboards (monitoring for both technical and business processes), MLops/devops infrastructure, data pipelines that both feed in and out of the system. That’s a lot of work, and the data scientists you are hiring don’t always come equipped with the expectation that this will be their work, or the training to be actually good at it.

And when you do get to the actual modeling, what your data scientist is likely doing is fiddling around with the APIs for tools they are importing into a notebook. They’re not often doing real math. Do you really need an “expert” on the SKLearn and XGBoost APIs when perusing the documentation may do just fine?

This all leaves very little room for actual clever “data science” work in the traditional sense of some nerd sitting around and doing math.

And if you’re just going to hire just one data scientist on a team of many people working on all the other parts of the system, to do the little actual modeling work that exists in the org, maybe it isn’t so unreasonable to hire someone who knows how to find the OLS slope coefficient if you take $x \sim N(0, 1^2)$ and $y \sim N(0, 0.1^2)$ and rotate it by 45º counterclockwise. Asking that question in a data science interview may mean you reject all 20 of the candidates in your pool, but did you really need a “data scientist” in the first place?