In fact, you sometimes will hear about someone choosing to use the geometric mean over the arithmetic mean because they want to put less weight on outliers– my position on this is that it should be criminal and punished with jail time and a life-time revocation of all software licenses.

Note that this blog refers to **log()** a lot. Every time I use **log()**, I am talking about the natural logarithm, never the log base 10.

The arithmetic mean, geometric mean, and harmonic mean all involve the following four steps:

- Define some invertible function
**π(Β·)** - Apply this function to each number in the set:
**π(x)** - Take the arithmetic mean of the transformed series:
**avg(π(x))** - Invert the transformation on the average:
**π**^{-1}(**avg(π(x)))**

^ Given some function **avg(x) = sum(x) / count(x)**. Or in more math-y terms, .

The only difference between each type of mean is the function **π(Β·)**. Those functions for each respective type of mean are the following:

**Arithmetic mean:**the function is the identity map:. The inverse of this function is trivially**π**(x) = x.**π**^{-1}(x) = x**Geometric mean:**the function is the natural log:. The inverse of this function is**π**(x) = log(x).**π**^{-1}(x) = e^{x}**Harmonic mean:**the function is the multiplicative inverse:. Similar to the arithmetic mean, the function is involutory and thus the inverse is itself:**π**(x) = 1/x.**π**^{-1}(x) = 1/x

We can also add the root mean square into the mix, which is not one of the “Pythagorean means” and is outside this article’s scope, but can be described the same way using some **π(Β·)**:

**Root mean square:**the function is:. The inverse of this function is**π**(x) = x^{2}**π**^{-1}(x) = βx

This means that:

- The
**geometric mean**is basically**exp(avg(ln(x)))**. - The
**harmonic mean**is basically**1/avg(1/x))**.

We’re playing fast and loose with the math notation when we say that, but hopefully this makes perfect sense to the Pandas, R, Excel, etc. folks out there.

Another way of thinking about what’s happening: Each type of mean is the average of **π(x)**, *converted back to the units x was originally in*.

In fact, you can imagine many situations in which we don’t care about the conversion back to the original units of **x**, and we simply take **avg(π(x))** and call it a day. (Later in this article we will discuss one such example.)

Basically, instead of thinking about it as arithmetic mean vs geometric mean vs harmonic mean, I believe it is far better to think of it as deciding between taking a mean of **x** vs **log(x)** vs **1/x**.

The geometric mean is basically just **exp(avg(ln(x)))** for some series **x**. You may be thinking to yourself, “hold on just a second, that’s not the formula I learned for geometric means, isn’t the geometric mean multiplicative?” Indeed, the geometric mean is usually defined like this:

The multiplicative formulation of the geometric mean is a big part of my complaint on how these things are taught and is a big motivation for why I wrote this blog post. This formulation has two problems:

- It does not provide a good intuition for the relationship between all the types of means.
- It does not provide a good intuition for when this type of mean is useful.

Let’s start with a slightly easier, non-geometric example to understand our motivation for why we want to use the geometric mean in the first place. Then we will move toward understanding why we prefer the formulation that uses logarithms.

Imagine a series **x _{i}** that starts at

The output:

```
>>> import numpy as np
>>> np.random.seed(42069)
>>> x = 42 + np.random.normal(1, 1, size=100).cumsum()
>>> print(x)
[ 42.92553097 43.49170572 44.8624312 46.51825442 48.63784536
48.8951176 50.54671526 51.22867236 50.3424614 52.75737276
51.69855546 52.65819763 54.06775613 53.28172338 53.66697074
56.12498604 57.61778804 58.27503826 57.70256389 59.38735085
60.22063766 61.90920957 63.79519213 64.69549338 64.43602456
67.30519201 67.13024514 67.78104984 69.06629574 69.60412637
71.16176743 71.62132034 74.39080503 77.09848795 79.4652772
80.39650678 80.91626346 81.52066251 82.33636486 81.91222599
83.48843792 85.30007755 86.06519214 86.43031374 87.13039204
90.25620318 93.03419293 94.51869434 95.27790282 95.90321776
96.76434325 98.1029641 98.32628988 98.49971699 97.59649203
99.94065695 103.04571803 102.67171363 103.76771164 103.59085332
105.19872315 107.13508702 109.37005647 110.2144704 112.2378265
113.62732814 114.72603068 115.62444397 115.05362115 115.76635685
116.1258608 117.39540806 119.84769791 120.87730939 121.77746089
121.28436228 123.23558472 124.28775198 123.13780385 123.70222487
124.66368608 126.87138039 127.04262872 127.67170174 128.37581743
131.53389542 130.40204315 132.71405898 135.27380666 136.73629043
136.92096472 136.91655098 139.62076843 140.17345403 141.37458494
141.97151287 142.00637033 143.79217108 143.55446468 143.82917616]
```

On average, this series increments by 1, or: **ΞxΜ = 1**. The code example with N=100 is fairly close to this:

```
>>> print(np.diff(x).mean())
1.019228739283586
```

A nice property of this series is that we can estimate the number **x _{i}** at any point out by taking

**x _{i} ~= x_{0} + i * ΞxΜ**

For example, given **x _{0} = 42** and

One more note is, you can see if we did this with **Ξx ~ N(0, 1 ^{2})**, the end value would be pretty close to our starting value, since on average the change is 0 so it doesn’t tend to go anywhere:

```
>>> y = 42 + np.random.normal(0, 1, size=100).cumsum()
>>> print(y)
[42.14327609 41.73220144 40.34751605 39.89921679 40.71819255 39.98666041
39.99145178 38.64581255 39.41370336 38.25513447 38.30145859 39.42274649
38.04648615 36.24137523 35.46375047 35.50439808 36.63500785 36.59388467
37.57533195 39.08275764 41.53667158 41.56685919 41.41477371 41.2577686
41.07551212 42.30114276 41.64867131 42.72961065 43.64169298 42.34088469
42.88702932 42.23553899 43.61890447 43.82668718 41.60184273 40.27826271
41.25916824 42.18715753 41.85738733 41.14549348 41.84768201 41.69063447
40.6840706 40.83319881 40.40445944 39.34640317 38.09054094 38.87652195
40.3595753 40.8148765 40.83529456 41.18929838 40.08415662 39.05557295
38.64264682 36.52039689 36.69964053 35.37000166 35.20584305 36.37476645
36.88307023 37.31826077 38.29526266 37.74575645 36.80800914 37.5450178
38.65961874 38.88282908 39.928218 40.56804189 41.80931894 42.89032324
42.04655778 42.18207305 44.03568325 44.30914736 43.87061869 42.66411857
41.56574674 42.80970406 44.62550963 44.68844778 44.39159728 44.83927878
43.25006607 42.00114961 42.34021689 41.85862954 41.12647006 42.40091559
41.94186999 42.49949258 40.68446343 40.00669043 39.43730559 39.77142686
39.78884967 40.82806017 40.40149929 40.08320282]
```

Now imagine instead that we’re working with * percentages*. Let’s try to create a series that acts as a percentage, but doesn’t drift anywhere i.e. its average percent change is 0. A sensible thing to do here is to set the mean of the normal distribution to

`1.00`

because 42 * 1 * 1 * 1 * 1 * 1 * 1 * … = 42. If we do that (and set the variance to be small), then take the cumulative product, we get this:```
>>> z = 42 * np.random.normal(1.00, 0.01, size=100).cumprod()
>>> z
array([42.29485307, 42.10421377, 41.92718004, 42.15880536, 42.40603275,
42.01686461, 41.69826372, 41.83495864, 41.25095708, 41.10461977,
41.49515729, 41.69517539, 41.32604793, 40.91920303, 40.79499964,
40.35644085, 40.20518116, 40.1654621 , 39.74730232, 39.78929327,
40.12100013, 40.02502775, 39.44923442, 39.44081756, 39.4641928 ,
38.80762578, 38.89268137, 39.11465487, 39.16728902, 39.76985309,
40.06795117, 40.23589703, 40.10532389, 39.34583711, 39.61553391,
39.9075052 , 40.10830292, 39.9106884 , 39.85976268, 39.79411555,
38.99997783, 39.44369954, 40.39200402, 40.88128515, 40.44479728,
40.22209045, 40.0285552 , 40.23540308, 40.42241127, 40.9433428 ,
41.24621725, 40.8689337 , 41.29192764, 40.56362479, 41.17267844,
40.51011262, 40.87376187, 40.39144992, 39.75819826, 40.0730663 ,
40.58474266, 40.38275682, 40.37280631, 40.23939867, 40.63397416,
41.09796335, 41.35882045, 41.51904402, 41.34215026, 40.6133968 ,
40.66670512, 40.23359742, 40.02866379, 38.7808683 , 39.19549234,
38.93472017, 38.95480154, 38.62374928, 38.76198573, 38.75981943,
38.8845329 , 39.13893181, 38.41712861, 38.51901389, 38.70939761,
39.17713543, 38.85041645, 38.54570443, 38.93457566, 39.1375754 ,
39.2318531 , 39.14232187, 39.11531491, 38.02500266, 38.77757832,
38.9453628 , 38.19591922, 37.34626113, 38.13230149, 37.96815914])
```

Oh dear, clearly the number is drifting downwards. Even though our normal distribution has a mean of `1.00`

, the percentage change is negative on average!

The reason why this happens is pretty straightforward: If you multiply a number by 101% then by 99%, you get a smaller number than you started with. If you do this again and then again a third time, the resulting number gets smaller and smaller even though `(1.01 + 0.99 + 1.01 + 0.99 + 1.01 + 0.99) / 6 = 1.00`

.

Curious minds will be asking a few questions that are all variants of: What thing allows us to input a “0%” and allows us to have a series that actually, truly increases on average by 0% over time?

In the example of 101% and 99%, really what we needed to be doing is oscillate between 101.0101010101% and 99%, and this gives us an average percentage change over time of 0%. The formula that represents this is the **geometric mean**:

`(0.99 * 1.01010101010101)^(1/2) = 1.00000`

We can intuit here that the reason this works is because:

- Percentages are multiplied.
- Geometric means are like the multiplicative equivalent of the arithmetic mean.
- Ergo, geometric means are good for percentage changes.

That’s all fine and dandy. But I would argue this approach is not the best way to think about it.

There is a ** much** better way to think about everything we’ve done up to this point on geometric means, which is to use logarithms. Logarithms are, in my experience, criminally underused. Logarithms are a way to do the following:

- turn multiplication into addition
- turn exponentiation into multiplication

And we can see that the formula for the geometric mean consists of two steps:

- Multiplication of all x
_{0}, x_{1}, … followed by: - Exponentiation by 1/N

If we were to instead use logarithms, it would become:

- Addition of all log(x
_{0}), log(x_{1}), … followed by: - Multiplication by 1/N

Well golly, that looks an awful lot like taking the arithmetic mean.

We can just use the above example to confirm this works as intended:

(`log(0.99) + log(1.01010101010101)) / 2 = 0`

Of course, the final step to a proper geometric mean is that we need to convert back to the original domain, so let’s take the **exp()** of the arithmetic average of log(x), and we’re all set. (That said, we could also skip this final step, but more on that later.)

`e^((log(0.99) + log(1.01010101010101)) / 2) = e^0 = 1`

It would be negligent of me if I failed to point out another factoid here, which is that the example where we used a normal distribution and cumulatively multiplied it on the number 42 was pretty bad. What we should have been doing is something like this:

```
>>> np.random.seed(123789)
>>> w = np.exp(np.log(42) + np.random.normal(0, 0.01, size=100).cumsum())
>>> print(w)
[42.04757853 42.03767307 41.71175204 41.35454934 41.62054759 41.6075732
41.54983288 41.43970185 41.91568309 42.23240569 42.72341859 42.24791181
41.97051688 41.99435579 42.04788319 42.68784402 43.17163141 43.11321422
43.05973883 43.48593847 43.45580668 43.67426001 44.5840072 45.69515896
46.29346244 46.52853533 46.2992428 46.18620244 45.88003251 46.62132714
46.60257292 46.43310272 46.55999518 46.32343379 46.20800647 45.89207742
46.00097075 46.18218262 46.59929002 46.35029673 46.24973082 46.39807618
46.924215 46.94728105 46.5494736 46.43645126 47.26412466 46.54322008
46.33447728 46.63572133 46.6786372 46.90368576 47.21667417 47.05015934
47.01905637 47.43071786 47.57165118 46.9620617 47.2683152 46.28069134
46.20036235 45.76285665 46.12441204 46.58868319 46.32119959 46.33432096
46.83187783 46.486633 46.84950368 47.20823273 46.80124992 47.08130145
46.58325664 46.71110628 46.81578114 46.14625089 46.15792567 45.85668589
45.37720581 44.64861972 44.39836865 43.93450417 45.0169601 45.58411068
45.28835317 45.33707393 45.25037578 45.52754206 45.26391821 45.28384832
45.06173788 45.5085144 45.63230275 44.99180853 45.43820589 46.09319298
46.01279547 46.08962901 46.67943336 46.96112588]
```

Keeping in mind that **ln(1) = 0**, and again that logarithms turn multiplication into addition, so we can just use the cumulative *sum* this time (rather than the cumulative *product*).

Note that in this formulation, the series is no longer normally distributed with respect to **x**. Instead, it’s log-normally distributed with respect to **x**. Which is just a fancy way of saying that it’s normally distributed with respect to **log(x)**. Basically, you can put the “log” either in front of the word “normally” or in front of the “x”. “Log(x) is normal” and “x is log-normal” are equivalent statements.

The code above may look like a standard normal distribution, but you know it’s log-normal because (1) 42 was wrapped in **log()**, and (2) I end up wrapping the entire expression in **exp()**.

When working with percentages, we like to use logarithms and log-normal distributions for a few reasons:

- Log-normal distributions never go below 0. Logarithms never go below zero in their domain. If we used, say, a normal distribution, there is always a chance the output number is negative, and if we mess that up things will go haywire.
- The average of the series for
**log(x)**equals the**ΞΌ**(“mu”) parameter for the log-normal distribution. Or rather, it may be more accurate to say that for any finite series that is log-normally distributed, the mean of that series is an unbiased estimator for**ΞΌ**. 2 or more normally distributed random variables together yields another normally distributed variable, but the same is**Adding***not*true when2 or more normally distributed random variables. So if we use logs, we get to keep everything normal, which can be extremely convenient.**multiplying**- Adding is easier than multiplying in a lot of contexts. Some examples:
- In SQL, the easiest way to take a “total product” in a
`GROUP BY`

statement is to do`EXP(SUM(LN(x)))`

. - In SQL, the easiest way to take the geometric mean in a
`GROUP BY`

statement is to do`EXP(AVG(LN(x)))`

(but you already knew that, right?). - Linear regression prediction is just adding up the independent variables on the right-hand side, so estimating some
`log(y)`

using linear regression is a lot like expressing all the features as multiplicative of one another.

- In SQL, the easiest way to take a “total product” in a
- It just so happens that in a lot of contexts where percentages are an appropriate way to frame the problem, you’ll see log-normal distributions. This is especially true of real-world price data, or the changes in prices:
**log(price**. (Note if log(price) is normal, that means subtracting two log(price)’s is normal, because of the thing about adding two normal distributions.) Try a Shapiro-Wilk test on your_{t}) – log(price_{t-1})**log(x)**, or just make a Q-Q plot of**log(x)**and use your eyes. **log(1 + r)**(with the natural log) is actually a pretty close approximation to**r**, especially for values close to 0.

Two more things– first, you can always us the `np.random.lognormal()`

function then take the cumulative product to confirm your high school pre-calculus:

```
>>> w = 42 * np.random.lognormal(0, 0.01, size=100).cumprod()
>>> print(w)
[42.04757853 42.03767307 41.71175204 41.35454934 41.62054759 41.6075732
41.54983288 41.43970185 41.91568309 42.23240569 42.72341859 42.24791181
41.97051688 41.99435579 42.04788319 42.68784402 43.17163141 43.11321422
43.05973883 43.48593847 43.45580668 43.67426001 44.5840072 45.69515896
46.29346244 46.52853533 46.2992428 46.18620244 45.88003251 46.62132714
46.60257292 46.43310272 46.55999518 46.32343379 46.20800647 45.89207742
46.00097075 46.18218262 46.59929002 46.35029673 46.24973082 46.39807618
46.924215 46.94728105 46.5494736 46.43645126 47.26412466 46.54322008
46.33447728 46.63572133 46.6786372 46.90368576 47.21667417 47.05015934
47.01905637 47.43071786 47.57165118 46.9620617 47.2683152 46.28069134
46.20036235 45.76285665 46.12441204 46.58868319 46.32119959 46.33432096
46.83187783 46.486633 46.84950368 47.20823273 46.80124992 47.08130145
46.58325664 46.71110628 46.81578114 46.14625089 46.15792567 45.85668589
45.37720581 44.64861972 44.39836865 43.93450417 45.0169601 45.58411068
45.28835317 45.33707393 45.25037578 45.52754206 45.26391821 45.28384832
45.06173788 45.5085144 45.63230275 44.99180853 45.43820589 46.09319298
46.01279547 46.08962901 46.67943336 46.96112588]
```

Second, we might as well take the geometric mean of the deltas in the series, huh?

We expect that the geometric mean should be very close to `1.00`

, so when we subtract 1 from it we should see something very close to `0.00`

:

```
>>> np.exp(np.diff(np.log(w)).mean()) - 1
0.0011169703289617416
```

Additionally, we expect the geometric mean to be pretty close (but not equal) to the arithmetic mean of the percentage changes:

```
>>> (np.diff(w) / w[:-1]).mean()
0.001156197504173517
```

Last but not least, in my opinion, if you are calculating geometric means and log-normal distributions and whatnot, it is often perfectly fine to just take the average of **log(x)** rather than the geometric mean of **x**, or the normal distribution of **log(x)** instead of the log-normal distribution of **x**, things of that nature.

In a lot of quantitative contexts where logging your variable makes sense, you will probably just stay in the log form for the entirety of your mathematical calculations. Remember that the **exp()** at the very end is just a conversion back from **log(x)** to **x**, and sometimes that conversion is unnecessary! Get used to talking about your data that should be logged in its logged form.

In that sense, I mostly feel like a geometric mean is just a convenience for managers. We tell our managers that the average percentage change of this or that is [geometric mean of X minus one] percent, but in our modeling we do [mean of log(X)]. Honestly, I prefer to stick with the latter.

The most common example provided for a harmonic mean is the average speed traveled given a fixed distance. Wikipedia provides the example of a vehicle traveling 60kph on an outbound trip, and 20kph on a return trip.

The “true” average speed that was traveled for this trip was 30kph. The 20kph needs to be “weighted” 3 times as much as the 60kph part because traveling a fixed distance at 3 times slower the speed means you’re traveling at that speed for 3 times as long because that’s how much longer it takes you to travel that distance. You * could* take the weighted average to get to this 30kph number, i.e.

`(20*3+60)/4 = 30`

. That works for mental math. But the more generalized way to solve this is to take reciprocals.Given what we now know about the harmonic mean, we can see that what we are doing is:

- Convert the numbers to hours per kilometer:
**π**(x) = 1/x - Take the average of hours per kilometer.
- Convert back to kilometer per hour:
**π**^{-1}(x) = 1/x

Our motivation for this is because we want a fixed denominator and a variable numerator so that the numbers we’re working with work sensibly when being added up (e.g. **a/c + b/c = (a+b)/c**). In the original formulation where the numbers are reported in kilometers per hour, the denominator is variable (hours) and the numerator is fixed (kilometers).

Again, much like the geometric mean situation, *our motivation in this case is to make the problem trivially and sensibly additive* by applying an invertible transformation. Note that if the duration was variable but the time was fixed, then the numbers would already be additive and we would just take the average without any fuss.

It may sound silly to even ask, but we should not take for granted the question of * why* we use means and whether we should be using them.

For example, imagine a time series of annual US GDP levels from 1982 to 2021. What happens if you take the mean of this? Answer: you get $11.535 trillion. OK, so what though? This value is not going to be useful for, say, predicting US GDP in 2022. US GDP in 2022 is more likely to be closer to the 2021 number (23.315 trillion) than it is to be anywhere near the average annual GDP between 1982 and 2021. Also, this number is extremely sensitive to the time periods you pick.

Clearly, it is not necessarily the case that taking an average of any and all data is useful. Sometimes (like in the above example) it is pretty useless.

Another example where it may be somewhat useful but misleading would be for household income. Most household income statistics use median household income, not mean household income, because these numbers are heavily skewed by very rich people.

To be clear, * sometimes (arguably a majority of the time) it’s fine to use the mean over a median, even if it is skewed by a heavy tail of outliers!* For example, insurance policies: the

The only reason in the household income context is because we want to know what a typical or representative household is like, and the mean doesn’t give us a great idea of that due to the skew. Similarly, in the fire insurance case, a typical household with fire insurance gets a payout of $0. But the thing is we don’t actually care about typical or representative households doing our actuarial assignment, that’s not what insurance is about.

In the case of the GDP example, the mean was useless because the domain over which we were trying to summarize the data was a meaningless and arbitrary domain, and the data generating process is non-stationary with respect to that domain. Taking the median also would have been pretty useless, too, which is to say none of these sort of summary statistics are particularly useful for this data. In the household income example, we want some sort of summary statistic, but we want every household at the top to offset every household at the bottom, hit for hit, until all that’s left is the 50th percentile.

In short:

- For some data, neither mean or median is useful.
- Sometimes mean is useful and median is not.
- Sometimes median is useful and mean is not.
- Sometimes they’re both useful.
- No, the median is not something you use just to adjust for skew (because sometimes you don’t want to adjust for skew, even for very heavy-tailed distributions). There are no hard and fast rules you can make based on a finite distribution’s shape alone. It all depends on the context!

One reason we like to use means is because of the central limit theorem. Basically, if you have a random variable sampled from some stationary distribution with finite variance, the larger your sample gets, the more the sample mean (which is itself a random variable) converges to a normal distribution with mean equal to the population mean.

The central limit theorem is often misunderstood, and it’s hard to blame people because it’s a mouthful and it’s easy to conflate many of the concepts inside that mouthful. There are two distinct mentions of random variables, distributions, and means.

If more people understood the central limit theorem you’d see fewer silly statements out in the wild like “you can’t do a t-test on data unless it’s normally distributed.” Ahem… yes you can! The key here is that * distribution of the underlying data* is different from the

This serves as our first motivation for why we care about means: all finite-variance stationary distributions have them, they are pretty stable as the sample size increases, and they converge to their population means.

A second motivation is because the arithmetic mean minimizes the squared errors of any given sample. And sometimes that’s pretty useful. That is to say, if you have some sequence **x _{i}** of N numbers

A third reason is because in many contexts, the mean gives us an estimator for a parameter we care about from an underlying distribution. For example:

- The sample mean of a normally distributed random variable
**x**converges to the mean parameter for that distribution. - The sample mean of a normally distributed random variable
**log(x)**converges to the mean parameter of a log-normal distribution of**x**. - The sample mean of a Bernoulli distributed random variable
**x**converges to the probability parameter of a Bernoulli distribution. - Given some
**r**, the sample mean of a negative binomial distributed variable**x**, when divided by**r**, converges to the odds of the probability parameter**p**.

So by taking a mean, we can estimate these and many other distributions’ parameters.

- I find the statement that “arithmetic mean > geometric mean > harmonic mean” to be a cute factoid but often worse than useless to point out, which is why it’s down here. It’s way more often than not inappropriate to compare means like this. It either is or isn’t appropriate to take reciprocals before averaging, or take the log before averaging. The comparison of magnitudes across these means is completely irrelevant to that decision, and this cute factoid may mislead people into thinking that it is.
- I think technically one reason the geometric mean is often defined using power and multiplication, rather than logs and exponents, is because you are technically allowed to use negative numbers? However, I am not aware of any actual situations where a geometric mean would be useful but we’re allowing for negative numbers.

I had been a data scientist for the past few years, but in 2022, I got a new job as a data engineer, and it’s been pretty good to me so far.

I’m still working alongside “data scientists,” and do a little bit of that myself still, but most of my “data science” work is directing and consulting on others’ work. I’ve be focusing more on implementation of data science (“MLops”) and data engineering.

The main reason I soured on data science is that the work felt like it didn’t matter, in *multiple* senses of the words “didn’t matter”:

- The work is downstream of engineering, product, and office politics, meaning the work was only often as good as the weakest link in that chain.
- Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could absolutely suck at your job or be incredible at it and you’d get nearly the same regards in either case.
- The work was often very low value-add to the business (often compensating for incompetence up the management chain).
- When the work’s value-add exceeded the labor costs, it was often personally unfulfilling (e.g. tuning a parameter to make the business extra money).

Management was by far my biggest gripe. I am completely exhausted by the absolute insanity that was the tech industry up to 2021. Companies all over were consistently pursuing things that could be reasoned about *a priori* as being insane ideas– ideas any decently smart person should know wouldn’t work before they’re tried. Some projects could have saved whole person-years of labor had anyone possessed a better understanding of the business, customers, broader economic / social landscape, financial accounting, and (far too underrated in tech) any relevant subject matter areas.

Those who have seen my Twitter posts know that I believe the role of the data scientist in a scenario of insane management is not to provide real, honest consultation, but to launder these insane ideas as having some sort of basis in objective reality even if they don’t. Managers will say they want to make data-driven decisions, but they really want decision-driven data. If you strayed from this role– e.g. by warning people not to pursue stupid ideas– your reward was their disdain, then they’d do it anyway, then it wouldn’t work (what a shocker). The only way to win is to become a stooge.

The reason managers pursued these insane ideas is partly because they are hired despite not having any subject matter expertise in business or the company’s operations, and partly because VC firms had the strange idea that ballooning costs well in excess of revenue was “growth” and therefore good in all cases; the business equivalent of the Flat Earth Society. It was also beneficial for one’s personal career growth to manage an insane project (rΓ©sumΓ© lines such as “managed $10 million in top-line revenue,” failing to disclose that their COGS was $30 million). Basically, there’s a decent reward for succeeding and no downside for failing, and sometimes you will even be rewarded for your failures! So why not do something insane?

Also, it seems that VC firms like companies to run the same way their portfolios do– they want companies to try 100 different things, and if only 5 out of those 100 things work, then the VCs will consider that a success. On the ground floor, this creates a lot of misery, since the median employee at the company is almost certainly working on a product that is not destined to perform well, but the shareholders are happy, which is of course all that matters.

The median data scientist is horrible at coding and engineering in general. The few who are remotely decent at coding are often not good at engineering in the sense that they tend to over-engineer solutions, have a sense of self-grandeur, and want to waste time building their own platform stuff (folks, do not do this).

This leads to two feelings on my end:

- It got annoying not having some amount of authority over code and infra decisions. Working with data scientists without having control over infra feels like wading through piles of immutable shit.
- It was obvious that there is a general industry-wide need for people who are good at both data science and coding to oversee firms’ data science practices in a technical capacity.

I don’t want to be *too* snooty: in a sense, it’s fine for data scientists to suck at coding! Especially if they bring other valuable skills to the table, or if they’re starting out. And in another sense, bad code getting into production is a symptom of bad team design and management, more than any individual contributors’ faults! By describing the median data scientist’s coding skills as shitty, I’m just trying to be honest, not scornful.

The problem is that the median data scientist works at a small to medium-sized company that doesn’t build their data science practices around a conceit that the data scientists’ code will suck. They’d rather let a 23 year old who knows how to `pip install jupyterlab`

run loose and self-manage, or manage alongside other similarly situated 23 year-olds. Where is the adult in charge?

23 year-old data scientists should probably not work in start-ups, frankly; they should be working at companies that have actual capacity to on-board and delegate work to data folks fresh out of college. So many careers are being ruined before they’ve even started because data science kids went straight from undergrad to being the third data science hire at a series C company where the first two hires either provide no mentorship, or provide shitty mentorship because they too started their careers in the same way.

On the other hand, it’s not just the companies’ and managers’ faults; individual data scientists are also to blame for being really bad at career growth. This is not contemptible for people who are just starting out their careers, but at some point folks’ rΓ©sumΓ©s starts to outpace actual accumulation of skills and I cannot help but to find that a teeny bit embarrassing.

It seems like the main career growth data scientists subject themselves to is learning the API of some gradient boosting tool or consuming superficial + shallow + irrelevant knowledge. I don’t really sympathize with this learning trajectory because I’ve never felt the main bottleneck to my work was that I needed some gradient boosting tool. Rather the main bottlenecks I’ve faced were always crappy infrastructure and lacking (quality) data, so it has always felt natural to focus my efforts toward learning that stuff to unblock myself.

My knowledge gaps have also historically been less grandiose than learning how some state of the art language model works or pretending I understand some arXiv white paper ornate with LaTeX notation of advanced calculus. Personally, I’ve benefited a ton from reading the first couple chapters out of advanced textbooks (while ignoring the last 75% of the textbook), and refreshing on embarrassingly pedestrian math knowledge like “how do logarithms work.” Yeah I admit it, I’m an old man and I’ve had to refresh on my high school pre-calc. Maybe it’s because I have 30k Twitter followers, but I live in constant anxiety that someone will pop quiz me with questions like “what is the formula for an F-statistic,” and that by failing to get it right I will vanish in a puff of smoke. So my brain tells me that I must always refresh myself on the basics. I admit this is perhaps a strange way to live one’s life, but it worked for me: after having gauged my eyes out on linear regression and basic math, it’s shockingly apparent to me how much people merely pretend to understand this stuff, and how much ostensible interest in more advanced topics is pure sophistry.

For the life of me I cannot see how reading a blog post that has sentences in it such as “DALL-E is a diffusion model with billions of parameters” would ever be relevant to my work. The median guy who is into this sort of superficial content consumption hasn’t actually gone through chapters in an advanced textbook in years if ever. Don’t take them at their word that they’ve actually grinded through the math because people lie about how well-read they are *all the time* (and it’s easy to tell when people are lying). Like bro, you want to do stuff with “diffusion models”? You don’t even know how to add two normal distributions together! You ain’t diffusing shit!

I don’t want to blame people for spending their free-time doing things other than learning how to code or doing math exercises out of grad school textbooks. To actually become experts in multiple things is oppressively time-consuming, and leaves little time for other stuff. There’s more to life than your dang job or the subject matters that may be relevant to your career. One of the main sins of “data scientist” jobs is that it expects far too much from people.

But there’s also a part of me that’s just like, how can you not be curious? How can you write Python for 5 years of your life and never look at a bit of source code and try to understand how it works, why it was designed a certain way, and why a particular file in the repo is there? How can you fit a dozen regressions and not try to understand where those coefficients come from and the linear algebra behind it? I dunno, man.

Ultimately nobody really knows what they are doing, and that’s OK. But between companies not building around this observation, and individuals not self-directing their educations around this observation, it is just a bit maddening to feel stuck in stupid hell.

These are the things I’ve been enjoying about data engineering:

- Sense of autonomy and control.
- By virtue of what my job is, I have tons of control over the infrastructure.

- Data engineering feels significantly less subject to the whims and direction of insane management.

- Less need for sophistry.
- My work is judged based on how good the data pipelines are, not based on how good looking my slide decks are or how many buzzwords I can use in a sentence. Not to say data engineering doesn’t have buzzwords and trends, but that’s peddled by SaaS vendors more than actual engineers.

- More free time.
- I dunno, it feels like becoming a data engineer cured my imposter syndrome? I feel like I have more ability to dick around in my spare time without feeling inadequate about some aspect of my job or expertise. But this is probably highly collinear with not being a lackey for product managers.

- Obvious and immediate value that is not tied to a KPI.
- I like being valued, what can I say.

- Ultimately the data scientists need me more than I need them; I’m the reason their stuff is in production and runs smoothly.
- I have a sense that, if my current place of business needed to chop employees, that it would be a dumb decision to chop me over any data scientist.

- Frankly, I feel really good at what I do.
- As someone who has worked a variety of downstream data-related jobs, I have both a very strong sense of what my downstream consumers want, as well as the chops to QC/QA my own work with relative ease the way a good analyst would.
- At my last company I had a lot of “I could totally do a better job at designing this” feelings regarding our data stack, and it has immensely fed my ego to have confirmed all of these suspicions myself.
- This role gets to leverage basically everything I’ve learned in my career so far.

By far the most important thing here is the sense of independence. At this point it feels like the main person I should be complaining about is myself. (And self-loathing is so much healthier than hating a random product manager.) As long as my company’s data scientists are dependent on me to get code into production, I have veto power over a lot of bad code. So if they are putting bad code in production, that ultimately ends up being my fault.

I think my career trajectory made sense– there was no way I was hopping straight into data engineering and doing a good job of it without having done the following first:

- See a few data stacks and form opinions about them as a downstream consumer.
- Get better at coding.
- Pick up on the lingo that data engineers use to describe data (which is distinct from how social scientists, financial professionals, data scientists, etc. describe data), like “entity,” “normalization,” “slowly-changing dimension type 2,” “CAP theorem,” “upsert,” “association table,” so on and so on.

So, ultimately I have no regrets having done data science, but I am also enjoying the transition to data engineering. I continue to do data science in the sense that these roles are murkily defined (at both my “data scientist” and “data engineer” jobs I spend like 40% of the time writing downstream SQL transformations), I get to critique and provide feedback on data science work, and, hey, I actually *did* deploy some math heavy code recently. Hell, you could argue I’m just a data scientist who manages the data science infra at a company.

Anyway, great transition, would recommend for anyone who is good at coding and currently hates their data science job. My advice is, if you want to get into this from the data science angle, make sure you are actively blurring the lines between data science and data engineering at your current job to prepare for the transition. Contribute code to your company’s workflow management repo; put stuff into production (both live APIs and batch jobs); learn how to do CI/CD, Docker, Terraform; and form opinions about the stuff your upstream engineers are feeding you (what you do and don’t like and why). In fact it is very likely this work is higher value and more fun than tuning hyperparameters anyway, so why not start now?

Sorry, this post has no point, so it’s ending rather anticlimactically.

]]>A/B tests are the gold standard of user testing, but there are a few fundamental limitations to A/B tests:

- When evaluating an A/B test, your metrics must be (a) measurable, and (b) actually being measured.
- A/B tests apply effects that are (a) on the margin, and (b) within the testing period.

Some of these points may seem obvious on their face, but they have pretty important implications that many businesses (specifically managers, and even many folks with “data” in their job titles) fail to consider.

What people expect is that as an app grows, the sample sizes get larger, so this increases the statistical power of experiments. Additionally, larger denominators in KPIs imply that fixed labor costs are more productive per unit, so even if you see lower effect sizes (on a percentage basis), they become more acceptable when spread out across more users.

Everything above is true, but there are some oft neglected countervailing effects:

- Features “cannibalize” one another, or rather they experience diminishing marginal returns as other features get added:
- As a consequence of the first point, contemporaneous effects of experimentation may squeeze rapidly toward zero,
*even if*intertemporal effects– which are often stronger anyway– remain non-zero.

The end result is that, at a growth company, it is not unreasonable to find that diminishing effect sizes may decrease the statistical power of experiments (measuring contemporaneous or near-term effects) *faster* than inflating sample sizes increase statistical power. But critically, this doesn’t mean that the features being tested are “bad.”

While drafting and before publishing, I have confirmed from other data scientists at other companies that they can see a pattern of diminishing effect sizes of A/B tests over time across experiments in aggregate, and have also discussed with them some things in regards to the internal politics surrounding it. That said, everything in this post is ultimately my own opinion and analysis.

The implications of measurability are that you can’t tell whether users actually enjoy or hate a particular feature if this doesn’t show up in metrics, e.g. they hate the feature but continue to use the app within the period of your testing.

The assumption that churn is a measure for user dissatisfaction is, ultimately, an assumption. It is more likely that contemporaneous churn signifies extreme dissatisfaction, but that minor dissatisfaction may exist without immediately obvious churn.

Some folks may mistakenly believe things like, “if we have enough users, we’ll be able to see the effects.” The mistake here is not really one of sample size, but of timing of the effects: if user dissatisfaction leads to a delayed churn, the test may conclude before the measurable quantity sees a statistically significant effect (even though the immeasurable effect, i.e. sentiment, was immediate).

The idea that users will not churn contemporaneously even if they’ll churn later sounds like a stretch– surely across 500,000 users in your experiment, you’ll see some sort of immediate effect, no?– but it’s not a stretch if you think about it from a user experience perspective. Measurable effects on churn or other KPIs may be delayed because users often do not immediately stop what they are doing just because a particular in-app flow becomes more annoying. Usually the biggest hurdle to a user is convincing them to log into the app in the first place, and a “high intent” user will not be immediately dissuaded from completing a flow even if they are dissatisfied with the smoothness of the experience in the process of doing so. Measuring the delayed effect may be especially slow if your typical user only logs in, say, once a month. (I highly recommend doing cumulative impulse response functions of key metrics partitioned by experiment group like average amount of dollar spend. You may be shocked how many effects linger past some experiment dates. You may even see sustained effects in the first derivatives of your CIRFs!) User insistence on completing subpar flows is how you get zero effects within-period but non-zero effects in future periods, even with massive sample sizes.

While we’re here, it should also be noted that “measurable” is not a synonym for “being measured.” Many metrics that sound great in theory during a quick call with management may not ever be collected in practice for a wide variety of reasons. Some of those reasons will be bad reasons and should be identified as missing metrics, then rectified. Other reasons will be good reasons, like it’d involve 30 Jira points of labor, and oh well, that’s life. So the company’s “data-driven” understanding of what is going on is confined to what the company has the internal capacity to measure. This means that many theoretically measurable metrics will be *de facto* immeasurable. And many of the things you really, really want to know will be unknowable via data.

And finally, a metric being measured and utilized also assumes that it is the correct metric being measured, and that there are no errors in what a particular number represents. Of course these assumptions can be violated as well.

Let’s say you are A/B testing a price for a simple consumer product, and you decide to go with whichever one yields the most profit to the business.

This is just a simple `(price - marginal cost) * quantity`

calculation, and it should not usually be hard to calculate which is more profitable. The problem here isn’t the measurability of profit.

Yet, there is still a hidden measurement problem, combined with an intertemporal effect problem. Specifically, the measurement problem is consumer surplus (i.e. it’s semi-measurable or deducible, but most likely not *actually* being measured at the majority of tech companies). The intertemporal effects are twofold: (1) users who receive low consumer surplus may be less likely to purchase again, and (2) the profit-maximizing price may change across periods due to competition and macroeconomic conditions.

Profit being an adversarial metric with the user is the root problem. Without a countervailing force being “measured” in some way (consumer surplus), you may end up pissing off users enough to not come back a 2nd time.

The other problem is competition, which may introduce non-stationarities in a broader sense. So optimal prices can change a lot over time– even if users have the memories of goldfish and there are no autoregressive (so to speak) intertemporal effects on a user-by-user basis.

I mention A/B testing of prices in particular because it is an especially dangerous thing to attempt without a deep understanding of pricing from the marketing side of things, plus a confidently held theory of how to attain desired business results with pricing. I don’t even mean it’s dangerous from an ethical sense (that’s true but outside the scope of this document); I mean it is dangerous from a strictly long-term business perspective. Even if your theory is that it’s optimal business to squeeze users for as much money as possible in a short-term sense, this should be acknowledged as a theory and it should confidently held when being executed.

Many managers do not want to confidently state that squeezing users for every penny is desirable, perhaps because it is counter to liberal sensibilities. So they may instead state that to price a particular way is “data-driven.” It’s a farce to hide behind the descriptor “data-driven” in the pricing context, as short-term profit maximization is not a data-driven result, it’s a theory for how to run a business. The price that spits out from the A/B test as maximizing short-term profit is what’s “data-driven,” not the *decision* to go with said pricing strategy. A general form of this idea is true in all contexts of using KPIs to make decisions (e.g. is maximizing time users spend on an app actually a good thing?), but the pricing context is where it is most obvious that describing a decision as “data-driven” is just begging the question.

If your app has 5 features (ABCDE), and you are A/B testing a 6th feature (F), the A/B test is testing the differences between the combination of ABCDE and ABCDEF. In other words, you are measuring the marginal impact of F conditional on ABCDE.

It is easy to imagine a few unintended problems arising from this. Imagine for example that all these features are more or less interchangeable in terms of their effects on KPIs in isolation, but the effects of these features tend to eat into each other as the features get piled on. In this case, **the order in which these features gets tested (and not the quality of the features in a vacuum) would be the primary determinant of which features evaluate well**.

Note that this argument does not rely on the idea that the company tackles the most important features first, and then features additions by their very nature become more fringe and smaller over time. Although there will certainly be some of that going on, this effect comes from a saturation of features in general. Diminishing marginal returns exist *even if* all the features are more or less the same quality. But, that’s the key to understanding a key limitation of A/B tests: you are testing the marginal effect conditional on what was contemporaneously in the app while you conducted the test.

In the worst-case scenario, diminishing marginal returns in quantity of features become *negative* marginal returns, and basically nothing else can be added to the app.

Let’s go back to our example of testing F, conditional on ABCDE. Imagine feature D was only ever tested conditional on ABC existing in the app. However, the fairest comparison of features D and F would be testing feature D conditional on ABCE, testing feature F conditional on ABCE, testing the interaction of DF conditional on ABCE, and testing neither conditional on ABCE. Given that ABCDE are already established features in the app, this would mean we need to re-test D.

To be clear, actually doing this “fair” comparison is a little unreasonable. Companies don’t tend to continually A/B test features that have proven themselves successful for a few reasons:

- Non-zero maintenance cost of sustaining multiple experiences.
- Difficulty in marketing a feature that not all users receive. (Imagine being advertised to about a feature that you can’t even see because you’re in the control group.)
- Testing a worse experience comes with the opportunity cost (sometimes called “regret”) of showing some proportion of users something that doesn’t optimize a KPI.

Additionally, there are arguments in favor of preferring to retain older features at full exposure, *even if* they have worse marginal effects in actuality. Imagine in the hypothetical “fair” comparison that F is actually slightly better than D. We still may prefer D over F because:

- Maintaining tried and true features tends to be easier from the engineering perspective.
- Users come to expect the existence of a feature the longer it is around.
- Users are “loss averse” to the removal of a feature (i.e. going from “not X” to “X” back to “not X” is potentially worse than just starting and staying on “not X”).

A final reason to not tackle this problem is because the “dumb” interpretation of A/B testing isn’t obviously bad for operational purposes. Effectively, if F has basically no effect conditional on ABCDE existing, then it’s not obvious at first glance that there are problems with taking this literally to mean there is no effect of including or excluding “F,” unless modifying ABCDE is a reasonable option to pursue.

That said, many companies would still benefit from doing the following:

- Most importantly, have a theory and use common sense when building the app out. (Many things don’t need to be A/B tested.)
- When you start to see diminished effect sizes, run longer duration A/B tests.
- Be careful to consider the temporal effect of when the experiment was started; experiments not run contemporaneously are often not reasonable to compare to one another.
- Measure IRFs and CIRFs of your KPIs with experiment exposure as the impulse, even past the experimentation period.

Product managers, data scientists, data analysts, and engineers tend to suck at thinking about their company holistically. In fairness, their job is usually to think about very small slices of the company very intensely. But spending the entire month of August optimizing some API to run faster, even if economical to do so, will belie the true determinants of the app’s performance, which are often a grab bag of both banal and crazy factors, both intrinsic and extrinsic to the company and its operations. Some tech employees lose the plot and really do need to look at an overpriced MBB consulting deck about their company, unmarred by being too deep in the weeds on one thing.

So these tech employees may not even realize that their A/B test performances are getting worse; or they may notice but not understand it is not just a coincidence, that it is a perfectly reasonably expected thing to be happening.

Another reason many tech employees may not understand what’s going on is because they’ve been told that A/B tests are good and the gold standard for testing features, without deeply understanding what they do and don’t measure. Unfortunately for managers who insist on being “data-driven,” comparing the results of A/B tests across different periods absolutely do require interpretations and subject matter expertise because inter-period A/B tests are in some ways incomparable.

A “data-driven” culture that does not acknowledge simple limitations of data tools we use is one where data and data-ish rhetoric can be abused to push for nonsensical conclusions and to cause misery.

The main frustration that comes from diminishing marginal effects of A/B testing is that managers will see those beneath them as failing in some sense– upper managers thinking their product managers are failing; product managers thinking their data scientists are failing, etc. compared to the experiments and features of yesteryear.

The truth of course is more complicated and some combination of:

- The intertemporal effects dominate (implying the test needs to go on for longer)
- The app is saturated with other cannibalizing features (implying that the work is either not productive, or that other features should be removed).
- The feature makes the app better (or worse?), but in some fundamentally immeasurable-in-an-A/B-test way. (Requiring theory, intuition, and common sense to dictate what should happen)

Usually people over-rely on this sort of “data-driven” rhetoric when they lack confidence in what they are doing in some regards– maybe they are completely clueless, or maybe they sense what they are doing is unethical and need some sort of external validation / therapy in the form of a dashboard.

I am a data person at the end of the day. I think the highest value data is that which tells you that a prior belief was wrong, or tells you something when you had no real prior beliefs at all. A lot of features you’ll be adding to an app don’t really need to be justified through “data-driven” means if you have a strong prior belief that it makes the app better. Maybe it doesn’t have an immediately obvious impact measurable in the treatment group, but maybe it makes users more happy over time and more likely to tell their friends to download the app (also, your app’s referral feature is probably garbage/unused and doesn’t capture 99% of downloads through word-of-mouth attributable to an A/B test).

These strong prior beliefs may come from, say, subject matter expertise. And if your company is at the point where contemporaneous and measurable effect sizes are decaying your power analyses faster than sample size increases can keep up, or you are working on a problem that is dealing with hard to measure or immeasurable effects, you’re going to need subject matter expertise to fill in that gap.

Unfortunately, this is all easier said than done. Middle managers don’t like hearing that tests need to run for longer. They don’t like hearing that their product idea “failed” an A/B test. They don’t like having to actively disavow a “data-driven” approach that upper management is pressing them to adhere strictly to (although they’ll likely be politically pressured into pretending to be “data-driven” by always reporting metrics that support the value of their own work, but that’s another story).

I’ll end on a personal anecdote of when an over-abundance of nominal commitments to “data-driven decisionmaking” led to some toxic internal politics:

I once had a “data-driven” upper manager who would say “show me the data” in conversation a lot when I said we should do something. The data this person claimed to want was never something that was actually being measured (always for good reason or reasons outside of my control). In fact, that was usually why I was not showing data– it was because it didn’t exist in some sense! But I’m not so credulous to believe that the request for data was sincere; I suspect that this manager knew this data was unattainable in some sense, and was using “show me the data” as a bludgeon to deliberately suppress these conversations and “win” arguments.

]]>The technique the authors use is cute, but it’s not a true arbitrary multivariate regression. They cheat a little bit using dummy variables for the majority of their coefficients. I respect it, but it’s not an arbitrary regression.

Fortunately, it is possible to do true multivariate regression using a real-valued design matrix. This post covers how and provides code that can fully replicate what I did.

Install dependencies:

```
pip install statsmodels pandas numpy duckdb
```

Let’s generate some data and setup a DuckDB connection:

```
import statsmodels.api as sm
import pandas as pd
import duckdb
import numpy as np
N = 50_000
df = pd.DataFrame(index=range(N))
df.index.name = "idx"
np.random.seed(42069)
df["const"] = 1
df["x1"] = np.random.normal(2, 4, size=N)
df["x2"] = np.random.normal(3, 2, size=N) + df["x1"] * np.random.normal(-1, 1, size=N)
df["x3"] = np.random.normal(2, 2, size=N) + df["x2"] * np.random.normal(-1, 1, size=N)
df["y"] = 2 + 3 * df["x1"] + 5 * df["x2"] - df["x3"] + np.random.normal(0, 3, size=N)
df.to_csv("fwl_generated_data.csv")
```

Let’s first start with a univariate regression:

```
# Multivariate
sm.OLS(endog=df["y"], exog=df[["const", "x1"]]).fit().summary()
```

Output:

In a univariate regression, the coefficient is equal to the covariance of the centered y variable and centered x variable, divided by the variance of the centered x variable. We can cheat a bit and use the fact that both the numerator and denominator will be divided by the sample size. This gives us our “x” coefficient:

```
sum(y_centered * x1_centered) / sum(x1_centered * x1_centered)
```

Getting the constant term is basically a matter of walking backwards our process of centering the data. First, we add back in the mean of “y” we subtracted out originally. But that’s not enough– we also centered “x.” How can we walk that back? Just subtract out the mean of x times the coefficient we just calculated. This gives us our constant term:

```
avg(y) - avg(x1) * sum(y_centered * x1_centered) / sum(x1_centered * x1_centered) as const_coef
```

Let’s use that CSV we generated to run DuckDB and calculate the coefficients there:

```
univariate_regression_query = """
with base as (
select
y,
y - avg(y) over () as y_centered,
x1,
x1 - avg(x1) over () as x1_centered
from fwl_generated_data.csv
),
regress as (
select
avg(y) - avg(x1) * sum(y_centered * x1_centered) / sum(x1_centered * x1_centered) as const_coef,
sum(y_centered * x1_centered) / sum(x1_centered * x1_centered) as x1_coef
from base
)
select *
from regress
"""
con.execute(univariate_regression_query).df()
```

And our output is this:

Nice, we got a match!

The trickery behind what we will be using to generate multiple regression coefficients is the Frisch-Waugh-Lowell theorem (“FWL”). I have covered this theorem in a previous post here.

Long story short, we will be using the residuals of univariate regressions of x1 on x2 and vice-versa to construct *new* univariate regressions on y. Because we already know how to do univariate regression in SQL from the previous section, doing multiple regression is just a matter of stringing these pieces together.

Visually, constructing a single multiple regression coefficient using FWL looks like this:

Doing that twice gives us both coefficients.

(As you can tell from the animation, “residualizing” the “y” variable is not actually necessary for calculating the coefficients. We only need to “residualize” the regressors.)

Here is the regression done in Statsmodels:

Here is the full code:

```
multiple_regression_query = """
with base as (
select
idx,
y,
y - avg(y) over () as y_centered,
x1,
x1 - avg(x1) over () as x1_centered,
x2,
x2 - avg(x2) over () as x2_centered
from fwl_generated_data.csv
),
univariate_regress as (
select
avg(x2) - avg(x1) * sum(x2_centered * x1_centered) / sum(x1_centered * x1_centered) as x1_const_coef,
sum(x2_centered * x1_centered) / sum(x1_centered * x1_centered) as x1_coef,
avg(x1) - avg(x2) * sum(x1_centered * x2_centered) / sum(x2_centered * x2_centered) as x2_const_coef,
sum(x1_centered * x2_centered) / sum(x2_centered * x2_centered) as x2_coef
from base
),
resids as (
select
y,
y_centered,
x1,
x1
- (select x2_coef from univariate_regress) * x2
- (select x2_const_coef from univariate_regress)
as x1_resid,
x2,
x2
- (select x1_coef from univariate_regress) * x1
- (select x1_const_coef from univariate_regress)
as x2_resid
from base
),
multiple_regression as (
select
sum(y_centered * x1_resid) / sum(x1_resid * x1_resid) as x1_coef,
sum(y_centered * x2_resid) / sum(x2_resid * x2_resid) as x2_coef,
avg(y)
- avg(x1) * sum(y_centered * x1_resid) / sum(x1_resid * x1_resid)
- avg(x2) * sum(y_centered * x2_resid) / sum(x2_resid * x2_resid)
as const_coef
from resids
)
select * from multiple_regression
"""
con.execute(multiple_regression_query).df()
```

The output:

Good God, we’ve done it!

But can we go even further?

You could probably tell that this was coming. That “x3” in the DataFrame we created wasn’t for nothing.

I almost didn’t do this out of laziness, but I figured people would get mad at me if I didn’t.

The compromise is that I will do it, but I won’t explain what is going on. Figuring out why this works is left as an exercise for the reader.

Here’s Statsmodels:

Here’s the SQL embedded in Python:

```
big_multiple_regression_query = """
with base as (
select
y,
y - avg(y) over () as y_centered,
x1,
x1 - avg(x1) over () as x1_centered,
x2,
x2 - avg(x2) over () as x2_centered,
x3,
x3 - avg(x3) over () as x3_centered
from fwl_generated_data.csv
),
------------------------------------------------------------
regress_1 as (
select
avg(x1) - avg(x2) * sum(x1_centered * x2_centered) / sum(x2_centered * x2_centered) as x1x2_const,
sum(x1_centered * x2_centered) / sum(x2_centered * x2_centered) as x1x2_coef,
avg(x1) - avg(x3) * sum(x1_centered * x3_centered) / sum(x3_centered * x3_centered) as x1x3_const,
sum(x1_centered * x3_centered) / sum(x3_centered * x3_centered) as x1x3_coef,
avg(x2) - avg(x1) * sum(x2_centered * x1_centered) / sum(x1_centered * x1_centered) as x2x1_const,
sum(x2_centered * x1_centered) / sum(x1_centered * x1_centered) as x2x1_coef,
avg(x2) - avg(x3) * sum(x2_centered * x3_centered) / sum(x3_centered * x3_centered) as x2x3_const,
sum(x2_centered * x3_centered) / sum(x3_centered * x3_centered) as x2x3_coef,
avg(x3) - avg(x1) * sum(x3_centered * x1_centered) / sum(x1_centered * x1_centered) as x3x1_const,
sum(x3_centered * x1_centered) / sum(x1_centered * x1_centered) as x3x1_coef,
avg(x3) - avg(x2) * sum(x3_centered * x2_centered) / sum(x2_centered * x2_centered) as x3x2_const,
sum(x3_centered * x2_centered) / sum(x2_centered * x2_centered) as x3x2_coef
from base
),
resids_1 as (
select
y,
y_centered,
x1,
x1_centered,
x2,
x2_centered,
x3,
x3_centered,
x1 - (select x1x2_coef from regress_1) * x2 - (select x1x2_const from regress_1) as x1x2_resid,
x1 - (select x1x3_coef from regress_1) * x3 - (select x1x3_const from regress_1) as x1x3_resid,
x2 - (select x2x1_coef from regress_1) * x1 - (select x2x1_const from regress_1) as x2x1_resid,
x2 - (select x2x3_coef from regress_1) * x3 - (select x2x3_const from regress_1) as x2x3_resid,
x3 - (select x3x1_coef from regress_1) * x1 - (select x3x1_const from regress_1) as x3x1_resid,
x3 - (select x3x2_coef from regress_1) * x2 - (select x3x2_const from regress_1) as x3x2_resid,
from base
),
regress_2 as (
select
sum(x1_centered * x2x3_resid) / sum(x2x3_resid * x2x3_resid) as x1_x2x3_coef,
sum(x1_centered * x3x2_resid) / sum(x3x2_resid * x3x2_resid) as x1_x3x2_coef,
avg(x1)
- avg(x2) * sum(x1_centered * x2x3_resid) / sum(x2x3_resid * x2x3_resid)
- avg(x3) * sum(x1_centered * x3x2_resid) / sum(x3x2_resid * x3x2_resid)
as x1_const,
sum(x2_centered * x1x3_resid) / sum(x1x3_resid * x1x3_resid) as x2_x1x3_coef,
sum(x2_centered * x3x1_resid) / sum(x3x1_resid * x3x1_resid) as x2_x3x1_coef,
avg(x2)
- avg(x1) * sum(x2_centered * x1x3_resid) / sum(x1x3_resid * x1x3_resid)
- avg(x3) * sum(x2_centered * x3x1_resid) / sum(x3x1_resid * x3x1_resid)
as x2_const,
sum(x3_centered * x2x1_resid) / sum(x2x1_resid * x2x1_resid) as x3_x2x1_coef,
sum(x3_centered * x1x2_resid) / sum(x1x2_resid * x1x2_resid) as x3_x1x2_coef,
avg(x3)
- avg(x1) * sum(x3_centered * x1x2_resid) / sum(x1x2_resid * x1x2_resid)
- avg(x2) * sum(x3_centered * x2x1_resid) / sum(x2x1_resid * x2x1_resid)
as x3_const
from resids_1
),
resids_2 as (
select
y,
y_centered,
x1,
x1_centered,
x2,
x2_centered,
x3,
x3_centered,
x1
- (select x1_x2x3_coef from regress_2) * x2
- (select x1_x3x2_coef from regress_2) * x3
- (select x1_const from regress_2)
as x1_resid,
x2
- (select x2_x1x3_coef from regress_2) * x1
- (select x2_x3x1_coef from regress_2) * x3
- (select x2_const from regress_2)
as x2_resid,
x3
- (select x3_x1x2_coef from regress_2) * x1
- (select x3_x2x1_coef from regress_2) * x2
- (select x3_const from regress_2)
as x3_resid,
from base
),
regress_3 as (
select
sum(y_centered * x1_resid) / sum(x1_resid * x1_resid) as x1_coef,
sum(y_centered * x2_resid) / sum(x2_resid * x2_resid) as x2_coef,
sum(y_centered * x3_resid) / sum(x3_resid * x3_resid) as x3_coef,
avg(y)
- avg(x1) * sum(y_centered * x1_resid) / sum(x1_resid * x1_resid)
- avg(x2) * sum(y_centered * x2_resid) / sum(x2_resid * x2_resid)
- avg(x3) * sum(y_centered * x3_resid) / sum(x3_resid * x3_resid)
as const_coef
from resids_2
)
select * from regress_3
"""
con.execute(big_multiple_regression_query).df()
```

Here’s the output of the SQL:

And there you have it! Multiple regression with 3 predictors, done in SQL, using nothing more complex than AVG() and SUM().

]]>Sign up for the 2021 Advent of Code here. Take a stab at these problems with Microsoft Excel or Google Sheets.

Solutions in Google Sheets are provided below.

With my solutions, I am trying to minimize use of esoteric spreadsheet trickery (array formulas, `OFFSET()`

, `INDIRECT()`

, stuff like that).

I will also explain the answer in depth, which will not be interesting to many readers, but it should be helpful for a few.

*(columns start on row 2)*

**Column A:**- [puzzle inputs]

**Column B:***(starting from row 2)*`=iferror(int(A2 - A1 > 0), 0)`

**Solution:**`=sum(B:B)`

Very simple:

The above way of expressing it is my preferred way. But there are other equivalent approaches such as (ignoring the `iferror`

() part):

`if(A2 - A1 > 0, 1, 0)`

`if(sign(A2 - A1) = 1, 1, 0)`

`int(sign(A2 - A1) = 1)`

*(columns start on row 2)*

**Column A:**- [puzzle inputs]

**Column B:**`=split(A2, " ")`

**Column C:**- [skip me]

**Column D:**`=if(B2 = "down", 1, if(B2 = "up", -1, 0)) * C2`

**Column E:**`=if(B2 = "forward", C2, 0)`

Start by splitting into two columns, delimiting by the space:

For depth, you don’t need to add and subtract. You can just add everything up, and multiply by “`-1`

” for rows that you would have subtracted. For rows where you want to add, multiply by “`1`

“:

The horizontal change is much simpler:

And then sum the two columns, then multiply, and we’re all set.

*(columns start on row 2)*

**Column A:***[puzzle inputs]*- I prefer to not copy + paste as text, for ‘ol times sake. So let Google Sheets / Excel format it as a decimal number for you. Only weak-willed folks, such as software developers, should be intimidated by this.

**Range P1:AA1:***[hardcode 0 through 11]*

**Columns B-M:**`=rounddown( mod( $A2, pow(10, P$1+1) ) / pow(10, P$1) )`

**Range P2:AA2:**- =
`mode(B:B)`

- =
**Range P3:AA3:**`=pow(2,P$1)*P2`

**Range P4:AA4:**`=pow(2,P$1)*(1-P2)`

**Solution:**`=sum(P3:AA3)*sum(P4:AA4)`

This is what the whole thing looks like:

We are not intimidated by Google Sheets automatically formatting our binary fixed-width number into a decimal:

- To isolate the ones digit, divide by 10 and check the remainder:
- If it’s
`=1`

, then the ones digit is 1. - If it’s
`=0`

, then the ones digit is 0.

- If it’s
- To isolate the tens digit, divide by 100, and check the remainder:
- If it’s
`=11`

, then the tens digit is 1. - If it’s
`=10`

, then the tens digit is 1. - If it’s
`=1`

, then the tens digit is 0. - If it’s
`=0`

, then the tens digit is 0.

- If it’s

See the pattern? For each digit `n`

from 0 to 11, we need to do the following steps to break down the number `x`

:

- Calculate the remainder when dividing
`x`

by`10^(n+1)`

:`mod(x, 10^(n+1))`

- Divide the number calculated above by
`10^n`

. - Round that number down.

Let’s take G3 as an example. The number `x`

is `10100011100`

:

`G3`

is the 6th horizontal position, so `n=5`

(because our list goes from 0 to 11). We calculate the remainder when dividing by `10^6`

, then divide by `10^5`

.

All set? Now let’s calculate the most common digits. The most common digit is easy: `=mode(B:B)`

, then drag to the right. An alternate approach is `=round(average(B:B)`

, which works for binary numbers.

Now it’s time to calculate the “gamma” and “epsilon.” For gamma, take the individual digits and multiply each by `2^n`

, then take the sum.

For epsilon: do the same but instead of using the digit, use `(1-digit)`

. This operation flips the 1’s into 0’s, and 0’s into 1’s.

Then sum it all up, multiply, and there’s your answer:

Day 4 is too complicated for me to walk through here. But rest assured, you can solve it with a spreadsheet, with zero manual parsing, zero VBA / Google App Scripts, and zero array formulas.

Unfortunately we do need to do a little trickery with `indirect()`

and/or `offset()`

to compile the boards, and there are a few `index()`

and `match()`

operations.

The trick is to parse the input such that each row of the data represents a single possible “bingo”. Then, keep a running count of the numbers that get called per bingo. The row that has the leftmost column with a “5” represents the winner.

In a separate tab that contains the running tally, find the first board with a bingo solution that has a “5” in its running tally. Get that winner’s board number, print the board out, then it should be straightforward to do the remaining operations on it:

]]>Lots of people enjoyed it, but the LinkedIn-adjacent section of data science Twitter is not happy about it.

I want to provide as much context for it as I can here, clarify a few things, and correct myself on a few things, including on some negative stuff I said elsewhere about Prophet.

The Zillow job posting (PDF here) is specifically for the now-defunct “Zillow Offers” team.

In this role, you would “lead Zillow Offers’ forecasting efforts. Youβll use our wealth of internal and external data sources to produce a time series demand forecast, conversion/funnel forecast, and resale forecast.”

The job posting mentions that “day to day we answer questions like: What type of homes should we buy? Which markets should we enter? How many homes will we buy tomorrow / next week / next month / next year? How are we performing relative to the market? What will our unit economics look like next year?”

Now I’m no Zillow data scientist, but I am pretty sure I know the answer to all of those questions.

Prophet is the only Python/R library mentioned in the posting.

Prophet is a library that makes it easy for you to fit a model that decomposes a time series model into trend, season, and holiday components. It’s somewhat customizable and has a few nifty tools like graphing and well-thought out forecasting.

I said a few rude things about Prophet that I partially take back. Here is my latest, nuanced opinion on Prophet after looking into it more for this post: it’s a nifty tool if you know what you’re doing or are doing something unimportant, and it is **possibly very dangerous** if you don’t know what you’re doing and are using it to make an actual decision.

In my original tweet, I said that Prophet is used for “y_t = f(t) time trend / curve fitting forecasts.” If you think that the quoted statement is a synonym for “time series model”– which I suspect a nonzero number of whiners about my original tweet probably believe because I know from experience that not all of you know this nomenclature– then Prophet is definitely potentially dangerous in your hands.

`f(t)`

means function of time. But time series processes aren’t only functions of time. Or only of contemporaneous covariates. They’re also functions of lagged values, including lagged values of the variable you’re regressing on. So to say that a time series is `f(t)`

is to specifically say that it’s not a function of other things, namely previous data points, such as the previous dependent variable value.

This is an accurate description of what Prophet’s model is. To get a little more in the weeds, Prophet does the following linear decomposition:

`g(t)`

: Logistic or linear growth trend with optional linear splines (linear in the exponent for the logistic growth). The library calls the knots “change points.”`s(t)`

: Sine and cosine (i.e. Fourier series) for seasonal terms.`h(t)`

: Gaussian functions (bell curves) for holiday effects (instead of dummies, to make the effect smoother).

You can read about the model here, or alternatively look at the source code.

The change point stuff is cleverly implemented and would be tedious to implement on ones own. Prophet also does some great out of sample forecasting that you’re not going to get from doing `sm.OLS().predict()`

. I commend the authors of the library for all of that, and those are great reasons to use the library.

But if you want linear growth, and you can remember the Pandas API for timestamp parsing, and you can `LEFT JOIN`

to a holiday table, then the actual model isn’t super hard to do on your own as a linear regression. If you do stuff with tools such as Stan, PyMC3, or the Microsoft Excel solver tool, and you have a pretty important use case, then it might even be better to look at the gobsmackingly beautiful and short source code of its Stan implementation and reimplement it yourself and tailor it for what you need.

So Prophet is very convenient because it’s hard to beat `pip install prophet`

into `from prophet import Prophet`

into `Prophet().fit(df)`

. But it’s not especially complicated or magical, and it’s concerning when people treat it like that. Prophet is designed with ease of use in mind, even for time series novices (as I said more pejoratively, “piss easy as possible for little babies”). It’s insane that this is the only Python library mentioned for a job that pays $200,000 a year, as if using it is some sort of arcane or special skill. And you should be careful using Prophet for critical decision-making if you don’t understand the caveats. But if you understand the risks and caveats, then go ahead and use it; it may make your life a little simpler.

I think Sean Taylor, the creator of the package, would agree with me on the above paragraph.

So the main caveat of Prophet is that it’s does “curve fitting.” This is fine for problems like web traffic forecasts, and some other crude forms of anomaly detection where the actionable decision is something minor like “send a notification to my Slack channel.” But why’s Prophet so bad for Zillow, exactly?

A process integrated to order 1, (an `I(1)`

process) is one where its rate of change is stationary. Brownian motion is a canonical `I(1)`

process because its rate of change is Gaussian white noise, which is stationary. But the random walk itself is not stationary. So the `t+1`

value of a random walk is just the value at `t`

plus a number sampled from some bell curve.

Home prices are stochastic and integrated (with little to no predictable seasonality that exceeds the cost of carry) because they are prices in a competitive, highly financialized marketplace. To *not* model prices as dependent on past values (or not simply assume the average intratemporal price is always correct and model intratemporal variance as being mispriced) would be an incredible mistake. I am not saying Zillow failed to do this or that it matters at the end of the day whether they even did this (more on that later), but if you understand this, then you can see why it’s especially awful to see Prophet, a time trend library, in their job description.

I want to focus for a moment on my claim that competitive, financialized marketplaces inescapably lead to stochastic integrated prices.

The main reason why competitive market price movements tend to be stochastic and I(1), and tend to exhibit little highly profitable predictable seasonality, is because if that wasn’t true, then people would be able to make free money and markets would no longer be weak-form efficient.

For example, imagine demand for cashews spikes a ton in January. As long as cashews last in storage, you might be able to make some free money buying up a bunch of cashews, storing them somewhere, and then selling them in January. In fact, instead of selling a package for $3 in the summer and $9 in the winter, you could store the ones you’d sell in the summer, and sell them for $8 in the winter (undercutting the competitors who sell at $9), and make $5 gross. As long as storing them until the winter costs under $5, then this is profitable.

Obviously, shifting quantity toward winter requires actually storing the cashews. Which is to say, people do stuff like this because there’s demand for it. But people who do this are not typically making exorbitant and unusually high profits doing so because they’ll typically compete against each other until exorbitant profit is hard to find. So to the extent that there is predictable seasonality in my toy example, it reflects the cost of storage, or more generally the cost of carry. It’s predictable, but not free money.

When prices exhibit predictable seasonality, it is usually because of something like:

- Cost of carry (e.g. annual seasonality of gas prices)
- Intertemporal price discrimination (e.g. last minute travel bookings)
- Elasticity of intertemporal substitutions interacting with intratemporal capacity constraints (e.g. holiday travel)

The travel sector (especially air travel, but also long distance train and bus travel, as well as other travel services such as hotels) is one of a few major sectors where price movements are both volatile yet seasonally predictable, hence why it makes up two-thirds of the list above. Travel a very unique industry in that regard. There are many reasons why this is the case, one being that travel needs to be scheduled and what you’re paying for is in part the time slot, not just the service itself in a vacuum, so it makes sense that desirable time slots or shorter time deltas to departure often cost more.

Most consumer goods don’t behave as extremely as travel. But that said, prices for consumer goods often do have predictable seasonal fluctuations and predictable seasonally-adjusted trends, due to the above reasons, but in a less extreme form. For example, retail outlets discount their spring+summer clothes predictably when making space for their fall+winter styles. And otherwise clothing prices in aggregate tend to keep pace with inflation, which is relatively stable. For consumer products, a time trend model works just fine for many contexts.

Now let’s think about stocks, which have very little seasonality except for noise trading. The cost of carry for a stock is the opportunity cost of the capital that goes into it plus the risk you take on for holding it (this is what the idea of risk neutral valuation is based on). But stock ownership is otherwise just some 0’s and 1’s in a computer, so there’s no physical storage cost, so stocks just end up being reasonably well-modeled by a growth trend plus a random walk in the log price, with very little in the way of predictable seasonality.

The housing market does have seasonal fluctuations– specifically, the quarterly rate of change of 4th quarter median home sale prices is higher than in other quarters across all markets:

It’s unclear to me why that’s there or what that means, but I’m sure people make money on that. But I also imagine the money isn’t free, and probably not the kind of money you’re paying a team of 2,000 employees to chase after. If you dig and dig and dig and can find a true systematic mispricing that generates free money, then congratulations, now you’re in business. But that’s very hard to do.

I do not have subject matter expertise in real estate, and you should read others who are more knowledgeable about it than I am. (Here’s the extent to which I explored real estate pricing as I wrote this.) But I would be a little cautious in over-interpreting when realty agents believe there are extremely predictable patterns you can make money off of. The salary for their labor is the cost of doing this sort of discovery, after all. And unless they’re lavishly paid for doing basically no work, then that’s not free excess profit in excess of the cost of the labor. Which is to say it’s possible for both me to be correct about markets in general, and for a rising and grinding realty agent to be correct in the sense that you can get paid a fair wage to basically store the cashews.

So to summarize, the problem with Prophet here is that it doesn’t model prices as a function of previous values, which is a very wrong way to think about how prices work. It’s that simple.

Beats me and anyone else.

What we do know, to the extent anonymous internet people are to be believed, is that “ZO wasnβt NN but ZO was watching zestimates performance to see if they could go down a similar path,” so in other words, Zillow Offers was not powered by the “Zestimate.” (The Zestimate is Zillow’s number that estimates the value of a house.) I hate to be the bearer of this bad news for anyone memeing about the Zestimate.

So we don’t know the internal algorithm that fueled pricing for Zillow Offers. But we do know a bit about the Zestimate. Starting in 2019, Zillow started moving toward a single neural network model for the Zestimates. This model was apparently designed, as far as I can tell, via a Kaggle competition with over 3,000 teams fighting for a $1,000,000 prize. Certainly such a process wouldn’t lead to, say, overfitting on the test set.

We also know that iBuyers’ algorithms writ large are apparently well-approximated by a regression on log home prices based on this independent research, released in December 2020. So in other words, this tweet’s parenthetical is technically incorrect (to the extent that the Zestimate is concerned), but spiritually extremely correct:

But that’s the Zestimate, not Zillow Offers, which we believe is different. Does this all mean that Zillow Offer’s core pricing model used Prophet? Did it not include autoregressive pricing?

I would not extrapolate too heavily from the job posting. Job postings that reference frameworks often mask the amount of workflow heterogeneity within an organization. It’s possibly even the case that the median Zillow Offers data scientist does not use Prophet, or the median data scientist at Zillow may even agree with me on the silliness of that job description. But I do know that the recruiter didn’t come up with “especially Prophet” on their own. So the most confident thing we can say is that the number of Prophet users at Zillow Offers is N β₯ 1, likely more, but the level of organizational saturation is unclear.

Pure speculation here: I imagine Zillow Offers’s core algorithm for hedonic valuation is much more sophisticated than `Prophet().fit(df)`

. And it is possibly autoregressive, or uses time fixed effects or first differences to control for within period averages, or at least I hope checks one of those boxes. And it may not use `Prophet().fit(df)`

as part of the core pricing / trading algorithm, although it might be used in feature engineering or for forecasting covariates that the model uses.

So I’m not saying the model is just Prophet. But I do believe that mentioning Prophet as the singular skill they value in time series analysis means they probably don’t have as strong feelings about “financialized prices are stochastic I(1) processes” that I do. One thing the supposed ex-Zillow Redditor mentions is that “Zillow has almost zero institutional knowledge in quantitative methods and pretty much no one in Zillow AI had [a background in finance / trading].” The understanding of how prices work comes not from looking at some of your company’s internal data for a day, but from subject matter expertise in economics or finance.

I can’t wait for the informed postmortems to really start rolling out, but right now we’re all just speculating. And it’s extremely fun to speculate.

The most compelling explanation is that they got pwned by adverse selection.

This has less to do with their algorithm being wrong too much on average, and more about the fact that its wrongness can be exploited by more knowledgable market participants who know a dollar bill lying on the ground when they see it, even if it’s often “correct.” This one-two punch of Twitter threads explains it really well:

For a more technical deep dive, Arpit Gupta has a great thread:

What’s so strange about how Zillow lost money is that they performed *worse than average.* In aggregate since 2019, housing prices have risen. By quite a lot, actually.

And they’ve been losing tons of money while being ostensibly long housing. In other words, they did worse than the boomer wojak guy:

(I wrote that thread before I was fully embracing the adverse selection explanation. The thread’s still fine, I think, just a smaller part of the story.)

In one tweet, I semi-joked that what happened was lower and mid level employees convinced upper management that the algorithm was 99% accurate by hiding the caveats of what “99% accuracy” means. Zillow is a publicly traded company and the fact it was losing money since its inception is on the public record, so I don’t think the C-suite was in complete denial. The COO would sort of answer questions about it.. But this doesn’t tell us what sorts of folk stories and contrived KPIs the lower level employees and middle managers were using. I bet they had some fun ones like “average time to flip” and “forecasted sell price accuracy.”

Speaking of middle managers, word on the street is that Zillow Offers put their thumb on the scale of the algorithm to make it engage in more aggressive trades. Manually adjusting an algorithm isn’t necessarily a bad thing, but you need to do it for the right reasons. And clearly that didn’t end up working out.

If all of this is the case, then it’s not clear that the problem was bad price predictions *per se*. Instead, it seems more like Zillow didn’t understand what the risks were when the algorithm was wrong, and how to identify signals that suggest the algorithm was wrong (e.g. high demand for an offer is a sign that an offer is too good). You can predict your way out of that problem and try to reduce the residuals, which is the natural data science thing to do, but that’s *not* a smart way to do risk management. Unfortunately for Zillow’s shareholders, risk management is often going to be a blind spot when you hire a bunch of data scientists without finance sector experience.

I can’t believe that the world’s biggest Microsoft Excel fan has to say this, but I am fine with easy to use tools. I am fine with simple models.

I have two requirements for tools. The first one is that the tool does what I need it to do. The second is whether it’s easy to use.

The reason why Excel is so great as a tool is because it very often satisfies both the first requirement and the second requirement. It’s genuinely amazing the things you can do with Microsoft Excel. I slightly exaggerate on Twitter, but my appreciation of Microsoft Excel is genuine to some degree.

The first requirement almost always takes precedence over the second requirement. It doesn’t matter if your tool is easy to use if it doesn’t solve the problem you need it to. That’s the gist of the issue at hand here. As stated before, I believe a nonzero number of people grandstanding about this discourse to their professional audience may not entirely understand the issue with Prophet in this context. That’s fine, there are tons of skills that the term “data science” encompasses, many of which I am personally a total novice in that others may be experts in. In some cases other people will be total novices with time series analysis and financial markets, and not really grasp how absurd the whole Prophet thing really is. It’s ok if you don’t know this, but you look a little silly if you’re defending it.

Prophet is a simple model. But so is SARIMA. Neither is more “fancy” than the other, if you ask me. One is a better tool for modeling prices in a competitive marketplace. You should require your data scientists to know the one that is needed for the job.

And let’s not overlook the absurdity of strongly preferring a candidate who has experience in a plug and play library built for novices for a job that pays $200,000. How is this not completely ridiculous to anyone with a pulse? $200k a year can attract people who actually know what they’re doing. Maybe a math or econ PhD. Maybe a Microsoft Excel pros with 10 years of finance sector experience. The list goes on.

The requirement that people come to your company knowing how to use piss easy baby tools is an extremely dumb and lazy hiring practice. It is also, unfortunately, a common practice in data science job postings. The aggregate effect of this practice being widespread is that talented people with unusual backgrounds get gatekept out of good paying jobs that they’d be exceptional at. Making fun of the job posting and using Prophet has been compared to gatekeeping. To be clear, the Prophet prerequisite is an actual form of gatekeeping being undertaken by a major company that has actual material impacts on people’s careers. The job post excludes people not based on aptitude, but based on whether they have previous experience and familiarity with a tool they could be introduced to and then master in under 15 minutes. A tweet making fun of the job posting is not gatekeeping. Get over it, LinkedIn clout chasers.

]]>I’ve been posting on the internet for over two decades and I’ve lost some interest in engaging in discourses because it has become repetitive to a degree. So analyzing these repetitions and patterns across discourses is more interesting and fresh to me than the actual individual discourses.

One pattern of discourse that I’ve noticed goes something like this:

- Someone makes a proximate cause claim: “X happened because of Y,” or in its alternate form, “X would not have happened but for Y”.
- There are multiple potential Ys that can explain X.
- People debate about what is the most correctest of
*sine qua non*among all the possible choices.

Here is a good example of such a “proximate cause” discourse (check the replies and feast on as much discourse as you can stomach):

These discourses tend to be uninteresting and uninformative because they involve everyone talking past each other. The pattern of retort to someone else’s theory of proximate cause is not to say that theirs is an invalid proximate cause, but that it is not the One True Proximate Cause. In other words, the retort to “A is the cause” is not “A did not cause it,” but rather “it was actually B.”

So everyone ends up making perfectly reasonable claims. It’s a choose-your-own-adventure. Discourses where everyone is allowed to be correct are not interesting to me; it is the discourse equivalent of a high school soccer scrimmage match. In my opinion, you will be happier and better off if you learn how to detect these sorts of discourses and refuse to engage in them.

Although these sorts of discourses are boring to engage with directly, they are fascinating from a metadiscourse perspective. *Why* would someone hone in on some proximate cause over another proximate cause? Obviously, people will pick whatever most suits their worldview, but isn’t that just begging the question?

As far as I can tell, the reasoning behind any given One True Proximate Cause comes down to theories of agency: i.e. people’s beliefs concerning which parties involved had the ability to correct course or to foresee how their actions would cause something undesirable, and which parties were bound by circumstance or reasonableness to act in a certain way.

The people with whom we align ourselves end up being a part of the latter group– constrained by circumstance, acting reasonably– whereas the out-group ends up being the group that had agency all along, but who failed to act correctly so as to not proximately cause the bad event.

Once you see how many political discourses on the internet are defined solely by theories of which groups do and don’t have agency, it’s hard to unsee it. This is not to say that all theories of agency are created equally– some theories of agency are much easier to sympathize with than others. My point is only to say that theories of agency are the real discussions that people are having when discussing proximate causes of multi-causal events, and these discourses would be much better off if people cut to the chase on this point.

]]>**Another Update: **I think some of the explanations on this page may be helped with more colors. I have some updated visuals here that include colors.

The Frisch-Waugh-Lovell theorem states that within a multivariate regression on and , the coefficient for , which is , will be the exact same as if you had instead run a regression on the residuals of and after regressing each one on separately.

The point of this post is not to explain the FWL theorem in linear algebraic detail, or explain why it’s useful (basically, it’s a fundamental intuition about what multivariate regression does and what it means to “partial” out the effects of two regressors). If you want to learn more about that, there’s some great stuff already on Google.

The point of this post is to simply provide an animation of this theorem. I find that the explanations of this theorem are often couched in lots of linear algebra, and it may be hard for some people to understand what’s going on exactly. I hope this animation can help with that.

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
np.random.seed(42069)
df = pd.DataFrame({'x1': np.random.uniform(0, 10, size=50)})
df['x2'] = 4.9 + df['x1'] * 0.983 + 2.104 * np.random.normal(0, 1.35, size=50)
df['y'] = 8.643 - 2.34 * df['x1'] + 3.35 * df['x2'] + np.random.normal(0, 1.65, size=50)
df['const'] = 1
model = sm.OLS(
endog=df['y'],
exog=df[['const', 'x1', 'x2']]
).fit()
model.summary()
```

The output of the above:

OLS Regression Results

Dep. Variable: | y | R-squared: | 0.977 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.976 |

Method: | Least Squares | F-statistic: | 997.5 |

Date: | Sat, 26 Dec 2020 | Prob (F-statistic): | 3.22e-39 |

Time: | 17:11:39 | Log-Likelihood: | -95.281 |

No. Observations: | 50 | AIC: | 196.6 |

Df Residuals: | 47 | BIC: | 202.3 |

Df Model: | 2 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | 9.4673 | 0.546 | 17.337 | 0.000 | 8.369 | 10.566 |

x1 | -2.2003 | 0.128 | -17.213 | 0.000 | -2.458 | -1.943 |

x2 | 3.1931 | 0.081 | 39.647 | 0.000 | 3.031 | 3.355 |

Omnibus: | 0.120 | Durbin-Watson: | 1.914 |
---|---|---|---|

Prob(Omnibus): | 0.942 | Jarque-Bera (JB): | 0.279 |

Skew: | -0.095 | Prob(JB): | 0.870 |

Kurtosis: | 2.687 | Cond. No. | 27.3 |

Here is what would happen if we actually ran a univariate regression on the residuals after factoring out .

*(The animation takes a few seconds, so you might need to wait for it to restart to get the full effect.)*

Notice that the slope in the final block ends up equaling

, which is the coefficient for in the multivariate regression.**3.1931**

Getting the coefficient for is more interesting; one thing that happens in the multivariate regression is the coefficient is *negative* despite the fact that is positively correlated with . What gives? Well, the following animation helps to show where that comes from:

You can mostly see here what’s happening: After we take out the effect of on , what we’re left over with is a negative relationship between and . Put another way: there is a negative correlation between and the residuals from the regression .

]]>Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning.

Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning.

Bitcoin Machine Learning.

Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning.

Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning.

Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning.

Bitcoin Machine Learning. Bitcoin Machine Learning. Bitcoin Machine Learning.

]]>I recently deleted a Twitter thread discussing my interview strategy, partly because I agreed with the critics and changed my mind on a few questions I was asking,* and partly because I do not think Twitter isn’t a good place for more nuanced takes.

The first major constraint of an interview from my perspective is this: we have **M** job openings and **N**>**M** candidates. I can pass or fail candidates in my round of the interview. The other major constraint is that I only have **X** minutes to interview you. I cannot spend the next 10 hours digging into everything about you and get a nuanced view of you as a person. I need to make a decision based on a relatively short interaction.

If I pass everybody or fail everybody then I am effectively relegating the decision-making process to co-interviewers since we ultimately have to pick **M** out of the **N**. candidates. If I won’t do it, they will. So, should I relegate the decision-making to those co-interviewers? I’m personally worried about doing that. One thing I’ve noticed in reading evaluations is that a lot of interviewers rely on things that are… odd to me. A lot of criteria interviewers rely on is stuff like “well I don’t think they seemed interested in the job” or “they seemed a little nervous.”

…Not interested in the job? If they’re this far into it they’re probably interested! …Nervous? If they’re interested in the job then I imagine they’d be! And hey wait, didn’t they complain before that they’re not interested? Plus some people don’t like interviews but they can work perfectly fine.

If I pass or fail everyone, then I’m relegating decision-making to the people who evaluate candidates based on that. And I don’t want to to do that because the above criteria are ripe for bias laundering. Those criteria seem to filter less for job competency and more for people who are “good at interviewing,” which may correlate with class and upbringing, which in turn correlate with race and gender.

This is not to say I think my coworkers who I interview with are dumb people for evaluating candidates this way, I just think that interviewing is a tiny part of their job and they haven’t thought about interviewing as much as they think about other job related problems. If you want to think about this problem more yourself, I recommend reading “How to Take the Bias Out of Interviews” for more. The problem is basically that the way people like to interview candidates is in an unstructured format, but unstructured interviews have zero predictive ability because they often come down to things like happenstance of where the conversation veered, or how often they smiled, or how many boilerplate questions they asked about “How’s the workplace culture at ___?”

So ultimately I need to filter candidates out, and I want to avoid bias laundering. The way I do this is to ask questions that have clearly correct and incorrect answers, and evaluate candidates based on their answers to those questions.

People don’t like being evaluated based on questions that have correct and incorrect answers because the questions being asked can feel arbitrary: what if they know a lot and are really smart, but the particular question I asked is a personal shortcoming of the prospective candidate?

The reason I rely on questions where answers can be wrong is because I think a lot of the more discretionary parts of the interview process are actually bad.

Obviously, everyone thinks they personally would make the correct discretionary decision. But if two people would tend to make two different decisions given the same circumstances, then only one of them can be right. And obviously nobody thinks of themselves as “biased” in some superficial way. But there are studies that provide strong evidence that these biases exist, so at least some of these self-proclaimed unbiased people must be wrong.

I don’t just ask “correct/incorrect” questions. I also ask case study questions that include parts with possibly 1 million valid answers. I’ve also interviewed with companies who relying mostly or exclusively on such case study questions.

My verdict is that I actually don’t like my case study question much, as I’ve never filtered out a candidate based on that. “But surely, some answers to the case study are better than others,” you retort. Yes, I do agree some answers seem better than others! But how exactly do I compare answers across candidates? Obviously I am biased toward the subset of answers that I would have personally chosen– these are the ones I think are “better.” But then if I pick candidates that way, I’m just hiring people who think like me. And I want candidates who, to some extent, possibly *don’t* think like me, since diversity of thought is probably good to have.

When going by the case study alone, and applying the reasonable constraint that answers that differ from my preferences are fine (*if not preferable* to my personal answer by virtue of it providing diversity of thought), then I’d want to hire all of the candidates that I have interviewed with so far! None of the answers I’ve gotten are genuinely bad. But, this just leads us back to the problem outlined above: if I don’t make a decision, then someone else at my org will, and their decision might be bad.

I think people like both asking and answering case study questions because they believe that the unknowable process through which interviewers make decisions regarding these case study answers must be able to pick up on more nuance than the more knowable algorithm. I understand why people think that. I just think that these nuances that human discretion picks up on are often noise or (worse) bias– both forms of false wisdom.

The obscurity and multidimensionality of the discretionary decision-making process is, in this case, not a detriment but a *benefit* to its perceived value. You can nitpick whether or not the questions I’m asking are good or bad questions if my interview is based off questions that have the potential for incorrect answers. And yes, some of them might be bad questions. But it’s not clear how you criticize my processes if it’s entirely discretionary and obscure. If my hiring and firing decisions come down to “I did not believe they provided thoughtful answers” or “Alice’s answers were fine, but Bob’s were better and more creative,” how can you criticize that?

I’ve written a few times before about how one trade-off of using machine learning is that ML techniques obfuscate the interpretability of a model’s parameters. All ML algorithms suffer this to at least some degree (LASSO), and some algorithms suffer this more than others (RNNs). I argue that interpretability is a good thing, and obfuscating interpretability is a genuine trade-off that people should think more about.

However, if one accepts that higher powers have a great unknowable wisdom, interpretability becomes a *bad* thing. This conclusion is in stark contrast to the idea I’ve presented before in discussing ML, which is that knowing how things work is good actually. Knowing is actually bad if you believe in the false wisdom of a higher power. AI’s most uncritical proponents seem to believe the fact that ML models cannot be interpreted must mean they are doing things so advanced that mere mortals cannot comprehend them. The ML model is the higher power; the wisdom is the model’s outputs.

False wisdom of a higher power is not just a problem that stems from reliance on human discretion or machine learning. Another example of false wisdom of a higher power is Austrian economics and *laissez-faire*. The normative thrust of Austrian economics is that the market is calculating subjective human preferences in a way that ascends the understanding of mere mortals, therefore it is either “correct,” or at the very least will be more correct or more objective than a human trying to make decisions about how to allocate things. The market is the higher power; the wisdom is its allocation of resources.

Unknowable opaque processes can only be evaluated based on their outcomes. Transparent processes can be evaluated based on both their outcomes and their internal workings. Comparing a transparent process to an opaque one by criticizing the transparent process’s internal logic is a double standard.

Let’s say for example I have a process for hiring that was super transparent and also has no correlation with future performance. I show people the process’s internal logic: turns out, I’m just randomly sorting the list of **N** candidates and picking the first **M** candidates in that list.

Now let’s say I implement another process for choosing candidates. It’s just me making decisions based on my gut. Turns out, my gut is also no better than average and my gut choices have no correlation with future performance either.

The only way you can fairly compare these two processes is by their outcomes because the internal logic of the gut decision-making is not exposed. In that comparison, both processes are equally bad. Despite that, if I implement the latter process nobody at my company would bat an eye, and if I implemented the former process I’d get fired on the spot for being not serious.

It’s probably the case that I have in the past asked bad interview questions that have weak signals, only slightly better at best than flipping a coin. I’m working on making my questions have stronger signals. But is the signal any stronger or weaker than one’s unadulterated discretion, which is the preferred way of making hiring decisions for most interviewers? And is human discretion, on average, better than flipping a coin?

** The question I was asking related to knowledge of what I considered to be a basic part of the Python API, which I asked to compare their answer to people’s stated years of Python experience in order to get a signal on whether or not the candidate was exaggerating their Python experience or not. The intent here was that many candidates are slightly dishonest about their stated experiences and I did not want to penalize people who are more honest, but I did not approach it in an ideal way. I will not be asking the question from now on. Unfortunately though, without asking a few Python trivia questions, I may need to scrap the part of the interview where I try to assess Python experience in relation to one’s rΓ©sumΓ© in general, as this is hard to do without a barrage of trivia questions related to Python APIs or general OOP knowledge.*