Zillow, Prophet, Time Series, & Prices

So I made a mildly controversial tweet.

i cannot get over that the zillow data scientist job posting "strongly prefers" you have experience with a python library that is designed to make it as piss easy as possible for little babies to do y_t = f(t) time trend / curve fitting forecasts. pic.twitter.com/YTUcgascCi
— Senior Data Masseuse (@ryxcommar) November 3, 2021

Lots of people enjoyed it, but the LinkedIn-adjacent section of data science Twitter is not happy about it.

I want to provide as much context for it as I can here, clarify a few things, and correct myself on a few things, including on some negative stuff I said elsewhere about Prophet.

What was the Zillow job posting?

The Zillow job posting (PDF here) is specifically for the now-defunct “Zillow Offers” team.

In this role, you would “lead Zillow Offers’ forecasting efforts. You’ll use our wealth of internal and external data sources to produce a time series demand forecast, conversion/funnel forecast, and resale forecast.”

The job posting mentions that “day to day we answer questions like: What type of homes should we buy? Which markets should we enter? How many homes will we buy tomorrow / next week / next month / next year? How are we performing relative to the market? What will our unit economics look like next year?”

Now I’m no Zillow data scientist, but I am pretty sure I know the answer to all of those questions.

Prophet is the only Python/R library mentioned in the posting.

What is Prophet?

Prophet is a library that makes it easy for you to fit a model that decomposes a time series model into trend, season, and holiday components. It’s somewhat customizable and has a few nifty tools like graphing and well-thought out forecasting.

I said a few rude things about Prophet that I partially take back. Here is my latest, nuanced opinion on Prophet after looking into it more for this post: it’s a nifty tool if you know what you’re doing or are doing something unimportant, and it is possibly very dangerous if you don’t know what you’re doing and are using it to make an actual decision.

In my original tweet, I said that Prophet is used for “y_t = f(t) time trend / curve fitting forecasts.” If you think that the quoted statement is a synonym for “time series model”– which I suspect a nonzero number of whiners about my original tweet probably believe because I know from experience that not all of you know this nomenclature– then Prophet is definitely potentially dangerous in your hands.

f(t) means function of time. But time series processes aren’t only functions of time. Or only of contemporaneous covariates. They’re also functions of lagged values, including lagged values of the variable you’re regressing on. So to say that a time series is f(t) is to specifically say that it’s not a function of other things, namely previous data points, such as the previous dependent variable value.

This is an accurate description of what Prophet’s model is. To get a little more in the weeds, Prophet does the following linear decomposition:

g(t): Logistic or linear growth trend with optional linear splines (linear in the exponent for the logistic growth). The library calls the knots “change points.”
s(t): Sine and cosine (i.e. Fourier series) for seasonal terms.
h(t): Gaussian functions (bell curves) for holiday effects (instead of dummies, to make the effect smoother).

You can read about the model here, or alternatively look at the source code.

The change point stuff is cleverly implemented and would be tedious to implement on ones own. Prophet also does some great out of sample forecasting that you’re not going to get from doing sm.OLS().predict(). I commend the authors of the library for all of that, and those are great reasons to use the library.

But if you want linear growth, and you can remember the Pandas API for timestamp parsing, and you can LEFT JOIN to a holiday table, then the actual model isn’t super hard to do on your own as a linear regression. If you do stuff with tools such as Stan, PyMC3, or the Microsoft Excel solver tool, and you have a pretty important use case, then it might even be better to look at the gobsmackingly beautiful and short source code of its Stan implementation and reimplement it yourself and tailor it for what you need.

So Prophet is very convenient because it’s hard to beat pip install prophet into from prophet import Prophet into Prophet().fit(df). But it’s not especially complicated or magical, and it’s concerning when people treat it like that. Prophet is designed with ease of use in mind, even for time series novices (as I said more pejoratively, “piss easy as possible for little babies”). It’s insane that this is the only Python library mentioned for a job that pays $200,000 a year, as if using it is some sort of arcane or special skill. And you should be careful using Prophet for critical decision-making if you don’t understand the caveats. But if you understand the risks and caveats, then go ahead and use it; it may make your life a little simpler.

I think Sean Taylor, the creator of the package, would agree with me on the above paragraph.

So the main caveat of Prophet is that it’s does “curve fitting.” This is fine for problems like web traffic forecasts, and some other crude forms of anomaly detection where the actionable decision is something minor like “send a notification to my Slack channel.” But why’s Prophet so bad for Zillow, exactly?

Prices In Competitive Financialized Markets Are Stochastic I(1) Processes

A process integrated to order 1, (an I(1) process) is one where its rate of change is stationary. Brownian motion is a canonical I(1) process because its rate of change is Gaussian white noise, which is stationary. But the random walk itself is not stationary. So the t+1 value of a random walk is just the value at t plus a number sampled from some bell curve.

Home prices are stochastic and integrated (with little to no predictable seasonality that exceeds the cost of carry) because they are prices in a competitive, highly financialized marketplace. To not model prices as dependent on past values (or not simply assume the average intratemporal price is always correct and model intratemporal variance as being mispriced) would be an incredible mistake. I am not saying Zillow failed to do this or that it matters at the end of the day whether they even did this (more on that later), but if you understand this, then you can see why it’s especially awful to see Prophet, a time trend library, in their job description.

I want to focus for a moment on my claim that competitive, financialized marketplaces inescapably lead to stochastic integrated prices.

The main reason why competitive market price movements tend to be stochastic and I(1), and tend to exhibit little highly profitable predictable seasonality, is because if that wasn’t true, then people would be able to make free money and markets would no longer be weak-form efficient.

For example, imagine demand for cashews spikes a ton in January. As long as cashews last in storage, you might be able to make some free money buying up a bunch of cashews, storing them somewhere, and then selling them in January. In fact, instead of selling a package for $3 in the summer and $9 in the winter, you could store the ones you’d sell in the summer, and sell them for $8 in the winter (undercutting the competitors who sell at $9), and make $5 gross. As long as storing them until the winter costs under $5, then this is profitable.

Obviously, shifting quantity toward winter requires actually storing the cashews. Which is to say, people do stuff like this because there’s demand for it. But people who do this are not typically making exorbitant and unusually high profits doing so because they’ll typically compete against each other until exorbitant profit is hard to find. So to the extent that there is predictable seasonality in my toy example, it reflects the cost of storage, or more generally the cost of carry. It’s predictable, but not free money.

When prices exhibit predictable seasonality, it is usually because of something like:

Cost of carry (e.g. annual seasonality of gas prices)
Intertemporal price discrimination (e.g. last minute travel bookings)
Elasticity of intertemporal substitutions interacting with intratemporal capacity constraints (e.g. holiday travel)

The travel sector (especially air travel, but also long distance train and bus travel, as well as other travel services such as hotels) is one of a few major sectors where price movements are both volatile yet seasonally predictable, hence why it makes up two-thirds of the list above. Travel a very unique industry in that regard. There are many reasons why this is the case, one being that travel needs to be scheduled and what you’re paying for is in part the time slot, not just the service itself in a vacuum, so it makes sense that desirable time slots or shorter time deltas to departure often cost more.

Most consumer goods don’t behave as extremely as travel. But that said, prices for consumer goods often do have predictable seasonal fluctuations and predictable seasonally-adjusted trends, due to the above reasons, but in a less extreme form. For example, retail outlets discount their spring+summer clothes predictably when making space for their fall+winter styles. And otherwise clothing prices in aggregate tend to keep pace with inflation, which is relatively stable. For consumer products, a time trend model works just fine for many contexts.

Now let’s think about stocks, which have very little seasonality except for noise trading. The cost of carry for a stock is the opportunity cost of the capital that goes into it plus the risk you take on for holding it (this is what the idea of risk neutral valuation is based on). But stock ownership is otherwise just some 0’s and 1’s in a computer, so there’s no physical storage cost, so stocks just end up being reasonably well-modeled by a growth trend plus a random walk in the log price, with very little in the way of predictable seasonality.

The housing market does have seasonal fluctuations– specifically, the quarterly rate of change of 4th quarter median home sale prices is higher than in other quarters across all markets:

It’s unclear to me why that’s there or what that means, but I’m sure people make money on that. But I also imagine the money isn’t free, and probably not the kind of money you’re paying a team of 2,000 employees to chase after. If you dig and dig and dig and can find a true systematic mispricing that generates free money, then congratulations, now you’re in business. But that’s very hard to do.

I do not have subject matter expertise in real estate, and you should read others who are more knowledgeable about it than I am. (Here’s the extent to which I explored real estate pricing as I wrote this.) But I would be a little cautious in over-interpreting when realty agents believe there are extremely predictable patterns you can make money off of. The salary for their labor is the cost of doing this sort of discovery, after all. And unless they’re lavishly paid for doing basically no work, then that’s not free excess profit in excess of the cost of the labor. Which is to say it’s possible for both me to be correct about markets in general, and for a rising and grinding realty agent to be correct in the sense that you can get paid a fair wage to basically store the cashews.

So to summarize, the problem with Prophet here is that it doesn’t model prices as a function of previous values, which is a very wrong way to think about how prices work. It’s that simple.

What was Zillow’s algorithm?

Beats me and anyone else.

What we do know, to the extent anonymous internet people are to be believed, is that “ZO wasn’t NN but ZO was watching zestimates performance to see if they could go down a similar path,” so in other words, Zillow Offers was not powered by the “Zestimate.” (The Zestimate is Zillow’s number that estimates the value of a house.) I hate to be the bearer of this bad news for anyone memeing about the Zestimate.

So we don’t know the internal algorithm that fueled pricing for Zillow Offers. But we do know a bit about the Zestimate. Starting in 2019, Zillow started moving toward a single neural network model for the Zestimates. This model was apparently designed, as far as I can tell, via a Kaggle competition with over 3,000 teams fighting for a $1,000,000 prize. Certainly such a process wouldn’t lead to, say, overfitting on the test set.

We also know that iBuyers’ algorithms writ large are apparently well-approximated by a regression on log home prices based on this independent research, released in December 2020. So in other words, this tweet’s parenthetical is technically incorrect (to the extent that the Zestimate is concerned), but spiritually extremely correct:

oh, you are shocked that your "AI" (hedonic regression on number of bedrooms and age of house etc) is less accurate than the local that knows the neighborhood and can walk the property and compare it in detail to comps? yeah
— Dr. Benn "DJ D-Vol" Eifert (@bennpeifert) November 3, 2021

But that’s the Zestimate, not Zillow Offers, which we believe is different. Does this all mean that Zillow Offer’s core pricing model used Prophet? Did it not include autoregressive pricing?

I would not extrapolate too heavily from the job posting. Job postings that reference frameworks often mask the amount of workflow heterogeneity within an organization. It’s possibly even the case that the median Zillow Offers data scientist does not use Prophet, or the median data scientist at Zillow may even agree with me on the silliness of that job description. But I do know that the recruiter didn’t come up with “especially Prophet” on their own. So the most confident thing we can say is that the number of Prophet users at Zillow Offers is N ≥ 1, likely more, but the level of organizational saturation is unclear.

Pure speculation here: I imagine Zillow Offers’s core algorithm for hedonic valuation is much more sophisticated than Prophet().fit(df). And it is possibly autoregressive, or uses time fixed effects or first differences to control for within period averages, or at least I hope checks one of those boxes. And it may not use Prophet().fit(df) as part of the core pricing / trading algorithm, although it might be used in feature engineering or for forecasting covariates that the model uses.

So I’m not saying the model is just Prophet. But I do believe that mentioning Prophet as the singular skill they value in time series analysis means they probably don’t have as strong feelings about “financialized prices are stochastic I(1) processes” that I do. One thing the supposed ex-Zillow Redditor mentions is that “Zillow has almost zero institutional knowledge in quantitative methods and pretty much no one in Zillow AI had [a background in finance / trading].” The understanding of how prices work comes not from looking at some of your company’s internal data for a day, but from subject matter expertise in economics or finance.

So what happened at Zillow?

I can’t wait for the informed postmortems to really start rolling out, but right now we’re all just speculating. And it’s extremely fun to speculate.

The most compelling explanation is that they got pwned by adverse selection.

This has less to do with their algorithm being wrong too much on average, and more about the fact that its wrongness can be exploited by more knowledgable market participants who know a dollar bill lying on the ground when they see it, even if it’s often “correct.” This one-two punch of Twitter threads explains it really well:

Cute fact about the Zillow thing is that even if Zillow's model was more accurate than local agent valuations, local agents/property owners can still win on average because they get to choose which houses to buy/sell.
— macrocephalopod (@macrocephalopod) November 3, 2021

1/ Zillow made the same mistake that every new quant trader makes early on: Mistaking an adversarial environment for a random one. https://t.co/d18TES7AmO
— Doug Colkitt (@0xdoug) November 3, 2021

For a more technical deep dive, Arpit Gupta has a great thread:

Okay let’s figure out why Zillow Offers failed.

Buchak Matvos Piskorski Seru quantify the frictions (especially adverse selection) that limit dealer intermediation in real estate, Ie iBuyers.https://t.co/CM2si2F1Tl pic.twitter.com/FVfK7U9NKO
— Arpit Gupta (@arpitrage) November 5, 2021

What’s so strange about how Zillow lost money is that they performed worse than average. In aggregate since 2019, housing prices have risen. By quite a lot, actually.

And they’ve been losing tons of money while being ostensibly long housing. In other words, they did worse than the boomer wojak guy:

So the housing market's a good spot to park money, but my point is that it didn't seem like Zillow's supposed big data advantage would have any notable advantage over this guy parking his money into a randomly chosen 3rd house. pic.twitter.com/yHVwEUCIru
— Senior Data Masseuse (@ryxcommar) November 3, 2021

(I wrote that thread before I was fully embracing the adverse selection explanation. The thread’s still fine, I think, just a smaller part of the story.)

In one tweet, I semi-joked that what happened was lower and mid level employees convinced upper management that the algorithm was 99% accurate by hiding the caveats of what “99% accuracy” means. Zillow is a publicly traded company and the fact it was losing money since its inception is on the public record, so I don’t think the C-suite was in complete denial. The COO would sort of answer questions about it.. But this doesn’t tell us what sorts of folk stories and contrived KPIs the lower level employees and middle managers were using. I bet they had some fun ones like “average time to flip” and “forecasted sell price accuracy.”

Speaking of middle managers, word on the street is that Zillow Offers put their thumb on the scale of the algorithm to make it engage in more aggressive trades. Manually adjusting an algorithm isn’t necessarily a bad thing, but you need to do it for the right reasons. And clearly that didn’t end up working out.

If all of this is the case, then it’s not clear that the problem was bad price predictions per se. Instead, it seems more like Zillow didn’t understand what the risks were when the algorithm was wrong, and how to identify signals that suggest the algorithm was wrong (e.g. high demand for an offer is a sign that an offer is too good). You can predict your way out of that problem and try to reduce the residuals, which is the natural data science thing to do, but that’s not a smart way to do risk management. Unfortunately for Zillow’s shareholders, risk management is often going to be a blind spot when you hire a bunch of data scientists without finance sector experience.

the epistemology of machine learning doesn't usually allow for this type of thinking. Residuals are seen as a failure to predict rather than as a fact of nature. Residuals exist only to be reduced, not to be embraced and utilized.
— Senior Data Masseuse (@ryxcommar) November 4, 2021

I Have Nothing Against Easy Tools

Give me a break. This person’s pinned tweet portrays Microsoft Excel as some sort of evil menace, by the way.

I can’t believe that the world’s biggest Microsoft Excel fan has to say this, but I am fine with easy to use tools. I am fine with simple models.

I have two requirements for tools. The first one is that the tool does what I need it to do. The second is whether it’s easy to use.

The reason why Excel is so great as a tool is because it very often satisfies both the first requirement and the second requirement. It’s genuinely amazing the things you can do with Microsoft Excel. I slightly exaggerate on Twitter, but my appreciation of Microsoft Excel is genuine to some degree.

The first requirement almost always takes precedence over the second requirement. It doesn’t matter if your tool is easy to use if it doesn’t solve the problem you need it to. That’s the gist of the issue at hand here. As stated before, I believe a nonzero number of people grandstanding about this discourse to their professional audience may not entirely understand the issue with Prophet in this context. That’s fine, there are tons of skills that the term “data science” encompasses, many of which I am personally a total novice in that others may be experts in. In some cases other people will be total novices with time series analysis and financial markets, and not really grasp how absurd the whole Prophet thing really is. It’s ok if you don’t know this, but you look a little silly if you’re defending it.

Prophet is a simple model. But so is SARIMA. Neither is more “fancy” than the other, if you ask me. One is a better tool for modeling prices in a competitive marketplace. You should require your data scientists to know the one that is needed for the job.

And let’s not overlook the absurdity of strongly preferring a candidate who has experience in a plug and play library built for novices for a job that pays $200,000. How is this not completely ridiculous to anyone with a pulse? $200k a year can attract people who actually know what they’re doing. Maybe a math or econ PhD. Maybe a Microsoft Excel pros with 10 years of finance sector experience. The list goes on.

The requirement that people come to your company knowing how to use piss easy baby tools is an extremely dumb and lazy hiring practice. It is also, unfortunately, a common practice in data science job postings. The aggregate effect of this practice being widespread is that talented people with unusual backgrounds get gatekept out of good paying jobs that they’d be exceptional at. Making fun of the job posting and using Prophet has been compared to gatekeeping. To be clear, the Prophet prerequisite is an actual form of gatekeeping being undertaken by a major company that has actual material impacts on people’s careers. The job post excludes people not based on aptitude, but based on whether they have previous experience and familiarity with a tool they could be introduced to and then master in under 15 minutes. A tweet making fun of the job posting is not gatekeeping. Get over it, LinkedIn clout chasers.