Goodhart’s Law

It struck me as somewhat ironic, and sometimes uncomfortable, that many times I had to tell stakeholders not to trust a metric that I helped define. As a data scientist, this was always a tricky tightrope to walk. While you don’t want improper inferences or decisions to be made using your metric, at the same time you don’t want to undermine the trust around the metric itself or your bona fides as a rigorous data scientist (“why should we trust this metric if it doesn’t work in this specific instance?”). Sometimes it takes tens of person-years of work in order to define, validate, and gain consensus around a working metric, so these scenarios tend to be high stakes conversations.

To be clear, the choice and specific definitions of metrics can have very real consequences on product direction and user experience. This is what made the evaluation aspects of data science work very rewarding, even if the impact was indirect because we weren’t necessarily shipping out algorithmic changes.

As one example, when I was working on the Discover feed product, we on the leadership team had many passionate discussions about what our “north star” metric should be: daily active users (DAU) or days active per monthly active (aka L28, formally defined as the average days active per user who was active in the last 28 days). At that point in time, we chose DAU, but measured and reported progress on both.

The implications of DAU vs. L28 was most clear as a tradeoff, which was very real in terms of product strategy: did we want a product which many people used infrequently and focus on these low-engagement users, or a product that had a strong power user base and focus on these high-engagement users? At the time, the DAU metric implied the former, and L28 implied the latter. Of course it’s possible to do both, but when making tough resourcing decisions, one had to take precedence.

DAU is a relatively simple metric and thus is widely used. It has a lot of very nice properties, but like all metrics, it is “gameable.” That is, either intentionally or unintentionally, teams can optimize or “hill climb” on it, resulting in seemingly positive metric changes that, in reality, are at best spurious or at worst clearly bad for the product or users. My observation is that the severity of the gameability issue is far higher for more mature products with smaller headroom than for newer products with greater headroom. (Further thoughts on this in a future episode.)

This phenomenon is known outside of tech as Goodhart’s Law or Campbell’s Law.

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

Campbell’s Law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

Note that the metric in question started out well-intentioned and also good and useful as a metric. However, the metric’s weaknesses are eventually exploited and the metric becomes less useful for decision making or monitoring success over time. A favorite example was way back when Google+ (yes, the misunderstood social network product) defined their daily actives as including any Google product surface which called the Google+ backend. As you can imagine, when G+ features rolled out onto Google’s search result pages and YouTube’s watch pages, it looked like G+ grew like crazy. I probably don’t need to tell you that the social network product did not grow like crazy.

The tech world is rife with consumer products that too rigidly relied on a growth or engagement metric, which caused them to become cesspools of clickbait and spam. This drive for growth beyond reasonable limits is one contributing factor (beyond pure monetization motives) towards what Cory Doctorow called “enshittification.”

The Metric is Not the Product

The bulk of my time as a data scientist was spent on what we called Evaluation, which entailed designing experimentation platforms, helping teams run and analyze experiments, and defining success metrics. All in service of answering the question: “how do we know that what we’re launching is actually good (for users)?” This is a seemingly simple question, but for a product as complex as Google Search, I assure you it’s nigh intractable.

As a data science tech lead / manager, my primary task on which I spent literally tens of thousands of hours essentially boiled down to navigating the duality of:

  1. Convincing stakeholders that no metric is perfect.
  2. Convincing stakeholders that some metrics are useful, even if not perfect.

Or as the famous statistician George Box put it, “All models are wrong, but some are useful.”

The rest of this post has some high level generalizations on the fact that “the metric is not the product.” In subsequent posts, I’ll go into my observations when teams have overly relied on metrics, or when executives demand too much from metrics.

One fairly concise explanation of this topic is in Shane Parrish’s The Great Mental Models, under the chapter “The Map is Not the Territory.” Excerpt: https://fs.blog/map-and-territory/

“Reality is messy and complicated, so our tendency to simplify it is understandable. However, if the aim becomes simplification rather than understanding, we start to make bad decisions. When we mistake the map for the territory, we start to think we have all the answers. We create static rules or policies that deal with the map but forget that we exist in a constantly changing world. When we close off or ignore feedback loops, we don’t see that the terrain has changed and we dramatically reduce our ability to adapt to a changing environment.”

As a data scientist working in a very data-driven organization, it’s easy to forget that data (e.g., logged interactions, user survey responses, etc.) and metrics are merely abstractions of ground truth reality. My colleagues called mismatches in this abstraction mapping “representational uncertainty: the gap between the desired meaning of some measure and its actual meaning” in the excellent blog post “Uncertainties: Statistical, Representational, Interventional”: https://www.unofficialgoogledatascience.com/2021/12/uncertainties-statistical.html

The rather circular definition of “statistic” from wikipedia is: “A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose.” With apologies for abusing these terms in a non-technical manner, a statistic (or metric) is a calculation done on data, that is a reduced representation of the data, which captures some relevant information and throws away the rest.

In this sense, it should be abundantly clear that a metric, even a well-defined one, can’t possibly encapsulate all aspects of a product. It’s relatively easier to define a “north star” product metric for some products. For example, long-term revenue is fairly uncontroversial as the Ads org’s north star, though nuances and precise assumptions abound even there.

For organic Search, we were lucky to be able to focus strictly on user value and not monetization, which was and is rare. Since before I joined, the holy grail was to find “one metric to rule them all” that captured all aspects of user value. We were never successful in this pursuit, and most of us are convinced that a “constellation” of metrics is more practical and operationally viable than a single “north star.”

In my roles as product manager and data scientist, I had to constantly remind myself and others to use metrics carefully, to not either under-rely or over-rely on them. I believe they are an incredibly useful tool for measuring progress or success, and also in getting organizational alignment across large and complex organizations. But they are not some organizational panacea, nor will they perfectly represent any aspect of ground truth reality.

Hello Again, World

(FYI, I’m intending to cross-post upcoming entries to Substack, in case it’s easier to follow there: https://seantime.substack.com/)

I recently took a break from working on Google Search, after 15 years as a data scientist and most recently as an acting product manager. It was certainly a once-in-a-generation opportunity to be able to work with so many amazing colleagues on a product that was and is so impactful in society as a whole.

Having now the privilege of some extra time, I feel drawn to put some of my thoughts in writing because nuanced, subtly careless mistakes around data, metrics, and scale were so prevalent among my exceptional colleagues, including software engineers and data scientists, many of whom were much smarter than me. These nuances go beyond the occasional lapses in what we called “statistical thinking” or “probabilistic thinking”, though ultimately many issues are derived from the further implications of such.

There are many intelligent people who have thought about these topics more deeply than I have. I can only draw from my personal experiences and hope to slightly contribute to the conversation.

My goals for the next few months are to reflect on some of the highest-level insights I think I’ve learned in my career. I hope some of these reflections may be relevant or useful for some of you.

I’ll be upfront that my intention is to explore beyond the sphere of work and career. With any luck, I’ll have the fortitude in the future to more directly probe some of the broader and far more important issues that misinterpretations of data, metrics, and scale contribute to society.

This will undoubtedly be a wandering journey with many digressions, and many posts will be rough. Content is unlikely to be published with consistent frequency, so please subscribe on Substack to get email updates.