seantime – Something wonderful and beautiful and true.

October 22, 2025

Metrics Aren’t Permanent

The past couple posts highlighted two common but nuanced issues that cause metrics to be misjudged or misinterpreted: (1) a representational gap between what is measured and what is “ground truth reality”, or what one wishes to actually measure; and (2) Goodhart’s Law, or the incentives that intentionally or unintentionally cause metrics to be “gamed” and less useful over time.

This post talks about another issue that, while more obvious, frequently gets overlooked in the daily hustle: (3) metrics must follow, and adapt to, strategic and environmental shifts over time. As a product’s strategy changes, or as the users, ecosystem, or environment of the product changes, the metrics that guide the product must also change.

There are a lot of benefits to having a stable, immutable metric definition. Time series and historical analyses are much easier if metrics don’t frequently churn. Perhaps more importantly, institutional knowledge increases and gets better encapsulated in metric and product understanding. This generally gives organizations more confidence that they have the correct strategic initiatives and are doing the right things for the product and users.

However, too rigid of an emphasis on metric stability can result in major problems, especially if the product or environment are rapidly evolving — which is almost always true in tech. One area in which tech generally recognizes this issue quite well is for Trust and Safety. As spam and scam techniques evolve, teams are quick to improve and adapt the classification and identification of attack vectors. My experience is that metrics tend to pivot quickly in this domain. (The primary metric issues tended to be due to statistical complexities around proper measurement on low prevalence rate problems.)

I think it might surprise outsiders that the organic Search product’s north star changes fairly frequently, in order to align with major yearslong strategic pivots. This is a slower cycle than for Trust and Safety, but equally vital to long-term product success. When I joined Google, the primary metric was search queries, roughly defined as requests to a search backend system. This worked quite well in the “ten blue links” desktop computer days. But then a bunch of features were launched which were unambiguously good, yet could reduce search queries, for example autocomplete suggestions in the search box dropdown. More recent advances such as multimodal or AI mode cause further divergence between a search query and an atomic unit of user value. The north star metric obviously gets updated to incorporate such changes.

It’s something that we don’t necessarily realize, but we tend to think about the definition of metrics as Platonic ideals, immutable and eternal. I think this has something to do with the fact that, when we are defining metrics, it’s typically in the language of mathematics. The chaos of reality — of the data and measurement itself — is separated out. We trust that the mathematical formulas and properties of statistical convergence will always work; any errors are due to the messiness of measurement in the real world. But at least in the business world, the real world is what’s “really real.”

October 17, 2025

Goodhart’s Law

It struck me as somewhat ironic, and sometimes uncomfortable, that many times I had to tell stakeholders not to trust a metric that I helped define. As a data scientist, this was always a tricky tightrope to walk. While you don’t want improper inferences or decisions to be made using your metric, at the same time you don’t want to undermine the trust around the metric itself or your bona fides as a rigorous data scientist (“why should we trust this metric if it doesn’t work in this specific instance?”). Sometimes it takes tens of person-years of work in order to define, validate, and gain consensus around a working metric, so these scenarios tend to be high stakes conversations.

To be clear, the choice and specific definitions of metrics can have very real consequences on product direction and user experience. This is what made the evaluation aspects of data science work very rewarding, even if the impact was indirect because we weren’t necessarily shipping out algorithmic changes.

As one example, when I was working on the Discover feed product, we on the leadership team had many passionate discussions about what our “north star” metric should be: daily active users (DAU) or days active per monthly active (aka L28, formally defined as the average days active per user who was active in the last 28 days). At that point in time, we chose DAU, but measured and reported progress on both.

The implications of DAU vs. L28 was most clear as a tradeoff, which was very real in terms of product strategy: did we want a product which many people used infrequently and focus on these low-engagement users, or a product that had a strong power user base and focus on these high-engagement users? At the time, the DAU metric implied the former, and L28 implied the latter. Of course it’s possible to do both, but when making tough resourcing decisions, one had to take precedence.

DAU is a relatively simple metric and thus is widely used. It has a lot of very nice properties, but like all metrics, it is “gameable.” That is, either intentionally or unintentionally, teams can optimize or “hill climb” on it, resulting in seemingly positive metric changes that, in reality, are at best spurious or at worst clearly bad for the product or users. My observation is that the severity of the gameability issue is far higher for more mature products with smaller headroom than for newer products with greater headroom. (Further thoughts on this in a future episode.)

This phenomenon is known outside of tech as Goodhart’s Law or Campbell’s Law.

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

Campbell’s Law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”

Note that the metric in question started out well-intentioned and also good and useful as a metric. However, the metric’s weaknesses are eventually exploited and the metric becomes less useful for decision making or monitoring success over time. A favorite example was way back when Google+ (yes, the misunderstood social network product) defined their daily actives as including any Google product surface which called the Google+ backend. As you can imagine, when G+ features rolled out onto Google’s search result pages and YouTube’s watch pages, it looked like G+ grew like crazy. I probably don’t need to tell you that the social network product did not grow like crazy.

The tech world is rife with consumer products that too rigidly relied on a growth or engagement metric, which caused them to become cesspools of clickbait and spam. This drive for growth beyond reasonable limits is one contributing factor (beyond pure monetization motives) towards what Cory Doctorow called “enshittification.”

October 16, 2025October 16, 2025

The Metric is Not the Product

The bulk of my time as a data scientist was spent on what we called Evaluation, which entailed designing experimentation platforms, helping teams run and analyze experiments, and defining success metrics. All in service of answering the question: “how do we know that what we’re launching is actually good (for users)?” This is a seemingly simple question, but for a product as complex as Google Search, I assure you it’s nigh intractable.

As a data science tech lead / manager, my primary task on which I spent literally tens of thousands of hours essentially boiled down to navigating the duality of:

Convincing stakeholders that no metric is perfect.
Convincing stakeholders that some metrics are useful, even if not perfect.

Or as the famous statistician George Box put it, “All models are wrong, but some are useful.”

The rest of this post has some high level generalizations on the fact that “the metric is not the product.” In subsequent posts, I’ll go into my observations when teams have overly relied on metrics, or when executives demand too much from metrics.

One fairly concise explanation of this topic is in Shane Parrish’s The Great Mental Models, under the chapter “The Map is Not the Territory.” Excerpt: https://fs.blog/map-and-territory/

“Reality is messy and complicated, so our tendency to simplify it is understandable. However, if the aim becomes simplification rather than understanding, we start to make bad decisions. When we mistake the map for the territory, we start to think we have all the answers. We create static rules or policies that deal with the map but forget that we exist in a constantly changing world. When we close off or ignore feedback loops, we don’t see that the terrain has changed and we dramatically reduce our ability to adapt to a changing environment.”

As a data scientist working in a very data-driven organization, it’s easy to forget that data (e.g., logged interactions, user survey responses, etc.) and metrics are merely abstractions of ground truth reality. My colleagues called mismatches in this abstraction mapping “representational uncertainty: the gap between the desired meaning of some measure and its actual meaning” in the excellent blog post “Uncertainties: Statistical, Representational, Interventional”: https://www.unofficialgoogledatascience.com/2021/12/uncertainties-statistical.html

The rather circular definition of “statistic” from wikipedia is: “A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose.” With apologies for abusing these terms in a non-technical manner, a statistic (or metric) is a calculation done on data, that is a reduced representation of the data, which captures some relevant information and throws away the rest.

In this sense, it should be abundantly clear that a metric, even a well-defined one, can’t possibly encapsulate all aspects of a product. It’s relatively easier to define a “north star” product metric for some products. For example, long-term revenue is fairly uncontroversial as the Ads org’s north star, though nuances and precise assumptions abound even there.

For organic Search, we were lucky to be able to focus strictly on user value and not monetization, which was and is rare. Since before I joined, the holy grail was to find “one metric to rule them all” that captured all aspects of user value. We were never successful in this pursuit, and most of us are convinced that a “constellation” of metrics is more practical and operationally viable than a single “north star.”

In my roles as product manager and data scientist, I had to constantly remind myself and others to use metrics carefully, to not either under-rely or over-rely on them. I believe they are an incredibly useful tool for measuring progress or success, and also in getting organizational alignment across large and complex organizations. But they are not some organizational panacea, nor will they perfectly represent any aspect of ground truth reality.