The bulk of my time as a data scientist was spent on what we called Evaluation, which entailed designing experimentation platforms, helping teams run and analyze experiments, and defining success metrics. All in service of answering the question: “how do we know that what we’re launching is actually good (for users)?” This is a seemingly simple question, but for a product as complex as Google Search, I assure you it’s nigh intractable.
As a data science tech lead / manager, my primary task on which I spent literally tens of thousands of hours essentially boiled down to navigating the duality of:
- Convincing stakeholders that no metric is perfect.
- Convincing stakeholders that some metrics are useful, even if not perfect.
Or as the famous statistician George Box put it, “All models are wrong, but some are useful.”
The rest of this post has some high level generalizations on the fact that “the metric is not the product.” In subsequent posts, I’ll go into my observations when teams have overly relied on metrics, or when executives demand too much from metrics.
One fairly concise explanation of this topic is in Shane Parrish’s The Great Mental Models, under the chapter “The Map is Not the Territory.” Excerpt: https://fs.blog/map-and-territory/
“Reality is messy and complicated, so our tendency to simplify it is understandable. However, if the aim becomes simplification rather than understanding, we start to make bad decisions. When we mistake the map for the territory, we start to think we have all the answers. We create static rules or policies that deal with the map but forget that we exist in a constantly changing world. When we close off or ignore feedback loops, we don’t see that the terrain has changed and we dramatically reduce our ability to adapt to a changing environment.”
As a data scientist working in a very data-driven organization, it’s easy to forget that data (e.g., logged interactions, user survey responses, etc.) and metrics are merely abstractions of ground truth reality. My colleagues called mismatches in this abstraction mapping “representational uncertainty: the gap between the desired meaning of some measure and its actual meaning” in the excellent blog post “Uncertainties: Statistical, Representational, Interventional”: https://www.unofficialgoogledatascience.com/2021/12/uncertainties-statistical.html
The rather circular definition of “statistic” from wikipedia is: “A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose.” With apologies for abusing these terms in a non-technical manner, a statistic (or metric) is a calculation done on data, that is a reduced representation of the data, which captures some relevant information and throws away the rest.
In this sense, it should be abundantly clear that a metric, even a well-defined one, can’t possibly encapsulate all aspects of a product. It’s relatively easier to define a “north star” product metric for some products. For example, long-term revenue is fairly uncontroversial as the Ads org’s north star, though nuances and precise assumptions abound even there.
For organic Search, we were lucky to be able to focus strictly on user value and not monetization, which was and is rare. Since before I joined, the holy grail was to find “one metric to rule them all” that captured all aspects of user value. We were never successful in this pursuit, and most of us are convinced that a “constellation” of metrics is more practical and operationally viable than a single “north star.”
In my roles as product manager and data scientist, I had to constantly remind myself and others to use metrics carefully, to not either under-rely or over-rely on them. I believe they are an incredibly useful tool for measuring progress or success, and also in getting organizational alignment across large and complex organizations. But they are not some organizational panacea, nor will they perfectly represent any aspect of ground truth reality.