It struck me as somewhat ironic, and sometimes uncomfortable, that many times I had to tell stakeholders not to trust a metric that I helped define. As a data scientist, this was always a tricky tightrope to walk. While you don’t want improper inferences or decisions to be made using your metric, at the same time you don’t want to undermine the trust around the metric itself or your bona fides as a rigorous data scientist (“why should we trust this metric if it doesn’t work in this specific instance?”). Sometimes it takes tens of person-years of work in order to define, validate, and gain consensus around a working metric, so these scenarios tend to be high stakes conversations.
To be clear, the choice and specific definitions of metrics can have very real consequences on product direction and user experience. This is what made the evaluation aspects of data science work very rewarding, even if the impact was indirect because we weren’t necessarily shipping out algorithmic changes.
As one example, when I was working on the Discover feed product, we on the leadership team had many passionate discussions about what our “north star” metric should be: daily active users (DAU) or days active per monthly active (aka L28, formally defined as the average days active per user who was active in the last 28 days). At that point in time, we chose DAU, but measured and reported progress on both.
The implications of DAU vs. L28 was most clear as a tradeoff, which was very real in terms of product strategy: did we want a product which many people used infrequently and focus on these low-engagement users, or a product that had a strong power user base and focus on these high-engagement users? At the time, the DAU metric implied the former, and L28 implied the latter. Of course it’s possible to do both, but when making tough resourcing decisions, one had to take precedence.
DAU is a relatively simple metric and thus is widely used. It has a lot of very nice properties, but like all metrics, it is “gameable.” That is, either intentionally or unintentionally, teams can optimize or “hill climb” on it, resulting in seemingly positive metric changes that, in reality, are at best spurious or at worst clearly bad for the product or users. My observation is that the severity of the gameability issue is far higher for more mature products with smaller headroom than for newer products with greater headroom. (Further thoughts on this in a future episode.)
This phenomenon is known outside of tech as Goodhart’s Law or Campbell’s Law.
Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
Campbell’s Law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”
Note that the metric in question started out well-intentioned and also good and useful as a metric. However, the metric’s weaknesses are eventually exploited and the metric becomes less useful for decision making or monitoring success over time. A favorite example was way back when Google+ (yes, the misunderstood social network product) defined their daily actives as including any Google product surface which called the Google+ backend. As you can imagine, when G+ features rolled out onto Google’s search result pages and YouTube’s watch pages, it looked like G+ grew like crazy. I probably don’t need to tell you that the social network product did not grow like crazy.
The tech world is rife with consumer products that too rigidly relied on a growth or engagement metric, which caused them to become cesspools of clickbait and spam. This drive for growth beyond reasonable limits is one contributing factor (beyond pure monetization motives) towards what Cory Doctorow called “enshittification.”