A metrics Catch-22 for intervention design

Statisticians have a common frustration; they are called too late to help. A study is performed, data is collected, and then a statistician is called — just in time to tell everyone that the study was underpowered, the data collected can’t be used for evaluation, or that significance can’t be assessed without a now-impossible to reconstruct baseline. I’m here to argue the opposite can occur as well; calling a statistician too early can doom a project to clinical or policy insignificance.

If you are a statistician, you are puzzled or dismissive at this point — but it’s true. Imagine an educational intervention to improve student’s comfort with fractions. The principle investigator comes to you and says; I learned my lesson last time, and I want help early — how do I design this thing? You collaborate to design or find an appropriate skills test that considers the students’ arithmetic fluency with fractions, do your power calculation, specify your data collection regime, and even randomize the intervention — and then the PI goes off and designs the details of implementation. This sounds ideal, right? You’ve covered all of your statistical bases.

The problem here is that you’ve allowed all the researcher degrees of freedom to wander down the path that leads most directly to testing success. Perhaps the intervention has some drills that happen to match the test question structure, or it focuses a bit more on the types of applications that are tested. That means that unless the PI is incredibly careful (or purposefully incompetent) the intervention was implicitly designed to do well on the test created or selected for it, taking advantage of the simplified metric. This isn’t a statistical concern, but it is absolutely a problem for the generalizability of the study.

This is just Goodhart’s law and principle-agent conflicts in a different guise; it’s possible that testing success perfectly aligns with the end goal of greater long term understanding of fractions, but it’s far from guaranteed. So behavior warps to optimize for the metric, not the goal.

On the other hand, as Gwern helpfully pointed out to me, the opposite is arguably worse; if you choose the metric after the intervention, you can tailor the metric to the intervention, and again enter the garden of forking paths to essentially cherry pick results. This is widely recognized, but as I mention above, doing the opposite creates its own problems.

How can we avoid this when designing studies? I don’t have a good answer. The choice is between implicitly post-hoc optimizing either the metrics, or the intervention — and so I’ll simply echo Yossarian in observing “that’s some catch, that Catch-22.”