A metrics Catch-22 for intervention design

Statisticians have a common frustration; they are called too late to help. A study is performed, data is collected, and then a statistician is called — just in time to tell everyone that the study was underpowered, the data collected can’t be used for evaluation, or that significance can’t be assessed without a now-impossible to reconstruct baseline. I’m here to argue the opposite can occur as well; calling a statistician too early can doom a project to clinical or policy insignificance.

If you are a statistician, you are puzzled or dismissive at this point — but it’s true. Imagine an educational intervention to improve student’s comfort with fractions. The principle investigator comes to you and says; I learned my lesson last time, and I want help early — how do I design this thing? You collaborate to design or find an appropriate skills test that considers the students’ arithmetic fluency with fractions, do your power calculation, specify your data collection regime, and even randomize the intervention — and then the PI goes off and designs the details of implementation. This sounds ideal, right? You’ve covered all of your statistical bases.

The problem here is that you’ve allowed all the researcher degrees of freedom to wander down the path that leads most directly to testing success. Perhaps the intervention has some drills that happen to match the test question structure, or it focuses a bit more on the types of applications that are tested. That means that unless the PI is incredibly careful (or purposefully incompetent) the intervention was implicitly designed to do well on the test created or selected for it, taking advantage of the simplified metric. This isn’t a statistical concern, but it is absolutely a problem for the generalizability of the study.

This is just Goodhart’s law and principle-agent conflicts in a different guise; it’s possible that testing success perfectly aligns with the end goal of greater long term understanding of fractions, but it’s far from guaranteed. So behavior warps to optimize for the metric, not the goal.

On the other hand, as Gwern helpfully pointed out to me, the opposite is arguably worse; if you choose the metric after the intervention, you can tailor the metric to the intervention, and again enter the garden of forking paths to essentially cherry pick results. This is widely recognized, but as I mention above, doing the opposite creates its own problems.

How can we avoid this when designing studies? I don’t have a good answer. The choice is between implicitly post-hoc optimizing either the metrics, or the intervention — and so I’ll simply echo Yossarian in observing “that’s some catch, that Catch-22.”

The good, the bad, and the appropriately under-powered

Many quantitative studies are good — they employ appropriate methodology, have properly specified, empirically valid hypotheses registered before data collection, then collect sufficient data transparently and appropriately. Others fail at one or more of these hurdles. But a third category also exists; the appropriately under-powered. Despite doing everything else right, many properly posed questions cannot be answered with the potentially available data.

Two examples will illustrate this point. It is difficult to ensure the safety and efficacy of treatments for sufficiently rare diseases in the typical manner, because the total number of cases can be insufficient for a properly powered clinical trial. Similarly, it is difficult to answer a variety of well-posed, empirical questions in political science, because the number of countries to be used as samples is limited.

What are the options for dealing with this phenomenon? (Excepting the old unacceptable standby of p-hacking, multiple comparisons, etc. and hope the journal publishes the study anyways.) I think there are three main ones, none of which are particularly satisfactory

Don’t try to answer these questions empirically, use other approaches.
If data cannot resolve the problem to the customary “standard” of p<0.05, then use qualitative approaches or theory driven methods instead.
Estimate the effect and show that it is statistically non-significant.
This will presumably be interpreted as the effect having a small or insignificant practical effect, despite the fact that that isn’t how p-values work.
Do a Bayesian analysis with comparisons of different prior beliefs to show how the posterior changes.
This will not alter the fact that there is too little data to convincingly show an answer, and is difficult to explain. Properly uncertain prior beliefs will show that the answer is still uncertain after accounting for the new data, but will perhaps shift the estimated posterior slightly to the right, and narrow the distribution.

At the end of the day, we are left with the unsatisfying conclusion that some questions are not well suited to this approach, and when being honest we should not claim that the scientific or empirical evidence should shift people’s opinions much. That’s OK.

Unless, perhaps, someone out there has clearer answers for me?

Evaluating Ben Franklin’s Alternative to Regression Models for Decision Making

Recently, Gwern pointed me to a blog post by Chris Stucchio that makes the impressive-sounding claim that “a pro/con list is 75% as good as [linear regression], which he goes on to show based on a simulation. I was intrigued, as this seemed counterintuitive. I thought making choices would be a bit harder than that, especially when you have lots of choices — and it is, kind of. But first, let’s setup the problem motivation, before I show you pretty graphs of how it performs.

Motivation

Let’s posit a decision maker with a set of options, each of which has some number of characteristics that they have preferences about. How should they choose? It’s not easy to figure out exactly which option they would like the most — especially if you want to get the perfect answer! Decision theory has a panoply of tools, like Multi-Attribute Decision Theory, each with whole books written about them. But you don’t want to spend $20,000 on consultants and model building to choose what ice cream to order; those methods are complicated, and you have a relatively simple decision.

For example, someone is choosing a car. They know that they want fuel efficiency of more than 30 miles per gallon, they want at least 5 seats for their whole family to fit, they prefer a sedan to an SUV or small car, and they would like it to cost under $15,000. Specifying how much they care about each, however, is hard; do they care about price twice as much as the number of seats? Do they care about fuel efficiency more or less than speed?

Instead of asking people to specify their utility function, as many decision theory methods would require, most people just look at the options and pick the one they like most. That works OK, but given cognitive biases and sales pitches that convince them to do something they’ll regret later, a person might be better off with something a bit more structured. That’s where Chris brings in Ben Franklin’s advice.

…my Way is, to divide half a Sheet of Paper by a Line into two Columns, writing over the one Pro, and over the other Con. Then…I put down under the different Heads short Hints of the different Motives…I find at length where the Ballance lies…I come to a Determination accordingly.

Chris interprets “where the Ballance lies” as which list, Pro or Con, has more entries.

The question he asks is how much worse this fairly basic method, which is uses a statistical method referred to as “Unit-Weighted Regression,” is than a more complex regression model with exact preference weights.

Where did “75% as Good” come from?

Chris set up a simulation that showed that, given two random choices and random rankings, with a high number of attributes to consider, 75% of the time the choice given by Ben Franklin’s method is the same as that given by a method that uses the (usually unknown) exact preference weights. This is helpful, since we frequently don’t have enough data to arrive at a good approximation of those weights when considering a decision. (For example, we may want to assist senior management with a decision, but we don’t want to pester them with lots of questions in order to elicit their preferences.)

Following the simulation, he proves that, given certain assumptions, this bound is exact. I’m not going to get into those assumptions, but I will note that they probably overstate the actual error rate in the given case; most of the time, there are not many features, and when there are, features that have very low weights wouldn’t be included, which will help the classification, as I’ll show below.

But first, there’s a different problem; he only talks about 2 options. So let’s get to my question, and then back to our car buyer.

Multiple Options

It should be fairly intuitive that picking the best option is harder given more choices. If we picked randomly between two options, we’d get the right choice 50% of the time, without even a pro-con list. (And coin-flipping might be a good idea if you’re not sure what to do — Steven Levitt tried it, and according to the NBER working paper he wrote, it’s surprisingly effective. Despite this, most people don’t like the idea.)

But most choices have more than two options, and that makes the problem harder. First, I don’t have any fair three-sided coins. And second, our random guess now gets it right only a third of the time. But how does Ben Franklin’s method do?

First, this shows the case Chris analyzed, with only two options, compared to 3;

The method does slightly worse, but it’s almost as good as long as there aren’t lots of dimensions. Intuitively, that makes sense; when there are only a couple of things you care about, one of the options probably has more than the other— so unless one of the options is much more important than the others, it’s unlikely that the weights make a big difference. We can check this intuition by looking at our performance with many more options;

With only a few things that we care about, pro/con lists still perform incredibly well, even when there are tons of choices. In fact, with few enough features, it performs even better. This makes sense; if there is a choice that is clearly best we can pick it, since it has everything we want. This is part of the problem with how the problem was set up; we are looking at whether each item has or doesn’t have the thing we want — not the value.

If we have a lot of cars to choose from, and we only care about the 4 things we listed, (30 MPG, 5 seats, Sedan, cost < $15,000), picking one that satisfies all of our preferences is easy. But that doesn’t mean we pick the best one! Given a choice between a five-seater sedan that gets 40 MPG and costs $14,000 or one that gets 32 MPG and costs $14,995, our methods calls it a tie. (It’s “correct” because we assumed each feature is binary.) There are plenty of algorithmic ways to get around this that are a bit more complex, but any manual pro/con list would make this difference apparent without adding complexity.

Interestingly, however, with many choices, the methods starts working much worse with many feature dimensions. Why? In a sense, it’s actually because we don’t have enough choices. But first, let’s talk about weak preferences, and why they make the problem seem harder than it really is.

Who Cares?

If we actually have a list of 10 or 15 features, odds are good that some of them don’t really matter. In algorithm design, we need a computer to make decisions without asking us, so a binary classifier can have problems picking the best of many choices with lots of features — but people don’t have that issue.

If I were to give you a list of 10 things you might care about for a car, some of them won’t matter to you nearly as much as others. So… if we drop elements of the pro/con list that are less than 1/5 as important as the average, how does the method perform?

And this is why I suggested above that when building a Pro/Con list, we normally leave off really low importance items — and that helps a bit, usually.

When we have lots of choices, the low importance features add noise, not useful information;

Of course, we need to be careful, because it’s not that simple! Dropping features when we don’t have very many is a bad idea — we’ll miss the best choice.

The Curse of Dimensionality versus Irrelevant Metrics

We can drop low importance features, but why does the method work so much worse with more features in the first place? Because, given a lot of features, there are a huge number of possibilities. 5 features allows 2⁵ possibilities — 32. Anything that has all 32 that we want (or most of them,) will be the best choice — and ignoring some of them, even if they are low weight, will miss that. If we have 50 features, though, we’ll never have 2⁵⁰ options to find one that has everything we might want — so we want to pay attention to the most important features. And that’s the curse of dimensionality.

If I were really a statistician, that would be an answer. But as a decision theorist, that actually means that our metric is a problem. Picking bad metrics can be a disaster as I have argued at length elsewhere. And our car buyer shows us why.

There are easily a hundred dimensions we could consider when buying a car. Looking at the engine alone, we might look at torque, horsepower, and top speed, to name a few. But most of these options are irrelevant, so we would ignore them in favor of the 4 things we really care about, listed above; picking a car with the best engine torque that didn’t seat 5 would be a massive failure.

And in our analysis here, these dimensions are collapsed into a binary, both in our heuristic pro/con list, and in the base case we compared against! As mentioned earlier, this ignores the difference between 32 MPG and 40 MPG, or between $14,000 and $14,995 — both differences we do care about.

And that’s where I think Ben Franklin is cleverer than we gave him credit for initially. He says “I find at length where the Ballance lies…I come to a Determination accordingly.” That sounds like he’s going to list the options, think about the Pros and Cons, and then make a decision — not on the basis of which list is longer — but simply by looking at the question with the information presented clearly.

Note: Code to generate the graphs in R can be found here; https://github.com/davidmanheim/Random-Stuff/blob/master/MultiOption_Pro_Con_Graphs.R