A Quick Confidence Heuristic

Let’s say you have well-informed opinions on a variety of topics. Without information about your long term accuracy in each given area, how confident should you be in those opinions?

Here’s a quick heuristic, for any area where other people have well-informed opinions about the same topics; your confidence should be a function of the distance of your estimate from the average opinion, and the standard deviation of those opinions. I’ll call this the wisdom-of-crowds-confidence level, because it can be justified based on the empirical observation that the average of even uninformed guesses is typically a better predictor than most individual predictions.

Why does this make sense?

The Aumann agreement theorem implies that rational discussants can, given enough patience and introspection, pass messages about their justifications until they eventually converge. Given that informed opinions share most evidence, the differential between the opinions is likely due to specific unshared assumptions or evidence. If that evidence were shared, unless the vast majority of the non-shared assumptions were piled up on the same side, the answer would land somewhere near the middle. (This is why I was going to call the heuristic Aumann-confidence, but I don’t think it quite fits.)

Unless you have a strong reason to assume you are a privileged observer, trading on inside information or much better calibrated than other observers, there is no reason to expect this nonshared evidence will be biased. And while this appears to contradict the conservation of expected evidence theorem, it’s actually kind-of a consequence of it, because we need to update on the knowledge that there is unshared evidence leading the other person to make their own claim.

This is where things get tricky — we need to make assumptions about joint distributions on unshared evidence. Suffice it to say that unless we have reason to believe our unshared evidence or assumptions is much stronger than theirs, we should end up near the middle. And that goes back to a different, earlier assumption – that others are also well informed.

Now that we’ve laid out the framework, though, we can sketch the argument.

  1. We can expect that our opinion should shift towards the average, once we know what the average is, even without exploring the other people’s unshared assumptions and data. The distance it should shift depends on how good our assumptions and data are compared to theirs.
  2. Even if we have strong reasons for thinking that we understand why others hold the assumptions they do, they presumably feel the same way about us.
  3. And why do you think your unshared evidence and assumptions are so great anyways, huh? Are you special or something?

Anyways, those are my thoughts.

Comments?

(Originally posted here on LessWrong)

Deceptive Dataviz and Confusing Data; Uncomparables in Education

I don’t actually want to talk about dataviz here, I want to talk about the data that is visualized. I routinely see graphs that are not (necessarily) bad as misleading graphs, but is bad data to be presented in a graph. There are plenty of examples of unreasonably constrained axes, or simply incorrect bar heights — but that’s not the problem for today.

Today, I want to give an example of data that is displayed as if the information is comparable, when it isn’t – like dollars and scores, or percentage improvement versus totals. What do I mean? I have a great example!

This graph is a masterpiece of the errors I am talking about. And it seems the very recently deceased Dr. Coulson is being maligned by a wiki article on Cato attributing this graph to him. (At the very least, the original seems to have kept dollars and percentages separate.) This graph tries hard to make incomparable data comparable, by displaying percentage change of a variety of incomparable datasets — which is better than showing the comparable raw data, right?

Well, no. At least not here. But why are they incomparable?

First, we have NAEP scores, which are inconsistently measured over the time period; the meaning of the metric changed repeatedly over the time period displayed, as academic standards have been altered to reflect the changing abilities and needs of students.

They are also scores, and as I’m sure everyone is aware, the difference between a 1300 and an 1400 on the SAT is much smaller than the difference between a 1500 and a 1600. Percentage improvements on these tests are not a great comparison. They are also a range-bound number; the scores are in the range 0–500, so that doubling the math scores is not only not linear, but in most cases literally impossible; it’s already around 300.

Next, the basis for all of these numbers is non-constant, in an interesting way. The chart presents enrollment as a total, but ignores the changing demographic mix — and no, this isn’t about the soft bigotry of low expectations, it’s about the expanding school population. Expanding? Yes — because the number is constant, but the total is shrinking. (Chart by Bill McBride)


The 1970s were the height of the baby boom — and the percentage of people who were going to school was still on an upwards trend;


The totals were flat, but the demographic split wasn’t, and the percentage of low achievers, who are the least likely to attend, is increasing. And the demographic composition of schools matters. But I won’t get into divergent birth rates and similar demographic issues any further for now.

But what about cost? I mean, clearly that can’t be deceptive — we’re spending more, because we keep hiring more teachers, like the chart seems to show! But we aren’t —teachers only increased by 50% in that time, not nearly 100%. But the chart isn’t wrong — they’re hiring more staff (largely to deal with regulations, as I’m sure Cato would agree.)


And this also explains why total cost went up — we have way more non-teacher staff, many of whom are much more expensive. We’re also neglecting the fact that the country is richer, and as a share of GDP, we’re way behind, because we pay teachers the same amount, but the economy as a whole grew. But that’s a different issue.

So yes, we can show a bunch of numbers correctly on a chart, but it won’t mean what it looks like if we’re sloppy — or purposefully misleading.

The good, the bad, and the appropriately under-powered

Many quantitative studies are good — they employ appropriate methodology, have properly specified, empirically valid hypotheses registered before data collection, then collect sufficient data transparently and appropriately. Others fail at one or more of these hurdles. But a third category also exists; the appropriately under-powered. Despite doing everything else right, many properly posed questions cannot be answered with the potentially available data.

Two examples will illustrate this point. It is difficult to ensure the safety and efficacy of treatments for sufficiently rare diseases in the typical manner, because the total number of cases can be insufficient for a properly powered clinical trial. Similarly, it is difficult to answer a variety of well-posed, empirical questions in political science, because the number of countries to be used as samples is limited.

What are the options for dealing with this phenomenon? (Excepting the old unacceptable standby of p-hacking, multiple comparisons, etc. and hope the journal publishes the study anyways.) I think there are three main ones, none of which are particularly satisfactory

  1. Don’t try to answer these questions empirically, use other approaches.
    If data cannot resolve the problem to the customary “standard” of p<0.05, then use qualitative approaches or theory driven methods instead.
  2. Estimate the effect and show that it is statistically non-significant.
    This will presumably be interpreted as the effect having a small or insignificant practical effect, despite the fact that that isn’t how p-values work.
  3. Do a Bayesian analysis with comparisons of different prior beliefs to show how the posterior changes.
    This will not alter the fact that there is too little data to convincingly show an answer, and is difficult to explain. Properly uncertain prior beliefs will show that the answer is still uncertain after accounting for the new data, but will perhaps shift the estimated posterior slightly to the right, and narrow the distribution.

At the end of the day, we are left with the unsatisfying conclusion that some questions are not well suited to this approach, and when being honest we should not claim that the scientific or empirical evidence should shift people’s opinions much. That’s OK.

Unless, perhaps, someone out there has clearer answers for me?

Evaluating Ben Franklin’s Alternative to Regression Models for Decision Making

Recently, Gwern pointed me to a blog post by Chris Stucchio that makes the impressive-sounding claim that “a pro/con list is 75% as good as [linear regression], which he goes on to show based on a simulation. I was intrigued, as this seemed counterintuitive. I thought making choices would be a bit harder than that, especially when you have lots of choices — and it is, kind of. But first, let’s setup the problem motivation, before I show you pretty graphs of how it performs.

Motivation

Let’s posit a decision maker with a set of options, each of which has some number of characteristics that they have preferences about. How should they choose? It’s not easy to figure out exactly which option they would like the most — especially if you want to get the perfect answer! Decision theory has a panoply of tools, like Multi-Attribute Decision Theory, each with whole books written about them. But you don’t want to spend $20,000 on consultants and model building to choose what ice cream to order; those methods are complicated, and you have a relatively simple decision.

For example, someone is choosing a car. They know that they want fuel efficiency of more than 30 miles per gallon, they want at least 5 seats for their whole family to fit, they prefer a sedan to an SUV or small car, and they would like it to cost under $15,000. Specifying how much they care about each, however, is hard; do they care about price twice as much as the number of seats? Do they care about fuel efficiency more or less than speed?

Instead of asking people to specify their utility function, as many decision theory methods would require, most people just look at the options and pick the one they like most. That works OK, but given cognitive biases and sales pitches that convince them to do something they’ll regret later, a person might be better off with something a bit more structured. That’s where Chris brings in Ben Franklin’s advice.

…my Way is, to divide half a Sheet of Paper by a Line into two Columns, writing over the one Pro, and over the other Con. Then…I put down under the different Heads short Hints of the different Motives…I find at length where the Ballance lies…I come to a Determination accordingly.

Chris interprets “where the Ballance lies” as which list, Pro or Con, has more entries.

The question he asks is how much worse this fairly basic method, which is uses a statistical method referred to as “Unit-Weighted Regression,” is than a more complex regression model with exact preference weights.

Where did “75% as Good” come from?

Chris set up a simulation that showed that, given two random choices and random rankings, with a high number of attributes to consider, 75% of the time the choice given by Ben Franklin’s method is the same as that given by a method that uses the (usually unknown) exact preference weights. This is helpful, since we frequently don’t have enough data to arrive at a good approximation of those weights when considering a decision. (For example, we may want to assist senior management with a decision, but we don’t want to pester them with lots of questions in order to elicit their preferences.)

Following the simulation, he proves that, given certain assumptions, this bound is exact. I’m not going to get into those assumptions, but I will note that they probably overstate the actual error rate in the given case; most of the time, there are not many features, and when there are, features that have very low weights wouldn’t be included, which will help the classification, as I’ll show below.

But first, there’s a different problem; he only talks about 2 options. So let’s get to my question, and then back to our car buyer.

Multiple Options

It should be fairly intuitive that picking the best option is harder given more choices. If we picked randomly between two options, we’d get the right choice 50% of the time, without even a pro-con list. (And coin-flipping might be a good idea if you’re not sure what to do — Steven Levitt tried it, and according to the NBER working paper he wrote, it’s surprisingly effective. Despite this, most people don’t like the idea.)

But most choices have more than two options, and that makes the problem harder. First, I don’t have any fair three-sided coins. And second, our random guess now gets it right only a third of the time. But how does Ben Franklin’s method do?

First, this shows the case Chris analyzed, with only two options, compared to 3;


The method does slightly worse, but it’s almost as good as long as there aren’t lots of dimensions. Intuitively, that makes sense; when there are only a couple of things you care about, one of the options probably has more than the other— so unless one of the options is much more important than the others, it’s unlikely that the weights make a big difference. We can check this intuition by looking at our performance with many more options;


With only a few things that we care about, pro/con lists still perform incredibly well, even when there are tons of choices. In fact, with few enough features, it performs even better. This makes sense; if there is a choice that is clearly best we can pick it, since it has everything we want. This is part of the problem with how the problem was set up; we are looking at whether each item has or doesn’t have the thing we want — not the value.

If we have a lot of cars to choose from, and we only care about the 4 things we listed, (30 MPG, 5 seats, Sedan, cost < $15,000), picking one that satisfies all of our preferences is easy. But that doesn’t mean we pick the best one! Given a choice between a five-seater sedan that gets 40 MPG and costs $14,000 or one that gets 32 MPG and costs $14,995, our methods calls it a tie. (It’s “correct” because we assumed each feature is binary.) There are plenty of algorithmic ways to get around this that are a bit more complex, but any manual pro/con list would make this difference apparent without adding complexity.

Interestingly, however, with many choices, the methods starts working much worse with many feature dimensions. Why? In a sense, it’s actually because we don’t have enough choices. But first, let’s talk about weak preferences, and why they make the problem seem harder than it really is.

Who Cares?

If we actually have a list of 10 or 15 features, odds are good that some of them don’t really matter. In algorithm design, we need a computer to make decisions without asking us, so a binary classifier can have problems picking the best of many choices with lots of features — but people don’t have that issue.

If I were to give you a list of 10 things you might care about for a car, some of them won’t matter to you nearly as much as others. So… if we drop elements of the pro/con list that are less than 1/5 as important as the average, how does the method perform?


And this is why I suggested above that when building a Pro/Con list, we normally leave off really low importance items — and that helps a bit, usually.

When we have lots of choices, the low importance features add noise, not useful information;


Of course, we need to be careful, because it’s not that simple! Dropping features when we don’t have very many is a bad idea — we’ll miss the best choice.


The Curse of Dimensionality versus Irrelevant Metrics

We can drop low importance features, but why does the method work so much worse with more features in the first place? Because, given a lot of features, there are a huge number of possibilities. 5 features allows 2⁵ possibilities — 32. Anything that has all 32 that we want (or most of them,) will be the best choice — and ignoring some of them, even if they are low weight, will miss that. If we have 50 features, though, we’ll never have 2⁵⁰ options to find one that has everything we might want — so we want to pay attention to the most important features. And that’s the curse of dimensionality.

If I were really a statistician, that would be an answer. But as a decision theorist, that actually means that our metric is a problem. Picking bad metrics can be a disaster as I have argued at length elsewhere. And our car buyer shows us why.

There are easily a hundred dimensions we could consider when buying a car. Looking at the engine alone, we might look at torque, horsepower, and top speed, to name a few. But most of these options are irrelevant, so we would ignore them in favor of the 4 things we really care about, listed above; picking a car with the best engine torque that didn’t seat 5 would be a massive failure.

And in our analysis here, these dimensions are collapsed into a binary, both in our heuristic pro/con list, and in the base case we compared against! As mentioned earlier, this ignores the difference between 32 MPG and 40 MPG, or between $14,000 and $14,995 — both differences we do care about.

And that’s where I think Ben Franklin is cleverer than we gave him credit for initially. He says “I find at length where the Ballance lies…I come to a Determination accordingly.” That sounds like he’s going to list the options, think about the Pros and Cons, and then make a decision — not on the basis of which list is longer — but simply by looking at the question with the information presented clearly.

Note: Code to generate the graphs in R can be found here; https://github.com/davidmanheim/Random-Stuff/blob/master/MultiOption_Pro_Con_Graphs.R