A Tentative Typology of AI-Foom Scenarios

“If a foom-like explosion can quickly make a once-small system more powerful than the rest of the world put together, the rest of the world might not be able to use law, competition, social norms, or politics to keep it in check.” — Robin Hanson

As Robin Hanson recently discussed, there is a lack of clarity about what an “AI Foom” looks like, or how likely it is. He says “In a prototypical “foom,” or local intelligence explosion, a single AI system…” and proceeds to describe a possibility. I’d like to explore a few more, and discuss what qualifies as a “foom” a bit more. This is not intended as a full exploration, or as a prediction; it merely captures my current thinking.

First, it’s appropriate to briefly mention the assumptions made here;

  • Near Term — Human-intelligence AI is possible in the near term, say 30 years.
  • No Competitive Apocalypse — A single system will be created first, and other groups will not have resources sufficient to quickly build another system with similar capabilities.
  • Unsafe AI — The I launched will not have a well-bounded and safe utility function, and will find something to maximize other than what humanity would like.

These assumptions are not certainties, and are not part of the discussion — but I will condition the rest of the discussion on them, so that debating them is reasonable, elsewhere.

What’s (enough for) a “foom”?

With preliminaries out of the way, what would qualify as a “foom,” an adaptation or change that makes the system “more powerful than the rest of the world put together”?

Non-Foom AI X-Risk

There are a few scenarios which lead more directly to existential risk, without passing through a stage of gathering power. (Beyond listing them, I will not discuss these here. Also, names of scenarios given here do not imply anything about the belief of the namesake.)

a) Accidental Paperclipping — The goals specified allow the AI system to do something destructive, and is irreversible or not noticed. The AI is not sufficiently risk-aware or intelligent to avoid doing so.

b) Purposeful Paperclipping — The goals specified allow the AI system to achieve them, or attempt to do so, by something destructive which the AI can do directly, and is irreversible or not easily noticed in time.

c) Yudkowskian Simplicity-foom — There are relatively simple methods of vastly reducing the complexity of the systems the AI needs to deal with, allowing the system to better perform its goals. At near-human or human intelligence levels, one or more of those methods becomes feasible. (These might include designing viruses, nano-assemblers, or other systems that could wipe out humanity.)

Fooms

There are a few possibilities I would consider for an AI to become immensely powerful;

a) Yudkowskian Intelligence-foom — The AI is sophisticated enough to make further improvements on itself, and quickly moves from human-level intelligence to super-Einstein levels, and beyond. It can now make advances in physics, chemistry, biology, etc. that make it capable of arbitrarily dangerous behaviors.

b) Hansonian-Em foom — The AI can make efficient and small copies of, or variations on itself rapidly and cheaply, and is unboxed (or unboxes itself.) These human-level AI can run on little enough hardware, or run enough faster than humans, that the machines can rapidly amass resources and hack/exploit/buy resources that allow it to quickly gain direct control of financial and then physical resources.

c) Machiavellian Intelligence-foom — The AI can manipulate political systems surreptitiously, and amasses power directly or indirectly via manipulating individual humans. (Perhaps the AI gains resources and control via blackmail of specific individuals, who are unaware on whose behalf the operate.) The resulting control can prevent coordinated action against the AI, and allow it to gather resources to achieve its unstated nefarious true goal.

d) Asimovian Psychohistory-foom — The AI can build predictive models of human reactions well enough to manipulate them over the medium and long term. (This is different than a Machiavellian-foom only because it relies on models of humans and predictive power rather than humanlike manipulation.)

This is almost certainly not a complete or comprehensive list, and I would be grateful for additional suggestions. What it does allow is a discussion of what makes various types of fooms likely, and consider which might be pursued.

AI Complexity and Intelligence Range

The first critical question among these is the complexity of intelligence — I won’t try to estimate this, but others are researching and discussing it. Here, complexity refers to something akin to computational complexity, and refers to the difficulty of running an artificial intelligence of a given capacity. If emulating a small mammal’s brain is possible, but increasing the intelligence of AI from there to human requires an exponential increase in complexity and computing speed, we will say it is very complex, while if it requires only doubling, it is not. (I assume the computational complexity matters here, and there are no breakthroughs in hardware, quantum computing, or computational complexity theory.)

The related question is the range of intelligence. If beyond human-level AI is not possible given the techniques used to achieve human-level intelligence, or requires an exponential or even a large polynomial increase in computing power, we will consider the range small — even if not bounded, there are near-term limits. Moore’s law (if it continues) implies that the speed of AI thought will increase, but not quickly. Alternatively, if the techniques used to achieve human level AI can be extended easily to create even more intelligent systems by adding hardware, the range is large. This gives us a simplified set of possibilities.

Intelligence vs. Range — Cases


Low-Complexity Intelligence within Large Range — If humans are, as Eliezer Yudkowsky has argued, relatively clustered on the scale of intelligence, the difficulty of designing significantly more intelligent reasoning systems may within, or not be far beyond, human capability. Rapid increases in intelligence of AI systems above human levels would be a critical threshold, and an existential risk.

Low-Complexity Intelligence within Small Range — If human minds are near a peak of intelligence, near-human or human-level Hansonian Ems may still be possible to instantiate in relatively little hardware, and their relative lack of complexity make them a potential existential risk.

High-Complexity Intelligence within Small Range — Relatively little existential risk from AI seems to exist, and instead a transition to an “Age of Em” scenario seems likely.

High-Complexity Intelligence within Large Range — A threshold or Foom is unlikely, but incremental AI improvements may still pose existential risks. When a single superintelligent AI is developed, other groups are likely to follow. A singularity may be plausible, where many systems are built with superhuman intelligence, posing different types of existential or other risks.

Human Complexity and Manipulability

The second critical question is human psychology. If human minds can be manipulated more easily by moderately complex AIs than by other humans (which is already significant,) AIs might not need to “foom” in the Yudkowskian sense at all. Instead, the exponential increase in AI power and resources can happen via manipulation at an individual level or at a group level. Humans, individually or en masse, may be convinced that AI should be given this power.

Even if perfect manipulation is impossible, classical blackmail or other typical counterintelligence-type attacks may be possible, leading to a malevolent system to be able to manipulate humans. Alternatively, if human-level cognition can be achieved with much less resources than a human mind, Hansonian-fooms are possible, but so is predictive modeling of individual human minds by a manipulative system.

Alternatively, if very predictive models can be made that approximate human behavior, much like Asimov’s postulated psychohistory. This seems unlikely to be as rapid a threat, but AIs in intelligence, marketing, and other domains may specifically target this ability. If human psychology can be understood more easily than expected, these systems may succeed beyond current expectations, and the AI may be able to manipulate humans en-masse, without controlling individuals. This is similar to an unresolved debate in history about the relative importance of individuals (a la “Great Man Theory”) versus societal trends.

Conclusion

We don’t know when human-level AI will occur, or what form it will take. Focus on AI-safety may depend on the type of AI-foom that we are concerned with, and a better characterization of these uncertainties could be useful for addressing existential risks of AI deployment.

All of this is speculation, and despite my certain-sounding claims above, I am interested in reactions or debate.

A New Way to Pay for Stadiums?

There is a well known problem with building stadiums. Simply put, it’s a losing proposition for cities, and a way for team owners to bilk the public out of tax funds that could be better spent elsewhere. On the other hand, is it really fair to tell cities and fans that they can’t try to lure team to come — or stay — by building new stadiums?

Thankfully, Alex Tabarrok has a proposal that I think fits the bill, with some adaptation, called “Dominant Assurance Contracts.” Think of it as a kickstarter for public goods, with an extra safety net. Basically, someone puts up a starter fund for a project that enough people want, and pledges that money, irrevocably. Everyone else then decides if they also want the project to happen. If they do, they make a kickstarter-like pledge of a fixed amount, with a special bonus; if the project doesn’t succeed, the people who put money in get paid extra, from the starter fund — a fixed amount of that fund is distributed to everyone who pledged, if the project doesn’t get funded. It’s a sort of a compensation for the losers, who are now actually winners.

This has a clever component, which involves some fairly light game theory. Basically, it’s about solving the free rider problem; people don’t contribute, but they still benefit if the project happens, so they “free ride.” If this happens too much, people feel like suckers for paying, more start to cheat, the system runs out of money.

Here, there’s an incentive not to let that happen; if the project fails and you didn’t contribute, you don’t get paid. So if you’re unsure the project will happen, you can bet on it, and win either way. People who want the project to happen can free ride, but if the system has too many free riders, it’s now (somewhat) self-correcting. And taking a page from Kickstarter, I think the dynamic can be further improved.

How would this work for stadiums? There are a couple options, and I’ll outline one of them. But first, it’s worth noting that now, stadiums are usually financed by public bonds, which allows the city or state to pay for the stadium upfront, and use tax revenue for the next 20, 30, or 50 years to pay off the bond. Sports teams then pay only 10% of the overall cost for a stadium from revenue, as a way to circumvent the requirement that no more than 10% of the cost can be paid for by revenue from the project.

Stadium Assurance Contracts, Part 1

One possible use for a Dominant Assurance Contract, is to let the team pledge a couple percent of the cost. The city, or the sports team, can then decide how many people it thinks will pledge to build the stadium — and people who pledge can be given something, such as season tickets, if they pledge.

Let’s consider a brand new, top of the line 75,000 seat football stadium, which costs about $1 billion. The city agrees on the minimum for the team to put up, say $15m, and the public gets a chance to pay for it — on a voluntary basis. The difference here is that teams can run a kickstarter; work out the prices so that if the team can get 15,000 fans to commit to pay for NFL season tickets for the next 30 years, at an average of $2,000/year, the stadium is paid for. (Don’t worry, season tickets for the really good seats already cost $3,000–$4,000, and those are likely the ones that people will want to reserve.)

That’s effectively a bond that pays out football tickets instead of coupon payments, paid for by the team out of their revenue, which pre-sells the seats to pay off the cost. Buyers can always sell their multi-season tickets, potentially at a profit — it’s like a bond. And if the team wants to move before then, it can refund the pledgers the remainder of their commitment — and pay for the remainder of the stadium itself.

In this setup, the team won’t risk their funds if they aren’t sure the fans will pay for them to stay. If the fans and investors don’t see the stadium built, they’ll get an average of about $1,000 each in payout — quite an incentive for simply pledging to buy tickets if the stadium is built. More, if the team can’t find enough fans to pay for a stadium, the team is out quite a bit of money — so they’ll need to rally their fans to support the stadium.

Of course, we’ll need a way to convince the cities that this is a better idea than continuing to give in to teams that want public funds — an uphill battle. But if we got cities to try it, it might keep fans, politicians, and economists all happy — and help . And best of all, neither the Giants fans nor the Jets fans will need to help pay taxes for the other team’s stadium.

PS. Thanks to Kevin Chlebik and Jim Stone for their thoughts on the idea — and if anyone else has suggestions, I’d love to hear them here, or on Twitter.

Freedom of Propaganda

The first amendment to the United States constitution reads, in part, “Congress shall make no law… abridging the freedom of speech, or of the press,” and this has been extended quite a bit by the courts. For example, “freedom of speech and of press is accorded aliens residing in this country,” (Bridges v. Wixon, 326 U.S. 135, 148) and anonymity is protected as well (Talley v. California, 362 U.S. 60). These critical rights have both positive and negative repercussions, one of which has been particularly salient in the past year; propaganda. The question I’d like to ask is a simple one; what rights do foreign governments have to intentionally seek to disrupt the internal affairs of the United States?

The obvious answer,of course, is none. Bluman, et al., v. Federal Election Commission reaffirmed this, saying “The Supreme Court has long held that the government (federal, state, local) may exclude foreign citizens from activities that are part of democratic self-government in the United States.” The question, however, is how this is operationally possible given the current landscape of speech and propaganda. Can the money from an untraceable “Super-PAC” be constrained to US funds alone, given the complex web of international financial ownership that exists? Can we really ensure that our political candidates and appointees are not under the influence of foreign intelligence agencies, given their right to privacy? Can we guarantee freedom of speech to Infowars, however blatantly false and obviously insane, while restricting true accounts from RT.com used as disinformation?

The US Constitution famously begins with a purpose; “ to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity.” These goals are praiseworthy, but no rights nor obligations emerge from them. The president swears to defend the constitution, but it seems clear that he cannot be impeached for failing to insure domestic tranquility, or for undermining the blessings of liberty. The first amendment clearly isn’t intended to allow armies of Russian trolls to masquerade as citizens on Twitter — but that intent seems hard to reconcile with the first amendment rights of anonymity.

The question we need to ask is how we can insure domestic tranquility while respecting the rights of our own citizens to anonymously and freely express themselves. I do not have an answer, and I’m deeply worried that the two are not compatible.

A metrics Catch-22 for intervention design

Statisticians have a common frustration; they are called too late to help. A study is performed, data is collected, and then a statistician is called — just in time to tell everyone that the study was underpowered, the data collected can’t be used for evaluation, or that significance can’t be assessed without a now-impossible to reconstruct baseline. I’m here to argue the opposite can occur as well; calling a statistician too early can doom a project to clinical or policy insignificance.

If you are a statistician, you are puzzled or dismissive at this point — but it’s true. Imagine an educational intervention to improve student’s comfort with fractions. The principle investigator comes to you and says; I learned my lesson last time, and I want help early — how do I design this thing? You collaborate to design or find an appropriate skills test that considers the students’ arithmetic fluency with fractions, do your power calculation, specify your data collection regime, and even randomize the intervention — and then the PI goes off and designs the details of implementation. This sounds ideal, right? You’ve covered all of your statistical bases.

The problem here is that you’ve allowed all the researcher degrees of freedom to wander down the path that leads most directly to testing success. Perhaps the intervention has some drills that happen to match the test question structure, or it focuses a bit more on the types of applications that are tested. That means that unless the PI is incredibly careful (or purposefully incompetent) the intervention was implicitly designed to do well on the test created or selected for it, taking advantage of the simplified metric. This isn’t a statistical concern, but it is absolutely a problem for the generalizability of the study.

This is just Goodhart’s law and principle-agent conflicts in a different guise; it’s possible that testing success perfectly aligns with the end goal of greater long term understanding of fractions, but it’s far from guaranteed. So behavior warps to optimize for the metric, not the goal.

On the other hand, as Gwern helpfully pointed out to me, the opposite is arguably worse; if you choose the metric after the intervention, you can tailor the metric to the intervention, and again enter the garden of forking paths to essentially cherry pick results. This is widely recognized, but as I mention above, doing the opposite creates its own problems.

How can we avoid this when designing studies? I don’t have a good answer. The choice is between implicitly post-hoc optimizing either the metrics, or the intervention — and so I’ll simply echo Yossarian in observing “that’s some catch, that Catch-22.”

A Quick Confidence Heuristic

Let’s say you have well-informed opinions on a variety of topics. Without information about your long term accuracy in each given area, how confident should you be in those opinions?

Here’s a quick heuristic, for any area where other people have well-informed opinions about the same topics; your confidence should be a function of the distance of your estimate from the average opinion, and the standard deviation of those opinions. I’ll call this the wisdom-of-crowds-confidence level, because it can be justified based on the empirical observation that the average of even uninformed guesses is typically a better predictor than most individual predictions.

Why does this make sense?

The Aumann agreement theorem implies that rational discussants can, given enough patience and introspection, pass messages about their justifications until they eventually converge. Given that informed opinions share most evidence, the differential between the opinions is likely due to specific unshared assumptions or evidence. If that evidence were shared, unless the vast majority of the non-shared assumptions were piled up on the same side, the answer would land somewhere near the middle. (This is why I was going to call the heuristic Aumann-confidence, but I don’t think it quite fits.)

Unless you have a strong reason to assume you are a privileged observer, trading on inside information or much better calibrated than other observers, there is no reason to expect this nonshared evidence will be biased. And while this appears to contradict the conservation of expected evidence theorem, it’s actually kind-of a consequence of it, because we need to update on the knowledge that there is unshared evidence leading the other person to make their own claim.

This is where things get tricky — we need to make assumptions about joint distributions on unshared evidence. Suffice it to say that unless we have reason to believe our unshared evidence or assumptions is much stronger than theirs, we should end up near the middle. And that goes back to a different, earlier assumption – that others are also well informed.

Now that we’ve laid out the framework, though, we can sketch the argument.

  1. We can expect that our opinion should shift towards the average, once we know what the average is, even without exploring the other people’s unshared assumptions and data. The distance it should shift depends on how good our assumptions and data are compared to theirs.
  2. Even if we have strong reasons for thinking that we understand why others hold the assumptions they do, they presumably feel the same way about us.
  3. And why do you think your unshared evidence and assumptions are so great anyways, huh? Are you special or something?

Anyways, those are my thoughts.

Comments?

(Originally posted here on LessWrong)

Chasing Superior Good Syndrome vs. Baumol’s (or Scott’s) Cost Disease

Slatestarcodex had an excellent (as always) piece on “Considerations on Cost Disease.” It goes over a number of reasons, aside from Baumol’s cost disease, for why everything in certain sectors, namely healthcare and have gotten much more expensive. I think it misses an important dynamic, though, that I’d like to lay out.

First, though, he has a list of eight potential answers, each of which he partly dismisses. Cost increases are really happening, and markets mostly work, so it’s not simply a market failure. Government inefficiency and overregulation doesn’t really explain large parts of the problem, nor do fear of lawsuits. Risk tolerance has decreased, but that seems not to have been the sole issue. Cost shirking by some people might increase costs a bit, but that isn’t the whole picture. Finally, not on that list but implicitly explored when Scott refers to “politics,” is Moloch.

I think it’s a bit strange to end a piece with a long list of partial answers, which plausibly explain the vast majority of the issue with “ What’s happening? I don’t know and I find it really scary.” But I think there is another dynamic that’s being ignored — and I would be surprised if an economist ignored it, but I’ll blame Scott’s eclectic ad-hoc education for why he doesn’t discuss the elephant in the room — Superior goods.

Superior Goods

For those who don’t remember their Economics classes, imagine a guy who makes $40,000/year and eats chicken for dinner 3 nights a week. He gets a huge 50% raise, to $60,000/year, and suddenly has extra money to spend — his disposable income probably tripled or quadrupled. Before the hedonic treadmill kicks in, and he decides to waste all the money on higher rent and nicer cars, he changes his diet. But he won’t start eating chicken 10 times a week — he’ll start eating steak. When people get more money, they replace cheap “inferior” goods with expensive “superior” goods. And steak is a superior good.

But how many times a week will people eat steak? Two? Five? Americans as a whole got really rich in the 1940s and 1950s, and needed someplace to start spending their newfound wealth. What do people spend extra money on? Entertainment is now pretty cheap, and there are only so many nights a week you see a movie, and only so many $20/month MMORPGs you’re going to pay for. You aren’t going to pay 5 times as much for a slightly better video game or movie — and although you might pay double for 3D-Imax, there’s not much room for growth in that 5%.

The Atlantic had a piece on this several years ago, with the following chart;


Food, including rising steak consumption, decreased to a negligible part of people’s budgets, as housing started rising.In this chart, the reason healthcare hasn’t really shot up to the extent Scott discussed, as the article notes, is because most of the cost is via pre-tax employer spending. The other big change the article discusses is that after 1950 or so, everyone got cars, and commuted from their more expensive suburban houses — which is effectively an implicit increase in housing cost.

And at some point, bigger houses and nicer cars begin to saturate; a Tesla is nicer than my Hyundai, and I’d love one, but not enough to upgrade for 3x the cost. I know how much better a Tesla is — I’ve seen them.

Limitless Demand, Invisible Supply

There are only a few things that we have a limitless demand for, but very limited ability to judge the impact of our spending. What are they?

I think this is one big missing piece of the puzzle; in both healthcare and education, we want improvements, and they are worth a ton, but we can’t figure out how much the marginal spending improves things. So we pour money into these sectors.

Scott thinks this means that teachers’ and doctors’ wages should rise, but they don’t. I think it’s obvious why; they supply isn’t very limited. And the marginal impact of two teachers versus one, or a team of doctors versus one, isn’t huge. (Class size matters, but we have tons of teachers — with no shortage in sight, there is no price pressure.)

What sucks up the increased money? Dollars, both public and private, chasing hard to find benefits.

I’d spend money to improve my health, both mental and physical, but how? Extra medical diagnostics to catch problems, pricier but marginally more effective drugs, chiropractors, probably useless supplements — all are exploding in popularity. How much do they improve health? I don’t really know — not much, but I’d probably try something if it might be useful.

I’m spending a ton of money on preschool for my kids. Why? Because it helps, according to the studies. How much better is the $15,000/year daycare versus the $8,000 a year program a friend of mine runs in her house? Unclear, but I’m certainly not the only one spending big bucks. Why spend less, if education is the most superior good around?

How much better is Harvard than a subsidized in-state school, or four years of that school versus 2 years of cheap community college before transferring in? The studies seem to suggest that most of the benefit is really because the kids who get into the better schools. And Scott knows that this is happening.

We pour money into schools and medicine in order to improve things, but where does the money go? Into efforts to improve things, of course. But I’ve argued at length before that bureaucracy is bad at incentivizing things, especially when goals are unclear. So the money goes to sinkholes like more bureaucrats and clever manipulation of the metrics that are used to allocate the money.

As long as we’re incentivized to improve things that we’re unsure how to improve, the incentives to pour money into them unwisely will continue, and costs will rise. That’s not the entire answer, but it’s a central dynamic that leads to many of the things Scott is talking about — so hopefully that reduces Scott’s fears a bit.

Deceptive Dataviz and Confusing Data; Uncomparables in Education

I don’t actually want to talk about dataviz here, I want to talk about the data that is visualized. I routinely see graphs that are not (necessarily) bad as misleading graphs, but is bad data to be presented in a graph. There are plenty of examples of unreasonably constrained axes, or simply incorrect bar heights — but that’s not the problem for today.

Today, I want to give an example of data that is displayed as if the information is comparable, when it isn’t – like dollars and scores, or percentage improvement versus totals. What do I mean? I have a great example!

This graph is a masterpiece of the errors I am talking about. And it seems the very recently deceased Dr. Coulson is being maligned by a wiki article on Cato attributing this graph to him. (At the very least, the original seems to have kept dollars and percentages separate.) This graph tries hard to make incomparable data comparable, by displaying percentage change of a variety of incomparable datasets — which is better than showing the comparable raw data, right?

Well, no. At least not here. But why are they incomparable?

First, we have NAEP scores, which are inconsistently measured over the time period; the meaning of the metric changed repeatedly over the time period displayed, as academic standards have been altered to reflect the changing abilities and needs of students.

They are also scores, and as I’m sure everyone is aware, the difference between a 1300 and an 1400 on the SAT is much smaller than the difference between a 1500 and a 1600. Percentage improvements on these tests are not a great comparison. They are also a range-bound number; the scores are in the range 0–500, so that doubling the math scores is not only not linear, but in most cases literally impossible; it’s already around 300.

Next, the basis for all of these numbers is non-constant, in an interesting way. The chart presents enrollment as a total, but ignores the changing demographic mix — and no, this isn’t about the soft bigotry of low expectations, it’s about the expanding school population. Expanding? Yes — because the number is constant, but the total is shrinking. (Chart by Bill McBride)


The 1970s were the height of the baby boom — and the percentage of people who were going to school was still on an upwards trend;


The totals were flat, but the demographic split wasn’t, and the percentage of low achievers, who are the least likely to attend, is increasing. And the demographic composition of schools matters. But I won’t get into divergent birth rates and similar demographic issues any further for now.

But what about cost? I mean, clearly that can’t be deceptive — we’re spending more, because we keep hiring more teachers, like the chart seems to show! But we aren’t —teachers only increased by 50% in that time, not nearly 100%. But the chart isn’t wrong — they’re hiring more staff (largely to deal with regulations, as I’m sure Cato would agree.)


And this also explains why total cost went up — we have way more non-teacher staff, many of whom are much more expensive. We’re also neglecting the fact that the country is richer, and as a share of GDP, we’re way behind, because we pay teachers the same amount, but the economy as a whole grew. But that’s a different issue.

So yes, we can show a bunch of numbers correctly on a chart, but it won’t mean what it looks like if we’re sloppy — or purposefully misleading.

The good, the bad, and the appropriately under-powered

Many quantitative studies are good — they employ appropriate methodology, have properly specified, empirically valid hypotheses registered before data collection, then collect sufficient data transparently and appropriately. Others fail at one or more of these hurdles. But a third category also exists; the appropriately under-powered. Despite doing everything else right, many properly posed questions cannot be answered with the potentially available data.

Two examples will illustrate this point. It is difficult to ensure the safety and efficacy of treatments for sufficiently rare diseases in the typical manner, because the total number of cases can be insufficient for a properly powered clinical trial. Similarly, it is difficult to answer a variety of well-posed, empirical questions in political science, because the number of countries to be used as samples is limited.

What are the options for dealing with this phenomenon? (Excepting the old unacceptable standby of p-hacking, multiple comparisons, etc. and hope the journal publishes the study anyways.) I think there are three main ones, none of which are particularly satisfactory

  1. Don’t try to answer these questions empirically, use other approaches.
    If data cannot resolve the problem to the customary “standard” of p<0.05, then use qualitative approaches or theory driven methods instead.
  2. Estimate the effect and show that it is statistically non-significant.
    This will presumably be interpreted as the effect having a small or insignificant practical effect, despite the fact that that isn’t how p-values work.
  3. Do a Bayesian analysis with comparisons of different prior beliefs to show how the posterior changes.
    This will not alter the fact that there is too little data to convincingly show an answer, and is difficult to explain. Properly uncertain prior beliefs will show that the answer is still uncertain after accounting for the new data, but will perhaps shift the estimated posterior slightly to the right, and narrow the distribution.

At the end of the day, we are left with the unsatisfying conclusion that some questions are not well suited to this approach, and when being honest we should not claim that the scientific or empirical evidence should shift people’s opinions much. That’s OK.

Unless, perhaps, someone out there has clearer answers for me?

A cruciverbalist’s introduction to Bayesian reasoning

Mathematical methods inspired by an eighteenth century minister (8)

“Bayesian” is a word that has gained a lot of attention recently, though my experience tells me most people aren’t exactly sure what it means. I’m fairly confident that there are many more crossword-puzzle enthusiasts than Bayesian statisticians — but I would also note that the overlap is larger than most would imagine. In fact, anyone who has ever worked on a crossword puzzle has employed Bayesian reasoning. They just aren’t (yet)aware of it. So I’m going to explain both how intuitive Bayesian thinking is, and why it’s useful, even outside of crosswords and statistics.

But first, who was Bayes, what is his “law” about, and what does that mean?

Sound of a Conditional Reverend’s Dog (5)

“Bayes” of statistical fame, is the Reverend Thomas Bayes. He was a theologian and mathematician, and the two works he published during his lifetime dealt with the theological problem of happiness, and a defense of Newton’s calculus — neither of which concern us. His single posthumous work, however, was what made him a famous statistician. The original title, “ A Method of Calculating the Exact Probability of All Conclusions founded on Induction,” clearly indicates that it’s meant to be a very inclusive, widely applicable theorem. It was also, supposedly, a response to a theological challenge posed by Hume — claiming miracles didn’t happen.

Wonders at distance travelled without vehicle upset (8)

“Miracles”, Hume’s probabilistic argument said, are improbable, but incorrect reports are likely— so, the argument goes, it is more likely that the reports are incorrect than that the miracle occurred. This way of comparing probabilities isn’t quite right, statistically, as we will suggest later. But Bayes didn’t address this directly at all.

Taking a risk bringing showy jewelry to school (8)

“Gambling” was a hot topic in 19th century mathematics, and Bayes tried to answer an interesting question; when you see something happen several times, how do can you figure out, in general, the probability of it occurring? His example was about throwing balls onto a table — you aren’t looking, and a friend throws the first ball. After this, he throws more, each time, telling you whether the ball landed to the left or right of the first ball. After a doing this a few times, you still have’t seen the table, but want to know how likely is it that the next ball land to the left of that original ball.

To answer this, he pointed out that you get a bit more information about the answer every time a ball is thrown. After the first ball, for all you know the odds are 50/50 that the next one will be on either side. after a few balls are thrown, you get a better and better sense of what the answer is. After you hear the next five balls all land to the left, you’ve become convince that the next ball landing to the left is more likely than landing to the right. That’s because the probabilities are not independent — each answer gives you a little bit more information about the odds.

But enough math — I’m ready to look at a crossword.

Could wine be drunk by new arrival? (6)

“Newbie” is how I’d prefer to put my ability with crossword puzzles. But as soon as I started, I noticed a clear connection. The method of reasoning I practice and endorse as a decision theorist are nearly identical to the methods that are used by people in this everyday amusement. So I’ll get started on filling in (only one part of) the crossword I did yesterday, and we’ll see how my Bayesian reasoning works. I start by filling in a few easy answers, and I’m pretty confident in all of these. 6 Down — Taxing mo. for many, 31 Across — Data unit, 44 Across — “Scream” actress Campbell.


The way I’ve filled these in so far is simple — I picked answers I thought were very likely to be correct. But how can I know that they are correct? Maybe I’m fooling myself. The answer is that I’ve done a couple crosswords before, and I’ve found that I’m usually right when I’m confident, and these answers seem really obvious. But can I apply probabilistic reasoning here?

Distance into which vehicle reverses ___ that’s a wonder (7)

“Miracles,” or anything else, according to Reverend Bayes, should follow the same law as thrown balls. If someone is confident, that is evidence, of a sort. Stephen Stigler, a historian of math, argues that Bayes was implying an important caveat to Hume’s claim — the probability of hearing about a miracle increases each time you hear another report of it. That is, thee two facts are, in a technical sense, not independent — and the more independent accounts you hear, the more convinced you should be.

But that certainly doesn’t mean that every time a bunch of people claim something outlandish, it’s true. And in modern Bayesian terms, this is where your prior belief matters. If someone you don’t know well at work tells you that they golfed seven under par on Sunday, you have every reason to be skeptical. If they tell you they golfed seven over par, you’re a bit less likely to be skeptical. How skeptical, in each case?

We can roughly assess your degree of belief— if a friend of yours attested to the second story, you’d likely be convinced, but it would take several people independently verifying the story for you to have a similar level of belief in the first. That’s because you’re more skeptical in the first place. We could try to quantify this, and introduce Bayes’ law formally, but there’s no need to bring algebra into this essay. Instead, I want to think a bit more informally — because I can assess something as more or less likely without knowing the answer, without doing any math, and without assigning it a number.

When you hear something outlandish, your prior belief is that it is unlikely. Evidence, however, can shift that belief — and enough evidence, even circumstantial or tentative, might convince you that the claim is plausible, probably, or even very likely. And in a way it doesn’t matter what your prior is, if you can accumulate enough different pieces of trustworthy evidence. And that leads us to how I can use the answers I filled in as evidence to help me make further plausible guesses.

I look at some of the clues I didn’t immediately figure out. I wasn’t sure what 6 Across — Completely blows away, would be; there are lots of 4-letter words that might fit the clue. Once I get the A, however, I’m fairly confident in my guess, conditional on this (fairly certain) new information. I look at 31 Down — Military Commission (6), but I can’t think of any that start with a B. I see 54 Across — Place for a race horse, and I’m unsure — there are a few words that fit — it could be “first”, “third”, “fifth,” “sixth” or “ninth”, and I have no reason to think any more likely than another. So I look for more information, and notice 54 Down — It might grow to be a mushroom (5, offscreen). “Spore” seems likely, and I can see that this means “Sixth” works — so I fill in both.


At this point, I can start filling in a lot more of the puzzle, and the pieces are falling in to place — each word I figure out that fits is a bit more evidence that the others are correct, making me confident, but there are a few areas where I seem stuck.


Being stuck is evidence of a different sort — it probably means at least one of two things — either I have something incorrect, or I’m really bad at figuring out crosswords. Or, of course, both.

At this point I start revisiting some of my earlier answers, ones I was pretty confident about until I got stuck. I’m still pretty confident in 39 Down — Was at one time, but ___ now. “Isn’t” is too obvious of an answer to be wrong, I think. On the other hand, 38 Down — A miscellany or collection, has me stumped, but two Is in a row also seem strange. 37 Down — Small, fruity candy, is also frustrating me; I’m not such an expert in candy, but I’m also not coming up with anything plausible. So I look at 50 Across — A tiny part of this?, again, and re-affirm that “Bit” seems like it’s a good fit. I’m now looking for something that can give me more information, so I rack my brains, and 36 Across — Ho Chi Min’s capital, comes to me: Hanoi. I’m happy that 39 Down is confirmed, but getting nervous about the rest.

I decided to wait, and look elsewhere, filling in a bit more where I could. My progress elsewhere is starting to help me out.


Now, I need to re-evaluate some earlier decisions and update my beliefs again. It has become a bit more complex than evaluating single answers — I need to consider the joint probability of several different things at once. I’ll unpack how this relates to Bayesian reasoning afterwards, but first, I think I made a mistake.

I was marginally confident in 50 Across — A tiny part of this? as “bit”, but now I have new evidence. I’m pretty sure Nerb isn’t a type of candy, but “Nerd” seems to fit. I’m not sure if they are fruity, so I’m not confident, and I’m still completely at a loss on 38 Down — A miscellany or collection. That means I need to come up with an alternative for 50 Across; “Dot” seems like an unlikely option, but it fits really well. And then it occurs to me; A dot is a little bit of the question mark. That’s an annoying answer, but it seems a lot more likely than that “Nerb” is a type of candy. And I’m not sure what Olio is, but there’s really nothing else that I can imagine fitting. And there are plenty of words I don’t know. (As I found out later, this is one of them.)

At first, I had a high confidence that “Bit” was the best answer for 50 Across — I had a fairly strong prior belief, but I wasn’t certain. As evidence mounted, I started to re-assess. Weak evidence, like the strange two Is in a row, made me start to question the assumption that I was right. More weak evidence — remembering that there is a candy of some sort called Nerds, and realizing that “Dot” was a potential answer, made me revise my opinion. I wasn’t strongly convinced that I had everything right, but I revised my belief. And that’s exactly the way a Bayesian approach should work; you’re trying to figure out which possibility is worth betting on.

That’s because all of probability theory started with a simple question that a certain gambler asked Blaise Pascal; how to we split the pot when a game gets interrupted. And historians who don’t think Bayes was trying to formulate a theological rebuttal to Hume suggest that he’s really responding to a question posed by de Moivre — from whose book he may have learned probability theory, which we need to mention in order to figure out why I’d pick “Dot” over “Bit” — even though I think it’s a stupid answer. But before I get there, I’ve made a bit more progress — I’m finished, except for one little thing.


31 Down — Military Commission. That’s definitely a problem — I’m absolutely sure Brevei isn’t the right answer, and 49 Down, offscreen, is giving me trouble too. The problem is, I listed all the possible answers for 54 Across — Place for a race horse, and the only one that started with an “S” was sixth.

Conviction … or what’s almost required for a conviction (9)

“Certainty” can be dangerous, because if something is certain, almost by definition, it means nothing can convince me otherwise. It’s easy to be overconfident, but as a Bayesian, it’s dangerous to be so confident that I don’t consider other possibilities — because I can’t update my beliefs! That’s why Bayesians, in general, are skeptical of certainty. If I’m certain that my kid is smart and doing well in school, no number of bad grades or notes from the teacher can convince me to get them a tutor. In the same way, if I’m certain that I know how to get where I’m going, no amount of confused turns, circling, or patient wifely requests will convince me to ask for directions. And if I’m certain that “Place for a race horse” is limited to a numeric answer, no number of meaningless words like “Brevei” can change my mind.

High payout wagers (9)

“Perfectas” are bets placed on a horse race, predicting the winner and second place finisher, together. If you get them right, they payoff can be really significant — much more than bets on horses to win or to place. In fact, there are lots of weird betting terms in horse racing, and by excluding them from consideration, I may have been hasty in filling out “sixth.” My assumption of having compiled and exhaustive list of terms was premature. Instead, I need to reconsider once again — and that brings us to why, in a probabilistic sense, crosswords are hard.

Disreputable place for a smoke? (5)

“Joint” probabilities are those that relate to multiple variables. And when solving the crossword, I’m not just looking to answer each clue, I’m looking to fill in the puzzle — it needs to solve all of the clues together. Just like figuring out a Perfecta is harder than picking the right horse; putting multiple uncertain questions together is where joint probabilities show up. But it’s not hopeless; as you figure out more of the puzzle, you reduce the remaining uncertainty. It’s like getting to place a Perfecta bet after seeing 90% of the race; you have some pretty good ideas about what can and can’t happen.

Similarly, Bayesians, in general, collect evidence to constrain what they think is and isn’t probable. Once enough balls have been thrown to the left of that first one, you get pretty sure the odds aren’t 50–50. The prerequisite for getting the right answer, however, is being willing to reconsider your beliefs — because reality doesn’t care what you believe.

And the reality is that 31 Down is Brevet, so I need an answer to 54 Across — Place for a race horse that starts “St”. And that’s when it hit me — sometimes, I need to simply be less certain I know what’s going on. The race horse isn’t running, and there are no bets. It’s in a stall, waiting patiently for me to realize I was confused.


A Final Note

I’d note three key lessons that Bayesians can learn from crosswords, since I’ve already spent pages explaining how Crossworders already understand Bayesian thinking. And they are lessons for life, ones that I’d hope crossword enthusiasts can apply more generally as well.

  1. The process of explicitly thinking about what you are uncertain of, and noticing when something is off, or you are confused, is useful to apply even (especially!) when you’re not doing crossword puzzles.
  2. Evaluating how sure you are, and wondering if you are overconfident in your model or assumptions, would have come in handy to those predicting the 2016 election.
  3. Being willing to actually change your mind when presented with evidence is hard, but I hope you’d rather have a messy crossword than an incorrectly solved one.

A Postscript for Pedants

Scrupulously within the rules, but not totally restrictive

“Strict” Bayesians are probably annoyed about some of this — at no point in the process did I get any new evidence. No one told me about any new balls thrown, I only revised my belief based on thinking. A “Real Bayesian” starts with all the evidence already available, and only updates when new evidence comes in. For a non-technical response, it’s sufficient to note that computation and thought takes time, and although the brain roughly approximates Bayesian reasoning, the process of updating is iterative. And for a technical version of the same argument, I’ll let someone else explain that there are no real Bayesians. (And thanks to Noah Smith for that link!)

The crossword clues were a combination of info from http://www.wordplays.com/crossword-clues/, and my own inventions.
The crossword is an excerpt from Washington Post Express’s Daily Crossword for January 11th, 2017, available in full on Page 20, here: https://issuu.com/expressnightout/docs/express_01112017

“Bearish” on Z-Cash

I recently made my 2017 predictions, and was asked why I was “bearish” on Z-Cash. I predicted a 25% chance that the price would rise, and a 75% chance that the market cap would do so, over the course of the year.

I’m not sure this is really bearish. First, after 2 months, there are currently about 375,000 ZEC minted, of which 300,000 are in circulation. (Block 40,000.) I’m not sure the exact schedule — it should only half after 840,000 blocks, in well over a year — but in 12 months, there should be 7 times as many coins. That means, by the end of the year, at current prices, the market cap would move from $20m to closer to $140m. So the market cap would need to increase significantly in order for the price to stay stable.

Is this implausible? No. But it would probably involve cannibalizing much of the dark-web market share from Monero, (and darkweb markets won’t necessarily switch to new coins quickly,) or a speculative price bubble that extends through the end of the year. I am bullish on Z-Cash over the longer term, but it’s riding on speculation now, and I’d be a little bit surprised if it managed to attract that large a market cap within the year. Because at some point, as more coins are generated and speculators stop pouring in money, the fundamentals take over from the speculators. Perhaps only 25% was overconfident — but I’m definitely not certain of an increase.