My 2019 Predictions

I am once again making a yearly set of public predictions which I will publicly score early next year, as I did last year and the year before (as well as in other formats and/or less publicly in various forums even earlier.) This is, essentially, a vanity project — predictions benefit greatly from frequent updating, discussion, consensus, and aggregation (especially if the latter two are managed carefully — and interesting and ongoing area of research.) It is helpful to me personally to reflect on my success and failure, and gain insight into my (lack of) accuracy and calibration, but that’s not a reason for anyone else to care.

At the same time, I’ve agreed with those who say it’s important to publicly signal the baseline willingness to make frequent, concrete, and publicly evaluated predictions. Not only do I take the predictions of those who fail to do this less seriously, I have publicly called out pundits, reporters, analysts and others who fail to do this, and will probably do so again in the future. If you talk about the future but don’t make your record clear, you’re doing a disservice to the public. This is doing my very small part to make publicly accountable predictions a norm.

The initial predictions are as-of 1–13–2019, but will (probably) be updated to include my estimates on SSC’s predictions, as well as the expected VOX Future Perfect predictions. Edit to add: As of 1/16, I am predicting on all of the Vox future perfect predictions. Their predictions are in {}

Now, on to the show.


It’s a lot easier to do these predictions in an election year, but I have a few easy-to-quantify things I will mention. Also, since Israeli elections are coming up (in April,) and I’m in Israel, I’m going to risk making a fool of myself and predict a few things on those.

Trump’s RCP average approval rating on 1/1/19 is above 30%/35%/40%/45%/50%, respectively: 95% / 85% / 50% / 40% / 5%.

Trump still president at end of year: 96% {90%} (Note: I was predicting this question before VFP, but they included it.)

VFP: No Democratic presidential candidate will become a clear frontrunner (Predictwise probability of nomination >50%) in the political prediction markets at any point in 2019: 75% {60%}

VFP: The US will not enter a recession: 65% {80%} (My scoring assumes we use NBER’s retrospective peak month. They usually delay announcing for about a year, so this likely can’t be scored until 2020.)

VFP: Congress will not authorize funding for a full-length border wall: 98% {95%} (“Full length” is cheating.)

Added Q: Congress will authorize funding for a border walls of at least $5.7bn: 15%

VFP: US homicides will decline: 75% {80%}


VFP: The United Kingdom will leave the European Union: 65% {80%} (I think an extension past end of March is likely, and cancellation or extensions pushing past Jan 1, 2020 are possible.)

Added: Brexit will be delayed past March 29th (or cancelled): 51%

VFP: Narendra Modi will continue as Indian prime minister after the 2019 elections 70% {60%} (I’m not better informed than Dylan and Kelsey, but I have a stronger trust in polls + stronger prior that dislike of the opposition will translate into a win.)

VFP: Neither India nor China will enter a recession: 80% {70%} (Similar to Dylan’s reasoning, but stronger. But joint questions are annoying.)

Added: India will not enter a recession: 85%

Added: China will not enter a recession: 85%


Netanyahu is prime minister after the Israeli elections: 80%

Netanyahu’s party gets the most votes: 85%

Jewish home passes threshold / gets 6 seats: 60% / 35%

Arab parties (total) seats decline for 11: 70%. (Splitting is dumb, but seems inevitable.)


VFP: Malaria deaths will decrease 75% {80%} (Strongly based on their guess — they know more than I do about this.)

VFP: No additional countries will adopt a universal basic income: 80% {90%} (There are lots of countries that might do something, and the idea is gaining traction, so I’m hedging.)

VFP: More animals will be killed for US human consumption in 2019 than in 2018: 75% {60%} (The trend is strong, the economy is fine. I’m confused that they have their probability so low.)


VFP: Impossible Burger meat will be sold in at least one national grocery chain: 95% {95%}

VFP: Fully autonomous self-driving cars will not be commercially available as taxis or for sale: 70% {90%} (Even if they aren’t price competitive, there’s a huge cachet in being first to market. Someone wants to do it, even if the tech is still too expensive. But they say “real commercial product,” so they might hedge if it’s offered but far too expensive, etc.)

VFP: DeepMind will release an AlphaZero update, or new app, capable of beating humans and existing computer programs at a task in a new domain: 60% {50%} (AlphaGo was October 2015, Alphazero was Dec. 2017. I assume they have more projects that are in the works — unclear if they will release them.)


VFP: Average world temperatures will increase relative to 2018: 65% {60%}

VFP: Global carbon emissions will increase: 80% {80%}

My earlier long-term predictions:

(2017) There will be a Republican primary challenger getting >10% of the primary vote in 2020 (conditional on Trump running) — 60% (was 70%. I’m thinking about total popular vote, and given the structure of primaries, this is a higher bar that I initially thought about. Still, there are a LOT of republicans who hate him, and many more public figures who would switch over if they weren’t scared of what happens when Trump wins.)

(2017) The stock market [Edit: S&P] will go down under President Trump (Conditional on him having a 4 year term, Inauguration-Inauguration) — 60%(no change, was 60%. But I’m affirming because which split congress usually means markets go up, I have greater concerns. I’m updating based partly on results so far, with markets up, and partly on my suspicions that the current gyrations will get worse, and that the current economic mismanagement really is a big problem.)

(2018) The retrospective consensus of economists about the 2017 tax bill will be;
…didn’t increase GDP growth more than 0.2%: 96% (was: 95%)
…that, after accounting for growth, it increased the 10-year deficit
more than $1tr / $1.2tr / $1.5tr, respectively: 93% / 80% / 45% (was: 90% / 70% / 40%. See recent article about how poorly it’s working out.)

(2018) The House will vote to impeach Trump before the end of his current term: 75% (was 65%) Note: 50% vote needed.

(2018) Conditional on impeachment, the senate will convict: 10% (was 20%) Note: 67% vote needed. (Most uncertainty is if he does something additionally crazy, crazy enough to prompt short term worries about safety/stability.)


Still living in Israel at end of year: 97%

I have (some) official academic affiliation: 60%.

I have an affiliation with: F-O/CC-C/Te-I/Other: 40% / 40% / 30% / 20%

My multi-Agent Goodhart paper is accepted into the special issue: 60%

I publish or submit pre-prints of at least 1/2/3 more papers: 90%/80%/60%.

My Google Scholar H-index hits 7 / 8 / 9: 65% / 35% / 5%

My actual (no-self cites, includes non-google sources) H-index hits 6 / 7 : 70% / 30%

2018 Predictions — Accuracy and Score

OK, it’s (slightly after) that time of year again, and I need to make my new predictions. But you can’t improve unless you figure out how you’ve been doing, so here’s my review of last year’s predictions. (I still need to put in my calibration numbers, and those aren’t looking great for me.)

Note: The first set are my own, the second (numbered) set are following Scott Alexander’s predictions. I have the outcome as a binary yes/no in bold preceding the question. I have removed many of my earlier comments, and added new ones in bold. Feel free to go back and make sure I’m not cherry picking to make myself look less stupid.

(I’m taking out the long-term predictions that I can’t yet grade.)

Brier Score: 0.1957. Ouch! (Brier Score w/o Crypto Qs: 0.1374 But listing this score is just me cheating to feel better.)

My calibration was fairly good, though the sample is noisy.
Key: Right / Wrong [Cheating, w/o Crypto] (Cumulative w/ 2017) — Percentage [Cheating, w/o Crypto] (Cumulative)

[50–60%): 4 (6) / 1 (4) — 80% (60%)
[60–70%): 3 (11) / 4 (7) — 43% (61%)
[70–80%):4 (10) / 0 (2) — 100% (83%)
[80–90%): 5 [3] (15) / 3 [1] (4) — 57% [75%] (79%)
[90–95%):4 (13) / 0 (0) — 100% (100%)
[95–99%):7 (18) / 2 [0] (2) — 78%[100%] (90%)
[99+%]: 0 (4) / 0 (0) — N/A (100%)

US Politics

No — Democrats take the senate: 45%
(The seats up for grabs are largely Democrat controlled — hard to make inroads.)

No — The Republicans will maintain control of the House of Representatives in 2018 elections: 45% 
(Last year I said 40% — this is what the prediction markets now say, but I’m updating. I’m skeptical that Trump’s unpopularity convinces the heartland to vote dem, or stay home. But this is a low confidence prediction, made early.)

No (Wrong)— Republicans win House of Representatives special election in Pennsylvania’s 18th district: 60%

Yes/No — Trump’s approval rating, based on the RCP average, will be over 40% / 45% at some point in 2018: 60% / 25% (It got really close to 45%, though.)


I SUCK AT THIS, as the past two years should make clear. (And if you think you can do better, why aren’t you rich? (Alfred, you don’t need to respond, I know.)) But I still think there’s a real chance that the bubbles pop — and even if they don’t, I expect the pace of growth to slow once the regular capital markets have put in their money.

Bitcoin Crashes — “loses more than 50% of peak value”;

Distribution — Hard to Score via Brier — Off-the cuff probability distribution: 
10% — BTC investment (not use) spreads until much of public holds at these high prices before crashing 
60% — not very soon, but w/in 2–3 years 
15% — Crash During 2018 
Yes — 15% — (Mid-December 2017) was the top.

I’m on the record already;
No, (No) — Conditional on the crash occurring? 1 year later, I’d predict bitcoin is smaller than at least 2 alternatives, and less than 25% of total cryptocoin market cap, with 80% confidence. (1 altcoin, 33%, 90% conf.)

Global Catastrophic Risks

AI Progress indicators –

AI wins a Real Time Strategy game (RTS — Starcraft, etc.) in full-mode against the best human players before end of;

(Other years removed.)

Scott’s Prediction Topics (His Numbers)

Yes — 1. Donald Trump remains president at end of year: 98% (95%)
Yes — 2. Democrats take control of the House in midterms: 55% (80%)
No— 3. Democrats take control of the Senate in midterms: 45% (50%)
No— 4. Mueller’s investigation gets cancelled (eg Trump fires him): 20% (50%) [I assume almost immediately being relaunched by appointing him an independent counsel or equivalent after firing doesn’t count. If it does, I probably agree with Scott.]
Correct— 5. Mueller does not indict Trump: 80% (70%) [I can’t see him indicting Trump. I think there will be a report with arguably indictable offenses, but even so it very well may come out in 2019.] ( I’m happy to have called what now seems inevitable in the comment.)
No — 6. PredictIt shows Bernie Sanders having highest chance to be Dem nominee at end of year: 60% (60%) [Biden and Warren are more viable choices, and Bernie is really old. But this is predicting the prediction, so I’m less certain about this than I am that he won’t be the nominee.] (Beto and Kamala Harris are on top. And given that Obama beat Hillary, I should have remembered that new people would show up. This was stupidly overconfident)
Yes — 7. PredictIt shows Donald Trump having highest chance to be GOP nominee at end of year: 95% (95%)
No— 9. Some sort of major immigration reform legislation gets passed: 80% (70%)
Correct— 10. No major health-care reform legislation gets passed: 90% (95%) (I predicted this right after the individual mandate was repealed at the end of 2017, it seems, so yeah.)
Correct- 11. No large-scale deportation of Dreamers: 95% (90%) (Not that they figured anything out, but it hasn’t been mass deportations.)
Correct — 12. US government shuts down again sometime in 2018: 60% (50%) (I was correct by a hair. )
Correct — 13. Trump’s approval rating lower than 50% at end of year: 95% (90%)
No. — 14. …lower than 40%: 60% (50%)
WAITING ON SCOTT — 15. GLAAD poll suggesting that LGBQ acceptance is down will mostly not be borne out by further research: 70% (80%) [This is WAY outside my wheelhouse, here.]

No (Correct) — 16. Dow does not fall more than 10% from max at any point in 2018: 45% (50%) (It did, twice!)
Wrong — 17. Bitcoin is higher than $5,000 at end of year: 90% (95%)
Wrong — 18. Bitcoin is higher than $10,000 at end of year: 70% (80%)
Right— 19. Bitcoin is lower than $20,000 at end of year: 80% (70%)
It was — 20. Ethereum is lower than Bitcoin at end of year: 50% (95%)
X — 21. Luna has a functioning product by end of year: N/A (90%) [I don’t know what this is.] (Still Don’t.)
X — 22. Falcon Heavy first launch not successful: N/A — Just saw this. (70%)
X — 23. Falcon Heavy eventually launched successfully in 2018: N/A — Just saw this. (80%) 
Correct — 24. SpaceX does not attempt its lunar tourism mission by end of year: 95% (95%) [??]
Correct — 25. Sci-Hub is still relatively easily accessible from within US at end of year (even typing in IP directly is relatively easy): 95% (95%) [??]
Correct — 26. Nothing particularly bad (beyond the level of an funny/weird news story) happens because of ability to edit videos this year: 80% (90%) [But I’m putting a major fake news controversy as bad. Unsure Scott agrees.]
Nope — 27. A member of the general public can ride-share a self-driving car without a human backup driver in at least one US city by the end of the year: 60% (80%)

Correct — 28. Reddit does not ban r/the_donald by the end of the year: 90% (90%)
Correct — 29. None of his enemies manage to find a good way to shut up/discredit Jordan Peterson: 70% (70%) [??]

{I don’t follow these.}

PERSONAL (Not Scott’s, but adapted):
Correct — 47. I move by end of July: 95%
Correct— 50. I go to Oxford as a visiting researcher: 65% (It was AMAZING.)
Correct (I didn’t)— 51. I do a postdoc at Oxford: 30%
Nope — 53. I get at least one article published in a newspaper or decently large website (not Ribbonfarm or Kol Habirah): 20% (But it was close — I submitted something that’s still in process.)
55. I weigh more than 160lb at year end: 50% (Close. And I don’t have a scale handy. Not that it affects the score, so…?)
Correct — 63. My paper with Scott G. goes on Arxiv/published: 90%
N/A- 64. My paper with Abram/Osonde goes on Arxiv/published: 50% (To explain, this was intended to be the multi-agent goodhart paper, which I did put a version of online, and is getting jounral reviewed, but it is not co-authored with them, so I’m taking a mulligan.)

Should you worry about terrorists with Bio-Weapons? (No.)

A recent story said that there was a terrorist “planning” to attack the Italian town of Macomer (pop. approx. 10,000) by putting Ricin and Anthrax in the water supply. That sounds scary, right?

First, he didn’t HAVE Anthrax or Ricin. He was planning on buying it online. Somehow. (No, no-one is selling biological weapons online. Not even on the dark web.) But let’s assume he managed to get it, somehow. Maybe ISIS or Al-Qaeda, which pursued biological weapons but couldn’t manage to buy or make them, nevertheless ended up finding some surplus bioweapons from the Russians, and gave them to this guy in Italy. And no, that’s not plausible, but we’re going well past the point of plausible in order to try to find a way to worry about this threat.

OK, so he’s got his materials, and wants to put them in the water. The water supply might be unguarded — I don’t know, but many places have fences and such around freshwater drinking sources, and might notice someone dumping in something. But maybe they don’t. Some places have water utilities that test the drinking water for various toxins somewhere along the line. Let’s assume there is no such program in place for the water delivered to this small Italian town. But how much Antrax and Ricin would he need to put in, exactly?

We can start with Ricin. It’s relatively easy to make. (Obviously much harder than a bomb, and most terrorists can’t manage to make those without a fair amount of help, but maybe the terrorist has a really good chemistry background.) The victims only need to ingest 1 ml of ricin per kilogram of body weight to have a 50% chance of dying. We’ll conservatively assume that the people of Macomer average 50kg. That’s only 50ml of ricin per person, or 500 liters of ricin toxin — or about 100 gallons, for Americans. That’s not an amount you can stick in a backpack — you need to back a truck up to the water supply to get that volume of toxin in.

But people aren’t ingesting all of that water. An approximate level of water usage per person is 450 liters per day. (That’s still about 100 gallons.) In California, with its droughts, they require about half that, and we’ll go with the lower number. People only need to drink about 2 liters of water (8 cups) a day. That doesn’t need to all be in the form of tap water — juices and soda works too. But again, let’s assume everyone is being healthy and environmentally conscious, and they drink only tap water. That’s about 1% of their daily water usage. So we need to multiply the amount in the water supply by 100 to get the same effect — that’s a large tanker truck worth of ricin. But a city isn’t only going to have a single day supply of water in the reservoir. For every day worth of water they have, we need to add another tanker truck.

OK, so maybe our would-be terrorist isn’t going to be able to order a couple dozen tanker trucks of Ricin on the dark web. (And don’t worry too much about a permanently contaminated water supply — boiling the water gets rid of ricin.)

But anthrax. Maybe that could work. And it is found naturally on the ground! Fortunately for us, and unfortunately for our would-be terrorist, culturing anthrax in large quantities is really, really hard. For an idea how hard, we can look at Aum Shinrikyo — the people who successfully made Sarin and used it to attack a subway in Japan. It turns out that before they did this, they spent years trying to isolate Anthrax. The eventually succeeded, but it turns out that the strain they ended up cultivating wasn’t a very good one for hurting people. Still, maybe our terrorist gets lucky, and finds someone on the internet who happens to be willing to sell the a dangerous strain of anthrax. All they need to do is cultivate it.

Unfortunately, the terrorist in question doesn’t have any lab experience in microbiology. He can try to buy a book, and some equipment, but he’s going to have a hard time figuring out something that the PhD bacteriologists need to use specially designed fermenters to culture.

And then he needs to add it to the water. And even though there doesn’t need to be a huge quantity of spores to cause fatal anthrax by ingestion in an individual, as with Ricin, the picture changes when the spores are dumped into a large body of water.

At the end of the day, biological threats exist, and it’s likely that we will see more in the future. But idiots claiming to have plans to poison water supplies with non-existent supplies of bio-weapons are just that — idiots that make ridiculous claims. Terrorist threats are serious, and I’m sure he’ll serve time in jail. That’s good, because there are lots of ways that people interested in committing terrorism can kill people. They just involve trucks and guns, not imaginary mail-order bioweapons.

Blockchains, Reserve Banks, and Accounting for Liabilities

Blockchain enthusiasts have occasionally claimed that blackchains allow “an asset without a liability,” a phrase used by Walker and Luu, and echoed by Nic Carter. Despite being a ledger, the blockchain is not money owed by anyone — which is a informal understanding of what a liability is. The claimed advantage of this seems to be that cryptocurrency holdings are akin to a natural resource, like gold or silver, rather than a reserve-bank backed fiat currency.

In many ways Bitcoin and similar ventures do resemble such assets, or even exceed them in important ways. For example, the supply of Gold is “fixed” — modulus mining, which can increase when gold prices are high. The supply of bitcoin, of course, is fixed in a much less manipulable sense. However, not having a corresponding liability is not one of the ways that cryptocurrencies differ from reserve-bank currencies — in fact, it is an incredibly close parallel.

First, it is useful to understand what a reserve bank balance sheet does — it has assets, which are things it purchased, and liabilities, which is principally the money that has been issued. Reserve banks function by running a liability-focused balance sheet — they create money from nothing, which gives them an asset, the money created, and a corresponding liability, which is that the money is actually a debt to itself. If this money is used to buy something, they get an asset in exchange for an asset and the balance sheet balances. But the liability doesn’t mean they owe anything — they can leave the currency issued and never repay it.

If I have a $50 bill, that is a direct $50 liability on the part of the central bank. They don’t owe me anything, but it’s a liability. The way those liabilities are balanced is via a negative equity — the total government debt. That’s because in most cases, central banks are actually paying for things that they don’t receive — the government runs a deficit, and the excess payments by the government creates a debt, which is again, a reserve bank liability. The theory is that the reserve bank could always balance its books by having the government tax to pay for all of those liabilities — but the more common resolution is inflation, insolvency, and often abandoning the currency.

This is fundamentally different than a commercial bank. If I have $50 in a commercial bank, and no debt, that’s a $50 liability that the bank owes me, and a corresponding $50 asset that they have to lend. The money is ALSO a liability on the reserve bank balance sheet, corresponding to the asset the bank has. The bank’s balance sheet for these assets needs to balance, however, unlike the reserve bank. If they hold the cash, they can then lend my $50 (in fractional reserve banking, to three or four people,) but for every dollar they hand out, they gain a corresponding asset — that someone owes them the money. They can become insolvent just like the federal government — if too many people default on loans, they run out of money, and (if the FDIC or equivalent doesn’t step in,) the depositors lose their deposits.

How does a blockchain work? Just like a reserve bank, it gives out money (pre-mined tokens, block rewards, transaction fees,) but it does not get anything in exchange. This is a lot like when a government overspends its assets — the corresponding liabilities turn into money. Here, however, the item purchased with the money isn’t an asset, it’s an intangible — security and transactibility. Until the currency is fully mined, every block mined costs the network money to pay for this security and transactibility.

When I have a bitcoin, it’s an asset balanced by a liability held by the blockchain. Whose liability? It belongs to the network. The network must mine more blocks to allow transactions, and keep these liabilities useful —and this is an ongoing expense.

Cryptocurrencies have associated liabilities, just like any other asset that gets issued. What balances the block-chain debit-sheet? Nothing — just like central banks, which spent money and created liabilities. Unlike central governments, blockchains don’t have the ability to tax to re-balance the balance sheet. By design, of course, most also can’t inflate the currency. If for any reason the blockchain is unable to meet the ongoing expense of mining rewards to provide security and transactability, it does exactly the same thing a reserve bank does, and the money disappears.

Policy Beats Morality

This is a simple point, but one that gets overlooked, so I think it deserves a clear statement. Morality is less effective than incentives at changing behavior, and most of the time, policy is the way incentives get changed.

Telling people the right thing to do doesn’t work. Even if they believe you, or understand what you are saying, most people will not change their behavior simply because it’s the right thing to do. What works better is changing the incentives. If this is done right, people who won’t do the right thing on their own often support the change, and their behavior will follow.

I remember reading a story that I think was about Martin Gardner’s column in Scientific American in which he asked eminent scientists to write in whether they would cooperate with someone described as being “as intelligent as themselves” in a one-shot prisoner’s dilemma. He was disappointed to find that even many of the smartest people in the world were rational, instead of superrational. Despite his assertion that intelligent enough people should agree that superrationality leads to better outcomes for everyone, those people followed their incentives, and everyone defected. Perhaps we can chalk this up to their lack of awareness of newer variants of decision theory, but the simpler explanation is that morality is a weak tool, and people know it. The beneficial nature of the “morality” of non-defection wasn’t enough to convince participants that anyone would go along.

Environmentalists spent decades attempting “moral suasion” as a way to get people to recycle. It didn’t work. What worked was curb-side pickup of recycling that made money for municipalities, paired with fines for putting recyclables in the regular garbage. Unsurprisingly, incentives matter. This is well understood, but often ignored. When people are told the way to curb pollution is to eat less meat or drive less, they don’t listen. The reason their behavior doesn’t change isn’t because it’s “really” the fault of companies, it’s because morality doesn’t change behavior much — but policy will.

The reason politics is even related to policy is because politicians like being able to actually change public behavior. The effectiveness of policy in changing behavior is the secondary reason why — after donations by Intuit and H&R Block — congress will never simplify the tax code. To paraphrase / disagree with Scott Alexander, “Society Is Fixed, Policy Is Mutable.” Public policy can change the incentives in a way that makes otherwise impossible improvements turn into defaults. Punishment mechanisms are (at least sometimes) sufficient to induce cooperation among free-riders.

Policy doesn’t change culture directly, but it certainly changes behaviors and outcomes. So I’ll say it again: policy beats morality.

*) Yes, technological change and innovation can ALSO drive changes in incentives, but predicting the direction of such changes is really hard. This is why I’m skeptical that innovation alone is a good target for changing systems. Even when technology lowers the cost of recycling, it’s rarely clear beforehand whether new technology will in fact manage to prompt such changes — electric trolleys were a better technology than early cars, but they lost. Electric cars are still rare. Nuclear power is the lowest carbon alternative, but it’s been regulated into inefficiency.

Discounting the relativistic future

Tyler Cowen’s new book, “Stubborn Attachments,” contains a quote about why we should use low discount rates. He’s right about the conclusion of wanting low discount rates, but I think the example doesn’t quite make the point. (H/t for pointing out the quote goes to Robert Wiblin.)

…it seems odd, to say the least, to discount the well-being of people as their velocity increases. If, for instance, we sent off a spacecraft at near the velocity of light, the astronauts would return to earth, hardly aged, many millions of years hence. Should we pay less attention to the safety of our spaceship… the faster those vehicles go?

As I responded on Twitter, I’m fairly sure this is conceptually wrong because economists are used to thinking about time in Newtonian terms. If we use a proper spacetime metric, the problem, I argue, goes away — and so do some other things.

Let’s work through Tyler’s example. An astronaut leaves earth and immediately accelerates to 0.99c, crushing him into a pulp in a way that is mathematically conveniently for us. As economically rational agents, assuming his spaceship conveniently resurrects him ,should we care about his safety? [Note: The assumption of economically rational agents is obviously ridiculous, but it’s only slightly more of an exaggeration than the other parts of our story.]

So let’s look forward in time. When it’s a year later on earth, how much do we care about the astronaut? Using a typical discount rate, of say, 5%, we care about him 95% as much.

He, however, has had only about 0.02 years of time pass, and cares a bit more. But when he lets a year pass in his reference frame, he cares 95% as much about future him, but us earthbound people need to wait 50 years for that to happen, and we care about him 50 years from now about 8.7% as much as we did when he launched.

But where is he? About 10¹² kilometers away. Americans can’t be bothered to think about poor people in Africa, so why should they care about this guy who is about 100,000,000 times as far away? But Tyler Cowen agrees with Peter Singer in his moral objections to distance based discounting, so after we’ve spend the next 50 years avoiding existential risks and solving poverty in some economically efficient way, we need to decide how much value we should have initially placed on our astronaut.

Even if we don’t want to discount distance in space, unless using a discount rate of 0%, these post-Einstein sophisticates need to discount distance in space-time. Our astronaut travelling at 0.99c is about already a light year away, and using a handy-dandy space-time distance calculator, that means he’s just about 51 years away, and we think he’s worth about 8% of what he was when we launched him.

Let’s say he turns around, once again suddenly changing velocity, getting crushed to a pulp, and being resurrected by his ship. On the centennial of his launch, he comes back, 2 years older. Unless we’re doing something really complicated with our intergenerational discounting, we should initially have discounted this future by that same 5% yearly, and our future returnee is worth 0.76% of a person. That has nothing to do with space travel, it just means people don’t care about the future. [Note: Yes, high discount rates might be bad if you’re hoping to live to 100, because it means decision-makers now should trash the future for present gain. As if they aren’t already. But we’ll get back to that.]

Our prospective astronaut, however, has a higher self-valuation, and thinks this future is worth about 90% as much as the present. That makes sense — he’s only lived 2 years. [Note: If you’ve got a fast enough spaceship, you’re gonna be able to find a hell of an IRR for your investments. Just make sure you figure out that whole not getting crushed to death thing.] But different people always have different discount rates — we’re just saying that high-speed relativistic astronauts should hope that society cares about the long term future.

So we conclude that the people on earth care about events happening in a century very little, but people who travel really fast care quite a bit more. And we conclude that people who are really far away are absolutely worth less than people nearby, if only because they can’t get back here until the far future. But if we want to put someone on a spaceship, they better realize that they care about their safety a lot more than we do.

The conclusion is inescapable; we need to launch political decision makers away from earth as fast as we can possible make them go. We don’t even need to make sure the spaceship is safe , because in our reference frame, it’ll be a long time until it gets back. This way, they might start to care a little bit more about the far future. [Note: Or at least they’ll care a bit more about engineering standards.] The problem goes away, and so do the politicians.

Shorrock’s Law of Limits

I recently saw an interesting new insight into the dynamics of over-optimization failures stated by Steven Shorrock; “When you put a limit on a measure, if that measure relates to efficiency, the limit will be used as a target.” This seems to be a combination of several dynamics that can co-occur in at least a couple ways, and despite my extensive earlier discussion of related issues, I think it’s worth laying out these dynamics along with a few examples to illustrate them.

When limits become targets

First, there is a general fact about constrained optimization that, in simple terms, says that for certain types of systems the best solution to a problem is going to involve hitting one of the limits. This was formally shown in a lemma by Dantzig about the simplex method, where for any convex function the maximum must lie at an extreme point in the space. (Convexity is important, but we’ll get back to it later.)

When a regulator imposes a limit on a system, it’s usually because they see a problem with exceeding that limit. If the limit is a binding constraint — that is, if you limit something critical to the process, and require a lower level of the metric than is currently being produced, the best response is to hug the limit as closely as possible. If we limit how many hours a pilot can fly (the initial prompt for Shorrock’s law,) or that a trucker can drive, the best way to comply with the limit is to get as close ot the limit as possible, which minimizes how much it impacts overall efficiency.

There are often good reasons not to track a given metric, when it is unclear how to measure it, or when it is expensive to measure. A large part of the reason that companies don’t optimize for certain factors is because they aren’t tracked. What isn’t measured isn’t managed — but once there is a legal requirement to measure it, it’s much cheaper to start using that data to manage it. The companies now have something they must track, and once they are tracking hours, it would be wasteful not to also optimize for them.

Even when the limit is only sometimes reached in practice before the regulation is put in place, formalizing the metric and the limitation means that it becomes more explicit — leading to reificiation of the metric. This isn’t only because of the newly required cost of tracking the metric, it’s also because what used to be a difficult to conceptualize factor like “tiredness” now has a newly available albeit imperfect metric.

Lastly, there is the motivation to cheat. Before fuel efficiency standards, there was no incentive for companies to explicitly target the metric. Once the limit was put into place, companies needed to pay attention — and paying attention to a specific feature means that decisions are made with this new factor in mind. The newly reified metric gets gamed, and suddenly there is a ton of money at stake. And sometimes the easiest way to perform better is to cheat.

So there are a lot of reasons that regulators should worry about creating targets, and ignoring second-order effects caused by these rules is naive at best. If we expect the benefits to just exceed the costs, we should adjust those expectations sharply downward, and if we haven’t given fairly concrete and explicit consideration to how the rule will be gamed, we should expect to be unpleasantly surprised. That doesn’t imply that metrics can’t improve things, and it doesn’t even imply that regulations aren’t often justifiable. But it does mean that the burden of proof for justifying new regulation needs to be higher that we might previously have assumed.

Comment on: “ How to Beat Science and Influence People” —

This is an interesting and well written policy paper that clearly explains the dynamics which occur in public policy that exploit policy maker’s naivete. Unfortunately, I think it goes too far in claiming that the failure occurs “even when policy makers rationally update on all evidence available to them.” This follows from a more general failure of many rational actor models to properly model data sources, a critical characteristic of a model used for inferences. (See Chapter 2 of my dissertation, forthcoming.)

This failure results from not appreciating what Yudkowsky calls “Filtered evidence.” If, in fact, rational policymakers have a model that acknowledges the process which generates evidence, the failure mode assumed in the paper largely disappears. To see how, consider the difference between the following two Bayesian models that treat data x as representative of item y:

Naive Model:
x~normal(y, σ²)

Filtered Evidence Model: 
z~normal(y, σ²)
x~(filterer goal-epsilon<z<filterer goal+epsilon)

Note that the filterer goal can usually be inferred contextually, and the range may be asymmetric — but people with a policy goal will select evidence to match that goal, and policymakers can often infer that goal and use it to account for the potential filtration.

This filtered evidence model relies on a policymaker knowing that the provider of evidence x has a motive to misrepresent the truth — something that politicians generally understand, though other motives exist for ignoring this fact. This does not imply that it requires any particular level of stupidity or obvious failure, since the need to account for this is far from typical. For example, the failure to appreciate this fact can be seen in the recent discussions of p-hacking and how p-curves display anomalies.

Comments on Arxiv Paper 1712.03198

Compiling some random, nitpicky comments on this generally excellent paper;

Pg 2 — 
Paragraph “This article outlines…” seems mostly unnecessary given the abstract and the following paragraph.

Perhaps clarify that ADMEP is a new method (or cite it.)

Sentence: “Section 8.1…” seems like the order should be reversed to mention 8 before 8.1.

“This excludes the article types: tutorial in biostatistics, commentary, book review, correction, letter to the editor and authors’ response. In total, this returned 264 research articles.” I would suggest: “In the volume, there were a total of 264 articles after removing the following article types: tutorial in biostatistics, commentary, book review, correction, letter to the editor and authors’ response” (The phrase “this returned” seemed to imply a different search than what I think you meant.)

Pg 9 — 
“ are, to all intents and purposes, truly random” — I’m annoyed by this only because for cryptographic purposes this is very untrue. I’d prefer “are, for all statistical purposes, truly random”

Justifying your Attempts at Redefinition

I am a co-author of the Lakens et al. paper, Justify Your Alpha. In it, we argued that it was critical for scientists to justify their choice of significance level. In doing so, we asked to return to the original practice of significance testing, where the goal is not (and should never have become) to choose which results are noteworthy or publishable. Instead, “without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.” (Neyman and Pearson, 1933) Given the goal of finding methods with an upper bound to the in long-run probability of being wrong in our conclusion, our paper clearly shows that the threshold for statistical significance as currently used should not be uniform. But I now think this is partly a red herring.

It is critical to note that the original conclusion of Neyman and Pearson was a “criteria suitable for testing any given statistical hypothesis.” The criteria they found was only one part of the goal, which was to allow statistical evidence to be used to converge to beliefs which were (with high probability) correct. In 1933, when the users of statistical methods were restricted to those who could calculate the answers manually, and the discussions in a given field were among only a few sophisticated users of the methods, p-values were a reasonable choice for such a criteria.

But as noted by Crane and Martin, the current question of interest is how the statistical methods that are demanded by journals and peer review affect the ability for science as a discipline to converge to correct answers in the long-run. The principles that they suggest, along with those of open science, are more likely to help achieve Neyman and Pearson’s initial goals than redefining the p-value, especially given that all involved in the discussion agree that it currently fails to achieve its original purpose of keeping the error rate below 1-in-20. Preregistration, public availability of data and code, and justification of study design choices (like which alpha is used) are clearly attempts to further promote the sociological goals of science.

If we hope to find rules that fulfill Neyman and Pearson’s original goals, focusing on study design and statistical methods narrowly seems doomed to failure. Instead, we need to focus on the principles, techniques, and practices that each discipline demands of contributors. This conversation has started, and continues, but perhaps the goals we are looking to achieve need to be more clearly stated (and justified) before further discussion of statistical methods and goalposts is useful.