Justifying your Attempts at Redefinition

I am a co-author of the Lakens et al. paper, Justify Your Alpha. In it, we argued that it was critical for scientists to justify their choice of significance level. In doing so, we asked to return to the original practice of significance testing, where the goal is not (and should never have become) to choose which results are noteworthy or publishable. Instead, “without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.” (Neyman and Pearson, 1933) Given the goal of finding methods with an upper bound to the in long-run probability of being wrong in our conclusion, our paper clearly shows that the threshold for statistical significance as currently used should not be uniform. But I now think this is partly a red herring.

It is critical to note that the original conclusion of Neyman and Pearson was a “criteria suitable for testing any given statistical hypothesis.” The criteria they found was only one part of the goal, which was to allow statistical evidence to be used to converge to beliefs which were (with high probability) correct. In 1933, when the users of statistical methods were restricted to those who could calculate the answers manually, and the discussions in a given field were among only a few sophisticated users of the methods, p-values were a reasonable choice for such a criteria.

But as noted by Crane and Martin, the current question of interest is how the statistical methods that are demanded by journals and peer review affect the ability for science as a discipline to converge to correct answers in the long-run. The principles that they suggest, along with those of open science, are more likely to help achieve Neyman and Pearson’s initial goals than redefining the p-value, especially given that all involved in the discussion agree that it currently fails to achieve its original purpose of keeping the error rate below 1-in-20. Preregistration, public availability of data and code, and justification of study design choices (like which alpha is used) are clearly attempts to further promote the sociological goals of science.

If we hope to find rules that fulfill Neyman and Pearson’s original goals, focusing on study design and statistical methods narrowly seems doomed to failure. Instead, we need to focus on the principles, techniques, and practices that each discipline demands of contributors. This conversation has started, and continues, but perhaps the goals we are looking to achieve need to be more clearly stated (and justified) before further discussion of statistical methods and goalposts is useful.

Review of Brand et al. — MetaPsychology osf.io/6s29n

Overall Comments

Very interesting paper, and very interesting method, one which seems easy to integrate into current Bayesian modeling practice.

It took me a while to figure out what you meant by posterior passing. It might be worthwhile explaining the method more simply in the abstract; “posterior passing, where posterior found in a past analysis is used as the Bayesian prior for the subsequent analysis.” This seems simpler to me. Others may disagree.

Methodological Comments

1) If the simulation was an attempt to replicate the various methods, why is the study size fixed for the NHST methods? The parameter passing method allows the Bayesian approach to take advantage of prior data, but the way in which prior data is incorporated in NHST is at least partially via power calculations; they should vary the sample size based on the previously observed effect size.

2) If the intent is for posterior passing to be used in place of meta-analysis, shouldn’t the analysis of frequentist methods include a meta-analysis of the results from the 80 trials, to compare to the result found with prior-passing?

3) You note the importance of file-drawer bias. Would it be possible to run the analysis of the posterior-passing method only allowing passing of results when they are above some threshold, to account for this?

General conceptual introduction and attempt to improve science overall:

The presentation in the paper mentions that “the attempts of advocates of Bayesian methods of data analysis to introduce these methods to psychologists… have been without widespread success or response from the field.” To remedy this, some model of how it might change is necessary, and that model should explain the observation.

One plausible explanation is offered earlier in the review; “due to incentives for high numbers of publications, poorer methods that 65 produced false positives persisted and proliferated.” Another plausible explanation is that newer methods are more complex, and people prefer not to learn new methods.

Ideally, at least a comment should explain how the proposal would address the presented problems — the answer to which eludes me. Perhaps embracing the proposed method needs to be a standard for the method to fix the problem of people incentivized to use simpler/easier to cheat methods, in which case how and why would people start to use it?

Alternatively, the background should be cut significantly, and the problem presented should be more closely restricted to “what method would reduce false positive rates and incorporate / replace reproducibility?” (This seems to be what was actually done.)

My 2018 Predictions

(Initial Version from Jan 1 — Scott finally posted his predictions on Feb 6, so here I am again on Feb 6th with updates, see lower.)

A hazy future means high uncertainty gets quantified!

Just to clarify what I’m doing here, I’m using my best guesses and knowledge to make predictions about a set of future events. This follows the urgings of Eliezer Yudkowsky, and the example of Scott over at SSC, who does this yearly — and it follows in the footsteps of my participation in the Good Judgement Project.

I’m picking these because they seem to be important things potentially happening in the coming year, not because I have specific domain knowledge. I’m happy to find and hear from people who are more accurate and have better judgement than myself, and can prove it with a public track record — and I know several — because I can learn from them. So If you don’t think I have any basis for these predictions, you may be right, but I am a #superforecaster with a track record. And I challenge those with more knowledge, or claims that they could make guesses as well as I can, to try it and see.

All that said, I’m starting with the things I don’t think Scott over at SSC will predict, then I’ll log my predictions on his list once it’s out. That prevents me from cherry picking easy things to predict, or focusing on ones I have more than normal insight into.

US Politics

Interestingly, these are all gonna be correlated in a way the scoring won’t account for. Still, for predictions, it’s put up or shut up.
(I’m waiting for Scott to list what 2018 Election categories he’s predicting. For now;)

Democrats take the senate: 45%
(The seats up for grabs are largely Democrat controlled — hard to make inroads.)

The Republicans will maintain control of the House of Representatives in 2018 elections: 45%
(Last year I said 40% — this is what the prediction markets now say, but I’m updating. I’m skeptical that Trump’s unpopularity convinces the heartland to vote dem, or stay home. But this is a low confidence prediction, made early.)

Republicans win House of Representatives special election in Pennsylvania’s 18th district: 60%

Trump’s approval rating, based on the RCP average, will be over 40% / 45% at some point in 2018: 60% / 25%

Previous-year long-term predictions:

There will be a Republican primary challenger getting >10% of the primary vote in 2020 (conditional on Trump running) — 70%

The stock market will go down under President Trump (Conditional on him having a 4 year term, Inauguration-Inauguration) — 60%

New long-term predictions:

The retrospective consensus of economists about the 2017 tax bill will be;
…didn’t increase GDP growth more than 0.2%: 95%

…that, after accounting for growth, it increased the 10-year deficit
more than $1tr / $1.2tr / $1.5tr, respectively: 90% / 70% / 40%

The House will vote to impeach Trump before the end of his current term: 65% (50% vote needed)

Conditional on impeachment, the senate will convict: 20% (67% vote needed)

Cryptocurrency

I SUCK AT THIS, as the past two years should make clear. (And if you think you can do better, why aren’t you rich? (Alfred, you don’t need to respond, I know.)) But I still think there’s a real chance that the bubbles pop — and even if they don’t, I expect the pace of growth to slow once the regular capital markets have put in their money.

Bitcoin Crashes — “loses more than 50% of peak value”;

Off-the cuff probability distribution: 10% — BTC investment (not use) spreads until much of public holds at these high prices before crashing
60% — not very soon, but w/in 2–3 years
15% — Crash During 2018
15% — (Mid-December 2017) was the top.

I’m on the record already;
Conditional on the crash occurring? 1 year later, I’d predict bitcoin is smaller than at least 2 alternatives, and less than 25% of total cryptocoin market cap, with 80% confidence. (1 altcoin, 33%, 90% conf.)

Global Catastrophic Risks

AI Progress indicators –

AI wins a Real Time Strategy game (RTS — Starcraft, etc.) in full-mode against the best human players before end of;
2018–25%
2019–45%
2020–60%
Within Byun Hyun Woo’s Lifetime: 98% (He claims it won’t, here. Only this low because he might die in the next couple years.)

Scott’s Prediction Topics (His Numbers)

US:
1. Donald Trump remains president at end of year: 98% (95%)
2. Democrats take control of the House in midterms: 55% (80%)
3. Democrats take control of the Senate in midterms: 45% (50%)
4. Mueller’s investigation gets cancelled (eg Trump fires him): 20% (50%) [I assume almost immediately being relaunched by appointing him an independent counsel or equivalent after firing doesn’t count. If it does, I probably agree with Scott.]
5. Mueller does not indict Trump: 80% (70%) [I can’t see him indicting Trump. I think there will be a report with arguably indictable offenses, but even so it very well may come out in 2019.]
6. PredictIt shows Bernie Sanders having highest chance to be Dem nominee at end of year: 60% (60%) [Biden and Warren are more viable choices, and Bernie is really old. But this is predicting the prediction, so I’m less certain about this than I am that he won’t be the nominee.]
7. PredictIt shows Donald Trump having highest chance to be GOP nominee at end of year: 95% (95%)
9. Some sort of major immigration reform legislation gets passed: 80% (70%)
10. No major health-care reform legislation gets passed: 90% (95%)
11. No large-scale deportation of Dreamers: 95% (90%)
12. US government shuts down again sometime in 2018: 60% (50%)
13. Trump’s approval rating lower than 50% at end of year: 95% (90%)
14. …lower than 40%: 60% (50%)
15. GLAAD poll suggesting that LGBQ acceptance is down will mostly not be borne out by further research: 70% (80%) [This is WAY outside my wheelhouse, here.]

ECONOMICS AND TECHNOLOGY:
16. Dow does not fall more than 10% from max at any point in 2018: 45% (50%)
17. Bitcoin is higher than $5,000 at end of year: 90% (95%)
18. Bitcoin is higher than $10,000 at end of year: 70% (80%)
19. Bitcoin is lower than $20,000 at end of year: 80% (70%)
20. Ethereum is lower than Bitcoin at end of year: 50% (95%)
21. Luna has a functioning product by end of year: N/A (90%) [I don’t know what this is.]
22. Falcon Heavy first launch not successful: N/A — Just saw this. (70%)
23. Falcon Heavy eventually launched successfully in 2018: N/A — Just saw this. (80%)
24. SpaceX does not attempt its lunar tourism mission by end of year: 95% (95%) [??]
25. Sci-Hub is still relatively easily accessible from within US at end of year (even typing in IP directly is relatively easy): 95% (95%) [??]
26. Nothing particularly bad (beyond the level of an funny/weird news story) happens because of ability to edit videos this year: 80% (90%) [But I’m putting a major fake news controversy as bad. Unsure Scott agrees.]
27. A member of the general public can ride-share a self-driving car without a human backup driver in at least one US city by the end of the year: 60% (80%)

CULTURE WARS:
28. Reddit does not ban r/the_donald by the end of the year: 90% (90%)
29. None of his enemies manage to find a good way to shut up/discredit Jordan Peterson: 70% (70%) [??]

COMMUNITIES:
{I don’t follow these.}

PERSONAL (Not Scott’s, but adapted):
47. I move by end of July: 95%
50. I go to Oxford as a visiting researcher: 65%
51. I do a postdoc at Oxford: 30%
53. I get at least one article published in a newspaper or decently large website (not Ribbonfarm or Kol Habirah): 20%
55. I weigh more than 160lb at year end: 50%
63. My paper with Scott G. goes on Arxiv/published: 90%
64. My paper with Abram/Osonde goes on Arxiv/published: 50%