ArXiV minimal statistics checklist

This checklist should help you identify and fix common errors/misinterpretation in your analysis, or of a paper you are refereeing. I tried to keep it short so you can read the entire document, and provide links if you want to learn more (and find out why something is an error).

For general introductions to astrostatistics and machine learning, see the learning material section

If you use p-values (from a KS test, Pearson correlation, etc.)
If you do a KS test
If you do Goodness-of-Fit
If you do Bayesian inference

I. If you use p-values (from a KS test, Pearson correlation, etc.)

What do you think a low p-value says?

You have absolutely disproved the null hypothesis (e.g. "no correlation" is ruled out, the data are not sampled from this model, there is no difference between the population means).

You have found the probability of the null hypothesis being true.

You have absolutely proved your experimental hypothesis (e.g. that there is a difference between the population means, there is a correlation, the data are sampled from this model).

You can deduce the probability of the experimental hypothesis being true.

You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

Choose the right answer A - F.

If you have chosen any of A-F, you are wrong. p-values can not give you any of this information.

p-values are not probabilities. p-values only specify the frequency of as-extreme data occuring, if the null hypothesis is true. The null hypothesis may not be true, and a rare occurance happens if you try often enough, so it can not tell you that it is improbable (a prior is needed for that).

If you want to compare two models against data, try Bayesian model comparison instead.

If you want to see which of two empirical models describes the data best, try the Akaike information criterion.

Further reading:

The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf

Do you declare the p-value "significant" (or some creative variant thereof) based on some threshold?

p-values are not probabilities, but random variables between 0 and 1, so sometimes they are low by chance.

Choosing a "significance" threshold (0.1, 0.05, 3sigma, 5sigma) based on your result is poor practice.

More important than the significance is to report the size of the effect, and place upper and lower limits on it.

Beware that p-values are not the frequency of making a mistake or the probability of detection. Following this article, for a p-value of 0.05 (claiming a detection of something), the probability that there really is something is only ~64%. For 2 sigma it is 80%, for p<0.01, it is 90%, for 3 sigma it is 99%. This has to do with the fact that most hypothesis you test are probably not true (the article assumes 10% of all tested hypotheses are true). To first order you can estimate the probability that you really detected something by
100% - p-value / (fraction of hypotheses that are true).

To actually determine the frequency of making a mistake, make Monte Carlo simulations without the effect, generating the same number of data points at the same time/frequencies etc. and apply your detection method.

Publication bias additionally makes reported p-values unreliable (because they are random variable, preferentially only low values are reported).

Further reading:

http://www.medpagetoday.com/Blogs/TheMethodsMan/52171

The false detection rate (FDR) is also known as the false positive rate (alpha) and type I error.

The missed detections rate is also known as the false negative rate (beta) and type II error and is related to the power of a test, 1-beta. For low-power tests with high missed detections rate (say, >2%), highly significant features will be false detections. http://andrewgelman.com/2014/11/17/power-06-looks-like-get-used/

Do you do multiple tests on the same data?

For example, you have 10 measured variables, and try to find any correlations (N*(N-1)/2 combinations). Or you test many pixels for a significant detection.

You need to correct for the number of tests, so the threshold should be set much lower.

So control the False Discovery Rate (FDR) using the Benjamini-Hochberg procedure.

II. If you do a KS test:

Do you do a KS test in 2 or more dimensions?

It is not valid there (specifically, you need to prove that in your application the statistic is distribution-free).

Instead, use Monte Carlo simulations.

Do you do a KS test of your data against a fitted model (against the same data)?

The KS test results are invalid then, and p-values are misleading. Do not use this!

Use Monte Carlo simulations instead, look into bootstrapping and cross-validation.

For Anderson-Darling (which is better than KS) or Cramér–von Mises test, the above also applies.

Further reading:

https://asaip.psu.edu/Articles/beware-the-kolmogorov-smirnov-test

You do not detect anything with significance.

Don't be sad: A non-detection means you can place an upper-limit on the effect size! Don't let the opportunity slip.

If you do a likelihood ratio to compare two models

Do you test near the boundary of a parameter? For example test for the presence of an additional component?

The likelihood ratio is not chi-square distributed there.

You need to do simulations (using the null model) to predict the distribution of likelihood ratios.

Or: Use Bayesian model comparison, or a Information Criterion (AIC, WAIC, WBIC).

Is one model a special case of the other model?

If not, the likelihood ratio is not chi-square distributed.

You need to do simulations (using the null model) to predict the distribution of likelihood ratios.

Or: Use Bayesian model comparison.

Further reading:

Statistics, Handle with Care: Detecting Multiple Model Components with the Likelihood Ratio Test http://adsabs.harvard.edu/abs/2002ApJ...571..545P

Bayesian model comparison and Information Criteria: http://adsabs.harvard.edu/abs/2008ConPh..49...71T

III. If you do Goodness-of-Fit

Do you use the chi^2 statistic based on binning chosen based on the data? (for example algorithms that provide adaptive binning)
Do you use the chi^2 statistic on non-linear models?
Do you use the chi^2 statistic on small datasets?

The chi^2 statistic is then not valid and problematic. See "Dos and don'ts of reduced chi-squared"

(What is good advice here?)

Instead of trying to "proof" that the fit is good, statisticians recommend visualisations, such as the Q-Q plot, where you plot the cumulative data against the cumulative model and expect a straight line.

Another alternative to Goodness-of-Fit is to first make an over-fitting, empirical model, and then show that a physically motivated model has a similar likelihood (and given its higher prior should be preferred).

IV. If you do Bayesian inference

Do you check how sensitive your results are to the priors?

Try out other priors and see if the results are any different. [Assumed constants are also priors (delta function priors).]

Do you use the "confidence interval" for a parameter constraint you derived with MCMC?

Credible interval is the correct word for Bayesian inference.

Credible intervals tell you that with a certain probability, the true value lies within this range.

Confidence intervals provide an answer to a different question we are usually less interested in: First you need to assume that the measured value is the true value. Now if you had other data generated, how far does my measure guess lie typically from the true value? This is primarily a measure of how well a method works that provides a single value. It does not give the true range of the parameter constrained the data.

In some cases, when you have flat priors, the confidence intervals and credible intervals are the same.

To get confidence intervals for a Bayesian method, create Monte Carlo simulations and fit them.

Both credible and confidence intervals are only valid if the model is correct.

More common Statistical Mistakes in the Astronomical Literature: by Eric Feigelson http://bccp.berkeley.edu/beach_program/COTB14Feigelson3.pdf

Join the Astrostatistics Facebook group to learn more and ask questions.

Learning Material

Assembled with help from the Astrostatistics Facebook group.

Lectures:

Annual astrostatistics summer school at Penn State
Stefano Andreon's page (Click on teaching)
Brendon Brewer's teaching page (Click on STATS 331)
James M. Cordes's teaching page
James Long's SAMSI course
Jarle Brinchmann's Databases & Data mining course
Astrostatistics & Machine Learning course

Books:

"Introduction to Probability" (Dimitri Bertsekas): covers basics: random variable, combinatorics, derived distributions, what is a PDF, how to work with them, expectation values
"Probability and Statistics" (Morris H. DeGroot): covers classical approaches to hypothesis testing and frequentist analysis, so you can really understand, e.g., the classical tests, what p-values really are, why are they used and how and when they are useful.
"Scientific Inference" (Simon Vaughan): basic probability theory, statistical thinking, model and data representation in computers. likelihood function, graphical summaries, basic Monte Carlo.
"Data Analysis - A Bayesian Tutorial" (Sivia): building an intuition for how Bayesian statistics works
"Bayesian data analysis" (Gelman): nitty gritty details
"An Introduction to Statistical Learning with Applications in R" (James, Witten, Hastie & Tibshirani) available online
"Machine Learning: An Algorithmic Perspective" (Stephen Marsland) here
"Statistics, Data Mining, and Machine Learning in Astronomy" (Zeljko Ivezic, Andy Connolly, Jake VanderPlas, Alex Gray) for scikit-learn and astroml

Back to homepage