Je vertrek voorbereiden of je verzekering afsluiten bij studie, stage of onderzoek in het buitenland
Study or work abroad? check your insurance options with The JoHo Foundation
Researchers are often confused about what can be inferred from significance tests. One problem occurs when people apply the Bayesian intuitions to significance testing - two approaches that must be firmly seperated.
Psychology and other disciplines have benefited enormously from having a rigorous procedure for extracting inferences from data. But can we do that better than we do now? Regarding the Bayesian approach, every practical probelm have been largely solved; so there is little to stop researchers from using the Bayesian approach in almost all circumstances.
Real research questions do not have pat answers, but see if, nonetheless, you have clear preferences. Almost all responses are consistent either with some statistical approach or with what a large section of researchers do in practise. There are three research scenarios by which you can see were you intuitions lie: (1) the stopping rule, (2) the planned versus post hoc, and the (3) multiple testing.
The orthodox logic of statistics, as developed by Nerman and Pearson, starts from the assumption that probabilities are long-run relative frequencies. This requires an indefinitely large series of events that constitutes the collective; the probability of some property (q) occuring is then the proportion of events in the collective with property q. Long-run relative frequencies do not apply to the truth of individual theories because theories are not collective - the theories are just true or false. So, when using this approach to probability, the null hypothesis of no population difference between two particular conditions cannot be assigned a probability - it is either true or false.
The logic of Neyman Pearson (orthodox) statistics is to adopt decision procedures with known long-term error rates and then control those errors at acceptable levels. The error rate for false positives is called alpha, with a signifance level of .05, and the error rate for the false negatives is called beta, where beta is 1 - power.
The probability of a theory being true given data can be symbolized as P (theory| data), and that is what many of us would like to know. But this is the inverse of what orthodox statistics tells us, namely the P(data |theory).
When people directly infer a probability of the null hypothesis from a p value or significance level, they are violating the logic of Neyman Pearson statistics. Such people want to know the probability of theories and hypotheses. Neyman Pearson does not directly tell them that. Bayesian statistics starts from the premise that we can assign degrees of plausibility to theories, and what we want our data to do is tell us how to adjust these plausibilities.
In the Bayesian approach, probability applies to the truth of theories. Thus, we can answer questions about p(H), the probability of a hypothesis being true (the prior probability) and also p (H|D), the probability of the hypothesis given data (the posterior probability) - neither we can do when using the orthodox approach. The probability of obtaining the exact data given the hypothesis is the likelihood.
From this theorem (Bayes) comes the likelihood principle; all information relevant tot inference obtained in data is provided by the likelihoof. The likelihood is the probability of obtaining the exact data obtained given a hypothesis. This is different from a p value, which is the probability of obtaining the same or more extreme data given both a hypothesis and a decision procedure.
In orthodox statistics, the p values are changed according to the decision procedure: under what conditions one would stop collecting data, whether or not the test is post hoc, or how many other tests one conducted. So, orthodox statistics violates the likelihood principle.
The Bayes factor is introduced, which will allow us to consider the contrast between orthodox and Bayes in detail. The Bayes factor pits one theory against another. Once data are collected we can calculate the likelihood for each theory. These likelihoods are things we want researchers to agree on.
posterior odds = B x prior odds
The B automatically gives a notion of sensitivity; it directly distinguishes data supporting the null from data uninformative about whether the null or you theory was supported.
One definition of rationality is having sufficient justification for one's beliefs, and another is that it is a matter of having subjected one's beliefs to critical scrutiny. Popper followed the latter definition and termed it critical rationalism. In this view there is never a sufficient justification for a given belief because knowledge has no absolute foundation. Critical rationalism bears some striking similarities to the orthodox approach to statistical inference - in this view the statistical inference cannot tell you how confident to be in different hypotheses; it only gives conventions for behavioral acceptance or rejection of different hypotheses, which, given a relevant statistical model, results in controlled preset long-term error rates.
A version about degrees of belief are subjective probabilities, personal convictions in an opinion. When probabilities of different propositions form part of the inferential procedure we use in deriving conclusions from data, then we need to make sure that the procedure is fair. There has been an attempt to specify the objective probabilities that follow the informational specifications of a problem.
In sum, one notion of rationality is having sufficient justification for one's beliefs. If one can assign numerical continuous degrees of justification to beliefs, then some simple minimal desiderate lead to the 'likelihood principle' of inference. Hypothesis testing violates the likelihood principle, indicating that some of the deepest held intuitions we train ourselves to have as orthodox users of statistics are irrational on a key intuitive notion of rationality.
The typical use of statistics is often not influenced by a factor that is logically relevant to inference: the effect size. A problem in many areas is that researchers have been relating theories to statistics by using the wrong questions: 'is there a difference?' with the only acceptable answers being 'yes' and 'withhold judgment'.
Neyman developed two specific measures of sensitivity: power and confidence intervals. A confidence interval is the set of population values that the data are consistent with. It may include zero but must include other values too. However, theories and practical questions generally specify, even if vaguely, relevant effect sizes. And they must, if predictions of a difference are ever to be tested.
Effect size is very important in the Neyman-Pearson approach: One must specify the sort of effect one predicts in order to calculate power. On the other hand, Fisherian significance testing leads people to ignore effect sizes. By contrast, one must specify what sort of effect sizes a theory predicts to calculate a Bayes factor.Despite some attempts to encourage researchers to use confidence intervals, their use has not taken off. Confidence intervals of some sort would deal with many problems.
To be able to calculate a Bayes factor in support of a theory, one has to specify what the probability of different effect sizes are, given the theory. In terms of data, the Bayes factor calculator asks for a mean together with its standard error. In terms of predictions of the theory, one has to decide what range of effects are relevant to the theory. The hard part is determining the best way to represent the predictions of a theory: which of these distributions and with what parameters?
Some researchers suggested a 'default' Bayes factor to be used on any data where the null hypothesis is compared with a default theory - namely, the theory that effects may occur in either direction, scaled to a large standardized effect size. But, as mentioned before, the Bayes factor is just one form of Bayesian inference- namely, a method for evaluating one theory against another.
With the Bayes factor, one does not have to worry about corrections for multiple testing, stopping rules, or planned versus post hoc comparisons. But, you might insist, all these rules in orthodox statistics were there to stop cheating. For example, when different assumptions concerning the predictions of a theory lead to different Bayes factors, what is to stop a researcher from picking the best one?
Strictly, every Bayes factor is a completely accurate indication of the support for the data of one theory over another, where the theories are defined by the precise predictions they ake, as we have represented them. The crucial question is which of these representations best matches the theory as the researcher has described it and related it to existing literature. Note that there is not anything wrong with finding out which ways of representing predictions produce especially high Bayes factos. This is not cheating but determining possible constraints on theory.
The strenghts of Bayesian analyses are also their weaknesses:
Ultimately, the issue is about what is more important to us: using a procedure with known long term error rates or knowing the degree of support for our theory (the amount by which we should chage our conviction in a theory). If we want to know the degree of evidence or support for our theory, then our reliance on orthodox statistics is irrational.
Je vertrek voorbereiden of je verzekering afsluiten bij studie, stage of onderzoek in het buitenland
Study or work abroad? check your insurance options with The JoHo Foundation
Add new contribution