What is Null Hypothesis? What Is Its Importance in Research?
Scientists begin their research with a hypothesis that a relationship of some kind exists between variables. The null hypothesis is the opposite stating that no such relationship exists. Null hypothesis may seem unexciting, but it is a very important aspect of research. In this article, we discuss what null hypothesis is, how to make use of it, and why you should use it to improve your statistical analyses.
What is the Null Hypothesis?
The null hypothesis can be tested using statistical analysis and is often written as H 0 (read as “H-naught”). Once you determine how likely the sample relationship would be if the H 0 were true, you can run your analysis. Researchers use a significance test to determine the likelihood that the results supporting the H 0 are not due to chance.
The null hypothesis is not the same as an alternative hypothesis. An alternative hypothesis states, that there is a relationship between two variables, while H 0 posits the opposite. Let us consider the following example.
A researcher wants to discover the relationship between exercise frequency and appetite. She asks:
Q: Does increased exercise frequency lead to increased appetite? Alternative hypothesis: Increased exercise frequency leads to increased appetite. H 0 assumes that there is no relationship between the two variables: Increased exercise frequency does not lead to increased appetite.
Let us look at another example of how to state the null hypothesis:
Q: Does insufficient sleep lead to an increased risk of heart attack among men over age 50? H 0 : The amount of sleep men over age 50 get does not increase their risk of heart attack.
Why is Null Hypothesis Important?
Many scientists often neglect null hypothesis in their testing. As shown in the above examples, H 0 is often assumed to be the opposite of the hypothesis being tested. However, it is good practice to include H 0 and ensure it is carefully worded. To understand why, let us return to our previous example. In this case,
Alternative hypothesis: Getting too little sleep leads to an increased risk of heart attack among men over age 50.
H 0 : The amount of sleep men over age 50 get has no effect on their risk of heart attack.
Note that this H 0 is different than the one in our first example. What if we were to conduct this experiment and find that neither H 0 nor the alternative hypothesis was supported? The experiment would be considered invalid . Take our original H 0 in this case, “the amount of sleep men over age 50 get, does not increase their risk of heart attack”. If this H 0 is found to be untrue, and so is the alternative, we can still consider a third hypothesis. Perhaps getting insufficient sleep actually decreases the risk of a heart attack among men over age 50. Because we have tested H 0 , we have more information that we would not have if we had neglected it.
Do I Really Need to Test It?
The biggest problem with the null hypothesis is that many scientists see accepting it as a failure of the experiment. They consider that they have not proven anything of value. However, as we have learned from the replication crisis , negative results are just as important as positive ones. While they may seem less appealing to publishers, they can tell the scientific community important information about correlations that do or do not exist. In this way, they can drive science forward and prevent the wastage of resources.
Do you test for the null hypothesis? Why or why not? Let us know your thoughts in the comments below.
The following null hypotheses were formulated for this study: Ho1. There are no significant differences in the factors that influence urban gardening when respondents are grouped according to age, sex, household size, social status and average combined monthly income.
Rate this article Cancel Reply
Your email address will not be published.
Enago Academy's Most Popular Articles
- Old Webinars
- Trending Now
- Webinar Mobile App
Mastering Research Funding: A step-by-step guide to finding and winning grants
Identifying relevant funding opportunities Importance of eligibility criteria Understanding the funder’s perspective Crafting a strong…
- Career Corner
Academic Webinars: Transforming knowledge dissemination in the digital age
Digitization has transformed several areas of our lives, including the teaching and learning process. During…
- Manuscripts & Grants
- Reporting Research
Mastering Research Grant Writing in 2024: Navigating new policies and funder demands
Entering the world of grants and government funding can leave you confused; especially when trying…
How to Create a Poster That Stands Out: Tips for a smooth poster presentation
It was the conference season. Judy was excited to present her first poster! She had…
Academic Essay Writing Made Simple: 4 types and tips
The pen is mightier than the sword, they say, and nowhere is this more evident…
Recognizing the Signs: A guide to overcoming academic burnout
Intersectionality in Academia: Dealing with diverse perspectives
Sign-up to read more
Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:
- 2000+ blog articles
- 50+ Webinars
- 10+ Expert podcasts
- 50+ Infographics
- 10+ Checklists
- Research Guides
We hate spam too. We promise to protect your privacy and never spam you.
- Industry News
- Publishing Research
- AI in Academia
- Promoting Research
- Diversity and Inclusion
- Infographics
- Expert Video Library
- Other Resources
- Enago Learn
- Upcoming & On-Demand Webinars
- Peer Review Week 2024
- Open Access Week 2023
- Conference Videos
- Enago Report
- Journal Finder
- Enago Plagiarism & AI Grammar Check
- Editing Services
- Publication Support Services
- Research Impact
- Translation Services
- Publication solutions
- AI-Based Solutions
- Thought Leadership
- Call for Articles
- Call for Speakers
- Author Training
- Edit Profile
I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:
Which among these features would you prefer the most in a peer review assistant?
Ask an Academic
Why Does Research Require a Null Hypothesis?
Every researcher is required to establish hypotheses in order to predict, tentatively, the outcome of the research.
What is a null hypothesis and why does research need one?
Every researcher is required to establish hypotheses in order to predict, tentatively, the outcome of the research (Leedy & Ormrod, 2016). A null hypothesis is “the result of chance alone”, there’s no patterns, differences or relationships between variables (Leedy & Ormrod, 2016). Whether the outcome is positive or negative, the requirement of a null hypothesis in addition of your alternative hypothesis means that your research (and you as the researcher as well) is not one-sided (Bland & Altman, 1994). In other words, you and the research are open to the possibility that maybe or maybe not a difference between the variables exists and open to the possibility that the outcome of the research is due to a reason (alternative hypothesis) or a chance (null hypothesis) (Leedy & Ormrod, 2016; Pierce, 2008 & Bland & Altman, 1994).
After collecting data, the hypotheses must be tested in order to reach a conclusion (Daniel & Cross, 2013). A null hypothesis is tested when the probability of the results are “due to chance alone” but the data collected reasonably suggest that something (a factor, a reason or other variable) in the studied environment/population leads to a difference/relationship/pattern between them (Leedy & Ormrod, 2016 & Pierce, 2008). A null hypothesis is used to draw conclusions from the collected data when the “process of comparing data” with the expected outcome (results) of chance alone (Leedy & Ormrod, 2016). When the result is because of “something other than chance”, the null hypothesis is rejected and the alternative hypothesis comes to play because the data, indirectly, led us to support it (Leedy & Ormrod, 2016). The alternative hypothesis might be the one the researcher wants to be accepted, however, it “can only be accepted” if after the collected data shows that the null hypothesis “has been rejected” (Pierce, 2008).
Bland, J. M., & Altman, D. G. (1994). Statistics Notes: One and two sided tests of significance. British Medical Journal (BMJ), 309 , 248-248. doi:10.1136/bmj.309.6949.248
Daniel, W. W., & Cross, C. L. (2013). Chapter 7 Hypothesis Testing. In Biostatistics: A Foundation for Analysis in the Health Sciences (10th ed., pp. 214-303). Hoboken, NJ: Wiley. Retrieved February 13, 2018, from https://msph1blog.files.wordpress.com/2016/10/biostatistics-_daniel-10th1.pdf .
Leedy, P. D., & Ormrod, J. E. (2016). Practical Research: Planning and Design (11th ed.). NJ: Pearson Education . Retrieved February 13, 2018, from https://digitalbookshelf.argosy.edu/#/books/9781323328798/cfi/6/6!/4/2/2/48@0:0 .
Pierce, T. (2008, September). Independent samples t-test. Retrieved February 13, 2018, from http://www.radford.edu/~tpierce/610%20files/Data%20Analysis%20for%20Professional%20Psychologists/Independent%20samples%20t-test%2010-02-09.pdf
Privacy Overview
- Resources Home 🏠
- Try SciSpace Copilot
- Search research papers
- Add Copilot Extension
- Try AI Detector
- Try Paraphraser
- Try Citation Generator
- April Papers
- June Papers
- July Papers
Importance of Null Hypothesis in Research
Table of Contents
The null hypothesis is a fundamental concept in statistical analysis and research methodology. It forms the basis of many statistical tests and is a critical component in the process of scientific discovery. But what is the meaning of the null hypothesis, and why is it so important in research? Let's delve into this topic to gain a comprehensive understanding.
What is Null Hypothesis?
The null hypothesis often denoted as H0, is a statement in statistical inference that suggests no statistical significance exists in a set of observed data. In other words, it assumes that any kind of difference or importance you see in a set of data is due to chance.
The null hypothesis is the initial claim that researchers set out to test. It's a starting point that allows us to test specific relationships between variables in a study. The null hypothesis is not necessarily a claim that researchers believe is true, but rather, a claim that is assumed to be true for the purpose of testing statistical significance.
Formulating the Null Hypothesis
When formulating a null hypothesis, it's important to remember that it should be a clear, concise, and testable statement. It should also make a claim about the population parameters for the variables under study, not about the sample statistics.
For example, if a researcher wants to test whether a new drug has an effect on a disease, the null hypothesis might be "The new drug has no effect on the disease." This is a claim that can be tested by collecting and analyzing data.
The Importance of the Null Hypothesis in Research
The null hypothesis plays a crucial role in statistical hypothesis testing, a standard procedure in scientific research. It provides a benchmark against which the alternative hypothesis is tested and helps control for the effects of random variation. Here are the 3 prominent significance of the null hypothesis in research.
Foundation of Research Design
In crafting a robust research design, formulating a clear null hypothesis is paramount. It defines the scope of the study, delineates variables, and establishes the groundwork for subsequent analyses.
Statistical Testing Reliance
Null hypothesis testing, a common statistical method, relies on comparing observed data to what would be expected under the assumption of no effect. This statistical scrutiny is integral to drawing valid conclusions.
Clarifying Research Objectives
The null hypothesis sharpens the focus of the research objectives. Defining the absence of an anticipated effect compels researchers to construct precise and testable hypotheses aligned with their inquiries.
Validates the effects of the study
Null hypothesis is important because it provides a framework for proving or disproving that something has an effect. By assuming that the null hypothesis is true, researchers can test the validity of their alternative hypothesis. If the data collected provides enough evidence to reject the null hypothesis, it suggests that the alternative hypothesis may be true.
Role in Statistical Significance
The null hypothesis is central to the concept of statistical significance. If the data collected in a study can be considered unlikely under the null hypothesis, then the null hypothesis is rejected, and the result is deemed statistically significant.
Statistical significance, however, does not necessarily imply practical significance. A result can be statistically significant but still needs to be of practical importance, depending on the context and the specific research question.
Testing the Null Hypothesis
Testing the null hypothesis involves collecting data and calculating a test statistic. The test statistic is then compared to a critical value, which is determined based on the significance level, the type of test being conducted, and the degrees of freedom.
If the test statistic is more extreme than the critical value, the null hypothesis is rejected. If not, there is not enough evidence to reject the null hypothesis. This does not prove that the null hypothesis is true, but rather, that there is not enough evidence to suggest that it is false.
Types of Errors in Hypothesis Testing
When testing the null hypothesis, there are two types of errors that can occur: Type I and Type II errors. A Type I error occurs when the null hypothesis is true, but is rejected. A Type II error occurs when the null hypothesis is false, but is not rejected.
Researchers must consider the potential for these errors when designing their studies and interpreting their results. The risk of these errors can be controlled to some extent by choosing an appropriate significance level and by increasing the sample size.
In conclusion, the null hypothesis is a fundamental concept in statistical analysis and research methodology. It provides a benchmark for testing statistical significance and helps control for the effects of random variation. While it is a simple concept, understanding the null hypothesis and its role in research is crucial for any researcher or statistician.
By formulating a clear and testable null hypothesis, researchers can design their studies in a way that allows them to make meaningful inferences about the relationships between variables. Whether the null hypothesis is ultimately rejected or not, it plays a crucial role in advancing scientific knowledge and understanding.
You might also like
5 Tools zur Literaturrecherche für die optimale Recherche (+2 Bonustools)
5 outils de revue de littérature pour réussir vos recherches (+2 outils bonus)
人工智能在系统文献综述中的作用
- Teesside University Student & Library Services
- Learning Hub Group
Quantitative data collection and analysis
- Testing hypotheses
- Quantitative data collection
- Averages and percentiles
- Measures of Spread or Dispersion
- Samples and population
- Statistical tests - parametric
- Statistical tests - non-parametric
- Probability
- Reliability and Validity
- Analysing relationships
- Useful Books
Testing Hypotheses
- What is a hypothesis?
- Significance testing
- One-tailed or two-tailed?
- Degrees of freedom
A hypothesis is a statement that we are trying to prove or disprove. It is used to express the relationship between variables and whether this relationship is significant. It is specific and offers a prediction on the results of your research question.
Your research question will lead you to developing a hypothesis, this is why your research question needs to be specific and clear.
The hypothesis will then guide you to the most appropriate techniques you should use to answer the question. They reflect the literature and theories on which you basing them. They need to be testable (i.e. measurable and practical).
Null hypothesis (H 0 ) is the proposition that there will not be a relationship between the variables you are looking at. i.e. any differences are due to chance). They always refer to the population. (Usually we don't believe this to be true.)
e.g. There is no difference in instances of illegal drug use by teenagers who are members of a gang and those who are not..
Alternative hypothesis (H A ) or ( H 1 ): this is sometimes called the research hypothesis or experimental hypothesis. It is the proposition that there will be a relationship. It is a statement of inequality between the variables you are interested in. They always refer to the sample. It is usually a declaration rather than a question and is clear, to the point and specific.
e.g. The instances of illegal drug use of teenagers who are members of a gang is different than the instances of illegal drug use of teenagers who are not gang members.
A non-directional research hypothesis - reflects an expected difference between groups but does not specify the direction of this difference (see two-tailed test).
A directional research hypothesis - reflects an expected difference between groups but does specify the direction of this difference. (see one-tailed test)
e.g. The instances of illegal drug use by teenagers who are members of a gang will be higher t han the instances of illegal drug use of teenagers who are not gang members.
Then the process of testing is to ascertain which hypothesis to believe.
It is usually easier to prove something as untrue rather than true, so looking at the null hypothesis is the usual starting point.
The process of examining the null hypothesis in light of evidence from the sample is called significance testing . It is a way of establishing a range of values in which we can establish whether the null hypothesis is true or false.
The debate over hypothesis testing
There has been discussion over whether the scientific method employed in traditional hypothesis testing is appropriate.
See below for some articles that discuss this:
- Gill, J. (1999) 'The insignificance of null hypothesis testing', Politics Research Quarterly , 52(3), pp. 647-674.
- Wainer, H. and Robinson, D.H. (2003) 'Shaping up the practice of null hypothesis significance testing', Educational Researcher, 32(7), pp.22-30.
- Ferguson, C.J. and Heener, M. (2012) ' A vast graveyard of undead theories: publication bias and psychological science's aversion to the null' , Perspectives on Psychological Science , 7(6), pp.555-561.
Taken from: Salkind, N.J. (2017) Statistics for people who (think they) hate statistics. 6th edn. London: SAGE pp. 144-145.
- Null hypothesis - a simple introduction (SPSS)
A significance level defines the level when your sample evidence contradicts your null hypothesis so that your can then reject it. It is the probability of rejecting the null hypothesis when it is really true.
e.g. a significance level of 0.05 indicates that there is a 5% (or 1 in 20) risk of deciding that there is an effect when in fact there is none.
The lower the significance level that you set, then the evidence from the sample has to be stronger to be able to reject the null hypothesis.
N.B. - it is important that you set the significance level before you carry out your study and analysis.
Using Confidence Intervals
I t is possible to test the significance of your null hypothesis using Confidence Interval (see under samples and populations tab).
- if the range lies outside our predicted null hypothesis value we can reject it and accept the alternative hypothesis
The test statistic
This is another commonly used statistic
- Write down your null and alternative hypothesis
- Find the sample statistic (e.g.the mean of your sample)
- Calculate the test statistic Z score (see under Measures of spread or dispersion and Statistical tests - parametric). In this case the sample mean is compared to the population mean (assumed from the null hypothesis) and the standard error (see under Samples and population) is used rather than the standard deviation.
- Compare the test statistic with the critical values (e.g. plus or minus 1.96 for 5% significance)
- Draw a conclusion about the hypotheses - does the calculated z value lies in this critical range i.e. above 1.96 or below -1.96? If it does we can reject the null hypothesis. This would indicate that the results are significant (or an effect has been detected) - which means that if there were no difference in the population then getting a result that you have observed would be highly unlikely therefore you can reject the null hypothesis.
Type I error - this is the chance of wrongly rejecting the null hypothesis even though it is actually true, e.g. by using a 5% p level you would expect the null hypothesis to be rejected about 5% of the time when the null hypothesis is true. You could set a more stringent p level such as 1% (or 1 in 100) to be more certain of not seeing a Type I error. This, however, makes more likely another type of error (Type II) occurring.
Type II error - this is where there is an effect, but the p value you obtain is non-significant hence you don’t detect this effect.
- Statistical significance - what does it really mean?
- Statistical tables
One-tailed tests - where we know in which direction (e.g. larger or smaller) the difference between sample and population will be. It is a directional hypothesis.
Two-tailed tests - where we are looking at whether there is a difference between sample and population. This difference could be larger or smaller. This is a non-directional hypothesis.
If the difference is in the direction you have predicted (i.e. a one-tailed test) it is easier to get a significant result. Though there are arguments against using a one-tailed test (Wright and London, 2009, p. 98-99)*
*Wright, D. B. & London, K. (2009) First (and second) steps in statistics . 2nd edn. London: SAGE.
N.B. - think of the ‘tails’ as the regions at the far-end of a normal distribution. For a two-tailed test with significance level of 0.05% then 0.025% of the values would be at one end of the distribution and the other 0.025% would be at the other end of the distribution. It is the values in these ‘critical’ extreme regions where we can think about rejecting the null hypothesis and claim that there has been an effect.
Degrees of freedom ( df) is a rather difficult mathematical concept, but is needed to calculate the signifcance of certain statistical tests, such as the t-test, ANOVA and Chi-squared test.
It is broadly defined as the number of "observations" (pieces of information) in the data that are free to vary when estimating statistical parameters. (Taken from Minitab Blog ).
The higher the degrees of freedom are the more powerful and precise your estimates of the parameter (population) will be.
Typically, for a 1-sample t-test it is considered as the number of values in your sample minus 1.
For chi-squared tests with a table of rows and columns the rule is:
(Number of rows minus 1) times (number of columns minus 1)
Any accessible example to illustrate the principle of degrees of freedom using chocolates.
- You have seven chocolates in a box, each being a different type, e.g. truffle, coffee cream, caramel cluster, fudge, strawberry dream, hazelnut whirl, toffee.
- You are being good and intend to eat only one chocolate each day of the week.
- On the first day, you can choose to eat any one of the 7 chocolate types - you have a choice from all 7.
- On the second day, you can choose from the 6 remaining chocolates, on day 3 you can choose from 5 chocolates, and so on.
- On the sixth day you have a choice of the remaining 2 chocolates you haven't ate that week.
- However on the seventh day - you haven't really got any choice of chocolate - it has got to be the one you have left in your box.
- You had 7-1 = 6 days of “chocolate” freedom—in which the chocolate you ate could vary!
- << Previous: Samples and population
- Next: Statistical tests - parametric >>
- Last Updated: Aug 1, 2024 3:26 PM
- URL: https://libguides.tees.ac.uk/quantitative
- Science, Tech, Math ›
- Chemistry ›
- Scientific Method ›
Null Hypothesis Examples
ThoughtCo / Hilary Allison
- Scientific Method
- Chemical Laws
- Periodic Table
- Projects & Experiments
- Biochemistry
- Physical Chemistry
- Medical Chemistry
- Chemistry In Everyday Life
- Famous Chemists
- Activities for Kids
- Abbreviations & Acronyms
- Weather & Climate
- Ph.D., Biomedical Sciences, University of Tennessee at Knoxville
- B.A., Physics and Mathematics, Hastings College
In statistical analysis, the null hypothesis assumes there is no meaningful relationship between two variables. Testing the null hypothesis can tell you whether your results are due to the effect of manipulating a dependent variable or due to chance. It's often used in conjunction with an alternative hypothesis, which assumes there is, in fact, a relationship between two variables.
The null hypothesis is among the easiest hypothesis to test using statistical analysis, making it perhaps the most valuable hypothesis for the scientific method. By evaluating a null hypothesis in addition to another hypothesis, researchers can support their conclusions with a higher level of confidence. Below are examples of how you might formulate a null hypothesis to fit certain questions.
What Is the Null Hypothesis?
The null hypothesis states there is no relationship between the measured phenomenon (the dependent variable ) and the independent variable , which is the variable an experimenter typically controls or changes. You do not need to believe that the null hypothesis is true to test it. On the contrary, you will likely suspect there is a relationship between a set of variables. One way to prove that this is the case is to reject the null hypothesis. Rejecting a hypothesis does not mean an experiment was "bad" or that it didn't produce results. In fact, it is often one of the first steps toward further inquiry.
To distinguish it from other hypotheses , the null hypothesis is written as H 0 (which is read as “H-nought,” "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the null hypothesis is not true, perhaps because the experimenter did not account for a critical factor or because of chance. This is one reason why it's important to repeat experiments.
Examples of the Null Hypothesis
To write a null hypothesis, first start by asking a question. Rephrase that question in a form that assumes no relationship between the variables. In other words, assume a treatment has no effect. Write your hypothesis in a way that reflects this.
Are teens better at math than adults? | Age has no effect on mathematical ability. |
Does taking aspirin every day reduce the chance of having a heart attack? | Taking aspirin daily does not affect heart attack risk. |
Do teens use cell phones to access the internet more than adults? | Age has no effect on how cell phones are used for internet access. |
Do cats care about the color of their food? | Cats express no food preference based on color. |
Does chewing willow bark relieve pain? | There is no difference in pain relief after chewing willow bark versus taking a placebo. |
Other Types of Hypotheses
In addition to the null hypothesis, the alternative hypothesis is also a staple in traditional significance tests . It's essentially the opposite of the null hypothesis because it assumes the claim in question is true. For the first item in the table above, for example, an alternative hypothesis might be "Age does have an effect on mathematical ability."
Key Takeaways
- In hypothesis testing, the null hypothesis assumes no relationship between two variables, providing a baseline for statistical analysis.
- Rejecting the null hypothesis suggests there is evidence of a relationship between variables.
- By formulating a null hypothesis, researchers can systematically test assumptions and draw more reliable conclusions from their experiments.
- What Are Examples of a Hypothesis?
- Random Error vs. Systematic Error
- Six Steps of the Scientific Method
- What Is a Hypothesis? (Science)
- Scientific Method Flow Chart
- What Are the Elements of a Good Hypothesis?
- Scientific Method Vocabulary Terms
- Understanding Simple vs Controlled Experiments
- The Role of a Controlled Variable in an Experiment
- What Is an Experimental Constant?
- What Is a Testable Hypothesis?
- Scientific Hypothesis Examples
- What Is the Difference Between a Control Variable and Control Group?
- DRY MIX Experiment Variables Acronym
- What Is a Controlled Experiment?
- Scientific Variable
- Skip to secondary menu
- Skip to main content
- Skip to primary sidebar
Statistics By Jim
Making statistics intuitive
Null Hypothesis: Definition, Rejecting & Examples
By Jim Frost 6 Comments
What is a Null Hypothesis?
The null hypothesis in statistics states that there is no difference between groups or no relationship between variables. It is one of two mutually exclusive hypotheses about a population in a hypothesis test.
- Null Hypothesis H 0 : No effect exists in the population.
- Alternative Hypothesis H A : The effect exists in the population.
In every study or experiment, researchers assess an effect or relationship. This effect can be the effectiveness of a new drug, building material, or other intervention that has benefits. There is a benefit or connection that the researchers hope to identify. Unfortunately, no effect may exist. In statistics, we call this lack of an effect the null hypothesis. Researchers assume that this notion of no effect is correct until they have enough evidence to suggest otherwise, similar to how a trial presumes innocence.
In this context, the analysts don’t necessarily believe the null hypothesis is correct. In fact, they typically want to reject it because that leads to more exciting finds about an effect or relationship. The new vaccine works!
You can think of it as the default theory that requires sufficiently strong evidence to reject. Like a prosecutor, researchers must collect sufficient evidence to overturn the presumption of no effect. Investigators must work hard to set up a study and a data collection system to obtain evidence that can reject the null hypothesis.
Related post : What is an Effect in Statistics?
Null Hypothesis Examples
Null hypotheses start as research questions that the investigator rephrases as a statement indicating there is no effect or relationship.
Does the vaccine prevent infections? | The vaccine does not affect the infection rate. |
Does the new additive increase product strength? | The additive does not affect mean product strength. |
Does the exercise intervention increase bone mineral density? | The intervention does not affect bone mineral density. |
As screen time increases, does test performance decrease? | There is no relationship between screen time and test performance. |
After reading these examples, you might think they’re a bit boring and pointless. However, the key is to remember that the null hypothesis defines the condition that the researchers need to discredit before suggesting an effect exists.
Let’s see how you reject the null hypothesis and get to those more exciting findings!
When to Reject the Null Hypothesis
So, you want to reject the null hypothesis, but how and when can you do that? To start, you’ll need to perform a statistical test on your data. The following is an overview of performing a study that uses a hypothesis test.
The first step is to devise a research question and the appropriate null hypothesis. After that, the investigators need to formulate an experimental design and data collection procedures that will allow them to gather data that can answer the research question. Then they collect the data. For more information about designing a scientific study that uses statistics, read my post 5 Steps for Conducting Studies with Statistics .
After data collection is complete, statistics and hypothesis testing enter the picture. Hypothesis testing takes your sample data and evaluates how consistent they are with the null hypothesis. The p-value is a crucial part of the statistical results because it quantifies how strongly the sample data contradict the null hypothesis.
When the sample data provide sufficient evidence, you can reject the null hypothesis. In a hypothesis test, this process involves comparing the p-value to your significance level .
Rejecting the Null Hypothesis
Reject the null hypothesis when the p-value is less than or equal to your significance level. Your sample data favor the alternative hypothesis, which suggests that the effect exists in the population. For a mnemonic device, remember—when the p-value is low, the null must go!
When you can reject the null hypothesis, your results are statistically significant. Learn more about Statistical Significance: Definition & Meaning .
Failing to Reject the Null Hypothesis
Conversely, when the p-value is greater than your significance level, you fail to reject the null hypothesis. The sample data provides insufficient data to conclude that the effect exists in the population. When the p-value is high, the null must fly!
Note that failing to reject the null is not the same as proving it. For more information about the difference, read my post about Failing to Reject the Null .
That’s a very general look at the process. But I hope you can see how the path to more exciting findings depends on being able to rule out the less exciting null hypothesis that states there’s nothing to see here!
Let’s move on to learning how to write the null hypothesis for different types of effects, relationships, and tests.
Related posts : How Hypothesis Tests Work and Interpreting P-values
How to Write a Null Hypothesis
The null hypothesis varies by the type of statistic and hypothesis test. Remember that inferential statistics use samples to draw conclusions about populations. Consequently, when you write a null hypothesis, it must make a claim about the relevant population parameter . Further, that claim usually indicates that the effect does not exist in the population. Below are typical examples of writing a null hypothesis for various parameters and hypothesis tests.
Related posts : Descriptive vs. Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics
Group Means
T-tests and ANOVA assess the differences between group means. For these tests, the null hypothesis states that there is no difference between group means in the population. In other words, the experimental conditions that define the groups do not affect the mean outcome. Mu (µ) is the population parameter for the mean, and you’ll need to include it in the statement for this type of study.
For example, an experiment compares the mean bone density changes for a new osteoporosis medication. The control group does not receive the medicine, while the treatment group does. The null states that the mean bone density changes for the control and treatment groups are equal.
- Null Hypothesis H 0 : Group means are equal in the population: µ 1 = µ 2 , or µ 1 – µ 2 = 0
- Alternative Hypothesis H A : Group means are not equal in the population: µ 1 ≠ µ 2 , or µ 1 – µ 2 ≠ 0.
Group Proportions
Proportions tests assess the differences between group proportions. For these tests, the null hypothesis states that there is no difference between group proportions. Again, the experimental conditions did not affect the proportion of events in the groups. P is the population proportion parameter that you’ll need to include.
For example, a vaccine experiment compares the infection rate in the treatment group to the control group. The treatment group receives the vaccine, while the control group does not. The null states that the infection rates for the control and treatment groups are equal.
- Null Hypothesis H 0 : Group proportions are equal in the population: p 1 = p 2 .
- Alternative Hypothesis H A : Group proportions are not equal in the population: p 1 ≠ p 2 .
Correlation and Regression Coefficients
Some studies assess the relationship between two continuous variables rather than differences between groups.
In these studies, analysts often use either correlation or regression analysis . For these tests, the null states that there is no relationship between the variables. Specifically, it says that the correlation or regression coefficient is zero. As one variable increases, there is no tendency for the other variable to increase or decrease. Rho (ρ) is the population correlation parameter and beta (β) is the regression coefficient parameter.
For example, a study assesses the relationship between screen time and test performance. The null states that there is no correlation between this pair of variables. As screen time increases, test performance does not tend to increase or decrease.
- Null Hypothesis H 0 : The correlation in the population is zero: ρ = 0.
- Alternative Hypothesis H A : The correlation in the population is not zero: ρ ≠ 0.
For all these cases, the analysts define the hypotheses before the study. After collecting the data, they perform a hypothesis test to determine whether they can reject the null hypothesis.
The preceding examples are all for two-tailed hypothesis tests. To learn about one-tailed tests and how to write a null hypothesis for them, read my post One-Tailed vs. Two-Tailed Tests .
Related post : Understanding Correlation
Neyman, J; Pearson, E. S. (January 1, 1933). On the Problem of the most Efficient Tests of Statistical Hypotheses . Philosophical Transactions of the Royal Society A . 231 (694–706): 289–337.
Share this:
Reader Interactions
January 11, 2024 at 2:57 pm
Thanks for the reply.
January 10, 2024 at 1:23 pm
Hi Jim, In your comment you state that equivalence test null and alternate hypotheses are reversed. For hypothesis tests of data fits to a probability distribution, the null hypothesis is that the probability distribution fits the data. Is this correct?
January 10, 2024 at 2:15 pm
Those two separate things, equivalence testing and normality tests. But, yes, you’re correct for both.
Hypotheses are switched for equivalence testing. You need to “work” (i.e., collect a large sample of good quality data) to be able to reject the null that the groups are different to be able to conclude they’re the same.
With typical hypothesis tests, if you have low quality data and a low sample size, you’ll fail to reject the null that they’re the same, concluding they’re equivalent. But that’s more a statement about the low quality and small sample size than anything to do with the groups being equal.
So, equivalence testing make you work to obtain a finding that the groups are the same (at least within some amount you define as a trivial difference).
For normality testing, and other distribution tests, the null states that the data follow the distribution (normal or whatever). If you reject the null, you have sufficient evidence to conclude that your sample data don’t follow the probability distribution. That’s a rare case where you hope to fail to reject the null. And it suffers from the problem I describe above where you might fail to reject the null simply because you have a small sample size. In that case, you’d conclude the data follow the probability distribution but it’s more that you don’t have enough data for the test to register the deviation. In this scenario, if you had a larger sample size, you’d reject the null and conclude it doesn’t follow that distribution.
I don’t know of any equivalence testing type approach for distribution fit tests where you’d need to work to show the data follow a distribution, although I haven’t looked for one either!
February 20, 2022 at 9:26 pm
Is a null hypothesis regularly (always) stated in the negative? “there is no” or “does not”
February 23, 2022 at 9:21 pm
Typically, the null hypothesis includes an equal sign. The null hypothesis states that the population parameter equals a particular value. That value is usually one that represents no effect. In the case of a one-sided hypothesis test, the null still contains an equal sign but it’s “greater than or equal to” or “less than or equal to.” If you wanted to translate the null hypothesis from its native mathematical expression, you could use the expression “there is no effect.” But the mathematical form more specifically states what it’s testing.
It’s the alternative hypothesis that typically contains does not equal.
There are some exceptions. For example, in an equivalence test where the researchers want to show that two things are equal, the null hypothesis states that they’re not equal.
In short, the null hypothesis states the condition that the researchers hope to reject. They need to work hard to set up an experiment and data collection that’ll gather enough evidence to be able to reject the null condition.
February 15, 2022 at 9:32 am
Dear sir I always read your notes on Research methods.. Kindly tell is there any available Book on all these..wonderfull Urgent
Comments and Questions Cancel reply
Why Discovering 'Nothing' in Science Can Be So Incredibly Important
In science, as in life, we all like to celebrate the big news.
We confirmed the existence of black holes by the ripples they create in space time . We photographed the shadow of a black hole . We figured out how to edit DNA . We found the Higgs boson !
What we don't usually hear about is the years of back-breaking, painstaking hard work that delivers inconclusive results, appearing to provide no evidence for the questions scientists ask – the incremental application of constraints that bring us ever closer to finding answers and making discoveries.
Yet without non-detections – what we call the null result – the progress of science would often be slowed and stymied. Null results drive us forward. They keep us from repeating the same errors, and shape the direction of future studies.
There is, in fact, much that we can learn from nothing.
Often, however, null results don't make it to scientific publications. This not only can generate significant inefficiencies in the way science is done, it's an indicator of potentially bigger problems in the current scientific publication processes.
"We know that there is this powerfully distorting effect of failing to publish null results," psychologist Marcus Munafò of the University of Bristol told ScienceAlert.
"Solving the problem isn't straightforward, because it's very easy to generate a null result by running a bad experiment. If we were to flood the literature with even more null results by generating poor-quality studies, that wouldn't necessarily help the fundamental problem, which is, ultimately, to get the right answers to important questions."
Defining the problem
The null hypothesis defines the parameters under which the results of a study would be indistinguishable from background noise. Gravitational wave interferometry is a nice, neat example: The signals produced by gravitational waves are very faint, and there are many sources of noise that can affect LIGO's sensors. A confirmed detection could only be made once those sources were conclusively ruled out.
If those sources cannot be ruled out, that is what's called a null result. That doesn't mean gravitational waves were not detected; it just means we can't determine that we've made a detection with any certainty.
This can be really useful, and in some fields – like cosmology, and gravitational wave astronomy – the publication of null results helps scientists to adjust the parameters of future experiments.
In other fields, where results can be more qualitative than quantitative, null results are less valued.
"Part of the problem in a lot of behavioral and medical science is that we can't make quantitative predictions," Munafò explained.
"So, we're just looking for evidence that there is an effect or an association, irrespective of its magnitude, which then creates this problem when, if we fail to find evidence that there is an effect, we haven't put any parameters on whether or not an effect that small would actually matter – biologically, theoretically, clinically. We can't do anything with it."
An extraordinary nothing
When wielded correctly, a null result can yield some extraordinary findings.
One of the most famous examples is the Michelson-Morley experiment , conducted by physicists Albert A. Michelson and Edward W. Morley in 1887. The pair were attempting to detect the velocity of our planet with respect to 'luminiferous aether' – the medium through which light was thought to travel, much as waves travel through water.
As Earth moved through space, they hypothesized, oncoming waves of light rippling through a perfectly still, Universe-wide ocean of aether should move at a slightly different speed to those rippling out at right angles to it. Their experiments were ingenious and painstaking , but of course, they detected nothing of the sort. The null result showed that the speed of light was constant in all reference frames, which Einstein would go on to explain with his special theory of relativity.
In other instances, null results can help us to design instrumentation and future experiments. The detection of colliding black holes via gravitational waves only took place after years of null detections allowed for improvements to the design of the gravitational wave interferometer. While at CERN, physicists have so far made no detection of a dark matter signal in particle collision experiments, which has allowed constraints to be placed on what it could be.
"Null experiments are just a part of the full range of observations," astrophysicist George Smoot III of UC Berkeley told ScienceAlert. "Sometimes you see something new and amazing and sometimes you see there is not."
When it's all about hard numbers, null results are often easier to interpret. In other fields, there can be little incentive to publish.
The implications of non-detection aren't always clear, and studies that do make a significant finding receive more attention, more funding, and are more likely to be cited. Clinical trials with positive results are more likely to be published than those with negative or null results . When it comes to deciding who is going to get a research grant, these things matter.
Scientists, too, are very busy people, with many potential lines of inquiry they could be pursuing. Why chase the null hypothesis when you could be using your time doing research that is more likely to be seen and lead to further research opportunities?
To publish or null
As well as leaving out important context that could help us learn something new about our world, the non-publication of null results can also lead to inefficiency – and, worse, could even discourage young scientists from pursuing a career, as Munafò found first-hand. As a young PhD student, he set about replicating an experiment that had found a certain effect, and thought his results would naturally be the same.
"And it didn't work. I didn't find that effect in my experiment. So as an early career researcher, you think, well, I must have done something wrong, maybe I'm not cut out for this," he said.
"I was fortunate enough to bump into a senior academic who said, 'Oh, yeah, no one could replicate that finding'. If you've been in the field long enough, you find out about this stuff through conversations at conferences, and your own experiences, and so on. But you have to stay in the field long enough to find that out. If you aren't lucky enough to have that person tell you that it's not your fault, it's just the fact that the finding itself is pretty flaky, you might end up leaving the field."
Academic publishing has been grappling with this problem, too. In 2002, a unique project – the Journal of Negative Results in BioMedicine – was established to encourage the publication of results that might not otherwise see the light of day. It closed in 2017, claiming to have succeeded in its mission, as many other journals had followed its lead in publishing more articles with negative or null results.
However, encouraging scientists to bring their negative results to light may sometimes prove close to fruitless. On the one hand, there's the potential for a glut of poorly conceived, poorly designed, poorly conducted studies. But the opposite is possible, too.
In 2014, the Journal of Business and Psychology published a null results special issue, and received surprisingly few submissions . This, the editors deduced, could be because scientists themselves are conditioned to believe that null results are worthless. In 2019, the Berlin Institute of Health announced a reward for replication studies, explicitly welcoming null results , yet only received 22 applications.
These attitudes could change. We've seen that it can happen; Smoot, for instance, has gleaned a great deal of insight from null detections.
"Search for antimatter in the cosmic rays – that was a null experiment and convinced me that there was no serious amount of antimatter in our galaxy and likely on a much larger scale, even though there was a great symmetry between matter and antimatter," he said.
"Next null experiment was testing for violation of angular momentum and rotation of the Universe. While it is conceivable, the null result is very important to our worldview and cosmology view and it was the initial motivation for me to use the cosmic microwave background radiation to observe and measure the Universe. That led to more null results, but also some major discoveries."
Ultimately, it might be a slow process. Publication needs to incentivize not null results in and of themselves, but studies designed in such a way that these results can be interpreted and published in their appropriate context. By no means a trivial ask, but one crucial to scientific progress.
"Getting the right answer to the right question matters," Munafò said.
"And sometimes that will mean null results. But I think we need to be careful not to make the publication of a null result an end in itself; it's a means to an end, if it helps us get to the right answer, but it needs more than just the publication of null results to get there.
"Ultimately, what we need are better formulated questions and better designed studies so that our results are solid and informative, irrespective of what they are."
Hypothesis Testing
When you conduct a piece of quantitative research, you are inevitably attempting to answer a research question or hypothesis that you have set. One method of evaluating this research question is via a process called hypothesis testing , which is sometimes also referred to as significance testing . Since there are many facets to hypothesis testing, we start with the example we refer to throughout this guide.
An example of a lecturer's dilemma
Two statistics lecturers, Sarah and Mike, think that they use the best method to teach their students. Each lecturer has 50 statistics students who are studying a graduate degree in management. In Sarah's class, students have to attend one lecture and one seminar class every week, whilst in Mike's class students only have to attend one lecture. Sarah thinks that seminars, in addition to lectures, are an important teaching method in statistics, whilst Mike believes that lectures are sufficient by themselves and thinks that students are better off solving problems by themselves in their own time. This is the first year that Sarah has given seminars, but since they take up a lot of her time, she wants to make sure that she is not wasting her time and that seminars improve her students' performance.
The research hypothesis
The first step in hypothesis testing is to set a research hypothesis. In Sarah and Mike's study, the aim is to examine the effect that two different teaching methods – providing both lectures and seminar classes (Sarah), and providing lectures by themselves (Mike) – had on the performance of Sarah's 50 students and Mike's 50 students. More specifically, they want to determine whether performance is different between the two different teaching methods. Whilst Mike is skeptical about the effectiveness of seminars, Sarah clearly believes that giving seminars in addition to lectures helps her students do better than those in Mike's class. This leads to the following research hypothesis:
Research Hypothesis: | When students attend seminar classes, in addition to lectures, their performance increases. |
Before moving onto the second step of the hypothesis testing process, we need to take you on a brief detour to explain why you need to run hypothesis testing at all. This is explained next.
Sample to population
If you have measured individuals (or any other type of "object") in a study and want to understand differences (or any other type of effect), you can simply summarize the data you have collected. For example, if Sarah and Mike wanted to know which teaching method was the best, they could simply compare the performance achieved by the two groups of students – the group of students that took lectures and seminar classes, and the group of students that took lectures by themselves – and conclude that the best method was the teaching method which resulted in the highest performance. However, this is generally of only limited appeal because the conclusions could only apply to students in this study. However, if those students were representative of all statistics students on a graduate management degree, the study would have wider appeal.
In statistics terminology, the students in the study are the sample and the larger group they represent (i.e., all statistics students on a graduate management degree) is called the population . Given that the sample of statistics students in the study are representative of a larger population of statistics students, you can use hypothesis testing to understand whether any differences or effects discovered in the study exist in the population. In layman's terms, hypothesis testing is used to establish whether a research hypothesis extends beyond those individuals examined in a single study.
Another example could be taking a sample of 200 breast cancer sufferers in order to test a new drug that is designed to eradicate this type of cancer. As much as you are interested in helping these specific 200 cancer sufferers, your real goal is to establish that the drug works in the population (i.e., all breast cancer sufferers).
As such, by taking a hypothesis testing approach, Sarah and Mike want to generalize their results to a population rather than just the students in their sample. However, in order to use hypothesis testing, you need to re-state your research hypothesis as a null and alternative hypothesis. Before you can do this, it is best to consider the process/structure involved in hypothesis testing and what you are measuring. This structure is presented on the next page .
- Search Menu
Sign in through your institution
- Advance Articles
- Author Guidelines
- Submission Site
- Open Access
- Self-Archiving Policy
- Why Submit?
- About Human Communication Research
- About International Communication Association
- Editorial Board
- Advertising & Corporate Services
- Journals Career Network
- Journals on Oxford Academic
- Books on Oxford Academic
A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research
- Article contents
- Figures & tables
- Supplementary Data
Timothy R. Levine, René Weber, Craig Hullett, Hee Sun Park, Lisa L. Massi Lindsey, A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research, Human Communication Research , Volume 34, Issue 2, 1 April 2008, Pages 171–187, https://doi.org/10.1111/j.1468-2958.2008.00317.x
- Permissions Icon Permissions
Null hypothesis significance testing (NHST) is the most widely accepted and frequently used approach to statistical inference in quantitative communication research. NHST, however, is highly controversial, and several serious problems with the approach have been identified. This paper reviews NHST and the controversy surrounding it. Commonly recognized problems include a sensitivity to sample size, the null is usually literally false, unacceptable Type II error rates, and misunderstanding and abuse. Problems associated with the conditional nature of NHST and the failure to distinguish statistical hypotheses from substantive hypotheses are emphasized. Recommended solutions and alternatives are addressed in a companion article.
International Communication Association members
Personal account.
- Sign in with email/username & password
- Get email alerts
- Save searches
- Purchase content
- Activate your purchase/trial code
- Add your ORCID iD
Institutional access
Sign in with a library card.
- Sign in with username/password
- Recommend to your librarian
- Institutional account management
- Get help with access
Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:
IP based access
Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.
Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institution’s website and Oxford Academic.
- Click Sign in through your institution.
- Select your institution from the list provided, which will take you to your institution's website to sign in.
- When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
- Following successful sign in, you will be returned to Oxford Academic.
If your institution is not listed or you cannot sign in to your institution’s website, please contact your librarian or administrator.
Enter your library card number to sign in. If you cannot sign in, please contact your librarian.
Society Members
Society member access to a journal is achieved in one of the following ways:
Sign in through society site
Many societies offer single sign-on between the society website and Oxford Academic. If you see ‘Sign in through society site’ in the sign in pane within a journal:
- Click Sign in through society site.
- When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.
If you do not have a society account or have forgotten your username or password, please contact your society.
Sign in using a personal account
Some societies use Oxford Academic personal accounts to provide access to their members. See below.
A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.
Some societies use Oxford Academic personal accounts to provide access to their members.
Viewing your signed in accounts
Click the account icon in the top right to:
- View your signed in personal account and access account management features.
- View the institutional accounts that are providing access.
Signed in but can't access content
Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.
For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.
Short-term Access
To purchase short-term access, please sign in to your personal account above.
Don't already have a personal account? Register
Month: | Total Views: |
---|---|
February 2018 | 8 |
March 2018 | 6 |
April 2018 | 2 |
May 2018 | 1 |
June 2018 | 3 |
July 2018 | 6 |
August 2018 | 4 |
September 2018 | 12 |
October 2018 | 9 |
November 2018 | 10 |
December 2018 | 4 |
January 2019 | 1 |
February 2019 | 8 |
March 2019 | 13 |
April 2019 | 11 |
May 2019 | 9 |
June 2019 | 13 |
July 2019 | 8 |
August 2019 | 6 |
September 2019 | 9 |
October 2019 | 14 |
November 2019 | 11 |
December 2019 | 24 |
January 2020 | 25 |
February 2020 | 11 |
March 2020 | 31 |
April 2020 | 35 |
May 2020 | 36 |
June 2020 | 10 |
July 2020 | 12 |
August 2020 | 15 |
September 2020 | 14 |
October 2020 | 25 |
November 2020 | 10 |
December 2020 | 10 |
January 2021 | 15 |
February 2021 | 16 |
March 2021 | 30 |
April 2021 | 68 |
May 2021 | 59 |
June 2021 | 57 |
July 2021 | 20 |
August 2021 | 25 |
September 2021 | 31 |
October 2021 | 32 |
November 2021 | 46 |
December 2021 | 37 |
January 2022 | 18 |
February 2022 | 43 |
March 2022 | 37 |
April 2022 | 74 |
May 2022 | 62 |
June 2022 | 39 |
July 2022 | 26 |
August 2022 | 16 |
September 2022 | 30 |
October 2022 | 35 |
November 2022 | 16 |
December 2022 | 24 |
January 2023 | 26 |
February 2023 | 11 |
March 2023 | 27 |
April 2023 | 29 |
May 2023 | 10 |
June 2023 | 6 |
July 2023 | 9 |
August 2023 | 33 |
September 2023 | 17 |
October 2023 | 14 |
November 2023 | 15 |
December 2023 | 9 |
January 2024 | 10 |
February 2024 | 26 |
March 2024 | 12 |
April 2024 | 34 |
May 2024 | 28 |
June 2024 | 2 |
July 2024 | 6 |
August 2024 | 4 |
September 2024 | 14 |
October 2024 | 3 |
Email alerts
Citing articles via.
- Recommend to Your Librarian
- Advertising and Corporate Services
Affiliations
- Online ISSN 1468-2958
- Copyright © 2024 International Communication Association
- About Oxford Academic
- Publish journals with us
- University press partners
- What we publish
- New features
- Open access
- Rights and permissions
- Accessibility
- Advertising
- Media enquiries
- Oxford University Press
- Oxford Languages
- University of Oxford
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
- Copyright © 2024 Oxford University Press
- Cookie settings
- Cookie policy
- Privacy policy
- Legal notice
This Feature Is Available To Subscribers Only
Sign In or Create an Account
This PDF is available to Subscribers Only
For full access to this pdf, sign in to an existing account, or purchase an annual subscription.
- Sign Up Now
- -- Navigate To -- CR Dashboard Connect for Researchers Connect for Participants
- Log In Log Out Log In
- Recent Press
- Papers Citing Connect
- Connect for Participants
- Connect for Researchers
- Connect AI Training
- Managed Research
- Prime Panels
- MTurk Toolkit
- Health & Medicine
- Enterprise Accounts
- Conferences
- Knowledge Base
- A Researcher’s Guide To Statistical Significance And Sample Size Calculations
What Does It Mean for Research to Be Statistically Significant?
Quick Navigation:
What does it mean to be statistically significant, an example of null hypothesis significance testing, measuring statistical significance: understanding the p value (significance level), what factors affect the power of hypothesis test, 1. sample size, 2. significance level, 3. standard deviations, 4. effect size, why is statistical significance important for researchers, does your study need to be statistically significant, practical significance vs. statistical significance, part 1: how is statistical significance defined in research.
The world today is drowning in data.
That may sound like hyperbole but consider this. In 2018, humans around the world produced more than 2.5 quintillion bytes of data—each day. According to some estimates , every minute people conduct almost 4.5 million Google searches, post 511,200 tweets, watch 4.5 million YouTube videos, swipe 1.4 million times on Tinder, and order 8,683 meals from GrubHub. These numbers—and the world’s total data—are expected to continue growing exponentially in the coming years.
For behavioral researchers and businesses, this data represents a valuable opportunity. However, using data to learn about human behavior or make decisions about consumer behavior often requires an understanding of statistics and statistical significance.
Statistical significance is a measurement of how likely it is that the difference between two groups, models, or statistics occurred by chance or occurred because two variables are actually related to each other. This means that a “statistically significant” finding is one in which it is likely the finding is real, reliable, and not due to chance.
To evaluate whether a finding is statistically significant, researchers engage in a process known as null hypothesis significance testing . Null hypothesis significance testing is less of a mathematical formula and more of a logical process for thinking about the strength and legitimacy of a finding.
Imagine a Vice President of Marketing asks her team to test a new layout for the company website. The new layout streamlines the user experience by making it easier for people to place orders and suggesting additional items to go along with each customer’s purchase. After testing the new website, the VP finds that visitors to the site spend an average of $12.63. Under the old layout, visitors spent an average of $12.32, meaning the new layout increases average spending by $0.31 per person. The question the VP must answer is whether the difference of $0.31 per person is significant or something that likely occurred by chance.
To answer this question with statistical analysis, the VP begins by adopting a skeptical stance toward her data known as the null hypothesis . The null hypothesis assumes that whatever researchers are studying does not actually exist in the population of interest. So, in this case the VP assumes that the change in website layout does not influence how much people spend on purchases.
With the null hypothesis in mind, the manager asks how likely it is that she would obtain the results observed in her study—the average difference of $0.31 per visitor—if the change in website layout actually causes no difference in people’s spending (i.e., if the null hypothesis is true). If the probability of obtaining the observed results is low, the manager will reject the null hypothesis and conclude that her finding is statistically significant.
Statistically significant findings indicate not only that the researchers’ results are unlikely the result of chance, but also that there is an effect or relationship between the variables being studied in the larger population. However, because researchers want to ensure they do not falsely conclude there is a meaningful difference between groups when in fact the difference is due to chance, they often set stringent criteria for their statistical tests. This criterion is known as the significance level .
Within the social sciences, researchers often adopt a significance level of 5%. This means researchers are only willing to conclude that the results of their study are statistically significant if the probability of obtaining those results if the null hypothesis were true—known as the p value —is less than 5%.
Five percent represents a stringent criterion, but there is nothing magical about it. In medical research, significance levels are often set at 1%. In cognitive neuroscience, researchers often adopt significance levels well below 1%. And, when astronomers seek to explain aspects of the universe or physicists study new particles like the Higgs Boson they set significance levels several orders of magnitude below .05.
In other research contexts like business or industry, researchers may set more lenient significance levels depending on the aim of their research. However, in all research, the more stringently a researcher sets their significance level, the more confident they can be that their results are not due to chance.
Determining whether a given set of results is statistically significant is only one half of the hypothesis testing equation. The other half is ensuring that the statistical tests a researcher conducts are powerful enough to detect an effect if one really exists. That is, when a researcher concludes their hypothesis was incorrect and there is no effect between the variables being studied, that conclusion is only meaningful if the study was powerful enough to detect an effect if one really existed.
The power of a hypothesis test is influenced by several factors.
Sample size—or, the number of participants the researcher collects data from—affects the power of a hypothesis test. Larger samples with more observations generally lead to higher-powered tests than smaller samples. In addition, large samples are more likely to produce replicable results because extreme scores that occur by chance are more likely to balance out in a large sample rather than in a small one.
Although setting a low significance level helps researchers ensure their results are not due to chance, it also lowers their power to detect an effect because it makes rejecting the null hypothesis harder. In this respect, the significance level a researcher selects is often in competition with power.
Standard deviations represent unexplained variability within data, also known as error. Generally speaking, the more unexplained variability within a dataset, the less power researchers have to detect an effect. Unexplained variability can be the result of measurement error, individual differences among participants, or situational noise.
A final factor that influences power is the size of the effect a researcher is studying. As you might expect, big changes in behavior are easier to detect than small ones.
Sometimes researchers do not know the strength of an effect before conducting a study. Even though this makes it harder to conduct a well powered study, it is important to keep in mind that phenomena that produce a large effect will lead to studies with more power than phenomena that produce only a small effect.
Statistical significance is important because it allows researchers to hold a degree of confidence that their findings are real, reliable, and not due to chance. But statistical significance is not equally important to all researchers in all situations. The importance of obtaining statistically significant results depends on what a researcher studies and within what context.
Within academic research, statistical significance is often critical because academic researchers study theoretical relationships between different variables and behavior. Furthermore, the goal of academic research is often to publish research reports in scientific journals. The threshold for publishing in academic journals is often a series of statistically significant results.
Outside of academia, statistical significance is often less important. Researchers, managers, and decision makers in business may use statistical significance to understand how strongly the results of a study should inform the decisions they make. But, because statistical significance is simply a way of quantifying how much confidence to hold in a research finding, people in industry are often more interested in a finding’s practical significance than statistical significance.
To demonstrate the difference between practical and statistical significance, imagine you’re a candidate for political office. Maybe you have decided to run for local or state-wide office, or, if you’re feeling bold, imagine you’re running for President.
During your campaign, your team comes to you with data on messages intended to mobilize voters. These messages have been market tested and now you and your team must decide which ones to adopt.
If you go with Message A, 41% of registered voters say they are likely to turn out at the polls and cast a ballot. If you go with Message B, this number drops to 37%. As a candidate, should you care whether this difference is statistically significant at a p value below .05?
The answer is of course not. What you likely care about more than statistical significance is practical significance —the likelihood that the difference between groups is large enough to be meaningful in real life.
You should ensure there is some rigor behind the difference in messages before you spend money on a marketing campaign, but when elections are sometimes decided by as little as one vote you should adopt the message that brings more people out to vote. Within business and industry, the practical significance of a research finding is often equally if not more important than the statistical significance. In addition, when findings have large practical significance, they are almost always statistically significant too.
Conducting statistically significant research is a challenge, but it’s a challenge worth tackling. Flawed data and faulty analyses only lead to poor decisions. Start taking steps to ensure your surveys and experiments produce valid results by using CloudResearch. If you have the team to conduct your own studies, CloudResearch can help you find large samples of online participants quickly and easily. Regardless of your demographic criteria or sample size, we can help you get the participants you need. If your team doesn’t have the resources to run a study, we can run it for you. Our team of expert social scientists, computer scientists, and software engineers can design any study, collect the data, and analyze the results for you. Let us show you how conducting statistically significant research can improve your decision-making today.
Continue Reading: A Researcher’s Guide to Statistical Significance and Sample Size Calculations
Part 2: How to Calculate Statistical Significance
Part 3: Determining Sample Size: How Many Survey Participants Do You Need?
Related articles, what is data quality and why is it important.
If you were a researcher studying human behavior 30 years ago, your options for identifying participants for your studies were limited. If you worked at a university, you might be...
How to Identify and Handle Invalid Responses to Online Surveys
As a researcher, you are aware that planning studies, designing materials and collecting data each take a lot of work. So when you get your hands on a new dataset,...
SUBSCRIBE TO RECEIVE UPDATES
- Name * First name Last name
- I would like to request a demo of the Engage platform
- Phone This field is for validation purposes and should be left unchanged.
2024 Grant Application Form
Personal and institutional information.
- Full Name * First Last
- Position/Title *
- Affiliated Academic Institution or Research Organization *
Detailed Research Proposal Questions
- Project Title *
- Research Category * - Antisemitism Islamophobia Both
- Objectives *
- Methodology (including who the targeted participants are) *
- Expected Outcomes *
- Significance of the Study *
Budget and Grant Tier Request
- Requested Grant Tier * - $200 $500 $1000 Applicants requesting larger grants may still be eligible for smaller awards if the full amount requested is not granted.
- Budget Justification *
Research Timeline
- Projected Start Date * MM slash DD slash YYYY Preference will be given to projects that can commence soon, preferably before September 2024.
- Estimated Completion Date * MM slash DD slash YYYY Preference will be given to projects that aim to complete within a year.
- Project Timeline *
- Email This field is for validation purposes and should be left unchanged.
- Name * First Name Last Name
- I would like to request a demo of the Sentry platform
- Name * First Last
- Comments This field is for validation purposes and should be left unchanged.
- Name * First and Last
- Please select the best time to discuss your project goals/details to claim your free Sentry pilot for the next 60 days or to receive 10% off your first Managed Research study with Sentry.
- Email * Enter Email Confirm Email
- Organization
- Job Title *
- Name This field is for validation purposes and should be left unchanged.
Ohio State nav bar
The Ohio State University
- BuckeyeLink
- Find People
- Search Ohio State
Research Questions & Hypotheses
Generally, in quantitative studies, reviewers expect hypotheses rather than research questions. However, both research questions and hypotheses serve different purposes and can be beneficial when used together.
Research Questions
Clarify the research’s aim (farrugia et al., 2010).
- Research often begins with an interest in a topic, but a deep understanding of the subject is crucial to formulate an appropriate research question.
- Descriptive: “What factors most influence the academic achievement of senior high school students?”
- Comparative: “What is the performance difference between teaching methods A and B?”
- Relationship-based: “What is the relationship between self-efficacy and academic achievement?”
- Increasing knowledge about a subject can be achieved through systematic literature reviews, in-depth interviews with patients (and proxies), focus groups, and consultations with field experts.
- Some funding bodies, like the Canadian Institute for Health Research, recommend conducting a systematic review or a pilot study before seeking grants for full trials.
- The presence of multiple research questions in a study can complicate the design, statistical analysis, and feasibility.
- It’s advisable to focus on a single primary research question for the study.
- The primary question, clearly stated at the end of a grant proposal’s introduction, usually specifies the study population, intervention, and other relevant factors.
- The FINER criteria underscore aspects that can enhance the chances of a successful research project, including specifying the population of interest, aligning with scientific and public interest, clinical relevance, and contribution to the field, while complying with ethical and national research standards.
Feasible | ||
Interesting | ||
Novel | ||
Ethical | ||
Relevant |
- The P ICOT approach is crucial in developing the study’s framework and protocol, influencing inclusion and exclusion criteria and identifying patient groups for inclusion.
Population (patients) | ||
Intervention (for intervention studies only) | ||
Comparison group | ||
Outcome of interest | ||
Time |
- Defining the specific population, intervention, comparator, and outcome helps in selecting the right outcome measurement tool.
- The more precise the population definition and stricter the inclusion and exclusion criteria, the more significant the impact on the interpretation, applicability, and generalizability of the research findings.
- A restricted study population enhances internal validity but may limit the study’s external validity and generalizability to clinical practice.
- A broadly defined study population may better reflect clinical practice but could increase bias and reduce internal validity.
- An inadequately formulated research question can negatively impact study design, potentially leading to ineffective outcomes and affecting publication prospects.
Checklist: Good research questions for social science projects (Panke, 2018)
Research Hypotheses
Present the researcher’s predictions based on specific statements.
- These statements define the research problem or issue and indicate the direction of the researcher’s predictions.
- Formulating the research question and hypothesis from existing data (e.g., a database) can lead to multiple statistical comparisons and potentially spurious findings due to chance.
- The research or clinical hypothesis, derived from the research question, shapes the study’s key elements: sampling strategy, intervention, comparison, and outcome variables.
- Hypotheses can express a single outcome or multiple outcomes.
- After statistical testing, the null hypothesis is either rejected or not rejected based on whether the study’s findings are statistically significant.
- Hypothesis testing helps determine if observed findings are due to true differences and not chance.
- Hypotheses can be 1-sided (specific direction of difference) or 2-sided (presence of a difference without specifying direction).
- 2-sided hypotheses are generally preferred unless there’s a strong justification for a 1-sided hypothesis.
- A solid research hypothesis, informed by a good research question, influences the research design and paves the way for defining clear research objectives.
Types of Research Hypothesis
- In a Y-centered research design, the focus is on the dependent variable (DV) which is specified in the research question. Theories are then used to identify independent variables (IV) and explain their causal relationship with the DV.
- Example: “An increase in teacher-led instructional time (IV) is likely to improve student reading comprehension scores (DV), because extensive guided practice under expert supervision enhances learning retention and skill mastery.”
- Hypothesis Explanation: The dependent variable (student reading comprehension scores) is the focus, and the hypothesis explores how changes in the independent variable (teacher-led instructional time) affect it.
- In X-centered research designs, the independent variable is specified in the research question. Theories are used to determine potential dependent variables and the causal mechanisms at play.
- Example: “Implementing technology-based learning tools (IV) is likely to enhance student engagement in the classroom (DV), because interactive and multimedia content increases student interest and participation.”
- Hypothesis Explanation: The independent variable (technology-based learning tools) is the focus, with the hypothesis exploring its impact on a potential dependent variable (student engagement).
- Probabilistic hypotheses suggest that changes in the independent variable are likely to lead to changes in the dependent variable in a predictable manner, but not with absolute certainty.
- Example: “The more teachers engage in professional development programs (IV), the more their teaching effectiveness (DV) is likely to improve, because continuous training updates pedagogical skills and knowledge.”
- Hypothesis Explanation: This hypothesis implies a probable relationship between the extent of professional development (IV) and teaching effectiveness (DV).
- Deterministic hypotheses state that a specific change in the independent variable will lead to a specific change in the dependent variable, implying a more direct and certain relationship.
- Example: “If the school curriculum changes from traditional lecture-based methods to project-based learning (IV), then student collaboration skills (DV) are expected to improve because project-based learning inherently requires teamwork and peer interaction.”
- Hypothesis Explanation: This hypothesis presumes a direct and definite outcome (improvement in collaboration skills) resulting from a specific change in the teaching method.
- Example : “Students who identify as visual learners will score higher on tests that are presented in a visually rich format compared to tests presented in a text-only format.”
- Explanation : This hypothesis aims to describe the potential difference in test scores between visual learners taking visually rich tests and text-only tests, without implying a direct cause-and-effect relationship.
- Example : “Teaching method A will improve student performance more than method B.”
- Explanation : This hypothesis compares the effectiveness of two different teaching methods, suggesting that one will lead to better student performance than the other. It implies a direct comparison but does not necessarily establish a causal mechanism.
- Example : “Students with higher self-efficacy will show higher levels of academic achievement.”
- Explanation : This hypothesis predicts a relationship between the variable of self-efficacy and academic achievement. Unlike a causal hypothesis, it does not necessarily suggest that one variable causes changes in the other, but rather that they are related in some way.
Tips for developing research questions and hypotheses for research studies
- Perform a systematic literature review (if one has not been done) to increase knowledge and familiarity with the topic and to assist with research development.
- Learn about current trends and technological advances on the topic.
- Seek careful input from experts, mentors, colleagues, and collaborators to refine your research question as this will aid in developing the research question and guide the research study.
- Use the FINER criteria in the development of the research question.
- Ensure that the research question follows PICOT format.
- Develop a research hypothesis from the research question.
- Ensure that the research question and objectives are answerable, feasible, and clinically relevant.
If your research hypotheses are derived from your research questions, particularly when multiple hypotheses address a single question, it’s recommended to use both research questions and hypotheses. However, if this isn’t the case, using hypotheses over research questions is advised. It’s important to note these are general guidelines, not strict rules. If you opt not to use hypotheses, consult with your supervisor for the best approach.
Farrugia, P., Petrisor, B. A., Farrokhyar, F., & Bhandari, M. (2010). Practical tips for surgical research: Research questions, hypotheses and objectives. Canadian journal of surgery. Journal canadien de chirurgie , 53 (4), 278–281.
Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, D., & Newman, T. B. (2007). Designing clinical research. Philadelphia.
Panke, D. (2018). Research design & method selection: Making good choices in the social sciences. Research Design & Method Selection , 1-368.
Have a language expert improve your writing
Run a free plagiarism check in 10 minutes, generate accurate citations for free.
- Knowledge Base
Methodology
- How to Write a Strong Hypothesis | Steps & Examples
How to Write a Strong Hypothesis | Steps & Examples
Published on May 6, 2022 by Shona McCombes . Revised on November 20, 2023.
A hypothesis is a statement that can be tested by scientific research. If you want to test a relationship between two or more variables, you need to write hypotheses before you start your experiment or data collection .
Example: Hypothesis
Daily apple consumption leads to fewer doctor’s visits.
Table of contents
What is a hypothesis, developing a hypothesis (with example), hypothesis examples, other interesting articles, frequently asked questions about writing hypotheses.
A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.
A hypothesis is not just a guess – it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).
Variables in hypotheses
Hypotheses propose a relationship between two or more types of variables .
- An independent variable is something the researcher changes or controls.
- A dependent variable is something the researcher observes and measures.
If there are any control variables , extraneous variables , or confounding variables , be sure to jot those down as you go to minimize the chances that research bias will affect your results.
In this example, the independent variable is exposure to the sun – the assumed cause . The dependent variable is the level of happiness – the assumed effect .
Prevent plagiarism. Run a free check.
Step 1. ask a question.
Writing a hypothesis begins with a research question that you want to answer. The question should be focused, specific, and researchable within the constraints of your project.
Step 2. Do some preliminary research
Your initial answer to the question should be based on what is already known about the topic. Look for theories and previous studies to help you form educated assumptions about what your research will find.
At this stage, you might construct a conceptual framework to ensure that you’re embarking on a relevant topic . This can also help you identify which variables you will study and what you think the relationships are between them. Sometimes, you’ll have to operationalize more complex constructs.
Step 3. Formulate your hypothesis
Now you should have some idea of what you expect to find. Write your initial answer to the question in a clear, concise sentence.
4. Refine your hypothesis
You need to make sure your hypothesis is specific and testable. There are various ways of phrasing a hypothesis, but all the terms you use should have clear definitions, and the hypothesis should contain:
- The relevant variables
- The specific group being studied
- The predicted outcome of the experiment or analysis
5. Phrase your hypothesis in three ways
To identify the variables, you can write a simple prediction in if…then form. The first part of the sentence states the independent variable and the second part states the dependent variable.
In academic research, hypotheses are more commonly phrased in terms of correlations or effects, where you directly state the predicted relationship between variables.
If you are comparing two groups, the hypothesis can state what difference you expect to find between them.
6. Write a null hypothesis
If your research involves statistical hypothesis testing , you will also have to write a null hypothesis . The null hypothesis is the default position that there is no association between the variables. The null hypothesis is written as H 0 , while the alternative hypothesis is H 1 or H a .
- H 0 : The number of lectures attended by first-year students has no effect on their final exam scores.
- H 1 : The number of lectures attended by first-year students has a positive effect on their final exam scores.
Research question | Hypothesis | Null hypothesis |
---|---|---|
What are the health benefits of eating an apple a day? | Increasing apple consumption in over-60s will result in decreasing frequency of doctor’s visits. | Increasing apple consumption in over-60s will have no effect on frequency of doctor’s visits. |
Which airlines have the most delays? | Low-cost airlines are more likely to have delays than premium airlines. | Low-cost and premium airlines are equally likely to have delays. |
Can flexible work arrangements improve job satisfaction? | Employees who have flexible working hours will report greater job satisfaction than employees who work fixed hours. | There is no relationship between working hour flexibility and job satisfaction. |
How effective is high school sex education at reducing teen pregnancies? | Teenagers who received sex education lessons throughout high school will have lower rates of unplanned pregnancy teenagers who did not receive any sex education. | High school sex education has no effect on teen pregnancy rates. |
What effect does daily use of social media have on the attention span of under-16s? | There is a negative between time spent on social media and attention span in under-16s. | There is no relationship between social media use and attention span in under-16s. |
If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.
- Sampling methods
- Simple random sampling
- Stratified sampling
- Cluster sampling
- Likert scales
- Reproducibility
Statistics
- Null hypothesis
- Statistical power
- Probability distribution
- Effect size
- Poisson distribution
Research bias
- Optimism bias
- Cognitive bias
- Implicit bias
- Hawthorne effect
- Anchoring bias
- Explicit bias
A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).
Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.
Cite this Scribbr article
If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.
McCombes, S. (2023, November 20). How to Write a Strong Hypothesis | Steps & Examples. Scribbr. Retrieved October 4, 2024, from https://www.scribbr.com/methodology/hypothesis/
Is this article helpful?
Shona McCombes
Other students also liked, construct validity | definition, types, & examples, what is a conceptual framework | tips & examples, operationalization | a guide with examples, pros & cons, get unlimited documents corrected.
✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts
Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.
Quantitative Data Analysis
5 Hypothesis Testing in Quantitative Research
Mikaila Mariel Lemonik Arthur
Statistical reasoning is built on the assumption that data are normally distributed , meaning that they will be distributed in the shape of a bell curve as discussed in the chapter on Univariate Analysis . While real life often—perhaps even usually—does not resemble a bell curve, basic statistical analysis assumes that if all possible random samples from a population were drawn and the mean taken from each sample, the distribution of sample means, when plotted on a graph, would be normally distributed (this assumption is called the Central Limit Theorem ). Given this assumption, we can use the mathematical techniques developed for the study of probability to determine the likelihood that the relationships or patterns we observe in our data occurred due to random chance rather than due some actual real-world connection, which we call statistical significance.
Statistical significance is not the same as practical significance. The fact that we have determined that a given result is unlikely to have occurred due to random chance does not mean that this given result is important, that it matters, or that it is useful. Similarly, we might observe a relationship or result that is very important in practical terms, but that we cannot claim is statistically significant—perhaps because our sample size is too small, for instance. Such a result might have occurred by chance, but ignoring it might still be a mistake. Let’s consider some examples to make this a bit clearer. Assume we were interested in the impacts of diet on health outcomes and found the statistically significant result that people who eat a lot of citrus fruit end up having pinky fingernails that are, on average, 1.5 millimeters longer than those who tend not to eat any citrus fruit. Should anyone change their diet due to this finding? Probably not, even those it is statistically significant. On the other hand, if we found that the people who ate the diets highest in processed sugar died on average five years sooner than those who ate the least processed sugar, even in the absence of a statistically significant result we might want to advise that people consider limiting sugar in their diet. This latter result has more practical significance (lifespan matters more than the length of your pinky fingernail) as well as a larger effect size or association (5 years of life as opposed to 1.5 millimeters of length), a factor that will be discussed in the chapter on association .
While people generally use the shorthand of “the likelihood that the results occurred by chance” when talking about statistical significance, it is actually a bit more complicated than that. What statistical significance is really telling us is the likelihood (or probability ) that a result equal to or more “extreme [1] ” is true in the real world, rather than our results having occurred due to random chance or sampling error . Testing for statistical significance, then, requires us to understand something about probability.
A Brief Review of Probability
You might remember having studied probability in a math class, with questions about coin flips or drawing marbles out of a jar. Such exercises can make probability seem very abstract. But in reality, computations of probability are deeply important for a wide variety of activities, ranging from gambling and stock trading to weather forecasts and, yes, statistical significance.
Probability is represented as a proportion (or decimal number) somewhere between 0 and 1. At 0, there is absolutely no likelihood that the event or pattern of interest would occur; at 1, it is absolutely certain that the event or pattern of interest will occur. We indicate that we are talking about probability by using the symbol [latex]p[/latex]. For example, if something has a 50% chance of occurring, we would write [latex]p=0.5[/latex] or [latex]\frac {1}{2}[/latex]. If we want to represent the likelihood of something not occurring, we can write [latex]1-p[/latex].
Check your thinking: Assume you were flipping coins, and you called heads. The probability of getting heads on a coin flip using a fair coin (in other words, a normal coin that has not been weighted to bias the result) is 0.5. Thus, in 50% of coin flips you should get heads. Consider the following probability questions and write down your answers so you can check them against the discussion below.
- Imagine you have flipped the coin 29 times and you have gotten heads each time. What is the probability you will get heads on flip 30?
- What is the probability that you will get heads on all of the first five coin flips?
- What is the probability that you will get heads on at least one of the first five coin flips?
There are a few basic concepts from the mathematical study of probability that are important for beginner data analysts to know, and we will review them here.
Probability over Repeated Trials : The probability of the outcome of interest is the same in each trial or test, regardless of the results of the prior test. So, if we flip a coin 29 times and get heads each time, what happens when we flip it the 29th time? The probability of heads is still 0.5! The belief that “this time it must be tails because it has been heads so many times” or “this coin just wants to come up heads” is simply superstition, and—assuming a fair coin—the results of prior trials do not influence the results of this one.
Probability of Multiple Events : The probability that the outcome of interest will occur repeatedly across multiple trials is the product [2] of the probability of the outcome on each individual trial. This is called the multiplication theorem . Thinking about the multiplication theorem requires that we keep in mind the fact that when we multiply decimal numbers together, those numbers get smaller— thus, the probability that a series of outcomes will occur is smaller than the probability of any one of those outcomes occurring on its own. So, what is the probability that we will get heads on all five of our coin flips? Well, to figure that out, we need to multiply the probability of getting heads on each of our coin flips together. The math looks like this (and produces a very small probability indeed):
[latex]\frac {1}{2} \cdot \frac {1}{2} \cdot \frac {1}{2} \cdot \frac {1}{2} \cdot \frac {1}{2} = 0.03125[/latex]
Probability of One of Many Events : Determining the probability that the outcome of interest will occur on at least one out of a series of events or repeated trials is a little bit more complicated. Mathematicians use the addition theorem to refer to this, because the basic way to calculate it is to calculate the probability of each sequence of events (say, heads-heads-heads, heads-heads-tails, heads-tails-heads, and so on) and add them together. But the greater the number of repeated trials, the more complicated that gets, so there is a simpler way to do it. Consider that the probability of getting no heads is the same as the probability of getting all tails (which would be the same as the probability of getting all heads that we calculated above). And the only circumstance in which we would not have at least one flip resulting in heads would be a circumstance in which all flips had resulted in tails. Therefore, what we need to do in order to calculate the probability that we get at least one heads is to subtract the probability that we get no heads from 1—and as you can imagine, this procedure shows us that the probability of the outcome of interest occurring at least once over repeated trials is higher than the probability of the occurrence on any given trial. The math would look like this:
[latex]1- (\frac{1}{2})^5=0.9688[/latex]
So why is this digression into the math of probability important? Well, when we test for statistical significance, what we are really doing is determining the probability that the outcome we observed—or one that is more extreme than that which we observed—occurred by chance. We perform this analysis via a procedure called Null Hypothesis Significance Testing.
Null Hypothesis Significance Testing
Null hypothesis significance testing , or NHST , is a method of testing for statistical significance by comparing observed data to the data we would expect to see if there were no relationship between the variables or phenomena in question. NHST can take a little while to wrap one’s head around, especially because it relies on a logic of double negatives: first, we state a hypothesis we believe not to be true (there is no relationship between the variables in question) and then, we look for evidence that disconfirms this hypothesis. In other words, we are assuming that there is no relationship between the variables—even though our research hypothesis states that we think there is a relationship—and then looking to see if there is any evidence to suggest there is not no relationship. Confusing, right?
So why do we use the null hypothesis significance testing approach?
- The null hypothesis—that there is no relationship between the variables we are exploring—would be what we would generally accept as true in the absence of other information,
- It means we are assuming that differences or patterns occur due to chance unless there is strong evidence to suggest otherwise,
- It provides a benchmark for comparing observed outcomes, and
- It means we are searching for evidence that disconforms our hypothesis, making it less likely that we will accept a conclusion that turns out to be untrue.
Thus, NHST helps us avoid making errors in our interpretation of the result. In particular, it helps us avoid Type 2 error , as discussed in the chapter on Bivariate Analyses . As a reminder, Type 2 error is error where you accept a hypothesis as true when in fact it was false (while Type 1 error is error where you reject the hypothesis when in fact it was true). For example, you are making a Type 1 error if you decide not to study for a test because you assume you are so bad at the subject that studying simply cannot help you, when in fact we know from research that studying does lead to higher grades. And you are making a Type 2 error if your boss tells you that she is going to promote you if you do enough overtime and you then work lots of overtime in response, when actually your boss is just trying to make you work more hours and already had someone else in mind to promote.
We can never remove all sources of error from our analyses, though larger sample sizes help reduce error. Looking at the formula for computing standard error , we can see that the standard error ([latex]SE[/latex]) would get smaller as the sample size ([latex]N[/latex]) gets larger. Note: σ is the symbol we use to represent standard deviation.
[latex]SE = \frac{\sigma}{\sqrt N}[/latex]
Besides making our samples larger, another thing that we can do is that we can choose whether we are more willing to accept Type 1 error or Type 2 error and adjust our strategies accordingly. In most research, we would prefer to accept more Type 1 error, because we are more willing to miss out on a finding than we are to make a finding that turns out later to be inaccurate (though, of course, lots of research does eventually turn out to be inaccurate).
Performing NHST
Performing NHST requires that our data meet several assumptions:
- Our sample must be a random sample—statistical significance testing and other inferential and explanatory statistical methods are generally not appropriate for non-random samples [3] —as well as representative and of a sufficient size (see the Central Limit Theorem above).
- Observations must be independent of other observations, or else additional statistical manipulation must be performed. For instance, a dataset of data about siblings would need to be handled differently due to the fact that siblings affect one another, so data on each person in the dataset is not truly independent.
- You must determine the rules for your significance test, including the level of uncertainty you are willing to accept (significance level) and whether or not you are interested in the direction of the result (one-tailed versus two-tailed tests, to be discussed below), in advance of performing any analysis.
- The number of significance tests you run should be limited, because the more tests you run, the greater the likelihood that one of your tests will result in an error. To make this more clear, if you are willing to accept a 5% probability that you will make the error of accepting a hypothesis as true when it is really false, and you run 20 tests, one of those tests (5% of them!) is pretty likely to have produced an incorrect result.
If our data has met these assumptions, we can move forward with the process of conducting an NHST. This requires us to make three decisions: determining our null hypothesis , our confidence level (or acceptable significance level), and whether we will conduct a one-tailed or a two-tailed test. In keeping with Assumption 3 above, we must make these decisions before performing our analysis. The null hypothesis is the hypothesis that there is no relationship between the variables in question. So, for example, if our research hypothesis was that people who spend more time with their friends are happier, our null hypothesis would be that there is no relationship between how much time people spend with their friends and their happiness.
Our confidence level is the level of risk we are willing to accept that our results could have occurred by chance. Typically, in social science research, researchers use p<0.05 (we are willing to accept up to a 5% risk that our results occurred by chance), p<0.01 (we are willing to accept up to a 1% risk that our results occurred by chance), and/or p<0.001 (we are willing to accept up to a 0.1% risk that our results occurred by chance). P, as was noted above, is the mathematical notation for probability, and that’s why we use a p-value to indicate the probability that our results may have occurred by chance. A higher p-value increases the likelihood that we will accept as accurate a result that really occurred by chance; a lower p-value increases the likelihood that we will assume a result occurred by chance when actually it was real. Remember, what the p-value tells us is not the probability that our own research hypothesis is true, but rather this: assuming that the null hypothesis is correct, what is the probability that the data we observed—or data more extreme than the data we observed—would have occurred by chance.
Whether we choose a one-tailed or a two-tailed test tells us what we mean when we say “data more extreme than.” Remember that normal curve? A two-tailed test is agnostic as to the direction of our results—and many of the most common tests for statistical significance that we perform, like the Chi square, are two-tailed by default. However, if you are only interested in a result that occurs in a particular direction, you might choose a one-tailed test. For instance, if you were testing a new blood pressure medication, you might only care if the blood pressure of those taking the medication is significantly lower than those not taking the medication—having blood pressure significantly higher would not be a good or helpful result, so you might not want to test for that.
Having determined the parameters for our analysis, we then compute our test of statistical significance. There are different tests of statistical significance for different variables (for example, the Chi square discussed in the chapter on bivariate analyses ), as you will see in other chapters of this text, but all of them produce results in a similar format. We then compare this result to the p value we already selected. If the p value produced by our analysis is lower than the confidence level we selected, we can reject the null hypothesis, as the probability that our result occurred by chance is very low. If, on the other hand, the p value produced by our analysis is higher than the confidence level we selected, we fail to reject the null hypothesis, as the probability that our result occurred by chance is too high to accept. Keep in mind this is what we do even when the p value produced by our analysis is quite close to the threshold we have selected. So, for instance, if we have selected the confidence level of p<0.05 and the p value produced by our analysis is p=0.0501, we still fail to reject the null hypothesis and proceed as if there is not any support for our research hypothesis.
Thus, the process of null hypothesis significance testing proceeds according to the following steps:
- Determine the null hypothesis
- Set the confidence level and whether this will be a one-tailed or two-tailed test
- Compute the test value for the appropriate significance test
- Compare the test value to the critical value of that test statistic for the confidence level you selected
- Determine whether or not to reject the null hypothesis
Your statistical analysis software will perform steps 3 and 4 for you (before there was computer software to do this, researchers had to do the calculations by hand and compare their results to figures on published tables of critical values). But you as the researcher must perform steps 1, 2, and 5 yourself.
Confidence Intervals & Margins of Error
When talking about statistical significance, some researchers also use the terms confidence intervals and margins of error . Confidence intervals are ranges of probabilities within which we can assume the true population parameter lies. Most typically, analysts aim for 95% confidence intervals, meaning that in 95 out of 100 cases, the population parameter will lie within the upper and lower levels specified by your confidence interval. These are calculated by your statistics software as well. The margin of error, then, is the range of values within the confidence interval. So, for instance, a 2021 survey of Americans conducted by the Robert Wood Johnson Foundation and the Harvard T.H. Chan School of Public Health found that 71% of respondents favor substantially increasing federal spending on public health programs. This poll had a 95% confidence interval with a +/- 3.6 margin of error. What this tells us is that there is a 95% probability (19 in 20) that between 67.4% (71-3.6) and 74.6% (71+3.6) of Americans favored increasing federal public health spending at the time the poll was conducted. When a figure reflects an overwhelming majority, such as this one, the margin of error may seem of little relevance. But consider a similar poll with the same margin of error that sought to predict support for a political candidate and found that 51.5% of people said they would vote for that candidate. In that case, we would have found that there was a 95% probability that between 47.9% and 55.1% of people intended to vote for the candidate—which means the race is total tossup and we really would have no idea what to expect. For some people, thinking in terms of confidence intervals and margins of error is easier to understand than thinking in terms of p values; confidence intervals and margins of error are more frequently used in analyses of polls while p values are found more often in academic research. But basically, both approaches are doing the same fundamental analysis—they are determining the likelihood that the results we observed or a similarly-meaningful result would have occurred by chance.
What Does Significance Testing Tell Us?
One of the most important things to remember about significance testing is that, while the word “significance” is used in ordinary speech to mean importance, significance testing does not tell us whether our results are important—or even whether they are interesting. A full understanding of the relationship between a given set of variables requires looking at statistical significance as well as association and the theoretical importance of the findings. Table 1 provides a perspective on using the combination of significance and association to determine how important the results of statistical analysis are—but even using Table 1 as a guide, evaluating findings based on theoretical importance remains key. So: make sure that when you are conducting analyses, you avoid being misled into assuming that significant results are sufficient for making broad claims about the importance and meaning of results. And remember as well that significance only tells us the likelihood that the pattern of relationships we observe occurred by chance—not whether that pattern is causal. For, after all, quantitative research can never eliminate all plausible alternative explanations for the phenomenon in question (one of the three elements of causation, along with association and temporal order).
Something’s happening here! | Could be interesting, but might have occurred by chance | ||
Probably did not occur by chance, but not interesting | Nothing’s happening here |
- Getting 7 heads on 7 coin flips
- Getting 5 heads on 7 coin flips
- Getting 1 head on 10 coin flips
Then check your work using the Coin Flip Probability Calculator .
- As the advertised hourly pay for a job goes up, the number of job applicants increases.
- Teenagers who watch more hours of makeup tutorial videos on TikTok have, on average, lower self-esteem.
- Couples who share hobbies in common are less likely to get divorced.
- Assume a research conducted a study that found that people wearing green socks type on average one word per minute faster than people who are not wearing green socks, and that this study found a p value of p<0.01. Is this result statistically significant? Is this result practically significant? Explain your answers.
- If we conduct a political poll and have a 95% confidence interval and a margin of error of +/- 2.3%, what can we conclude about support for Candidate X if 49.3% of respondents tell us they will vote for Candidate X? If 24.7% do? If 52.1% do? If 83.7% do?
- One way to think about this is to imagine that your result has been plotted on a bell curve. Statistical significance tells us the probability that the "real" result—the thing that is true in the real world and not due to random chance—is at the same point as or further along the skinny tails of the bell curve than the result we have plotted. ↵
- In other words, what you get when you multiply. ↵
- They also are not appropriate for censuses—but you do not need inferential statistics in a census because you are looking at the entire population rather than a sample, so you can simply describe the relationships that do exist. ↵
A distribution of values that is symmetrical and bell-shaped.
A graph showing a normal distribution—one that is symmetrical with a rounded top that then falls away towards the extremes in the shape of a bell
The sum of all the values in a list divided by the number of such values.
The theorem that states that if you take a series of sufficiently large random samples from the population (replacing people back into the population so they can be reselected each time you draw a new sample), the distribution of the sample means will be approximately normally distributed.
A statistical measure that suggests that sample results can be generalized to the larger population, based on a low probability of having made a Type 1 error.
How likely something is to happen; also, a branch of mathematics concerned with investigating the likelihood of occurrences.
Measurement error created due to the fact that even properly-constructed random samples are do not have precisely the same characteristics as the larger population from which they were drawn.
The theorem in probability about the likelihood of a given outcome occurring repeatedly over multiple trials; this is determined by multiplying the probabilities together.
The theorem addressing the determination of the probability of a given outcome occurring at least once across a series of trials; it is determined by adding the probability of each possible series of outcomes together.
A method of testing for statistical significance in which an observed relationship, pattern, or figure is tested against a hypothesis that there is no relationship or pattern among the variables being tested
Null hypothesis significance testing.
The error you make when you do not infer a relationship exists in the larger population when it actually does exist; in other words, a false negative conclusion.
The error made if one infers that a relationship exists in a larger population when it does not really exist; in other words, a false positive error.
A measure of accuracy of sample statistics computed using the standard deviation of the sampling distribution.
The hypothesis that there is no relationship between the variables in question.
The probability that the sample statistics we observe holds true for the larger population.
A measure of statistical significance used in crosstabulation to determine the generalizability of results.
A range of estimates into which it is highly probable that an unknown population parameter falls.
A suggestion of how far away from the actual population parameter a sample statistic is likely to be.
Social Data Analysis Copyright © 2021 by Mikaila Mariel Lemonik Arthur is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
The PMC website is updating on October 15, 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- PMC5635437.1 ; 2015 Aug 25
- PMC5635437.2 ; 2016 Jul 13
- ➤ PMC5635437.3; 2016 Oct 10
Null hypothesis significance testing: a short tutorial
Cyril pernet.
1 Centre for Clinical Brain Sciences (CCBS), Neuroimaging Sciences, The University of Edinburgh, Edinburgh, UK
Version Changes
Revised. amendments from version 2.
This v3 includes minor changes that reflect the 3rd reviewers' comments - in particular the theoretical vs. practical difference between Fisher and Neyman-Pearson. Additional information and reference is also included regarding the interpretation of p-value for low powered studies.
Peer Review Summary
Review date | Reviewer name(s) | Version reviewed | Review status |
---|---|---|---|
Dorothy Vera Margaret Bishop | Approved with Reservations | ||
Stephen J. Senn | Approved | ||
Stephen J. Senn | Approved with Reservations | ||
Marcel ALM van Assen | Not Approved | ||
Daniel Lakens | Not Approved |
Although thoroughly criticized, null hypothesis significance testing (NHST) remains the statistical method of choice used to provide evidence for an effect, in biological, biomedical and social sciences. In this short tutorial, I first summarize the concepts behind the method, distinguishing test of significance (Fisher) and test of acceptance (Newman-Pearson) and point to common interpretation errors regarding the p-value. I then present the related concepts of confidence intervals and again point to common interpretation errors. Finally, I discuss what should be reported in which context. The goal is to clarify concepts to avoid interpretation errors and propose reporting practices.
The Null Hypothesis Significance Testing framework
NHST is a method of statistical inference by which an experimental factor is tested against a hypothesis of no effect or no relationship based on a given observation. The method is a combination of the concepts of significance testing developed by Fisher in 1925 and of acceptance based on critical rejection regions developed by Neyman & Pearson in 1928 . In the following I am first presenting each approach, highlighting the key differences and common misconceptions that result from their combination into the NHST framework (for a more mathematical comparison, along with the Bayesian method, see Christensen, 2005 ). I next present the related concept of confidence intervals. I finish by discussing practical aspects in using NHST and reporting practice.
Fisher, significance testing, and the p-value
The method developed by ( Fisher, 1934 ; Fisher, 1955 ; Fisher, 1959 ) allows to compute the probability of observing a result at least as extreme as a test statistic (e.g. t value), assuming the null hypothesis of no effect is true. This probability or p-value reflects (1) the conditional probability of achieving the observed outcome or larger: p(Obs≥t|H0), and (2) is therefore a cumulative probability rather than a point estimate. It is equal to the area under the null probability distribution curve from the observed test statistic to the tail of the null distribution ( Turkheimer et al. , 2004 ). The approach proposed is of ‘proof by contradiction’ ( Christensen, 2005 ), we pose the null model and test if data conform to it.
In practice, it is recommended to set a level of significance (a theoretical p-value) that acts as a reference point to identify significant results, that is to identify results that differ from the null-hypothesis of no effect. Fisher recommended using p=0.05 to judge whether an effect is significant or not as it is roughly two standard deviations away from the mean for the normal distribution ( Fisher, 1934 page 45: ‘The value for which p=.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not’). A key aspect of Fishers’ theory is that only the null-hypothesis is tested, and therefore p-values are meant to be used in a graded manner to decide whether the evidence is worth additional investigation and/or replication ( Fisher, 1971 page 13: ‘it is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require […]’ and ‘no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon’). How small the level of significance is, is thus left to researchers.
What is not a p-value? Common mistakes
The p-value is not an indication of the strength or magnitude of an effect . Any interpretation of the p-value in relation to the effect under study (strength, reliability, probability) is wrong, since p-values are conditioned on H0. In addition, while p-values are randomly distributed (if all the assumptions of the test are met) when there is no effect, their distribution depends of both the population effect size and the number of participants, making impossible to infer strength of effect from them.
Similarly, 1-p is not the probability to replicate an effect . Often, a small value of p is considered to mean a strong likelihood of getting the same results on another try, but again this cannot be obtained because the p-value is not informative on the effect itself ( Miller, 2009 ). Because the p-value depends on the number of subjects, it can only be used in high powered studies to interpret results. In low powered studies (typically small number of subjects), the p-value has a large variance across repeated samples, making it unreliable to estimate replication ( Halsey et al. , 2015 ).
A (small) p-value is not an indication favouring a given hypothesis . Because a low p-value only indicates a misfit of the null hypothesis to the data, it cannot be taken as evidence in favour of a specific alternative hypothesis more than any other possible alternatives such as measurement error and selection bias ( Gelman, 2013 ). Some authors have even argued that the more (a priori) implausible the alternative hypothesis, the greater the chance that a finding is a false alarm ( Krzywinski & Altman, 2013 ; Nuzzo, 2014 ).
The p-value is not the probability of the null hypothesis p(H0), of being true, ( Krzywinski & Altman, 2013 ). This common misconception arises from a confusion between the probability of an observation given the null p(Obs≥t|H0) and the probability of the null given an observation p(H0|Obs≥t) that is then taken as an indication for p(H0) (see Nickerson, 2000 ).
Neyman-Pearson, hypothesis testing, and the α-value
Neyman & Pearson (1933) proposed a framework of statistical inference for applied decision making and quality control. In such framework, two hypotheses are proposed: the null hypothesis of no effect and the alternative hypothesis of an effect, along with a control of the long run probabilities of making errors. The first key concept in this approach, is the establishment of an alternative hypothesis along with an a priori effect size. This differs markedly from Fisher who proposed a general approach for scientific inference conditioned on the null hypothesis only. The second key concept is the control of error rates . Neyman & Pearson (1928) introduced the notion of critical intervals, therefore dichotomizing the space of possible observations into correct vs. incorrect zones. This dichotomization allows distinguishing correct results (rejecting H0 when there is an effect and not rejecting H0 when there is no effect) from errors (rejecting H0 when there is no effect, the type I error, and not rejecting H0 when there is an effect, the type II error). In this context, alpha is the probability of committing a Type I error in the long run. Alternatively, Beta is the probability of committing a Type II error in the long run.
The (theoretical) difference in terms of hypothesis testing between Fisher and Neyman-Pearson is illustrated on Figure 1 . In the 1 st case, we choose a level of significance for observed data of 5%, and compute the p-value. If the p-value is below the level of significance, it is used to reject H0. In the 2 nd case, we set a critical interval based on the a priori effect size and error rates. If an observed statistic value is below and above the critical values (the bounds of the confidence region), it is deemed significantly different from H0. In the NHST framework, the level of significance is (in practice) assimilated to the alpha level, which appears as a simple decision rule: if the p-value is less or equal to alpha, the null is rejected. It is however a common mistake to assimilate these two concepts. The level of significance set for a given sample is not the same as the frequency of acceptance alpha found on repeated sampling because alpha (a point estimate) is meant to reflect the long run probability whilst the p-value (a cumulative estimate) reflects the current probability ( Fisher, 1955 ; Hubbard & Bayarri, 2003 ).
The figure was prepared with G-power for a one-sided one-sample t-test, with a sample size of 32 subjects, an effect size of 0.45, and error rates alpha=0.049 and beta=0.80. In Fisher’s procedure, only the nil-hypothesis is posed, and the observed p-value is compared to an a priori level of significance. If the observed p-value is below this level (here p=0.05), one rejects H0. In Neyman-Pearson’s procedure, the null and alternative hypotheses are specified along with an a priori level of acceptance. If the observed statistical value is outside the critical region (here [-∞ +1.69]), one rejects H0.
Acceptance or rejection of H0?
The acceptance level α can also be viewed as the maximum probability that a test statistic falls into the rejection region when the null hypothesis is true ( Johnson, 2013 ). Therefore, one can only reject the null hypothesis if the test statistics falls into the critical region(s), or fail to reject this hypothesis. In the latter case, all we can say is that no significant effect was observed, but one cannot conclude that the null hypothesis is true. This is another common mistake in using NHST: there is a profound difference between accepting the null hypothesis and simply failing to reject it ( Killeen, 2005 ). By failing to reject, we simply continue to assume that H0 is true, which implies that one cannot argue against a theory from a non-significant result (absence of evidence is not evidence of absence). To accept the null hypothesis, tests of equivalence ( Walker & Nowacki, 2011 ) or Bayesian approaches ( Dienes, 2014 ; Kruschke, 2011 ) must be used.
Confidence intervals
Confidence intervals (CI) are builds that fail to cover the true value at a rate of alpha, the Type I error rate ( Morey & Rouder, 2011 ) and therefore indicate if observed values can be rejected by a (two tailed) test with a given alpha. CI have been advocated as alternatives to p-values because (i) they allow judging the statistical significance and (ii) provide estimates of effect size. Assuming the CI (a)symmetry and width are correct (but see Wilcox, 2012 ), they also give some indication about the likelihood that a similar value can be observed in future studies. For future studies of the same sample size, 95% CI give about 83% chance of replication success ( Cumming & Maillardet, 2006 ). If sample sizes however differ between studies, CI do not however warranty any a priori coverage.
Although CI provide more information, they are not less subject to interpretation errors (see Savalei & Dunn, 2015 for a review). The most common mistake is to interpret CI as the probability that a parameter (e.g. the population mean) will fall in that interval X% of the time. The correct interpretation is that, for repeated measurements with the same sample sizes, taken from the same population, X% of times the CI obtained will contain the true parameter value ( Tan & Tan, 2010 ). The alpha value has the same interpretation as testing against H0, i.e. we accept that 1-alpha CI are wrong in alpha percent of the times in the long run. This implies that CI do not allow to make strong statements about the parameter of interest (e.g. the mean difference) or about H1 ( Hoekstra et al. , 2014 ). To make a statement about the probability of a parameter of interest (e.g. the probability of the mean), Bayesian intervals must be used.
The (correct) use of NHST
NHST has always been criticized, and yet is still used every day in scientific reports ( Nickerson, 2000 ). One question to ask oneself is what is the goal of a scientific experiment at hand? If the goal is to establish a discrepancy with the null hypothesis and/or establish a pattern of order, because both requires ruling out equivalence, then NHST is a good tool ( Frick, 1996 ; Walker & Nowacki, 2011 ). If the goal is to test the presence of an effect and/or establish some quantitative values related to an effect, then NHST is not the method of choice since testing is conditioned on H0.
While a Bayesian analysis is suited to estimate that the probability that a hypothesis is correct, like NHST, it does not prove a theory on itself, but adds its plausibility ( Lindley, 2000 ). No matter what testing procedure is used and how strong results are, ( Fisher, 1959 p13) reminds us that ‘ […] no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon'. Similarly, the recent statement of the American Statistical Association ( Wasserstein & Lazar, 2016 ) makes it clear that conclusions should be based on the researchers understanding of the problem in context, along with all summary data and tests, and that no single value (being p-values, Bayesian factor or else) can be used support or invalidate a theory.
What to report and how?
Considering that quantitative reports will always have more information content than binary (significant or not) reports, we can always argue that raw and/or normalized effect size, confidence intervals, or Bayes factor must be reported. Reporting everything can however hinder the communication of the main result(s), and we should aim at giving only the information needed, at least in the core of a manuscript. Here I propose to adopt optimal reporting in the result section to keep the message clear, but have detailed supplementary material. When the hypothesis is about the presence/absence or order of an effect, and providing that a study has sufficient power, NHST is appropriate and it is sufficient to report in the text the actual p-value since it conveys the information needed to rule out equivalence. When the hypothesis and/or the discussion involve some quantitative value, and because p-values do not inform on the effect, it is essential to report on effect sizes ( Lakens, 2013 ), preferably accompanied with confidence or credible intervals. The reasoning is simply that one cannot predict and/or discuss quantities without accounting for variability. For the reader to understand and fully appreciate the results, nothing else is needed.
Because science progress is obtained by cumulating evidence ( Rosenthal, 1991 ), scientists should also consider the secondary use of the data. With today’s electronic articles, there are no reasons for not including all of derived data: mean, standard deviations, effect size, CI, Bayes factor should always be included as supplementary tables (or even better also share raw data). It is also essential to report the context in which tests were performed – that is to report all of the tests performed (all t, F, p values) because of the increase type one error rate due to selective reporting (multiple comparisons and p-hacking problems - Ioannidis, 2005 ). Providing all of this information allows (i) other researchers to directly and effectively compare their results in quantitative terms (replication of effects beyond significance, Open Science Collaboration, 2015 ), (ii) to compute power to future studies ( Lakens & Evers, 2014 ), and (iii) to aggregate results for meta-analyses whilst minimizing publication bias ( van Assen et al. , 2014 ).
[version 3; referees: 1 approved
Funding Statement
The author(s) declared that no grants were involved in supporting this work.
- Christensen R: Testing Fisher, Neyman, Pearson, and Bayes. The American Statistician. 2005; 59 ( 2 ):121–126. 10.1198/000313005X20871 [ CrossRef ] [ Google Scholar ]
- Cumming G, Maillardet R: Confidence intervals and replication: Where will the next mean fall? Psychological Methods. 2006; 11 ( 3 ):217–227. 10.1037/1082-989X.11.3.217 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Dienes Z: Using Bayes to get the most out of non-significant results. Front Psychol. 2014; 5 :781. 10.3389/fpsyg.2014.00781 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Fisher RA: Statistical Methods for Research Workers . (Vol. 5th Edition). Edinburgh, UK: Oliver and Boyd.1934. Reference Source [ Google Scholar ]
- Fisher RA: Statistical Methods and Scientific Induction. Journal of the Royal Statistical Society, Series B. 1955; 17 ( 1 ):69–78. Reference Source [ Google Scholar ]
- Fisher RA: Statistical methods and scientific inference . (2nd ed.). NewYork: Hafner Publishing,1959. Reference Source [ Google Scholar ]
- Fisher RA: The Design of Experiments . Hafner Publishing Company, New-York.1971. Reference Source [ Google Scholar ]
- Frick RW: The appropriate use of null hypothesis testing. Psychol Methods. 1996; 1 ( 4 ):379–390. 10.1037/1082-989X.1.4.379 [ CrossRef ] [ Google Scholar ]
- Gelman A: P values and statistical practice. Epidemiology. 2013; 24 ( 1 ):69–72. 10.1097/EDE.0b013e31827886f7 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Halsey LG, Curran-Everett D, Vowler SL, et al.: The fickle P value generates irreproducible results. Nat Methods. 2015; 12 ( 3 ):179–85. 10.1038/nmeth.3288 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Hoekstra R, Morey RD, Rouder JN, et al.: Robust misinterpretation of confidence intervals. Psychon Bull Rev. 2014; 21 ( 5 ):1157–1164. 10.3758/s13423-013-0572-3 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Hubbard R, Bayarri MJ: Confusion over measures of evidence (p’s) versus errors ([alpha]’s) in classical statistical testing. The American Statistician. 2003; 57 ( 3 ):171–182. 10.1198/0003130031856 [ CrossRef ] [ Google Scholar ]
- Ioannidis JP: Why most published research findings are false. PLoS Med. 2005; 2 ( 8 ):e124. 10.1371/journal.pmed.0020124 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Johnson VE: Revised standards for statistical evidence. Proc Natl Acad Sci U S A. 2013; 110 ( 48 ):19313–19317. 10.1073/pnas.1313476110 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Killeen PR: An alternative to null-hypothesis significance tests. Psychol Sci. 2005; 16 ( 5 ):345–353. 10.1111/j.0956-7976.2005.01538.x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Kruschke JK: Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison. Perspect Psychol Sci. 2011; 6 ( 3 ):299–312. 10.1177/1745691611406925 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Krzywinski M, Altman N: Points of significance: Significance, P values and t -tests. Nat Methods. 2013; 10 ( 11 ):1041–1042. 10.1038/nmeth.2698 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Lakens D: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t -tests and ANOVAs. Front Psychol. 2013; 4 :863. 10.3389/fpsyg.2013.00863 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Lakens D, Evers ER: Sailing From the Seas of Chaos Into the Corridor of Stability: Practical Recommendations to Increase the Informational Value of Studies. Perspect Psychol Sci. 2014; 9 ( 3 ):278–292. 10.1177/1745691614528520 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Lindley D: The philosophy of statistics. Journal of the Royal Statistical Society. 2000; 49 ( 3 ):293–337. 10.1111/1467-9884.00238 [ CrossRef ] [ Google Scholar ]
- Miller J: What is the probability of replicating a statistically significant effect? Psychon Bull Rev. 2009; 16 ( 4 ):617–640. 10.3758/PBR.16.4.617 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Morey RD, Rouder JN: Bayes factor approaches for testing interval null hypotheses. Psychol Methods. 2011; 16 ( 4 ):406–419. 10.1037/a0024377 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Neyman J, Pearson ES: On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference: Part I. Biometrika. 1928; 20A ( 1/2 ):175–240. 10.3389/fpsyg.2015.00245 [ CrossRef ] [ Google Scholar ]
- Neyman J, Pearson ES: On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond Ser A. 1933; 231 ( 694–706 ):289–337. 10.1098/rsta.1933.0009 [ CrossRef ] [ Google Scholar ]
- Nickerson RS: Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods. 2000; 5 ( 2 ):241–301. 10.1037/1082-989X.5.2.241 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Nuzzo R: Scientific method: statistical errors. Nature. 2014; 506 ( 7487 ):150–152. 10.1038/506150a [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Open Science Collaboration. PSYCHOLOGY. Estimating the reproducibility of psychological science. Science. 2015; 349 ( 6251 ):aac4716. 10.1126/science.aac4716 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Rosenthal R: Cumulating psychology: an appreciation of Donald T. Campbell. Psychol Sci. 1991; 2 ( 4 ):213–221. 10.1111/j.1467-9280.1991.tb00138.x [ CrossRef ] [ Google Scholar ]
- Savalei V, Dunn E: Is the call to abandon p -values the red herring of the replicability crisis? Front Psychol. 2015; 6 :245. 10.3389/fpsyg.2015.00245 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Tan SH, Tan SB: The Correct Interpretation of Confidence Intervals. Proceedings of Singapore Healthcare. 2010; 19 ( 3 ):276–278. 10.1177/201010581001900316 [ CrossRef ] [ Google Scholar ]
- Turkheimer FE, Aston JA, Cunningham VJ: On the logic of hypothesis testing in functional imaging. Eur J Nucl Med Mol Imaging. 2004; 31 ( 5 ):725–732. 10.1007/s00259-003-1387-7 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- van Assen MA, van Aert RC, Nuijten MB, et al.: Why Publishing Everything Is More Effective than Selective Publishing of Statistically Significant Results. PLoS One. 2014; 9 ( 1 ):e84896. 10.1371/journal.pone.0084896 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Walker E, Nowacki AS: Understanding equivalence and noninferiority testing. J Gen Intern Med. 2011; 26 ( 2 ):192–196. 10.1007/s11606-010-1513-8 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Wasserstein RL, Lazar NA: The ASA’s Statement on p -Values: Context, Process, and Purpose. The American Statistician. 2016; 70 ( 2 ):129–133. 10.1080/00031305.2016.1154108 [ CrossRef ] [ Google Scholar ]
- Wilcox R: Introduction to Robust Estimation and Hypothesis Testing . Edition 3, Academic Press, Elsevier: Oxford, UK, ISBN: 978-0-12-386983-8.2012. Reference Source [ Google Scholar ]
Referee response for version 3
Dorothy vera margaret bishop.
1 Department of Experimental Psychology, University of Oxford, Oxford, UK
I can see from the history of this paper that the author has already been very responsive to reviewer comments, and that the process of revising has now been quite protracted.
That makes me reluctant to suggest much more, but I do see potential here for making the paper more impactful. So my overall view is that, once a few typos are fixed (see below), this could be published as is, but I think there is an issue with the potential readership and that further revision could overcome this.
I suspect my take on this is rather different from other reviewers, as I do not regard myself as a statistics expert, though I am on the more quantitative end of the continuum of psychologists and I try to keep up to date. I think I am quite close to the target readership , insofar as I am someone who was taught about statistics ages ago and uses stats a lot, but never got adequate training in the kinds of topic covered by this paper. The fact that I am aware of controversies around the interpretation of confidence intervals etc is simply because I follow some discussions of this on social media. I am therefore very interested to have a clear account of these issues.
This paper contains helpful information for someone in this position, but it is not always clear, and I felt the relevance of some of the content was uncertain. So here are some recommendations:
- As one previous reviewer noted, it’s questionable that there is a need for a tutorial introduction, and the limited length of this article does not lend itself to a full explanation. So it might be better to just focus on explaining as clearly as possible the problems people have had in interpreting key concepts. I think a title that made it clear this was the content would be more appealing than the current one.
- P 3, col 1, para 3, last sentence. Although statisticians always emphasise the arbitrary nature of p < .05, we all know that in practice authors who use other values are likely to have their analyses queried. I wondered whether it would be useful here to note that in some disciplines different cutoffs are traditional, e.g. particle physics. Or you could cite David Colquhoun’s paper in which he recommends using p < .001 ( http://rsos.royalsocietypublishing.org/content/1/3/140216) - just to be clear that the traditional p < .05 has been challenged.
What I can’t work out is how you would explain the alpha from Neyman-Pearson in the same way (though I can see from Figure 1 that with N-P you could test an alternative hypothesis, such as the idea that the coin would be heads 75% of the time).
‘By failing to reject, we simply continue to assume that H0 is true, which implies that one cannot….’ have ‘In failing to reject, we do not assume that H0 is true; one cannot argue against a theory from a non-significant result.’
I felt most readers would be interested to read about tests of equivalence and Bayesian approaches, but many would be unfamiliar with these and might like to see an example of how they work in practice – if space permitted.
- Confidence intervals: I simply could not understand the first sentence – I wondered what was meant by ‘builds’ here. I understand about difficulties in comparing CI across studies when sample sizes differ, but I did not find the last sentence on p 4 easy to understand.
- P 5: The sentence starting: ‘The alpha value has the same interpretation’ was also hard to understand, especially the term ‘1-alpha CI’. Here too I felt some concrete illustration might be helpful to the reader. And again, I also found the reference to Bayesian intervals tantalising – I think many readers won’t know how to compute these and something like a figure comparing a traditional CI with a Bayesian interval and giving a source for those who want to read on would be very helpful. The reference to ‘credible intervals’ in the penultimate paragraph is very unclear and needs a supporting reference – most readers will not be familiar with this concept.
P 3, col 1, para 2, line 2; “allows us to compute”
P 3, col 2, para 2, ‘probability of replicating’
P 3, col 2, para 2, line 4 ‘informative about’
P 3, col 2, para 4, line 2 delete ‘of’
P 3, col 2, para 5, line 9 – ‘conditioned’ is either wrong or too technical here: would ‘based’ be acceptable as alternative wording
P 3, col 2, para 5, line 13 ‘This dichotomisation allows one to distinguish’
P 3, col 2, para 5, last sentence, delete ‘Alternatively’.
P 3, col 2, last para line 2 ‘first’
P 4, col 2, para 2, last sentence is hard to understand; not sure if this is better: ‘If sample sizes differ between studies, the distribution of CIs cannot be specified a priori’
P 5, col 1, para 2, ‘a pattern of order’ – I did not understand what was meant by this
P 5, col 1, para 2, last sentence unclear: possible rewording: “If the goal is to test the size of an effect then NHST is not the method of choice, since testing can only reject the null hypothesis.’ (??)
P 5, col 1, para 3, line 1 delete ‘that’
P 5, col 1, para 3, line 3 ‘on’ -> ‘by’
P 5, col 2, para 1, line 4 , rather than ‘Here I propose to adopt’ I suggest ‘I recommend adopting’
P 5, col 2, para 1, line 13 ‘with’ -> ‘by’
P 5, col 2, para 1 – recommend deleting last sentence
P 5, col 2, para 2, line 2 ‘consider’ -> ‘anticipate’
P 5, col 2, para 2, delete ‘should always be included’
P 5, col 2, para 2, ‘type one’ -> ‘Type I’
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The University of Edinburgh, UK
I wondered about changing the focus slightly and modifying the title to reflect this to say something like: Null hypothesis significance testing: a guide to commonly misunderstood concepts and recommendations for good practice
Thank you for the suggestion – you indeed saw the intention behind the ‘tutorial’ style of the paper.
- P 3, col 1, para 3, last sentence. Although statisticians always emphasise the arbitrary nature of p < .05, we all know that in practice authors who use other values are likely to have their analyses queried. I wondered whether it would be useful here to note that in some disciplines different cutoffs are traditional, e.g. particle physics. Or you could cite David Colquhoun’s paper in which he recommends using p < .001 ( http://rsos.royalsocietypublishing.org/content/1/3/140216) - just to be clear that the traditional p < .05 has been challenged.
I have added a sentence on this citing Colquhoun 2014 and the new Benjamin 2017 on using .005.
I agree that this point is always hard to appreciate, especially because it seems like in practice it makes little difference. I added a paragraph but using reaction times rather than a coin toss – thanks for the suggestion.
Added an example based on new table 1, following figure 1 – giving CI, equivalence tests and Bayes Factor (with refs to easy to use tools)
Changed builds to constructs (this simply means they are something we build) and added that the implication that probability coverage is not warranty when sample size change, is that we cannot compare CI.
I changed ‘ i.e. we accept that 1-alpha CI are wrong in alpha percent of the times in the long run’ to ‘, ‘e.g. a 95% CI is wrong in 5% of the times in the long run (i.e. if we repeat the experiment many times).’ – for Bayesian intervals I simply re-cited Morey & Rouder, 2011.
It is not the CI cannot be specified, it’s that the interval is not predictive of anything anymore! I changed it to ‘If sample sizes, however, differ between studies, there is no warranty that a CI from one study will be true at the rate alpha in a different study, which implies that CI cannot be compared across studies at this is rarely the same sample sizes’
I added (i.e. establish that A > B) – we test that conditions are ordered, but without further specification of the probability of that effect nor its size
Yes it works – thx
P 5, col 2, para 2, ‘type one’ -> ‘Type I’
Typos fixed, and suggestions accepted – thanks for that.
Stephen J. Senn
1 Luxembourg Institute of Health, Strassen, L-1445, Luxembourg
The revisions are OK for me, and I have changed my status to Approved.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Referee response for version 2
On the whole I think that this article is reasonable, my main reservation being that I have my doubts on whether the literature needs yet another tutorial on this subject.
A further reservation I have is that the author, following others, stresses what in my mind is a relatively unimportant distinction between the Fisherian and Neyman-Pearson (NP) approaches. The distinction stressed by many is that the NP approach leads to a dichotomy accept/reject based on probabilities established in advance, whereas the Fisherian approach uses tail area probabilities calculated from the observed statistic. I see this as being unimportant and not even true. Unless one considers that the person carrying out a hypothesis test (original tester) is mandated to come to a conclusion on behalf of all scientific posterity, then one must accept that any remote scientist can come to his or her conclusion depending on the personal type I error favoured. To operate the results of an NP test carried out by the original tester, the remote scientist then needs to know the p-value. The type I error rate is then compared to this to come to a personal accept or reject decision (1). In fact Lehmann (2), who was an important developer of and proponent of the NP system, describes exactly this approach as being good practice. (See Testing Statistical Hypotheses, 2nd edition P70). Thus using tail-area probabilities calculated from the observed statistics does not constitute an operational difference between the two systems.
A more important distinction between the Fisherian and NP systems is that the former does not use alternative hypotheses(3). Fisher's opinion was that the null hypothesis was more primitive than the test statistic but that the test statistic was more primitive than the alternative hypothesis. Thus, alternative hypotheses could not be used to justify choice of test statistic. Only experience could do that.
Further distinctions between the NP and Fisherian approach are to do with conditioning and whether a null hypothesis can ever be accepted.
I have one minor quibble about terminology. As far as I can see, the author uses the usual term 'null hypothesis' and the eccentric term 'nil hypothesis' interchangeably. It would be simpler if the latter were abandoned.
Referee response for version 1
Marcel alm van assen.
1 Department of Methodology and Statistics, Tilburgh University, Tilburg, Netherlands
Null hypothesis significance testing (NHST) is a difficult topic, with misunderstandings arising easily. Many texts, including basic statistics books, deal with the topic, and attempt to explain it to students and anyone else interested. I would refer to a good basic text book, for a detailed explanation of NHST, or to a specialized article when wishing an explaining the background of NHST. So, what is the added value of a new text on NHST? In any case, the added value should be described at the start of this text. Moreover, the topic is so delicate and difficult that errors, misinterpretations, and disagreements are easy. I attempted to show this by giving comments to many sentences in the text.
Abstract: “null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely”. No, NHST is the method to test the hypothesis of no effect.
Intro: “Null hypothesis significance testing (NHST) is a method of statistical inference by which an observation is tested against a hypothesis of no effect or no relationship.” What is an ‘observation’? NHST is difficult to describe in one sentence, particularly here. I would skip this sentence entirely, here.
Section on Fisher; also explain the one-tailed test.
Section on Fisher; p(Obs|H0) does not reflect the verbal definition (the ‘or more extreme’ part).
Section on Fisher; use a reference and citation to Fisher’s interpretation of the p-value
Section on Fisher; “This was however only intended to be used as an indication that there is something in the data that deserves further investigation. The reason for this is that only H0 is tested whilst the effect under study is not itself being investigated.” First sentence, can you give a reference? Many people say a lot about Fisher’s intentions, but the good man is dead and cannot reply… Second sentence is a bit awkward, because the effect is investigated in a way, by testing the H0.
Section on p-value; Layout and structure can be improved greatly, by first again stating what the p-value is, and then statement by statement, what it is not, using separate lines for each statement. Consider adding that the p-value is randomly distributed under H0 (if all the assumptions of the test are met), and that under H1 the p-value is a function of population effect size and N; the larger each is, the smaller the p-value generally is.
Skip the sentence “If there is no effect, we should replicate the absence of effect with a probability equal to 1-p”. Not insightful, and you did not discuss the concept ‘replicate’ (and do not need to).
Skip the sentence “The total probability of false positives can also be obtained by aggregating results ( Ioannidis, 2005 ).” Not strongly related to p-values, and introduces unnecessary concepts ‘false positives’ (perhaps later useful) and ‘aggregation’.
Consider deleting; “If there is an effect however, the probability to replicate is a function of the (unknown) population effect size with no good way to know this from a single experiment ( Killeen, 2005 ).”
The following sentence; “ Finally, a (small) p-value is not an indication favouring a hypothesis . A low p-value indicates a misfit of the null hypothesis to the data and cannot be taken as evidence in favour of a specific alternative hypothesis more than any other possible alternatives such as measurement error and selection bias ( Gelman, 2013 ).” is surely not mainstream thinking about NHST; I would surely delete that sentence. In NHST, a p-value is used for testing the H0. Why did you not yet discuss significance level? Yes, before discussing what is not a p-value, I would explain NHST (i.e., what it is and how it is used).
Also the next sentence “The more (a priori) implausible the alternative hypothesis, the greater the chance that a finding is a false alarm ( Krzywinski & Altman, 2013 ; Nuzzo, 2014 ).“ is not fully clear to me. This is a Bayesian statement. In NHST, no likelihoods are attributed to hypotheses; the reasoning is “IF H0 is true, then…”.
Last sentence: “As Nickerson (2000) puts it ‘theory corroboration requires the testing of multiple predictions because the chance of getting statistically significant results for the wrong reasons in any given case is high’.” What is relation of this sentence to the contents of this section, precisely?
Next section: “For instance, we can estimate that the probability of a given F value to be in the critical interval [+2 +∞] is less than 5%” This depends on the degrees of freedom.
“When there is no effect (H0 is true), the erroneous rejection of H0 is known as type I error and is equal to the p-value.” Strange sentence. The Type I error is the probability of erroneously rejecting the H0 (so, when it is true). The p-value is … well, you explained it before; it surely does not equal the Type I error.
Consider adding a figure explaining the distinction between Fisher’s logic and that of Neyman and Pearson.
“When the test statistics falls outside the critical region(s)” What is outside?
“There is a profound difference between accepting the null hypothesis and simply failing to reject it ( Killeen, 2005 )” I agree with you, but perhaps you may add that some statisticians simply define “accept H0’” as obtaining a p-value larger than the significance level. Did you already discuss the significance level, and it’s mostly used values?
“To accept or reject equally the null hypothesis, Bayesian approaches ( Dienes, 2014 ; Kruschke, 2011 ) or confidence intervals must be used.” Is ‘reject equally’ appropriate English? Also using Cis, one cannot accept the H0.
Do you start discussing alpha only in the context of Cis?
“CI also indicates the precision of the estimate of effect size, but unless using a percentile bootstrap approach, they require assumptions about distributions which can lead to serious biases in particular regarding the symmetry and width of the intervals ( Wilcox, 2012 ).” Too difficult, using new concepts. Consider deleting.
“Assuming the CI (a)symmetry and width are correct, this gives some indication about the likelihood that a similar value can be observed in future studies, with 95% CI giving about 83% chance of replication success ( Lakens & Evers, 2014 ).” This statement is, in general, completely false. It very much depends on the sample sizes of both studies. If the replication study has a much, much, much larger N, then the probability that the original CI will contain the effect size of the replication approaches (1-alpha)*100%. If the original study has a much, much, much larger N, then the probability that the original Ci will contain the effect size of the replication study approaches 0%.
“Finally, contrary to p-values, CI can be used to accept H0. Typically, if a CI includes 0, we cannot reject H0. If a critical null region is specified rather than a single point estimate, for instance [-2 +2] and the CI is included within the critical null region, then H0 can be accepted. Importantly, the critical region must be specified a priori and cannot be determined from the data themselves.” No. H0 cannot be accepted with Cis.
“The (posterior) probability of an effect can however not be obtained using a frequentist framework.” Frequentist framework? You did not discuss that, yet.
“X% of times the CI obtained will contain the same parameter value”. The same? True, you mean?
“e.g. X% of the times the CI contains the same mean” I do not understand; which mean?
“The alpha value has the same interpretation as when using H0, i.e. we accept that 1-alpha CI are wrong in alpha percent of the times. “ What do you mean, CI are wrong? Consider rephrasing.
“To make a statement about the probability of a parameter of interest, likelihood intervals (maximum likelihood) and credibility intervals (Bayes) are better suited.” ML gives the likelihood of the data given the parameter, not the other way around.
“Many of the disagreements are not on the method itself but on its use.” Bayesians may disagree.
“If the goal is to establish the likelihood of an effect and/or establish a pattern of order, because both requires ruling out equivalence, then NHST is a good tool ( Frick, 1996 )” NHST does not provide evidence on the likelihood of an effect.
“If the goal is to establish some quantitative values, then NHST is not the method of choice.” P-values are also quantitative… this is not a precise sentence. And NHST may be used in combination with effect size estimation (this is even recommended by, e.g., the American Psychological Association (APA)).
“Because results are conditioned on H0, NHST cannot be used to establish beliefs.” It can reinforce some beliefs, e.g., if H0 or any other hypothesis, is true.
“To estimate the probability of a hypothesis, a Bayesian analysis is a better alternative.” It is the only alternative?
“Note however that even when a specific quantitative prediction from a hypothesis is shown to be true (typically testing H1 using Bayes), it does not prove the hypothesis itself, it only adds to its plausibility.” How can we show something is true?
I do not agree on the contents of the last section on ‘minimal reporting’. I prefer ‘optimal reporting’ instead, i.e., the reporting the information that is essential to the interpretation of the result, to any ready, which may have other goals than the writer of the article. This reporting includes, for sure, an estimate of effect size, and preferably a confidence interval, which is in line with recommendations of the APA.
I have read this submission. I believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.
The idea of this short review was to point to common interpretation errors (stressing again and again that we are under H0) being in using p-values or CI, and also proposing reporting practices to avoid bias. This is now stated at the end of abstract.
Regarding text books, it is clear that many fail to clearly distinguish Fisher/Pearson/NHST, see Glinet et al (2012) J. Exp Education 71, 83-92. If you have 1 or 2 in mind that you know to be good, I’m happy to include them.
I agree – yet people use it to investigate (not test) if an effect is likely. The issue here is wording. What about adding this distinction at the end of the sentence?: ‘null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences used to investigate if an effect is likely, even though it actually tests for the hypothesis of no effect’.
I think a definition is needed, as it offers a starting point. What about the following: ‘NHST is a method of statistical inference by which an experimental factor is tested against a hypothesis of no effect or no relationship based on a given observation’
The section on Fisher has been modified (more or less) as suggested: (1) avoiding talking about one or two tailed tests (2) updating for p(Obs≥t|H0) and (3) referring to Fisher more explicitly (ie pages from articles and book) ; I cannot tell his intentions but these quotes leave little space to alternative interpretations.
The reasoning here is as you state yourself, part 1: ‘a p-value is used for testing the H0; and part 2: ‘no likelihoods are attributed to hypotheses’ it follows we cannot favour a hypothesis. It might seems contentious but this is the case that all we can is to reject the null – how could we favour a specific alternative hypothesis from there? This is explored further down the manuscript (and I now point to that) – note that we do not need to be Bayesian to favour a specific H1, all I’m saying is this cannot be attained with a p-value.
The point was to emphasise that a p value is not there to tell us a given H1 is true and can only be achieved through multiple predictions and experiments. I deleted it for clarity.
This sentence has been removed
Indeed, you are right and I have modified the text accordingly. When there is no effect (H0 is true), the erroneous rejection of H0 is known as type 1 error. Importantly, the type 1 error rate, or alpha value is determined a priori. It is a common mistake but the level of significance (for a given sample) is not the same as the frequency of acceptance alpha found on repeated sampling (Fisher, 1955).
A figure is now presented – with levels of acceptance, critical region, level of significance and p-value.
I should have clarified further here – as I was having in mind tests of equivalence. To clarify, I simply states now: ‘To accept the null hypothesis, tests of equivalence or Bayesian approaches must be used.’
It is now presented in the paragraph before.
Yes, you are right, I completely overlooked this problem. The corrected sentence (with more accurate ref) is now “Assuming the CI (a)symmetry and width are correct, this gives some indication about the likelihood that a similar value can be observed in future studies. For future studies of the same sample size, 95% CI giving about 83% chance of replication success (Cumming and Mallardet, 2006). If sample sizes differ between studies, CI do not however warranty any a priori coverage”.
Again, I had in mind equivalence testing, but in both cases you are right we can only reject and I therefore removed that sentence.
Yes, p-values must be interpreted in context with effect size, but this is not what people do. The point here is to be pragmatic, does and don’t. The sentence was changed.
Not for testing, but for probability, I am not aware of anything else.
Cumulative evidence is, in my opinion, the only way to show it. Even in hard science like physics multiple experiments. In the recent CERN study on finding Higgs bosons, 2 different and complementary experiments ran in parallel – and the cumulative evidence was taken as a proof of the true existence of Higgs bosons.
Daniel Lakens
1 School of Innovation Sciences, Eindhoven University of Technology, Eindhoven, Netherlands
I appreciate the author's attempt to write a short tutorial on NHST. Many people don't know how to use it, so attempts to educate people are always worthwhile. However, I don't think the current article reaches it's aim. For one, I think it might be practically impossible to explain a lot in such an ultra short paper - every section would require more than 2 pages to explain, and there are many sections. Furthermore, there are some excellent overviews, which, although more extensive, are also much clearer (e.g., Nickerson, 2000 ). Finally, I found many statements to be unclear, and perhaps even incorrect (noted below). Because there is nothing worse than creating more confusion on such a topic, I have extremely high standards before I think such a short primer should be indexed. I note some examples of unclear or incorrect statements below. I'm sorry I can't make a more positive recommendation.
“investigate if an effect is likely” – ambiguous statement. I think you mean, whether the observed DATA is probable, assuming there is no effect?
The Fisher (1959) reference is not correct – Fischer developed his method much earlier.
“This p-value thus reflects the conditional probability of achieving the observed outcome or larger, p(Obs|H0)” – please add 'assuming the null-hypothesis is true'.
“p(Obs|H0)” – explain this notation for novices.
“Following Fisher, the smaller the p-value, the greater the likelihood that the null hypothesis is false.” This is wrong, and any statement about this needs to be much more precise. I would suggest direct quotes.
“there is something in the data that deserves further investigation” –unclear sentence.
“The reason for this” – unclear what ‘this’ refers to.
“ not the probability of the null hypothesis of being true, p(H0)” – second of can be removed?
“Any interpretation of the p-value in relation to the effect under study (strength, reliability, probability) is indeed
wrong, since the p-value is conditioned on H0” - incorrect. A big problem is that it depends on the sample size, and that the probability of a theory depends on the prior.
“If there is no effect, we should replicate the absence of effect with a probability equal to 1-p.” I don’t understand this, but I think it is incorrect.
“The total probability of false positives can also be obtained by aggregating results (Ioannidis, 2005).” Unclear, and probably incorrect.
“By failing to reject, we simply continue to assume that H0 is true, which implies that one cannot, from a nonsignificant result, argue against a theory” – according to which theory? From a NP perspective, you can ACT as if the theory is false.
“(Lakens & Evers, 2014”) – we are not the original source, which should be cited instead.
“ Typically, if a CI includes 0, we cannot reject H0.” - when would this not be the case? This assumes a CI of 1-alpha.
“If a critical null region is specified rather than a single point estimate, for instance [-2 +2] and the CI is included within the critical null region, then H0 can be accepted.” – you mean practically, or formally? I’m pretty sure only the former.
The section on ‘The (correct) use of NHST’ seems to conclude only Bayesian statistics should be used. I don’t really agree.
“ we can always argue that effect size, power, etc. must be reported.” – which power? Post-hoc power? Surely not? Other types are unknown. So what do you mean?
The recommendation on what to report remains vague, and it is unclear why what should be reported.
This sentence was changed, following as well the other reviewer, to ‘null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely, even though it actually tests whether the observed data are probable, assuming there is no effect’
Changed, refers to Fisher 1925
I changed a little the sentence structure, which should make explicit that this is the condition probability.
This has been changed to ‘[…] to decide whether the evidence is worth additional investigation and/or replication (Fisher, 1971 p13)’
my mistake – the sentence structure is now ‘ not the probability of the null hypothesis p(H0), of being true,’ ; hope this makes more sense (and this way refers back to p(Obs>t|H0)
Fair enough – my point was to stress the fact that p value and effect size or H1 have very little in common, but yes that the part in common has to do with sample size. I left the conditioning on H0 but also point out the dependency on sample size.
The whole paragraph was changed to reflect a more philosophical take on scientific induction/reasoning. I hope this is clearer.
Changed to refer to equivalence testing
I rewrote this, as to show frequentist analysis can be used - I’m trying to sell Bayes more than any other approach.
I’m arguing we should report it all, that’s why there is no exhausting list – I can if needed.
IMAGES
VIDEO
COMMENTS
The null hypothesis is the opposite stating that no such relationship exists. Null hypothesis may seem unexciting, but it is a very important aspect of research. In this article, we discuss what null hypothesis is, how to make use of it, and why you should use it to improve your statistical analyses.
Answer. Every researcher is required to establish hypotheses in order to predict, tentatively, the outcome of the research (Leedy & Ormrod, 2016). A null hypothesis is "the result of chance alone", there's no patterns, differences or relationships between variables (Leedy & Ormrod, 2016). Whether the outcome is positive or negative, the ...
Background. Null Hypothesis Significance Testing (NHST) is the most familiar statistical procedure for making inferences about population effects. Important problems associated with this method have been addressed and various alternatives that overcome these problems have been developed. Despite its many well-documented drawbacks, NHST remains ...
Two factors have an additive effect on some measure if the size of the effect of a given change in one factor is the same regardless of the level (value) of the other factor. Thus, "proving" additivity requires proving a null hypothesis of the form "the magnitudes of these differences are the same.".
The null hypothesis often denoted as H0, is a statement in statistical inference that suggests no statistical significance exists in a set of observed data. In other words, it assumes that any kind of difference or importance you see in a set of data is due to chance. The null hypothesis is the initial claim that researchers set out to test.
Hypothesis-testing (Quantitative hypothesis-testing research) - Quantitative research uses deductive reasoning. - This involves the formation of a hypothesis, collection of data in the investigation of the problem, analysis and use of the data from the investigation, and drawing of conclusions to validate or nullify the hypotheses.
A hypothesis is a statement that we are trying to prove or disprove. It is used to express the relationship between variables and whether this relationship is significant. It is specific and offers a prediction on the results of your research question. Your research question will lead you to developing a hypothesis, this is why your research ...
To distinguish it from other hypotheses, the null hypothesis is written as H 0 (which is read as "H-nought," "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the ...
The null hypothesis (H0) answers "No, there's no effect in the population.". The alternative hypothesis (Ha) answers "Yes, there is an effect in the population.". The null and alternative are always claims about the population. That's because the goal of hypothesis testing is to make inferences about a population based on a sample.
The null hypothesis in statistics states that there is no difference between groups or no relationship between variables. It is one of two mutually exclusive hypotheses about a population in a hypothesis test. When your sample contains sufficient evidence, you can reject the null and conclude that the effect is statistically significant.
The null hypothesis defines the parameters under which the results of a study would be indistinguishable from background noise. ... where results can be more qualitative than quantitative, null results are less valued. ... the null result is very important to our worldview and cosmology view and it was the initial motivation for me to use the ...
However, in order to use hypothesis testing, you need to re-state your research hypothesis as a null and alternative hypothesis. Before you can do this, it is best to consider the process/structure involved in hypothesis testing and what you are measuring. This structure is presented on the next page. Understand the structure of hypothesis ...
Quantitative research relies on assessing statistical significance, ... The p-value represents the evidence for the null hypothesis, where the null hypothesis is the opposite of the study hypothesis—usually the absence of an effect. For example, if the study hypothesis is that working overtime increases anxiety for health workers, the null ...
Thus, the null hypothesis for our researcher would be, "Taking the new medication will not lower systolic blood pressure by at least 10 mm Hg compared to not taking the new medication." The researcher now has the null hypothesis for the research and must specify the significance level or level of acceptable uncertainty. Even when disproving a ...
The null hypothesis is the negation of the alternative hypothesis. For example, if a researcher predicts a difference between two means, the alternative hypothesis is that the two means are different and the null is that the means are exactly equal. The null hypothesis is seldom stated in research reports, but its existence is always implied in ...
To evaluate whether a finding is statistically significant, researchers engage in a process known as null hypothesis significance testing. Null hypothesis significance testing is less of a mathematical formula and more of a logical process for thinking about the strength and legitimacy of a finding. An Example of Null Hypothesis Significance ...
Present the findings in your results and discussion section. Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps. Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test.
The primary research question should originate from the hypothesis, not the data, and be established before starting the study. Formulating the research question and hypothesis from existing data (e.g., a database) can lead to multiple statistical comparisons and potentially spurious findings due to chance.
Unlike isolated null hypothesis significance testing or magnitude-based inference, this approach ensures that effects smaller than the smallest effect size of interest cannot be considered 'significant' or 'beneficial' regardless of an over-inflated sample size (null hypothesis significance testing) or a 'beneficial' central ...
HYPOTHESIS TESTING. A clinical trial begins with an assumption or belief, and then proceeds to either prove or disprove this assumption. In statistical terms, this belief or assumption is known as a hypothesis. Counterintuitively, what the researcher believes in (or is trying to prove) is called the "alternate" hypothesis, and the opposite ...
Developing a hypothesis (with example) Step 1. Ask a question. Writing a hypothesis begins with a research question that you want to answer. The question should be focused, specific, and researchable within the constraints of your project. Example: Research question.
Set the confidence level and whether this will be a one-tailed or two-tailed test. Compute the test value for the appropriate significance test. Compare the test value to the critical value of that test statistic for the confidence level you selected. Determine whether or not to reject the null hypothesis.
Abstract: "null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely". No, NHST is the method to test the hypothesis of no effect. I agree - yet people use it to investigate (not test) if an effect is likely.
We accept the null hypothesis for these explanatory variables that the impact is not significantly different from zero. Explanatory variables with p-values lower than .05 are NSC-level APS and NSC-level mathematics. We, therefore, rejected the null hypothesis and stated that the logistic coefficient is significantly different from zero.