null hypothesis test for mean

FOR INSTRUCTOR
FOR INSTRUCTORS

8.4.3 Hypothesis Testing for the Mean

$\quad$ $H_0$: $\mu=\mu_0$, $\quad$ $H_1$: $\mu \neq \mu_0$.

$\quad$ $H_0$: $\mu \leq \mu_0$, $\quad$ $H_1$: $\mu > \mu_0$.

$\quad$ $H_0$: $\mu \geq \mu_0$, $\quad$ $H_1$: $\mu \lt \mu_0$.

Two-sided Tests for the Mean:

Therefore, we can suggest the following test. Choose a threshold, and call it $c$. If $|W| \leq c$, accept $H_0$, and if $|W|>c$, accept $H_1$. How do we choose $c$? If $\alpha$ is the required significance level, we must have

As discussed above, we let \begin{align}%\label{} W(X_1,X_2, \cdots,X_n)=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}. \end{align} Note that, assuming $H_0$, $W \sim N(0,1)$. We will choose a threshold, $c$. If $|W| \leq c$, we accept $H_0$, and if $|W|>c$, accept $H_1$. To choose $c$, we let \begin{align} P(|W| > c \; | \; H_0) =\alpha. \end{align} Since the standard normal PDF is symmetric around $0$, we have \begin{align} P(|W| > c \; | \; H_0) = 2 P(W>c | \; H_0). \end{align} Thus, we conclude $P(W>c | \; H_0)=\frac{\alpha}{2}$. Therefore, \begin{align} c=z_{\frac{\alpha}{2}}. \end{align} Therefore, we accept $H_0$ if \begin{align} \left|\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}} \right| \leq z_{\frac{\alpha}{2}}, \end{align} and reject it otherwise.
We have \begin{align} \beta (\mu) &=P(\textrm{type II error}) = P(\textrm{accept }H_0 \; | \; \mu) \\ &= P\left(\left|\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}} \right| \lt z_{\frac{\alpha}{2}}\; | \; \mu \right). \end{align} If $X_i \sim N(\mu,\sigma^2)$, then $\overline{X} \sim N(\mu, \frac{\sigma^2}{n})$. Thus, \begin{align} \beta (\mu)&=P\left(\left|\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}} \right| \lt z_{\frac{\alpha}{2}}\; | \; \mu \right)\\ &=P\left(\mu_0- z_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}} \leq \overline{X} \leq \mu_0+ z_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}}\right)\\ &=\Phi\left(z_{\frac{\alpha}{2}}+\frac{\mu_0-\mu}{\sigma / \sqrt{n}}\right)-\Phi\left(-z_{\frac{\alpha}{2}}+\frac{\mu_0-\mu}{\sigma / \sqrt{n}}\right). \end{align}
Let $S^2$ be the sample variance for this random sample. Then, the random variable $W$ defined as \begin{equation} W(X_1,X_2, \cdots, X_n)=\frac{\overline{X}-\mu_0}{S / \sqrt{n}} \end{equation} has a $t$-distribution with $n-1$ degrees of freedom, i.e., $W \sim T(n-1)$. Thus, we can repeat the analysis of Example 8.24 here. The only difference is that we need to replace $\sigma$ by $S$ and $z_{\frac{\alpha}{2}}$ by $t_{\frac{\alpha}{2},n-1}$. Therefore, we accept $H_0$ if \begin{align} |W| \leq t_{\frac{\alpha}{2},n-1}, \end{align} and reject it otherwise. Let us look at a numerical example of this case.

$\quad$ $H_0$: $\mu=170$, $\quad$ $H_1$: $\mu \neq 170$.

Let's first compute the sample mean and the sample standard deviation. The sample mean is \begin{align}%\label{} \overline{X}&=\frac{X_1+X_2+X_3+X_4+X_5+X_6+X_7+X_8+X_9}{9}\\ &=165.8 \end{align} The sample variance is given by \begin{align}%\label{} {S}^2=\frac{1}{9-1} \sum_{k=1}^9 (X_k-\overline{X})^2&=68.01 \end{align} The sample standard deviation is given by \begin{align}%\label{} S&= \sqrt{S^2}=8.25 \end{align} The following MATLAB code can be used to obtain these values: x=[176.2,157.9,160.1,180.9,165.1,167.2,162.9,155.7,166.2]; m=mean(x); v=var(x); s=std(x); Now, our test statistic is \begin{align} W(X_1,X_2, \cdots, X_9)&=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}\\ &=\frac{165.8-170}{8.25 / 3}=-1.52 \end{align} Thus, $|W|=1.52$. Also, we have \begin{align} t_{\frac{\alpha}{2},n-1} = t_{0.025,8} \approx 2.31 \end{align} The above value can be obtained in MATLAB using the command $\mathtt{tinv(0.975,8)}$. Thus, we conclude \begin{align} |W| \leq t_{\frac{\alpha}{2},n-1}. \end{align} Therefore, we accept $H_0$. In other words, we do not have enough evidence to conclude that the average height in the city is different from the average height in the country.

Let us summarize what we have obtained for the two-sided test for the mean.

Case	Test Statistic	Acceptance Region
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ known	$W=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$	$\|W\| \leq z_{\frac{\alpha}{2}}$
$n$ large, $X_i$ non-normal	$W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$	$\|W\| \leq z_{\frac{\alpha}{2}}$
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ unknown	$W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$	$\|W\| \leq t_{\frac{\alpha}{2},n-1}$

One-sided Tests for the Mean:

As before, we define the test statistic as \begin{align}%\label{} W(X_1,X_2, \cdots,X_n)=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}. \end{align} If $H_0$ is true (i.e., $\mu \leq \mu_0$), we expect $\overline{X}$ (and thus $W$) to be relatively small, while if $H_1$ is true, we expect $\overline{X}$ (and thus $W$) to be larger. This suggests the following test: Choose a threshold, and call it $c$. If $W \leq c$, accept $H_0$, and if $W>c$, accept $H_1$. How do we choose $c$? If $\alpha$ is the required significance level, we must have \begin{align} P(\textrm{type I error}) &= P(\textrm{Reject }H_0 \; | \; H_0) \\ &= P(W > c \; | \; \mu \leq \mu_0) \leq \alpha. \end{align} Here, the probability of type I error depends on $\mu$. More specifically, for any $\mu \leq \mu_0$, we can write \begin{align} P(\textrm{type I error} \; | \; \mu) &= P(\textrm{Reject }H_0 \; | \; \mu) \\ &= P(W > c \; | \; \mu)\\ &=P \left(\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}> c \; | \; \mu\right)\\ &=P \left(\frac{\overline{X}-\mu}{\sigma / \sqrt{n}}+\frac{\mu-\mu_0}{\sigma / \sqrt{n}}> c \; | \; \mu\right)\\ &=P \left(\frac{\overline{X}-\mu}{\sigma / \sqrt{n}}> c+\frac{\mu_0-\mu}{\sigma / \sqrt{n}} \; | \; \mu\right)\\ &\leq P \left(\frac{\overline{X}-\mu}{\sigma / \sqrt{n}}> c \; | \; \mu\right) \quad (\textrm{ since }\mu \leq \mu_0)\\ &=1-\Phi(c) \quad \big(\textrm{ since given }\mu, \frac{\overline{X}-\mu}{\sigma / \sqrt{n}} \sim N(0,1) \big). \end{align} Thus, we can choose $\alpha=1-\Phi(c)$, which results in \begin{align} c=z_{\alpha}. \end{align} Therefore, we accept $H_0$ if \begin{align} \frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}} \leq z_{\alpha}, \end{align} and reject it otherwise.

Case	Test Statistic	Acceptance Region
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ known	$W=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$	$W \leq z_{\alpha}$
$n$ large, $X_i$ non-normal	$W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$	$W \leq z_{\alpha}$
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ unknown	$W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$	$W \leq t_{\alpha,n-1}$

$\quad$ $H_0$: $\mu \geq \mu_0$, $\quad$ $H_1$: $\mu \lt \mu_0$,

Case	Test Statistic	Acceptance Region
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ known	$W=\frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$	$W \geq -z_{\alpha}$
$n$ large, $X_i$ non-normal	$W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$	$W \geq -z_{\alpha}$
$X_i \sim N(\mu, \sigma^2)$, $\sigma$ unknown	$W=\frac{\overline{X}-\mu_0}{S / \sqrt{n}}$	$W \geq -t_{\alpha,n-1}$

The print version of the book is available on .

Hypothesis Testing Calculator

$H_o$:
$H_a$:	μ	≠	μ₀

$n$

$\bar{x}$

$\text{Test Statistic: }$

$\text{Degrees of Freedom: } $

$df$

$ \text{Level of Significance: } $

$\alpha$

Type II Error

$H_o$:	$\mu$
$H_a$:	$\mu$	≠	$\mu_0$

$n$

$\mu$

$\text{Level of Significance: }$

$\alpha$

The first step in hypothesis testing is to calculate the test statistic. The formula for the test statistic depends on whether the population standard deviation (σ) is known or unknown. If σ is known, our hypothesis test is known as a z test and we use the z distribution. If σ is unknown, our hypothesis test is known as a t test and we use the t distribution. Use of the t distribution relies on the degrees of freedom, which is equal to the sample size minus one. Furthermore, if the population standard deviation σ is unknown, the sample standard deviation s is used instead. To switch from σ known to σ unknown, click on $\boxed{\sigma}$ and select $\boxed{s}$ in the Hypothesis Testing Calculator.

	$\sigma$ Known	$\sigma$ Unknown
Test Statistic	$ z = \dfrac{\bar{x}-\mu_0}{\sigma/\sqrt{{\color{Black} n}}} $	$ t = \dfrac{\bar{x}-\mu_0}{s/\sqrt{n}} $

Next, the test statistic is used to conduct the test using either the p-value approach or critical value approach. The particular steps taken in each approach largely depend on the form of the hypothesis test: lower tail, upper tail or two-tailed. The form can easily be identified by looking at the alternative hypothesis (H a ). If there is a less than sign in the alternative hypothesis then it is a lower tail test, greater than sign is an upper tail test and inequality is a two-tailed test. To switch from a lower tail test to an upper tail or two-tailed test, click on $\boxed{\geq}$ and select $\boxed{\leq}$ or $\boxed{=}$, respectively.

Lower Tail Test	Upper Tail Test	Two-Tailed Test
$H_0 \colon \mu \geq \mu_0$	$H_0 \colon \mu \leq \mu_0$	$H_0 \colon \mu = \mu_0$
$H_a \colon \mu	$H_a \colon \mu \neq \mu_0$

In the p-value approach, the test statistic is used to calculate a p-value. If the test is a lower tail test, the p-value is the probability of getting a value for the test statistic at least as small as the value from the sample. If the test is an upper tail test, the p-value is the probability of getting a value for the test statistic at least as large as the value from the sample. In a two-tailed test, the p-value is the probability of getting a value for the test statistic at least as unlikely as the value from the sample.

To test the hypothesis in the p-value approach, compare the p-value to the level of significance. If the p-value is less than or equal to the level of signifance, reject the null hypothesis. If the p-value is greater than the level of significance, do not reject the null hypothesis. This method remains unchanged regardless of whether it's a lower tail, upper tail or two-tailed test. To change the level of significance, click on $\boxed{.05}$. Note that if the test statistic is given, you can calculate the p-value from the test statistic by clicking on the switch symbol twice.

In the critical value approach, the level of significance ($\alpha$) is used to calculate the critical value. In a lower tail test, the critical value is the value of the test statistic providing an area of $\alpha$ in the lower tail of the sampling distribution of the test statistic. In an upper tail test, the critical value is the value of the test statistic providing an area of $\alpha$ in the upper tail of the sampling distribution of the test statistic. In a two-tailed test, the critical values are the values of the test statistic providing areas of $\alpha / 2$ in the lower and upper tail of the sampling distribution of the test statistic.

To test the hypothesis in the critical value approach, compare the critical value to the test statistic. Unlike the p-value approach, the method we use to decide whether to reject the null hypothesis depends on the form of the hypothesis test. In a lower tail test, if the test statistic is less than or equal to the critical value, reject the null hypothesis. In an upper tail test, if the test statistic is greater than or equal to the critical value, reject the null hypothesis. In a two-tailed test, if the test statistic is less than or equal the lower critical value or greater than or equal to the upper critical value, reject the null hypothesis.

Lower Tail Test	Upper Tail Test	Two-Tailed Test
If $z \leq -z_\alpha$, reject $H_0$.	If $z \geq z_\alpha$, reject $H_0$.	If $z \leq -z_{\alpha/2}$ or $z \geq z_{\alpha/2}$, reject $H_0$.
If $t \leq -t_\alpha$, reject $H_0$.	If $t \geq t_\alpha$, reject $H_0$.	If $t \leq -t_{\alpha/2}$ or $t \geq t_{\alpha/2}$, reject $H_0$.

When conducting a hypothesis test, there is always a chance that you come to the wrong conclusion. There are two types of errors you can make: Type I Error and Type II Error. A Type I Error is committed if you reject the null hypothesis when the null hypothesis is true. Ideally, we'd like to accept the null hypothesis when the null hypothesis is true. A Type II Error is committed if you accept the null hypothesis when the alternative hypothesis is true. Ideally, we'd like to reject the null hypothesis when the alternative hypothesis is true.

		Condition
		$H_0$ True	$H_a$ True
Conclusion	Accept $H_0$	Correct	Type II Error
Conclusion	Reject $H_0$	Type I Error	Correct

Hypothesis testing is closely related to the statistical area of confidence intervals. If the hypothesized value of the population mean is outside of the confidence interval, we can reject the null hypothesis. Confidence intervals can be found using the Confidence Interval Calculator . The calculator on this page does hypothesis tests for one population mean. Sometimes we're interest in hypothesis tests about two population means. These can be solved using the Two Population Calculator . The probability of a Type II Error can be calculated by clicking on the link at the bottom of the page.

Hypothesis Testing (cont...)

Hypothesis testing, the null and alternative hypothesis.

In order to undertake hypothesis testing you need to express your research hypothesis as a null and alternative hypothesis. The null hypothesis and alternative hypothesis are statements regarding the differences or effects that occur in the population. You will use your sample to test which statement (i.e., the null hypothesis or alternative hypothesis) is most likely (although technically, you test the evidence against the null hypothesis). So, with respect to our teaching example, the null and alternative hypothesis will reflect statements about all statistics students on graduate management courses.

The null hypothesis is essentially the "devil's advocate" position. That is, it assumes that whatever you are trying to prove did not happen ( hint: it usually states that something equals zero). For example, the two different teaching methods did not result in different exam performances (i.e., zero difference). Another example might be that there is no relationship between anxiety and athletic performance (i.e., the slope is zero). The alternative hypothesis states the opposite and is usually the hypothesis you are trying to prove (e.g., the two different teaching methods did result in different exam performances). Initially, you can state these hypotheses in more general terms (e.g., using terms like "effect", "relationship", etc.), as shown below for the teaching methods example:

Null Hypotheses (H ):	Undertaking seminar classes has no effect on students' performance.
Alternative Hypothesis (H ):	Undertaking seminar class has a positive effect on students' performance.

Depending on how you want to "summarize" the exam performances will determine how you might want to write a more specific null and alternative hypothesis. For example, you could compare the mean exam performance of each group (i.e., the "seminar" group and the "lectures-only" group). This is what we will demonstrate here, but other options include comparing the distributions , medians , amongst other things. As such, we can state:

Null Hypotheses (H ):	The mean exam mark for the "seminar" and "lecture-only" teaching methods is the same in the population.
Alternative Hypothesis (H ):	The mean exam mark for the "seminar" and "lecture-only" teaching methods is not the same in the population.

Now that you have identified the null and alternative hypotheses, you need to find evidence and develop a strategy for declaring your "support" for either the null or alternative hypothesis. We can do this using some statistical theory and some arbitrary cut-off points. Both these issues are dealt with next.

Significance levels

The level of statistical significance is often expressed as the so-called p -value . Depending on the statistical test you have chosen, you will calculate a probability (i.e., the p -value) of observing your sample results (or more extreme) given that the null hypothesis is true . Another way of phrasing this is to consider the probability that a difference in a mean score (or other statistic) could have arisen based on the assumption that there really is no difference. Let us consider this statement with respect to our example where we are interested in the difference in mean exam performance between two different teaching methods. If there really is no difference between the two teaching methods in the population (i.e., given that the null hypothesis is true), how likely would it be to see a difference in the mean exam performance between the two teaching methods as large as (or larger than) that which has been observed in your sample?

So, you might get a p -value such as 0.03 (i.e., p = .03). This means that there is a 3% chance of finding a difference as large as (or larger than) the one in your study given that the null hypothesis is true. However, you want to know whether this is "statistically significant". Typically, if there was a 5% or less chance (5 times in 100 or less) that the difference in the mean exam performance between the two teaching methods (or whatever statistic you are using) is as different as observed given the null hypothesis is true, you would reject the null hypothesis and accept the alternative hypothesis. Alternately, if the chance was greater than 5% (5 times in 100 or more), you would fail to reject the null hypothesis and would not accept the alternative hypothesis. As such, in this example where p = .03, we would reject the null hypothesis and accept the alternative hypothesis. We reject it because at a significance level of 0.03 (i.e., less than a 5% chance), the result we obtained could happen too frequently for us to be confident that it was the two teaching methods that had an effect on exam performance.

Whilst there is relatively little justification why a significance level of 0.05 is used rather than 0.01 or 0.10, for example, it is widely used in academic research. However, if you want to be particularly confident in your results, you can set a more stringent level of 0.01 (a 1% chance or less; 1 in 100 chance or less).

One- and two-tailed predictions

When considering whether we reject the null hypothesis and accept the alternative hypothesis, we need to consider the direction of the alternative hypothesis statement. For example, the alternative hypothesis that was stated earlier is:

Alternative Hypothesis (H ):

Undertaking seminar classes has a positive effect on students' performance.

The alternative hypothesis tells us two things. First, what predictions did we make about the effect of the independent variable(s) on the dependent variable(s)? Second, what was the predicted direction of this effect? Let's use our example to highlight these two points.

Sarah predicted that her teaching method (independent variable: teaching method), whereby she not only required her students to attend lectures, but also seminars, would have a positive effect (that is, increased) students' performance (dependent variable: exam marks). If an alternative hypothesis has a direction (and this is how you want to test it), the hypothesis is one-tailed. That is, it predicts direction of the effect. If the alternative hypothesis has stated that the effect was expected to be negative, this is also a one-tailed hypothesis.

Alternatively, a two-tailed prediction means that we do not make a choice over the direction that the effect of the experiment takes. Rather, it simply implies that the effect could be negative or positive. If Sarah had made a two-tailed prediction, the alternative hypothesis might have been:

Alternative Hypothesis (H ):

Undertaking seminar classes has an effect on students' performance.

In other words, we simply take out the word "positive", which implies the direction of our effect. In our example, making a two-tailed prediction may seem strange. After all, it would be logical to expect that "extra" tuition (going to seminar classes as well as lectures) would either have a positive effect on students' performance or no effect at all, but certainly not a negative effect. However, this is just our opinion (and hope) and certainly does not mean that we will get the effect we expect. Generally speaking, making a one-tail prediction (i.e., and testing for it this way) is frowned upon as it usually reflects the hope of a researcher rather than any certainty that it will happen. Notable exceptions to this rule are when there is only one possible way in which a change could occur. This can happen, for example, when biological activity/presence in measured. That is, a protein might be "dormant" and the stimulus you are using can only possibly "wake it up" (i.e., it cannot possibly reduce the activity of a "dormant" protein). In addition, for some statistical tests, one-tailed tests are not possible.

Rejecting or failing to reject the null hypothesis

Let's return finally to the question of whether we reject or fail to reject the null hypothesis.

If our statistical analysis shows that the significance level is below the cut-off value we have set (e.g., either 0.05 or 0.01), we reject the null hypothesis and accept the alternative hypothesis. Alternatively, if the significance level is above the cut-off value, we fail to reject the null hypothesis and cannot accept the alternative hypothesis. You should note that you cannot accept the null hypothesis, but only find evidence against it.

Hypothesis Testing for Means & Proportions

Lisa Sullivan, PhD

Professor of Biostatistics

Boston University School of Public Health

Introduction

This is the first of three modules that will addresses the second area of statistical inference, which is hypothesis testing, in which a specific statement or hypothesis is generated about a population parameter, and sample statistics are used to assess the likelihood that the hypothesis is true. The hypothesis is based on available information and the investigator's belief about the population parameters. The process of hypothesis testing involves setting up two competing hypotheses, the null hypothesis and the alternate hypothesis. One selects a random sample (or multiple samples when there are more comparison groups), computes summary statistics and then assesses the likelihood that the sample data support the research or alternative hypothesis. Similar to estimation, the process of hypothesis testing is based on probability theory and the Central Limit Theorem.

This module will focus on hypothesis testing for means and proportions. The next two modules in this series will address analysis of variance and chi-squared tests.

Learning Objectives

After completing this module, the student will be able to:

Define null and research hypothesis, test statistic, level of significance and decision rule
Distinguish between Type I and Type II errors and discuss the implications of each
Explain the difference between one and two sided tests of hypothesis
Estimate and interpret p-values
Explain the relationship between confidence interval estimates and p-values in drawing inferences
Differentiate hypothesis testing procedures based on type of outcome variable and number of sample

Introduction to Hypothesis Testing

Techniques for hypothesis testing .

The techniques for hypothesis testing depend on

the type of outcome variable being analyzed (continuous, dichotomous, discrete)
the number of comparison groups in the investigation
whether the comparison groups are independent (i.e., physically separate such as men versus women) or dependent (i.e., matched or paired such as pre- and post-assessments on the same participants).

In estimation we focused explicitly on techniques for one and two samples and discussed estimation for a specific parameter (e.g., the mean or proportion of a population), for differences (e.g., difference in means, the risk difference) and ratios (e.g., the relative risk and odds ratio). Here we will focus on procedures for one and two samples when the outcome is either continuous (and we focus on means) or dichotomous (and we focus on proportions).

General Approach: A Simple Example

The Centers for Disease Control (CDC) reported on trends in weight, height and body mass index from the 1960's through 2002. 1 The general trend was that Americans were much heavier and slightly taller in 2002 as compared to 1960; both men and women gained approximately 24 pounds, on average, between 1960 and 2002. In 2002, the mean weight for men was reported at 191 pounds. Suppose that an investigator hypothesizes that weights are even higher in 2006 (i.e., that the trend continued over the subsequent 4 years). The research hypothesis is that the mean weight in men in 2006 is more than 191 pounds. The null hypothesis is that there is no change in weight, and therefore the mean weight is still 191 pounds in 2006.

Null Hypothesis	H : μ= 191 (no change)
Research Hypothesis	H : μ> 191 (investigator's belief)

In order to test the hypotheses, we select a random sample of American males in 2006 and measure their weights. Suppose we have resources available to recruit n=100 men into our sample. We weigh each participant and compute summary statistics on the sample data. Suppose in the sample we determine the following:

Do the sample data support the null or research hypothesis? The sample mean of 197.1 is numerically higher than 191. However, is this difference more than would be expected by chance? In hypothesis testing, we assume that the null hypothesis holds until proven otherwise. We therefore need to determine the likelihood of observing a sample mean of 197.1 or higher when the true population mean is 191 (i.e., if the null hypothesis is true or under the null hypothesis). We can compute this probability using the Central Limit Theorem. Specifically,

(Notice that we use the sample standard deviation in computing the Z score. This is generally an appropriate substitution as long as the sample size is large, n > 30. Thus, there is less than a 1% probability of observing a sample mean as large as 197.1 when the true population mean is 191. Do you think that the null hypothesis is likely true? Based on how unlikely it is to observe a sample mean of 197.1 under the null hypothesis (i.e., <1% probability), we might infer, from our data, that the null hypothesis is probably not true.

Suppose that the sample data had turned out differently. Suppose that we instead observed the following in 2006:

How likely it is to observe a sample mean of 192.1 or higher when the true population mean is 191 (i.e., if the null hypothesis is true)? We can again compute this probability using the Central Limit Theorem. Specifically,

There is a 33.4% probability of observing a sample mean as large as 192.1 when the true population mean is 191. Do you think that the null hypothesis is likely true?

Neither of the sample means that we obtained allows us to know with certainty whether the null hypothesis is true or not. However, our computations suggest that, if the null hypothesis were true, the probability of observing a sample mean >197.1 is less than 1%. In contrast, if the null hypothesis were true, the probability of observing a sample mean >192.1 is about 33%. We can't know whether the null hypothesis is true, but the sample that provided a mean value of 197.1 provides much stronger evidence in favor of rejecting the null hypothesis, than the sample that provided a mean value of 192.1. Note that this does not mean that a sample mean of 192.1 indicates that the null hypothesis is true; it just doesn't provide compelling evidence to reject it.

In essence, hypothesis testing is a procedure to compute a probability that reflects the strength of the evidence (based on a given sample) for rejecting the null hypothesis. In hypothesis testing, we determine a threshold or cut-off point (called the critical value) to decide when to believe the null hypothesis and when to believe the research hypothesis. It is important to note that it is possible to observe any sample mean when the true population mean is true (in this example equal to 191), but some sample means are very unlikely. Based on the two samples above it would seem reasonable to believe the research hypothesis when x̄ = 197.1, but to believe the null hypothesis when x̄ =192.1. What we need is a threshold value such that if x̄ is above that threshold then we believe that H 1 is true and if x̄ is below that threshold then we believe that H 0 is true. The difficulty in determining a threshold for x̄ is that it depends on the scale of measurement. In this example, the threshold, sometimes called the critical value, might be 195 (i.e., if the sample mean is 195 or more then we believe that H 1 is true and if the sample mean is less than 195 then we believe that H 0 is true). Suppose we are interested in assessing an increase in blood pressure over time, the critical value will be different because blood pressures are measured in millimeters of mercury (mmHg) as opposed to in pounds. In the following we will explain how the critical value is determined and how we handle the issue of scale.

First, to address the issue of scale in determining the critical value, we convert our sample data (in particular the sample mean) into a Z score. We know from the module on probability that the center of the Z distribution is zero and extreme values are those that exceed 2 or fall below -2. Z scores above 2 and below -2 represent approximately 5% of all Z values. If the observed sample mean is close to the mean specified in H 0 (here m =191), then Z will be close to zero. If the observed sample mean is much larger than the mean specified in H 0 , then Z will be large.

In hypothesis testing, we select a critical value from the Z distribution. This is done by first determining what is called the level of significance, denoted α ("alpha"). What we are doing here is drawing a line at extreme values. The level of significance is the probability that we reject the null hypothesis (in favor of the alternative) when it is actually true and is also called the Type I error rate.

α = Level of significance = P(Type I error) = P(Reject H 0 | H 0 is true).

Because α is a probability, it ranges between 0 and 1. The most commonly used value in the medical literature for α is 0.05, or 5%. Thus, if an investigator selects α=0.05, then they are allowing a 5% probability of incorrectly rejecting the null hypothesis in favor of the alternative when the null is in fact true. Depending on the circumstances, one might choose to use a level of significance of 1% or 10%. For example, if an investigator wanted to reject the null only if there were even stronger evidence than that ensured with α=0.05, they could choose a =0.01as their level of significance. The typical values for α are 0.01, 0.05 and 0.10, with α=0.05 the most commonly used value.

Suppose in our weight study we select α=0.05. We need to determine the value of Z that holds 5% of the values above it (see below).

Standard normal distribution curve showing an upper tail at z=1.645 where alpha=0.05

The critical value of Z for α =0.05 is Z = 1.645 (i.e., 5% of the distribution is above Z=1.645). With this value we can set up what is called our decision rule for the test. The rule is to reject H 0 if the Z score is 1.645 or more.

With the first sample we have

Because 2.38 > 1.645, we reject the null hypothesis. (The same conclusion can be drawn by comparing the 0.0087 probability of observing a sample mean as extreme as 197.1 to the level of significance of 0.05. If the observed probability is smaller than the level of significance we reject H 0 ). Because the Z score exceeds the critical value, we conclude that the mean weight for men in 2006 is more than 191 pounds, the value reported in 2002. If we observed the second sample (i.e., sample mean =192.1), we would not be able to reject the null hypothesis because the Z score is 0.43 which is not in the rejection region (i.e., the region in the tail end of the curve above 1.645). With the second sample we do not have sufficient evidence (because we set our level of significance at 5%) to conclude that weights have increased. Again, the same conclusion can be reached by comparing probabilities. The probability of observing a sample mean as extreme as 192.1 is 33.4% which is not below our 5% level of significance.

Hypothesis Testing: Upper-, Lower, and Two Tailed Tests

The procedure for hypothesis testing is based on the ideas described above. Specifically, we set up competing hypotheses, select a random sample from the population of interest and compute summary statistics. We then determine whether the sample data supports the null or alternative hypotheses. The procedure can be broken down into the following five steps.

Step 1. Set up hypotheses and select the level of significance α.

H 0 : Null hypothesis (no change, no difference);

H 1 : Research hypothesis (investigator's belief); α =0.05

Upper-tailed, Lower-tailed, Two-tailed Tests

The research or alternative hypothesis can take one of three forms. An investigator might believe that the parameter has increased, decreased or changed. For example, an investigator might hypothesize:

: μ > μ , where μ is the comparator or null value (e.g., μ =191 in our example about weight in men in 2006) and an increase is hypothesized - this type of test is called an ; : μ < μ , where a decrease is hypothesized and this is called a ; or : μ ≠ μ where a difference is hypothesized and this is called a .

The exact form of the research hypothesis depends on the investigator's belief about the parameter of interest and whether it has possibly increased, decreased or is different from the null value. The research hypothesis is set up by the investigator before any data are collected.

Step 2. Select the appropriate test statistic.

The test statistic is a single number that summarizes the sample information. An example of a test statistic is the Z statistic computed as follows:

When the sample size is small, we will use t statistics (just as we did when constructing confidence intervals for small samples). As we present each scenario, alternative test statistics are provided along with conditions for their appropriate use.

Step 3. Set up decision rule.

The decision rule is a statement that tells under what circumstances to reject the null hypothesis. The decision rule is based on specific values of the test statistic (e.g., reject H 0 if Z > 1.645). The decision rule for a specific test depends on 3 factors: the research or alternative hypothesis, the test statistic and the level of significance. Each is discussed below.

The decision rule depends on whether an upper-tailed, lower-tailed, or two-tailed test is proposed. In an upper-tailed test the decision rule has investigators reject H 0 if the test statistic is larger than the critical value. In a lower-tailed test the decision rule has investigators reject H 0 if the test statistic is smaller than the critical value. In a two-tailed test the decision rule has investigators reject H 0 if the test statistic is extreme, either larger than an upper critical value or smaller than a lower critical value.
The exact form of the test statistic is also important in determining the decision rule. If the test statistic follows the standard normal distribution (Z), then the decision rule will be based on the standard normal distribution. If the test statistic follows the t distribution, then the decision rule will be based on the t distribution. The appropriate critical value will be selected from the t distribution again depending on the specific alternative hypothesis and the level of significance.
The third factor is the level of significance. The level of significance which is selected in Step 1 (e.g., α =0.05) dictates the critical value. For example, in an upper tailed Z test, if α =0.05 then the critical value is Z=1.645.

The following figures illustrate the rejection regions defined by the decision rule for upper-, lower- and two-tailed Z tests with α=0.05. Notice that the rejection regions are in the upper, lower and both tails of the curves, respectively. The decision rules are written below each figure.

Rejection Region for Upper-Tailed Z Test (H : μ > μ ) with α=0.05

The decision rule is: Reject H if Z 1.645.


α	Z
0.10	1.282
0.05	1.645
0.025	1.960
0.010	2.326
0.005	2.576
0.001	3.090
0.0001	3.719

Standard normal distribution with lower tail at -1.645 and alpha=0.05

Rejection Region for Lower-Tailed Z Test (H 1 : μ < μ 0 ) with α =0.05

The decision rule is: Reject H 0 if Z < 1.645.


a	Z
0.10	-1.282
0.05	-1.645
0.025	-1.960
0.010	-2.326
0.005	-2.576
0.001	-3.090
0.0001	-3.719

Standard normal distribution with two tails

Rejection Region for Two-Tailed Z Test (H 1 : μ ≠ μ 0 ) with α =0.05

The decision rule is: Reject H 0 if Z < -1.960 or if Z > 1.960.



0.20	1.282
0.10	1.645
0.05	1.960
0.010	2.576
0.001	3.291
0.0001	3.819

The complete table of critical values of Z for upper, lower and two-tailed tests can be found in the table of Z values to the right in "Other Resources."

Critical values of t for upper, lower and two-tailed tests can be found in the table of t values in "Other Resources."

Step 4. Compute the test statistic.

Here we compute the test statistic by substituting the observed sample data into the test statistic identified in Step 2.

Step 5. Conclusion.

The final conclusion is made by comparing the test statistic (which is a summary of the information observed in the sample) to the decision rule. The final conclusion will be either to reject the null hypothesis (because the sample data are very unlikely if the null hypothesis is true) or not to reject the null hypothesis (because the sample data are not very unlikely).

If the null hypothesis is rejected, then an exact significance level is computed to describe the likelihood of observing the sample data assuming that the null hypothesis is true. The exact level of significance is called the p-value and it will be less than the chosen level of significance if we reject H 0 .

Statistical computing packages provide exact p-values as part of their standard output for hypothesis tests. In fact, when using a statistical computing package, the steps outlined about can be abbreviated. The hypotheses (step 1) should always be set up in advance of any analysis and the significance criterion should also be determined (e.g., α =0.05). Statistical computing packages will produce the test statistic (usually reporting the test statistic as t) and a p-value. The investigator can then determine statistical significance using the following: If p < α then reject H 0 .

Step 1. Set up hypotheses and determine level of significance

H 0 : μ = 191 H 1 : μ > 191 α =0.05

The research hypothesis is that weights have increased, and therefore an upper tailed test is used.

Step 2. Select the appropriate test statistic.

Because the sample size is large (n > 30) the appropriate test statistic is

Step 3. Set up decision rule.

In this example, we are performing an upper tailed test (H 1 : μ> 191), with a Z test statistic and selected α =0.05. Reject H 0 if Z > 1.645.

We now substitute the sample data into the formula for the test statistic identified in Step 2.

We reject H 0 because 2.38 > 1.645. We have statistically significant evidence at a =0.05, to show that the mean weight in men in 2006 is more than 191 pounds. Because we rejected the null hypothesis, we now approximate the p-value which is the likelihood of observing the sample data if the null hypothesis is true. An alternative definition of the p-value is the smallest level of significance where we can still reject H 0 . In this example, we observed Z=2.38 and for α=0.05, the critical value was 1.645. Because 2.38 exceeded 1.645 we rejected H 0 . In our conclusion we reported a statistically significant increase in mean weight at a 5% level of significance. Using the table of critical values for upper tailed tests, we can approximate the p-value. If we select α=0.025, the critical value is 1.96, and we still reject H 0 because 2.38 > 1.960. If we select α=0.010 the critical value is 2.326, and we still reject H 0 because 2.38 > 2.326. However, if we select α=0.005, the critical value is 2.576, and we cannot reject H 0 because 2.38 < 2.576. Therefore, the smallest α where we still reject H 0 is 0.010. This is the p-value. A statistical computing package would produce a more precise p-value which would be in between 0.005 and 0.010. Here we are approximating the p-value and would report p < 0.010.

Type I and Type II Errors

In all tests of hypothesis, there are two types of errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 when in fact it is true. This is also called a false positive result (as we incorrectly conclude that the research hypothesis is true when in fact it is not). When we run a test of hypothesis and decide to reject H 0 (e.g., because the test statistic exceeds the critical value in an upper tailed test) then either we make a correct decision because the research hypothesis is true or we commit a Type I error. The different conclusions are summarized in the table below. Note that we will never know whether the null hypothesis is really true or false (i.e., we will never know which row of the following table reflects reality).

Table - Conclusions in Test of Hypothesis


is True	Correct Decision	Type I Error
is False	Type II Error	Correct Decision

In the first step of the hypothesis test, we select a level of significance, α, and α= P(Type I error). Because we purposely select a small value for α, we control the probability of committing a Type I error. For example, if we select α=0.05, and our test tells us to reject H 0 , then there is a 5% probability that we commit a Type I error. Most investigators are very comfortable with this and are confident when rejecting H 0 that the research hypothesis is true (as it is the more likely scenario when we reject H 0 ).

When we run a test of hypothesis and decide not to reject H 0 (e.g., because the test statistic is below the critical value in an upper tailed test) then either we make a correct decision because the null hypothesis is true or we commit a Type II error. Beta (β) represents the probability of a Type II error and is defined as follows: β=P(Type II error) = P(Do not Reject H 0 | H 0 is false). Unfortunately, we cannot choose β to be small (e.g., 0.05) to control the probability of committing a Type II error because β depends on several factors including the sample size, α, and the research hypothesis. When we do not reject H 0 , it may be very likely that we are committing a Type II error (i.e., failing to reject H 0 when in fact it is false). Therefore, when tests are run and the null hypothesis is not rejected we often make a weak concluding statement allowing for the possibility that we might be committing a Type II error. If we do not reject H 0 , we conclude that we do not have significant evidence to show that H 1 is true. We do not conclude that H 0 is true.

The most common reason for a Type II error is a small sample size.

Tests with One Sample, Continuous Outcome

Hypothesis testing applications with a continuous outcome variable in a single population are performed according to the five-step procedure outlined above. A key component is setting up the null and research hypotheses. The objective is to compare the mean in a single population to known mean (μ 0 ). The known value is generally derived from another study or report, for example a study in a similar, but not identical, population or a study performed some years ago. The latter is called a historical control. It is important in setting up the hypotheses in a one sample test that the mean specified in the null hypothesis is a fair and reasonable comparator. This will be discussed in the examples that follow.

Test Statistics for Testing H 0 : μ= μ 0

if n > 30
if n < 30

Note that statistical computing packages will use the t statistic exclusively and make the necessary adjustments for comparing the test statistic to appropriate values from probability tables to produce a p-value.

The National Center for Health Statistics (NCHS) published a report in 2005 entitled Health, United States, containing extensive information on major trends in the health of Americans. Data are provided for the US population as a whole and for specific ages, sexes and races. The NCHS report indicated that in 2002 Americans paid an average of $3,302 per year on health care and prescription drugs. An investigator hypothesizes that in 2005 expenditures have decreased primarily due to the availability of generic drugs. To test the hypothesis, a sample of 100 Americans are selected and their expenditures on health care and prescription drugs in 2005 are measured. The sample data are summarized as follows: n=100, x̄

=$3,190 and s=$890. Is there statistical evidence of a reduction in expenditures on health care and prescription drugs in 2005? Is the sample mean of $3,190 evidence of a true reduction in the mean or is it within chance fluctuation? We will run the test using the five-step approach.

Step 1. Set up hypotheses and determine level of significance

H 0 : μ = 3,302 H 1 : μ < 3,302 α =0.05

The research hypothesis is that expenditures have decreased, and therefore a lower-tailed test is used.

This is a lower tailed test, using a Z statistic and a 5% level of significance. Reject H 0 if Z < -1.645.

Step 4. Compute the test statistic.

We do not reject H 0 because -1.26 > -1.645. We do not have statistically significant evidence at α=0.05 to show that the mean expenditures on health care and prescription drugs are lower in 2005 than the mean of $3,302 reported in 2002.

Recall that when we fail to reject H 0 in a test of hypothesis that either the null hypothesis is true (here the mean expenditures in 2005 are the same as those in 2002 and equal to $3,302) or we committed a Type II error (i.e., we failed to reject H 0 when in fact it is false). In summarizing this test, we conclude that we do not have sufficient evidence to reject H 0 . We do not conclude that H 0 is true, because there may be a moderate to high probability that we committed a Type II error. It is possible that the sample size is not large enough to detect a difference in mean expenditures.

The NCHS reported that the mean total cholesterol level in 2002 for all adults was 203. Total cholesterol levels in participants who attended the seventh examination of the Offspring in the Framingham Heart Study are summarized as follows: n=3,310, x̄ =200.3, and s=36.8. Is there statistical evidence of a difference in mean cholesterol levels in the Framingham Offspring?

Here we want to assess whether the sample mean of 200.3 in the Framingham sample is statistically significantly different from 203 (i.e., beyond what we would expect by chance). We will run the test using the five-step approach.

H 0 : μ= 203 H 1 : μ≠ 203 α=0.05

The research hypothesis is that cholesterol levels are different in the Framingham Offspring, and therefore a two-tailed test is used.

Step 3. Set up decision rule.

This is a two-tailed test, using a Z statistic and a 5% level of significance. Reject H 0 if Z < -1.960 or is Z > 1.960.

We reject H 0 because -4.22 ≤ -1. .960. We have statistically significant evidence at α=0.05 to show that the mean total cholesterol level in the Framingham Offspring is different from the national average of 203 reported in 2002. Because we reject H 0 , we also approximate a p-value. Using the two-sided significance levels, p < 0.0001.

Statistical Significance versus Clinical (Practical) Significance

This example raises an important concept of statistical versus clinical or practical significance. From a statistical standpoint, the total cholesterol levels in the Framingham sample are highly statistically significantly different from the national average with p < 0.0001 (i.e., there is less than a 0.01% chance that we are incorrectly rejecting the null hypothesis). However, the sample mean in the Framingham Offspring study is 200.3, less than 3 units different from the national mean of 203. The reason that the data are so highly statistically significant is due to the very large sample size. It is always important to assess both statistical and clinical significance of data. This is particularly relevant when the sample size is large. Is a 3 unit difference in total cholesterol a meaningful difference?

Consider again the NCHS-reported mean total cholesterol level in 2002 for all adults of 203. Suppose a new drug is proposed to lower total cholesterol. A study is designed to evaluate the efficacy of the drug in lowering cholesterol. Fifteen patients are enrolled in the study and asked to take the new drug for 6 weeks. At the end of 6 weeks, each patient's total cholesterol level is measured and the sample statistics are as follows: n=15, x̄ =195.9 and s=28.7. Is there statistical evidence of a reduction in mean total cholesterol in patients after using the new drug for 6 weeks? We will run the test using the five-step approach.

H 0 : μ= 203 H 1 : μ< 203 α=0.05

Step 2. Select the appropriate test statistic.

Because the sample size is small (n<30) the appropriate test statistic is

This is a lower tailed test, using a t statistic and a 5% level of significance. In order to determine the critical value of t, we need degrees of freedom, df, defined as df=n-1. In this example df=15-1=14. The critical value for a lower tailed test with df=14 and a =0.05 is -2.145 and the decision rule is as follows: Reject H 0 if t < -2.145.

We do not reject H 0 because -0.96 > -2.145. We do not have statistically significant evidence at α=0.05 to show that the mean total cholesterol level is lower than the national mean in patients taking the new drug for 6 weeks. Again, because we failed to reject the null hypothesis we make a weaker concluding statement allowing for the possibility that we may have committed a Type II error (i.e., failed to reject H 0 when in fact the drug is efficacious).

This example raises an important issue in terms of study design. In this example we assume in the null hypothesis that the mean cholesterol level is 203. This is taken to be the mean cholesterol level in patients without treatment. Is this an appropriate comparator? Alternative and potentially more efficient study designs to evaluate the effect of the new drug could involve two treatment groups, where one group receives the new drug and the other does not, or we could measure each patient's baseline or pre-treatment cholesterol level and then assess changes from baseline to 6 weeks post-treatment. These designs are also discussed here.

Video - Comparing a Sample Mean to Known Population Mean (8:20)

Link to transcript of the video

Tests with One Sample, Dichotomous Outcome

Hypothesis testing applications with a dichotomous outcome variable in a single population are also performed according to the five-step procedure. Similar to tests for means, a key component is setting up the null and research hypotheses. The objective is to compare the proportion of successes in a single population to a known proportion (p 0 ). That known proportion is generally derived from another study or report and is sometimes called a historical control. It is important in setting up the hypotheses in a one sample test that the proportion specified in the null hypothesis is a fair and reasonable comparator.

In one sample tests for a dichotomous outcome, we set up our hypotheses against an appropriate comparator. We select a sample and compute descriptive statistics on the sample data. Specifically, we compute the sample size (n) and the sample proportion which is computed by taking the ratio of the number of successes to the sample size,

We then determine the appropriate test statistic (Step 2) for the hypothesis test. The formula for the test statistic is given below.

Test Statistic for Testing H 0 : p = p 0

if min(np 0 , n(1-p 0 )) > 5

The formula above is appropriate for large samples, defined when the smaller of np 0 and n(1-p 0 ) is at least 5. This is similar, but not identical, to the condition required for appropriate use of the confidence interval formula for a population proportion, i.e.,

Here we use the proportion specified in the null hypothesis as the true proportion of successes rather than the sample proportion. If we fail to satisfy the condition, then alternative procedures, called exact methods must be used to test the hypothesis about the population proportion.

Example:

The NCHS report indicated that in 2002 the prevalence of cigarette smoking among American adults was 21.1%. Data on prevalent smoking in n=3,536 participants who attended the seventh examination of the Offspring in the Framingham Heart Study indicated that 482/3,536 = 13.6% of the respondents were currently smoking at the time of the exam. Suppose we want to assess whether the prevalence of smoking is lower in the Framingham Offspring sample given the focus on cardiovascular health in that community. Is there evidence of a statistically lower prevalence of smoking in the Framingham Offspring study as compared to the prevalence among all Americans?

H 0 : p = 0.211 H 1 : p < 0.211 α=0.05

We must first check that the sample size is adequate. Specifically, we need to check min(np 0 , n(1-p 0 )) = min( 3,536(0.211), 3,536(1-0.211))=min(746, 2790)=746. The sample size is more than adequate so the following formula can be used:

This is a lower tailed test, using a Z statistic and a 5% level of significance. Reject H 0 if Z < -1.645.

We reject H 0 because -10.93 < -1.645. We have statistically significant evidence at α=0.05 to show that the prevalence of smoking in the Framingham Offspring is lower than the prevalence nationally (21.1%). Here, p < 0.0001.

The NCHS report indicated that in 2002, 75% of children aged 2 to 17 saw a dentist in the past year. An investigator wants to assess whether use of dental services is similar in children living in the city of Boston. A sample of 125 children aged 2 to 17 living in Boston are surveyed and 64 reported seeing a dentist over the past 12 months. Is there a significant difference in use of dental services between children living in Boston and the national data?

Calculate this on your own before checking the answer.

Video - Hypothesis Test for One Sample and a Dichotomous Outcome (3:55)

Tests with Two Independent Samples, Continuous Outcome

There are many applications where it is of interest to compare two independent groups with respect to their mean scores on a continuous outcome. Here we compare means between groups, but rather than generating an estimate of the difference, we will test whether the observed difference (increase, decrease or difference) is statistically significant or not. Remember, that hypothesis testing gives an assessment of statistical significance, whereas estimation gives an estimate of effect and both are important.

Here we discuss the comparison of means when the two comparison groups are independent or physically separate. The two groups might be determined by a particular attribute (e.g., sex, diagnosis of cardiovascular disease) or might be set up by the investigator (e.g., participants assigned to receive an experimental treatment or placebo). The first step in the analysis involves computing descriptive statistics on each of the two samples. Specifically, we compute the sample size, mean and standard deviation in each sample and we denote these summary statistics as follows:

for sample 1:

for sample 2:

The designation of sample 1 and sample 2 is arbitrary. In a clinical trial setting the convention is to call the treatment group 1 and the control group 2. However, when comparing men and women, for example, either group can be 1 or 2.

In the two independent samples application with a continuous outcome, the parameter of interest in the test of hypothesis is the difference in population means, μ 1 -μ 2 . The null hypothesis is always that there is no difference between groups with respect to means, i.e.,

The null hypothesis can also be written as follows: H 0 : μ 1 = μ 2 . In the research hypothesis, an investigator can hypothesize that the first mean is larger than the second (H 1 : μ 1 > μ 2 ), that the first mean is smaller than the second (H 1 : μ 1 < μ 2 ), or that the means are different (H 1 : μ 1 ≠ μ 2 ). The three different alternatives represent upper-, lower-, and two-tailed tests, respectively. The following test statistics are used to test these hypotheses.

Test Statistics for Testing H 0 : μ 1 = μ 2

if n 1 > 30 and n 2 > 30
if n 1 < 30 or n 2 < 30

NOTE: The formulas above assume equal variability in the two populations (i.e., the population variances are equal, or s 1 2 = s 2 2 ). This means that the outcome is equally variable in each of the comparison populations. For analysis, we have samples from each of the comparison populations. If the sample variances are similar, then the assumption about variability in the populations is probably reasonable. As a guideline, if the ratio of the sample variances, s 1 2 /s 2 2 is between 0.5 and 2 (i.e., if one variance is no more than double the other), then the formulas above are appropriate. If the ratio of the sample variances is greater than 2 or less than 0.5 then alternative formulas must be used to account for the heterogeneity in variances.

The test statistics include Sp, which is the pooled estimate of the common standard deviation (again assuming that the variances in the populations are similar) computed as the weighted average of the standard deviations in the samples as follows:

Because we are assuming equal variances between groups, we pool the information on variability (sample variances) to generate an estimate of the variability in the population. Note: Because Sp is a weighted average of the standard deviations in the sample, Sp will always be in between s 1 and s 2 .)

Data measured on n=3,539 participants who attended the seventh examination of the Offspring in the Framingham Heart Study are shown below.


Characteristic	n		S	n		s
Systolic Blood Pressure	1,623	128.2	17.5	1,911	126.5	20.1
Diastolic Blood Pressure	1,622	75.6	9.8	1,910	72.6	9.7
Total Serum Cholesterol	1,544	192.4	35.2	1,766	207.1	36.7
Weight	1,612	194.0	33.8	1,894	157.7	34.6
Height	1,545	68.9	2.7	1,781	63.4	2.5
Body Mass Index	1,545	28.8	4.6	1,781	27.6	5.9

Suppose we now wish to assess whether there is a statistically significant difference in mean systolic blood pressures between men and women using a 5% level of significance.

H 0 : μ 1 = μ 2

H 1 : μ 1 ≠ μ 2 α=0.05

Because both samples are large ( > 30), we can use the Z test statistic as opposed to t. Note that statistical computing packages use t throughout. Before implementing the formula, we first check whether the assumption of equality of population variances is reasonable. The guideline suggests investigating the ratio of the sample variances, s 1 2 /s 2 2 . Suppose we call the men group 1 and the women group 2. Again, this is arbitrary; it only needs to be noted when interpreting the results. The ratio of the sample variances is 17.5 2 /20.1 2 = 0.76, which falls between 0.5 and 2 suggesting that the assumption of equality of population variances is reasonable. The appropriate test statistic is

We now substitute the sample data into the formula for the test statistic identified in Step 2. Before substituting, we will first compute Sp, the pooled estimate of the common standard deviation.

Notice that the pooled estimate of the common standard deviation, Sp, falls in between the standard deviations in the comparison groups (i.e., 17.5 and 20.1). Sp is slightly closer in value to the standard deviation in the women (20.1) as there were slightly more women in the sample. Recall, Sp is a weight average of the standard deviations in the comparison groups, weighted by the respective sample sizes.

Now the test statistic:

We reject H 0 because 2.66 > 1.960. We have statistically significant evidence at α=0.05 to show that there is a difference in mean systolic blood pressures between men and women. The p-value is p < 0.010.

Here again we find that there is a statistically significant difference in mean systolic blood pressures between men and women at p < 0.010. Notice that there is a very small difference in the sample means (128.2-126.5 = 1.7 units), but this difference is beyond what would be expected by chance. Is this a clinically meaningful difference? The large sample size in this example is driving the statistical significance. A 95% confidence interval for the difference in mean systolic blood pressures is: 1.7 + 1.26 or (0.44, 2.96). The confidence interval provides an assessment of the magnitude of the difference between means whereas the test of hypothesis and p-value provide an assessment of the statistical significance of the difference.

Above we performed a study to evaluate a new drug designed to lower total cholesterol. The study involved one sample of patients, each patient took the new drug for 6 weeks and had their cholesterol measured. As a means of evaluating the efficacy of the new drug, the mean total cholesterol following 6 weeks of treatment was compared to the NCHS-reported mean total cholesterol level in 2002 for all adults of 203. At the end of the example, we discussed the appropriateness of the fixed comparator as well as an alternative study design to evaluate the effect of the new drug involving two treatment groups, where one group receives the new drug and the other does not. Here, we revisit the example with a concurrent or parallel control group, which is very typical in randomized controlled trials or clinical trials (refer to the EP713 module on Clinical Trials).

A new drug is proposed to lower total cholesterol. A randomized controlled trial is designed to evaluate the efficacy of the medication in lowering cholesterol. Thirty participants are enrolled in the trial and are randomly assigned to receive either the new drug or a placebo. The participants do not know which treatment they are assigned. Each participant is asked to take the assigned treatment for 6 weeks. At the end of 6 weeks, each patient's total cholesterol level is measured and the sample statistics are as follows.

Treatment
New Drug	15	195.9	28.7
Placebo	15	227.4	30.3

Is there statistical evidence of a reduction in mean total cholesterol in patients taking the new drug for 6 weeks as compared to participants taking placebo? We will run the test using the five-step approach.

H 0 : μ 1 = μ 2 H 1 : μ 1 < μ 2 α=0.05

Because both samples are small (< 30), we use the t test statistic. Before implementing the formula, we first check whether the assumption of equality of population variances is reasonable. The ratio of the sample variances, s 1 2 /s 2 2 =28.7 2 /30.3 2 = 0.90, which falls between 0.5 and 2, suggesting that the assumption of equality of population variances is reasonable. The appropriate test statistic is:

This is a lower-tailed test, using a t statistic and a 5% level of significance. The appropriate critical value can be found in the t Table (in More Resources to the right). In order to determine the critical value of t we need degrees of freedom, df, defined as df=n 1 +n 2 -2 = 15+15-2=28. The critical value for a lower tailed test with df=28 and α=0.05 is -1.701 and the decision rule is: Reject H 0 if t < -1.701.

Now the test statistic,

We reject H 0 because -2.92 < -1.701. We have statistically significant evidence at α=0.05 to show that the mean total cholesterol level is lower in patients taking the new drug for 6 weeks as compared to patients taking placebo, p < 0.005.

The clinical trial in this example finds a statistically significant reduction in total cholesterol, whereas in the previous example where we had a historical control (as opposed to a parallel control group) we did not demonstrate efficacy of the new drug. Notice that the mean total cholesterol level in patients taking placebo is 217.4 which is very different from the mean cholesterol reported among all Americans in 2002 of 203 and used as the comparator in the prior example. The historical control value may not have been the most appropriate comparator as cholesterol levels have been increasing over time. In the next section, we present another design that can be used to assess the efficacy of the new drug.

Video - Comparison of Two Independent Samples With a Continuous Outcome (8:02)

Tests with Matched Samples, Continuous Outcome

In the previous section we compared two groups with respect to their mean scores on a continuous outcome. An alternative study design is to compare matched or paired samples. The two comparison groups are said to be dependent, and the data can arise from a single sample of participants where each participant is measured twice (possibly before and after an intervention) or from two samples that are matched on specific characteristics (e.g., siblings). When the samples are dependent, we focus on difference scores in each participant or between members of a pair and the test of hypothesis is based on the mean difference, μ d . The null hypothesis again reflects "no difference" and is stated as H 0 : μ d =0 . Note that there are some instances where it is of interest to test whether there is a difference of a particular magnitude (e.g., μ d =5) but in most instances the null hypothesis reflects no difference (i.e., μ d =0).

The appropriate formula for the test of hypothesis depends on the sample size. The formulas are shown below and are identical to those we presented for estimating the mean of a single sample presented (e.g., when comparing against an external or historical control), except here we focus on difference scores.

Test Statistics for Testing H 0 : μ d =0

A new drug is proposed to lower total cholesterol and a study is designed to evaluate the efficacy of the drug in lowering cholesterol. Fifteen patients agree to participate in the study and each is asked to take the new drug for 6 weeks. However, before starting the treatment, each patient's total cholesterol level is measured. The initial measurement is a pre-treatment or baseline value. After taking the drug for 6 weeks, each patient's total cholesterol level is measured again and the data are shown below. The rightmost column contains difference scores for each patient, computed by subtracting the 6 week cholesterol level from the baseline level. The differences represent the reduction in total cholesterol over 4 weeks. (The differences could have been computed by subtracting the baseline total cholesterol level from the level measured at 6 weeks. The way in which the differences are computed does not affect the outcome of the analysis only the interpretation.)


1	215	205	10
2	190	156	34
3	230	190	40
4	220	180	40
5	214	201	13
6	240	227	13
7	210	197	13
8	193	173	20
9	210	204	6
10	230	217	13
11	180	142	38
12	260	262	-2
13	210	207	3
14	190	184	6
15	200	193	7

Because the differences are computed by subtracting the cholesterols measured at 6 weeks from the baseline values, positive differences indicate reductions and negative differences indicate increases (e.g., participant 12 increases by 2 units over 6 weeks). The goal here is to test whether there is a statistically significant reduction in cholesterol. Because of the way in which we computed the differences, we want to look for an increase in the mean difference (i.e., a positive reduction). In order to conduct the test, we need to summarize the differences. In this sample, we have

The calculations are shown below.


1	10	100
2	34	1156
3	40	1600
4	40	1600
5	13	169
6	13	169
7	13	169
8	20	400
9	6	36
10	13	169
11	38	1444
12	-2	4
13	3	9
14	6	36
15	7	49

Is there statistical evidence of a reduction in mean total cholesterol in patients after using the new medication for 6 weeks? We will run the test using the five-step approach.

H 0 : μ d = 0 H 1 : μ d > 0 α=0.05

NOTE: If we had computed differences by subtracting the baseline level from the level measured at 6 weeks then negative differences would have reflected reductions and the research hypothesis would have been H 1 : μ d < 0.

Step 2 . Select the appropriate test statistic.

This is an upper-tailed test, using a t statistic and a 5% level of significance. The appropriate critical value can be found in the t Table at the right, with df=15-1=14. The critical value for an upper-tailed test with df=14 and α=0.05 is 2.145 and the decision rule is Reject H 0 if t > 2.145.

We now substitute the sample data into the formula for the test statistic identified in Step 2.

We reject H 0 because 4.61 > 2.145. We have statistically significant evidence at α=0.05 to show that there is a reduction in cholesterol levels over 6 weeks.

Here we illustrate the use of a matched design to test the efficacy of a new drug to lower total cholesterol. We also considered a parallel design (randomized clinical trial) and a study using a historical comparator. It is extremely important to design studies that are best suited to detect a meaningful difference when one exists. There are often several alternatives and investigators work with biostatisticians to determine the best design for each application. It is worth noting that the matched design used here can be problematic in that observed differences may only reflect a "placebo" effect. All participants took the assigned medication, but is the observed reduction attributable to the medication or a result of these participation in a study.

Video - Hypothesis Testing With a Matched Sample and a Continuous Outcome (3:11)

Tests with Two Independent Samples, Dichotomous Outcome

There are several approaches that can be used to test hypotheses concerning two independent proportions. Here we present one approach - the chi-square test of independence is an alternative, equivalent, and perhaps more popular approach to the same analysis. Hypothesis testing with the chi-square test is addressed in the third module in this series: BS704_HypothesisTesting-ChiSquare.

In tests of hypothesis comparing proportions between two independent groups, one test is performed and results can be interpreted to apply to a risk difference, relative risk or odds ratio. As a reminder, the risk difference is computed by taking the difference in proportions between comparison groups, the risk ratio is computed by taking the ratio of proportions, and the odds ratio is computed by taking the ratio of the odds of success in the comparison groups. Because the null values for the risk difference, the risk ratio and the odds ratio are different, the hypotheses in tests of hypothesis look slightly different depending on which measure is used. When performing tests of hypothesis for the risk difference, relative risk or odds ratio, the convention is to label the exposed or treated group 1 and the unexposed or control group 2.

For example, suppose a study is designed to assess whether there is a significant difference in proportions in two independent comparison groups. The test of interest is as follows:

H 0 : p 1 = p 2 versus H 1 : p 1 ≠ p 2 .

The following are the hypothesis for testing for a difference in proportions using the risk difference, the risk ratio and the odds ratio. First, the hypotheses above are equivalent to the following:

For the risk difference, H 0 : p 1 - p 2 = 0 versus H 1 : p 1 - p 2 ≠ 0 which are, by definition, equal to H 0 : RD = 0 versus H 1 : RD ≠ 0.
If an investigator wants to focus on the risk ratio, the equivalent hypotheses are H 0 : RR = 1 versus H 1 : RR ≠ 1.
If the investigator wants to focus on the odds ratio, the equivalent hypotheses are H 0 : OR = 1 versus H 1 : OR ≠ 1.

Suppose a test is performed to test H 0 : RD = 0 versus H 1 : RD ≠ 0 and the test rejects H 0 at α=0.05. Based on this test we can conclude that there is significant evidence, α=0.05, of a difference in proportions, significant evidence that the risk difference is not zero, significant evidence that the risk ratio and odds ratio are not one. The risk difference is analogous to the difference in means when the outcome is continuous. Here the parameter of interest is the difference in proportions in the population, RD = p 1 -p 2 and the null value for the risk difference is zero. In a test of hypothesis for the risk difference, the null hypothesis is always H 0 : RD = 0. This is equivalent to H 0 : RR = 1 and H 0 : OR = 1. In the research hypothesis, an investigator can hypothesize that the first proportion is larger than the second (H 1 : p 1 > p 2 , which is equivalent to H 1 : RD > 0, H 1 : RR > 1 and H 1 : OR > 1), that the first proportion is smaller than the second (H 1 : p 1 < p 2 , which is equivalent to H 1 : RD < 0, H 1 : RR < 1 and H 1 : OR < 1), or that the proportions are different (H 1 : p 1 ≠ p 2 , which is equivalent to H 1 : RD ≠ 0, H 1 : RR ≠ 1 and H 1 : OR ≠

1). The three different alternatives represent upper-, lower- and two-tailed tests, respectively.

The formula for the test of hypothesis for the difference in proportions is given below.

Test Statistics for Testing H 0 : p 1 = p

The formula above is appropriate for large samples, defined as at least 5 successes (np > 5) and at least 5 failures (n(1-p > 5)) in each of the two samples. If there are fewer than 5 successes or failures in either comparison group, then alternative procedures, called exact methods must be used to estimate the difference in population proportions.

The following table summarizes data from n=3,799 participants who attended the fifth examination of the Offspring in the Framingham Heart Study. The outcome of interest is prevalent CVD and we want to test whether the prevalence of CVD is significantly higher in smokers as compared to non-smokers.

	Free of CVD	History of CVD	Total
Non-Smoker	2,757	298	3,055
Current Smoker	663	81	744
Total	3,420	379	3,799

The prevalence of CVD (or proportion of participants with prevalent CVD) among non-smokers is 298/3,055 = 0.0975 and the prevalence of CVD among current smokers is 81/744 = 0.1089. Here smoking status defines the comparison groups and we will call the current smokers group 1 (exposed) and the non-smokers (unexposed) group 2. The test of hypothesis is conducted below using the five step approach.

H 0 : p 1 = p 2 H 1 : p 1 ≠ p 2 α=0.05

Step 2. Select the appropriate test statistic.

We must first check that the sample size is adequate. Specifically, we need to ensure that we have at least 5 successes and 5 failures in each comparison group. In this example, we have more than enough successes (cases of prevalent CVD) and failures (persons free of CVD) in each comparison group. The sample size is more than adequate so the following formula can be used:

Reject H 0 if Z < -1.960 or if Z > 1.960.

We now substitute the sample data into the formula for the test statistic identified in Step 2. We first compute the overall proportion of successes:

We now substitute to compute the test statistic.

Step 5. Conclusion.

We do not reject H 0 because -1.960 < 0.927 < 1.960. We do not have statistically significant evidence at α=0.05 to show that there is a difference in prevalent CVD between smokers and non-smokers.

A 95% confidence interval for the difference in prevalent CVD (or risk difference) between smokers and non-smokers as 0.0114 + 0.0247, or between -0.0133 and 0.0361. Because the 95% confidence interval for the risk difference includes zero we again conclude that there is no statistically significant difference in prevalent CVD between smokers and non-smokers.

Smoking has been shown over and over to be a risk factor for cardiovascular disease. What might explain the fact that we did not observe a statistically significant difference using data from the Framingham Heart Study? HINT: Here we consider prevalent CVD, would the results have been different if we considered incident CVD?

A randomized trial is designed to evaluate the effectiveness of a newly developed pain reliever designed to reduce pain in patients following joint replacement surgery. The trial compares the new pain reliever to the pain reliever currently in use (called the standard of care). A total of 100 patients undergoing joint replacement surgery agreed to participate in the trial. Patients were randomly assigned to receive either the new pain reliever or the standard pain reliever following surgery and were blind to the treatment assignment. Before receiving the assigned treatment, patients were asked to rate their pain on a scale of 0-10 with higher scores indicative of more pain. Each patient was then given the assigned treatment and after 30 minutes was again asked to rate their pain on the same scale. The primary outcome was a reduction in pain of 3 or more scale points (defined by clinicians as a clinically meaningful reduction). The following data were observed in the trial.

New Pain Reliever

0.46

Standard Pain Reliever

0.22

We now test whether there is a statistically significant difference in the proportions of patients reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) using the five step approach.

H 0 : p 1 = p 2 H 1 : p 1 ≠ p 2 α=0.05

Here the new or experimental pain reliever is group 1 and the standard pain reliever is group 2.

We must first check that the sample size is adequate. Specifically, we need to ensure that we have at least 5 successes and 5 failures in each comparison group, i.e.,

In this example, we have min(50(0.46), 50(1-0.46), 50(0.22), 50(1-0.22)) = min(23, 27, 11, 39) = 11. The sample size is adequate so the following formula can be used

We reject H 0 because 2.526 > 1960. We have statistically significant evidence at a =0.05 to show that there is a difference in the proportions of patients on the new pain reliever reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) as compared to patients on the standard pain reliever.

A 95% confidence interval for the difference in proportions of patients on the new pain reliever reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) as compared to patients on the standard pain reliever is 0.24 + 0.18 or between 0.06 and 0.42. Because the 95% confidence interval does not include zero we concluded that there was a statistically significant difference in proportions which is consistent with the test of hypothesis result.

Again, the procedures discussed here apply to applications where there are two independent comparison groups and a dichotomous outcome. There are other applications in which it is of interest to compare a dichotomous outcome in matched or paired samples. For example, in a clinical trial we might wish to test the effectiveness of a new antibiotic eye drop for the treatment of bacterial conjunctivitis. Participants use the new antibiotic eye drop in one eye and a comparator (placebo or active control treatment) in the other. The success of the treatment (yes/no) is recorded for each participant for each eye. Because the two assessments (success or failure) are paired, we cannot use the procedures discussed here. The appropriate test is called McNemar's test (sometimes called McNemar's test for dependent proportions).

Vide0 - Hypothesis Testing With Two Independent Samples and a Dichotomous Outcome (2:55)

Here we presented hypothesis testing techniques for means and proportions in one and two sample situations. Tests of hypothesis involve several steps, including specifying the null and alternative or research hypothesis, selecting and computing an appropriate test statistic, setting up a decision rule and drawing a conclusion. There are many details to consider in hypothesis testing. The first is to determine the appropriate test. We discussed Z and t tests here for different applications. The appropriate test depends on the distribution of the outcome variable (continuous or dichotomous), the number of comparison groups (one, two) and whether the comparison groups are independent or dependent. The following table summarizes the different tests of hypothesis discussed here.

Continuous Outcome, One Sample: H0: μ = μ0
Continuous Outcome, Two Independent Samples: H0: μ1 = μ2
Continuous Outcome, Two Matched Samples: H0: μd = 0
Dichotomous Outcome, One Sample: H0: p = p 0
Dichotomous Outcome, Two Independent Samples: H0: p1 = p2, RD=0, RR=1, OR=1

Once the type of test is determined, the details of the test must be specified. Specifically, the null and alternative hypotheses must be clearly stated. The null hypothesis always reflects the "no change" or "no difference" situation. The alternative or research hypothesis reflects the investigator's belief. The investigator might hypothesize that a parameter (e.g., a mean, proportion, difference in means or proportions) will increase, will decrease or will be different under specific conditions (sometimes the conditions are different experimental conditions and other times the conditions are simply different groups of participants). Once the hypotheses are specified, data are collected and summarized. The appropriate test is then conducted according to the five step approach. If the test leads to rejection of the null hypothesis, an approximate p-value is computed to summarize the significance of the findings. When tests of hypothesis are conducted using statistical computing packages, exact p-values are computed. Because the statistical tables in this textbook are limited, we can only approximate p-values. If the test fails to reject the null hypothesis, then a weaker concluding statement is made for the following reason.

In hypothesis testing, there are two types of errors that can be committed. A Type I error occurs when a test incorrectly rejects the null hypothesis. This is referred to as a false positive result, and the probability that this occurs is equal to the level of significance, α. The investigator chooses the level of significance in Step 1, and purposely chooses a small value such as α=0.05 to control the probability of committing a Type I error. A Type II error occurs when a test fails to reject the null hypothesis when in fact it is false. The probability that this occurs is equal to β. Unfortunately, the investigator cannot specify β at the outset because it depends on several factors including the sample size (smaller samples have higher b), the level of significance (β decreases as a increases), and the difference in the parameter under the null and alternative hypothesis.

We noted in several examples in this chapter, the relationship between confidence intervals and tests of hypothesis. The approaches are different, yet related. It is possible to draw a conclusion about statistical significance by examining a confidence interval. For example, if a 95% confidence interval does not contain the null value (e.g., zero when analyzing a mean difference or risk difference, one when analyzing relative risks or odds ratios), then one can conclude that a two-sided test of hypothesis would reject the null at α=0.05. It is important to note that the correspondence between a confidence interval and test of hypothesis relates to a two-sided test and that the confidence level corresponds to a specific level of significance (e.g., 95% to α=0.05, 90% to α=0.10 and so on). The exact significance of the test, the p-value, can only be determined using the hypothesis testing approach and the p-value provides an assessment of the strength of the evidence and not an estimate of the effect.

Answers to Selected Problems

Dental services problem - bottom of page 5.

Step 1: Set up hypotheses and determine the level of significance.

α=0.05

Step 2: Select the appropriate test statistic.

First, determine whether the sample size is adequate.

Therefore the sample size is adequate, and we can use the following formula:

Step 3: Set up the decision rule.

Reject H0 if Z is less than or equal to -1.96 or if Z is greater than or equal to 1.96.

Step 4: Compute the test statistic
Step 5: Conclusion.

We reject the null hypothesis because -6.15<-1.96. Therefore there is a statistically significant difference in the proportion of children in Boston using dental services compated to the national proportion.

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 13: Inferential Statistics

Some Basic Null Hypothesis Tests

Learning Objectives

Conduct and interpret one-sample, dependent-samples, and independent-samples t tests.
Interpret the results of one-way, repeated measures, and factorial ANOVAs.
Conduct and interpret null hypothesis tests of Pearson’s r .

In this section, we look at several common null hypothesis testing procedures. The emphasis here is on providing enough information to allow you to conduct and interpret the most basic versions. In most cases, the online statistical analysis tools mentioned in Chapter 12 will handle the computations—as will programs such as Microsoft Excel and SPSS.

The t Test

As we have seen throughout this book, many studies in psychology focus on the difference between two means. The most common null hypothesis test for this type of statistical relationship is the t test . In this section, we look at three types of t tests that are used for slightly different research designs: the one-sample t test, the dependent-samples t test, and the independent-samples t test.

One-Sample t Test

The one-sample t test is used to compare a sample mean ( M ) with a hypothetical population mean (μ0) that provides some interesting standard of comparison. The null hypothesis is that the mean for the population (µ) is equal to the hypothetical population mean: μ = μ0. The alternative hypothesis is that the mean for the population is different from the hypothetical population mean: μ ≠ μ0. To decide between these two hypotheses, we need to find the probability of obtaining the sample mean (or one more extreme) if the null hypothesis were true. But finding this p value requires first computing a test statistic called t . (A test statistic is a statistic that is computed only to help find the p value.) The formula for t is as follows:

$t=\dfrac{M-\mu_0}{\left(\dfrac{SD}{\sqrt{N}}\right)}$

Again, M is the sample mean and µ 0 is the hypothetical population mean of interest. SD is the sample standard deviation and N is the sample size.

The reason the t statistic (or any test statistic) is useful is that we know how it is distributed when the null hypothesis is true. As shown in Figure 13.1, this distribution is unimodal and symmetrical, and it has a mean of 0. Its precise shape depends on a statistical concept called the degrees of freedom, which for a one-sample t test is N − 1. (There are 24 degrees of freedom for the distribution shown in Figure 13.1.) The important point is that knowing this distribution makes it possible to find the p value for any t score. Consider, for example, a t score of +1.50 based on a sample of 25. The probability of a t score at least this extreme is given by the proportion of t scores in the distribution that are at least this extreme. For now, let us define extreme as being far from zero in either direction. Thus the p value is the proportion of t scores that are +1.50 or above or that are −1.50 or below—a value that turns out to be .14.

Graph with one-tailed critical values of ±1.711 and two-tailed critical values of ±2.262.

Fortunately, we do not have to deal directly with the distribution of t scores. If we were to enter our sample data and hypothetical mean of interest into one of the online statistical tools in Chapter 12 or into a program like SPSS (Excel does not have a one-sample t test function), the output would include both the t score and the p value. At this point, the rest of the procedure is simple. If p is less than .05, we reject the null hypothesis and conclude that the population mean differs from the hypothetical mean of interest. If p is greater than .05, we retain the null hypothesis and conclude that there is not enough evidence to say that the population mean differs from the hypothetical mean of interest. (Again, technically, we conclude only that we do not have enough evidence to conclude that it does differ.)

If we were to compute the t score by hand, we could use a table like Table 13.2 to make the decision. This table does not provide actual p values. Instead, it provides the critical values of t for different degrees of freedom ( df) when α is .05. For now, let us focus on the two-tailed critical values in the last column of the table. Each of these values should be interpreted as a pair of values: one positive and one negative. For example, the two-tailed critical values when there are 24 degrees of freedom are +2.064 and −2.064. These are represented by the red vertical lines in Figure 13.1. The idea is that any t score below the lower critical value (the left-hand red line in Figure 13.1) is in the lowest 2.5% of the distribution, while any t score above the upper critical value (the right-hand red line) is in the highest 2.5% of the distribution. Therefore any t score beyond the critical value in either direction is in the most extreme 5% of t scores when the null hypothesis is true and has a p value less than .05. Thus if the t score we compute is beyond the critical value in either direction, then we reject the null hypothesis. If the t score we compute is between the upper and lower critical values, then we retain the null hypothesis.

Table 13.2 Table of Critical Values of t When α = .05
	One-tailed critical value	Two-tailed critical value
3	2.353	3.182
4	2.132	2.776
5	2.015	2.571
6	1.943	2.447
7	1.895	2.365
8	1.860	2.306
9	1.833	2.262
10	1.812	2.228
11	1.796	2.201
12	1.782	2.179
13	1.771	2.160
14	1.761	2.145
15	1.753	2.131
16	1.746	2.120
17	1.740	2.110
18	1.734	2.101
19	1.729	2.093
20	1.725	2.086
21	1.721	2.080
22	1.717	2.074
23	1.714	2.069
24	1.711	2.064
25	1.708	2.060
30	1.697	2.042
35	1.690	2.030
40	1.684	2.021
45	1.679	2.014
50	1.676	2.009
60	1.671	2.000
70	1.667	1.994
80	1.664	1.990
90	1.662	1.987
100	1.660	1.984

Thus far, we have considered what is called a two-tailed test , where we reject the null hypothesis if the t score for the sample is extreme in either direction. This test makes sense when we believe that the sample mean might differ from the hypothetical population mean but we do not have good reason to expect the difference to go in a particular direction. But it is also possible to do a one-tailed test , where we reject the null hypothesis only if the t score for the sample is extreme in one direction that we specify before collecting the data. This test makes sense when we have good reason to expect the sample mean will differ from the hypothetical population mean in a particular direction.

Here is how it works. Each one-tailed critical value in Table 13.2 can again be interpreted as a pair of values: one positive and one negative. A t score below the lower critical value is in the lowest 5% of the distribution, and a t score above the upper critical value is in the highest 5% of the distribution. For 24 degrees of freedom, these values are −1.711 and +1.711. (These are represented by the green vertical lines in Figure 13.1.) However, for a one-tailed test, we must decide before collecting data whether we expect the sample mean to be lower than the hypothetical population mean, in which case we would use only the lower critical value, or we expect the sample mean to be greater than the hypothetical population mean, in which case we would use only the upper critical value. Notice that we still reject the null hypothesis when the t score for our sample is in the most extreme 5% of the t scores we would expect if the null hypothesis were true—so α remains at .05. We have simply redefined extreme to refer only to one tail of the distribution. The advantage of the one-tailed test is that critical values are less extreme. If the sample mean differs from the hypothetical population mean in the expected direction, then we have a better chance of rejecting the null hypothesis. The disadvantage is that if the sample mean differs from the hypothetical population mean in the unexpected direction, then there is no chance at all of rejecting the null hypothesis.

Example One-Sample t Test

Imagine that a health psychologist is interested in the accuracy of university students’ estimates of the number of calories in a chocolate chip cookie. He shows the cookie to a sample of 10 students and asks each one to estimate the number of calories in it. Because the actual number of calories in the cookie is 250, this is the hypothetical population mean of interest (µ 0 ). The null hypothesis is that the mean estimate for the population (μ) is 250. Because he has no real sense of whether the students will underestimate or overestimate the number of calories, he decides to do a two-tailed test. Now imagine further that the participants’ actual estimates are as follows:

250, 280, 200, 150, 175, 200, 200, 220, 180, 250

The mean estimate for the sample ( M ) is 212.00 calories and the standard deviation ( SD ) is 39.17. The health psychologist can now compute the t score for his sample:

$t=\dfrac{212-250}{\left(\dfrac{39.17}{\sqrt{10}}\right)}=-3.07$

If he enters the data into one of the online analysis tools or uses SPSS, it would also tell him that the two-tailed p value for this t score (with 10 − 1 = 9 degrees of freedom) is .013. Because this is less than .05, the health psychologist would reject the null hypothesis and conclude that university students tend to underestimate the number of calories in a chocolate chip cookie. If he computes the t score by hand, he could look at Table 13.2 and see that the critical value of t for a two-tailed test with 9 degrees of freedom is ±2.262. The fact that his t score was more extreme than this critical value would tell him that his p value is less than .05 and that he should reject the null hypothesis.

Finally, if this researcher had gone into this study with good reason to expect that university students underestimate the number of calories, then he could have done a one-tailed test instead of a two-tailed test. The only thing this decision would change is the critical value, which would be −1.833. This slightly less extreme value would make it a bit easier to reject the null hypothesis. However, if it turned out that university students overestimate the number of calories—no matter how much they overestimate it—the researcher would not have been able to reject the null hypothesis.

The Dependent-Samples t Test

The dependent-samples t test (sometimes called the paired-samples t test) is used to compare two means for the same sample tested at two different times or under two different conditions. This comparison is appropriate for pretest-posttest designs or within-subjects experiments. The null hypothesis is that the means at the two times or under the two conditions are the same in the population. The alternative hypothesis is that they are not the same. This test can also be one-tailed if the researcher has good reason to expect the difference goes in a particular direction.

It helps to think of the dependent-samples t test as a special case of the one-sample t test. However, the first step in the dependent-samples t test is to reduce the two scores for each participant to a single difference score by taking the difference between them. At this point, the dependent-samples t test becomes a one-sample t test on the difference scores. The hypothetical population mean (µ 0 ) of interest is 0 because this is what the mean difference score would be if there were no difference on average between the two times or two conditions. We can now think of the null hypothesis as being that the mean difference score in the population is 0 (µ 0 = 0) and the alternative hypothesis as being that the mean difference score in the population is not 0 (µ 0 ≠ 0).

Example Dependent-Samples t Test

Imagine that the health psychologist now knows that people tend to underestimate the number of calories in junk food and has developed a short training program to improve their estimates. To test the effectiveness of this program, he conducts a pretest-posttest study in which 10 participants estimate the number of calories in a chocolate chip cookie before the training program and then again afterward. Because he expects the program to increase the participants’ estimates, he decides to do a one-tailed test. Now imagine further that the pretest estimates are

230, 250, 280, 175, 150, 200, 180, 210, 220, 190

and that the posttest estimates (for the same participants in the same order) are

250, 260, 250, 200, 160, 200, 200, 180, 230, 240

The difference scores, then, are as follows:

+20, +10, −30, +25, +10, 0, +20, −30, +10, +50

Note that it does not matter whether the first set of scores is subtracted from the second or the second from the first as long as it is done the same way for all participants. In this example, it makes sense to subtract the pretest estimates from the posttest estimates so that positive difference scores mean that the estimates went up after the training and negative difference scores mean the estimates went down.

The mean of the difference scores is 8.50 with a standard deviation of 27.27. The health psychologist can now compute the t score for his sample as follows:

$t=\dfrac{8.5-0}{\left(\dfrac{27.27}{\sqrt{10}}\right)}=1.11$

If he enters the data into one of the online analysis tools or uses Excel or SPSS, it would tell him that the one-tailed p value for this t score (again with 10 − 1 = 9 degrees of freedom) is .148. Because this is greater than .05, he would retain the null hypothesis and conclude that the training program does not increase people’s calorie estimates. If he were to compute the t score by hand, he could look at Table 13.2 and see that the critical value of t for a one-tailed test with 9 degrees of freedom is +1.833. (It is positive this time because he was expecting a positive mean difference score.) The fact that his t score was less extreme than this critical value would tell him that his p value is greater than .05 and that he should fail to reject the null hypothesis.

The Independent-Samples t Test

The independent-samples t test is used to compare the means of two separate samples ( M 1 and M 2 ). The two samples might have been tested under different conditions in a between-subjects experiment, or they could be preexisting groups in a correlational design (e.g., women and men, extraverts and introverts). The null hypothesis is that the means of the two populations are the same: µ 1 = µ 2 . The alternative hypothesis is that they are not the same: µ 1 ≠ µ 2 . Again, the test can be one-tailed if the researcher has good reason to expect the difference goes in a particular direction.

The t statistic here is a bit more complicated because it must take into account two sample means, two standard deviations, and two sample sizes. The formula is as follows:

$t=\dfrac{M_1-M_2}{\sqrt{\dfrac{{SD_1}^2}{n_1}+\dfrac{{SD_2}^2}{n_2}}}$

Notice that this formula includes squared standard deviations (the variances) that appear inside the square root symbol. Also, lowercase n 1 and n 2 refer to the sample sizes in the two groups or condition (as opposed to capital N , which generally refers to the total sample size). The only additional thing to know here is that there are N − 2 degrees of freedom for the independent-samples t test.

Example Independent-Samples t Test

Now the health psychologist wants to compare the calorie estimates of people who regularly eat junk food with the estimates of people who rarely eat junk food. He believes the difference could come out in either direction so he decides to conduct a two-tailed test. He collects data from a sample of eight participants who eat junk food regularly and seven participants who rarely eat junk food. The data are as follows:

Junk food eaters: 180, 220, 150, 85, 200, 170, 150, 190

Non–junk food eaters: 200, 240, 190, 175, 200, 300, 240

The mean for the junk food eaters is 220.71 with a standard deviation of 41.23. The mean for the non–junk food eaters is 168.12 with a standard deviation of 42.66. He can now compute his t score as follows:

$t=\dfrac{220.71-168.12}{\sqrt{\dfrac{41.23^2}{8}+\dfrac{42.66^2}{7}}}=2.42$

If he enters the data into one of the online analysis tools or uses Excel or SPSS, it would tell him that the two-tailed p value for this t score (with 15 − 2 = 13 degrees of freedom) is .015. Because this p value is less than .05, the health psychologist would reject the null hypothesis and conclude that people who eat junk food regularly make lower calorie estimates than people who eat it rarely. If he were to compute the t score by hand, he could look at Table 13.2 and see that the critical value of t for a two-tailed test with 13 degrees of freedom is ±2.160. The fact that his t score was more extreme than this critical value would tell him that his p value is less than .05 and that he should fail to retain the null hypothesis.

The Analysis of Variance

When there are more than two groups or condition means to be compared, the most common null hypothesis test is the analysis of variance (ANOVA) . In this section, we look primarily at the one-way ANOVA , which is used for between-subjects designs with a single independent variable. We then briefly consider some other versions of the ANOVA that are used for within-subjects and factorial research designs.

One-Way ANOVA

The one-way ANOVA is used to compare the means of more than two samples ( M 1 , M 2 … M G ) in a between-subjects design. The null hypothesis is that all the means are equal in the population: µ 1 = µ 2 =…= µ G . The alternative hypothesis is that not all the means in the population are equal.

The test statistic for the ANOVA is called F . It is a ratio of two estimates of the population variance based on the sample data. One estimate of the population variance is called the mean squares between groups (MS B ) and is based on the differences among the sample means. The other is called the mean squares within groups (MS W ) and is based on the differences among the scores within each group. The F statistic is the ratio of the MS B to the MS W and can therefore be expressed as follows:

F = MS B ÷ MS W

Again, the reason that F is useful is that we know how it is distributed when the null hypothesis is true. As shown in Figure 13.2, this distribution is unimodal and positively skewed with values that cluster around 1. The precise shape of the distribution depends on both the number of groups and the sample size, and there is a degrees of freedom value associated with each of these. The between-groups degrees of freedom is the number of groups minus one: df B = ( G − 1). The within-groups degrees of freedom is the total sample size minus the number of groups: df W = N − G . Again, knowing the distribution of F when the null hypothesis is true allows us to find the p value.

Line graph with a peak after 0, then a sharp descent. Critical value is approximately 2.8.

The online tools in Chapter 12 and statistical software such as Excel and SPSS will compute F and find the p value. If p is less than .05, then we reject the null hypothesis and conclude that there are differences among the group means in the population. If p is greater than .05, then we retain the null hypothesis and conclude that there is not enough evidence to say that there are differences. In the unlikely event that we would compute F by hand, we can use a table of critical values like Table 13.3 “Table of Critical Values of ” to make the decision. The idea is that any F ratio greater than the critical value has a p value of less than .05. Thus if the F ratio we compute is beyond the critical value, then we reject the null hypothesis. If the F ratio we compute is less than the critical value, then we retain the null hypothesis.

Table 13.3 Table of Critical Values of F When α = .05
	2	3	4
8	4.459	4.066	3.838
9	4.256	3.863	3.633
10	4.103	3.708	3.478
11	3.982	3.587	3.357
12	3.885	3.490	3.259
13	3.806	3.411	3.179
14	3.739	3.344	3.112
15	3.682	3.287	3.056
16	3.634	3.239	3.007
17	3.592	3.197	2.965
18	3.555	3.160	2.928
19	3.522	3.127	2.895
20	3.493	3.098	2.866
21	3.467	3.072	2.840
22	3.443	3.049	2.817
23	3.422	3.028	2.796
24	3.403	3.009	2.776
25	3.385	2.991	2.759
30	3.316	2.922	2.690
35	3.267	2.874	2.641
40	3.232	2.839	2.606
45	3.204	2.812	2.579
50	3.183	2.790	2.557
55	3.165	2.773	2.540
60	3.150	2.758	2.525
65	3.138	2.746	2.513
70	3.128	2.736	2.503
75	3.119	2.727	2.494
80	3.111	2.719	2.486
85	3.104	2.712	2.479
90	3.098	2.706	2.473
95	3.092	2.700	2.467
100	3.087	2.696	2.463

Example One-Way ANOVA

Imagine that the health psychologist wants to compare the calorie estimates of psychology majors, nutrition majors, and professional dieticians. He collects the following data:

Psych majors: 200, 180, 220, 160, 150, 200, 190, 200

Nutrition majors: 190, 220, 200, 230, 160, 150, 200, 210, 195

Dieticians: 220, 250, 240, 275, 250, 230, 200, 240

The means are 187.50 ( SD = 23.14), 195.00 ( SD = 27.77), and 238.13 ( SD = 22.35), respectively. So it appears that dieticians made substantially more accurate estimates on average. The researcher would almost certainly enter these data into a program such as Excel or SPSS, which would compute F for him and find the p value. Table 13.4 shows the output of the one-way ANOVA function in Excel for these data. This table is referred to as an ANOVA table. It shows that MS B is 5,971.88, MS W is 602.23, and their ratio, F , is 9.92. The p value is .0009. Because this value is below .05, the researcher would reject the null hypothesis and conclude that the mean calorie estimates for the three groups are not the same in the population. Notice that the ANOVA table also includes the “sum of squares” ( SS ) for between groups and for within groups. These values are computed on the way to finding MS B and MS W but are not typically reported by the researcher. Finally, if the researcher were to compute the F ratio by hand, he could look at Table 13.3 and see that the critical value of F with 2 and 21 degrees of freedom is 3.467 (the same value in Table 13.4 under F crit ). The fact that his F score was more extreme than this critical value would tell him that his p value is less than .05 and that he should reject the null hypothesis.

Table 13.4 Typical One-Way ANOVA Output From Excel

Between groups	11,943.75	2	5,971.875	9.916234	0.000928	3.4668
Within groups	12,646.88	21	602.2321
Total	24,590.63	23

ANOVA Elaborations

Post hoc comparisons.

When we reject the null hypothesis in a one-way ANOVA, we conclude that the group means are not all the same in the population. But this can indicate different things. With three groups, it can indicate that all three means are significantly different from each other. Or it can indicate that one of the means is significantly different from the other two, but the other two are not significantly different from each other. It could be, for example, that the mean calorie estimates of psychology majors, nutrition majors, and dieticians are all significantly different from each other. Or it could be that the mean for dieticians is significantly different from the means for psychology and nutrition majors, but the means for psychology and nutrition majors are not significantly different from each other. For this reason, statistically significant one-way ANOVA results are typically followed up with a series of post hoc comparisons of selected pairs of group means to determine which are different from which others.

One approach to post hoc comparisons would be to conduct a series of independent-samples t tests comparing each group mean to each of the other group means. But there is a problem with this approach. In general, if we conduct a t test when the null hypothesis is true, we have a 5% chance of mistakenly rejecting the null hypothesis (see Section 13.3 “Additional Considerations” for more on such Type I errors). If we conduct several t tests when the null hypothesis is true, the chance of mistakenly rejecting at least one null hypothesis increases with each test we conduct. Thus researchers do not usually make post hoc comparisons using standard t tests because there is too great a chance that they will mistakenly reject at least one null hypothesis. Instead, they use one of several modified t test procedures—among them the Bonferonni procedure, Fisher’s least significant difference (LSD) test, and Tukey’s honestly significant difference (HSD) test. The details of these approaches are beyond the scope of this book, but it is important to understand their purpose. It is to keep the risk of mistakenly rejecting a true null hypothesis to an acceptable level (close to 5%).

Repeated-Measures ANOVA

Recall that the one-way ANOVA is appropriate for between-subjects designs in which the means being compared come from separate groups of participants. It is not appropriate for within-subjects designs in which the means being compared come from the same participants tested under different conditions or at different times. This requires a slightly different approach, called the repeated-measures ANOVA . The basics of the repeated-measures ANOVA are the same as for the one-way ANOVA. The main difference is that measuring the dependent variable multiple times for each participant allows for a more refined measure of MS W . Imagine, for example, that the dependent variable in a study is a measure of reaction time. Some participants will be faster or slower than others because of stable individual differences in their nervous systems, muscles, and other factors. In a between-subjects design, these stable individual differences would simply add to the variability within the groups and increase the value of MS W . In a within-subjects design, however, these stable individual differences can be measured and subtracted from the value of MS W . This lower value of MS W means a higher value of F and a more sensitive test.

Factorial ANOVA

When more than one independent variable is included in a factorial design, the appropriate approach is the factorial ANOVA . Again, the basics of the factorial ANOVA are the same as for the one-way and repeated-measures ANOVAs. The main difference is that it produces an F ratio and p value for each main effect and for each interaction. Returning to our calorie estimation example, imagine that the health psychologist tests the effect of participant major (psychology vs. nutrition) and food type (cookie vs. hamburger) in a factorial design. A factorial ANOVA would produce separate F ratios and p values for the main effect of major, the main effect of food type, and the interaction between major and food. Appropriate modifications must be made depending on whether the design is between subjects, within subjects, or mixed.

Testing Pearson’s r

For relationships between quantitative variables, where Pearson’s r is used to describe the strength of those relationships, the appropriate null hypothesis test is a test of Pearson’s r . The basic logic is exactly the same as for other null hypothesis tests. In this case, the null hypothesis is that there is no relationship in the population. We can use the Greek lowercase rho (ρ) to represent the relevant parameter: ρ = 0. The alternative hypothesis is that there is a relationship in the population: ρ ≠ 0. As with the t test, this test can be two-tailed if the researcher has no expectation about the direction of the relationship or one-tailed if the researcher expects the relationship to go in a particular direction.

It is possible to use Pearson’s r for the sample to compute a t score with N − 2 degrees of freedom and then to proceed as for a t test. However, because of the way it is computed, Pearson’s r can also be treated as its own test statistic. The online statistical tools and statistical software such as Excel and SPSS generally compute Pearson’s r and provide the p value associated with that value of Pearson’s r . As always, if the p value is less than .05, we reject the null hypothesis and conclude that there is a relationship between the variables in the population. If the p value is greater than .05, we retain the null hypothesis and conclude that there is not enough evidence to say there is a relationship in the population. If we compute Pearson’s r by hand, we can use a table like Table 13.5, which shows the critical values of r for various samples sizes when α is .05. A sample value of Pearson’s r that is more extreme than the critical value is statistically significant.

Table 13.5 Table of Critical Values of Pearson’s r When α = .05
	Critical value of one-tailed	Critical value of two-tailed
5	.805	.878
10	.549	.632
15	.441	.514
20	.378	.444
25	.337	.396
30	.306	.361
35	.283	.334
40	.264	.312
45	.248	.294
50	.235	.279
55	.224	.266
60	.214	.254
65	.206	.244
70	.198	.235
75	.191	.227
80	.185	.220
85	.180	.213
90	.174	.207
95	.170	.202
100	.165	.197

Example Test of Pearson’s r

Imagine that the health psychologist is interested in the correlation between people’s calorie estimates and their weight. He has no expectation about the direction of the relationship, so he decides to conduct a two-tailed test. He computes the correlation for a sample of 22 university students and finds that Pearson’s r is −.21. The statistical software he uses tells him that the p value is .348. It is greater than .05, so he retains the null hypothesis and concludes that there is no relationship between people’s calorie estimates and their weight. If he were to compute Pearson’s r by hand, he could look at Table 13.5 and see that the critical value for 22 − 2 = 20 degrees of freedom is .444. The fact that Pearson’s r for the sample is less extreme than this critical value tells him that the p value is greater than .05 and that he should retain the null hypothesis.

Key Takeaways

To compare two means, the most common null hypothesis test is the t test. The one-sample t test is used for comparing one sample mean with a hypothetical population mean of interest, the dependent-samples t test is used to compare two means in a within-subjects design, and the independent-samples t test is used to compare two means in a between-subjects design.
To compare more than two means, the most common null hypothesis test is the analysis of variance (ANOVA). The one-way ANOVA is used for between-subjects designs with one independent variable, the repeated-measures ANOVA is used for within-subjects designs, and the factorial ANOVA is used for factorial designs.
A null hypothesis test of Pearson’s r is used to compare a sample value of Pearson’s r with a hypothetical population value of 0.
Practice: Use one of the online tools, Excel, or SPSS to reproduce the one-sample t test, dependent-samples t test, independent-samples t test, and one-way ANOVA for the four sets of calorie estimation data presented in this section.
Practice: A sample of 25 university students rated their friendliness on a scale of 1 ( Much Lower Than Average ) to 7 ( Much Higher Than Average ). Their mean rating was 5.30 with a standard deviation of 1.50. Conduct a one-sample t test comparing their mean rating with a hypothetical mean rating of 4 ( Average ). The question is whether university students have a tendency to rate themselves as friendlier than average.
The correlation between height and IQ is +.13 in a sample of 35.
For a sample of 88 university students, the correlation between how disgusted they felt and the harshness of their moral judgments was +.23.
The correlation between the number of daily hassles and positive mood is −.43 for a sample of 30 middle-aged adults.

A common null hypothesis test examining the difference between two means.

Compares a sample mean with a hypothetical population mean that provides some interesting standard of comparison.

A statistic that is computed only to help find the p value.

Points on the test distribution that are compared to the test statistic to determine whether to reject the null hypothesis.

The null hypothesis is rejected if the t score for the sample is extreme in either direction.

Where the null hypothesis is rejected only if the t score for the sample is extreme in one direction that we specify before collecting the data.

Statistical test used to compare two means for the same sample tested at two different times or under two different conditions.

Variable formed by subtracting one variable from another.

Statistical test used to compare the means of two separate samples.

Most common null hypothesis test when there are more than two groups or condition means to be compared.

A null hypothesis test that is used for between-between subjects designs with a single independent variable.

An estimate of population variance based on the differences among the sample means.

An estimate of population variance based on the differences among the scores within each group.

Analysis of selected pairs of group means to determine which are different from which others.

The dependent variable is measured multiple times for each participant, allowing a more refined measure of MSW.

A null hypothesis test that is used when more than one independent variable is included in a factorial design.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

AP®︎/College Statistics

Course: ap®︎/college statistics > unit 10.

Idea behind hypothesis testing

Examples of null and alternative hypotheses

Writing null and alternative hypotheses
P-values and significance tests
Comparing P-values to different significance levels
Estimating a P-value from a simulation
Estimating P-values from simulations
Using P-values to make conclusions

Want to join the conversation?

Upvote Button navigates to signup page
Downvote Button navigates to signup page
Flag Button navigates to signup page

Video transcript

13.1 Understanding Null Hypothesis Testing

Learning objectives.

Explain the purpose of null hypothesis testing, including the role of sampling error.
Describe the basic logic of null hypothesis testing.
Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables in a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 adults with clinical depression and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for adults with clinical depression).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of adults with clinical depression, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called sampling error . (Note that the term error here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s r value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

There is a relationship in the population, and the relationship in the sample reflects this.
There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H 0 and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the alternative hypothesis (often symbolized as H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
Determine how likely the sample relationship would be if the null hypothesis were true.
If the sample relationship would be extremely unlikely, then reject the null hypothesis in favor of the alternative hypothesis. If it would not be extremely unlikely, then retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of d = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value . A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p value that is not low means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the p value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called α (alpha) and is almost always set to .05. If there is a 5% chance or less of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to reject it. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood p Value

The p value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [1] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the p value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the p value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The p value is really the probability of a result at least as extreme as the sample result if the null hypothesis were true. So a p value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the p value is not the probability that any particular hypothesis is true or false. Instead, it is the probability of obtaining the sample result if the null hypothesis were true.

“Null Hypothesis” retrieved from http://imgs.xkcd.com/comics/null_hypothesis.png (CC-BY-NC 2.5)

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the p value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the p value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s d is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s d is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word Yes , then this combination would be statistically significant for both Cohen’s d and Pearson’s r . If it contains the word No , then it would not be statistically significant for either. There is one cell where the decision for d and r would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”



Sample Size	Weak	Medium	Strong
Small ( = 20)	No	No	= Maybe = Yes
Medium ( = 50)	No	Yes	Yes
Large ( = 100)	= Yes = No	Yes	Yes
Extra large ( = 500)	Yes	Yes	Yes

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [2] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word significant can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the statistical significance of a result and the practical significance of that result. Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

“Conditional Risk” retrieved from http://imgs.xkcd.com/comics/conditional_risk.png (CC-BY-NC 2.5)

Key Takeaways

Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favor of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
The probability of obtaining the sample result if the null hypothesis were true (the p value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.
The correlation between two variables is r = −.78 based on a sample size of 137.
The mean score on a psychological characteristic for women is 25 ( SD = 5) and the mean score for men is 24 ( SD = 5). There were 12 women and 10 men in this study.
In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
A student finds a correlation of r = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.
Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Share This Book

Increase Font Size

Blown out of Proportion? Housing Prices, Home Prices and New Tenant Rents

Housing services — which include tenants' rents and imputed homeowners' rents — continue to be a major contributor to inflation relative to the share of housing in personal consumption expenditure (PCE). As shown in Figure 1 below, rent and homeowners' imputed rents made up 33 percent of year-over-year PCE inflation in June, despite only accounting for 15.4 percent of the PCE spending basket. When considering the outlook for future housing services prices, economists have found it useful to examine the ratio of housing services prices to two other prices: home prices and rents for new tenants. In this week's post, we look at what these data might be indicating for the near-term outlook of housing services prices.

Figure 1: Contributions to Year-Over-Year PCE Inflation

Combined line and column chart showing how much different variables contributed to year-over-year PCE inflation since January 2017. The different variables being represented by columns are services, food, goods, energy, and housing.

Source: Author's calculations using Bureau of Economic Analysis data via Haver Analytics

Figure 2 below plots the ratio of housing services prices to home prices as measured by the national S&P CoreLogic Case-Shiller Home Price Index (CSHPI). The ratio's May reading of 0.42 is low relative to its long-run 1975-2024 historical distribution: 1.9 standard deviations below the long-run average of 0.59. Assuming this ratio returns to its pre-pandemic level, the latest low readings potentially indicate further upside risk for housing services prices. If homebuying prices remain unchanged, tenant and homeowners' imputed rents would have to increase for this ratio to return to pre-pandemic levels.

Figure 2: Ratio of Housing Services Prices to Case-Shiller Home Price Index

Line graph showing the ratio of housing services prices to the Case-Shiller Home Price Index since January 1975. Each recession is highlighted for awareness.

Source: Bureau of Economic Analysis, Standard & Poor's via Haver Analytics

In contrast, examining the ratio of housing services prices to new tenant rents hints at a more optimistic outlook for housing services price growth. (Note: An earlier Macro Minute post discusses the relationship between new rents and the rent measures captured in official inflation indexes .) In the latest data from the second quarter of 2024, the ratio of housing services prices to the new tenant rent index (NTRI) was 0.68 compared to a pre-pandemic ratio of 0.67 and a long-run (since 2005) average of 0.66. Assuming mean-reversion in this ratio and adjustment on the part of the housing services price index, this suggests housing services prices could potentially fall in the near term — although the difference between the latest reading and the pre-pandemic benchmark is small.

Figure 3: Ratio of Housing Services Prices to New Tenant Rent Index

Line graph showing the ratio of housing services prices to the new tenant rent index since the first quarter of 2005. Each recession is highlighted for awareness.

Source: Bureau of Economic Analysis, Bureau of Labor Statistics via Haver Analytics

When considering the overall balance of risks to the outlook, how does one decide between the upside risk to housing services prices noted in Figure 2 and the downside risk indicated in Figure 3? A key assumption we made in looking at both ratios was that both series tend to mean-revert. (In other words, they tend to return their long-run average over time.) We can judge which of the two ratios is more informative by testing whether they have demonstrated this property in the past.

One way to check a series for mean-reversion is to test whether the series is stationary (exhibiting a constant mean and variance over time), or non-stationary (not converging toward any particular level over time). Table 1 below reports the results of three statistical tests that evaluate whether a series exhibits non-stationarity. The Augmented Dickey-Fuller (ADF) and Phillips-Perron (PP) tests evaluate a null hypothesis that a series exhibits a kind of non-stationary behavior called a unit root. The Kiwatkowski-Phillips-Schmidt-Shin (KPSS) test evaluates the null hypothesis that a series is stationary.


	Ratio of housing services prices to CSHPI	Ratio of housing services prices to NTRI
ADF test	Unable to reject unit root hypothesis (p = 0.23).	Reject unit root hypothesis (p = 0.04).
PP test	Unable to reject unit root hypothesis (p = 0.88)	Unable to reject unit root hypothesis (p = 0.50)
KPSS test	Reject stationarity hypothesis (p < 0.01).	Unable to reject stationarity hypothesis (p > 0.10).
Source: Author's calculations

Table 1 shows that all three tests suggest that the ratio of housing services prices to the CSHPI is non-stationary and therefore does not exhibit mean-reverting behavior. In contrast, two of the three tests suggest the ratio of housing services prices to the NTRI may be stationary. Based on these results, economic forecasters might want to put more emphasis on the latter ratio when forming an outlook for housing services prices.

The challenge in dealing with ratios is that mean-reversion could be driven by either the numerator or the denominator. To identify which of the two price indexes is likely to be doing the adjustment, I estimate an error correction model for both series and use it to construct a five-year forecast. (Note: The model is described in greater detail in an earlier post that looks at the relationship between the CPI and the PPI .) The forecast is plotted below in Figure 4.

Figure 4: Forecast of Housing Services Prices and Ratio of Housing Services Prices to NTRI

Line graph comparing the forecast of housing services prices to the ratio of housing services prices to the new tenant rent index since the first quarter of 2005. The forecast extends to the first quarter of 2029.

Source: Author's calculations using Bureau of Economic Analysis and Bureau of Labor Statistics data via Haver Analytics

The model forecasts that the process of mean-reversion in the ratio could be delayed, with the ratio of housing services prices to NTRI rising through the middle of 2025. The model also projects that the housing services price index will stall in the near term before resuming an upward trajectory similar to its pre-pandemic pace. This indicates that the projected near-term increase in the ratio is driven by declines in the NTRI through mid-2025. Thereafter, the ratio of housing services prices to NTRI is projected to decline toward its long-run average as NTRI growth outpaces growth in housing services prices.

These results show the trickiness of forecasting based on a narrative of mean-reversion in ratios: The process of mean-reversion doesn't always kick in immediately, and adjustments in ratios depend on the interplay of the numerator and the denominator.

Views expressed in this article are those of the author and not necessarily those of the Federal Reserve Bank of Richmond or the Federal Reserve System.

Phone Icon Contact Us

ORIGINAL RESEARCH article

Does capital marketization promote better rural industrial integration: evidence from china.

College of Economics, Sichuan Agricultural University, Chengdu, China

Introduction: Although rural industrial integration is a crucial pathway for advancing the revitalization of rural economies, it continues to grapple with financial challenges. This paper delves into the theoretical underpinnings of how capital marketization influences rural industrial integration.

Methods: Using panel data from China’s provinces spanning the years 2010 to 2020, a comprehensive index of rural industrial integration is constructed from the vantage point of a new development paradigm. The paper employs the system GMM method to empirically investigate the impact of capital marketization on rural industrial integration and to dissect its transmission mechanisms. Additionally, a threshold regression model is applied to explore the specific patterns of the nonlinear relationship between the two variables.

Results and discussion: The study’s findings reveal that the degree of rural industrial integration is significantly and positively influenced by its previous level, demonstrating an accumulative effect wherein the prior level of integration lays the groundwork for future advancements. The influence of capital marketization on the degree of rural industrial integration is characterized by a non-linear relationship, adhering to a “U-shaped” curve. Below the inflection point, the development of capital marketization is detrimental to rural industrial integration, whereas above this point, it exerts a positive influence. Currently, China’s overall level of capital marketization is positioned beyond the inflection point, indicating substantial potential for enhancing industry integration in rural China. In addition, the study indicates that at very low levels of economic development, capital marketization does not affect the development of rural industries. As the economic development level rises, so does the impact of capital marketization on rural industrial integration.

1 Introduction

The promotion of the integration of the primary, secondary, and tertiary industries in rural areas (hereinafter referred to as “rural industry integration”) is a pivotal measure for the revitalization of rural regions. Capital marketization has risen as an effective approach to mitigate the financial challenges encountered in this integration process. Historically viewed, industry integration represents an inevitable trend in the development trajectory of rural industries. Since the initiation of China’s rural reform in 1978, marked by the introduction of the household contract responsibility system, the essence of rural reform has centered on the realignment of production relations. This has significantly bolstered the dynamism of rural agricultural development and disrupted the previously isolated status of various agricultural processes. In 1992, China outlined the goal of establishing a socialist market economy system, placing increased emphasis on the regulating role of the market in rural economic development. As the market economy evolved, the traditional fragmented farming practices became insufficient to satisfy evolving development demands. To reconcile the disparity between “small-scale farmers” and the “large-scale market,” an integrated agricultural industrial operation model emerged in China’s rural areas. This model, grounded in family contract farming, encompasses the entire spectrum from production to processing to sales. The exchange of factors between urban and rural areas intensified, propelling swift rural industrial development and fostering tighter integration among the primary, secondary, and tertiary sectors. In 2015, China proposed the concept of advancing rural industry integration, underscoring its importance as a cornerstone in the construction of a modern agricultural industry system. In 2018, China reaffirmed its dedication to fostering the integration of the primary, secondary, and tertiary industries in rural areas, vigorously advancing the development of agricultural modernization and the realization of the rural revitalization strategy.

At this stage, China’s agricultural industry chain is continuously expanding, and the entities involved in rural industry integration are growing more diverse and robust. The emergence of new agricultural industries and innovative formats is accelerating, with novel models for rural industry integration continually being developed and explored. The development of rural industry integration has become an essential pathway for the progress of social production in the contemporary era. It is also an imperative for the transformation and modernization of rural economies, a vital strategy for fostering integrated urban–rural development, a key driver for structural reform on the agricultural supply side, and a critical means to ensure sustained income growth for farmers ( Chen et al., 2020 ; Zhang et al., 2023 ). China’s rural industry integration is now at a pivotal juncture, transitioning from an initial exploratory phase to a period of rapid acceleration. However, this complex endeavor confronts a multitude of challenges. The most prominent of these is the presence of bottleneck constraints on various factors, particularly the significant shortfall in capital support. This financial shortfall has plunged rural industry integration into a profound predicament.

Challenges in the agricultural and rural sectors, including difficulties in securing financing, high costs of borrowing, and sluggish lending processes, underscore the importance of financial support as a vital catalyst for the advancement of rural industry integration. Strengthening this support is fundamentally linked to the enhancement of rural capital market development ( Lopez and Winkler, 2018 ). However, within an environment characterized by imperfect competition, the marketization of capital elements could potentially skew the allocation of production factors towards industry integration models that are more responsive to market demands. Paradoxically, this dynamic may, in fact, impede the progress of rural industry integration. Existing research suggests that government support ( Steiner and Teasdale, 2019 ), social capital ( Lang and Fink, 2019 ), financial services ( Khanal and Omobitan, 2020 ), digital technology ( Cowie et al., 2020 ), among others, can significantly enhance agricultural performance and promote the development of rural industry integration. Nonetheless, there remains a dearth of research elucidating the precise mechanisms through which the marketization of capital influences the development of rural industry integration.

The primary objective of this paper is to examining the influence of capital marketization on the development of rural industry integration. It aims to assess whether capital marketization can effectively alleviate the financial constraints faced by rural industries during the integration process and to clarify the mechanisms by which it influences this process. The article makes three contributions. Firstly, it measures rural industry integration using the new development concept, which includes five dimensions: innovation, coordination, green development, openness, and sharing. Secondly, it uncovers the mechanisms through which the marketization of capital influences rural industry integration, investigating the theoretical basis for the dynamic process of current capital market reforms in alleviating the financial challenges of rural industry integration development. Thirdly, it employs the System GMM method to verify the effects of the marketization of capital elements in unleashing the potential of rural industry integration development, clarifying the role and impact pathways of the marketization of capital elements on rural industry integration.

The rest of the paper is structured as follows. Section 2 presents a comprehensive literature review and Section 3 establishes research hypotheses. Section 4 presents the conceptual framework and the data used in the study. The empirical results are then reported in section 5. The final section presents concluding remarks and implications.

2 Literature review

Industrial integration typically originates from technological interconnections between various sectors, which in turn leads to the blurring or dissolution of traditional industry boundaries. In the late 1990s, Japanese agricultural expert Naraomi Imamura introduced the concept of the “Sixth Industry,” formally incorporating agriculture into the realm of industrial integration research. In China, rural industry integration is led by innovative business entities, interconnected through a mechanism that fosters shared interests. It is driven by the momentum of technological innovation, institutional innovation, and format innovation, guided by the new development concept of “innovation, coordination, green development, openness, and sharing.” The reform agenda is centered on facilitating the free flow of factors, optimizing the allocation of resources, and achieving an organic integration of industries. The overarching objectives are to enhance agricultural productivity, augment farmers’ incomes, and stimulate rural prosperity.

In recent years, research on rural industrial integration has mainly focused on three areas. Firstly, some studies concentrate on exploring the pathways of rural industry integration. These pathways are crucial for promoting the revitalization of rural industries and achieving a more sustainable village economy ( Qin et al., 2020 ). Key pathways for integration include the integration of crop and livestock farming, the expansion of industrial chains in both upstream and downstream directions, the diversification of agricultural industry functions, the steering role of industrial and commercial capital and leading enterprises, and the establishment of horizontal industrial integration platforms along with the evolution of Internet + agricultural industry ( Zhang et al., 2022 ; Zhou et al., 2023 ). Secondly, scholars have investigated the construction of evaluation index and relevant measurements for assessing the level of rural industrial integration. Existing literature primarily measures rural industry integration from three perspectives. Initially, it evaluates the interaction and socio-economic impacts of the integration between agriculture and related industries, such as the extension of agricultural industry chains, the multifunctionality of agriculture, the development of agricultural service industries, the enhancement of farmers’ income, job creation, and the integration of urban–rural development ( Zhang and Wu, 2022 ). Subsequently, it examines rural industry integration through the lens of its types, such as industrial restructuring, extension, cross-linking, and penetration ( Hao et al., 2023 ). Finally, in light of the new development concept, scholars have developed evaluative frameworks for rural industry development across five dimensions: innovation, coordination, green development, openness, and sharing ( Liu et al., 2018 ; Xue et al., 2018 ). Thirdly, some studies concentrate on the challenges encountered by rural industry integration. Despite the positive momentum of recent developments, rural industry integration still face various difficulties. Many regions in China involve in this integration are grappling with issues such as low levels of integration and superficial integration depths. In the course of rapid urbanization, which is characterized by profound shifts in population, land, and industry dynamics, specific rural areas are universally dealing with a dearth of motivation for industrial development, an intensifying phenomenon of rural land hollowing, weakened grassroots governance structures, a fragile mainstream of rural development, and a scarcity of public infrastructure ( Tu et al., 2018 ). Villages constitute interconnected organic entities with the circulation of resources such as labor, capital, material, and information ( Lopez and Winkler, 2018 ; Li et al., 2019 ; Zhou et al., 2020 ). Among these, capital has emerged as a pivotal factor restricting regional development in rural China ( Guo et al., 2022 ), where factor mobility plays a significant role in determining the economic benefits of development ( Banerjee et al., 2020 ).

Accordingly, some studies has focused on the financial challenges faced by rural industry integration, seeking to identify strategies to mitigate these financial difficulties and promote further industrial integration. A significant body of the research indicates that the financial challenges in the development of rural industry integration mainly arise from an insufficient capital support. Investing industrial and commercial capital into agriculture has been recognized as a potential solution to address the shortage of financial resources ( Long et al., 2016 ). Such capital inflow can provide agriculture with essential inputs such as funding, technological advancements, and skilled personnel ( Cofré-Bravo et al., 2019 ). However, it is noteworthy that increased agricultural productivity may paradoxically lead to capital outflows from rural regions. This occurs as productivity gains can lower interest rates, prompting capital to migrate optimally towards the urban manufacturing sector in search of higher returns ( Bustos et al., 2020 ). Conversely, an alternative perspective from other research suggests that the financial challenges confronting rural industry integration are multifaceted and cannot be solely attributed to capital scarcity. The agricultural sector demands substantial investments that are fraught with high risks and characterized by long gestation periods for returns. Typically, individual operating entities struggle to shoulder these financial burdens on their own, highlighting the need for ongoing innovation in the financial markets to develop tailored rural financial products ( Adegbite and Machethe, 2020 ). Additionally, it is imperative to harness the market’s role in resource allocation effectively. Evidence suggests that market forces have a pronounced impact on industrial integration, particularly in provinces with a more advanced degree of marketization ( Tian et al., 2020 ). The degree of economic marketization is identified as a pivotal factor in enhancing the efficiency of capital allocation across different regions within China ( Zhang et al., 2021 ).

Although there have been some empirical analyses on marketization, and a significant body of research has explored the construction of evaluation indicators for factor marketization, there is a scarcity of literature directly measuring capital marketization. Existing studies primarily focus on the measurement of factor marketization, land factor marketization, and production factor marketization. Fan et al. (2003) previously developed a marketization index for various provinces in China, including five dimensions: government and market, the ownership structure, goods market development, factors market development and the legal framework. Yan (2007) measured the degree of marketization in China by constructing an index that encompasses the agricultural, industrial, and service sectors. When considering a comprehensive assessment of factor marketization, Zhou and Hall (2019) calculated a relative index of marketization processes across different regions of China, considering five aspects: the relationship between government and market, the development of the non-state-owned economy, the maturity of product market development, the advancement of factor market development, and the establishment of market intermediary organizations and the legal system environment. The urban land marketization level is typically gauged by the proportion of land allocated through tender, auction, and listing relative to the total land supply ( Cheng et al., 2022 ). Regarding rural land marketization, Yao and Wang (2022) used the year 2008 as an indicator of agricultural land marketization in China when the country decided to strengthen the development of the agricultural land transfer market and improve the transfer rate.

In summary, the findings from existing research offer substantial insights for the theoretical analysis within this paper, underscoring the innovative aspects and contributions of this study. On the one hand, market-oriented reforms have emerged a focal point of current economic development. Yet, the role of capital marketization in facilitating rural industry integration has received scant scholarly attention. Capital marketization, which is distinct from capital itself, encompasses a dynamic process that includes a range of economic, social, legal, and systemic reforms. The marketization of capital is essential for the free flow and rational distribution of capital, particularly in the structuring of rural financial institution networks. These elements are vital to the development process of rural industry integration.

This study employs a dynamic approach to investigate the financial challenges faced by rural industry integration and mechanisms for their mitigation, offering valuable perspectives on tackling financial issues in the development of rural industry integration in China. This research carries significant implications for formulating of future policies related to the advancement of rural industry integration in the country. On the other hand, the implementation of the new development philosophy is a vital pathway for China’s progress in the new era. As a leading agricultural nation, China’s agricultural development must align with and implement the new development philosophy. Currently, there is scarce research that measures the effectiveness of rural industry integration from the perspective of this philosophy. This study takes a starting point from the new development philosophy, formulates indicators to measure rural industry integration, and integrates rural industry integration deeply with the new development philosophy. This approach provides novel empirical evidence to inform the development of targeted financial policies aimed at propelling rural industry development.

3 Theoretical analysis and hypotheses

Summarizing the viewpoints from existing literature, this paper proposes that the impact of capital marketization on rural industry integration is nonlinear, exhibiting both positive and negative aspects. The positive impact comprises direct and mediating effects. The direct effect indicates that capital marketization fosters the development of rural industry integration by improving the efficiency of capital allocation, promoting the mobility of capital, and mitigating risks associated with agricultural production. The mediating effect refers to the indirect roles played by the development of rural finance and the optimization of industrial structure, which shape rural industry integration in the context of capital marketization. Conversely, the negative impact suggests that at low levels of capital marketization, the market’s capacity for integration planning is less than optimal. The marketization process might drive production factors towards configurations more aligned with market needs, thereby inhibiting the development of rural industry integration. Additionally, the facilitative role of capital marketization in rural industry integration is subject to constraints imposed by the threshold of regional economic development (see Figure 1 ).

Figure 1 . Mechanisms of the impact of capital marketization on rural industry integration.

3.1 Direct effects of capital marketization on rural industry integration

The concept of marketization finds its origins in the “Financial Deepening Theory,” initially proposed by Shaw (1973) and McKinnon (1973) . This theoretical framework emerged as a counterpoint to the financial repression policies that were prevalent in some developing countries during that era. The theory championed the liberalization of financial markets, advocating for the easing or even the dissolution of governmental financial controls, and the adoption of market-determined interest rates. These rates were intended to genuinely mirror the market’s supply and demand dynamics for capital. Consequently, the allocation of capital would be steered by market mechanisms, thereby empowering financial markets to effectively contribute to the allocation of resources. This study posits that the Financial Deepening Theory implies a fundamental logic: the more advanced the development of the financial sector, the more effectively it can serve the production sector. Enhanced service leads to improved capital allocation efficiency, which in turn stimulates industrial development and fosters economic growth.

The development of rural industry integration requires a large amount of capital collaboration, indicative of a capital accumulation process. Capital marketization enables the fluid and expeditious movement of capital within the market, directing surplus funds towards sectors that demand capital for growth ( Petry, 2020 ). This process is instrumental in enabling industries or enterprises in need of development to secure financing for innovative integration initiatives. Consequently, this transformation in the developmental approach of rural industries fosters the enhancement of industrial chains and facilitates the realization of rural industry integration. Furthermore, capital marketization significantly improves the efficiency of resource allocation. It does so by attracting additional capital, stimulating the expansion of savings, augmenting the availability of funds, offering investment and financing avenues, easing the financial strain on rural industry integration, and tackling the prevalent issues of “difficulty in securing financing” and “high cost of financing.” Moreover, capital can mitigate the risks associated with the adoption of new technologies, diminish the risk perceptions of investors in agriculture-related sectors, and disperse the concentration of risks inherent in the rural industry integration process, thereby fostering its progression ( Clapp, 2019 ).

However, in scenarios where the level of marketization is insufficient, the interest linkage mechanism within rural industries remains underdeveloped. The process of marketization tends to channel more robust production factors towards integration models that align more closely with market demands. This dynamic may impede small-scale farmers from participating in the modern agricultural system, thereby obstructing the overall advancement of rural industry integration. Additionally, considering the diverse sectors involved in rural industry integration and their complex interconnections, an inadequate level of marketization impairs the market’s capacity to effectively integrate and strategically plan capital allocation ( Liu et al., 2023 ). Based on the analysis above, this paper proposes Hypothesis 1:

H1: The influence of capital marketization on rural industry integration is nonlinear. At low levels of capital marketization, the marketization process inhibits rural industry integration. In contrast, at high levels of capital marketization, the marketization process is expected to foster rural industry integration.

3.2 Mediating effects of rural financial development on industrial structure optimization

In China’s rural financial sector, market failure and inefficient government intervention are prevalent issues. The market environment in rural finance is not yet fully mature, and a commitment to market-oriented reforms can enhance the rural financial market environment ( Han, 2020 ). Such reforms have the potential to augment the provision of financial support for rural revitalization, tackle institutional and technological impediments in rural financial development, and bolster the efficiency of rural financial services ( Yaseen et al., 2018 ). The marketization process represents an efficient mechanism for resource allocation, preventing significant distortions in the distribution of rural financial resource. It addresses the capital requirements for rural economic development at a fundamental level and promotes the advancement of rural finance. Advancements in marketization can stimulate innovation in rural financial products and services, broaden the reach of financial services, amplify the scope of agricultural insurance, and bolster the development of rural industry integration. Moreover, the enhancement of marketization can standardize transaction systems within factor and product markets, ensuring the rational allocation of resources. This drives the optimization and upgrading of the industrial structure. A more rational industrial structure can encourage the reallocation of surplus rural labor from agriculture to secondary and tertiary sectors ( Long et al., 2016 ). Consequently, this reallocation can raise both urban and rural income levels, thereby nurturing the development of rural industry integration. Based on the above analysis, we propose Hypothesis 2:

H2: Capital marketization is posited to influence rural industry integration by fostering the development of rural finance and by driving the optimization and upgrading of the industrial structure.

3.3 Threshold effect of regional economic development

The progression of rural industry integration is influenced not only by the development of rural finance and the composition of industry but also significantly by the level of local economic development. In regions where the economic development is comparatively advanced, capital marketization can enhance the mobility of capital and effectuate rational resource allocation, thereby actively fostering the development of rural industry integration ( Xu and Tan, 2020 ). On the contrary, in regions with lower levels of economic development, there may be a pervasive financial conundrum stemming from capital scarcity, and external capital might be disinclined to invest in areas with less robust economic development. In these contexts, the scope of resources that capital marketization can effectively allocate is constrained, which can substantially impede its capacity to promote rural industry integration. Therefore, this paper proposes Hypothesis 3:

H3: There exists a threshold effect of the regional economic development on the promotion of rural industry integration by capital marketization.

4 Methodology

4.1 econometric model specification.

The econometric model in this study is composed of three components. Firstly, a dynamic panel model is used to analyze whether the current level of rural industry integration is influenced by capital marketization and the previous level of rural industry integration. Secondly, a mediation effects model is employed to further verify the mechanism through which capital marketization affects rural industry integration. Thirdly, a threshold effects model is introduced, integrating the level of economic development as a threshold variable. This model is designed to explore the conditional nature of the relationship between capital marketization and rural industry integration, taking into account the potential threshold effects that economic development may impose on this dynamic.

4.1.1 Dynamic panel data model

This study undertakes an examination of the impact of capital marketization on the level of rural industry integration by employing the level of rural industry integration ( i n d ) as the dependent variable and the level of capital marketization ( c a p ) as the core explanatory variable. The benchmark panel model is constructed as follows:

where, i = 1 , 2 , 3 , … , 30 represents each province (or municipality), t = 2010 , 2010 , 2011 , … , 2020 represents the year, i n d i t and c a p i t represent the level of rural industry integration and the level of capital marketization, respectively. β 0 is the intercept, β 1 is the regression coefficient of the capital marketization, μ i represents fixed effects, and ε i t is the random disturbance term.

To encompass the influence of additional factors, such as rural education level ( e d u ), economic openness ( i m e x ), rural ecological environment ( e n v i ), urbanization level ( t o w n ), and government financial support ( g o v ), the model is adjusted to include these variables. The extended panel model is given by Equation 2 :

where β 2 , β 3 , β 4 , β 5 , β 6 are the regression coefficients for these control variables. All other terms have the same meanings as described in the benchmark model ( Equation 1 ).

To account for a potential non-linear relationship between rural industry integration and capital marketization, this study introduces the quadratic term of capital marketization in the model. The static panel model is constructed as Equation 3 :

where c a p i t 2 represents the squared term of capital marketization, β 1 … β 6 are the regression coefficients for the core explanatory variables and control variables. In the model (3), if β 1 and β 2 are significantly non-zero, the relationship between capital marketization and rural industry integration can be determined based on the signs of β 1 and β 2 . In particular, when β 1 > 0 , β 2 < 0 , it indicates an inverted U-shaped relationship between capital marketization and rural industry integration. That is, when the level of capital marketization is below or equal to the inflection point, it has a positive promoting effect on industry integration. When the level is above the inflection point, it has a negative inhibitory effect. When β 1 〈 0 , β 2 〉 0 , it suggests a U-shaped relationship between capital marketization and rural industry integration. In this case, when the level of capital marketization is below or equal to the inflection point, it has a negative inhibitory effect. When the level is above the inflection point, it has a positive promoting effect on rural industry integration.

Considering that the potential influence of past levels of rural industry integration on the current state within a region, this study further incorporates the first-order lag of rural industry integration variable into the econometric model. The dynamic panel model is constructed as Equation 4 :

where i n d i , t − 1 is the first-order lag of the rural industry integration variable, β 1 is its regression coefficient, β 2 … β 8 are the regression coefficients for the core explanatory variables and control variables.

4.1.2 Threshold regression model

In order to examine the threshold effect of economic development level on the impact of capital marketization on rural industry integration, this study employs to the panel threshold effect model proposed by Hansen (1999) . The economic development level is considered as the threshold variable, and the threshold regression model is constructed as Equation 5 :

where e c o i t is the threshold variable representing the economic development level. I ( ⋅ ) is an indicator function, taking the value of 1 if the expression inside the parentheses is true and 0 otherwise. γ 1 , γ 2 , ⋯ , γ n are the threshold values to be estimated for different levels of economic development.

4.1.3 Mediation effects model

Building upon the prior analysis that capital marketization can enhance rural industry integration through the facilitation of rural financial development and the optimization of industrial structure, this study employs a mediation effects analysis framework. The panel mediation effects model is structured as follows:

where m e d i represents the set of mediation variables, which includes rural financial development level and industrial structure, c o n t r o l represents the set of control variables. In particular, Equation 6 represents the total effect model, indicating the overall effect of capital marketization on rural industry integration. Equation 7 is designed to estimate the impact of capital marketization on the levels of rural financial development and industrial structure. Equation 8 is employed to estimate the direct effect of capital marketization on rural industry integration and the indirect effects through the levels of rural financial development and industrial structure.

4.1.4 Estimation methodology

The current level of rural industry integration may be influenced by historical levels due to inertia-like factors. To account for this, this study introduces the lagged term of industry integration as an explanatory variable into the regression model, endowing it with dynamic explanatory power. However, the inclusion of lagged dependent variables can introduce endogeneity issues. The System GMM method, proposed by Blundell and Bond (1998) , addresses this by estimating both the level and the first-differenced models simultaneously, which helps to mitigate concerns related to unobserved heteroscedasticity, omitted variable bias, measurement errors, and potential endogeneity.

A critical assumption for the GMM model is the absence of autocorrelation in the error term. To test this assumption, the study conducts residual autocorrelation tests (AR tests) with the null hypothesis (H0) stating that there is no autocorrelation at lag 2 in the error term. Acceptance of the null hypothesis in the AR(2) test suggests that the model specification is appropriate. Additionally, to validate the exogeneity of the instrumental variables, the Hansen J test (Over-identification test) is employed. The null hypothesis (H0) posits that the instrumental variables are valid, and acceptance of this null hypothesis confirms the suitability of the chosen instruments. In terms of estimation techniques, the System GMM model offers one-step and two-step estimation procedures. Given that the two-step estimator is more robust to heteroscedasticity and cross-sectional correlation and generally outperforms the one-step estimation, this study opts for the two-step System GMM approach to estimate ( Equation 4 ).

4.2 Variables

4.2.1 dependent variable.

The dependent variable in this study is the development of rural industry integration ( i n d ). The study constructs an index to measure the level of rural industry integration across five key dimensions: innovation, coordination, green development, openness, and shared development. In particular, innovation includes both innovation in cultivation methods and the innovation of integration entities. To quantify innovation in cultivation methods, the study uses the level of agricultural mechanization. The number of cooperatives per ten thousand people serves as a metric for assessing the development of integration entities. Coordination is examined through industry coordination and urban–rural coordination. The deviation degree of the primary industry structure is used to measure industry coordination, and the per capita income ratio of urban to rural residents measures the extent of urban–rural coordination. Green development is primarily concerned with the ecological performance of rural industry integration, which includes factors such as the use of fertilizers and pesticides, the capacity for harmless waste disposal, and carbon emissions. Openness is measured by looking at the development of the agricultural service industry and the extension of the industrial chain. The integration of the primary and tertiary industries is measured by the ratio of the output value of the agricultural, forestry, animal husbandry, and fishery service industry to that of the primary industry. Additionally, the extension of the agricultural industry chain is measured by the ratio of the main business income of the agricultural and sideline food processing industry to the output value of the primary industry. Shared development is evaluated through the lens of benefit sharing and information sharing. The degree of benefit sharing is measured by the degree of benefit connection and income growth rate. Information sharing is quantified by the number of phones and computers per hundred households in rural households Subsequently, the study applies the entropy method to determine the weights of each secondary indicator. Using a linear weighting method, the study calculates the rural industry integration development level for each province in China from 2010 to 2020 (see Table 1 ).

Table 1 . Evaluation system for rural industry integration development.

4.2.2 Core explanatory variable

Capital marketization ( c a p ) is the central explanatory variable in this study, referring to the establishment of market-oriented reforms within the capital market. This process involves enhancing the legal regulatory system and achieving autonomous and orderly flow of factors, as well as an efficient and fair allocation through reforms in economic and social systems. In this research, capital is specifically understood to mean financial capital. The assessment of capital marketization is constructed around four dimensions: the government-market relationship, the liberalization of economic entities, price marketization, and the fairness of the financial environment.

In particular, the government-market relationship includes government resources and government size. The proportion of government expenditure in GDP measures the extent of government resource control, with higher control potentially leading to greater market distortion and lower marketization. The proportion of public employees in the total employment reflects the size of the government, with a smaller size indicating a higher degree of marketization. Economic entity liberalization includes enterprise and bank liberalization. Enterprise liberalization is measured by the share of non-state-owned economic fixed asset investment in total societal fixed asset investment and the share of non-state-owned enterprises’ liabilities in total liabilities. These two indicators reflect the market position of non-state-owned entities, which generally align more closely with market economy principles than state-owned counterparts. Bank liberalization is measured by the proportion of non-state-owned banks in total bank assets, with a higher proportion indicating lower market concentration and greater market competition, signifying higher marketization. Price marketization covers the marketization of both agricultural products and capital prices. The marketization of agricultural product prices is measured by the producer price index of agricultural and sideline products, reflecting price stability. The marketization of capital prices is measured by the degree of interest rate marketization, with a higher degree indicating a more advanced capital marketization. The fairness of the financial environment involves the protection of market competition and the emphasis on regulation. Market competition protection is measured by the ratio of concluded to accepted cases for unfair competition violations, indicating the robustness of the market economy legal system. Greater protection equates to a more effective market economic system. The emphasis on regulation is measured by the proportion of financial regulatory expenditure in GDP. The entropy method is also used to calculate the weights of each secondary indicator (see Table 2 ).

Table 2 . Evaluation system for the development degree of capital marketization.

4.2.3 Other variables

In this study, the threshold variable is the level of economic development ( e c o i t ), which is measured by per capita GDP. It is a comprehensive indicator that reflects the economic activities and living standards of a region.

In addition, this paper identifies rural financial development ( r u r a l f i n a n c e i t ) and industrial structure ( s t r u c t u r e i t ) as mediating variables. The level of rural financial development is measured by the ratio of outstanding loans for agriculture from financial institutions in each province to the output value of the primary industry. The industrial structure, which refers to the composition of industries and the connections and proportions between them. Is measured by the ratio of non-agricultural output value to agricultural output value. An increase in this ration indicates an optimization of the industrial structure. The study also incorporates control variables that may affect the integrated development of rural industries. These include the level of rural education ( e d u i t ), the degree of economic openness ( i m e x i t ), the rural ecological environment ( e n v i i t ), the level of urbanization ( t o w n i t ), and government financial support ( g o v i t ). These variables are crucial for a thorough understanding of the factors that influence the integration and development of rural industries (see Table 3 ).

Table 3 . Control variables.

This paper utilizes a balanced panel dataset from 30 provinces, municipalities, and autonomous regions in China, covering the period from 2010 to 2020, resulting in a total of 330 observations. The data are sourced from a variety of authoritative publications, including the “China Statistical Yearbook,” “China Rural Statistical Yearbook,” “China Science and Technology Statistical Yearbook,” “China Financial Statistical Yearbook,” “China Population and Employment Statistical Yearbook,” “China Basic Unit Statistical Yearbook,” “China Fiscal Yearbook,” “China Land and Resources Statistical Yearbook,” as well as provincial (municipal) statistical yearbooks and the Wind database. The data processing and regression analysis are primarily conducted using Stata 15 software. Descriptive statistics for each variable are presented in Table 4 , offering an initial overview of the dataset’s characteristics.

Table 4 . Descriptive statistics of variables.

As shown in Table 4 , the mean value of the comprehensive index of rural industrial integration is 0.418, with a standard deviation of 0.062. The index ranges from a minimum value of 0.225 to a maximum of 0.571. These statistics indicate that the differences in the level of rural industrial integration across various regions are relatively minor, suggesting a comparatively balanced development of rural industrial integration in China.

The mean value of the capital marketization index is 0.638, with a standard deviation of 0.111, which points to substantial variability in the degree of marketization among various regions. Based on the comprehensive index of capital marketization calculated within this study, regions such as Beijing, Shanghai, Guangdong, and Jiangsu exhibit a higher degree of marketization, whereas regions like Guizhou, Qinghai, and Guangxi are found to have a relatively lower degree of marketization.

Among the other control variables, government financial support shows a wide range, with a minimum value of 0.110 and a maximum value of 5.110, and a standard deviation of 0.694. This variation underscores the differing levels of emphasis that local governments place on supporting “agriculture, rural areas, and farmers,” highlighting the significant heterogeneity in financial commitment across regions.

5 Empirical results

5.1 baseline regression results.

To ensure a robust comparison and to bolster the reliability of the estimation outcomes, this paper employs a variety of estimation techniques for Equation 4 . Specifically, the analysis utilizes a mixed Ordinary Least Squares (OLS) model, a Panel Instrumental Variable (IV) model, and a System Generalized Method of Moments (GMM) model. The comparative results of these estimations are systematically displayed in Table 5 .

Table 5 . The impact of capital marketization on the integrated development of rural industries.

In Table 5 , models (1) and (3) present the results of estimating Equation 4 using mixed OLS, IV, and system GMM methods, respectively. A comparative analysis is conducted, and the Hausman test in models (1) and (2) rejects the null hypothesis, indicating the presence of endogenous explanatory variables. This finding implies that an IV model is more appropriate than the OLS model for addressing endogeneity. Moreover, there is a slight increase in the significance level of the impact of capital marketization on industrial integration from model (1) to model (2). This suggests that the endogeneity issue has been partially resolved. Model (3) shows the estimation results from the system GMM. The paper also performs AR(1) and AR(2) tests on the disturbance term, setting up the null hypothesis H0: there is no autocorrelation in the model’s disturbance term. The test outcomes show that the first-order test rejects H0, while the second-order test fails to reject H0. This suggest that the model’s disturbance term exhibits first-order autocorrelation but does not indicate second-order or higher-order autocorrelation. This results supports the selection of the system GMM model as a suitable method for this study. Additionally, the Sargan test is conducted to assess the validity of the chosen instrumental variables. The test results, which accepts the null hypothesis, indicates that the instrumental variables used are essentially valid and appropriate for the analysis.

The estimation results from model (3) show that the lagged level of rural industrial integration has a significantly positive effect on its current level. This indicates that rural industrial integration exhibits characteristics of accumulation and is dependent on previous level of integration. The coefficient of capital marketization is significantly negative, but the coefficient for its square term is significantly positive, revealing a significant non-linear “U-shaped” relationship between capital marketization and rural industrial integration. This finding aligns with related research of Sun and Zhu (2022) and Liu et al. (2024) , who also discovered a U-shaped relationship between financial development and rural economic growth in China. To be more specific, before the turning point of the curve, there is a significant negative relationship between the level of capital marketization and the level of rural industrial integration. Beyond the turning point, the relationship becomes significantly positive. Thus Hypothesis 1 is supported.

The turning point of the “U-shaped” curve is calculated using the formula L critical = β 1 − 2 β 2 , resulting in a value of 0.6509. This implies that when the level of capital marketization is below 0.6509, its development is likely to inhibit rural industrial integration. Conversely, when the level exceeds 0.6509, the development of capital marketization is expected to foster rural industrial integration. Notably, the average index of capital marketization in China for the year 2020 was 0.667, which is slightly above the calculated turning point of the “U” curve. This suggests that China has entered a phase where the capital marketization process is conducive to the integrated development of rural industries.

To explore the mechanisms through which capital marketization influences the integrated development of rural industries, this study incorporates the level of rural financial development and industrial structure into Equation 4 . Upon introducing the level of rural financial development into the model, a notable reduction in the coefficients and significance levels of both capital marketization and its square term is observed when compared to model (3). The estimated result for the level of rural financial development is instrumental in fostering the integrated development of rural industries. This result suggests that capital marketization may exert its influence on rural industrial integration through the development of the rural financial sector. The rationale is that capital marketization facilitates the rational allocation of capital and enhances its liquidity, which in turn promotes rural financial development. A robust rural financial sector can attract additional factors and resources, effectively mitigating the constraints on rural industrial integration and alleviating financial bottlenecks. Consequently, this contributes to advancement of the integrated development of rural industries.

In model (5), the inclusion of the industrial structure variable has led to an increase in both the coefficient of capital marketization and its squared term, in comparison to model (3). The coefficient for the industrial structure is significantly positive, indicating that a more rational industrial structure benefits the integrated development of rural industries. This positive outcome may stem from the role of capital marketization in promoting the rational allocation of various resources, which in turn fosters the optimization and upgrading of the industrial structure. Additionally, it aids in the reasonable distribution of rural surplus labor, thereby facilitating the integrated development of rural industries. Consequently, Hypothesis 2 is supported by these findings.

Models (6) and (7) extend the analysis by incorporating interaction terms: capital marketization multiplied by the level of rural financial development (Cap×Rural finance), and capital marketization multiplied by the industrial structure (Cap×Structure), respectively. The results indicate that the coefficients for the interaction terms in both models are significant. This signifies that capital marketization enhances the integrated development of rural industries through its impact on promoting rural financial development and optimizing the industrial structure. The significance of these interaction terms reaffirms Hypothesis 2.

5.2 Mechanism analysis

To ensure the reliability of the aforementioned conclusions and further examine the impact mechanism of capital marketization on the integrated development of rural industries, this paper employs a mediation effect analysis. This methodical approach is utilized to scrutinize the mediating roles of two distinct pathways through which capital marketization is hypothesized to influence rural industrial integration. The test results are shown in Table 6 . In Table 6 , model (8) presents the regression outcomes reflecting the direct effect of capital marketization on the level of rural financial development. Model (9) illustrates the adjusted effect of capital marketization on the degree of rural industrial integration, with the inclusion of the rural financial development level as a variable. Models (10) and (11), on the other hand, display the regression results for the impact of capital marketization on the industrial structure and the combined effect of both capital marketization and the industrial structure on the degree of rural industrial integration.

Table 6 . Mediation effect analysis of rural financial development and industrial structure.

The results in Table 6 indicate that capital marketization has a significantly positive impact on rural financial development. This positive impact aligns with the conclusions reached by Tian et al. (2020) , who emphasized the substantial role of rural finance in fostering industrial integration. When both capital marketization and the level of rural financial development are incorporated into the model, they are found to significantly and positively influence the integration of rural industries. This finding indicates that rural financial development acts as a mediating factor in the relationship between capital marketization on the integrated development of rural industries. As capital marketization advances, it enhances rural financial development, mitigating the challenges of “difficulty and high cost of financing” those rural industries face, and thus effectively promoting their integrated development. Furthermore, the regression results in Table 6 indicate that capital marketization significantly and positively affects the industrial structure, which in turn significantly and positively impacts the integrated development of rural industries. This suggests that the industrial structure serves as another mediating channel through which capital marketization influences rural industrial integration. The results are consistent with the earlier regression findings, reinforcing the mediating role of the industrial structure in this context.

As shown in Table 7 , the indirect effect of capital marketization on the integrated development of rural industries, as mediated by rural financial development, is 0.042, with a confidence interval CI = [0.027 0.059]. The exclusion of zero from this confidence interval substantiates the mediating role of rural financial development in the impact of capital marketization on rural industrial integration. Similarly, the indirect effect through the industrial structure is identified as 0.017, with a corresponding confidence interval CI = [0.005 0.031]. Once again, the absence of zero from this interval confirms the mediating influence of the industrial structure on the relationship between capital marketization and the integrated development of rural industries. Consequently, these findings corroborate Hypothesis 2, which posits that both rural financial development and the industrial structure serve as pivotal mediators in the influence of capital marketization on the advancement of rural industrial integration.

Table 7 . The results of direct and indirect effects.

5.3 Threshold effect test

To ascertain whether the promotional effect of capital marketization on the integrated development of rural industries is moderated by the level of regional economic development, acting as a threshold, this section introduces a threshold regression model with economic development level as the threshold variable. The model examines the differences in the impact of capital marketization on rural industrial integration across different economic development intervals. The test results are shown in Table 8 . Table 8 lists the p -values obtained from the threshold effect test, which are based on three scenarios: the presence of a single threshold, a dual threshold, and a triple threshold in the way economic development level impacts the integrated development of rural industries through the mediation of capital marketization.

Table 8 . Threshold effect test results.

The results in Table 8 indicate that when the null hypothesis assumes the absence of three threshold values, the P-statistic is 0.6967. This results does not lead to the rejection of the null hypothesis. Conversely, when the null hypothesis assumes the absence of a double threshold value, the corresponding statistical measure is 0.0000, which leads to the rejection of the null hypothesis. Based on the structure of the test and the observed outcomes, it can be preliminarily concluded that there are two thresholds in the impact of economic development level on the integrated development of rural industries, as mediated by capital marketization.

Table 9 presents the threshold estimation results. The results show that the first threshold value for economic development, within the context of capital marketization’s influence on rural industrial integration, is 2.54, with the second threshold value being 5.47. When the level of economic development is below the first threshold, the impact of capital marketization on rural industrial integration proves to be non-significant. This indicates that at low levels of economic development, regions encounter a financial conundrum characterized by a scarcity of capital. External capital demonstrates a reluctance to invest in areas with lower economic development, resulting in an insufficient pool of resources available for capital marketization to allocate rationally, which in turn significantly undermines its capacity to foster rural industrial integration. Upon surpassing the first threshold value, the influence of capital marketization on rural industry development transitions to a notably positive impact. Furthermore, once the economic development level surpasses the second threshold, the magnitude of the coefficient for capital marketization’s impact on rural industrial integration intensifies compared to when it is below this value. This heightened impact suggests that in regions with higher levels of economic development, there is a greater abundance of factors and resources. Consequently, capital marketization has a more substantial pool of resources to allocate rationally, thereby exerting a more pronounced role in advancing the integrated development of rural industries. In light of these findings, Hypothesis 3 is substantiated.

Table 9 . The results of threshold effect.

6 Conclusion and policy implications

6.1 conclusion.

This paper has elucidated the mechanisms by which capital marketization influences the integration of rural industries. It has developed an evaluation index for the development levels of both capital marketization and rural industrial integration, ensuring alignment with real-world scenarios and policy directions. Using dynamic panel data from China, the paper has conducted and analysis of the trends, transmission mechanisms, and threshold constraints influencing the impact of capital marketization on rural industrial integration. The study’s findings reveal that the degree of rural industrial integration is significantly and positively influenced by its previous level, demonstrating an accumulative effect wherein the prior level of integration lays the groundwork for future advancements. The influence of capital marketization on the degree of rural industrial integration is characterized by a non-linear relationship, adhering to a “U-shaped” curve. Below the inflection point, the development of capital marketization is detrimental to rural industrial integration, whereas above this point, it exerts a positive influence. Currently, China’s overall level of capital marketization is positioned beyond the inflection point, indicating substantial potential for enhancing industry integration in rural China. Capital marketization can stimulate rural financial development and refine the industrial structure, thereby mitigating the challenges of “difficulty and high cost of financing” and acting as a mediating pathway to foster rural industrial integration. In addition, the study indicates that at very low levels of economic development, capital marketization does not affect the development of rural industries. As the economic development level rises, so does the impact of capital marketization on rural industrial integration. Collectively, the evidence suggests that capital marketization is instrumental in advancing the integrated development of rural industries. With appropriate conditions in place, capital marketization can facilitate profound integration within rural industries and pave the way for high-quality development.

6.2 Policy implications

The research findings yield several key policy recommendations. Firstly, the accumulation of experience and factors in rural industrial integration merits attention. It is essential to continuously improve the level of rural industrial integration. In regions where rural industrial integration is advanced, ongoing efforts should focus on maintaining the utilization of existing facilities, fostering innovation among business entities, and sharing development outcomes to further enhance the dynamism of industrial integration. Conversely, in areas with lower levels of integration, strategies should aim to leverage underutilized resources, capitalize on advantageous industries, learn from the experiences of more integrated regions, and adapt development approaches to local conditions.

Secondly, with China’s overall level of capital marketization positioned to promote the integrated development of rural industries, there is an opportunity to bolster this integration. Establishing branches of rural financial institutions, ensuring adequate staffing, and advancing interest rate marketization could enhance the lending and deposit capabilities of these institutions. Such measures would elevate the level of capital marketization in China, encouraging the discovery of new agricultural roles and the emergence of innovative business models, thereby advancing the integration of rural industries.

Thirdly, given the current low overall educational level among rural residents in China, there is a pressing need to augment investment in rural education. This would elevate the educational standards of the rural populace, facilitate the transition of surplus rural labor to secondary and tertiary sectors, refine the industrial structure, and, by extension, foster deeper integration and development of rural industries.

While this study provides valuable insights, it acknowledges certain limitations and avenues for future research. The data’s temporal scope may not encompass the most recent trends and policy shifts that could influence the dynamics between capital marketization and rural industrial integration. Future studies should consider extending the timeframe of their data and broadening the research to encompass micro-level analyses for a nuanced understanding of local particularities. Additionally, a detailed examination of the specific components within the capital marketization process that lead to the observed non-linear effects could yield more precise policy directives. Despite these limitations, the research establishes a robust foundation for further exploration of the capital markets’ role in the integrated development of rural industries.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

ZD: Conceptualization, Methodology, Supervision, Writing – review & editing, Project administration. XF: Data curation, Software, Writing – original draft, Formal analysis, Methodology.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by the National Social Science Fund of China (21CGL026).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Adegbite, O. O., and Machethe, C. L. (2020). Bridging the financial inclusion gender gap in smallholder agriculture in Nigeria: an untapped potential for sustainable development. World Dev. 127:104755. doi: 10.1016/j.worlddev.2019.104755

Crossref Full Text | Google Scholar

Banerjee, A., Duflo, E., and Qian, N. (2020). On the road: access to transportation infrastructure and economic growth in China. J. Dev. Econ. 145:102442. doi: 10.1016/j.jdeveco.2020.102442

Blundell, R., and Bond, S. (1998). Initial conditions and moment restrictions in dynamic panel data models. J. Econom. 87, 115–143. doi: 10.1016/S0304-4076(98)00009-8

Bustos, P., Garber, G., and Ponticelli, J. (2020). Capital accumulation and structural transformation. Q. J. Econ. 135, 1037–1094. doi: 10.1093/qje/qjz044

Chen, K., Long, H., Liao, L., Tu, S., and Li, T. (2020). Land use transitions and urban-rural integrated development: theoretical framework and China’s evidence. Land Use Policy 92:104465. doi: 10.1016/j.landusepol.2020.104465

Cheng, J., Zhao, J., Zhu, D., Jiang, X., Zhang, H., and Zhang, Y. (2022). Land marketization and urban innovation capability: evidence from China. Habitat Int. 122:102540. doi: 10.1016/j.habitatint.2022.102540

Clapp, J. (2019). The rise of financial investment and common ownership in global agrifood firms. Rev. Int. Polit. Econ. 26, 604–629. doi: 10.1080/09692290.2019.1597755

Cofré-Bravo, G., Klerkx, L., and Engler, A. (2019). Combinations of bonding, bridging, and linking social capital for farm innovation: how farmers configure different support networks. J. Rural. Stud. 69, 53–64. doi: 10.1016/j.jrurstud.2019.04.004

Cowie, P., Townsend, L., and Salemink, K. (2020). Smart rural futures: will rural areas be left behind in the 4th industrial revolution? J. Rural. Stud. 79, 169–176. doi: 10.1016/j.jrurstud.2020.08.042

PubMed Abstract | Crossref Full Text | Google Scholar

Fan, G., Wang, X., Zhang, L., and Zhu, H. (2003). Marketization Index for China’s Provinces. J. Econ. Res. 3, 9–18+89. doi: 10.3969/j.issn.1005-6432.2001.06.024

Guo, Y., Zhou, Y., and Liu, Y. (2022). Targeted poverty alleviation and its practices in rural China: a case study of Fuping county, Hebei Province. J. Rural. Stud. 93, 430–440. doi: 10.1016/j.jrurstud.2019.01.007

Han, J. (2020). How to promote rural revitalization via introducing skilled labor, deepening land reform and facilitating investment? China Agric. Econ. Rev. 12, 577–582. doi: 10.1108/CAER-02-2020-0020

Hansen, B. E. (1999). Threshold effects in non-dynamic panels: estimation, testing, and inference. J. Econ. 93, 345–368. doi: 10.1016/S0304-4076(99)00025-1

Hao, H., Liu, C., and Xin, L. (2023). Measurement and dynamic trend research on the development level of rural industry integration in China. Agriculture 13:2245. doi: 10.3390/agriculture13122245

Khanal, A. R., and Omobitan, O. (2020). Rural finance, capital constrained small farms, and financial performance: findings from a primary survey. J. Agric. Appl. Econ. 52, 288–307. doi: 10.1017/aae.2019.45

Lang, R., and Fink, M. (2019). Rural social entrepreneurship: the role of social capital within and across institutional levels. J. Rural. Stud. 70, 155–168. doi: 10.1016/j.jrurstud.2018.03.012

Li, Y., Westlund, H., and Liu, Y. (2019). Why some rural areas decline while some others not: an overview of rural evolution in the world. J. Rural. Stud. 68, 135–143. doi: 10.1016/j.jrurstud.2019.03.003

Liu, Y., Cui, J., Feng, L., and Yan, H. (2024). Does county financial marketization promote high-quality development of agricultural economy? Analysis of the mechanism of county urbanization. PLoS One 19:e0298594. doi: 10.1371/journal.pone.0298594

Liu, Y., Cui, J., Jiang, H., and Yan, H. (2023). Do county financial marketization reforms promote food total factor productivity growth?: a mechanistic analysis of the factors quality of land, labor, and capital. Front. Sustain. Food Systems. 7:1263328. doi: 10.3389/fsufs.2023.1263328

Liu, Y., Li, J., and Yang, Y. (2018). Strategic adjustment of land use policy under the economic transformation. Land Use Policy 74, 5–14. doi: 10.1016/j.landusepol.2017.07.005

Long, H., Tu, S., Ge, D., Li, T., and Liu, Y. (2016). The allocation and management of critical resources in rural China under restructuring: problems and prospects. J. Rural. Stud. 47, 392–412. doi: 10.1016/j.jrurstud.2016.03.011

Lopez, T., and Winkler, A. (2018). The challenge of rural financial inclusion–evidence from microfinance. Appl. Econ. 50, 1555–1577. doi: 10.1080/00036846.2017.1368990

McKinnon, R. I. (1973). Money & Capital in Economic Develop nt, Washington,D.C., The Brookings Institution.

Google Scholar

Petry, J. (2020). Financialization with Chinese characteristics? Exchanges, control and capital markets in authoritarian capitalism. Econ. Soc. 49, 213–238. doi: 10.1080/03085147.2020.1718913

Qin, X., Li, Y., Lu, Z., and Pan, W. (2020). What makes better village economic development in traditional agricultural areas of China? Evidence from 338 villages. Habitat Int. 106:102286. doi: 10.1016/j.habitatint.2020.102286

Shaw, E. S. (1973). Financial Deepening in Economic Development . New York: Oxford University Press.

Steiner, A., and Teasdale, S. (2019). Unlocking the potential of rural social enterprise. J. Rural. Stud. 70, 144–154. doi: 10.1016/j.jrurstud.2017.12.021

Sun, L., and Zhu, C. (2022). Impact of digital inclusive finance on rural high-quality development: evidence from China. Discret. Dyn. Nat. Soc. 2022:7939103. doi: 10.1155/2022/7939103

Tian, X., Wu, M., Ma, L., and Wang, N. (2020). Rural finance, scale management and rural industrial integration. China Agric. Econ. Rev. 12, 349–365. doi: 10.1108/CAER-07-2019-0110

Tu, S., Long, H., Zhang, Y., Ge, D., and Qu, Y. (2018). Rural restructuring at village level under rapid urbanization in metropolitan suburbs of China and its implications for innovations in land use policy. Habitat Int. 77, 143–152. doi: 10.1016/j.habitatint.2017.12.001

Xu, L., and Tan, J. (2020). Financial development, industrial structure and natural resource utilization efficiency in China. Res. Policy 66:101642. doi: 10.1016/j.resourpol.2020.101642

Xue, L., Weng, L., and Yu, H. (2018). Addressing policy challenges in implementing sustainable development goals through an adaptive governance approach: a view from transitional China. Sustain. Dev. 26, 150–158. doi: 10.1002/sd.1726

Yan, J. (2007). The measurement of China’s marketization process. Statistics & Decision . 23, 69–71. doi: 10.3969/j.issn.1002-6487.2007.23.027

Yao, W., and Wang, C. (2022). Agricultural land marketization and productivity: evidence from China. J. Appl. Econ. 25, 22–36. doi: 10.1080/15140326.2021.1997045

Yaseen, A., Bryceson, K., and Mungai, A. N. (2018). Commercialization behaviour in production agriculture: the overlooked role of market orientation. J. Agribus. Dev. Emerg. Econ. 8, 579–602. doi: 10.1108/JADEE-07-2017-0072

Zhang, S., Chen, C., Xu, S., and Xu, B. (2021). Measurement of capital allocation efficiency in emerging economies: evidence from China. Technol. Forecast. Soc. Chang. 171:120954. doi: 10.1016/j.techfore.2021.120954

Zhang, Z., Sun, C., and Wang, J. (2023). How can the digital economy promote the integration of rural industries—taking China as an example. Agriculture 13:2023. doi: 10.3390/agriculture13102023

Zhang, H., and Wu, D. (2022). The impact of transport infrastructure on rural industrial integration: spatial spillover effects and spatio-temporal heterogeneity. Land 11:1116. doi: 10.3390/land11071116

Zhang, R., Yuan, Y., Li, H., and Hu, X. (2022). Improving the framework for analyzing community resilience to understand rural revitalization pathways in China. J. Rural. Stud. 94, 287–294. doi: 10.1016/j.jrurstud.2022.06.012

Zhou, J., Chen, H., Bai, Q., Liu, L., Li, G., and Shen, Q. (2023). Can the integration of rural industries help strengthen China’s agricultural economic resilience? Agriculture 13:1813. doi: 10.3390/agriculture13091813

Zhou, Y., and Hall, J. (2019). The impact of marketization on entrepreneurship in China: recent evidence. Am. J. Entrep. 12, 31–55.

Zhou, Y., Li, X., and Liu, Y. (2020). Rural land system reforms in China: history, issues, measures and prospects. Land Use Policy 91:104330. doi: 10.1016/j.landusepol.2019.104330

Keywords: rural industrial integration, capital marketization, system GMM model, threshold regression, China

Citation: Ding Z and Fan X (2024) Does capital marketization promote better rural industrial integration: evidence from China. Front. Sustain. Food Syst . 8:1412487. doi: 10.3389/fsufs.2024.1412487

Received: 05 April 2024; Accepted: 02 August 2024; Published: 15 August 2024.

Reviewed by:

Copyright © 2024 Ding and Fan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Zhao Ding, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

IMAGES

Describe a Benefit of Hypothesis Testing Using Statistics
Hypothesis Testing with Two Samples
PPT
Everything You Need To Know about Hypothesis TestingPart I Null
Null Hypothesis Significance Testing Overview
t test null hypothesis example

VIDEO

Hypothesis Testing
Hypothesis Testing
Hypothesis Testing: Two-tailed z test for mean
Hypothesis Testing and The Null Hypothesis, Clearly Explained!!!
Intro to Hypothesis Testing in Statistics
Null and Alternate Hypothesis

COMMENTS

10.26: Hypothesis Test for a Population Mean (5 of 5)
The hypotheses are claims about the population mean, µ. The null hypothesis is a hypothesis that the mean equals a specific value, µ 0. The alternative hypothesis is the competing claim that µ is less than, greater than, or not equal to the . When is < or > , the test is a one-tailed test. When is ≠ , the test is a two-tailed test.
Hypothesis Testing for the Mean
Table 8.3: One-sided hypothesis testing for the mean: H0: μ ≤ μ0, H1: μ > μ0. Note that the tests mentioned in Table 8.3 remain valid if we replace the null hypothesis by μ = μ0. The reason for this is that in choosing the threshold c, we assumed the worst case scenario, i.e, μ = μ0 .
Hypothesis Testing Calculator with Steps
Hypothesis Testing Calculator. The first step in hypothesis testing is to calculate the test statistic. The formula for the test statistic depends on whether the population standard deviation (σ) is known or unknown. If σ is known, our hypothesis test is known as a z test and we use the z distribution. If σ is unknown, our hypothesis test is ...
Null Hypothesis: Definition, Rejecting & Examples
The null states that the mean bone density changes for the control and treatment groups are equal. Null Hypothesis H 0: Group means are equal in the population: ... In the case of a one-sided hypothesis test, the null still contains an equal sign but it's "greater than or equal to" or "less than or equal to." If you wanted to ...
Hypothesis Testing
Hypothesis testing is a formal procedure for investigating our ideas about the world. It allows you to statistically test your predictions. ... Stating results in a statistics assignment In our comparison of mean height between men and women we found an average difference of 13.7 cm and a p-value of 0.002; ... The null hypothesis of a test ...
Lesson 6b: Hypothesis Testing for One-Sample Mean
With a test statistic of - 1.3 and p-value between 0.1 to 0.2, we fail to reject the null hypothesis at a 1% level of significance since the p-value would exceed our significance level. We conclude that there is not enough statistical evidence that indicates that the mean length of lumber differs from 8.5 feet.
How to Write a Null Hypothesis (5 Examples)
Example 1: Weight of Turtles. A biologist wants to test whether or not the true mean weight of a certain species of turtles is 300 pounds. To test this, he goes out and measures the weight of a random sample of 40 turtles. Here is how to write the null and alternative hypotheses for this scenario: H0: μ = 300 (the true mean weight is equal to ...
8.6 Hypothesis Tests for a Population Mean with Known Population
The p-value of 0.0126 is a large probability compared to the 1% significance level, and so is likely to happen assuming the null hypothesis is true. This suggests that the assumption that the null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject the null hypothesis.
Hypothesis Testing
Let's return finally to the question of whether we reject or fail to reject the null hypothesis. If our statistical analysis shows that the significance level is below the cut-off value we have set (e.g., either 0.05 or 0.01), we reject the null hypothesis and accept the alternative hypothesis. Alternatively, if the significance level is above ...
6.6
We can conduct a hypothesis test. Because 98.6 is not contained within the 95% confidence interval, it is not a reasonable estimate of the population mean. We should expect to have a p value less than 0.05 and to reject the null hypothesis. $H_0: \mu=98.6$ $H_a: \mu \ne 98.6$
Hypothesis Testing for Means & Proportions
We then determine the appropriate test statistic (Step 2) for the hypothesis test. The formula for the test statistic is given below. Test Statistic for Testing H0: p = p 0. if min (np 0 , n (1-p 0 )) > 5. The formula above is appropriate for large samples, defined when the smaller of np 0 and n (1-p 0) is at least 5.
Hypothesis tests for a mean
The hypothesis test is based on the T statistic. The resulting statistic from the test drops into the plot. Red values represent tests where the null hypothesis is rejected at the specified level of significance. The default significance level of 0.05 used for the tests can be changed by adjusting the Level input within the applet.
8.3: Hypothesis Test Examples for Means with Unknown Standard Deviation
Full Hypothesis Test Examples. Example 8.3.6 8.3. 6. Statistics students believe that the mean score on the first statistics test is 65. A statistics instructor thinks the mean score is higher than 65. He samples ten statistics students and obtains the scores 65 65 70 67 66 63 63 68 72 71.
8.6: Hypothesis Test of a Single Population Mean with Examples
Answer. Set up the hypothesis test: A 5% level of significance means that α = 0.05 α = 0.05. This is a test of a single population mean. H0: μ = 65 Ha: μ > 65 H 0: μ = 65 H a: μ > 65. Since the instructor thinks the average score is higher, use a " > > ". The " > > " means the test is right-tailed.
10.29: Hypothesis Test for a Difference in Two Population Means (1 of 2)
Step 3: Assess the evidence. If the conditions are met, then we calculate the t-test statistic. The t-test statistic has a familiar form. Since the null hypothesis assumes there is no difference in the population means, the expression (μ 1 - μ 2) is always zero.. As we learned in "Estimating a Population Mean," the t-distribution depends on the degrees of freedom (df).
Some Basic Null Hypothesis Tests
The most common null hypothesis test for this type of statistical relationship is the t test. In this section, we look at three types of t tests that are used for slightly different research designs: the one-sample t test, the dependent-samples t test, and the independent-samples t test. The one-sample t test is used to compare a sample mean ...
8.7 Hypothesis Tests for a Population Mean with Unknown Population
The p-value for a hypothesis test on a population mean is the area in the tail(s) of the distribution of the sample mean. When the population standard deviation is unknown, use the [latex]t[/latex]-distribution to find the p-value.. If the p-value is the area in the left-tail: Use the t.dist function to find the p-value. In the t.dist(t-score, degrees of freedom, logic operator) function:
Examples of null and alternative hypotheses
It is the opposite of your research hypothesis. The alternative hypothesis--that is, the research hypothesis--is the idea, phenomenon, observation that you want to prove. If you suspect that girls take longer to get ready for school than boys, then: Alternative: girls time > boys time. Null: girls time <= boys time.
PDF Hypothesis Testing for population mean
Test statistic: The statistic used as a basis for deciding whether the null hypothesis should be rejected. If the test statistic results in a value that is in the rejection region we will reject the null hypothesis, H o. If the test statistic results in a value that is not in the rejection region we will accept the null hypothesis.
Null hypothesis
The null hypothesis and the alternative hypothesis are types of conjectures used in statistical tests to make statistical inferences, which are formal methods of reaching conclusions and separating scientific claims from statistical noise.. The statement being tested in a test of statistical significance is called the null hypothesis. The test of significance is designed to assess the strength ...
13.1 Understanding Null Hypothesis Testing
A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value. A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p value that is not low means that ...
Statistical hypothesis test
A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently support a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p ...
Null Hypothesis
Mean Comparison (Two-sample t-test) H 0: ... In the realm of hypothesis testing, the null hypothesis (H 0) and alternative hypothesis (H₁ or Ha) play critical roles. The null hypothesis generally assumes no difference, effect, or relationship between variables, suggesting that any observed change or effect is due to random chance. ...
Blown out of Proportion? Housing Prices, Home Prices and New Tenant
One way to check a series for mean-reversion is to test whether the series is stationary (exhibiting a constant mean and variance over time), or non-stationary (not converging toward any particular level over time). ... The Kiwatkowski-Phillips-Schmidt-Shin (KPSS) test evaluates the null hypothesis that a series is stationary. Table 1: Tests of ...
Frontiers
The test results, which accepts the null hypothesis, indicates that the instrumental variables used are essentially valid and appropriate for the analysis. The estimation results from model (3) show that the lagged level of rural industrial integration has a significantly positive effect on its current level.