Regression Analysis

  • Reference work entry
  • Cite this reference work entry

regression analysis research paper

1258 Accesses

1 Citations

Regression analysis is a technique that permits one to study and measure the relation between two or more variables. Starting from data registered in a  sample , regression analysis seeks to determine an estimate of a mathematical relation between two or more variables. The goal is to estimate the value of one variable as a function of one or more other variables. The estimated variable is called the dependent variable and is commonly denoted by Y . In contrast, the variables that explain the variations in Y are called independent variables, and they are denoted by  X .

When Y depends on only one X , we have simple regression analysis, but when Y depends on more than one independent variable, we have multiple regression analysis. If the relation between the dependent and the independent variables is linear, then we have linear regression analysis.

The pioneer in linear regression analysis, Boscovich, Roger Joseph , an astronomer as well as a physician, was one of the first to find...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Eisenhart, C.: Boscovich and the Combination of Observations. In: Kendall, M., Plackett, R.L. (eds.) Studies in the History of Statistics and Probability, vol. II. Griffin, London (1977)

Google Scholar  

Galton, F.: Natural Inheritance. Macmillan, London (1889)

Gauss, C.F.: Theoria Motus Corporum Coelestium. Werke, 7 (1809)

Laplace, P.S. de: Sur les degrés mesurés des méridiens, et sur les longueurs observées sur pendule. Histoire de l'Académie royale des inscriptions et belles lettres, avec les Mémoires de littérature tirées des registres de cette académie. Paris (1789)

Legendre, A.M.: Nouvelles méthodes pour la détermination des orbites des comètes. Courcier, Paris (1805)

Plackett, R.L.: Studies in the history of probability and statistics. In: Kendall, M., Plackett, R.L. (eds.) The discovery of the method of least squares. vol. II. Griffin, London (1977)

Stigler, S.: The History of Statistics, the Measurement of Uncertainty Before 1900. Belknap, London (1986)

MATH   Google Scholar  

Download references

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag

About this entry

Cite this entry.

(2008). Regression Analysis. In: The Concise Encyclopedia of Statistics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-32833-1_348

Download citation

DOI : https://doi.org/10.1007/978-0-387-32833-1_348

Publisher Name : Springer, New York, NY

Print ISBN : 978-0-387-31742-7

Online ISBN : 978-0-387-32833-1

eBook Packages : Mathematics and Statistics Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 01 December 2015

Points of Significance

Multiple linear regression

  • Martin Krzywinski 2 &
  • Naomi Altman 1  

Nature Methods volume  12 ,  pages 1103–1104 ( 2015 ) Cite this article

46k Accesses

81 Citations

42 Altmetric

Metrics details

When multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple.

You have full access to this article via your institution.

Last month we explored how to model a simple relationship between two variables, such as the dependence of weight on height 1 . In the more realistic scenario of dependence on several variables, we can use multiple linear regression (MLR). Although MLR is similar to linear regression, the interpretation of MLR correlation coefficients is confounded by the way in which the predictor variables relate to one another.

In simple linear regression 1 , we model how the mean of variable Y depends linearly on the value of a predictor variable X ; this relationship is expressed as the conditional expectation E( Y | X ) = β 0 + β 1 X . For more than one predictor variable X 1 , . . ., X p , this becomes β 0 + Σ β j X j . As for simple linear regression, one can use the least-squares estimator (LSE) to determine estimates b j of the β j regression parameters by minimizing the residual sum of squares, SSE = Σ( y i − ŷ i ) 2 , where ŷ i = b 0 + Σ j b j xij . When we use the regression sum of squares, SSR = Σ( ŷ i − Y − ) 2 , the ratio R 2 = SSR/(SSR + SSE) is the amount of variation explained by the regression model and in multiple regression is called the coefficient of determination.

The slope β j is the change in Y if predictor j is changed by one unit and others are held constant. When normality and independence assumptions are fulfilled, we can test whether any (or all) of the slopes are zero using a t -test (or regression F -test). Although the interpretation of β j seems to be identical to its interpretation in the simple linear regression model, the innocuous phrase “and others are held constant” turns out to have profound implications.

To illustrate MLR—and some of its perils—here we simulate predicting the weight ( W , in kilograms) of adult males from their height ( H , in centimeters) and their maximum jump height ( J , in centimeters). We use a model similar to that presented in our previous column 1 , but we now include the effect of J as E( W | H , J ) = β H H + β J J + β 0 + ε, with β H = 0.7, β J = −0.08, β 0 = −46.5 and normally distributed noise ε with zero mean and σ = 1 ( Table 1 ). We set β J negative because we expect a negative correlation between W and J when height is held constant (i.e., among men of the same height, lighter men will tend to jump higher). For this example we simulated a sample of size n = 40 with H and J normally distributed with means of 165 cm (σ = 3) and 50 cm (σ = 12.5), respectively.

Although the statistical theory for MLR seems similar to that for simple linear regression, the interpretation of the results is much more complex. Problems in interpretation arise entirely as a result of the sample correlation 2 among the predictors. We do, in fact, expect a positive correlation between H and J —tall men will tend to jump higher than short ones. To illustrate how this correlation can affect the results, we generated values using the model for weight with samples of J and H with different amounts of correlation.

Let's look first at the regression coefficients estimated when the predictors are uncorrelated, r ( H , J ) = 0, as evidenced by the zero slope in association between H and J ( Fig. 1a ). Here r is the Pearson correlation coefficient 2 . If we ignore the effect of J and regress W on H , we find Ŵ = 0.71 H − 51.7 ( R 2 = 0.66) ( Table 1 and Fig. 1b ). Ignoring H , we find Ŵ = −0.088 J + 69.3 ( R 2 = 0.19). If both predictors are fitted in the regression, we obtain Ŵ = 0.71 H − 0.088 J − 47.3 ( R 2 = 0.85). This regression fit is a plane in three dimensions ( H , J , W ) and is not shown in Figure 1 . In all three cases, the results of the F -test for zero slopes show high significance ( P ≤ 0.005).

figure 1

( a ) Simulated values of uncorrelated predictors, r ( H , J ) = 0. The thick gray line is the regression line, and thin gray lines show the 95% confidence interval of the fit. ( b ) Regression of weight ( W ) on height ( H ) and of weight on jump height ( J ) for uncorrelated predictors shown in a . Regression slopes are shown ( b H = 0.71, b J = −0.088). ( c ) Simulated values of correlated predictors, r ( H , J ) = 0.9. Regression and 95% confidence interval are denoted as in a . ( d ) Regression (red lines) using correlated predictors shown in c . Light red lines denote the 95% confidence interval. Notice that b J = 0.097 is now positive. The regression line from b is shown in blue. In all graphs, horizontal and vertical dotted lines show average values.

When the sample correlations of the predictors are exactly zero, the regression slopes ( b H and b J ) for the “one predictor at a time” regressions and the multiple regression are identical, and the simple regression R 2 sums to multiple regression R 2 (0.66 + 0.19 = 0.85; Fig. 2 ). The intercept changes when we add a predictor with a nonzero mean to satisfy the constraint that the least-squares regression line goes through the sample means, which is always true when the regression model includes an intercept.

figure 2

Shown are the values of regression coefficient estimates ( b H , b J , b 0 ) and R 2 and the significance of the test used to determine whether the coefficient is zero from 250 simulations at each value of predictor sample correlation −1 < r ( H , J ) < 1 for each scenario where either H or J or both H and J predictors are fitted in the regression. Thick and thin black curves show the coefficient estimate median and the boundaries of the 10th–90th percentile range, respectively. Histograms show the fraction of estimated P values in different significance ranges, and correlation intervals are highlighted in red where >20% of the P values are >0.01. Actual regression coefficients ( β H , β J , β 0 ) are marked on vertical axes. The decrease in significance for b J when jump height is the only predictor and r ( H , J ) is moderate (red arrow) is due to insufficient statistical power ( b J is close to zero). When predictors are uncorrelated, r ( H , J ) = 0, R 2 of individual regressions sum to R 2 of multiple regression (0.66 + 0.19 = 0.85). Panels are organized to correspond to Table 1 , which shows estimates of a single trial at two different predictor correlations.

Balanced factorial experiments show a sample correlation of zero among the predictors when their levels have been fixed. For example, we might fix three heights and three jump heights and select two men representative of each combination, for a total of 18 subjects to be weighed. But if we select the samples and then measure the predictors and response, the predictors are unlikely to have zero correlation.

When we simulate highly correlated predictors r ( H , J ) = 0.9 ( Fig. 1c ), we find that the regression parameters change depending on whether we use one or both predictors ( Table 1 and Fig. 1d ). If we consider only the effect of H , the coefficient β H = 0.7 is inaccurately estimated as b H = 0.44. If we include only J , we estimate β J = −0.08 inaccurately, and even with the wrong sign ( b J = 0.097). When we use both predictors, the estimates are quite close to the actual coefficients ( b H = 0.63, b J = −0.056).

In fact, as the correlation between predictors r ( H , J ) changes, the estimates of the slopes ( b H , b J ) and intercept ( b 0 ) vary greatly when only one predictor is fitted. We show the effects of this variation for all values of predictor correlation (both positive and negative) across 250 trials at each value ( Fig. 2 ). We include negative correlation because although J and H are likely to be positively correlated, other scenarios might use negatively correlated predictors (e.g., lung capacity and smoking habits). For example, if we include only H in the regression and ignore the effect of J , b H steadily decreases from about 1 to 0.35 as r ( H , J ) increases. Why is this? For a given height, larger values of J (an indicator of fitness) are associated with lower weight. If J and H are negatively correlated, as J increases, H decreases, and both changes result in a lower value of W . Conversely, as J decreases, H increases, and thus W increases. If we use only H as a predictor, J is lurking in the background, depressing W at low values of H and enhancing W at high levels of H , so that the effect of H is overestimated ( b H increases). The opposite effect occurs when J and H are positively correlated. A similar effect occurs for b J , which increases in magnitude (becomes more negative) when J and H are negatively correlated. Supplementary Figure 1 shows the effect of correlation when both regression coefficients are positive.

When both predictors are fitted ( Fig. 2 ), the regression coefficient estimates ( b H , b J , b 0 ) are centered at the actual coefficients ( β H , β J , β 0 ) with the correct sign and magnitude regardless of the correlation of the predictors. However, the standard error in the estimates steadily increases as the absolute value of the predictor correlation increases.

Neglecting important predictors has implications not only for R 2 , which is a measure of the predictive power of the regression, but also for interpretation of the regression coefficients. Unconsidered variables that may have a strong effect on the estimated regression coefficients are sometimes called 'lurking variables'. For example, muscle mass might be a lurking variable with a causal effect on both body weight and jump height. The results and interpretation of the regression will also change if other predictors are added.

Given that missing predictors can affect the regression, should we try to include as many predictors as possible? No, for three reasons. First, any correlation among predictors will increase the standard error of the estimated regression coefficients. Second, having more slope parameters in our model will reduce interpretability and cause problems with multiple testing. Third, the model may suffer from overfitting. As the number of predictors approaches the sample size, we begin fitting the model to the noise. As a result, we may seem to have a very good fit to the data but still make poor predictions.

MLR is powerful for incorporating many predictors and for estimating the effects of a predictor on the response in the presence of other covariates. However, the estimated regression coefficients depend on the predictors in the model, and they can be quite variable when the predictors are correlated. Accurate prediction of the response is not an indication that regression slopes reflect the true relationship between the predictors and the response.

Altman, N. & Krzywinski, M. Nat. Methods 12 , 999–1000 (2015).

Article   CAS   Google Scholar  

Altman, N. & Krzywinski, M. Nat. Methods 12 , 899–900 (2015).

Download references

Author information

Authors and affiliations.

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,

  • Naomi Altman

Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.,

Martin Krzywinski

You can also search for this author in PubMed   Google Scholar

Ethics declarations

Competing interests.

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary figure 1 regression coefficients and r 2.

The significance and value of regression coefficients and R 2 for a model with both regression coefficients positive, E( W | H,J ) = 0.7 H + 0.08 J - 46.5 + ε. The format of the figure is the same as that of Figure 2 .

Supplementary information

Supplementary figure 1.

Regression coefficients and R 2 (PDF 299 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Krzywinski, M., Altman, N. Multiple linear regression. Nat Methods 12 , 1103–1104 (2015). https://doi.org/10.1038/nmeth.3665

Download citation

Published : 01 December 2015

Issue Date : December 2015

DOI : https://doi.org/10.1038/nmeth.3665

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Income and oral and general health-related quality of life: the modifying effect of sense of coherence, findings of a cross-sectional study.

  • Mehrsa Zakershahrak
  • Sergio Chrisopoulos
  • David Brennan

Applied Research in Quality of Life (2023)

Outcomes of a novel all-inside arthroscopic anterior talofibular ligament repair for chronic ankle instability

  • Xiao’ao Xue
  • Yinghui Hua

International Orthopaedics (2023)

Predicting financial losses due to apartment construction accidents utilizing deep learning techniques

  • Ji-Myong Kim
  • Sang-Guk Yum

Scientific Reports (2022)

Regression modeling of time-to-event data with censoring

  • Tanujit Dey
  • Stuart R. Lipsitz

Nature Methods (2022)

A Systematic Analysis for Energy Performance Predictions in Residential Buildings Using Ensemble Learning

  • Monika Goyal
  • Mrinal Pandey

Arabian Journal for Science and Engineering (2021)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

regression analysis research paper

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals

You are here

  • Volume 24, Issue 4
  • Understanding and interpreting regression analysis
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0002-7839-8130 Parveen Ali 1 , 2 ,
  • http://orcid.org/0000-0003-0157-5319 Ahtisham Younas 3 , 4
  • 1 School of Nursing and Midwifery , University of Sheffield , Sheffield , South Yorkshire , UK
  • 2 Sheffiled University Interpersonal Violence Research Group , The University of Sheffiled SEAS , Sheffield , UK
  • 3 Faculty of Nursing , Memorial University of Newfoundland , St. John's , Newfoundland and Labrador , Canada
  • 4 Swat College of Nursing , Mingora, Swat , Pakistan
  • Correspondence to Ahtisham Younas, Memorial University of Newfoundland, St. John's, NL A1C 5S7, Canada; ay6133{at}mun.ca

https://doi.org/10.1136/ebnurs-2021-103425

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

  • statistics & research methods

Introduction

A nurse educator is interested in finding out the academic and non-academic predictors of success in nursing students. Given the complexity of educational and clinical learning environments, demographic, clinical and academic factors (age, gender, previous educational training, personal stressors, learning demands, motivation, assignment workload, etc) influencing nursing students’ success, she was able to list various potential factors contributing towards success relatively easily. Nevertheless, not all of the identified factors will be plausible predictors of increased success. Therefore, she could use a powerful statistical procedure called regression analysis to identify whether the likelihood of increased success is influenced by factors such as age, stressors, learning demands, motivation and education.

What is regression?

Purposes of regression analysis.

Regression analysis has four primary purposes: description, estimation, prediction and control. 1 , 2 By description, regression can explain the relationship between dependent and independent variables. Estimation means that by using the observed values of independent variables, the value of dependent variable can be estimated. 2 Regression analysis can be useful for predicting the outcomes and changes in dependent variables based on the relationships of dependent and independent variables. Finally, regression enables in controlling the effect of one or more independent variables while investigating the relationship of one independent variable with the dependent variable. 1

Types of regression analyses

There are commonly three types of regression analyses, namely, linear, logistic and multiple regression. The differences among these types are outlined in table 1 in terms of their purpose, nature of dependent and independent variables, underlying assumptions, and nature of curve. 1 , 3 However, more detailed discussion for linear regression is presented as follows.

  • View inline

Comparison of linear, logistic and multiple regression

Linear regression and interpretation

Linear regression analysis involves examining the relationship between one independent and dependent variable. Statistically, the relationship between one independent variable (x) and a dependent variable (y) is expressed as: y= β 0 + β 1 x+ε. In this equation, β 0 is the y intercept and refers to the estimated value of y when x is equal to 0. The coefficient β 1 is the regression coefficient and denotes that the estimated increase in the dependent variable for every unit increase in the independent variable. The symbol ε is a random error component and signifies imprecision of regression indicating that, in actual practice, the independent variables are cannot perfectly predict the change in any dependent variable. 1 Multiple linear regression follows the same logic as univariate linear regression except (a) multiple regression, there are more than one independent variable and (b) there should be non-collinearity among the independent variables.

Factors affecting regression

Linear and multiple regression analyses are affected by factors, namely, sample size, missing data and the nature of sample. 2

Small sample size may only demonstrate connections among variables with strong relationship. Therefore, sample size must be chosen based on the number of independent variables and expect strength of relationship.

Many missing values in the data set may affect the sample size. Therefore, all the missing values should be adequately dealt with before conducting regression analyses.

The subsamples within the larger sample may mask the actual effect of independent and dependent variables. Therefore, if subsamples are predefined, a regression within the sample could be used to detect true relationships. Otherwise, the analysis should be undertaken on the whole sample.

Building on her research interest mentioned in the beginning, let us consider a study by Ali and Naylor. 4 They were interested in identifying the academic and non-academic factors which predict the academic success of nursing diploma students. This purpose is consistent with one of the above-mentioned purposes of regression analysis (ie, prediction). Ali and Naylor’s chosen academic independent variables were preadmission qualification, previous academic performance and school type and the non-academic variables were age, gender, marital status and time gap. To achieve their purpose, they collected data from 628 nursing students between the age range of 15–34 years. They used both linear and multiple regression analyses to identify the predictors of student success. For analysis, they examined the relationship of academic and non-academic variables across different years of study and noted that academic factors accounted for 36.6%, 44.3% and 50.4% variability in academic success of students in year 1, year 2 and year 3, respectively. 4

Ali and Naylor presented the relationship among these variables using scatter plots, which are commonly used graphs for data display in regression analysis—see examples of various scatter plots in figure 1 . 4 In a scatter plot, the clustering of the dots denoted the strength of relationship, whereas the direction indicates the nature of relationships among variables as positive (ie, increase in one variable results in an increase in the other) and negative (ie, increase in one variable results in decrease in the other).

  • Download figure
  • Open in new tab
  • Download powerpoint

An Example of Scatter Plot for Regression.

Table 2 presents the results of regression analysis for academic and non-academic variables for year 4 students’ success. The significant predictors of student success are denoted with a significant p value. For every, significant predictor, the beta value indicates the percentage increase in students’ academic success with one unit increase in the variable.

Regression model for the final year students (N=343)

Conclusions

Regression analysis is a powerful and useful statistical procedure with many implications for nursing research. It enables researchers to describe, predict and estimate the relationships and draw plausible conclusions about the interrelated variables in relation to any studied phenomena. Regression also allows for controlling one or more variables when researchers are interested in examining the relationship among specific variables. Some of the key considerations are presented that may be useful for researchers undertaking regression analysis. While planning and conducting regression analysis, researchers should consider the type and number of dependent and independent variables as well as the nature and size of sample. Choosing a wrong type of regression analysis with small sample may result in erroneous conclusions about the studied phenomenon.

Ethics statements

Patient consent for publication.

Not required.

  • Montgomery DC ,
  • Schneider A ,

Twitter @parveenazamali, @@Ahtisham04

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Provenance and peer review Commissioned; internally peer reviewed.

Read the full text or download the PDF:

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Cardiopulm Phys Ther J
  • v.20(3); 2009 Sep

Regression Analysis for Prediction: Understanding the Process

Phillip b palmer.

1 Hardin-Simmons University, Department of Physical Therapy, Abilene, TX

Dennis G O'Connell

2 Hardin-Simmons University, Department of Physical Therapy, Abilene, TX

Research related to cardiorespiratory fitness often uses regression analysis in order to predict cardiorespiratory status or future outcomes. Reading these studies can be tedious and difficult unless the reader has a thorough understanding of the processes used in the analysis. This feature seeks to “simplify” the process of regression analysis for prediction in order to help readers understand this type of study more easily. Examples of the use of this statistical technique are provided in order to facilitate better understanding.

INTRODUCTION

Graded, maximal exercise tests that directly measure maximum oxygen consumption (VO 2 max) are impractical in most physical therapy clinics because they require expensive equipment and personnel trained to administer the tests. Performing these tests in the clinic may also require medical supervision; as a result researchers have sought to develop exercise and non-exercise models that would allow clinicians to predict VO 2 max without having to perform direct measurement of oxygen uptake. In most cases, the investigators utilize regression analysis to develop their prediction models.

Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables. The analysis yields a predicted value for the criterion resulting from a linear combination of the predictors. According to Pedhazur, 15 regression analysis has 2 uses in scientific literature: prediction, including classification, and explanation. The following provides a brief review of the use of regression analysis for prediction. Specific emphasis is given to the selection of the predictor variables (assessing model efficiency and accuracy) and cross-validation (assessing model stability). The discussion is not intended to be exhaustive. For a more thorough explanation of regression analysis, the reader is encouraged to consult one of many books written about this statistical technique (eg, Fox; 5 Kleinbaum, Kupper, & Muller; 12 Pedhazur; 15 and Weisberg 16 ). Examples of the use of regression analysis for prediction are drawn from a study by Bradshaw et al. 3 In this study, the researchers' stated purpose was to develop an equation for prediction of cardiorespiratory fitness (CRF) based on non-exercise (N-EX) data.

SELECTING THE CRITERION (OUTCOME MEASURE)

The first step in regression analysis is to determine the criterion variable. Pedhazur 15 suggests that the criterion have acceptable measurement qualities (ie, reliability and validity). Bradshaw et al 3 used VO 2 max as the criterion of choice for their model and measured it using a maximum graded exercise test (GXT) developed by George. 6 George 6 indicated that his protocol for testing compared favorably with the Bruce protocol in terms of predictive ability and had good test-retest reliability ( ICC = .98 –.99). The American College of Sports Medicine indicates that measurement of VO 2 max is the “gold standard” for measuring cardiorespiratory fitness. 1 These facts support that the criterion selected by Bradshaw et al 3 was appropriate and meets the requirements for acceptable reliability and validity.

SELECTING THE PREDICTORS: MODEL EFFICIENCY

Once the criterion has been selected, predictor variables should be identified (model selection). The aim of model selection is to minimize the number of predictors which account for the maximum variance in the criterion. 15 In other words, the most efficient model maximizes the value of the coefficient of determination ( R 2 ). This coefficient estimates the amount of variance in the criterion score accounted for by a linear combination of the predictor variables. The higher the value is for R 2 , the less error or unexplained variance and, therefore, the better prediction. R 2 is dependent on the multiple correlation coefficient ( R ), which describes the relationship between the observed and predicted criterion scores. If there is no difference between the predicted and observed scores, R equals 1.00. This represents a perfect prediction with no error and no unexplained variance ( R 2 = 1.00). When R equals 0.00, there is no relationship between the predictor(s) and the criterion and no variance in scores has been explained ( R 2 = 0.00). The chosen variables cannot predict the criterion. The goal of model selection is, as stated previously, to develop a model that results in the highest estimated value for R 2 .

According to Pedhazur, 15 the value of R is often overestimated. The reasons for this are beyond the scope of this discussion; however, the degree of overestimation is affected by sample size. The larger the ratio is between the number of predictors and subjects, the larger the overestimation. To account for this, sample sizes should be large and there should be 15 to 30 subjects per predictor. 11 , 15 Of course, the most effective way to determine optimal sample size is through statistical power analysis. 11 , 15

Another method of determining the best model for prediction is to test the significance of adding one or more variables to the model using the partial F-test . This process, which is further discussed by Kleinbaum, Kupper, and Muller, 12 allows for exclusion of predictors that do not contribute significantly to the prediction, allowing determination of the most efficient model of prediction. In general, the partial F-test is similar to the F-test used in analysis of variance. It assesses the statistical significance of the difference between values for R 2 derived from 2 or more prediction models using a subset of the variables from the original equation. For example, Bradshaw et al 3 indicated that all variables contributed significantly to their prediction. Though the researchers do not detail the procedure used, it is highly likely that different models were tested, excluding one or more variables, and the resulting values for R 2 assessed for statistical difference.

Although the techniques discussed above are useful in determining the most efficient model for prediction, theory must be considered in choosing the appropriate variables. Previous research should be examined and predictors selected for which a relationship between the criterion and predictors has been established. 12 , 15

It is clear that Bradshaw et al 3 relied on theory and previous research to determine the variables to use in their prediction equation. The 5 variables they chose for inclusion–gender, age, body mass index (BMI), perceived functional ability (PFA), and physical activity rating (PAR)–had been shown in previous studies to contribute to the prediction of VO 2 max (eg, Heil et al; 8 George, Stone, & Burkett 7 ). These 5 predictors accounted for 87% ( R = .93, R 2 = .87 ) of the variance in the predicted values for VO 2 max. Based on a ratio of 1:20 (predictor:sample size), this estimate of R , and thus R 2 , is not likely to be overestimated. The researchers used changes in the value of R 2 to determine whether to include or exclude these or other variables. They reported that removal of perceived functional ability (PFA) as a variable resulted in a decrease in R from .93 to .89. Without this variable, the remaining 4 predictors would account for only 79% of the variance in VO 2 max. The investigators did note that each predictor variable contributed significantly ( p < .05 ) to the prediction of VO 2 max (see above discussion related to the partial F-test).

ASSESSING ACCURACY OF THE PREDICTION

Assessing accuracy of the model is best accomplished by analyzing the standard error of estimate ( SEE ) and the percentage that the SEE represents of the predicted mean ( SEE % ). The SEE represents the degree to which the predicted scores vary from the observed scores on the criterion measure, similar to the standard deviation used in other statistical procedures. According to Jackson, 10 lower values of the SEE indicate greater accuracy in prediction. Comparison of the SEE for different models using the same sample allows for determination of the most accurate model to use for prediction. SEE % is calculated by dividing the SEE by the mean of the criterion ( SEE /mean criterion) and can be used to compare different models derived from different samples.

Bradshaw et al 3 report a SEE of 3.44 mL·kg −1 ·min −1 (approximately 1 MET) using all 5 variables in the equation (gender, age, BMI, PFA, PA-R). When the PFA variable is removed from the model, leaving only 4 variables for the prediction (gender, age, BMI, PA-R), the SEE increases to 4.20 mL·kg −1 ·min −1 . The increase in the error term indicates that the model excluding PFA is less accurate in predicting VO 2 max. This is confirmed by the decrease in the value for R (see discussion above). The researchers compare their model of prediction with that of George, Stone, and Burkett, 7 indicating that their model is as accurate. It is not advisable to compare models based on the SEE if the data were collected from different samples as they were in these 2 studies. That type of comparison should be made using SEE %. Bradshaw and colleagues 3 report SEE % for their model (8.62%), but do not report values from other models in making comparisons.

Some advocate the use of statistics derived from the predicted residual sum of squares ( PRESS ) as a means of selecting predictors. 2 , 4 , 16 These statistics are used more often in cross-validation of models and will be discussed in greater detail later.

ASSESSING STABILITY OF THE MODEL FOR PREDICTION

Once the most efficient and accurate model for prediction has been determined, it is prudent that the model be assessed for stability. A model, or equation, is said to be “stable” if it can be applied to different samples from the same population without losing the accuracy of the prediction. This is accomplished through cross-validation of the model. Cross-validation determines how well the prediction model developed using one sample performs in another sample from the same population. Several methods can be employed for cross-validation, including the use of 2 independent samples, split samples, and PRESS -related statistics developed from the same sample.

Using 2 independent samples involves random selection of 2 groups from the same population. One group becomes the “training” or “exploratory” group used for establishing the model of prediction. 5 The second group, the “confirmatory” or “validatory” group is used to assess the model for stability. The researcher compares R 2 values from the 2 groups and assessment of “shrinkage,” the difference between the two values for R 2 , is used as an indicator of model stability. There is no rule of thumb for interpreting the differences, but Kleinbaum, Kupper, and Muller 12 suggest that “shrinkage” values of less than 0.10 indicate a stable model. While preferable, the use of independent samples is rarely used due to cost considerations.

A similar technique of cross-validation uses split samples. Once the sample has been selected from the population, it is randomly divided into 2 subgroups. One subgroup becomes the “exploratory” group and the other is used as the “validatory” group. Again, values for R 2 are compared and model stability is assessed by calculating “shrinkage.”

Holiday, Ballard, and McKeown 9 advocate the use of PRESS-related statistics for cross-validation of regression models as a means of dealing with the problems of data-splitting. The PRESS method is a jackknife analysis that is used to address the issue of estimate bias associated with the use of small sample sizes. 13 In general, a jackknife analysis calculates the desired test statistic multiple times with individual cases omitted from the calculations. In the case of the PRESS method, residuals, or the differences between the actual values of the criterion for each individual and the predicted value using the formula derived with the individual's data removed from the prediction, are calculated. The PRESS statistic is the sum of the squares of the residuals derived from these calculations and is similar to the sum of squares for the error (SS error ) used in analysis of variance (ANOVA). Myers 14 discusses the use of the PRESS statistic and describes in detail how it is calculated. The reader is referred to this text and the article by Holiday, Ballard, and McKeown 9 for additional information.

Once determined, the PRESS statistic can be used to calculate a modified form of R 2 and the SEE . R 2 PRESS is calculated using the following formula: R 2 PRESS = 1 – [ PRESS / SS total ], where SS total equals the sum of squares for the original regression equation. 14 Standard error of the estimate for PRESS ( SEE PRESS ) is calculated as follows: SEE PRESS =, where n equals the number of individual cases. 14 The smaller the difference between the 2 values for R 2 and SEE , the more stable the model for prediction. Bradshaw et al 3 used this technique in their investigation. They reported a value for R 2 PRESS of .83, a decrease of .04 from R 2 for their prediction model. Using the standard set by Kleinbaum, Kupper, and Muller, 12 the model developed by these researchers would appear to have stability, meaning it could be used for prediction in samples from the same population. This is further supported by the small difference between the SEE and the SEE PRESS , 3.44 and 3.63 mL·kg −1 ·min −1 , respectively.

COMPARING TWO DIFFERENT PREDICTION MODELS

A comparison of 2 different models for prediction may help to clarify the use of regression analysis in prediction. Table ​ Table1 1 presents data from 2 studies and will be used in the following discussion.

Comparison of Two Non-exercise Models for Predicting CRF

VariablesHeil et al = 374Bradshaw et al = 100
Intercept36.58048.073
Gender (male = 1, female = 0)3.7066.178
Age (years)0.558−0.246
Age −7.81 E-3
Percent body fat−0.541
Body mass index (kg-m )−0.619
Activity code (0-7)1.347
Physical activity rating (0–10)0.671
Perceived functional abilty0.712
)
.88 (.77).93 (.87)
4.90·mL–kg ·min 3.44 mL·kg min
12.7%8.6%

As noted above, the first step is to select an appropriate criterion, or outcome measure. Bradshaw et al 3 selected VO 2 max as their criterion for measuring cardiorespiratory fitness. Heil et al 8 used VO 2 peak. These 2 measures are often considered to be the same, however, VO 2 peak assumes that conditions for measuring maximum oxygen consumption were not met. 17 It would be optimal to compare models based on the same criterion, but that is not essential, especially since both criteria measure cardiorespiratory fitness in much the same way.

The second step involves selection of variables for prediction. As can be seen in Table ​ Table1, 1 , both groups of investigators selected 5 variables to use in their model. The 5 variables selected by Bradshaw et al 3 provide a better prediction based on the values for R 2 (.87 and .77), indicating that their model accounts for more variance (87% versus 77%) in the prediction than the model of Heil et al. 8 It should also be noted that the SEE calculated in the Bradshaw 3 model (3.44 mL·kg −1 ·min −1 ) is less than that reported by Heil et al 8 (4.90 mL·kg −1 ·min −1 ). Remember, however, that comparison of the SEE should only be made when both models are developed using samples from the same population. Comparing predictions developed from different populations can be accomplished using the SEE% . Review of values for the SEE% in Table ​ Table1 1 would seem to indicate that the model developed by Bradshaw et al 3 is more accurate because the percentage of the mean value for VO 2 max represented by error is less than that reported by Heil et al. 8 In summary, the Bradshaw 3 model would appear to be more efficient, accounting for more variance in the prediction using the same number of variables. It would also appear to be more accurate based on comparison of the SEE% .

The 2 models cannot be compared based on stability of the models. Each set of researchers used different methods for cross-validation. Both models, however, appear to be relatively stable based on the data presented. A clinician can assume that either model would perform fairly well when applied to samples from the same populations as those used by the investigators.

The purpose of this brief review has been to demystify regression analysis for prediction by explaining it in simple terms and to demonstrate its use. When reviewing research articles in which regression analysis has been used for prediction, physical therapists should ensure that the: (1) criterion chosen for the study is appropriate and meets the standards for reliability and validity, (2) processes used by the investigators to assess both model efficiency and accuracy are appropriate, 3) predictors selected for use in the model are reasonable based on theory or previous research, and 4) investigators assessed model stability through a process of cross-validation, providing the opportunity for others to utilize the prediction model in different samples drawn from the same population.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. Regression Analysis Essay Example

    regression analysis research paper

  2. FREE 10+ Regression Analysis Samples in PDF

    regression analysis research paper

  3. Regression Analysis

    regression analysis research paper

  4. Multiple Linear Regression Analysis Research Paper

    regression analysis research paper

  5. (PDF) Linear Regression Analysis Using R for Research and Development

    regression analysis research paper

  6. Regression Analysis Essay Example

    regression analysis research paper

COMMENTS

  1. (PDF) Regression Analysis

    7.1 Introduction. Regression analysis is one of the most fr equently used tools in market resear ch. In its. simplest form, regression analys is allows market researchers to analyze rela tionships ...

  2. (PDF) Multiple Regression: Methodology and Applications

    Abstract. Multiple regression is one of the most significant forms of regression and has a wide range. of applications. The study of the implementation of multiple regression analysis in different ...

  3. The clinician's guide to interpreting a regression analysis

    Regression analysis is an important statistical method that is commonly used to determine the relationship between ... Schober P, Vetter TR. Linear regression in medical research. Anesth Analg ...

  4. Regression Analysis

    The aim of linear regression analysis is to estimate the coefficients of the regression equation b 0 and b k (k∈K) so that the sum of the squared residuals (i.e., the sum over all squared differences between the observed values of the i th observation of y i and the corresponding predicted values \( {\hat{y}}_i \)) is minimized.The lower part of Fig. 1 illustrates this approach, which is ...

  5. Review of guidance papers on regression modeling in statistical series

    Abstract. Although regression models play a central role in the analysis of medical research projects, there still exist many misconceptions on various aspects of modeling leading to faulty analyses. Indeed, the rapidly developing statistical methodology and its recent advances in regression modeling do not seem to be adequately reflected in ...

  6. Introduction to Multivariate Regression Analysis

    These questions can in principle be answered by multiple linear regression analysis. In the multiple linear regression model, Y has normal distribution with mean. The model parameters β 0 + β 1 + +β ρ and σ must be estimated from data. β 0 = intercept. β 1 β ρ = regression coefficients.

  7. A Comprehensive Study of Regression Analysis and the Existing

    In many different sciences, including medicine, engineering, and observational studies, the investigation of the relationship between variables, i.e., dependents, and independents, is defined as research objectives. Employing statistical methods to achieve the relationship between variables is very time-consuming or costly in many scenarios and does not provide practical application. Therefore ...

  8. Handbook of Regression Analysis

    For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may ...

  9. Principle Assumptions of Regression Analysis: Testing, Techniques, and

    SUBMIT PAPER. Advances in Developing Human Resources. Impact Factor: 3.1 / 5-Year Impact ... Testing the principle assumptions of regression analysis is a process. As such, the presentation of this process in a systems framework provides a comprehensive plan with step-by-step guidelines to help determine the optimal statistical model for a ...

  10. Regression Analysis

    Regression analysis is a technique that permits one to study and measure the relation between two or more variables. Starting from data registered in a sample, regression analysis seeks to determine an estimate of a mathematical relation between two or more variables.The goal is to estimate the value of one variable as a function of one or more other variables.

  11. A Study on Multiple Linear Regression Analysis

    Regression analysis is a statistical technique for estimating the relationship among variables which have reason and result relation. Main focus of univariate regression is analyse the relationship between a dependent variable and one independent variable and formulates the linear relation equation between dependent and independent variable.

  12. Linear Regression Analysis

    Linear regression is used to study the linear relationship between a dependent variable Y (blood pressure) and one or more independent variables X (age, weight, sex). The dependent variable Y must be continuous, while the independent variables may be either continuous (age), binary (sex), or categorical (social status).

  13. PDF Using regression analysis to establish the relationship between home

    Home environment and reading achievement research has been largely dominated by a focus on early reading acquisition, while research on the relationship between home environments and reading success with preadolescents (Grades 4-6) has been largely overlooked. There are other limitations as well. Clarke and Kurtz-Costes (1997) argued that prior ...

  14. Multiple linear regression

    When we use the regression sum of squares, SSR = Σ ( ŷi − Y−) 2, the ratio R2 = SSR/ (SSR + SSE) is the amount of variation explained by the regression model and in multiple regression is ...

  15. PDF Fundamentals of Multiple Regression

    The value of t .025 is found in a t-table, using the usual df of t for assessing statistical significance of a regression coefficient (N - the num-ber of X's - 1), and is the value that leaves a tail of the t-curve with 2.5% of the total probability. For instance, if df = 30, then t.025 = 2.042.

  16. (PDF) Linear regression analysis study

    Linear regression is a statistical procedure for calculating the value of a dependent variable from an independent variable. Linear regression measures the association between two variables. It is ...

  17. An Introduction to Regression Analysis

    Alan O. Sykes, "An Introduction to Regression Analysis" (Coase-Sandor Institute for Law & Economics Working Paper No. 20, 1993). This Working Paper is brought to you for free and open access by the Coase-Sandor Institute for Law and Economics at Chicago Unbound. It has been accepted for inclusion in Coase-Sandor Working Paper Series in Law and ...

  18. Understanding and interpreting regression analysis

    Linear regression analysis involves examining the relationship between one independent and dependent variable. Statistically, the relationship between one independent variable (x) and a dependent variable (y) is expressed as: y= β 0 + β 1 x+ε. In this equation, β 0 is the y intercept and refers to the estimated value of y when x is equal to 0.

  19. Regression Analysis for Prediction: Understanding the Process

    Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables. The analysis yields a predicted value for the criterion resulting from a linear combination of the predictors. According to Pedhazur, 15 regression analysis has 2 uses ...

  20. Theory and Implementation of linear regression

    Linear regression refers to the mathematical technique of fitting given data to a function of a certain type. It is best known for fitting straight lines. In this paper, we explain the theory behind linear regression and illustrate this technique with a real world data set. This data relates the earnings of a food truck and the population size of the city where the food truck sells its food.

  21. PDF Multiple Regression Analysis of Performance Indicators in the ...

    The research methodology is based on statistical analysis, which in this paper includes the multiple regression analysis. This type of analysis is used for modeling and analyzing several variables. The multiple regression analysis extends regression analysis Titan et al., by describing the relationship between a dependent

  22. Understanding and interpreting regression analysis

    Analysis of regression is a strong and valuable factual examination system with numerous ramifications in research. Regression analysis ... This essay aims to present a basic analysis of ...

  23. A Study on Multiple Linear Regression Analysis

    Regression analysis is a statistical technique for estimating the relationship among variables which have reason and result relation. Main focus of univariate regression is analyse the ...