Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

Reliability vs. Validity in Research | Difference, Types and Examples

Published on July 3, 2019 by Fiona Middleton . Revised on June 22, 2023.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique. or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt

It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research . Failing to do so can lead to several types of research bias and seriously affect your work.

Reliability vs validity
Reliability Validity
What does it tell you? The extent to which the results can be reproduced when the research is repeated under the same conditions. The extent to which the results really measure what they are supposed to measure.
How is it assessed? By checking the consistency of results across time, across different observers, and across parts of the test itself. By checking how well the results correspond to established theories and other measures of the same concept.
How do they relate? A reliable measurement is not always valid: the results might be , but they’re not necessarily correct. A valid measurement is generally reliable: if a test produces accurate results, they should be reproducible.

Table of contents

Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis, other interesting articles.

Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.

What is reliability?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.

What is validity?

Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.

If the thermometer shows different temperatures each time, even though you have carefully controlled conditions to ensure the sample’s temperature stays the same, the thermometer is probably malfunctioning, and therefore its measurements are not valid.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.

Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

scientific validity in research

Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.

Types of reliability

Different types of reliability can be estimated through various statistical methods.

Type of reliability What does it assess? Example
The consistency of a measure : do you get the same results when you repeat the measurement? A group of participants complete a designed to measure personality traits. If they repeat the questionnaire days, weeks or months apart and give the same answers, this indicates high test-retest reliability.
The consistency of a measure : do you get the same results when different people conduct the same measurement? Based on an assessment criteria checklist, five examiners submit substantially different results for the same student project. This indicates that the assessment checklist has low inter-rater reliability (for example, because the criteria are too subjective).
The consistency of : do you get the same results from different parts of a test that are designed to measure the same thing? You design a questionnaire to measure self-esteem. If you randomly split the results into two halves, there should be a between the two sets of results. If the two results are very different, this indicates low internal consistency.

Types of validity

The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.

Type of validity What does it assess? Example
The adherence of a measure to  of the concept being measured. A self-esteem questionnaire could be assessed by measuring other traits known or assumed to be related to the concept of self-esteem (such as social skills and ). Strong correlation between the scores for self-esteem and associated traits would indicate high construct validity.
The extent to which the measurement  of the concept being measured. A test that aims to measure a class of students’ level of Spanish contains reading, writing and speaking components, but no listening component.  Experts agree that listening comprehension is an essential aspect of language ability, so the test lacks content validity for measuring the overall level of ability in Spanish.
The extent to which the result of a measure corresponds to of the same concept. A is conducted to measure the political opinions of voters in a region. If the results accurately predict the later outcome of an election in that region, this indicates that the survey has high criterion validity.

To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalizability of the results).

The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.

Ensuring validity

If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data.

  • Choose appropriate methods of measurement

Ensure that your method and measurement technique are high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardized questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or findings of previous studies, and the questions should be carefully and precisely worded.

  • Use appropriate sampling methods to select your subjects

To produce valid and generalizable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession).  Ensure that you have enough participants and that they are representative of the population. Failing to do so can lead to sampling bias and selection bias .

Ensuring reliability

Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible .

  • Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations , clearly define how specific behaviors or responses will be counted, and make sure questions are phrased the same way each time. Failing to do so can lead to errors such as omitted variable bias or information bias .

  • Standardize the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.

For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions, preferably in a properly randomized setting. Failing to do so can lead to a placebo effect , Hawthorne effect , or other demand characteristics . If participants can guess the aims or objectives of a study, they may attempt to act in more socially desirable ways.

It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper . Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.

Reliability and validity in a thesis
Section Discuss
What have other researchers done to devise and improve methods that are reliable and valid?
How did you plan your research to ensure reliability and validity of the measures used? This includes the chosen sample set and size, sample preparation, external conditions and measuring techniques.
If you calculate reliability and validity, state these values alongside your main results.
This is the moment to talk about how reliable and valid your results actually were. Were they consistent, and did they reflect true values? If not, why not?
If reliability and validity were a big problem for your findings, it might be helpful to mention this here.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Middleton, F. (2023, June 22). Reliability vs. Validity in Research | Difference, Types and Examples. Scribbr. Retrieved August 12, 2024, from https://www.scribbr.com/methodology/reliability-vs-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, what is quantitative research | definition, uses & methods, data collection | definition, methods & examples, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

  • Privacy Policy

Research Method

Home » Validity – Types, Examples and Guide

Validity – Types, Examples and Guide

Table of Contents

Validity

Validity is a fundamental concept in research, referring to the extent to which a test, measurement, or study accurately reflects or assesses the specific concept that the researcher is attempting to measure. Ensuring validity is crucial as it determines the trustworthiness and credibility of the research findings.

Research Validity

Research validity pertains to the accuracy and truthfulness of the research. It examines whether the research truly measures what it claims to measure. Without validity, research results can be misleading or erroneous, leading to incorrect conclusions and potentially flawed applications.

How to Ensure Validity in Research

Ensuring validity in research involves several strategies:

  • Clear Operational Definitions : Define variables clearly and precisely.
  • Use of Reliable Instruments : Employ measurement tools that have been tested for reliability.
  • Pilot Testing : Conduct preliminary studies to refine the research design and instruments.
  • Triangulation : Use multiple methods or sources to cross-verify results.
  • Control Variables : Control extraneous variables that might influence the outcomes.

Types of Validity

Validity is categorized into several types, each addressing different aspects of measurement accuracy.

Internal Validity

Internal validity refers to the degree to which the results of a study can be attributed to the treatments or interventions rather than other factors. It is about ensuring that the study is free from confounding variables that could affect the outcome.

External Validity

External validity concerns the extent to which the research findings can be generalized to other settings, populations, or times. High external validity means the results are applicable beyond the specific context of the study.

Construct Validity

Construct validity evaluates whether a test or instrument measures the theoretical construct it is intended to measure. It involves ensuring that the test is truly assessing the concept it claims to represent.

Content Validity

Content validity examines whether a test covers the entire range of the concept being measured. It ensures that the test items represent all facets of the concept.

Criterion Validity

Criterion validity assesses how well one measure predicts an outcome based on another measure. It is divided into two types:

  • Predictive Validity : How well a test predicts future performance.
  • Concurrent Validity : How well a test correlates with a currently existing measure.

Face Validity

Face validity refers to the extent to which a test appears to measure what it is supposed to measure, based on superficial inspection. While it is the least scientific measure of validity, it is important for ensuring that stakeholders believe in the test’s relevance.

Importance of Validity

Validity is crucial because it directly affects the credibility of research findings. Valid results ensure that conclusions drawn from research are accurate and can be trusted. This, in turn, influences the decisions and policies based on the research.

Examples of Validity

  • Internal Validity : A randomized controlled trial (RCT) where the random assignment of participants helps eliminate biases.
  • External Validity : A study on educational interventions that can be applied to different schools across various regions.
  • Construct Validity : A psychological test that accurately measures depression levels.
  • Content Validity : An exam that covers all topics taught in a course.
  • Criterion Validity : A job performance test that predicts future job success.

Where to Write About Validity in A Thesis

In a thesis, the methodology section should include discussions about validity. Here, you explain how you ensured the validity of your research instruments and design. Additionally, you may discuss validity in the results section, interpreting how the validity of your measurements affects your findings.

Applications of Validity

Validity has wide applications across various fields:

  • Education : Ensuring assessments accurately measure student learning.
  • Psychology : Developing tests that correctly diagnose mental health conditions.
  • Market Research : Creating surveys that accurately capture consumer preferences.

Limitations of Validity

While ensuring validity is essential, it has its limitations:

  • Complexity : Achieving high validity can be complex and resource-intensive.
  • Context-Specific : Some validity types may not be universally applicable across all contexts.
  • Subjectivity : Certain types of validity, like face validity, involve subjective judgments.

By understanding and addressing these aspects of validity, researchers can enhance the quality and impact of their studies, leading to more reliable and actionable results.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Internal Validity

Internal Validity – Threats, Examples and Guide

Split-Half Reliability

Split-Half Reliability – Methods, Examples and...

Reliability Vs Validity

Reliability Vs Validity

Internal_Consistency_Reliability

Internal Consistency Reliability – Methods...

Face Validity

Face Validity – Methods, Types, Examples

Content Validity

Content Validity – Measurement and Examples

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Reliability and validity: Importance in Medical Research

Affiliations.

  • 1 Al-Nafees Medical College,Isra University, Islamabad, Pakistan.
  • 2 Fauji Foundation Hospital, Foundation University Medical College, Islamabad, Pakistan.
  • PMID: 34974579
  • DOI: 10.47391/JPMA.06-861

Reliability and validity are among the most important and fundamental domains in the assessment of any measuring methodology for data-collection in a good research. Validity is about what an instrument measures and how well it does so, whereas reliability concerns the truthfulness in the data obtained and the degree to which any measuring tool controls random error. The current narrative review was planned to discuss the importance of reliability and validity of data-collection or measurement techniques used in research. It describes and explores comprehensively the reliability and validity of research instruments and also discusses different forms of reliability and validity with concise examples. An attempt has been taken to give a brief literature review regarding the significance of reliability and validity in medical sciences.

Keywords: Validity, Reliability, Medical research, Methodology, Assessment, Research tools..

PubMed Disclaimer

Similar articles

  • Principles and methods of validity and reliability testing of questionnaires used in social and health science researches. Bolarinwa OA. Bolarinwa OA. Niger Postgrad Med J. 2015 Oct-Dec;22(4):195-201. doi: 10.4103/1117-1936.173959. Niger Postgrad Med J. 2015. PMID: 26776330
  • The measurement of collaboration within healthcare settings: a systematic review of measurement properties of instruments. Walters SJ, Stern C, Robertson-Malt S. Walters SJ, et al. JBI Database System Rev Implement Rep. 2016 Apr;14(4):138-97. doi: 10.11124/JBISRIR-2016-2159. JBI Database System Rev Implement Rep. 2016. PMID: 27532315 Review.
  • Evaluation of research studies. Part IV: Validity and reliability--concepts and application. Fullerton JT. Fullerton JT. J Nurse Midwifery. 1993 Mar-Apr;38(2):121-5. doi: 10.1016/0091-2182(93)90146-8. J Nurse Midwifery. 1993. PMID: 8492191
  • Validity and reliability of measurement instruments used in research. Kimberlin CL, Winterstein AG. Kimberlin CL, et al. Am J Health Syst Pharm. 2008 Dec 1;65(23):2276-84. doi: 10.2146/ajhp070364. Am J Health Syst Pharm. 2008. PMID: 19020196 Review.
  • [Psychometric characteristics of questionnaires designed to assess the knowledge, perceptions and practices of health care professionals with regards to alcoholic patients]. Jaussent S, Labarère J, Boyer JP, François P. Jaussent S, et al. Encephale. 2004 Sep-Oct;30(5):437-46. doi: 10.1016/s0013-7006(04)95458-9. Encephale. 2004. PMID: 15627048 Review. French.
  • A psychometric assessment of a novel scale for evaluating vaccination attitudes amidst a major public health crisis. Cheng L, Kong J, Xie X, Zhang F. Cheng L, et al. Sci Rep. 2024 May 4;14(1):10250. doi: 10.1038/s41598-024-61028-z. Sci Rep. 2024. PMID: 38704420 Free PMC article.
  • Test-Retest Reliability of Isokinetic Strength in Lower Limbs under Single and Dual Task Conditions in Women with Fibromyalgia. Gomez-Alvaro MC, Leon-Llamas JL, Melo-Alonso M, Villafaina S, Domínguez-Muñoz FJ, Gusi N. Gomez-Alvaro MC, et al. J Clin Med. 2024 Feb 24;13(5):1288. doi: 10.3390/jcm13051288. J Clin Med. 2024. PMID: 38592707 Free PMC article.
  • Bridging, Mapping, and Addressing Research Gaps in Health Sciences: The Naqvi-Gabr Research Gap Framework. Naqvi WM, Gabr M, Arora SP, Mishra GV, Pashine AA, Quazi Syed Z. Naqvi WM, et al. Cureus. 2024 Mar 8;16(3):e55827. doi: 10.7759/cureus.55827. eCollection 2024 Mar. Cureus. 2024. PMID: 38590484 Free PMC article. Review.
  • Reliability, validity, and responsiveness of the simplified Chinese version of the knee injury and Osteoarthritis Outcome Score in patients after total knee arthroplasty. Yao R, Yang L, Wang J, Zhou Q, Li X, Yan Z, Fu Y. Yao R, et al. Heliyon. 2024 Feb 21;10(5):e26786. doi: 10.1016/j.heliyon.2024.e26786. eCollection 2024 Mar 15. Heliyon. 2024. PMID: 38434342 Free PMC article.
  • Psychometric evaluation of the Chinese version of the stressors in breast cancer scale: a translation and validation study. Hu W, Bao J, Yang X, Ye M. Hu W, et al. BMC Public Health. 2024 Feb 9;24(1):425. doi: 10.1186/s12889-024-18000-3. BMC Public Health. 2024. PMID: 38336690 Free PMC article.

Publication types

  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Pakistan Medical Association

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Validity In Psychology Research: Types & Examples

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it’s intended to measure. It ensures that the research findings are genuine and not due to extraneous factors.

Validity can be categorized into different types based on internal and external validity .

The concept of validity was formulated by Kelly (1927, p. 14), who stated that a test is valid if it measures what it claims to measure. For example, a test of intelligence should measure intelligence and not something else (such as memory).

Internal and External Validity In Research

Internal validity refers to whether the effects observed in a study are due to the manipulation of the independent variable and not some other confounding factor.

In other words, there is a causal relationship between the independent and dependent variables .

Internal validity can be improved by controlling extraneous variables, using standardized instructions, counterbalancing, and eliminating demand characteristics and investigator effects.

External validity refers to the extent to which the results of a study can be generalized to other settings (ecological validity), other people (population validity), and over time (historical validity).

External validity can be improved by setting experiments more naturally and using random sampling to select participants.

Types of Validity In Psychology

Two main categories of validity are used to assess the validity of the test (i.e., questionnaire, interview, IQ test, etc.): Content and criterion.

  • Content validity refers to the extent to which a test or measurement represents all aspects of the intended content domain. It assesses whether the test items adequately cover the topic or concept.
  • Criterion validity assesses the performance of a test based on its correlation with a known external criterion or outcome. It can be further divided into concurrent (measured at the same time) and predictive (measuring future performance) validity.

table showing the different types of validity

Face Validity

Face validity is simply whether the test appears (at face value) to measure what it claims to. This is the least sophisticated measure of content-related validity, and is a superficial and subjective assessment based on appearance.

Tests wherein the purpose is clear, even to naïve respondents, are said to have high face validity. Accordingly, tests wherein the purpose is unclear have low face validity (Nevo, 1985).

A direct measurement of face validity is obtained by asking people to rate the validity of a test as it appears to them. This rater could use a Likert scale to assess face validity.

For example:

  • The test is extremely suitable for a given purpose
  • The test is very suitable for that purpose;
  • The test is adequate
  • The test is inadequate
  • The test is irrelevant and, therefore, unsuitable

It is important to select suitable people to rate a test (e.g., questionnaire, interview, IQ test, etc.). For example, individuals who actually take the test would be well placed to judge its face validity.

Also, people who work with the test could offer their opinion (e.g., employers, university administrators, employers). Finally, the researcher could use members of the general public with an interest in the test (e.g., parents of testees, politicians, teachers, etc.).

The face validity of a test can be considered a robust construct only if a reasonable level of agreement exists among raters.

It should be noted that the term face validity should be avoided when the rating is done by an “expert,” as content validity is more appropriate.

Having face validity does not mean that a test really measures what the researcher intends to measure, but only in the judgment of raters that it appears to do so. Consequently, it is a crude and basic measure of validity.

A test item such as “ I have recently thought of killing myself ” has obvious face validity as an item measuring suicidal cognitions and may be useful when measuring symptoms of depression.

However, the implication of items on tests with clear face validity is that they are more vulnerable to social desirability bias. Individuals may manipulate their responses to deny or hide problems or exaggerate behaviors to present a positive image of themselves.

It is possible for a test item to lack face validity but still have general validity and measure what it claims to measure. This is good because it reduces demand characteristics and makes it harder for respondents to manipulate their answers.

For example, the test item “ I believe in the second coming of Christ ” would lack face validity as a measure of depression (as the purpose of the item is unclear).

This item appeared on the first version of The Minnesota Multiphasic Personality Inventory (MMPI) and loaded on the depression scale.

Because most of the original normative sample of the MMPI were good Christians, only a depressed Christian would think Christ is not coming back. Thus, for this particular religious sample, the item does have general validity but not face validity.

Construct Validity

Construct validity assesses how well a test or measure represents and captures an abstract theoretical concept, known as a construct. It indicates the degree to which the test accurately reflects the construct it intends to measure, often evaluated through relationships with other variables and measures theoretically connected to the construct.

Construct validity was invented by Cronbach and Meehl (1955). This type of content-related validity refers to the extent to which a test captures a specific theoretical construct or trait, and it overlaps with some of the other aspects of validity

Construct validity does not concern the simple, factual question of whether a test measures an attribute.

Instead, it is about the complex question of whether test score interpretations are consistent with a nomological network involving theoretical and observational terms (Cronbach & Meehl, 1955).

To test for construct validity, it must be demonstrated that the phenomenon being measured actually exists. So, the construct validity of a test for intelligence, for example, depends on a model or theory of intelligence .

Construct validity entails demonstrating the power of such a construct to explain a network of research findings and to predict further relationships.

The more evidence a researcher can demonstrate for a test’s construct validity, the better. However, there is no single method of determining the construct validity of a test.

Instead, different methods and approaches are combined to present the overall construct validity of a test. For example, factor analysis and correlational methods can be used.

Convergent validity

Convergent validity is a subtype of construct validity. It assesses the degree to which two measures that theoretically should be related are related.

It demonstrates that measures of similar constructs are highly correlated. It helps confirm that a test accurately measures the intended construct by showing its alignment with other tests designed to measure the same or similar constructs.

For example, suppose there are two different scales used to measure self-esteem:

Scale A and Scale B. If both scales effectively measure self-esteem, then individuals who score high on Scale A should also score high on Scale B, and those who score low on Scale A should score similarly low on Scale B.

If the scores from these two scales show a strong positive correlation, then this provides evidence for convergent validity because it indicates that both scales seem to measure the same underlying construct of self-esteem.

Concurrent Validity (i.e., occurring at the same time)

Concurrent validity evaluates how well a test’s results correlate with the results of a previously established and accepted measure, when both are administered at the same time.

It helps in determining whether a new measure is a good reflection of an established one without waiting to observe outcomes in the future.

If the new test is validated by comparison with a currently existing criterion, we have concurrent validity.

Very often, a new IQ or personality test might be compared with an older but similar test known to have good validity already.

Predictive Validity

Predictive validity assesses how well a test predicts a criterion that will occur in the future. It measures the test’s ability to foresee the performance of an individual on a related criterion measured at a later point in time. It gauges the test’s effectiveness in predicting subsequent real-world outcomes or results.

For example, a prediction may be made on the basis of a new intelligence test that high scorers at age 12 will be more likely to obtain university degrees several years later. If the prediction is born out, then the test has predictive validity.

Cronbach, L. J., and Meehl, P. E. (1955) Construct validity in psychological tests. Psychological Bulletin , 52, 281-302.

Hathaway, S. R., & McKinley, J. C. (1943). Manual for the Minnesota Multiphasic Personality Inventory . New York: Psychological Corporation.

Kelley, T. L. (1927). Interpretation of educational measurements. New York : Macmillan.

Nevo, B. (1985). Face validity revisited . Journal of Educational Measurement , 22(4), 287-293.

Print Friendly, PDF & Email

scientific validity in research

What is the Significance of Validity in Research?

scientific validity in research

Introduction

  • What is validity in simple terms?

Internal validity vs. external validity in research

Uncovering different types of research validity, factors that improve research validity.

In qualitative research , validity refers to an evaluation metric for the trustworthiness of study findings. Within the expansive landscape of research methodologies , the qualitative approach, with its rich, narrative-driven investigations, demands unique criteria for ensuring validity.

Unlike its quantitative counterpart, which often leans on numerical robustness and statistical veracity, the essence of validity in qualitative research delves deep into the realms of credibility, dependability, and the richness of the data .

The importance of validity in qualitative research cannot be overstated. Establishing validity refers to ensuring that the research findings genuinely reflect the phenomena they are intended to represent. It reinforces the researcher's responsibility to present an authentic representation of study participants' experiences and insights.

This article will examine validity in qualitative research, exploring its characteristics, techniques to bolster it, and the challenges that researchers might face in establishing validity.

scientific validity in research

At its core, validity in research speaks to the degree to which a study accurately reflects or assesses the specific concept that the researcher is attempting to measure or understand. It's about ensuring that the study investigates what it purports to investigate. While this seems like a straightforward idea, the way validity is approached can vary greatly between qualitative and quantitative research .

Quantitative research often hinges on numerical, measurable data. In this paradigm, validity might refer to whether a specific tool or method measures the correct variable, without interference from other variables. It's about numbers, scales, and objective measurements. For instance, if one is studying personalities by administering surveys, a valid instrument could be a survey that has been rigorously developed and tested to verify that the survey questions are referring to personality characteristics and not other similar concepts, such as moods, opinions, or social norms.

Conversely, qualitative research is more concerned with understanding human behavior and the reasons that govern such behavior. It's less about measuring in the strictest sense and more about interpreting the phenomenon that is being studied. The questions become: "Are these interpretations true representations of the human experience being studied?" and "Do they authentically convey participants' perspectives and contexts?"

scientific validity in research

Differentiating between qualitative and quantitative validity is crucial because the research methods to ensure validity differ between these research paradigms. In quantitative realms, validity might involve test-retest reliability or examining the internal consistency of a test.

In the qualitative sphere, however, the focus shifts to ensuring that the researcher's interpretations align with the actual experiences and perspectives of their subjects.

This distinction is fundamental because it impacts how researchers engage in research design , gather data , and draw conclusions . Ensuring validity in qualitative research is like weaving a tapestry: every strand of data must be carefully interwoven with the interpretive threads of the researcher, creating a cohesive and faithful representation of the studied experience.

While often terms associated more closely with quantitative research, internal and external validity can still be relevant concepts to understand within the context of qualitative inquiries. Grasping these notions can help qualitative researchers better navigate the challenges of ensuring their findings are both credible and applicable in wider contexts.

Internal validity

Internal validity refers to the authenticity and truthfulness of the findings within the study itself. In qualitative research , this might involve asking: Do the conclusions drawn genuinely reflect the perspectives and experiences of the study's participants?

Internal validity revolves around the depth of understanding, ensuring that the researcher's interpretations are grounded in participants' realities. Techniques like member checking , where participants review and verify the researcher's interpretations , can bolster internal validity.

External validity

External validity refers to the extent to which the findings of a study can be generalized or applied to other settings or groups. For qualitative researchers, the emphasis isn't on statistical generalizability, as often seen in quantitative studies. Instead, it's about transferability.

It becomes a matter of determining how and where the insights gathered might be relevant in other contexts. This doesn't mean that every qualitative study's findings will apply universally, but qualitative researchers should provide enough detail (through rich, thick descriptions) to allow readers or other researchers to determine the potential for transfer to other contexts.

scientific validity in research

Try out a free trial of ATLAS.ti today

See how you can turn your data into critical research findings with our intuitive interface.

Looking deeper into the realm of validity, it's crucial to recognize and understand its various types. Each type offers distinct criteria and methods of evaluation, ensuring that research remains robust and genuine. Here's an exploration of some of these types.

Construct validity

Construct validity is a cornerstone in research methodology . It pertains to ensuring that the tools or methods used in a research study genuinely capture the intended theoretical constructs.

In qualitative research , the challenge lies in the abstract nature of many constructs. For example, if one were to investigate "emotional intelligence" or "social cohesion," the definitions might vary, making them hard to pin down.

scientific validity in research

To bolster construct validity, it is important to clearly and transparently define the concepts being studied. In addition, researchers may triangulate data from multiple sources , ensuring that different viewpoints converge towards a shared understanding of the construct. Furthermore, they might delve into iterative rounds of data collection, refining their methods with each cycle to better align with the conceptual essence of their focus.

Content validity

Content validity's emphasis is on the breadth and depth of the content being assessed. In other words, content validity refers to capturing all relevant facets of the phenomenon being studied. Within qualitative paradigms, ensuring comprehensive representation is paramount. If, for instance, a researcher is using interview protocols to understand community perceptions of a local policy, it's crucial that the questions encompass all relevant aspects of that policy. This could range from its implementation and impact to public awareness and opinion variations across demographic groups.

Enhancing content validity can involve expert reviews where subject matter experts evaluate tools or methods for comprehensiveness. Another strategy might involve pilot studies , where preliminary data collection reveals gaps or overlooked aspects that can be addressed in the main study.

Ecological validity

Ecological validity refers to the genuine reflection of real-world situations in research findings. For qualitative researchers, this means their observations , interpretations , and conclusions should resonate with the participants and context being studied.

If a study explores classroom dynamics, for example, studying students and teachers in a controlled research setting would have lower ecological validity than studying real classroom settings. Ecological validity is important to consider because it helps ensure the research is relevant to the people being studied. Individuals might behave entirely different in a controlled environment as opposed to their everyday natural settings.

Ecological validity tends to be stronger in qualitative research compared to quantitative research , because qualitative researchers are typically immersed in their study context and explore participants' subjective perceptions and experiences. Quantitative research, in contrast, can sometimes be more artificial if behavior is being observed in a lab or participants have to choose from predetermined options to answer survey questions.

Qualitative researchers can further bolster ecological validity through immersive fieldwork, where researchers spend extended periods in the studied environment. This immersion helps them capture the nuances and intricacies that might be missed in brief or superficial engagements.

Face validity

Face validity, while seemingly straightforward, holds significant weight in the preliminary stages of research. It serves as a litmus test, gauging the apparent appropriateness and relevance of a tool or method. If a researcher is developing a new interview guide to gauge employee satisfaction, for instance, a quick assessment from colleagues or a focus group can reveal if the questions intuitively seem fit for the purpose.

While face validity is more subjective and lacks the depth of other validity types, it's a crucial initial step, ensuring that the research starts on the right foot.

Criterion validity

Criterion validity evaluates how well the results obtained from one method correlate with those from another, more established method. In many research scenarios, establishing high criterion validity involves using statistical methods to measure validity. For instance, a researcher might utilize the appropriate statistical tests to determine the strength and direction of the linear relationship between two sets of data.

If a new measurement tool or method is being introduced, its validity might be established by statistically correlating its outcomes with those of a gold standard or previously validated tool. Correlational statistics can estimate the strength of the relationship between the new instrument and the previously established instrument, and regression analyses can also be useful to predict outcomes based on established criteria.

While these methods are traditionally aligned with quantitative research, qualitative researchers, particularly those using mixed methods , may also find value in these statistical approaches, especially when wanting to quantify certain aspects of their data for comparative purposes. More broadly, qualitative researchers could compare their operationalizations and findings to other similar qualitative studies to assess that they are indeed examining what they intend to study.

In the realm of qualitative research , the role of the researcher is not just that of an observer but often as an active participant in the meaning-making process. This unique positioning means the researcher's perspectives and interactions can significantly influence the data collected and its interpretation . Here's a deep dive into the researcher's pivotal role in upholding validity.

Reflexivity

A key concept in qualitative research, reflexivity requires researchers to continually reflect on their worldviews, beliefs, and potential influence on the data. By maintaining a reflexive journal or engaging in regular introspection, researchers can identify and address their own biases , ensuring a more genuine interpretation of participant narratives.

Building rapport

The depth and authenticity of information shared by participants often hinge on the rapport and trust established with the researcher. By cultivating genuine, non-judgmental, and empathetic relationships with participants, researchers can enhance the validity of the data collected.

Positionality

Every researcher brings to the study their own background, including their culture, education, socioeconomic status, and more. Recognizing how this positionality might influence interpretations and interactions is crucial. By acknowledging and transparently sharing their positionality, researchers can offer context to their findings and interpretations.

Active listening

The ability to listen without imposing one's own judgments or interpretations is vital. Active listening ensures that researchers capture the participants' experiences and emotions without distortion, enhancing the validity of the findings.

Transparency in methods

To ensure validity, researchers should be transparent about every step of their process. From how participants were selected to how data was analyzed , a clear documentation offers others a chance to understand and evaluate the research's authenticity and rigor .

Member checking

Once data is collected and interpreted, revisiting participants to confirm the researcher's interpretations can be invaluable. This process, known as member checking , ensures that the researcher's understanding aligns with the participants' intended meanings, bolstering validity.

Embracing ambiguity

Qualitative data can be complex and sometimes contradictory. Instead of trying to fit data into preconceived notions or frameworks, researchers must embrace ambiguity, acknowledging areas of uncertainty or multiple interpretations.

scientific validity in research

Make the most of your research study with ATLAS.ti

From study design to data analysis, let ATLAS.ti guide you through the research process. Download a free trial today.

scientific validity in research

Validity in research: a guide to measuring the right things

Last updated

27 February 2023

Reviewed by

Cathy Heath

Short on time? Get an AI generated summary of this article instead

Validity is necessary for all types of studies ranging from market validation of a business or product idea to the effectiveness of medical trials and procedures. So, how can you determine whether your research is valid? This guide can help you understand what validity is, the types of validity in research, and the factors that affect research validity.

Make research less tedious

Dovetail streamlines research to help you uncover and share actionable insights

  • What is validity?

In the most basic sense, validity is the quality of being based on truth or reason. Valid research strives to eliminate the effects of unrelated information and the circumstances under which evidence is collected. 

Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge.

Studies must be conducted in environments that don't sway the results to achieve and maintain validity. They can be compromised by asking the wrong questions or relying on limited data. 

Why is validity important in research?

Research is used to improve life for humans. Every product and discovery, from innovative medical breakthroughs to advanced new products, depends on accurate research to be dependable. Without it, the results couldn't be trusted, and products would likely fail. Businesses would lose money, and patients couldn't rely on medical treatments. 

While wasting money on a lousy product is a concern, lack of validity paints a much grimmer picture in the medical field or producing automobiles and airplanes, for example. Whether you're launching an exciting new product or conducting scientific research, validity can determine success and failure.

  • What is reliability?

Reliability is the ability of a method to yield consistency. If the same result can be consistently achieved by using the same method to measure something, the measurement method is said to be reliable. For example, a thermometer that shows the same temperatures each time in a controlled environment is reliable.

While high reliability is a part of measuring validity, it's only part of the puzzle. If the reliable thermometer hasn't been properly calibrated and reliably measures temperatures two degrees too high, it doesn't provide a valid (accurate) measure of temperature. 

Similarly, if a researcher uses a thermometer to measure weight, the results won't be accurate because it's the wrong tool for the job. 

  • How are reliability and validity assessed?

While measuring reliability is a part of measuring validity, there are distinct ways to assess both measurements for accuracy. 

How is reliability measured?

These measures of consistency and stability help assess reliability, including:

Consistency and stability of the same measure when repeated multiple times and conditions

Consistency and stability of the measure across different test subjects

Consistency and stability of results from different parts of a test designed to measure the same thing

How is validity measured?

Since validity refers to how accurately a method measures what it is intended to measure, it can be difficult to assess the accuracy. Validity can be estimated by comparing research results to other relevant data or theories.

The adherence of a measure to existing knowledge of how the concept is measured

The ability to cover all aspects of the concept being measured

The relation of the result in comparison with other valid measures of the same concept

  • What are the types of validity in a research design?

Research validity is broadly gathered into two groups: internal and external. Yet, this grouping doesn't clearly define the different types of validity. Research validity can be divided into seven distinct groups.

Face validity : A test that appears valid simply because of the appropriateness or relativity of the testing method, included information, or tools used.

Content validity : The determination that the measure used in research covers the full domain of the content.

Construct validity : The assessment of the suitability of the measurement tool to measure the activity being studied.

Internal validity : The assessment of how your research environment affects measurement results. This is where other factors can’t explain the extent of an observed cause-and-effect response.

External validity : The extent to which the study will be accurate beyond the sample and the level to which it can be generalized in other settings, populations, and measures.

Statistical conclusion validity: The determination of whether a relationship exists between procedures and outcomes (appropriate sampling and measuring procedures along with appropriate statistical tests).

Criterion-related validity : A measurement of the quality of your testing methods against a criterion measure (like a “gold standard” test) that is measured at the same time.

  • Examples of validity

Like different types of research and the various ways to measure validity, examples of validity can vary widely. These include:

A questionnaire may be considered valid because each question addresses specific and relevant aspects of the study subject.

In a brand assessment study, researchers can use comparison testing to verify the results of an initial study. For example, the results from a focus group response about brand perception are considered more valid when the results match that of a questionnaire answered by current and potential customers.

A test to measure a class of students' understanding of the English language contains reading, writing, listening, and speaking components to cover the full scope of how language is used.

  • Factors that affect research validity

Certain factors can affect research validity in both positive and negative ways. By understanding the factors that improve validity and those that threaten it, you can enhance the validity of your study. These include:

Random selection of participants vs. the selection of participants that are representative of your study criteria

Blinding with interventions the participants are unaware of (like the use of placebos)

Manipulating the experiment by inserting a variable that will change the results

Randomly assigning participants to treatment and control groups to avoid bias

Following specific procedures during the study to avoid unintended effects

Conducting a study in the field instead of a laboratory for more accurate results

Replicating the study with different factors or settings to compare results

Using statistical methods to adjust for inconclusive data

What are the common validity threats in research, and how can their effects be minimized or nullified?

Research validity can be difficult to achieve because of internal and external threats that produce inaccurate results. These factors can jeopardize validity.

History: Events that occur between an early and later measurement

Maturation: The passage of time in a study can include data on actions that would have naturally occurred outside of the settings of the study

Repeated testing: The outcome of repeated tests can change the outcome of followed tests

Selection of subjects: Unconscious bias which can result in the selection of uniform comparison groups

Statistical regression: Choosing subjects based on extremes doesn't yield an accurate outcome for the majority of individuals

Attrition: When the sample group is diminished significantly during the course of the study

Maturation: When subjects mature during the study, and natural maturation is awarded to the effects of the study

While some validity threats can be minimized or wholly nullified, removing all threats from a study is impossible. For example, random selection can remove unconscious bias and statistical regression. 

Researchers can even hope to avoid attrition by using smaller study groups. Yet, smaller study groups could potentially affect the research in other ways. The best practice for researchers to prevent validity threats is through careful environmental planning and t reliable data-gathering methods. 

  • How to ensure validity in your research

Researchers should be mindful of the importance of validity in the early planning stages of any study to avoid inaccurate results. Researchers must take the time to consider tools and methods as well as how the testing environment matches closely with the natural environment in which results will be used.

The following steps can be used to ensure validity in research:

Choose appropriate methods of measurement

Use appropriate sampling to choose test subjects

Create an accurate testing environment

How do you maintain validity in research?

Accurate research is usually conducted over a period of time with different test subjects. To maintain validity across an entire study, you must take specific steps to ensure that gathered data has the same levels of accuracy. 

Consistency is crucial for maintaining validity in research. When researchers apply methods consistently and standardize the circumstances under which data is collected, validity can be maintained across the entire study.

Is there a need for validation of the research instrument before its implementation?

An essential part of validity is choosing the right research instrument or method for accurate results. Consider the thermometer that is reliable but still produces inaccurate results. You're unlikely to achieve research validity without activities like calibration, content, and construct validity.

  • Understanding research validity for more accurate results

Without validity, research can't provide the accuracy necessary to deliver a useful study. By getting a clear understanding of validity in research, you can take steps to improve your research skills and achieve more accurate results.

Should you be using a customer insights hub?

Do you want to discover previous research faster?

Do you share your research findings with others?

Do you analyze research data?

Start for free today, add your research, and get to key insights faster

Editor’s picks

Last updated: 18 April 2023

Last updated: 27 February 2023

Last updated: 6 February 2023

Last updated: 5 February 2023

Last updated: 16 April 2023

Last updated: 9 March 2023

Last updated: 30 April 2024

Last updated: 12 December 2023

Last updated: 11 March 2024

Last updated: 4 July 2024

Last updated: 6 March 2024

Last updated: 5 March 2024

Last updated: 13 May 2024

Latest articles

Related topics, .css-je19u9{-webkit-align-items:flex-end;-webkit-box-align:flex-end;-ms-flex-align:flex-end;align-items:flex-end;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-box-flex-wrap:wrap;-webkit-flex-wrap:wrap;-ms-flex-wrap:wrap;flex-wrap:wrap;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;row-gap:0;text-align:center;max-width:671px;}@media (max-width: 1079px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}}@media (max-width: 799px){.css-je19u9{max-width:400px;}.css-je19u9>span{white-space:pre;}} decide what to .css-1kiodld{max-height:56px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}@media (max-width: 1079px){.css-1kiodld{display:none;}} build next, decide what to build next, log in or sign up.

Get started for free

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Reliability vs Validity in Research | Differences, Types & Examples

Reliability vs Validity in Research | Differences, Types & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 10 October 2022.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.

It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research .

Reliability vs validity
Reliability Validity
What does it tell you? The extent to which the results can be reproduced when the research is repeated under the same conditions. The extent to which the results really measure what they are supposed to measure.
How is it assessed? By checking the consistency of results across time, across different observers, and across parts of the test itself. By checking how well the results correspond to established theories and other measures of the same concept.
How do they relate? A reliable measurement is not always valid: the results might be reproducible, but they’re not necessarily correct. A valid measurement is generally reliable: if a test produces accurate results, they should be .

Table of contents

Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis.

Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.

What is reliability?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.

What is validity?

Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.

Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect your data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.

Prevent plagiarism, run a free check.

Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.

Types of reliability

Different types of reliability can be estimated through various statistical methods.

Type of reliability What does it assess? Example
The consistency of a measure : do you get the same results when you repeat the measurement? A group of participants complete a designed to measure personality traits. If they repeat the questionnaire days, weeks, or months apart and give the same answers, this indicates high test-retest reliability.
The consistency of a measure : do you get the same results when different people conduct the same measurement? Based on an assessment criteria checklist, five examiners submit substantially different results for the same student project. This indicates that the assessment checklist has low inter-rater reliability (for example, because the criteria are too subjective).
The consistency of : do you get the same results from different parts of a test that are designed to measure the same thing? You design a questionnaire to measure self-esteem. If you randomly split the results into two halves, there should be a between the two sets of results. If the two results are very different, this indicates low internal consistency.

Types of validity

The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.

Type of validity What does it assess? Example
The adherence of a measure to  of the concept being measured. A self-esteem questionnaire could be assessed by measuring other traits known or assumed to be related to the concept of self-esteem (such as social skills and optimism). Strong correlation between the scores for self-esteem and associated traits would indicate high construct validity.
The extent to which the measurement  of the concept being measured. A test that aims to measure a class of students’ level of Spanish contains reading, writing, and speaking components, but no listening component.  Experts agree that listening comprehension is an essential aspect of language ability, so the test lacks content validity for measuring the overall level of ability in Spanish.
The extent to which the result of a measure corresponds to of the same concept. A is conducted to measure the political opinions of voters in a region. If the results accurately predict the later outcome of an election in that region, this indicates that the survey has high criterion validity.

To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalisability of the results).

The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.

Ensuring validity

If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability, or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data .

  • Choose appropriate methods of measurement

Ensure that your method and measurement technique are of high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardised questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or the findings of previous studies, and the questions should be carefully and precisely worded.

  • Use appropriate sampling methods to select your subjects

To produce valid generalisable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession). Ensure that you have enough participants and that they are representative of the population.

Ensuring reliability

Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible.

  • Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations, clearly define how specific behaviours or responses will be counted, and make sure questions are phrased the same way each time.

  • Standardise the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.

For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions.

It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper. Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.

Reliability and validity in a thesis
Section Discuss
What have other researchers done to devise and improve methods that are reliable and valid?
How did you plan your research to ensure reliability and validity of the measures used? This includes the chosen sample set and size, sample preparation, external conditions, and measuring techniques.
If you calculate reliability and validity, state these values alongside your main results.
This is the moment to talk about how reliable and valid your results actually were. Were they consistent, and did they reflect true values? If not, why not?
If reliability and validity were a big problem for your findings, it might be helpful to mention this here.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, October 10). Reliability vs Validity in Research | Differences, Types & Examples. Scribbr. Retrieved 12 August 2024, from https://www.scribbr.co.uk/research-methods/reliability-or-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, the 4 types of validity | types, definitions & examples, a quick guide to experimental design | 5 steps & examples, sampling methods | types, techniques, & examples.

  • U.S. Department of Health & Human Services

National Institutes of Health (NIH) - Turning Discovery into Health

  • Virtual Tour
  • Staff Directory
  • En Español

You are here

Nih clinical research trials and you, guiding principles for ethical research.

Pursuing Potential Research Participants Protections

Female doctor talking to a senior couple at her desk.

“When people are invited to participate in research, there is a strong belief that it should be their choice based on their understanding of what the study is about, and what the risks and benefits of the study are,” said Dr. Christine Grady, chief of the NIH Clinical Center Department of Bioethics, to Clinical Center Radio in a podcast.

Clinical research advances the understanding of science and promotes human health. However, it is important to remember the individuals who volunteer to participate in research. There are precautions researchers can take – in the planning, implementation and follow-up of studies – to protect these participants in research. Ethical guidelines are established for clinical research to protect patient volunteers and to preserve the integrity of the science.

NIH Clinical Center researchers published seven main principles to guide the conduct of ethical research:

Social and clinical value

Scientific validity, fair subject selection, favorable risk-benefit ratio, independent review, informed consent.

  • Respect for potential and enrolled subjects

Every research study is designed to answer a specific question. The answer should be important enough to justify asking people to accept some risk or inconvenience for others. In other words, answers to the research question should contribute to scientific understanding of health or improve our ways of preventing, treating, or caring for people with a given disease to justify exposing participants to the risk and burden of research.

A study should be designed in a way that will get an understandable answer to the important research question. This includes considering whether the question asked is answerable, whether the research methods are valid and feasible, and whether the study is designed with accepted principles, clear methods, and reliable practices. Invalid research is unethical because it is a waste of resources and exposes people to risk for no purpose

The primary basis for recruiting participants should be the scientific goals of the study — not vulnerability, privilege, or other unrelated factors. Participants who accept the risks of research should be in a position to enjoy its benefits. Specific groups of participants  (for example, women or children) should not be excluded from the research opportunities without a good scientific reason or a particular susceptibility to risk.

Uncertainty about the degree of risks and benefits associated with a clinical research study is inherent. Research risks may be trivial or serious, transient or long-term. Risks can be physical, psychological, economic, or social. Everything should be done to minimize the risks and inconvenience to research participants to maximize the potential benefits, and to determine that the potential benefits are proportionate to, or outweigh, the risks.

To minimize potential conflicts of interest and make sure a study is ethically acceptable before it starts, an independent review panel should review the proposal and ask important questions, including: Are those conducting the trial sufficiently free of bias? Is the study doing all it can to protect research participants? Has the trial been ethically designed and is the risk–benefit ratio favorable? The panel also monitors a study while it is ongoing.

Potential participants should make their own decision about whether they want to participate or continue participating in research. This is done through a process of informed consent in which individuals (1) are accurately informed of the purpose, methods, risks, benefits, and alternatives to the research, (2) understand this information and how it relates to their own clinical situation or interests, and (3) make a voluntary decision about whether to participate.

Respect for potential and enrolled participants

Individuals should be treated with respect from the time they are approached for possible participation — even if they refuse enrollment in a study — throughout their participation and after their participation ends. This includes:

  • respecting their privacy and keeping their private information confidential
  • respecting their right to change their mind, to decide that the research does not match their interests, and to withdraw without a penalty
  • informing them of new information that might emerge in the course of research, which might change their assessment of the risks and benefits of participating
  • monitoring their welfare and, if they experience adverse reactions, unexpected effects, or changes in clinical status, ensuring appropriate treatment and, when necessary, removal from the study
  • informing them about what was learned from the research

More information on these seven guiding principles and on bioethics in general

This page last reviewed on March 16, 2016

Connect with Us

  • More Social Media from NIH
  • How it works

researchprospect post subheader

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every  research design  needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid. 

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid. 

Example:  Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example:  Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example:  If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity  is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the  variables .

Example: age, level, height, and grade.

External validity  is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

  • Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
  • Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threat Definition Example
Confounding factors Unexpected events during the experiment that are not a part of treatment. If you feel the increased weight of your experiment participants is due to lack of physical activity, but it was actually due to the consumption of coffee with sugar.
Maturation The influence on the independent variable due to passage of time. During a long-term experiment, subjects may feel tired, bored, and hungry.
Testing The results of one test affect the results of another test. Participants of the first experiment may react differently during the second experiment.
Instrumentation Changes in the instrument’s collaboration Change in the   may give different results instead of the expected results.
Statistical regression Groups selected depending on the extreme scores are not as extreme on subsequent testing. Students who failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier.
Selection bias Choosing comparison groups without randomisation. A group of trained and efficient teachers is selected to teach children communication skills instead of randomly selecting them.
Experimental mortality Due to the extension of the time of the experiment, participants may leave the experiment. Due to multi-tasking and various competition levels, the participants may leave the competition because they are dissatisfied with the time-extension even if they were doing well.

Threats of External Validity

Threat Definition Example
Reactive/interactive effects of testing The participants of the pre-test may get awareness about the next experiment. The treatment may not be effective without the pre-test. Students who got failed in the pre-final exam are likely to get passed in the final exams; they might be more confident and conscious than earlier.
Selection of participants A group of participants selected with specific characteristics and the treatment of the experiment may work only on the participants possessing those characteristics If an experiment is conducted specifically on the health issues of pregnant women, the same treatment cannot be given to male participants.

How to Assess Reliability and Validity?

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through  various statistical methods  depending on the types of validity, as explained below:

Types of Reliability

Type of reliability What does it measure? Example
Test-Retests It measures the consistency of the results at different points of time. It identifies whether the results are the same after repeated measures. Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from a various group of participants, it means the validity of the questionnaire and product is high as it has high test-retest reliability.
Inter-Rater It measures the consistency of the results at the same time by different raters (researchers) Suppose five researchers measure the academic performance of the same student by incorporating various questions from all the academic subjects and submit various results. It shows that the questionnaire has low inter-rater reliability.
Parallel Forms It measures Equivalence. It includes different forms of the same test performed on the same participants. Suppose the same researcher conducts the two different forms of tests on the same topic and the same students. The tests could be written and oral tests on the same topic. If results are the same, then the parallel-forms reliability of the test is high; otherwise, it’ll be low if the results are different.
Inter-Term It measures the consistency of the measurement. The results of the same tests are split into two halves and compared with each other. If there is a lot of difference in results, then the inter-term reliability of the test is low.

Types of Validity

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity. 

Type of reliability What does it measure? Example
Content validity It shows whether all the aspects of the test/measurement are covered. A language test is designed to measure the writing and reading skills, listening, and speaking skills. It indicates that a test has high content validity.
Face validity It is about the validity of the appearance of a test or procedure of the test. The type of   included in the question paper, time, and marks allotted. The number of questions and their categories. Is it a good question paper to measure the academic performance of students?
Construct validity It shows whether the test is measuring the correct construct (ability/attribute, trait, skill) Is the test conducted to measure communication skills is actually measuring communication skills?
Criterion validity It shows whether the test scores obtained are similar to other measures of the same concept. The results obtained from a prefinal exam of graduate accurately predict the results of the later final exam. It shows that the test has high criterion validity.

Does your Research Methodology Have the Following?

  • Great Research/Sources
  • Perfect Language
  • Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

  • Use an appropriate questionnaire to measure the competency level.
  • Ensure a consistent environment for participants
  • Make the participants familiar with the criteria of assessment.
  • Train the participants appropriately.
  • Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

  • The reactivity should be minimised at the first concern.
  • The Hawthorne effect should be reduced.
  • The respondents should be motivated.
  • The intervals between the pre-test and post-test should not be lengthy.
  • Dropout rates should be avoided.
  • The inter-rater reliability should be ensured.
  • Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Segments Explanation
All the planning about reliability and validity will be discussed here, including the chosen samples and size and the techniques used to measure reliability and validity.
Please talk about the level of reliability and validity of your results and their influence on values.
Discuss the contribution of other researchers to improve reliability and validity.

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

  • Standardise procedures and instructions.
  • Use consistent and precise measurement tools.
  • Train observers or raters to reduce subjective judgments.
  • Increase sample size to reduce random errors.
  • Conduct pilot studies to refine methods.
  • Repeat measurements or use multiple methods.
  • Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

You May Also Like

A meta-analysis is a formal, epidemiological, quantitative study design that uses statistical methods to generalise the findings of the selected independent studies.

Disadvantages of primary research – It can be expensive, time-consuming and take a long time to complete if it involves face-to-face contact with customers.

Ethnography is a type of research where a researcher observes the people in their natural environment. Here is all you need to know about ethnography.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works

The 4 Types of Validity in Research Design (+3 More to Consider)

Avatar

The conclusions you draw from your research (whether from analyzing surveys, focus groups, experimental design, or other research methods) are only useful if they’re valid.

How “true” are these results? How well do they represent the thing you’re actually trying to study? Validity is used to determine whether research measures what it intended to measure and to approximate the truthfulness of the results.

Unfortunately, researchers sometimes create their own definitions when it comes to what is considered valid.

  • In quantitative research testing for validity and reliability is a given.
  • However, some qualitative researchers have gone so far as to suggest that validity does not apply to their research even as they acknowledge the need for some qualifying checks or measures in their work.

This is wrong. Validity is always important – even if it’s harder to determine in qualitative research.

To disregard validity is to put the trustworthiness of your work in question and to call into question others’ confidence in its results. Even when qualitative measures are used in research, they need to be looked at using measures of reliability and validity in order to sustain the trustworthiness of the results.

What is validity in research?

Validity is how researchers talk about the extent to which results represent reality. Research methods, quantitative or qualitative, are methods of studying real phenomenon – validity refers to how much of that phenomenon they measure vs. how much “noise,” or unrelated information, is captured by the results.

Validity and reliability make the difference between “good” and “bad” research reports. Quality research depends on a commitment to testing and increasing the validity as well as the reliability of your research results.

Any research worth its weight is concerned with whether what is being measured is what is intended to be measured and considers how observations are influenced by the circumstances in which they are made.

The basis of how our conclusions are made plays an important role in addressing the broader substantive issues of any given study.

For this reason, we are going to look at various validity types that have been formulated as a part of legitimate research methodology.

Here are the 7 key types of validity in research:

  • Face validity
  • Content validity
  • Construct validity
  • Internal validity
  • External validity
  • Statistical conclusion validity
  • Criterion-related validity

1. Face validity

Face validity is how valid your results seem based on what they look like. This is the least scientific method of validity, as it is not quantified using statistical methods.

Face validity is not validity in a technical sense of the term.  It is concerned with whether it seems like we measure what we claim.

Here we look at how valid a measure appears on the surface and make subjective judgments based on that.

For example,

  • Imagine you give a survey that appears to be valid to the respondent and the questions are selected because they look valid to the administer.
  • The administer asks a group of random people, untrained observers if the questions appear valid to them

In research, it’s never enough to rely on face judgments alone – and more quantifiable methods of validity are necessary to draw acceptable conclusions.  There are many instruments of measurement to consider so face validity is useful in cases where you need to distinguish one approach over another.

Face validity should never be trusted on its own merits.

2. Content validity

Content validity is whether or not the measure used in the research covers all of the content in the underlying construct (the thing you are trying to measure).

This is also a subjective measure, but unlike face validity, we ask whether the content of a measure covers the full domain of the content. If a researcher wanted to measure introversion, they would have to first decide what constitutes a relevant domain of content for that trait.

Content validity is considered a subjective form of measurement because it still relies on people’s perceptions for measuring constructs that would otherwise be difficult to measure.

Where content validity distinguishes itself (and becomes useful) through its use of experts in the field or individuals belonging to a target population. This study can be made more objective through the use of rigorous statistical tests.

For example, you could have a content validity study that informs researchers how items used in a survey represent their content domain, how clear they are, and the extent to which they maintain the theoretical factor structure assessed by the factor analysis.

3. Construct validity

A construct represents a collection of behaviors that are associated in a meaningful way to create an image or an idea invented for a research purpose. Construct validity is the degree to which your research measures the construct (as compared to things outside the construct).

Depression is a construct that represents a personality trait that manifests itself in behaviors such as oversleeping, loss of appetite, difficulty concentrating, etc.

The existence of a construct is manifest by observing the collection of related indicators.  Any one sign may be associated with several constructs.  A person with difficulty concentrating may have ADHD but not depression.

Construct validity is the degree to which inferences can be made from operationalizations (connecting concepts to observations) in your study to the constructs on which those operationalizations are based.  To establish construct validity you must first provide evidence that your data supports the theoretical structure.

You must also show that you control the operationalization of the construct, in other words, show that your theory has some correspondence with reality.

  • Convergent Validity –  the degree to which an operation is similar to other operations it should theoretically be similar to.
  • Discriminative Validity -– if a scale adequately differentiates itself or does not differentiate between groups that should differ or not differ based on theoretical reasons or previous research.
  • Nomological Network –  representation of the constructs of interest in a study, their observable manifestations, and the interrelationships among and between these.  According to Cronbach and Meehl,  a nomological network has to be developed for a measure for it to have construct validity
  • Multitrait-Multimethod Matrix –  six major considerations when examining Construct Validity according to Campbell and Fiske.  This includes evaluations of convergent validity and discriminative validity.  The others are trait method unit, multi-method/trait, truly different methodology, and trait characteristics.

4. Internal validity

Internal validity refers to the extent to which the independent variable can accurately be stated to produce the observed effect.

If the effect of the dependent variable is only due to the independent variable(s) then internal validity is achieved. This is the degree to which a result can be manipulated.

Put another way, internal validity is how you can tell that your research “works” in a research setting. Within a given study, does the variable you change affect the variable you’re studying?

5. External validity

External validity refers to the extent to which the results of a study can be generalized beyond the sample. Which is to say that you can apply your findings to other people and settings.

Think of this as the degree to which a result can be generalized. How well do the research results apply to the rest of the world?

A laboratory setting (or other research setting) is a controlled environment with fewer variables. External validity refers to how well the results hold, even in the presence of all those other variables.

6. Statistical conclusion validity

Statistical conclusion validity is a determination of whether a relationship or co-variation exists between cause and effect variables.

This type of validity requires:

  • Ensuring adequate sampling procedures
  • Appropriate statistical tests
  • Reliable measurement procedures

This is the degree to which a conclusion is credible or believable.

7. Criterion-related validity

Criterion-related validity (also called instrumental validity) is a measure of the quality of your measurement methods.  The accuracy of a measure is demonstrated by comparing it with a measure that is already known to be valid.

In other words – if your measure has a high correlation with other measures that are known to be valid because of previous research.

For this to work you must know that the criterion has been measured well.  And be aware that appropriate criteria do not always exist.

What you are doing is checking the performance of your operationalization against criteria.

The criteria you use as a standard of judgment accounts for the different approaches you would use:

  • Predictive Validity –  operationalization’s ability to predict what it is theoretically able to predict.  The extent to which a measure predicts expected outcomes.
  • Concurrent Validity –  operationalization’s ability to distinguish between groups it theoretically should be able to.  This is where a test correlates well with a measure that has been previously validated.

When we look at validity in survey data we are asking whether the data represents what we think it should represent.

We depend on the respondent’s mindset and attitude to give us valid data.

In other words, we depend on them to answer all questions honestly and conscientiously. We also depend on whether they are able to answer the questions that we ask. When questions are asked that the respondent can not comprehend or understand, then the data does not tell us what we think it does.

No credit card required. Instant set-up.

Please enter a valid email address to continue.

Related Posts

The 9 Best Customer Experience Management Software for 2024

Customer experience management (CEM) software is a powerful tool used by businesses to enhance and optimize their interactions with customers...

The 8 Best Sales Funnel Builders

In the modern sales landscape, there are many incredible sales funnel builders to choose from. As businesses aim to enhance...

Sponsored Content: What You Need to Know (and 9 Examples!)

“Paid post” “Presented by” “Sponsored by” “Partnered with” “Promoted” “Affiliated with” “Powered by”

Try it now, for free

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals

You are here

  • Volume 18, Issue 2
  • Issues of validity and reliability in qualitative research
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Helen Noble 1 ,
  • Joanna Smith 2
  • 1 School of Nursing and Midwifery, Queens's University Belfast , Belfast , UK
  • 2 School of Human and Health Sciences, University of Huddersfield , Huddersfield , UK
  • Correspondence to Dr Helen Noble School of Nursing and Midwifery, Queens's University Belfast, Medical Biology Centre, 97 Lisburn Rd, Belfast BT9 7BL, UK; helen.noble{at}qub.ac.uk

https://doi.org/10.1136/eb-2015-102054

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Evaluating the quality of research is essential if findings are to be utilised in practice and incorporated into care delivery. In a previous article we explored ‘bias’ across research designs and outlined strategies to minimise bias. 1 The aim of this article is to further outline rigour, or the integrity in which a study is conducted, and ensure the credibility of findings in relation to qualitative research. Concepts such as reliability, validity and generalisability typically associated with quantitative research and alternative terminology will be compared in relation to their application to qualitative research. In addition, some of the strategies adopted by qualitative researchers to enhance the credibility of their research are outlined.

Are the terms reliability and validity relevant to ensuring credibility in qualitative research?

Although the tests and measures used to establish the validity and reliability of quantitative research cannot be applied to qualitative research, there are ongoing debates about whether terms such as validity, reliability and generalisability are appropriate to evaluate qualitative research. 2–4 In the broadest context these terms are applicable, with validity referring to the integrity and application of the methods undertaken and the precision in which the findings accurately reflect the data, while reliability describes consistency within the employed analytical procedures. 4 However, if qualitative methods are inherently different from quantitative methods in terms of philosophical positions and purpose, then alterative frameworks for establishing rigour are appropriate. 3 Lincoln and Guba 5 offer alternative criteria for demonstrating rigour within qualitative research namely truth value, consistency and neutrality and applicability. Table 1 outlines the differences in terminology and criteria used to evaluate qualitative research.

  • View inline

Terminology and criteria used to evaluate the credibility of research findings

What strategies can qualitative researchers adopt to ensure the credibility of the study findings?

Unlike quantitative researchers, who apply statistical methods for establishing validity and reliability of research findings, qualitative researchers aim to design and incorporate methodological strategies to ensure the ‘trustworthiness’ of the findings. Such strategies include:

Accounting for personal biases which may have influenced findings; 6

Acknowledging biases in sampling and ongoing critical reflection of methods to ensure sufficient depth and relevance of data collection and analysis; 3

Meticulous record keeping, demonstrating a clear decision trail and ensuring interpretations of data are consistent and transparent; 3 , 4

Establishing a comparison case/seeking out similarities and differences across accounts to ensure different perspectives are represented; 6 , 7

Including rich and thick verbatim descriptions of participants’ accounts to support findings; 7

Demonstrating clarity in terms of thought processes during data analysis and subsequent interpretations 3 ;

Engaging with other researchers to reduce research bias; 3

Respondent validation: includes inviting participants to comment on the interview transcript and whether the final themes and concepts created adequately reflect the phenomena being investigated; 4

Data triangulation, 3 , 4 whereby different methods and perspectives help produce a more comprehensive set of findings. 8 , 9

Table 2 provides some specific examples of how some of these strategies were utilised to ensure rigour in a study that explored the impact of being a family carer to patients with stage 5 chronic kidney disease managed without dialysis. 10

Strategies for enhancing the credibility of qualitative research

In summary, it is imperative that all qualitative researchers incorporate strategies to enhance the credibility of a study during research design and implementation. Although there is no universally accepted terminology and criteria used to evaluate qualitative research, we have briefly outlined some of the strategies that can enhance the credibility of study findings.

  • Sandelowski M
  • Lincoln YS ,
  • Barrett M ,
  • Mayan M , et al
  • Greenhalgh T
  • Lingard L ,

Twitter Follow Joanna Smith at @josmith175 and Helen Noble at @helnoble

Competing interests None.

Read the full text or download the PDF:

Cody Kommers

The Validity of Psychological Research

How do we know whether a finding is legitimate or not.

Posted August 27, 2018

 Wikimedia Commons

There is a distinction that one learns about on the first day of a course in psychological research methods. It is the difference between internal and external validity.

Internal validity is scientific validity. The extent to which a researcher devises a solid experiment, controls for confounding variables, and executes the procedure as planned determines a finding’s internal validity. If it were to come to the researcher’s attention that a confounding variable for which they did not control could also explain their result, then the finding’s internal validity would be called into question. This validity is concerned with what happens inside the lab, while the experiment is happening.

External validity, in contrast, is ecological validity. How well does the researcher’s finding generalize to the world outside the lab? You could control for all the variables perfectly, execute your procedure flawlessly, and run a pristine experiment. But if the stimuli you’re using aren’t representative of what people are likely to encounter in real life, then the experiment lacks external validity. This validity is concerned with what happens outside the lab, what to make of the result after all the nitty-gritty has been finely tuned.

Ideally, a researcher would conduct an experiment that is unassailably valid both internally and externally. However, in practice, these considerations usually involve a tradeoff . The more externally valid your stimuli—the more precisely they can be measured and controlled—the more sterile they become and thereby reflect less of the inherent messiness of our everyday experience. The more realistic you make your stimuli, the less meticulous you’re able to be about what exactly you’re showing your participant.

The upshot is that without internal validity, you can’t draw scientific conclusions. But without external validity, you have nothing worth drawing conclusions about. Practically speaking, the best a researcher can hope for is a healthy and reasonable balance between the two. But how well does psychological research actually balance the tension between these two considerations? Is it fifty-fifty, equal parts external and internal? Or does one get prioritized at the expense of the other?

There is an important asymmetry between internal and external validity, which gives insight into the answer to this question. It has to do with how these different kinds of validity are measured. Scientists are trained every day of their professional lives to be sensitive to internal validity. They can spot a confounding variable in a study from a mile away. And once one is identified, it’s difficult to shake it off as inconsequential to the study’s findings. Perhaps more importantly, it’s embarrassing for a scientist to run a shoddy experiment in which people can easily point out procedural flaws. It is, in short, relatively obvious how to optimize for internal validity.

But external validity is not so easily optimized for. It is much more difficult to point at an experiment and claim that it bears little resemblance to the real world in a crucial and undeniable way. Such an issue will be regarded as perhaps a good point, but ultimately just an opinion. The experimental stimuli aren’t intended to be representative of the whole of the human cognitive experience, after all, but only a specific part of it. That’s what makes the experimental variables so well controlled and the theoretical predictions so parsimonious in the first place. There is also no such embarrassment of an accusation about lack of external validity, but rather a certain pride associated with being a hardline scientist who studies her phenomena of interest with clinical precision and ardent fastidiousness.

The result is that is psychological research biases toward internal rather than external validity. It can more easily be measured, and consequently is a much more apparent badge that proclaims, “here is the work of a legitimate scientist.” Those scientists who prioritize internal validity and scientific legitimacy are better positioned for promotion, and this influences the constituency of the institution of psychological research as a whole to favor those who care more for the considerations of an internal than the external. The problem is that while psychological research becomes more ostensibly scientific, it becomes less connected to that which it intends to study. Human behavior is a fundamentally messy topic, and psychology benefits from existing in the tension between these two kinds of validity. Both are, after all, necessary for truly valid psychological research and neither is sufficient on its own. If we lose our sense of external validity because it is a trickier metric to optimize then psychological research suffers just as much as if we failed to construct solid experiments.

This is a consideration that gets overlooked in the “ replication crisis ” in which psychological research currently finds itself. The usual approach to addressing this crisis lies with better statistical analyses, which decrease the probability of a false or misleading finding. While this is surely a critical contribution to the internal validity of psychological findings, it takes external validity out from under the spotlight of attention. Psychology, the thinking goes, will not become healthy by an attempt to render it more externally valid, but only by making it more rigorously scientific.

This thinking, to my mind, is only one side of the story. Sure, psychology can improve its statistical methodology in the service of the field's betterment. But too stringent a focus on the internal runs the risk of leading to an equally illegitimate state of the science which pursues a significance that is more statistical than psychological. Our aversion to getting our hands dirty with the veridical messiness of the human experience may lead to us to forfeit the opportunity to go out there and actually work with nature itself , opting instead for the safety and clarity of the laboratory setting. This seems hardly like a psychology worth replicating.

Cody Kommers

Cody Kommers is a PhD student in Experimental Psychology at Oxford.

  • Find a Therapist
  • Find a Treatment Center
  • Find a Psychiatrist
  • Find a Support Group
  • Find Online Therapy
  • United States
  • Brooklyn, NY
  • Chicago, IL
  • Houston, TX
  • Los Angeles, CA
  • New York, NY
  • Portland, OR
  • San Diego, CA
  • San Francisco, CA
  • Seattle, WA
  • Washington, DC
  • Asperger's
  • Bipolar Disorder
  • Chronic Pain
  • Eating Disorders
  • Passive Aggression
  • Personality
  • Goal Setting
  • Positive Psychology
  • Stopping Smoking
  • Low Sexual Desire
  • Relationships
  • Child Development
  • Self Tests NEW
  • Therapy Center
  • Diagnosis Dictionary
  • Types of Therapy

July 2024 magazine cover

Sticking up for yourself is no easy task. But there are concrete skills you can use to hone your assertiveness and advocate for yourself.

  • Emotional Intelligence
  • Gaslighting
  • Affective Forecasting
  • Neuroscience

  • Foundations
  • Write Paper

Search form

  • Experiments
  • Anthropology
  • Self-Esteem
  • Social Anxiety

scientific validity in research

Validity and Reliability

The principles of validity and reliability are fundamental cornerstones of the scientific method.

This article is a part of the guide:

  • Types of Validity
  • Definition of Reliability
  • Content Validity
  • Construct Validity
  • External Validity

Browse Full Outline

  • 1 Validity and Reliability
  • 2 Types of Validity
  • 3.1 Population Validity
  • 3.2 Ecological Validity
  • 4 Internal Validity
  • 5.1.1 Concurrent Validity
  • 5.1.2 Predictive Validity
  • 6 Content Validity
  • 7.1 Convergent and Discriminant Validity
  • 8 Face Validity
  • 9 Definition of Reliability
  • 10.1 Reproducibility
  • 10.2 Replication Study
  • 11 Interrater Reliability
  • 12 Internal Consistency Reliability
  • 13 Instrument Reliability

Together, they are at the core of what is accepted as scientific proof, by scientist and philosopher alike.

By following a few basic principles, any experimental design will stand up to rigorous questioning and skepticism.

scientific validity in research

What is Reliability?

The idea behind reliability is that any significant results must be more than a one-off finding and be inherently repeatable .

Other researchers must be able to perform exactly the same experiment , under the same conditions and generate the same results. This will reinforce the findings and ensure that the wider scientific community will accept the hypothesis .

Without this replication of statistically significant results , the experiment and research have not fulfilled all of the requirements of testability .

This prerequisite is essential to a hypothesis establishing itself as an accepted scientific truth.

For example, if you are performing a time critical experiment, you will be using some type of stopwatch. Generally, it is reasonable to assume that the instruments are reliable and will keep true and accurate time. However, diligent scientists take measurements many times, to minimize the chances of malfunction and maintain validity and reliability.

At the other extreme, any experiment that uses human judgment is always going to come under question.

For example, if observers rate certain aspects, like in Bandura’s Bobo Doll Experiment , then the reliability of the test is compromised. Human judgment can vary wildly between observers , and the same individual may rate things differently depending upon time of day and current mood.

This means that such experiments are more difficult to repeat and are inherently less reliable. Reliability is a necessary ingredient for determining the overall validity of a scientific experiment and enhancing the strength of the results.

Debate between social and pure scientists, concerning reliability, is robust and ongoing.

scientific validity in research

What is Validity?

Validity encompasses the entire experimental concept and establishes whether the results obtained meet all of the requirements of the scientific research method.

For example, there must have been randomization of the sample groups and appropriate care and diligence shown in the allocation of controls .

Internal validity dictates how an experimental design is structured and encompasses all of the steps of the scientific research method .

Even if your results are great, sloppy and inconsistent design will compromise your integrity in the eyes of the scientific community. Internal validity and reliability are at the core of any experimental design.

External validity is the process of examining the results and questioning whether there are any other possible causal relationships.

Control groups and randomization will lessen external validity problems but no method can be completely successful. This is why the statistical proofs of a hypothesis called significant , not absolute truth.

Any scientific research design only puts forward a possible cause for the studied effect.

There is always the chance that another unknown factor contributed to the results and findings. This extraneous causal relationship may become more apparent, as techniques are refined and honed.

If you have constructed your experiment to contain validity and reliability then the scientific community is more likely to accept your findings.

Eliminating other potential causal relationships, by using controls and duplicate samples, is the best way to ensure that your results stand up to rigorous questioning.

Validity and Reliability

  • Psychology 101
  • Flags and Countries
  • Capitals and Countries

Martyn Shuttleworth (Oct 20, 2008). Validity and Reliability. Retrieved Aug 18, 2024 from Explorable.com: https://explorable.com/validity-and-reliability

You Are Allowed To Copy The Text

The text in this article is licensed under the Creative Commons-License Attribution 4.0 International (CC BY 4.0) .

This means you're free to copy, share and adapt any parts (or all) of the text in the article, as long as you give appropriate credit and provide a link/reference to this page.

That is it. You don't need our permission to copy the article; just include a link/reference back to this page. You can use it freely (with some kind of link), and we're also okay with people reprinting in publications like books, blogs, newsletters, course-material, papers, wikipedia and presentations (with clear attribution).

Related articles

Internal Validity

Want to stay up to date? Follow us!

Save this course for later.

Don't have time for it all now? No problem, save it as a course and come back to it later.

Footer bottom

  • Privacy Policy

scientific validity in research

  • Subscribe to our RSS Feed
  • Like us on Facebook
  • Follow us on Twitter

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.25(3); 2014 Oct

Logo of ejifcc

Peer Review in Scientific Publications: Benefits, Critiques, & A Survival Guide

Jacalyn kelly.

1 Clinical Biochemistry, Department of Pediatric Laboratory Medicine, The Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada

Tara Sadeghieh

Khosrow adeli.

2 Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada

3 Chair, Communications and Publications Division (CPD), International Federation for Sick Clinical Chemistry (IFCC), Milan, Italy

The authors declare no conflicts of interest regarding publication of this article.

Peer review has been defined as a process of subjecting an author’s scholarly work, research or ideas to the scrutiny of others who are experts in the same field. It functions to encourage authors to meet the accepted high standards of their discipline and to control the dissemination of research data to ensure that unwarranted claims, unacceptable interpretations or personal views are not published without prior expert review. Despite its wide-spread use by most journals, the peer review process has also been widely criticised due to the slowness of the process to publish new findings and due to perceived bias by the editors and/or reviewers. Within the scientific community, peer review has become an essential component of the academic writing process. It helps ensure that papers published in scientific journals answer meaningful research questions and draw accurate conclusions based on professionally executed experimentation. Submission of low quality manuscripts has become increasingly prevalent, and peer review acts as a filter to prevent this work from reaching the scientific community. The major advantage of a peer review process is that peer-reviewed articles provide a trusted form of scientific communication. Since scientific knowledge is cumulative and builds on itself, this trust is particularly important. Despite the positive impacts of peer review, critics argue that the peer review process stifles innovation in experimentation, and acts as a poor screen against plagiarism. Despite its downfalls, there has not yet been a foolproof system developed to take the place of peer review, however, researchers have been looking into electronic means of improving the peer review process. Unfortunately, the recent explosion in online only/electronic journals has led to mass publication of a large number of scientific articles with little or no peer review. This poses significant risk to advances in scientific knowledge and its future potential. The current article summarizes the peer review process, highlights the pros and cons associated with different types of peer review, and describes new methods for improving peer review.

WHAT IS PEER REVIEW AND WHAT IS ITS PURPOSE?

Peer Review is defined as “a process of subjecting an author’s scholarly work, research or ideas to the scrutiny of others who are experts in the same field” ( 1 ). Peer review is intended to serve two primary purposes. Firstly, it acts as a filter to ensure that only high quality research is published, especially in reputable journals, by determining the validity, significance and originality of the study. Secondly, peer review is intended to improve the quality of manuscripts that are deemed suitable for publication. Peer reviewers provide suggestions to authors on how to improve the quality of their manuscripts, and also identify any errors that need correcting before publication.

HISTORY OF PEER REVIEW

The concept of peer review was developed long before the scholarly journal. In fact, the peer review process is thought to have been used as a method of evaluating written work since ancient Greece ( 2 ). The peer review process was first described by a physician named Ishaq bin Ali al-Rahwi of Syria, who lived from 854-931 CE, in his book Ethics of the Physician ( 2 ). There, he stated that physicians must take notes describing the state of their patients’ medical conditions upon each visit. Following treatment, the notes were scrutinized by a local medical council to determine whether the physician had met the required standards of medical care. If the medical council deemed that the appropriate standards were not met, the physician in question could receive a lawsuit from the maltreated patient ( 2 ).

The invention of the printing press in 1453 allowed written documents to be distributed to the general public ( 3 ). At this time, it became more important to regulate the quality of the written material that became publicly available, and editing by peers increased in prevalence. In 1620, Francis Bacon wrote the work Novum Organum, where he described what eventually became known as the first universal method for generating and assessing new science ( 3 ). His work was instrumental in shaping the Scientific Method ( 3 ). In 1665, the French Journal des sçavans and the English Philosophical Transactions of the Royal Society were the first scientific journals to systematically publish research results ( 4 ). Philosophical Transactions of the Royal Society is thought to be the first journal to formalize the peer review process in 1665 ( 5 ), however, it is important to note that peer review was initially introduced to help editors decide which manuscripts to publish in their journals, and at that time it did not serve to ensure the validity of the research ( 6 ). It did not take long for the peer review process to evolve, and shortly thereafter papers were distributed to reviewers with the intent of authenticating the integrity of the research study before publication. The Royal Society of Edinburgh adhered to the following peer review process, published in their Medical Essays and Observations in 1731: “Memoirs sent by correspondence are distributed according to the subject matter to those members who are most versed in these matters. The report of their identity is not known to the author.” ( 7 ). The Royal Society of London adopted this review procedure in 1752 and developed the “Committee on Papers” to review manuscripts before they were published in Philosophical Transactions ( 6 ).

Peer review in the systematized and institutionalized form has developed immensely since the Second World War, at least partly due to the large increase in scientific research during this period ( 7 ). It is now used not only to ensure that a scientific manuscript is experimentally and ethically sound, but also to determine which papers sufficiently meet the journal’s standards of quality and originality before publication. Peer review is now standard practice by most credible scientific journals, and is an essential part of determining the credibility and quality of work submitted.

IMPACT OF THE PEER REVIEW PROCESS

Peer review has become the foundation of the scholarly publication system because it effectively subjects an author’s work to the scrutiny of other experts in the field. Thus, it encourages authors to strive to produce high quality research that will advance the field. Peer review also supports and maintains integrity and authenticity in the advancement of science. A scientific hypothesis or statement is generally not accepted by the academic community unless it has been published in a peer-reviewed journal ( 8 ). The Institute for Scientific Information ( ISI ) only considers journals that are peer-reviewed as candidates to receive Impact Factors. Peer review is a well-established process which has been a formal part of scientific communication for over 300 years.

OVERVIEW OF THE PEER REVIEW PROCESS

The peer review process begins when a scientist completes a research study and writes a manuscript that describes the purpose, experimental design, results, and conclusions of the study. The scientist then submits this paper to a suitable journal that specializes in a relevant research field, a step referred to as pre-submission. The editors of the journal will review the paper to ensure that the subject matter is in line with that of the journal, and that it fits with the editorial platform. Very few papers pass this initial evaluation. If the journal editors feel the paper sufficiently meets these requirements and is written by a credible source, they will send the paper to accomplished researchers in the field for a formal peer review. Peer reviewers are also known as referees (this process is summarized in Figure 1 ). The role of the editor is to select the most appropriate manuscripts for the journal, and to implement and monitor the peer review process. Editors must ensure that peer reviews are conducted fairly, and in an effective and timely manner. They must also ensure that there are no conflicts of interest involved in the peer review process.

An external file that holds a picture, illustration, etc.
Object name is ejifcc-25-227-g001.jpg

Overview of the review process

When a reviewer is provided with a paper, he or she reads it carefully and scrutinizes it to evaluate the validity of the science, the quality of the experimental design, and the appropriateness of the methods used. The reviewer also assesses the significance of the research, and judges whether the work will contribute to advancement in the field by evaluating the importance of the findings, and determining the originality of the research. Additionally, reviewers identify any scientific errors and references that are missing or incorrect. Peer reviewers give recommendations to the editor regarding whether the paper should be accepted, rejected, or improved before publication in the journal. The editor will mediate author-referee discussion in order to clarify the priority of certain referee requests, suggest areas that can be strengthened, and overrule reviewer recommendations that are beyond the study’s scope ( 9 ). If the paper is accepted, as per suggestion by the peer reviewer, the paper goes into the production stage, where it is tweaked and formatted by the editors, and finally published in the scientific journal. An overview of the review process is presented in Figure 1 .

WHO CONDUCTS REVIEWS?

Peer reviews are conducted by scientific experts with specialized knowledge on the content of the manuscript, as well as by scientists with a more general knowledge base. Peer reviewers can be anyone who has competence and expertise in the subject areas that the journal covers. Reviewers can range from young and up-and-coming researchers to old masters in the field. Often, the young reviewers are the most responsive and deliver the best quality reviews, though this is not always the case. On average, a reviewer will conduct approximately eight reviews per year, according to a study on peer review by the Publishing Research Consortium (PRC) ( 7 ). Journals will often have a pool of reviewers with diverse backgrounds to allow for many different perspectives. They will also keep a rather large reviewer bank, so that reviewers do not get burnt out, overwhelmed or time constrained from reviewing multiple articles simultaneously.

WHY DO REVIEWERS REVIEW?

Referees are typically not paid to conduct peer reviews and the process takes considerable effort, so the question is raised as to what incentive referees have to review at all. Some feel an academic duty to perform reviews, and are of the mentality that if their peers are expected to review their papers, then they should review the work of their peers as well. Reviewers may also have personal contacts with editors, and may want to assist as much as possible. Others review to keep up-to-date with the latest developments in their field, and reading new scientific papers is an effective way to do so. Some scientists use peer review as an opportunity to advance their own research as it stimulates new ideas and allows them to read about new experimental techniques. Other reviewers are keen on building associations with prestigious journals and editors and becoming part of their community, as sometimes reviewers who show dedication to the journal are later hired as editors. Some scientists see peer review as a chance to become aware of the latest research before their peers, and thus be first to develop new insights from the material. Finally, in terms of career development, peer reviewing can be desirable as it is often noted on one’s resume or CV. Many institutions consider a researcher’s involvement in peer review when assessing their performance for promotions ( 11 ). Peer reviewing can also be an effective way for a scientist to show their superiors that they are committed to their scientific field ( 5 ).

ARE REVIEWERS KEEN TO REVIEW?

A 2009 international survey of 4000 peer reviewers conducted by the charity Sense About Science at the British Science Festival at the University of Surrey, found that 90% of reviewers were keen to peer review ( 12 ). One third of respondents to the survey said they were happy to review up to five papers per year, and an additional one third of respondents were happy to review up to ten.

HOW LONG DOES IT TAKE TO REVIEW ONE PAPER?

On average, it takes approximately six hours to review one paper ( 12 ), however, this number may vary greatly depending on the content of the paper and the nature of the peer reviewer. One in every 100 participants in the “Sense About Science” survey claims to have taken more than 100 hours to review their last paper ( 12 ).

HOW TO DETERMINE IF A JOURNAL IS PEER REVIEWED

Ulrichsweb is a directory that provides information on over 300,000 periodicals, including information regarding which journals are peer reviewed ( 13 ). After logging into the system using an institutional login (eg. from the University of Toronto), search terms, journal titles or ISSN numbers can be entered into the search bar. The database provides the title, publisher, and country of origin of the journal, and indicates whether the journal is still actively publishing. The black book symbol (labelled ‘refereed’) reveals that the journal is peer reviewed.

THE EVALUATION CRITERIA FOR PEER REVIEW OF SCIENTIFIC PAPERS

As previously mentioned, when a reviewer receives a scientific manuscript, he/she will first determine if the subject matter is well suited for the content of the journal. The reviewer will then consider whether the research question is important and original, a process which may be aided by a literature scan of review articles.

Scientific papers submitted for peer review usually follow a specific structure that begins with the title, followed by the abstract, introduction, methodology, results, discussion, conclusions, and references. The title must be descriptive and include the concept and organism investigated, and potentially the variable manipulated and the systems used in the study. The peer reviewer evaluates if the title is descriptive enough, and ensures that it is clear and concise. A study by the National Association of Realtors (NAR) published by the Oxford University Press in 2006 indicated that the title of a manuscript plays a significant role in determining reader interest, as 72% of respondents said they could usually judge whether an article will be of interest to them based on the title and the author, while 13% of respondents claimed to always be able to do so ( 14 ).

The abstract is a summary of the paper, which briefly mentions the background or purpose, methods, key results, and major conclusions of the study. The peer reviewer assesses whether the abstract is sufficiently informative and if the content of the abstract is consistent with the rest of the paper. The NAR study indicated that 40% of respondents could determine whether an article would be of interest to them based on the abstract alone 60-80% of the time, while 32% could judge an article based on the abstract 80-100% of the time ( 14 ). This demonstrates that the abstract alone is often used to assess the value of an article.

The introduction of a scientific paper presents the research question in the context of what is already known about the topic, in order to identify why the question being studied is of interest to the scientific community, and what gap in knowledge the study aims to fill ( 15 ). The introduction identifies the study’s purpose and scope, briefly describes the general methods of investigation, and outlines the hypothesis and predictions ( 15 ). The peer reviewer determines whether the introduction provides sufficient background information on the research topic, and ensures that the research question and hypothesis are clearly identifiable.

The methods section describes the experimental procedures, and explains why each experiment was conducted. The methods section also includes the equipment and reagents used in the investigation. The methods section should be detailed enough that it can be used it to repeat the experiment ( 15 ). Methods are written in the past tense and in the active voice. The peer reviewer assesses whether the appropriate methods were used to answer the research question, and if they were written with sufficient detail. If information is missing from the methods section, it is the peer reviewer’s job to identify what details need to be added.

The results section is where the outcomes of the experiment and trends in the data are explained without judgement, bias or interpretation ( 15 ). This section can include statistical tests performed on the data, as well as figures and tables in addition to the text. The peer reviewer ensures that the results are described with sufficient detail, and determines their credibility. Reviewers also confirm that the text is consistent with the information presented in tables and figures, and that all figures and tables included are important and relevant ( 15 ). The peer reviewer will also make sure that table and figure captions are appropriate both contextually and in length, and that tables and figures present the data accurately.

The discussion section is where the data is analyzed. Here, the results are interpreted and related to past studies ( 15 ). The discussion describes the meaning and significance of the results in terms of the research question and hypothesis, and states whether the hypothesis was supported or rejected. This section may also provide possible explanations for unusual results and suggestions for future research ( 15 ). The discussion should end with a conclusions section that summarizes the major findings of the investigation. The peer reviewer determines whether the discussion is clear and focused, and whether the conclusions are an appropriate interpretation of the results. Reviewers also ensure that the discussion addresses the limitations of the study, any anomalies in the results, the relationship of the study to previous research, and the theoretical implications and practical applications of the study.

The references are found at the end of the paper, and list all of the information sources cited in the text to describe the background, methods, and/or interpret results. Depending on the citation method used, the references are listed in alphabetical order according to author last name, or numbered according to the order in which they appear in the paper. The peer reviewer ensures that references are used appropriately, cited accurately, formatted correctly, and that none are missing.

Finally, the peer reviewer determines whether the paper is clearly written and if the content seems logical. After thoroughly reading through the entire manuscript, they determine whether it meets the journal’s standards for publication,

and whether it falls within the top 25% of papers in its field ( 16 ) to determine priority for publication. An overview of what a peer reviewer looks for when evaluating a manuscript, in order of importance, is presented in Figure 2 .

An external file that holds a picture, illustration, etc.
Object name is ejifcc-25-227-g002.jpg

How a peer review evaluates a manuscript

To increase the chance of success in the peer review process, the author must ensure that the paper fully complies with the journal guidelines before submission. The author must also be open to criticism and suggested revisions, and learn from mistakes made in previous submissions.

ADVANTAGES AND DISADVANTAGES OF THE DIFFERENT TYPES OF PEER REVIEW

The peer review process is generally conducted in one of three ways: open review, single-blind review, or double-blind review. In an open review, both the author of the paper and the peer reviewer know one another’s identity. Alternatively, in single-blind review, the reviewer’s identity is kept private, but the author’s identity is revealed to the reviewer. In double-blind review, the identities of both the reviewer and author are kept anonymous. Open peer review is advantageous in that it prevents the reviewer from leaving malicious comments, being careless, or procrastinating completion of the review ( 2 ). It encourages reviewers to be open and honest without being disrespectful. Open reviewing also discourages plagiarism amongst authors ( 2 ). On the other hand, open peer review can also prevent reviewers from being honest for fear of developing bad rapport with the author. The reviewer may withhold or tone down their criticisms in order to be polite ( 2 ). This is especially true when younger reviewers are given a more esteemed author’s work, in which case the reviewer may be hesitant to provide criticism for fear that it will damper their relationship with a superior ( 2 ). According to the Sense About Science survey, editors find that completely open reviewing decreases the number of people willing to participate, and leads to reviews of little value ( 12 ). In the aforementioned study by the PRC, only 23% of authors surveyed had experience with open peer review ( 7 ).

Single-blind peer review is by far the most common. In the PRC study, 85% of authors surveyed had experience with single-blind peer review ( 7 ). This method is advantageous as the reviewer is more likely to provide honest feedback when their identity is concealed ( 2 ). This allows the reviewer to make independent decisions without the influence of the author ( 2 ). The main disadvantage of reviewer anonymity, however, is that reviewers who receive manuscripts on subjects similar to their own research may be tempted to delay completing the review in order to publish their own data first ( 2 ).

Double-blind peer review is advantageous as it prevents the reviewer from being biased against the author based on their country of origin or previous work ( 2 ). This allows the paper to be judged based on the quality of the content, rather than the reputation of the author. The Sense About Science survey indicates that 76% of researchers think double-blind peer review is a good idea ( 12 ), and the PRC survey indicates that 45% of authors have had experience with double-blind peer review ( 7 ). The disadvantage of double-blind peer review is that, especially in niche areas of research, it can sometimes be easy for the reviewer to determine the identity of the author based on writing style, subject matter or self-citation, and thus, impart bias ( 2 ).

Masking the author’s identity from peer reviewers, as is the case in double-blind review, is generally thought to minimize bias and maintain review quality. A study by Justice et al. in 1998 investigated whether masking author identity affected the quality of the review ( 17 ). One hundred and eighteen manuscripts were randomized; 26 were peer reviewed as normal, and 92 were moved into the ‘intervention’ arm, where editor quality assessments were completed for 77 manuscripts and author quality assessments were completed for 40 manuscripts ( 17 ). There was no perceived difference in quality between the masked and unmasked reviews. Additionally, the masking itself was often unsuccessful, especially with well-known authors ( 17 ). However, a previous study conducted by McNutt et al. had different results ( 18 ). In this case, blinding was successful 73% of the time, and they found that when author identity was masked, the quality of review was slightly higher ( 18 ). Although Justice et al. argued that this difference was too small to be consequential, their study targeted only biomedical journals, and the results cannot be generalized to journals of a different subject matter ( 17 ). Additionally, there were problems masking the identities of well-known authors, introducing a flaw in the methods. Regardless, Justice et al. concluded that masking author identity from reviewers may not improve review quality ( 17 ).

In addition to open, single-blind and double-blind peer review, there are two experimental forms of peer review. In some cases, following publication, papers may be subjected to post-publication peer review. As many papers are now published online, the scientific community has the opportunity to comment on these papers, engage in online discussions and post a formal review. For example, online publishers PLOS and BioMed Central have enabled scientists to post comments on published papers if they are registered users of the site ( 10 ). Philica is another journal launched with this experimental form of peer review. Only 8% of authors surveyed in the PRC study had experience with post-publication review ( 7 ). Another experimental form of peer review called Dynamic Peer Review has also emerged. Dynamic peer review is conducted on websites such as Naboj, which allow scientists to conduct peer reviews on articles in the preprint media ( 19 ). The peer review is conducted on repositories and is a continuous process, which allows the public to see both the article and the reviews as the article is being developed ( 19 ). Dynamic peer review helps prevent plagiarism as the scientific community will already be familiar with the work before the peer reviewed version appears in print ( 19 ). Dynamic review also reduces the time lag between manuscript submission and publishing. An example of a preprint server is the ‘arXiv’ developed by Paul Ginsparg in 1991, which is used primarily by physicists ( 19 ). These alternative forms of peer review are still un-established and experimental. Traditional peer review is time-tested and still highly utilized. All methods of peer review have their advantages and deficiencies, and all are prone to error.

PEER REVIEW OF OPEN ACCESS JOURNALS

Open access (OA) journals are becoming increasingly popular as they allow the potential for widespread distribution of publications in a timely manner ( 20 ). Nevertheless, there can be issues regarding the peer review process of open access journals. In a study published in Science in 2013, John Bohannon submitted 304 slightly different versions of a fictional scientific paper (written by a fake author, working out of a non-existent institution) to a selected group of OA journals. This study was performed in order to determine whether papers submitted to OA journals are properly reviewed before publication in comparison to subscription-based journals. The journals in this study were selected from the Directory of Open Access Journals (DOAJ) and Biall’s List, a list of journals which are potentially predatory, and all required a fee for publishing ( 21 ). Of the 304 journals, 157 accepted a fake paper, suggesting that acceptance was based on financial interest rather than the quality of article itself, while 98 journals promptly rejected the fakes ( 21 ). Although this study highlights useful information on the problems associated with lower quality publishers that do not have an effective peer review system in place, the article also generalizes the study results to all OA journals, which can be detrimental to the general perception of OA journals. There were two limitations of the study that made it impossible to accurately determine the relationship between peer review and OA journals: 1) there was no control group (subscription-based journals), and 2) the fake papers were sent to a non-randomized selection of journals, resulting in bias.

JOURNAL ACCEPTANCE RATES

Based on a recent survey, the average acceptance rate for papers submitted to scientific journals is about 50% ( 7 ). Twenty percent of the submitted manuscripts that are not accepted are rejected prior to review, and 30% are rejected following review ( 7 ). Of the 50% accepted, 41% are accepted with the condition of revision, while only 9% are accepted without the request for revision ( 7 ).

SATISFACTION WITH THE PEER REVIEW SYSTEM

Based on a recent survey by the PRC, 64% of academics are satisfied with the current system of peer review, and only 12% claimed to be ‘dissatisfied’ ( 7 ). The large majority, 85%, agreed with the statement that ‘scientific communication is greatly helped by peer review’ ( 7 ). There was a similarly high level of support (83%) for the idea that peer review ‘provides control in scientific communication’ ( 7 ).

HOW TO PEER REVIEW EFFECTIVELY

The following are ten tips on how to be an effective peer reviewer as indicated by Brian Lucey, an expert on the subject ( 22 ):

1) Be professional

Peer review is a mutual responsibility among fellow scientists, and scientists are expected, as part of the academic community, to take part in peer review. If one is to expect others to review their work, they should commit to reviewing the work of others as well, and put effort into it.

2) Be pleasant

If the paper is of low quality, suggest that it be rejected, but do not leave ad hominem comments. There is no benefit to being ruthless.

3) Read the invite

When emailing a scientist to ask them to conduct a peer review, the majority of journals will provide a link to either accept or reject. Do not respond to the email, respond to the link.

4) Be helpful

Suggest how the authors can overcome the shortcomings in their paper. A review should guide the author on what is good and what needs work from the reviewer’s perspective.

5) Be scientific

The peer reviewer plays the role of a scientific peer, not an editor for proofreading or decision-making. Don’t fill a review with comments on editorial and typographic issues. Instead, focus on adding value with scientific knowledge and commenting on the credibility of the research conducted and conclusions drawn. If the paper has a lot of typographical errors, suggest that it be professionally proof edited as part of the review.

6) Be timely

Stick to the timeline given when conducting a peer review. Editors track who is reviewing what and when and will know if someone is late on completing a review. It is important to be timely both out of respect for the journal and the author, as well as to not develop a reputation of being late for review deadlines.

7) Be realistic

The peer reviewer must be realistic about the work presented, the changes they suggest and their role. Peer reviewers may set the bar too high for the paper they are editing by proposing changes that are too ambitious and editors must override them.

8) Be empathetic

Ensure that the review is scientific, helpful and courteous. Be sensitive and respectful with word choice and tone in a review.

Remember that both specialists and generalists can provide valuable insight when peer reviewing. Editors will try to get both specialised and general reviewers for any particular paper to allow for different perspectives. If someone is asked to review, the editor has determined they have a valid and useful role to play, even if the paper is not in their area of expertise.

10) Be organised

A review requires structure and logical flow. A reviewer should proofread their review before submitting it for structural, grammatical and spelling errors as well as for clarity. Most publishers provide short guides on structuring a peer review on their website. Begin with an overview of the proposed improvements; then provide feedback on the paper structure, the quality of data sources and methods of investigation used, the logical flow of argument, and the validity of conclusions drawn. Then provide feedback on style, voice and lexical concerns, with suggestions on how to improve.

In addition, the American Physiology Society (APS) recommends in its Peer Review 101 Handout that peer reviewers should put themselves in both the editor’s and author’s shoes to ensure that they provide what both the editor and the author need and expect ( 11 ). To please the editor, the reviewer should ensure that the peer review is completed on time, and that it provides clear explanations to back up recommendations. To be helpful to the author, the reviewer must ensure that their feedback is constructive. It is suggested that the reviewer take time to think about the paper; they should read it once, wait at least a day, and then re-read it before writing the review ( 11 ). The APS also suggests that Graduate students and researchers pay attention to how peer reviewers edit their work, as well as to what edits they find helpful, in order to learn how to peer review effectively ( 11 ). Additionally, it is suggested that Graduate students practice reviewing by editing their peers’ papers and asking a faculty member for feedback on their efforts. It is recommended that young scientists offer to peer review as often as possible in order to become skilled at the process ( 11 ). The majority of students, fellows and trainees do not get formal training in peer review, but rather learn by observing their mentors. According to the APS, one acquires experience through networking and referrals, and should therefore try to strengthen relationships with journal editors by offering to review manuscripts ( 11 ). The APS also suggests that experienced reviewers provide constructive feedback to students and junior colleagues on their peer review efforts, and encourages them to peer review to demonstrate the importance of this process in improving science ( 11 ).

The peer reviewer should only comment on areas of the manuscript that they are knowledgeable about ( 23 ). If there is any section of the manuscript they feel they are not qualified to review, they should mention this in their comments and not provide further feedback on that section. The peer reviewer is not permitted to share any part of the manuscript with a colleague (even if they may be more knowledgeable in the subject matter) without first obtaining permission from the editor ( 23 ). If a peer reviewer comes across something they are unsure of in the paper, they can consult the literature to try and gain insight. It is important for scientists to remember that if a paper can be improved by the expertise of one of their colleagues, the journal must be informed of the colleague’s help, and approval must be obtained for their colleague to read the protected document. Additionally, the colleague must be identified in the confidential comments to the editor, in order to ensure that he/she is appropriately credited for any contributions ( 23 ). It is the job of the reviewer to make sure that the colleague assisting is aware of the confidentiality of the peer review process ( 23 ). Once the review is complete, the manuscript must be destroyed and cannot be saved electronically by the reviewers ( 23 ).

COMMON ERRORS IN SCIENTIFIC PAPERS

When performing a peer review, there are some common scientific errors to look out for. Most of these errors are violations of logic and common sense: these may include contradicting statements, unwarranted conclusions, suggestion of causation when there is only support for correlation, inappropriate extrapolation, circular reasoning, or pursuit of a trivial question ( 24 ). It is also common for authors to suggest that two variables are different because the effects of one variable are statistically significant while the effects of the other variable are not, rather than directly comparing the two variables ( 24 ). Authors sometimes oversee a confounding variable and do not control for it, or forget to include important details on how their experiments were controlled or the physical state of the organisms studied ( 24 ). Another common fault is the author’s failure to define terms or use words with precision, as these practices can mislead readers ( 24 ). Jargon and/or misused terms can be a serious problem in papers. Inaccurate statements about specific citations are also a common occurrence ( 24 ). Additionally, many studies produce knowledge that can be applied to areas of science outside the scope of the original study, therefore it is better for reviewers to look at the novelty of the idea, conclusions, data, and methodology, rather than scrutinize whether or not the paper answered the specific question at hand ( 24 ). Although it is important to recognize these points, when performing a review it is generally better practice for the peer reviewer to not focus on a checklist of things that could be wrong, but rather carefully identify the problems specific to each paper and continuously ask themselves if anything is missing ( 24 ). An extremely detailed description of how to conduct peer review effectively is presented in the paper How I Review an Original Scientific Article written by Frederic G. Hoppin, Jr. It can be accessed through the American Physiological Society website under the Peer Review Resources section.

CRITICISM OF PEER REVIEW

A major criticism of peer review is that there is little evidence that the process actually works, that it is actually an effective screen for good quality scientific work, and that it actually improves the quality of scientific literature. As a 2002 study published in the Journal of the American Medical Association concluded, ‘Editorial peer review, although widely used, is largely untested and its effects are uncertain’ ( 25 ). Critics also argue that peer review is not effective at detecting errors. Highlighting this point, an experiment by Godlee et al. published in the British Medical Journal (BMJ) inserted eight deliberate errors into a paper that was nearly ready for publication, and then sent the paper to 420 potential reviewers ( 7 ). Of the 420 reviewers that received the paper, 221 (53%) responded, the average number of errors spotted by reviewers was two, no reviewer spotted more than five errors, and 35 reviewers (16%) did not spot any.

Another criticism of peer review is that the process is not conducted thoroughly by scientific conferences with the goal of obtaining large numbers of submitted papers. Such conferences often accept any paper sent in, regardless of its credibility or the prevalence of errors, because the more papers they accept, the more money they can make from author registration fees ( 26 ). This misconduct was exposed in 2014 by three MIT graduate students by the names of Jeremy Stribling, Dan Aguayo and Maxwell Krohn, who developed a simple computer program called SCIgen that generates nonsense papers and presents them as scientific papers ( 26 ). Subsequently, a nonsense SCIgen paper submitted to a conference was promptly accepted. Nature recently reported that French researcher Cyril Labbé discovered that sixteen SCIgen nonsense papers had been used by the German academic publisher Springer ( 26 ). Over 100 nonsense papers generated by SCIgen were published by the US Institute of Electrical and Electronic Engineers (IEEE) ( 26 ). Both organisations have been working to remove the papers. Labbé developed a program to detect SCIgen papers and has made it freely available to ensure publishers and conference organizers do not accept nonsense work in the future. It is available at this link: http://scigendetect.on.imag.fr/main.php ( 26 ).

Additionally, peer review is often criticized for being unable to accurately detect plagiarism. However, many believe that detecting plagiarism cannot practically be included as a component of peer review. As explained by Alice Tuff, development manager at Sense About Science, ‘The vast majority of authors and reviewers think peer review should detect plagiarism (81%) but only a minority (38%) think it is capable. The academic time involved in detecting plagiarism through peer review would cause the system to grind to a halt’ ( 27 ). Publishing house Elsevier began developing electronic plagiarism tools with the help of journal editors in 2009 to help improve this issue ( 27 ).

It has also been argued that peer review has lowered research quality by limiting creativity amongst researchers. Proponents of this view claim that peer review has repressed scientists from pursuing innovative research ideas and bold research questions that have the potential to make major advances and paradigm shifts in the field, as they believe that this work will likely be rejected by their peers upon review ( 28 ). Indeed, in some cases peer review may result in rejection of innovative research, as some studies may not seem particularly strong initially, yet may be capable of yielding very interesting and useful developments when examined under different circumstances, or in the light of new information ( 28 ). Scientists that do not believe in peer review argue that the process stifles the development of ingenious ideas, and thus the release of fresh knowledge and new developments into the scientific community.

Another issue that peer review is criticized for, is that there are a limited number of people that are competent to conduct peer review compared to the vast number of papers that need reviewing. An enormous number of papers published (1.3 million papers in 23,750 journals in 2006), but the number of competent peer reviewers available could not have reviewed them all ( 29 ). Thus, people who lack the required expertise to analyze the quality of a research paper are conducting reviews, and weak papers are being accepted as a result. It is now possible to publish any paper in an obscure journal that claims to be peer-reviewed, though the paper or journal itself could be substandard ( 29 ). On a similar note, the US National Library of Medicine indexes 39 journals that specialize in alternative medicine, and though they all identify themselves as “peer-reviewed”, they rarely publish any high quality research ( 29 ). This highlights the fact that peer review of more controversial or specialized work is typically performed by people who are interested and hold similar views or opinions as the author, which can cause bias in their review. For instance, a paper on homeopathy is likely to be reviewed by fellow practicing homeopaths, and thus is likely to be accepted as credible, though other scientists may find the paper to be nonsense ( 29 ). In some cases, papers are initially published, but their credibility is challenged at a later date and they are subsequently retracted. Retraction Watch is a website dedicated to revealing papers that have been retracted after publishing, potentially due to improper peer review ( 30 ).

Additionally, despite its many positive outcomes, peer review is also criticized for being a delay to the dissemination of new knowledge into the scientific community, and as an unpaid-activity that takes scientists’ time away from activities that they would otherwise prioritize, such as research and teaching, for which they are paid ( 31 ). As described by Eva Amsen, Outreach Director for F1000Research, peer review was originally developed as a means of helping editors choose which papers to publish when journals had to limit the number of papers they could print in one issue ( 32 ). However, nowadays most journals are available online, either exclusively or in addition to print, and many journals have very limited printing runs ( 32 ). Since there are no longer page limits to journals, any good work can and should be published. Consequently, being selective for the purpose of saving space in a journal is no longer a valid excuse that peer reviewers can use to reject a paper ( 32 ). However, some reviewers have used this excuse when they have personal ulterior motives, such as getting their own research published first.

RECENT INITIATIVES TOWARDS IMPROVING PEER REVIEW

F1000Research was launched in January 2013 by Faculty of 1000 as an open access journal that immediately publishes papers (after an initial check to ensure that the paper is in fact produced by a scientist and has not been plagiarised), and then conducts transparent post-publication peer review ( 32 ). F1000Research aims to prevent delays in new science reaching the academic community that are caused by prolonged publication times ( 32 ). It also aims to make peer reviewing more fair by eliminating any anonymity, which prevents reviewers from delaying the completion of a review so they can publish their own similar work first ( 32 ). F1000Research offers completely open peer review, where everything is published, including the name of the reviewers, their review reports, and the editorial decision letters ( 32 ).

PeerJ was founded by Jason Hoyt and Peter Binfield in June 2012 as an open access, peer reviewed scholarly journal for the Biological and Medical Sciences ( 33 ). PeerJ selects articles to publish based only on scientific and methodological soundness, not on subjective determinants of ‘impact ’, ‘novelty’ or ‘interest’ ( 34 ). It works on a “lifetime publishing plan” model which charges scientists for publishing plans that give them lifetime rights to publish with PeerJ, rather than charging them per publication ( 34 ). PeerJ also encourages open peer review, and authors are given the option to post the full peer review history of their submission with their published article ( 34 ). PeerJ also offers a pre-print review service called PeerJ Pre-prints, in which paper drafts are reviewed before being sent to PeerJ to publish ( 34 ).

Rubriq is an independent peer review service designed by Shashi Mudunuri and Keith Collier to improve the peer review system ( 35 ). Rubriq is intended to decrease redundancy in the peer review process so that the time lost in redundant reviewing can be put back into research ( 35 ). According to Keith Collier, over 15 million hours are lost each year to redundant peer review, as papers get rejected from one journal and are subsequently submitted to a less prestigious journal where they are reviewed again ( 35 ). Authors often have to submit their manuscript to multiple journals, and are often rejected multiple times before they find the right match. This process could take months or even years ( 35 ). Rubriq makes peer review portable in order to help authors choose the journal that is best suited for their manuscript from the beginning, thus reducing the time before their paper is published ( 35 ). Rubriq operates under an author-pay model, in which the author pays a fee and their manuscript undergoes double-blind peer review by three expert academic reviewers using a standardized scorecard ( 35 ). The majority of the author’s fee goes towards a reviewer honorarium ( 35 ). The papers are also screened for plagiarism using iThenticate ( 35 ). Once the manuscript has been reviewed by the three experts, the most appropriate journal for submission is determined based on the topic and quality of the paper ( 35 ). The paper is returned to the author in 1-2 weeks with the Rubriq Report ( 35 ). The author can then submit their paper to the suggested journal with the Rubriq Report attached. The Rubriq Report will give the journal editors a much stronger incentive to consider the paper as it shows that three experts have recommended the paper to them ( 35 ). Rubriq also has its benefits for reviewers; the Rubriq scorecard gives structure to the peer review process, and thus makes it consistent and efficient, which decreases time and stress for the reviewer. Reviewers also receive feedback on their reviews and most significantly, they are compensated for their time ( 35 ). Journals also benefit, as they receive pre-screened papers, reducing the number of papers sent to their own reviewers, which often end up rejected ( 35 ). This can reduce reviewer fatigue, and allow only higher-quality articles to be sent to their peer reviewers ( 35 ).

According to Eva Amsen, peer review and scientific publishing are moving in a new direction, in which all papers will be posted online, and a post-publication peer review will take place that is independent of specific journal criteria and solely focused on improving paper quality ( 32 ). Journals will then choose papers that they find relevant based on the peer reviews and publish those papers as a collection ( 32 ). In this process, peer review and individual journals are uncoupled ( 32 ). In Keith Collier’s opinion, post-publication peer review is likely to become more prevalent as a complement to pre-publication peer review, but not as a replacement ( 35 ). Post-publication peer review will not serve to identify errors and fraud but will provide an additional measurement of impact ( 35 ). Collier also believes that as journals and publishers consolidate into larger systems, there will be stronger potential for “cascading” and shared peer review ( 35 ).

CONCLUDING REMARKS

Peer review has become fundamental in assisting editors in selecting credible, high quality, novel and interesting research papers to publish in scientific journals and to ensure the correction of any errors or issues present in submitted papers. Though the peer review process still has some flaws and deficiencies, a more suitable screening method for scientific papers has not yet been proposed or developed. Researchers have begun and must continue to look for means of addressing the current issues with peer review to ensure that it is a full-proof system that ensures only quality research papers are released into the scientific community.

Validity, Accuracy and Reliability Explained with Examples

This is part of the NSW HSC science curriculum part of the Working Scientifically skills.

Part 1 – Validity

Part 2 – Accuracy

Part 3 – Reliability

Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using simple experiments as examples.

Target Analogy to Understand Accuracy and Reliability

The target analogy is a classic way to understand the concepts of accuracy and reliability in scientific measurements and experiments. 

scientific validity in research

Accuracy refers to how close a measurement is to the true or accepted value. In the analogy, it's how close the arrows come to hitting the bullseye (represents the true or accepted value).

Reliability  refers to the consistency of a set of measurements. Reliable data can be reproduced under the same conditions. In the analogy, it's represented by how tightly the arrows are grouped together, regardless of whether they hit the bullseye. Therefore, we can have scientific results that are reliable but inaccurate.

  • Validity  refers to how well an experiment investigates the aim or tests the underlying hypothesis. While validity is not represented in this target analogy, the validity of an experiment can sometimes be assessed by using the accuracy of results as a proxy. Experiments that produce accurate results are likely to be valid as invalid experiments usually do not yield accurate result.

Validity refers to how well an experiment measures what it is supposed to measure and investigates the aim.

Ask yourself the questions:

  • "Is my experimental method and design suitable?"
  • "Is my experiment testing or investigating what it's suppose to?"

scientific validity in research

For example, if you're investigating the effect of the volume of water (independent variable) on plant growth, your experiment would be valid if you measure growth factors like height or leaf size (these would be your dependent variables).

However, validity entails more than just what's being measured. When assessing validity, you should also examine how well the experimental methodology investigates the aim of the experiment.

Assessing Validity

An experiment’s procedure, the subsequent methods of analysis of the data, the data itself, and the conclusion you draw from the data, all have their own associated validities. It is important to understand this division because there are different factors to consider when assessing the validity of any single one of them. The validity of an experiment as a whole , depends on the individual validities of these components.

When assessing the validity of the procedure , consider the following:

  • Does the procedure control all necessary variables except for the dependent and independent variables? That is, have you isolated the effect of the independent variable on the dependent variable?
  • Does this effect you have isolated actually address the aim and/or hypothesis?
  • Does your method include enough repetitions for a reliable result? (Read more about reliability below)

When assessing the validity of the method of analysis of the data , consider the following:

  • Does the analysis extrapolate or interpolate the experimental data? Generally, interpolation is valid, but extrapolation is invalid. This because by extrapolating, you are ‘peering out into the darkness’ – just because your data showed a certain trend for a certain range it does not mean that this trend will hold for all.
  • Does the analysis use accepted laws and mathematical relationships? That is, do the equations used for analysis have scientific or mathematical base? For example, `F = ma` is an accepted law in physics, but if in the analysis you made up a relationship like `F = ma^2` that has no scientific or mathematical backing, the method of analysis is invalid.
  • Is the most appropriate method of analysis used? Consider the differences between using a table and a graph. In a graph, you can use the gradient to minimise the effects of systematic errors and can also reduce the effect of random errors. The visual nature of a graph also allows you to easily identify outliers and potentially exclude them from analysis. This is why graphical analysis is generally more valid than using values from tables.

When assessing the validity of your results , consider the following: 

  • Is your primary data (data you collected from your own experiment) BOTH accurate and reliable? If not, it is invalid.
  • Are the secondary sources you may have used BOTH reliable and accurate?

When assessing the validity of your conclusion , consider the following:

  • Does your conclusion relate directly to the aim or the hypothesis?

How to Improve Validity

Ways of improving validity will differ across experiments. You must first identify what area(s) of the experiment’s validity is lacking (is it the procedure, analysis, results, or conclusion?). Then, you must come up with ways of overcoming the particular weakness. 

Below are some examples of this.

Example – Validity in Chemistry Experiment 

Let's say we want to measure the mass of carbon dioxide in a can of soft drink.

Heating a can of soft drink

The following steps are followed:

  • Weigh an unopened can of soft drink on an electronic balance.
  • Open the can.
  • Place the can on a hot plate until it begins to boil.
  • When cool, re-weigh the can to determine the mass loss.

To ensure this experiment is valid, we must establish controlled variables:

  • type of soft drink used
  • temperature at which this experiment is conducted
  • period of time before soft drink is re-weighed

Despite these controlled variables, this experiment is invalid because it actually doesn't help us measure the mass of carbon dioxide in the soft drink. This is because by heating the soft drink until it boils, we are also losing water due to evaporation. As a result, the mass loss measured is not only due to the loss of carbon dioxide, but also water. A simple way to improve the validity of this experiment is to not heat it; by simply opening the can of soft drink, carbon dioxide in the can will escape without loss of water.

Example – Validity in Physics Experiment

Let's say we want to measure the value of gravitational acceleration `g` using a simple pendulum system, and the following equation:

$$T = 2\pi \sqrt{\frac{l}{g}}$$

  • `T` is the period of oscillation
  • `l` is the length of string attached to the mass
  • `g` is the acceleration due to gravity

Pendulum practical

  • Cut a piece of a string or dental floss so that it is 1.0 m long.
  • Attach a 500.0 g mass of high density to the end of the string.
  • Attach the other end of the string to the retort stand using a clamp.
  • Starting at an angle of less than 10º, allow the pendulum to swing and measure the pendulum’s period for 10 oscillations using a stopwatch.
  • Repeat the experiment with 1.2 m, 1.5 m and 1.8 m strings.

The controlled variables we must established in this experiment include:

  • mass used in the pendulum
  • location at which the experiment is conducted

The validity of this experiment depends on the starting angle of oscillation. The above equation (method of analysis) is only true for small angles (`\theta < 15^{\circ}`) such that `\sin \theta = \theta`. We also want to make sure the pendulum system has a small enough surface area to minimise the effect of air resistance on its oscillation.

scientific validity in research

In this instance, it would be invalid to use a pair of values (length and period) to calculate the value of gravitational acceleration. A more appropriate method of analysis would be to plot the length and period squared to obtain a linear relationship, then use the value of the gradient of the line of best fit to determine the value of `g`. 

Accuracy refers to how close the experimental measurements are to the true value.

Accuracy depends on

  • the validity of the experiment
  • the degree of error:
  • systematic errors are those that are systemic in your experiment. That is, they effect every single one of your data points consistently, meaning that the cause of the error is always present. For example, it could be a badly calibrated temperature gauge that reports every reading 5 °C above the true value.
  • random errors are errors that occur inconsistently. For example, the temperature gauge readings might be affected by random fluctuations in room temperature. Some readings might be above the true value, some might then be below the true value.
  • sensitivity of equipment used.

Assessing Accuracy 

The effect of errors and insensitive equipment can both be captured by calculating the percentage error:

$$\text{% error} = \frac{\text{|experimental value – true value|}}{\text{true value}} \times 100%$$

Generally, measurements are considered accurate when the percentage error is less than 5%. You should always take the context of the experimental into account when assessing accuracy. 

While accuracy and validity have different definitions, the two are closely related. Accurate results often suggest that the underlying experiment is valid, as invalid experiments are unlikely to produce accurate results.

In a simple pendulum experiment, if your measurements of the pendulum's period are close to the calculated value, your experiment is accurate. A table showing sample experimental measurements vs accepted values from using the equation above is shown below. 

scientific validity in research

All experimental values in the table above are within 5% of accepted (theoretical) values, they are therefore considered as accurate. 

How to Improve Accuracy

  • Remove systematic errors : for example, if the experiment’s measuring instruments are poorly calibrated, then you should correctly calibrate it before doing the experiment again.
  • Reduce the influence of random errors : this can be done by having more repetitions in the experiment and reporting the average values. This is because if you have enough of these random errors – some above the true value and some below the true value – then averaging them will make them cancel each other out This brings your average value closer and closer to the true value.
  • Use More Sensitive Equipments: For example, use a recording to measure time by analysing motion of an object frame by frame, instead of using a stopwatch. The sensitivity of an equipment can be measured by the limit of reading . For example, stopwatches may only measure to the nearest millisecond – that is their limit of reading. But recordings can be analysed to the frame. And, depending on the frame rate of the camera, this could mean measuring to the nearest microsecond.
  • Obtain More Measurements and Over a Wider Range:  In some cases, the relationship between two variables can be more accurately determined by testing over a wider range. For example, in the pendulum experiment, periods when strings of various lengths are used can be measured. In this instance, repeating the experiment does not relate to reliability because we have changed the value of the independent variable tested.

Reliability

Reliability involves the consistency of your results over multiple trials.

Assessing Reliability

The reliability of an experiment can be broken down into the reliability of the procedure and the reliability of the final results.

The reliability of the procedure refers to how consistently the steps of your experiment produce similar results. For example, if an experiment produces the same values every time it is repeated, then it is highly reliable. This can be assessed quantitatively by looking at the spread of measurements, using statistical tests such as greatest deviation from the mean, standard deviations, or z-scores.

Ask yourself: "Is my result reproducible?"

The reliability of results cannot be assessed if there is only one data point or measurement obtained in the experiment. There must be at least 3. When you're repeating the experiment to assess the reliability of its results, you must follow the  same steps , use the  same value  for the independent variable. Results obtained from methods with different steps cannot be assessed for their reliability.

Obtaining only one measurement in an experiment is not enough because it could be affected by errors and have been produced due to pure chance. Repeating the experiment and obtaining the same or similar results will increase your confidence that the results are reproducible (therefore reliable).

In the soft drink experiment, reliability can be assessed by repeating the steps at least three times:

reliable results example

The mass loss measured in all three trials are fairly consistent, suggesting that the reliability of the underly method is high.

The reliability of the final results refers to how consistently your final data points (e.g. average value of repeated trials) point towards the same trend. That is, how close are they all to the trend line? This can be assessed quantitatively using the `R^2` value. `R^2` value ranges between 0 and 1, a value of 0 suggests there is no correlation between data points, and a value of 1 suggests a perfect correlation with no variance from trend line.

In the pendulum experiment, we can calculate the `R^2` value (done in Excel) by using the final average period values measured for each pendulum length.

scientific validity in research

Here, a `R^2` value of 0.9758 suggests the four average values are fairly close to the overall linear trend line (low variance from trend line). Thus, the results are fairly reliable. 

How to Improve Reliability

A common misconception is that increasing the number of trials increases the reliability of the procedure . This is not true. The only way to increase the reliability of the procedure is to revise it. This could mean using instruments that are less susceptible to random errors, which cause measurements to be more variable.

Increasing the number of trials actually increases the reliability of the final results . This is because having more repetitions reduces the influence of random errors and brings the average values closer to the true values. Generally, the closer experimental values are to true values, the closer they are to the true trend. That is, accurate data points are generally reliable and all point towards the same trend.

Reliable but Inaccurate / Invalid

It is important to understand that results from an experiment can be reliable (consistent), but inaccurate (deviate greatly from theoretical values) and/or invalid. In this case, your procedure  is reliable, but your final results likely are not.

Examples of Reliability

Using the soft drink example again, if the mass losses measured for three soft drinks (same brand and type of drink) are consistent, then it's reliable. 

Using the pendulum example again, if you get similar period measurements every time you repeat the experiment, it’s reliable.  

However, in both cases, if the underlying methods are invalid, the consistent results would be invalid and inaccurate (despite being reliable).

Do you have trouble understanding validity, accuracy or reliability in your science experiment or depth study?

Consider getting personalised help from our 1-on-1 mentoring program !

RETURN TO WORKING SCIENTIFICALLY

  • choosing a selection results in a full page refresh
  • press the space key then arrow keys to make a selection

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 22 March 2017

More than 3Rs: the importance of scientific validity for harm-benefit analysis of animal research

  • Hanno Würbel 1  

Lab Animal volume  46 ,  pages 164–166 ( 2017 ) Cite this article

9317 Accesses

79 Citations

11 Altmetric

Metrics details

You have full access to this article via your institution.

The reproducibility crisis in biomedical research presents a new challenge for conducting harm-benefit analysis: how do we improve the validity of studies to maximize the likelihood of benefit?

Every year, 50–100 million vertebrates are used in experimental procedures worldwide. The use of animals for research is legally regulated on the explicit understanding that such use will provide significant new knowledge facilitating relevant benefits, and no unnecessary harm will be imposed on the animals 1 . Harm-benefit analysis (HBA) is the common tool for making ultimate decisions on whether study protocols meet these expectations. Therefore, HBA is a crucial part of project evaluation and explicitly required by the EU Directive 2010/63; it is also implied in the US Guide for the Care and Use of Laboratory Animals and emphasized in the Terrestrial Animal Health Code by the World Organization for Animal Health (OIE) 2 .

HBA follows the legal principle of proportionality and involves three main questions, namely (1) whether the study is suitable for achieving a legitimate aim, (2) whether it is necessary, and (3) whether it is adequate. Question (3) refers to the actual HBA, which evaluates whether the expected benefits of a study outweigh the harms imposed on the animals. Questions (1) and (2) are instrumental prerequisites for the actual HBA; they are concerned with the scientific rationale underpinning the expected outcome of the study (suitability) and potential alternatives to the likely harms imposed on the animals (necessity).

Evaluation of potential alternatives essentially examines whether the 3Rs principle 3 has been exploited to minimize the harms imposed on the animals. Thus, for a study protocol to proceed to the final HBA, it must argue convincingly that the expected outcome cannot be achieved by using no or non-sentient animals (replace), by using fewer animals (reduce), or by using less harmful procedures (refine). In particular, refinements such as enriched housing, habituation to procedures, non-invasive techniques, and anesthetics and analgesics can shift weights in HBA of animal experiments by minimizing the harms imposed on the animals.

Bumping up the benefits

But what about the benefit side of the equation? Unless a study produces results that are scientifically valid and reproducible, the animals may be wasted for inconclusive research, no matter how little harm is inflicted on them 1 . Whereas 3R efforts to minimize harms to the animals are carefully scrutinized by ethical review committees, the scientific validity and reproducibility of study outcomes are generally taken for granted 4 . Such confidence may not be warranted as highlighted by the ongoing “reproducibility crisis” in biomedical research.

Over the past decade, evidence has accumulated indicating that scientific validity and reproducibility are alarmingly poor throughout biomedical research 1 , 5 . Based on systematic reviews and simulations, Ioannidis concluded that “for most study designs and settings, it is more likely for a research claim to be false than true” 6 . This is supported by evidence for risks of bias throughout in vivo research 4 , 7 , 8 , spectacular cases of irreproducibility 9 , 10 , and translational failure on a large scale 11 , 12 .

Systematic error (bias), poor reproducibility, and translational failure can be caused by flaws at all levels of research, including design, conduct, analysis, and reporting of experiments. For example, studies may use poorly validated animal models or outcome variables 13 ; they may be based on samples that are too small 14 or idiosyncratic 15 ; they may violate principles of good research practice (for example, randomization, blinded outcome assessment, a priori sample size calculation) 4 , 7 , 8 or use inappropriate statistics (for example, p -hacking) 16 ; or they may report results selectively 17 or not at all (for example, publication bias) 18 .

All of this can be detrimental to the scientific validity and reproducibility of results published in the primary scientific literature, thereby compromising the outcome of the research. In much the same way as the 3Rs principle serves to implement strategies that minimize harms to the animals, a more powerful principle may be needed to implement strategies that maximize scientific validity, thereby facilitating the benefits of animal experiments. The following analogy may illustrate this. When refinements for a harmful procedure are available (for example, post-surgical analgesia) but ignored in a study protocol, this represents a violation of the 3Rs principle, thereby causing unnecessary harms to the animals. Similarly, ignorance of measures against risks of bias (for example, randomization, blinded outcome assessment) can be regarded as violation of the principles of good research practice, thereby compromising the outcome of studies. However, similar to unavoidable harms, not all risks of bias are avoidable. For example, when assessing behavioral differences between mice of different coat color, blinded outcome assessment may be impossible. Although non-blinded outcome assessment represents a risk of bias that compromises the study outcome, it is not unethical. By contrast, when blinded outcome assessment is feasible but ignored without justification, it represents a case of irresponsible use of animals, which is unethical, and for example, in the EU is actually against the law.

There is some debate as to whether scientific validity should be weighed on the harm side or the benefit side of the equation 2 , or whether it should be part of an independent third dimension “likelihood of benefit” as in “Bateson's cube” 19 . However, in their recent report on current concepts of HBA of animal experiments, the AALAS-FELASA Working Group concluded that “performing HBA in a systematic way and thereby defining and describing benefits is not common practice”, but that “a well-designed experiment is a fundamental criterion for reliable information and for generating any benefit at all” 2 .

The 3Vs of scientific validity

I therefore propose to extend HBA by adding a more systematic assessment of scientific validity and suggest including three key aspects of scientific validity, namely construct validity (cV), internal validity (iV), and external validity (eV), which for reasons of convenience I will hereafter refer to as the 3Vs. Thus, before the actual HBA, study protocols should not only be assessed for the 3Rs but also for the 3Vs ( Fig. 1 and Table 1 ). Assessment of construct validity should be based on evidence about the level of agreement between the animal model, test or outcome variable and the quality it is meant to measure 20 . In the case of outcome variables this may include evidence of convergent and discriminant validity; in the case of animal models for specific conditions (for example, diseases) in humans or other animals this may include evidence of the three main aspects of model validity: face, construct, and predictive validity 20 , 21 . Assessment of internal validity should be based on evidence for the scientific rationale ( e.g . use of appropriate control groups) and for scientific rigor in terms of measures against risks of bias (for example, definition of primary and secondary outcome variables, sample size calculation, randomization, blinding, statistical analysis plan) 1 , 22 . Finally, assessment of external validity should be based on evidence for experimental design features that enhance, or facilitate inference about, the reproducibility and generalizability of the expected results 1 . This includes splitting experiments into multiple independent replicates (batches) 23 , introducing systematic variation (heterogenization) of relevant variables (for example, species/strains of animals, housing conditions, tests, etc.) 15 , 24 , 25 , or implementing multi-center study designs 26 . In this way, the 3Vs could offer welcome guiding principles for assessing and maximizing the scientific validity of study outcomes, thereby increasing the likelihood of achieving the expected benefit of animal experiments.

figure 1

Kim Caesar/Springer Nature

Whereas 3Rs methods minimize the weight of harms to the animals on the HBA balance, methods to improve the scientific validity of the research (3Vs) maximize the value of study outcomes, thereby facilitating the expected benefits.

At present, ethical review does not include a systematic assessment of scientific validity in the course of HBA. For animal research in Switzerland we recently demonstrated that the authorities licensing animal experiments would actually lack important information to do so; the application form does not explicitly ask for it and, therefore, applicants do not provide it 4 , 8 . In light of the current “reproducibility crisis”, I propose that a more systematic assessment of the 3Vs – similar to the assessment of the 3Rs – as part of HBA would provide a powerful tool to evaluate and enhance the scientific validity and reproducibility of in vivo research.

This seems particularly pertinent in terms of reproducibility and generalizability of research findings. The scope of animal experiments is often very narrow, most studies being conducted as small-scale single-laboratory studies. Due to the highly standardized conditions within laboratories, results of single-laboratory studies have often very little external validity 1 , 15 , 27 . Ironically, 3R efforts to minimize animal use (reduce) may inadvertently exacerbate this situation by promoting standardization as a means to reduce within-experiment variation in view of smaller sample sizes 28 . However, this can be counterproductive since standardization inevitably reduces external validity, and as a consequence reproducibility 27 , 29 .

Using data from 50 independent studies on the effect of hypothermia on infarct volume in animal models of stroke, we recently conducted a simulation study to analyze reproducibility of single-laboratory studies compared to multi-laboratory studies. Treatment effects of single-laboratory studies varied widely (between 0% and 100% reduction of infarct volume), and this variation was reduced considerably by multi-laboratory designs. Furthermore, whereas less than 50% of single-laboratory studies produced an accurate estimate of the “true” effect size (reduction of infarct volume by 48%, as assessed by meta-analysis), simulations showed that multi-laboratory studies based on as few as three laboratories can increase reproducibility from less than 50% to over 80%, without increasing false negative rate or a need for larger sample sizes 30 .

Beyond HBA in ethical review of animal research, the 3Vs could also become instrumental for peer-review of grant applications and manuscripts submitted for publication. It is laudable that the NIH has recently updated its guidelines for how to evaluate research proposals by including assessment of scientific rigor ( https://grants.nih.gov/reproducibility/index.htm ), and that more and more journals are endorsing the UK NC3Rs ARRIVE guidelines ( https://www.nc3rs.org.uk/arrive-guidelines ). However, assessing scientific validity more systematically based on the 3Vs could help develop these initiatives further toward more powerful guidelines. As with the 3Rs, there is no need for a fixed checklist approach. Instead, funders deciding on the allocation of grant money, authorities licensing animal experiments, and editors evaluating manuscripts for publication could all define their own criteria for assessing each of the 3Vs in a way that appears most conducive to the kinds of decisions at their hands. Besides facilitating decision making, this would also enhance the scientific validity and reproducibility of findings from animal research. While this is clearly important for scientific reasons, it also matters on ethical grounds; it helps to avoid wasting animals for inconclusive research and imposing unnecessary harm on laboratory animals.

Bailoo, J.D., Reichlin, T.S. & Würbel, H. Refinement of experimental design and conduct in laboratory animal research. ILAR J. 55 , 383–391 (2014).

Article   CAS   Google Scholar  

Brønstad, A. et al. Current concepts of harm–benefit analysis of animal experiments – report from the AALAS–FELASA working group on harm–benefit analysis – part 1. Lab. Anim. 50 1S, 1–20 (2016).

Article   Google Scholar  

Russell, W.M.S. & Burch, R.L. 1959. The Principles of Humane Experimental Technique . Methuen, London.

Google Scholar  

Vogt, L., Reichlin, T.S., Nathues, C. & Würbel, H. Authorization of animal experiments is based on confidence rather than evidence of scientific rigor. PLoS Biol. 14 , e2000598 (2016).

Ioannidis, J.P.A., Fanelli, D., Dunne, D.D. & Goodman, S.N. Meta-research: evaluation and improvement of research methods and practices. PLoS Biol. 13 , e1002264 (2015).

Ioannidis, J.P.A. Why most published research findings are false. PLoS Med. 2 , e124 (2005).

Macleod, M.R. et al. Risk of bias in reports of in vivo research: a focus for improvement. PLoS Biol. 13 , e1002273 (2015).

Reichlin, T.S., Vogt, L. & Würbel, H. The researchers' view of scientific rigor - Survey on the conduct and reporting of in vivo research. PLoS ONE 11 , e0165999 (2016).

Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10 , 712 (2011).

Begley, C.G. & Ellis, L.M. Drug development: Raise standards for preclinical cancer research. Nature 483 , 531–533 (2012).

Kola, I. & Landis, J. Can the pharmaceutical industry reduce attrition rates? Nat. Rev. Drug Discov. 3 , 711–715 (2004).

O'Collins, V.E. et al. 1,026 experimental treatments in acute stroke. Ann. Neurol. 59 , 467–477 (2006).

Nestler, E.J. & Hyman, S.E. Animal models of neuropsychiatric disorders. Nat. Neurosci. 13 , 1161–1169 (2010).

Button, K.S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14 , 365–376 (2013).

Richter, S.H., Garner, J.P. & Würbel, H. Environmental standardization: cure or cause of poor reproducibility in animal experiments? Nat. Methods 6 , 257–261 (2009).

Head, M.L., Holman, L., Lanfear, R., Kahn, A.T. & Jennions, M.D. The extent and consequences of P-hacking in science. PLoS Biol. 13 , e1002106 (2015).

Tsilidis, K.K. et al. 2013. Evaluation of excess significance bias in animal studies of neurological diseases. PLoS Biol. 11 : e1001609.

Sena, E.S., van der Worp, H.B., Bath, P.M.W., Howells, D.W. & Macleod, M.R. Publication bias in reports of animal stroke studies leads to major overstatement of efficacy. PLoS Biol. 8 , e1000344 (2010).

Bateson, P. When to experiment on animals. New Sci. 109 , 30–32 (1986).

PubMed   Google Scholar  

van der Staay, F.J., Arndt, S.S. & Nordquist, R.E. Evaluation of animal models of neurobehavioral disorders. Behav. Brain Funct. 5 , 11 (2009).

Willner, P. Validation criteria for animal models of human mental disorders: learned helplessness as a paradigm case. Prog. Neuropsychopharmacol. Biol. Psychiatry 10 , 677–690 (1986).

Van der Worp, H.B. et al. Can animal models of disease reliably inform human studies? PLoS Med. 7 , e1000245 (2010).

Paylor, R. Questioning standardization in science. Nat. Methods 6 , 253–254 (2009).

Richter, S.H., Garner, J.P., Auer, C., Kunert, J. & Würbel, H. Systematic variation improves reproducibility of animal experiments. Nat. Methods 7 , 167–168 (2010).

Richter, S.H. et al. Effect of population heterogenization on the reproducibility of mouse behavior: a multi-laboratory study. PLoS ONE 6 , e16461 (2011).

Wodarski, R. et al. Cross-centre replication of suppressed burrowing behaviour as an ethologically relevant pain outcome measure in the rat: a prospective multicentre study. Pain 157 , 2350–2365 (2016).

Voelkl, B. & Würbel, H. Reproducibility crisis: Are we ignoring reaction norms? Trends Pharmacol. Sci. 37 , 509–510 (2016).

Parker, R.M.A. & Browne, W.J. The place of experimental design and statistics in the 3Rs. ILAR J. 55 , 477–485 (2014).

Würbel, H. Behaviour and the standardization fallacy. Nat. Genet. 26 , 263 (2000).

Würbel, H., Reichlin, T.S., Voelkl, B. & Vogt, L. 2016. More than refinement – improving the validity and reproducibility of animal research. in: Dwyer, C., Haskell, M., Sandilands, V. (eds.), Proc. 50 th Congr. Int. Soc. Appl. Ethol. , Wageningen Academic Publishers, Wageningen, p. 324.

Download references

Acknowledgements

I would like to thank Eimear Murphy, Katharina Friedli and Herwig Grimm for valuable comments on earlier drafts of this article. Research on which this article is based was funded by the European Research Council (ERC Advanced Grant REFINE No. 322576) and the Swiss Federal Food Safety and Veterinary Office (FSVO Grant No. 2.13.01).

Author information

Authors and affiliations.

Division of Animal Welfare, Veterinary Public Health Institute, Vetsuisse Faculty, University of Bern, Länggassstrasse 120, Bern, 3012, Switzerland

Hanno Würbel

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Hanno Würbel .

Ethics declarations

Competing interests.

The author declares no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Würbel, H. More than 3Rs: the importance of scientific validity for harm-benefit analysis of animal research. Lab Anim 46 , 164–166 (2017). https://doi.org/10.1038/laban.1220

Download citation

Published : 22 March 2017

Issue Date : April 2017

DOI : https://doi.org/10.1038/laban.1220

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A minimal metadata set (mnms) to repurpose nonclinical in vivo data for biomedical research.

  • Anastasios Moresis
  • Leonardo Restivo
  • Alexandra Bannach-Brown

Lab Animal (2024)

Animal welfare research is fascinating, ethical, and useful—but how can it be more rigorous?

  • Georgia J. Mason

BMC Biology (2023)

Domestication and breeding objective did not shape the interpretation of physical and social cues in goats (Capra hircus)

  • Christian Nawroth
  • Katrina Wiesmann
  • Jan Langbein

Scientific Reports (2023)

A Sustainable Translational Sheep Model for Planned Cesarean Delivery of Contraction-Free Ewes

  • Alexander Paping
  • Loreen Ehrlich
  • Thorsten Braun

Reproductive Sciences (2023)

Understanding the importance of quality control and quality assurance in preclinical PET/CT imaging

  • Wendy A. McDougald
  • Julia G. Mannheim

EJNMMI Physics (2022)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

scientific validity in research

Facial Expression Recognition for Probing Students’ Emotional Engagement in Science Learning

  • Open access
  • Published: 14 August 2024

Cite this article

You have full access to this open access article

scientific validity in research

  • Xiaoyu Tang 1 ,
  • Yayun Gong 1 ,
  • Yang Xiao 1 ,
  • Jianwen Xiong 1 &
  • Lei Bao   ORCID: orcid.org/0000-0003-3348-4198 2  

168 Accesses

Explore all metrics

Student engagement in science classroom is an essential element for delivering effective instruction. However, the popular method for measuring students’ emotional learning engagement (ELE) relies on self-reporting, which has been criticized for possible bias and lacking fine-grained time solution needed to track the effects of short-term learning interactions. Recent research suggests that students’ facial expressions may serve as an external representation of their emotions in learning. Accordingly, this study proposes a machine learning method to efficiently measure students’ ELE in real classroom. Specifically, a facial expression recognition system based on a multiscale perception network (MP-FERS) was developed by combining the pleasure-displeasure, arousal-nonarousal, and dominance-submissiveness (PAD) emotion models. Data were collected from videos of six physics lessons with 108 students. Meanwhile, students’ academic records and self-reported learning engagement were also collected. The results show that students’ ELE measured by MP-FERS was a significant predictor of academic achievement and a better indicator of true learning status than self-reported ELE. Furthermore, MP-FERS can provide fine-grained time resolution on tracking the changes in students’ ELE in response to different teaching environments such as teacher-centered or student-centered classroom activities. The results of this study demonstrate the validity and utility of MP-FERS in studying students’ emotional learning engagement.

Similar content being viewed by others

scientific validity in research

Facial expression recognition of online learners from real-time videos using a novel deep learning model

scientific validity in research

Revolutionizing online education: Advanced facial expression recognition for real-time student progress tracking via deep learning model

scientific validity in research

Video-Based Emotion Recognition in the Wild for Online Education Systems

Explore related subjects.

  • Artificial Intelligence
  • Digital Education and Educational Technology

Avoid common mistakes on your manuscript.

Introduction

As an essential indicator of the impact of reforms in science education, effective teaching has received much attention from the science education community. Currently, explanations of effective teaching can be classified into three intertwined orientations: the importance of teachers’ behavioral characteristics in encouraging and facilitating student learning (Blomeke et al., 2022 ; Joshi & Bhaskar, 2022 ); the importance of meeting students’ needs and interests to engage them in the classroom and achieve positive learning outcomes (Elias et al., 2023 ; Kennedy, 2016 ); and the effectiveness of teacher–student interactions (Sun et al., 2022 ; Xintong et al., 2022 ). Regardless of the orientation, the goal is to promote effective learning. The first two orientations prioritize external conditions, while the third focuses on internal psychological conditions. Consequently, students’ effective learning should be intrinsically motivated to construct knowledge and learning transfer with appropriate guidance and learning activities and promote positive emotional experiences.

The different emphases of these three teaching orientations have led to several notable trends in current classroom teaching styles: (1) teacher-centered teaching, which emphasizes teacher-directed behaviors and verbal expressions to promote students’ understanding (Kateřina, 2019 ); (2) student-centered teaching, which emphasizes the active construction of knowledge through students’ hands-on participation in learning activities (Eva & Kathleen, 2023 ); and (3) interactive teaching, which emphasizes the interaction between teacher guidance and student activities (Howe et al., 2019 ) and promotes the construction of knowledge through both subjects’ joint efforts. Teachers’ behavioral characteristics in different teaching styles directly impact students’ emotional learning engagement and subsequently influence their learning outcomes. For example, the extended use of the lecture mode may lead to student boredom and loss of attention from the classroom. Emotional learning engagement constitutes students’ response to the teacher’s teaching behavior and serves not only as a prerequisite for learning outcomes but also as an important indicator of the effectiveness of teaching itself.

Current educational trends increasingly emphasize the role of emotional learning engagement in teaching and learning. For example, in a study on students’ learning effectiveness and core literacy in the UK, measuring students’ enjoyment of learning was the focus (Office for Standards in Education, 2010 ). Canada’s education policy promotes a focus on students’ interests and performances in learning (Ontario, 2014 ). Moreover, studies from China suggest that teachers should evaluate students’ learning through their emotional performance, such as their interest and participation in daily learning activities (China, 2022 ). These results show that using students’ emotional engagement as a factor for evaluating the effectiveness of classroom teaching has become a significant trend in education assessment. Psychological studies have also shown that positive emotions, such as concentration and happiness, can promote students’ learning efficiency. In contrast, negative emotions, such as boredom and anxiety, can decrease intellectual development (Sun et al., 2015 ). Moreover, it has been found in numerous studies that classroom emotion and interest are key factors in science learning retention (Prescod et al., 2018 ; Sadler et al., 2012 ; Schelfhout et al., 2021 ). Therefore, importance needs to be attributed to students’ emotional changes in the classroom, as such changes can constantly remind teachers to make timely adjustments to their teaching strategies to keep students actively engaged in science learning.

In the literature, the commonly used measures of student engagement are self-reports and structured observations, which could be biased in measuring implicit emotional engagement. In addition, structured observations are usually designed with a specific observation plan and purpose, which limits the measurement to be single-angled and subjective, and information on students’ emotional experiences is also limited (Mikeska et al., 2019 ). Moreover, although self-reports can reveal students’ psychological feelings more directly, they are often widely inaccurate in the emotional dimension, especially in lower grades (Ben-Eliyahu et al., 2018 ). Furthermore, this uncertainty may be exacerbated by the time lag in measurement and the influence of social expectations. In addition, analyzing data from structured observations and self-reports is labor intensive and can be subjectively biased. Therefore, finding a machine-based objective measuring method could be invaluable for advancing research in this area. The recent emergence of facial expression recognition technology has led to the development of a promising approach (Liu et al., 2021 ) that can automatically perform feature extraction and expression recognition on facial images (Mollahosseini et al., 2016 ). Motivated by recent work, this study develops a multiscale perception facial expression recognition system (MP-FERS) for measuring students’ emotional learning engagement and validates the measurement outcomes of the MP-FERS using a mixed research approach.

Literature Review

Student engagement as a multidimensional construct.

Many studies have demonstrated that student engagement is a multidimensional construct and can significantly impact students’ academic achievement (Engels et al., 2021 ; Muenks et al., 2017 ). In general, student engagement includes behavioral, emotional, and cognitive engagement (Fredricks et al., 2004 ). Behavioral engagement reflects aspects of student learning behaviors, including answering questions, participating in class discussions, and completing expected tasks (Sedova et al., 2019 ; Wang et al., 2011 ). Cognitive engagement is not directly observable and reflects the degree to which students think about or focus on learning activities (Greene, 2015 ). This engagement is mainly involved in the mental effort and cognitive strategies used for understanding knowledge (Connell & Wellborn, 1991 ; Newmann, 1992 ; Olivier et al., 2021 ).

Compared to the first two types of engagement, emotional engagement is an implicit psychological state. Since this study focuses on students’ emotional engagement in the classroom, it is further defined as emotional learning engagement (ELE), which specifically refers to students’ emotional responses to the learning process and classroom environment, including interest and belonging (Connell & Wellborn, 1991 ; Skinner & Belmont, 1993 ). As part of the dyadic interaction between a learner and a learning activity, ELE is present throughout the learning process (Ben-Eliyahu et al., 2018 ). ELE aligns well with research on flow, which refers to learners feeling positive emotions, losing time, and becoming fully immersed during a learning activity (Nakamura & Csikszentmihalyi, 2014 ). Flow is often used to describe high-quality ELE. Both situational interests caused by the specific characteristics of classroom activities and personal interests aroused by the willingness to undertake challenging tasks can equip students with a mindset directed toward classroom tasks (Renninger et al., 1994 ), which constitutes the psychological basis of classroom experiences. Numerous studies have demonstrated the importance of emotions. Positively activated emotions (e.g., joy, anticipation) may lead to higher behavioral and cognitive engagement (Linnenbrink, 2007 ; Pekrun et al., 2009 ), while negatively activated emotions (e.g., confusion) may lead to more in-depth inquiry about the learning materials (D’Mello & Graesser, 2012 ). The deactivations of emotions (e.g., fatigue) reflect a negative state of being absent from classroom activities, which can lead to a lack of psychological affiliation, causing behavioral and cognitive engagement burnout (Linnenbrink, 2007 ; Pekrun et al., 2002 ). Notably, deactivated neutral emotions (e.g., boredom) reflect a tendency to detach from ongoing activity and cannot be ignored when modeling learning engagement (Ben-Eliyahu et al., 2018 ).

In summary, the literature has demonstrated that ELE can play a vital role in student engagement by influencing behavioral and cognitive engagement from a psychological perspective, which focuses on emotional dimensions such as interest, pleasure, and enjoyment. Thus, this study focuses on measuring ELE and its interactions with the learning environment.

The Influence of ELE on Academic Achievement

Emotions are ubiquitous in academic settings (e.g., emotions such as enjoyment, anger, anxiety, and boredom that arise during the learning process), and they can profoundly impact students’ academic engagement and performance (Pekrun & Linnenbrink-Garcia, 2012 ). Evidence shows that negative emotions such as anger and sadness are negatively associated with achievement (Hernández et al., 2016 ), while positive emotions such as enjoyment are positively associated with achievement (Pekrun & Linnenbrink-Garcia, 2012 ). Recent studies have found a feedback loop between emotion and achievement over time (Pekrun et al., 2017 ; Putwain et al., 2018 , 2022 ). For example, higher enjoyment and lower boredom predict greater subsequent achievement, and, in turn, greater academic achievement predicts subsequent greater enjoyment and lower boredom. This suggests that emotion and academic achievement are consistently and tightly intertwined. These empirical studies revealed the feasibility of using student engagement as an indicator of effective teaching (Reinhold et al., 2020 ).

ELE can have complex and extensive influences on academic achievement because it provides a critical psychological foundation for learning. ELE can influence both behavioral and cognitive engagement, which can further influence academic achievement (Geertshuis, 2019 ; Liu et al., 2022a , 2022b ). For example, students with positive emotions are more likely to devote time and energy to learning and can prevent themselves from possible academic burnout (González-Romá et al., 2006 ). Thus, these students may show more sustained behavioral engagement and are more likely to deal effectively with learning difficulties (Wang & Eccles, 2012 ). The control-value theory explains that emotions determine the use of cognitive resources and learning strategies, as well as motivation, to influence achievement (Meinhardt & Pekrun, 2003 ; Pekrun, 2006 ). Positive emotions (e.g., enjoyment) retain cognitive resources, increase interest and motivation, and promote flexible and deep learning strategies, leading to a better likelihood for academic success. Conversely, negative emotions (anger, sadness) may induce irrelevant thinking, reduce cognitive resources, disrupt attentional focus, and prevent the systematic use of deep learning strategies, all of which are detrimental factors to academic progress (Kuhbandner et al., 2010 ).

Overall, student engagement is dynamically related internally, with emotion being the most foundational factor in academic achievement. ELE influences the formation of motivation, interest, and attention, thus promoting sustainable behavioral engagement (Yang et al., 2021 ). Behavioral engagement, in turn, influences cognitive engagement through different learning modes, ultimately affecting learning outcomes (Chi & Wylie, 2014 ; Yang et al., 2021 ). These studies have provided strong evidence demonstrating the predictive role of emotional engagement in academic achievement, which provides the theoretical and experimental basis for using the MP-FERS to measure student engagement in this study.

The Influence of Teaching Style on ELE

Numerous studies have demonstrated that teaching style affects students’ learning interest and enjoyment (Kang & Keinonen, 2018 ). Emotion, as a result of the classroom context, mediates learning outcomes that reflect teaching effectiveness (Schukajlow & Rakoczy, 2016 ). As science education shifts from teacher-centered to student-centered, many studies note that student-centered instruction is more likely to increase students’ affective interest than traditional teacher-centered instruction (Alimoglu et al., 2017 ; Renninger & Bachrach, 2015 ; Trobst et al., 2016 ). For example, cooperative learning (Sibomana et al., 2021 ), game-based approaches (North et al., 2021 ), and problem-solving approaches (Taub et al., 2020 ) can increase students’ enjoyment and positive attitudes toward learning. A cooperative learning environment stimulates student interaction and significantly increases positive emotions (Martínez-Sierra, 2014 ). Conversely, game-based instruction conforms to students’ instincts, thereby increasing their enjoyment of learning and confidence in success (Battersby et al., 2020 ; Byusa et al., 2022 ). Further evidence of the facilitative effects of student-centered instruction on student engagement was provided in a mixed study that investigated the perceptions of engagement factors among middle and high school students who varied in their level of science engagement. The researchers found that student-centered instruction significantly influenced ELE, motivational beliefs, and social support (Fredricks et al., 2018 ).

Summarizing the literature, the positive effect of student-centered instruction on students’ ELE is relatively consistent across ages (Areepattamannil, 2012 ), grades, and subjects (Capar & Tarim, 2015 ). Emotional interest, as a strong predictor of the science learning retention rate, cannot be ignored, and teaching style is an essential factor. This study will further investigate this relationship by using the MP-FERS, which will also provide evidence for the validity of the measurement using this system.

Measuring ELE by Facial Expression Recognition

Since the abovementioned three types of engagement differ in their degrees of externalization, they are commonly measured by teacher observations and student self-reports (Ben-Eliyahu et al., 2018 ; Fredricks et al., 2004 ). External behavioral engagement is often measured through teacher observations (Bakker et al., 2015 ; Guo et al., 2014 ), while cognitive engagement is measured through student self-reports or work samples such as standardized tests and student work (Bakker et al., 2015 ). However, measuring ELE is particularly difficult because emotions are implicit. Self-reports are currently the most popular method for measuring emotions. Such measurements rely on behavioral indicators of ELE, which are often difficult for younger students to use to discriminate between the different types of engagement items (Fredricks et al., 2004 ). In addition, self-reports are often used as a one-time metric and lack the temporal resolution for tracking ELE variations over time and in connection with specific learning contexts (Park et al., 2012 ).

To examine classroom effectiveness, we choose to measure ELE, which is influenced by the classroom environment and is characterized by student emotions resulting from classroom elements such as learning content and activities. The traditional measure of ELE usually collects students’ subjective feelings and self-evaluations after a course, requiring students to recall the class content and their mental states. This method is both abrupt and subjective (Henrie et al., 2015 ), resulting in a measurement of student engagement that may deviate significantly from the real engagement level due to the delay. In addition, students may conceal their low level of engagement due to the influence of social expectations, leading to inaccurate measurements.

The 2017 Horizon Report (Freeman et al., 2017 ) suggested that classroom measurement should focus on using measuring tools to track, analyze, and reflect the learning data from student classroom engagement. With the advancement of technology, information technology can be used to measure student engagement.

Mehrabian and Russell ( 1974 ) research showed that emotional expression consists of 7% words, 38% voice, and 55% facial expressions, which indicates that facial expressions can be an essential avenue for measuring emotion. Although the emotional intensity of facial expressions may vary due to cultural differences (Tsai et al., 2019 ), the basic categories of emotions expressed are broadly consistent (Anthony & Nicolas, 2021 ). As a result, facial expression recognition (FER) techniques are now widely used. Currently, the information expressed by students’ facial expressions is associated with their learning emotions. FER is gradually being applied in teaching environments. Studies have suggested that students’ expressions reflect their cognition. Wang et al. ( 2014 ) studied puzzlement detection using FER. Liaw et al. ( 2021 ) analyzed changes in students’ facial expressions and found a significant relationship with conflict-induced conceptual change, which is valuable for predicting students’ learning outcomes. Several studies have also attempted to analyze students’ emotions with facial expressions. Chen et al. ( 2015 ) built detectors of confusion, engagement, and frustration with features extracted from FER. Zhu and Chen ( 2019 ) constructed a database of students’ spontaneous facial expressions and applied it to evaluate emotions during e-learning. Most of these studies have classified emotions based on extracted feature information. Recent research has further quantified emotions as classroom status indicators. For example, Pei and Shan ( 2019 ) generated students’ concentration scores via FER. Shen et al. ( 2021 ) constructed an engagement equation based on four emotions, namely, neutral, understanding, disgust, and doubt, to generate students’ classroom engagement scores. The findings of these studies suggest the feasibility of developing the MP-FERS for classroom evaluation.

Although the aforementioned studies achieved acceptable identification accuracy and precision, the quantitative criteria lack theoretical support. Few have explored the strength of the link between engagement measured by computer vision and actual engagement (for example, comparing FER measurements with teacher observations or self-reports). Furthermore, discussions of how FER measurement results work for teaching feedback are lacking (Vanneste et al., 2021 ). This study attempts to address these limitations by exploring two research questions, which are discussed next.

In this study, the PAD emotional state model proposed by psychologists, such as Mehrabian ( 1995 ), was used for quantifying ELE through facial recognition. The PAD model describes emotional states through pleasure, arousal, and dominance and uses continuous sampling and recognition of facial expressions to systematically quantify emotion as ELE in real time. It has been demonstrated that almost all the reliable variance in the other 42 emotional response scales can be explained by the PAD emotional state model (Mehrabian, 1996 ). Moreover, the PAD model remains valid for facial expression analysis (Cao et al., 2008 ; Jia et al., 2014 ). Since each emotion has a set of PAD values, the PAD emotion space region can effectively characterize the learner’s emotional state. Gilroy et al. ( 2009 ) established a correlation between flow and emotional state measures through PAD values. As discussed earlier, the flow state represents high-quality ELE, which supports the use of the PAD model as the method for quantifying ELE.

Research Questions

This research proposes a method for measuring ELE using the MP-FERS and discusses its effectiveness in science classrooms. Specifically, this research aims to answer the following two research questions:

To what extent can MP-FERS produce valid and reliable measures of students’ ELE in a real classroom setting?

How may students’ ELE, as measured by the MP-FERS, vary with different teaching activities?

Research Methodology

As discussed previously, measuring ELE is essential but challenging. Using primarily quantitative methods to investigate ELE involves limitations. Therefore, mixed methods were used in this study (Creswell and Clark, 2011 ). Mixed methods are suitable for problems where quantitative or qualitative methods are insufficient for developing a comprehensive understanding (Greene, 2007 ). This study collected adequate data from multiple sources to better explore the feasibility of using the MP-FERS to measure ELE. With the types of data and analysis methods used, a mixed-method approach was needed for this study.

Multiscale Perception Facial Expression Recognition System (MP-FERS) for ELE Measurement

Facial expression can provide an essential basis for determining students’ ELE. However, teachers have limited capacity to capture changes in each student’s facial expressions over time. Accordingly, this research designed a machine learning-based MP-FERS to measure ELE in real time.

The Organizational Structure of MP-FERS

The organizational structure of the MP-FERS is shown in Fig.  1 . First, an HD camera collects real-time videos of students’ facial expressions, which are streamed into the sentiment analysis module. Then, the sentiment analysis module predicts the emotions of students’ facial images extracted from the video. Finally, the various emotions identified are further analyzed through the engagement measurement module to be quantified based on the pleasure-displeasure, arousal-nonarousal, and dominance-submissiveness (PAD) emotion model proposed by Mehrabian ( 1995 ). The ELE values are then calculated through an equation to generate a change curve on the display terminal. The continuous sampling and recognition of facial expressions facilitate a more precise capture of students’ ELE caused by changes in the classroom environment. Essentially, the MP-FERS primarily measures situational emotions that reflect the effectiveness of instruction.

figure 1

The organizational structure of the MP-FERS comprises a camera, sentiment analysis, and engagement measurement modules. MP-FERS is deployed on an edge computing box with an input image size of 640 × 640 and running at a frame rate of 5 fps

Sentiment Analysis Model of MP-FERS

Facial expressions are among the most potent, natural, and universal signals that humans utilize to convey emotions and intentions. The sentiment analysis model used in this study consists of two steps: (1) facial image preprocessing and (2) facial expression recognition for sentiment classification. Preprocessing of facial images is performed mainly by using an HD camera to acquire students’ classroom images and then using the OpenCV toolbox to extract students’ facial images. FER in the classroom environment may reduce the amount of complete facial feature information due to facial occlusion or pose changes, resulting in a small recognizable range and low accuracy. To address this problem, this study employs a vision transformer (ViT)-based (Dosovitskiy et al., 2020 ) multiscale local and global perception network (MLGPN), which learns the local and global representation of expressions and the relationship between different representations at multiple levels, reducing interference from occlusion and pose changes. Its overall network structure is shown in Fig.  2 . First, the multiscale local perception unit embeds a channel attention module that guides the network to learn global and local salient features of expression images at different scales. Then, the expression features with multiscale information were analyzed to produce channel and spatial information about the features through the global perception unit composed of the Vit architecture, which adaptively models the global dependencies of learning expression images in different dimensions. Finally, the hierarchical stacking of multiscale local and global perception (MLGP) blocks composed of two types of perception units effectively reduces the influence of pose changes and occlusions.

figure 2

The overall architecture of the proposed MP-FERS. The input image is first processed by a backbone network based on a convolutional neural network (CNN) to obtain a 256 × 14 × 14 feature map. The feature map is then passed through the stacked MLGP block layer, where it sequentially passes through multiscale local attention units and global perception units. A classification header module is connected after the output of the last layer of the MLGP block, which consists of fully connected layers for dimensionality reduction, batch normalization, GELU, and other modules. The output sequence is turned into a vector of expression polarity probability distributions, and the expression categories are finally obtained via softmax normalization

To better validate the robustness and generalizability of the model, this study was conducted on three large-scale field datasets popularized by FER. The results are shown in Table  1 . FERPlus contains 28,709 training images and 3589 test images. The RAF-DB dataset is a representative wild FER dataset with 12,271 training images and 3068 testing images. AffectNet is the largest field dataset for the FER task, with 283,901 training samples and 3500 test samples. Compared to the other methods, the MP-FERS method achieved the best accuracy on all three datasets, as shown in Table  1 . The experimental results demonstrate that MP-FERS can learn local nuances and region-global relationships of expressions, effectively reducing the effects of occlusion and pose changes.

Designing the Measurement Model of ELE

Quantifying emotion scales to produce an engagement index is one of the challenges faced by researchers in the field of emotion measurement. By reviewing studies on ELE, we found that each level of engagement is either associated with or directly explained by emotion. For example, high engagement is expressed as excitement, happiness, and surprise, while disengagement is expressed as tiredness, boredom, and sadness (Altuwairqi et al., 2021 ; D’Mello & Graesser, 2012 ). The definition of engagement categories is consistent with the description of multidimensional emotions. Hence, this study associates emotion categories with ELE.

Next, we quantify the predicted emotional labels of facial expressions through the PAD emotion model in this study. The PAD conceptualizes emotions of pleasure, arousal, and dominance (Mehrabian, 1995 ), which are linked to the positive and negative characteristics of emotions, the level of neurophysiological activation, and the individual’s state of control over the situation or others, respectively. Unlike discrete emotion models and other dimensional emotion models (Arent, 2005 ; Wundt, 1980 ), the PAD emotion model describes subjective experiences and maps their relations to external performance and physiological arousal; thus, it is considered more appropriate to describe students’ ELE under the influence of classroom situations (Jia et al., 2014 ). In expression recognition, human expressions are classified into seven main types: happy, angry, disgusted, fearful, sad, surprised, and neutral. Table 2 shows the mapping values of these emotions in the three dimensions proposed by the Institute of Psychology, Chinese Academy of Sciences.

Although the PAD values are useful, they are not integrated into a single engagement measure. To address this, we created a single measure in terms of the PAD values to represent the overall ELE of an individual or the whole class, which is described in Eq. ( 1 ). Since the contribution of the three dimensions to ELE varies across subjects and teaching forms, the weight values of each dimension \(\left(\alpha , \beta , \gamma \right)\) are constantly changing under the attention mechanism. The subscript \(j\) refers to the number of emotional categories, \(P,A,D\) refers to the mapping values of the seven emotions in the PAD model (Table  2 ), and \({P}_{j}\) is the probability of each emotion category. \(\sum_{j=1}^{7}P{P}_{j}, \sum_{j=1}^{7}A{P}_{j}\) , and \(\sum_{j=1}^{7}D{P}_{j}\) represent pleasure, arousal, and dominance values, respectively. During the pretraining of the evaluation model, this research established the equation by adjusting the parameters several times so that the machine scores would be the closest to the manual scores.

Equation ( 1 ) calculates the ELE at multiple points in time. To evaluate individual students and the whole class, we define \({E}_{k}\) as the average ELE of the \({k}^{th}\) student in Eq. ( 2 ), which calculates the effect of classroom activities on stimulating individual interest in learning. By combining the \({E}_{k}\) of all students, we can calculate the ELE of the whole class as shown in Eq. ( 3 ), which combines the engagement of all students. In the equation, m denotes the total number of time points measured during the class, and n represents the number of students.

Procedure and Analysis

As discussed in the review, research on facial emotion recognition tools lacks validity confirmation. The validity of the MP-FERS is equivalent to the concept of validity in social science research. Therefore, this study evaluated the validity of the MP-FERS by comparing the MP-FERS outcomes with other measures, including self-reports of engagement and students’ academic performance. The overall framework of the measures and comparisons are shown in Fig.  3 .

figure 3

A mixed-methods approach is used to validate the MP-FERS dataset. This is a triangulated validation model that includes evidence from self-reports, academic achievement, and teaching style, each corroborating MP-FERS results from a different dimension

As discussed earlier, objective evidence of ELE should come from students’ behaviors in real classroom situations, i.e., the measurement using MP-FERS, which is the core of this study. Due to the greater specificity of the MP-FERS data compared to those collected by traditional test instruments such as questionnaires, we employed criterion-related validity (Cohen et al., 2007 ). Criterion validity reflects the degree to which a measurement instrument is valid for measuring or predicting an individual’s performance in a given context. It is critical to find suitable criteria that reflect students’ ELE. In this study, two criteria were selected for validation.

First, self-reports are still regarded as the standard method because they produce perceptive and subjective evidence directly from subjects (Fredricks et al., 2004 ). In this study, students’ self-reported engagement was measured within the same time frame as that of the MP-FERS, which can serve as a criterion to validate concurrent validity.

In addition, evidence of related factors cannot be ignored. As discussed in the literature review, students’ ELE is strongly influenced by teaching style and is an effective predictor of academic achievement. For this reason, the degree of student-centeredness in the classroom and student academic achievement are also selected as related criteria to validate the predictive validity of the scale.

By analyzing and comparing the three interrelated groups of measures from multiple data sources, mixed methods facilitate the triangulation of the data analysis to establish the validity and reliability of the MP-FERS measurement outcomes (Delahunty et al., 2018 ). Since perceptive and subjective evidence strongly predict related evidence, if the ELE measured by the MP-FERS matches subjective evidence and predicts related evidence equally well, a high level of validity of measurements should be demonstrated using the MP-FERS.

Participants and Context

The study was conducted with 118 eighth- and eleventh-grade students from two middle schools in Guangdong Province, China. These students majored in a science-based curriculum track that included courses in physics, chemistry, and biology. Ten participants did not complete the final test (8.5%); this is a relatively low percentage of missing participants that may not have affected the subsequent analysis (Hair et al., 2010 ). Ultimately, 108 students completed the course and reported their engagement and academic achievement. The sample included 62 boys (57%) and 46 girls (43%).

Study Design and Procedure

This study was conducted in physics classes. We recorded the participants’ classroom facial expressions for six 40-min-long lessons. To maximize the ability to measure authentic emotional engagement, the researchers administered self-report questionnaires immediately after each lesson was completed. To measure students’ academic achievement, their final exam scores were obtained from the schools. Finally, three researchers majoring in science education analyzed the teaching clips, which were categorized into three teaching styles.

Before the study, students and teachers voluntarily agreed to participate and to provide us with all of the requested personal data. The participants were informed that sensitive facial information would not be retained and that all the data would be kept confidential. Official informed consent was obtained following the requirements and policies of the schools and the local ethics committees.

Additional Measurements and Analysis

In addition to using the MP-FERS for measuring students’ ELE, numerous additional measurement methods, including a questionnaire survey for self-reported ELE, course grades for academic achievement, and class video analysis for teaching style, were used in this study.

Self-Reported ELE

To measure self-reported ELE, the Science Learning Engagement Scale (SLES) developed by Ben-Eliyahu et al. ( 2018 ) was used. The SLES is a reliable and valid instrument for measuring student engagement and includes emotional, behavioral, and cognitive engagement. The original scale is used to measure student engagement in both formal and informal learning. In this study, only the context of science classroom learning, which included 17 items, was retained. The complete scale is shown in the supplementary material. The three reverse-structured questions (Q3, Q4, Q5) included in the scale were reverse-coded before analyzing the data. A higher score indicates a higher level of engagement.

Academic Achievement

In this study, academic achievement was evaluated using students’ final physics test scores. The questions were assigned by the local municipal education authorities and developed by a panel of senior teachers after two rounds of reviews. The tests included multiple-choice, fill-in-the-blank (short answer), and computational show-work questions, with a total possible score of 100. Since we only measured ELE in six lessons during the semester, the test scores on selected questions that corresponded to the content areas of the six lessons were used for student achievement. Each student’s score was normalized.

Teaching Style

The teaching styles were categorized using the S-T interaction analysis scale, which is a typical method for quantitatively analyzing classroom instruction that quantifies the distribution of teacher behavior (T) and student behavior (S) based on a classroom observation framework (Kaiyue et al., 2021 ). This method distinguishes student-centered instruction from teacher-centered instruction by analyzing teacher occupancy (Rt) at each classroom stage (Fu & Zhang, 2001 ). According to Li et al. ( 2021 ), Rt > 0.7 is the lecture type, which is defined as teacher-centered in this study; Rt < 0.3 is the practice type, which is defined as student-centered in this study; and 0.3 < Rt < 0.7 is the interactive type. In this study, the coding method was based on educational information processing technology (Fu & Zhang, 2001 ). A period of class time was coded as teacher behavior (T) if it was dominated by the teacher with explanations, demonstrations, media displays, questions, or evaluations. In contrast, a class period was coded as student behavior (S) if it was dominated by students with speech, reading, thinking, discussing, experimenting, or notetaking. According to previous S-T analysis studies (Dong & Ke, 2015 ; Liu et al., 2014 ), the minimum teaching session length was typically 2 min, while the duration of the behavior was typically less than 3 min. Therefore, the classroom videos were divided into 2- to 3-min segments of teaching clips, which were coded based on the teacher and student behaviors discussed above to determine the teaching styles.

Correlation Analysis for Criterion Validity

To establish the criterion validity of the MP-FERS, ELE scores measured from the MP-FERS were compared with self-reported data. In addition, correlations between ELE scores and students’ physics test scores were also analyzed. However, since the MP-FERS and self-reports are completely different measures, the absolute scales of the results from the two methods are not directly comparable, but correlations can be used to compare their variances. However, since ELE is believed to be a strong predictor of academic performance, analyzing the correlations among ELE measures and student academic performance can provide useful evidence for establishing the validity of the MP-FERS. Descriptive details of the dataset are included in the supplementary materials. The correlations are provided in Table  3 and discussed next.

Self-reports are commonly used as subjective measures of ELE; therefore, we first compared the correlation between ELE scores from the MP-FERS and self-reports. The correlation matrix showed that the ELE scores from the MP-FERS and self-reports were positively and significantly correlated ( r  = 0.496, p  < 0.01; R 2  = 0.25). This result suggested that the MP-FERS has appropriate concurrent validity with students’ self-reported ELE scores. Accordingly, the ELE measured by the MP-FERS is a suitable indicator of students’ emotional state in the classroom.

However, the correlation was within the medium range, indicating a moderate level of inconsistency between the subjective and objective measures. In self-reports, many students reported close to perfect scores on the emotional engagement dimension, leading to a small variance in the measurement. However, there were clear differences in students’ learning behaviors and facial expressions, as seen in the videos, which indicate that students’ self-reports of ELE are likely biased by intended preferences. For example, a significant fraction of students displayed expressions of boredom during the instruction, as seen from the class video, but reported a high level of engagement. Therefore, we need to further analyze the correlations between ELE measures and academic achievement to determine which measure has more predictive value for learning. If the MP-FERS results are more significantly correlated with science achievement than self-reports are, ELE from the MP-FERS is more strongly correlated with student performance. Thus, we can consider the MP-FERS score to be a better predictor of student performance. Two examples of student data are shown in the supplementary materials.

As shown in Table  3 , the ELE obtained from the MP-FERS was strongly correlated with physics scores ( r  = 0.845, p  < 0.01; R 2  = 0.71) and was much greater than the correlation between self-reported ELE and physics scores ( r  = 0.479, p  < 0.01; R 2  = 0.23). This result suggested that the MP-FERS score was a better predictor of student performance than self-reports were. This result further reveals the weakness of self-reports, which are subjective and involve a high degree of uncertainty. For example, students may know what teachers expect them to answer and do not want their low level of engagement to be revealed; thus, self-reported ELE can be strongly biased by social expectations. Moreover, self-reports were obtained after the completion of an entire lesson, which means that students may ignore some neutral or mildly negative feelings and still feel good about themselves. The identified limitations of self-reports are consistent with similar concerns in the literature, as discussed in the literature review. Accordingly, we believe that the strong correlation between MP-FERS scores and students’ physics test scores demonstrates a higher level of criterion validity for the MP-FERS than for self-reports.

Applications of MP-FERS to Informing Teaching

The unique advantage of the MP-FERS is that it can provide near real-time measures of ELE, which can be used to inform teaching practices. To explore how the MP-FERS can be applied in real classrooms, we analyzed the relationships between teaching style and students’ ELE measured by the MP-FERS. As indicated by related research, teaching style can significantly influence ELE (Fredricks et al., 2018 ); therefore, understanding how such influence manifests in a classroom can provide valuable information for teachers to adjust their teaching strategies so that appropriate ELE states can be maintained during the teaching process.

Student ELE in Different Teaching Styles

To examine the utility of the MP-FERS, we compared the differences between the measured ELE states of students in different teaching styles, including student-centered, interactive, and teacher-centered styles. If the student-centered ELE is significantly higher than the teacher-centered ELE, then we can consider the MP-FERS a practical tool for probing students’ emotional responses to classroom teaching and helping teachers improve teaching effectiveness.

In the analysis, we divided the collected classroom videos into a total of 139 clips of 2–3 min each and calculated the average ELE of the whole class in each clip. Researchers also reviewed the teaching style of each clip based on teacher and student activities; 49 teacher-centered clips, 52 interactive clips, and 38 student-centered clips were identified. The ELE was measured with MP-FERS using video images from 65 students whose ELE could be identified in a teaching clip. Each student typically had 800–1400 measured ELE data points in a lesson. All the students’ data were aggregated to produce the average ELE in video clips of the three different teaching styles. The results are shown in Fig.  4 as violin plots, which combine the features of the boxplot and density plot. The plots were generated using the ggpubr package in the statistical software R. Explanations of the types of information included in the plots are also provided in the supplementary material.

figure 4

Emotional learning engagement (ELE) in teacher-centered, interactive, and student-centered styles. The violin plots show the distributions of students’ ELE in the three teaching styles, as well as the medians. The p -value from ANOVA was used to compare the differences among all three groups, while the p -values between any two group means were obtained with t -tests to compare the difference between two specific groups

As shown in Fig.  4 , there was a significant difference in ELE scores across the three teaching styles ( F (2,136) = 3.99, p  = 0.021; η 2  = 0.055). The mean ELE score was highest for the student-centered style ( M  = 58.15, SD  = 7.43) and lowest for the teacher-centered style ( M  = 53.41, SD  = 6.95), while the ELE score for the interactive style ( M  = 55.59, SD  = 8.67) was between these two values. Independent t -tests indicated that the ELE was significantly greater in the student-centered clip than in the teacher-centered clip ( t  = 3.06, p  = 0.003; Cohen’s d  = 0.66). However, the ELE in the interactive clip was not significantly different from that in either the student-centered clip ( t  =  − 1.47, p  = 0.15; Cohen’s d  = 0.32) or the teacher-centered clip ( t  = 1.39, p  = 0.17; Cohen’s d  = 0.28).

The results showed that students had higher ELE in teaching clips with a higher degree of student activity, which is consistent with the findings of previous studies (Fredricks et al., 2018 ; Renninger & Bachrach, 2015 ). Therefore, the MP-FERS measurement of ELE can be considered a convenient and viable tool for probing real-time ELE in teaching and learning.

ELE for Students at Different Academic Levels

To explore how different teaching styles may impact the emotional reactions of students at different academic levels, the students were sorted into two performance groups: a high-score group (top 50%) and a low-score group (bottom 50%). The two groups’ ELE scores for different teaching styles were analyzed and are shown in Table  4 .

One-way ANOVA revealed that ELE score significantly differed among the three teaching styles for both the high-score group ( F (2,696) = 9.57, p  < 0.001, η 2  = 0.027) and the low-score group ( F (2,609) = 13.96, p  < 0.001, η 2  = 0.04). The ELE in both groups was highest in the student-centered style group and lowest in the teacher-centered style group (see Table  4 ). This result indicates that the effect of teaching style on students’ emotional experience is relatively consistent across performance levels. The overall ELE was significantly higher in the high-score group than in the low-score group ( p  = 0.001). However, among the three teaching styles, the difference between the performance groups was significant only for the teacher-centered style ( t  = 2.79, p  = 0.005; Cohen’s d  = 0.25), and there were no significant differences between the two groups for either the interactive ( t  = 1.31, p  = 0.190; Cohen’s d  = 0.13) or student-centered styles ( t  = 0.42, p  = 0.672; Cohen’s d  = 0.05). This finding suggested that students’ emotional responses to interactive and student-centered styles are similar across achievement levels. Both interactive and student-centered teaching styles involve open-exploratory methods, which provide students with more opportunities to express themselves and be engaged in learning. An active classroom atmosphere allows most students to experience a sense of participation and thus present collective engagement. On the other hand, in the passive teacher-centered style, higher performing students often have a better chance of keeping up with the instructor’s lectures than weaker students, leading to a more pronounced difference between their engagement levels. In addition, since teacher-centered lectures lack interaction, students’ ELE depends heavily on their own intrinsic learning motivations and interest, which are significantly correlated with achievement and therefore can also lead to differences in ELE between performance groups. To summarize, the results suggest that designing appropriate interactive and student-centered activities can be an effective strategy for improving the ELE of the majority of students, regardless of their academic performance levels; therefore, such activities can provide a more inclusive environment for teaching and learning.

Changes in Student’s ELE During a Lesson

One advantage of MP-FERS over traditional self-reports is that the measurements are automatically performed with class videos that can produce near real-time outcomes; this approach can be used to analyze fine-grained teaching interactions for evaluating their effectiveness and extend the ability for real-time feedback to improve the classroom environment toward better learning engagement. To examine whether MP-FERS can provide real-time fine-grained ELE feedback for monitoring and improving teaching interactions, the temporal variation in MP-FERS-measured ELE in one lesson was analyzed. Since students’ ELE varied greatly from lesson to lesson due to changes in content and teaching emphasis, we selected one lesson as an example to demonstrate the features and capacities of MP-FERS for real-time measurement. The results are shown in Fig.  5 with a time resolution of 3–5 min. The outcomes were calculated with a moving average of a 5-min window throughout the 40-min class time. Each window in the plot was calculated based on 9 students’ data, with approximately 60–170 data points per student in each 5-min time frame. The total sample size for each calculated mean is in the range of 540–1530, which makes the standard errors very small (error bars not shown). For the time frame of each data point plotted, the teaching style was also determined based on the majority of the types of teaching activities.

figure 5

The curves of students’ emotional learning engagement in a lesson

The results showed that both the teaching style and the students’ ELE varied substantially throughout the lesson period. The trends in ELE indicate that high-performing students can often maintain a higher and more stable level of ELE, while low-performing students have a lower and more fluctuating ELE during class time. The results also showed that most of the peak engagement stages involved student-centered styles, such as doing exercises (minutes 27–33) and group discussions (minutes 35–37). These findings are consistent with the summative results discussed in the previous section but provide a timeline extension to allow fine-grained resolution for examining the changes in ELE with time and teaching style, which can provide valuable utility for research and teaching. For example, the results show that appropriate interactions can significantly increase students’ ELE. During the instruction episode in minutes 27–33, the middle school teacher in the class asked all the students to assume the role of bomb disposal expert and dismantle an explosive device, which had a control circuit fabricated in series and parallel forms. All the students in the class participated in the activity, which stimulated an ELE peak, as shown in Fig.  5 . This example illustrates that students’ interest in science can increase when they are given roles in which they can take the lead to solve a context-rich problem.

The results in Fig.  5 can also provide useful diagnostic information for further analysis of teaching activities. For example, during minutes 15–20, which was a teacher-centered stage, the ELE changes of the high- and low-score groups were vastly different from the states in the remaining time frames. Further analysis of this teaching stage suggested that the teacher was explaining a difficult practice problem in which high-performing students were able to keep up with and respond to teacher. On the other hand, low-performing students were not able to meaningfully follow the discussion and were left out of the interaction loop, which might further lead to frustration among these students. Given this feedback, teachers can improve the design of their teaching by changing to use a more inclusive strategy, such as adding additional “scaffolding” steps, to help develop desirable learning pathways among all students.

Conclusions and Implications

In this research, we developed a multiscale perception facial expression recognition system (MP-FERS) for measuring students’ emotional learning engagement, which was applied to study real classrooms’ teaching activities.

For the first research question, it was found in this study that ELE measured by the MP-FERS is moderately correlated with ELE measured via self-reports, indicating moderately good concurrent validity. Moreover, the ELE measured by the MP-FERS is more strongly correlated with academic achievement than self-reported ELE, revealing greater predictive validity; these findings are consistent with those of Muñoz-García and Villena-Martínez ( 2021 ). The results also suggest that the MP-FERS can help address the weakness of self-reports, which are a subjective measure that is likely biased by students’ intentions. In contrast, the MP-FERS is an objective measure that cannot be easily concealed by students’ intentions. We scrutinized the self-reported and MP-FERS measures, and found that self-reported ELE was generally higher than the ELE measured by the MP-FERS for both high- and low-performance groups. This finding could indicate potential biases in self-reporting, where students may tend to report in a way that aligns with teachers’ expectations. The MP-FERS, on the other hand, provides a relatively objective measure that is less likely and harder to be intentionally biased. In addition, the MP-FERS is noninvasive and far more efficient than a questionnaire and provides real-time results at much finer temporal resolutions, which are valuable for research and teaching. In previous research, Whitehill et al. ( 2014 ) constructed an FER model that discriminates between four levels of engagement, whereas human observers can distinguish between only high and low engagement. Ashwin and Guddeti ( 2020 ) also found that the machine classification of emotional states provides the same classification as human annotations. All of these outcomes demonstrate the efficiency and potential of FER systems for measuring ELE through students’ facial expressions.

For the second research question, this study explored how teaching styles impacted students’ ELE overall and at high- and low-performance levels. Students are generally more engaged in student-centered and interactive activities. This finding is consistent with previous studies that used self-reports to measure ELE, all of which have demonstrated the positive influence of student-centered teaching on student engagement (Baeten et al., 2010 ; Watson et al., 2021 ). From Fig.  4 , the distribution of ELE across teaching styles reveals that ELE is concentrated at a lower level for teacher-centered style and toward a higher level for student-centered style. For interactive styles, the ELE is more broadly distributed from low to high levels. The results suggest that interactive teaching may be influenced by a more diverse set of factors. Since the interactive style involves interactions among multiple participants (teachers, students, and groups) and is characterized by interactive cycles throughout, the stimulation of ELE in the interactive style may depend not only on the form of the activities, but also on the design of the interaction processes. For example, the design of question chains and the way of guidance are both important factors in promoting ELE. However, interactions that do not match students’ proficiency levels may be counter-productive to ELE. This suggests that teachers need to pay particular attention to the design of interactive teaching to help all students develop their desirable learning pathways.

In addition, the results also demonstrated that MP-FERS can produce accurate real-time measurements of ELE at a fine-grained temporal resolution, echoing the current literature on measuring emotion in the teaching process (Liaw et al., 2021 ). This feature makes it possible to study the temporal variation in ELE levels in response to different teaching activities and can help teachers improve classroom instruction in places where students’ ELE is low. We analyzed the temporal variation in MP-FERS-measured ELE in one lesson. The synchronous changes in ELE among high- and low-performing groups across the timeline reflect their immediate responses to teaching activities, demonstrating the sensitivity of ELE to teaching variations. The results also reveal differences in the ELE responses between the high-performing and low-performing groups. In the student-centered style, the difference in ELE is small. Student-centered teaching is more open-ended and students have a greater sense of self-control, so low-performing groups may be emotionally satisfied through a variety of activities. In contrast, in most of the teacher-centered styles, the difference between the two groups is significant and even showed an opposite trend (minutes 15–20). In a teacher-centered style characterized by lecturing, students’ emotional satisfaction may be more related to the ability to keep up with the teacher’s lectures. This suggests that teachers need to pay attention to the difficulty of lecturing content and develop desirable learning pathways that can promote comprehension among low-performing students. On the other hand, teachers can also learn from teaching formats that produce high engagement for all students and thus tailor their future instruction toward a higher level of learning engagement.

Notably, facial expressions may be subject to cultural variations, and ELE may vary across classroom settings, disciplines, and demographic groups. Follow-up studies should expand into other science disciplines and student populations. In addition, further research could consider digging deeper into the value-added effects of students’ emotions on learning performance, such as retention of scientific concepts and participation in science activities, enabling educators to recognize how fostering ELE can contribute to learning outcomes in the science classroom.

The application of artificial intelligence (AI) tools such as the MP-FERS has the potential to enhance the measurement of teaching effectiveness. However, variations in facial expressions across cultures and the privacy and ethical acceptance of AI technology should be carefully considered (Wu et al., 2020 ). For example, incomplete or biased data collection may lead to biased educational decisions. Overreliance on technology may reduce emotional communication between teachers and students and weaken teachers’ discriminative ability. In conclusion, it is vital to respect each student’s unique learning process and refrain from imposing uniform standards. Data derived from intelligent tools should inform and help refine teachers’ pedagogical approaches and strategies.

Data Availability

The models used herein uses a large-scale open dataset from the Internet open source, publicly available on Google Search. Data sheets are available upon request through the corresponding author.

Code Availability

Code is original and produced by the authors. Contact corresponding author to inquire about availability of code.

The data generated during the current study are partly available from the corresponding author on reasonable request. Because the class video data included images of students, we cannot share them with the readers due to the ethical reason.

Alimoglu, M. K., Yardim, S., & Uysal, H. (2017). The effectiveness of TBL with real patients in neurology education in terms of knowledge retention, in-class engagement, and learner reactions. Advances in Physiology Education, 41 (1), 38–43. https://doi.org/10.1152/advan.00130.2016

Article   Google Scholar  

Altuwairqi, K., Jarraya, S. K., Allinjawi, A., & Hammami, M. (2021). A new emotion-based affective model to detect student’s engagement. Journal of King Saud University-Computer and Information Sciences, 33 (1), 99–109. https://doi.org/10.1016/j.jksuci.2018.12.008

Anthony, C., & Nicolas, M. (2021). The recognition of emotions beyond facial expressions: Comparing emoticons specifically designed to convey basic emotions with other modes of expression. Computers in Human Behavior, 118 , 106689. https://doi.org/10.1016/j.chb.2021.106689

Areepattamannil, S. (2012). Effects of inquiry-based science instruction on science achievement and interest in science: Evidence from Qatar. Journal of Educational Research, 105 (2), 134–146. https://doi.org/10.1080/00220671.2010.533717

Arent, S. (2005). Thayer’s model of arousal and activation. In R. Bartlett, C. Gratton, & C. G. Rolf (Eds.), Encyclopedia of International Sport Studies. London: Routledge.

Google Scholar  

Ashwin, T. S., & Guddeti, R. (2020). Affective database for e-learning and classroom environments using Indian students’ faces, hand gestures and body postures. Future Generation Computer Systems, 108 , 334–348. https://doi.org/10.1016/j.future.2020.02.075

Baeten, M., Kyndt, E., Struyven, K., & Dochy, F. (2010). Using student-centred learning environments to stimulate deep approaches to learning: Factors encouraging or discouraging their effectiveness. Educational Research Review, 5 (3), 243–260. https://doi.org/10.1016/J.EDUREV.2010.06.001

Bakker, A. B., Vergel, A. I. S., & Kuntze, J. (2015). Student engagement and performance: A weekly diary study on the role of openness. Motivation And Emotion, 39 (1), 49–62. https://doi.org/10.1007/S11031-014-9422-5

Battersby, G. L., Beeley, C., Baguley, D. A., Barker, H. D., Broad, H. D., Carey, N. C., & Williams, D. P. (2020). Go Fischer: An Introductory Organic Chemistry Card Game. Journal of Chemical Education, 97 (8), 2226–2230. https://doi.org/10.1021/acs.jchemed.0c00504

Ben-Eliyahu, A., Moore, D., Dorph, R., & Schunn, C. D. (2018). Investigating the multidimensionality of engagement: Affective, behavioral, and cognitive engagement across science activities and contexts. Contemporary Educational Psychology, 53 , 87–105. https://doi.org/10.1016/J.CEDPSYCH.2018.01.002

Blomeke, S., Jentsch, A., Ross, N., Kaiser, G., & Konig, J. (2022). Opening up the black box: Teacher competence, instructional quality, and students? learning progress. Learning and Instruction, 79 , 101600. https://doi.org/10.1016/j.learninstruc.2022.101600

Byusa, E., Kampire, E., & Mwesigye, A. R. (2022). Game-based learning approach on students? motivation and understanding of chemistry concepts: A systematic review of literature. Heliyon, 8 (5). https://doi.org/10.1016/j.heliyon.2022.e09541

Cao, J., Wang, H., Hu, P., & Miao, J. (2008). PAD Model Based Facial Expression Analysis. Paper presented at the Advances in Visual Computing, Berlin, Heidelberg.

Capar, G., & Tarim, K. (2015). Efficacy of the cooperative learning method on mathematics achievement and attitude: A meta-analysis research. Educational Sciences-Theory & Practice, 15 (2), 553–559.

Chen, D., Wen, G., Li, H., Chen, R., & Li, C. (2023). Multi-relations aware network for in-the-wild facial expression recognition. Ieee Transactions on Circuits and Systems for Video Technology, 33 (8), 3848–59. https://doi.org/10.1109/TCSVT.2023.3234312

Chen, Y., Bosch, N., & ’Mello, S. D. (2015). Video-Based Affect Detection in Noninteractive Learning Environments. Paper presented at the 8th International Conference on Educational Data Mining. Madrid, Spain

Chi, M. T. H., & Wylie, R. (2014). The ICAP framework: Linking cognitive engagement to active learning outcomes. Educational Psychologist, 49 (4), 219–243. https://doi.org/10.1080/00461520.2014.965823

China, T. M. o. E. o. t. P. s. R. o. (2022). Science Curriculum Standards for compulsory Education : Beijing Normal University Publishing House

Cohen, L., Manion, L., & Morrison, K. (2007). Research Methods in Education : Research Methods in Education.

Connell, J. P., & Wellborn, J. G. (1991). Competence, autonomy, and relatedness: A motivational analysis of self-system processes. Journal of Personality Social Psychology, 65 .

Creswell, J. W., & Clark., V. L. (2011). Designing and Conducting Mixed Methods Research. Thousand Oaks

D’Mello, S., & Graesser, A. (2012). Dynamics of affective states during complex learning. Learning and Instruction, 22 (2), 145–157. https://doi.org/10.1016/J.LEARNINSTRUC.2011.10.001

Delahunty, T., Seery, N., & Lynch, R. (2018). Exploring the use of electroencephalography to gather objective evidence of cognitive processing during problem solving. Journal of Science Education and Technology, 27 (2), 114–130. https://doi.org/10.1007/s10956-017-9712-2

Dong, J., & Ke, X. (2015). Using S-T teaching analysis to evaluate teacher-student interaction behavior. Biology Teaching, 40 (06), 11–12.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Xiaohua, Z., Unterthiner, T., . . . Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale

Elias, B., Kimberly, T. N., Roberta Michnick, G., & Kathy, H.-P. (2023). Investigating the contributions of active, playful learning to student interest and educational outcomes. Acta Psychologica, 238 , 103983. https://doi.org/10.1016/j.actpsy.2023.103983

Engels, M. C., Spilt, J. L., Denies, K., & Verschueren, K. (2021). The role of affective teacher-student relationships in adolescents’ school engagement and achievement trajectories. Learning and Instruction, 75 , 101485. https://doi.org/10.1016/J.LEARNINSTRUC.2021.101485

Eva, T., & Kathleen, M. (2023). Teaching routines and student-centered mathematics instruction: The essential role of conferring to understand student thinking and reasoning. The Journal of Mathematical Behavior, 70 , 101032. https://doi.org/10.1016/j.jmathb.2023.101032

Fredricks, J. A., Blumenfeld, P. C., & Paris, A. H. (2004). School engagement: Potential of the concept, state of the evidence. Review of Educational Research, 74 (1), 59–109. https://doi.org/10.3102/00346543074001059

Fredricks, J. A., Hofkens, T., Wang, M. T., Mortenson, E., & Scott, P. (2018). Supporting girls’ and boys’ engagement in math and science learning: A mixed methods study. Journal of Research in Science Teaching, 55 (2), 271–298. https://doi.org/10.1002/tea.21419

Freeman, A., Becker, S. A., & Cummins, M. (2017). NMC/CoSN Horizon Report: 2017 K-12 Edition. In.

Fu, D., & Zhang, H. (2001). Educational Information Processing . Beijing Normal University Publishing House.

Geertshuis, S. A. (2019). Slaves to our emotions: Examining the predictive relationship between emotional well-being and academic outcomes. Active Learning in Higher Education, 20 (2), 153–166. https://doi.org/10.1177/1469787418808932

SW Gilroy M Cavazza M Benayoun 2009 Using affective trajectories to describe states of flow in interactive art Paper Presented at the Proceedings of the International Conference on Advances in Computer Entertainment Technology, Athens, Greece. https://doi.org/10.1145/1690388.1690416

González-Romá, V., Schaufeli, W. B., Bakker, A. B., & Lloret, S. (2006). Burnout and work engagement: Independent factors or opposite poles? Journal of Vocational Behavior, 68 (1), 165–174. https://doi.org/10.1016/j.jvb.2005.01.003

Greene, B. A. (2015). Measuring cognitive engagement with self-report scales: Reflections from over 20 years of research. Educational Psychologist, 50 (1), 14–30. https://doi.org/10.1080/00461520.2014.989230

JC Greene 2007 Mixed Methods in Social Inquiry San Francisco

Guo, Y., Sun, S., Breit-Smith, A., Morrison, F. J., & Connor, C. M. D. (2014). Behavioral engagement and reading achievement in elementary-school-age children: A longitudinal cross-lagged analysis. Journal of Educational Psychology, 107 (2), 332–347. https://doi.org/10.1037/A0037638

Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate Data Analysis: A Global Perspective : Multivariate Data Analysis: A Global Perspective.

Henrie, C. R., Halverson, L. R., & Graham, C. R. (2015). Measuring student engagement in technology-mediated learning. Computer Education, 90 (90), 36–53. https://doi.org/10.1016/J.COMPEDU.2015.09.005

Hernández, M. M., Eisenberg, N., Valiente, C., Vanschyndel, S. K., Spinrad, T. L., Silva, K. M., & Thompson, M. S. (2016). Emotional expression in school context, social relationships, and academic adjustment in kindergarten. Emotion, 16 (4), 553.

Howe, C., Hennessy, S., Mercer, N., Vrikki, M., & Wheatley, L. (2019). Teacher-student dialogue during classroom teaching: Does it really impact on student outcomes? Journal of the Learning Sciences, 28 (4–5), 462–512. https://doi.org/10.1080/10508406.2019.1573730

Jia, J., Wu, Z., Zhang, S., Meng, H. M., & Cai, L. (2014). Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimedia Tools and Applications, 73 (1), 439–461. https://doi.org/10.1007/s11042-013-1604-8

Jin, R., Zhao, S., Hao, Z., Xu, Y., Xu, T., & Chen, E. (2022). AVT: Au-Assisted Visual Transformer for Facial Expression Recognition. Transformer for Facial Expression Recognition. Paper presented at the 2022 IEEE International Conference on Image Processing (ICIP). Bordeaux, France. https://doi.org/10.1109/ICIP46576.2022.9897960

Joshi, A., & Bhaskar, P. (2022). Qualitative study on critical traits of teacher for effective teaching in higher education institutions. International Journal of Learning and Change, 14 (4), 390–408. https://doi.org/10.1504/ijlc.2022.124466

Kaiyue, L., Zhong, S., & Min, X. (2021). Artificial intelligent based video analysis on the teaching interaction patterns in classroom environment. International Journal of Information and Education Technology, 11 (3), 126–130.

Kang, J., & Keinonen, T. (2018). The effect of student-centered approaches on students’ interest and achievement in science: Relevant topic-based, open and guided inquiry-based, and discussion-based approaches. Research in Science Education, 48 (4), 865–885. https://doi.org/10.1007/s11165-016-9590-2

Kateřina, L. (2019). Socialization of a student teacher on teaching practice into the discursive community of the classroom: Between a teacher-centered and a learner-centered approach. Learning, Culture and Social Interaction, 22 , 100314. https://doi.org/10.1016/j.lcsi.2019.05.001

Kennedy, M. M. (2016). How does professional development improve teaching? Review of Educational Research, 86 (4), 945–980. https://doi.org/10.3102/0034654315626800

Kuhbandner, C., Pekrun, R., & Maier, M. A. (2010). The role of positive and negative affect in the omirroringo of other persons’ actions. Cognition & Emotion, 24 (7), 1182–1190. https://doi.org/10.1080/02699930903119196

Li, X., Wu, J., & Huang, S. (2021). Analysis of high school biology quality course based on improved S-T analysis method. Journal of Teaching and Management, 21 , 3.

Liaw, H., Yu, Y. R., Chou, C. C., & Chiu, M. H. (2021). Relationships between facial expressions, prior knowledge, and multiple representations: A case of conceptual change for kinematics instruction. Journal of Science Education and Technology, 30 (2), 227–238. https://doi.org/10.1007/S10956-020-09863-3

Linnenbrink, E. A. (2007). Chapter 7 - The role of affect in student learning: A multi-dimensional approach to considering the interaction of affect, motivation, and engagement. In P. A. Schutz & R. Pekrun (Eds.), Emotion in Education (pp. 107–124). Academic Press.

Chapter   Google Scholar  

Liu, H., Cai, H., Li, Q., Li, X., & Xiao, H. (2022a). Adaptive multilayer perceptual attention network for facial expression recognition. Ieee Transactions on Circuits and Systems for Video Technology, 32 (9), 6253–6266. https://doi.org/10.1109/tcsvt.2022.3165321

Liu, L., Du, W., Wang, P., & Jing, M. (2014). Improvement of S-T analysis method and analysis of national high school chemistry quality lessons. Education in Chemistry (07) 19–22

Liu, S., Liu, S., Liu, Z., Peng, X., & Yang, Z. (2022b). Automated detection of emotional and cognitive engagement in MOOC discussions to predict learning achievement. Computers & Education, 181 , 104461. https://doi.org/10.1016/j.compedu.2022.104461

Liu, T., Wang, J., Yang, B., & Wang, X. (2021). Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Physics & Technology, 112 , 103594. https://doi.org/10.1016/J.INFRARED.2020.103594

Liu, Y., Tao, L., & Fu, X. (2009). The analysis of PAD emotional state model based on emotion pictures. Journal of Image and Graphics, 14 (05), 753–758.

Martínez-Sierra, G. G. (2014). High school students’ emotional experiences in mathematics classes. Research in Mathematics Education, 16 (3), 17.

Mehrabian, A. (1995). Framework for a comprehensive description and measurement of emotional states. Genetic Social and General Psychology Monographs, 121 (3), 339–361.

Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14 (4), 261–292. https://doi.org/10.1007/BF02686918

Mehrabian, A., & Russell, J. A. (1974). An approach to environmental psychology . MIT Press

Meinhardt, J., & Pekrun, R. (2003). Attentional resource allocation to emotional events: An ERP study. Cognition and Emotion, 17 (3), 477–500. https://doi.org/10.1080/02699930244000039

Mikeska, J. N., Holtzman, S., McCaffrey, D. F., Liu, S., & Shattuck, T. (2019). Using classroom observations to evaluate science teaching: Implications of lesson sampling for measuring science teaching effectiveness across lesson types. Science Education, 103 (1), 123–144. https://doi.org/10.1002/SCE.21482

Mollahosseini, A., Chan, D., & Mahoor, M. H. (2016). Going deeper in facial expression recognition using deep neural networks . Paper presented at the Workshop on Applications of Computer Vision.

Muenks, K., Wigfield, A., Yang, J. S., & O’Neal, C. R. (2017). How true is grit? Assessing its relations to high school and college students’ personality characteristics, self-regulation, engagement, and achievement. Journal of Educational Psychology, 109 (5), 599–620. https://doi.org/10.1037/EDU0000153

Muñoz-García, A., & Villena-Martínez, M. D. (2021). Influences of learning approaches, student engagement, and satisfaction with learning on measures of sustainable behavior in a social sciences student sample. Sustainability, 13 (2), 541. https://doi.org/10.3390/SU13020541

Nakamura, J., & Csikszentmihalyi, M. (2014). The concept of flow. Flow and the Foundations of Positive Psychology 25.

Newmann, F. M. (1992). Student engagement and achievement in American secondary schools .

North, B., Diab, M., Lameras, P., Zaraik, J., & Fischer, H. (2021). Developing a Platform for using Game-Based Learning in Vocational Education and Training. Paper presented at the 2021 IEEE Global Engineering Education Conference (EDUCON).

Office for Standards in Education, C. s. S. a. S. O. (2010). The Evaluation Schedule for Schools. In.

Olivier, E., Galand, B., Morin, A. J. S., & Hospel, V. (2021). Need-supportive teaching and student engagement in the classroom: Comparing the additive, synergistic, and global contributions. Learning And Instruction, 71 , 101389. https://doi.org/10.1016/J.LEARNINSTRUC.2020.101389

Ontario, T. M. (2014). Achieving excellence: A renewed vision for education in Ontario. Government of Ontario .

Park, S., Holloway, S. D., Arendtsz, A., Bempechat, J., & Li, J. (2012). What makes students engaged in learning? A time-use study of within- and between-individual predictors of emotional engagement in low-performing high schools. Journal of Youth and Adolescence, 41 (3), 390–401. https://doi.org/10.1007/S10964-011-9738-3

Pei, J., & Shan, P. (2019). A micro-expression recognition algorithm for students in classroom learning based on convolutional neural network. Traitement Du Signal, 36 (6), 557–563. https://doi.org/10.18280/TS.360611

Pekrun, R. (2006). The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice. Educational Psychology Review, 18 (4), 315–341. https://doi.org/10.1007/s10648-006-9029-9

Pekrun, R., Elliot, A. J., & Maier, M. A. (2009). Achievement goals and achievement emotions: Testing a model of their joint relations with academic performance. Journal of Educational Psychology, 101 (1), 115–135. https://doi.org/10.1037/A0013383

Pekrun, R., Goetz, T., Titz, W., & Perry, R. P. (2002). Academic emotions in students’ self-regulated learning and achievement: A program of qualitative and quantitative research. Educational Psychologist, 37 (2), 91–105. https://doi.org/10.1207/S15326985EP3702_4

Pekrun, R., Lichtenfeld, S., Marsh, H. W., Murayama, K., & Goetz, T. (2017). Achievement emotions and academic performance: Longitudinal models of reciprocal effects. Child Development, 88 (5), 1653–1670. https://doi.org/10.1111/cdev.12704

Pekrun, R., & Linnenbrink-Garcia, L. (2012). Academic Emotions and Student Engagement. In S. L. Christenson, A. L. Reschly, & C. Wylie (Eds.), Handbook of Research on Student Engagement (pp. 259–282). US: Springer.

Prescod, D. J., Daire, A. P., Young, C., Dagley, M., & Georgiopoulos, M. (2018). Exploring negative career thoughts between STEM-declared and STEM-interested students. Journal of Employment Counseling, 55 (4), 166–175. https://doi.org/10.1002/joec.12096

Putwain, D. W., Becker, S., Symes, W., & Pekrun, R. (2018). Reciprocal relations between students’ academic enjoyment, boredom, and achievement over time. Learning and Instruction, 54 , 73–81. https://doi.org/10.1016/j.learninstruc.2017.08.004

Putwain, D. W., Wood, P., & Pekrun, R. (2022). Achievement emotions and academic achievement: Reciprocal relations and the moderating influence of academic buoyancy. Journal of Educational Psychology, 114 (1), 108–126. https://doi.org/10.1037/edu0000637

Reinhold, F., Strohmaier, A., Hoch, S., Reiss, K., Böheim, R., & Seidel, T. (2020). Process data from electronic textbooks indicate students’ classroom engagement. Learning and Individual Differences., 83 , 101934. https://doi.org/10.1016/J.LINDIF.2020.101934

Renninger, K. A., & Bachrach, J. E. (2015). Studying triggers for interest and engagement using observational methods. Educational Psychologist, 50 (1), 58–69. https://doi.org/10.1080/00461520.2014.999920

Renninger, K. A., Hidi, S., & Krapp, A. (1994). The role of interest in learning and development. American Journal of Psychology, 107 (2) https://doi.org/10.4324/9781315807430

Sadler, P. M., Sonnert, G., Hazari, Z., & Tai, R. (2012). Stability and volatility of STEM career interest in high school: A gender study. Science Education, 96 (3), 411–427. https://doi.org/10.1002/sce.21007

Schelfhout, S., Wille, B., Fonteyne, L., Roels, E., De Fruyt, F., & Duyck, W. (2021). From interest assessment to study orientation: An empirical advice set engine. Journal of Experimental Education, 89 (1), 169–195. https://doi.org/10.1080/00220973.2019.1605327

Schukajlow, S., & Rakoczy, K. (2016). The power of emotions: Can enjoyment and boredom explain the impact of individual preconditions and teaching methods on interest and performance in mathematics? Learning and Instruction, 44 , 117–127. https://doi.org/10.1016/j.learninstruc.2016.05.001

Sedova, K., Sedlacek, M., Svaricek, R., Majcik, M., Navratilova, J., Drexlerova, A., & Salamounova, Z. (2019). Do those who talk more learn more? The relationship between student classroom talk and student achievement. Learning and Instruction, 63 , 101217. https://doi.org/10.1016/J.LEARNINSTRUC.2019.101217

Shen, J., Yang, H., Li, J., & Cheng, Z. (2021). Assessing learning engagement based on facial expression recognition in MOOC’s scenario. Multimedia Systems . https://doi.org/10.1007/s00530-021-00854-x

Effect of cooperative learning on chemistry students’ achievement in Rwandan Day-upper secondary schools. (2021). European Journal of Educational Research, 10 (4), 10.

Skinner, E. A., & Belmont, M. J. (1993). Motivation in the classroom: Reciprocal effects of teacher behavior and student engagement across the school year. Journal of Educational Psychology, 85 (4), 571–581. https://doi.org/10.1037/0022-0663.85.4.571

Sun, B., Liu, Y. N., Chen, J. B., Luo, J. H., & Zhang, D. (2015). Emotion analysis based on facial expression recognition in smart learning environment. Modern Distance Education Research, 2 (8), 96–103.

Sun, H.-L., Sun, T., Sha, F.-Y., Gu, X.-Y., Hou, X.-R., Zhu, F.-Y., & Fang, P.-T. (2022). The influence of teacher-student interaction on the effects of online learning: Based on a serial mediating model. Frontiers in Psychology, 13 , 779217. https://doi.org/10.3389/fpsyg.2022.779217

Taub, M., Sawyer, R., Smith, A., Rowe, J., Azevedo, R., & Lester, J. (2020). The agency effect: The impact of student agency on learning, emotions, and problem-solving behaviors in a game-based learning environment. Computers & Education, 147 , 103781. https://doi.org/10.1016/j.compedu.2019.103781

Trobst, S., Kleickmann, T., Lange-Schubert, K., Rothkopf, A., & Moller, K. (2016). Instruction and students’ declining interest in science: An analysis of German Fourth- and Sixth-Grade classrooms. American Educational Research Journal, 53 (1), 162–193. https://doi.org/10.3102/0002831215618662

Tsai, J. L., Blevins, E., Bencharit, L. Z., Chim, L., Fung, H. H., & Yeung, D. Y. (2019). Cultural variation in social judgments of smiles: The role of ideal affect. Journal of Personality and Social Psychology, 116 (6), 966–988. https://doi.org/10.1037/pspp0000192

Vanneste, P., Oramas, J., Verelst, T., Tuytelaars, T., Raes, A., Depaepe, F., & Van den Noortgate, W. (2021). Computer vision and human behaviour, emotion and cognition detection: A use case on student engagement. Mathematics, 9 (3), 287. https://www.mdpi.com/2227-7390/9/3/287 .

Wang, J., Ma, X., Sun, J., Zhao, Z., & Zhu, Y. (2014). Puzzlement detection from facial expression using active appearance models and support vector machines. International Journal of Signal Processing Image Processing Pattern Recognition 7 https://doi.org/10.14257/ijsip.2014.7.5.30

Wang, K., Peng, X., Yang, J., Lu, S., Qiao, Y., & Ieee. (2020). Suppressing Uncertainties for Large-Scale Facial Expression Recognition. Paper presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Electr Network.

Wang, M.-T., & Eccles, J. S. (2012). Adolescent behavioral, emotional, and cognitive engagement trajectories in school and their differential relations to educational success. Journal of Research on Adolescence, 22 (1), 31–39. https://doi.org/10.1111/J.1532-7795.2011.00753.X

Wang, M.-T., Willett, J. B., & Eccles, J. S. (2011). The assessment of school engagement: Examining dimensionality and measurement invariance by gender and race/ethnicity. Journal of School Psychology, 49 (4), 465–480. https://doi.org/10.1016/J.JSP.2011.04.001

Watson, W. R., Watson, S. L., Magar, S. T., & Tay, L. (2021). Comparing attitudinal learning of large enrolment active learning and lecture classes. Innovations in Education and Teaching International, 58 (2), 146–156. https://doi.org/10.1080/14703297.2019.1711440

Whitehill, J., Serpell, Z., Lin, Y. C., Foster, A., & Movellan, J. R. (2014). The faces of engagement: Automatic recognition of student engagement from facial expressions. IEEE Transactions on Affective Computing, 5 (1), 86–98. https://doi.org/10.1109/TAFFC.2014.2316163

Wu, H., Tu, Y., & Tan, L. (2020). Education risks and its avoidance in the era of artificial intelligence. Modern Educational Technology, 030 (004), 18–24.

Wundt, W. (1980). Outlines of Psychology : US:Springer.

Xintong, L., Christi, B., & Amanda, A. O. (2022). Positive teacher-student relationships may lead to better teaching. Learning and Instruction, 80 , 101581. https://doi.org/10.1016/j.learninstruc.2022.101581

Xue, F., Wang, Q., & Guo, G. (2021). TransFER: Learning Relation-aware Facial Expression Representations with Transformers. Paper presented at the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Electr Network.

Yang, X., Zhang, M., Kong, L., Wang, Q., & Hong, J. C. (2021). The effects of scientific self-efficacy and cognitive anxiety on science engagement with the “Question-Observation-ng-Explanation” model during school disruption in COVID-19 pandemic. Journal of Science Education and Technology, 30 (3), 1–14. https://doi.org/10.1007/S10956-020-09877-X

Zhang, Y., Wang, C., Ling, X., & Deng, W. (2022). Learn from All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition. Paper presented at the 17th European Conference on Computer Vision (ECCV), Tel Aviv, ISRAEL.

Zhao, Z., Liu, Q., & Wang, S. (2021). Learning deep global multi-scale and local attention features for facial expression recognition in the wild. Ieee Transactions on Image Processing, 30 , 6544–6556. https://doi.org/10.1109/tip.2021.3093397

Zhu, X., & Chen, Z. (2019). Dual-modality spatiotemporal feature learning for spontaneous facial expression recognition in e-learning using hybrid deep neural network. Visual Computer . https://doi.org/10.1007/s00371-019-01660-3

Download references

This work was supported in part by the National Social Science Foundation of China under Grant No. CHA200261. Any opinions expressed in this work are those of the authors and do not necessarily represent those of the funding agencies.

Author information

Authors and affiliations.

School of Physics, South China Normal University, Guangzhou, China

Xiaoyu Tang, Yayun Gong, Yang Xiao & Jianwen Xiong

Department of Physics, The Ohio State University, Columbus, OH, 43210, USA

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: Xiaoyu Tang, Lei Bao. Methodology: Yang Xiao, Xiaoyu Tang. Formal analysis and investigation: Yayun Gong. Writing—original draft preparation: Xiaoyu Tang, Yayun Gong. Writing—review and editing: Lei Bao, Yang Xiao. Funding acquisition: Xiaoyu Tang. Supervision: Lei Bao, Jianwen Xiong.

Corresponding authors

Correspondence to Jianwen Xiong or Lei Bao .

Ethics declarations

Ethics approval.

All human trials in this research meet the ethical standards of the Chinese Association for Ethical Research(CAES). This research is conducted with approval from the authors’ institution.

Consent to Participate

Informed consent was obtained from all individual participants included in the study and their legal guardians.

Consent for Publication

The participants have provided informed consent for publication of their learning data in this article, and consented to the submission of the case report to the journal.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 242 KB)

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Tang, X., Gong, Y., Xiao, Y. et al. Facial Expression Recognition for Probing Students’ Emotional Engagement in Science Learning. J Sci Educ Technol (2024). https://doi.org/10.1007/s10956-024-10143-7

Download citation

Accepted : 01 August 2024

Published : 14 August 2024

DOI : https://doi.org/10.1007/s10956-024-10143-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Student engagement
  • Emotional learning engagement
  • Facial expression recognition
  • Science teaching
  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. The 4 Types of Validity in Research

    The 4 Types of Validity in Research | Definitions & Examples. Published on September 6, 2019 by Fiona Middleton.Revised on June 22, 2023. Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid.

  2. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  3. Validity

    Research validity pertains to the accuracy and truthfulness of the research. It examines whether the research truly measures what it claims to measure. Without validity, research results can be misleading or erroneous, leading to incorrect conclusions and potentially flawed applications. ... While it is the least scientific measure of validity ...

  4. Validity, reliability, and generalizability in qualitative research

    In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of "individual" is seen differently between humanistic and positive psychologists due to differing philosophical perspectives: Where humanistic psychologists believe "individual" is a ...

  5. Reliability and validity: Importance in Medical Research

    Reliability and validity are among the most important and fundamental domains in the assessment of any measuring methodology for data-collection in a good research. Validity is about what an instrument measures and how well it does so, whereas reliability concerns the truthfulness in the data obtain …

  6. Validity In Psychology Research: Types & Examples

    In psychology research, validity refers to the extent to which a test or measurement tool accurately measures what it's intended to measure. It ensures that the research findings are genuine and not due to extraneous factors. Validity can be categorized into different types, including construct validity (measuring the intended abstract trait), internal validity (ensuring causal conclusions ...

  7. Internal and external validity: can you apply research study results to

    The validity of a research study includes two domains: internal and external validity. Internal validity is defined as the extent to which the observed results represent the truth in the population we are studying and, thus, are not due to methodological errors. In our example, if the authors can support that the study has internal validity ...

  8. What is Validity in Research?

    Validity is an important concept in establishing qualitative research rigor. At its core, validity in research speaks to the degree to which a study accurately reflects or assesses the specific concept that the researcher is attempting to measure or understand. It's about ensuring that the study investigates what it purports to investigate.

  9. Validity in Research: A Guide to Better Results

    Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge. Studies must be conducted in environments ...

  10. Reliability vs Validity in Research

    Revised on 10 October 2022. Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method, technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure. It's important to consider reliability and validity when you are ...

  11. Guiding Principles for Ethical Research

    Scientific validity A study should be designed in a way that will get an understandable answer to the important research question. This includes considering whether the question asked is answerable, whether the research methods are valid and feasible, and whether the study is designed with accepted principles, clear methods, and reliable practices.

  12. Validity in clinical research: a review of basic concepts and

    Validity is one key aspect of the 'holy trinity' of issues (validity, reliability and generalizability) that provide the bedrock for scientific research ( Sparkes, 1998 ). It is often said that if the research is not valid it is, effectively, useless. As one of the fundamental concepts underpinning any experimental problem solving process ...

  13. Reliability and Validity

    Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...

  14. The 4 Types of Validity in Research Design (+3 More to Consider)

    For this reason, we are going to look at various validity types that have been formulated as a part of legitimate research methodology. Here are the 7 key types of validity in research: Face validity. Content validity. Construct validity. Internal validity. External validity. Statistical conclusion validity.

  15. Validity in Qualitative Evaluation: Linking Purposes, Paradigms, and

    However, the increased importance given to qualitative information in the evidence-based paradigm in health care and social policy requires a more precise conceptualization of validity criteria that goes beyond just academic reflection. After all, one can argue that policy verdicts that are based on qualitative information must be legitimized by valid research, just as quantitative effect ...

  16. Issues of validity and reliability in qualitative research

    Although the tests and measures used to establish the validity and reliability of quantitative research cannot be applied to qualitative research, there are ongoing debates about whether terms such as validity, reliability and generalisability are appropriate to evaluate qualitative research.2-4 In the broadest context these terms are applicable, with validity referring to the integrity and ...

  17. The Validity of Psychological Research

    This validity is concerned with what happens outside the lab, what to make of the result after all the nitty-gritty has been finely tuned. Ideally, a researcher would conduct an experiment that is ...

  18. Validity and Reliability

    Internal validity dictates how an experimental design is structured and encompasses all of the steps of the scientific research method. Even if your results are great, sloppy and inconsistent design will compromise your integrity in the eyes of the scientific community. Internal validity and reliability are at the core of any experimental design.

  19. Peer Review in Scientific Publications: Benefits, Critiques, & A

    Peer review is a mutual responsibility among fellow scientists, and scientists are expected, as part of the academic community, to take part in peer review. If one is to expect others to review their work, they should commit to reviewing the work of others as well, and put effort into it. 2) Be pleasant. If the paper is of low quality, suggest ...

  20. Validity, Accuracy and Reliability: A Comprehensive Guide

    Part 3 - Reliability. Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using ...

  21. The sociology of scientific validity: How professional networks shape

    Conversely, contests of scientific validity may draw on epistemic views that are idiosyncratic, rather than organized within schools of thought that can be measured with formal relationships. In sum, disagreements in the evaluation of scientific validity are substantial, poorly understood, and continue to be an important area for research. 5.2.

  22. Reliability vs. Validity in Scientific Research

    See why leading organizations rely on MasterClass for learning & development. In the fields of science and technology, the terms reliability and validity are used to describe the robustness of qualitative and quantitative research methods. While these criteria are related, the terms aren't interchangeable.

  23. More than 3Rs: the importance of scientific validity for harm-benefit

    Assessment of internal validity should be based on evidence for the scientific rationale (e.g. use of appropriate control groups) and for scientific rigor in terms of measures against risks of ...

  24. Facial Expression Recognition for Probing Students' Emotional

    The validity of the MP-FERS is equivalent to the concept of validity in social science research. Therefore, this study evaluated the validity of the MP-FERS by comparing the MP-FERS outcomes with other measures, including self-reports of engagement and students' academic performance.