n–1
The table above shows only the t -tests for population means. Another common t -test is for correlation coefficients . You use this t -test to decide if the correlation coefficient is significantly different from zero.
When you define the hypothesis, you also define whether you have a one-tailed or a two-tailed test. You should make this decision before collecting your data or doing any calculations. You make this decision for all three of the t -tests for means.
To explain, let’s use the one-sample t -test. Suppose we have a random sample of protein bars, and the label for the bars advertises 20 grams of protein per bar. The null hypothesis is that the unknown population mean is 20. Suppose we simply want to know if the data shows we have a different population mean. In this situation, our hypotheses are:
$ \mathrm H_o: \mu = 20 $
$ \mathrm H_a: \mu \neq 20 $
Here, we have a two-tailed test. We will use the data to see if the sample average differs sufficiently from 20 – either higher or lower – to conclude that the unknown population mean is different from 20.
Suppose instead that we want to know whether the advertising on the label is correct. Does the data support the idea that the unknown population mean is at least 20? Or not? In this situation, our hypotheses are:
$ \mathrm H_o: \mu >= 20 $
$ \mathrm H_a: \mu < 20 $
Here, we have a one-tailed test. We will use the data to see if the sample average is sufficiently less than 20 to reject the hypothesis that the unknown population mean is 20 or higher.
See the "tails for hypotheses tests" section on the t -distribution page for images that illustrate the concepts for one-tailed and two-tailed tests.
For all of the t -tests involving means, you perform the same steps in analysis:
Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.
Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.
Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.
Completely free for academics and students .
The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the posttest-only two-group randomized experimental design .
Figure 1 shows the distributions for the treated (blue) and control (green) groups in a study. Actually, the figure shows the idealized distribution – the actual distribution would usually be depicted with a histogram or bar graph . The figure indicates where the control and treatment group means are located. The question the t-test addresses is whether the means are statistically different.
What does it mean to say that the averages for two groups are statistically different? Consider the three situations shown in Figure 2. The first thing to notice about the three situations is that the difference between the means is the same in all three . But, you should also notice that the three situations don’t look the same – they tell very different stories. The top example shows a case with moderate variability of scores within each group. The second situation shows the high variability case. the third shows the case with low variability. Clearly, we would conclude that the two groups appear most different or distinct in the bottom or low-variability case. Why? Because there is relatively little overlap between the two bell-shaped curves. In the high variability case, the group difference appears least striking because the two bell-shaped distributions overlap so much.
This leads us to a very important conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The t-test does just this.
The formula for the t-test is a ratio. The top part of the ratio is just the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the scores. This formula is essentially another example of the signal-to-noise metaphor in research: the difference between the means is the signal that, in this case, we think our program or treatment introduced into the data; the bottom part of the formula is a measure of variability that is essentially noise that may make it harder to see the group difference. Figure 3 shows the formula for the t-test and how the numerator and denominator are related to the distributions.
The top part of the formula is easy to compute – just find the difference between the means. The bottom part is called the standard error of the difference . To compute it, we take the variance for each group and divide it by the number of people in that group. We add these two values and then take their square root. The specific formula for the standard error of the difference between the means is:
Remember, that the variance is simply the square of the standard deviation .
The final formula for the t-test is:
The t -value will be positive if the first mean is larger than the second and negative if it is smaller. Once you compute the t -value you have to look it up in a table of significance to test whether the ratio is large enough to say that the difference between the groups is not likely to have been a chance finding. To test the significance, you need to set a risk level (called the alpha level ). In most social research, the “rule of thumb” is to set the alpha level at .05 . This means that five times out of a hundred you would find a statistically significant difference between the means even if there was none (i.e. by “chance”). You also need to determine the degrees of freedom (df) for the test. In the t-test , the degrees of freedom is the sum of the persons in both groups minus 2 . Given the alpha level, the df, and the t -value, you can look the t -value up in a standard table of significance (available as an appendix in the back of most statistics texts) to determine whether the t -value is large enough to be significant. If it is, you can conclude that the difference between the means for the two groups is different (even given the variability). Fortunately, statistical computer programs routinely print the significance test results and save you the trouble of looking them up in a table.
The t-test, one-way Analysis of Variance (ANOVA) and a form of regression analysis are mathematically equivalent (see the statistical analysis of the posttest-only randomized experimental design ) and would yield identical results.
Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.
For more information on Conjointly's use of cookies, please read our Cookie Policy .
I am new to conjointly, i am already using conjointly.
Journal logo.
Colleague's E-mail is Invalid
Your message has been successfully sent to your colleague.
Save my selection
Pandey, R. M.
Department of Biostatistics, All India Institute of Medical Sciences, New Delhi, India
Address for correspondence: Dr. R.M. Pandey, Department of Biostatistics, All India Institute of Medical Sciences, New Delhi, India. E-mail: [email protected]
This is an open access journal, and articles are distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License, which allows others to remix, tweak, and build upon the work non-commercially, as long as appropriate credit is given and the new creations are licensed under the identical terms.
Student's t -test is a method of testing hypotheses about the mean of a small sample drawn from a normally distributed population when the population standard deviation is unknown. In 1908 William Sealy Gosset, an Englishman publishing under the pseudonym Student, developed the t -test. This article discusses the types of T test and shows a simple way of doing a T test.
To draw some conclusion about a population parameter (true result of any phenomena in the population) using the information contained in a sample, two approaches of statistical inference are used, that is, confidence interval (range of results likely to be obtained, usually, 95% of the times) and hypothesis testing, to find how often the observed finding could be due to chance alone, reported by P value which is the probability of obtaining the result as extreme as observed under null hypothesis. Statistical tests used for hypothesis testing are broadly classified into two groups, that is, parametric tests and nonparametric tests. In parametric tests, some assumption is made about the distribution of population from which the sample is drawn. In all parametric tests, the distribution of quantitative variable in the population is assumed to be normally distributed. As one does not have access to the population values to say normal or nonnormal, assumption of normality is made based on the sample values. Nonparametric statistical methods are also known as distribution-free methods or methods based on ranks where no assumptions are made about the distribution of variable in the population.
The family of t -tests falls in the category of parametric statistical tests where the mean value(s) is (are) compared against a hypothesized value. In hypothesis testing of any statistic (summary), for example, mean or proportion, the hypothesized value of the statistic is specified while the population variance is not specified, in such a situation, available information is only about variability in the sample. Therefore, to compute the standard error (measure of variability of the statistic of interest which is always in the denominator of the test statistic), it is considered reasonable to use sample standard deviation. William Sealy Gosset, a chemist working for a brewery in Dublin Ireland introduced the t -statistic. As per the company policy, chemists were not allowed to publish their findings, so Gosset published his mathematical work under the pseudonym “Student,” his pen name. The Student's t -test was published in the journal Biometrika in 1908.[ 1 , 2 ]
In medical research, various t -tests and Chi-square tests are the two types of statistical tests most commonly used. In any statistical hypothesis testing situation, if the test statistic follows a Student's t -test distribution under null hypothesis, it is a t -test. Most frequently used t -tests are: For comparison of mean in single sample; two samples related; two samples unrelated tests; and testing of correlation coefficient and regression coefficient against a hypothesized value which is usually zero. In one-sample location test, it is tested whether or not the mean of the population has a value as specified in a null hypothesis; in two independent sample location test, equality of means of two populations is tested; to compare the mean delta (difference between two related samples) against hypothesized value of zero in a null hypothesis, also known as paired t -test or repeated-measures t -test; and, to test whether or not the slope of a regression line differs significantly from zero. For a binary variable (such as cure, relapse, hypertension, diabetes, etc.,) which is either yes or no for a subject, if we take 1 for yes and 0 for no and consider this as a score attached to each study subject then the sample proportion (p) and the sample mean would be the same. Therefore, the approach of t -test for mean can be used for proportion as well.
The focus here is on describing a situation where a particular t -test would be used. This would be divided into t -tests used for testing: (a) Mean/proportion in one sample, (b) mean/proportion in two unrelated samples, (c) mean/proportion in two related samples, (d) correlation coefficient, and (e) regression coefficient. The process of hypothesis testing is same for any statistical test: Formulation of null and alternate hypothesis; identification and computation of test statistics based on sample values; deciding of alpha level, one-tailed or two-tailed test; rejection or acceptance of null hypothesis by comparing the computed test statistic with the theoretical value of “ t ” from the t -distribution table corresponding to given degrees of freedom. In hypothesis testing, P value is reported as P < 0.05. However, in significance testing, the exact P value is reported so that the reader is in a better position to judge the level of statistical significance.
The above is an illustration of the most common situations where t -test is used. With availability of software, computation is not the issue anymore. Any software where basic statistical methods are provided will have these tests. All one needs to do is to identify the t -test to be used in a given situation, arrange the data in the manner required by the particular software, and use mouse to perform the test and report the following: Number of observations, summary statistic, P value, and the 95% confidence interval of summary statistic of interest.
In addition to the statistical software, you can also use online calculators for calculating the t -statistics, P values, 95% confidence interval, etc., Various online calculators are available over the World Wide Web. However, for explaining how to use these calculators, a brief description is given below. A link to one of the online calculator available over the internet is http://www.graphpad.com/quickcalcs/ .
Similarly online t -test calculators can be used to calculate the paired t -test ( t -test for two related samples) and t -test for two independent samples. You just need to look that in what format you are having the data and a basic knowledge of in which condition which test has to be applied and what is the correct form for entering the data in the calculator.
Conflicts of interest.
There are no conflicts of interest.
Student's T test; method; William Gosset
The story of heart transplantation: from cape town to cape comorin, the odds ratio: principles and applications, how to use medical search engines, tools for placing research in context, cardio-oncology: an emerging concept.
Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service
Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve
Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground
Know how your people feel and empower managers to improve employee engagement, productivity, and retention
Take action in the moments that matter most along the employee journey and drive bottom line growth
Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people
Get faster, richer insights with qual and quant tools that make powerful market research available to everyone
Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts
Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market
Explore the platform powering Experience Management
Popular Use Cases
The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.
An introduction to t-test theory for surveys.
8 min read What are t-tests, when should you use them, and what are their strengths and weaknesses for analyzing survey data?
The t-test, also known as t-statistic or sometimes t-distribution, is a popular statistical tool used to test differences between the means (averages) of two groups, or the difference between one group’s mean and a standard value. Running a t-test helps you to understand whether the differences are statistically significant (i.e. they didn’t just happen by a fluke).
For example, let’s say you surveyed two sample groups of 500 customers in two different cities about their experiences at your stores. Group A in Los Angeles gave you on average 8 out of 10 for customer service, while Group B in Boston gave you an average score of 5 out of 10. Was your customer service really better in LA, or was it just chance that your LA sample group happened to contain a lot of customers who had positive experiences?
T-tests give you an answer to that question. They tell you what the probability is that the differences you found were down to chance. If that probability is very small, then you can be confident that the difference is meaningful (or statistically significant).
In a t-test, you start with a null hypothesis – an assumption that the two populations are the same and there is no meaningful difference between them. The t-test will prove or disprove your null hypothesis.
Free IDC report: The new era of market research is about intelligence
So far we’ve talked about testing whether there’s a difference between two independent populations, aka a 2-sample t-test. But there are some other common variations of the t-test worth knowing about too.
Instead of a second population, you run a test to see if the average of your population is significantly different from a certain number or value.
Example: Is the average monthly spend among my customers significantly more or less than $50?
The classic example we’ve described above, where the means of two independent populations are compared to see if there is a significant difference.
Example: Do Iowan shoppers spend more per store visit than Alaskan ones?
With a paired t-test, you’re testing two dependent (paired) groups to see if they are significantly different. This can be useful for “before and after” scenarios.
Example: Did the average monthly spend per customer significantly increase after I ran my last marketing campaign?
You can also choose between one-tailed or two-tailed t-tests.
A t-test is used when there are two or fewer groups. If you have more than two groups, another option, such as ANOVA , may be a better fit.
There are a couple more conditions for using a 2 sample t-test, which are:
You also need to have a big enough sample size to make sure the results are sound. However, one of the benefits of the t-test is that it allows you to work with relatively small quantities of data, since it relies on the mean and variance of the sample, not the population as a whole.
The table shows alternative statistical techniques that can be used to analyze this type of data when different levels of measurement are available.
You may sometimes hear the t-test referred to as the “Student’s t-test”. Although it is regularly used by students, that’s not where the name comes from.
The t-distribution was developed by W. S. Gosset (1908), an employee of the Guinness brewery in Dublin. Gosset was not allowed to publish research findings in his own name, so he adopted the pseudonym “Student”. The t-distribution, as it was first designated, has been known under a variety of names, including the Student’s distribution and Student’s t-distribution.
In order to run a t-test, you need 5 things:
From there, you can either use formulae to run your t-test manually (we’ve provided formulae at the end of this article), or use a stats software package such as SPSS or Minitab to compute your results.
The outputs of a t-test are:
This is made up of two elements: the difference between the means in your two groups, and the variance between them. These two elements are expressed as a ratio. If it’s small, there isn’t much difference between the groups. If it’s larger, there is more difference.
This relates to the size of the sample and how much the values within it could vary while still maintaining the same average. Numerically, it’s the sample size minus one. You can also think of it as the number of values you’d need to find out in order to know all of the values. (The final one could be deduced by knowing the others and the total.)
Going the manual route, with these two numbers in hand, you can use your critical value table to find:
This is the heart of the matter – it tells you the probability of your t-value happening by chance. The smaller the p-value, the surer you can be of the statistical significance of your results.
We know not everyone running survey software is a statistician, or wants to spend time learning statistical concepts and methods. That’s why we developed Stats iQ. It’s a powerful computational tool that gives you results equivalent to methods like the t-test, expressed in a few simple sentences.
Analysis & Reporting
Data saturation in qualitative research 8 min read, thematic analysis 11 min read, behavioral analytics 12 min read, statistical significance calculator: tool & complete guide 18 min read, regression analysis 19 min read, data analysis 31 min read, request demo.
Ready to learn more about Qualtrics?
Microbe Notes
The t-test is a test in statistics that is used for testing hypotheses regarding the mean of a small sample taken population when the standard deviation of the population is not known.
Table of Contents
Interesting Science Videos
T-tests can be performed manually using a formula or through some software.
About Author
Anupama Sapkota
Hi, on the very top, the one sample t-test formula in the picture is incorrect. It should be x-bar – u, not +
Thanks, it has been corrected 🙂
Save my name, email, and website in this browser for the next time I comment.
This site uses Akismet to reduce spam. Learn how your comment data is processed .
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
Tae kyun kim.
Department of Anesthesia and Pain Medicine, Pusan National University School of Medicine, Busan, Korea.
In statistic tests, the probability distribution of the statistics is important. When samples are drawn from population N (µ, σ 2 ) with a sample size of n, the distribution of the sample mean X ̄ should be a normal distribution N (µ, σ 2 / n ). Under the null hypothesis µ = µ 0 , the distribution of statistics z = X ¯ - µ 0 σ / n should be standardized as a normal distribution. When the variance of the population is not known, replacement with the sample variance s 2 is possible. In this case, the statistics X ¯ - µ 0 s / n follows a t distribution ( n-1 degrees of freedom). An independent-group t test can be carried out for a comparison of means between two independent groups, with a paired t test for paired data. As the t test is a parametric test, samples should meet certain preconditions, such as normality, equal variances and independence.
A t test is a type of statistical test that is used to compare the means of two groups. It is one of the most widely used statistical hypothesis tests in pain studies [ 1 ]. There are two types of statistical inference: parametric and nonparametric methods. Parametric methods refer to a statistical technique in which one defines the probability distribution of probability variables and makes inferences about the parameters of the distribution. In cases in which the probability distribution cannot be defined, nonparametric methods are employed. T tests are a type of parametric method; they can be used when the samples satisfy the conditions of normality, equal variance, and independence.
T tests can be divided into two types. There is the independent t test, which can be used when the two groups under comparison are independent of each other, and the paired t test, which can be used when the two groups under comparison are dependent on each other. T tests are usually used in cases where the experimental subjects are divided into two independent groups, with one group treated with A and the other group treated with B. Researchers can acquire two types of results for each group (i.e., prior to treatment and after the treatment): preA and postA, and preB and postB. An independent t test can be used for an intergroup comparison of postA and postB or for an intergroup comparison of changes in preA to postA (postA-preA) and changes in preB to postB (postB-preB) ( Table 1 ).
Treatment A | Treatment B | |||||||
---|---|---|---|---|---|---|---|---|
ID | preA | postA | ΔA | ID | preB | postB | ΔB | |
1 | 63 | 77 | 14 | 11 | 81 | 101 | 20 | |
2 | 69 | 88 | 19 | 12 | 87 | 103 | 16 | |
3 | 76 | 90 | 14 | 13 | 77 | 107 | 30 | |
4 | 78 | 95 | 17 | 14 | 80 | 114 | 34 | |
5 | 80 | 96 | 16 | 15 | 76 | 116 | 40 | |
6 | 89 | 96 | 7 | 16 | 86 | 116 | 30 | |
7 | 90 | 102 | 12 | 17 | 98 | 116 | 18 | |
8 | 92 | 104 | 12 | 18 | 87 | 120 | 33 | |
9 | 103 | 110 | 7 | 19 | 105 | 120 | 15 | |
10 | 112 | 115 | 3 | 20 | 69 | 127 | 58 |
ID: individual identification, preA, preB: before the treatment A or B, postA, postB: after the treatment A or B, ΔA, ΔB: difference between before and after the treatment A or B.
On the other hand, paired t tests are used in different experimental environments. For example, the experimental subjects are not divided into two groups, and all of them are treated initially with A. The amount of change (postA-preA) is then measured for all subjects. After all of the effects of A disappear, the subjects are treated with B, and the amount of change (postB-preB) is measured for all of the subjects. A paired t test is used in such crossover test designs to compare the amount of change of A to that of B for the same subjects ( Table 2 ).
Treatment A | Treatment B | |||||||
---|---|---|---|---|---|---|---|---|
ID | preA | postA | ΔA | ID | preB | postB | ΔB | |
1 | 63 | 77 | 14 | 1 | 73 | 103 | 30 | |
2 | 69 | 88 | 19 | 2 | 74 | 104 | 30 | |
3 | 76 | 90 | 14 | 3 | 76 | 107 | 31 | |
4 | 78 | 95 | 17 | 4 | 84 | 108 | 24 | |
5 | 80 | 96 | 16 | wash out | 5 | 84 | 110 | 26 |
6 | 89 | 96 | 7 | 6 | 86 | 110 | 24 | |
7 | 90 | 102 | 12 | 7 | 92 | 113 | 21 | |
8 | 92 | 104 | 12 | 8 | 95 | 114 | 19 | |
9 | 103 | 110 | 7 | 9 | 103 | 118 | 15 | |
10 | 112 | 115 | 3 | 10 | 115 | 120 | 5 |
Statistics is basically about probabilities. A statistical conclusion of a large or small difference between two groups is not based on an absolute standard but is rather an evaluation of the probability of an event. For example, a clinical test is performed to determine whether or not a patient has a certain disease. If the test results are either higher or lower than the standard, clinicians will determine that the patient has the disease despite the fact that the patient may or may not actually have the disease. This conclusion is based on the statistical concept which holds that it is more statistically valid to conclude that the patient has the disease than to declare that the patient is a rare case among people without the disease because such test results are statistically rare in normal people.
The test results and the probability distribution of the results must be known in order for the results to be determined as statistically rare. The criteria for clinical indicators have been established based on data collected from an entire population or at least from a large number of people. Here, we examine a case in which a clinical indicator exhibits a normal distribution with a mean of µ and a variance of σ 2 . If a patient's test result is χ, is this statistically rare against the criteria (e.g., 5 or 1%)? Probability is represented as the surface area in a probability distribution, and the z score that represents either 5 or 1%, near the margins of the distribution, becomes the reference value. The test result χ can be determined to be statistically rare compared to the reference probability if it lies in a more marginal area than the z score, that is, if the value of χ is located in the marginal ends of the distribution ( Fig. 1 ).
This is done to compare one individual's clinical indicator value. This however raises the question of how we would compare the mean of a sample group (consisting of more than one individual) against the population mean. Again, it is meaningless to compare each individual separately; we must compare the means of the two groups. Thus, do we make a statistical inference using only the distribution of the clinical indicators of the entire population and the mean of the sample? No. In order to infer a statistical possibility, we must know the indicator of interest and its probability distribution. In other words, we must know the mean of the sample and the distribution of the mean. We can then determine how far the sample mean varies from the population mean by knowing the sampling distribution of the means.
The sample mean we can get from a study is one of means of all possible samples which could be drawn from a population. This sample mean from a study was already acquired from a real experiment, however, how could we know the distribution of the means of all possible samples including studied sample? Do we need to experiment it over and over again? The simulation in which samples are drawn repeatedly from a population is shown in Fig. 2 . If samples are drawn with sample size n from population of normal distribution (µ, σ 2 ), the sampling distribution shows normal distribution with mean of µ and variance of σ 2 / n . The number of samples affects the shape of the sampling distribution. That is, the shape of the distribution curve becomes a narrower bell curve with a smaller variance as the number of samples increases, because the variance of sampling distribution is σ 2 / n . The formation of a sampling distribution is well explained in Lee et al. [ 2 ] in a form of a figure.
Now that the sampling distribution of the means is known, we can locate the position of the mean of a specific sample against the distribution data. However, one problem remains. As we noted earlier, the sampling distribution exhibits a normal distribution with a variance of σ 2 / n , but in reality we do not know σ 2 , the variance of the population. Therefore, we use the sample variance instead of the population variance to determine the sampling distribution of the mean. The sample variance is defined as follows:
In such cases in which the sample variance is used, the sampling distribution follows a t distribution that depends on the 0degree of freedom of each sample rather than a normal distribution ( Fig. 3 ).
A t test is also known as Student's t test. It is a statistical analysis technique that was developed by William Sealy Gosset in 1908 as a means to control the quality of dark beers. A t test used to test whether there is a difference between two independent sample means is not different from a t test used when there is only one sample (as mentioned earlier). However, if there is no difference in the two sample means, the difference will be close to zero. Therefore, in such cases, an additional statistical test should be performed to verify whether the difference could be said to be equal to zero.
Let's extract two independent samples from a population that displays a normal distribution and compute the difference between the means of the two samples. The difference between the sample means will not always be zero, even if the samples are extracted from the same population, because the sampling process is randomized, which results in a sample with a variety of combinations of subjects. We extracted two samples with a size of 6 from a population N (150, 5 2 ) and found the difference in the means. If this process is repeated 1,000 times, the sampling distribution exhibits the shape illustrated in Fig. 4 . When the distribution is displayed in terms of a histogram and a density line, it is almost identical to the theoretical sampling distribution: N(0, 2 × 5 2 /6) ( Fig. 4 ).
However, it is difficult to define the distribution of the difference in the two sample means because the variance of the population is unknown. If we use the variance of the sample instead, the distribution of the difference of the samples means would follow a t distribution. It should be noted, however, that the two samples display a normal distribution and have an equal variance because they were independently extracted from an identical population that has a normal distribution.
Under the assumption that the two samples display a normal distribution and have an equal variance, the t statistic is as follows:
population mean difference (µ 1 - µ 2 ) was assumed to be 0; thus:
The population variance was unknown and so a pooled variance of the two samples was used:
However, if the population variance is not equal, the t statistic of the t test would be
and the degree of freedom is calculated based on the Welch Satterthwaite equation.
It is apparent that if n 1 and n 2 are sufficiently large, the t statistic resembles a normal distribution ( Fig. 3 ).
A statistical test is performed to verify the position of the difference in the sample means in the sampling distribution of the mean ( Fig. 4 ). It is statistically very rare for the difference in two sample means to lie on the margins of the distribution. Therefore, if the difference does lie on the margins, it is statistically significant to conclude that the samples were extracted from two different populations, even if they were actually extracted from the same population.
Paired t tests are can be categorized as a type of t test for a single sample because they test the difference between two paired results. If there is no difference between the two treatments, the difference in the results would be close to zero; hence, the difference in the sample means used for a paired t test would be 0.
Let's go back to the sampling distribution that was used in the independent t test discussed earlier. The variance of the difference between two independent sample means was represented as the sum of each variance. If the samples were not independent, the variance of the difference of two variables A and B, Var (A-B), can be shown as follows,
where σ 1 2 is the variance of variable A, σ 2 2 is the variance of variable B, and ρ is the correlation coefficient for the two variables. In an independent t test, the correlation coefficient is 0 because the two groups are independent. Thus, it is logical to show the variance of the difference between the two variables simply as the sum of the two variances. However, for paired variables, the correlation coefficient may not equal 0. Thus, the t statistic for two dependent samples must be different, meaning the following t statistic,
must be changed. First, the number of samples are paired; thus, n 1 = n 2 = n , and their variance can be represented as s 1 2 + s 2 2 - 2ρ s 1 s 2 considering the correlation coefficient. Therefore, the t statistic for a paired t test is as follows:
In this equation, the t statistic is increased if the correlation coefficient is greater than 0 because the denominator becomes smaller, which increases the statistical power of the paired t test compared to that of an independent t test. On the other hand, if the correlation coefficient is less than 0, the statistical power is decreased and becomes lower than that of an independent t test. It is important to note that if one misunderstands this characteristic and uses an independent t test when the correlation coefficient is less than 0, the generated results would be incorrect, as the process ignores the paired experimental design.
As previously explained, if samples are extracted from a population that displays a normal distribution but the population variance is unknown, we can use the sample variance to examine the sampling distribution of the mean, which will resemble a t distribution. Therefore, in order to reach a statistical conclusion about a sample mean with a t distribution, certain conditions must be satisfied: the two samples for comparison must be independently sampled from the same population, satisfying the conditions of normality, equal variance, and independence.
Shapiro's test or the Kolmogorov-Smirnov test can be performed to verify the assumption of normality. If the condition of normality is not met, the Wilcoxon rank sum test (Mann-Whitney U test) is used for independent samples, and the Wilcoxon sign rank test is used for paired samples for an additional nonparametric test.
The condition of equal variance is verified using Levene's test or Bartlett's test. If the condition of equal variance is not met, nonparametric test can be performed or the following statistic which follows a t distribution can is used.
However, this statistics has different degree of freedom which was calculated by the Welch-Satterthwaite [ 3 , 4 ] equation.
Owing to user-friendly statistics software programs, the rich pool of statistics information on the Internet, and expert advice from statistics professionals at every hospital, using and processing statistics data is no longer an intractable task. However, it remains the researchers' responsibility to design experiments to fulfill all of the conditions of their statistic methods of choice and to ensure that their statistical assumptions are appropriate. In particular, parametric statistical methods confer reasonable statistical conclusions only when the statistical assumptions are fully met. Some researchers often regard these statistical assumptions inconvenient and neglect them. Even some statisticians argue on the basic assumptions, based on the central limit theory, that sampling distributions display a normal distribution regardless of the fact that the population distribution may or may not follow a normal distribution, and that t tests have sufficient statistical power even if they do not satisfy the condition of normality [ 5 ]. Moreover, they contend that the condition of equal variance is not so strict because even if there is a ninefold difference in the variance, the α level merely changes from 0.5 to 0.6 [ 6 ]. However, the arguments regarding the conditions of normality and the limit to which the condition of equal variance may be violated are still bones of contention. Therefore, researchers who unquestioningly accept these arguments and neglect the basic assumptions of a t test when submitting papers will face critical comments from editors. Moreover, it will be difficult to persuade the editors to neglect the basic assumptions regardless of how solid the evidence in the paper is. Hence, researchers should sufficiently test basic statistical assumptions and employ methods that are widely accepted so as to draw valid statistical conclusions.
The results of independent and paired t tests of the examples are illustrated in Tables 1 and 2. The tests were conducted using the SPSS Statistics Package (IBM® SPSS® Statistics 21, SPSS Inc., Chicago, IL, USA).
First, we need to examine the degree of normality by confirming the Kolmogorov-Smirnov or Shapiro-Wilk test in the second table. We can determine that the samples satisfy the condition of normality because the P value is greater than 0.05. Next, we check the results of Levene's test to examine the equality of variance. The P value is again greater than 0.05; hence, the condition of equal variance is also met. Finally, we read the significance probability for the "equal variance assumed" line. If the condition of equal variance is not met (i.e., if the P value is less than 0.05 for Levene's test), we reach a conclusion by referring to the significance probability for the "equal variance not assumed" line, or we perform a nonparametric test.
A paired t test is identical to a single-sample t test. Therefore, we test the normality of the difference in the amount of change for treatment A and treatment B (ΔA-ΔB). The normality is verified based on the results of Kolmogorov-Smirnov and Shapiro-Wilk tests, as shown in the second table. In conclusion, there is a significant difference between the two treatments (i.e., the P value is less than 0.001).
Topics: Hypothesis Testing
If you’re not a statistician, looking through statistical output can sometimes make you feel a bit like Alice in Wonderland. Suddenly, you step into a fantastical world where strange and mysterious phantasms appear out of nowhere.
For example, consider the T and P in your t-test results.
“Curiouser and curiouser!” you might exclaim, like Alice, as you gaze at your output.
What are these values, really? Where do they come from? Even if you’ve used the p-value to interpret the statistical significance of your results umpteen times , its actual origin may remain murky to you.
T and P are inextricably linked. They go arm in arm, like Tweedledee and Tweedledum. Here's why.
When you perform a t-test, you're usually trying to find evidence of a significant difference between population means (2-sample t) or between the population mean and a hypothesized value (1-sample t). The t-value measures the size of the difference relative to the variation in your sample data . Put another way, T is simply the calculated difference represented in units of standard error. The greater the magnitude of T, the greater the evidence against the null hypothesis. This means there is greater evidence that there is a significant difference. The closer T is to 0, the more likely there isn't a significant difference.
Remember, the t-value in your output is calculated from only one sample from the entire population. It you took repeated random samples of data from the same population, you'd get slightly different t-values each time, due to random sampling error (which is really not a mistake of any kind–it's just the random variation expected in the data).
How different could you expect the t-values from many random samples from the same population to be? And how does the t-value from your sample data compare to those expected t-values?
You can use a t-distribution to find out.
For the sake of illustration, assume that you're using a 1-sample t-test to determine whether the population mean is greater than a hypothesized value, such as 5, based on a sample of 20 observations, as shown in the above t-test output.
The highest part (peak) of the distribution curve shows you where you can expect most of the t-values to fall. Most of the time, you’d expect to get t-values close to 0. That makes sense, right? Because if you randomly select representative samples from a population, the mean of most of those random samples from the population should be close to the overall population mean, making their differences (and thus the calculated t-values) close to 0.
In other words, the probability of obtaining a t-value of 2.8 or higher, when sampling from the same population (here, a population with a hypothesized mean of 5), is approximately 0.006.
How likely is that? Not very! For comparison, the probability of being dealt 3-of-a-kind in a 5-card poker hand is over three times as high (≈ 0.021).
Given that the probability of obtaining a t-value this high or higher when sampling from this population is so low, what’s more likely? It’s more likely this sample doesn’t come from this population (with the hypothesized mean of 5). It's much more likely that this sample comes from different population, one with a mean greater than 5.
To wit: Because the p-value is very low (< alpha level), you reject the null hypothesis and conclude that there's a statistically significant difference.
In this way, T and P are inextricably linked. Consider them simply different ways to quantify the "extremeness" of your results under the null hypothesis. You can’t change the value of one without changing the other.
The larger the absolute value of the t-value, the smaller the p-value, and the greater the evidence against the null hypothesis.(You can verify this by entering lower and higher t values for the t-distribution in step 6 above).
The t-distribution example shown above is based on a one-tailed t-test to determine whether the mean of the population is greater than a hypothesized value. Therefore the t-distribution example shows the probability associated with the t-value of 2.8 only in one direction (the right tail of the distribution).
How would you use the t-distribution to find the p-value associated with a t-value of 2.8 for two-tailed t-test (in both directions)?
Hint: In Minitab, adjust the options in step 5 to find the probability for both tails. If you don't have a copy of Minitab, download a free 30-day trial version .
© 2023 Minitab, LLC. All Rights Reserved.
Institute for Digital Research and Education
Introduction.
This page shows how to perform a number of statistical tests using SPSS. Each section gives a brief description of the aim of the statistical test, when it is used, an example showing the SPSS commands and SPSS (often abbreviated) output with a brief interpretation of the output. You can see the page Choosing the Correct Statistical Test for a table that shows an overview of when each test is appropriate to use. In deciding which test is appropriate to use, it is important to consider the type of variables that you have (i.e., whether your variables are categorical, ordinal or interval and whether they are normally distributed), see What is the difference between categorical, ordinal and interval variables? for more information on this.
Most of the examples in this page will use a data file called hsb2, high school and beyond. This data file contains 200 observations from a sample of high school students with demographic information about the students, such as their gender ( female ), socio-economic status ( ses ) and ethnic background ( race ). It also contains a number of scores on standardized tests, including tests of reading ( read ), writing ( write ), mathematics ( math ) and social studies ( socst ). You can get the hsb data file by clicking on hsb2 .
A one sample t-test allows us to test whether a sample mean (of a normally distributed interval variable) significantly differs from a hypothesized value. For example, using the hsb2 data file , say we wish to test whether the average writing score ( write ) differs significantly from 50. We can do this as shown below. t-test /testval = 50 /variable = write. The mean of the variable write for this particular sample of students is 52.775, which is statistically significantly different from the test value of 50. We would conclude that this group of students has a significantly higher mean on the writing test than 50.
A one sample median test allows us to test whether a sample median differs significantly from a hypothesized value. We will use the same variable, write , as we did in the one sample t-test example above, but we do not need to assume that it is interval and normally distributed (we only need to assume that write is an ordinal variable). nptests /onesample test (write) wilcoxon(testvalue = 50).
A one sample binomial test allows us to test whether the proportion of successes on a two-level categorical dependent variable significantly differs from a hypothesized value. For example, using the hsb2 data file , say we wish to test whether the proportion of females ( female ) differs significantly from 50%, i.e., from .5. We can do this as shown below. npar tests /binomial (.5) = female. The results indicate that there is no statistically significant difference (p = .229). In other words, the proportion of females in this sample does not significantly differ from the hypothesized value of 50%.
A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical variable differ from hypothesized proportions. For example, let’s suppose that we believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White folks. We want to test whether the observed proportions from our sample differ significantly from these hypothesized proportions. npar test /chisquare = race /expected = 10 10 10 70. These results show that racial composition in our sample does not differ significantly from the hypothesized values that we supplied (chi-square with three degrees of freedom = 5.029, p = .170).
An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups. For example, using the hsb2 data file , say we wish to test whether the mean for write is the same for males and females. t-test groups = female(0 1) /variables = write. Because the standard deviations for the two groups are similar (10.3 and 8.1), we will use the “equal variances assumed” test. The results indicate that there is a statistically significant difference between the mean writing score for males and females (t = -3.734, p = .000). In other words, females have a statistically significantly higher mean score on writing (54.99) than males (50.12). See also SPSS Learning Module: An overview of statistical tests in SPSS
The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and can be used when you do not assume that the dependent variable is a normally distributed interval variable (you only assume that the variable is at least ordinal). You will notice that the SPSS syntax for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test. We will use the same data file (the hsb2 data file ) and the same variables in this example as we did in the independent t-test example above and will not assume that write , our dependent variable, is normally distributed.
npar test /m-w = write by female(0 1). The results suggest that there is a statistically significant difference between the underlying distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.001). See also FAQ: Why is the Mann-Whitney significant when the medians are equal?
A chi-square test is used when you want to see if there is a relationship between two categorical variables. In SPSS, the chisq option is used on the statistics subcommand of the crosstabs command to obtain the test statistic and its associated p-value. Using the hsb2 data file , let’s see if there is a relationship between the type of school attended ( schtyp ) and students’ gender ( female ). Remember that the chi-square test assumes that the expected value for each cell is five or higher. This assumption is easily met in the examples below. However, if this assumption is not met in your data, please see the section on Fisher’s exact test below. crosstabs /tables = schtyp by female /statistic = chisq. These results indicate that there is no statistically significant relationship between the type of school attended and gender (chi-square with one degree of freedom = 0.047, p = 0.828). Let’s look at another example, this time looking at the linear relationship between gender ( female ) and socio-economic status ( ses ). The point of this example is that one (or both) variables may have more than two levels, and that the variables do not have to have the same number of levels. In this example, female has two levels (male and female) and ses has three levels (low, medium and high). crosstabs /tables = female by ses /statistic = chisq. Again we find that there is no statistically significant relationship between the variables (chi-square with two degrees of freedom = 4.577, p = 0.101). See also SPSS Learning Module: An Overview of Statistical Tests in SPSS
The Fisher’s exact test is used when you want to conduct a chi-square test but one or more of your cells has an expected frequency of five or less. Remember that the chi-square test assumes that each cell has an expected frequency of five or more, but the Fisher’s exact test has no such assumption and can be used regardless of how small the expected frequency is. In SPSS unless you have the SPSS Exact Test Module, you can only perform a Fisher’s exact test on a 2×2 table, and these results are presented by default. Please see the results from the chi squared example above.
A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable. For example, using the hsb2 data file , say we wish to test whether the mean of write differs between the three program types ( prog ). The command for this test would be: oneway write by prog. The mean of the dependent variable differs significantly among the levels of program type. However, we do not know if the difference is between only two of the levels or all three of the levels. (The F test for the Model is the same as the F test for prog because prog was the only variable entered into the model. If other variables had also been entered, the F test for the Model would have been different from prog .) To see the mean of write for each level of program type, means tables = write by prog. From this we can see that the students in the academic program have the highest mean writing score, while students in the vocational program have the lowest. See also SPSS Textbook Examples: Design and Analysis, Chapter 7 SPSS Textbook Examples: Applied Regression Analysis, Chapter 8 SPSS FAQ: How can I do ANOVA contrasts in SPSS? SPSS Library: Understanding and Interpreting Parameter Estimates in Regression and ANOVA
The Kruskal Wallis test is used when you have one independent variable with two or more levels and an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a generalized form of the Mann-Whitney test method since it permits two or more groups. We will use the same data file as the one way ANOVA example above (the hsb2 data file ) and the same variables as in the example above, but we will not assume that write is a normally distributed interval variable. npar tests /k-w = write by prog (1,3). If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different value of chi-squared. With or without ties, the results indicate that there is a statistically significant difference among the three type of programs.
A paired (samples) t-test is used when you have two related observations (i.e., two observations per subject) and you want to see if the means on these two normally distributed interval variables differ from one another. For example, using the hsb2 data file we will test whether the mean of read is equal to the mean of write . t-test pairs = read with write (paired). These results indicate that the mean of read is not statistically significantly different from the mean of write (t = -0.867, p = 0.387).
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You use the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the two variables is interval and normally distributed (but you do assume the difference is ordinal). We will use the same example as above, but we will not assume that the difference between read and write is interval and normally distributed. npar test /wilcoxon = write with read (paired). The results suggest that there is not a statistically significant difference between read and write . If you believe the differences between read and write were not ordinal but could merely be classified as positive and negative, then you may want to consider a sign test in lieu of sign rank test. Again, we will use the same variables in this example and assume that this difference is not ordinal. npar test /sign = read with write (paired). We conclude that no statistically significant difference was found (p=.556).
You would perform McNemar’s test if you were interested in the marginal frequencies of two binary outcomes. These binary outcomes may be the same outcome variable on matched pairs (like a case-control study) or two outcome variables from a single group. Continuing with the hsb2 dataset used in several above examples, let us create two binary outcomes in our dataset: himath and hiread . These outcomes can be considered in a two-way contingency table. The null hypothesis is that the proportion of students in the himath group is the same as the proportion of students in hiread group (i.e., that the contingency table is symmetric). compute himath = (math>60). compute hiread = (read>60). execute. crosstabs /tables=himath BY hiread /statistic=mcnemar /cells=count. McNemar’s chi-square statistic suggests that there is not a statistically significant difference in the proportion of students in the himath group and the proportion of students in the hiread group.
You would perform a one-way repeated measures analysis of variance if you had one categorical independent variable and a normally distributed interval dependent variable that was repeated at least twice for each subject. This is the equivalent of the paired samples t-test, but allows for two or more levels of the categorical variable. This tests whether the mean of the dependent variable differs by the categorical variable. We have an example data set called rb4wide , which is used in Kirk’s book Experimental Design. In this data set, y is the dependent variable, a is the repeated measure and s is the variable that indicates the subject number. glm y1 y2 y3 y4 /wsfactor a(4). You will notice that this output gives four different p-values. The output labeled “sphericity assumed” is the p-value (0.000) that you would get if you assumed compound symmetry in the variance-covariance matrix. Because that assumption is often not valid, the three other p-values offer various corrections (the Huynh-Feldt, H-F, Greenhouse-Geisser, G-G and Lower-bound). No matter which p-value you use, our results indicate that we have a statistically significant effect of a at the .05 level. See also SPSS Textbook Examples from Design and Analysis: Chapter 16 SPSS Library: Advanced Issues in Using and Understanding SPSS MANOVA SPSS Code Fragment: Repeated Measures ANOVA
If you have a binary outcome measured repeatedly for each subject and you wish to run a logistic regression that accounts for the effect of multiple measures from single subjects, you can perform a repeated measures logistic regression. In SPSS, this can be done using the GENLIN command and indicating binomial as the probability distribution and logit as the link function to be used in the model. The exercise data file contains 3 pulse measurements from each of 30 people assigned to 2 different diet regiments and 3 different exercise regiments. If we define a “high” pulse as being over 100, we can then predict the probability of a high pulse using diet regiment. GET FILE='C:mydatahttps://stats.idre.ucla.edu/wp-content/uploads/2016/02/exercise.sav'. GENLIN highpulse (REFERENCE=LAST) BY diet (order = DESCENDING) /MODEL diet DISTRIBUTION=BINOMIAL LINK=LOGIT /REPEATED SUBJECT=id CORRTYPE = EXCHANGEABLE. These results indicate that diet is not statistically significant (Wald Chi-Square = 1.562, p = 0.211).
A factorial ANOVA has two or more categorical independent variables (either with or without the interactions) and a single normally distributed interval dependent variable. For example, using the hsb2 data file we will look at writing scores ( write ) as the dependent variable and gender ( female ) and socio-economic status ( ses ) as independent variables, and we will include an interaction of female by ses . Note that in SPSS, you do not need to have the interaction term(s) in your data set. Rather, you can have SPSS create it/them temporarily by placing an asterisk between the variables that will make up the interaction term(s). glm write by female ses. These results indicate that the overall model is statistically significant (F = 5.666, p = 0.00). The variables female and ses are also statistically significant (F = 16.595, p = 0.000 and F = 6.611, p = 0.002, respectively). However, that interaction between female and ses is not statistically significant (F = 0.133, p = 0.875). See also SPSS Textbook Examples from Design and Analysis: Chapter 10 SPSS FAQ: How can I do tests of simple main effects in SPSS? SPSS FAQ: How do I plot ANOVA cell means in SPSS? SPSS Library: An Overview of SPSS GLM
You perform a Friedman test when you have one within-subjects independent variable with two or more levels and a dependent variable that is not interval and normally distributed (but at least ordinal). We will use this test to determine if there is a difference in the reading, writing and math scores. The null hypothesis in this test is that the distribution of the ranks of each type of score (i.e., reading, writing and math) are the same. To conduct a Friedman test, the data need to be in a long format. SPSS handles this for you, but in other statistical packages you will have to reshape the data before you can conduct this test. npar tests /friedman = read write math. Friedman’s chi-square has a value of 0.645 and a p-value of 0.724 and is not statistically significant. Hence, there is no evidence that the distributions of the three types of scores are different.
Ordered logistic regression is used when the dependent variable is ordered, but not continuous. For example, using the hsb2 data file we will create an ordered variable called write3 . This variable will have the values 1, 2 and 3, indicating a low, medium or high writing score. We do not generally recommend categorizing a continuous variable in this way; we are simply creating a variable to use for this example. We will use gender ( female ), reading score ( read ) and social studies score ( socst ) as predictor variables in this model. We will use a logit link and on the print subcommand we have requested the parameter estimates, the (model) summary statistics and the test of the parallel lines assumption. if write ge 30 and write le 48 write3 = 1. if write ge 49 and write le 57 write3 = 2. if write ge 58 and write le 70 write3 = 3. execute. plum write3 with female read socst /link = logit /print = parameter summary tparallel. The results indicate that the overall model is statistically significant (p < .000), as are each of the predictor variables (p < .000). There are two thresholds for this model because there are three levels of the outcome variable. We also see that the test of the proportional odds assumption is non-significant (p = .563). One of the assumptions underlying ordinal logistic (and ordinal probit) regression is that the relationship between each pair of outcome groups is the same. In other words, ordinal logistic regression assumes that the coefficients that describe the relationship between, say, the lowest versus all higher categories of the response variable are the same as those that describe the relationship between the next lowest category and all higher categories, etc. This is called the proportional odds assumption or the parallel regression assumption. Because the relationship between all pairs of groups is the same, there is only one set of coefficients (only one model). If this was not the case, we would need different models (such as a generalized ordered logit model) to describe the relationship between each pair of outcome groups. See also SPSS Data Analysis Examples: Ordered logistic regression SPSS Annotated Output: Ordinal Logistic Regression
A factorial logistic regression is used when you have two or more categorical independent variables but a dichotomous dependent variable. For example, using the hsb2 data file we will use female as our dependent variable, because it is the only dichotomous variable in our data set; certainly not because it common practice to use gender as an outcome variable. We will use type of program ( prog ) and school type ( schtyp ) as our predictor variables. Because prog is a categorical variable (it has three levels), we need to create dummy codes for it. SPSS will do this for you by making dummy codes for all variables listed after the keyword with . SPSS will also create the interaction term; simply list the two variables that will make up the interaction separated by the keyword by . logistic regression female with prog schtyp prog by schtyp /contrast(prog) = indicator(1). The results indicate that the overall model is not statistically significant (LR chi2 = 3.147, p = 0.677). Furthermore, none of the coefficients are statistically significant either. This shows that the overall effect of prog is not significant. See also Annotated output for logistic regression
A correlation is useful when you want to see the relationship between two (or more) normally distributed interval variables. For example, using the hsb2 data file we can run a correlation between two continuous variables, read and write . correlations /variables = read write. In the second example, we will run a correlation between a dichotomous variable, female , and a continuous variable, write . Although it is assumed that the variables are interval and normally distributed, we can include dummy variables when performing correlations. correlations /variables = female write. In the first example above, we see that the correlation between read and write is 0.597. By squaring the correlation and then multiplying by 100, you can determine what percentage of the variability is shared. Let’s round 0.597 to be 0.6, which when squared would be .36, multiplied by 100 would be 36%. Hence read shares about 36% of its variability with write . In the output for the second example, we can see the correlation between write and female is 0.256. Squaring this number yields .065536, meaning that female shares approximately 6.5% of its variability with write . See also Annotated output for correlation SPSS Learning Module: An Overview of Statistical Tests in SPSS SPSS FAQ: How can I analyze my data by categories? Missing Data in SPSS
Simple linear regression allows us to look at the linear relationship between one normally distributed interval predictor and one normally distributed interval outcome variable. For example, using the hsb2 data file , say we wish to look at the relationship between writing scores ( write ) and reading scores ( read ); in other words, predicting write from read . regression variables = write read /dependent = write /method = enter. We see that the relationship between write and read is positive (.552) and based on the t-value (10.47) and p-value (0.000), we would conclude this relationship is statistically significant. Hence, we would say there is a statistically significant positive linear relationship between reading and writing. See also Regression With SPSS: Chapter 1 – Simple and Multiple Regression Annotated output for regression SPSS Textbook Examples: Introduction to the Practice of Statistics, Chapter 10 SPSS Textbook Examples: Regression with Graphics, Chapter 2 SPSS Textbook Examples: Applied Regression Analysis, Chapter 5
A Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal). The values of the variables are converted in ranks and then correlated. In our example, we will look for a relationship between read and write . We will not assume that both of these variables are normal and interval. nonpar corr /variables = read write /print = spearman. The results suggest that the relationship between read and write (rho = 0.617, p = 0.000) is statistically significant.
Logistic regression assumes that the outcome variable is binary (i.e., coded as 0 and 1). We have only one variable in the hsb2 data file that is coded 0 and 1, and that is female . We understand that female is a silly outcome variable (it would make more sense to use it as a predictor variable), but we can use female as the outcome variable to illustrate how the code for this command is structured and how to interpret the output. The first variable listed after the logistic command is the outcome (or dependent) variable, and all of the rest of the variables are predictor (or independent) variables. In our example, female will be the outcome variable, and read will be the predictor variable. As with OLS regression, the predictor variables must be either dichotomous or continuous; they cannot be categorical. logistic regression female with read. The results indicate that reading score ( read ) is not a statistically significant predictor of gender (i.e., being female), Wald = .562, p = 0.453. Likewise, the test of the overall model is not statistically significant, LR chi-squared – 0.56, p = 0.453. See also Annotated output for logistic regression SPSS Library: What kind of contrasts are these?
Multiple regression is very similar to simple regression, except that in multiple regression you have more than one predictor variable in the equation. For example, using the hsb2 data file we will predict writing score from gender ( female ), reading, math, science and social studies ( socst ) scores. regression variable = write female read math science socst /dependent = write /method = enter. The results indicate that the overall model is statistically significant (F = 58.60, p = 0.000). Furthermore, all of the predictor variables are statistically significant except for read . See also Regression with SPSS: Chapter 1 – Simple and Multiple Regression Annotated output for regression SPSS Frequently Asked Questions SPSS Textbook Examples: Regression with Graphics, Chapter 3 SPSS Textbook Examples: Applied Regression Analysis
Analysis of covariance is like ANOVA, except in addition to the categorical predictors you also have continuous predictors as well. For example, the one way ANOVA example used write as the dependent variable and prog as the independent variable. Let’s add read as a continuous variable to this model, as shown below. glm write with read by prog. The results indicate that even after adjusting for reading score ( read ), writing scores still significantly differ by program type ( prog ), F = 5.867, p = 0.003. See also SPSS Textbook Examples from Design and Analysis: Chapter 14 SPSS Library: An Overview of SPSS GLM SPSS Library: How do I handle interactions of continuous and categorical variables?
Multiple logistic regression is like simple logistic regression, except that there are two or more predictors. The predictors can be interval variables or dummy variables, but cannot be categorical variables. If you have categorical predictors, they should be coded into one or more dummy variables. We have only one variable in our data set that is coded 0 and 1, and that is female . We understand that female is a silly outcome variable (it would make more sense to use it as a predictor variable), but we can use female as the outcome variable to illustrate how the code for this command is structured and how to interpret the output. The first variable listed after the logistic regression command is the outcome (or dependent) variable, and all of the rest of the variables are predictor (or independent) variables (listed after the keyword with ). In our example, female will be the outcome variable, and read and write will be the predictor variables. logistic regression female with read write. These results show that both read and write are significant predictors of female . See also Annotated output for logistic regression SPSS Textbook Examples: Applied Logistic Regression, Chapter 2 SPSS Code Fragments: Graphing Results in Logistic Regression
Discriminant analysis is used when you have one or more normally distributed interval independent variables and a categorical dependent variable. It is a multivariate technique that considers the latent dimensions in the independent variables for predicting group membership in the categorical dependent variable. For example, using the hsb2 data file , say we wish to use read , write and math scores to predict the type of program a student belongs to ( prog ). discriminate groups = prog(1, 3) /variables = read write math. Clearly, the SPSS output for this procedure is quite lengthy, and it is beyond the scope of this page to explain all of it. However, the main point is that two canonical variables are identified by the analysis, the first of which seems to be more related to program type than the second. See also discriminant function analysis SPSS Library: A History of SPSS Statistical Features
MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more dependent variables. In a one-way MANOVA, there is one categorical independent variable and two or more dependent variables. For example, using the hsb2 data file , say we wish to examine the differences in read , write and math broken down by program type ( prog ). glm read write math by prog. The students in the different programs differ in their joint distribution of read , write and math . See also SPSS Library: Advanced Issues in Using and Understanding SPSS MANOVA GLM: MANOVA and MANCOVA SPSS Library: MANOVA and GLM
Multivariate multiple regression is used when you have two or more dependent variables that are to be predicted from two or more independent variables. In our example using the hsb2 data file , we will predict write and read from female , math , science and social studies ( socst ) scores. glm write read with female math science socst. These results show that all of the variables in the model have a statistically significant relationship with the joint distribution of write and read .
Canonical correlation is a multivariate technique used to examine the relationship between two groups of variables. For each set of variables, it creates latent variables and looks at the relationships among the latent variables. It assumes that all variables in the model are interval and normally distributed. SPSS requires that each of the two groups of variables be separated by the keyword with . There need not be an equal number of variables in the two groups (before and after the with ). manova read write with math science /discrim. * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. WITHIN CELLS Regression Multivariate Tests of Significance (S = 2, M = -1/2, N = 97 ) Test Name Value Approx. F Hypoth. DF Error DF Sig. of F Pillais .59783 41.99694 4.00 394.00 .000 Hotellings 1.48369 72.32964 4.00 390.00 .000 Wilks .40249 56.47060 4.00 392.00 .000 Roys .59728 Note.. F statistic for WILKS' Lambda is exact. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. WITHIN CELLS Regression (Cont.) Univariate F-tests with (2,197) D. F. Variable Sq. Mul. R Adj. R-sq. Hypoth. MS Error MS F READ .51356 .50862 5371.66966 51.65523 103.99081 WRITE .43565 .42992 3894.42594 51.21839 76.03569 Variable Sig. of F READ .000 WRITE .000 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Raw canonical coefficients for DEPENDENT variables Function No. Variable 1 READ .063 WRITE .049 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized canonical coefficients for DEPENDENT variables Function No. Variable 1 READ .649 WRITE .467 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * Correlations between DEPENDENT and canonical variables Function No. Variable 1 READ .927 WRITE .854 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Variance in dependent variables explained by canonical variables CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO 1 79.441 79.441 47.449 47.449 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Raw canonical coefficients for COVARIATES Function No. COVARIATE 1 MATH .067 SCIENCE .048 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized canonical coefficients for COVARIATES CAN. VAR. COVARIATE 1 MATH .628 SCIENCE .478 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Correlations between COVARIATES and canonical variables CAN. VAR. Covariate 1 MATH .929 SCIENCE .873 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * Variance in covariates explained by canonical variables CAN. VAR. Pct Var DE Cum Pct DE Pct Var CO Cum Pct CO 1 48.544 48.544 81.275 81.275 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Regression analysis for WITHIN CELLS error term --- Individual Univariate .9500 confidence intervals Dependent variable .. READ reading score COVARIATE B Beta Std. Err. t-Value Sig. of t MATH .48129 .43977 .070 6.868 .000 SCIENCE .36532 .35278 .066 5.509 .000 COVARIATE Lower -95% CL- Upper MATH .343 .619 SCIENCE .235 .496 Dependent variable .. WRITE writing score COVARIATE B Beta Std. Err. t-Value Sig. of t MATH .43290 .42787 .070 6.203 .000 SCIENCE .28775 .30057 .066 4.358 .000 COVARIATE Lower -95% CL- Upper MATH .295 .571 SCIENCE .158 .418 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. CONSTANT Multivariate Tests of Significance (S = 1, M = 0, N = 97 ) Test Name Value Exact F Hypoth. DF Error DF Sig. of F Pillais .11544 12.78959 2.00 196.00 .000 Hotellings .13051 12.78959 2.00 196.00 .000 Wilks .88456 12.78959 2.00 196.00 .000 Roys .11544 Note.. F statistics are exact. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. CONSTANT (Cont.) Univariate F-tests with (1,197) D. F. Variable Hypoth. SS Error SS Hypoth. MS Error MS F Sig. of F READ 336.96220 10176.0807 336.96220 51.65523 6.52329 .011 WRITE 1209.88188 10090.0231 1209.88188 51.21839 23.62202 .000 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - EFFECT .. CONSTANT (Cont.) Raw discriminant function coefficients Function No. Variable 1 READ .041 WRITE .124 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Standardized discriminant function coefficients Function No. Variable 1 READ .293 WRITE .889 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Estimates of effects for canonical variables Canonical Variable Parameter 1 1 2.196 * * * * * * A n a l y s i s o f V a r i a n c e -- design 1 * * * * * * EFFECT .. CONSTANT (Cont.) Correlations between DEPENDENT and canonical variables Canonical Variable Variable 1 READ .504 WRITE .959 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - The output above shows the linear combinations corresponding to the first canonical correlation. At the bottom of the output are the two canonical correlations. These results indicate that the first canonical correlation is .7728. The F-test in this output tests the hypothesis that the first canonical correlation is equal to zero. Clearly, F = 56.4706 is statistically significant. However, the second canonical correlation of .0235 is not statistically significantly different from zero (F = 0.1087, p = 0.7420).
Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the number of variables in a model or to detect relationships among variables. All variables involved in the factor analysis need to be interval and are assumed to be normally distributed. The goal of the analysis is to try to identify factors which underlie the variables. There may be fewer factors than variables, but there may not be more factors than variables. For our example using the hsb2 data file , let’s suppose that we think that there are some common factors underlying the various test scores. We will include subcommands for varimax rotation and a plot of the eigenvalues. We will use a principal components extraction and will retain two factors. (Using these options will make our results compatible with those from SAS and Stata and are not necessarily the options that you will want to use.) factor /variables read write math science socst /criteria factors(2) /extraction pc /rotation varimax /plot eigen. Communality (which is the opposite of uniqueness) is the proportion of variance of the variable (i.e., read ) that is accounted for by all of the factors taken together, and a very low communality can indicate that a variable may not belong with any of the factors. The scree plot may be useful in determining how many factors to retain. From the component matrix table, we can see that all five of the test scores load onto the first factor, while all five tend to load not so heavily on the second factor. The purpose of rotating the factors is to get the variables to load either very high or very low on each factor. In this example, because all of the variables loaded onto factor 1 and not on factor 2, the rotation did not aid in the interpretation. Instead, it made the results even more difficult to interpret. See also SPSS FAQ: What does Cronbach’s alpha mean?
Your Name (required)
Your Email (must be a valid email for us to receive the report!)
Comment/Error Report (required)
How to cite this page
The evaluation of vaccines continues long after initial regulatory approval. Postapproval observational studies are often used to investigate aspects of vaccine effectiveness (VE) that clinical trials cannot feasibly assess. These includes long-term effectiveness, effectiveness within subgroups, effectiveness against rare outcomes, and effectiveness as the circulating pathogen changes. 1 Policymakers rely on these data to guide vaccine recommendations or formulation updates. 2
Dean N , Amin AB. Test-Negative Study Designs for Evaluating Vaccine Effectiveness. JAMA. Published online June 12, 2024. doi:10.1001/jama.2024.5633
© 2024
Artificial Intelligence Resource Center
Cardiology in JAMA : Read the Latest
Browse and subscribe to JAMA Network podcasts!
Select your interests.
Customize your JAMA Network experience by selecting one or more topics from the list below.
Learn / Guides / Usability testing guide
Back to guides
A multi-chapter look at website usability testing, its benefits and methods, and how to get started with it.
Reading time, take your first usability testing step today.
Sign up for a free Hotjar account and make sure your site behaves as you intend it to.
Usability testing is all about getting real people to interact with a website, app, or other product you've built and observing their behavior and reactions to it. Whether you start small by watching session recordings or go all out and rent a lab with eye-tracking equipment, usability testing is a necessary step to make sure you build an effective, efficient, and enjoyable experience for your users.
We start this guide with an introduction to:
What is usability testing
Why usability testing matters
What are the benefits of usability testing
What is not usability testing
The following chapters cover different testing methods , the usability questions they can help you answer, how to run a usability testing session , how to analyze and evaluate your testing results. Finally, we wrap up with 12 checklists and templates to help you run efficient usability sessions, and the best usability testing tools .
Usability testing is a method of testing the functionality of a website, app, or other digital product by observing real users as they attempt to complete tasks on it . The users are usually observed by researchers working for a business during either an in-person or, more commonly, a remote usability testing session.
The goal of usability testing is to reveal areas of confusion and uncover pain points in the customer journey to highlight opportunities to improve the overall user experience. Usability evaluation seeks to gauge the practical functionality of the product, specifically how efficiently a user completes a pre-defined goal.
(Note: if all testing activities take place on a website, the terms 'usability testing' and ' website usability testing' can be used interchangeably—which is what we're going to do throughout the rest of this page.)
💡Did you know there are different types of usability tests ?
Moderated usability testing : a facilitator introduces the test to participants, answers their queries, and asks follow-up questions
Unmoderated usability testing : the participants conduct the test without direct supervision, usually with a script
Remote usability testing : the test participants (and the researcher, in the case of moderated usability testing) conduct the test online or, more rarely, over the phone
In-person usability testing : the test participants and the researcher(s) are in the same location
Hotjar Engage lets you conduct remote, moderated usability testing with your own users or testers from our pool of 175,000+ participants.
While the terms are often used interchangeably, usability testing and user testing differ in scope.
They are both, however, a part of UX testing—a more comprehensive approach aiming to analyze the user experience at every touchpoint, including users’ perception of a digital product or service’s performance, emotional response, perceived value, and satisfaction with UX design, as well as their overall impression of the company and brand.
User testing is a research method that uses real people to evaluate a product or service by observing their interactions and gathering feedback.
By comparison with usability testing, user testing insights reveal:
What users think about when using your product or service
How they perceive your product or service
What are their user needs
Usability testing, on the other hand, has a more focused approached, by seeking to answer questions like:
Are there bugs or other errors impacting user flow?
Can users complete their task efficiently?
Do they understand how to navigate the site?
Usability testing is done by real-life users who are likely to reveal issues that people familiar with a website can no longer identify—very often, in-depth knowledge makes it easy for designers, marketers, and product owners to miss a website's usability issues.
Bringing in new users to test your site and/or observing how real people are already using it are effective ways to determine whether your visitors:
Understand how your site works and don't get 'lost' or confused
Can complete the main actions they need to
Don't encounter usability issues or bugs
Have a functional and efficient experience
Notice any other usability problems
This type of user research is exceptionally important with new products or new design updates: without it, you may be stuck with a UX design process that your team members understand, but your target audience will not.
I employ usability testing when I’m looking to gut-check myself as a designer. Sometimes I run designs by my cross-functional squad or the design team and we all have conflicting feedback. The catch is, we’re not always our user so it’s hard to sift through and agree on the best way forward.
Usability testing cuts through the noise and reveals if the usability of a proposed design meets basic expectations. It’s a great way to quickly de-risk engineering investment.
I also like to iterate on designs as we receive more and more information, so usability testing is a great way to move fast and not break too many things in the process.
Your website can benefit from usability testing no matter where it is in the development process, from prototyping all the way to the finished product. You can also continue to test the user experience as you iterate and improve your product over time.
Employing tests with real users helps you:
Validate your prototype . Bring in users in the early stages of the development process, and test whether they’re experiencing any issues before locking down a final product. Do they encounter any bugs ? Does your site or product behave as expected when users interact with it? Testing on a prototype first can validate your concept and help you make plans for future functionality before you spend a lot of money to build out a complete website.
Confirm your product meets expectations. Once your product is completed, test usability again to make sure everything works the way it was intended. How's the ease of use? Is something still missing in the interface?
Identify issues with complex flows . If there are functions on your site that need users to follow multiple steps (for example an ecommerce checkout process ), run usability testing to make sure these processes are as straightforward and intuitive as possible.
Complement and illuminate other data points . Usability testing can often provide the why behind data points accumulated from other methods: your funnel analysis might show you that visitors drop off your site , and conducting usability testing can highlight underlying issues with pages with high churn rate.
Catch minor errors . In addition to large-scale usability issues, usability testing can help identify smaller errors. A new set of eyes is more likely to pick up on broken links, site errors, and grammatical issues that have been inadvertently glossed over. Usability testing can also validate fixes made after identifying those errors.
💡Pro tip: enable console tracking in Hotjar and filter session recordings by ‘Error’ to watch sessions of users who ran into a JavaScript error.
Open the console from the recording player to understand where the issue comes from, fix the issue, and run a usability test to validate the fix.
Develop empathy. It's not unusual for the people working on a project to develop tunnel vision around their product and forget they have access to knowledge that their typical website visitor may not have. Usability testing is a good way to develop some empathy for the real people who are using and will be using your site, and look at things from their perspective.
Get buy-in for change. It's one thing to know about a website issue; it's another to see users actually struggle with it. When it's evident that something is being misunderstood by users, it's natural to want to make it right. Watching short clips of key usability testing findings can be a very persuasive way to lobby for change within your organization.
Ultimately provide a better user experience. Great customer experience is essential for a successful product. Usability testing can help you identify issues that wouldn't be uncovered otherwise and create the most user-friendly product possible.
There are several UX tools and user testing tools that help improve the customer experience , but don't really qualify as 'usability testing tools' because they don't explicitly evaluate the functionality of a product:
A/B testing : A/B testing is a way to experiment with multiple versions of a web page to see which is most effective. While it can be used to test changes based on user testing, it is not a usability testing tool.
Focus groups : focus groups are a type of user testing , for which researchers gather a group of people together to discuss a specific topic. Usually, the goal is to learn people's opinions about a product or service, not to test how they use it.
Surveys : use surveys to gauge user experience. Because they do not allow you to actually observe visitors on the site in action, surveys are not considered usability testing—though they may be used in conjunction with it via a website usability survey .
Heatmaps : heatmaps offer a visual representation of how users interact with the page by showing the hottest (most engaged with) and coolest (least engaged with) parts of it. The click , scroll , and move maps allow you to see how users in aggregate engage with a website, but they are still technically not usability testing.
User acceptance testing : this is often the last phase of the software-testing process, where testers go through a calibrated set of steps to ensure the software works correctly. This is a technical test of QA (quality assurance), not a way to evaluate if the product is user-friendly and efficient.
In-house proper use testing : people in your company probably test software all the time, but this is not usability testing. Employees are inherently biased, making them unable to give the kind of honest results that real users can.
Your website's user interface should be straightforward and easy to use, and usability testing is an essential step in getting there. But to get the most actionable results, testing must be done correctly—you will need to reproduce normal-use conditions exactly.
One of the easiest ways to get started with usability testing is through session recordings . Observing how visitors navigate your website can help you create the best user experience possible.
What is website usability testing.
Website usability testing is the practice of evaluating the functionality of your website by observing visitors’ actions and behavior as they complete specific tasks. Website usability testing lets you experience your site from the visitors’ perspective so you can identify opportunities to improve the user experience.
Your in-depth knowledge of, and familiarity with, your website might prevent you from seeing its design or usability issues. When you run a website usability test, users can identify issues with your site that you may have otherwise missed. For example website bugs , missing or broken elements, or an ineffective call to action (CTA) .
The type of website usability test you need will be based on your available resources, target audience, and goals. The main types of usability tests are:
Remote or in-person
Moderated or unmoderated
Scripted or unscripted
For more detailed information about the types of usability tests and to determine which one you should try on your site, visit the usability testing methods chapter of this guide.
Your goals and objectives will determine both the steps you’ll need to take to run a test on your website and the usability testing questions you’ll ask.
Having a plan before you start will help you organize the data and results you collect in an understandable way so you can improve the user experience. These 12 usability testing checklists and templates are a good place to start.
A 5-step process for moderated usability testing could be:
Plan the session : nature of the study and logistical details like number of participants and moderators, as well as recording setup
Recruit participants : from your user base or via a tester recruitment tool
Design the task
Run the session : don’t forget to record it and take notes
Analyze the insights
Tip: if you want to get started with website usability testing right now, with minimal set-up, we recommend giving Hotjar Engage a try:
Bring your own users into the platform or recruit from our pool of 175,000+ participants
Involve more stakeholders by adding up to 4 moderators and 10 spectators from your team during the session
Focus on gathering insights from user feedback while the platform automatically records and transcripts the session
An official website of the United States government
Here’s how you know
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
Health care professionals use thyroid tests to check how well your thyroid is working and to find the cause of problems such as hyperthyroidism or hypothyroidism. The thyroid is a small, butterfly-shaped gland in the front of your neck that makes two thyroid hormones : thyroxine (T 4 ) and triiodothyronine (T 3 ). Thyroid hormones control how the body uses energy, so they affect nearly every organ in your body, even your heart.
Thyroid tests help health care professionals diagnose thyroid diseases such as
Your doctor will start with blood tests and may also order imaging tests.
Doctors may order one or more blood tests to check your thyroid function. Tests may include thyroid stimulating hormone (TSH), T 4 , T 3 , and thyroid antibody tests.
For these tests, a health care professional will draw blood from your arm and send it to a lab for testing. Your doctor will talk to you about your test results.
Health care professionals usually check the amount of TSH in your blood first. TSH is a hormone made in the pituitary gland that tells the thyroid how much T 4 and T 3 to make.
A high TSH level most often means you have hypothyroidism, or an underactive thyroid. This means that your thyroid isn’t making enough hormone. As a result, the pituitary keeps making and releasing TSH into your blood.
A low TSH level usually means you have hyperthyroidism, or an overactive thyroid. This means that your thyroid is making too much hormone, so the pituitary stops making and releasing TSH into your blood.
If the TSH test results are not normal, you will need at least one other test to help find the cause of the problem.
A high blood level of T 4 may mean you have hyperthyroidism. A low level of T 4 may mean you have hypothyroidism.
In some cases, high or low T 4 levels may not mean you have thyroid problems. If you are pregnant or are taking oral contraceptives , your thyroid hormone levels will be higher. Severe illness or using corticosteroids —medicines to treat asthma, arthritis, skin conditions, and other health problems—can lower T 4 levels. These conditions and medicines change the amount of proteins in your blood that “bind,” or attach, to T 4 . Bound T 4 is kept in reserve in the blood until it’s needed. “Free” T 4 is not bound to these proteins and is available to enter body tissues. Because changes in binding protein levels don’t affect free T 4 levels, many healthcare professionals prefer to measure free T 4 .
If your health care professional thinks you may have hyperthyroidism even though your T 4 level is normal, you may have a T 3 test to confirm the diagnosis. Sometimes T 4 is normal yet T 3 is high, so measuring both T 4 and T 3 levels can be useful in diagnosing hyperthyroidism.
Measuring levels of thyroid antibodies may help diagnose an autoimmune thyroid disorder such as Graves’ disease —the most common cause of hyperthyroidism—and Hashimoto’s disease —the most common cause of hypothyroidism. Thyroid antibodies are made when your immune system attacks the thyroid gland by mistake. Your health care professional may order thyroid antibody tests if the results of other blood tests suggest thyroid disease.
Your health care professional may order one or more imaging tests to diagnose and find the cause of thyroid disease. A trained technician usually does these tests in your doctor’s office, outpatient center, or hospital. A radiologist, a doctor who specializes in medical imaging, reviews the images and sends a report for your health care professional to discuss with you.
Ultrasound of the thyroid is most often used to look for, or more closely at, thyroid nodules. Thyroid nodules are lumps in your neck. Ultrasound can help your doctor tell if the nodules are more likely to be cancerous.
For an ultrasound, you will lie on an exam table and a technician will run a device called a transducer over your neck. The transducer bounces safe, painless sound waves off your neck to make pictures of your thyroid. The ultrasound usually takes around 30 minutes.
Health care professionals use a thyroid scan to look at the size, shape, and position of the thyroid gland. This test uses a small amount of radioactive iodine to help find the cause of hyperthyroidism and check for thyroid nodules. Your health care professional may ask you to avoid foods high in iodine, such as kelp, or medicines containing iodine for a week before the test.
For the scan, a technician injects a small amount of radioactive iodine or a similar substance into your vein. You also may swallow the substance in liquid or capsule form. The scan takes place 30 minutes after an injection, or up to 24 hours after you swallow the substance, so your thyroid has enough time to absorb it.
During the scan, you will lie on an exam table while a special camera takes pictures of your thyroid. The scan usually takes 30 minutes or less.
Thyroid nodules that make too much thyroid hormone show up clearly in the pictures. Radioactive iodine that shows up over the whole thyroid could mean you have Graves’ disease.
Even though only a small amount of radiation is needed for a thyroid scan and it is thought to be safe, you should not have this test if you are pregnant or breastfeeding.
A radioactive iodine uptake test, also called a thyroid uptake test, can help check thyroid function and find the cause of hyperthyroidism. The thyroid “takes up” iodine from the blood to make thyroid hormones, which is why this is called an uptake test. Your health care professional may ask you to avoid foods high in iodine, such as kelp, or medicines containing iodine for a week before the test.
For this test, you will swallow a small amount of radioactive iodine in liquid or capsule form. During the test, you will sit in a chair while a technician places a device called a gamma probe in front of your neck, near your thyroid gland. The probe measures how much radioactive iodine your thyroid takes up from your blood. Measurements are often taken 4 to 6 hours after you swallow the radioactive iodine and again at 24 hours. The test takes only a few minutes.
If your thyroid collects a large amount of radioactive iodine, you may have Graves’ disease, or one or more nodules that make too much thyroid hormone. You may have this test at the same time as a thyroid scan.
Even though the test uses a small amount of radiation and is thought to be safe, you should not have this test if you are pregnant or breastfeeding.
If your health care professional finds a nodule or lump in your neck during a physical exam or on thyroid imaging tests, you may have a fine needle aspiration biopsy to see if the lump is cancerous or noncancerous.
For this test, you will lie on an exam table and slightly bend your neck backward. A technician will clean your neck with an antiseptic and may use medicine to numb the area. An endocrinologist who treats people with endocrine gland problems like thyroid disease, or a specially trained radiologist, will place a needle through the skin and use ultrasound to guide the needle to the nodule. Small samples of tissue from the nodule will be sent to a lab for testing. This procedure usually takes less than 30 minutes. Your health care professional will talk with you about the test result when it is available.
This content is provided as a service of the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), part of the National Institutes of Health. NIDDK translates and disseminates research findings to increase knowledge and understanding about health and disease among patients, health professionals, and the public. Content produced by NIDDK is carefully reviewed by NIDDK scientists and other experts.
The NIDDK would like to thank: COL Henry B. Burch, MD, Chair, Endocrinology Division and Professor of Medicine, Uniformed Services University of the Health Sciences
Statistics By Jim
Making statistics intuitive
By Jim Frost 15 Comments
T-tests are statistical hypothesis tests that analyze one or two sample means. When you analyze your data with any t-test, the procedure reduces your entire sample to a single value, the t-value. In this post, I describe how each type of t-test calculates the t-value. I don’t explain this just so you can understand the calculation, but I describe it in a way that really helps you grasp how t-tests work.
The equation for how the 1-sample t-test produces a t-value based on your sample is below:
This equation is a ratio, and a common analogy is the signal-to-noise ratio. The numerator is the signal in your sample data, and the denominator is the noise. Let’s see how t-tests work by comparing the signal to the noise!
In the signal-to-noise analogy, the numerator of the ratio is the signal. The effect that is present in the sample is the signal. It’s a simple calculation. In a 1-sample t-test, the sample effect is the sample mean minus the value of the null hypothesis. That’s the top part of the equation.
For example, if the sample mean is 20 and the null value is 5, the sample effect size is 15. We’re calling this the signal because this sample estimate is our best estimate of the population effect.
The calculation for the signal portion of t-values is such that when the sample effect equals zero, the numerator equals zero, which in turn means the t-value itself equals zero. The estimated sample effect (signal) equals zero when there is no difference between the sample mean and the null hypothesis value. For example, if the sample mean is 5 and the null value is 5, the signal equals zero (5 – 5 = 0).
The size of the signal increases when the difference between the sample mean and null value increases. The difference can be either negative or positive, depending on whether the sample mean is greater than or less than the value associated with the null hypothesis.
A relatively large signal in the numerator produces t-values that are further away from zero.
The denominator of the ratio is the standard error of the mean, which measures the sample variation. The standard error of the mean represents how much random error is in the sample and how well the sample estimates the population mean.
As the value of this statistic increases, the sample mean provides a less precise estimate of the population mean. In other words, high levels of random error increase the probability that your sample mean is further away from the population mean.
In our analogy, random error represents noise. Why? When there is more random error, you are more likely to see considerable differences between the sample mean and the null hypothesis value in cases where the null is true . Noise appears in the denominator to provide a benchmark for how large the signal must be to distinguish from the noise.
Our signal-to-noise ratio analogy equates to:
Both of these statistics are in the same units as your data. Let’s calculate a couple of t-values to see how to interpret them.
The signal is the same in both examples, but it is easier to distinguish from the lower amount of noise in the first example. In this manner, t-values indicate how clear the signal is from the noise. If the signal is of the same general magnitude as the noise, it’s probable that random error causes the difference between the sample mean and null value rather than an actual population effect.
Paired t-tests require dependent samples. I’ve seen a lot of confusion over how a paired t-test works and when you should use it. Pssst! Here’s a secret! Paired t-tests and 1-sample t-tests are the same hypothesis test incognito!
You use a 1-sample t-test to assess the difference between a sample mean and the value of the null hypothesis.
A paired t-test takes paired observations (like before and after), subtracts one from the other, and conducts a 1-sample t-test on the differences. Typically, a paired t-test determines whether the paired differences are significantly different from zero.
Download the CSV data file to check this yourself: T-testData . All of the statistical results are the same when you perform a paired t-test using the Before and After columns versus performing a 1-sample t-test on the Differences column.
Once you realize that paired t-tests are the same as 1-sample t-tests on paired differences, you can focus on the deciding characteristic —does it make sense to analyze the differences between two columns?
Suppose the Before and After columns contain test scores and there was an intervention in between. If each row in the data contains the same subject in the Before and After column, it makes sense to find the difference between the columns because it represents how much each subject changed after the intervention. The paired t-test is a good choice.
On the other hand, if a row has different subjects in the Before and After columns, it doesn’t make sense to subtract the columns. You should use the 2-sample t-test described below.
The paired t-test is a convenience for you. It eliminates the need for you to calculate the difference between two columns yourself. Remember, double-check that this difference is meaningful! If using a paired t-test is valid, you should use it because it provides more statistical power than the 2-sample t-test, which I discuss in my post about independent and dependent samples .
Use the 2-sample t-test when you want to analyze the difference between the means of two independent samples. This test is also known as the independent samples t-test . Click the link to learn more about its hypotheses, assumptions, and interpretations.
Like the other t-tests, this procedure reduces all of your data to a single t-value in a process similar to the 1-sample t-test. The signal-to-noise analogy still applies.
Here’s the equation for the t-value in a 2-sample t-test.
The equation is still a ratio, and the numerator still represents the signal. For a 2-sample t-test, the signal, or effect, is the difference between the two sample means. This calculation is straightforward. If the first sample mean is 20 and the second mean is 15, the effect is 5.
Typically, the null hypothesis states that there is no difference between the two samples. In the equation, if both groups have the same mean, the numerator, and the ratio as a whole, equals zero. Larger differences between the sample means produce stronger signals.
The denominator again represents the noise for a 2-sample t-test. However, you can use two different values depending on whether you assume that the variation in the two groups is equal or not. Most statistical software let you choose which value to use.
Regardless of the denominator value you use, the 2-sample t-test works by determining how distinguishable the signal is from the noise. To ascertain that the difference between means is statistically significant, you need a high positive or negative t-value.
Here’s what we’ve learned about the t-values for the 1-sample t-test, paired t-test, and 2-sample t-test:
For example, a t-value of 2 indicates that the signal is twice the magnitude of the noise.
Great … but how do you get from that to determining whether the effect size is statistically significant? After all, the purpose of t-tests is to assess hypotheses. To find out, read the companion post to this one: How t-Tests Work: t-Values, t-Distributions and Probabilities . Click here for step-by-step instructions on how to do t-tests in Excel !
If you’d like to learn about other hypothesis tests using the same general approach, read my posts about:
January 9, 2023 at 11:11 am
Hi Jim, thank you for explaining this I will revert to this during my 8 weeks in class everyday to make sure I understand what I’m doing . May I ask more questions in the future.
November 27, 2021 at 1:37 pm
This was an awesome piece, very educative and easy to understand
June 19, 2021 at 1:53 pm
Hi Jim, I found your posts very helpful. Could you plz explain how to do T test for a panel data?
June 19, 2021 at 3:40 pm
You’re limited by what you can do with t-tests. For panel data and t-tests, you can compare the same subjects at two points in time using a paired t-test. For more complex arrangements, you can use repeated measures ANOVA or specify a regression model to meet your needs.
February 11, 2020 at 10:34 pm
Hi Jim: I was reviewing this post in preparation for an analysis I plan to do, and I’d like to ask your advice. Each year, staff complete an all-employee survey, and results are reported at workgroup level of analysis. I would like to compare mean scores of several workgroups from one year to the next (in this case, 2018 and 2019 scores). For example, I would compare workgroup mean scores on psychological safety between 2018 and 2019. I am leaning toward a paired t test. However, my one concern is that….even though I am comparing workgroup to workgroup from one year to the next….it is certainly possible that there may be some different employees in a given workgroup from one year to the next (turnover, transition, etc.)….Assuming that is the case with at least some of the workgroups, does that make a paired t test less meanginful? Would I still use a paired t test or would another type t test be more appropriate? I’m thinking because we are dealing with workgroup mean scores (and not individual scores), then it may still be okay to compare meaningfully (avoiding an ecological fallacy). Thoughts?
Many thanks for these great posts. I enjoy reading them…!
April 8, 2019 at 11:22 pm
Hi jim. First of all, I really appreciate your posts!
When I use t-test via R or scikit learn, there is an option for homogeneity of variance. I think that option only applied to two sample t-test, but what should I do for that option?
Should I always perform f-test for check the homogeneity of variance? or Which one is a more strict assumption?
November 9, 2018 at 12:03 am
This blog is great. I’m at Stanford and can say this is a great supplement to class lectures. I love the fact that there aren’t formulas so as to get an intuitive feel. Thank you so much!
November 9, 2018 at 9:12 am
Thanks Mel! I’m glad it has been helpful! Your kind words mean a lot to me because I really strive to make these topics as easy to understand as possible!
December 29, 2017 at 4:14 pm
Thank you so much Jim! I have such a hard time understanding statistics without people like you who explain it using words to help me conceptualize rather than utilizing symbols only!
December 29, 2017 at 4:56 pm
Thank you, Jessica! Your kind words made my day. That’s what I want my blog to be all about. Providing simple but 100% accurate explanations for statistical concepts!
Happy New Year!
October 22, 2017 at 2:38 pm
Hi Jim, sure, I’ll go through it…Thank you..!
October 22, 2017 at 4:50 am
In summary, the t test tells, how the sample mean is different from null hypothesis, i.e. how the sample mean is different from null, but how does it comment about the significance? Is it like “more far from null is the more significant”? If it is so, could you give some more explanation about it?
October 22, 2017 at 2:30 pm
Hi Omkar, you’re in luck, I’ve written an entire blog post that talks about how t-tests actually use the t-values to determine statistical significance. In general, the further away from zero, the more significant it is. For all the information, read this post: How t-Tests Work: t-Values, t-Distributions, and Probabilities . I think this post will answer your questions.
September 12, 2017 at 2:46 am
Excellent explanation, appreciate you..!!
September 12, 2017 at 8:48 am
Thank you, Santhosh! I’m glad you found it helpful!
February 5, 2024
Informed by new research, dartmouth will reactivate the standardized testing requirement for undergraduate admission beginning with applicants to the class of 2029.
When Dartmouth suspended its standardized testing requirement for undergraduate applicants in June 2020, it was a pragmatic pause taken by most colleges and universities in response to an unprecedented global pandemic. At the time, we imagined the resulting "test-optional" policy as a short-term practice rather than an informed commentary on the role of testing in our holistic evaluation process. Nearly four years later, having studied the role of testing in our admissions process as well as its value as a predictor of student success at Dartmouth, we are removing the extended pause and reactivating the standardized testing requirement for undergraduate admission, effective with the Class of 2029. For Dartmouth, the evidence supporting our reactivation of a required testing policy is clear. Our bottom line is simple: we believe a standardized testing requirement will improve—not detract from—our ability to bring the most promising and diverse students to our campus.
A new research study commissioned by Dartmouth President Sian Beilock and conducted by Dartmouth economists Elizabeth Cascio, Bruce Sacerdote and Doug Staiger and educational sociologist Michele Tine confirms that standardized testing— when assessed using the local norms at a student's high school —is a valuable element of Dartmouth's undergraduate application. Their illuminating study found that high school grades paired with standardized testing are the most reliable indicators for success in Dartmouth's course of study. They also found that test scores represent an especially valuable tool to identify high-achieving applicants from low and middle-income backgrounds; who are first-generation college-bound; as well as students from urban and rural backgrounds. It is also an important tool as we meet applicants from under-resourced or less familiar high schools across the increasingly wide geography of our applicant pool. That is, contrary to what some have perceived, standardized testing allows us to admit a broader and more diverse range of students.
The finding that standardized testing can be an effective tool to expand access and identify talent was unexpected, thought-provoking, and encouraging. Indeed, their study challenges the longstanding critique that standardized testing inhibits rather than broadens college access; they note that contextually strong testing clearly enhances the admission chances of high-achieving applicants from less-resourced backgrounds when such scores are disclosed. Indeed, their finding reinforces the value of Dartmouth's longstanding practice of considering testing within our broader understanding of the candidate as a whole person. Especially during the pandemic's test-optional period, my colleagues and I sharpened our awareness of local norms and environmental factors, as well as the degree of opportunity available at a student's high school and in their community. Those environmental elements of discovery and assessment were one of the fortuitous by-products of the extended pandemic moment during which we reimagined traditional guidelines and practices. Knowing what we now know, it is an approach we will preserve as we move forward. Contextualized testing is an essential element of our individualized, holistic review. Of course, Dartmouth will never reduce any student to their test scores. It is simply one data point among many, but a helpful one when it is present.
The faculty researchers write: "Our overall conclusion is that SAT and ACT scores are a key method by which Dartmouth can identify students who will succeed at Dartmouth , including high performing students…who may attend a high school for which Dartmouth has less information to (fully) judge the transcript." Simply said, it is another opportunity to identify students who are the top performers in their environments, wherever they might be.
Indeed, as Dartmouth experienced our first admissions round with a "testing recommended" advisory this past fall, we set new institutional records for access even as 75 percent of those early acceptances included testing as an element of the application. We celebrated two early milestones: 22 percent are first-generation college bound and 21 percent qualified for a zero-parent contribution with family incomes and assets at or below $65,000 USD. These outcomes encourage and excite us, and we view contextualized testing as another opportunity to amplify our objective to admit and enroll a broadly heterogenous undergraduate class that is well-prepared to succeed in the curriculum we offer.
Our experience with optional testing has been enlightening. As with the other optional elements of the Dartmouth application—an alumni interview, a peer recommendation—the decision to share testing was individualized. But as the faculty study notes, "Some low-income students appear to withhold test scores even in cases where providing the test score would be a significant positive signal to admissions." Dartmouth admission officers also observed this pattern: Our post-admission research showed students with strong scores in their local framework often opted for a test-optional approach when their scores fell below our typical mean or mid-50% range. Often, those scores would have been additive, positive elements of the candidacy had they been shared. The absence of such scores underscores longstanding misperceptions about what represents a "high" or a "low" score; those definitions are not binary. A score that falls below our class mean but several hundred points above the mean at the student's school is "high" and, as such, it has value as one factor among many in our holistic assessment. That is how we consider testing at Dartmouth, and the opportunity to imagine better ways to inform students about their "score strength" will be a priority for us.
Moreover, the Dartmouth faculty study found testing "allows Dartmouth admission officers to more precisely identify students who will thrive academically." In our high-volume, globally heterogeneous applicant pool in which most candidates are "high achievers," environmental and historical data, high school performance, and testing—when taken together—offer the most robust framework for predicting success at Dartmouth. That finding was especially true for applicants from under-resourced high schools, noting that students with standardized test scores at or above the 75th percentile of test-takers from their respective high schools are well prepared to succeed in our fast-paced, rigorous course of study. All scores are assessed through that local framing as we seek excellence from new geographies.
Beginning with the Class of 2029, Dartmouth will once again require applicants from high schools within the United States to submit results of either the SAT or ACT, with no Dartmouth preference for either test. As always, the results of multiple administrations will be super-scored, which means we will consider the highest result on individual sections of either exam regardless of the test date or testing format. For applicants from schools outside the U.S. , results of either the SAT, ACT or three Advanced Placement (AP) examinations OR predicted or final exam results from the International Baccalaureate (IB), British A-Levels, or an equivalent standardized national exam are required. This distinction between students attending a school in the U.S. or outside the U.S. acknowledges the disparate access to American standardized testing—as well as the lack of familiarity with such testing—in different parts of the world. Dartmouth's English language proficiency policy remains unchanged: For students for whom English is not the first language or if English is not the primary language of instruction for at least two years, students are required to submit an English proficiency score from TOEFL, IELTS, Duolingo or the Cambridge English Exam.
Dartmouth will pair the restoration of required testing with a reimagined way of reporting testing outcomes, ideally in ways that are more understandable for students, families, and college counselors. For example, when testing was submitted as part of our Early Decision round for the Class of 2028, 94 percent of the accepted students who shared testing scored at or above the 75th percentile of test-takers at their respective high school. More significantly, this figure was a full 100 percent for the 79 students who attend a high school that matriculates 50 percent or fewer of its graduates to a four-year college. Accordingly, we will develop a new testing profile that seeks, in part, to disrupt the long-standing focus on the class mean and mid-50 percent range, with hopes of empowering students to understand how a localized score aligns with the admissions parameters at Dartmouth.
Dartmouth has practiced holistic admissions since 1921, and that century-long consideration of the whole person is unquestionably as relevant as ever. As we reactivate our required testing policy, contextualized testing will be one factor—but never the primary factor—among the many quantitative and qualitative elements of our application. As always, the whole person counts, as do the environmental factors each person navigates. And, as always, we will evaluate and reframe Dartmouth's undergraduate admission requirements as the data and the evidence informs us.
Breast Cancer Research volume 26 , Article number: 97 ( 2024 ) Cite this article
320 Accesses
1 Altmetric
Metrics details
Tumor immune infiltration and peripheral blood immune signatures have prognostic and predictive value in breast cancer. Whether distinct peripheral blood immune phenotypes are associated with response to neoadjuvant chemotherapy (NAC) remains understudied.
Peripheral blood mononuclear cells from 126 breast cancer patients enrolled in a prospective clinical trial (NCT02022202) were analyzed using Cytometry by time-of-flight with a panel of 29 immune cell surface protein markers. Kruskal–Wallis tests or Wilcoxon rank-sum tests were used to evaluate differences in immune cell subpopulations according to breast cancer subtype and response to NAC.
There were 122 evaluable samples: 47 (38.5%) from patients with hormone receptor-positive, 39 (32%) triple-negative (TNBC), and 36 (29.5%) HER2-positive breast cancer. The relative abundances of pre-treatment peripheral blood T, B, myeloid, NK, and unclassified cells did not differ according to breast cancer subtype. In TNBC, higher pre-treatment myeloid cells were associated with lower pathologic complete response (pCR) rates. In hormone receptor-positive breast cancer, lower pre-treatment CD8 + naïve and CD4 + effector memory cells re-expressing CD45RA (T EMRA ) T cells were associated with more extensive residual disease after NAC. In HER2 + breast cancer, the peripheral blood immune phenotype did not differ according to NAC response.
Pre-treatment peripheral blood immune cell populations (myeloid in TNBC; CD8 + naïve T cells and CD4 + T EMRA cells in luminal breast cancer) were associated with response to NAC in early-stage TNBC and hormone receptor-positive breast cancers, but not in HER2 + breast cancer.
NCT02022202 . Registered 20 December 2013.
The successful implementation of immunotherapy in multiple cancers has led to an increased appreciation of the relevance of antitumor immune responses in clinical outcomes. In patients with breast cancer, the generation of anticancer adaptive immunity appears more robust in the triple-negative (TNBC) and the human epidermal growth factor receptor 2 (HER2)-positive subtypes, while estrogen receptor (ER)-positive/HER2-negative breast cancers (herein referred to as luminal subtype) are generally regarded as less immunogenic [ 1 , 2 ]. The robustness of immune cell infiltration within the tumor stroma is both prognostic and predictive of response to chemotherapy and immunotherapy in all breast cancer subtypes [ 3 , 4 , 5 ]. Furthermore, robust tumor immune cell infiltration is highly associated with favorable prognosis in patients with early-stage TNBC, even without systemic therapy administration [ 6 , 7 ].
Most of our understanding of the interactions between breast cancer tumor cells and immune cells comes from “tumor-centric” research evaluating immune cells infiltrating the tumor microenvironment. However, immune cells infiltrating tumors must first be recruited from the peripheral blood systemic pool. Akin to the use of “liquid biopsies” to detect circulating tumor DNA, studies in other malignancies [ 8 , 9 ] and in breast cancer [ 10 , 11 ] have demonstrated that distinct peripheral blood immune signatures at the time of diagnosis (before any treatment) and changes in those signatures induced by treatment have the potential to predict treatment outcome.
Comprehensive simultaneous enumeration of distinct peripheral blood immune cell subpopulations has been historically limited by the low-plex capabilities of technologies such as standard flow cytometry. However, the advent of highly multiplexed proteomic platforms, such as mass cytometry (also known as Cytometry by Time-Of-Flight [CyTOF]), has enabled the simultaneous investigation of large numbers of cell markers at single-cell resolution. By replacing fluorophores with non-organic elements, mass cytometry offers an extensive spectrum with minimal spillover between channels and virtually no biological signal background [ 12 ], positioning CyTOF as an ideal technology to characterize the systemic immunological landscape of patients with cancer [ 12 ]. In this study, we aimed to evaluate the relative abundance of the major peripheral blood immune cell lineages (i.e., B, T, NK, and myeloid cells)—and their diverse subsets—in patients with operable breast cancer treated with neoadjuvant chemotherapy (NAC) within the context of a prospective clinical trial [ 13 ]. To accomplish this, we used a CyTOF panel including 29 surface protein markers (Fig. 1 ) to interrogate the profile of peripheral blood mononuclear cell (PBMC) samples obtained before initiation of neoadjuvant chemotherapy (NAC) and evaluate the differential abundance of immune cell subsets according to breast cancer subtype and pathologic response.
( A ) Peripheral blood mononuclear cell immune phenotyping workflow, ( B ) Labeling strategy
Patient population.
PBMC samples were prospectively collected from 126 of 132 eligible patients enrolled in the Breast Cancer Genome-Guided Therapy study at Mayo Clinic (NCT02022202) between March 5, 2012, and May 1, 2014. Patients with a new diagnosis of operable invasive breast cancer of any subtype were eligible if the primary tumor measured ≥ 1.5 cm, and they were recommended to receive NAC by their treating oncologist. The primary results of the study, including patient characteristics and genomic profiling data, have been published previously [ 13 ]. Clinical approximated breast cancer subtypes were defined using the St. Gallen Criteria [ 14 ]: luminal A (ER > 10% + grade 1 or ER > 10% + grade 2 + Ki-67 < 15%); luminal B (ER > 10% + grade 2 + Ki67 ≥ 15% or ER > 10% + grade 3); HER2 + (defined as 3 + by immunohistochemistry [IHC] or amplified by fluorescence in situ hybridization [FISH]); and TNBC (ER ≤ 10%, progesterone receptor ≤ 10%, and HER2-).
Participants in this study were recommended to receive twelve doses of weekly paclitaxel (with trastuzumab for HER2 + breast cancer), followed by four cycles of an anthracycline-based regimen. Pertuzumab was allowed along with trastuzumab for HER2 + breast cancer after September 2012. Carboplatin was allowed for TNBC after June 2013. None of the patients enrolled in this study received immunotherapy. Following completion of NAC, patients underwent surgery, and resected tissue was evaluated for pathologic response. Pathologic complete response (pCR) was defined as the absence of invasive tumor in the breast and axillary lymph nodes (ypT0/Tis, ypN0). The amount of residual disease after NAC was evaluated using the Residual Cancer Burden (RCB) index, with RCB-0 representing pCR, and RCB-1, RCB-2, and RCB-3 representing increasing amounts of residual disease [ 15 , 16 ]. Endocrine therapy was to be administered postoperatively for patients with ER + breast cancer. The Mayo Clinic Institutional Review Board and appropriate committees approved this study. All patients provided written informed consent.
PBMC suspensions were prospectively created from peripheral blood collected using heparin tubes (Becton Dickinson Vacutainer® SKU: 367874) before NAC initiation by the Mayo Clinic Biospecimens Accessioning and Processing laboratory. Mononuclear cells were isolated using a density gradient isolation technique. Following isolation, the sample was viably cryopreserved in a mixture of cell culture medium, fetal bovine serum (FBS), and dimethyl sulfoxide (DMSO). Cells were subsequently slow frozen to maintain cell integrity and stored in liquid nitrogen.
We divided the study population into three cohorts according to breast cancer subtype: TNBC, HER2-positive, and luminal. For each cohort, samples were thawed and processed in batches of 6–7 individual patient samples, along with a longitudinal reference sample, using the workflow depicted in Fig. 1 A. The longitudinal reference samples were technical replicates created from a single PMBC pool composed of four healthy donors. These reference samples were used for panel titration and served as a longitudinal reference to identify issues with antibody staining quality and batch effects [ 17 , 18 ]. The order in which each patient sample was processed within each cohort was determined by randomization, stratified by pCR status.
After thawing, samples were stained with a panel of 29 commercially available, metal-tagged antibodies (Fluidigm, CA) optimized to identify major human immune cell subsets (Fig. 1 B). Final antibody concentrations were selected based on signal-to-noise ratio and their ability to differentiate negative, dim, and bright populations. Samples were stained individually using standard manufacturer protocol (Fluidigm, CA), barcoded overnight with a unique palladium barcode during DNA intercalation, and pooled for acquisition in the mass cytometer.
After acquisition in the mass cytometer, output data was de-barcoded and normalized on a per-batch basis to the median intensity of Eqbeads [ 19 ]. Gaussian discrimination parameters were used for data cleanup [ 20 ]. Flow Cytometry Standard (FCS) files were uploaded to an automated platform for unbiased processing (Astrolabe Diagnostics, Arlington, VA, USA), which uses the flow self-organization map (FlowSOM) algorithm [ 21 ] followed by a labeling step to automatically assign cells to pre-selected and biologically known immune cell lineages. Patient-level metadata was added to the experimental matrix, and immune cell subsets were clustered and annotated to determine the differential abundance of immune cell subpopulations across clinical and pathological groups of interest.
First, we identified and calculated the frequencies of major immune cell populations (i.e., B, T, NK, and myeloid cells) according to lineage-defining cell surface proteins (Fig. 1 B). Within these major immune cell compartments, we evaluated the individual cell maturation and antigen-experienced states of T and B cells and distinct NK cell subsets according to the labeling strategy shown in Fig. 1 B. Of note, CD11c was used to define the myeloid lineage in these experiments, due to suboptimal performance of CD14 and CD16 (which were thus excluded from the labeling hierarchy). Due to this, no additional phenotyping of this compartment was carried out. Percent of immune cell subsets is presented here as a percent of all PBMCs.
For an initial exploration of the high-dimensional data generated in this study, we utilized the Uniform Manifold Approximation and Projection (UMAP) technique for dimensionality-reduction algorithm [ 22 ]. We projected PBMC data from all patients, according to each breast cancer subtype, and according to responses to systemic therapy into UMAP plots generated using OMIQ (Dotmatics, Boston, MA). Kruskal–Wallis tests or Wilcoxon rank-sum tests were used to assess whether an immune cell type (expressed as a percent of the total immune cells) differed with respect to breast cancer subtype. Wilcoxon rank-sum tests were used to compare patients with and without pCR in the HER2 + and TNBC subtypes. Given the expected low rates of pCR after NAC in the luminal breast cancer subtype, we grouped patients with pCR and minimal residual disease after NAC (RCB index class 0/1) versus those with moderate-to-extensive residual disease (RCB class 2/3). p values < 0.05 were considered statistically significant. Since the analysis was exploratory, no correction for multiple comparisons was performed. Analysis was performed using SAS (Version 9.4, SAS Institute, Inc. Cary, NC).
Viably cryopreserved PBMC samples from 126 patients obtained before the initiation of NAC were available. After thawing the cryopreserved samples, the average cell count was 3.94 × 10 6 (SD 1.94 × 10 6 ), with mean post-thaw cell viability of 81% (SD 15%). After acquisition on the mass cytometer, the mean yield per sample was 506,099 single-cell events (range: 48,725–1,130,427). Four samples (3 from patients with luminal breast cancer and one from TNBC) were excluded from subsequent analyses due to a low number of single-cell events, leaving 122 evaluable samples. In these, we analyzed a total of 61,744,075 single-cell events (luminal: 28,465,649; TNBC: 13,906,902; and HER2-positive: 19,371,524). The average yield (SD) per sample by breast cancer subtype was luminal: 605,652 (217,935); TNBC: 356,587 (239,863); and HER2-positive: 538,098 (284,120).
Of the 122 evaluable samples, 47 (38.5%) were from patients with luminal breast cancer (11 luminal A, 36 luminal B, 2 luminal subtype unknown), 39 (32%) from patients with TNBC, and 36 (29.5%) from patients with HER2 + breast cancer (16 ER + /HER2 + and 20 ER-/HER2 +). Baseline patient characteristics from each cohort and their best response to NAC are detailed in Table 1 . Patients with TNBC included in this study were more frequently clinically node-negative (cN0) at presentation compared to patients with other breast cancer subtypes (64% cN0 in TNBC compared to 34% and 22% for luminal and HER2 +, respectively). Stromal TILs were available in 24 (62%) of patients with TNBC. The median TIL level was 20% (range 1–80%, IQR 10–40%). TIL levels were not obtained for the luminal or HER2 + breast cancer cohorts (Table 1 ).
For visualization purposes, we projected all CD45 + viable single-cell events into a UMAP and identified major immune cell islands according to the expression of lineage-defining markers (Fig. 2 A, B). We calculated the total frequencies of the major immune cell subtypes across the three breast cancer subtypes (Fig. 2 C). Across breast cancer subtypes, the largest peripheral blood immune cell compartment was the T cell compartment (CD45 + CD3 + CD20-CD11c-CD56-), followed by overall similar frequencies of B cells (CD45 + CD3-CD20 + CD11c-CD56-), myeloid cells (CD45 + CD3-CD20-CD11c + CD56-), and NK cells (CD45 + CD3-CD20-CD11c-CD56 +). The relative abundances of pre-treatment peripheral blood T cells, B cells, myeloid cells, NK cells, and unclassified cells did not significantly differ according to breast cancer subtype. Additionally, we did not identify significant differences in the phenotypic composition of each of the individual compartments of T cells, myeloid cells, B cells, and NK cells (Supplement Figs. S1 – S4 show the distribution of B and T cell subsets according to breast cancer subtype). Within unclassified cells, canonical marker negative (CD3-CD11c-CD20-CD56-CD123-) HLADR + cells were highest in TNBC (TNBC: 0.39%, HER2 + BC 0.28%, and luminal: 0.17%, p = 0.0228).
Major immune cell compartments in the overall study population. ( A ) UMAP projection of major PBMC immune cell compartments, ( B ) Canonical marker expression of in each island corresponding to panel ( A ), ( C ) Relative pre-treatment abundance of major immune cell populations according to breast cancer subtype
We observed a moderate negative correlation between age and the levels of peripheral blood CD8 + naïve T cells across breast cancer subtypes, with the strongest correlation seen in patients with luminal breast cancers (Spearman rank correlation rho − 0.57 in luminal, − 0.51 in HER2 + and − 0.40 in TNBC). Correlations of other immune cells with age are shown in Fig. S8 and Supplementary Table 1 .
Among 39 patients with TNBC, 21 (54%) achieved pCR (Table 1 ). The distribution of RCB was RCB 0/1: 27 (69%), RCB 2/3: 10 (26%), and not available in 2 (5%). The proportion of pre-NAC myeloid cells (CD3-CD20-CD56-CD11c +) was significantly lower among the patients who achieved a pCR compared to those with residual disease (median 13.1% vs. 15.4%, p = 0.0217), Fig. 3 A, B. No significant differences in B, T, or NK cells were seen according to response to NAC (Fig. S5 ). Among the 24 patients with stromal TIL data, TIL levels were not found to differ significantly between patients who achieved pCR (n = 14, median TILs 20%, IQR 10–40%) and those who did not (n = 10, median TILs 25%, IQR 5–30%, p = 0.68, Fig. S9 ). Weak to moderate correlations were observed between stromal TIL levels and specific peripheral blood immune cell populations (Supplementary Table 2 and Figs. S10 – S12 ).
PMBC immunophenotypic differences were observed according to response to neoadjuvant chemotherapy (NAC) in TNBC and luminal breast cancers. ( A ) Density plots showing lower density of myeloid cells (dashed line in patients with TNBC who achieved pCR (left) compared to those who did not (right), ( B ) Relative pre-treatment abundance of major immune cell populations in TNBC according to response to NAC, ( C ) Density plots showing higher density of CD8 + naïve T (dashed outline in top island) and CD4 + TEMRA cells (dashed outline in bottom island) in patients with luminal BC with minimal or no residual disease (RCB 0-I) versus moderate to extensive residual disease (RCB II-III) after NAC, ( D ) Relative pre-treatment abundance of CD4 + T cell subsets in luminal breast cancer according to response to NAC, ( E ) Relative pre-treatment abundance of CD8 + T cell subsets in luminal breast cancer according to response to NAC, ( F ) Density plots showing a trend towards higher density of B cells in patients with HER2 + who achieved pCR (left) compared to those who did not (right), ( G ) Relative pre-treatment abundance of major immune cell populations in HER2 + breast cancer according to response to NAC
Among 47 patients with luminal breast cancer (11 luminal A, 36 luminal B), 4 (9%) achieved pCR (Table 1 ). The distribution of RCB was RCB 0/1: 7 (15%), RCB 2/3: 38 (81%), and not available in 2 (4%) patients (Table 1 ). All 7 patients who achieved RCB 0/1 had tumors consistent with a luminal B-like phenotype (ER > 10% + grade 3 [2 pts] or ER > 10% + grade 2 + Ki-67 ≥ 15% [5 pts]). No statistically significant differences in the proportion of total myeloid, B, T, or NK cells were detected between patients who achieved pCR versus not, or according to RCB (data not shown). However, within the T cell compartment, CD8 + naïve (CD3 + CD8 + CD45RA + CD197 +) and CD4 + effector memory cells re-expressing CD45RA T cells (T EMRA , CD3 + CD4 + CD45RA + CD197-) were significantly higher in patients with better response to NAC (RCB 0/1) compared to those with more extensive residual disease (RCB 2/3, CD8 + naïve median 8.5% vs 3.9%, p = 0.0273; CD4 + T EMRA median 7.1% vs 2.4%, p = 0.0467, Figs. 3 C–E and S6 ).
Among 36 patients with HER2 + breast cancer, 16 (44%) achieved pCR (Table 1 ). The distribution of RCB was RCB 0/1: 21 (58%) and RCB 2/3: 15 (42%) (Table 1 ). Pre-NAC total B cells trended higher among patients who achieved a pCR compared to those with residual disease (median 11.5% vs. 9.3%, p = 0.0827), Fig. 3 F–G. Within the B cell compartment, transitional B cells were numerically higher among patients who achieved pCR versus not (median 0.89% vs. 0.62%, p = 0.0915) (Fig. S7 ).
It is now well established that antitumor immunity plays a key role in the treatment response and prognosis of patients with breast cancer. The presence of high levels of TILs and of tumor-derived immune-related gene expression are associated with improved prognosis and therapeutic response, particularly in triple-negative and HER2 + breast cancer [ 1 , 2 , 3 , 4 , 5 , 6 , 23 ]. In addition, morphological immune features identified in regional lymph nodes are also prognostic in TNBC [ 24 , 25 ]. Based on the hypothesis that tumor-triggered immune responses can be detected not only in the tumor microenvironment and lymph nodes but also in the peripheral blood, this study utilized CyTOF to evaluate the circulating immune cell repertoire of patients with operable breast cancer before initiation of NAC and potential associations with response to NAC. We identified significant differences in the peripheral blood immune phenotype according to treatment response in patients with TNBC and luminal breast cancer (in the myeloid and T cell compartments, respectively). However, among patients with HER2 + breast cancer, pre-NAC B cells only trended higher in patients achieving pCR compared to those with residual disease.
Our findings in the TNBC cohort suggest that higher pre-treatment circulating myeloid cells may be associated with NAC resistance. Myeloid cells, including monocytes, granulocytes, and myeloid-derived suppressor cells (MDSCs) have potent immunosuppressive effects that counteract the endogenous antitumor immune response [ 26 ]. Tumor-derived inflammatory signals may promote the expansion of myeloid cells [ 27 ], which can, in turn, promote tumor progression by infiltrating tumors or homing to distant organs and establishing pre-metastatic niches that “prime” tissues for the engraftment of disseminating tumor cells [ 28 , 29 , 30 ]. It has been shown that myeloid cells are enriched in the tumor microenvironment of chemoimmunotherapy-resistant breast cancer tumors [ 31 , 32 ]. Additionally, peripheral blood MDSCs are significantly elevated in patients with various cancers compared to unaffected individuals [ 33 ], and higher expression of peripheral blood macrophage-related chemokines (e.g. CCL3) have been associated with lower pCR rates in the context of neoadjuvant chemoimmunotherapy [ 11 ]. While we were unable to further characterize the myeloid compartment in our study, our data supports further evaluation of circulating myeloid cells throughout NAC in TNBC, particularly considering that tumor-associated myeloid cells exist in a diverse phenotype continuum [ 34 , 35 ] that may also be reflected in the peripheral blood. Notably, while it has been reported that higher T cell levels within TNBC tumors are associated with pCR after NAC [ 36 ], we did not observe statistically significant differences in baseline peripheral blood T cell subsets according to subsequent NAC response. This lack of association may be due to the relatively small TNBC sample size in our study, or due to tumor immune phenotype differences (and their association with treatment response) not being fully recapitulated in the peripheral blood. Additionally, it is possible that peripheral blood T cell dynamics during chemotherapy + / − immunotherapy may be more informative than isolated baseline values (the only available in our study). Indeed, it has been suggested that peripheral blood cytotoxic T cell signatures at the end of NAC may be associated with long-term outcomes among patients with chemotherapy resistant tumors [ 10 ].
Patients with luminal breast cancer who achieved a more robust response to NAC exhibited higher levels of pre-NAC naïve CD8 + T cells and of CD4 + T EMRA cells compared to those with more extensive residual disease. These findings are in alignment with previous studies in lung and head and neck cancer, which have demonstrated a positive correlation between higher levels of peripheral blood naïve CD8 + T cells and survival [ 37 , 38 ]. In young women with luminal breast cancers, higher intratumoral CD8 + T cells correlate with improved survival [ 39 ]. Naïve T cells—immune cells that have not yet encountered antigen—can differentiate into several types of effector T cells with the capacity to subsequently destroy cancer cells. Effector CD8 + T cells derived from naïve subsets may be better able to maintain their replicative potential and resist exhaustion compared to CD8 + T cells derived from memory subsets [ 40 ]. With regards to CD4 + T EMRA cells, these have been found to be more abundant in the peripheral blood of breast cancer survivors compared to healthy volunteers [ 41 ], but associations with response to chemotherapy are less well understood. Further studies confirming these observations in additional cohorts and exploring underlying mechanisms by which these cells contribute to the anti-tumor immune response in luminal breast cancers are needed.
We observed a moderate negative correlation (rho = − 0.57) between age and pre-NAC levels of peripheral blood naïve CD8 + T cells in patients with luminal breast cancer, raising questions on age as a potential confounder in the association of these cells with treatment response. In this cohort, we found that age did not differ significantly between patients achieving RCB 0/1 and those achieving RCB 2/3. However, a larger dataset would be needed to examine the association of age and baseline peripheral blood naïve CD8 + T cells with chemoresistance in patients with luminal breast cancer.
A growing body of literature suggests that B cell immunity is highly relevant in breast cancer, particularly in the HER2 + subtype and in the context of treatment with trastuzumab [ 42 , 43 , 44 ]. Higher tumor-infiltrating B cells correlate with improved prognosis in various solid tumors, including melanoma, gastrointestinal tumors, non-small cell lung cancer, and ovarian cancer [ 45 , 46 , 47 , 48 , 49 , 50 ]. When compared to healthy controls, patients with breast cancer have higher total peripheral blood B cells, particularly memory B cells [ 51 ]. While we did not observe statistically significant differences in total peripheral blood B cells across breast cancer subtypes, pre-NAC B cells trended higher in patients with HER2 + achieving pCR compared to those with residual disease. This observation is in alignment with studies showing that tumor-derived B cell signatures predict response to NAC in HER2 + breast cancer [ 42 ], and that enrichment of tumor-infiltrating B cells correlates with improved survival in TNBC [ 50 , 52 ].
Our study has several strengths, including (1) the use of peripheral blood samples from a prospective clinical trial, with treatment response information, (2) homogeneous treatment that was guideline-concordant at the time of the study, (3) inclusion of all breast cancer subtypes, and (4) the use of a robust CyTOF panel for single-cell immune phenotyping. Limitations include (1) lack of healthy controls, (2) single PBMC timepoint for evaluation, precluding immune phenotype monitoring throughout NAC, (3) inability to further phenotype the myeloid compartment, (4) the use of cryopreserved samples, which may lead to non-proportional loss of cell types more susceptible to the freeze/thaw process, (5) evaluation limited to association of immune phenotype with NAC response (without evaluation of long-term outcomes), and (6) limited sample size impacting the ability to examine separately luminal A from luminal B or ER + HER2 + from ER-HER2 + breast cancer, or to carry out multivariate analyses. Additionally, patients with TNBC in this study were treated prior to the introduction of neoadjuvant immunotherapy, which has since become standard [ 53 ]. Further studies longitudinally examining the peripheral blood immune phenotype and the functional state of immune cell populations at various time points throughout NAC and potential associations with long-term clinical outcomes may provide further insights into their potential as a minimally invasive biomarker. A prospective evaluation using freshly stained PBMC samples from patients undergoing modern NAC regimens for breast cancer, and including healthy controls is ongoing in our center (NCT04897009).
Data are available upon reasonable request to the corresponding author.
Loi S, Michiels S, Salgado R, et al. Tumor infiltrating lymphocytes are prognostic in triple negative breast cancer and predictive for trastuzumab benefit in early breast cancer: results from the FinHER trial. Ann Oncol. 2014;25(8):1544–50. https://doi.org/10.1093/annonc/mdu112[publishedOnlineFirst:20140307] .
Article CAS PubMed Google Scholar
Loi S, Sirtaine N, Piette F, et al. Prognostic and predictive value of tumor-infiltrating lymphocytes in a phase III randomized adjuvant breast cancer trial in node-positive breast cancer comparing the addition of docetaxel to doxorubicin with doxorubicin-based chemotherapy: BIG 02-98. J Clin Oncol. 2013;31(7):860–7. https://doi.org/10.1200/JCO.2011.41.0902[publishedOnlineFirst:20130122] .
Denkert C, von Minckwitz G, Darb-Esfahani S, et al. Tumour-infiltrating lymphocytes and prognosis in different subtypes of breast cancer: a pooled analysis of 3771 patients treated with neoadjuvant therapy. Lancet Oncol. 2018;19(1):40–50. https://doi.org/10.1016/s1470-2045(17)30904-x[publishedOnlineFirst:2017/12/14] .
Article PubMed Google Scholar
Adams S, Gray RJ, Demaria S, et al. Prognostic value of tumor-infiltrating lymphocytes in triple-negative breast cancers from two phase III randomized adjuvant breast cancer trials: ECOG 2197 and ECOG 1199. J Clin Oncol. 2014;32(27):2959.
Article PubMed PubMed Central Google Scholar
Loi S, Drubay D, Adams S, et al. Tumor-infiltrating lymphocytes and prognosis: a pooled individual patient analysis of early-stage triple-negative breast cancers. J Clin Oncol. 2019;37(7):559–69. https://doi.org/10.1200/jco.18.01010[publishedOnlineFirst:2019/01/17] .
Leon-Ferre RA, Polley M-Y, Liu H, et al. Impact of histopathology, tumor-infiltrating lymphocytes, and adjuvant chemotherapy on prognosis of triple-negative breast cancer. Breast Cancer Res Treat. 2018;167(1):89–99.
Leon-Ferre RA, Jonas SF, Salgado R, et al. Tumor-infiltrating lymphocytes in triple-negative breast cancer. JAMA. 2024;331(13):1135–44.
Krieg C, Nowicka M, Guglietta S, et al. High-dimensional single-cell analysis predicts response to anti-PD-1 immunotherapy. Nat Med. 2018;24(2):144–53.
Wistuba-Hamprecht K, Martens A, Weide B, et al. Establishing high dimensional immune signatures from peripheral blood via mass cytometry in a discovery cohort of stage IV melanoma patients. J Immunol. 2017;198(2):927–36.
Axelrod ML, Nixon MJ, Gonzalez-Ericsson PI, et al. Changes in peripheral and local tumor immunity after neoadjuvant chemotherapy reshape clinical outcomes in patients with breast cancerimmunologic changes with chemotherapy in TNBC. Clin Cancer Res. 2020;26(21):5668–81.
Article CAS PubMed PubMed Central Google Scholar
Huebner H, Rübner M, Schneeweiss A, et al. RNA expression levels from peripheral immune cells, a minimally invasive liquid biopsy source to predict response to therapy, survival and immune-related adverse events in patients with triple negative breast cancer enrolled in the GeparNuevo trial. American Society of Clinical Oncology; 2023.
Book Google Scholar
Spitzer MH, Nolan GP. Mass cytometry: single cells, many features. Cell. 2016;165(4):780–91.
Goetz MP, Kalari KR, Suman VJ, et al. Tumor sequencing and patient-derived xenografts in the neoadjuvant treatment of breast cancer. JNCI J Natl Cancer Inst. 2017;109(7):djw306.
Goldhirsch A, Wood WC, Coates AS, et al. Strategies for subtypes—dealing with the diversity of breast cancer: highlights of the St Gallen International Expert Consensus on the Primary Therapy of Early Breast Cancer 2011. Ann Oncol. 2011;22(8):1736–47.
Symmans WF, Wei C, Gould R, et al. Long-term prognostic risk after neoadjuvant chemotherapy associated with residual cancer burden and breast cancer subtype. J Clin Oncol. 2017;35(10):1049.
Yau C, Osdoit M, van der Noordaa M, et al. Residual cancer burden after neoadjuvant chemotherapy and long-term survival outcomes in breast cancer: a multicentre pooled analysis of 5161 patients. Lancet Oncol. 2022;23(1):149–60.
Rybakowska P, Van Gassen S, Quintelier K, et al. Data processing workflow for large-scale immune monitoring studies by mass cytometry. Comput Struct Biotechnol J. 2021;19:3160–75.
Sahaf B, Pichavant M, Lee BH, et al. Immune profiling mass cytometry assay harmonization: multicenter experience from CIMAC-CIDC. Clin Cancer Res. 2021;27(18):5062–71.
Finck R, Simonds EF, Jager A, et al. Normalization of mass cytometry data with bead standards. Cytometry A. 2013;83(5):483–94. https://doi.org/10.1002/cyto.a.22271[publishedOnlineFirst:20130319] .
Bagwell CB, Inokuma M, Hunsberger B, et al. Automated data cleanup for mass cytometry. Cytometry A. 2020;97(2):184–98.
Van Gassen S, Callebaut B, Van Helden MJ, et al. FlowSOM: using self-organizing maps for visualization and interpretation of cytometry data. Cytometry A. 2015;87(7):636–45.
McInnes L, Healy J, Melville J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint https://arxiv.org/abs/1802.03426 (2018).
Leon-Ferre RA, Jonas SF, Salgado R, et al. Abstract PD9-05: stromal tumor-infiltrating lymphocytes identify early-stage triple-negative breast cancer patients with favorable outcomes at 10-year follow-up in the absence of systemic therapy: a pooled analysis of 1835 patients. Cancer Res. 2023;83(5):PD9-05.
Article Google Scholar
Verghese G, Li M, Liu F, et al. Multiscale deep learning framework captures systemic immune features in lymph nodes predictive of triple negative breast cancer outcome in large-scale studies. J Pathol. 2023;260(4):376–89.
Liu F, Hardiman T, Wu K, et al. Systemic immune reaction in axillary lymph nodes adds to tumor-infiltrating lymphocytes in triple-negative breast cancer prognostication. NPJ Breast Cancer. 2021;7(1):86.
Engblom C, Pfirschke C, Pittet MJ. The role of myeloid cells in cancer therapies. Nat Rev Cancer. 2016;16(7):447–62.
Condamine T, Mastio J, Gabrilovich DI. Transcriptional regulation of myeloid-derived suppressor cells. J Leucoc Biol. 2015;98(6):913–22.
Article CAS Google Scholar
Gubin MM, Esaulova E, Ward JP, et al. High-dimensional analysis delineates myeloid and lymphoid compartment remodeling during successful immune-checkpoint cancer therapy. Cell. 2018;175(4):1014–30.
Zhu Y, Herndon JM, Sojka DK, et al. Tissue-resident macrophages in pancreatic ductal adenocarcinoma originate from embryonic hematopoiesis and promote tumor progression. Immunity. 2017;47(2):323–38.
Gabrilovich DI, Nagaraj S. Myeloid-derived suppressor cells as regulators of the immune system. Nat Rev Immunol. 2009;9(3):162–74.
Zhang Y, Chen H, Mo H, et al. Single-cell analyses reveal key immune cell subsets associated with response to PD-L1 blockade in triple-negative breast cancer. Cancer Cell. 2021;39(12):1578–93.
Ye J-h, Wang X-h, Shi J-j, et al. Tumor-associated macrophages are associated with response to neoadjuvant chemotherapy and poor outcomes in patients with triple-negative breast cancer. J Cancer. 2021;12(10):2886.
Almand B, Clark JI, Nikitina E, et al. Increased production of immature myeloid cells in cancer patients: a mechanism of immunosuppression in cancer. J Immunol. 2001;166(1):678–89.
Azizi E, Carr AJ, Plitas G, et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell. 2018;174(5):1293–308.
Lambrechts D, Wauters E, Boeckx B, et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Nat Med. 2018;24(8):1277–89.
Yam C, Yen E-Y, Chang JT, et al. Immune phenotype and response to neoadjuvant therapy in triple-negative breast cancer. Clin Cancer Res. 2021;27(19):5365–75.
Takahashi H, Sakakura K, Ida S, et al. Circulating naïve and effector memory T cells correlate with prognosis in head and neck squamous cell carcinoma. Cancer Sci. 2022;113(1):53.
Zhao X, Zhang Y, Gao Z, et al. Prognostic value of peripheral naive CD8+ T cells in oligometastatic non-small-cell lung cancer. Future Oncol. 2021;18(1):55–65.
Tesch ME, Guzman Arocho YD, Collins LC, et al. Association of tumor-infiltrating lymphocytes (TILs) with clinicopathologic characteristics and prognosis in young women with HR+/HER2-breast cancer (BC). American Society of Clinical Oncology; 2023.
Hinrichs CS, Borman ZA, Gattinoni L, et al. Human effector CD8+ T cells derived from naive rather than memory subsets possess superior traits for adoptive immunotherapy. Blood J Am Soc Hematol. 2011;117(3):808–14.
CAS Google Scholar
Arana Echarri A, Struszczak L, Beresford M, et al. Immune cell status, cardiorespiratory fitness and body composition among breast cancer survivors and healthy women: a cross sectional study. Front Physiol. 2023;14:879.
Fernandez-Martinez A, Pascual T, Singh B, et al. Prognostic and predictive value of immune-related gene expression signatures vs tumor-infiltrating lymphocytes in early-stage ERBB2/HER2-positive breast cancer: a correlative analysis of the CALGB 40601 and PAMELA trials. JAMA Oncol. 2023;9(4):490–9.
Taylor C, Hershman D, Shah N, et al. Augmented HER-2–specific immunity during treatment with trastuzumab and chemotherapy. Clin Cancer Res. 2007;13(17):5133–43.
Knutson KL, Clynes R, Shreeder B, et al. Improved survival of HER2+ breast cancer patients treated with trastuzumab and chemotherapy is associated with host antibody immunity against the HER2 intracellular domain. Can Res. 2016;76(13):3702–10.
Fristedt R, Borg D, Hedner C, et al. Prognostic impact of tumour-associated B cells and plasma cells in oesophageal and gastric adenocarcinoma. J Gastrointest Oncol. 2016;7(6):848.
Hennequin A, Derangere V, Boidot R, et al. Tumor infiltration by Tbet+ effector T cells and CD20+ B cells is associated with survival in gastric cancer patients. Oncoimmunology. 2016;5(2):e1054598.
Berntsson J, Nodin B, Eberhard J, et al. Prognostic impact of tumour-infiltrating B cells and plasma cells in colorectal cancer. Int J Cancer. 2016;139(5):1129–39.
Bosisio FM, Wilmott JS, Volders N, et al. Plasma cells in primary melanoma. Prognostic significance and possible role of IgA. Mod Pathol. 2016;29(4):347–58.
Milne K, Köbel M, Kalloger SE, et al. Systematic analysis of immune infiltrates in high-grade serous ovarian cancer reveals CD20, FoxP3 and TIA-1 as positive prognostic factors. PLoS ONE. 2009;4(7):e6412.
Lohr M, Edlund K, Botling J, et al. The prognostic relevance of tumour-infiltrating plasma cells and immunoglobulin kappa C indicates an important role of the humoral immune response in non-small cell lung cancer. Cancer Lett. 2013;333(2):222–8.
Tsuda B, Miyamoto A, Yokoyama K, et al. B-cell populations are expanded in breast cancer patients compared with healthy controls. Breast Cancer. 2018;25(3):284–91.
Kuroda H, Jamiyan T, Yamaguchi R, et al. Prognostic value of tumor-infiltrating B lymphocytes and plasma cells in triple-negative breast cancer. Breast Cancer. 2021;28:904–14.
Schmid P, Cortes J, Pusztai L, et al. Pembrolizumab for early triple-negative breast cancer. N Engl J Med. 2020;382(9):810–21.
Download references
The authors would like to express their gratitude to all the patients and families for their participation in the BEAUTY trial and in this study. We would like to extend our appreciation to the Mayo Clinic Immune Monitoring Core for their assistance with the CyTOF data acquisition and for facilitating the analyses, to the Mayo Clinic Biospecimen Accessioning and Processing Core for their assistance with central biobanking of all biological samples, and to the BEAUTY study team for their support of this study.
This work was supported by CTSA Grant Number KL2 TR002379 from the National Center for Advancing Translational Science (NCATS) to RL-F, the Mayo Clinic Breast Cancer Specialized Program of Research Excellence Grant (P50CA 116201) to RL-F, VJS, JMC, KK, LW, KLK and MPG, a generous gift from the Wohlers Family Foundation to SMA and JC, the Mayo Clinic Cancer Center Support Grant (P30 CA15083-40A2), the Mayo Clinic Center for Individualized Medicine, Nadia’s Gift Foundation, John P. Guider, The Eveleigh Family, the Prospect Creek Foundation, the George M. Eisenberg Foundation for Charities, the Pharmacogenomics Research Network (U19 GM61388, to MPG, LW, RW, KRK, and JNI), NIH R01 CA196648 to LW, the Regis Foundation, and generous support from Afaf Al-Bahar. JCB is the W.H. Odell Professor of Individualized Medicine. RW is the Mary Lou and John H. Dasburg Professor of Cancer Genomics Research. MPG is the Erivan K. Haub Family Professor of Cancer Research Honoring Richard F. Emslander, M.D. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.
Matthew P. Goetz and Jose C. Villasboas are co-senior authors.
Department of Oncology, Mayo Clinic, Rochester, MN, USA
Roberto A. Leon-Ferre, Karthik V. Giridhar, James N. Ingle & Matthew P. Goetz
Division of Hematology, Mayo Clinic, Rochester, MN, USA
Kaitlyn R. Whitaker, Ahmad Al-Jarrad, Stephen M. Ansell & Jose C. Villasboas
Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
Vera J. Suman, Tanya Hoskin, Raymond M. Moore, Krishna Kalari & Liewei Wang
Department of Surgery, Mayo Clinic, Jacksonville, FL, USA
Sarah A. McLaughlin
Division of Hematology and Oncology, Mayo Clinic, Scottsdale, AZ, USA
Donald W. Northfelt
Department of Radiology, Mayo Clinic, Rochester, MN, USA
Katie N. Hunt & Amy Lynn Conners
Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, Alberta, Canada
Jodi M. Carter
Schulze Center for Novel Therapeutics, Mayo Clinic, Rochester, MN, USA
Richard Weinshilboum
Department of Immunology, Mayo Clinic, Jacksonville, FL, USA
Keith L. Knutson
Department of Surgery, Mayo Clinic, Rochester, MN, USA
Judy C. Boughey
You can also search for this author in PubMed Google Scholar
JCV, SMA, RLF, MPG, and JCB conceived and designed the study. KRW performed the CyTOF staining and acquired the immune phenotype data. MPG, JCB, VJS, SAM, DWN, KNH, ALC, AM, JMC, KK, RW, LW and JNI contributed to the design, analyses, and patient sample procurement from the clinical trial leveraged for this study. VJS and TH conducted the statistical analyses. RMM provided bioinformatics support. RLF, JCV, MPG, VJS, TH and KLK analyzed and interpreted the data. RLF and JCV drafted the manuscript. KVG, JNI, AAJ, KLK, VJS, TH, JCB and MPG critically revised the manuscript for important intellectual content. All authors read and approved the final manuscript.
Correspondence to Roberto A. Leon-Ferre .
Ethics approval and consent to participate.
The Mayo Clinic Institutional Review Board and appropriate committees approved this study. All patients provided written informed consent.
Before enrollment in the clinical trial, all patients consented for the treatment of their coded data for the publication of the study results. This publication does not contain identifiable patient data or images.
RL-F: Dr. Leon-Ferre reports consulting fees paid to Mayo Clinic from Gilead Sciences, Lyell Immunopharma and AstraZeneca, outside of the scope of this work, and personal fees for CME activities from MJH Life Sciences. MPG: Dr. Goetz reports personal fees for CME activities from Research to Practice, Clinical Education Alliance, Medscape, and MJH Life Sciences; personal fees serving as a panelist for a panel discussion from Total Health Conferencing and personal fees for serving as a moderator for Curio Science; consulting fees to Mayo Clinic from ARC Therapeutics, AstraZeneca, Biotheranostics, Blueprint Medicines, Lilly, Novartis, Rna Diagnostics, Sanofi Genzyme, Seattle Genetics, Sermonix, Engage Health Media, Laekna and TerSera Therapeutics/Ampity Health; grant funding to Mayo Clinic from Lilly, Pfizer, Sermonix, Loxo, AstraZeneca and ATOSSA Therapeutics; and travel support from Lilly. JCB: Dr. Boughey reports research support paid to Mayo Clinic from Eli Lilly and SymBioSis, outside of the scope of this work, participation on a DSMB for CairnsSurgical, and personal fees for speaking for PER, PeerView, EndoMag and contributing a chapter to UpToDate. KRW, VJS, TH, KVG, RM, AA-J, JMC, KK, RW, LW, JNI, KLK, SMA, and JCV report no conflicts of interest within the scope of this work.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Prior presentations: 2019 San Antonio Breast Cancer Symposium, 2020 American Society of Clinical Oncology Annual Meeting, and 2022 Association for Clinical and Translational Science Annual Meeting.
Additional file 1., rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Reprints and permissions
Cite this article.
Leon-Ferre, R.A., Whitaker, K.R., Suman, V.J. et al. Pre-treatment peripheral blood immunophenotyping and response to neoadjuvant chemotherapy in operable breast cancer. Breast Cancer Res 26 , 97 (2024). https://doi.org/10.1186/s13058-024-01848-z
Download citation
Received : 22 February 2024
Accepted : 22 May 2024
Published : 10 June 2024
DOI : https://doi.org/10.1186/s13058-024-01848-z
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 1465-542X
Organizations over-rely on approaches that consistently fail to diversify management ranks — and overlook those that have proven effective.
While companies say they champion diversity, there are glaring disparities in diverse representation within managerial ranks. The authors examine the impact of various management practices on diverse representation in managerial roles and how often each management practice is utilized in organizations, shedding light on why organizations are not making greater progress toward diverse representation. Despite not working well for attaining diverse representation, diversity training is widely used in organizations. In contrast, formal mentoring programs and targeted recruitment are effective for increasing diverse representation but are underused. Indeed, the relationship between how often management practices are implemented in organizations and their effectiveness in attaining diverse representation is negative and strong. This article breaks down the practices organizations should utilize to achieve diverse representation, underscoring the need to shift toward practices that increase diverse representation in management.
Despite the U.S. population’s growing diversity , managerial roles are still predominantly held by white men. While the largest firms have been pledging to recruit and train Black workers for over 40 years, there has been little increase in Black representation in managerial roles during this timeframe. In a 2021 analysis , Black employees held only 7% of managerial roles despite comprising 14% of all employees. Women have difficulty attaining leadership roles despite evidence that “women are more likely than men to lead in a style that is effective.”
IMAGES
VIDEO
COMMENTS
Revised on June 22, 2023. A t test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another. t test example.
We'll use a two-sample t test to evaluate if the difference between the two group means is statistically significant. The t test output is below. In the output, you can see that the treatment group (Sample 1) has a mean of 109 while the control group's (Sample 2) average is 100. The p-value for the difference between the groups is 0.112.
A paired two-sample t-test can be used to capture the dependence of measurements between the two groups. These variations of the student's t-test use observed or collected data to calculate a test statistic, which can then be used to calculate a p-value. Often misinterpreted, the p-value is equal to the probability of collecting data that is at ...
T-Test: A t-test is an analysis of two populations means through the use of statistical examination; a t-test with two samples is commonly used with small sample sizes, testing the difference ...
A t test is a statistical technique used to quantify the difference between the mean (average value) of a variable from up to two samples (datasets). The variable must be numeric. Some examples are height, gross income, and amount of weight lost on a particular diet. A t test tells you if the difference you observe is "surprising" based on ...
The t test tells you how significant the differences between group means are. It lets you know if those differences in means could have happened by chance. The t test is usually used when data sets follow a normal distribution but you don't know the population variance.. For example, you might flip a coin 1,000 times and find the number of heads follows a normal distribution for all trials.
Hypothesis tests work by taking the observed test statistic from a sample and using the sampling distribution to calculate the probability of obtaining that test statistic if the null hypothesis is correct. In the context of how t-tests work, you assess the likelihood of a t-value using the t-distribution.
Typically, you perform this test to determine whether two population means are different. This procedure is an inferential statistical hypothesis test, meaning it uses samples to draw conclusions about populations. The independent samples t test is also known as the two sample t test. This test assesses two groups.
What is a t-test and when is it used? What types of t-tests are there? What are hypotheses and prerequisites in a t-test? How is a t-test calculated and how ...
The t test is one type of inferential statistics. It is used to determine whether there is a significant difference between the means of two groups. With all inferential statistics, we assume the dependent variable fits a normal distribution. When we assume a normal distribution exists, we can identify the probability of a particular outcome.
Two- and one-tailed tests. The one-tailed test is appropriate when there is a difference between groups in a specific direction [].It is less common than the two-tailed test, so the rest of the article focuses on this one.. 3. Types of t-test. Depending on the assumptions of your distributions, there are different types of statistical tests.
A t -test (also known as Student's t -test) is a tool for evaluating the means of one or two populations using hypothesis testing. A t-test may be used to evaluate whether a single group differs from a known value (a one-sample t-test), whether two groups differ from each other (an independent two-sample t-test), or whether there is a ...
The T-Test. The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the posttest-only two-group randomized experimental design. Figure 1.
In medical research, various t -tests and Chi-square tests are the two types of statistical tests most commonly used. In any statistical hypothesis testing situation, if the test statistic follows a Student's t -test distribution under null hypothesis, it is a t -test. Most frequently used t -tests are: For comparison of mean in single sample ...
T-tests give you an answer to that question. They tell you what the probability is that the differences you found were down to chance. If that probability is very small, then you can be confident that the difference is meaningful (or statistically significant). In a t-test, you start with a null hypothesis - an assumption that the two ...
T-tests are handy hypothesis tests in statistics when you want to compare means. You can compare a sample mean to a hypothesized or target value using a one-sample t-test. You can compare the means of two groups with a two-sample t-test. If you have two groups with paired observations (e.g., before and after measurements), use the paired t-test.
The t-test is frequently used in comparing 2 group means.The compared groups may be independent to each other such as men and women. Otherwise, compared data are correlated in a case such as comparison of blood pressure levels from the same person before and after medication (Figure 1).In this section we will focus on independent t-test only.There are 2 kinds of independent t-test depending on ...
The t-test is a test in statistics that is used for testing hypotheses regarding the mean of a small sample taken population when the standard deviation of the population is not known. The t-test is used to determine if there is a significant difference between the means of two groups. The t-test is used for hypothesis testing to determine ...
A t test is also known as Student's t test. It is a statistical analysis technique that was developed by William Sealy Gosset in 1908 as a means to control the quality of dark beers. A t test used to test whether there is a difference between two independent sample means is not different from a t test used when there is only one sample (as ...
Key takeaways: A t-test is a statistical calculation that measures the difference in means between two sample groups. T-tests can help you measure the validity of results in fields like marketing, sales and accounting. Conducting a t-test involves inputting the mean and standard deviation values into a defined formula.
T and P are inextricably linked. They go arm in arm, like Tweedledee and Tweedledum. Here's why. When you perform a t-test, you're usually trying to find evidence of a significant difference between population means (2-sample t) or between the population mean and a hypothesized value (1-sample t). The t-value measures the size of the difference ...
Example 1: Fuel Treatment. Researchers want to know if a new fuel treatment leads to a change in the mean miles per gallon of a certain car. To test this, they conduct an experiment in which they measure the mpg of 11 cars with and without the fuel treatment. Since each car is used in each sample, the researchers can use a paired samples t-test ...
A one sample t-test allows us to test whether a sample mean (of a normally distributed interval variable) significantly differs from a hypothesized value. For example, using the hsb2 data file, say we wish to test whether the average writing score (write) differs significantly from 50. We can do this as shown below. t-test /testval = 50 ...
This JAMA Guide to Statistics and Methods article explains the test-negative study design, an observational study design routinely used to estimate vaccine effectiveness, and examines its use in a study that estimated the performance of messenger RNA boosters against the Omicron variant.
This type of user research is exceptionally important with new products or new design updates: without it, ... While it can be used to test changes based on user testing, it is not a usability testing tool. Focus groups: focus groups are a type of user testing, for which researchers gather a group of people together to discuss a specific topic ...
If the TSH test results are not normal, you will need at least one other test to help find the cause of the problem. T 4 tests. A high blood level of T 4 may mean you have hyperthyroidism. A low level of T 4 may mean you have hypothyroidism. In some cases, high or low T 4 levels may not mean you have thyroid
A paired t-test takes paired observations (like before and after), subtracts one from the other, and conducts a 1-sample t-test on the differences. Typically, a paired t-test determines whether the paired differences are significantly different from zero. Download the CSV data file to check this yourself: T-testData.
Informed by new research, Dartmouth will reactivate the standardized testing requirement for undergraduate admission beginning with applicants to the Class of 2029 ... At the time, we imagined the resulting "test-optional" policy as a short-term practice rather than an informed commentary on the role of testing in our holistic evaluation ...
Pre-treatment peripheral blood immune phenotype according to breast cancer subtype. For visualization purposes, we projected all CD45 + viable single-cell events into a UMAP and identified major immune cell islands according to the expression of lineage-defining markers (Fig. 2A, B). We calculated the total frequencies of the major immune cell subtypes across the three breast cancer subtypes ...
Summary. While companies say they champion diversity, there are glaring disparities in diverse representation within managerial ranks. The authors examine the impact of various management ...