Results Cohen's D Article
With a Cohen's d of 0.8, 78.8% of the ' treatment ' group will be above the mean of the ' control ' group (Cohen's U 3), 68.9% of the two groups will overlap, and there is a 71.4% chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group (probability of superiority).
As you read educational research, you’ll encounter t-test (t) and ANOVA (F) statistics frequently. Hopefully, you understand the basics of (statistical) significance testing as related to the null hypothesis and p values, to help you interpret results. If not, see the Significance Testing (t-tests) review for more information. In this class, we’ll consider the difference between statistical significance and practical significance, using a concept called effect size.
The “Significance” Issue
Cohen’s d is a type of effect size between two means. An effect size is a quantitative measure of the magnitude for the difference between two means, in this regard. Cohen’s d values are also known as the standardised mean difference (SMD). Since the values are standardised, it is possible to compare values between different variables. The effect size for this analysis (d = 1.56) was found to exceed Cohen’s (1988) convention for a large effect (d=.80). These results indicate that individuals in the experimental psychotherapy group (M= 8.45, SD= 3.93) experienced fewer episodes of self- injury following treatment than did individuals in the control group (M= 13.83, SD= 2.14). In short, the sign of your Cohen’s d effect tells you the direction of the effect. If M 1 is your experimental group, and M 2 is your control group, then a negative effect size indicates the effect decreases your mean, and a positive effect size indicates that the effect increases your mean. How is Cohen’s d related to statistical significance?
Most statistical measures used in educational research rely on some type of statistical significance measure to authenticate results. Recall that the one thing t-tests, ANOVA, chi square, and even correlations have in common is that interpretation relies on a p value (p = statistical significance). That is why the easy way to interpret significance studies is to look at the direction of the sign (<, =, or >) to understand if the results are statistically meaningful.
While most published statistical reports include information on significance, such measures can cause problems for practical interpretation. For example, a significance test does not tell the size of a difference between two measures (practical significance), nor can it easily be compared across studies. To account for this, the American Psychological Association (APA) recommended all published statistical reports also include effect size (for example, see the APA 5th edition manual section, 1.10: Results section). Further guidance is summed by Neill (2008):
- When there is no interest in generalizing (e.g., we are only interested in the results for the sample), there is no need for significance testing. In these situations, effect sizes are sufficient and suitable.
- When examining effects using small sample sizes, significance testing can be misleading. Contrary to popular opinion, statistical significance is not a direct indicator of size of effect, but rather it is a function of sample size, effect size, and p level.
- When examining effects using large samples, significant testing can be misleading because even small or trivial effects are likely to produce statistically significant results.
What is Effect Size?
The simple definition of effect size is the magnitude, or size, of an effect. Statistical significance (e.g., p < .05) tells us there was a difference between two groups or more based on some treatment or sorting variable. For example, using a t-test, we could evaluate whether the discussion or lecture method is better for teaching reading to 7th graders:
For six weeks, we use the discussion method to teach reading to Class A, while using the lecture method to teach reading to Class B. At the end of the six weeks, both groups take the same test. The discussion group (Class A), averages 92, while the lecture group (Class B) averages 84.
Recalling the Significance Testing review, we would calculate standard deviation and evaluate the results using a t-test. The results give us a value for p, telling us (if p <.05, for example) the discussion method is superior for teaching reading to 7th graders. What this fails to tell us is the magnitude of the difference. In other words, how much more effective was the discussion method? To answer this question, we standardize the difference and compare it to 0.
Effect Size (Cohen’s d, r) & Standard Deviation
Effect size is a standard measure that can be calculated from any number of statistical outputs.
One type of effect size, the standardized mean effect, expresses the mean difference between two groups in standard deviation units. Typically, you’ll see this reported as Cohen’s d, or simply referred to as “d.” Though the values calculated for effect size are generally low, they share the same range as standard deviation (-3.0 to 3.0), so can be quite large. Interpretation depends on the research question. The meaning of effect size varies by context, but the standard interpretation offered by Cohen (1988) is:
.8 = large (8/10 of a standard deviation unit)
.5 = moderate (1/2 of a standard deviation)
.2 = small (1/5 of a standard deviation)
*Recall from the Correlation review r can be interpreted as an effect size using the same guidelines. If you are comparing groups, you don’t need to calculate Cohen’s d. If you are asked for effect size, it is r.
Calculating Effect Size (Cohen’s d)
Option 1 (on your own)
Given mean (m) and standard deviation (sd), you can calculate effect size (d). The formula is:
Negative Cohen's D
| m1 (group or treatment 1) – m2 (group or treatment 2) |
[pooled] sd |
Where pooled sd is *√sd1+sd2/2]
Option 2 (using an online calculator)
If you have mean and standard deviation already, or the results from a t-test, you can use an online calculator, such as this one. When using the calculator, be sure to only use Cohen’s d when you are comparing groups. If you are working with correlations, you don’t need d. Report and interpret r.
Wording Results
The basic format for group comparison is to provide: population (N), mean (M) and standard deviation (SD) for both samples, the statistical value (t or F), degrees freedom (df), significance (p), and confidence interval (CI.95). Follow this information with a sentence about effect size (see red, below).
Effect size example 1 (using a t-test): p ≤ .05, or Significant Results
Among 7th graders in Lowndes County Schools taking the CRCT reading exam (N = 336), there was a statistically significant difference between the two teaching teams, team 1 (M = 818.92, SD = 16.11) and team 2 (M = 828.28, SD = 14.09), t(98) = 3.09, p ≤ .05, CI.95 -15.37, -3.35. Therefore, we reject the null hypothesis that there is no difference in reading scores between teaching teams 1 and 2. Further, Cohen’s effect size value (d = .62) suggested a moderate to high practical significance.
Effect size example 2 (using a t-test): p ≥ .05, or Not Significant Results
Among 7th graders in Lowndes County Schools taking the CRCT science exam (N = 336), there was no statistically significant difference between female students (M = 834.00, SD = 32.81) and male students (841.08, SD = 28.76), t(98) = 1.15 p ≥ .05, CI.95 -19.32, 5.16. Therefore, we fail to reject the null hypothesis that there is no difference in science scores between females and males. Further, Cohen’s effect size value (d = .09) suggested low practical significance.
In our two previous post on Cohen’s d and standardized effect size measures [1, 2], we learned why we might want to use such a measure, how to calculate it for two independent groups, and why we should always be mindful of what standardizer (i.e., the denominator in d = effect size / standardizer) is used to calculate Cohen’s d.
But how do we interpret Cohen’s d?
First a tangent: bias in Cohen’s d
Most statistical analyses try to inform us about the population we are studying, not the sample of that population we happen to have tested for our study. With Cohen’s d we want to estimate the standardized effect size for a given population.
If our standardizer is an estimate, which it almost always will be, d will be a biased measure and tend to overestimate our estimate of the population effect size. As pointed out by Cummings and Robert Calin-Jageman in their book Introduction to the New Statistics: Estimation, Open Science, and Beyond, if our sample size is less that 50 we should be reporting d_unbiased. The unbiased version of Cohen’s d is often referred to as Hedge’s g and can easily be calculated by various statistical packages, including R.
r-based family of effect size values.
There is another family of standardized effect size measures based on r, which is often used in correlation and regression analysis. As explained in this article, eta squared (n-looking thing) is the biased version and omega squared (w-looking thing) is the unbiased version.
Thinking about Cohen’s d: Cohen’s reference values
Cohen was reluctant to provide reference values for his standardized effect size measures. Although he stated that d = 0.2, 0.5 and 0.8 correspond to small, medium and large effects, he specified that these values provide a conventional frame of reference which is recommended when no other basis is available
.
Thinking about Cohen’s d: overlap pictures
Cohen's D
If we can assume that our data comes from a population with a normal distribution, it is helpful to picture the amount of overlap between two distributions associated with various values of Cohen’s d. Below is a figure illustrating the amount of overlap associated with the three d values identified by Cohen (code used to generate figure is available here:
Figure 1: Examples of overlap between two normally distributed groups for different Cohen d values. The mean of the pink population is 50. The standardizer (i.e., the standard deviation) of the between-group difference is 15. Thus, for a standardised between-group difference of 0.5, the between-group difference (effect size; ES) in original units will be 0.5 = ES/15, which gives 7.5. So the difference between the mean of the two distributions is 1/2 a standard deviation, or 7.5 (figure panel 2).
As you can see, there is considerable overlap between the two distribution even when Cohen’s d indicates a large effect. This means that even for large effects there will be many individuals that go against the population-level pattern. Always keep these types of figures in mind when trying to interpret effect size measures.
Thinking about Cohen’s d: effect size in original units
This is often the first approach to use when interpreting results. The outcome measure used to compute Cohen’s d may have known reference values (e.g., BMI) or a meaningful scale (e.g., hours of sleep per night).
Thinking about Cohen’s d: the standardizer and the reference population
Cohen’s d is a number of standard deviation units. It is important to ask yourself what standard deviation these units are based on. As was discussed in the previous post, if available it is always better to use an estimate of the population standard deviation rather than the standard deviation of the studied sample. If such a value is not available and the sample standard deviation is used, be aware that, as the denominator in the formula, the standardizer can have a large influence on the value of d.
Thinking about Cohen’s d: values of d across disciplines
In any discipline there is a wide range of effect sizes reported. However, as highlighted by Cummings and Calin-Jageman, researchers in various fields have reported on what range of d values can be expected.
The mean effect size in psychology is d = 0.4, with 30% of of effects below 0.2 and 17% greater than 0.8. In education research, the average effect size is also d = 0.4, with 0.2, 0.4 and 0.6 considered small, medium and large effects. In contrast, medical research is often associated with small effect sizes, often in the 0.05 to 0.2 range. Despite being small, these effects often represent meaningful effects such as saving lives. For example, being fit decreases mortality risk in the next 8 years by d = 0.08. Finally, effects as large as d = 5 are common in fields such as pharmacology.
Summary
There is no straight forward way to interpret standardized effect size measures. While they are increasingly being reported in published manuscripts, Cohen’s d and other such measures should not be glanced over. As pointed out in this and previous posts [1, 2] numerous things need to be considered when interpreting these values.