{smcl}
[*!version 1.0 20sep2023]{...}
{hline}
{cmd:pooledsd} {it:Calculates pooled standard deviation for a continuous variable by a factor variable.}
{hline}
{title:Syntax}
{cmd:pooledsd} {depvar} {ifin}, by({var}) [mdiff(num)]
{title:Description}
{p} Calculates the pooled standard deviation for a continuous variable using the groups listed in a factor variable {p_end}
{phang} sqrt((({it:n}_11)*{it:sd}_1^2 + ({it:n}_21)*{it:sd}_2^2 + ... ({it:n}_{it:k}1)*{it:sd}_{it:k}^2)/({it:n}_1 + {it:n}_2 + ... {it:n}_{it:k}  {it:k})) {p_end}
{p}Where: {p_end}
{phang} {it:n} = the number of observations in a given group. {p_end}
{phang} {it:sd} = the standard deviation for a continuous variable for a given group. {p_end}
{phang} {it:k} = the total number of groups for which the standard deviation is being pooled. {p_end}
{title:Options}
{opt by(var)} is required. Specifies for which factor variable the depvar should be pooled.
{opt mdiff} is optional. The mdiff value will be divided by the pooled standard deviation value and Cohen's d will be reported in the output.
{title:Example #1}
{p}We are interested in exploring the variability in average January temperatures. {p_end}
{phang} {cmd:sysuse citytemp, clear} {p_end}
{phang} {cmd:des tempjan division} {p_end}
{asis}
storage display value
variable name type format label variable label
...
tempjan float %9.0g Average January temperature
division int %8.0g division Census Division
{smcl}
{phang} {cmd:tabstat tempjan, statistics(n mean sd)} {p_end}
{phang2} {asis}
variable  N mean sd
+
tempjan  954 35.74895 14.18813
{smcl}
{p}As we can see, the standard deviation for temperature is 14.188 degrees. But, this estimate is based on the grand mean for the contiguous USA and isn't accounting for regional variation across the country. {p_end}
{phang} {cmd:tabstat tempjan, statistics(n mean sd) by(division)} {p_end}
{asis}
Summary for variables: tempjan
by categories of: division (Census Division)
division  N mean sd
+
N. Eng.  67 26.93134 3.193279
Mid Atl  97 28.54433 3.637363
E.N.C.  206 22.79126 3.761282
W.N.C.  78 18.79744 8.43165
S. Atl.  115 49.15739 12.82852
E.S.C.  46 40.77826 6.252676
W.S.C.  89 45.02809 6.624558
Mountain  61 32.70164 9.551553
Pacific  195 50.4559 7.922557
+
Total  954 35.74895 14.18813

{smcl}
{p}Looking at the standard deviation column, we come to suspect there is unequal variability in temperature across census divisions. For example, the standard deviation for South Atlantic is {it:four times larger} than the standard deviation for New England. We should incorporate information regarding {it: division} into the estimate of standard deviation. The {cmd:pooledsd} command will do that for us. {p_end}
{phang} {cmd:pooledsd tempjan, by(division)} {p_end}
{asis}
Pooled standard deviation for groups 1 2 3 4 5 6 7 8 9 in division.
There were a total of 954 observations used in the calculation.

Census 
Division  n sd
+
#1 (N. Eng.)  67 3.19328
#2 (Mid Atl)  97 3.637363
#3 (E.N.C.)  206 3.761282
#4 (W.N.C.)  78 8.43165
#5 (S. Atl.)  115 12.82852
#6 (E.S.C.)  46 6.252676
#7 (W.S.C.)  89 6.624558
#8 (Mountain)  61 9.551553
#9 (Pacific)  195 7.922557

The pooled standard deviation is 7.4429
{smcl}
{p}The pooled standard deviation is approximately half of what the unpooled standard deviation was. This makes sense given that the estimate is now incorporating information about geographical locale, which will be related to variability in temperature. {p_end}
{title:Example #2}
{p}Assume we are interested in testing if Group #1 (New England) reports a different mean temperature than the pooled means of Group #2 (MidAtlantic) and Group #3 (EastNorthCentral). We run an ANOVA on these data and follow up with a {cmd:contrast} test. {p_end}
{phang} {cmd:qui anova tempjan division} {p_end}
{phang} {cmd:contrast {division 1 .5 .5 0 0 0 0 0 0}, effects} {p_end}
{asis}
Contrasts of marginal linear predictions
Margins : asbalanced

 df F P>F
+
division  1 1.54 0.2149

Denominator  945


 Contrast Std. Err. t P>t [95% Conf. Interval]
+
division 
(1)  1.263548 1.018249 1.24 0.2149 3.261838 .7347432

{smcl}
{p}Although our hypothesis was not supported (p < .05), we still want to report our findings with a metric of effect size for the twogroup comparison. Unfortunately, there's not an obvious method to produce an effect size estimate. While it's possible to recode division and take advantage of Stata's {cmd:esize twosample} command, this will produce a problem. {p_end}
{phang} {cmd:qui recode division (1=2) (2/3=1) (*=.), gen(pooled)} {p_end}
{phang} {cmd:ttest tempjan, by(pooled)} {p_end}
{asis}
Twosample t test with equal variances

Group  Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
1  303 24.633 .2634905 4.586552 24.11449 25.15151
2  67 26.93134 .3901212 3.193279 26.15244 27.71025
+
combined  370 25.04919 .2314825 4.452655 24.594 25.50438
+
diff  2.29834 .5898923 3.458323 1.138358

diff = mean(1)  mean(2) t = 3.8962
Ho: diff = 0 degrees of freedom = 368
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0001 Pr(T > t) = 0.0001 Pr(T > t) = 0.9999
{smcl}
{p} As can be seen, the mean difference has nearly doubled from 1.26 to 2.30. This is because Group #2 (MidAtlantic) and Group #3 (EastNorthCentral) had a different number of observations and the estimate is biased toward the mean of EastNorthCentral. If we use the {cmd:esize twosample} commandwhich will use the same mean difference as the ttestwe will overestimate the resulting effect size. {cmd:pooledsd} offers an alternative. {p_end}
{phang} {cmd:pooledsd tempjan, by(division) mdiff(1.263548)} {p_end}
{asis}
Pooled standard deviation for groups 1 2 3 4 5 6 7 8 9 in division.
There were a total of 954 observations used in the calculation.

Census 
Division  n sd
+
#1 (N. Eng.)  67 3.1932794
#2 (Mid Atl)  97 3.6373629
#3 (E.N.C.)  206 3.7612817
#4 (W.N.C.)  78 8.4316499
#5 (S. Atl.)  115 12.828518
#6 (E.S.C.)  46 6.2526763
#7 (W.S.C.)  89 6.624558
#8 (Mountain)  61 9.5515528
#9 (Pacific)  195 7.922557

The pooled standard deviation is 7.4429
The Cohen's d estimate is 0.1698
{smcl}
{p}By using {cmd:pooledsd} and specifying the contrast value in {opt mdiff} we can produce an estimate of effect size based on the pooled standard deviation of all groups. {p_end}
{title:Example #3}
{p}While the previous use of {cmd:pooledsd} produced an estimate of Cohen's d, it used the pooled standard deviation of all groups in {it:division}. As can be seen in the output Group #1, Group #2, and Group #3, seem to have lower variability with respect to their temperatures. Consequently, we should exclude Groups #4/#9 to ensure our effect size estimate better reflects the groups being compared. Let's only include respondents who were in Group #1 or Group #2 or Group #3. {p_end}
{phang}{cmd:pooledsd tempjan if division == 1  division == 2  division == 3, by(division) mdiff(1.263548)}{p_end}
{asis}
Pooled standard deviation for groups 1 2 3 in division.
There were a total of 370 observations used in the calculation.

Census 
Division  n sd
+
#1 (N. Eng.)  67 3.1932794
#2 (Mid Atl)  97 3.6373629
#3 (E.N.C.)  206 3.7612817

The pooled standard deviation is 3.6328
The Cohen's d estimate is 0.3478
{smcl}
{p}Only groups who were given a nonzero weight in the contrast command are now included in the pooled standard deviation estimate. Our Cohen's {it:d} estimate is 0.35, which is conventionally interpreted to be a small effect. {p_end}
{title:Scalars}
{p} {cmd:pooledsd} produces two scalars. {p_end}
{asis}
scalars:
r(cohd) = Cohen's d estimate
r(psd) = Pooled standard deviation estimate
{smcl}
{title:Author}
Dr. David Speed
Department of Psychology
University of New Brunswick  Saint John
dspeed@unb.ca
{p}{it:Note 1.} While I have tested pooledsd it is offered 'asis' with no warranty. However, if you encounter issues or errors, please email me.
{hline}