{smcl}
[*!version 1.0 20sep2023]{...}
{hline}
{cmd:pooledsd} {it:Calculates pooled standard deviation for a continuous variable by a factor variable.} 
{hline}

{title:Syntax}

{cmd:pooledsd} {depvar} {ifin}, by({var}) [mdiff(num)]

{title:Description}

{p} Calculates the pooled standard deviation for a continuous variable using the groups listed in a factor variable {p_end}

{phang} sqrt((({it:n}_1-1)*{it:sd}_1^2 + ({it:n}_2-1)*{it:sd}_2^2 + ... ({it:n}_{it:k}-1)*{it:sd}_{it:k}^2)/({it:n}_1 + {it:n}_2 + ... {it:n}_{it:k} - {it:k})) {p_end}

{p}Where:  {p_end}

{phang} {it:n} = the number of observations in a given group. {p_end}
{phang} {it:sd} = the standard deviation for a continuous variable for a given group. {p_end}
{phang} {it:k} = the total number of groups for which the standard deviation is being pooled. {p_end}

{title:Options}

{opt by(var)} is required. Specifies for which factor variable the depvar should be pooled.

{opt mdiff} is optional. The mdiff value will be divided by the pooled standard deviation value and Cohen's d will be reported in the output.
	
{title:Example #1}

{p}We are interested in exploring the variability in average January temperatures. {p_end}

{phang} {cmd:sysuse citytemp, clear} {p_end}

{phang} {cmd:des tempjan division} {p_end}
{asis}
        
                      storage   display    value
        variable name   type    format     label      variable label
        --------------------------------------------------------------...--------
        tempjan         float   %9.0g                 Average January temperature
        division        int     %8.0g      division   Census Division
{smcl}

{phang} {cmd:tabstat tempjan, statistics(n mean sd)} {p_end}
{phang2} {asis}
            variable |         N      mean        sd
        -------------+------------------------------
             tempjan |       954  35.74895  14.18813

{smcl}

{p}As we can see, the standard deviation for temperature is 14.188 degrees. But, this estimate is based on the grand mean for the contiguous USA and isn't accounting for regional variation across the country. {p_end}

{phang} {cmd:tabstat tempjan, statistics(n mean sd) by(division)} {p_end}
{asis}
        Summary for variables: tempjan
             by categories of: division (Census Division)
        
        division |         N      mean        sd
        ---------+------------------------------
         N. Eng. |        67  26.93134  3.193279
         Mid Atl |        97  28.54433  3.637363
          E.N.C. |       206  22.79126  3.761282
          W.N.C. |        78  18.79744   8.43165
         S. Atl. |       115  49.15739  12.82852
          E.S.C. |        46  40.77826  6.252676
          W.S.C. |        89  45.02809  6.624558
        Mountain |        61  32.70164  9.551553
         Pacific |       195   50.4559  7.922557
        ---------+------------------------------
           Total |       954  35.74895  14.18813
        ----------------------------------------

{smcl}
{p}Looking at the standard deviation column, we come to suspect there is unequal variability in temperature across census divisions. For example, the standard deviation for South Atlantic is {it:four times larger} than the standard deviation for New England. We should incorporate information regarding {it: division} into the estimate of standard deviation. The {cmd:pooledsd} command will do that for us. {p_end}

{phang} {cmd:pooledsd tempjan, by(division)} {p_end}

{asis}
        Pooled standard deviation for groups 1 2 3 4 5 6 7 8 9 in division.
        There were a total of 954 observations used in the calculation.
        
        ----------------------------------
        Census        |
        Division      |        n        sd
        --------------+-------------------
         #1 (N. Eng.) |       67   3.19328
         #2 (Mid Atl) |       97  3.637363
          #3 (E.N.C.) |      206  3.761282
          #4 (W.N.C.) |       78   8.43165
         #5 (S. Atl.) |      115  12.82852
          #6 (E.S.C.) |       46  6.252676
          #7 (W.S.C.) |       89  6.624558
        #8 (Mountain) |       61  9.551553
         #9 (Pacific) |      195  7.922557
        ----------------------------------
        The pooled standard deviation is 7.4429
{smcl}

{p}The pooled standard deviation is approximately half of what the unpooled standard deviation was. This makes sense given that the estimate is now incorporating information about geographical locale, which will be related to variability in temperature. {p_end}

{title:Example #2}

{p}Assume we are interested in testing if Group #1 (New England) reports a different mean temperature than the pooled means of Group #2 (Mid-Atlantic) and Group #3 (East-North-Central). We run an ANOVA on these data and follow up with a {cmd:contrast} test. {p_end}

{phang} {cmd:qui anova tempjan division} {p_end}
{phang} {cmd:contrast {division -1 .5 .5 0 0 0 0 0 0}, effects} {p_end}

{asis}      
        Contrasts of marginal linear predictions
        
        Margins      : asbalanced
        
        ------------------------------------------------
                     |         df           F        P>F
        -------------+----------------------------------
            division |          1        1.54     0.2149
                     |
         Denominator |        945
        ------------------------------------------------
        
        ------------------------------------------------------------------------------
                     |   Contrast   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
            division |
                (1)  |  -1.263548   1.018249    -1.24   0.2149    -3.261838    .7347432
        ------------------------------------------------------------------------------
{smcl}

{p}Although our hypothesis was not supported (p < .05), we still want to report our findings with a metric of effect size for the two-group comparison. Unfortunately, there's not an obvious method to produce an effect size estimate. While it's possible to recode division and take advantage of Stata's {cmd:esize twosample} command, this will produce a problem. {p_end}

{phang} {cmd:qui recode division (1=2) (2/3=1) (*=.), gen(pooled)} {p_end}

{phang} {cmd:ttest tempjan, by(pooled)} {p_end}

{asis}
        Two-sample t test with equal variances
        ------------------------------------------------------------------------------
           Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
        ---------+--------------------------------------------------------------------
               1 |     303      24.633    .2634905    4.586552    24.11449    25.15151
               2 |      67    26.93134    .3901212    3.193279    26.15244    27.71025
        ---------+--------------------------------------------------------------------
        combined |     370    25.04919    .2314825    4.452655      24.594    25.50438
        ---------+--------------------------------------------------------------------
            diff |            -2.29834    .5898923               -3.458323   -1.138358
        ------------------------------------------------------------------------------
            diff = mean(1) - mean(2)                                      t =  -3.8962
        Ho: diff = 0                                     degrees of freedom =      368
        
            Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
         Pr(T < t) = 0.0001         Pr(|T| > |t|) = 0.0001          Pr(T > t) = 0.9999
{smcl}

{p} As can be seen, the mean difference has nearly doubled from |1.26| to |2.30|. This is because Group #2 (Mid-Atlantic) and Group #3 (East-North-Central) had a different number of observations and the estimate is biased toward the mean of East-North-Central. If we use the {cmd:esize twosample} command--which will use the same mean difference as the t-test--we will overestimate the resulting effect size. {cmd:pooledsd} offers an alternative. {p_end}

{phang} {cmd:pooledsd tempjan, by(division) mdiff(-1.263548)} {p_end}

{asis}
        Pooled standard deviation for groups 1 2 3 4 5 6 7 8 9 in division.
        There were a total of 954 observations used in the calculation.
        
        ------------------------------------
        Census        |
        Division      |         n         sd
        --------------+---------------------
         #1 (N. Eng.) |        67  3.1932794
         #2 (Mid Atl) |        97  3.6373629
          #3 (E.N.C.) |       206  3.7612817
          #4 (W.N.C.) |        78  8.4316499
         #5 (S. Atl.) |       115  12.828518
          #6 (E.S.C.) |        46  6.2526763
          #7 (W.S.C.) |        89   6.624558
        #8 (Mountain) |        61  9.5515528
         #9 (Pacific) |       195   7.922557
        ------------------------------------
        The pooled standard deviation is 7.4429
        The Cohen's d estimate is -0.1698
{smcl}

{p}By using {cmd:pooledsd} and specifying the contrast value in {opt mdiff} we can produce an estimate of effect size based on the pooled standard deviation of all groups. {p_end}

{title:Example #3}

{p}While the previous use of {cmd:pooledsd} produced an estimate of Cohen's d, it used the pooled standard deviation of all groups in {it:division}. As can be seen in the output Group #1, Group #2, and Group #3, seem to have lower variability with respect to their temperatures. Consequently, we should exclude Groups #4/#9 to ensure our effect size estimate better reflects the groups being compared. Let's only include respondents who were in Group #1 or Group #2 or Group #3. {p_end}

{phang}{cmd:pooledsd tempjan if division == 1 | division == 2 | division == 3, by(division) mdiff(-1.263548)}{p_end}

{asis}
        Pooled standard deviation for groups 1 2 3 in division.
        There were a total of 370 observations used in the calculation.
        
        -----------------------------------
        Census       |
        Division     |         n         sd
        -------------+---------------------
        #1 (N. Eng.) |        67  3.1932794
        #2 (Mid Atl) |        97  3.6373629
         #3 (E.N.C.) |       206  3.7612817
        -----------------------------------
        The pooled standard deviation is 3.6328
        The Cohen's d estimate is -0.3478
{smcl}

{p}Only groups who were given a non-zero weight in the contrast command are now included in the pooled standard deviation estimate. Our Cohen's {it:d} estimate is |0.35|, which is conventionally interpreted to be a small effect.  {p_end}

{title:Scalars}

{p} {cmd:pooledsd} produces two scalars. {p_end}

{asis}
        scalars:
                       r(cohd) =  Cohen's d estimate
                        r(psd) =  Pooled standard deviation estimate
{smcl}

{title:Author}

Dr. David Speed
Department of Psychology
University of New Brunswick - Saint John
dspeed@unb.ca

{p}{it:Note 1.} While I have tested pooledsd it is offered 'as-is' with no warranty. However, if you encounter issues or errors, please email me. 

{hline}