{smcl} {* *! version 4.004 March 2023}{...} {help summclust:summclust} {hline}{...} {title:Title} {pstd} Cluster specific summary statistics{p_end} {title:Syntax} {phang} {cmd:summclust} {it:varlist}, {it:cluster(varname)} [ {it:options}] {phang} {it:varlist}the dependent variable, the independent variable of interest, and other (binary or continuous) independent variables, {phang} {it:cluster}the clustering variable. {synoptset 45 tabbed}{...} {synopthdr} {synoptline} {synopt:{opt fevar(varlist)}}creates fixed effects for the included variables, using {cmd:i.varname}.{p_end} {synopt:{opt absorb(varname)}}partials out this variable from the regression before computing other statistics. This should only be used for variables that are nested within the specified clusters. This option is computationally faster than using {cmd: fevar}. {p_end} {synopt:{opt jack:knife}}calculates the jackknife variance estimator CV_3J in addition to CV_3.{p_end} {synopt:{opt add:means}}displays additional summary statistics for cluster variability based on alternative means.{p_end} {synopt:{opt gstar}}calculates the effective number of clusters G*() and G*(1).{p_end} {synopt:{opt rho(scalar)}}calculates the effective number of clusters, G*(rho), in addition to G*(0) and G*(1). This option can be used with out without the {cmd: gstar} option. The value of rho must be between 0 and 1.{p_end} {synopt:{opt tab:le}}displays the cluster by cluster statistics.{p_end} {synopt:{opt sam:ple}}allows for sample restrictions. For instance, in order to restrict the analysis to individuals 25 years of age or older based on a variable "age", sample(age>=25) should be added to the list of options. {p_end} {synopt:{opt nog:raph}}suppresses creation of the figure, which is otherwise shown by default. {p_end} {synopt:{opt reg:table}}displays the full regression table similar to Stata's {cmd: regress} table, but with CV_3 standard errors. {p_end} {title:Updates} {marker description}{...} {title:Description} {pstd}{cmd:summclust} is a stand-alone command that summarizes cluster variability and calculates a cluster jackknife variance estimator. MacKinnon, Nielsen, and Webb (2023) documents it more fully than this help file. The command calculates measures of cluster-level influence and leverage. It can optionally calculate the effective number of clusters. By default, it reports CV_1 and CV_3 standard errors, and it can optionally report a CV_3J standard error. It also, optionally, calculates additional measures of cluster-level heterogeneity. By default, it produces a figure which can help identify the source of cluster-level heterogeneity. Finally, it can produce a full table of regression results. {pstd}{cmd:summclust} by default calculates the CV_3 standard error. With well-behaved samples, this should match the standard error calculated using either Stata's native {cmd: jackknife: reg y x, cluster(group)} or {cmd: reg y x, cluster(group) vce(jackknife)} commands. However, many samples are not well behaved. Specifically, some of the omit-one-cluster subsamples may be singular. When they are, {cmd:summclust} calculates two standard errors. One drops the singular subsamples, as the native Stata routines do. The other uses a generalized inverse. {cmd:summclust} provides guidance as to which standard error is likely to be more reliable. When {cmd: regtable} is specified, and singular subsamples are present, two versions of the regression table are displayed. Similarly, if {cmd:jackknife} is specified and there are singular subsamples, four different standard errors are shown, either CV_3 or CV_3J, combined with either the generalized inverse or one computed after dropping the singular subsamples. {pstd}{cmd: nograph} suppresses creation of the figure, which is otherwise shown by default. The figure shows four scatter plots: leverage against observations per cluster, partial leverage against observations per cluster, leverage against omit-one-cluster coefficients, and partial leverage against omit-one-cluster coefficients. This figure can be quite informative, but it is computationally costly to produce. We therefore recommend invoking this option after you have inspected the figure. {pstd} {cmd: regtable} when {cmd:jackknife} is specified, regtable uses the CV_3J estimates to produce the regression table. Otherwise, CV_3 estimates are used. {title:Stored results} {pstd} {cmd:summclust} stores the following in {cmd:r()}: {p2col 5 20 24 2: Matrices}{p_end} {synopt:{cmd:r(ng)}}The number of observations per cluster.{p_end} {synopt:{cmd:r(lever)}}The cluster-specific leverage.{p_end} {synopt:{cmd:r(part)}}The cluster-specific partial leverage.{p_end} {synopt:{cmd:r(betanog)}}The estimate of beta when the g_th cluster is omitted.{p_end} {p2col 5 20 24 2: Scalars}{p_end} {synopt:{cmd:r(gstarzero)}}The effective number of clusters for the coefficient of interest using rho=0.{p_end} {synopt:{cmd:r(gstarrho)}}The effective number of clusters for the coefficient of interest using the scalar rho from the {cmd:rho} option.{p_end} {synopt:{cmd:r(gstarone)}}The effective number of clusters for the coefficient of interest using rho=1.{p_end} {synopt:{cmd:r(beta)}}The estimated beta for the coefficient of interest.{p_end} {synopt:{cmd:r(cv1se)}}The CV_1 standard error for the coefficient of interest.{p_end} {synopt:{cmd:r(cv1t)}}The CV_1 t-statistic for the coefficient of interest.{p_end} {synopt:{cmd:r(cv1p)}}The P value for the null hypothesis that beta=0 for the coefficient of interest using the CV_1 standard error.{p_end} {synopt:{cmd:r(cv1lci)}}The lower bound of the 95% confidence interval for beta using the CV_1 standard error.{p_end} {synopt:{cmd:r(cv1uci)}}The upper bound of the 95% confidence interval for beta using the CV_1 standard error.{p_end} {synopt:{cmd:r(cv3se)}}The CV_3 standard error for the coefficient of interest.{p_end} {synopt:{cmd:r(cv3t)}}The CV_3 t-statistic for the coefficient of interest.{p_end} {synopt:{cmd:r(cv3p)}}The P value for the null hypothesis that beta=0 for the coefficient of interest using the CV_3 standard error.{p_end} {synopt:{cmd:r(cv3lci)}}The lower bound of the 95% confidence interval for beta using the CV_3 standard error.{p_end} {synopt:{cmd:r(cv3uci)}}The upper bound of the 95% confidence interval for beta using the CV_3 standard error.{p_end} {synopt:{cmd:r(cv3Jse)}}The CV_3J standard error for the coefficient of interest.{p_end} {synopt:{cmd:r(cv3Jt)}}The CV_3J t-statistic for the coefficient of interest.{p_end} {synopt:{cmd:r(cv3Jp)}}The P value for the null hypothesis that beta=0 for the coefficient of interest using the CV_3J standard error.{p_end} {synopt:{cmd:r(cv3Jlci)}}The lower bound of the 95% confidence interval for beta using the CV_3J standard error.{p_end} {synopt:{cmd:r(cv3Juci)}}The upper bound of the 95% confidence interval for beta using the CV_3J standard error.{p_end} {pstd} {cmd:summclust} stores the following in {cmd:mata}: {p2col 5 20 24 2: Matrices}{p_end} {synopt:{cmd:cvstuff}}The matrix with the standard errors, t-statistics, etc.{p_end} {synopt:{cmd:clustsum}}The matrix with the measures of cluster variability.{p_end} {synopt:{cmd:bonus}}The matrix with additional measures of cluster variability. Only calculated when the option {cmd:addmeans} is specified.{p_end} {synopt:{cmd:scall}}The matrix with the cluster-by-cluster statistics. Only calculated when the option {cmd:table} is specified.{p_end} {synopt:{cmd:cnames}}The string matrix with the cluster names, to match with elements in scall. Only calculated when the option {cmd:table} is specified.{p_end} {synopt:{cmd:regresstab}}The matrix that is displayed when the {cmd:regresstab} option is specified.{p_end} {synopt:{cmd:regresstab}}The additional matrix that is displayed when the {cmd:regresstab} option is specified and there are singular clusters.{p_end} {title:Examples} {hline} {pstd} nlswork -- using {cmd:regress} {phang2}{cmd:. webuse nlswork, clear} {phang2}{cmd:. keep if inrange(age,20,40)} {phang2}{cmd:. reg ln_wage i.grade i.age i.birth_yr union race msp, cluster(ind)} {pstd} nlswork -- using {cmd:summclust} {phang2}{cmd:. summclust ln_wage msp union race, fevar(grade age birth_yr) cluster(ind) } {pstd} adding industry fixed effects using {cmd:absorb} {phang2}{cmd:. summclust ln_wage msp union race, fevar(grade age birth_yr) absorb(ind) cluster(ind)} {pstd} sample restrictions - using {cmd:sample} {phang2}{cmd:. summclust ln_wage msp union race, fevar(grade age birth_yr) sample(south==1) cluster(ind)} {pstd} Effective Number of Clusters using {cmd:gstar} or {cmd:rho}. {phang2}{cmd:. summclust ln_wage msp union race, fevar(grade age birth_yr) cluster(ind) gstar} {phang2}{cmd:. summclust ln_wage msp union race, fevar(grade age birth_yr) cluster(ind) rho(0.5)} {pstd} All Output. {phang2}{cmd:. summclust ln_wage msp union race, fevar(grade age birth_yr) absorb(ind) cluster(ind) table addmeans jack rho(0.5) regtable} {title:Author} {p 4}Matthew D. Webb{p_end} {p 4}matt.webb@carleton.ca{p_end} {title:Citation} {p 4 8 2}{cmd:summclust} is not an official Stata command. It is a free contribution to the research community. Please cite: {p 8 8 2} James G. MacKinnon, Morten Ø. Nielsen, and Matthew D. Webb. 2023. Leverage, Influence, and the Jackknife in Clustered Regression Models: Reliable Inference Using summclust.{p_end}