{smcl} {* *! version 1.4.1 MLB 26Sept2022}{...} {cmd:help stdtable} {hline} {title:Title} {phang} {bf:stdtable} {hline 2} Standardize cross-tabulations to pre-specified row and column totals {title:Syntax} {p 8 17 2} {cmd:stdtable} {help varname:rowvar} {help varname:colvar} {ifin} {weight} {cmd:,} {it:options} {synoptset 35 tabbed}{...} {synopthdr} {synoptline} {syntab:Main} {synopt:{cmd:by(}{it:{varname}}{cmd: [, }{opt base:line(#|string)}{cmd:])}}specifies a numeric or string variable to be treated as {it:superrow}. The {cmd:baseline()} sub-option specifies the value in {it:varname} to which the tables are standardized.{p_end} {synopt:{opt baser:ow(matrix)}}matrix with row totals to which the table(s) are standardized{p_end} {synopt:{opt basec:ol(matrix)}}matrix with column totals to which the table(s) are standardized{p_end} {p 41 43 0}The default is to standardize to row and column totals of all 100s if the table is square, and to row totals of 100/(number of rows) and column totals of 100/(number of columns) if the table is not square. {synopt:{opt row}}Standardize such that they can be interpreted as standardized row percentages.{p_end} {synopt:{opt col}}Standardize such that they can be interpreted as standardized column percentages.{p_end} {synopt:{opth f:ormat(%fmt)}}specifies the display format for the output{p_end} {synopt:{opt raw}}also displays the raw counts.{p_end} {synopt:{opt replace}}replace current data with standardized (and raw) counts.{p_end} {synopt:{cmd:replace(}{it:framename} {cmd:[}{it:, replace}{cmd:] )}}replaces the data in frame {it:framename} with standardized (and raw) counts. Requires Stata 16 or higher{p_end} {synopt:{opt name(collectname)}}Specifies the name of the {help tables intro :collection} that {cmd:stdtable} will leave behind. Requires Stata 17 or higher{p_end} {syntab:IPF options} {synopt:{opt tol:erance(#)}}tolerance for the standardized counts; default is 1e-6{p_end} {synopt:{opt iter:ate(#)}}perform maximum of # iterations; default is {cmd:iterate(16000)}{p_end} {synopt:{opt log}}display an iteration log of the maximum relative change in estimated standardized counts and max relative difference between the row totals and target row totals.{p_end} {synoptline} {p2colreset}{...} {p 4 6 2} {cmd:fweight}s, {cmd:aweight}s, and {cmd:iweight}s are allowed; see {help weight}. {title:Description} {pstd} {cmd:stdtable} standardizes a cross-tabulation such that the by fixing the row and column totals (Yule 1912, Mosteller 1968, Agresti 2002: 345-346). These standardized counts are estimated using Iterative Proportional Fitting. By default it sets all the row and column totals to 100 if the number of columns is the same as the number of rows. Consider the following example from Featherman and Hauser (1978) using data collected in the USA as a supplement to the March Current Population Survey by the U.S. Bureau of the Census in 1973: {cmd} . preserve . use "http://www.maartenbuis.nl/software/mob.dta", clear . tab row col [fw=pop], . restore {txt} {p 4 4 2}({stata "stdtable_ex 1":click to run}){p_end} {pstd} There are many more people that went from a farm to lower manual than the other way around. However, the number of people in agriculture strongly declined so sons had to leave the farm. Moreover, the number of people in lower manual occupations were on the increase, offering room for those sons that had to leave their farm.({help stdtable_foot##fert:1}){marker fert} We may be interested in knowing if this asymmetry is completely explained by these changes in the marginal distribution, or if there is more to it. We could look at row (outflow) percentages, but than we only control for the distribution of the father's occupation. Similarly, the column (inflow) percentages only control for the distribution of son's occupation. What we want is something that does both simultaneously, i.e. fix both the column totals and the row totals to 100. This is what {cmd:stdtable} does: {cmd} . preserve . use "http://www.maartenbuis.nl/software/mob.dta", clear . stdtable row col [fw=pop], . restore {txt} {p 4 4 2}({stata "stdtable_ex 2":click to run}){p_end} {pstd} These standardized counts can be interpreted as the row and column percentages that would occur if for both fathers and sons each occupation was equally likely. It appears that the apparent asymmetry was almost entirely due to changes in the marginal distributions. Also, it is now much clearer that farming is much more persistent over generations than the other occupations. {pstd} This table shows the counts that would have occurred when the odds ratios (effects) are the same as in the data, but the row and column totals were all 100. By setting the row and column totals to all the same number we filter out the effect of the marginal distribution. Setting the row and column totals to a 100 works when we have the same number of rows and columns. If the number of rows and columns differ then the total sample size implied by summing the row totals would not match the total sample size when summing the column totals. In that case the default margins will the 100 / (number of columns) for the column totals and 100 / (number of rows) for row totals. These standardized counts can be interpreted as the cell percentages that would have occurred if each category was equally likely to occur. {pstd} As of Stata 17, the results will be displayed with the new {help table} command, which mean that it will leave behind a {help tables intro :collection} that can be easily exported using {help collect export}. {pstd} Standardizing tables can also be useful to compare tables with different marginal distributions. In the example below we look at the race of husbands and wives in the USA for married couples whose husbands were born born between 1821 and 1989 using the 1880 till 2000 censuses and the 2001 till 2014 American Comunity Surveys. We can see that the racial boundaries have become a bit more permeable over time, but that the USA is still very far removed from being a melting pot. {cmd} . preserve . use "http://www.maartenbuis.nl/software/homogamy.dta", clear . stdtable racem racef [fw=freq], by(marcoh) . restore {txt} {p 4 4 2}({stata "stdtable_ex 3":click to run}){p_end} {pstd} The standardized table can be left in memory using the {cmd:replace} option, which can be useful for graphing that table. {stata "ssc desc twby":twby} from {help ssc :SSC} is nice for this. {cmd} . preserve . use "http://www.maartenbuis.nl/software/homogamy.dta", clear . stdtable racem racef [fw=freq] , by(marcoh) replace format(%5.0f) . gen y = -6 . twby racem racef, compact left xoffset(0.4) legend(off): /// > twoway bar std marcoh, barw(8) || /// > scatter y marcoh, msymbol(i) mlab(std) mlabpos(0) /// > yscale(range(0 100)) ylab(none) ytitle("") /// > xlab(1950(10)2010, val angle(30)) . restore {txt} {p 4 4 2}({stata "stdtable_ex 4":click to run}){p_end} {pstd} Setting all the row and column totals to a 100 is nice for filtering out the effect for filtering out the effect of the marginal distributions, but is unrealistic. If we just want to filter out the effects of changes in the marginal distributions over time, we could fix all the margins to be equal to the margins of one cohort, say 2010-2017. In the example below we look at how the row percentages would have developed if the row and column totals would have stayed constant at the 2010-2017 levels. {cmd} . preserve . use "http://www.maartenbuis.nl/software/homogamy.dta", clear . stdtable racem racef [fw=freq], /// > by(marcoh, base(2010)) row raw replace format(%5.0f) . gen marcoh1 = marcoh - 2 . gen marcoh2 = marcoh + 2 . gen y = -7 . twby racem racef , compact left xoffset(.4) /// > title("Raw row percentages and row percentages standardized" /// > "to marginal distributions of marriage cohort 2010-2017") : /// > twoway bar _freq marcoh1 , barwidth(4) scheme(s1color) || /// > bar std marcoh2 , barwidth(4) /// > legend(order(1 "raw" 2 "standardized")) /// > ytitle(row percentages) /// > xlab(1950 "1950-1959" /// > 1960 "1960-1969" /// > 1970 "1970-1979" /// > 1980 "1980-1989" /// > 1990 "1990-1999" /// > 2000 "2000-2009" /// > 2010 "2010-2017", angle(30)) /// > yscale(off range(0 105)) ytitle("") ylab(none) || /// > scatter y marcoh1 , /// > msymbol(i) mlab(_freq) mlabpos(0) mlabcolor(black) || /// > scatter std marcoh2 , /// > msymbol(i) mlab(std) mlabpos(12) mlabcolor(black) . restore {txt} {p 4 4 2}({stata "stdtable_ex 5":click to run}){p_end} {title:Options} {dlgtab:Main} {phang} {cmd:by(}{it:{varname}}{cmd: [, }{opt base:line(#|string)}{cmd:])} specifies a numeric or string variable to be treated as {it:superrow}. The {cmd:baseline()} sub-option specifies the value in {it:varname} to which the tables are standardized.{p_end} {phang} {opt baser:ow(matrix)} matrix with row totals to which the table(s) are standardized. The first cell corresponds to the lowest value of {it:rowvar}, the second cell to the second lowest value of {it:rowvar}, etc.{p_end} {phang} {opt basec:ol(matrix)} matrix with column totals to which the table(s) are standardized. The first cell corresponds to the lowest value of {it:colvar}, the second cell to the second lowest value of {it:colvar}, etc.{p_end} {pmore} The default is to standardize to row and column totals of all 100s if the table is square. In that case the standardized counts can be interpreted as row percentages and as column percentages. if the table is not square, then the default is to standardize the row totals to 100/(number of rows) and the column totals to 100/(number of columns). In that case the standardized counts can be interpreted as cell percentages. {phang} {opt row} Standardize such that the output can be interpreted as standardized row percentages. Cannot be combined with the option {cmd:col}. {phang} {opt col} Standardize such that the output can be interpreted as standardized column percentages. Cannot be combined with the option {cmd:row}. {pmore}{opt row} and {opt col} can be useful when the number of rows is not equal to the number of columns or when you used the {opt baseline()} sub-option in the {cmd:by()} option. {phang} {opth f:ormat(%fmt)} specifies the display format for the output table. This format is also applied to variables left behind by the {cmd:replace} option. The default is %9.3g. {p_end} {phang} {opt raw} also displays the raw counts when the {cmd:row} and {cmd:col} options have not been specified, or raw row and column percentages when the options {cmd:row} or {cmd:col} have been specified.{p_end} {phang} {opt replace} replace current data with standardized (and raw) counts. The row and column totals are returned in observations with missing values on {it:colvar} and {it:rowvar} respectively. {p_end} {phang} {cmd:replace(}{it:framename} {cmd:, [}{it:replace}{cmd:] )} replace data in frame {it:framename} with standardized (and raw) counts. Frame {it: framename} will be created if frame {it:framename} does not exist. {cmd:stdtable} will exit with an error if frame {it:framename} already exists and the {cmd:replace} sub-option has not been specified. Stata 16 or higher is required. The row and column totals are returned in observations with missing values on {it:colvar} and {it:rowvar} respectively. {p_end} {phang} {opt name(collectname)} specifies the name of the {help tables intro :collection} that {cmd:stdtable} leaves behind. The default is {it:stdtable}. Stata 17 or higher is required. {dlgtab:IPF options} {phang} {opt tol:erance(#)} tolerance for the standardized counts; default is 1e-6. Convergance is achieved when the maximum {help f_reldif:relative change} in standardized counts from one iteration to the next is less than {it:#}, {it:and} the maximum relative difference between the row totals and the target row totals is less than {it:#}. (Given the order in which the IPF algorithm is implemented the difference between the column totals and the target column totals is guaranteed be less than {it:#}){p_end} {phang} {opt iter:ate(#)} perform maximum of # iterations; default is {cmd:iterate(16000)}. That may seem a lot, but IPF algorithm is known for requiring a lot of iterations before reaching convergence. Fortunately, each iteration is very quick. {p_end} {phang} {opt log} display an iteration log of the maximum relative change in estimated standardized counts and max relative difference between the row totals and target row totals. Some tables have no solution. An indication that this is the case is when the max rel row diff remains well above the tolerance for all iterations. {p_end} {title:Author} {pstd} Maarten L. Buis,{break}University of Konstanz,{break}maarten.buis@uni.kn {title:References} {pstd} Agresti, A. (2002) {it:Categorical Data Analysis}, second edition. Hoboken: Wiley Interscience. {pstd} Featherman, D.L. and R.M. Hauser (1978) {it:Opportunity and Change}. New York: Academic. {pstd} Mosteller, F. (1968) Association and estimation in contingency tables, {it:Journal of the American Statistical Association}, 63(321): 1-28. {pstd} Yule, U. (1912) On the methods of measuring association between two attributes, {it:Journal of the Royal Statistical Society}, 75(6):579-652.