{smcl} {* 29mar2016/01apr2017/04apr2017/13apr2017}{...} {cmd:help rangestat} {hline} {title:Title} {phang} {cmd:rangestat} {hline 2} Generate statistics using observations within range {title:Syntax} {p 8 17 2} {cmd:rangestat} {it:slist} {ifin} {cmd:,} {opt i:nterval(keyvar low high)} [ {it:{help rangestat##table_options:options}} ] {pstd} where {it:slist} is composed of one or more of the following {p 8 17 2} {opt (stat)} { {varlist} | {it:new_varname}{cmd:=}{varname} } [ { {varlist} | {it:new_varname}{cmd:=}{varname} } ...] {p 8 17 2} {opt (flex_stat)} {varlist} {pstd} and {it:stat} is one of {p2colset 9 22 24 2}{...} {* 11apr2017 added skew, kurt; some new names; rearranged NJC}{...} {p2col :{opt obs}}number of raw observations{p_end} {p2col :{opt count}}number of non-missing observations{p_end} {p2col :{opt missing}}number of missing observations{p_end} {p2col :{opt mean}}{p_end} {p2col :{opt sum}}{p_end} {p2col :{opt sd}}standard deviation{p_end} {p2col :{opt variance}}{p_end} {p2col :{opt skewness}}see Cox (2010) for implementation details{p_end} {p2col :{opt kurtosis}}see Cox (2010) for implementation details{p_end} {p2col :{opt min}}minimum{p_end} {p2col :{opt median}}{p_end} {p2col :{opt max}}maximum{p_end} {p2col :{opt first}}first value{p_end} {p2col :{opt last}}last value{p_end} {p2col :{opt firstnm}}first non-missing value{p_end} {p2col :{opt lastnm}}last non-missing value{p_end} {p2colreset}{...} {pstd} and {it:flex_stat} is one of {p2colset 9 22 24 2}{...} {p2col :{opt corr}}correlation, first and second variables{p_end} {p2col :{opt cov}}covariance, first and second variables{p_end} {p2col :{opt reg}}ordinary least squares regression with a constant{p_end} {p2colreset}{...} {pstd} or the name of a user-supplied Mata function. {pstd} If {it:slist} does not start with a {opt (stat)} or {opt (flex_stat)}, {it:slist} is prefixed by {cmd:(mean)}. {synoptset 27 tabbed}{...} {marker table_options}{...} {synopthdr} {synoptline} {p2coldent :* {opt i:nterval(keyvar low high)}}use observations where {it:keyvar} is within the bounds indicated by {it:low} and {it:high} {p_end} {synopt :{opth by(varlist)}}the set of observations to use is found within {it:by} group {p_end} {synopt :{opt excl:udeself}}set the input variables to missing for the current observation {p_end} {synopt :{opt casew:ise}}casewise deletion of observations within variable groups {p_end} {synopt :{opt d:escribe}}runs {help describe} to show the names of the new variables created {p_end} {synopt :{opt local(name)}}define a local macro {it:name} that contains the names of the variables created {p_end} {synoptline} {phang}* {opt i:nterval(keyvar low high)} is required. {it:keyvar} is a numeric variable. The lower and upper bound of the closed interval to use for each observation can be specified using a numeric variable, a {it:#}, or a {help missing:system missing value}. If a {it:#} is used, the bound for each observation is computed by adding {it:#} to {it:keyvar}. If {it:low} is specified using a {help missing:system missing value}, {it:low} is set to missing for all observations. {cmd:rangestat} applies the same rules as {help inrange()} for missing bounds. {p2colreset}{...} {title:Description} {pstd} You can find various general and specific comments on problems that require calculating statistics based on other observations in Cox (2007, 2011, 2014). Typically, these types of problems require some form of looping since the desired result is based on a set of observations that may differ from the set used for the previous or next observation. {cmd:rangestat} offers a general solution to such problems as long as the set of observations to use can be expressed using a range of {it:keyvar} values. {pstd} {cmd:rangestat} can do calculations based on a {it:{help rangestat##degenerate_interval:degenerate interval}} (where {bind:{it:low == high}}), a {it:{help rangestat##rolling_window:rolling window}}, a {it:{help rangestat##recursive_window:recursive window}} (where the first period is fixed), a {it:{help rangestat##reversed_recursive_window:reversed recursive window}} (where the last period is fixed), or with {it:{help rangestat##observation_specific_windows:observation-specific windows}}, each independently specified using interval bound variables. {pstd} {it:keyvar} can be any numeric variable and may contain duplicates. If {it:keyvar} is a time variable, {cmd:tsset} or {cmd:xtset} settings are ignored. When {it:keyvar} is missing, the observation is excluded from the sample (its results are set to missing and the observation is not used to calculate statistics for other observations). {pstd} Variables (which should be numeric) are grouped and processed together for each {opt (stat)} or {opt (flex_stat)} part of {it:slist}. A new {opt (stat)} or {opt (flex_stat)} resets the list and starts a new variable group. Within a {opt (stat)} variable group, the statistic is calculated by variable, ignoring missing values in other variables in the group. You can use the {opt casew:ise} option to indicate that a missing value in any variable in the group will exclude the observation. With a {opt (flex_stat)} variable group, all built-in functions implement casewise deletion so there is no need to specify the {opt casew:ise} option for them. If you provide your own Mata function, you need to specify the {opt casew:ise} option if you want {cmd:rangestat} to pass a matrix with no missing values. {pstd} With {opt (stat)} variable groups, {cmd:rangestat} creates one variable per input variable to store the computed statistic. If the form {bind:{it:new_varname=varname}} was used in {it:slist}, the new variable will be named using {it:new_varname}. Otherwise, the variable name will be created by appending "_{it:stat}" to the name of the input variable. A built-in {opt (flex_stat)} statistic creates variables using rules specific to the function called. If {opt (flex_stat)} is an external Mata function, variables created to store the results are named using a combination of {it:flex_stat} and a sequence number. {pstd} For convenience, {cmd:rangestat} checks for intervals with the exact same bounds. In such cases, the results would be exactly the same for repeats within the same bounds so calculations are done only once per interval bound group and results are filled in for repeats. Note that this does not apply if the {opt excl:udeself} option is specified. {marker options}{...} {title:Options} {dlgtab:Options} {phang}{opt i:nterval(keyvar low high)} is required and defines the interval that selects the set of observations to use to calculate result for the current observation. {it:keyvar} is a numeric variable. Observations whose values for {it:keyvar} fall within the closed interval bounds are selected. {it:low} and {it:high} can each be specified using a numeric variable, a {it:#} (a number in Stata parlance), or a {help missing:system missing value}. If a {it:#} is used, the bound for each observation is computed by adding {it:#} to {it:keyvar}. If {it:low} is specified using a {help missing:system missing value}, {it:low} is set to missing for all observations. {cmd:rangestat} applies the same rules as {help inrange()} for missing bounds: if the lower bound is missing, observations will match up to and including the value of {it:high}. If both {it:low} and {it:high} are missing, all observations will match. Note that the treatment of missing values for {it:low} and {it:high} differs in version 1.1 up from the previous version of {cmd:rangestat} and this may require that previous code be adapted. (Use {help which} to find out which version you are running if you do not know.) {phang}{opth by(varlist)} groups observations, so that statistics are generated using only observations within the same group. For example, this option should be specified when you wish calculations to be restricted to given panels or given times for panel or longitudinal data. {phang}{opt excl:udeself} specifies that the input variables are set to missing for the current observation for the purpose of calculating the statistic. Note that the observation is still included if {it:keyvar} is within range: this affects statistics that count observations. If you really want to ignore the observation, use the {opt case:wise} option. {phang}{opt casew:ise} specifies casewise deletion of observations within variable groups. A variable group is the set of input variables that follows a {opt (stat)} or {opt (flex_stat)} in {it:slist}. The {cmd:(corr)}, {cmd:(cov)}, and {cmd:(reg)} {it:flex_stat} functions implement casewise deletion so you do not have to specify this option when using them. {phang}{opt d:escribe} runs {help describe} to show the names of the new variables created. Each new variable is labelled to indicate its source. {phang}{opt local(name)} specifies the name of a local macro that {cmd:rangestat} will populate with the names of the variables it creates. The macro is created within the scope of the calling do-file or program. {title:Setting the interval} {pstd} In most cases, you will want to define interval bounds in relation to the observation's current value for {it:keyvar}. You can do that by specifying each bound using {it:#} (a number in Stata parlance). The {it:#} is added to {it:keyvar} to set the bound. {cmd:rangestat} uses a closed interval, which means that values that match the bound are included. {marker degenerate_interval}{...} {pstd} {ul:A degenerate interval:} {pstd} The simplest case is to calculate statistics using observations with the same {it:keyvar} value. This requires that lower and upper bounds be the same as the value of {it:keyvar}, so you specify {it:low} and {it:high} using 0. The example below calculates the mean and standard deviation of the variables {cmd:price mpg} using observations with the same value for {cmd:rep78}. The example is a bit silly because you can do the same thing using {cmd:egen} functions: {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - interval_0_0}{...} sysuse auto, clear rangestat (min) price mpg (mean) price mpg, interval(rep78 0 0) * redo using egen functions sort rep78 make by rep78: egen min_price = min(price) by rep78: egen min_mpg = min(mpg) by rep78: egen mean_price = mean(price) by rep78: egen mean_mpg = mean(mpg) list rep78 *price* *mpg* if rep78 <= 2, sepby(rep78) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run interval_0_0 using rangestat.sthlp:click to run})} {pstd} Note that there is one important difference between the {cmd:rangestat} results above and those computed using {cmd:egen} functions: a missing value for {it:keyvar} (in this case {hi:rep78}) excludes the observation from the sample and thus no results are computed. {marker rolling_window}{...} {pstd} {ul:Rolling window interval:} {pstd} With {cmd:rangestat}, you can easily perform calculations on a rolling window. The following example uses a window of 5 years that includes the current observation. By specifying {bind:{cmd:interval(year -4 0)}}, the interval for an observation in 1950 for example will amount to [1950-4, 1950] and evaluate to [1946, 1950]. Since this is panel data, we use the {cmd:by(company)} option to restrict calculations to observations within the same company group. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - interval_rw}{...} webuse grunfeld, clear * include some missing values and omit some random observations set seed 1234 replace invest = . if uniform() < .1 drop if uniform() < .1 rangestat (mean) invest (sd) invest (count) invest, /// interval(year -4 0) by(company) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run interval_rw using rangestat.sthlp:click to run})} {pstd} As with all rolling window problems, results may not be based on the full window, in this example 5 years of observations. This will be true for the first 4 observations of each panel in the example and for subsequent observations if there are missing values or missing data years. You should always use the {cmd:(count)} statistic to get the number of non-missing values and use the count to reject results based on an insufficient sample size. {marker recursive_window}{...} {pstd} {ul:Recursive window interval:} {pstd} You can also perform calculations on a recursive window where the first period is fixed. The following example specifies {it:low} using a {help missing:system missing value} and {it:high} using 0. When a {help missing:system missing value} is used for the lower bound, {cmd:rangestat} assumes that you want the largest negative number possible for all observations. Hence, for example, the bounds for an observation in 1950 will be [{cmd:c(mindouble)},1950]. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - interval_rec}{...} webuse grunfeld, clear rangestat (sum) invest mvalue kstock, interval(year . 0) by(company) * the above is the same as performing a running sum by company (year): gen double rs_invest = sum(invest) by company (year): gen double rs_mvalue = sum(mvalue) by company (year): gen double rs_kstock = sum(kstock) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run interval_rec using rangestat.sthlp:click to run})} {marker reversed_recursive_window}{...} {pstd} {ul:Reversed recursive window:} {pstd} Similarly, you can perform calculations on a reversed recursive window where the last period is fixed. The following example specifies {it:low} using 0 and {it:high} using a {help missing:system missing value}. In Stata, a {help missing:system missing value} is a value that is higher than any non-missing value that can be stored. So the bounds for an observation in 1950 for example will be {bind:[1950,.]} and results will be calculated using all observations where {bind:year >= 1950}. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - interval_rev}{...} webuse grunfeld, clear rangestat (sum) invest, interval(year 0 .) by(company) * this above is the same as removing a running sum from the overall total by company (year): egen double invest_total = total(invest) by company (year): gen double rsum = sum(invest) by company (year): gen double match = invest_total - rsum + invest {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run interval_rev using rangestat.sthlp:click to run})} {marker observation_specific_windows}{...} {pstd} {ul:Observation-specific interval:} {pstd} You can also specify an interval that can't be computed simply by adding a {it:#} to {it:low} or {it:high}. The following example finds the average repair record of similarly priced cars, as defined by cars within 10% of the price of the current car. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - interval_10pc}{...} sysuse auto, clear gen low = .9 * price gen high = 1.1 * price rangestat (mean) rep78, interval(price low high) * spot check results for observation 15 list make price rep78 low high rep78_mean in 15 sum rep78 if inrange(price, low[15], high[15]) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run interval_10pc using rangestat.sthlp:click to run})} {pstd} As a rule, you must specify both bounds but there is no obligation to specify {it:low} and {it:high} the same way. You can use a variable for one bound and a {it:#} or {help missing:system missing value} for the other. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - interval_low_.}{...} sysuse auto, clear gen low = .9 * price rangestat (mean) rep78, interval(price low .) * spot check results for observation 15 list make price rep78 low rep78_mean in 15 sum rep78 if inrange(price, low[15], .) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run interval_low_. using rangestat.sthlp:click to run})} {title:Controlling the sample} {pstd} {cmd:rangestat} supports the standard Stata {cmd:if} and {cmd:in} qualifiers to reduce the computations to those observations where the condition is true. In practice, you are not likely to use these qualifiers because they restrict which observations get results {it:AND} which observations fall into each observation's interval. So for example you could be interested only in averages for foreign cars with the same repair record, calculated using data only from foreign cars: {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - control_if}{...} sysuse auto, clear rangestat (mean) price (count) price if foreign, interval(rep78 0 0) * spot check for a repair record of 3 for foreign cars only sum price if foreign & rep78 == 3 list make price* if foreign & rep78 == 3 {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run control_if using rangestat.sthlp:click to run})} {pstd} If you browse the results, you can confirm that only foreign cars have results and the example listing confirms that the results are based only on data from foreign cars. {pstd} {ul:Calculate results for all observations using data only from some} {pstd} Let's say that we need the average price of foreign cars with the same repair record and we need that for all cars in the data. The solution is to create a copy of the price variable and replace the prices of domestic cars with missing values. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - control_value}{...} sysuse auto, clear clonevar price_foreign = price if foreign rangestat (mean) price_foreign (count) price_foreign, interval(rep78 0 0) * spot check for a repair record of 3 sum price_foreign if rep78 == 3 list make price* foreign if rep78 == 3 {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run control_value using rangestat.sthlp:click to run})} {pstd} {ul:Calculate results for some observations using data from all} {pstd} Say you want to calculate the mean price per repair record but you want results for a single observation per level of repair record. To prevent {cmd:rangestat} from calculating results for the other observations, simply use bounds where {it:low > high}. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - control_sample}{...} sysuse auto, clear * tag the first observation per level of rep78 bysort rep78 (make): gen first = _n == 1 * create bounds for the first observation, use [1,0] for the rest by rep78: gen low = cond(first, rep78, 1) by rep78: gen high = cond(first, rep78, 0) rangestat (mean) price (count) price, interval(rep78 low high) * show the results and confirm using summarize list rep78 price_* if first by rep78: sum price {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run control_sample using rangestat.sthlp:click to run})} {pstd} Note again that {cmd:rangestat} does not generate results when {cmd:rep78} is missing because those observations are excluded from the sample. {pstd} {ul:Watch out for missing values for {it:keyvar}} {pstd} In the following example we have household data. Each person in the data has their mother's identifier if living in the same household. The mother's id is missing if the mother is deceased or living elsewhere. You can use {cmd:rangestat} with an {bind:{cmd:interval(motherid personid personid)}} to look up, for each observation, how many persons in the household list the {hi:personid} of the current observation as their mother. This will not work, however, if there are missing values in {hi:motherid} as {cmd:rangestat} will ignore all observations where {it:keyvar} is missing. The solution is simple: make a copy of {hi:motherid} and then replace the missing values with an identifier that does not occur in the data. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - control_missing}{...} * Example generated by -dataex-. To install: ssc install dataex clear input float(hhid personid motherid female age) 123 111 888 1 12 123 222 888 0 13 123 333 888 1 14 123 444 999 1 33 123 555 . 0 40 123 666 . 0 60 123 888 . 1 35 123 999 . 1 55 end * since keyvar cannot be missing, replace motherid with a value that * is not in the data clonevar mid = motherid replace mid = 0 if mi(mid) rangestat (count) age (mean) age, interval(mid personid personid) list {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run control_missing using rangestat.sthlp:click to run})} {title:Validating your results} {pstd} As with all things in Stata, the code you use may run without error, but that is not a guarantee of correct results. {pstd} Before you even try {cmd:rangestat} on your problem, you should be able to calculate what you want for any given observation using standard Stata commands. You should know that in Stata you can reference the value of any variable for a particular observation using {help subscripting:explicit subscripting}. For example, the following code will list the fifth observation in the data and then list all observations in the data that have the same value in the variable {hi:year} as that stored in the fifth observation. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - validate_subscripting}{...} webuse grunfeld, clear list in 5 list if year == year[5] {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run validate_subscripting using rangestat.sthlp:click to run})} {pstd} Now let's say that you want to calculate the standard deviation of variable {hi:mvalue} by company using a rolling window of 5 years. To make the problem a bit more complicated, we introduce missing values for {hi:mvalue} and some years are also missing. How would you calculate the measure for a single observation? The simplest solution is to use Stata's {help inrange():inrange()} function to target observations that are within the desired 5-year window from the current observation. And because this is panel data, you have to make sure that you pick up only observations from the same {hi:company}. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - validate_spot}{...} webuse grunfeld, clear * include some missing values and omit some random observations set seed 1234 replace mvalue = . if uniform() < .1 drop if uniform() < .1 * calculate expected results for observation 5 sum mvalue if inrange(year, year[5]-4, year[5]) & company == company[5] * calculate expected results for observation 20 sum mvalue if inrange(year, year[20]-4, year[20]) & company == company[20] {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run validate_spot using rangestat.sthlp:click to run})} {pstd} Once you are satisfied with the set up of the problem for individual observations, it's time to use {cmd:rangestat} to loop over all observations in the data. If the results do not match your individual test cases, then you did not set up the command correctly. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - validate_full}{...} webuse grunfeld, clear * include some missing values and omit some random observations set seed 1234 replace mvalue = . if uniform() < .1 drop if uniform() < .1 * calculate over the whole sample and list results for test observations rangestat (sd) mvalue (count) mvalue, interval(year -4 0) by(company) list in 5 list in 20 {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run validate_full using rangestat.sthlp:click to run})} {title:Additional examples using built-in functions} {pstd} {ul:Mean wage of people of similar age} {pstd} Calculate the mean wage for everyone within 1 year of the age for the current observation: {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - mean_age}{...} sysuse nlsw88, clear * calculate expected results for observation 10 sum wage if inrange(age, age[10]-1, age[10]+1) * calculate expected results for observation 20 sum wage if inrange(age, age[20]-1, age[20]+1) * calculate over the whole sample and list results for test observations rangestat wage, interval(age -1 1) list age wage* in 10 list age wage* in 20 {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run mean_age using rangestat.sthlp:click to run})} {pstd} Repeat, but this time, do it by groups of people of the same race and in the same industry. Also, exclude the wage from the current observation. While we are at it, count the number of observations: {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - mean_age2}{...} sysuse nlsw88, clear * calculate expected results for observation 10 sum wage if inrange(age, age[10]-1, age[10]+1) & race == race[10] & industry == industry[10] & _n != 10 * calculate expected results for observation 20 sum wage if inrange(age, age[20]-1, age[20]+1) & race == race[20] & industry == industry[20] & _n != 20 * calculate over the whole sample and list results for test observations rangestat mwage = wage (count) wage, interval(age -1 1) excludeself by(race industry) list age *wage* in 10 list age *wage* in 20 {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run mean_age2 using rangestat.sthlp:click to run})} {pstd} {ul:Median investment of other firms in a given year} {pstd} Some problems are hard to solve in Stata without a loop. You can easily calculate the median investment from all firms in any given year using a single {cmd:egen} call: {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - median_other}{...} webuse grunfeld, clear bysort year: egen m = median(invest) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run median_other using rangestat.sthlp:click to run})} {pstd} but there is no equally easy way to get an observation-specific median calculated using all observations within the group {it:except} the one from the current observation. A naive brute force solution is to loop over observations: {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - median_otherx}{...} webuse grunfeld, clear gen double mexclude = . quietly forvalues i=1/`=_N' { sum invest if year == year[`i'] & company != company[`i'], detail replace mexclude = r(p50) in `i' } list in 10/20 {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run median_otherx using rangestat.sthlp:click to run})} {pstd} However, this is much slower than the {cmd:egen} direct solution and will become painfully slow as the number of observations increases. {pstd} With {cmd:rangestat}, all you need is {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - median_otherx2}{...} webuse grunfeld, clear rangestat (median) invest, interval(year 0 0) excludeself list in 10/20 {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run median_otherx2 using rangestat.sthlp:click to run})} {pstd} Note that the {cmd:rangestat} interval can be degenerate, as is the case above. Setting both {it:low} and {it:high} to zero will have the effect of selecting, for the current observation, all (and only) observations that have the same year. As the {opt excludeself} option is specified, the value of the variable {cmd:invest} for each current observation will be ignored. {pstd} It is often possible to find solutions that reduce a problem like this to something simpler that does not require looping over all observations. Cox (2014) shows two such solutions using the example presented here. Both solutions loop over companies instead of observations. In this case, {cmd:rangestat} is faster than either solution and its performance advantage will increase as the number of groups (here companies) increases. {pstd} {ul:Looking up the education of a child's mother within a household} {pstd} {cmd:rangestat} can be used to look up the value of a variable in another observation. Suppose we have household data where {cmd:hhid} and {cmd:personid} uniquely identify individuals in the household and we want to look up the education of the mother for each child in the household. To do an individual case by hand, you could use {cmd:summarize} on the {hi:educ} variable subject to the condition that the {hi:personid} code is the same as the {hi:motherid} for the child and that the {hi:hhid} code is the same as the one for the child. {pstd} It would be tempting to specify the interval using {hi:interval(personid motherid motherid)} to pick up the observation where the value in {hi:personid} is the same as the value of {hi:motherid} for the current observation. However, missing values are allowed when specifying bounds and if both bounds are missing, all observations within the group will be selected. One solution to this problem is to use bounds where {it:low > high} when {hi:motherid} is missing. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - mother_educ}{...} * Example generated by -dataex-. To install: ssc install dataex clear input float(hhid personid motherid educ) 101 1 . 10 101 3 1 8 101 4 1 0 101 5 3 0 102 1 . 9 102 2 1 . 102 4 . 6 102 5 4 2 end * verify that hhid personid uniquely identify observations isid hhid personid, sort * look up education of mother for observation 3 sum educ if personid == motherid[3] & hhid == hhid[3] * use [1,0] bounds if motherid is missing gen low = cond(mi(motherid),1, motherid) gen high = cond(mi(motherid),0, motherid) rangestat (min) educ, interval(personid low high) by(hhid) list, sepby(hhid) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run mother_educ using rangestat.sthlp:click to run})} {pstd} Since {cmd:hhid} and {cmd:personid} uniquely identify individuals, there will be at most one observation that matches. Hence it does not matter if we request the {cmd:(min)}, {cmd:(max)}, {cmd:(mean)}, or indeed any other statistic that yields the unique value. {pstd} The example above was inspired by this {browse "http://www.statalist.org/forums/forum/general-stata-discussion/general/1323740-linking-mothers-education-with-child-in-individual-dataset-of-all-household":thread} on Statalist. {pstd} {* NJC 22mar2017 new example/11apr2017 quite different example} {ul:Finding panels with runs of observations} {pstd} A common problem in management of panel or longitudinal data is finding panels with long enough runs of observations, say successive years. {pstd} A first version of the problem is detecting whether long enough runs of observations exist at all: panels may start late, finish early or contain gaps. The solution is pretty simple: count, for each observation, how many observations are within the desired window and flag panels if the highest count is less than the target window length. {pstd} A second version is to spell out further that they must be good in some sense, usually that variables of interest are not missing. Here we use an indicator variable which is 1 if values of two variables are present (not missing) and 0 otherwise. The indicator variable approach has the advantage that it can be extended easily to accommodate arbitrarily complicated criteria. We could add other conditions in calculating the indicator, say that firms are in a particular industry or in a certain size interval. When that is done, you simply sum the indicator variable within the desired window. You then flag a panel if the highest count is less than the target window length. {pstd}The Grunfeld data are balanced and contain no missing values. We wilfully wreak some havoc on the first panel and drop odd years for the second. We are looking for panels with at least one run of 3 successive years. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - panel_runs}{...} webuse grunfeld, clear * sprinkle some missing values replace invest = . if inlist(_n, 3, 6, 9, 12, 15) replace mvalue = . if inlist(_n, 17, 18) drop if mod(year,2) & company == 2 * first version, simple observation count rangestat (count) year, interval(year -2 0) by(company) bysort company (year_count): gen to_drop = year_count[_N] < 3 * second version, sum good observations gen good_obs = !missing(invest, mvalue) rangestat (sum) good_obs, interval(year -2 0) by(company) bysort company (good_obs_sum): gen to_drop_mv = good_obs_sum[_N] < 3 sort company year list company-mvalue good_obs to_drop to_drop_mv if to_drop | to_drop_mv {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run panel_runs using rangestat.sthlp:click to run})} {pstd} {ul:Running a regression over a rolling window of time} {pstd} You can use the built-in function {cmd:(reg)} to perform a basic ordinary least squares linear regression (with a constant) over a rolling window of time. {pstd} Before proceeding with a {cmd:rangestat} solution, you should be able to set up the problem for individual observations using Stata's {help regress} commands. Let's say you have panel data and you want a rolling regression over a 7 years window. You need to structure the {cmd:regress} command so that only observations within the same panel group are used and include the current observation as well as those in the six preceding years. You can do this using an {cmd:if} qualifier on the {cmd:regress} command. You use {help subscripting:explicit subscripting} to get the value of a variable for a particular observation. So to perform a regression for observation 15 in the example below, we use {hi:year[15]} to refer to the value of variable {hi:year} for that observation and the {help inrange()} function will be true for all observations in the data where {bind:{hi:year >= year[15]-6}} and {bind:{hi:year <= year[15]}}. Since we want to stick to observations within the panel group, we also add the condition that the company identifier be the same as the one in observation 15. We redo this for observation 40. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - user_reg}{...} webuse grunfeld, clear * calculate the expected results for observation 15 regress invest mvalue kstock if inrange(year, year[15]-6, year[15]) /// & company == company[15] * calculate the expected results for observation 40 regress invest mvalue kstock if inrange(year, year[40]-6, year[40]) /// & company == company[40] {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run user_reg using rangestat.sthlp:click to run})} {pstd} Now you can use {cmd:rangestat} to repeat the same regression for all observations. The {cmd:(reg)} built-in function will create a number of variables to store the results. The first three variables store the number of observations, the R-squared, and the adjusted R-squared. Then follow as many variables as there are predictors (plus the constant) for the coefficients and another series for the standard errors. If {cmd:rangestat} is used to replicate the results above, the variables created will be: {p2colset 9 22 24 2}{...} {p2col :{hi:reg_nobs}}number of obs{p_end} {p2col :{hi:reg_r2}}R-squared{p_end} {p2col :{hi:reg_adj_r2}}adj. R-squared{p_end} {p2col :{hi:b_mvalue}}coef of mvalue{p_end} {p2col :{hi:b_kstock}}coef of kstock{p_end} {p2col :{hi:b_cons}}coef of constant{p_end} {p2col :{hi:se_mvalue}}standard error of mvalue{p_end} {p2col :{hi:se_kstock}}standard error of kstock{p_end} {p2col :{hi:se_cons}}standard error of constant{p_end} {p2colreset}{...} {pstd} With {cmd:(reg)}, as with {help regress}, you specify the response (dependent variable) first, followed by the predictors. The interval bounds are set in relation to the value of the {hi:year} variable for the current observation. The {cmd:by(company)} indicates that the observations within the interval range are to be found within the same panel group. {pstd} Note that we are just showing Stata technique here. You might well think that seven observations is in practice rather few for a regression with two predictors. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - user_reg2}{...} webuse grunfeld, clear rangestat (reg) invest mvalue kstock, interval(year -6 0) by(company) * check results for observation 15 list in 15 regress invest mvalue kstock if inrange(year, year[15]-6, year[15]) /// & company == company[15] * check results for observation 40 list in 40 regress invest mvalue kstock if inrange(year, year[40]-6, year[40]) /// & company == company[40] {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run user_reg2 using rangestat.sthlp:click to run})} {pstd} As with all problems involving calculations over a rolling window, the number of observations will be lower for the leading observations for each group. Missing values will also reduce the number of observations for individual regressions. {cmd:rangestat} will attempt to calculate results, regardless of sample size: it is up to the user to use the {cmd:reg_nobs} variable to reject results if they are based on a sample size that is below an acceptable threshold. Note that {cmd:(reg)} will return missing values for the current observation in cases where Stata's {help regress} would omit variable(s) due to collinearity in the current observation's interval. {pstd} {ul:Calculating monthly covariances on daily data} {pstd} You can use the built-in function {cmd:(cov)} to get the covariance of two variables. {pstd} As with all {cmd:rangestat} problems, the first thing to do is to find a solution for individual observations using standard Stata commands. In this example, we create fake daily stock returns for a large number of firms over a period of 50 months. As with the rolling regression example above, we use {help subscripting:explicit subscripting} to get the value of a variable for a particular observation. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - cov_solo}{...} * fake data on 100 firms for 50 months with 21 daily returns per month clear set seed 123123 set obs 100 gen long firm = _n expand 50 bysort firm: gen month = _n expand 21 bysort firm month: gen ret_day = _n gen return = runiform() if runiform() < .99 gen weight = runiform() egen firm_month = group(firm month) * calculate results using -correlate- for first and last observation corr return weight if firm_month == firm_month[1], covariance dis as res r(cov_12) corr return weight if firm_month == firm_month[_N], covariance dis as res r(cov_12) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run cov_solo using rangestat.sthlp:click to run})} {pstd} Now you can use {cmd:rangestat}'s {cmd:(cov)} option to repeat the same covariance calculation for all observations in data. While this appears inefficient because the results are the same for every day in the month, {cmd:rangestat} is smart and will only perform calculations once for all observations where the interval is the same and replicate results for repeats with the same interval. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - cov}{...} * fake data on 100 firms for 50 months with 21 daily returns per month clear set seed 123123 set obs 100 gen long firm = _n expand 50 bysort firm: gen month = _n expand 21 bysort firm month: gen ret_day = _n gen return = runiform() if runiform() < .99 gen weight = runiform() egen firm_month = group(firm month) rangestat (cov) return weight, interval(firm_month 0 0) describe list in 1 list in l * compare with results using -correlate- for first and last observation corr return weight if firm_month == firm_month[1], covariance dis as res r(cov_12) corr return weight if firm_month == firm_month[_N], covariance dis as res r(cov_12) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run cov using rangestat.sthlp:click to run})} {pstd} This example was adapted from {browse "http://www.statalist.org/forums/forum/general-stata-discussion/general/1381867-calculating-monthly-covariances-based-on-daily-data":this thread} on Statalist. {title:Examples with a user-supplied Mata function} {pstd} {ul:Rowwise sorting} {pstd} Say you have data in wide layout and you need to sort values across several variables. You could use {help reshape} to convert the data to long layout, sort observations, and then {help reshape} the data back to wide layout, but that's a rather inefficient way to do this. A better solution is to use {cmd:rowsort} (Cox 2009; {stata search rowsort} to locate), a computationally efficient solution implemented in Mata. {pstd} Since the sort order is observation specific, you can also use {cmd:rangestat} to target each observation and use a simple Mata function to sort values. For each observation, {cmd:rangestat} will call your Mata function with a real matrix that contains the values of the specified variables for the set of observations that fall within the interval bounds. In the following example, {hi:id} uniquely identifies observations and the interval is defined using {hi:interval(id 0 0)}. The only observation that will satisfy this condition is the current observation and {hi:X} will contain a real matrix with a single row. In Mata, the {cmd:sort()} function reorders rows so {hi:X} must first be transposed, sorted according to the values found in column 1, and then transposed back before being returned to {cmd:rangestat}. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - rowsort}{...} * Example generated by -dataex-. To install: ssc install dataex clear input float(id x1 x2 x3 x4 x5 x6) 1 13 14 12 2 23 56 2 2 34 56 43 21 12 3 2 3 45 1 23 34 4 4 6 13 14 22 45 5 2 4 23 56 78 23 end * make sure that id uniquely identifies each observation isid id * define a Mata function that sorts a rowvector mata: mata clear real rowvector rsort(real matrix X) { return(sort(X',1)') } end rangestat (rsort) x1-x6, interval(id 0 0) list, noobs compress {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run rowsort using rangestat.sthlp:click to run})} {pstd} The example above was adapted from this {browse "http://www.statalist.org/forums/forum/general-stata-discussion/general/1383287-row-sorting":Statalist thread}. {pstd} {* NJC 22mar2017 new example}{...} {ul: Moving quantiles using a user-supplied Mata function} {pstd} Given a response variable and a predictor variable, we might be interested in plotting conditional quantiles, particular quantiles for the response calculated within moving windows of the predictor. The function {cmd:mm_quantile()} (Jann 2005) is a very suitable general tool for calculating quantiles. All we need is to decide which quantiles we want and specify those in a wrapper function. Then we call up a graph. {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - moving_quantiles}{...} webuse nlswork, clear * ssc inst moremata needed for -mm_quantile()- mata: mata clear real rowvector myquantile(real colvector X) { return(mm_quantile(X, 1, (0.1, 0.25, 0.5, 0.75, 0.9))) } end rangestat (myquantile) ln_wage, interval(age -2 2) label var myquantile1 "p10" label var myquantile2 "p25" label var myquantile3 "p50" label var myquantile4 "p75" label var myquantile5 "p90" set scheme s1color scatter ln_wage age, ms(oh) mc(gs8) || /// line myquantile? age, sort legend(order(6 5 4 3 2) col(1) pos(3)) /// ytitle("`: var label ln_wage'") yla(, ang(h)) xla(15(5)45) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run moving_quantiles using rangestat.sthlp:click to run})} {title:Comparison with tsegen (SSC)} {pstd} There is some overlap in functionality between {stata "ssc des tsegen":tsegen} (SSC) and {cmd:rangestat}. Both can calculate statistics over a rolling window of time. For example {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - compared_to_tsegen}{...} webuse grunfeld, clear tsegen double inv_m5b = rowmean(L(0/4).invest) rangestat (mean) invest, interval(year -4 0) by(company) assert inv_m5b == invest_mean {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run compared_to_tsegen using rangestat.sthlp:click to run})} {pstd} In general, {stata "ssc des tsegen":tsegen} will be more efficient with time-series data because Stata is very efficient at expanding {help tsvarlist:time-series varlists} and the {help egen} functions process all observations at the same time. {pstd} With {cmd:rangestat}, observations are processed one at a time. This requires finding, for the current observation, which observations are within the specified range and then computing and storing each statistic for that observation. {cmd:rangestat} is optimized for speed; it does all its computations in Mata and has a very efficient algorithm to identify the set of observations to use. As the window of time increases, {stata "ssc des tsegen":tsegen} will have to expand more temporary variables and {cmd:rangestat} will eventually outperform {stata "ssc des tsegen":tsegen}. In minimal testing, the break-even point appears to be around 50 periods, less if {stata "ssc des tsegen":tsegen} has to be repeatedly called to calculate additional statistics. {pstd} The syntax of {cmd:rangestat} is more flexible (it's very similar to that of {help collapse}) and {cmd:rangestat} can calculate multiple statistics on multiple variables at the same time: {space 8}{hline 27} {it:example do-file content} {hline 27} {cmd}{...} {* example_start - compared_to_tsegen2}{...} webuse grunfeld, clear rangestat (sd) sd_inv=invest kstock (count) invest kstock, interval(year -4 0) by(company) {* example_end}{...} {txt}{...} {space 8}{hline 80} {space 8}{it:({stata rangestat_run compared_to_tsegen2 using rangestat.sthlp:click to run})} {* NJC note 13apr2017 scope for future brief discussion comparing with egen} {title:Stored results} {pstd} {cmd:rangestat} stores the following in {cmd:r()}: {synoptset 15 tabbed}{...} {p2col 5 15 19 2: Macros}{p_end} {synopt:{cmd:r(newvars)}}list of variables created{p_end} {p2colreset}{...} {title:References} {pstd} Cox, N.J. 2007. {browse "http://www.stata-journal.com/sjpdf.html?articlenum=pr0033":Events in intervals.} {it:Stata Journal} 7: 440{c -}443. {pstd} Cox, N.J. 2009. {browse "http://www.stata-journal.com/sjpdf.html?articlenum=pr0046":Rowwise.} {it:Stata Journal} 9: 137{c -}157. {pstd} Cox, N.J. 2010. {browse "http://www.stata-journal.com/article.html?article=st0204":The limits of sample skewness and kurtosis.} {it:Stata Journal} 10: 482{c -}495. {pstd} Cox, N.J. 2011. {browse "http://www.stata-journal.com/sjpdf.html?articlenum=dm0055":Compared with ....} {it:Stata Journal} 11: 305{c -}314. {pstd} Cox, N.J. 2014. {browse "http://www.stata-journal.com/article.html?article=dm0075":Self and others.} {it:Stata Journal} 14: 432{c -}444. {pstd} Jann, B. 2005 and later updates. moremata: Stata module (Mata) to provide various functions. {browse "https://ideas.repec.org/c/boc/bocode/s455001.html":https://ideas.repec.org/c/boc/bocode/s455001.html} (In Stata, install with {cmd: ssc inst moremata}) {title:Acknowledgements} {pstd} Thanks to Clyde Schechter for kindly showing us an example where {cmd:rangestat} would generate an overflow when computing interval bounds if {it:keyvar + #} could not be stored in a variable of {it:keyvar}'s data type. This was most likely to bite when {it:keyvar} was a byte. Observations with the overflow would be excluded from the sample. This report led to a review of {cmd:rangestat}'s handling of missing interval bounds and it was decided to follow the same rules as {help inrange()} and allow missing bounds. {pstd} Several Statalist members helped indirectly by posting challenging real problems. {title:Authors} {pstd}Robert Picard{p_end} {pstd}picard@netbox.com{p_end} {pstd}Nicholas J. Cox, Durham University, U.K.{p_end} {pstd}n.j.cox@durham.ac.uk{p_end} {pstd}Roberto Ferrer{p_end} {pstd}refp16@gmail.com{p_end} {title:Also see} {psee} Stata: {help egen}, {help rolling}, {help tssmooth}, {help tsvarlist}, {help tsrevar} {p_end} {psee} SSC: {stata "ssc desc rangejoin":rangejoin}, {stata "ssc desc tsegen":tsegen}, {stata "ssc desc mvsumm":mvsumm}, {stata "ssc desc rollstat":rollstat}, {stata "ssc desc egenmore":egenmore} {p_end} {psee} Others: {stata "search vlookup, all":vlookup} {p_end}