------------------------------------------------------------------------------- help for carryforward -------------------------------------------------------------------------------
Carry values forward, filling in missing values.
carryforward varname [if exp] [in range], {gen(newvar) | replace} [cfindic(newvar2) back carryalong(varlist)]
by ... : may be used with carryforward; see help by.
Description
carryforward will carry values forward from one observation to the next, filling in missing values with the previous value. It is important to note that this is not appropriate for imputing missing values; more on this later.
The carrying-forward action proceeds sequentially in the present sort order (or as sorted by bysort), cascading values from one observation to the next, potentially carrying a given value through many observations. The process stops when a nonmissing value (or an excluded observation or the end of a by group) is encountered, and resumes when another missing value is encountered.
An example will illustrate:
. carryforward x, gen(y) (6 real changes made)
. list, noobs sep(0)
+---------+ | x y | |---------| | 12 12 | | 4 4 | | . 4 | | . 4 | | . 4 | | 3 3 | | . 3 | | 7 7 | | . 7 | | . 7 | +---------+
Notice that each value is carried until a non-missing value of x is encountered.
Options
gen(newvar) specifies the new variable that will recieve the values.
replace specifies that the new values are to go directly into varname. Under this option, carryforward functions as a replace operation.
You must use either gen() or replace, but not both.
cfindic(newvar2) specifies an indicator variable that will be generated, indicating which observations recieved carry-forward values.
carryalong(varlist) specifies additional variables that will have their values carried along in concert with varname. These variables get their values carried forward, but the set of observations that are affected is determined by varname rather than the variables in varlist themselves. Be aware that this is essentially a replace operation, with no regard for the original values in varlist. Whereas varname or newvar never have non-missing values overwritten, the variables in varlist can, indeed, have non-missing values overwritten. (If you are concerned about overwriting values, keep a copy in a separate variable. But typically, you would use this option to carry values into what were originally missing values.)
back merely affects the wording of labels and notes, changing "fwd" to "back"; it has no effect on the data. Typically, you would use it when you "fool" carryforward into carrying values backward (see example).
Remarks
The effect of carryforward is sensitive to the sort order of the data. Thus, you should have the data sorted in an order that is meaningful with respect to what is being carried forward. This can be done with a preceeding sort operation, or in conjuction with bysort. carryforward will not sort the data, unless you specify bysort with it. With by or bysort, you would typically include a secondary varlist (such as year and negyear in the examples below) to control the order of observations within by groups, and you should be sure that the totality of the by variables are sufficient to uniquely sort the data, so as to get meaningful and consistent results.
Of course, the use of by or bysort will also constrain the cascading action to stay within by groups.
carryforward will create variable labels for generated variables, or will add notes to varname under the replace option.
When values are carried forward, you will see a message such as (22 real changes made), reporting the number of originally missing values that were replaced. This refers to either varname or newvar, depending on which option you used.
When an if exp or in range condition is specified, observations failing the condition will be excluded from having values carried into them, and will interrupt the carrying of values. That is, they are not merely excluded from consideration; they affect subsequent observations. An example will illustrate.
. carryforward x if c1, gen(y) (4 real changes made)
. list, noobs sep(0)
+--------------+ | x c1 y | |--------------| | 12 1 12 | | 4 1 4 | | . 1 4 | | . 0 . | | . 1 . | | 3 1 3 | | . 1 3 | | 7 0 7 | | . 1 7 | | . 1 7 | +--------------+
Notice that the fourth observation did not recieve a value in y, since c1==0, and that the fifth observtion also did not recieve a value, as the fourth observation interrupted the flow of values. If, on the other hand, you wish for such excluded observations to not interrupt the flow of values, you should first sort the dataset so as to move these observations out of the way.
-------------------------------------------------------------------- Technical note: It would be possible to program an option such that excluded observations were merely skipped - and did stop the flow of values. (Thus, observation 5 in the above example would recieve 4 in y.) This is a potential avenue for future development, and the author welcomes comments on whether this is desirable. --------------------------------------------------------------------
Also notice that the 0 in c1 in observation 8 had no effect, since x is non-missing in that observation. When observations are excluded by conditions, it is the observations where values are being replaced - not the ones where the values come from - that matter.
When using carryalong(varlist) there is nothing to stop you from including varname among varlist, but there is no point in doing so. This is effectively equivalent to specifying replace. (If you specified replace, then there is no additional effect; if you specified gen(newvar), then newvar and varname will be equal - as if you had specified both gen(newvar) and replace, if that were allowed.)
Examples
. by personid spellno (year): carryforward statefp, replace
. gen int negyear = -year . bysort personid (negyear): carryforward educ2, gen(educ2b) back cfindic(educ2b_cbi) carryalong(educ2_from_hw educ2_cfi)
In the latter example, we are going backwards; thus, the back option. Also, educ2_from_hw is an attribute about how educ2 was constructed, so we want it to be carried along with educ2. Similarly for educ2_cfi, but that was actually a cfindic variable from an earlier carryforward operation (not shown). (That earlier operation was in the forward direction; the present one goes backward. In between, certain observations were dropped; otherwise, there would be little use in having educ2_cfi in the carryalong variables.)
Additional Remarks
carryforward is not intended for imputing missing values; indeed, this operation is often a bad choice for missing-value imputation. The intent is, rather, to fill in "holes", where it is natural that a value should prevail from one observation to the next, depending on the order of the data (typically based on time or date). A notable example is where you have datasets of changes in attributes, and after a merge, you are left with missing values in non-matched observations. An example will illustrate.
Suppose you have two or more datasets that represent changes in different attributes over time, say salary and marital status for a set of people. Each dataset should be uniquely sorted on person_id and date, but note that these observations ("events" or "changes") may occur on different dates in the different datasets. That is, there will be non-matched observations when they are merged. Also, it is preferable that these datasets should have non-missing values for the "content" variables (salary, marital status), but our code will handle the possibility of missing values. You would not want to carry actual values through such an observation, as the given missing value presumably signifies a true unknown.
Suppose salary.dta contains salary, and marstat.dta contains marit_stat.
. use salary . gen byte rec_sal = 1
. merge person_id date using marstat, uniq . gen byte rec_mar = (_merge==2 | _merge==3) . drop _merge
. recode rec_sal (mis=0)
. assertky person_id date . by person_id (date): carryforward salary if ~rec_sal, replace . by person_id (date): carryforward marit_stat if ~rec_mar, replace
The use of by person_id (date): insures that you limit the carrying of values to within person-based groups, as you don't want to carry a value from one person to another. The inclusion of (date) assures that the sort order is correct within each such group.
The use of if ~rec_sal insures that you don't carry a value into and potentially beyond an originally-missing value. Cases with ~rec_sal always have missing values for salary; they are the holes in salary data created by the merge. On the other hand, missing values in salary for cases with rec_sal were missing in the original salary data; they presumably represent "unknown", rather than being an artifact of the merge. Similarly for rec_mar and marit_stat.
assertky is a program that sorts the data and assures that the sort order is unique. It is available on SSC.
Author
David Kantor, Institute for Policy Studies, Johns Hopkins University. Email kantor.d@att.net if you observe any problems.
Also See