------------------------------------------------------------------------------- help for collapseunique -------------------------------------------------------------------------------

Collapse a dataset to unique observations of a specified set of basis variables

collapseunique [varlist] [if exp] [in range] [, by(byvars) testsep fast emptyvarlist]

Description

collapseunique performs an action that is similar to that of collapse, but rather than summarizing over a multitude of values, it yields the unique values within by-groups, provided that the values are indeed unique.

varlist is the set of variables you want to collect which are supposedly functionally dependent on byvars.

by(byvars) specifies the "basis" on which you are reducing the dataset - the variables on which varlist is supposedly functionally dependent.

collapseunique tests to be sure that all of the variables in varlist are functionally dependent on byvars. If that test passes, then the dataset is reduced to one observation per unique value (or unique tuple of values) of byvars, and only those variables mentioned in varlist or byvars are retained. Thus, it generally reduces both observations and variables. Note that in making this reduction, there is no ambiguity of which varlist values to choose, because of the functional dependency.

Other variables (those not mentioned in varlist or byvars) are dropped, since they might not be functionally dependent on byvars, and thus, would, if retained, have no logical connection to the corresponding byvars values; such variables would have an arbitrary representative value retained in the reduction step, and could be inconsistent in repeated runs.

After this successfully runs, the resulting dataset will be sorted by byvars (providing it exists), which also becomes a key for the resulting dataset. (A key is a set of variables that uniquely identify obserations. Thus, in a typical usage, you would not have byvars as the key for the starting dataset, since, in that case, the dataset is already "reduced".)

See funcdep for a further explanation of functional dependency.

Additional Options

testsep performs the functional dependency test of varlist one variable at a time; the default is to test them all at once. The difference is that if the test fails, and if varlist contains multiple variables, then under testsep, you will be able to tell which variable(s) caused the failure. (When exploring data for functional dependency, it may be useful to first use funcdep.)

It is less efficient to use testsep, but the difference is negligible on small datasets. Also, testsep will generate more output noise.

fast has the same effect as the fast option in collapse; it specifies that collapseunique not do extra work (a preserve) so that it can restore the original data should the user press Break - or, additionally, if the functional dependency test fails. Thus, it is faster for large datasets, but is potentially dangerous. See more on this subject under Remarks.

emptyvarlist is required if varlist is absent or is a subset of byvars; it is ignored otherwise. This is a safeguard against accidentally reducing the data; it confirms that you truly intend to reduce to the unique values in byvars. (Without this safeguard, it would be easy to accidentally clobber the data.)

Remarks

The purpose of collapseunique is to extract embedded functional dependency information, which you may subsequently want to save in a separate dataset. Thus, it facilitates data normalization. More on this later. There are other possible uses, which will be explained under Examples.

A simple example may help illustrate the action of this command:

. list, noobs

+-----------+ | a b c | |-----------| | 6 4 7 | | 9 4 7 | | 1 6 5 | | 2 6 5 | | 4 6 5 | | 2 8 5 | +-----------+

. collapseunique c, by(b) (output omitted)

. list, noobs

+-------+ | b c | |-------| | 4 7 | | 6 5 | | 8 5 | +-------+

At this point, you can save this table, and then go back and remove c from the original table. The resulting two tables constitute a normalized version of the original table, where you have "factored out" the functional dependency of c on b. (Note that in this example, c is functionally dependent on b, but not vice-versa.)

You would not normally specify varlist and byvars such that an overlap exists between them, however there is no harm in having one. Any variables common to both are removed from varlist at the outset. This can potentially make varlist empty (if it is a subset of byvars).

varlist may be absent (or it may become empty due to editing as described above), in which case the dataset is reduced to the unique tuples of values in byvars. Note the requirement of the emptyvarlist option in this case. (An absent varlist does not default to _all. Note also, that an empty varlist is always functionally dependent on any byvars.)

by(byvars) may be absent, in which case the whole dataset is treated as one by-group, and is reduced to a single observation containing varlist, provided that the values of varlist are constant over the whole set.

Either varlist or byvars may be absent, but not both.

collapseunique differs from collapse in that it may refuse to do any action at all, depending on the content of the data, whereas collapse does not have such a sensitivity to the data.

collapseunique treats missing values (including the extended missing values) as distinct.

Note regarding fast: If the functional dependency test fails, or if the user presses Break, the dataset is restored to its original state, unless fast is specified. collapseunique starts by filtering on any if or in qualifiers that may be present. Therefore, under fast, if the functional dependency test fails or the user presses Break, the dataset is left in the state as screened by the if or in qualifiers, if any.

Examples

. collapseunique county_median_income, by(state county)

. collapseunique family_income family_size , by(familyid year)

Another use of collapseunique is to reduce a dataset to one observation per by-group after creating aggregate measures such as those generated by egen functions, though in many cases, the result could be achieved in one command using collapse (and collapse has some additional advantages such as the ability to take weights).

. egen meanincome= mean(income), by(state) . egen maxincome= max(income), by(state) . collapseunique meanincome maxincome, by(state)

The foregoing set of commands could be replaced by one collapse command. But there are some types of aggregations that are not handled by collapse, for example,

. egen kurtincome= kurt(income), by(state) . collapseunique kurtincome, by(state)

Yet another possible use of collapseunique is to eliminate duplicate observations:

. collapseunique _all, by(recordid)

The prior example presumes a variable (recordid) which would be key if it weren't for the duplicate obserevations. If you just want to reduce to all unique observations, you could do...

. collapseunique, by(_all) emptyvarlist

Additional Remarks

One way to understand the action of collapseunique is to note that...

. collapseunique a, by(k)

is roughly equivalent to...

. collapse (max) maxa=a (min) mina=a , by(k) . assert maxa==mina . rename maxa a . drop mina

where the dataset is restored if the assert fails. (Code to handle that situation is not shown here. Also, to properly use the latter technique would require the transferrence of labels onto the new a. These examples may not be equivalent if a contains missing values.)

You may ask, what is the difference between these two?:

. collapseunique a b c, by(k)

. collapseunique, by(k a b c) emptyvarlist

The answer is that the first example may or may not execute, depending on the data; the second will always execute (assuming that k, a, b, and c are variables). If the first example does execute successfully, then the two examples will yield the same reduced dataset (putting aside the nominal sort order).

Be aware that functional dependency can arise "by accident", in which case it may be inappropriate to apply a collapseunique operation. See funcdep for more on this matter.

collapseunique has some similarity to contract.

This calls funcdep (another program by the same author) to perform the functional dependency test.

This is written for Stata 8; a version for Stata 6 also exists.

-------------------------------------------------------------------- technical note: As noted, varlist is initially reduced by removing any variables it has in common with byvars. There is no intrinsic need for this, but it improves efficiency and closes a loophole in the emptyvarlist safeguard. --------------------------------------------------------------------

On Normalization

Data normalization is the process of taking a table and possibly splitting it into multiple tables, each of which is "normal" in that it contains no functional dependency between non-key variables. To help understand this, consider the common procedure of merging higher-level information onto lower-level data, such as merging family-based information onto a set of person-based observations. Normalization reverses this process. Non-normal datasets can be convenient in analysis, but are considered as poor design in the primary storage of data. Also, once data are normalized, you can more easily construct new analysis sets by recombining the normalized datasets.

Author

David Kantor. Email kantor.d@att.net if you observe any problems.

Also see