{smcl} {* 20Oct2006; prior versions 23mar2004; 22Mar2004; 17Mar2004; 16Mar2004; 04mar2004; 29Oct2003} {hline} help for {hi:collapseunique} {hline} {title:Collapse a dataset to unique observations of a specified set of basis variables} {p 8 17 2} {cmd:collapseunique} [{it:varlist}] [{cmd:if} {it:exp}] [{cmd:in} {it:range}] [{cmd:, by(}{it:byvars}{cmd:)} {cmd:testsep} {cmd:fast} {cmdab:emptyv:arlist}] {title:Description} {p 4 4 2} {cmd:collapseunique} performs an action that is similar to that of {help collapse}, but rather than summarizing over a multitude of values, it yields the unique values within by-groups, provided that the values are indeed unique. {p 4 4 2} {it:varlist} is the set of variables you want to collect which are supposedly functionally dependent on {it:byvars}. {p 4 4 2} {cmd:by(}{it:byvars}{cmd:)} specifies the "basis" on which you are reducing the dataset {c -} the variables on which {it:varlist} is supposedly functionally dependent. {p 4 4 2} {cmd:collapseunique} tests to be sure that all of the variables in {it:varlist} are functionally dependent on {it:byvars}. If that test passes, then the dataset is reduced to one observation per unique value (or unique tuple of values) of {it:byvars}, and only those variables mentioned in {it:varlist} or {it:byvars} are retained. Thus, it generally reduces both observations and variables. Note that in making this reduction, there is no ambiguity of which {it:varlist} values to choose, because of the functional dependency. {p 4 4 2} Other variables (those not mentioned in {it:varlist} or {it:byvars}) are dropped, since they might not be functionally dependent on {it:byvars}, and thus, would, if retained, have no logical connection to the corresponding {it:byvars} values; such variables would have an arbitrary representative value retained in the reduction step, and could be inconsistent in repeated runs. {p 4 4 2} After this successfully runs, the resulting dataset will be sorted by {it:byvars} (providing it exists), which also becomes a key for the resulting dataset. (A key is a set of variables that uniquely identify obserations. Thus, in a typical usage, you would not have {it:byvars} as the key for the starting dataset, since, in that case, the dataset is already "reduced".) {p 4 4 2} See {help funcdep} for a further explanation of functional dependency. {title:Additional Options} {p 4 8 2} {cmd:testsep} performs the functional dependency test of {it:varlist} one variable at a time; the default is to test them all at once. The difference is that if the test fails, and if {it:varlist} contains multiple variables, then under {cmd:testsep}, you will be able to tell which variable(s) caused the failure. (When exploring data for functional dependency, it may be useful to first use {help funcdep}.) {p 8 8 2} It is less efficient to use {cmd:testsep}, but the difference is negligible on small datasets. Also, {cmd:testsep} will generate more output noise. {p 4 8 2} {cmd:fast} has the same effect as the {cmd:fast} option in {help collapse}; it specifies that {cmd:collapseunique} not do extra work (a {help preserve}) so that it can restore the original data should the user press {hi:Break} {c -} or, additionally, if the functional dependency test fails. Thus, it is faster for large datasets, but is potentially dangerous. See more on this subject under {ul:{bf:Remarks}}. {p 4 8 2} {cmdab:emptyv:arlist} is required if {it:varlist} is absent or is a subset of {it:byvars}; it is ignored otherwise. This is a safeguard against accidentally reducing the data; it confirms that you truly intend to reduce to the unique values in {it:byvars}. (Without this safeguard, it would be easy to accidentally clobber the data.) {title:Remarks} {p 4 4 2} The purpose of {cmd:collapseunique} is to extract embedded functional dependency information, which you may subsequently want to {help save} in a separate dataset. Thus, it facilitates data normalization. More on this later. There are other possible uses, which will be explained under {ul:{bf:Examples}}. {p 4 4 2} A simple example may help illustrate the action of this command: {com}. list, noobs {txt} {c TLC}{hline 3}{c -}{hline 3}{c -}{hline 3}{c TRC} {c |} {res}a b c {txt}{c |} {c LT}{hline 3}{c -}{hline 3}{c -}{hline 3}{c RT} {c |} {res}6 4 7 {txt}{c |} {c |} {res}9 4 7 {txt}{c |} {c |} {res}1 6 5 {txt}{c |} {c |} {res}2 6 5 {txt}{c |} {c |} {res}4 6 5 {txt}{c |} {c |} {res}2 8 5 {txt}{c |} {c BLC}{hline 3}{c -}{hline 3}{c -}{hline 3}{c BRC} {com}. collapseunique c, by(b) {txt}{it:(output omitted)} {com}. list, noobs {txt} {c TLC}{hline 3}{c -}{hline 3}{c TRC} {c |} {res}b c {txt}{c |} {c LT}{hline 3}{c -}{hline 3}{c RT} {c |} {res}4 7 {txt}{c |} {c |} {res}6 5 {txt}{c |} {c |} {res}8 5 {txt}{c |} {c BLC}{hline 3}{c -}{hline 3}{c BRC} {p 4 4 2} At this point, you can save this table, and then go back and remove c from the original table. The resulting two tables constitute a {it:normalized} version of the original table, where you have "factored out" the functional dependency of c on b. (Note that in this example, c is functionally dependent on b, but not vice-versa.) {p 4 4 2} You would not normally specify {it:varlist} and {it:byvars} such that an overlap exists between them, however there is no harm in having one. Any variables common to both are removed from {it:varlist} at the outset. This can potentially make {it:varlist} empty (if it is a subset of {it:byvars}). {p 4 4 2} {it:varlist} may be absent (or it may become empty due to editing as described above), in which case the dataset is reduced to the unique tuples of values in {it:byvars}. Note the requirement of the {cmdab:emptyv:arlist} option in this case. (An absent {it:varlist} does not default to {cmd:_all}. Note also, that an empty {it:varlist} is always functionally dependent on any {it:byvars}.) {p 4 4 2} {cmd:by(}{it:byvars}{cmd:)} may be absent, in which case the whole dataset is treated as one by-group, and is reduced to a single observation containing {it:varlist}, provided that the values of {it:varlist} are constant over the whole set. {p 4 4 2} Either {it:varlist} or {it:byvars} may be absent, but not both. {p 4 4 2} {cmd:collapseunique} differs from {help collapse} in that it may refuse to do any action at all, depending on the content of the data, whereas {cmd:collapse} does not have such a sensitivity to the data. {p 4 4 2} {cmd:collapseunique} treats missing values (including the extended missing values) as distinct. {p 4 4 2} Note regarding {cmd:fast}: If the functional dependency test fails, or if the user presses {hi:Break}, the dataset is {help restore}d to its original state, unless {cmd:fast} is specified. {cmd:collapseunique} starts by filtering on any {cmd:if} or {cmd:in} qualifiers that may be present. Therefore, under {cmd:fast}, if the functional dependency test fails or the user presses {hi:Break}, the dataset is left in the state as screened by the {cmd:if} or {cmd:in} qualifiers, if any. {title:Examples} {p 6 6 2} {cmd:. collapseunique county_median_income, by(state county)} {p 6 6 2} {cmd:. collapseunique family_income family_size , by(familyid year)} {p 4 4 2} Another use of {cmd:collapseunique} is to reduce a dataset to one observation per by-group after creating aggregate measures such as those generated by {help egen} functions, though in many cases, the result could be achieved in one command using {help collapse} (and {cmd:collapse} has some additional advantages such as the ability to take weights). {p 6 6 2} {cmd:. egen meanincome= mean(income), by(state)}{p_end} {p 6 6 2} {cmd:. egen maxincome= max(income), by(state)}{p_end} {p 6 6 2} {cmd:. collapseunique meanincome maxincome, by(state)}{p_end} {p 4 4 2} The foregoing set of commands could be replaced by one {help collapse} command. But there are some types of aggregations that are not handled by {cmd:collapse}, for example, {p 6 6 2} {cmd:. egen kurtincome= kurt(income), by(state)}{p_end} {p 6 6 2} {cmd:. collapseunique kurtincome, by(state)}{p_end} {p 4 4 2} Yet another possible use of {cmd:collapseunique} is to eliminate duplicate observations: {p 6 6 2} {cmd:. collapseunique _all, by(recordid)}{p_end} {p 4 4 2} The prior example presumes a variable (recordid) which would be key if it weren't for the duplicate obserevations. If you just want to reduce to all unique observations, you could do... {p 6 6 2} {cmd:. collapseunique, by(_all) emptyvarlist}{p_end} {title:Additional Remarks} {p 4 4 2} One way to understand the action of {cmd:collapseunique} is to note that... {p 6 6 2} {cmd:. collapseunique a, by(k)} {p 4 4 2} is roughly equivalent to... {p 6 6 2} {cmd:. collapse (max) maxa=a (min) mina=a , by(k)}{p_end} {p 6 6 2} {cmd:. assert maxa==mina}{p_end} {p 6 6 2} {cmd:. rename maxa a}{p_end} {p 6 6 2} {cmd:. drop mina}{p_end} {p 4 4 2} where the dataset is {help restore}d if the {help assert} fails. (Code to handle that situation is not shown here. Also, to properly use the latter technique would require the transferrence of labels onto the new a. These examples may not be equivalent if a contains missing values.) {p 4 4 2} You may ask, what is the difference between these two?: {p 6 6 2} {cmd:. collapseunique a b c, by(k)} {p 6 6 2} {cmd:. collapseunique, by(k a b c) emptyvarlist} {p 4 4 2} The answer is that the first example may or may not execute, depending on the data; the second will always execute (assuming that k, a, b, and c are variables). If the first example does execute successfully, then the two examples will yield the same reduced dataset (putting aside the nominal sort order). {p 4 4 2} Be aware that functional dependency can arise "by accident", in which case it may be inappropriate to apply a {cmd:collapseunique} operation. See {help funcdep} for more on this matter. {p 4 4 2} {cmd:collapseunique} has some similarity to {help contract}. {p 4 4 2} This calls {help funcdep} (another program by the same author) to perform the functional dependency test. {p 4 4 2} This is written for Stata 8; a version for Stata 6 also exists. {col 12}{hline} {p 12 12 12} {hi:technical note:} As noted, {it:varlist} is initially reduced by removing any variables it has in common with {it:byvars}. There is no intrinsic need for this, but it improves efficiency and closes a loophole in the {cmd:emptyvarlist} safeguard. {p_end} {col 12}{hline} {title:On Normalization} {p 4 4 2} Data normalization is the process of taking a table and possibly splitting it into multiple tables, each of which is "normal" in that it contains no functional dependency between non-key variables. To help understand this, consider the common procedure of merging higher-level information onto lower-level data, such as merging family-based information onto a set of person-based observations. Normalization reverses this process. Non-normal datasets can be convenient in analysis, but are considered as poor design in the primary storage of data. Also, once data are normalized, you can more easily construct new analysis sets by recombining the normalized datasets. {title:Author} {p 4 4 2} David Kantor. Email {browse "mailto:kantor.d@att.net":kantor.d@att.net} if you observe any problems. {title:Also see} {p 4 4 2} {help funcdep}, {help collapse}, {help contract}, {help duplicates}