-------------------------------------------------------------------------------
help for mmerge                                              [jw]  Feb 26, 2002
-------------------------------------------------------------------------------

Easy and safe merging of datasets

Basic syntax

mmerge match-variable(s) using filename [, {simple | table} umatch(varlist) ukeep(varlist) ]

Full Syntax

mmerge match-variable(s) using filename [, { type(type_value) unmatched(unmatched_value) | simple | table } missing(m_value) nolabel replace update _merge(varname) noshow { ukeep(varlist) | udrop(varlist) } uif(exp) umatch(varlist) { uname(stub) | urename(rename_specs) } ulabel(stub) ]

where

type_value = { auto | 1:1 | 1:n | n:1 | n:n | spread }

unmatched_value = { both | none | master | using }

missing_value = { none | value | nomatch }

rename_specs = oldname newname [\ oldname newname \ ...]

Description

mmerge is an extension of merge that makes matched merging safe. It requires users to specify the type of match to be performed; mmerge verifies that the requirements hold. It also makes merging easy, though that may not be obvious at a first look at the full syntax diagram. Look at the examples first!

Detailed description

Safety features

- The safety of merging crucially depends on the requirement that users should specify whether the match-variables form a key in the master and using data. A key uniquely defines observations and does not contain missing values. - In contrast with merge, the resulting data after mmerge is independent of the order of observations in master and using data. As a consequence, you are not required to sort the data yourself. - merge distinguishes between unmatched records due to missing values in the match variables (_merge=-1,-2) and to non-missing values (_merge=1,2). Here 1/-1 and 2/-2 refer to unmatched obs originating in the master and using data respectively. - mmerge displays names of variables that occur in both master and using data.

Convenience features

- The master and using data are automatically sorted - A _merge variable in master or using data will be silently overwritten; this is not so bad as mmerge automatically tabulates _merge - The match-variable(s) of the using data can be named differently from the master data - Selection of observations and variables in the using dataset is facilitated - Prefixing strings to the names (and labels) of variables in the using data is supported - Full renaming of variables in the using data is supported - Value labels are added for the _merge variable (label name __MERGE)

Options to specify special cases

simple specifies matched merging in which the master and using data supposedly describe the same objects, fully identified by the match variables;, i.e., the match variables form a key in the master and using data.

simple is equivalent to specifying options type(1:1) and unmatched(both), and invoking assert _merge==3 upon the completion of mmerge.

table specifies a "table lookup merge", in which the master data contains data on objects of type A (e.g., households) that contain an object of type B (e.g., a city), identified by the match-variables (e.g., the city of residence), while the using data contains descriptions of type B (cities). Thus, the match variables should form a key in the using data.

table is equivalent to specifying options type{n:1} unmatched(master).

In a future version of mmerge, I may support additional "special types of matched merging".

Options for matching

types(str) specifies whether the match variables are keys in master and using data. Valid values are

1:1 key in the master and using data 1:n key in master data n:1 key in using data n:n no keys; mmerge performs a relational join via joinby

spread mmerge determines which of 1:n or n:1 holds auto mmerge determines the match-type. (default)

While auto is easy, you are strongly advised to specify your knowledge of the relationship between the master and using data via one of the other values. This allows mmerge to test whether your understanding is consistent with the data.

missing(str) specifies how missing values in the match variables are treated. Valid values are

none missing values not allowed in the match vars (default) value missing values are treated as ordinary values nomatch missing values of the match vars in the master should not match missing values in the match-vars of the using data.

Options for merging

unmatched(str) specifies whether non-matching observations in the master and using data are included in the merge result. Valid values are

none only completely maching obs are retained both non-matching obs from master and using are included (default) master non-matching obs from master are included using non-matching obs from using are included

nolabel prevents Stata from copying the value label definitions from the disk dataset. Even if you do not specify this option, in no event do label definitions from disk replace those already in meory.

update varies the action merge takes when an observation is matched. By default, the master data is held inviolate--values from the master data are retained when the same variables are found in both datasets. If update is specified, however, the values from the using data are retained in cases where the master data constains missing.

replace, allowed with update only, specified that even in the case when the master data contains nomissing values, they are to be replaced with corresponding values from the using data when corresponding data are not equal. A nonmissing value, however, will never be replaced with a missing value.

_merge(varname) specifies the name of the variable that will mark the source of the resulting observation. The default is _merge(_merge).

noshow specifies that the report on the files and the contents of _merge is suppressed.

Options for manipulating the using data ("u"-options)

ukeep(varlist) udrop(varlist) specifies a varlist in the using data that has to be kept (dropped) before being merged into the master data. It is not valid to specify both ukeep and udrop. If neither is specified, all variables of the using data are used. The match variable(s) need not be specified in ukeep; they are automatically included in ukeep (excluded from udrop).

uif(exp) specifies that only the observations in the using data that meet expression exp are to be used. Properness of the key in the using data is determined after uif is processed.

umatch(varlist) specifies the names of the match variables in the using data. The umatch variables are associated with the match variables in the specified order. Clearly, the number of match variables in umatch should be the same as the number of matching variables in the master.

mmerge renames the umatch variables to the master match variable names after ukeep/udrop have been processed, but before urename is processed. An error occurs if there are naming conflicts.

urename(speclist) specifies a list of "oldname newname" clauses to be applied to the using data, separated by a backslash \. Note that urename is applied after ukeep/udrop, and hence ukeep/udrop should use the original names. It is not allowed to rename the match variables here. Use umatch instead.

uname({it:stub) specifies a stub prefixed to the names of the variables imported from the using data, truncated at 32 characters. uname may not be combined with urename.

ulabel(stub) specifies a stub to be prefixed to the variable labels of the imported variables from the using data.

Examples

1. You have two datasets (auto1 and auto2) with properties of cars, identified by the variable make in both files. You can merge the two datasets with

. use auto1 . mmerge make using auto2, type(1:1) unmatched(both)

to obtain a dataset on all cars represented in either of the data sets. Since both is the default value of the option unmatched, it could actually have been omitted. To obtain a dataset that does not include cars that occur in only of the datasets, issue the command

. mmerge make using auto2, type(1:1) unmatched(none)

If the identifier in auto2 was named brand,

. mmerge make using auto2, type(1:1) umatch(brand)

If you had been convinced that the auto1 and auto2 describe the same cars, you could have specified simple instead of type(1:1) and unmatched(both). Then, in addition, mmerge would verify that your conviction holds true, displaying an error message otherwise.

2. You have a data set of individuals (indiv) and a dataset of households (hhold) with key hhid. The variable hhid is also in dataset indiv to specify the household to which individuals belong. You want to add the household level variables to the individuals. This is an example of "table lookup merging",

. use indiv . mmerge hhid using hhold, table

One might also have typed

. mmerge hhid using hhold, type(n:1) Alternatively, you could switch the role of indiv and hhold,

. use hhold . mmerge hhid using indiv, type(1:n)

If you are a bit like me, you'll be easily confused between 1:n and n:1. You can ask mmerge to settle the issue for you with the match type spread.

. use indiv . mmerge hhid using hhold, type(spread)

Typically, hhold contains many variables, and you only want to import some of them, say the city of residence (city) and number of kids (nkids). This is faciliated by the ukeep option.

. use indiv . mmerge hhid using hhold, table ukeep(city nkids)

Saved results

mmerge saves in r()

scalar

mobs number of obs in master data mvar number of vars (incl match vars) in master data mkey boolean: match vars form key in master data uobs number of obs in using data uvar number of vars (incl match vars) in using data ukey boolean: match vars form key in using data

local

mfile full name of master data mmatch names of matching vars in master data ufile full name of using data umatch (original) names of matching vars in using data common common variables (excl match vars) in master and using

Author

This commands was written by Jeroen Weesie, Dept of Sociology, Utrecht University. I want to acknowledge stimulating discussions with Bill Gould, Chris Snijders, and Fred Wolfe. Comments and suggestions are welcomed.

Also See

Manual: [U] 25 Commands for combining data [R] merge [R] joinby

On-line: append Append datasets -- add cases cross Form every pairwise combination of two datasets fillin Rectangularize dataset joinby Form all pairwise combinations within groups merge Merge datasets -- add variables