help cfby


cfby -- Compare two files to get the number differences "by" a common varia > ble


cfby [varlist] using filename , id(varname) [options]

options Description ------------------------------------------------------------------------- nopunct ignores differences in punctuation and capitalization nomatch surpress warnings about missing observations upper convert all string variables to upper case before comparing lower convert all string variables to lower case before comparing nostring do not compare any string variables -------------------------------------------------------------------------


cfby compares the variables in varlist from the dataset in memory to the variables in varlist from the using dataset and displays the discrepancy rates by a common variable. It is useful if you are doing data entry and want to get discrepancy rates of data entry officers.


id(varname) is required. varname is the variable that matches observations in the master dataset to observations in the using dataset. It must uniquely identify observations in both the master and using datasets.

nopunct Deletes the following characters before comparing: ! ? ' and replaces the following characters with a space: . , - / ; and trims all extra spaces

nomatch is specified if the number of observations in the master and using dataset do not need to match. The default is to assume 1:1 matching between the datasets, and to list any observations that existin in only one dataset.


cfby is intended to be used as part of the data entry process when data is checked for accuracy. It outputs a matrix of discrepancies for each unique combination of values of the by variable between the master and using datasets. So if you compared the first entry of a dataset to the second entry, it would output the discrepancy rate for each pair of data entry officers. If the master dataset was the result of a thoroughly checked audit and the using dataset were the raw first entry, simply set the by variable to a constant in the audit dataset and cfby will output the error rate for each data entry officer in the first entry. cfby does not compare variables that have a different string/numeric type in both datasets. cfby also doesn't compare variables that are different in all observations.


use "audit dataset.dta"

cfby region-no_good_at_all using "first entry.dta" , id(uniqueid) by(deo)

Saved Results

cfby saves the following in r():

Matricies r(e) number of discrepenacies r(q) number of data points compared


Ryan Knight,

Also see

Online: cf, compare