{smcl} {* 31 Jul 2023}{...} {hline} help for {hi:iecompdup} {hline} {title:Title} {phang2}{cmdab:iecompdup} {hline 2} Compares two duplicates and generate a list of the variables where the duplicates are identical and a list of the variables where the duplicates differ {phang2}For a more descriptive discussion on the intended usage and work flow of this command please see the {browse "https://dimewiki.worldbank.org/wiki/Ieduplicates":DIME Wiki}. Note that this command share wiki article with {help ieduplicates}. {title:Syntax} {phang2} {cmdab:iecompdup} {it:id_varname} [{help if:if}], {cmdab:id(}{it:id_value}{cmd:)} [{cmdab:didi:fference} {cmdab:keepdiff:erence} {cmdab:keepoth:er(}{it:varlist}{cmd:)} {cmdab:more2ok}] {marker opts}{...} {synoptset 28}{...} {synopthdr:options} {synoptline} {synopt :{cmdab:id(}{it:id_value}{cmd:)}}value of the {it:id_varname} variable that is duplicated{p_end} {synopt :{cmdab:didi:fference}}outputs the list with the variables for which the two observation differs. The default is to only store them in a local{p_end} {synopt :{cmdab:keepdiff:erence}}drops all but the variables for which the two observations differ{p_end} {synopt :{cmdab:keepoth:er(}{it:varlist}{cmd:)}}used together with {cmdab:keepdifference}. Variables included in {it:varlist} are also kept{p_end} {synopt :{cmdab:more2ok}}allows running the command on groups of more than two observations, although only the first two duplicates (in the order the data is sorted) are compared{p_end} {synoptline} {title:Description} {pstd}{cmdab:iecompdup} compare all variables for observations that are duplicates in the {it:id_varname} variable and the duplicated value is {cmdab:id(}{it:id_value}{cmd:)}. Duplicates can be identified and corrected with its sister command {help ieduplicates}. {cmdab:iecompdup} is intended to assist in the process of investigating why two observations are duplicated with respect to {it:id_varname}, and what correction is appropriate. {pstd}{cmdab:iecompdup} returns two locals {cmd:r(matchvars)} and {cmd:r(diffvars)}. {cmd:r(matchvars)} returns a list of the names of all variables for which the two observations have identical values, unless both values are missing values or the empty string. {cmd:r(diffvars)} returns a list of the names of all variables where the two observations are not identical. {pstd}For example, if a duplicate is found in a dataset downloaded from a data collection server (ODK or similar) and the duplicates were due to redundant submissions of the same data, then {cmd:r(diffvars)} would only include the submission time variable and any unique key used by the server. In such case, one observation can be dropped without risking losing information, since it is an identical submission of the exact same observation. (See Examples section below for a more detailed suggestion on how to use the command. ) {title:Options} {phang}{cmdab:id(}{it:id_value}{cmd:)} is used to specify the ID value that the duplicates share. Both text strings and numeric values are allowed. {phang}{cmdab:didi:fference} is used to display the list of all variables for which the {it:id_varname} variable duplicates differ. The default is to provide this list in a local, and only display the number of variables that differ. {phang}{cmdab:keepdiff:erence} is used to return the data set with only the ID variable and variables that differs between the duplicates. This means that the command would drop all variables where the duplicates are identical or both missing. It also drops all observations but the two duplicates compared. {phang}{cmdab:keepoth:er(}{it:varlist}{cmd:)} is used to keep more variables than the variables that differs between the duplicates when {cmdab:keepdifference} is specified. The command can keep, for example, a variable with information about who collected these data. This option returns an error if it is specified not in conjunction with {cmdab:keepdifference}. {phang}{cmdab:more2ok} allows running the command on groups of more than two observations, although only the first two duplicates (in the order the data is sorted) are compared. In a group of three duplicates, run the command three times on each combination of the three duplicates. A future update that includes the possibility to compare more than one case is under consideration{p_end} {title:Stored results} {pstd} {cmdab:iecompdup} stores the following results in {hi:r()}: {synoptset 15 tabbed}{...} {p2col 5 15 19 2: Locals}{p_end} {synopt:{cmd:r(matchvars)}}a list of the variables where the duplicates has the same value{p_end} {synopt:{cmd:r(diffvars)}}a list of the variables where the duplicates has different values{p_end} {p2col 5 15 19 2: Scalars}{p_end} {synopt:{cmd:r(nummatch)}}The number of variables in {cmd:r(matchvars)}{p_end} {synopt:{cmd:r(numdiff)}}The number of variables in {cmd:r(matchvars)}{p_end} {synopt:{cmd:r(numnomiss)}}The number of variables for which at least one of the duplicates has a non-missing value. By definition, {cmd:r(numnomiss)} equals the sum of {cmd:r(nummatch)} and {cmd:r(numdiff)}{p_end} {p2colreset}{...} {title:Examples} {pstd} A series of examples on how to specify command, and how to evaluate output: {pstd}{hi:Example 1.} {phang2}{inp:iecompdup HH_ID , id(55424) didifference}{p_end} {pmore}In the example above, let's say that there are two observations in the data set with the value 55424 for variable HH_ID. HH_ID holds an ID that was uniquely assigned to each household. Before continuing the analysis, one must investigate why two observations were assigned the same ID. iecompdup is a great place to start. {pmore}Specifying the command as above compares the two observations that both have a value of 55424 for variable {it:id_varname}. The output displayed will only be number of non-missing variables for which the two observations have identical values, and the number of non-missing variables for which the two observations have different values. The list of those two sets of variables are stored as locals. The data set is returned exactly as it was. {pmore}The locals stored in {cmd:r(diffvars)} and {cmd:r(nummatch)} can be used to provide information on why the two observations are duplicates. A suggested method to evaluate these two lists are presented in Example 2 below. {pstd}{hi:Example 2.} {phang2}{inp:iecompdup HH_ID , id(55424) didifference}{p_end} {pmore}This example makes the same assumptions as example 1 that there are two observations in the data set with the value 55424 for variable HH_ID. The only difference is that the option didifference is specified. The output is the same as example 1 but with the addition that the list stored in {cmd:r(diffvars)} is displayed in the output window. The data set is returned exactly as it was. {pmore}The method to evaluate the output presented in this example focus on the variables for which the duplicates are different. Therefore, start by looking at the list of variables displayed by {inp:didifference}. Do the variables with different values across the duplicates contain observation data like "number of household members" or "annual income", or are they submission information such as "submission ID", "server key" or "submission time"? The answer to this question could suggest one of the three solutions below. Note that this method should only be used as a guiding rule of thumb, all suggested solutions should be evaluated qualitatively as well. {pmore}{ul:Solution 1. All variables contain submission information data.} The far most common mistake leading to duplicates in household surveys is that the same observation data is submitted to the server twice. If that is the case, then only submission information variables would be outputted by the command, not any observation data. If this is the case, then you can safely delete either of the observations. {pmore}{ul:Solution 2. Most variables contain submission information data, but a few contain observation data.} If a few observation data variables are displayed together with submission information variables then it is likely that it is the same observation but some variables were edited after the first submission. Follow up with your field team to see why some variables were changed. See the tips in example 3 below before following up. {pmore}{ul:Solution 3. Many variables contain observation data.} If many observation data variables are displayed together with submission data variables, then it is likely that two different observations have accidentally been given the same ID. That is especially likely if location variables or name variables are different, or if the values for enumerator and/or supervisor are different. See the tips in example 3 below before following up. {pmore}The cases listed above will solve the vast majority of duplicates encountered in household surveys. The appropriate correction can afterwards be applied using the command {help ieduplicates}. {pstd}{hi:Example 3.} {phang2}{inp:iecompdup HH_ID , id(55424) didifference keepdifference keepother(village enumerator supervisor)}{p_end} {pmore}This example again makes the same assumptions as example 1 and example 2 that there are two observations in the data set with the value 55424 for variable HH_ID. This time {inp:keepdifference} and {inp:keepother()} are specified. Those two options can be used to provide additional information to the field team when following up based on solution 2 and solution 3 in example 2. {inp:keepdifference} drops all variables apart from {it:id_varname} and the variables in {cmd:r(diffvars)}. Any variables in {inp:keepother()} are also kept. All observations apart from the duplicates with the ID specified in {inp:id()} are also dropped. This data can be exported to excel and sent to a field team that can see how the observations differ. In this example the field team can also see in which village the data was collected, as well as the name of the enumerator and the supervisor. Any other information helpful to the field team can be entered in {inp:keepother()}. {pstd}{hi:Example 4.} {phang2}{inp:iecompdup HH_ID if inlist(key, "uuid:0003aad0", "uuid:0009baf1"), id(55424) didifference keepdifference keepother(village enumerator supervisor)}{p_end} {pmore}When there are several pairs or groups of duplicates, the command should be run once for each pair or group, as {cmdab:iecompdup} can oly compare two observations at a time. In this case, use an {inp:if} expression to select the observations to be compared. Alternatively, you can use the {inp:more2ok} option, which will compare the first two duplicates observations. {title:Acknowledgements} {phang}I would like to acknowledge the help in testing and proofreading I received in relation to this command and help file from (in alphabetic order):{p_end} {pmore}Michell Dong{break}Carlos Goes{break}Paula Gonzales {title:Author} {phang}All commands in ietoolkit is developed by DIME Analytics at DECIE, The World Bank's unit for Development Impact Evaluations. {phang}Main author: Kristoffer Bjarkefur, DIME Analytics, The World Bank Group {phang}Please send bug-reports, suggestions and requests for clarifications writing "ietoolkit iecompdup" in the subject line to:{break} dimeanalytics@worldbank.org {phang}You can also see the code, make comments to the code, see the version history of the code, and submit additions or edits to the code through {browse "https://github.com/worldbank/ietoolkit":the GitHub repository of ietoolkit}.{p_end}