{smcl}
{* 11 Jan 2022}{...}
{hline}
help for {hi:ieboilsave}
{hline}

{title:Title}

{phang2}{cmdab:ieboilsave} {hline 2} Checks that a dataset follows DECIE
	standards for a data set, and tags the dataset with metadata.

{phang2}For a more descriptive discussion on the intended usage and work flow of this
command please see the {browse "https://dimewiki.worldbank.org/wiki/Ieboilsave":DIME Wiki}.

{title:Syntax}

{phang2}
{cmdab:ieboilsave}
, {cmdab:idvar:name(}{it:varname}{cmd:)} [{cmdab:diout:put} {cmdab:missingok}
{cmdab:tagnoname} {cmdab:tagnohost}]


{marker opts}{...}
{synoptset 18}{...}
{synopthdr:options}
{synoptline}
{synopt :{cmdab:idvar:name(}{it:varname}{cmd:)}}specifies the ID variable
	uniquely and fully identifying the data set{p_end}
{synopt :{cmdab:missingok}}regular missing values are allowed{p_end}
{synopt :{cmdab:diout:put}}display output summarizing results of tests made and meta data stored{p_end}
{synopt :{cmdab:tagnoname}}do not tag the data set with user name or the computer (host) name{p_end}
{synopt :{cmdab:tagnohost}}do not tag the data set with the computer (host) name{p_end}
{synoptline}

{title:Description}

{pstd}{cmdab:ieboilsave} standardizes the boilerplate (section of standardized
	code) used at DECIE before saving a dataset. This includes checking that
	the ID variable is uniquely and fully identifying the dataset. The test uses the
	command {help isid}, but provides a more useful output. Only one variable
	is allowed to be the ID variable, see more in {help ieboilsave##IDnotes:Notes on ID variables} below.

{pstd}The command also checks that no regular missing values are used. Missing values should
	be replaced with the extended missing values .a, .b, ... , .z where each
	extended missing value represents a reason for why the value is missing.

{pstd}The command also tags meta data to the data set with information useful to
	future users. The meta data is tagged to the data set using {cmdab:char}.
	Char stores meta data to the data set using an associative array,
	see {help char} for an explanation on how to access data stored with
	char. The charnames (which is the equivalence to key or index in associative
	arrays) is listed below. When applicable these values are taken from the
	system parameters stored in {cmdab:c()} (see {help creturn}), and the
	{cmdab:c()} parameters used by this command are listed below. When a data
	set already have these charnames, the old values are overwritten with the
	new ones.

{p2colset 5 24 26 2}
{p2col : Charname}Meta data associated with the charname{p_end}
{p2line}
{p2col :{cmdab:_dta[ie_idvar]}}stores the name of the ID variable that uniquely
	and fully identifies the data set.{p_end}
{p2col :{cmdab:_dta[ie_version]}}stores the Stata version used (not installed,
	see {help ieboilstart: ieboilstart} for more details) to create the data
	set. Retrieved from {cmdab:c(version)}.{p_end}
{p2col :{cmdab:_dta[ie_date]}}stores the date the file was saved. Copying files,
	sharing files over sync services or emails may change the time stamp shown
	in folder. Retrieved from {cmdab:c(current_date)}.{p_end}
{p2col :{cmdab:_dta[ie_name]}}stores user name chosen when installing the
	instance of Stata that was used when generating the file. Retrieved from
	{cmdab:c(username)}. Storing this meta data is optional.{p_end}
{p2col :{cmdab:_dta[ie_host]}}stores computer name chosen when installing the
	instance of the operative system that was used when generating the file.
	Retrieved from {cmdab:c(hostname)}. Storing this meta data is optional.{p_end}
{p2col :{cmdab:_dta[ie_boilsave]}}stores a short summary of the result of
	running {cmdab:ieboilsave}. See option {cmdab:dioutput} below for more details.{p_end}
{p2line}

{title:Options}

{phang}{cmdab:idvar:name(}{it:varname}{cmd:)} specifies the ID variable that is
	supposed to be fully and uniquely identifying the data set. This command
	uses the command {help isid:isid} but provides a more helpful output in case
	the ID variable has duplicates or missing values. Using multiple ID variables
	to uniquely identify a data set is not best practice, and only one variable
	is therefore allowed in {it:varname}. See {help ieboilsave##IDnotes:Notes on ID variables} below
	read a justification for why it is bad practice.{p_end}

{phang}{cmdab:diout:put} displays the same information stored in {cmdab:_dta[ie_boilsave]} in
	the output window in Stata. This information includes the results of the ID
	variable test, the missing values test and all the meta data stored
	with {cmdab:char}. Unless this option is specified, {cmdab:ieboilsave} runs silently as
	long as it does not cause any errors.{p_end}

{phang}{cmdab:missingok} allows the data set to have the standard missing values,
	see {help missing values}. Since changing regular missing values to extended
	missing values is time consuming it might not always be a good use of a
	Stata coder's time to do this for intermediary data sets. But since it should
	be done for all final datasets, the default is to not allow regular missing
	values.{p_end}

{phang}{cmdab:tagnoname} prevents the command from tagging the data set with metadata
	containing user name and computer (host) name. Username and computer name
	can be very useful when facing issues related to replicability. For privacy
	reasons this can be disabled, but best practice is to keep it enabled at
	least for all data sets that are not meant for public dissemination.{p_end}

{phang}{cmdab:tagnohost} is similar to {cmdab:tagnoname} but it only prevents the
	command from tagging the data set with metadata containing the computer name.
	Specifying {cmdab:tagnohost} is redundant if {cmdab:tagnoname} is already
	specified.{p_end}

{title:Examples}

{pstd} {hi:Example 1.}

{pmore}{inp:ieboilsave, idvarname(respondent_ID)}

{pmore}In the example above, the command checks that the variable {it:respondent_ID}
uniquely and fully identifies that data set, checks that there is no missing
values that are not among the extended missing values and saves meta data to the
data set using char.

{pstd} {hi:Example 2.}

{pmore}{inp:ieboilsave, idvarname(respondent_ID) dioutput}

{pmore}The only difference between example 1 and this example is that in this
	example the command outputs the information stored in _dta[ie_boilsave]. The
	output will look similar to this:

{pmore}. {inp:ieboilsave, idvarname(respondent_ID) dioutput}{p_end}

{phang2}. {res:ieboilsave ran successfully. The uniquely and fully identifying ID variable is hhid. This data set was created in Stata version 13.1, by user Kristoffer using computer Kristoffer-PC, on 27 Jan 2016. There are no regular missing values in this data set}

{pstd} {hi:Example 3.}

{pmore}{inp:ieboilsave, idvarname(respondent_ID)}{p_end}
{pmore}{inp:local localname : char _dta[ie_boilsave]}{p_end}
{pmore}{inp:di "`localname'"}{p_end}

{pmore}Example 3 would generate exactly the same output as example 2 (formatted
	slightly different) but this example shows how to display the information
	in char _dta[ie_boilsave] at any point after running the command. For
	example, if you receive new data set where _dta[ie_boilsave] is already
	specified, then the two last lines of code is how you easiest access that
	information in a readable way.

{marker IDnotes}{...}
{title:Notes on ID variables}

{pstd}The concept of {it:Unique and Fully Identifying IDs} (in short unique IDs) and
	{it:Unit of Observation} are two concepts that cannot be emphasized enough in
	data management best practices. The unit of observation is what each row
	represents in a data set, and the unique ID should be unique for each
	instance of the unit of observation. This is mostly the same unit as the
	respondent during data collection.

{pstd}For example, let's say the respondents during a data collection were farmers;
	then the dataset is downloaded from the servers with farmers as the unit of observation.
	However, let's say that the analysis was carried out at the plot level. The data
	set prepared for the plot-level regressions no longer has farmer as
	unit of observation, so it is plots and the dataset should be identified using
	plot IDs not farmer IDs. If farmer IDs are unique for each farmer, and plot
	IDs are unique among the plots for each farmer those two
	IDs combined uniquely identify the data set. While it is technically true, it is
	not good practice. Impact Evaluations run over many years and there is
	likely going to be several different people working with the data set, and
	the slightest confusion in ID variables can lead to large analysis
	mistakes. It can lead to data sets merged incorrectly, that can lead to
	duplicates and it can lead to several observations included multiple times
	in a regression therefore inflating N and underestimating the p-value, causing
	false positives.

{pstd}Best practice is to always create a single variable that uniquely and fully
	identifies every unit in the unit of observation before saving a data
	set. Common practice is to make this the first variable in a data set
	using {help order:order}. It is also best practice to always start by making
	sure you fully understand the unit of observation in datasets you get from
	someone else. After you think you know the unit of observation, make sure
	that you have a single variable that uniquely and fully identifies the
	unit of observation in the data set.

{pstd}These concepts are also central to modern database design. It is approached
	somewhat differently as databases mostly consists of more than one dataset,
	but the principles are the same. There are a lot of reading material online
	search for {it:primary keys} and {it:normalization} in database design resources.

{title:Acknowledgements}

{phang}I would like to acknowledge the help in testing and proofreading I received in relation to this command and help file from (in alphabetic order):{p_end}
{pmore}Michell Dong{break}Paula Gonzales

{title:Author}

{phang}All commands in ietoolkit is developed by DIME Analytics at DECIE, The World Bank's unit for Development Impact Evaluations.

{phang}Main author: Kristoffer Bjarkefur, DIME Analytics, The World Bank Group

{phang}Please send bug-reports, suggestions and requests for clarifications
		 writing "ietoolkit ieboilsave" in the subject line to:{break}
		 dimeanalytics@worldbank.org

{phang}You can also see the code, make comments to the code, see the version
		 history of the code, and submit additions or edits to the code through {browse "https://github.com/worldbank/ietoolkit":the GitHub repository of ietoolkit}.{p_end}