help ckvar 


ckvar - Data Validation (or Scoring) using Rules


ckvar [varlist], [ { valid | score } key(varlist) markdup(newvar) novars keepgoing brief progress stub(str) droplabels nopreserve loud ]


ckvar is a utility command which can be used to validate or score values of variables. It does this by reading the validation or scoring rules from characteristics which are attached to the variables themselves, instead of relying on external do or ado files. ckvar can also be used to check for duplicate observations based on a key, and can mark groups of duplicated keys.

This help file explains the syntax of the ckvar command itself.

To see how to set or edit the rules used by ckvar, look at the help for ckvaredit. (No knowledge of characteristics is needed.)

To see an overview of the purpose of ckvar, please look here.

To see the details of how ckvar is implemented, and what charateristics it uses to validate or score a dataset, please look here.


valid and score tell whether to run validation (i.e. error-checking) routines or scoring routines associated with the specified variables. Only one can be specified. The valid option is the default. When validatating, a new variable will be produced for each variable which has at least one error; all observations are either valid or invalid. When scoring, a new score variable will be produced for every variable scored, and there is no assumption of just two possible outcomes.

key allows checking for observations with duplicate keys. The varlist here defines the variable(s) which together are supposed to define unique identifiers for the observations of the dataset (in database terms: the fields which define the key). These variables must already exist.

markdup allows unique marking of groups of duplicate observations so that they can be investigated more easily. The variable name given here must be that of a new variable. After the duplicate check has been run, this variable contains a 0 for observations which are not part of a group of duplicates and non-negative integer for each observation which is part of a group of duplicates. Each group has its own number to make the comparisons easier.

novars may be specified together with key if all that is desired were a check for duplicates. Specifying this option will ignore all validation routines.

keepgoing tells ckvar to keep running, even if programming errors are encountered. This can be used to find all problematic characteristics at once. All variables that can be checked are checked. All variables with fatal errors are noted.

brief shortens the validation table produced after the variables are checked by eliminating rows for variables which either are completely valid or which do not get validated. This is intended for those concentrating on tracking down errors rather than documenting their existence.

progress echos the name of each variable as it is being validated or scored. This is useful for detecting runaway processes, though it clutters the screen when checking datasets with many variables.

stub overrides the usual stub for the characteristics used by ckvar. By default, validation routines use characteristics starting with valid, while scoring routines use characteristics starting with score. The stub option is intended to allow multiple possible scoring routines on the same dataset.

droplabels instructs ckvar to drop value labels associated with variables generated when checking errors. This would be very rarely used, except when debugging validation or scoring routines.

nopreserve prevents the dataset from being preserved before running ckvar. By default, the dataset is preserved so that if there are problems with the validation or scoring, it is returned to its pristine state, without any extra variables. If the data set is large, nopreserve can save time.

loud causes output from the underlying dochar program to be echoed to the screen. Its only use is for debugging.


. ckvar checks all the variables which have validation routines, generating indicator variables for variables which have bad data. If there are any errors, the total number of errors will be stored in a variable called error__total.

. ckvar, score does the same, but scores all the variables, generating one score variable for every variable which has a scoring routine. In this case the total will be stored in a variable called score__total.

. ckvar this that theOther checks the three variables this, that, and theOther for errors. If there are errors, the total count of errors for each observation is put into the new variable error__total.

. ckvar, key(ssn date_of_visit) markdup(duplicates) runs all the validation routines and checks to see if there are any observations which have the same combination of ssn and date_of_visit. If there are any duplicates, the variable duplicates will mark groups of duplicates with different numbers. Finally, if there are any errors, the total number of errors found in each observation will be stored in the (new) variable error__total

. ckvar, key(ssn date_of_visit) markdup(duplicates) novars checks only to see if there are any observations which have both the same ssn and the same date_of_visit. Once again, if there are any duplicates, the variable duplicates will mark groups of duplicates with different numbers. Finally, the novars option states that no error checking or scoring is to be done in this case.


You do not need any understanding of characteristics to use ckvar, even if you need very complicated rules. ckvaredit provides a dialog box which allows the rules to be entered and edited in a natural fashion. ckvardo allows the rules to be dumped into a do file for application to another dataset. If you are, however interested in understanding the naming conventions for the characteristics, look at ckchar. If you are truly masochistic, and would like to see how to program complicated rules by hand, first look at dochar, and then at docharprog.


Bill Rising, StataCorp email: web: