Count matching values for one variable in another
countmatch var1 var2 [if] [in] [ , generate(newvar) by(byvarlist) missing list_options ]
Description
countmatch counts observations for which each distinct value of var1 is matched by (is equal to) var2, whether for the same observation or for some different observation(s). var1 and var2 should be both numeric or both string.
Options
generate() specifies the name of a new variable to hold information on match counts. If generate() is not specified, data and counts will be listed.
by() specifies that matching is to be carried out only within distinct groups defined by byvarlist. Observations with equal values must belong to the same group to count as matching.
missing indicates that missing values of var1 should be included in the comparison. By default, they are excluded.
list_options are options of list, which may be used to tune the output of any listing.
Remarks
--------------------------------------------------------------------------- Examples
For concreteness, consider data on friendships. Two variables are name and bestfriendname. Then countmatch name bestfriendname counts how many people name each person in name as their best friend in bestfriendname. This will include all those who name themselves as their own best friends.
Alternatively, two variables are name and friendname and each observation specifies a person and one of their friends, so that the data occur in blocks, one block for each person. Then countmatch name friendname counts how many people name each person in name as their friend in friendname. This will, again, include all those who name themselves as their own friends. The count will necessarily be the same for each observation on a particular person. Downstream of this you may wish to list each person and the corresponding count just once, and egen's tag() function offers a way to do this.
Doing this with by() adds a restriction: count only within distinct groups of byvarlist. You might be counting only friends of the same race or gender, for example. Getting all friends and all friends in the same group will allow you to determine all friends outside the same group by subtraction.
--------------------------------------------------------------------------- Do-it-yourself
Although countmatch automates a solution, the following notes on how to do this for yourself may be interesting or useful.
We focus on a simple version of the problem. For different values of var1, how many values of var2 are the same?
We will need to loop over the distinct values of var1. Each time round the loop there will be a count, and then the result will be put into a variable in the right place(s). To do that we need to have a variable to put it in.
. gen long count = 0
initialises a counter variable. The long is cautious, just in case the counts get really big. Another variable type may well be fine for your problem. Initialising to missing (not 0) is another good way.
For toy examples, we can use levelsof confidently. (In an updated Stata 8, use levels instead.) Frequently, var1 and var2 are both string, so let us focus on that situation.
. levelsof var1, local(levels)
puts the distinct values into a local macro.
. quietly foreach l of local levels { . count if `"`l'"' == var2 . replace count = r(N) if var1 == `"`l'"' . }
gives a first solution. Compound double quotes `" "' are used just in case there are double quotes lurking in the strings. That may be unlikely, but it does no harm.
Now this code pivots on both variables being string. Also, in a industrial-strength solution, you would not want to rely on all the distinct values fitting into a macro, so levelsof may be set on one side. One thing we can always do is map the distinct values to successive integers:
. egen group = group(var1) . su group, meanonly . local ngroup = r(max)
egen, group() maps the distinct values of var1 to the integers 1,...,#groups; and we can retrieve #groups by a summarize and then peeking at the saved results. Initialise as before:
. gen long count = 0
Another variable will come in useful, holding the observation numbers. Then once again the counting is done in a loop.
. gen long obs = _n
. qui forval i = 1/`ngroup' { . su obs if group == `i', meanonly . local first = r(min) . count if var1[`first'] == var2 . replace count = r(N) if group == `i' }
The loop uses a look-up technique. When we are focusing on group == 1, for example, how we know what value of var1 we are dealing with? (By construction, var1 is constant for each distinct value of group - we set up a one-to-one mapping - but what is that constant?) Notice that it is not general enough to go
. su var1 if group == `i'
and look at the saved results, because in general var1 could be a string. We have to be one step more devious. We just need to find the observation number for any observation in a particular group, and then we can get at the corresponding value of var1. That is where the obs variable comes in useful. There are two saved results after a summarize that will work here, the minimum or the maximum, and you can choose. (The mean will not work in general: consider, for example, a group with just two representatives, in observation 8 and observation 10: the mean at 9 does not identify a representative.)
--------------------------------------------------------------------------- Existence of match deducible from count of matches
Whether or not a match exists is determined by inrange(count,1,.).
--------------------------------------------------------------------------- Multiple variables
Given var1 and some varlist over which we wish to count matches, loop over varlist. This will fail if variables are not either all numeric or all string. One way of checking first is to use ds.
. qui foreach v of var varlist { . countmatch var1 `v', gen(`v'_m) . }
--------------------------------------------------------------------------- Matches in the same observation
Given var1 and some varlist over which we wish to count matches in the same observation, initialise a count variable and then loop over varlist. This will fail if variables are not either all numeric or all string. One way of checking first is to use ds.
. gen count = 0 . qui foreach v of var varlist { . replace count = count + (`v' == var1) . }
Examples
. countmatch name bestfriend . countmatch name bestfriend, gen(nfriends)
Author
Nicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk
Acknowledgments
This is a rewriting of fndmtch2. The original problem was suggested by Brian Uzzi. A bug was reported by Socrates Mokkas, which prompted this rewriting. Marcello Pagano pointed out some unclear wording in this help.
See also
Online: help for duplicates; fndmtch (if installed)