Title
moss -- Find multiple occurrences of substrings
Syntax
moss strvar [if] [in] , match(["]pattern["]) [ regex prefix(prefix) suffix(suffix) maximum(#) compact ]
Description
moss finds occurrences of substrings matching a pattern in a given string variable. Depending on what is sought and what is found, variables are created giving the count of occurrences (always); the positions of occurrences (whenever any are found); and the exact substrings found (when a regular expression defines a subexpression to be returned). The default names are respectively _count, _pos1 up, and _match1 up.
Remarks
By default, moss finds repeated occurrences of the string specified in match() using Stata's strpos() string function (in older versions of Stata, strpos() was named index()). A _count variable is created to indicate the number of occurrences per observation. The position, per observation, of the first instance will be recorded in _pos1, the second in _pos2, and so on.
With the regex option, moss can be used to repeatedly find more complex patterns of text. The specification of the search pattern must follow regexm() syntax and include one and only one subexpression to be matched. When using regular expressions, subexpressions are identified using parentheses. For example, match("AMC ([A-Za-z]+)") will match "AMC Concord", "AMC Pacer", and "AMC AMC Spirit" but moss will put in _match1 the matched subexpressions "Concord", "Pacer", and "AMC Spirit".
moss follows the principle that occurrences must be disjoint and may not overlap. That is, it finds just one occurrence of "ana" in "banana", not two.
Options
match() is required and the pattern can be either literal text or a regular expression.
regex specifies that the pattern is to be interpreted as a regular expression. Such a pattern must contain precisely one subexpression to be extracted. See Examples.
prefix() specifies an alternative prefix for new variable names to be created by moss. Such a prefix must start either with a letter or with an underscore.
suffix() specifies a suffix for new variable names to be created.
prefix() and suffix() may not be combined.
maximum() specifies an upper limit to the number of position and match variables to be created. That is, specify max(3) if you want to see details of at most the first 3 occurrences of your pattern.
compact specifies that the most compact storage types possible be used during calculations. Specifying this option may slow moss down.
Examples
. moss make, match(",")
. moss make, match("([0-9]+)") regex
. moss history, match("(X+)") regex
. moss s, match("([^ ]+)") prefix(s_) regex
Authors
Robert Picard picard@netbox.com
Nicholas J. Cox, Durham University n.j.cox@durham.ac.uk
Acknowledgments
A question on Statalist from Rebecca A. Pope was the stimulus for writing this program.
Also see
Help: [D] strpos(), [D] regexm(), [D] split
FAQs: What are regular expressions and how can I use them in Stata?