{smcl}
{* *! version 1.0  march2014}{...}
{vieweralsosee "matchit" "help matchit"}{...}
{viewerjumpto "Syntax" "freqindex##syntax"}{...}
{viewerjumpto "Description" "freqindex##description"}{...}
{viewerjumpto "Options" "freqindex##options"}{...}
{viewerjumpto "Examples" "freqindex##examples"}{...}
{viewerjumpto "Saved results" "freqindex##saved_results"}{...}
{marker Top}{...}
{title:Title}

{p2colset 5 18 20 2}{...}
{p2col :Freqindex {hline 2}}Generates an index of terms with their frequencies based on the current dataset{p_end}
{p2colreset}{...}

{marker syntax}{...}
{title:Syntax}

{p 5 15}
{cmd:freqindex} {it:[idvar] txtvar} 
[{it:, options}]
{p_end}

{synoptset 20 tabbed}{...}
{synoptline}
{syntab :}
{synopthdr}
{synoptline}
{synopt :{opt sim:ilmethod(simfcn)}}
Specifies the method to decompose the string into {it:grams}. 
Default is {bf:token}. Other built-in {it:simfcn} are:
{bf:bigram, ngram, ngram_circ, soundex} and {bf:token_soundex}.
{p_end}

{synopt :{opt incm:ata(mata_array)}}
Increments an existing index in memory ({it:mata_array}) 
with the information from the current dataset.
{p_end}

{synopt :{opt keepm:ata}}
Keeps the Mata objects after conclusion (including {it:mata_array}). 
Default is dropping them. See list below.
{p_end}

{synopt :{opt nost:ata}}
Omits producing a Stata output with the results (which is the default). 
It only makes sense to be used in combination with {it:keepmata} 
and meant for programming purposes when indexing several files. 
{p_end}

{synoptline}

{marker description}{...}
{title:Description}

{pstd}
{cmd:freqindex} indexes each singular term in a given string variable ({it:txtvar})
from the current dataset and computes its frequencies. 
As such, it returns a new dataset containing a string variable listing the terms 
(named {it:grams}) and a numeric variable with the corresponding frequencies (named {it:freq}).
Please, note that {cmd:freqindex} is case-sensitive and  
it also takes into account any other symbol (as far as Stata does).
{p_end}

{pstd}
{cmd:freqindex} is also a required element of {help matchit}, which uses it to compute weights.
Moreover, {cmd:freqindex} can be used autonomously as a complementary tool for 
computing weights based on custom frequencies or frequencies found in other sources. 
When using it with {help matchit} you should always specify the same {it:simfcn} in both commands.
Check {help "matchit##table_examples":here} for an example of how each built-in {it:simfcn} 
treats strings.
{p_end}

{pstd}
The numeric variable {it:idvar} is optional and 
has limited use beyond programming purposes.
{p_end}

{marker options}{...}
{title:Options}

{dlgtab: Options}

{phang}
{opt txtvar}
is the required string {varname} from the current file to be indexed.

{phang}
{opt sim:ilmethod(simfcn)}
explicitly declares the method to parse the two string variables into {it:Grams}. 
Default is {bf:token}. Other built-in {it:simfcn} are: {bf:bigram, ngram, soundex} and {bf:token_soundex}.
{p_end}

{phang}
{opt sim:ilmethod(simfcn,arg)}
is the alternative syntax when {it:simfcn} requires an argument. 
This is the case of {bf:ngram} and {bf:ngram_circ}, 
which allows computing 1-gram, 2-gram, 3-gram, etc. by passing {bf:n} as an argument.
For instance, {cmd:sim}({bf:ngram,2}) is equivalent to {cmd: sim}({bf:bigram}).
{p_end}

{phang}
{opt keepm:ata}
keeps the Mata objects after conclusion (including {it:mata_array}). 
Default is dropping them. 
It is useful when indexing several columns and/or several files.
See an {help "freqindex##examples":example} below.
{p_end}

{phang}
{opt nost:ata}
omits producing a Stata output with the results (which is the default). 
It only makes sense to be used in combination with {bf:keepmata}. 
It is particularly useful when indexing several columns from the same file 
(see an {help "freqindex##examples":example} below). 
{p_end}

{phang}
{opt incm:ata(mata_array)}
Increments an existing index in the Mata associative array ({it:mata_array}) 
with the information from the current dataset.
Please explicitly set {bf:keepmata} if you want to keep {it:mata_array} 
after running {cmd:freqindex}. 
{p_end}

{phang}
{opt idvar} 
is a numeric {varname} from the current file identifying its observations.
It is optional and of no use beyond Mata programming. 
It only makes sense to be used in combination with {bf:keepmata}. 

{synoptline}

{marker examples}{...}
{title:Examples:}

{phang2}{cmd:. freqindex} {it:mystring} 

{pstd}Setting matching method{p_end}
{phang2}{cmd:. freqindex} {it:mystring}, {bf: sim(soundex)} {p_end}
{phang2}{cmd:. freqindex} {it:mystring}, {bf: sim(ngram,3)} {p_end}

{pstd}Incrementing an existing index{p_end}
{phang2}{cmd:. freqindex} {it:mystring1}, {bf: keepm nost}{p_end}
{phang2}{cmd:. freqindex} {it:mystring2}, {bf: incm(WGTARRAY) keepm nost}{p_end}
{phang2}{cmd:. freqindex} {it:mystring3}, {bf: incm(WGTARRAY) keepm nost}{p_end}
{phang2}{cmd:. freqindex} {it:mystring4}, {bf: incm(WGTARRAY) keepm nost}{p_end}
{phang2}{bf: ...} {p_end}
{phang2}{cmd:. freqindex} {it:mystringN}, {bf: incm(WGTARRAY)}{p_end}
{phang2}{cmd:. list}{p_end}
{synoptline}

{marker saved_results}{...}
{title:Saved results}

{pstd}
{cmd:freqindex} saves the following in {cmd:Mata}{p_end}
{pstd}
(only if keepmata option is included){p_end}

{synoptset 15 tabbed}{...}
{p2col 5 15 19 2: Mata:}{p_end}
{synopt:{cmd:IDW}}colvector of idvar (only if specified){p_end}
{synopt:{cmd:TXTW}}colvector of txtvar{p_end}
{synopt:{cmd:WGTARRAY}}Array of grams->frequencies{p_end}
{p2colreset}{...}

{marker author}{...}
{title:Author}

{pstd}Julio D. Raffo{p_end}