{* This file was generated by scripts/examples2smcl.}{...}
{* It is included by the ngram.sthlp file to embed the examples/ folder into the documentation.}{...}
{...}
{title:Examples: basic}

{pstd}ngram has a lot of options, but not all are very useful.  For many uses "ngram varname" will suffice.{p_end}
{pstd}This example demonstrates the more likely customizations.{p_end}
{pstd}{p_end}
{pstd}Stata isn't used too commonly for text-mining so this example is contrived.{p_end}

{pstd}Setup{p_end}
{phang2}{cmd:. sysuse auto}{p_end}

{pstd}Set the locale to set the language of the dataset{p_end}
{phang2}{cmd:. set locale_functions en}{p_end}

{pstd}Extract unigrams through trigrams, *without* normalizing the tokens{p_end}
{phang2}{cmd:. ngram make, deg(3) nolower thresh(1)}{p_end}

{pstd}Inspect the results{p_end}
{phang2}{cmd:. desc}{p_end}
{phang2}{cmd:. list make if t_VW}{p_end}

{pstd}{it:({stata ngram_example basic:click to run})}{p_end}

{title:Examples: threshold}

{pstd}Setup{p_end}
{phang2}{cmd:. sysuse auto}{p_end}
{phang2}{cmd:. set locale_functions en}{p_end}

{pstd}Extract n-grams, culling sparse parts with threshold(){p_end}
{phang2}{cmd:. ngram make, deg(3) thresh(5)}{p_end}

{pstd}Inspect the results{p_end}
{phang2}{cmd:. desc}{p_end}
{phang2}{cmd:. list make if t_buick}{p_end}

{pstd}{it:({stata ngram_example threshold:click to run})}{p_end}

{title:Examples: stemmer}

{pstd}This data is from the donated Reuters dataset at the UCI machine learning archive:{p_end}
{pstd}http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection{p_end}
{pstd}The first 2000 records have been reformated from the ancient SGML into a spreadsheet that Stata can load.{p_end}
{pstd}This is in csv format because pre Stata 13 .dta files cannot handle more than 244-byte long strings.{p_end}
{phang2}{cmd:. import delimited using "reuters.csv", clear}{p_end}

{pstd}Set the locale.{p_end}
{pstd}If you are reading this in English, your locale is probably also set to English, making this redundant.{p_end}
{pstd}but it is good to be explicit. Read this as "the reuters dataset is an American English text dataset."{p_end}
{phang2}{cmd:. set locale_functions en_US}{p_end}

{pstd}Extract unigrams as normal{p_end}
{phang2}{cmd:. ngram title, thresh(3)}{p_end}

{pstd}Inspect the results{p_end}
{phang2}{cmd:. desc t_*}{p_end}

{phang2}{cmd:. list title if t_acquire}{p_end}

{pstd}{p_end}
{pstd}reset the state{p_end}
{pstd}this is a fragile way to do this; in general you should be reloading the dataset fresh each time{p_end}
{phang2}{cmd:. drop t_*}{p_end}
{phang2}{cmd:. drop n_token}{p_end}


{pstd}Now compare to the same with stemming enabled{p_end}
{phang2}{cmd:. ngram title, thresh(3) stem}{p_end}

{phang2}{cmd:. desc t_*}{p_end}

{pstd}It is informative to compare the difference in the listed results:{p_end}
{pstd}the stemmed results capture lines that the unstemmed ones miss,{p_end}
{pstd}because instead of the single stemmed t_acquir they have{p_end}
{pstd}separate t_acquire, t_acquires, and t_acquired columns.{p_end}
{phang2}{cmd:. list title if t_acquir}{p_end}

{pstd}{it:({stata ngram_example stemmer:click to run})}{p_end}