{* This file was generated by scripts/examples2smcl.}{...} {* It is included by the ngram.sthlp file to embed the examples/ folder into the documentation.}{...} {...} {title:Examples: basic} {pstd}ngram has a lot of options, but not all are very useful. For many uses "ngram varname" will suffice.{p_end} {pstd}This example demonstrates the more likely customizations.{p_end} {pstd}{p_end} {pstd}Stata isn't used too commonly for text-mining so this example is contrived.{p_end} {pstd}Setup{p_end} {phang2}{cmd:. sysuse auto}{p_end} {pstd}Set the locale to set the language of the dataset{p_end} {phang2}{cmd:. set locale_functions en}{p_end} {pstd}Extract unigrams through trigrams, *without* normalizing the tokens{p_end} {phang2}{cmd:. ngram make, deg(3) nolower thresh(1)}{p_end} {pstd}Inspect the results{p_end} {phang2}{cmd:. desc}{p_end} {phang2}{cmd:. list make if t_VW}{p_end} {pstd}{it:({stata ngram_example basic:click to run})}{p_end} {title:Examples: threshold} {pstd}Setup{p_end} {phang2}{cmd:. sysuse auto}{p_end} {phang2}{cmd:. set locale_functions en}{p_end} {pstd}Extract n-grams, culling sparse parts with threshold(){p_end} {phang2}{cmd:. ngram make, deg(3) thresh(5)}{p_end} {pstd}Inspect the results{p_end} {phang2}{cmd:. desc}{p_end} {phang2}{cmd:. list make if t_buick}{p_end} {pstd}{it:({stata ngram_example threshold:click to run})}{p_end} {title:Examples: stemmer} {pstd}This data is from the donated Reuters dataset at the UCI machine learning archive:{p_end} {pstd}http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection{p_end} {pstd}The first 2000 records have been reformated from the ancient SGML into a spreadsheet that Stata can load.{p_end} {pstd}This is in csv format because pre Stata 13 .dta files cannot handle more than 244-byte long strings.{p_end} {phang2}{cmd:. import delimited using "reuters.csv", clear}{p_end} {pstd}Set the locale.{p_end} {pstd}If you are reading this in English, your locale is probably also set to English, making this redundant.{p_end} {pstd}but it is good to be explicit. Read this as "the reuters dataset is an American English text dataset."{p_end} {phang2}{cmd:. set locale_functions en_US}{p_end} {pstd}Extract unigrams as normal{p_end} {phang2}{cmd:. ngram title, thresh(3)}{p_end} {pstd}Inspect the results{p_end} {phang2}{cmd:. desc t_*}{p_end} {phang2}{cmd:. list title if t_acquire}{p_end} {pstd}{p_end} {pstd}reset the state{p_end} {pstd}this is a fragile way to do this; in general you should be reloading the dataset fresh each time{p_end} {phang2}{cmd:. drop t_*}{p_end} {phang2}{cmd:. drop n_token}{p_end} {pstd}Now compare to the same with stemming enabled{p_end} {phang2}{cmd:. ngram title, thresh(3) stem}{p_end} {phang2}{cmd:. desc t_*}{p_end} {pstd}It is informative to compare the difference in the listed results:{p_end} {pstd}the stemmed results capture lines that the unstemmed ones miss,{p_end} {pstd}because instead of the single stemmed t_acquir they have{p_end} {pstd}separate t_acquire, t_acquires, and t_acquired columns.{p_end} {phang2}{cmd:. list title if t_acquir}{p_end} {pstd}{it:({stata ngram_example stemmer:click to run})}{p_end}