{smcl}
{* *! version 1.0  17june2026}{...}
{title:Title}

{phang}
{bf:littext} {hline 2} Automated construct discovery and relationship
inference from an academic text

{pstd}
The short alias {bf:litt} is provided for interactive use; the two commands
are functionally identical.

{title:Syntax}

{p 8 16 2}
{bf:littext analyze} {cmd:,} {opt t:ext(varname)} [ {it:options} ]

{p 8 16 2}
{bf:littext graph} {cmd:,} [ {opt t:ype(string)} {opt top(#)}
{opt out:dir(string)} {opt we:ighted} {opt lev:el(string)}
{opt format(string)} {opt emb:ed(string)} {opt sav:ing(string)} {opt rep:lace} ]

{p 8 16 2}
{bf:littext export} {cmd:,} {opt out:dir(string)} [ {opt n:ame(string)}
{opt format(string)} {opt minc:onf(#)} {opt t:ype(string)} {opt top(#)}
{opt col:umns(string)} ]

{p 8 16 2}
{bf:littext example} [ {cmd:,} {opt clear} ]

{p 8 16 2}
{bf:littext install} [ {cmd:,} {opt q:uiet} {opt v:erbose} ]

{title:Description}

{pstd}
{bf:littext} extracts candidate construct relationships from an unstructured
corpus of academic text (titles, abstracts, full texts) or other research
text such as interview transcripts, consumer reviews, and social media
comments. It is intended for the exploratory researcher who has assembled a
large corpus and wants to generate candidate relationships of the form
"X is associated with Y", "X moderates the effect of Z on Y", etc., that
can then be hand-curated into a formal systematic literature review coding
scheme.

{pstd}
The pipeline is: text-kind-appropriate cleaning; spaCy noun-chunk
extraction; sentence-transformer embedding of candidate constructs;
HDBSCAN clustering into synonym groups; lexical construct-hierarchy
detection; co-occurrence-based relation candidacy with normalized
PMI scoring; dependency-pattern matching for relationship valence.

{pstd}
Results are returned in three Stata frames left in memory:

{p 8 12 2}{bf:lt_constructs}  -  one row per extracted construct{p_end}
{p 8 12 2}{bf:lt_relations}   -  one row per candidate relationship{p_end}
{p 8 12 2}{bf:lt_diag}        -  one row per source document with diagnostics{p_end}

{pstd}
After {bf:littext analyze} returns, a user remains in the frame it called
from; the results are in the named frames {bf:lt_constructs},
{bf:lt_relations}, and {bf:lt_diag}, queried with a frame prefix, e.g.
{cmd:frame lt_relations: list source target relation_type confidence}.
Files on disk are produced only if you pass {opt sav:ing()}.

{title:Options for {cmd:littext analyze}}

{phang}
{opt t:ext(varname)} (required) - the variable in the current dataset that
holds the document text. May be a {bf:str#} or {bf:strL} variable. Must be
a string variable; numeric variables are rejected with an error.

{phang}
{opt i:d(varname)} - a per-document identifier. If omitted, {bf:_n} is used.

{phang}
{opt y:ear(varname)} - publication year (numeric). Used for the trend graph
and stored in {bf:lt_diag}.

{phang}
{opt j:ournal(varname)} - outlet name (string). Stored in {bf:lt_diag} for
comparative analysis.

{phang}
{opt textt:ype(string)} - declares the kind of text in the corpus. One of:

{p 8 12 2}{bf:abstract}    - academic abstracts (default; Emerald and copyright cleaners){p_end}
{p 8 12 2}{bf:fulltext}    - full papers (above plus LaTeX, references, captions){p_end}
{p 8 12 2}{bf:transcript}  - interview / focus group transcripts (speaker labels, timestamps){p_end}
{p 8 12 2}{bf:review}      - consumer reviews (HTML, ratings, verified-purchase labels){p_end}
{p 8 12 2}{bf:comment}     - social-media comments (URLs; emoticons preserved){p_end}
{p 8 12 2}{bf:other}       - minimal cleaning only (whitespace, control chars){p_end}

{pstd}
If {opt textt:ype()} is not declared, the package defaults to
{bf:abstract} and emits a note indicating that this default was applied.
The declaration drives the cleaning regime, the default {opt u:nit()},
and the default {opt mint:extlen()}. A post-clean median-length sanity
check warns when the corpus length is outside the typical window for
the declared texttype, which most often indicates a misdeclared
{opt t:ext()} variable.

{phang}
{opt u:nit(string)} - unit of analysis for relationship candidacy. One of
{bf:sentence}, {bf:abstract}, {bf:paragraph}. If not specified, defaults
from {opt textt:ype()}: sentence for abstract/transcript/review/comment/other,
paragraph for fulltext.

{phang}
{opt emb:edmodel(string)} - name of the sentence-transformers model used for
construct embeddings.

{phang}
{opt minf:req(#)} - minimum document frequency for a candidate construct to
be retained. Default: {bf:1} for corpora with fewer than 50 documents,
{bf:2} otherwise. Resolved value and rationale are printed at run time.

{phang}
{opt maxr:elations(#)} - cap on the number of candidate relationships
written to {bf:lt_relations} (highest-confidence first). Default {bf:100000}.

{phang}
{opt mint:extlen(#)} - minimum text length in characters. Rows whose
{opt t:ext()} value is shorter than this threshold are dropped before the
pipeline runs. If not specified, defaults from {opt textt:ype()}: 50 for
abstract/other, 500 for fulltext, 30 for transcript, 20 for review,
10 for comment. Pass {opt keepe:mpty} to disable row-dropping entirely.

{phang}
{opt keepe:mpty} - retain all rows including empty, whitespace-only, and
below-threshold ones. The default behavior is to drop these with a logged
count. Use this when the corpus is being analyzed for a purpose that
requires preserving the input row count.

{phang}
{opt addsentiment} - additionally compute VADER affective polarity on each
evidence sentence and store it in {bf:text_polarity}. Note: this is
{it:affective sentiment} of the text, NOT {it:relationship valence}.
Relationship valence is always computed and stored in {bf:relation_type}.

{phang}
{opt q:uiet} - suppress progress output.

{phang}
{opt sav:ing(string)} - if specified, the three frames are also saved as
{it:stub}_constructs.dta, {it:stub}_relations.dta, {it:stub}_diag.dta.

{phang}
{opt rep:lace} - allow overwriting existing files when {opt sav:ing()} is used.

{title:Options for {cmd:littext graph}}

{phang}
{opt t:ype(string)} - figure type. One of:

{p 8 12 2}{bf:frequency}     - bar chart of top-k constructs (Stata-native){p_end}
{p 8 12 2}{bf:distribution}  - distribution of relation types (Stata-native){p_end}
{p 8 12 2}{bf:trend}         - extraction yield over years (Stata-native){p_end}
{p 8 12 2}{bf:confidence}    - histogram of confidence scores (Stata-native){p_end}
{p 8 12 2}{bf:extraction}    - distribution by extraction method (Stata-native){p_end}
{p 8 12 2}{bf:map}           - UMAP concept map (matplotlib; default){p_end}
{p 8 12 2}{bf:network}       - relationship network (matplotlib){p_end}
{p 8 12 2}{bf:dendrogram}    - construct-cluster dendrogram (matplotlib){p_end}
{p 8 12 2}{bf:cooccurrence}  - pairwise NPMI heatmap of top-k constructs (matplotlib){p_end}
{p 8 12 2}{bf:roles}         - construct x relation-type heatmap (matplotlib){p_end}

{phang}
{opt top(#)} - number of top constructs or relationships to display.
Default {bf:20}. For heatmaps, controls the matrix dimensions.

{phang}
{opt we:ighted} - for {bf:type(network)} only: color edges continuously by
confidence (viridis) rather than discretely by relation type. Useful when
edge strength matters more than syntactic type.

{phang}
{opt lev:el(string)} - hierarchy specificity for construct-vocabulary graph
types. Accepts {bf:leaf} (default; constructs at maximum specificity),
{bf:root} (each construct replaced by its hierarchy root), or a
non-negative integer N (collapse to depth N). Honoured by
{bf:type(frequency)} and by the matplotlib {bf:type(map)}
and {bf:type(network)} renderers. In {bf:map} a rolled construct is drawn
at the frequency-weighted centroid of its children. In {bf:network} edges
are aggregated within each relation type but never across types: a
positive and a negative edge between the same rolled pair stay distinct.
Other Stata-native types, the heatmaps ({bf:cooccurrence}, {bf:roles}),
and {bf:dendrogram} (whose tree is built from cluster distances, not the
construct hierarchy) ignore {opt level()} and emit a one-line note. The
hierarchy is computed by the lexical right-substring rule plus the
hyphenated-prefix rule described in the Notes section.

{phang}
{opt out:dir(string)} - directory where figure files will be written.
REQUIRED. Pass an absolute path (e.g. {bf:"D:\projects\figures"}). If
omitted, {cmd:littext graph} stops with an error rather than guessing a
location. A relative path is accepted but resolved against the current
working directory ({bf:c(pwd)}) with a warning. The resolved absolute
path is printed on every save.

{phang}
{opt sav:ing(string)} - output file stub for matplotlib figures (PNG and PDF
are written). For Stata-native graphs, the file is saved as PNG via
{cmd:graph export}.

{phang}
{opt format(string)} - output format for the matplotlib figure types
({bf:map}, {bf:network}, {bf:dendrogram}, {bf:cooccurrence}, {bf:roles}).
Accepts {bf:static} (default; PNG and PDF via matplotlib), {bf:html}
(interactive Plotly HTML), or {bf:both}. Ignored with a note for
Stata-native types, which are always static.

{phang}
{opt emb:ed(string)} - how plotly.js is embedded in {bf:format(html)}
output. {bf:selfcontained} (default) writes a standalone file (~3.5 MB)
that opens offline on any machine; {bf:cdn} writes a small file that
loads plotly.js from a content-delivery network and therefore needs an
internet connection to render.

{title:Options for {cmd:littext export}}

{pstd}
{cmd:littext export} writes the candidate relationships from the most recent
{cmd:littext analyze} (the {bf:lt_relations} frame) as a hypothesis register
for hand-curation: a clean candidate table, sorted strongest-first, with no
curation columns added (the analyst adds their own). Run {cmd:littext
analyze} first.

{phang}
{opt out:dir(string)} - REQUIRED. Absolute path to the directory where the
register is written. A relative path is resolved against {bf:c(pwd)} with a
warning.

{phang}
{opt n:ame(string)} - file-name stub for the register (default
{bf:littext_register}). The extension is added per {opt format()}.

{phang}
{opt format(string)} - output format: {bf:csv} (default), {bf:xlsx}, or
{bf:both}. CSV is written with full quoting, so evidence spans containing
commas or quotes survive intact.

{phang}
{opt minc:onf(#)} - keep only candidates with {bf:confidence} at or above
this value (default: keep all).

{phang}
{opt t:ype(string)} - restrict to one or more relation types, given as a
space- or comma-separated list (e.g. {bf:type(pos_assoc neg_assoc)}).

{phang}
{opt top(#)} - keep only the top {bf:#} candidates after sorting by
descending confidence (default: keep all).

{phang}
{opt col:umns(string)} - space-separated list of {bf:lt_relations} columns
to export. Unknown columns are skipped with a note.

{title:Stata frames produced}

{pstd}
{bf:lt_constructs}: construct_id, surface_form, canonical_form, cluster_id,
freq_doc, freq_total, parent_canonical, canonical_root, hierarchy_depth,
is_root.

{pstd}
{bf:lt_relations}: rel_id, doc_id, unit_id, source, target,
source_construct_id, target_construct_id, relation_type, confidence,
extraction_method, evidence_text, text_polarity.

{pstd}
{bf:lt_diag}: doc_id, year, journal, n_constructs_extracted,
n_relations_extracted.

{title:relation_type vocabulary}

{phang}
{bf:pos_assoc}  - positive association (X increases/enhances/predicts Y){p_end}
{phang}
{bf:neg_assoc}  - negative association (X reduces/attenuates Y){p_end}
{phang}
{bf:moderates}  - X moderates the relationship between two others{p_end}
{phang}
{bf:mediates}   - X mediates the effect of one construct on another{p_end}
{phang}
{bf:causes}     - X causes / leads to Y{p_end}
{phang}
{bf:assoc}      - non-directional or unclassified co-occurrence{p_end}

{title:Notes}

{pstd}
{bf:row-drop behavior.} By default, {cmd:littext analyze} drops rows
where the {opt t:ext()} variable is missing or whitespace-only, rows where
a user-supplied {opt i:d()} variable is missing, and rows whose text is
shorter than {opt mint:extlen()} characters. A summary of the drops is
printed (suppressed under {opt q:uiet}). A warning is emitted if more
than 25% of input rows are dropped, which most often indicates that the
{opt t:ext()} variable points at the wrong column.

{pstd}
{bf:text-kind declaration.} The {opt textt:ype()} option drives
three downstream defaults: which cleaning regime is applied to the raw
text; the default segmentation {opt u:nit()}; and the default
{opt mint:extlen()}. Each derived default is overridable by passing the
corresponding option explicitly. The resolved values and their sources
("user-specified" vs "texttype default") are printed at run time.
A post-clean median-length sanity check warns when the corpus length
falls outside the typical window for the declared texttype.

{pstd}
{bf:construct hierarchy.} Four new columns are added to
{bf:lt_constructs}: {bf:parent_canonical} (the canonical form of the
immediate IS-A parent, or empty if the construct is a root),
{bf:canonical_root} (the topmost ancestor; equals {bf:canonical_form}
for roots), {bf:hierarchy_depth} (zero for roots, one for direct
children), and {bf:is_root} (1 if root, 0 otherwise). The hierarchy is
detected by a lexical right-substring rule with a frequency prior,
supplemented by a hyphenated-prefix rule that admits constructs of the
form {it:X-based Parent}, {it:X-driven Parent}, {it:X-led Parent}, and
{it:X-oriented Parent} as children of {it:Parent} regardless of the
frequency prior. The rule is English-specific and is silent on
conceptually-subsumed but lexically-distinct relations (e.g., it does
not link {it:brand reputation} to {it:brand equity} because they
share no right substring).

{pstd}
{bf:Example hierarchies the rule recovers.} For instance, if a corpus contains
{it:brand equity} and any of {it:consumer-based brand equity},
{it:financial-based brand equity}, {it:online brand equity}, or
{it:employee-based brand equity}, the rule places each subtype as a
depth-1 child of {it:brand equity}. Query the hierarchy with:

{phang}{cmd:. frame lt_constructs: list canonical_form parent_canonical canonical_root, sepby(canonical_root)}{p_end}

{pstd}
or roll up at the graph level with:

{phang}{cmd:. littext graph, type(frequency) level(root)}{p_end}

{title:Examples}

{pstd}Load the bundled synthetic RBV corpus (300 abstracts) and analyze it:{p_end}

{phang}{cmd:. littext example, clear}{p_end}
{phang}{cmd:. littext analyze, text(abstract) id(article_id) year(year) journal(journal) texttype(abstract)}{p_end}
{phang}{cmd:. frame lt_relations: list source target relation_type confidence in 1/10}{p_end}
{phang}{cmd:. frame lt_relations: tab relation_type}{p_end}
{phang}{cmd:. littext graph, type(map) outdir("D:/figs")}{p_end}
{phang}{cmd:. littext graph, type(network) top(25)}{p_end}
{phang}{cmd:. littext graph, type(network) level(root)}{p_end}
{phang}{cmd:. littext graph, type(map) level(1)}{p_end}
{phang}{cmd:. littext graph, type(network) outdir("D:/figs") format(html)}{p_end}
{phang}{cmd:. littext graph, type(map) outdir("D:/figs") format(both)}{p_end}
{phang}{cmd:. littext export, outdir("D:/register") format(both)}{p_end}
{phang}{cmd:. littext export, outdir("D:/register") minconf(0.7) type(pos_assoc neg_assoc) top(200)}{p_end}
{phang}{cmd:. littext graph, type(frequency) level(root)}{p_end}

{pstd}For a corpus of interview transcripts:{p_end}

{phang}{cmd:. use my_transcripts.dta, clear}{p_end}
{phang}{cmd:. littext analyze, text(transcript) id(case_id) texttype(transcript)}{p_end}

{pstd}For a corpus of consumer reviews:{p_end}

{phang}{cmd:. use product_reviews.dta, clear}{p_end}
{phang}{cmd:. littext analyze, text(review_text) id(review_id) texttype(review)}{p_end}

{pstd}For full-text academic papers (LaTeX or PDF-extracted text):{p_end}

{phang}{cmd:. use my_fulltexts.dta, clear}{p_end}
{phang}{cmd:. littext analyze, text(body) id(paper_id) texttype(fulltext)}{p_end}

{title:Sentiment analysis: a note}

{pstd}
{bf:littext} draws a clear line between two distinct constructs that are often
conflated in social sciences/marketing/management applications:

{phang}
1. {it:Relationship valence} is the sign of the directional relationship
between two constructs (X positively/negatively related to Y). This is always
computed and stored in {bf:relation_type}. It is essential to the purpose of
the package; a hypothesis register that cannot distinguish "X increases Y"
from "X reduces Y" is not a hypothesis register.{p_end}

{phang}
2. {it:Affective sentiment} is the emotional polarity of a piece of text, in
the sense of VADER, LIWC, or the NRC Emotion Lexicon. This is meaningful for
consumer-text corpora (reviews, tweets) but largely uninformative for academic
abstracts. {bf:littext} computes it only on request via {opt addsentiment} and
stores it in {bf:text_polarity}.{p_end}

{pstd}
Users should not treat {bf:text_polarity} as a measure of relationship sign.

{title:Requirements}

{pstd}
Stata 19 or higher with Python integration configured. Python 3.14
recommended on Windows; spaCy on Python requires {bf:blis 1.3.3} or
higher. Required Python packages: spacy, sentence-transformers, hdbscan,
scikit-learn, umap-learn, matplotlib, networkx, plotly, pandas, numpy. The spaCy
model {bf:en_core_web_sm} must be downloaded once via
{cmd:python -m spacy download en_core_web_sm}.

{title:Limitations}

{pstd}
{bf:littext} uses noun-chunk extraction rather than a domain-trained NER
model, and co-occurrence plus dependency-pattern matching rather than a
trained relation extractor. It is therefore best understood as a
candidate-generation tool whose output requires manual curation before
being treated as a coding scheme. Quantitative precision/recall figures
should not be reported against the bundled synthetic corpus.

{pstd}
The construct-hierarchy detector and the {opt textt:ype()} cleaning
regimes are English-specific. The hierarchy rule does not recover
conceptual subsumption that lacks a lexical signal.

{title:References}

{pstd}
Bouma, G. (2009). Normalized (pointwise) mutual information in collocation
extraction. In {it:Proceedings of GSCL, 30}, 31-40.

{pstd}
Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text
corpora. In {it:Proceedings of COLING-92}, 539-545.

{pstd}
Hutto, C. J., & Gilbert, E. (2014). VADER: A parsimonious rule-based model
for sentiment analysis of social media text. In {it:Proceedings of ICWSM}, 8(1), 216-225.

{pstd}
Li, J., Larsen, K. R., & Abbasi, A. (2020). TheoryOn: A design framework and
system for unlocking behavioral knowledge through ontology learning. {it:MIS Quarterly}, 44(4), 1733-1772.


{title:Aliases}

{pstd}
{cmd:litt} is provided as a short-form alias for {cmd:littext}. It
forwards every argument and propagates returned scalars and macros.


{title:Author}

{pstd}
Nebojsa S. Davcik{break}
EM Normandie Business School, Oxford, UK{break}
ORCID: 0000-0003-1041-8788{break}
{browse "https://orcid.org/0000-0003-1041-8788":https://orcid.org/0000-0003-1041-8788}{break}
Email: {browse "mailto:davcik@live.com":davcik@live.com}


{title:Citation}

{pstd}
When citing {cmd:littext} in academic work, please use:

{phang2}
Davcik, N. S. 2026. {it:LITTEXT: Stata module for automated construct discovery and relationship inference from academic text.}
Available at: {browse "https://github.com/Davcik/littext":https://github.com/Davcik/littext}


{title: License}

{pstd}
{cmd:littext} is free software released under the
{browse "https://www.gnu.org/licenses/gpl-3.0.html":GNU General Public License version 3 or later} (GPL-3.0-or-later).
You may redistribute and modify it under the terms of that license;
modified versions and larger works that incorporate {cmd:littext}
must also be released under GPL-3 or later. See the LICENSE file in
the repository root for the full license text.

{pstd}
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


{title:Also see}

{phang}The command: {helpb littext}{p_end}