{smcl} {title:edm Description} {p 4 4 2} The command {bf:edm} implements a series of tools that can be used for empirical dynamic modeling in Stata. The core algorithm is written in C++ (with a Mata backup) to achieve reasonable execution speed. The command keyword is {bf:edm}, and should be immediately followed by a subcommand such as explore or xmap. A dataset must be declared as time-series or panel data by the tsset or xtset command prior to using the edm command, and time-series operators including l., f., d., and s. can be used (the last for seasonal differencing). {title:Syntax} {p 4 4 2} The {bf:explore} subcommand follows the syntax below and supports one variable for exploration using simplex projection or S-mapping. {p 8 12 2} {bf:edm} explore {bf:variable} [if {bf:exp}], [e(numlist ascending >=2)] [tau(integer 1)] [theta(numlist ascending)] [k(integer 0)] [ALGorithm(string)] [REPlicate(integer 0)] [seed(integer 0)] [full] [RANDomize] [PREDICTionsave(name)] [COPREDICTionsave(name)] [copredictvar(string)] [CROSSfold(integer 0)] [CI(integer 0)] [EXTRAembed(string)] [ALLOWMISSing] [MISSINGdistance(real 0)] [dt] [reldt] [DTWeight(real 0)] [DTSave(name)] [DETails] [reportrawe] [strict] [Predictionhorizon(string)] [dot(integer 1)] [mata] [gpu] [nthreads(integer 0)] [savemanifold(name)] [idw(real 0)] [lowmemory] {p 4 4 2} The second subcommand {bf:xmap} performs convergent cross-mapping (CCM). The subcommand follows the syntax below and requires two variables to follow immediately after xmap. It shares many of the same options with the explore subcommand although there are some differences given the different purpose of the analysis. {p 8 12 2} {bf:edm} xmap {bf:variables} [if {bf:exp}], [e(integer 2)] [tau(integer 1)] [theta(real 1)] [Library(numlist)] [RANDomize] [k(integer 0)] [ALGorithm(string)] [REPlicate(integer 0)] [strict] [DIrection(string)] [seed(integer 0)] [PREDICTionsave(name)] [COPREDICTionsave(name)] [copredictvar(string)] [CI(integer 0)] [EXTRAembed(string)] [ALLOWMISSing] [MISSINGdistance(real 0)] [dt] [reldt] [DTWeight(real 0)] [DTSave(name)] [oneway] [DETails] [SAVEsmap(string)] [Predictionhorizon(string)] [dot(integer 1)] [mata] [gpu] [nthreads(integer 0)] [savemanifold(name)] [idw(real 0)] [lowmemory] {p 4 4 2} The third subcommand {bf:update} updates the plugin to its latest version {p 8 12 2} {bf:edm} update, [develop] [replace] {p 4 4 2} The fourth subcommand {bf:version} displays the current version number {p 8 12 2} {bf:edm} version {marker options}{...} {title:Options} {dlgtab: Options for explore and xmap subcommands} {phang} {bf:e(numlist ascending)}: This option specifies the number of dimensions E used for the main variable in the manifold reconstruction. If a list of numbers is provided, the command will compute results for all numbers specified. The xmap subcommand only supports a single integer as the option whereas the explore subcommand supports the option as a numlist. The default value for E is 2, but in theory E can range from 2 to almost half of the total sample size. The actual E used in the estimation may be different if additional variables are incorporated. A error message is provided if the specified value is out of range. Missing data will limit the maximum E under the default deletion method. {phang} {bf:tau(integer)}: The tau (or τ) option allows researchers to specify the ‘time delay’, which essentially sorts the data by the multiple τ. This is done by specifying lagged embeddings that take the form: t,t-1τ,…,t-(E-1)τ, where the default is tau(1) (i.e., typical lags). However, if tau(2) is set then every-other t is used to reconstruct the attractor and make predictions—this does not halve the observed sample size because both odd and even t would be used to construct the set of embedding vectors for analysis. This option is helpful when data are oversampled (i.e., spaced too closely in time) and therefore very little new information about a dynamic system is added at each occasion. However, the tau() setting is also useful if different dynamics occur at different times scales, and can be chosen to reflect a researcher’s theory-driven interest in a specific time-scale (e.g., daily instead of hourly). Researchers can evaluate whether τ>1 is required by checking for large autocorrelations in the observed data (e.g., using Stata’s corrgram function). Of course, such a linear measure of association may not work well in nonlinear systems and thus researchers can also check performance by examining ρ and MAE at different values of τ. {phang} {bf:theta(numlist ascending)}: Theta (or θ) is the distance weighting parameter for the local neighbours in the manifold. It is used to detect the nonlinearity of the system in the explore subcommand for S-mapping. Of course, as noted above, for simplex projection and CCM a weight of theta(1) is applied to neighbours based on their distance, which is reflected in the fact that the default value of θ is 1. However, this can be altered even for simplex projection or CCM (two cases that we do not cover here). Particularly, values for S-mapping to test for improved predictions as they become more local may include the following command: theta(0 .00001 .0001 .001 .005 .01 .05 .1 .5 1 1.5 2 3 4 6 8 10). {phang} {bf:k(integer)}: This option specifies the number of neighbours used for prediction. When set to 1, only the nearest neighbour is used, but as k increases the next-closest nearest neighbours are included for making predictions. In the case that k is set 0, the number of neighbours used is calculated automatically (typically as k = E + 1 to form a simplex around a target), which is the default value. When k < 0 (e.g., k(-1)), all possible points in the prediction set are used (i.e., all points in the library are used to reconstruct the manifold and predict target vectors). This latter setting is useful and typically recommended for S-mapping because it allows all points in the library to be used for predictions with the weightings in theta. However, with large datasets this may be computationally burdensome and therefore k(100) or perhaps k(500) may be preferred if T or NT is large. {phang} {bf:ALGorithm(string)}: This option specifies the algorithm used for prediction. If not specified, simplex projection (a locally weighted average) is used. Valid options include simplex and smap, the latter of which is a sequential locally weighted global linear mapping (or S-map as noted previously). In the case of the xmap subcommand where two variables predict each other, the algorithm(smap) invokes something analogous to a distributed lag model with E + 1 predictors (including a constant term c) and, thus, E + 1 locally-weighted coefficients for each predicted observation/target vector—because each predicted observation has its own type of regression done with k neighbours as rows and E + 1 coefficients as columns. As noted below, in this case special options are available to save these coefficients for post-processing but, again, it is not actually a regression model and instead should be seen as a manifold. {phang} {bf:RANDomize}: When splitting the observations into library and prediction sets, by default the oldest observations go into the library set and the newest observations to the prediction set. Though if the randomize option is specified, the data is allocated into the two sets in a random fashion. If the replicate option is specified, then this randomization is enabled automatically. {phang} {bf:REPlicate(integer)}: The explore subcommand uses a random 50/50 split for simplex projection and S-maps, whereas the xmap subcommand selects the observations randomly for library construction if the size of the library L is smaller than the size of all available observations. In these cases, results may be different in each run because the embedding vectors (i.e., the E-dimensional points) used to reconstruct a manifold are chosen at random. The replicate option takes advantages of this to allow repeating the randomization process and calculating results each time. This is akin to a nonparametric bootstrap without replacement, and is commonly used for inference using confidence intervals in EDM (Tsonis et al., 2015; van Nes et al., 2015; Ye et al., 2015b). When replicate is specified, such as replicate(50), mean values and the standard deviations of the results are reported across the 50 runs by default. As we note below, it is possible to save all estimates for post-processing using typical Stata commands such as svmat, allowing the graphing of results or finding percentile-based with the pctile command. {phang} {bf:PREDICTionsave(variable)}: This option allows you to save the edm predictions as a variable, which could be useful for plotting and diagnosis. {phang} {bf:COPREDICTionsave(variable)}: This option allows you to save the copredictions as a variable. You must specify the copredictvar(variables) options for this to work. {phang} {bf:copredictvar(variable)}: This option specifies the variable used for coprediction. A second prediction is run for each configuration of E, library, etc., using the same library set but with a prediction set built from the lagged embedding of this variable. {phang} {bf:Predictionhorizon(integer)}: This option adjusts the default number of observations ahead which we predict. By default, the explore mode predict τ observations ahead and the xmap mode uses p(0). This parameter can be negative. {phang} {bf:DETails}: By default, only mean values and standard deviations are reported when the replicate option is specified. The details option overrides this behaviour by providing results for each individual run. Irrespective of using this option, all results can be saved for post-processing. {phang} {bf:CI(integer)}: When used with replicate() or crossfold(), this option reports the confidence interval for the mean of the estimates (MAE and/or ρ), as well as the percentiles of their distribution. The first row of output labelled “Est. mean CI” reports the estimated confidence interval of the mean ρ, assuming that ρ has a normal distribution—estimated as the corrected sample standard deviation (with N-1 in the denominator) divided by the squared root of the number of replications. The reported range can be used to compare mean ρ across different (hyper) parameter values (e.g., different E, θ, or L) using the same datasets as if the sample was the entire population (such that uncertainty is reduced to 0 when the number of replications →∞). These intervals can be used to test which (hyper) parameter values best describe a sample, as might be typically used when using crossfold validation methods. The row labelled with “Pc (Est.)” follows the same normality assumption and reports the estimated percentile values based on the corrected sample standard deviation of the replicated estimates. The row labelled “Pc (Obs.)” reports the actual observed percentile values from the replicated estimates. In both of these latter cases the percentile values offer alternative metrics for comparisons across distributions, which would be more useful for testing typical hypotheses about population differences in estimates across different (hyper) parameter values (e.g., different E, θ, or L), such as testing whether a dynamical system appears to be nonlinear in a population (i.e., testing whether ρ is maximized when θ > 0). The number specified within the ci() bracket determines the confidence level and the locations of the percentile cut-offs. For example, ci(90) instructs edm to return 90% CI as well as the cut-off values for the 5th and 95th percentile values—because ρ and MAE values cannot or are not expected to take on negative values, we typically prefer one-tailed hypothesis tests and therefore would use ci(90) to get a one-tailed 95% interval. These estimated ranges are also included in the e() return list as a series of scalars with names starting with “ub” for upper bound and “lb” for lower bound values of the CIs. These return values can be used for further post-processing. {phang} {bf:seed(integer)}: This option specifies the seed used for the random number. In some special cases users may wish to use this in order to keep library and prediction sets the same across simplex projection and S-mapping with a single variable, or across multiple CCM runs with different variables. Note: if "set rng" has been used to change Stata's random number generation algorithm, then edm will temporarily change it the default 64 bit Mersenne twister internally. {phang} {bf:strict}: When this option is specified, the computation will fail if the requested number of neighboring observations is not present. By default, if not all of the request neighbors are available, the computation will just continue using as many as possible. {phang} {bf:ALLOWMISSing} This option allows observations with missing values to be used in the manifold. Vectors with at least one non-missing values will be used in the manifold construction. Distance computations are adapted to allow missing values when this option is specified. {phang} {bf:MISSINGdistance(real)}: This option allows users to specify the assumed distance between missing values and any values (including missing) when estimating the Euclidean distance of the vector. This enables computations with missing values. The option implies allowmissing. By default, the distance is set to the expected distance of two random draws in a normal distribution, which equals to 2/sqrt(pi) * standard deviation of the mapping variable. {phang} {bf:EXTRAembed(variables)}: This option allows incorporating additional variables into the embedding (multivariate embedding), e.g. extra(z l.z). Time series lists are unabbreviated here, e.g. extra(L(1/3).z) will be equivalent to extra(L1.z L2.z L3.z). Normally, lagged versions of the extra variables are not included in the embedding, however the syntax extra(z(e)) includes e lags of z in the embedding. {phang} {bf:dt}: This option allows automatic inclusion of the timestamp differencing in the embedding. There will be E dt variables included for an embedding with E dimensions. By default, the weights used for these additional variables equal to the standard deviation of the main mapping variable divided by the standard deviation of the time difference. This can be overridden by the `dtweight()` option. `dt` option will be ignored when running with data with no sampling variation in the time lags. The first dt variable embeds the time of the between the most recent observation and the time of the corresponding target/predictand. {phang} {bf:reldt}: This option, to be read as 'relative dt', is like the 'dt' option above in that it includes E extra variables for an embedding with E dimensions. However the timestamp differences added are not the time between the corresponding observations, but the time of the target/predictand minus the time of the lagged observations. {phang} {bf:DTWeight(real)}: This option specifies the weight used for the timestamp differencing variable. {phang} {bf:DTSave(variable)}: This option allows users to save the internally generated timestamp differencing variable. {phang} {bf:nthreads(integer)}: The number of threads the C++ plugin will use for parallel computations. The default is the number of cores available on the host computer. {phang} {bf:idw(real)}: This parameter is used when xtset indicates that the current dataset is panel data. Then, idw specifies a penalty that is added to the distances between points in the manifold which correspond to observations from different panels. By default idw is 0, so the data from all panels is mixed together and treatly equally. If idw(-1) is set (or any other negative value), then the weight is treated as 'infinity', so neighbours will never be selected which cross the boundaries between panels. Setting idw(-1) with k(-1) means we may use a different number of neighbors for different predictions (i.e. if the panels are unbalanced). {phang} {bf:lowmemory}: It is possible that RAM may be depleted while running edm on large datasets with large values of E. The lowmemory flag directs the plugin to try to save as much space as possible by more efficiently using memory, though for small datasets this will likely slow down the computations by a small but noticeable amount. {phang} Besides the shared parameters, edm explore supports the following extra options: {phang} {bf:CROSSfold(integer)}: This option asks the program to run a cross-fold validation of the predicted variables. crossfold(5) indicates a 5-fold cross validation. Note that this cannot be used together with replicate. This option is only available with the `explore` subcommand. {phang} {bf:full}: When this option is specified, the explore command will use all possible observations in the manifold construction instead of the default 50/50 split. This is effectively the same as leave-one-out cross-validation as the observation itself is not used for the prediction. {phang} {bf:reportrawe}: By default, the program reports the actual E used in the manifold. With this option, the program will only report the number of dimensions constructed from the main variable. {phang} {bf:Library(numlist ascending)}: This option specifies the total library size L used for the manifold reconstruction. Varying the library size is used to estimate the convergence property of the cross-mapping, with a minimum value Lmin = E + 2 and the maximum equal to the total number of observations minus sufficient lags (e.g., in the time-series case without missing data this is Lmax = T + 1 – E). An error message is given if the L value is beyond the allowed range. To assess the rate of convergence (i.e., the rate at which ρ increases as L grows), the full range of library sizes at small values of L can be used, such as if E = 2 and T = 100, with the setting then perhaps being library(4(1)25 30(5)50 54(15)99). This option is only available with the `xmap` subcommand. {phang} {bf:SAVEsmap(string)}: This option allows smap coefficients to be stored in variables with a specified prefix. For example, specifying “edm xmap x y, algorithm(smap) savesmap(beta) k(-1)” will create a set of new variables such as beta1_b0_rep1. The string prefix (e.g., ‘beta’) must not be shared with any variables in the dataset, and the option is only valid if the algorithm(smap) is specified. In terms of the saved variables such as beta1_b0_rep1, the first number immediately after the prefix ‘beta’ is 1 or 2 and indicates which of the two listed variables is treated as the dependent variable in the cross-mapping (i.e., the direction of the mapping). For the “edm xmap x y” case, variables starting with beta1_ contain coefficients derived from the manifold M_X created using the lags of the first variable ‘x’ to predict Y, or Y|M_X. This set of variables therefore store the coefficients related to ‘x’ as an outcome rather than a predictor in CCM. Keep in mind that any Y→X effect associated with the beta1_ prefix is shown as Y|M_X, because the outcome is used to cross-map the predictor, and thus the reported coefficients will be scaled in the opposite direction of a typical regression (because in CCM the outcome variable predicts the cause). To get more familiar regression coefficients (which will be locally weighted), variables starting with beta2_ store the coefficients estimated in the other direction, where the second listed variable ‘y’ is used for the manifold reconstruction M_Y for the mapping X|M_Y in the “edm xmap x y” case, testing the opposite X→Y effect in CCM, but with reported S-map coefficients that map to a Y→X regression. We appreciate that this may be unintuitive, but because CCM causation is tested by predicting the causal variable with the outcome, to get more familiar regression coefficients requires reversing CCM’s causal direction to a more typical predictor→outcome regression logic. This can be clarified by reverting to the conditional notation such as X|M_Y, which in CCM implies a left-to-right X→Y effect, but for the S-map coefficients will be scaled as a locally-weighted regression in the opposite direction Y→X. Moving on, following the 1 and 2 is the letter b and a number. The numerical labeling scheme generally follows the order of the lag for the main variable and then the order of the extra variables introduced in the case of multivariate embedding. b0 is a special case which records the coefficient of the constant term in the regression. The final term rep1 indicates the coefficients are from the first round of replication (if the replicate() option is not used then there is only one). Finally, the coefficients are saved to match the observation t in the dataset that is being predicted, which allows plotting each of the E estimated coefficients against time and/or the values of the variable being predicted. The variables are also automatically labelled for clarity. This option is only available with the `xmap` subcommand. {phang} {bf:DIrection(string)}: This option allows users to control whether the cross mapping is calculated bidirectionally or unidirectionally, the latter of which reduces computation times if bidirectional mappings are not required. Valid options include “oneway” and “both”, the latter of which is the default and computes both possible cross-mappings. When oneway is chosen, the first variable listed after the xmap subcommand is treated as the potential dependent variable following the conventions in the regression syntax of Stata such as the‘reg’ command, so “edm xmap x y, direction(oneway)” produces the cross-mapping Y|M_X, which pertains to a Y→X effect. This is consistent with the beta1_ coefficients from the previous savesmap(beta) option. On this point, the direction(oneway) option may be especially useful when an initial “edm xmap x y”procedure shows convergence only for a cross-mapping Y|M_X, which pertains to a Y→X effect. To save time with large datasets, any follow-up analyses with the algorithm(smap) option can then be conducted with “edm xmap x y, algorithm(smap) savesmap(beta) direction(oneway)”. To make this easier there is also a simplified oneway option that implies direction(oneway). This option is only available with the `xmap` subcommand. {phang} {bf:oneway}: This option is equivalent to "direction(oneway)" {dlgtab: Options for update subcommand} {phang} The update subcommand supports the following options: {phang} {bf:develop}: This option updates the command to its latest development version. The development version usually contains more features but may be less tested compared with the older version distributed on SSC. {phang} {bf:replace}: This option specifies whether you allow the update to override your local ado files. {title:Examples} {p 4 4 4} Chicago crime dataset example (included in the auxiliary file) {cmd: use chicago,clear} {cmd: edm explore temp, e(2/30)} {cmd: edm xmap temp crime} {cmd: edm xmap temp crime, alg(smap) savesmap(beta) e(6) k(-1)} {title:Updates} {phang} To install the stable version or upgrade directly through Stata: {phang} {cmd:edm update, replace} {phang} To install the development version directly through Stata: {phang} {cmd:edm update, develop replace} {title:Suggested Citation} {phang} Li, J, Zyphur, M & Sugihara, G. Beyond linearity, stability, and equilibrium: The edm package for empirical dynamic modeling and convergent cross-mapping, Stata Journal {title:Contact} {phang} Jinjing Li, National Centre for Social and Economic Modelling, University of Canberra, Australia {browse "mailto:jinjing.li@canberra.edu.au":jinjing.li@canberra.edu.au}