/* import_svmlight: import an .svmlight format file, replacing the current Stata dataset. See _svmlight.c */ * The variables created will be 'y' and 'x%d' for %d=[1 through max(feature_id)]. * Feature IDs are always positive integers, in svmlight format, according to its source code. * TODO: rename to svm_use and figure out how to support the dual 'svm use filename' and 'svm use varlist using filename' that the built-in use does * it will be possible, just maybe ugly program _svmlight, plugin /*load the C extension if not already loaded*/ program define import_svmlight version 13 syntax using/, [clip] quietly { * Do the pre-loading, to count how much space we need plugin call _svmlight, "import" "pre" "`using'" * HACK: Stata's various versions all have a hard upper limit on the number of variables; for example StataIC has 2048 (2^11) and StataMP has 2^15 * ADDITIONALLY, Stata has an off-by-one bug: the max you can actually pass to a C plugin is one less [citation needed] * We simply clamp the number of variables to get around this, leaving room for 1 for the Y variable and 1 to avoid the off-by-one bug * This needs to be handled better. Perhaps we should let the user give varlist (but if they don't give it, default to all in the file??) if(`=_svm_load_M+1' > `c(max_k_theory)'-1-1) { di as error "Warning: your version of Stata will not allow `=_svm_load_M+1' variables nor be able to use the C plugin with that many." if("`clip'"!="") { di as error "Clamping to `=c(max_k_theory)-1-1'." scalar _svm_load_M = `=c(max_k_theory)-1-1-1' /*remember: the extra -1 is to account for the Y column, and the extra extra -1 is the leave room for a prediction column*/ } else { exit 1 } } * handle error cases; I do this explicitly so if(`=_svm_load_M'<1) { * because Stata programming is all with macros, if this is a bad variable it doesn't cause a sensible crash, * instead of causes either "invalid syntax" or some sort of mysterious "invalid operation" error * (in particular "newlist x1-x0" is invalid) * checking this doesn't cover all the ways M can be bad (e.g. it could be a string) di as error "Need at least one feature to load" exit 1 } if(`=_svm_load_N'<1) { * this one di as error "Need at least one observation to load" exit 1 } * make a new, empty, dataset of exactly the size we need clear * Make variables y x1 x2 x3 ... x`=_svm_load_M' generate double y = . * this weird newlist syntax is the official suggestion for making a set of new variables in "help foreach" foreach j of newlist x1-x`=_svm_load_M' { * make a new variable named "xj" where j is an integer * specify "double" because libsvm uses doubles and the C interface uses doubles, yet the default is floats generate double `j' = . } * Make observations 1 .. `=_svm_load_N' * Stata will fill in the missing value for each at this point set obs `=_svm_load_N' * Delete the "local variables" * Do this here in case the next step crashes * I am programming in BASIC. scalar drop _svm_load_N _svm_load_M * Do the actual loading * "*" means "all variables". We need to pass this in because in addition to C plugins only being able to read and write to variables that already exist, * they can only read and write to variables specified in varlist * (mata does not have this sort of restriction.) capture plugin call _svmlight *, "import" "`using'" } end * load the given svmlight-format file into memory * the outcome variable (the first one on each line) is loaded in y, the rest are loaded into x