{smcl} {* *! version 1.3.2 10dec2013}{...} {cmd:help project}{right:dialog: {bf:{dialog project_setup:project}}} {hline} {title:Title} {phang} {bf:project} {hline 2} A set of tools to build and manage a Stata project. {title:Syntax: 1. Dataset of Projects} {pstd} To define a new project, select its master do-file and default log type using {p 8 16 2} {cmd:project, setup} {pstd} To remove a project from the dataset of projects {p 8 16 2} {cmd:project} {it:project_name}, {cmd: pclear} {pstd} To list currently defined projects and their directories {p 8 16 2} {cmd:project} [{it:project_name}] {cmd:, plist} {pstd} To change Stata's current working directory to a project's directory {p 8 16 2} {cmd:project} {it:project_name} {cmd:, cd} {title:Syntax: 2. Project Management Tasks} {pstd} A project is managed one task at a time by {p 8 16 2} {cmd:project} {it:project_name} {cmd:,} {it:{help project##manage_task:manage_task}} [{opt text:log} {opt smcl:log}] {phang} except for the share task, which has an extended syntax {p 8 16 2} {cmd:project} {it:project_name} {cmd:,} {cmd:share(}{it:[share_with_name] [, {help project##share_options:share_options}])}{cmd:)} [{opt text:log} {opt smcl:log}] {synoptset 21 tabbed}{...} {marker manage_task}{...} {synopthdr :manage_task} {synoptline} {synopt :{opt build}}build {it:project_name} by running the master do-file {p_end} {synopt :{opt list(options)}}list files in the project; {it:options} are {opt build} to list the details of the last build, {opt type} to list project files by type, {opt index} to list project files alphabetically, {opt directory} to list project files by directory, {opt concordance} to produce a dependency to do-file concordance table, {opt archive} to list project files that would be copied to the archive directory by the {opt archive} task. The {opt cleanup} option lists files that would be moved to the archive directory because they are in the project folder but are not included in the project.{p_end} {synopt :{opt validate}}validate a build by checking that project files have not changed since the last build {p_end} {synopt :{opt replicate}}completely rebuild {it:project_name} and check all created files for differences {p_end} {synopt :{opt archive}}copy files that have changed since the last archive to the archive directory{p_end} {synopt :{opt cleanup}}move files that are in the project directory but are not referenced within the build to the archive directory{p_end} {synopt :{opt rmcreated}}erase all files created by the project {p_end} {synoptline} {p2colreset}{...} {synoptset 21 tabbed}{...} {marker share_options}{...} {synopthdr :share_options} {synoptline} {synopt :{opt all:time}}share all files, irrespective of when they were added/modified{p_end} {synopt :{opt nocre:ated}}do not include files created by the project{p_end} {synopt :{opt max(size_string)}}specify a maximum file size in bytes, kB, MB, or GB{p_end} {synopt :{opt list}}list files that would be shared; no files are actually copied{p_end} {synoptline} {p2colreset}{...} {title:Syntax: 3. Build Directives} {pstd} You include build directives in do-files. By default, these clear the data in memory unless {cmd:preserve} is specified. {p 8 16 2} {cmd:project} {cmd:,} {it:{help project##build_directives:build_directives}} [{cmd:preserve}] {pstd} The build directive to run a do-file within a do-file (nested do-file) can include a log type override for the do-file {p 8 16 2} {cmd:project} {cmd:,} {opt do(filename)} [{opt text:log} {opt smcl:log} {cmd:preserve}] {synoptset 21 tabbed}{...} {marker build_directives}{...} {synopthdr :build_directives} {synoptline} {synopt :{opt do(filename)}}run a nested do-file {p_end} {synopt :{opt original(filename)}}indicate an input file dependency for the currently running do-file - the input file is not created within the project {p_end} {synopt :{opt uses(filename)}}indicate an input file dependency for the currently running do-file - the input file was created previously within the project {p_end} {synopt :{opt relies_on(filename)}}indicate a dependency - the currently running do-file relies on information in {it:filename} (the file itself is not directly accessed by the do-file, e.g. pdf, info, docs, etc.) {p_end} {synopt :{opt creates(filename)}}indicate an output file dependency for the currently running do-file - {it:filename} was previously saved to disk by the current do-file{p_end} {synopt :{opt doinfo}}retrieve project information within a do-file {p_end} {synopt :{opt break}}stop execution of the build from within a do-file; data in memory is preserved{p_end} {synoptline} {p2colreset}{...} {p 4 6 2} Note that if your {it:filename} contains embedded spaces, it must be enclosed within double quotes. Also, {it:{help project##build_directives:build_directives}} are embedded in do-files and will clear Stata's memory when executed unless the {opt preserve} option is added, e.g. {bind:{cmd:project, creates("mydata.dta") preserve}} {p_end} {title:Overview} {pstd} {cmd:project} automates the execution of do-files, skipping do-files with unchanged dependencies. With {cmd:project}, you accumulate and organize all Stata code related to a project, from the early data management steps to the final analysis, in a web of interconnected do-files, all managed from a master do-file. Each time you build a project (i.e. run the master do-file), {cmd:project} knows what has changed and only runs do-files that are affected by these changes. {pstd} You define a Stata project by typing {bind:{cmd:project, setup}} in Stata's Command window. This brings up this {bf:{dialog project_setup:dialog}} window that you use to select the project's master do-file. The name of the master do-file, without the ".do" extension, becomes the {it:project_name}. The {it:project_name} must conform to Stata's standard {help [M-1] naming:naming convention} for variables and other objects so make sure that your master do-file is named appropriately. A short {it:project_name} is recommended as {cmd:project} management tasks include the {it:project_name} and are typed interactively. {pstd} The directory where the master do-file resides becomes the project directory. If you defined a project called "abc", then typing {cmd:project abc, build} will run the master do-file "abc.do" located in the project directory irrespective of Stata's current working directory. Use {bind:{cmd:project, setup}} again if you move or rename the project directory or master do-file. {pstd} If a do-file loads files (raw data, Stata datasets, dictionaries, etc.) or creates files (Stata datasets, graphs, text files, etc.), you must embed {it:{help project##build_directives2:build_directives}} in the do-file to let {cmd:project} record the do-file's dependencies with respect to these input and output files. For example, {bind:{cmd:project, original("raw.txt")}} indicates that the current do-file loads an original file (e.g. {bind:{cmd:insheet using "raw.txt"}}) that was not created by the project. If a do-file loads a file previously created by the project (e.g. {bind:{cmd:use "final.dta"}}), you would write {bind:{cmd:project, uses("final.dta")}} to indicate the do-file's dependency. The do-file that saves "final.dta" would include a {bind:{cmd:project, creates("final.dta")}} directive. It's also a good idea to record dependencies for files that are not used directly by the do-file but are relevant nonetheless. If you download data from a web site, you might want to record the steps by printing to pdf the download page. You would then use a {bind:{cmd:project, relies_on("download.pdf")}} directive to record the dependency. If you reuse the same input file in a do-file, you only need to record the dependency once. When {cmd:project} records a dependency, it runs Stata's {help checksum} command on the file and stores the result in its databases. {cmd:project} also writes, for the record, the checksum in the log file. {pstd} All {it:{help project##build_directives2:build_directives}} (except {opt break}) clear the data in memory while {cmd:project} does its thing. Usually, it does not matter because a dependency on an input file is typically recorded just before loading the file and a dependency on an output file is recorded just after saving the file to disk. If this is inconvenient because, for example, you want to continue using the data after it is saved, you can add {opt preserve} to the directive to request that {cmd:project} preserve and restore the data (e.g. {bind:{cmd:project, creates("mydata.dta") preserve}}). The location in the do-file of each directive does not really matter except of course that you want to record a dependency to a file created after it is created. You could therefore put all the directives for input file dependencies at the top of the do-file and all output file dependencies directives at the end of the do-file. Alternatively, you could put them all at the end, it really does not matter. {pstd} Use the {opt do(filename)} {it:{help project##build_directives2:build_directive}} to run a nested do-file. This puts {cmd:project} in control of running the do-file. {cmd:project} automatically creates log files for all nested do-files. Before running a nested do-file, {cmd:project} suspends the current log file and starts a log file for the nested do-file. When control returns at the end of the nested do-file's run, {cmd:project} closes its log file and resumes logging the current do-file. {cmd:project} notes that the nested do-file and its dependencies are now dependencies of the current do-file. Note that {cmd:project} automatically adds the do-file itself as well as its log file to its dependencies. {pstd} {cmd:project} automatically changes Stata's working directory to the directory of the current do-file. This means that you can always access files in the do-file's directory by file name only. Files elsewhere but still within the project directory can be accessed using a file path that is relative to the project directory (see {help project##ex_2:example 2}). A project that never uses full path names can be easily shared with others or moved to a new directory without having to update any file path. {pstd} When building a project, {cmd:project} skips do-files with unchanged dependencies under the assumption that they would produce exactly the same results since nothing has changed. You can use the {opt replicate} task to check that this is actually the case. This command moves all files created by the project (including log files) to an archive and then starts a replication build. When the build is complete, {cmd:project} checks each file created by the replication build against the matching copy in the archive and reports any differences found. {pstd} With {cmd:project}, you generally develop your work in a semi-interactive way, one do-file at a time. You can insert a break point anywhere in a do-file to stop execution and do some interactive experimentation. If your work is well compartmentalized, only the currently edited do-file is rerun at each build. {pstd} When you use the {opt build} task for the first time, {cmd:project} creates two datasets in the project directory. The first is "{it:project_name}_files.dta" and it stores information about each file in the project. The second is "{it:project_name}_links.dta" and it stores links between do-files and their dependencies. Obviously, you should not edit or change these files directly. {pstd} The various projects and their directories are stored in a dataset called "project.dta" in the same directory as the "project.ado" file. Again, you should not edit or change this file directly. {pstd} Within the project directory, the "archive" and "replicate" directories are automatically created by {cmd:project} as needed and should be reserved for its use. {marker manage_task2}{...} {title:Project management tasks} {pstd} {opt build} runs the master do-file "{it:project_name}.do" located in the project directory. Before running the master do-file, {cmd:project} loops over all the files linked to the project and uses {help checksum} to check for changes since the previous {opt build}. The files are checked in increasing order of file size. The first difference found stops this process and the master do-file is run. This process resumes when {cmd:project} is called from within the master do-file to run a nested do-file (e.g. {bind:{cmd:project, do("data.do")}}). A nested do-file inherits the dependencies of all do-files nested within. The nested do-file's dependencies list is first checked against previously calculated checksums (within the same build). If no change is detected, checksums are computed on the remaining dependencies, again in increasing order of file size. As soon as a change is detected, the nested do-file is run. The process repeats for all subsequent nested do-files. {pstd} {opt list(options)} produces a variety of listings related to {it:project_name}. The listing are recorded in a date and time-stamped log file saved in the project's "archive/list" directory. The options are (more than one can be specified, separated by spaces) {synoptset 15 tabbed}{...} {synopt :{opt build}}list the current build. Nested do-files are indented and all dependencies are shown, in order of appearance.{p_end} {synopt :{opt type}}list project files by type. This groups do-files, logfiles, original files, files that are relied upon, and files created by {it:project_name}. {p_end} {synopt :{opt index}}list project files alphabetically {p_end} {synopt :{opt directory}}list project files by directory {p_end} {synopt :{opt concordance}}dependency to do-file concordance table. This listing shows, for each file associated with the project, the do-file(s) within which it appears. Note that dependencies between do-files are skipped; use the {cmd:list(build)} task to track those. {p_end} {synopt :{opt cleanup}}lists all files in the project directory that are {hi:not} part of the most recent successful build. These files would be moved to the archive directory if the {opt cleanup} task is used. {p_end} {pstd} {opt validate} is used to verify that the results produced by the previous build still hold by checking that files associated with the project have not changed. As with the {opt build} task, the {help checksum} Stata command is used to check all project files. This command produces an alphabetical listing of all the files in the project and their status. Changes are just reported, they do not trigger a new build. The report is saved in a date and time-stamped log file in the project's "archive" directory. {pstd} {opt replicate} is used to verify that the results produced by the previous build can be replicated. All files created by the project are moved to a directory called "replicate" within the project directory and a new complete build is performed. Each file created by the {opt replicate} build is then checked against the version produced by the previous build. Stata datasets are checked by comparing their {help datasignature} (because their checksum changes since Stata includes a time stamp in datasets). Logfiles are tricky to compare because they include the logfile's time-stamp and changes in the {help checksum} for logfiles and datasets. The {opt replicate} option checks logfiles record by record and ignores these specific differences. All other files are checked by comparing their {help checksum}. A report is produced at the end of the {opt replicate} build that indicates, for each file created by the project, whether they have changed or not. A log file of the report is saved in the "replicate" directory. If you find unexpected differences, this is probably because your code operates on data that is not fully sorted; see the following {browse "http://www.stata.com/support/faqs/programming/sorting-on-categorical-variables/":FAQ}. Note that a replication build may generate differences in log files compared to the previous normal build. For example {cmd:save "mydata.dta", replace} generates a note if the file does not exist. A similar situation occurs when a do-file is skipped because no change was found since the last build. To completely replicate log files, it is usually necessary to perform a second replicate build. {pstd} {opt archive} is used to copy to an archive all additions and modified files in the project since the last archive. Files that are created by the project are omitted because they can be recreated by building the project again. This provides a quick and simple way to back up what's new/changed in the project. The files to be archived are copied to subdirectories that match the directory structure where they reside in the project directory. The archive itself is in a date and time-stamped directory within the "archive" directory. A log file of what was archived is also saved next to the archived directory. {pstd} {cmd:share(}{it:[share_with_name] [, {help project##share_options:share_options}])}{cmd:)} is used to share project files with others. For example, {bind:{cmd:project abc, share(Bob)}} copies to an archive directory all files that have been added or have changed since the last time files were shared with "Bob". This archive directory is created in the project's "archive" directory. Its name is based on {it:share_with_name} plus a date and time stamp. The {it:share_with_name} must be a valid Stata name, i.e. a single name that could be given to a variable (use an underscore instead of a space if you want to share using a first and last name). When specifying {it:share_options}, do not use double quotes; also case matters. Use the {opt alltime} option to force an archive of all the files in the project. The {opt nocreated} is used to omit files created by the project (these include all datasets saved, all log files, and any other files generated by the project). You can also omit files that are larger than {opt max(size_string)}. The {opt list} {it:share_option} lists files that would be archived (no files are copied). A log file of the files shared is also saved next to the archived directory. {pstd} The {opt cleanup} task targets files that are not part of the project. {cmd:project} scans the project's master directory (and recursively subdirectories) and moves to the project's "archive" directory any file that is not included in the build. The {opt list(cleanup)} task can be used to see which files would be moved. This task also removes files from past builds that are not part of the most recent build from "{it:project_name}_files.dta" and "{it:project_name}_links.dta". You should run a {opt cleanup} task before running a {opt replicate} task. This is a good way to catch some missing dependencies. A log file of the files moved is also saved next to the archived directory. {pstd} {opt rmcreated} erases all files created by the project. Because this includes all logfiles, all do-files in the project will run the next time the project is built. A report is saved in a date and time-stamped log file in the project's "archive" directory. {marker build_directives2}{...} {title:Build Directives} {pstd} Build directives are commands within do-files that are executed as the project is built. These cannot be used interactively and they do not include the {it:project_name} (e.g. {bind:{cmd:project, uses("cpi.raw")}}). All file names must be fully specified as {cmd:project} never assumes a particular file extension. {pstd} All build directives clear Stata's memory except for {opt break} which simply stops the build after closing the log file(s). You can add {opt preserve} to a build directive (e.g. {bind:{cmd:project, creates("mydata.dta") preserve}}) if you do not want {cmd:project} to clear the memory. Alternatively, you can move the directive to a location in the do-file where clearing the memory will not matter. {pstd} {opt do(do_filename)} is used to run a nested do-file. Note that this directive is not the same as Stata's {help do} command. The {opt do(do_filename)} build directive will not run {it:do_filename} if the do-file has not changed and there is no change in any of the do-file's dependencies since the last {opt build}. Before running {it:do_filename}, {cmd:project} completely resets Stata so that {it:do_filename} runs with no priors (global macros are also cleared). This promotes good programming practices by making sure that a do-file's results are entirely a consequence of commands within that do-file. {cmd:project} also changes Stata's current directory to the one that contains {it:do_filename}. The {opt do(do_filename)} directive also manages log files for all do-files automatically. It suspends the log file of the currently running do-file while a nested do-file runs. The {opt do(do_filename)} task automatically records the do-file's dependency to itself and to its log file. All other dependencies must be explicitly declared within do-files using the directives below. {pstd} {opt original(original_file)} records the current do-file's dependence on {it:original_file}, a file that was not created within the project. It indicates that if {it:original_file} changes, the do-file must be run again as its results may change. {pstd} {opt uses(uses_file)} records the current do-file's dependence on {it:uses_file}, a file that was previously created within the project. It indicates that if {it:uses_file} changes, the do-file must be run again as its results may change. {pstd} {opt relies_on(relies_on_file)} records the current do-file's dependence on {it:relies_on_file}, a file that is not created within the project. The {it:relies_on_file} is not used within the do-file so it cannot affect what is produced by the do-file. However, a change in {it:relies_on_file} will still trigger the running of the do-file because the log file must, as a matter of record, include the correct checksum for {it:relies_on_file}. This option is for documentation files, notes, or other raw data files that were not in a format that can be input by Stata. This directive indicates that the do-file's structure and code is somehow dependent on the content of {it:relies_on_file}. Using this directive has the benefit of adding {it:relies_on_file} to the list of project files, which means that it will be included in {opt archive} and {opt share} tasks, and will not be subject to removal by the {opt cleanup} task. {pstd} {opt creates(created_file)} records the current do-file's dependency on {it:created_file}, a file that is created by the do-file. This ensures that if {it:created_file} is missing or if it has changed in any way since the last build, the do-file will be run again. {pstd} {opt doinfo} prints information about the project currently being built. It also defines the following local macros: {synoptset 15 tabbed}{...} {synopt:{cmd:r(pname)}}the project name{p_end} {synopt:{cmd:r(pdir)}}the project's main directory{p_end} {synopt:{cmd:r(bdate)}}build start date{p_end} {synopt:{cmd:r(btime)}}build end date{p_end} {synopt:{cmd:r(dofile)}}do-file stub name (i.e. minus the ".do" extension){p_end} {p2colreset}{...} {pstd} With files that are not in a do-file's directory, you can use the {cmd:r(pdir)} macro to reference them using a path that is relative to the project directory (see {help project##ex_2:example 2}). If you avoid full pathnames entirely, the whole project can be moved to a new directory or to a different computer or shared with others without having to edit any file path. It is sometimes useful to construct file names that are derived from the name of the do-file. For example, a do-file called "cpi.do" can be used to input "cpi.xls" using {bind:{cmd:import excel using "`r(dofile)'.xls"}} and save a Stata dataset using {bind:{cmd:save "`r(dofile)'.dta", replace}}. If you have another version called "cpi2.xls", all you may have to do is save a copy of the do-file as "cpi2.do" and add it to the project. Also see {help project##ex_2:example 2}. {pstd} The {opt break} directive is used to stop execution of the build at a specific point within the do-file. Note that this is not Stata's {help break} command; it is a build directive so the correct syntax is {cmd:project, break}. The data in memory is not cleared but the log file is closed. The build is considered incomplete. This allows for the interactive development of do-files within a project. You just stop the build at any point and update the do-file as desired. You can try out commands interactively as the data remained in memory. When you are ready, build the project again and repeat until you are satisfied with the do-file. Note that you get exactly the same effect if a command in a do-file generates an error: the log file is closed, the data remain in memory, and the build is incomplete. {marker ex_1} {title:Example 1. Master do-file} {pstd} When starting a new project, the first step is to create the master do-file. The master do-file's stub name (the name without the ".do" extension) becomes the project name. Then you type {bind:{cmd:project, setup}} in Stata's command window to bring up a dialog and select "abc.do". The master do-file usually does not include any working code, it simply calls nested do-files. Here's an example: {hline 30} top of {cmd:abc.do} {hline 3} {cmd:Version 9.2} // in case Stata's syntax changes in future versions * Common settings {cmd:set more off} {cmd:set varabbrev off} // less confusing {cmd:set linesize 132} // use 7pt font for printing * Run do-files {cmd:project , do(data/original_data.do)} {cmd:project , do(data/cleanup_data.do)} {cmd:project , do(data/other_data.do)} {cmd:project , do(analysis/reg_data.do)} {cmd:project , do(analysis/myregs.do)} {cmd:project , do(tables/mytables.do)} {cmd:project , do(figures/myfigs.do)} {hline 30} end of {cmd:abc.do} {hline 3} {pstd} Each of these nested do-files can also contain nested do-files. Notice that file paths are relative to the do-file's directory. To build this project: {cmd:. project abc, build} {pstd} You do not need to worry about the location of Stata's current working directory, {cmd:project} will find "abc.do" and automatically change the directory to the project's directory. {marker ex_2} {title:Example 2. Access the project directory and do-file name within do-files} {pstd} With {cmd:project}, Stata's working directory is always aligned with the directory of the currently running do-file. This makes it easy to access files in that directory by referencing them by name only. In the example below, the dataset produced will be saved in the directory where the do-file is located. Its name is formed using the do-file's stub name. If you make copies of the do-file to try something else, you do not have to edit the name of the saved file as it will be derived from the new do-file's name. {pstd} A well-organized project is unlikely to put all project files in the same directory. You can easily reference files in other directories using a path that is relative to the project's directory. Both the {cmd:r(pdir)} and {cmd:r(dofile)} macro are returned by the {opt doinfo} {it:{help project##build_directives2:build_directive}} {hline 30} top of {cmd:reg_data.do} {hline 3} {cmd:Version 9.2} // for future replication {cmd:project, doinfo} {cmd:local master "`r(pdir)'"} {cmd:local doname "`r(dofile)'"} {cmd:project, uses("`master'/other_data/macro.dta")} // declare here is faster than using preserve {cmd:project, uses("`master'/data/maindata_cleaned.dta")} {cmd:use "`master'/data/maindata_cleaned.dta"} {cmd:sort state year} {cmd:merge state year using "`master'/other_data/macro.dta"} {cmd:tab _merge} {cmd:drop _merge} {cmd:save "`doname'.dta", replace} {cmd:project, creates("`doname'.dta")} {hline 30} end of {cmd:reg_data.do} {hline 3} {title:Author} {pstd}Robert Picard{p_end} {pstd}picard@netbox.com{p_end}