{smcl} {* *! version 1.1 04feb2016}{...} help for {cmd:whichencoding}{right:version 1.1 (04 February 2016)} help for {cmd:ascii2unicode}{right:version 1.1 (04 February 2016)} help for {cmd:unicode2ascii}{right:version 1.1 (04 February 2016)} {hline} {title:Title} {phang}{bf:whichencoding} {hline 2} Examine the encoding of Stata datasets and text files. Stata 14.1 (28jan2016 update) or newer required{p_end} {phang}{bf:ascii2unicode} {hline 2} Translate datasets and text files from ASCII to Unicode encoding. Stata 14.1 (28jan2016 update) or newer required{p_end} {phang}{bf:unicode2ascii} {hline 2} Translate datasets and text files from Unicode to ASCII encoding. Stata 14.1 (28jan2016 update) or newer required {title:Syntax} {p 8 17 2} {cmd:whichencoding} {it:filespec} [{cmd: ,} {cmdab:d:etail} {cmd:nodata}] {p 8 17 2} {cmd:ascii2unicode} {it:filespec} [{cmd: ,} {cmdab:e:ncoding()} {cmdab:s:uffix()} {cmdab:d:etail} {cmd:nodata replace}] {p 8 17 2} {cmd:unicode2ascii} {it:filespec} [{cmd: ,} {cmdab:e:ncoding()} {cmdab:v:ersion()} {cmdab:s:uffix()} {cmdab:d:etail} {cmd:nodata replace}] {pstd}{it:filespec} specifies the file(s) to be examined or translated, typically datasets and do-files. You can specify single files or groups of files, for example, {cmd:*.dta *.do} indicating all datasets and do-files in the current directory. {pstd}{ul:Windows users}: Windows is not case-sensitive concerning filenames, but the {cmd:unicode} commands are, also in Stata for Windows, and this affects the commands described here. For example, the file specifications {cmd:b*.dta} and {cmd:B*.dta} select different datasets. {synoptset 20 tabbed}{...} {synopthdr} {synoptline} {syntab:Main} {synopt:{cmdab:e:ncoding()}}Extended ASCII encoding scheme for ASCII source or destination files. If not specified as recorded in the global macro {cmd:UnicodeEncoding}; otherwise {cmd:Windows-1252}{p_end} {synopt:{cmdab:v:ersion(#)}}Destination dataset version 12 or 13; version 13 if not specified{p_end} {synopt:{cmdab:s:uffix()}}Add a suffix to ASCII source or destination files; {bf:_v12} or {bf:_v13} for datasets and {bf:_asc} for text files if not specified{p_end} {synopt:{cmdab:d:etail}}Display details about encoding of each file{p_end} {synopt:{cmd:nodata}}Do not analyze or translate contents of string variables{p_end} {synopt:{cmd:replace}}Overwrite existing destination files{p_end} {synoptline} {title:Description} {pstd}Stata version 14 uses Unicode encoding, and prior versions use ASCII encoding. If you use plain ASCII characters only, for example if you use English language and avoid extended ASCII characters like é, ü, and ñ, you need not read further. Also, if you do not care if single characters are displayed incorrectly, you might decide not to spend more time with this issue. Read more about ASCII and Unicode encoding in the Remarks section. {pstd}If a Stata dataset or text file generated by Stata <14 contains extended ASCII characters (for example, é, ü, ñ), Stata 14+ can open the file, but the extended ASCII characters will not be displayed correctly. Similarly, if a dataset or text file generated by Stata version 14+ contains Unicode encoded characters beyond plain ASCII, you can use {help saveold} to generate a file that can be used by Stata 11-13, but the Unicode characters will not be displayed correctly. {pstd}{cmd:whichencoding} examines the occurrence of Unicode and extended ASCII characters in Stata datasets and text files like do-files, ado-files, help files and log files. This is useful to determine the need for translation when sharing Stata files between users or computers with different versions of Stata installed. The official {cmd:unicode analyze} command serves the same purpose, but the output from {cmd:whichencoding} is more compact and transparent. {pstd}{cmd:ascii2unicode} translates datasets and text files with extended ASCII characters to Unicode encoding. Files with plain ASCII characters only are not translated. The destination file takes the name of the source file, and a suffix is added to the source file name. The official {cmd:unicode translate} serves the same purpose, but the output from {cmd:ascii2unicode} is more compact and transparent, and you have access both to a Unicode and an ASCII version of a dataset at the same time. {pstd}{cmd:unicode2ascii} translates datasets and text files with Unicode characters to ASCII encoding. Files with plain ASCII characters only are not translated. In datasets, variable names, labels and label names (including labels in different languages), string variable contents, and notes are translated. The source file keeps its name, and a suffix is added to the destination file name. Currently (September 2015), no official Stata command serves the same purpose. {title:Options} {col 5}{bf:Commands and their options} {col 5}{hline 62} {col 39}{it:Command} {col 18}{hline 49} {col 5} {it:Option} whichencoding ascii2unicode unicode2ascii {col 5}{hline 62} {col 5} encoding() + + {col 5} version() + {col 5} suffix() + + {col 5} detail + + + {col 5} nodata + + + {col 5} replace + + {col 5}{hline 62} {phang} {opt encoding}{bf:(}{it:encoding scheme}{bf:)} specifies the extended ASCII encoding scheme for ASCII encoded source or destination files. If the computer's encoding scheme is {cmd:Latin 2}, make it the default by including this command in {cmd:profile.do}:{p_end} {p 12}{cmd:if c(stata_version)>=14 unicode encoding set Latin 2} {p 8 8}This will create the global macro {cmd:UnicodeEncoding}, making {cmd:Latin 2} the default encoding scheme. If the macro is not defined, the default encoding scheme will be {cmd:Windows-1252}.{p_end} {p 8 8}(Commands in {cmd:profile.do} will be executed automatically at Stata start-up. Use the {cmd:sysdir} command to locate the {cmd:PERSONAL} folder and put {cmd:profile.do} there.) {phang} {opt version(#)} ({cmd:unicode2ascii}) specifies the Stata version (12 or 13) of the destination dataset; version 13 is the default. (The diligent user may have noticed that Stata 14's help for {help saveold} claims that {cmd:saveold} can also save a version 11 dataset. This is not quite accurate: It can save a version 12 dataset, which Stata 11 can read, but not a version 11 dataset, which Stata 10 would be able to read.) {phang} {opt suffix()} specifies a suffix to be included in ASCII destination file names ({cmd:unicode2ascii}) or in modified ASCII source file names ({cmd:ascii2unicode}). Default suffixes are {bf:_v12} or {bf:_v13} for datasets and {bf:_asc} for text files. {phang} {opt detail} displays details about each file to be analyzed or translated. It displays the output from {bf:unicode analyze}, which can be quite verbose, so avoid using this option for more than a few files. {phang} {opt nodata} specifies that the contents of string variables should not be analyzed or translated. {phang} {opt replace} allows the destination dataset to overwrite an existing dataset with the same name. {title:Examples: whichencoding} {p 8}{cmd:. whichencoding filea.dta , detail}{p_end} {p 8}{cmd:. whichencoding *.dta *.do} {pstd}We examine three datasets and three do-files. Two files contain plain ASCII characters only, two files contain extended ASCII encoded characters, and two files contain Unicode encoded characters. Here is the output: {col 8}{hline 55} {col 8}Directory: C:\Docs\Project D {col 8}{hline 55} {col 8}File name{col 38}Version Encoding {col 8}{hline 55} {col 8}filea.dta{col 40}v13 Plain ASCII {col 8}fileb.dta{col 40}v12 Extended ASCII {col 8}filec.dta{col 40}v14 Unicode {col 8}gen_filea.do{col 47}Plain ASCII {col 8}gen_fileb.do{col 47}Extended ASCII {col 8}gen_filec.do{col 47}Unicode {col 8}{hline 55} {pstd}The output displays the version of datasets. We also learn that {cmd:filea.dta} and {cmd:gen_filea.do} contain plain ASCII characters only, so both Stata 13 and 14 can use them without any problems. {pstd}{cmd:fileb.dta} and {cmd:gen_fileb.do} contain extended ASCII characters. They can be opened by Stata 14, but some characters will not be displayed correctly unless they are translated to Unicode with the official {cmd:unicode translate} command or the unofficial {cmd:ascii2unicode} command (see below). {pstd}{cmd:filec.dta} and {cmd:gen_filec.do} contain Unicode encoded characters. If needed, we can use {cmd:saveold} to generate a version that Stata 11, 12, or 13 can open, but some characters will not be displayed correctly. Currently (February 2016), no official commands translate datasets from Unicode to extended ASCII, but the unofficial {cmd:unicode2ascii} command does (see below). {title:Examples: ascii2unicode} {p 8}{cmd:. ascii2unicode alpha.dta}{p_end} {p 8}{cmd:. ascii2unicode *.dta *.do , encoding(Latin 2)}{p_end} {p 8}{cmd:. ascii2unicode beta2.dta gen_beta2.do , detail replace} {pstd}We examine six files, {cmd:filea.dta} and {cmd:gen_filea.do}, which contain plain ASCII characters only, {cmd:fileb.dta} and {cmd:gen_fileb.do}, which contain some extended ASCII charaters, and {cmd:filec.dta} and {cmd:gen_filec.do}, which are Unicode encoded. The files with plain ASCII characters and the Unicode encoded files need no translation, only the files with extended ASCII characters are translated. Here is the output: {col 5}{hline 72} {col 5}Directory: C:\docs\project D {col 5}{hline 72} {col 5}Source files, modified names{col 43}Result files, Unicode compatible {col 5}{hline 34}{col 42}{hline 35} {col 5}filea.dta{col 40}= filea.dta {col 5}fileb_v12.dta{col 39}{hline 2}> fileb.dta {col 5}filec.dta{col 40}= filec.dta {col 5}gen_filea.do{col 40}= gen_filea.do {col 5}gen_fileb_asc.do{col 39}{hline 2}> gen_fileb.do {col 5}gen_filec.do{col 40}= gen_filec.do {col 5}{hline 72} {col 5}{hline 2}> File translated; new source file name = File not translated {col 5}{hline 72} {pstd} The result files in the rightmost column have the same names as the original source files; four of them are actually the same files, but two files were translated from ASCII to Unicode endoding. The original source files for these two files had their names changed by including a suffix. {cmd:fileb.dta} was a Stata version 12 dataset, hence the new name {cmd:fileb_v12.dta}. For do-files and other text files the default suffix is {cmd:_asc}; unlike datasets, they do not belong to a specifc Stata version. {title:Examples: unicode2ascii} {p 8}{cmd:. unicode2ascii alpha.dta}{p_end} {p 8}{cmd:. unicode2ascii *.dta *.do , encoding(Latin 2)}{p_end} {p 8}{cmd:. unicode2ascii beta2.dta gen_beta2.do , version(12) replace} {pstd}We examine three datasets and three do-files to be used by Stata version 13, and translate them from Unicode to ASCII, if necessary. This is the output: {col 5}{hline 70} {col 10}Directory: C:\Docs\Project C {col 5}{hline 70} {col 10}Source files{col 43}Result files, ASCII compatible {col 5}{hline 33}{col 42}{hline 33} {col 5}v13: filea.dta{col 40}= filea.dta {col 5}v12: fileb.dta{col 40}= fileb.dta {col 5}v14: filec.dta{col 39}{hline 2}> filec_v13.dta {col 5}asc: gen_filea.do{col 40}= gen_filea.do {col 5}asc: gen_fileb.do{col 40}= gen_fileb.do {col 5}unc: gen_filec.do{col 39}{hline 2}> gen_filec_asc.do {col 5}{hline 70} {col 5}{hline 2}> File translated; new result file name = File not translated {col 5}{hline 70} {pstd}The first column lists the source files, the second column the result files. Source dataset filenames are preceded by their version, and do-files and other text files are preceded by "{cmd:unc}" or "{cmd:asc}" to indicate their encoding. If a translation is indicated, a suffix is added to the result file name. {title:Remarks} {pstd}Prior to version 14, Stata used ASCII encoding of characters. The codes for plain ASCII are 0-127; for extended ASCII they are 128-255. There are several extended ASCII encoding schemes, for example, {cmd:Latin 1} and {cmd:Windows-1252} for Western European languages, {cmd:Latin 2} for some Central and Eastern European languages, and {cmd:Latin 4} using the Cyrillic alphabet. Thus, the same extended ASCII code may display different characters by computers using different encoding schemes. {pstd}From version 14, Stata uses Unicode encoding of characters. This gives access to thousands of characters and symbols, including Arabic, Cyrillic, Chinese, and other alphabets. The Unicode {it:code point} is the number to use with the {help uchar()} function, but behind it is a more complex encoding (UTF-8) where each character is defined by one to four 8-bit bytes. For the characters represented in plain ASCII, the Unicode code and the ASCII code are the same, and in the {cmd:Latin 1} extended ASCII encoding, the Unicode {it:code point} is the same as the ASCII code. {pstd}{cmd:whichencoding}, {cmd:ascii2unicode}, and {cmd:unicode2ascii} utilize the official {cmd:unicode analyze} command. Currently (February 2016) {cmd:unicode analyze} avoids analyzing a file if a backup file indicates that a file with the same name has been analyzed previously. To make sure that revised versions of already analyzed files are re-analyzed, {cmd:whichencoding} and {cmd:ascii2unicode} rename the backup file for the file being analyzed. Backup files are located in the {cmd:bak.stunicode} and {cmd:bak.stunicode/status.stunicode} subdirectories to the current directory. {title:Authors} {p 4 4 2}Svend Juul{break} Aarhus University{break} sj@ph.au.dk {p 4 4 2}Morten Frydenberg{break} Aarhus University{break} morten@ph.au.dk {title:Also see} {p 4 4 2}help for {help unicode} (Stata 14+ only){break} help for {help saveold}