{smcl}
{* *! version 1.1 04feb2016}{...}
help for {cmd:whichencoding}{right:version 1.1 (04 February 2016)}
help for {cmd:ascii2unicode}{right:version 1.1 (04 February 2016)}
help for {cmd:unicode2ascii}{right:version 1.1 (04 February 2016)}
{hline}


{title:Title}

{phang}{bf:whichencoding} {hline 2} Examine the encoding of Stata datasets and 
text files. Stata 14.1 (28jan2016 update) or newer required{p_end}
{phang}{bf:ascii2unicode} {hline 2} Translate datasets and text files from ASCII 
to Unicode encoding. Stata 14.1 (28jan2016 update) or newer required{p_end}
{phang}{bf:unicode2ascii} {hline 2} Translate datasets and text files from Unicode 
to ASCII encoding. Stata 14.1 (28jan2016 update) or newer required


{title:Syntax}

{p 8 17 2}
{cmd:whichencoding} {it:filespec} [{cmd: ,} {cmdab:d:etail} {cmd:nodata}]

{p 8 17 2}
{cmd:ascii2unicode} {it:filespec} [{cmd: ,} {cmdab:e:ncoding()} {cmdab:s:uffix()}
{cmdab:d:etail} {cmd:nodata replace}]

{p 8 17 2}
{cmd:unicode2ascii} {it:filespec} [{cmd: ,} {cmdab:e:ncoding()} {cmdab:v:ersion()}
{cmdab:s:uffix()} {cmdab:d:etail} {cmd:nodata replace}]

{pstd}{it:filespec} specifies the file(s) to be examined or translated, typically 
datasets and do-files. 
You can specify single files or groups of files, for
example, {cmd:*.dta *.do} indicating all datasets and do-files in the current 
directory.

{pstd}{ul:Windows users}: Windows is not case-sensitive concerning filenames, but the
{cmd:unicode} commands are, also in Stata for Windows, and this affects the 
commands described here. For example, the file specifications {cmd:b*.dta} and 
{cmd:B*.dta} select different datasets.

{synoptset 20 tabbed}{...}
{synopthdr}
{synoptline}
{syntab:Main}
{synopt:{cmdab:e:ncoding()}}Extended ASCII encoding scheme for ASCII source or 
destination files. If not specified as recorded in the global macro 
{cmd:UnicodeEncoding}; otherwise {cmd:Windows-1252}{p_end}
{synopt:{cmdab:v:ersion(#)}}Destination dataset version 12 or 13; 
version 13 if not specified{p_end}
{synopt:{cmdab:s:uffix()}}Add a suffix to ASCII source or destination files; 
{bf:_v12} or {bf:_v13} for datasets and {bf:_asc} for text files if not specified{p_end}
{synopt:{cmdab:d:etail}}Display details about encoding of each file{p_end}
{synopt:{cmd:nodata}}Do not analyze or translate contents of string variables{p_end}
{synopt:{cmd:replace}}Overwrite existing destination files{p_end}
{synoptline}


{title:Description}

{pstd}Stata version 14 uses Unicode encoding, and prior versions use ASCII 
encoding.  If you use plain ASCII characters only, for example if you use English
language and avoid extended ASCII characters like é, ü, and ñ, you need not
read further.  Also, if you do not care if single characters are displayed 
incorrectly, you might decide not to spend more time with this issue.
Read more about ASCII and Unicode encoding in the Remarks section.

{pstd}If a Stata dataset or text file generated by Stata <14 contains extended ASCII 
characters (for example, é, ü, ñ), Stata 14+ can open the file, but the extended 
ASCII characters will not be displayed correctly.  Similarly, if a dataset or 
text file generated by Stata version 14+ contains Unicode encoded characters 
beyond plain ASCII, you can use {help saveold} to generate a file that can be 
used by Stata 11-13, but the Unicode characters will not be displayed correctly.

{pstd}{cmd:whichencoding} examines the occurrence of Unicode and extended ASCII 
characters in Stata datasets and text files like do-files, ado-files, help files
and log files.  This is useful to determine the need for translation when sharing 
Stata files between users or computers with different versions of Stata 
installed.  The official {cmd:unicode analyze} command serves the same purpose, 
but the output from {cmd:whichencoding} is more compact and transparent.

{pstd}{cmd:ascii2unicode} translates datasets and text files with extended ASCII
characters to Unicode encoding.  Files with plain ASCII characters only are not
translated.  The destination file takes the name of the source
file, and a suffix is added to the source file name.  The official 
{cmd:unicode translate} serves the same purpose, but the output from 
{cmd:ascii2unicode} is more compact and transparent, and you have access both to
a Unicode and an ASCII version of a dataset at the same time.

{pstd}{cmd:unicode2ascii} translates datasets and text files with Unicode
characters to ASCII encoding.  Files with plain ASCII characters only are not
translated.  In datasets, variable names, labels and label names (including labels
in different languages), string variable contents, and notes are translated. 
The source file keeps its name, and a suffix is added to the destination file name.
Currently (September 2015), no official Stata command serves the same purpose.


{title:Options}

{col 5}{bf:Commands and their options}
{col 5}{hline 62}
{col 39}{it:Command}
{col 18}{hline 49}
{col 5} {it:Option}       whichencoding    ascii2unicode    unicode2ascii
{col 5}{hline 62}			 
{col 5} encoding()                          +                +
{col 5} version()                                            +
{col 5} suffix()                            +                +
{col 5} detail             +                +                +
{col 5} nodata             +                +                +
{col 5} replace                             +                +
{col 5}{hline 62}

{phang}
{opt encoding}{bf:(}{it:encoding scheme}{bf:)} specifies the extended ASCII encoding 
scheme for ASCII encoded source or destination files. 
If the computer's encoding scheme is {cmd:Latin 2},
make it the default by including this command in {cmd:profile.do}:{p_end}
{p 12}{cmd:if c(stata_version)>=14 unicode encoding set Latin 2}

{p 8 8}This will create the global macro {cmd:UnicodeEncoding}, making 
{cmd:Latin 2} the default encoding scheme.  If the macro is not defined, 
the default encoding scheme will be {cmd:Windows-1252}.{p_end}
{p 8 8}(Commands in {cmd:profile.do} will be executed automatically at Stata 
start-up. Use the {cmd:sysdir} command to locate the {cmd:PERSONAL} folder and 
put {cmd:profile.do} there.)

{phang}
{opt version(#)} ({cmd:unicode2ascii}) specifies the Stata version (12 or 13) 
of the destination dataset; version 13 is the default.  (The diligent user may 
have noticed that Stata 14's help for {help saveold} claims that {cmd:saveold} 
can also save a version 11 dataset.  This is not quite accurate: It can save a 
version 12 dataset, which Stata 11 can read, but not a version 11 dataset, 
which Stata 10 would be able to read.)  

{phang}
{opt suffix()} specifies a suffix to be included in ASCII destination file 
names ({cmd:unicode2ascii}) or in modified ASCII source file names 
({cmd:ascii2unicode}).  Default suffixes are {bf:_v12} or {bf:_v13} for datasets
and {bf:_asc} for text files.

{phang}
{opt detail} displays details about each file to be analyzed or translated. It 
displays the output from {bf:unicode analyze}, which can be quite verbose, so 
avoid using this option for more than a few files.

{phang}
{opt nodata} specifies that the contents of string variables should not be 
analyzed or translated.

{phang}
{opt replace} allows the destination dataset to overwrite an existing dataset
with the same name.


{title:Examples: whichencoding}

{p 8}{cmd:. whichencoding filea.dta , detail}{p_end}
{p 8}{cmd:. whichencoding *.dta *.do}

{pstd}We examine three datasets and three do-files.  Two files contain plain 
ASCII characters only, two files contain extended ASCII encoded characters, 
and two files contain Unicode encoded characters.  Here is the output: 

{col 8}{hline 55}
{col 8}Directory: C:\Docs\Project D
{col 8}{hline 55}
{col 8}File name{col 38}Version  Encoding
{col 8}{hline 55}
{col 8}filea.dta{col 40}v13    Plain ASCII
{col 8}fileb.dta{col 40}v12    Extended ASCII
{col 8}filec.dta{col 40}v14    Unicode
{col 8}gen_filea.do{col 47}Plain ASCII
{col 8}gen_fileb.do{col 47}Extended ASCII
{col 8}gen_filec.do{col 47}Unicode
{col 8}{hline 55}
 
{pstd}The output displays the version of datasets.  We also
learn that {cmd:filea.dta} and {cmd:gen_filea.do} contain plain ASCII characters 
only, so both Stata 13 and 14 can use them without any problems. 

{pstd}{cmd:fileb.dta} and {cmd:gen_fileb.do} contain extended ASCII characters.
They can be opened by Stata 14, but some characters will not be displayed  
correctly unless they are translated to Unicode with the official 
{cmd:unicode translate} command or the unofficial {cmd:ascii2unicode}
command (see below).

{pstd}{cmd:filec.dta} and {cmd:gen_filec.do} contain Unicode encoded characters.
If needed, we can use {cmd:saveold} to generate a version that Stata 11, 12, or 
13 can open, but some characters will not be displayed correctly.
Currently (February 2016), no official commands translate datasets from 
Unicode to extended ASCII, but the unofficial {cmd:unicode2ascii} command 
does (see below).


{title:Examples: ascii2unicode}

{p 8}{cmd:. ascii2unicode alpha.dta}{p_end}
{p 8}{cmd:. ascii2unicode *.dta *.do , encoding(Latin 2)}{p_end}
{p 8}{cmd:. ascii2unicode beta2.dta gen_beta2.do , detail replace}

{pstd}We examine six files, {cmd:filea.dta} and {cmd:gen_filea.do}, which contain 
plain ASCII characters only, {cmd:fileb.dta} and {cmd:gen_fileb.do}, which contain 
some extended ASCII charaters, and {cmd:filec.dta} and {cmd:gen_filec.do}, 
which are Unicode encoded.  The files with plain ASCII characters and the Unicode 
encoded files need no translation, only the files with extended ASCII characters
are translated. Here is the output:

{col 5}{hline 72}
{col 5}Directory: C:\docs\project D
{col 5}{hline 72}
{col 5}Source files, modified names{col 43}Result files, Unicode compatible
{col 5}{hline 34}{col 42}{hline 35}
{col 5}filea.dta{col 40}=  filea.dta
{col 5}fileb_v12.dta{col 39}{hline 2}> fileb.dta
{col 5}filec.dta{col 40}=  filec.dta
{col 5}gen_filea.do{col 40}=  gen_filea.do
{col 5}gen_fileb_asc.do{col 39}{hline 2}> gen_fileb.do
{col 5}gen_filec.do{col 40}=  gen_filec.do
{col 5}{hline 72}
{col 5}{hline 2}> File translated; new source file name        = File not translated
{col 5}{hline 72}

{pstd} The result files in the rightmost column have the same names as the 
original source files; four of them are actually the same files, but two files 
were translated from ASCII to Unicode endoding.  The original source files for 
these two files had their names changed by including a suffix.
{cmd:fileb.dta} was a Stata version 12 dataset, hence the new name 
{cmd:fileb_v12.dta}. For do-files and other text files the default suffix is
{cmd:_asc}; unlike datasets, they do not belong to a specifc Stata version.


{title:Examples: unicode2ascii}

{p 8}{cmd:. unicode2ascii alpha.dta}{p_end}
{p 8}{cmd:. unicode2ascii *.dta *.do , encoding(Latin 2)}{p_end}
{p 8}{cmd:. unicode2ascii beta2.dta gen_beta2.do , version(12) replace}

{pstd}We examine three datasets and three do-files to be used by Stata version 13, and 
translate them from Unicode to ASCII, if necessary.  This is the output:

{col 5}{hline 70}
{col 10}Directory: C:\Docs\Project C
{col 5}{hline 70}
{col 10}Source files{col 43}Result files, ASCII compatible
{col 5}{hline 33}{col 42}{hline 33}
{col 5}v13: filea.dta{col 40}=  filea.dta
{col 5}v12: fileb.dta{col 40}=  fileb.dta
{col 5}v14: filec.dta{col 39}{hline 2}> filec_v13.dta
{col 5}asc: gen_filea.do{col 40}=  gen_filea.do  
{col 5}asc: gen_fileb.do{col 40}=  gen_fileb.do
{col 5}unc: gen_filec.do{col 39}{hline 2}> gen_filec_asc.do
{col 5}{hline 70}
{col 5}{hline 2}> File translated; new result file name      = File not translated
{col 5}{hline 70}

{pstd}The first column lists the source files, the second column the result files.
Source dataset filenames are preceded by their version, and do-files and 
other text files are preceded by "{cmd:unc}" or "{cmd:asc}" to indicate their 
encoding.  If a translation is indicated, a suffix is added to the result file
name.


{title:Remarks}

{pstd}Prior to version 14, Stata used ASCII encoding of characters. 
The codes for plain ASCII are 0-127; for extended ASCII they are 128-255. 
There are several extended ASCII encoding schemes, for example, {cmd:Latin 1} and 
{cmd:Windows-1252} for Western European languages, {cmd:Latin 2} for some Central
and Eastern European languages, and {cmd:Latin 4} using the Cyrillic alphabet.
Thus, the same extended ASCII code may display different characters 
by computers using different encoding schemes. 

{pstd}From version 14, Stata uses Unicode encoding of characters.  This gives 
access to thousands of characters and symbols, including Arabic, Cyrillic, 
Chinese, and other alphabets.  The Unicode {it:code point} is the number to use with 
the {help uchar()} function, but behind it is a more complex encoding (UTF-8) 
where each character is defined by one to four 8-bit bytes.  For the characters 
represented in plain ASCII, the Unicode code and the ASCII code are the same,
and in the {cmd:Latin 1} extended ASCII encoding, the Unicode {it:code point} 
is the same as the ASCII code.

{pstd}{cmd:whichencoding}, {cmd:ascii2unicode}, and {cmd:unicode2ascii}
utilize the official {cmd:unicode analyze} command. Currently (February 2016) 
{cmd:unicode analyze} avoids analyzing a file if a backup file indicates that a 
file with the same name has been analyzed previously.  To make sure that revised 
versions of already analyzed files are re-analyzed, {cmd:whichencoding} and
{cmd:ascii2unicode} rename the backup file for the file being analyzed. 
Backup files are located in the {cmd:bak.stunicode} and 
{cmd:bak.stunicode/status.stunicode} subdirectories to the current directory.

{title:Authors}

{p 4 4 2}Svend Juul{break} 
Aarhus University{break}  
sj@ph.au.dk

{p 4 4 2}Morten Frydenberg{break}
Aarhus University{break} 
morten@ph.au.dk


{title:Also see}

{p 4 4 2}help for {help unicode} (Stata 14+ only){break}
help for {help saveold}