Title
Makes retrievals from GSOEP real easy
Syntax soepuse varnames using dirname , mandatory_options [ soepuse_options joint_options ]
soepadd varnames , mandatory_options [ joint_options ]
dirname refer to the name of the directory in which the GSOEP files are stored. The term varnames refer to variable names of the GSOEP. Note: You cannot specify varnames in terms of a varlist.
options Description ------------------------------------------------------------------------- Mandatory options ftyp(fileype) type of SOEP file (h, p, pgen, etc.) waves(numlist) waves to be used
Soepuse_options design(designtype) Design; default: design(balanced) keep(varlist) Keep variables from ppfad clear Replace data in memory
Joint Options (seldom used) ost(g|h) Request special files for east onlyost Use only special files oldnetto Use old design of netto variables uc Variable list is upper case fast Speed up (not much) -------------------------------------------------------------------------
Description
soepuse and soepadd are two little tools for performing retrievals from the German Socio Economic Panel. The programs are revisions of two older programs, mkdat and holrein for the same purpose. Both programs provide the same function than there predecessors but with a simplified syntax.
The programs create SOEP datasets with the variables of the varlist. soepuse generates a new file and soepadd merges further variables to a file generated with soepuse. By default, the created files will have a balanced panel design, but various other designs could be specified.
soepuse and soepadd both require that the variables to be loaded are specified in the order of the item correspondence list and that variables all belong to the same file-type. Here is an example: To combine individual gross and net income variables with household income using the waves of 1991 and 1992 you would specify
. soepuse hp5401 ip5401 hp5402 ip5402 using ~/data/gsoep24, f(p) w(1991 1992) . soepadd hh49 ih49, f(h) w(1991 1992)
or in a format that highlights better the requested format of the variable list:
. soepuse hp5401 ip5401 hp5402 ip5402 using ~/data/gsoep24, f(p) w(1991 1992)
soepuse and soepadd are constructed for using them in connection with SOEPinfo. Consider you have been searching the GSOEP database with SOEPinfo for information on political interest and party identification from 1984 to 1998. After founding that information you have stored SOEPinfo's item correspondence list to a file, which looks like this:
----------------------------------------------------------- 1984 |1985 |1986 |1987 |1988 |1989 ----------------------------------------------------------- Politik Politisches Interesse - |BP75 |CP75 |DP84 |EP73 |FP89 Politik Allgemeine Parteienpraeferenz AP5601 |BP7901 |CP7901 |DP8801 |EP7701 |FP9301 Politik Parteienidentifikation AP5602 |BP7902 |CP7902 |DP8802 |EP7702 |FP9302
After cutting pipes and headings, and changing uppercase to lowercase you end up with
- bp75 cp75 dp84 ep73 fp89 ap5601 bp7901 cp7901 dp8801 ep7701 fp9301 ap5602 bp7902 cp7902 dp8802 ep7702 fp9302
which is the structure requested by soepuse and soepadd. Take care not to erase the - sign for the missing variable name in the first row. soepuse and soepadd needs this as a placeholder whenever a variable is missing in the item correspondence.
The entire soepuse command to load all variables of the example will become
. soepuse - bp75 cp75 dp84 ep73 fp89 ap5601 bp7901 cp7901 dp8801 ep7701 fp9301 ap5602 bp7902 cp7902 dp8802 ep7702 fp9302 using ~/data/gsoep24, f(p) w(1991/1998)
Options
ftyp(string) is used to specify the type of the GSOEP data sets in which the variables to be loaded are stored. As it stands this can be any of the following types. Note that you can only specify one filetype add a time. Use soepadd to add variables of further filetypes.
--------------------------------------------------- h Household data hbrutto Gross information on household hgen Household data, generated variables kind child information p Person data pausl Person files for foreigners pbrutto Gross information on persons pequiv PSID equivalence files pgen Person data, generated variables pkal Person calendar files pluecke Retrospective question to fill gaps ----------------------------------------------------
waves(numlist) is used to specify the waves from which the variables out to be taken. waves(1984/2002) is used if the variable names correspond to files for all waves from 1984 to 2002. Likewise waves(1985(5)2005) is used if variable names correspond to waves of 1985, 1990, ... 2005. See help numlist for various ways to specify the list of waves.
design(designtype) specifies the design of the dataset to be created. design(balanced) is used to create a balanced panel design, i.e. the data will contain only observations interviewed in all requested waves. design(any) will keep all available observations in the dataset. design(#) with # being an integer positive number creates datasets with respondents interviewed at least # times. With design(any) and design(#) the netto variables from ppfad will be retained in the data set for further fine tuning of the design.
clear specifies that it is okay to replace the data in memory, even though the current data have not been saved to disk.
ost(g|h) must be specified, if your list of variable names contains names from specialized files for East Germany of years 1990 and 1991. Specify ost(g) if you have used either names from gpost or gpkalost, ost(h) if you have used either names from hpost or hpkalost and ost(g h) if you have used some specialized East-files of both waves.
onlyost must be specified if you variable names contains only names from the specialized files for East Germany.
oldnetto must be used if you are working with an old version of the GSOEP database, i.e. with an version where the variables anetto, bnetto ... znetto in the dataset ppfad have the value 1 for interviews.
uc must be used if the variable list is upper cased. This is helpful if you don't have a decent text editor that is capable to lower-case the upper-cased variable names from SOEPinfo.
fast speeds up the retrieval. By default, {soepuse} and {soepadd} do some extra work to check whether the variable names make sense. This helps debugging lengthy list, but takes some time, especially if the GSOEP data is stored on a slow network drive. Option fast bypasses the additional check of variable names.
Example(s)
Constructing Longitudinal Individual Records . soepuse gp109 zp6401 hp10901 ip10901 jp10901 using ., ost(g) w(1990/1993) f(p)
Linking Household Data to Individuals . soepuse hp5401 hp5402 using ., w(1991) f(p) . soepadd hh48, w(1991) f(h)
Linking Houshold Data to Individuals Across Waves . soepuse hp5401 ip5401 hp5402 ip5402 using ., w(1991/1992) f(p) . soepadd hh48 ih49, w(1991/1992) f(h)
Houshold Level Variables from Individual Data . soepuse hp07 hp15 using ., f(p) w(1991) . gen ft=1 if hp15==1 . gen pt=1 if hp15==2 . gen unemp=1 if hp07==1 . gen noinf=1 if hp15==9 . collapse (count) n_ft=ft n_pt=pt n_unemp=unemp n_noinf=noinf (mean) hhnr=hhnr, by(hhnr) . soepadd htyphh1 htyphh2, w(1991) f(hgen)
Creating longitudinal data from waves 1984-2006 with vars from different sources . soepuse afamstd bfamstd cfamstd dfamstd efamstd ffamstd gfamstd hfamstd ifamstd jfamstd kfamstd lfamstd mfamstd nfamstd ofamstd pfamstd qfamstd rfamstd sfamstd tfamstd ufamstd vfamstd egp84 egp85 egp86 egp87 egp88 egp89 egp90 egp91 egp92 egp93 egp94 egp95 egp96 egp97 egp98 egp99 egp00 egp01 egp02 egp03 egp04 egp05 using . , ftyp(pgen) waves(1984/2005) design(3) keep(sex gebjahr) clear
. soepadd ap6801 bp9301 cp9601 dp9801 ep89 fp108 gp109 hp10901 ip10901 jp10901 kp10401 lp10401 mp11001 np11701 op12301 pp13501 qp14301 rp13501 sp13501 tp14201 up14501 vp154 , ftyp(p) waves(1984/2005)
. soepadd i1110284 i1110285 i1110286 i1110287 i1110288 i1110289 i1110290 i1110291 i1110292 i1110293 i1110294 i1110295 i1110296 i1110297 i1110298 i1110299 i1110200 i1110201 i1110202 i1110203 i1110204 i1110205 e1110184 e1110185 e1110186 e1110187 e1110188 e1110189 e1110190 e1110191 e1110192 e1110193 e1110194 e1110195 e1110196 e1110197 e1110198 e1110199 e1110100 e1110101 e1110102 e1110103 e1110104 e1110105 , ftyp(pequiv) waves(1984/2005)
Note
soepuse and soepadd are two little unambitious helper programs. A far more advanced Stata program for working with the GSOEP and many other panel data sets is PanelWhiz by John Haisken DeNew.
Author
Ulrich Kohler, WZB, kohler@wzb.eu
Also see
Online: soepren (if installed), rgroup (if installed), soepdo (if installed)