Downloading Data From The Web
Different websites provide data in several different formats - ASCII, SAS, SPSS, Stata and others. Not all data are available in all formats, though, so you need to choose which best suits your needs. Sometimes, you will find that the data you want are not available in the format you want. Here are some tips:
- Data and related files are often bundled together and compressed into what are often called a "zip" file. Common file extensions for these are ".zip" on Windows and ".gz" or ".tar" on Unix. You will need to "unzip" the files before you can do anything else with them. WinZip is a good Windows program, and the "gunzip" command can be used on Unix.
- If there is an ASCII (i.e., plain text) data set with a program file for the statistical package you intend to use, then select that option. Sometimes there is an option for data files already in the format you want ("system," "portable," "transport"), but these may have some "glitches" due to differences in the type of machine they were created on and the type you are using. It's rare, but it does happen.
- If you are downloading data from a geospatial data site, the file may be in "Dbase" format and have an extension of ".dbf". This is the format used by ArcView (a "shape" file is actually a set of files, one or more of which is a .dbf file). These files can be read directly into SAS, Stata, SPSS and Excel.
- The setup files are written to read the entire data set, which you may not need. Rather than editing the program to read only the variables/observations you want, let the program read the entire data set, then just add drop and/or keep statements in the appropriate place to retain what you want. Make absolutely sure that you select all identification and weighting variables. If you are not sure if you want a particular variable, keep it. It's easier to ignore or drop a variable later than it is to go back and add it to your dataset.
- Sometimes the programs have large sections "commented out" so those statements are not executed. If you do want these statements to be executed, then be sure to un-comment them. Typically, these are statements to convert missing value codes (such as "999") to system-missing codes.
- If possible, run some descriptive statistics on your data and compare them to the codebook or some other source to make sure you have read the data correctly.
Image source: Porter Novelli Global. Reading RUFUS data with yED. CC BY-SA 2.0. flikr.