Definitions of some common and uncommon data terms:
Stands for American Standard Code for Information Interchange. This is the numeric representation of any kind of character in a file. Although not strictly correct, "ASCII" is often used interchangeably with "text" or "plain text". It simply means that the information in the file is not in any system or proprietary format. See also binary, text, raw data.
The way in which a computer actually stores the information in a file; at its most basic level, a series of ones and zeroes. Some older data files are stored in special binary formats such as "column binary" or "zoned decimal" to save storage space. SAS and SPSS are capable of reading some binary files; Stata is not. See also ASCII, text.
A group of eight pairs of zeroes and ones in a computer file. A kilobyte (Kb) is 1,000 bytes; a megabyte is 1,000,000 bytes; a gigabyte is 1,000,000,000 bytes; a terabyte is 1,000,000,000,000 bytes.
Originally referred to a punch-card, an out-dated way of entering data into a computer. Today, it refers to a single line of data in a file. "Card-image" data files have two or more lines of data for each observation. See also record, observation.
A character is any letter, number, space, punctuation mark, or symbol that may be typed on a computer.
A document that describes in detail the data with which you are working. Although there is no standard format for a codebook, a good one has the full wording of the questions and answers, a list of all the codes or values used to enter the data, and the begin and end columns or the begin column and the length of the variable. See also dictionary, data definition statements.
Also called a "csv" file; csv is a raw data file in which the variables are separated by commas. CSV files are often used to convert a data file from one software package such as Excel to another such as Stata. See also delimiter, tab-delimited file, fixed format (file), free format (file).
Program code that tells the computer in which columns and lines each variable can be found. In SAS, this is the "input" statement; in SPSS it is the "data list" statement; in Stata it is an "infix" or "infile" dictionary. See also codebook, dictionary.
A character or characters used to separate variables in a raw data file. The most common delimiters are commas, tabs and blank spaces. The choice of which delimiter to use depends on whether there are commas and/or blank spaces in the values of any of the variables. If so, then one must use some other character as a delimiter as the computer will not be able to distinguish between those used as delimiters and those that are actual values. comma-separated values; tab-delimited; fixed format (file); free format (file)
There are two types of dictionaries: a data dictionary and a Stata dictionary. A data dictionary is a document that lists all the variables and their locations (columns) in the data file, and, sometimes, the values for those variables. These are appropriate for any statistical package. A Stata dictionary is a program or file that Stata uses to read a raw data file - the information in this program can be obtained from either a data dictionary or a codebook. codebook; data definition statements.
A filename extension, or simply extension, is the second part of a filename, the part that comes after the ".". Some common extensions are ".sav" for SPSS files, ".csv" for a comma-separated values file, and ".doc" for Microsoft Word files.
A term typically used in database management, a field is the same as a variable.
A raw data file in which each variable occupies the same column or columns on each line for each observation. Fixed format files typically do not have any delimiters between variables. See also free-format (file), comma-separated values, tab-delimited (file).
A raw data file in which each variable may occupy different columns on each line for each observation. Free format files must have some type of delimiter, usually spaces, between variables. See also fixed-format (file), comma-separated values, tab-delimited (file).
Sometimes a data file will have different types or levels of information on different records. An example is when a survey collects information about a household, each family within that household, and each person within each family. The different types of variables will be on different records within the file. See also rectangular, flat.
The number of columns (or characters) a variable in a raw data file will occupy. Sometimes this also means the number of bytes the same variable will take in a system or binary file. See also record length.
Sometimes there is more than one record or line of data for each observation in a raw data file. Each record has a different set of variables on it, so each record must be read differently. See also hierarchical; flat.
These are numbers, plain and simple. Decimals and minus signs are the only acceptable non-number characters allowed. See also string (variable).
An observation is a unit of analysis - a respondent in a survey, for example. See also record.
Data storage format used by the ancient Egyptians. No, just kidding; it isn't quite that old. Osiris is a storage format where the data and the dictionary are in binary format. Many older datasets available from ICPSR are in Osiris format; SAS and SPSS are capable of reading Osiris files.
Raw data are data that have not been read into a software package like SAS or Stata. If you were to open this file you would see numbers and, perhaps, letters. You need a codebook or data dictionary to be able to read the data. These are sometimes called "text" or "ASCII" files. See also text; ASCII; system (file).
A record is one line in a data file. Sometimes there is more than one record per observation or there are different types of records in a single file. Records are sometimes referred to as "cards." See also observation; card.
Literally, the length of a record in a data file. This is measured in columns for raw data files and in bytes for system or binary files. This information is necessary for files that have more than 256 columns on one record as the total length must be specified in the data definition statement. Often abbreviated as "lrecl". Also referred to as "logical record length." See also length.
A data file in which there is one line or record of data for each observation. To "rectangularize" a file means to convert a multiple-record file or hierarchical file to this format. See also hierarchical; flat.
A string variable is one that has letters and/or numbers as opposed to just numbers. An example would be a person's name. Numbers can be treated as strings, but strings cannot be treated as numbers. Strings are also referred to as "character" or "alphanumeric" variables. See also numeric.
A raw data file that has tab characters to separate the variables. Tab-delimited files are often used to convert data from one software package such as Excel to another such as SPSS. Tab-delimited files are useful when variables can have commas or spaces as part of their values (i.e., a person's name). See also delimiter; comma-separated values; fixed format (file); free-format (file).
Value labels assign words to numeric values in a data file. Labels are used simply to make the output easier to read. Instead of printing "1", "2", "3", the computer will print "Yes", "No", "Maybe". See also variable label.
A variable label is a short description of a variable. They are not required, but make the output easier to understand. See also value label.