Roger Sage CDL / MELVYL SSH Library Home
SSHL Home Data, Gov't & GIS Home
Social Sciences Data Collection (SSDC)

SSDC Data File Structure in a Nutshell

If you (1) have a basic understanding of SSDC data file structure, (2) know about SSDC data file formats and (3) can locate SSDC data documentation you can read our data like a book. The reason that you can read our data is that most SSDC data use the ASCII character set and data file records are delimited. ASCII is often referred to as "plain text" and can be displayed by any application (browsers, notepad, wordpad, Word, etc.) that can read a text file. An SSDC ASCII record typically contains alpha-numeric characters, blanks and something that you can't see, a record delimiter.

Let's look at one record in a data file with a record length of 28 characters or "columns". You can use the ruler that is above the data file record to count the columns of ASCII characters.
--------------------------------------------------------------------------------
[Ruler:] 10        20        30        40        50        60        70        80
123456789*123456789*123456789*123456789*123456789*123456789*123456789*123456789*
--------------------------------------------------------------------------------
3000102291 212111999 6231291

If our data file has two records, it might look like this:

3000102291 212111999 6231291
3000202291 212111999 5731291

or like this:

3000102291 212111999 62312913000202291 212111999 5731291

The difference between the way the data files display is that the first file uses a "record delimiter" (an ASCII linefeed, often called a newline) that causes each record to be displayed on a separate line. The second data file does not have a "record delimiter" so the 2 records are displayed on one continuous line. Record delimiters are useful because statistical software applications (like SPSS) use record delimiters to determine the record length of each record in our data files.

SSDC Data File Structure:

Once you come to terms with "unit of analysis", understanding data file structures is simple.

Unit of Analysis - The basic observable entity being analyzed by a study. Contains one or more physical records in a data file. Also called a logical record (all physical records for a given unit of analysis), a case (all physical records for a person) or a record type (used in data files that have more than one unit of analysis).

An outline of SSDC data file structures looks like this:

  1. Logical Records (units of analysis, cases or record types)
    1. Physical Records (may be cards or decks if Card-Image data)
      1. Variables
        1. Variable Labels
        2. Variable column locations
        3. Variable Values
      2. Record Length
      3. Record Delimiters

Visualize a tree. The trunk is the logical record, the branches are the physical records and the leaves are the variables. Here is an example of an ASCII rectangular (or flat) data file. Each line of data (physical record) is one unit of analysis (logical record). Each line of data has the same record length (80 columns).

        

Here is an example of an ASCII card-image data file. There is one unit of analysis (case) and more than one physical record (card or deck) for each case. All of the components of the data file are labeled.

        

The last piece of the puzzle are the variable value labels. Codebooks translate all the column values (numbers) to labels (text).

SSDC Data File Formats:

Rectangular Data Format - One logical record (unit of analysis). One physical record for each logical record. The General Social Survey is rectangular; for each person (case) there is one physical record. The logical record length is equal to the physical record length. The majority of SSDC data files are rectangular.

Card-Image Data Format - One logical record (unit of analysis). May be more than one physical record for each logical record or case. Each physical record is limited to 80 columns of data and is called a deck or card. The number of physical records in each logical record is often noted as "cards per case". The Field (California) Polls are card-image; For each person (case) there are one or more physical records (cards or decks). Cards or decks are usually numbered.

Hierarchical Data Format - One logical record (unit of analysis). More than one physical record for each logical record. The Census of Population and Housing is hierarchical; for each housing unit (logical record) there is more than one physical record (household record + household person records).

Relational Data Format - More than one logical record type (units of analysis). Can be organized in different ways using logical record type relationships. The Survey of Income and Program Participation data files are relational; there are record types for household, family, person, wage and salary job, and general income amounts.

SSDC Data File Documentation:

SSDC study descriptions document data structure and format, record lengths, and include references to codebooks, SAS and SPSS control cards and data dictionaries. The codebooks, control cards and dictionaries document the variable names, labels, values and column locations. Always start with and use the file format paragraph in the SSDC study descriptions. SSDC file formats take precedence over formats specified in codebooks. The SSDC staff probably translated that packed zone decimal file specified in the codebook to ASCII and added newline delimiters. Next, read the codebook paragraph, locate the codebook and read the codebook.

From the file format paragraph for the Field (California) Polls you know that the data file format is Card-Image ASCII and that the record length (card or deck length) is either 80 or varies and that there are newline record delimiters.

FILE FORMAT = Card-Image ASCII 
NUMBER OF CASES = Varies, please see the codebooks for specific information 
CARDS PER CASE= Varies, please see the codebooks for specific information 
     TIP: Whenever you search the codebooks for a topic, the number of cases
          and cards per case are specified for each data file retrieved
RECORD DELIMITER = Line Feed 
RECORD LENGTH = 1956 - 1996 no.4: 81 (80 + line feed); 1996 no.5
                current data file: varies (newline delimited) 

If you browse the codebook you can find the:

  1. (1) unit of Analysis = person (case)
  2. (2) number of cases = 1219 (persons interviewed)
  3. (3) number of records for each unit of analysis = 6 (cards per case or physical records per person)
UNIVERSE:    CALIFORNIA REGISTERED VOTERS
INTERVIEWING PERIODS: OCTOBER 22-NOVEMBER 1, 1998
METHOD OF INTERVIEW:  TELEPHONE
NUMBER OF CASES: 1,219
CARDS PER CASE:   6

Then you can browse more until you locate information about the variables in each card or physical record:

Name         Position    Label
Q110A         130    RESPONDENT/HOUSEHOLD OWN: COMPUTER
Q110B         131    RESPONDENT/HOUSEHOLD OWN: RIFLE/SHOTGUN
Q110C         132    RESPONDENT/HOUSEHOLD OWN: ANSWER MACHINE
Q110D         133    RESPONDENT/HOUSEHOLD OWN: PISTOL/REVOLVR
Q111A         134    REGULAR COMPUTER USER

Now you know the name of some of the variables, the variable positions and the variable labels. If you examine variable name Q111A in detail you can read the variable question, determine the physical record number (deck or record 4), the column location (57) and the variable values (1,2,8 or 9). You can also find the labels (yes, no, don't know, NA) for each value.

VARIABLE 134 REGULAR COMPUTER USER    DECK 4/57

Q111A. DO YOU HAVE OCCASION TO USE A PERSONAL COMPUTER OR COMPUTER
WORKSTATION ON A REGULAR BASIS EITHER AT HOME, AT WORK OR AT
SCHOOL?

VALUE LABEL                 VALUE  N OF CASES
-----------                 -----  ----------
YES, COMPUTER USER              1       732
NO, NON-USER                    2       278
DON'T KNOW                      8         2
NOT APPLICABLE*                 9       207
                                     -------
                            TOTAL      1219

For more detailed information about data file structure and SPSS, see INTRODUCTION TO DATA HANDLING. University of Chicago Social Science Research Computing unit.

ROGER | Sage | CDL/MELVYL | UCSD Home | UCSD Libraries Home

Official Web Page of the University of California, San Diego
© Copyright 2000, UCSD, All Rights Reserved. This site may not be reproduced.
Social Sciences & Humanities Library, 9500 Gilman Drive, La Jolla, CA 92093, 858-534-3336
Email SSDC Webmaster