Roger Sage CDL / MELVYL SSH Library Home
SSHL Home Data, Gov't & GIS Home
FAQ    Downloading Tips    Processing and Quality Control  
Data Migration Project Home

Data Migration Project Glossary:

ASCII - American Standard Code for Information Interchange is the most common format for text files in computers and on the Internet. In an ASCII file, each alphabetic, numeric, or special character is represented with a 7-bit binary number (a string of seven 0s or 1s). 128 possible characters are defined. UNIX and DOS-based operating systems use ASCII for text files. Windows NT and 2000 use a newer code, Unicode. IBM's S/390 systems use a proprietary 8-bit code called EBCDIC.

ASCII control characters - The first 32 ASCII characters (octal codes 000 through 017) form a special set of non-printing characters called the control characters. They are called control characters because they perform various legacy printer/display control operations rather than displaying symbols. Unfortunately, different control characters perform different operations on different output devices and there is very little standardization among output devices. With the exception of the line feed character we have replaced these "extraneous" control characters in the processed ASCII documents and data files with a blank so that the processed files can be used with a variety of software on many kinds of computers.

ASCII high characters - Extra or extended characters (octal codes 128-177) added to the original ASCII character set. These can be foreign language characters, math characters, symbols and etc. Unfortunately, extended character sets vary and display differently on various ASCII file viewers/editors. They can also cause meaningless records to be imported into statistical or database software. Most of these characters in the original diskette files are probably untranslated EBCDIC-ASCII characters. These high or extended "extraneous" characters have been replaced with a blank or an english alpha/numeric character in the processed files.

CSV - A comma separated value data file is a physical ASCII file structure that contains records whose values are delimited or separated by commas. Within the context of the Data Migration Project, many of the original diskette data files have been translated to ASCII CSV format so they can be used by a variety of software on many types of computers.

DOS - The MS-DOS Disk Operating System was the Microsoft-marketed version of the first widely installed operating system in personal computers. It was essentially the same operating system that Bill Gates's young company developed for IBM as Personal Computer Disk Operating System (PC-DOS). The Data Migration Project original diskette files were written to be used on DOS computers.

data - In computing, data is information that has been translated into a form that is more convenient to move or process. Relative to today's computers and transmission media, data is information converted into binary digital form.

database - A database is a collection of data that is organized so that its contents can easily be accessed, analyzed and updated. In the context of the data migration project, databases are usually in the DOS dBase format.

data file - In the context of the Data Migration Project, a data file is an ASCII text file that has a header record consisting of variable names or values that is followed by "n" records of data values. Records may contain actual numbers or coded values.

data table - In the context of the Data Migration Project, a data table can be an ASCII text file that has many header records followed by "n" records of data values. A data table may also be in DOS Lotus 123 (spreadsheet) format. Typically, data tables are prepared for viewing data, rather than processing or analyzing data.

EBCDIC - EBCDIC (pronounced either "ehb-suh-dik" or "ehb-kuh-dik") is a binary code for alpha-numeric characters that IBM developed for its larger operating systems. It is the code for text files that is used in IBM's OS/390 operating system for its S/390 servers and that thousands of corporations use for their legacy applications and databases. In an EBCDIC file, each alphabetic or numeric character is represented with an 8-bit binary number(a string of eight 0's or 1's). 256 possible characters (letters of the alphabet, numerals, and special characters) are defined. We believe that many of the ASCII diskette files are translations of the EBCDIC character set used on government agency computers.

extension - In computer operating systems, a file name extension is an optional addition to the file name in a suffix of the form ".xxx" where "xxx" represents a limited number of alphanumeric characters depending on the operating system. The file name extension helps an application program recognize whether a file is a type that it can work with.

file - In data processing, a file is a related collection of records. For example, you might put the records you have on each of your customers in a file. In turn, each record would consist of fields for individual data items, such as customer name, customer number, customer address, and so forth. By providing the same information in the same fields in each record (so that all records are consistent), your file will be easily accessible for analysis and manipulation by a computer program.

kernel - The kernel is the essential center of a computer operating system, the core that provides basic services for all other parts of the operating system. Windows NT and 2000 use the NT kernel, while earlier versions of Windows use the DOS kernel. Some of the original DOS diskette applications will not run on versions of Windows that use the NT kernel.

legacy application - In information technology, legacy applications and data are those that have been inherited from languages, platforms, and techniques earlier than current technology. In the past, much programming has been written for specific operating systems. Currently, efforts are underway to migrate legacy applications to new programming languages and operating systems that follow open or standard programming interfaces. Theoretically, this will make it easier in the future to update applications without having to rewrite them entirely and will allow applications to run on any operating system.

octal - Octal (pronounced AHK-tuhl, from Latin octo or "eight") is a term that describes a base-8 number system. An octal number system consists of eight single-digit numbers: 0, 1, 2, 3, 4, 5, 6, and 7. The number after 7 is 10. The number after 17 is 20 and so forth. In computer programming, the octal equivalent of a binary number is sometimes used to represent it because it is shorter.

PDF - Portable Document Format is a file format that has captured all the elements of a printed document as an electronic image that you can view, navigate, print, or forward to someone else. PDF files are created using Adobe Acrobat, Acrobat Capture, or similar products. To view and use the files, you need the free Acrobat Reader, which you can easily download. Once you've downloaded the Reader, it will start automatically whenever you want to look at a PDF file.

record - A physical record is a chunk of data that has a specified and constant size in bytes or that is clearly delimited from other records by a newline character or sector of a disk or other means identifiable to a computer program reading the file.

spreadsheet - A spreadsheet is a sheet of paper that shows accounting or other data in rows and columns; a spreadsheet is also a computer application program that simulates a physical spreadsheet by capturing, displaying, and manipulating data arranged in rows and columns. The spreadsheet is one of the most popular uses of the personal computer.

text file - A text file is a human-readable sequence of characters and the words they form that can be encoded into computer-readable formats such as ASCII.

Unicode - Unicode is an entirely new idea in setting up binary codes for text or script characters. Officially called the Unicode Worldwide Character Standard, it is a system for "the interchange, processing, and display of the written texts of the diverse languages of the modern world." It also supports many classical and historical texts in a number of languages. Currently, the Unicode standard contains 34,168 distinct coded characters derived from 24 supported language scripts. These characters cover the principal written languages of the world.


 

ROGER | Sage | CDL/MELVYL | UCSD Home | UCSD Libraries Home

Official Web Page of the University of California, San Diego
© Copyright 2000, UCSD, All Rights Reserved. This site may not be reproduced.
Social Sciences & Humanities Library, 9500 Gilman Drive, La Jolla, CA 92093, 858-534-3336
Email SSDC Webmaster