Chapter 4. RECOVERY OF HISTORICAL U.S. CENSUS BUREAU MICRODATA: SUCCESS TO DATE

B.K. Atrostic, Randy Becker, Todd Gardner, Cheryl Grim, and Mark Mildorf, Center for Economic Studies

Additional years of microdata (CES). See Text Box 4-1 for high- many household and business from the Annual Survey of lights of recovered . files to be recovered and sought Manufactures (ASM), Survey the help of the research com- Historic data from over 2,500 of Industrial Research and munity (Becker and Grim, 2009 tapes were recovered before the Development (SIRD), and the and Gardner, 2009). Without the Census Bureau’s main- Current Population Survey are intensive recovery effort led by frame was decom- just a few examples of the valu- CES during the last half of 2009, missioned in 2010. In the 2008 able data recently recovered by important information would Research Report, CES noted the the Center for Economic Studies have been lost forever.

Text Box 4-1. RECOVERED DATA: HIGHLIGHTS The economic and demographic data recovered from the Unisys will be valuable additions to the data already available at the Center for Economic Studies (CES) and the Research Data Centers (RDCs). Most files will require additional work before they can be used for research purposes, and some may require approval by sponsoring agencies. Examples of data recovered from the Unisys include: • Earlier years of series already available at CES · Censuses of Mining, Retail, Wholesale, and Services · Annual Survey of Manufactures (ASM) · Survey of Industrial Research and Development (SIRD) · Survey of Minority-Owned Business Enterprises · Commodity Transport Survey (now called the Commodity Flow Survey) · Decennial Census data ­~ Puerto Rico sample and complete count files ­~ U.S. Possessions sample and complete count files · Selected Current Population Survey (CPS) March Supplements • Series not currently available at CES · Agriculture Surveys and Censuses · Annual Survey of Oil and Gas · Heating Fuel Survey · Water Use Survey · Survey of Construction · Income Surveys Development Program · CPS Supplements for months other than March • New variables for series already available at CES · Census of Manufactures (CM) special inquiries data · ASM Central Administrative Office data · ASM and CM data flags • Historical analysis files (see Text Box 4-2) · Industrial Time Series data · Linked CM/SIRD data created for Zvi Griliches

U.S. Census Bureau Research at the Center for Economic Studies and the Research Data Centers: 2009 19 nearly two decades. See Text Box 4-3.

Innovative on-the-fly solutions to problems that hampered earlier efforts let CES surmount a series of technical, operational, and administrative challenges that each threat- ened to halt the recovery in mid-process.

While CES led the recovery effort, success on this scale required help from many part- ners. The project got off the ground because of the strong support of C. Harvey Monk, Jr., then Associate Director

Photo by Lauren Brenner for Economic Programs at the Census Bureau. Occupying 336 square feet, the Unisys IX Clearpath 4400 and 6 tape drives were purchased by the Census Bureau in 1995. We also appreciate the support of the Census Bureau’s Computer The recovered files have the been proven for the ASM and Services Division. They granted potential to bring new data SIRD. See Text Box 4-2. us additional time to access series to CES, extend existing the Unisys machine and gra- The 2009 recovery effort built data series at CES by a decade ciously allowed us to use their on work that had been going or more, and fill gaps in existing office facilities. on at a much lower intensity for data. This potential has already

Text Box 4-2. NEWLY AVAILABLE HISTORIC DATA: THE MANUFACTURING SECTOR

The data recovery project has already borne manufacturing microdata series provides oppor- fruit. Annual Survey of Manufactures (ASM) data tunities to study the evolution and behavior of dating as far back as 1954 have been recovered, plants and industries over longer time periods and as many as 15 years of additional historical and more business cycles. data from the Survey of Industrial Research and Industrial Time Series Data Development (SIRD) have been recovered for large firms.1 In the 1960s, Census Bureau staff created a longitudinal microdata file using the ASM. The oldest microdata available to researchers at Internally, the file was referred to as the the Center for Economic Studies (CES) and in the Industrial Time Series (ITS). They began with a Research Data Centers (RDCs) as of December pilot study using data from 1954–1961. They 2009 were manufacturing data from the 1963 selected 25 industries containing about 2,500 Census of Manufactures (CM). Extending the establishments (Conklin, 1964).2

1 The recovered data have not been fully converted into 2 Most of the selected industries were capital-intensive research-ready format nor have the data gone through and many were primary metals industries. exhaustive checks for completeness. Continued on page 21

20 Research at the Center for Economic Studies and the Research Data Centers: 2009 U.S. Census Bureau Text Box 4-2.—Con. linked data, he found a positive relationship Research using the data from the pilot between investments in R&D and company study show early evidence of the value of longi- productivity. tudinally linked microdata. Jordan (1965) stud- Griliches returned to the Census Bureau ies the measurement of capacity utilization, and to extend his work in the early 1980s. Schaffer (1968) looks at changes in the structure Unfortunately, the data from his original project of manufacturing employment. had been lost (Griliches, 1986).

CES recovered not only the original pilot study With Census Bureau staff doing the hands-on data, but also what we believe to be the 1954– data work, Griliches designed a new dataset. 1964 linked ASM data covering all manufactur- The new dataset contained 1957–1977 firm- ing industries discussed in Kallek (1982). level data from the SIRD matched to data from The ITS data were abandoned when a new, the 1962, 1967, and 1977 ES and data from the and ultimately very successful, project to link 1967 and 1972 CM. The sample was limited to manufacturing data began in the early 1980s certainty companies from the 1972 SIRD and (Kallek, 1982). This project was led by Nancy contains approximately 1,100 companies. Ruggles and Richard Ruggles and developed the Griliches highlights three findings in his 1986 Longitudinal Establishment , a prede- : (1) investments in R&D make a positive cessor to the manufacturing microdata available contribution to productivity growth; (2) basic at CES today (Atrostic, 2009). research is a more important contributor to The ITS data were not included in the 1980s proj- productivity than other types of R&D; and (3) ect because of anticipated difficulties in linking privately-financed R&D expenditures appear to the more recent 1972 data. Given the comput- to be more effective at the company-level than ing capabilities and other data sources available federally-financed R&D expenditures. now, there is a good probability these data can The data underlying the findings in Griliches be linked to the currently available manufactur- (1986) were recently recovered from the Unisys ing data at CES.3 machine as part of the historical data recov- ery project.4 CES currently has data from the Griliches R&D Data SIRD going back to 1972. The addition of the In the mid-1960s, Zvi Griliches—famed Harvard Griliches sample extends our historical informa- University economist, then at the University tion for many large R&D companies back of Chicago—was approached by the Census to 1957. Bureau and the National Science Foundation While the Griliches sample is limited to large to work on a project analyzing historical data companies and key SIRD variables (e.g., total on industrial R&D. R&D expenditures, number of scientists and Griliches led efforts to develop a longitudinally- engineers), up to 15 additional years of data will linked dataset containing 1957–1965 company- now be available for many large R&D compa- level data from the Survey of Industrial Research nies. The recovered data have been converted and Development (SIRD) matched to data from into SAS datasets. Foster and Grim (2010) use the 1958 and 1963 CM and Enterprise Statistics these newly recovered data to extend the period (ES). Griliches (1980) details this effort and covered in their analysis of research and devel- discusses limitations of the data. Using these opment back to the 1950s.

3 See Becker and Grim (2009) for more information 4 The recovered data do not include the 1977 ES data. on other sources of available historical manufacturing microdata.

U.S. Census Bureau Research at the Center for Economic Studies and the Research Data Centers: 2009 21 Text Box 4-3. RECOVERING HISTORICAL MICRODATA: A LONG HISTORY

Before 1998: Microdata recovered from Unisys decommissioned in late 2008, CES was able to were the source of major CES prod- delay the decommissioning. ucts, such as the original Longitudinal Research 2009: CES reached out to the research commu- Database (LRD) (Atrostic, 2009 and McGuckin and Pascoe, 1988). nity. The research potential of microdata remain- ing on the Unisys were described in a presenta- 1998–2007: tion by B.K. Recovering files Over a decade ago, then-CES researcher, Al Nucci (1998) Atrostic at the became harder. wrote of the significant technical and institutional chal- National Bureau Staff with the lenges in moving historical microdata to modern comput- of Economic necessary skills ing systems: Research and institu- Productivity tional memory “In principle, it ought to be a relatively straightforward Program meet- retired. The exercise to obtain these data (i.e., a copy from archival ing in March Unisys machine tapes). This is not the case. The Census Bureau’s legacy 2009, and arti- became mainframe is a nonstandard UNISYS (a descendant of the cles by Becker increasingly earlier UNIVAC computers). Further, the Census Bureau and Grim, fragile, prone enhanced the capabilities of these computers with a propri- and Gardner to crashes, etary . Hence the migration of data from earlier in the 2008 and difficult to years not only has the more traditional problem of missing CES Research bring back on or incomplete file documentation and deteriorating tape Report. CES line. files but requires the use of specialized software, often actively solic- specially written for each file. Important data ited the views of our Research from several Further, the skills required for these tasks are no longer Data Center censuses and common at the [Census] Bureau and becoming increas- partners and, surveys were ingly rare as time progresses, in addition to the continuing through them, nevertheless diminution of Census Bureau institutional memory on the the broader retrieved and contents of earlier files and at their creation.” research com- some made munity. Strong available for research. Among the data recov- positive feedback and concrete support gave ered were microdata for the 1960–1990 decen- CES efforts new life. nial censuses (Gardner, 2009); the 1955, 1956, and 1965–1971 Annual Survey of Manufactures (Becker and Grim, 2009); and the 1973–1978 Pollution Abatement Costs and Expenditures survey (Becker, 2007).

But moving the thousands of files remaining on the Unisys remained a low priority. CES would need to identify which of those files existed nowhere else, and only one person at CES had extensive experience on 1100/2200 Unisys mainframes. Photo by Lauren Brenner 2008–2009: When the Census Bureau One of many drawers full of paper announced the Unisys machine would be register files, which contain vital information for the recovery of data files stored on the Unisys.

22 Research at the Center for Economic Studies and the Research Data Centers: 2009 U.S. Census Bureau CES assembled team mem- data and make them consistent Identifying Data to be bers from a number of profes- with data already at CES. Recovered sions and divisions across the Decommissioning the Unisys The first challenge was to create Census Bureau. Partners from closed a long and significant a list of “rescue worthy” data the Research Data Center (RDC) chapter in Census Bureau files on the Unisys. system and the broader research computing history. Some of community also joined the data Information about data stored the recovered files were likely recovery team. See Text Box 4-4. on the Unisys, such as data initially processed on the Census storage forms, record layouts, Recovery is only the first step in Bureau’s UNIVAC I computer, first and lists of computer tapes for creating usable research micro- used by the Census Bureau in each “data storage submission” data files. CES programmers are 1951. See Text Box 4-5. are stored in a paper file called a working to develop processes “register.” A data storage reg- that should greatly streamline DATA RECOVERY PROCESS ister may represent an entire the last steps of converting the The data recovery team over- economic census with scores of data from legacy formats and came significant challenges to tapes holding several different character sets to more readily recover data from the Unisys: files and millions of records, or accessible formats, such as ASCII part of a survey with a single or SAS datasets. After the con- • Identifying data to be recovered tape holding a single file with version is completed, substantial • Moving data from the Unisys only a few hundred records. work will remain to assess the • Keeping the Unisys running

Text Box 4-4. Research Data Center Partner Support THE DATA RECOVERY TEAM Baruch College, CUNY System University of California, Berkeley Center for Economic Studies University of Michigan Interuniversity B.K. Atrostic Consortium for Political and Social Research Randy Becker University of Minnesota Population Center Jason Chancellor Joshua Coates (intern) Henry Cross (intern) Todd Gardner Cheryl Grim Mark Mildorf Rose Taylor Ya Jiun Tsai

Demographic Surveys Division Zelda McBride Sue Peters

Economic Statistical Methods and Programming Division Connie Christensen Stephen Jarvis Photo by Lauren Brenner Keith Paterno Center for Economic Studies (CES) Assistant Robert Penrod Division Chief for Research Support, Mark Daniel Vacca Mildorf, with the Unisys in January 2010. As the only CES member of the data recovery Population Division team with prior experience on a Unisys, Mark Marie Pees led the charge to download data.

U.S. Census Bureau Research at the Center for Economic Studies and the Research Data Centers: 2009 23 Text Box 4-5. THE UNISYS: END OF COMPUTING ERA THAT BEGAN WITH UNIVAC I

Decommissioning the Census Bureau’s Unisys characteristics (such as the 36 bit word) and Clearpath 4400 Mainframe will end a thread of using the same family of operating systems. computer usage reaching back to the 1950s. The Census Bureau’s UNIVAC I was retired in The Census Bureau purchased UNIVAC I, serial 1964. The UNIVAC 1105s were retired in 1967, number 001, in 1951. It was the first commer- and the UNIVAC 1107s were replaced in 1971. cial electronic general-purpose data processing The Unisys Clearpath 4400 is the last Unisys computer. mainframe in operation at the Census Bureau. However, the lineage of the Unisys Clearpath is The Unisys Corporation discontinued support of better traced back to the UNIVAC 1100 series of the Clearpath as of December 31, 2009, and it mainframe, specifically the UNIVAC 1105 and was decommissioned in 2010. 1107 models.

In 1958, the Census Bureau acquired the first two UNIVAC 1105 computers ever sold. In 1963, the Census Bureau purchased two UNIVAC 1107 computers.1

Since then, the Census Bureau has had one or more 1100 or 2200 series mainframes in near continual operation.

The Unisys Clearpath 4400 is in the same fam- ily line as the UNIVAC 1107, sharing hardware Photo by U.S. Census Bureau, Public Information Office

1 For more information, see the CNN news article “50th Built by Remington-Rand, using more than 5,600 Anniversary of the UNIVAC I,” available at . commercial computer.

CES researchers examined over Moving data from the Unisys in late October 2009 to empha- 2,000 registers. Consulting with was a complicated process with size transferring data from the other subject-matter experts many technical challenges. See Unisys as quickly as possible. as needed, they set priorities Text Box 4-6. The creation of research-ready for recovery and created elec- SAS datasets was deferred. tronic scans of hundreds of The data had to be converted important registers. from a variety of formats to This change, together with a format that could be copied process improvements and Moving Data From the from the Unisys and read on a increased staffing, led to a dra- Unisys modern computer. Initially, the matic improvement in the flow unstructured data were read in of data from the Unisys. In the After the data to be recovered variable-by-variable and loaded 6 months prior to the change were identified, much work into SAS datasets. This procedure in processing, the data recov- remained. The next step was to proved to be painstakingly slow. ery team moved economic data move data from the Unisys to a from approximately 20 registers modern computer system. The data recovery team made a from the Unisys. In the 4 months major change to this procedure

24 Research at the Center for Economic Studies and the Research Data Centers: 2009 U.S. Census Bureau following the change, the data recovery team moved data from over 490 economic registers.

Moving data files from the Unisys was a major production effort. Over 2,500 tapes were pro- cessed. The resulting 7,000-plus data files were then copied from the Unisys to Linux computers.

Keeping the Unisys Running

No account of the data recovery effort would be complete with- out noting the vital role played by Rose Taylor, a retired Census Bureau employee with over 25 years of experience with the Unisys. Photo by Lauren Brenner Brought back as a contractor, Rose Taylor, a retired Census employee and key member of the data Taylor was essential in keep- recovery team, with the Unisys in January 2010.

Text Box 4-6. TECHNICAL CHALLENGES TO THE DATA RECOVERY EFFORT

Listed below is a small subset of the technical data and data in the Excess-3 character set, issues that made the data recovery effort such a a character set not widely used since the challenging exercise. 1960s, were also recovered.

1. The Unisys operating environment. Expertise 3. Completely unstructured data. All of the data with the Unisys 1100/2200 series is now recovered were stored in files that did not rare at the Census Bureau. Most of the data enforce any . To enable users recovery work was done by people who to properly interpret the data, hardcopy had never previously logged on to a Unisys. record layouts—road maps of the data—were Of the Center for Economic Studies staff prepared when these files were initially involved with the recovery effort, only one created. In many cases, the record layouts had prior experience on a Unisys, and that were lost, requiring the data recovery team experience ended in 1982. to develop and implement new methods to move the data off of the Unisys. 2. Legacy character sets. Although some of the data recovered were in the ASCII char- 4. Unique record and file formats. Much of the acter set, the vast majority of the data were data were stored in CENsus Input Output in the less familiar FIELDATA character set. (CENIO) files. CENIO is a unique Census Developed in the 1950s by the Department Bureau-developed file structure. Data were of Defense, the FIELDATA character set was also retrieved from “Algol Direct Access only used widely in commercial computing files” and a “save file” from a System 2000 on the Unisys 1100/2200 series. EBCDIC database.

U.S. Census Bureau Research at the Center for Economic Studies and the Research Data Centers: 2009 25 ing the Unisys running and from the Unisys. However, as of Becker, Randy and Cheryl Grim. recovering data. March 2010, relatively little of 2009. “Recovering Historical the data from the recovered reg- Manufacturing Microdata.” With Taylor, CES regained exper- isters has been converted into In 2008 Research Report: tise in recovering and restarting SAS datasets. Further process- Center for Economic Studies the machine following frequent ing is required to convert the and Research Data Centers. system crashes and in manag- remainder of the recovered data ing the flow of computing jobs to modern formats. to avoid conditions leading to Conklin, Mawell R. 1964. “Time crashes. She also understood Record layouts are not available Series for Individual Plants how programmers had stored for some data files. These cases from the Annual Survey of files on the Unisys. She solved will require additional analysis Manufactures and Related many puzzles that would have to reconstruct the record layout. Data.” Presented at a Joint slowed or completely stymied File conversion is potentially Session of the American the recovery. a painstaking process, but the Economic Association and end result should be to make the American Statistical Without Taylor’s expertise, it is detailed data available to ana- Association on December likely that CES would only have lyze important economic and 30, 1964. been able to retrieve a handful social issues. of files, rather than thousands. Foster, Lucia, and Cheryl Grim. Researchers interested in part- 2010. “Characteristics of Role of RDC Partners nering with CES to develop these the Top R&D Performing data should contact the authors Firms in the U.S.: Evidence Help from our RDC partners was or send an e-mail to . R&D.” Center for Economic cess of the data recovery proj- Studies Discussion Paper ect. Our partners at the Berkeley, REFERENCES CES-WP-10-33. Michigan, Minnesota, and New York (Baruch) RDCs completed Atrostic, B.K. 2009. “The Center Gardner, Todd. 2009. “Expanded important groundwork to speed for Economic Studies and Enhanced Decennial the process of converting data 1982–2007: A Brief History.” Census Data for Research.” pulled from the Unisys into SAS Center for Economic In 2008 Research Report: datasets. Our Michigan and Studies Discussion Paper Center for Economic Studies Minnesota RDC partners were CES-WP-09-35. and Research Data Centers. also instrumental in acquiring Becker, Randy. 2007. “New the services of Rose Taylor. Developments With the Griliches, Zvi. 1980. “Returns to Pollution Abatement Costs MOVING FORWARD Research and Development and Expenditures (PACE) Expenditures in the CES needs partners to convert Survey.” In Research at the Private Sector.” In New the recovered data files into Center for Economic Studies Developments in Productivity formats useful for research. and the Research Data Measurement and Analysis, Data from hundreds of registers Centers: 2006. Beatrice N. Vaccara, 419– 462. Chicago and London: NBER/University of Chicago Press.

26 Research at the Center for Economic Studies and the Research Data Centers: 2009 U.S. Census Bureau Griliches, Zvi. 1986. Kallek, Shirley. 1982. “Objectives Nucci, Alfred R. 1998. “The “Productivity, R&D, and Basic and Framework.” Center for Economic Research at the Firm Level In Workshop on the Studies Program to in the 1970s.” American Development and Use of Assemble Economic Census Economic Review, 76(1): Longitudinal Establishment Establishment Information.” 141–154. Data. U.S. Government Business and Economic Printing Office, Washington, History, 27(1): 249–256. Jordan, Willis K. 1965. DC. “The Measurement of Schaffer, William A. 1968. Performance Potential McGuckin, Robert H. and George “Changes in the Structure of in Manufacturing A. Pascoe Jr. 1988. “The Manufacturing Employment.” Establishments.” Bureau of Longitudinal Research Bureau of the Census the Census Working Paper Database (LRD): Status and Working Paper No. 26. No. 18. Possibilities.” Survey of Current Business, 68(11): 30–37.

U.S. Census Bureau Research at the Center for Economic Studies and the Research Data Centers: 2009 27

This document is an extract from “2009 Research Report: Center for Economic Studies and Research Data Centers.”

The full report is available at .

This report has not undergone the review accorded Census Bureau publications and no endorsement should be inferred. Any opinions and conclusions expressed herein are those of the author(s) and do not necessarily represent the views of the U.S. Census Bureau or other organizations. All results have been reviewed to ensure that no confidential information is disclosed.