CANADIAN HISTORICAL MOBILITY PROJECT

1871 NATIONAL

DOCUMENTATION

for

SPSSPORTABLE FILE

Michael Omstein Gordon Darroch

York University Toronto -___-_-. Page 9 CITATION RULES FOR MACHINE-READABLE DATA IN CANADIAN HISTORICAL JOURNALS by Jo& E. lgartua organization (department, researchcentre, archives, History, Universite du Qu&ec &Montreal or other) the functions of producer and distributor for their machinereadable data,under suchstipulations At itsJune 7,1993 businessmeeting, the Canadian asare agreeableto both parties. Committeefor History and Computing has adopted the C. Authors make use of machinereadable data in the following citation rules for machine-readabledata in sameway asthey do researchnotes on paper and are Canadian historicaljournals. The committee hopesthat under no obligation to make their machine-readable journal editors in turn adopt theserules and make them accessibleto others. known to their contributors. In this situation authors should make reference to the original sourcesfrom which machine-readable I. Goals notesare made. Any substantialtransformation of the The purposeof theserules is to define a methodof raw data are explained, either in the body of the citation for machine-readabledata that provides rules published material or in a note,according to the place similar to thosethat apply to traditional sources,in order taken by such data in the argument presented. For to adapt the historians’ scholarly apparatusto the new instance,one should explain the method by which kinds of sourcesused in theprocess of historical enquiry. occupational titles have been classified into catego- As well, theserulesalloti for scholarlyrecognition ries. of the scientific work involved in the creation and distri- bution of data basesof historicalmaLeria1, as the praaice III. Citation Rules 2 already existsin other disciplines.’ As much as possible, the information given in Finally, the adoption of theserules will enable references should be taken from the machinereadable researcherslo meetthe requirementsof funding agencies documentitself or from accompanyingdocumentation. (among which LheSocial Sciencesand Humanities Re- 1. Author: cite in the usual manner. search Council of ) that data made machine 2. Title: title of the data file or of the data base.’ readablethrough their funding be madeavailable to the 3. Machine-readable documents: to indicate that the scholarly community. . documentismachine-readable,writethewords”[com- puter file]“, in square brackets without using any II. Three Types of Situations acronyms. A. Authors use data basescreated and distributed by 4. Statementof responsibility, where applicable, “This 0th~ scholars,by researchorganizations or by com- indicatesthe responsibility of theperson(s) or co- mercial cnterpris&. rate body named asprincipal investigator or of other Referencesto suchdata should bemade according significant parties, such as the department, furiding to the rules defined below. agency,or sponsoringorganisation.” ’ B. Authors make useof data that they have made ma- 5. Edition, series,or version, if indicated chine-readableand thesedata are accessibleto other 6. Place of production, name of producer, followed by “[producer]“, date of production. 7. Placeof distribution, nameof distributor, followed by a third party (traditional archives, data libraries, re- “[distributor]“, date of distribution. searchgroups or other organ&ions). 8. Collection, where indicated. Making thedata accessible should be considereda 9. The following additional information may be added: / form of publication; data thus “published” are to be a. Abriefdescriptionofcontents,withinsquarebrack- cited: according to the rule defined below. ets,iftbetitle doesnot give sufficient information on Authorsare strongly encouragedto turn over to an this score. i

Comitb canadien d’histoire et d’informatique / Canadian Committee on History and Computing Page 11

CITATION RULES FOR MACHINE-READABLE DATA IN CANADIAN HISTORICAL JOURNALS b. Material designation in square brackets; for in- Departement d’histoire, Universid du Qu&ec a stance: Montreal, 1992, [on-linedatabase] accessible through [on-line database] the author upon request. [magnetic tape] [floppy dW Notes: [CD-ROM] ’ See for instance the “Notice to Contributors” in the c. If the database is periodically updated, give the American SociologicalReview, 57,l (February 1992): iii- date when the database. was used. iv. 10. Outline conditions governing access, whereapplica- *The citation rules outlined here follow those defined by ble. Terry Cook et al., Archival Citations: Suggestions for the Citation of Documents at the Public Archives of Canada Some examples: (Ottawa: Public Archives of Canada, 1983), 13-14. See . Centre interuniversitaire de recherche sur les also Danielle Thibault, Bibliographic Citation Guide (Ot- populations SOREP, Base de don&es SAGUENAY tawa: National Library of Canada, 1989). 102-103. One [computer file], SORBP, Chicoutimi: Universid du may also consult the rules defined by the American Socio- Qu&ec zi Chicoutimi, 1992, [on-line database] ac- logical Review, as well as the cataloguing methods used in cessible for research purposes, subject to approval the Canadian Union List of Machine Readable Data Files by the SOREP ethics committee. (CULDAT), produced by Edward H. Hanis and described

l CAIWM University Base [computer file], Ottawa: in Edward H. Hanis, “‘Reference and User Guide for the , 1992 [magnetic tape]. CULDAT Information System” (London, Ont.: January

l Science Cifufion Index [computer file], Philadel- 1990). The CULDAT project was sponsored by the phia: Institute for Scientific Information, 1989 [CD- Government Archives Division of the National Archives ROM]. of Canada. . IGARTUA, Jose E., Base de donnkes MEMBERS2 3A database is a set of data files linked together by a logical [list of members of the Canadian Committee for structure. History and Computing] [computer file] Montr&l: ‘Cook, ed., Archival Citations, 13. W

NEW COMPUTER TOOLS If you are looking for a note, on a certain person for subject, author, source, etc., all depending on the type of example, simply type in the person’s name and hit a key. information you’ve entered and the fields you’ve created. Theprogram rapidly searches through your notes, most Once you’ve sorted and reordered your notes you often hundreds of pages, and retrieves the note(s) that can print them. When printing you can decide to print the contain(s) the name in a matter of seconds: the exact speed entire database, ie. all of your notes, or only select sub- depends on the amount of information in the databaseand groups. You can also customize a printing format for your the speed of your computer. Similarly, you can search for own needs by setting the margins, the order of printing, the any word, number, or combination. Better yet, you can sort print size, and many other options. And, as mentioned the notes into different groups depending on your purpose. above, you can print as many copies as often as you want. If you wanted all the notes on the subject of railways, type But even more importantly, you can also transfer your in railways and hit a key. You can be even more specific notes directly into your wordprocessor. I find this ex- and ask for notes on railways in eastern Canada. Also, you tremely useful for making fast, detailed outlines. I choose can reorder the notes to your liking. I often shift mine into the notes for my topic from the database, reorder them to chronological order, but you can also order them by suit my needs, and then transfer them into my

Cornit canadien d’histoire et d’informatique / Canadian Committee on History and Computing Overview: Canadian Historical Mobilitv Project and Class, Household and Mobility Project.

Gordon Darroch and Michael Ornstein, Department of Sociology, York University, June 1986.

The Context The studies have been conducted in two phases though they are aspects of a continuing project. The first was focused on all four provinces of'canada in 1871 (, , and ). The second phase is focused on a large region of Central Ontario in the 1861-1871 decade (a wedge of counties stretching from the middle of Lake Erie to the lower shore of Lake Huron on the west, and on the east, from about one third of the way from Toronto to Kingston, at Port Hope, north to the southern tip of Georgian Bay). The studies are based on samples taken from the nominal data given on the census manuscripts of 1861 and 1871. Nominal data means simply the records of the individuals and households recorded on the original folios by the nineteenth- century census enumerators. At the time of writing they are available, in varying quality, on microfilm from 1851 to 1881 for Canada. Both phases of the data collection have unique elements. The first phase created a representative national sample of households for 1871 that allows detailed analysis for a variety of variables. We have reported some results in historical journals (for example, Darroch and Ornstein, 1983; 1984). The Ontario phase has two unique elements. First, it is b'ased on record linkage of very ' 2 large and unusual samples of individuals drawn from the census manuscripts of 1861 and 1871. Second, we created records for these individuals that nearly exhausted the information from all schedules of the censuses of those years, including household information, farm tenancy and productivity, real estate and the data of the manufacturing censuses (though the latter are very problematic). In some cases this required additional linkage procedures to attach information from more than one schedule to the same purported individual. This report provides an account of the methodology involved and of the nature and limitations of the data files. The studies were undertaken in the context of two types of historical analysis that emerged in the 1960s and early 1970s. One was the breakthrough in demographic studies represented by the development and spread of family reconstitution using parish records in pre-census times. The other was the work on social mobility in past time, largely stimulated in North America by Stephan Thernstrom early study Poverty and Proaress (1964). Though in many respects a limited work, Thernstrom first study nonetheless had a pervasive influence on the writing of social history after the mid-sixties (see, Social Science Historv, special issue, spring, 1986: l-44). Two aspects of these analyses influenced the design of the current studies. First, they showed that a systematic social history could be built up from unconventional historical sources @or the great majority of people who heft no intentional traces or 3 records. Second, and more specifically, each was confronted by a serious problem of design by the facts of migration. The problem of migration wassimply that there was a great deal more of it everywhere, in every era, than historians had usually imagined. The "discoveryt of the volume of migration in the past deeply complicated the new historical methodologies, which were founded on the capacity to build limited biographies for historical individuals by linking nominal records with some fidelity. In other words, only the stable population for given geographic areas under study were "at risk" in this crucial linkage methodology; the surprisingly large numbers of movers simply escaped the analytic net. In part, of course, the problem stems from the arbitrary nature of the civil or administrative units most often adopted as convenient sites for study, a small town, a parish or two, a city or possibly a county or dgpartment. These units bear only limited correspondence to meaningful social structural or individual spaces, in the past, as in the present. The problems presented by migration and the limits of civil units to tracing individuals through historical records were probably exaggerated in early studies (Thernstrom, 1964). Still, recent historical studies of migration underscore the general difficulty, since they are based on rare historical sources, such as continuous population registers (Kertzer and Hogan, 1985; Hochstadt, 1986), on the unique U.S. Soundex indexes of surnames (Stephenson et al., 1978) or on formidablp tedious procedures of 4 tracking individuals through innumerable discrete records (Knights, 1971).

The Sampling Method Migration was only one among several, related concerns of our study centering on social class formation, mobility and the household economy after mid-Century in Canada. Some solution to the methodological dilemma was necessary in order not to vitiate other inquiries. Our solution was to combine a methodological sledgehammer with a methodological scalpel. The sledgehammer was simply to expand the area under study sufficiently to capture the large component of total migration made up of local and circular moves (despite the heavy flow of outmigration to the U.S. in nineteenth-century Canada). Initially, we envisaged a study of the population of all four provinces at Confederation. In principle such a study is possible, though only the first stage of the current study has such broad scope. The more intensive, second stage focuses on the large area of Ontario described above and outlined in Figure 1.

Figure 1 about here

The scalpel was a sample design and necessary complement to the sledgehammer. Clearly, some sampling strategy is required by an historical study that breaks out of the confines of local administrat.~ive units and attempts to constru,ct a l*collective biography" of thousands of ordinary individuals recorded in / f.\G~+,iiC -%

SOUTHERN ONTAR COUNTIES AND GEOGRAPHICAL TOWNSHIPS

P LEGEND

CENTRAL ONTARIO MAIN STUDY REGION 1861-1871 LINKED DATA 5 administrative documents. A study of the population of even a relatively large city, like Toronto after mid-century, or of a major region or province clearly will generate an immense number of individual records. In the 1970s and 1980s such files quickly exhaust$efficient data management capacities of even very large computers. Moreover, our problem was not merely to design an efficient sampling strategy, but at the same time to preserve the capacity to conduct systematic record linkage. If one draws conventional samples from two or more historical listings, taking every nth person or otherwise randomly drawing cases, then one virtually eliminates the possibility of systematically linking records for the same individuals from the different listings. The random element of the samples, which ensures representativeness, also ensures that there is only the very slightest chance that any one individual will appear in both samples. One needs both to sample and to ensure that the samples are effectively closed populations, so that the same surviving individuals will appear in each. Our solution was to devise a form of sampling we call letter sampling. It can be briefly described, although the actual procedure is rather more tedious ( See other reports in this documentation). First, we needed to demonstrate that a random sample of surnames from a population of all surnames was, in fact, an adequate representative sample of the population itself. A sample of surnames, of course, preserves the possibility of record 6

linkage (at least for those who do not alter their surnames; unfortunately women tend to on marriage). The procedure requires several stages. They can be briefly outlined. First, we wanted to examine in detail the character of surname samples for a nineteenth-century population, before proceeding to a random selection. Ideally one would compare the results of a large number of such samples with the characteristics of the population in question. Of course, a knowledge of those characteristics would obviate the need for sample estimates. Alternatively, there were the limited and flawed tabulations of the aggregate census reports. We chose a second alternative. A large random sample of households could be drawn from the microfilmed copies of the nominal manuscript census of 1871, available through Public Archives of Canada and other libraries. The first or personal schedule of the manuscript census provides a variety of socio-demographic data for all individuals enumerated in their households of residence, including boarders, servants and visitors. The data collection itself was simple enough in principle, and relatively complicated in practice. The basic sample selections were proportional to township populations. Virtually all the reels of the microfilmed manuscript census of 1871were scanned for the four provinces, township by township, and a prespecified number of sample households were identified in each (see sampling documentation ). The information for all individuals in the sampled households was transcribed to paper and subsequently keypunched. The sample waps a stratified, random sample with the 7 stratification ensuring that relatively small groups were adequately represented for purposes of analysis. Specifically, the stratified sample overrepresents the urban population in general, those of English-origin in Quebec and of French-Origin in Ontario and New Brunswick and the German-origin group in all areas in which they were at least 10 percent of the population (as determined from the aggregate census). There were two primary objectives of this elaborate procedure. As noted above, it provided a surrogate national population from which letter samples could be drawn and to which they could be compared for a variety of characteristics and relationships. Second, it was clear that we had an unusual opportunity to supplement our methodological concerns with substantive ones: a relatively large stratified, random sample of a nineteenth-century national population provides for unique and very rich socio- historical analyses. Analysis of the national sample has been reported to date in published articles on the complexity of nineteenth- century ethnic divisions of labour (Darroch and Ornstein, 1980), on the relationships between regional economies and household organization (Darroch and Ornstein, 1984) and on the complexity of households themselves (Darroch and Ornstein, 1983). As for the original methodological objective, the results were consistently encouraging: the design effects of letter samples, which are technically cluster samples, were modest and the letter 8

samples adequately represented characteristics of the population from which they were drawn (see sample documentation). The second stage of the study adopted a refined version of the letter sampling strategy. In this case the national sample was used to divide all surnames appearing in the census of 1871 into a set of 100 mutually exclusive clusters defined by Soundex phonetic codes, using the first letter of the surname and a phonetic

classificationI of the next portion of the name. From these clusters a random sample of surname clusters or pockets was drawn, stratified by the size of the surname pockets. This second phase of the study aimed to link individual records between two census years, 1861 and 1871. The magnitude of that task and the limits of funding and research time restricted the study to the large central region of Ontario, centered on Toronto. The area is represented in Figure 1.

By the time we were prepared to collect data again from the microfilm copies of the manuscripts, we were also able to substitute programmable data entry terminals for our paper and keypunch technology. In effect, however, the procedure was much the same. Coders were trained to search systematically the manuscript collection by city, town and township for the whole of the contiguous area of Ontario. All individuals with surnames that 'fell into the cluster sample were considered primary sampling units 9 and their complete census records, as well as the records of every other member of the same household, were transcribed exactly to computer file. The sampling and recording were repeated separately for 1871 and 1861 for each township or town in the region. A further selection had to be taken from the 1861 agricultural census, which had been taken as an independent enumeration. The data collection for Ontario differed from the national sample in that all the information from the several schedules of the censuses was recorded for every member of the households (in 1861, the personal schedule is supplemented by information on manufacturing and industries and by the separate agricultural schedule; in 1871 there were nine full schedules, including agricultural, industrial and real estate censuses). The last of the major steps in this research design was the linking of the individual records between 1861 and 1871 for the Ontario region. Record linkage has become a central feature of historical analyses using nominal data, from the early manual linkage of parish records in family reconstitution to quite elaborate computer algorithms for automatic linkage of large numbers of records. After reviewing well-known computerized procedures we chose to develop a combination of computer and manual linkage that is particularly suited to these historical census records. In capsule form the procedure was as follows. The computer was enlisted to sort the records initially and to accomplish the merges of the individual data files after linkage. The&sorting was no mean task: 10 there were over 34,000 individual records selected in the letter sampling for central Ontario in 1861 and over 40,000 in 1871. Using alphabetically sorted surname lists, the linkage proper combined a complex set of decision rules regarding records that would be considered to refer to the same historical person, with the pattern recognition capabilities of research assistants. The rules emerged out of a reading of the linkage literature and from trials undertaken by the principal investigators. They were quite complex, with separate decision algorithms applying to cases where information was limited to individuals and to cases where the family and household context added information. The algorithms were conditional ones in which the requirement of a precise or close match on one item of information, for example, on name spellings, age, or birthplaces depended on the precision of the match on others, say, on a wife's or parent's first name, age and ethnicity or on the names and ages of children. Uncertainty was systematically reduced by the accumulation of information across several items. The research assistants learned their lltradell largely by trial and error under supervision. Results of initial trials showed quite high rates of replication for different individuals. In all some 16,000 records were considered true links. In every case, the links were coded to include a subjective estimate of the level of certainty. First estimates for the entire region put the rate of linkage at about 55 percent of those at risk in 1861, takkng account of mortality and, for women, marriage and name 11 change. The large residual represents an unknown combination of outmigration from the region, census underenumeration and record linkage failure. Both emigration to the U.S. and short-distance migration are known to have be very high during the period; they are probably the major component of the residual, but other evidence indicates that the limits of the nineteenth-century censuses and of the method are substantial. Previous studies suggest we might set the limit of census underenumeration at 10 to 12 percent in any census year, with the highest rate for 1861 (Knights, 1971; Stephenson, et al. 1978). Considering the likelihood of significant overlap in those subject to underenumeration, a combined rate for both years might be 15 to 18 percent. Estimating the errors and omissions of linkage adds another approximately 10 percent (see the differences between rates for different methods of tracing migrants in Katz, Doucet and Stern, 1982:ch. 3). For both years, then, the combined rate of linkage failure could be.as high as 25 to 28 percent of the total, leaving 17 to 20 percent of the loss to migration. 12 REFERENCES

Darroch, G. and M. Ornstein. "Family Coresidence in Canada in 1871: Family Life-cycles, Occupation and Networks of Mutual Aid". Canadian Historical Association, Historical Papers, 1983:30-55.

. llFamily & Household in Nineteenth Century Canada: Regional Patterns and Regional Economiestt. Journal of Family History, (Summer, 1984):158-177.

. "Ethnicity and Occupational Structure in Canada in 1871: The Vertical Mosaic in Historical Perpsective." Canadian Historical Review, (September, 1980); 305-333.

Hochstadt, S. "Urban Migration in Imperial Germany". Paper presented to the Canadian Historical Association Winnipeg, June 9, 1986. Katz, M.; M. Doucet and M. Stern. The Social Oraanization of Early Industrial Canitalism. Boston: 1985. Kertzer, D. and D. Hogan. tlOn the Move: Migration in an Italian Community, 1865-1921". Social Science Historv, (Winter, 1985):1-24. Knights, P.R. The Plain Peonle of Boston 1830-1860: A Study of City Growth. New York: 1971. Social Science History, Special Issue, Spring, 1986. Stephenson, C. et al. Social Predictors of American Mobility: A Census Canture-Recapture Study of New York & Wisconsin, 1875- 1905. Newberry Library, Chicago: 1978. 13 Thernstrom, S. Poverty and Prowess: Social Mobility in Nineteenth Centurv City. New York: 1969. Coding and Data Processing for the Feasibility Study: Canadian Historical Mobility Project.

by: Gordon Darroch Department of Sociology York University and Michael D. Ornstein Institute for Behavioural Research and Department of Sociology York University August, 1977. Revised, 1980, 1984, 1994, 1998. Appendix E: Part 1

Feasibility Study: Coding and Data Processing

Introduction.

Our project was faced from-the start with the need to create very.large machine readable files of data transcribed from the microfilms of the

Canadian censuses of 1861, and 1871. An examination of published work on nineteenth-century census-type data provides some, but not a great deal of guidance as to how to proceed; Only a very few projects, notably the

Philadelphia Social History Project, have had experience with data files of the magnitude of those we propose to collect or, for that matter, were involved in the feasibility project. Acquiring experience with large historical data files was one of the reasons we designed the feasibility study to entail the collection and management of data which was much more extensive and diverse than that required only to test the proposed 'letter sampling" strategy.

Previous research did make it clear, however, that two particularly serious analytic difficulties had to be avoided. The first arises when early decisions about coding procedures result in some variables of interest simply being omitted. The second arises when a variable is created with a smaller number of categories than turn out, in later analysis, as necessary to capture fully the historically significant variation in the variable.

Both problems initially may seem obvious ones to avoid. Yet those who have coded large amounts of data and especially historical data will 2 recognize the considerable temptation to simplify coding procedures by ignoring some seemingly unimportant information, say the size of dwellings as recorded on censuses, or by collapsing categories of a variable, with- a great many legitimate categories, such as or occupation. There were enormous numbers of distinctly named protestant churches and sects and of distinct occupations in nineteenth-century North America. For a coding task of even moderate size, the additional cost of returning to the original source of the data to rectify omissions or errors is usually prohibitive.

The coded data thus come to impose unnecessary constraints on the analysis itself.

In the light of these considerations we adopted the following principle in all phases of the data processing on this project: in the original coding of manuscript data (microfilm images) all the variables describing an

individual are coded. Each variable is also to be transcribed exactly as it

appears on the original document or coded in such a manner that exact

original values are recoverable. Finally, the subsequent data processing of

the records must always assign a unique value to.each unique category of

every variable.

It should be noted that this detailed and complete method of coding

facilitates a full exploratory analysis of the data using all possible

variables. In addition, it maximizes the value of the data to other

researchers who may employ it for any secondary analysis which the original

documents themselves permitted. Even if we had begun with a focused analytic.

purpose which, for example, did not require any information regarding

religious affiliation or detailed occupations, the decision not to code these

variables fully would obviously place severe limits on any future secondary 3 analysis of the data file while entailing only a relatively minor initial cost saving.

A second major consideration in the collection and processing of nominal manuscript data involves the units of analysis. The data ahould be in a form which makes it-possible to use each of the following -as units of . analysis: a. the individuals listed on the manuscript.source, for example, to permit

examining relationships such as that between a person's religion and

his or her occupation; b. the complete households, for example, to permit examining relationships

such as that between the religion of the head of the household and the

size of the household; and

C. the individuals listed, .but in "contextual analysesa in which the context

is given by the characteristics of the households as a whole or as

given by the characteristics of other individuals in the household.

For example, in the first case, it should be possible to examine school

attendance of children as a function of the size of household in which

they live; in the second case, to examine school attendance of children

in relation to the occupation of their fathers.

In order to make this possible, households must be coded in their entirety.

The important but perhaps not immediately obvious implication is that any

sample from manuscript censuses must be a sample of households, not only of

individuals. The formation of household composite variables, of course,

again poses the requirement that all the data on each individual in every

sampled household be coded.

The full range of contextual variables which could be of interest in 4 analyzing these data will not be apparent until the analysis is underway.

For example, an examination of school attendance might lead one to relate this variable to the occupation of an individual's eldest brother--but it is hard to anticipate this beforehand. Other variables, like "household size" have been frequently employed in the published research and it makes sense to create them at the start. In this study, a set of variables describing each household is attached to each individual in the household. In addition, a set of "pointers" is created which will allow new contextual variables to be created easily when they are needed. For example, one of these pointers is equal to the person number (i.e., position in the sequence of individuals in the household) of each individual's father. Part 2 contains a list of all these variables describing the household structure.

Coding Procedures

Considerations of cost led us to employ a coding procedure for the nominal census data which involves two steps common to , coding in projects such as this one. First, the census microfilm data are transcribed onto

coding forms designed for the purpose and, secondly, the data is keypunched

from the coding forms. It is possible to combine the two steps into one, by

coding the census directly onto a computer terminal, and using remote data

entry or by keypunching directly from manuscript records to computer cards.

Remote data entry would allow for immediate error correction, a considerable

advantage. Direct keypunching, of course, saves the time, cost and possible

error involved in a two-step procedure.

At the time our coding was carried out, the facilities available

through the YorkXUniversity Computing Centre all but precluded our using direct data entry. Direct keypunching would require coders with keypunching skills. However it is likely that we would have chosen the two-step procedure in any case at this pilot study stage. The procedure is suited to gaining firsthand experience with the difficulties faced in reading and accurately transcribing microfilm records. The Canadian census data, available in 1975 at least, presented the problems of near illegibility of some of the original manuscript records and of relatively poor quality of the microfilming itself.

Of course, the two-step procedure requires that the keypunched data be checked and verified for illegal codes, inconsistencies and the like. In our case we opted to both verify the keypunching and toscrutinize the data for coding errors by using a computer programme in batch mode. This means that once errors are detected or suspected, we were required to return to the microfilm to locate the incorrectly coded record. The correction of the files has proved to be a time-consuming and tedious process which the principal investigators undertook almost entirely themselves. Two advantages 1 to this verifying-checking procedure became evident. First, in a project the size of the feasibility study, much less the size of the proposed study, principal investigators simply cannot undertake much of the original coding, though close supervision is essential. We have found, however, that familiarity with all the problems of coding, .and their inevitable implic4tions for data analysis, has been assured by primarily undertaking the verification and error checking ourselves. As we shall shortly describe the computer programme written for this purpose requires that every conceivable error and ambiguity which can be detected in the punched card file is examined and corrected. In effect the procedure has required us to review in 6 detail the entire machine readable file. Of course only some kinds of : coding errors can be detected in this way. Hence, we also undertook as part of -he data collection the special "error detection" sample--amounting to a complete replication of the coding from the microfilms of a ten percent sample of the originally sampled households.

The second advantage of the procedure adopted has simply been our assurance that we have a very clean and consistent historical data file. In fact, as described below, the computer programme employed in checking the data has given us a unique record of the original errors and ambiguities in the file and the kinds of corrections made.

The Coding Forms & Instructions

Our coding forms have been designed to closely resemble original printed forms used for the nineteenth-century censuses (see figures 1, 2, and 3).

The two coding forms included below in the text were the only forms employed for the collection of personal and household information. Note: Figures 2 and 3 refer to a pilot study for two Ontario counties, Essex and Kent. These data have not been made available as an electronic file. Only card files were created.

The variables are transcribed in the precise order in which they appear in the original and where it is necessary to code something less than the complete original entry, mnemonic codes with letters are used. All the data on an individual are coded in a total of eighty characters, for ease at the keypunching stage. Our main objective, especially in using mnemonic codes, wag to minimize error at the coding stage, because of the high cost of .” ,*

. . I

.a .

. mBLLITY STUIlXlS61LIsSkX-KENT BAjICHOUSEHOLD INFORMATIOIo

Last Ha2ls roferaion-Trade-Occupation

It II I llll""'l""' l’I”ll1llll4ll ‘llll”lllll’l~ ‘. II II I ll”“l1lI”lll l’llllllllI’lll . IIIII llllll1ll~ ‘llll”l’ll”l,Jl’llllll’lllll llllllllll”‘ll co ----I II It I lllllllllll'lll 1'l1'1"1""11 I11111111111111

11 II I l'l'l"'llll'll l'lllttllttIl'l lI"l'ItII'I'ItI

~!I.,II~I '~~I~III'IIII~~ it"IiIIt'iIiti~ttiil'Ii'tttIi~

II II a 11111101111111. lll'l1l'l"ll'I ~'IIIIIIIl"'ll

If II I 11111'11'11111' '1"tIIII"I"' """111'1111l

#I II l- llll'll""l"l llll'l'lItt'llt ttttIlt.ltItItI'

PI I llllllllllll'l' 'lltl'ltttlllll .tttt'ttlI'It'Il 1 ', I , I I I I I I I I I I I I II', I I ,'I, ,,I, 1, I I'~ I,, , , I I I iI f I I 1, t

Ill1 Illll’I’lllllllj

! : : : : J : : : : : : : : : : : : : : : *: : : : 'I 'I 'I 'I : : : I ‘II 111111’1111’111

,.e.-- - . . .

. -* -.I I .- -I

7: --1 - - -.. -- I .- -. I - t- ~ -

- PERSONAL CENSUS FOB 1851 . CANADUNBISTORICN.XOBILITY PBDJECT: ESSEX--STUDY 10 correcting errors. The pursuit of this objective is not without its costs, for the data can then only be analysed after considerable transformation,

Primarily we must substitute a unique numeric (rather than alphabetic) code to represent each possible value of each variable for the purposes of analysis. The transformation of alphabetic into numeric codes is accomplished by a computer programme also designed specifically for this project. The alternative to our procedure is the conventional one of recording the original data as numeric codes when it is first read from the microfilm. Hut the conventional procedure is both slower and more prone to error than our mnemonic coding method. Consider, for example, that if two digits are used to represent a religion variable, approximately sixty of the one hundred possible two digit combinations would be required. The numbers 43, 67, and

10 might represent Wesleyan Methodists, Adventists, and Roman Catholics respectively--instead of our codes WM, AD, and RC.

The mnemonic codes are superior in two respects: they are much easier for coders to recall and so they should increase efficiency and result in fewer coding errors; and, if an error is made, the resulting error is more likely to be detected in the data checking procedure. It may be noted that there are 26 x 26 or 676 valid two character mnemonic codes, of which sixty are required. Thus in most cases a coding error will take the form of an invalid and hence correctable code. If by mistake an Adventist is coded AT,

'rather than AD; the error is simply more likely to be spotted than an error in numeric codes.

Two sets of two character mnemonic codes were developed, one for and one for place of birth. The place of birth codes were also used to code the 'nation of origin' variable in the 1871 census, so for example, the code for England (a place of birth) was also used for English

(reported as a nation of origin). The major difficulty which arose from the adoption of this procedure-was that certain religions and places of.birth. occurred so infrequently that it made no sense to create codes for them, yet we were committed to preserving the exact content of the original manuscripts. The solution was to use a special code, a blank followed by an asterisk, when-ever a mention of a religion, place of birth or nation of origin occurred-which had.not previously been.assigned a-mnemonic code by us.

At the same time an additional coding form, called the "long form," was filled out which was key-punched and its contents merged by computer with the individual record file (see figures la, lb and 1~).

The 1871 long form has three fields: a location code which identifies the individual to whom the data referred, a code specifying the variable in question ('R' for religion, 'B' for,place of birth, 'N' for nation of origin), and the exact mention or name for the variable as it is written on the census manuscript. For example, "West 1ndies"was a very uncommon place of birth requiring a "blank, *I' code and a long name form.

The first and last name of each individual and his or her occupation were transcribed directly onto the coding form in the form in which they ap- peared on the manuscript, for no predetermined coding scheme could preserve the original content in full. Because of the fixed field coding scheme, a procedure was required to deal with cases where the name or occupation exceeded the length of the field (sixteen characters for each name and for the occupation) on the coding form. Here again, the "long form” was employed --an asterisk was placed in the last space allocated to the variable in question on the individual coding form and a long form was filled out containing only the uncoded part of the person's name or occupation.

The 1861 long form used the same three fields in altered format but using the same variable codes as in 1871. The form was also used for a secondary purpose. In 1861, as in 1851, the personal census schedule .. ._- ...-.. _.--

LONG NAME FORM-1871 CEYSUS OF CANADA i 1 h!L..tName

I- .I;- .- i&c - - &&j&-J .! .! .! &&&&J&'&~ -t-I-I-~3~t.*t.~*tu~t~t=t~~~~t-.~~t~:~~~~~t=t-=t~~t-l :I .I.

1 ! * ,,,‘,,,,,,,,,,,,,,,,,,,l111,,llll*~,lllt~l..~.l.~~~ ""::::+::'+ J-l-- ! 44 ! ! ! ! ! ! : ! ! ! 1 ! ! ! ! ! : : : ! ! : : : : ! : ! ! ! ! ! l-+-M-cw3~-+-t3-H-H-t-t- l-+&l-t-l-t-l-t-t -I- 4

4 ! I ,,II,1l,,l,1,,,,,I1IllIllIIlIll,I*,,lIlIllIlllltll"';"I_~~~~f~

4 ! 1 ,,,,,,,,,,,~,,,,,,,,,,,I,l,,,,,,,,,,r,Il,,r,,llIlI'lI'r~'f~~~~~

‘I ! I ,,,,,,,,,,,,,,,,,,,1,1, I,, , I,, I,,,,,, 8 I I I I I ,,4,, I,, *, *I: ,'I :&.++

. .I ! 1 ,lIl,,,.,,,,,,,,.,,,,,,,,I,,II,,t,,,,*,,,,,,,~,,,~,,,,,~',"'*~

1 ! I ,,,,,,,., ,,.,,,,,,,.,,*,,,,,,,,,,,,,,,,,,,,,,,,,,l I,, I 11 '++'I 'I'* 4

J-&---&'----M-&&-' ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ? :-l-l--1..1l-1_1l-~ l-l-l4-~-t-l-t-t-l-c3-c-c-c - t - b 4

. . . . . 1 I I 111,,11,1,111,,lI.,II,1l11tI111111,,,1,,,,1,,,,,,~,~~~'~~~~~~~~)

-I ' I. 11,1,,1~1,11,I1~1~1-I.tI1-tIII,1IIIt11,1,,,~,',,,~~~~~~~~~~~~~+~~~~~ . .

J ! I If f I I it, 8 I t-1 1 I I! II I I) I 11 I 1 I I 8 8 111 I* ta II I II II ttt I~'I* 1 I rr&+-*t@ I!$

L .I ' _ ! I,ltl,l,,t,l,,,,tlt,,,~,,,,,,f,tf,f,?tllllt?,,,,,,~,,,,,"',,,~a, . '

. . . CODER'S INIRALS DATE I / PAGENUMh ---- .-- -- F, qv+

ryr rqr, 0 1411q 16 19.29 22 2% 21 so ,I1 1), II II I I I Ill I I Jlllrr IIl!! III! Ill!l! 1111Iw!111tlI!llIl1!ll!lL!IlII!Illl!ll,~I I I I I I I I I I I I I I I I I I l I I’ I I I l I , L! " ' 11' 1' ! ' ! ! ! ' ! !' J J J '! " J ' J' J I I II I I I I I I I I I tilt11 lllll IItl’lI,ll~ ~llIlll,llti~~~~~~~t~l~~~~~l~~~~l~~t~l~~~~J

II II I I I I I I I I I lIltI, II!,, lI~lllIllI 1~IlIIllIIllIIIlIlIlIllIIlIlIIlIlIIlI~I~

I I I I I I I I I I I .I I IIlll1 IIIlIIlll,IIlI1 lIlllrllllI.IIIIlIIIIIl lIlIllIII&IIIIIIIll~ ‘I I I I I .* I I I I I. I I I lIIllll,llI lrrr~tlIl1l~l1lllI'1llIIlIlIIIIIl.IIlIIl1llllIlIIlIllI( I I I I I I I I I I. I J .J lllll, !ll,, llllII,l,, 11111111r1lllll1 Illll1IIlIIlllllllllltIlIl~

II *I I I I I I I I I I I IIII-II lIlllIlllIl,r,, l,,1Il,‘,lIIIIl1IIllIlIlllIlIIlIlIIlI1IIII~ II II I I I Ill I IIIIII IIUI Ill, IlI,ll ,l,lllI,llIIIIIllIIIIrIllIIIfIIIIIIII!!,.I( I I I I I I I I I I ‘I I I IfIll, Ill11 III, lll,l1 ,,l,I,,ll,llIIIlI1IlIlIIlIl,IlII,lIlIIIIllJ

I I I

I ’ $... -.-- . . . . * . .:.. : ,

. +, . ., ., :. : . ._ .,, ‘... - . . .;.:.,.-.- ,...: ,: . . . 7 ,! “-i .,., ~~,-~~.~~.-.~~.~~.-.~..‘. - :-, %. ., ., ., : . i ; ‘. _ . , . 1 ~atmn.l-u'RN, AND MANUl'ACTUIUNC CENSUS 1851

. * CANADIAN HISTORICAL MOBILITY'PROJECTt ESSEX-KENT STUDY ' .-..-..-- -. .-

. 15 included information on household production and-business and manufacturing information for those members of households who were owners and operators.

Since this was occasional information which could be taken from the microfilm reels we provided for its direct coding. At the time of coding, comparable information for 1871 had not been microfilmed and provided on the same reels as the personal-and household data.

Four of the variables which were coded, the surname, religion, place of birth, and nation of origin, were quite often-identical for each person in a sequence of individuals listed within .a household. So that the coders would not be required to tediously'copy out these variables for each person in a sequence, blank fields in the keypunched data were automatically given the same values as the corresponding field for the previously coded person. Of course, the first person in each household must have all his or her variables coded.

Coding

The coding of the data for the sample of households from the 1871 census manuscripts was done through the facilities of the Institute of

Behavioural Research. Four Bell and Howell microfilm machines were rented for a period of six months. Of several machines examined, these provided the best reproductions of the microfilms.

Under contract, but in continuous contact with the principal investigators, coders were selected and trained. In the end twelve coders were employed, working four at a time for approximately four hour periods.

They were supervised by a full-time, experienced member of the staff of the

Institute. The coding and keypunching of the entire file of households consisting of individual records required eight months and cost $12,426. 16 Thus, the average cost per household was about $1.24 or 20 cents per in- dividual record.

In addition to the main file, we collected data for a trial "letter sample" of households for Essex and Kent counties, Ontario for 1851, 1861 and

1871. This coding was subsidized for the most part by York University in providing funds for us to employ, part-time, six graduate student research assistants. Mainly they were employed in coding the data from microfilm.

Two assisted in correcting the files. Two students are currently using the data procedures and computer programmes of the project for their own research. Pages 17 to 34, which follow, are copies of the coding instructions for the national sample from the 1871 census and for the coding of the special letter sample of the 1861 census for Essex and Kent counties.

For coding the 1871 letter sample for Essex and Kent counties, we employed 7 the coding forms and slightly revised instructions used for the national sample for that year. Coding of the 1851 census for the letter sample for

Essex and Kent was not initially planned, but we have been able to complete part of this work to date. The instructions and coding forms used in 1851 are only slightly altered versions of those employed for 1861. 17

CANADIAN HISTORICAL MOBILITY PROJECT CODING INSTRUCTIONS: PILOT PROJECT: NATIOML STUDY NOMINAL CENSUS RETURNS ON MICROFILM, 1871

The objective of this coding is to transcribe selected cases of households from the 1871 census returns, which are on microfilm, to coding forms for key-punching. You will be coding information on all individuals in all families.in selected households.

The only information to be coded is that for the households given to you on a "list of sample households." This list provides you with the in- formation to locate the households in.the.1871 microfilm.

The appropriate microfilm reels for the District (County> you are coding, will be given to you. The "list of sample households" gives district and subdistrict numbers, (a no. and a letter, IA, lB, 2A, 2B . . . 2OOA, 2OOB...) This refers to districts and subdistricts listed on the top and right-hand side of each microfilm frames. You should locate the appropriate microfilm frame by this number and check that it is correct by looking at the district names.

The "list of selected,households" provides the following additional information for locating the households to be coded.

E.A. Numbers: These refer to the divisions within subdistricts, they are numbered on the microfilm at the top right-hand side (and called Divisions" there). Sometimes the selected households are all in the same divisions (E.A.), sometimes, in different ones.

NOTE: It is essential that you locate the household in the appro- priate division (E.A.). This number will be coded, see coding instructions for column 5 below.

Special sample numbers (spec. samp. # ), to be coded, see columns 6-9 below.

-Household numbers (H.Hold. = ). These are the selected cases within the Division (E.A.'s) for which all information given on the microfilm for all individuals in the houseEd is to be transcribed on the coding sheets. The household number will be coded, see columns lo-12 below.

The "list of sample households" provides an additional instruction as to which households should be coded. Immediately after the household number several instructions are listed (-1) "code any case" means simply to code the household indicated whatever ethnicity - nationality (not birthplace) it may be.* (2) "French only" means you only code the case if the "Nation of origin" (called ethnic origin on microfilm) of the head of household (first person listed) is French. (3) "German only," and (4> "Non-French only" are the other instructions which will appear - in which case you only code the household to which the selected household number refers -if the head of household is German ethnic origin.*,a, 18 The "Non-French, German3 French onlyminstructions refer to ethnic origin or nation of origin (these refer to the same column on the microfilm), they- do not refer to "Birthplace" which appears first, i.e., two columns to the-‘. lef=f ‘origins.fl

GENERAL INSTRUCTIONS

1. Always code with a sharpened pencil, use an eraser.

2. Exercise caution in coding: accuracy counts more than speed.

PLACEMENT OF INDIVIDUALS ON CODING SHEETS:

- Code one person per line, skipping no lines.

- Try to avoid splitting households on more than one sheet by seeing that the number of lines left is greater than or equal to the number of individuals in the new household to be coded.

- For large households which must continue on more than one sheet, or where you have miscalculated and not left enough space, code the family, household, and the person numbers at the top of the next sheet.

- Leave lines blank which will not contain a full household on a sheet, and go to the next sheet.

- You will probably average two households per code sheet.

3. If parts of a name (last, first, initials or occupation) are illegible, and quite indecipherable, then

A. code exactly those letters which can be made out -

B. place dashes (--) in appropriate columns for the illegible letters - one dash for each letter - if the whole name is illegible, then leave the whole space blank on the code form.

C. also place a question mark (?) in the last column of the entry - e.g. for last names, in col. 31; first names, col. 47; birthplace, col. 53; religion, col. 55; nation of origin, col. 57; occupation, col. 73.

NOTES: If name or occupation is partly illegible and also too long for the columns on the code sheet put a (+) sign in the last col. of the entry instead of the question mark - then complete the name on the "long name code sheets" (also see coding instructions below for the specific items).

It is very important that the first letter of the last name is correctly coded. If this letter is illegible then an entirely new household is to be substituted and coded. The substitute household is given on the list' of sample households directly below the normal listing within each section. You code the first household listed if you have to 19 substitute. There are a variable number of substitute cases depending on the number of cases required.

Substitution will only be occasional; it probably means you must erase the original codes for location, household and special sample no. and substitute the new codes - then complete the substituted case as it is found on microfilm.

If for the substitute case the first letter of the last name is again illegible, another substitution will be necessary. The next case listed in order in the substitute list must be taken.

If all substitute cases have an illegible first letter, then check with your supervisor.

When you have substituted, put beside the one taken from the list of substitute households.

More than one substitution is made only if the original case was being coded under the instruction "code any case." If the original case was "French only," "Non-French only," or "German only," then the substitute case must also be one of these. If the original, illegible case did qualify under these 3 conditions.but the substitute does not, then do not code a household, go to the next section of the "list of selected households" and proceed as usual.

You may have to search the microfilm to locate a substitute, but it is always located near by the location of the original - perhaps in another division (E.A.), but always in the same general area.

4. All information coded should be PRINTED IN BLOCK CAPITALS. The letter "I" should have this form "I"; the number one should be written "I." The letter "L" is coded "L," not "1."

5. It is easy to miss the codes for married or widowed, school (attendance), able to read, write, etc., (i.e. col. 74-80), since this information is located at the extreme right-hand side of the microfilm frame.

6. Make certain that you are copying the precise spelling of the household nos. and names and occupations. They are given on the microfilm. Read them carefully first, and copy them letter by letter - the original enumerators made errors, copy the errors; spellings in the nineteenth century differed in many respects from current spellings; copy the nineteenth-century version.

7. Occasionally first names are listed in the last name column and vita versa. Read names before copying. If it is obvious that the error has been made code the names in the correct ~01s. If you are unsure, copy microfilm directly.

8. All information must be coded precisely in the columns marked on the code sheet. Last names, first names, and occupations are coded by beginning with the first column from the left, (col. 16, col. 32, col. 58). Other codes which have more than one col. (location codes, household no., person no., age) are right justified and leading zeros are to be filled 20 in, e.g. age, l-9 is coded 001-009, lo-99 is coded 010-099 and 100- is coded 100-XXX is col. 49-51.

Birthplace, religion and nation of origin are always two (2) letter _ codes.

9. REPETITION OF CODES

a. For the first member of the household, all variables must be filled in regardless if they are the same as the values for the last member of the previous household.

b. For individuals within a household some codes may be left blank - they will be automatically duplicated in punching. The only vari- ables for which blanks can be left are, last names '(~01s. 16-31); birthplace (col.-32-~3); religion (col. 54-55); nation of origin (ethnic origin), (col. 56-57).

The code must be given for the first individual listed on microfilm with a particular last name, birthplace, religion or ethnic origin. For individuals immediately following in the list who have the same code, the code should be left blank. -

e.g.If the head of the household is born in Scotland, the wife in Ireland and all the children in Ontario, then for birthplace we must code the first three birthplaces - the second and later children -;- will automatically be given the birthplace Ontario, if it is not coded.

10. Try to make out letters that initially appear illegible. In trying to make out the writing on the microfilms, try to familiarize yourself with the handwriting of the enumerator by looking around the microfilm frame on which a selected case is located. You may be able to recognize the differences between, say, M's and W's, S's and T's, etc. Each new numerator, of course, had a different script. But9 if you are not certain, follow the question mark (?) codes for illegible cases as indicated on these instructions.

SPECIFIC CODING INSTRUCTIONS

Columns 1-12: are coded directly from the "list of selected households.'

Columns l-3: District number, starting with district No. 1 and numbered consecutively. Code is right-justified in cols.l-3. This number is given as the numerical digits on the "list of sample households." Labelled, SUBDIS = lA, lB, 2A, 2B, etc.

Column 4: Subdistrict code within Districts. Given as the alphabetic code on the "list of sample households." Labelled, SUBDIS = lA, lB, ZA, 2B, etc.

Column 5: E.A. or DIVISION number within subdistricts. These numbers are also given on "list of sample households" as E.A. = X for each household select@. They range from 21 1 to 9 only. columns 6-9: Special Sample Number, given on "list of sample households as SPEC SAMP # =. , one for each household listed. Code is right-justified in ~01s. 6-9.

Columns 10-12: Household number, given on "list of selected households," as HHOLD = . Also right-justified in cols.lO-12.

The following information is coded from the microfilm:

Column 13: Family number: some households have more than one resident family. On the 1871 microfilms separate nos. are given for families (column 6 of the microfilm). DO NOT code these. Code the families consecutively within each household, beginning with 1, 2, 3 . . . Leave blank if only one family is listed in a household.

Columns 14-15.: Person number: The coder assigns each individual within each family a separate number, consecutively 01, 02, 03 . .

NOTE: some households will in fact be boarding houses or hotels. One or more families may be listed as well as any number of unattached individuals.

Code a family no. (‘~01. 13) f or each one given on microfilm listing.

Columns 16-31: Last or family name; from microfilm - (manuscripts col. 7), begin at left hand, col. 16. Fill in coding form until next to last column - if name will exceed space (16 spaces max.), then fill last column (Cal. 31) with an "asterisk" (A) and go to the long name coding form (#2). Provide the complete name there. Blanks are left for all persons with same last name after the first is coded.

NOTE: Make certain that all the different last names are filled, e.g. the last name is coded for all non-family members of the household (usually 1istedTst) See no. 3 above for instructions regarding illegible letters.

Columns 32-47: First names; initials: Begin at left hand, col. 32, code as given (~01. 7 on microfilm)

Leave one blank column between names and initials.

As above, if names and initials exceed space (17 spaces) fill in last column (Cal. 47) with asterisk (*) and code the complete name on long name form (#2).

See no. 3, above, for instructions regarding illegible letters.

Column 48: Sex: code M or F from microfilm. 22

If sex is missing put a question mark (?). columns 49-51: Age as given on microfilm. Code is right-justified.

NOTES: All single names use the first two letters of the birthplace as a code: Those consisting of two or three words use the first letters of the first two names. There are a few exceptions - they have asteriskzeside them in the following code listing.

For a few birthplaces listed on the microfilm, there will not be a code here. Then place an asterisk in col. 53 and code them on the long name code form.

Where a birthplace is not legible on microfilm, put a question mark (?) in column 53 and leave column 52 blank.

Many birthplaces are abbreviated on the microfilm - some abbreviations are given beside the name in the following list. -23-

l’U\CE OF DIRTll CODES INCLUDING ALL 1861 8 1871 MENTIONS

~i:rtfcta .i...... ;.Af Al StA ...... -SE AUSII(AL I r\ ...... A !S AUS I IC 1 A ...... A II I~ALAIJ I A ...... u A u t 1-t; I llfl ...... lj t 11ft1 Ilst~ .C~rt.urWJ EI ...... Uc c AldAUA ...... CANAUA EASI ...... CWAIJA w&is I ...... Cd CAdAUA rttsl’ ...... rcc

l:f ...... CuLUIJtdiU Al- KlCAY . II(1lihAt:K ...... fItis1 IILOlI\ ...... tNGLA14U ...... Ex FtJAdCE ...... Fl3 GCtlWifUY ...... 1: E GItEf K ...... tiR l)lJ tf(jLLAW ...... 1LLtCldLE ...... ;, \ ......

NATfVt (10 NUJ ...... NATIVE FtibiCti (T:) NH) ...... :: N k it t) H U N S d 1 c K ...... tn NLt( ~KUNSr~ICk ...... No Nt d(F(I~IND[.ArIII ...... Nt Ntit?llihES f ...... I* .

N(IhAY ...... NiJ r&T LIvtN ...... tt(; J&VA !jCOTf A ...... N s UNIAI~~~I ...... c ti Urr IAl< I [J ...... IIt4

UklAkIU ...... UC UNlAl~l u ...... IlC PULAfdlI ...... v;1 PUKI ui;~L ...... I' 1 i'KiUCt EDI"A~'D JSLAld)...... Pt:

I’ri PHUSS I A ......

WtiliLC ...... IE

WtltFC ...... 1c (H!Eht I; ...... OLJ

12u s s 1 f? ...... N.I

.. v SC Al111 1 NA V 1 A ...... SPA114 a*.*...... = sp

.5fl SrcE0t r4 ...... l ...... r..

si SnI:LttiLA~!l ...... *...... IEKKf ruf-lI!:l . . ..*...... r*t UhJlLlJ SIAIL3 ...... U -‘j lrt’i’t 14 CA&AijA ...... c ** IJl’t’t I( (:Adhl)A ...... I ...... 0. IlC lJf’i’t:Ij Ch.‘iAIiA ..* ...... p...... I’ C fl/tLi ;;i ...... nh- JI:Sl lN()lL::I ...... * h 1 25 columns 54-55: Religion

NOTE I: The religion codes are formed in the same way as birthplace: exceptions have an asterisk. NOTE II: As for birthplaces - no code is given for a religion, place * in col. 55 and code on long name form.

NOTE III: AS for birthplace, if illegible, put question mark (?) col. 55 and leave col. 54 blank.

NOTE IV: Religions may also be abbreviated, some are listed below.

(See codes for religion next page) -26-

RELIGIOUS CODE!3INCLUDING ALL 1861 6 1871 MENTIONS :

AliVEN 1 1 S 1 ...... A0 AttilCAlc ASSIIL'~AT lllft IIAI'T ...... A/\ Ahf.141 CAd t'llt:StJY lEH1 AN ...... A P A I ttt :. 1 s I ...... t3(IO I4ElH (t) ...... i: HAIJI I!jY ...... till (31tiLc: tiFLl~V~'H ...... flu }JItjLE CHRlSllAN 4k.lHOl)ISr .. ..H C tihti ...... I3 N t!KlTISti El’fLCdt’AL NtTrlUIlIS .. ..l-l E CIttl 1 fsrd Et’fSCt.IPAL Elk (htiI>tS .. ..[I H c tilti IEN ...... c t c PNESt! ...... c p CALVINISIIC NEItidJI)IST ...... C H CANAOA I’Ht s ...... C P CANAUIAN PHLSUYTEI~IAN ...... C P Cril~1STlAN tiAt' ...... cHI{ls 1 1 AN 8AP ...... s CHHI$i1AH tiAf' ...... CF

ChKISIIAN ~RkTMHIJ’ ...... CH CtrHISl IAN CUFIF ...... CHHIST~AN cLlr\;F ...... z C~tilSl IAN COi\cF ...... C F CHHLSTIAN CL)NFEHEIIC F. tlAP1 .. ..c c P CHHISTlAN CONFERENCE UAPT .. ..c 0 CtlKISflA:d C~1ruFEREhCt tiAPT .. ..CF. CHUI~CH i.lF ENGLAND ...... C E CHUItCH fiF SCU TLAUI) ...... cs CUf~til~~G~f~~~A1.j.S~ ...... cu . ot1sr ...... DE OISCIPLE ...... u 1 OISCIPLE UF ClrHIST ...... u 1 t ...... ct

t CtlUNCH ...... CE

E fit TH ...... --..--- -.. L t4trtwolSr ...... E’: t 1’1 SCllPAL (uf tit’it.i;LAhL)) ...... F-P tPISC(IpAL bltlbtb,LSf ...... VM w t IJL SCUPAL Ilf tNCL~rab, ...... UL tel!iC(lt'AL (If EtdciLAr@, ...... LJt'l tt'ISClll'ALl/\N ...... t P EvAW~LICAL ASSIX ...... EA tiVAt~CtL1SI ...... t v t (C) UF S ...... t; (Cl UF S ...... K F OAPllsr ...... F ,C tt’) ...... q ...... :; F c (PI ...... c.F P F C dAPIIST ...... FB FHEt CtW [ST 1Ad ...... =...... F K FREE HILL ~...... F ri FfXE rrlLL t3APfIST ...... FW ‘ t, , ‘ ,.I,\ t*e- * - . . - l ******.*... ‘...****=’ a” II LEtililLE ...... **** IL 1 lul) ...... *** . ..IN 1 :u JlJlJtPF.Nl>tr~ I ...... **. 1ltvlluti; It ...... **** 12 Jtw ..,...... ** JIJ K PRESUY ftl?JAN ...... *.* sr KIRK (UF SCUTLANO) ...... SP LATTER DAY SAINlS ...... LU LUTHERAN ...... *...... *lU MENNONITE ...... MN rlESS1At-t ...... MS ntlt-l E ...... EM HE 1rlU1) I S I ...... ME MWIl(JN ...... * MU nw ...... WM &w CONhEX LUN ...... a. rfErc JtHllSALf. t4(C,CH,CttUHCH) . . ..E . No 0 Ii N 0 M I t-4A 7 1 u IJ ...... hR hU .RELIGlUN ...... NR NO SECT ...... NH NOT GIVEN ...... NG PAGAN l *e...**.....*b....;.... rJA PLyMlJUTH bRE IHERN ...... PIESU C UF SCUTLANO ...... 2 PHtLStj c s ...... S ? PHES~JY IER.IAN ...... P’s PHFSUYI~NIAN C. OF L.P...... ;; PHESIIY I FHIAN KIRK ...... PKIMIIIVE METIiUnlST .,...... PM ptii)TtS 1 ANT ...... FH UUAKtK ...... ou 1~ UAP’l(ISI) ...... *..... t28 It tJ1ttst-l - ...... KP Kt UAPllSI ...... I? 8 IitFUlic~ f3~t’l-ISl ...... Rt3 I~tFURdED t$~Pl IST ...... HU KF urPIISI . . . ’ ...... Hd . WYMAN CA TIILIL It: ...... a? 5 KIRK ...... A...... S PKES(nl ...... SP SCU PKEShf (TERTAN) ...... SP SCUICH PHLsl~y (TEHIAN) . . . ..*..SP !ritl’ktJ DAY Al)VENTlST ,...... SD ShtDEtJHUHI; 1 AN ...... SW ,I UNKEH ...... lU/ U KiliF fJi

NOTE I: These codes are very similar to British codes, but a few differ: note them: them are underlined in the list which follows.

NOTE II: As For birth place, religion - if no code is given for an ethnic or naiton of origin, place * in col. 57, and code name on long name form.

NOTE III: As above, if illegible put question mark (?) col. 57, leave col. 56 blank.

NOTE IV: Ethnic-Nation of Origin listings are abbreviated as for'birthplace.

(See next page for codes for Ethnic or Nation of Origin) NATION OF ORIGIN CODES INCLUDING ALL 1861 & 1871 MENTIONS

ACAUIAN ...... A c AFKICAN ...... AF AMtir(lCAN ...... us AN(;l.U-sAX(JN ...... AX AlJSIKlAN ...... AU UAVAIIIAH ...... B A UELGIAN ...... H E CANADIAN ...... c A OANISH ...... DE OUTCH ...... D u EAST INDIAPI ...... i...... E L kNti1 1st.r ...... E N F K E I4c H . ..*...... f K GEKHAN ...... GE GRlXCk ...... G .R ItALf t3HEiiD ...... H 8 HINWU ...... III lLLtGf8LE ...... ? fLLtGI8LE ...... IlLEGIL)LE ...... ; L IKIstI ...... IK ITALIAN ...... I1 Jt*lSH ...... JII NATIVE IIUI~IAN ...... l N c\rA TI VE INuIAtJ ...... I~L)K~EGIAI\I ...... WI C;I’&‘N ...... N 1; t’UL 1 SH ...... PfiRTLIGESE ...... Er. PRUSSIAN ...... P K KUSSIAN ...... K u SCAWINAVIAH ...... s v . SCUlTISH ...... SvAhlstI ...... G SkEbISH ...... s w SP4ISS ...... S T HELSH ...... w A 30 columns 58-73: Occupation - Profession: Complete title as on micro-film manuscript.

If name exceeds 16 columns fill in last column (col. 73) with asterisk (*) and complete title on the long name code form (#2).

If an occupation is scratched out on the original form - but you can decipher it clearly, then code it as usual.

Column 74: Married or widowed; code M or W. If blank, code blank.

NOTE: If next item on the microfilm "married in last 12 months" is filled in - code L for married and W for widowed.

column 75: School (attendance) Code 1 if marked (usually a l), otherwise leave blank.

column 76: Unable to read: - labeled "read" on code form - Code 1 I if marked, otherwise, blank.

Column 77: Unable to write: - labeled "write" on code form - Code 1, if marked, otherwise blank.

Column 78: Deaf and dumb: - Code 1 if checked, otherwise, blank.

Column 79: Blind: - Code 1 if checked, otherwise, blank.

Column 80: Unsound mind: - Code 1 if checked, otherwise, blank. 31

Processing of the Coded Data

The computer programs noted above are designed to deal with this data to accomplish two tasks: they carry out a series of logical checks on the coded data to allow-coding.errors to be corrected; and they take the original data and transform it into files on which data analysis can proceed. The two tasks are intimately related. For example, if the "blank *rr is found in the

religion field for a given individual, it is necessary to make certain that a long form containing the religion for this individual has~also been created.

When the data are transformed into files for analysis, .the contents of the

religion field on the long form must be merged into a specific position on

the record for the relevant individual. Besides checking for the existence

of long-form data, flagged by asterisks on the individual records, a number

of other checks of the data are carried out. The entries for variables with

a fixed set of codes, including religion, place of birth, nation of origin,

marital status, and school attendance, are checked to see that they are among

the predetermined acceptable codes. In addition, some infrequently occurring

combinations of codes are also flagged by the program so that they can be

checked. These include individuals who are listed as attending school but

are under four or over nineteen years of age, married persons without a

spouse present in the household, and all individuals listed as deaf and dumb,

blind, or of unsound mind. These cases were scrutinized for possible error

and many were checked against the microfilm records.

The entire census of 1871 does not record the relationship of the

individuals in a household to each other. 32

If some error is tolerated, however, it is possible to deduce . . relationships among individuals within a household, making use of surnames, marital status, age, and the ordering in which an individual within a household are recorded on the census manuscript. A decision was made to carry out such an analysis for each household and using this information to attach to each individual record a number of summary variables describing the household of which he or she was a member (see below). Clearly there were cases where the family relationships are ambiguous. In all the detectable cases of ambiguity, a message was printed out by the data-checking programme to apprise us of the difficulty. For example, a child could logically have more than one person in the household as its mother--logically here being taken to mean that there are two or more women in the household with the same

surname as that child, who are married or have been married, and whose ages differ by at least fifteen and no more than fifty years from that of the

child. In all such ambiguous cases an "error"message was printed and the

household scrutinized by the principal investigators to attempt to resolve

the ambiguity. In the great majority of cases, the determination of family

relationships among members of a household was unambiguous.

Three other potentially ambiguous situations were flagged by the pro-

gram: married individuals without their spouses present (most of these proved

to have been correctly coded), individuals who appear to be the children of

those who are listed later in the sequence of persons in the household mostly

this seems to indicate a widowed parent or aged couple living in the

household of their child), and individuals who are identified as children of

parents in tie household, but who are separated from the&r parents by one or 33 more persons of a different surname.

In each of these cases, a "warning" message led us to reexamine the household. A "special allocation" procedure was developed whereby any alteration of the application of the computer program's rules for establishing family relationships could be recorded and the "imposed" relationships changed in the final data file. See Figure 4 and the accompanying "Layout'! description for the form and kinds of "special allocations" permitted. We discuss the nature of this intervention by means of "special allocations" and provide illustrative cases in the following text. c .9 F . a@ ? f DWR ICT SUtiD\STR\C-f t 1 1 t E E D~vrS\orJ

I f 1 t 1 f r f E F i E E E

a-- . . . . .

...... --. ..--. .i .-. -, “-.7’---j-

. --.*I-IT -----..- .- ----i--

-

I

-. -.p-. c i I 2 L45-r +I c I I 1 I g---T-l t I s

. : 35

Special Allocation Cards: Layout

(NOTE: 1861 data for a local study of Essex and Kent counties are referred to below). l-3 District of household 4 Subdistrict of household 5 Division (enumeration area) of household

6-8”. for 1861: blank for 1871: household number

9-10 type of special allocation:

NS-- .no spouse, used to prevent the assignment of two married adjacent individuals of opposite sex and with the same last names as a couple SP-- spouses, used to assign two individuals as spouses who are non- adjacent NP-- no parents, used to prevent the assignment of an individual as the child of any other person in the household cx- where x is 1, 2, 3, 4,. 5,.or 6--used to assign a child or children to a parent or parents other than those which the programme would automatically choose. The child or children are of type x, X = 1 means child of couple X = 2 means child of married X = 3 means child of widower X = 4 means widow X = 5 father and stepmother X = 6 mother and stepfather HS--household size, used to break a household into 2 or more units

11-16,contain the numbers of four persons in either of the following two 17-22,forms (one or more may be blank in any given case) 23-28,form I: person number of the individual, counted from the first, right-justified. 29-34,form II: xxyyyy, For 1861-xxx is the page number, yyy is the person number of the individual. For 1871--xxx is the family number and yyy is the original person number. N.B. should either of these be in error on the original (which means they will be corrected by the programme), use the original values.

The positions are used as follows (left blank if they do not apply) 11-16--husband or male parent 17-22--wife or female parent 23-28-(first) child 29-34-(second) child N.B. if both the person numbers for children are filled, then the programme assumes that all persons in between are to be treated as children. 36

Some Illustrations . Persons 5 and 9 are married Type 11;16 --17922 23-28 29-34 1. SP 2. Persons 9 and 10 are not married NS 10 3. Persons 4 through 7 have no parents NP ;3 b 4 7 in the household 4. Person 4 and 7 have no parents NP b b 4 in the household (N.B. requires 2 cards) NP b b 7 5. Persons 4 through 7 have persons 2 Cl 2 3 4 7 and 3 as father and mother respectively (i.e. type 1 children) 6. Person 4 and person 7 have persons 1 C6 2 1 4 and 2 as mother and step father c6 2 1 7 respectively (i.e. type 6 children, requires 2 cards)

To cause the household to be broken into more than one unit for analysis, the four person numbers must contain the person numbers (counting from the first individual only and not using the original family or person numbers) of the last persons in each subunit. e.g. if a household with 20 persons is to be analyzed in 3 units, 1-8, 9-13, 14-20, person numbers used are 8, 13, 20, b. _

If the split is into more than five groups, two cards must be employed--in this case the second card must follow the first in deck placement and should be based on an entirely new count of the household. Say we wish to break a fifty person household at persons 3, 18, 25, 30, 36, 42, 50. Then the two cards must read HS 3 18 25 30 HS 6 12 20 b

Should it be necessary to include special allocation data on parents and children simultaneously with households size information, then the parent and child data should be keyed to the original family and person number data.

We think that automatic creation of relationships, aided by our inter-

vention in all ambiguous case, is an acceptable substitute for an original

manuscript variable describing these relationships. Such a variable is a

critical one in analysis.

Part 2 of this appendix fully describes the algorithm used to analyze

the relationships with each household.

Two kinds of variables are generated which correspond to those

described above as coded directly from microfilm. The first is a set of 37 summary variables describing the entire household which are attached to the record of each individual in the household: Among these variables are the number of people in the household, the number of married couples in the household, the number of children of widowers, and so on. The complete set of variables is also given in Fart 2 of this appendix. The second type of variables describes each individual uniquely; they have different values for each person in the household. Examples of these variables are a child's number of older and of younger brothers and sisters resident in the household and the person number (i.e., position within the household) of each person's mother and father (some variables are chiefly of interest when it is necessary to create new summary variables for the household, for they allow the household to be read into storage and reanalyzed without going through the process of redefining the basic relationships).

Built into the family relationships algorithm is a set of decision rules to be followed at ambiguous points-- for example, if two individuals could "logically" be taken as the mother of a given child, the algorithm assigns the child to the '?notherttwho is closest in the household listing.

As noted above, a warning message is printed when this happens. What if our reexamination of a specific household leads us to conclude that the wrong person has been selected automatically as the mother? In such a case a

"special allocation" form is filled out, from which a card is keypunched. The card contains an instruction to the programme to reallocate the parent-child relationship. In this case, when the raw data are reprocessed a message indicating this change is printed when the household is encountered, and a variable on the record records the fact that this "special allocation" has taken place.

Further processing of the basic data was required. Four variables require substantial recoding from alphabetic to numerical codes to be usable 38 in the data analysis. They are occupation, religion, place of birth, and .- nation of origin. The last three are two character codes, but a code for the I' -: complete set of occupations for such a large file has as many as forty characters! The four recording tasks are carried out at a single step: the programme first reads in a "dictionary" which specifies how the alphabetic codes are to be transformed, it then goes through the individual records and

"looks up" the words in the dictionary and adds to the record the corresponding numeric codes from the dictionary.

The three two character variables can be handled fairly easily since there are no more than about eighty valid codes for each one of them; it is not difficult to assign numeric codes to the majority of the valid codes before the data are even coded. But there is nothing approximating a

complete list of the occupations which can be found in the census abstracts.

The aggregate census lists only about one hundred and twenty occupations,

though over one thousand are found in our data. So, before the dictionary

can be made up, the individual records must be analyzed by a program which

identified all the unique occupational mentions. The program developed for

this project punches a card for each occupation mentioned and these cards can

then be used to make up the dictionary. Each occupation is assigned an eight

ligit code. Of course, each unique spelling of an occupation must be treated

separately. The dictionaries are each described in detail in Part 3 of this

appendix.

There is a final problem which is caused by the long forms. For cx-

ample, the record for some individuals does not contain a valid two character

religion code, but rather a "blank *n code and a 24 character string with the

unique religion written out. All the infrequently occurring religions are

coded in this way. Our solution is to place a three-digit numeric code for

those religions coded on "long formsrt in t!he last three positions of the 24 39 character string;i.e;, the numeric code is simply punched onto the long form card itself after the data were collected. When the computer programme encounters a "blank *l" in the religion field of the original record of an individuallit scans the 22nd, 23rd, and 24th positions of the long form card to find the code. The long form values for place of birth, nation of origin, and occupation are handled in exactly the same way.

Finally certain cross-references among individuals within a household are likely to be of quite.common use, though they certainly do not exhaust the possibilities. In particular, it will often be of interest to examine the characteristics of an individual in relation to five specific individuals in the household: his or her spouse, mother, father, family head, and household head (of course, the same person could fill more than one of these relationships). On our file each individual's record contains six additional variables describing these five individuals in terms of their occupation, religion, place of birth, nation of origin, age and sex. Thus, for example, one could examine the relationship between school attendance and a child's father's occupation and his or her mother's place of birth using only the data already on the records. 40

Appendix E: Part 2

Use of the Automatic Household Analysis Program

Ordinarily all data sets are analyzed twice. In the first run, no final records are created. The list of warnings, signalling ambiguities in the identification of parent-child relationships, married persons without spouses, and so on, are carefully examined and a special allocations form filled out for each case where the computer algorithm produces an incorrect result (see above). These special allocation forms are then used in the second run of the data, when final records are created.

Variables Created by the Automatic Household Analysis _ ,--_.

As indicated above, two types of variables are created by the analysis: a set of summary variables which are attached to every person in the household which describes that household's general characteristics, including its size, the number of married couples in the household, the number of children with parents in the household, the number of servants in the household, etc.: and a set of variables which are unique to each person in the household, they include the number of older brothers an individual in the household has, the person number of his or her father, etc. These variables are listed below.

Stens in Automatic Household Analvsis

1. The file of special allocation forms is searched to find any forms which refer to the household in question. The instructions on these forts override all automated allocations produced by the algorithm. 41 These special allocations permit the following programme overrides: a. two individuals whose records do not meet the required conditions to be identified as a man and wife can be so designated, b. two individuals who are automatically identified as man and wife can be separated from each other,

C. a child can be identified as having a specific parent or parents when they vould not so be identified automatically, d. a child who is automatically identified as the son or daughter of a specific individual or couple can be separated from them, e. child who is automatically identified as having two biological parents in the household can be identified as having one stepparent and one biological parent; also a person automatically identified as a stepparent can be identified as a biological parent, f. the household can be broken into two or more groups of individuals which are analyzed as separate families or households and not together.

2. The marital status and presence of a spouse of the head of the house- hold (the head is taken simply as the first person listed in the household) are ascertained.

3. All servants and visitors in the household are identified, using the occupation variable. Since a number of possible designations of a servant are possible in the manuscript data, e.g., servant, maid servant, general servant, servente, sevt, etc., the occupation of each individual (except for the household head who cannot be a servant) is compared to a list of occupations previously identified as including all the possible occupational 42 titles referring to servants and also all the misspellings of those titles which are found in the data. This procedure, of course, requires a preliminary examination of the occupational titles which occur in the entire

file. Visitors to the household are identified only by the exact title

'VISITOR' in the occupation field.

4. All married couples in the household are identified. A couple must

have identical last names, must both be listed as married, and must be listed

on adjacent lines in the order of the household. If, as occasionally occurs,

the two spouses are not listed sequentially or the maiden name of the wife is

given as her last name (for example, in rare cases this appears in Quebec in

the 1861 census) a special allocation form

must be used to instruct the programme that the two.i.ndividuals are married .i. All such cases are identified and reexamined.

5. For all married persons in the household for whom a spouse cannot be

identified, a warning message is issued, the processing then proceeds

normally. In most cases a subsequent recheck of the microfilm found there

was, in fact, no spouse, present. occasionally, this warning led to our

finding some coding error.

6. Widows and widowers are identified.

7. All probable parent-child relationships in the household are

identified. A mother and child must have identical surnames and the age

difference between them must be between 15 and 50 years. A father and child

must have identical surnames and the age difference between them must be at

least 17 years. Fey any other cases a special allocation form must be used 43 to create a parent-child relationship.

In those cases where, according to these criteria a child could have more than one person as his or her mother or father, a warning message is printed. The household processing is then carried on under the assumption that the actual parent is the one closest to the child in the listing of the household.

A warning message is printed whenever a child is found to be listed before his or her probable parent in the household. This is usually the case when an aged parent resides with the family of one of his or her children.

A warning message is printed when a person identified as a child of a probable parent in the household is separated from that parent in the listing of individuals by one or more people with a different surname, for exampielin the case of younger stepchildren. It is important to note that where a child-parent relationship exists but their surnames are not identical, due to a name change at marriage or to remarriage, the automatic routines will fail to identify the relationship. If the relationship were known from some other data source or investigators were willing to deduce it from an inspection of the household, a special allocation form could be used to instruct the program to identify the relationship as part of the file. If a child is identified as related to only one member of a married couple--as occurs where the child's surname is identical to that of the couple, but the age comparisons indicate that only one of the couple is likely to be a parent--a warning message is issued. The child is identified as having a stepparent and data processing proceeds normally. In cases where a visual examination of the household suggests that there are both children and stepchildren (this is usually revealed by the presence of two distinct age-ordered groups of siblings), a special allocation form can be used to treat one group as stepchildren, even if the age requirements for biological parenthood are 44 satisfied for all.

8. Daughters-in-law are identified, they are the wives of men for which at

least one of the parents is found in the household. If the family lives with

the parents of the wife, the in-law relationship cannot be identified without

using outside sources, because of the name change at marriage.

9. For each set of siblings identified above, a set of variables to

measure the number of older and of younger brothers and sisters, (separately)

is computed.

Below we reproduce some selected examples of the output of The . Automatic Household Analysis Program. Each case includes a copy of the

program's print-out to which we have attached short captions noting the

nature of the ambiguity or error indicated by the message. Each case also

gives a copy of the original coding form. ,

1C I 456 214 101 F 2 1 0 1 1C 1 456 214 LGl 1 2 6 1 1

1Cl 456 314 1Cl M 20 CN RC FR TAVERNKEEPER M 11. 1. c 1 456 214 lC? F 46 GN RC’ FR M 11 1 c 1 456 2L4 1111 Y 18 GN RC EN FARMER 1 c 1 456 214 101 M 16 ON RC EN I=ARMER I : 1 C 1 446 214 131 F 14 CN RC EN I 1 C 1 45s 214 191 M 14 GN RC EN 1 c 1 45c 214 lti! M 12 CN RC EN ICI 45!*2!& 1 I! F 10 ON RC EN lC1 456 %\C :';I CN RC FR The error messages are given above the household listing on this run. The messages SEPCHILD and REVCHILD refer, re- I spectively, to person number 8, Elly Renard, and person number 1, Eli Renard. The SEPCHILD message identifies Elly, aged I 2, as the possible daughter of Eli and Josephine Renard, given the surname similarity and age differences. But the relation- ship is ambiguous because Elly is separated from her possible parents by six Brooker children. Elly is probably the. sister of Josephine's six surviving children by a previous marriage to a Mr. Brooker. The automatic household analysis program 4: would have assumed this relationship, assigning Elly to the ! closest probable parents. The "error" message here serves as ! i a warning to the.investigator that the ambiguity warrants consideration and possibly the automatically established re- 1 i . . lationships should be altered by using the Special Allocation i feature of the program. We did not alter the allocations in this case. Note that Elly is given as being of french origin, like her assumed father, Eli, while the other children are . of english origin. The RBVCHILD message identifies Eli Renard, aged 28, as a possible son of Josephine Renard, aged 46, merely because of the name similarity and the age differential (18 years). The automatic program will also have recorded this relation- ship. On the basis of the other information (age sequence of the children, married couple indicated) this ambiguity is resolved by overriding the automatic allocation by means of a Special Allocation, making.&li the husband of Josephine, as assumed above. ,

i

,. . . A- “. _._.“. . . . -.- . .-__--._, ,.. . . .- .__ . . . . - ,. . ,_. . . . “” Subdistrict . . Special Sample Number . . Nous ehold Number LamilY Person i ,

Last Name

First Name(s) and Initials

I 3CX

Age Birth Place Religion Hation of I. Origin

Occupation

narricd School had irite hat -lhfih Blind j I I III I i I II unsound *?

. . ***AG$-1 SRW- Ii-EiRtiR FilIi PERSCh 69C 3 732 31 101 4 -247447488 c‘ 0 0 1

69 c 2 702 31 19 1 1091 HOLMS c FORtif M 61 EN HH EN FARMER l M 69 c 2 to2 31 2oc2 HCLMS PAPY .69 c 2 792 31 :g: 3Cd3 HCLMS t*‘APGPEl 69C2 t02 *l 191 40.34 HCLMS vURCUS 69C2 702 il 101 5cc5 H?LM 5 ALREPT M 13 ON WM EN 69 c 2 702 31 10 1 h ? c0 IWLMF HARVV v 7 Ch WC EN 1 The error message above the case, AGE---SRW, indicates for person number 4, Murcus Holms , aged 25, a farmer,fs listed as attending school. In this case a subsequent review of the microfilm record indicated that no coding or punching error was made and the record is maintained. , .

* . I . .. .

-,y ...... -. - ...... -_...... _ .. - .... e-m...... -- . ... a- ..... - .-....-.-- . __I, .. . _-._- ... a- I 136 8 0 1101 217 1 1 ORAS “. ? RENECICT P 031 GE JE GE IMFCRTEP P 136 &l 3 1101 217 1 2 SOPI! E f 027 - M I 136 R 0 1131 217 1 3 ELLEN F 034 OU . 1131 217 1 4 IVA 105 B 0 F 033 / 10660 ilo 5 LAli2EWE SARA!- F 020 CN CE SC SERVANT i 106 I3 0 1101 217 1 6 HIX CATHERINE F 019 IR GC fR SERVANT

!b i .***RELlG1OY 1”: EFRQR F3R P?c’SCs lcc14 C 1lCl 217 A 1 347 0 0 '0 .a.-.. - : Thk~ ixrbr message bdlow the case, RELIGION IN ERROR, indicates that for person number 1, JE is not a valid code (in columns 54-55 of the coding form). A check of the micro- film showed that the actual religion was Jew and the valid code would be JU. The punch card was altered and the file corrected. The corrected code is also recorded in the right margin of the coding form. . . -.w L__-. --)--I- ---a, .- I . w..... -“8.--e....-.-

-

. . , subdistrict AA. $mclAA1

iowthold lumber . ra=W ‘trson lumber

.

Last Jama

. .

‘.._. .\ (. I- First ,: : Name(s) -. and InitiaAe

.

,

Sex

.,,, - . -. ,- , - ( ,” ..- , -I 1 hlO1 I .IR Birth ,Placc ~cllglon ;:. Ration of origin I. -F\-bT T T 1 ! ! !.!G .:i. . :’

Occupation

:: . _ :

..- -. **+f’ARNC 5F IN EPROa FOR PERSCK 1740 1 3202 226 lC1 9 C C 0 C I; 2- 1 3202 226 101 1501 STEVENS RCBT w 63 NE @A EN '> 174 D BOARDING HOUSE M I 174 D 1 2202 226 lC1 2002 STEVENS rJPS R F 63 N@ @A EN IM 174 0 1 3232 226 131 3033 STEVENS BEVERLY M 29 NB BA EN 174 II 1 32G2 226 1Cl 4cc4 STEVENS c E M 19 NO BA EN -174 D 1 3202 226 101 SC35 SIPE P c F 37 NB t?A EN 174 0 1 3202 226 131 6DC6 SIPE FWANK F! 14 NB 84 EN 174 0 1 3202 226 101 7CO7 Sf ME MPPY F B NR BA EN 174 0 1 3232 226 1Sl O( 00 SIPE rc s c 3 NC BA EN 174 0 1 ?2U2 226 101 9c29 WEO@E @SPOON J IJ N F F 21 EN CE EN SERVANT M 174 0 1 3202 226 1Cl 130 10 GI LFDRO MPRY F 20 WA RC IR SERVANT i. 174 D 1 3202 226 101 11011 OVIRK MPRY F 21 EN CE EN SERVANT 174 D 1 ?232 226 191 12012 STCCKLM F/PRY M 24 SC cs SC . CLERK 174 D 1 3202 226 101 13c13 BELYEA A )r 28 NE! @A EN EXPRESS OR1 VER 1’ : 174 0 1 3202 226 101 14014 CHANOLEI? c l-l u 30 NB CE EN POLICE OFFICE CE I74 D 1 3202 226 101 15015 CHISHCLC P M 35 NS Pt. SC GRCCER W \ i 174 D 1 3202 226 101 lbC16 CASE JCHN P 20 N@ CE EN I : 174 D 1 3202 22h lG1 17017 COLC,ING E M 19 NO CE EN CLERK 174 D I 3202 226 101 lClCl@ GrLCHAtST R M 25 SC CS SC CLERK 174 C 1 3292 226 101 10319 LAUSCN w s M 10 KS AA EN CLERK 174 n 1 3202 22h :c1 2q.3 211 Pf!!SfF E F F u 22 NS HA EN CLERK 174 D 1 3202 221, 131 ilCil YCE~RR 5 r. 25 N@ PL 1R GROCER 174 D 1 3202 226 1c1 22021 F!CDC:F!ALD u M 24 NE 84 SC ATTORNEY AT LAH 174 0 1 32C2 226 1"l. 23273 PAAKIN u M 25 NB PL tR CLERK 174 D 1 3202 27b 1121 2454 hH1 rs L 4 M .19 NE! WM EN STUDENT AT LAU . The case serves to illustrate the nature of some boarding houses considered as households. The error message MARNO SP,, indicates that person number 9, June Wedderspoon, apparantly a member of the household staff, is listed as married but without an’adjacent spouse. A check of the microfilm confirmed the record.

. - - ...... - ...... -. . - ...... - ... _ .. .- ... .m ... ,...m. . Wm..-, ......

.n,-

, ..a -...... ,_.._..,.._,._ ..*. :.. I-. -.- . _. _ . i i . I l l . -... .: : :..:. ;,r;:.. ,,. \ . _ :.Gf., .L.:.-.f?~..-:,: : -.. :‘ .-.__ ‘,. .: .- .- c ,__‘..-... ‘. ..’.,: . ** . ... i : ..:. _’:. .. . ‘. ,‘:,”., :\. I , ‘i,,;,.“’ .. -,I +I..., --.:; ::..’_,..., ‘,’ . .‘.’,,_ *. ‘. b . , :.. r . ,. ..; .- ‘. . . , .,. . .

. ..: .. ‘. ._ . . . . 4 4 1 0

174 0 2 3502 4 II 1Cl lc!Ol uALSTCI; fHc,YAS 6 t-l 24 OU RA IR 6 t SHOE MANUFAT P 174 0 2 3502 48 lr,l 2or)z P'LI, Sfr:N PFr)CCCA F 21 CU BA SC N 114 0 2 1592 4e 1.7 1 30';3 PALStCN p.h Y i E F 0 NE! EA IR I 174 0 2 3532 48 1Cl 4 c 0 4 p*LSTCN A hJI F 49 EN PA EN H I SERVANT 174 0 2 3532 40 131 '.i'!C5 14E L 0 Y Cl. 1 ZACETH F 16 IR RC IR I The error message given above the case, REWCHILD,in- * dicates that person number 1, Thomas Ralston, aged 21, has been automatically allocated as a child of a parent listed after him in the household, ie., Ann Ralston, aged 49, a widow. The allocation seemed highly probable here and was not altered. In general, nineteenth century census enumera- . tors in Canada seemed to reflect the household structure by listing first whoever had assumed the position of the head of the household followed by their spouse, if any, their children in order of their birth, other relatives, with boarders, servants and visitors last. Note too that the coding form only gives an "*lc in the last column for occupation while this version of the automatic program provides the complete occupational title taken from the long name forms.

-m. e.- -.-CI---.-...-*---.. --- -*_. --. -..d.u C. . . . . -

Lt I -t ...... i o-\ V\ 1 - -.-. 5 3 0 0 **swYJcY IL? N ERROR FOR PE2SCN 1834 6 3 0 0 :*?*YIiOCN IL? ‘N ERSOR FOQ PEQSCN 183A 7 3 0 0 ***WWCH IL? N EPRCQ F@P PZRSCY 183A 6 -3 0 0 :*+*HHi)cH IL? ‘N EPROQ FOR PERSCv 1834 9 3 0 cl ‘***wIwcH IL? ‘N ERROR FOR PEQSCN 1834 11 3 0 0 ‘***WHr)CP IL? :N ERROR FOP PZPSCN 1834 12 3 0 0 ‘***WHI?CH IL? :N EPQQQ FOR PERSCN 183A 11 C 0 :***HHOCH fL3 N ERRCR FOP PERSCN 183A 12 0 -:a3 0 c**+QEVCH IL0 :N ERROQ FOR PERSCN 1834 I 4 3 0 :***QEVCH IL0 N ERQOR FOR PERSCN 183A 3 4 4 0

193 A 1 2CIl 117 101 lOC1 POY LEZZAF M 43 NE3 RC FR FARMER M 183 A 1 201 117 101 2002 ROY PARY F 25 NC RC FR M 183 4 1 :oo: :::: 101 3003 ROY T I t’OTHY M 30 r-J@ RC FR FARPER P 183 4 1 101 4co4 R!?Y MAQY F 38 N@ RC M 183 4 1 201 I17 101 SC:35 ROY RPPHEAL M 8 N8 RC f “R la3 A 1 2Cl 117 lC1 6006 P9Y JFPCPP IM 6 NB RC FR 183 A 1 231 117 101 fCO7 ROY lvARY F 4 N6 RC FR 183 A 1 201 117 151 RCCl3 ROY ELLEI\: ‘. F 2 NB PC FR 183 A 1 201 117 101 9cc9 I RW F.LIZEBETH F 0 NB RC FR 183 A 1 291 117 101 l(3010 ROY LUCIE F 65 NE RC FR 183.A 1 201 117 101 llOl1 ROY TURECd M 21 NE RC FR 183 A 1 201 117 1Cl 12012 POY STEPHfh M 20 N8 RC FR FARMER 183 4 1 291 117 101 13013 LA”L4iTE ECWARO M 13 NI? RC FR FARPER 183 A 1 231 117 131 ?4C14 LAFLANTE JERCAr. M 12 NEI RC FR 183* A 1 261 117 1Cl 15015 LAPLAhfF JCS EPb M 10 N!! RC FR ‘.* lR3 A 1 201 117 lC1 1tZClC LPPLAh’TE l+ARY a NC RC FR 103 A 1 201 117 101 17517 LhPL4NTE ELI ZEOETH : 6 NO RC FR 103 A 1 201 117 101 lPI?lR LAFLPNTE JCt-N M 4 NB PC FR 18:3 A ,1 2C1 117 101 19c19 LAFLAh’TE . PkILErcK f- 2 NB RC FR See desc4 pTia4 att4chd. m--w-- .._I.--_.-.------w-----e- _-_---.. . --..--_-_-...- -. . . . _ _ ._-_ .--.- 57 This case illustrates what we take to be a truly ambiguous case regarding the determination of the relationships among the household members. The ambiguities are indicated by the WHOCHIL? IN ERROR messages. They first indicate that the allocation of the five Roy children, numbers 5 to 9, could reasonably be to either of two sets of parents listed above them, to Lezzar and Mary Roy or to Timothy and Mary Roy. The names and age differences between the children and these couples do not resolve the ambiguity. The program automatically assigns the children to the potential parents listed most immediately above them, arbitrarily, but not unreasonably imposing a resolution. We did not alter this allocation. There are four,other WHOCHIL? IN ERROR messages, two each for persons numbered 11 and 12, Tureca (?) and Stephen Roy, aged 21 and 20. The first of the messages indicates that the previously mentioned couples could also be the parents of these two men - but the second message notes the fact that they are listed immediately after Lucie Roy, aged 65, a widow (column 74). The age difference between Lucie Roy and both these men (44 and 45 years) has the program conclude that she is their widowed mother. The program again assigns the two men to the closest previously listed potential parent or parents, in this case Lucie. The logic of the overall listings suggest to us the allocations are appropriate. Thus the outcome of the program allocations in this case is a household in which the first couple is. considered childless, or without children residing in the household, the second couple is considered to have five children living in the household while Tureca (?) and Stephen Roy are taken to be younger sons of Lncie Roy. Two additional, REVCHILD IN ERROR, messages are given for persons numbered 1 and 3, indicating that both of these men, Lezzar and Timothy are possibly also the sons of the widow, Lucie Roy. The name similarity and the age differences again are the basis for the message. The program's automatic allocation was not altered. The household is considered to have four surviving sons of Lucie living in it. glple umber . .

1 ousebold lumber . : i 'adlY krson ihmbcr

!ast lame

.. . !. . :

.c, . ‘.. ; .:. .. . . :. 2. : .Fi : . ; . .. i ..J! .

: rl

. . .

.

.

_ . . . . -” :._ . - > I- . .- ; I ;.T

;:

i . Birth Place .* , ._ hl&iOn 2: Ration of .; - 9: . Origin ; t t ‘L’k: ( i i i i i I .:.. . i. . + j : .

occupation

,

‘. * .L---Y--- @S .- - . .._ ’ - . ‘- i. -r .*a I. ; . . . _-_.__ . PACE NUMBERg 2 CODER’S INITIALS . . . * a,

.

. . 1

. ,,. cc . :::a .‘._.. , ‘.

. 196 Al 101 11 LOWE WILLtaM, P N 040 EN WM SC SHOPKEEPER M ’ SARAH F 333 EN P 196 Al 101 2 : f ELIZABETH F 009 NS 1 ; 196 Al 101 2 1 4 AMELIA 1 196 41 101 2 1 5 HCFIE LIZZIE SC SERVANT c*‘tPL BIRTH IN ERROR FOR PERSCN l?hh 1 131 2 101 5 582 0 0 0

This case illustrates the results of a single codinq error. The error message below the household listing, PL BIRTH IN ERROR, for person number 5, Lizzie McFie, indi- cates that the mnemonic code PL ir columns 52 and 53 is not a valid birthplace code.- The punched card and the coding form were checked ,and the error found on the form, as the copy included shows. A subsequent check of the microfilm located the source of the error. The coder had placed the ! correct code for religion in the birthplace columns. For this case we provide a reproduction of the micro- film of the original manuscript record. It indicates the . likely source of the coding error. The poor quality of the microfilm in this caset as in many others, is obvious, though somewhat exaggereated by reproduction. More specif- ically, the entry for religion was altered by the enumerator and is difficult to interpret. The coder translated the entry, probably correctly, as Presbyteria:l, scratched out and replaced by PCLP - meaning Presbyterian Church of the Lower Province, a denominational subheading used in the cen- sus abstracts of 1871. We had provided A valid code of PL for this subheading. The error is indicated on the coding form and the re- cord corrected. . ‘ .

, .: . ..*.. : ,..* . ...,‘., . f,‘” I : 196 Al 102 12 1 1 SAWYER JbHN J4MES M 062 US CE EN RETIR GENTLEMAN H 196 Al 102 12 12 MARY F 028 NS . . ‘, 196 Al 102 12 13 FRANCES F 026 196 Al 102 12 ALICE F 025 ; 196 Al 102 12 : '; EMILY F 024 196 Al 102 12 ARTHUR M 022 BANK CLERK 196 Al 102 12 :16 2 JONES * CATHERINE F 030 SERVANT 196 Al 102 12 LCNNERG4N ELLEN F02 71 RR CI RSERVANT 196 Al 102 12 19 PUBLICOVER JIJL I A F 035 NS SERVANT w

:***AGE IN EPROR FOR PERSON 1964 1 102 12. 101 8 0 ‘*-*PI. BIRTH IN ERROR FOR PERSON 1964 1 102 12 101 8 0 :*-RELIGION IN ERROR Ft79 PERSCN 1964 1 102 12 I'01 8 664 0 0 ‘“*NAtf ON . IN ERROR FOS PERSCN 19bA 1 102 12 101 8 85 0 0 ?*ySEX IN ER”ClP FOR PERSQV 196A 1 lb2 12 101 a 0 0 0 . -- This page of the printout also illustrates an error message arising from mispunching. The messages given are AGE IN ERROR, PL BIRTH IN ERROR, RELIGION IN ERROR, NATION IN ERROR, SEX IN ERROR, all referring to person number 8, . Ellen Lonnergan. A comparison of the printout with the coding form shows that the entrios for all these variables have been punched alightly to the.right of the correct columns.

. . ,.

;c .- . .- . . . _. __ . . . . x i. . .’

* .. . . - . . . _- -- -.-, . . . _.. ._ . -. . ..- . - . . - . _ _ . _ . .’ .‘,‘,“W, I I II I a

‘. ., ‘.. . : ‘r’j: i.-;.,,: .,, -,‘L’ . f’:.. .’ -., : . . ,, ,, . . . :. 8 ..)f .:, ._ .::. :’ “-(. -

j

l . - \ I , . ,:’

,.: .m _--me-. . - . . .: . . . ,_-..- ----..__ -....- - - ;

1 !

mm.-. 197 Gl lo@1 223 1 1 GRAHAM JOSEPH M 050 NS CE IR *I4 - i 197 Gl 1001 223 1 2 CATl-.ERINE JANE F 037 SC H 197 Gl 1001 223 ALEXANDRE R M 019 IR * -t iQ’ 197 Gl 1001 223 : 2 ElYIMA F 012 1 197 Gl 1001 223 1 5 LOTOIA SUSANNA F 008 . 1 3 197 Gl 1001 223 1 5 JOHN WILLIAM M 005 1 I

. i ****CANNOT SUBS f I TUT E LONG FORM, OAfA 6 FOR PERSON 197G 1 1001 223 101 1001 NO SUCH CASE t i 3 * ****CANNOT SUOStItUff LONG FORM OATA 6 FOR PERSON 197G 1 1031 223 101 .3003 NO SUCH CASE ****PERCON I tY ERRrlQ FOP OF~Sc?‘l 197C 1 1001 223 101 6 ' 5 .'.:b__, - 1, _-.-* .--.. --J 2 e.-..,' Illustrates two error messages, liked below the case; u.1 . . zI.-. .9 The first indicates that the program CANNOT SUBSITUTE LONG FORM DATA for persons number ,l and 3, Joseph and Alexandre Graham. Both of these persons have a "*lr in column 73 of L the coding form and punch card, indicating that their 91 occupations were entered on Long Name Forms. The program's initial search for a long name card, matching the identi- I fication of these two, was unsuccessful. The appropriate forms were located and cards ptnshed. The second message , PERSON # IN ERROR, for person number 6, John William Graham, points to a punching error - two persons,are given as number 5. - .--.-.---. ... . - .-.I-.-.---.. . ,...._.i..I.m,’I,

I

-, ; s\ i + I i ! .

: . . ,,.!...:‘p-&, ,:., .-!... ‘A. .:A;.‘: ( ., .:.:.:- .i(. “-~~..T.., -::

i . , 66

Appendix E: Part 3 .,..-

Canadian Historical Mobility Project: Numeric and Mnemonic Codes for Place of Birth, Nation of Origin, Religion and Occupation, Census Data, 1861 and 1871

The following coding schemes were employed in coding the national sample from

1871 census manuscript data on microfilm (and the sample of census manuscript data for Essex and Kent Counties, Ontario in 1851, 1861 and 1871). The mnemonic codes for place of birth, nation of origin and religion were used in transcribing the data from microfilm to coding forms. Using mne~monic rather than numeric codes at this stage was intended to reduce coding error.

Numeric codes are employed on the SPSS file.

The codes include all mentions of place of birth, nation of origin, religion and occupation for all members of all households in the 1871 national sample and the 1861 and 1871 samples for Essex and Kent Counties,

Ontario.

For the place of birth, nation of origin and religion codes, initial lists of mnemonic codes were taken from the census abstracts and provided for coders in the coding instructions. Provision was made for coding all other mentions in the course of coding (see p. 11 above in reference to "long name" coding and coding instructions). Subsequently, the full codes were

constructed. The occupational coding is clearly the most complex and

conceptually difficult. The religion codes were also given a general con-

ceptual ordering in terms of church-sect status (for details, refer to the

specific descriptions given below).

The occupational code dictionary consists of all occupational mentions

in the three samples (currently excluding 1851 data for Essex and Kent] and

corresponding numerical codes. Given that occupatioti as reported in the 67 census is a critical variable in this study, as in most current quantitative historical analyses, we have created a quite complex multiple code. We have been informed by previous coding schemes, including Armstrong's work for British occupational-industry codes, based on Booth's early work, the work of the historians Hershberg, Katz, Blumin, Glasco and Griffin (The

"5 Cities Study," Historical Methods Newsletter 7 [June 19731, and the

Philadelphia Social History Project's very elaborate coding scheme. The full description of the latter given in the Historical Methods Newsletter 9 (nos.

2 and 3, March-June 1976) was not available to us at the time our coding scheme was created. The logic of the two schemes is similar. -v-

I. Four Digit and Two Character Mnemonic Codes for Place of Birth and Nation of Origin in the Canadian Census “Y. I of 1871, - -

General Coding Scheme. 1977. Rev i sed and Expanded. gee -variable list.

First Later Mnemonic Digit Group Digits Code Description

0 Missing & Illegible 000 NG Not given 100 IL Illegible

1 Upper Canada 000 UC Province as a whole Ontario 000 ON xY= District level code from 1871, xyz is the sequential code from ,001 to 090 from the 1871 census.

2 Lower Canada 000 Province as a whole Quebec 000 District level code from 1871, xyz is the sequential code from 091 to 173 from the 1871 census. 3 All other Canada 000 NB New Brunswick as a whole including Nfld, and P.E.I. OYZ District level code from 1871, yz are the last two digits of the sequential code between 74 and 87, from the 1871 census. kO@ NS Nova Scotia (Terre Neuve) as a whole 110 Cape Breton lYZ District level from yz are last two digits of the sequential code between 88 and 06, from the 1871 census 200 PE P.E.I. 300 NE Newfoundland 400 BC British Columbia 500 Canada West 600 610 Red River 700 Northwest 710 Rupert’s Land

3900 Canada--Canadian CA First Later Mnemonic Digit Group Code Description

4 United States-- 000 us American

5 France--French 000 FR . 6 United Kingdom & 000 Britain--British Ireland 100 EN England--English 110 Great Britain 200 WA Wales--Welsh 300 SC Scotland--Scottish or Scotch 400 IR Ireland--Irish 500 Guernsey 510 Jersey 520 Isle of Man 530 Orkney

7 Other European, 0000 AU Austria including Australia 100 GE Germany 110 BA Bavaria 120 PR Prussia 130 Bohemia '. 200 BE Belgium 300 Scandinavia 310 DE Denmark 320 NO Norway 330 SW Sweden 340 Greenland 400 DU Holland--Dutch 500 GR Greece 600 IT Italy 610 Sicily 700 PO Poland 800 Portugal 900 RU Russia

8 Other European, 000 Spain including Australia 100 Switzerland (continued) 200 Australia J 900 Ju Jewish

9 All other, 000 NI Native Indian non-European 100 HB Half-breed 200 AF Africa 210 Cape of Gorod Hope 300 EI East India. 310 Ceylon 320 Malta 70

First Later Mnemonic Digit Group Digits Code Description

400 HI ' Hindoo or Hindu 500 West Indies 510 Trinidad 520 Jamaica 530 Bermuda 540 Mexico 71

II. Three Digit and Two Character Mnemonic Codes for Religion in the Canadian Census of 1871

In general, an attempt was made to code religious affiliations according to their position on a dimension varying from established church to minority church to sect. The distinction varies as much with time as with religious affiliation; by 1671 many sects and minority churches were moving toward established status. Our primary sources of information have been:

S.D. Clark, Church and Sect in Canada (Toronto: University of Toronto Press, 1948) and David Millett, "The Age of Organized Religion," (unpublished manuscript, no date). We thank David Millett for making his manuscript available to us and Ted Mann for his comments.

The first digit of the code divides the religions into major groups. The second and third digits provide a detailed code for all mentions in the sample of religions which are known to be affiliated with the major church of column 1. The first digit is 9 for all "other" mentions,'including those for which no known major affiliation was'given. Codes in columns 2 and 3 are also loosely ordered in terms of increasing sectarianism, where this was given for the late nineteenth century in Millett. The size of the recorded congregation was used as a surrogate for this information in some cases - smaller congregations were assumed.to be more sectarian. Many religious mentions however were not clearly classifiable and they are listed after the known ones in alphabetic order.

All religious affiliations-mentions are coded separately, unless they are clearly only different spellings. All spellings are given exactly as they were transcribed from the census manuscripts. Mnemonic codes are indicated in brackets for those codes used in the coding of the 1871 national sample of 10,000 households (and in the 1871 Essex and Kent County letter sample). All mentions of religions found in our coding of these 10,000 households are included in the overall list. First Later Mmmonic Digit Group Digits Codes Description Nnc: 'ftu\scd. @w&L . 0 Missing & Illegible 00 NG Not given 10 IL Illegible --p&4 COJCM, w \IrcrdCc Ksr . Catholic, Church of 00 RC d Rome

Church of England 00 CE Also E. Church 01 English 02 EP Episcopalian Episcopal "Church of --11

Church of Scotland 00 cs

Lutheran 00 LU 10 Ev Lutheran Evangst Lutheran 1- -.,~__ Methodist 00 ME 01 WM Wesleyan Methodist Wesleyan 02 EM Episcopal Methodist . i E Methodist E Meth Meth E

From here, minority churches and sects in order of increasing sectarianism

10 BE British Episcopal Methodist 11 NC New Connexion 12 PM Primitive Methodist 13 BC Bible Christian Methodfst 14. BB Bible Believer I 15 CM Calvinistic Methodist From here, unclassifiable mentions

30 Dutch Meth 31 EGL Wesl 32 Evangel Meth evangelical methodist evangelist M evangst meth 33 I metb E 34 I meth C 35 Jmeth E 36 Meth If E 37 Methodist H 38 Methodist N First Later Mnemonic Digit Group Digits Code8 Description

6 Presbyterian 00 PS Presbyterian From here, minority churches and sects in order of increasing sectarianism

01 CP Canadian Presbyterian Canada Pres C Presb 02 Free Kirk Free Church F C Presb F C Presbyterian F Churc Presb F Church FE Presb F Presb F Presbyterian Free Presb Free Presby 03 K Presbyterian Kirk Kirk of Scotland Presbyterian Kirk Presb C S Presb C of Scotland S. Kirk S. Pres S. Presb Sco Presby Sco Presbyterian Scoth Presby Scotch Presbyter Scotch Presbyterian 04 United Presbyterians United Presb Un presb U Presbyterian U Presb U P Presb U Kirk Presb OS A.P American Presbyterian U S Presb 06 Reformed Presbyterian R Presb 07 Evangelical Union First Later Mnemonic Digit Group Digits Codes Description

6 Presbyterian From here, unclassifiable mentions (continued) 30 ES Presb Est Pres 31 Irish Pres 32 N Presbyterian 33 Old Presbyterian Old Presbyterian Kirk 34 Presb N A 35 W Presbyterian

7 Congregationalists 00 co 8 Baptist 00 BA From here, minority churches and sects in order of increasing sectarianism

01 Fw Free Will P Baptist FWC Bapt FWCBaptist Free Christian 02 RB Reformed Bapt Reform Baptist RF Baptist R Bapt R Baptist 03 Regular Baptist Regl Baptist 04 UN Union Baptist 05 cc Christian Conference Baptist Christian donf Christian Bap 06 AA African Association Bapt From here, unclassifiable mentions

30 Baptist Christian 31 C Baptist c Bapt 32 CM Bapt CM Baptist 33' Cal Bap Cal Bapt Cal-st Baptist Calvfn Baptist 34 Close Corn Baptist 35 First Baptist -w-

First Later Mnemonic Digit Group Digits Codes Description

8 Baptist 36 Lu C Baptist (continued) 37 N Bapt 38 Open C Baptist 39 Second Advent Baptist

9 Other 00 Amish Omish --Ol ,__----_. AD Adventists 02'- SD Sex?en Daly Adventists 03 Apostotic (sic)

05 Bethern 06 CB Christian Brethern 07 PB Plymouth Brethern 08 UB United Brethern I

10 Christian 11 Church of Christ C of Christ 12 Christian Delp 13 Church of God

15 Dain Ward 16 DI Disciple Disciple of Christ. 17 Dunkers 20 EV Evangelical 21 Evangelist 22 EA Evangelical Assoc 24 German Episcopal 2s Independent 26 IR Irvingites 27 TJ Latter Day Saints 28 Ms Messiah

30 MN Mennonites 31 MO Mormon 32 Mnece Munice 33 NSB Assoc 34 SW New Jerusalem C New Jerusalem New Jerusalem Ch New 3erusalem Chur$h New Jltusalem Swedenborgians

,:,

_... _-_ 76

First Later Ijnemonic Digit Group Digits Codes Description

9 Other (continued) 35 QU Quaker Friend Friends 36 Prot Cong Zion Pro test Congr Protest Congrega Congt Protest 37 PR Protestant 38 Tu Tunkers 39 UN Uniterian 40 w Universalists

50 Greek Greek Orthodox

60 Mahometan

70 Ju 71 Hebrew Hebrew Church 72 Reformed Jew

80 NR No religion No Denomination No Sect 81 AT Atheist 82 Free Thinker Free Thinker of England 83 Materialist 84 PA Pagan. 85 Infidele

90 DE Deist 91 Spirtulist Spirituecist 77

‘. Occupational Dictionary: Eight Digit Codes for Occupation and Industry in the Canadian Census of 1871 and in the Censuses of Essex and Kent Counties, Ontario in 1861

The occupational dictionary for the pilot project classifies every occupation mentioned either in the 1871 main file or the Essex-Kent 1861 and 1871 files. It provides an eight digit code constructed as follows:

Cols. 1 - 2: A Detailed Industrial Classification Cal. 3: Occupational Class Position Cols. 4 - 6: Detailed Occupational Codes Cal. 7: Vertical Status Code Col. 8: Degree of Difficulty

Cols. 1 - 2: A DETAILED INDUSTRIAL CLASSIFICATION

The detailed industrial classificafion (variable label INDUS) is adapted-from Armstrong's more detailed occupational allocation for all occupations separately distinguished in the English 1861 national occupational census abstract. Armstrong employed Booth's occupational list for the majority of the mentions classified. This is, clearly, a "functional" classification in Katz's sense (1972). The following shows the actual INDUS codes:

l-6. PRIMARY SECTOR

1. Farming 2. Other Agriculture 3. Logging 4. Fishing 5. Hunting

8. MINING SECTOR (inc. quarrying and well drilling)

10. BUILDING SECTOR

20-39. MANUFACTURE SECTOR

20. Machinery and tools (makers) 21. Shipbuilding 22. Metal workers 23. Watches, instruments and toys 25. Earthenware, inc. brickmakers 26. Coals, gas, chemicals 30. Furs, leather, glue, tallow, etc. 31. Wood workers, inc. furniture and paper 32. Carriages and harness 33. Printing and bookbinding 35. Textiles 36. Dress and textile products 78

37. Food, inc. drink and tobacco 39. Unspecified

40-49. TRANSPORT SECTOR

40. Navigation 41. Warehouses and docks 42. Railways 43. Roads

50-59. DEALING SECTOR

50. Raw materials, inc. fuels 51. Textiles, inc. textile products 52. Food, tobacco and all spirits 53. Furniture, utensils, and stationary 54. Hotels and lodging and restaurants 55. Other dealers 59. Unspecified

60-69. BUSINESS, GOVERNMENT AND PROFESSIONAL SERVICE SECTOR

60. Banking, insurance, accountancy 3% 61. Public administration, communication, army, navy, police and p&ons 63. Law and Medicine 64. Art and amusement 65. Literature and science, inc. newspapers 66. Education 67. Religion 69. Unspecified

70.DOMESTIC AND PERSONAL SERVICE SECTOR

80. INDUSTRY NOT KNOWN (inc. labourers)

90-99. RESIDUAL

91. Property owning and independent (inc. gentlemen) 92. Students 93. Other 79

Col. 3: OCCUPATIONAL CLASS POSITION

Occupational class position is an attempt to provide a relatively detailed classification in terms of the most likely,implications which occupational titles have in terms of four main criteria of social class as distinguished from status or prestige. The criteria used for determining occupational class position are:

(1) Is the "occupational title" in or out of the labour force? e.g., lawyer and law student;. seamstress and mother..

(2) Titles which clearly imply property ownership are coded as Merchants, Manufacturers, Agents & Dealers (code=l---). We include farmers as a separate, second category of property owner. Clearly there is some unavoidable error in such a classification. Examples of property owners are, manufacturer, miller, hotel keeper, etc. Note that we specifically exclude from this code occupations which could be either small manufacturers-owners or skilled, artisanal "makers" employed by others. This is a large group of nineteenth-century occupational titles, many of which were likely used interchangeably when the position of a person slipped from one category of artisan to the other. We do distinguish all occupational mentions most likely subject to this petit bourgeois/artisan slippage in terms of a specific range of detailed codes (see below ~01s. 4-6). There are reasons to be able to examine this group separately in analysis given the ambiguity of the class position which specifically characterizes them.

Within the non-propertied occupation titles, we attempt to employ the following three criteria, (a) probable skill level of the occupation, (b) the nature of the work process implied, and (c) the level of authority, i.e., directing other's work as a primary aspect of the occupation. These lead to six separate occupational class codes. The actual codes and criteria are as follows:

- Professionals, Managerial and Supervisory Occupations (code=2---) includes occupations which imply specialized formal training of any kind (doctors, lawyers, career soldiers, for example), and which suggest self-employment. It also includes those who are not self-employed but who direct others' work as a primary aspect of the job, managers, foremen, superintendents, bailiffs, inspectors. The two types are distinguished by different detailed codes, ~01s. 4-6, see below. 80

- White Collar (code=3--- ) occupations entailing administrative, clerical and technical work as an employee, i.e., not implying direct control of others! work as a major aspect of the job. ..:7.

- Artisanal (code=4--- ) occupational titles imply a high level of specialised skill and likely degree of autonomy in the work-process. Many of these occupations are further classified as ambiguously petit bourgeois/artisanal in the detailed codes, ~01s. 4-6 as mentioned. (Equivalent to ‘5 Cities Study," Historical Methods Newsletter, June 1973; Category III).

- Semi-Skilled and Unskilled (code=S---), with the exception of labourer as a specific occupational title. These "blue-collar" occupations imply little to moderate skill levels, tedious and physically demanding work-processes and minimum authority or autonomy in the work. E.g. of semi-skilled are barbers, drivers, lumbermen and shantymen; unskilled are messengers, operatives, miners. (Equivalent to "5 Cities Study," Category IV). -'

- Labourer (code=6---) is retained as a separate code on the grounds of the particular insecurity and hardship characterizing common labour 'in the nine- teenth century.

- Servants (code=7---) also retained as a separate code due to the special interest in the class implications of servant employment.

- Farmers (code=8---) are also retained as a separate code due to their predominance in the nineteenth-century.

- Outside Labour Force (code=9---) denotes those occupation titles that could not be placed in the above schema (e.g. gentlemen, students).

NOTE: A small number of occupations (>OO.l%) were Miscoded and could be treated as ‘illegible" or "missing."

Cols. 4 - 6: DETAILED OCCUPATIONAL CODES

Detailed occupational codes are individual codes for each separate mention of a different occupation. Col. 6 is reserved for apparently synonymous oc- - cupational titles in French or English, but misspellings (Farner for Farmer, armer for Farmer, etc.) are given the same code as the correct spelling.

These codes incorporate several important, additional distinctions.

I - even numbers were used for English titles, odd for French, with one exception; for dealers (as coded in terms of industry) we applied a special convention to facilitate analysis of this category. The convention is, for all occupational titles with the word "dealer" itself in them the 3 digit detailed code ends with a 5, those with the title "merchant" end with a 3, those with "marchand" end with a 4. (Note this is the only inconsistency in the general rule even = English, odd = French).

The object is to be able readily to separate the specifically labelled dealers and merchants from other occupations. We assume this is the majority of dealing occupations, although obviously others are also dealers (e.g., 81 miller).

II - to distinguish between professionals and managerial/supervisory occupations within category 3 of the occupational class code above; all pro- fessional occupations are given 3 digit, detailed codes 000 - 599; supervisory/managerial 600 - 799.

III - to distinguish those occupations which implied small property owners or ambiguous and variable petit bourgeois/artisanal occupations. The detailed 3 digit codes 800-999 were used in all cases where it was thought the occupational title was even possibly in thiscategory. This coding should be considered only with other sources, such as city directories.

IV b for the general coding procedures see Detailed Coding procedures (below) which include several additional conventions to ensure logically ordered and readily recodable detailed codes for every distinct occupational mention.

Coding Procedures for Detailed Occupational Codes a. Within each industry, distinct occupations are numbered using the round numbers, i.e., 000 for the first, 010 for the second, 020 for the third, etc. The first three corresponding French occupations are then coded 001, 011, 021., etc. b. When there are very similar titles, e.g., shingle cutter, shingle maker and shingle weaver, they receive consecutive (even, because they are English) numbers, in this case, 320, 322, 324. Since there are no corresponding French occupations, the numbers 321, 323 and 325 are not used.

C. The last digits 8 and 9 are used only for apprentices in English and French. The listings for carpenters are as follows: Carpenter 280 Charpentier 281 Menuisier 283 House carpenter 284 Shop carpenter 286 App carpenter 288 App menuisier 289 d. When there is a difficulty in translation, as above, there being both "charpentiers" and menuisiers for carpenters, the titles are grouped together. Occasionally more than ten numbers are required to include a whole group of similar occupations, in which case the numbers over a sequence of 20 are used. e. In general, within industrial code (col. l-2) numerical gaps are left between dissimilar occupations and a few more detailed occupational groupings are provided for obvious differences, e.g., in Industry 63, Law and Medicine codes 000 to 099 are used for legal occupations, 100 to 199 for medical occupations and "the" phrenologist has a number 200; in Industry 40, the Navigation industry, codes 000 to 099 are for sailors, and other ship workers, 100 to 199 are for sea captains, pilots and other "managerial" 82 occupations.

Col. 7: VERTICAL STATUS CODE

This code (variable label STATUS) is the more conventional "vertical" status code. We have adopted Michael Katz's code (for Hamilton, Ontario 1851-1861) directly which includes status groups ranked 1 to 5, with 6 reserved for "unclassifiable" (Katz classified all female occupations 6). We have used Katz's detailed occupational listing and follow it exactly. All occupational mentions in our data, not actually given in Katz's listing, are coded 9, unclassifiablementions. Note that this Katzian code places Farmers within the ranks of "White Collar" occupations.

Col. 8: DEGREE OF DIFFICULTY

This column is reserved for a subjective assessment made by the coders of the degree of difficulty (variable label OCPROB) in making a specific allocation. It refers only to problems of legibility or deciphering the actual mention as given on the original microfilm manuscript data. We used direct trans- criptions of these data, but where a letter could not be,made out reasonably, a blank space was. coded, e.g., -oale . I,*

The codes are 0 = no problem 1 = some difficulty 2 = considerable difficulty

For those mentions for which reasonable guesses could be made, a classifica- tion'was given, e.g., Coaler is coded 61420190 oale- is coded 61420192

conznarchand (merchant's clerk) is coded 59340120 Canie marchand is coded 59340122.

Note: There remains, of course, a truly ambiguous category of either un- decipherable or uninterpretable occupational mentions.

The general convention was to minimize the ambiguous category by reasonable guesses. Of a total of over 1,000 occupational mentions in the three files fewer than 50 were counted as truly ambiguous. 83

COMMENTS:

All the occupational coding was carried out by the two principal investigators and by Bruce Bellingham, a research assistant, working on aspects of the social history of Essex and Kent counties.

The procedure first punched all occupational mentions on separate computer cards with an appropriate. identification code. Each mention was then given the full occupational code. Then for each different occupational code a definition card was punched as: Cols. l-4 = cL?is 25-32 = The eight character occupational code 55-80 = occupational title

The final occupational dictionary consists of the definition cards with the' (possibly several) distinct original mentions filed after them. A computer program written for the purpose searches the file for duplicated codes and misclassified cards.

OCCUPATIONAL CLASSIFICATION: 19th Century

1. Merchants, Manufacturers, Agents and Dealers (property-owners and dealers), OCCUP codes 1000 to 1999

2. Professionals, Managerial and Supervisory Occupations, OCCUP codes 2000 to 2999

3. White Collar, OCCUP codes 3000 to 3999

4. Artisanal (Five cities study, category III), OCCUP codes 4000 to 4999

5. Semi-Skilled and Unskilled (Five cities study, category IV), OCCUP codes 5000 to 5999

6. Labourer, OCCUP codes 6000 to 6999

7. Servants, OCCUP codes 7000 to 7999

8. Farmers (also property-owning), OCCUP codes 8000 to 8999

9. Outside Labour Force (e.g. gentlemen, students), OCCUP codes 9000 to 9999

0. Occupation Blank, OCCUP code 0

Miscoded, OCCUP code 1 to 999 The Design of the Sample from the 1871 Canadian Census

The 1871 sample combines two separate samples: a stratified sample of all households in all four provinces of Canada; and a “two-stage” sample of households that included at least one person of a particular national origin in a particular province. The two are refeed to, respectively, as the “main” and “special” samples and aredesaibedintum.

Thethin Sample This ‘a stratified sample of all households from the entire census. The population is stratified by province and, within provinces, between urban (defined as communities of 3,000 or more) and non-urban areas. SO there arc eight strata (four provinces by urban/non-urban). Table 1 gives the population and number of households in each stratum, obtained from Volume 1 of the published Census. Table 4 (photocopied from the original documentation) gives the places identified as “urban” in the sample design. In Table 1, note that the estimated population is slightly different from the population recorded in the 187 1 Census volumes, for example the estimated population for non-urban Ontario (in the first row) is 240,483, versus the published figure of 243,568. This discrepancy arises from our having sample and the sample weights therefore involve post-stratification to remove these small errors. This assumes,of course, that the published Census is an exactly correct count of the censusreturns.

In order to increase the precision of comparisons between the two Atlantic provinces and Ontario and Quebec, the former were sampled with higher probability. Also urban areas (which included a minority of the population in 1871) were sampled with higher probability. Table 3, which is appended, is from the original sample design report and describes the urban sample in detail.

Because of the unequal probabilities of selection, weights are required to obtain unbiased estimates of population characteristics. Different weights are required to make optimal use of the data for estimates of characteristics of the entire population, for each province separately, for urban and rural areas, and for the combination of urban and rural areas. Table 1 also shows the value of the variable PROVUREI in the dataset, which serves to identify the eight strata

Reflecting the paired selection of households, a second factor figures into the weights. The actual sample was drawn by dividing the population, within each of the 206 diitricts, into “cells” consisting of sequencesof consecutive households in the censusmicrofilms, that is the districts were put in order, then subdistricts, divisions and households. From each cell, two selections were made for the main sample. There were a small numbex of errors in the selection of observations in cells-about 40 errors in 10,000 households; in which one or three cases(not two) is in the sample. These errors are corrected by changes in the weights (in one-case cells, the weight was doubled, in three-casecells, weights were multiplied by a factor of two-thirds).

There is a complication in considering the data as a sample of individuals, rathex than a sample of households. The dataset includes every person in every selectedhousehold; in other words, for persons in selectedhouseholds the probability of selection is one. So, the weights for the household sample can simply be applied to each individuals in a household. The attractiveness of the stratified sample of households is that errors derived from it are no larger than would be obtained from a simple random sample; in more technical terms, the “design effect” is not larger than one. The same is not true, however, of the resulting sample of persons in households,which is a cluster sample, so that the given weights may result in standard errors (as given by SAS, SPSS and other programmes that assume simple random samphng) that underestimate the tme values.

1 Table 1

Value of the Population, Mean Number Population, Post-Strat- Urban- Variable in the Published Number of of Cases in estimated from itication Province ization Provurb Census _Selections’ the “Cell” the sample Correction

O&l-i0 Non-urban 10 243,568 1750 274.84 240,483 139.182 Urban 11 43,450 1170 74.75 43,728 37.137

Quebec Non-urban 20 15 1,395 1110 269.82 149,749 136.392 Urban 21 29,220 792 74.52 29,s 12 36.893

New Brunswick Non-urban 30 37,348 958 71.12 34,02 1 38.985 Urban 31 6,23 1 286 41.90 5,992 21.787

Nova Scotia Non-urban 40 53,415 1086 97.94 53,185 49.185 Urban 41 9,086 360 50.02 9,013 25.239

Total 573,713 75’12

* corrected for a small number of caseswhere one or three, rather than 2 selectionswere made in a cell

2 Table 4: Districts with 15% or More German, aa compiled from the Census of Canada 1870-71, Vol I, Table I (pp, 2-83), and Table 111 Nation of Origin (pp. 252-333) No. Hhldr No. Hhlds Census Numb&r of Total Proportion Total Sampling Selected In Selected in Ms trict No. District Name Population .of Germans Households Fraction First Stage Second Stage

Ontario S Elgin W 1138 12796 .089 2296 ,169s 39'2 36 6 Elgin E 3512 20870 .168 4024 ,169s 686 93 11 Norfolk S 2843 15370 .185 2860 ,169s 480 12 Norfolk N 2541 15390 .165 2837 .169S 480 2 17 Haldfmand 3357 20091 .167 3510 .1695 595 107 18 Honck S628 15130 .372 2903 .1020 296 98 19 Uelland 5916 20572 .288 3856 .lOS8 408 134' 21 Lincoln 4844 20672 .234 379s .1267 481 127. . Wentworth S 3957 14638 .270 2629 .169S 450 114 i: Bruce 5525 . 31332 ,176 5270 .0453 247 39 30 Perth N 5543 25377 .218 435s .0453 202 43 31 Waterloo S 8892 20995 .424 3591 .0453 170 71 32 Waterloo N 131S8 19256 .684 3222 .0453 144 59 Prince Edward 4866 20336 .239 3780 .04s3 182 :"9 60 Haacings W. 2764 14365 ,192 2616 .04s3 132 Lennox 4649 16396 ,284 2983 .04s3 143 f3 %2 Addington s4s3 21312 .256 3681 .04s3 169 44 71 Dundas 5563 18777 .296 3139 .04s3 143 42 72 Stormont 2220 11873 .187 1936 .0453 91 21 83 Nipissing 266 943 .282 22s .04s3 26 8 Total 92635 356491 .2599 63508 .09317 5917 1335

Nova Scotia 195 Lunenburg 16612 23834 .697 3681 .0453 180 120 197 HslifAX 3425 19955 .I72 3273 .04s3 158 28

Total 20037 43789 . .4576 6954 338 148

Grand Total 112672 400280 .28148 70462 6255 1483 .

*In the District of Honck, Subdistricts 18 s-d; the District of .Wellsnd, Subdistricts 19 a-f; the District of Lincoln, Subdistricts 21a and 21 c-f the sanpllng fraction WAS .169S. In the ramriaing subdistricts, 18 c-8, 19 g-l, and 21 b, ths sampling fraction VU .O453. Table Sr Districts in Ontario and New Brunswick with 1SX or mere French Nation of Origin, na compiled from :he Census of Canada $70-71, Vol I, Table 1 (pp. 2-83) and Table XI (pp. 252-333)

No. Gf Nhlda No. of Hhlde Census Number of Total Proportion Total Sampling Selected in Selected in District No. District Name French Population of French Households Fraction First Stage Second Stage

Ontario 1 Essex 10539 32697 .322 6036 .10627 606 166 7s Prescott 9623 17647 ,545 2779 .10627 300 13s 98 . ii9 RussellNipisslng, S 5600151 18344943 .160,305 296422s .10627 31916 0 84 Nipissing, N 207 848 .183 lS3 .10627 13 S 88 Algoma, E 25s 977 ,261 219 .10627 10 3 89 Algoma, c 536 2177 .246 418 .10627 44 10 Total 26911 73633 .36547 12794 .10224 1308 417

Nev Brunswick 181 Victoria 7104 11641 .617 1788 .036232 7s 40 182 Restigouche 1143 557s .2os 876 .036232 36 9 la3 Cloucaster 12680 18810 .674 2564 .036232 113 73 185 Kent 9356 19101 .560 2917 .036232 120 68 186 Westmoreland 1071 2933s .460 4766 .036232 1.92 55 Total 41064 84462 .42819 12911 .041SlS 536 245 Table 6: District8 of Quebec with 152 or Hore non-French Nation of Origin as compiled from the Census of Canada, Vol I, Table I (pp. z-83) and Table II (pp. 252-333)

Censua Number of Total Proportion Total Sampling No. of Hhlds No. of Hhlds District No. District Name French Population of French Households Fraction Selected Selected in First Stage in Second StaRe

91 Pontiac S 3195 14,591 .219 2319 .04762 112 87 Pontiac N 260 1,219 .213 207 .04762 9 5 ;: Ottaua U 11531 23,794 .485 3895 .04762 182 92 94 Ottawa C 2929 5,282 .555 825 .01988 18 95 Ottawa E 7054 9,553 .738 1499 .01988 30 % 96 hrgenteull 3902 12,806 ,305 2109 .01988 46 32 101 Wontcalm 10794 12,742 .84? 2073 ,01988 4 - 107 Hochelaga 20224 25,640 .7a9 3680 ,01988 ii 112 Chateauguay 11288 16,166 .698 2602 l 1 88 54 113 Huntingdon B 2383 8,834 .300 1493 . 88 30 114 Huntingdon W 2541 7,470 .340 1167 k!! 88 24 117 St. Jean 9415 12,122 .7?7 1948 :01988 44 15 125 Elissieequoi 7114 16,922 .420 3022 .01988 60 36 126 Brome 3471 13,757 .252 2448 .01988 54 127 Shefford 12683 19.077 .665 3363 .01988 i3 136 Drummond 10487 14;281 .734 2339 .01988 :: 138 Richmond 3718 11,213 .332 1850 .01988 35 140 Sherbrooke 3544 8.516 ,416 1388 .04762 40 141 Stanatead 3212 13.138 ,245 2555 .04762 126 142 Compton 3785 13,665 ,277 2376 .01988 144 ComtCde Quibec 14681 19,607 l 749 3091 .01988 i(: 13 155 totbiniere 17340 20,606 ,841 3129 .04762 154 156 tlegantic 12074 18,879 ,640 2827 .04762 140 if 159 Dorchester 7872 9,564 ,823 1446 .01988 30 4 169 Bonaventure 9545 15,923 .599 2369 .01988 48 17 171 Gasp& C 2396 5,278 .454 843 .01988 18 6 172 Gasp6 S 4897 7,296 .671 1005 .01988 20 6 Total 202335 357,931 .56529 57868 1620 769 Appropriate Weights, for Different Analytical Goals

In order to obtain national estimates which yield population counts (i.e. one obtains estimates of the actual numbers in the population), use the weight POPWGT.

In order to obtain national estimates which yields number of observations close to those in the sample (i.e. crosstabulations and other tables reflect the sample size, approximately), use the weight SAMPWGT.

POPWT and SAMPWGT can be used for all kinds of analysis and, because the sample is large are generally sufficient. They do not, however, maximize one's ability to detect differences between provinces and between urban and non-urban areas, because they do not take account of the higher sampling ratios in urban areas and in the two Maritime provinces. The next three weights are designed to make use of this property of the sample.

In order to make comparisons between provinces (providing numbers of observations approximately equal to the actual sample sizes), use the weight PROVWT.

In order to make comparisons between urban and non-urban areas (providing numbers of observations approximately equal to the actual sample sizes) use the weight URBWT.

In order to make comparisons between urban and non-urban areas within provinces (providing numbers of observations approximately equal to the actual sample sizes) use the weight PRURBWT.

The Special Samples There are three components, separately derived, in the special sample: a sample 'of German households in all four provinces, from all districts 'in the four provinces with at least 15 percent German origin population; a sample of French households in Ontario and New Brunswick,, selected from districts in Ontario and New Brunswick in which at least 15 percent of the population were of French ethnic origin; and a sample of non-French households in Quebec, from districts in Quebec with at least 15 percent British (combining English, Irish, Scottish and Channel Islanders).

These samples are designed to allow particular, theoretically interesting comparisons, for example between the "charter" ethnic groups and the Germans (which in 1871 constituted the only non-French, non-British group of any __ . . _ _.._-._-. I __-- -.--. . size). Tables 4;'5 and.6;.'which are appended," are from'the original sample design report and describe the special samples in more detail.

3 In order to cut down the cost the special samples were restricted to districts which the published census volumes showed included enough of the group to yield a sufficiently large number of sample cases. In practice, districts with at least 15 percent of the desired group were included in special samples. Because they have large non-British populations (which would be coded in the main sample) and were in the urban-sample (with a higher sampling fraction, see above), Montreal and Quebec City were excluded from the sample of the non-French in Quebec. From the additional random samples of households in these districts, we coded households with at least one person in the target group. For example, a household with one person of German original and five of French origin would qualify for inclusion in the German special sample if it was selected.

In order to use the special samples, one should employ the weight ETHWT and always analyze the sectors of the population divided into categories of the variable ETHSEL. The special samples, it should be emphasized, are not representative of the entire groups from which they are drawn. The German households, for example, are from areas in which Germans are relatively numerous; these may or may not be the same as German households in more ethnically-isolated circumstances.

Table 2

Number of Households

First-stage Second-stage Group Population Sample Sample

Districts in all four provinces at 70462 6225 1483 least 15 percent German origin

Districts in Ontario with at least 12794 1308 417 15 percent French origin

Districts in New Brunswick with 12911 536 245 at least 15 percent French origin

Districts of Quebec (excluding 57868 1620 769 Montreal and Quebec City) at least Non-French origin

,

4 Some General Advice on Using Weights

Because the 1871 sample is fairly large, unless small subsamples are the object of analysis, statistical significance will rarely be at issue. That is, effects that are just large enough to be significant will generally correspond to small and uninteresting substantive differences. For this reason, the weights are calculated conservatively--so minimizing Type I error. % i -.;.. ,. SPSS, SAS and most other statistical packages treat data as if they were derived from a simple random sample. While the samples are not actually simple random samples, at the household level the stratified sample is at least as efficient as a simple random sample. The same is not true for the cluster sample of persons.

It is possible (but not simple) to make exact estimates of the standard errors of population (or sub-population) statistics, taking account of stratification and clustering. To do so you need to use the following variables: CELLNO, which is the number of the stratum for each household (these numbers begin with 1 are incremented, but may be restated in each district and sometimes within districts) and NSEL, the number of main-sample household selections in the stratum. It would be helpful to begin by printing out selected variables for a few hundred observations to see the patternof sample allocation

The variable SPECALOC is a special allocation variable used by the principle investigators. Normal analysis'of the 1871 data does not require its use.

To pursue these issues, a reasonable knowledge of sampling theory is essential, and you should probably consult Professor Michael Ornstein, (York Institute for Social Research, York University, Toronto, M3J lP3 [email protected]) as well.

_..._ ,.._.. ..- . -. ,...... -. - .,._” .- I.^ _-_. ._--.-. - -....-1----.-..--...... _ ..-. ^ _____.._------.__-- -. _....- _ _.. . .._._.--.. --.-- .----

5 1871 Data File, Added Documentation - weighted files (April, 2001).

The 1871 file has several variables for weighting cases, since the sample was designed with several forms of over sampling in order provide more optimal estimates of specific comparisons - between provinces and between urban areas, or a combination of these two. The over sampling and the appropriate weight variables are described in the sample section toward the end of the basic documentation, but the weight variables are relabelled in the current data file. The original and current weighted variable names are:

TOTWT = POPWGT - for population estimates. NATWGT = SAMPWGT - for sample estimates. PROVWT = PROVWGT - for comparisons of provinces. URBWT = URBWGT- for comparisons of urban areas (see urban variable). PRUREIWT = PRURBWGT - for comparisons of the 8 rural/urban by province (2 X 4) sectors of the sample. ETHWT = ETHWGT. - for comparisons of selected ethnic populations - seebelow.

In addition, the variable ETHWGT must only be used to analyze the population when comparisons are made among the very specific ethnic selections or categories -’ defined by the variable ETHSEL. A frequency tabulation for ETHSEL without any weight on, will reveal the specific groups and their frequencies in the special samples (as described in “The design of the Sample” section in the original documentation).

There are several other considerations in using weighted samples in the SPSS version of the files. For unknown reasons, we have found that if a weight variable is applied for an analysis, even if the weight is taken off before one exits the file (and the file not saved as weighted), when the file is reopened, the total N may appear as something over 40,000 cases. This is IN ERROR. When the file is retrieved for analysis, the weight procedure in SPSS “Data/weight” must be reset to no weight (it may appear as such, but should be reset). We recommend that before tabulations are undertaken, one establish that the N for the sample will be either 62,281, when no weight is applied (the full number of records), or 24,741, the size for appropriate sample estimates.

Also note that SPSS will provide unbiased parameter estimates, but NOT correct estimates of error for samples weighted such as this one. SPSS makes error estimates assuming simple random samples of a size equal to the sum of the weighted cases, rather than taking into account of probabilities of selection of cases. Normally, in a sample the size of this national sample the question of statistical significance is not a key one, unless one is examining fairly small subgroups. Nevertheless, some statistical packages, notably STATA, do compute correct estimates of error for stratified, weighted and “clustered” samples. (As the original documentation indicates, this is an equivalent to a simple random sample of households, but it is a cluster sample of individuals within them). .: fall 1978

.;, -: ? .: (, _y . : ; .: ,__. :.. . .-- . .’ ‘.. _,. ;..; .‘_. .:z... . ‘{ : : ,:

..‘: ; ‘.S ,. _. :.: .’ :; ,.!.+ : :’

. .

” ; : ..: . . . . . ,. ‘. _. . .

. ._ .’ .‘,’ , , Historical Methods, Vol. 11, No. 4, Fall 1978

National Mobility Studies in Past Time: A Sampling Strategy

Michael D. Ornstein’ A. Gordon Darroch

York University

Among the most striking findings of the new urban his- later won fame and fortune. Without a magi- tory is the consistency with which North American cal electronic device capable of sifting through tens of millions of names and locating a few communities, in particular, appear to be characterized hundred, there is no way of picking out by very high rates of geographic mobility. In the nine- former residents of Newburyport on later teenth century, the typical and now quite familiar national censuses.j finding is that between 50 and 75 percent of the resi- dents of a city cannot be found there ten years later.’ Later, reflecting on the unknown fate of transients in Thii largely unforeseen evidence of transiency continues America, Themstrom and Knights only slightly moderate to present challenging intellectual problems of deter- this earlier judgment, “. . . the knotty, but in principle, miniig the general patterns of historical mobility and not insoluble, question of what happens to out-migrants of establishing their relationships to patterns of social after they have left the city cries out for further sys- mobility and to social structural change. Since there is tematic study.“4 We take it for granted that the prob- every reason to think that these linked patterns were lem is not simply that of measuring the amount of national, and probably continental, in scope, it does internal migration. For the nature and extent of this nor seem likely that even a wider proliferation of his- migration had fundamentally important consequences torical studies of transiency and mobility within the for the lives of the individuals involved and for the confines of single communities can be accumulated into social structures in which these lives were embedded. the wide-lens view required for an understanding of the larger patterns. This paper presents a general research strategy which Migration was a dominant characteristic of nineteenth- makes national studies of geographic and social mobility century North American society. The very experience feasible, in the absence of continuous registers of domi- of migrating must have been one of the recurrent crises cile changes, so long as nominal records survive for an in the lives of a very great number of men and women.’ entire population from at least two points in time. We As most studies to date have emphasized, we have every have come to develop this methodology in the course reason to suppose that migration was integrally linked of considering the dilemmas posed by high rates of to occupational mobility or, at least, to the search for transiency in nineteenth-century North America. It mobility opportunities. But despite a virtual consensus should be clear that the techniques suggested are not on the ubiquity of transiency, it is not known whether bound in time to the nineteenth century or in space transients tended quickly to come to rest in other to North America. communities or whether they remained more or less In his pioneering work Poverty and Progress, Thern- “on the move,” a floating reserve army of labour. It strom laments seems probable that there were two or more distinct patterns. For some, surely, migration led eventually This whets our curiosity about the subsequent to some social mobility and economic security, a pat- career pattprns of the hundreds of laborers who tern which characterizes the relation of migration to worked in &!Jewburyport for a short time in the mobility in this century.6 For others, migration must 1850-l 880 period and then moved on. It is quite impossible, let it be said immediately, have been a journey which was endless and largely to trace these individuals and thereby to pro- fruitless. Of course, even posing these questions in his- vide a certain answer as to how many of them torical context presumes that there is a prospect of tracing large numbers of representative migrants through any Soundex “pocket” or phonetic grouping and because nominal records. the indexes are compiled separately for each state. Unfor- We also argue that the methodological problem of tunately, we know of no other instances in which pho- tracing migrants on a large scale has wider substantive netically arranged indexes to national censuseshave been significance. Its significance lies in opening for the first created, so this method is inapplicable beyond the time the possibility of systematically linking individual American censuses for certain years. Furthermore, the experiences in the past with historical patterns of insti- existing American index for 1880 is restricted to those tutional change. Quantitative social history to date has families with a child under eleven years of age, who generally focused on the individual experiences of migra- constituted only about 25 percent of the 1880 popula- tion and mobility, interpreting patterns in terms of indi- tion.” These restrictions mean, of course, that the vidual attributes, rationales, and consequences. The alter- approach cannot be considered a general strategy. native, complementary perspective is to focus on the Of course, as a first step in a national mobility study, institutional implications of the patterns, that is, to one might go about creating Soundex indexes for the examine them as systems of migratory labour and as population. The cost would be prohibitive. Essentially changes in class and status structures. Questions framed it would require the conversion of an entire national at this level of analysis may focus on the processes of census to machine-readable form. (If two censuses were renewal and maintenance of migrating labour and on to be linked, only one would have to be Soundex its economic functions as well as on the implications for coded; the other could simply be sampled.) If a study class structures and mobility processes.’ on such a scale could be financed, it would make more sense to study the entire population at both times.” Two Current Methods Given the present limitations ofcost and data hand- ling requirements, any feasible method of conducting The most straightforward approach to tracing out- national studies must employ a sampling procedure. migrants from a given community is to search exhaustive- Why not take random samples from each set of records? ly all known sources of nominal records for the indivi- Simply because the proportion of individuals which duals who cannot be found within the original communi- would be found in all the samples (presuming their names ty. Peter Knights adopts this strategy in his study of ante- were actually present) is a small fraction of the size of bellum Boston.s Knights examines the various records each sample. If 5 percent samples are taken from two of the towns surrounding Boston, identifying individuals sets of records for the same population, then just .25 from his origin sample whose names he has committed percent of the population could be expected to appear to memory. For local or for relatively small regional in both sets of records; if three sets of records are con- socio-historical studies, the individual tracing procedure sidered, to cover a longer period, the linked sample is a sensible and efficient strategy, but it cannot be comprises only -0125 percent of the population. In applied on a larger scale. It does not lend itself to sub- practice, the proportion traced would be even smaller, division among the members of a research team. More- since some of the individuals in the original sample over, although in principle the results obtained using this will have died, emigrated from the region or nation under method might be replicated, the replication is quite study, or be missing from the records because of census impractical. A single researcher may have to scan hun- underenumeration and because of the limitations of dreds of thousands or millions of records to find a rela- even the best linkage procedure. tively small number of sampled individuals, and the deci- Somehow, samples of individuals who can be traced sion that the recorded individual was a member of the must be created from the enumerations of entire popu- original sample is one which must be made, once and lations. Beyond the requirement that at least two national for all, by the researcher at the time he or she confronts enumerations exist, the method we will describe is a the original data. A tracing technique which would allow general one; the actual form of the data is unimportant.14 the researcher, or others, to reevaluate decisions linking We present our solution to this problem and include nominal records by altering the decision rules governing some estimates of the size of the sample required for a the linkage and reexamining the results would clearly national study of Canada between 1851 and 1871, the be preferable.9 particular problem around which we have elaborated A second and more promising approach is that reported this methodology. by Stephenson.” Stephenson and Jensen propose to employ the unique phonetic codes (Soundex indexes) LR tter Samples to the surnames of the U.S. censusesof 1880 and 1900 to trace a sample of Americans. (There is also a Soundex Assume two enumerations of the same set of individuals index of the 1920 U.S. census, but it is not presently exist. From each we choose a “letter sample” which available.)” This approaJl does make it possible to includes all persons whose surnames begin with a ran- trace large numbers of individuals, but a large-scale domly selected letter, say “B.” No attempt is made to regional or national study would prove very tedious identify particular individuals during the coding pro- because of the large number of individuals’who fall into cess. Both files are sirnplv scanned to locate and then 153 to code the complete manuscript record of all individuals Disproportionately large numbers of individuals of whose names begin with “B.” Now each person whose Scottish origin will be found in the “M” cluster; people surname has not changed (assuming the first letter is of French origin are over-represented in the “L” cluster, legible) and who appears in the first sample must also and so on. Two points should be stressed: be present in the second. This sampling procedure has (1) Once the effect of ethnicity has been accounted a critically important feature. It cuts the number of for, there is no reason to believe that other individual individuals to be studied to a manageable and affordable characteristics, such as place of birth, religion, age, occu- size, yet every person who is present, in both sets of pation, or size of household, should bear any additional records can be linked. To make certain that each person systematic relationship to the first letter of the surname, has an equal chance of appearing in the sample, a letter and is randomly chosen in such a manner that each letter (2) The existence of this relationship increases the has an equal probability of selection.” expected error but does nor bias the sample.” This is the basic idea. In actual practice the situation We have two objectives in dealing with errors which is much more complicated, and the balance of this paper arise from the letter sampling procedure: first, any scheme addressesthese complications. For convenience, the must allow the errors to be estimated; second, we hope method is elaborated in reference to the practical prob- to minimize the impact of the ethnic heterogeneity lems of conducting mobility studies using the nineteenth- among clusters. The first can be accomplished if at least century censuses. In general, we must deal with sets of two clusters are chosen for the sample-the difference records which contain only some individuals in common. between the values obtained in the two clusters can Death, emigration from the area studied (no matter how then be used to estimate the error ( in practice, it is pre- large), changes of name in marriage, and serious errors in ferable to sample more than two clusters, so as to obtain the original enumeration all remove individuals from the a more precise estimate of the error). Whatever the study. The failure to include individuals in the popula- procedure used to select the clusters, a replicated design tion with surnames beginning with the sampled letter, must be used. The thornier problem is to minimize the due either to coding error or to the illegibility of the impact of the ethnic homogeneity of the clusters on original manuscript or microfilm copy, has the same the sample error, a problem compounded by the great effect. Still more links between records will be ambig- variation in the size of the clusters: surnames are dis- uous if similar errors (or actual changes between tributed very unequally through the alphabet. Kish enumerations) are made in variables other than the last describes the estimation procedures for samples of this name which are routinely used in the linkage procedures kind.18 of current quantitative social history.16 We will suggest two approaches to this problem, but with this caveat: the evaluation of their merits Errors in Cluster Samples waits on the availability, in machine-readable form, of a large random sample of individuals from a nineteenth- Technically, a letter sample can be classified as a cluster century census, like the one that we are now collecting sample. Each group of persons whose surnames begin for Canada in 187 1. It should also be noted that, if with a given letter constitutes a cluster. To obtain a the actual impact of ethnicity proves to be small, either letter sample, one letter or cluster is chosen in a random because there is relatively little ethnic variation among fashion from the 26 clusters (1 for each letter). Then clusters or because only weak relationships are discovered the sample comprises all individuals in the selected - between ethnicity and the other variables of interest cluster. Each individual listed in the census has an equal (like occupation, family size, etc.), then it may not be chance of being in such a sample, for the probability of sensible to employ either strategy. This would be the selection for each individual is the product of the proba- case if the population under study was known to be bility of his or her cluster being selected (l/26, there ethnically homogeneous. If this were true, or if it can being an equal probability of each letter being selected be shown that ethnic differences are of little consequence, given a random method), multiplied by the probability we can proceed to employ letter samples as if they that he or she will enter the sample once the cluster has were conventional samples of the population, with the been selected (this probability is 1, since the entire important advantage that they make possible linkage cluster is in the sample. between samples. We are pessimistic in this regard; In general, estimates of population parameters (of ethnic variations are likely to be of some considerable means, proportions, correlations, etc.) calculated from consequence in nineteenth-century national populations, cluster samples exhibit larger errors than those obtained and certainly in North America. from simple random samples of equal size when some The first solution to the problem of ethnic variation important characteristic of the individuals is unevenly involves stratifying the clusters by ethnicity-that is, distributed among the clusters. Clusters based on sur- placing those letters which begin a high proportion Of names pose just such a difficulty-there is a strong French surnames in one stratum, those with many relationship between the first letter of an individual’s Irish surnames in another, and so on, for all the ethnic surname and the ethnic group to which he or she belongs. groups which are of significant size, or of particular 154 concern, in the population. +Jhen a sample of at least whfch should be adequate for determining the broad two letter clusters is randomly drawn from within each relationships among ethnicity, other popmation charac- stratum; this selection is made &dependently for each teristics, and surname spelling. stratum. Two or more clusters would be chosen from Given an adequate knowledge of the relationship each stratum so that errors can be estimated, as noted between ethnicity and surnames, there is a second and above. The main object of this stratification, Of course, seemingly more promising solution to the problem this is to improve the precision of the estimates from the relationship poses for letter sampling. But this approach sample, in this case by insuring adequate representation has the disadvantage of proceeding through statistically of all relevant ethnic groups in the population. uncharted waters. In this second strategy, we simply pro- What we end up with are stratified, cluster samples. ceed to sample randomly letter clusters from the popu- Such samples are relatively common in contemporary lation. Each sampled cluster would then be “post- survey sampling, and the desired estimates of population stratified” so that its ethnic distribution would be made values can be made without great difficulty. The draw- to conform to that of the population by an appropriate back of this method is that very complex stratification set of weights. ” In other words, the method tits the may be necessary in order to capture the variation in letter samples of surnames to the ethnic distribution of the ethnic compositions of the clusters. Then drawing the population, after the samples have been drawn. For a minimum of two clusters within each stratum could example, if 40 percent of the individuals in a letter result in a total sample size much larger than that cluster sampled were of English origin;compared to desired for reasons of cost. The problem of controlling 20 percent of the population as II whale, each English for sample size is a serious one, and we shall discuss person in the sample would be assigned a weight of l/2. solutions in a moment. In addition, although the rela- As in the previous case, we require that the ethnicity tively straightforward strategy just presented seems a of each individual and the distribution of the entire workable one, it is also relatively inefficient. Since no population be known or surrogate data provided.20 one letter cluster of surnames will be perfectly homo- An alternative and more elegant post-stratification geneous with respect to ethnic composition, one has can be achieved, although the basic principle remains to work with a typology of the known of presumed the same. The letter clusters sampled could be fitted to ethnic composition of the letter clusters, attempting by the marginal and crosstabular distributions for the popu- a judicious combination of letters to represent the lation, known from published census aggregate data or ethnic composition of the total population. But we note, from independent analyses, using a method called too, that it is a feasible strategy, permitting linkage of “iterative proportional fitting.“2’ For example, if for samples for the first time, so it may have uses in smaller a total, national population we know the distribution of scale projects. This procedure requires us to know the occupations, the cross-classification of region by place distribution of the ethnic groups within each letter of birth, of sex by age, and of marital status by age, a cluster and in the population as a whole. in general, it set of weights can be derived which would ensure that will not be known. One happy exception is provided the sample matches the population on all these variables by the “nation of origin” variable reported in the and cross-classifications simulraneously.22 ‘l’%is proce- Canadian census of 1871. dure could be used to produce weights even if the What if such an ethnic origin variable is not avail- ethnicity of the population and the samples are not able? Can anything be done to substitute for it? There known. are several approaches to creating a surrogate variable We consider this post-stratification procedure more that appear feasible, if inelegant. First, one could use promising since it clearly adds precision to the matching genealogical sources to classify the names of the sampled of the ethnic composition of the sample and that of individuals according to presumed ethnicity. It would the population. It does not solve the problem posed by be necessary to classify some names into an ambiguous the variability in the total sample size. category. In addition, or perhaps afternatively, if first We have perhaps exaggerated the problems posed by names and/or place of birth data are available, they could ethnicity, because of the North American context of be used to make an ethnic classification. We have not, in our own work. As we remarked earlier, in a more or fact, attempted such classifications and recognize that less ethnically homogeneous population, the problem the approach is not without difficulties. However, such is reduced, though there may still be some systematic classifications would not have to be very detailed, since relationship of surname and social characteristics. How- nineteenth-century national populations, even in the ever, real difficulties arise only when there are a number U.S. and Canada, consisted of, at most, half a dozen or of ethnic groups in the population which merit individual SOsizeable ethnic groups. Moreover, even where there is study. In our own study of Canada before 1880, there direct information on ethnicity from nineteenth-century are five groups comprising 5 percent or more of the census manuscripts, as there is for Canada, a surrogate total population, and their characteristics are of signi- variable may be as reliable. The point is that in the ficant historical interest. absence of an ethnic origin or nationality variable in a A disclaimer might be in order at this point. For the given set of historical records, one can be constructed majority of historians and historical sociologists, letter 155 sampling is clearly a complicated statistical procedure samples which is the critical feature of letter S8npling. and one which should not be undertaken without quali- There is no reason to believe that this procedure wiIl tied consultation. This difficulty is, we feel, more much alter the probability of linkage, since the linkage than compensated for by the fact that it makes possible, is carried out wirhin the subgroups selected for the for the first time, sample studies of very large popula- sample. On the other hand, since the strategy entails tions. In fact, as we suggested and shall discuss further scanning manuscript records to identify surnames below, a wide range of applications is possible. Simpli- which qualify for the sample, distinguishing those fied letter sampling procedures can be used to facilitate which qualify for the appropriate subclusters will large urban or regional studies carried out by single require even more care in the original coding process investigators, and, at the opposite extreme, they allow than the use of first letters alone?4 national mobility studies involving hundreds of thou- The final step is to recombine some of the I92 sands of records. Clearly, the larger the study, the more smaller clusters within letters in order to equalize the concern should be given to the sample design. size of the final clusters as much as possible. For letters containing very few surnames, like “Z,” all sevensmaller Considerations of Sample Size clusters might be recombined. For letters containing many surnames, like “S,” all the smaller clusters would be Consider the elementary, illustrative case of letter sampling, retained in the foal sample. Suppose the recombination in which a single letter is chosen and all the individuals leaves about one hundred subclusters of surnames. ‘Ihen, with surnames commencing with the letter are selected. to obtain a 5 percent sample of a population, five groups This case exemplifies why it is so difficult to predict would be selected by some probability procedure, like the size of the sample, for there is a very unequal division simple random sampling. The fact that five such clusters of surnames among the letters of the alphabet. The deter- were chosen would have two great advantages: the sam- mination of the size of a sample commonly involves a ple size would be much more predictable, both because deliberate compromise among two conflicting objec- of the equalization of size through recombination of tives: larger samples make for smaller errors in the clusters and because the size differences in the selected analysis, but they also cost more. Suppose that we wish clusters would tend to cancel one another out, and the to select a sample comprising about 5 percent of the variance of estimates would be reduced because the population, a proportion slightly larger than the average ethnic variations in the five clusters would also tend to for the 26 possible letters (100/26, or 3.85 percent). cancel out. The estimates of error would also be appre- Now if a single letter were chosen, we have the foUo&ng ciably enhanced by the choice of a sample with more than problem. Should “Z” be chosen, the sample will be far two clusters. too small to permit the desired analysis; should “S” be chosen, the cost would be too high. It is important to Letter Sampling and the Analysis of Households recognize that unless every individual in the letter clus- ter is included in the sample, the letter sampling proce- The discussion has been phrased in terms of an aoalysis dure will not provide fully linkable samples. It is no of the movements of individuals. However, one of the help to code a portion or a sample of the names begin- most important lines of inquiry in nineteenth-century ning with “S.” social and demographic history concerns the changing The remedy we propose involves the subdivision nature of the family and household. There is also an of surnames beyond the 26 unevenly sized groups formed important technical reason for wanting to deal with the of individuals whose names begin with different first individual in the context of a household-information letters. If these subdivisions are more or less equal in size on the characteristics of other members of that house- and there is a relatively large number of them, then we hold greatly increases the power of linkage procedures, can approximate the optimum, total sample size by whether performed manually or by computer. For this selecting an appropriate number of subgroups. reason, when data are available for all members of a A sensible way to subdivide the 26 letter clusters household, we strongly suggest that provision be made of surnames is to divide each of them into the 7 additional for coding all data on each individual in a household clusters defined by the first, numeric code of the Russell which contains any person whose surname falls in a Soundex Code, providing a total of 192 sgbclustersz3 selected cluster. Of course, only individuals with sur- The Soundex code uses the first letter of surnames in names which are in the chosen clusters form the basis uncoded form as a prefm, A, B, C, etc. The next conso- for the analysis, since they, and not the households, nant is coded as the first digit of the code in six cate- constitute the sample. Under these conditions the cost gories where B, P, F, V are coded as 1; C, G, J, K, Q, S, of coding may not be determined precisely from calcu- X, Z as 2; D and T as 3, etc. Hence, the subclusters of lations of sample size based exclusively on those whose names we would create are simply those in which Able surnames begin with the sampled letters. The practical and Avery would be coded Al ; A&land and Agger, A2; questions of cost cannot be divorced from questions Atkinson and Adams, A3 and so forth. Such subclusters of sampling strategy. If entire households are coded, of names preserve the capacity to link names across given a “letter sample” resident, a number of the indi- 156 vidual records will not refer to individuals with surnames contmuous populations registers, taxation records, in the letter sample. In a small proportion of casesof church records, and the like, could be employed. For boarding houses and exceptionally large households, the example, the method might be applied in some large number of additional individuals coded qould be quite city where the objective was eventually to combine substantial. What is required to make more precise several different types of records for the individuals estimates of coding cost is some estimate of the kinds traced through letter samples. At the level of the large of households having members with differing surnames city or region, a letter sampling strategy makes it possi- and the proportions of all households which they ble for a single researcher with limited funds to deal represent. Recent historical research has provided a with populations which now remain the preserve of generous amount of information on these questions of teams of researchersfueled with large sums of money. the composition of households.” Other things equal, of course, the smaller the area, the We might mention one unexpected dividend of higher the rates of outmigration from the sampled popu- the letter sampling procedure. The letter cfusters almost lations we can expect. certainly wih.exhibit a greater ethnic homogeneity than The sampling procedures described above create proba- simple random samples. This tendency for any two bility samples whose statistical properties can be described individuals within a letter cluster to be more like one mathematically. But what would be the effect of a another than two people randomly selected from the researchersimply choosing a letter which was easily population is the cause of the increased error which we distinguishable visuahy and which he or she knew had identified earlier as common to cluster samples. There a distribution of ethnic groups approximating that of is a second source of homogeneity within clusters: indi- the population? This “purposive” sample, of course, viduals within any one family who have the same sur- would have the same properties as a probability sample name are all in the sample or all outside of it. Since so far as the linkage of individuals in different sets of most individuals live with at least one other person records is concerned. The only difficulty is that the with the same surname, the letter sample can be seen as level of error produced by such a sample cannot be esti- a sample of families, some of which have only one mem- mated, since no real element of probability enters the ber. In fact, this conception can be extended across choice of the letter. However, one could proceed this households and over time. Individuals who retain their way if he or she was prepared to assertthat the indivi- surnames but leave the households in which they resided duals in the sample sufficiently resembled the popula- in the first sample are still caught in a subsequent sample. tion as a whole to make the sample representative. In This opens up the possibility of relating the study of fact, such an assertion could, to some extent, be con- migration and intragenerational mobility, for which the firmed by comparing the characteristics of the sample sampling methodology was primarily designed, to the to tabulated results for the population or to another, detailed study of intergenerational mobility, to patterns of small, simple random sample drawn from the popula- geographic dispersion of family members and perhaps tion by the researcher for this purpose. Moreover, as we to patterns of family formation. have argued, it is likely that a reasonable match between Finally, we note that letter sampling complicates the ethnic composition of the letter sample and the popu- linkage of individuals across sets of records. The compli- lation will itself go a long way toward matching the cation results from the fact that individuals are sampled sample and the population on other important charac- on a basis which narrows the range of surnames, the vari- teristics. The risk involved in this procedure can also able conventionally used to make the primary link between be considerably reduced if a minimum of two letters sets of records. It is clearly more difficuh to link two is chosen for coding; comparisons of the results obtained sets of data where all the names begin with a single letter for the individuals within each letter cluster allow at than to link fdes of the same size which contain all last least a guessat the extent of error. Especially where a names. Since linkage procedures do use additional infor- population is known to be more or lessethnically homo- mation-first names, age, sex, and even the characteris- geneous and where a rapid examination of the data was tics of other members of households where possible- required in a small-scale project, this less-rigorous pro- letter sampling will likely require that these more com- cedure is worthy of consideration. plex linkage procedures are systematically exploited. One possibility is to construct transition matrices out The Determination of Sample Size of pairs of records on individuals which are not linked In a Linkage Study: An Illustration with absolute certainty but only with some measured probability. Finally, as a practical illustration we consider the prob- The letter sampling methodology we have described lem of deciding on the size of a sample which would be is a general procedure. In many casesit can as easily be required in order to conduct a national study of mobility used to study regions as whole nations, to study the popu- using nineteenth-century manuscript census data and lation of several nations as that of a single nation. Although letter sampling. There are two important factors in the discussion has centered about national studies making such a decision: the loss of individuals between based on censuses, other types of records, including census points, from a variety of causes we discuss below, Table I.-Estimated Linkage Probabilities between 1851 and 1871, for Canada.

Number Remaining from Sample

Minimum Emigration Maximum Emigration

Household Household heads, heads, Cause of All boarders, AU boarders, Sample Attrition Males servants Males servants

185 1 Sample 1000 1000 1000 1000 18 5 1-6 1 Losses Emigration 35 35 136 136 Death 121 - 166 113 - 155 Underenumeration 169 160 150 142 Linkage Error

186 1 Remainder 675 639 602 567

1851-71Losses Emigration 103 99 207 202 Death 228 310 208 - 281- Underenumeration 134 118 117 104 Linkage Error i.. 1871 Remainder 535 473 ” 469 413 1861 Sample 1000 1000 1000 1000 1861-7 1 Losses Emigration 11s 115 139 135 Death 117 161 116 159 Underenumeration 154 140 150 141 Linkage Error (14 5% 1871 Remainder m mf 565 b” and the usualconsiderations of the desiredlevels of are rough and intentionally conservative estimates; error and the nature of the internal comparisons which there is insufficient data to make precise calculations are desired. This discussion limits itself to the first fac- for this or for other national studies. Separate calcula- tor, presentingestimates for a study of Canada (actually tions are carried out for all males (as usual, we cannot Ontario, Quebec, New Brunswick, and Nova Scotia) based follow most women because of the name change at on the censuses of 185 1 ,1861, and 187 1 -2’sFailures to marriage) and for the combination of male house- find individuals sampled in one census at a later point hold heads, boarders, and servants.” It is assumed that can be attributed to five sources: (1) deaths of sampled the emigration rates and the losses due to underenumera- individuals, (2) emigration, (3) census underenumeration, tion and linkage are equal for both these groups, but (4) missing data, and (5) failures correctly to link their mortality rates differ because of differences in recorQs actually referring to the same individual. We age structure. can calculate the proportion of individuals from the Table 1 shows our estimates of losses from various 185 1 Canadian census who are likely to be found in sources. Kalbach and McVey show the extent of disa- the 1861 and 187 1 censusesand of the individuals coded greement among researchers over the number of Cana- in the 186 1 census who should be found in 187 1. These dian residents emigrating in the decades between 185 1 158 and 187 1 .28 The smallest estimate for the first of these a lowestestimate of emigration.,602 if the highestesti- decades is only 86,000; the largest, 332,000. They are mate is used.By 1871, 535 remain, usingthe low esti- 3.5 percent and 13.6 percent, respectively, of the 185 1 mate, and 469 for the high one. If we attempt to trace population. The variation is much smaller, between headsof households,boarders, and servants,the loss rate 370,000 and 436,000 emigrants, for the 1861.71 decade. isabout IO percent greater. The lossfrom 1861 to 1871, Because of these differences, it seemed sensible to pro- for a samplechosen in 1861, is of courseof the same vide two-estimates based on minimum and maximum order. Thus, according to theseestimates it is necessary rates. Let us also assume that an individual who was to sampleabout twice asmany individualsas one expects found in the censuses of 18.51 and 1861 was only 70 to trace acrossa 20-year period in nineteenth-century percent as likely to emigrate as the average Canadian, Canada. between 1851 and 1861.29 Second, there are losses due to mortality. Let us Conc@sion assume that the emigrants had a death rate oniy 60 per- cent as great as the population as a whole, since it is It may be difficult to retain sight of our objective in known that the emigrants are younger and likely to be the methodologicalmaze we have just detailed. In in better health than those who remain behind. We referenceto the “New Urban History,” Roberta Balstad apply this correction factor to Keyfitz’s estimates of Miller hassuccinctly stated what we take to be an intel- survivorship for ninete&rthcentury Canada.se Finally, lectualpriority in the joint work of sociologistsand let us assume the combined effects of underenumeration historians: and imperfect linkage mean that only 80 percent of the links can be actually carried out, given the fact that . . . it is no longer enough to celebratethe fluidity of a [nineteenth century 1 city’s population; the an individual was alive and in Canada at the two points study of social and spatial mobility must include in time. Our current state of knowledge for studies of an analysisof the factors stimulatingand retarding the scale discussed here means that these are only guesses, socialmobility, the patterns and the modesof ex- although the apparent success with linkage of those urban and inter-urban migration, and the cumu- doing community studies suggests this is not an unreason- lative effect of such migration on the economics and the socialorders of both sendingand receiving able estimate. A 20 percent linkage error is based on areas.” rates of underenumeration between 5 and 10 percent and of linkage loss between 10 and 15 percent. The letter samplingtechnique offers a feasiblemethod- The results of compounding the separate forms of ology, perhapsthe only one short of coding entire cen- loss from the sample are in Table 1. Of 1,000 men sam- suses, for addressingthese central issuesof social history pled in 1851, we can expect to find 675 in 1861, assuming on a national scale.

NOTES

1. The senior author of this paper has been primarily 4. StephanThemstrom and Peter Knights, “Men in responsible for the technical development of the sam- Motion: SomeData and SpeculationsAbout Urban pling strategy described here. Developing its application PopulationMobility in Nineteenth Century America,” to socio-historical studies of mobility has been jointly in Anonymous Americans, ed. T. K. Hareven(Engle- undertaken by the authors as part of the Canadian His- wood Cliffs, 1971), pp. 17-44. torical Mobility Project, which they co-direct. We have had the benefit of the thoughtful advice of Peter Peskun, 5. Michael B. Katz, The People of Hamilton, Canada sampling statistician at the Institute of Behavioural West: Family and Class in a Mid-Nineteenth-Century Research, York University, and of Michael B. Katz, City (Cambridge,Mass., 1976). Department of History, York University. We would like to thank Karen Barker, Sheila Creighton, Shirley 6. PeterM. Blau and Otis Dudley Duncan,The Ameri- Young, and Marlene Sherman for typing this manu- can Occupational Structure (New York, 1967). script at various stages. The research project is sup- ported by The Canada Council. 7. For examplesee M. Buroway, “The Functions and Reproductionof Migrant Labour: Comparative Material 2. Stephan Thernstrom, The Other Bostonians (Cam- from Southern Africa and the United States,” American bridge, Mass., 1973), pp. 22 l-32, summarizesa number Journal of Sociology 8 1 (March 1976). of North American studies.But also seePeter Laslett and John Harrison,“Clayworth and Cogenhoe,”in 8. Peter R. Knights, The Plain People of Boston, 1830- Historical Essays Presented to David Ogg, ed. H. E. Bell 1860: A Study in City Groruth (New York, 197 1). and R. L. Ollard (London, 1963), pp. 157-84. 9. All quantitative historical studiesemploying nominal 3. Stephan Thernstrom, Poverty and Progress (New recordsface the problem of linking individual records York, 1969) p. 86. from different sourceswhich are presumedto refer to I 59 the same historical actor-say, census manuscript data, we discussare recognized, or how the letter B WAS chosen. tax records, and newspaper reports, or any of these at One might assumethat the French population under more than one date. Given the many sources of variation study is thought not to have strong systematicrelation- and error in the records of the last century, the problem shipsbetween social and demographiccharacteristics and is especially serious. Knights has informed US in personal first letters of surnames.Still, we suggestthat this is a communication that in his work which uses all the question worth examining even when “ethnic” homo- available data from records regarding a given individual geneity is assumed. (names, occupation, age, and the like) for hand linkages, only a very few cases of potentially linked records seem 16. None of these limitations, it should be noted, are to be truly ambiguous. unique to the methodology proposedhere. For example, 10. Charles Stephenson, “Tracing Those Who Left: the method of letter samplesis no help in tracing women Mobility Studies and the Soundex indexes to the U. S. whosesurnames change at marriage. But, unlessrecords Census,” Journal of Urban History 1 (November 1974): of marriageare available, this problem is sharedby the 73-84. previous methods of searchingfor sampledindividuals by name or employing Soundex indexes to censuses. 11. Soundex indexesare basedon the RussellSoundex We should acknowledge, in fact, that what is on the sur- code which codessurnames according to a Scheme rn face a “technical” problem of tracing women may well which the first letter of a name is combinedwith a three- help perpetuate a male bias within social history. For a digit numeric code basedon the remainingsequence of good recent discussionof record-linkageprocedures, see letters so that all surnamesare groupedin mutually Theodore Hershberg, Alan N. Burstein, and Robert Dock- exclusive categories;for example, seethe discussionin horn, “Record Linkage,” Historical Methods Newsletter Michael B. Katz, The People of Hamilton, Canada West: 9 (March-June 1976): 137-63. Family and Class in a Mid-Nineteenth-Century city (Cambridge,Mass., 1976). 17. It is worthwhile reviewing here the distinction between samplebhs and samplev,ariance. A biasedsample 12. Stephenson,p. 74. is one which, on average, does not yield results(by which we mean mean values of variables,proportions in 13. A mobility study basedon the entire population categoriesof variables, correlation coefficients, measures would, of course,eliminate the difficulties of statistical of association,and other statistics) equalto the values in inference which samplingcreates. But the-method is the population. The samplevariance, usually character- both costly and inefficient; generally a sampleof the ized by the standard error of the mean (or of someother population would yield resultsas good as thosefrom statistic), is a measureof the expected value of the diff- the whole population. Such a national population study erencebetween the sample and population values.In would cost somemillions of dollars. There is stilI another general, the samplevariance decreasesas the size of the difficulty: despitethe tremendousadvances in com- sampleincreases (by a factor which varies approximately puter speedin the last decade,it might actually prove as the squareroot of the number of casesfor most sam- to be impossibleto link two files each containing pling designs).Consider a sampleof only ten people millions of individual records.Computer linkagepro- chosen to represent Canada. Under proper sampling cedures require very large amountsof machinetime, procedures,such a samplewill be unbiased:for example, since they must cope with misspellingsand other errors on average it will contain five men and five women (pre- in the data. At the least, sucha national linkage would suming a population consisting of half men and half require weeksof machinetime. women). The difficulty Is that such a smallsample will very frequently produce erroneous estimates:in only 14. In the context of a national study, there is the 24.6 percent of samplesof 10 will there actually be a possibility that moreintensive local studiesof particular S-5 sex distribution, 20.5 percent of the time the split communities could be carried out, tracing the indivi- is 6-4, 11.7 percent of the time it is 7-3, and so on. Thus dualssampled in the main study through local records, the sampleestimate is often in error, but it is unbiased such as tax assessmentrolls and birth, death, and mar- becausea 6-4 split is as likely as a 4-6 split, 7-3 is as riage registers.Itrshould be noted, too, that linking two likely as 3-7, etc. Alternatively, considerattempting to : censusesseparated by ten years will not reveal the nature estimate the sex ratio for Canadadrawing a sample of of movement in the ten-year interval, only the outcome 100,000 people from the province of Nova Scotia. Such of that process. a large samplewould yield very preciseestimates. Repeatedsamples of 100,000 would all produce sex 15. In the courseof developing the methodsdescribed ratios which were very similar to one another. But this here, a literature review encounteredseveral mentions of sampleproduces biased estimatesof the sex ratio in letter samplingin historical demography.They all refer Canadaas a whole, since we have no reasonto believe to the researchof Dupiquier dealingwith a region near that Nova Scotia is adequately representativeof Canada. Paris. In conjunction with extensive studiesof family reconstitution from parishrecords, Dupa^quiertraces 18. Leslie Kish, Survey Sampling (New York, 1965). those individualswhose last name beginswith B over the region of some84 parishesin order to compare 19. Ibid., pp. 90-92. demographiccharacteristics of the migratingpopulation with thoseof the stablepopulation on which the work 20. A somewhat more efficient procedure would be to in France has,to date, solely concentrated. SeeJacques combine the clusters into subgroupssystematically, using DupBguier, “Sur la population fransaiseau XVIIe et au the method of half-sample replicatesdescribed by P. I. XVIII sie’cle,”Revue ffistorique 237 (1968), 43-79; McCarthy, “Replicatiog: An Approach to the Analysis and Louis Henry, “Historical Demography,” Daedalus of Data from Complex Surveys,” (Washington: National 97 (Spring 1968) 385-96. Theseauthors do not indicate Center for Health Statistics, Series2, No. 3 1, 1966). whether the formal properties of letter sampleswhich When the clusters are individually weighted, the greatest 160 losses in efficiency would take place for clusters whose 26. It might be noted that the smallPrOPOrtiOns Of ethnic distributions vary markedly from that of the total individuals in a number of important categoriessuggest population. By creating half-sample combinations of the need for rather large samples.In Canadain 187 1, the clusters (i. e., one cluster from each stratum), and cities and towns comprisedonly 15 percent of the popu- producing weights for them, the weighted subsamples lation; the two Maritime provinces, New Brunswick and would tend more to resemble the population as the Nova Scotia, contained 8.2 percent and Il. 1 percent of “eccentricities” of the individual clusters within each the population respectively; and thoseof German eth- half-sample and would tend to cancel out one another. nicity made up only 5.8 percent of the population in 1871. Aside from the Germans,English, French, Irish, 2 1. Yvonne M. M. Bishop, Stephen E. Fienberg, and and Scats, no ethnic group of Europeanorigin contains Paul W. Holland, Discrete Multivariate Analysis: Theory as much as 1 percent of the total population. and Practice (Cambridge,Mass., 1975). pp. 83-102. 27. It is assumed,for simplicity, that all malesaged 30 22. In fact, iterative proportional fitting could be com- and over are householdheads and that 25 percent of the bined with the first method discussedin which last names men aged 15-19 and 50 percent of those aged 20-29 are were classifiedby ethnicity. servantsor boarders.The estimatesdo not take account of the possibleeffect of missingdata. For the Canadian 23. Pierre Beauchamp,Hubert Charbonneau,and data, the only seriousgaps are in 1851. In this year, Yolande LaVoie showthat the linkage of French names very significant gapsin the manuscript records exist. is best accomplishedby using Henry’s phonetic code, For example, no manuscriptssurvive for Toronto in but its advantageover the RussellSoundex code is not 1851, the third largestCanadian city in that year, con- very great; so it seemssensible to choosethe Russell taining over 1 percent of the total population enumerated code to create the subgroupswithin letters in a popula- in British North America. The high proportion of missing tion including both British and French surnames.See data would bias a letter samplebegun in that year, but their “Automatic Record Linkageof Nominal Data: the fact that no significant amount of data is missing The Experience of 17th Century CanadianCensuses” in the later years means that the problem is min’imized. In caseswhere missingdata occur in more than one enu- (Paper presentedat the annual meeting,Population Associationof America, Toronto, April 1972). A small meration, then the problem of linkage is more compli- increasein its accuracy would probably be obtained by cated. Our main consideration will be to employ conser- creating additional categories(i. e., beyond the 26 letters) vative estimatesof lossof membersfrom a sample due for namesbeginning with “Saint,” “Le” or “La,” “MC” to missingdata in subsequentrecords and to compen- or “Mat,” and “0’. ” sate by increasingthe original samplesize.

24. One might anticipateproblems even in recognizing, 28. Their estimatesare taken from Nathan Keyfitz, say, from the censusmicrofilms, certain first letters. “The Growth of the CanadianPopulation,” Population For example, F and T might be confused.Also, it may Studies 4 (June 1946), 47-63; Duncan M. McDougall, be difficult to distinguishthe vowels from one another; “Immigration into Canada, 185 l-1920,” Canadian A, E, and 0 are especiallyprone to confusion. Our Journal of Economics and Political Science 2 7 (May experience to date with coding approximately 70,000 1961), 162-75; Norman Ryder, “Components of Cana- individuals from the Canadiancensuses of 186 1 and dian Population Growth,” Population Index 20 (January I87 1 falled to reveal significant problemsof legibility. 1954): 7 l-80; Pierre Camu, E. P. Weeks,and 2. W. Our conclusionis basedboth on the reports of coders Sametz, Economic Geography of Canada (Toronto, and on careful additional checksof about 13,000 indi- 1964). vidual records. Sincethe Soundex code doesnot use the vowels, except as“separators” betweenconsonants, the 29. We may note at this point that letter sampling on problem of distinguishingthem is eliminated. a national basisdoes not permit us to know the fate of emigrants,but it is quite conceivable that some emi- 25. .See,for example, Michael Anderson, Family and grants could be located in various nominal records of Kinship in Nineteenth-century Loylcashire (Cambridge, other countries. For Canada,for example, we could 1972); Lutz Berkner, “The Stem Family and the Develop- searchfor those individuals who cannot be traced (and mental Cycle of the PeasantHousehold: An Eighteenth are likely not to have died in the interval) in the censuses Century Austrian Example,” American Historical Review of American states to which migration is known to have 77 (April 1972), 398-418; Peter Laslett and Richard been heavy. We thank Peter Knights and Michael Katz Wall, eds.,Household and Family in Past Time: Com- for separately drawing this possibility to our attention. parative Studies in the Size and Structure of the Domes- tic Group over the Last Three Centuries in England, 30. Keyfitt, p. 49. France, Serbia, Japan and Colonial North America, with Further Materials from Western Europe (Cambridge, 31. Roberta Balstad Miller, “The Historical Study of 1972); Michael Gordon and Tamara Hareven,eds., Social Mobility: A New Perspective,” Historical Methods Journal of Marriage and the Family 35 (August 1973). Newsletter 8 (June 1975): 92-97.

161 .. vol.12,no.4 fall 1979

historical Historical Methods, Vol. 12, No. 4, Fall 1979

Error in Historied Data Files A Research Note on the Automatic Detection of Error and on the Nature and Sources of Errors in Coding

A. Gordon Darroch Department of Sociology York University and ; Michael D. Ornstein Department of Sociology and The Institute for Behavioural Research York University

Introduction sampling error, for example, errors in interpretation or There are now a very considerable number of quantitative transcription of historical records cannot even be esti- social history projects using census data for the nineteenth- mated systematically. They become part of the record century United States, Canada, and Europe.’ Some of itself. Such errors may be dispersed enough not to affect these projects are based on relatively large bodies of data aggregate analyses, for example, of occupational compo- collected for several points in time. With increasing fre- sition or variations in household structure, although we quency they involve more than one principal investigator really do not now know. They are almost certainly more and a number of student or other assistants. The tendency likely to complicate the linking of individual files which toward relatively large-scale research seems likely only to provides much of the richness of quantitative history. So increase as interest in quantitative social history of the far as we are aware, no one has provided an assessment of last century continues to grow. the extent and types of coding errors committed in any There are always errors committed in the transcription project.’ We report here the experience of the first phase of data from original manuscript sources or microfilm to of a large-scale project based on the Canadian census of a form permitting systematic analysis, commonly a com- 1871. puter file. The problem increases, we presume, with the size of the data file and with the complexity of the data The 1871 Canadian Sample collection procedures. The numbers of errors committed The project for which the study of coding errors was likely vary with the numbers of assistants coding data, undertaken was the first phase of a continuing study their experience, and the time devoted to the task, as well whose ultimate aim is to trace a representative national as with the quality of the original manuscript sources or sample of nineteenth-century people through the Cana- rrdcroftim copies. Moreover, unlike social science projects dian censuses of 1851, 1861, and 1871. In the larger from which coding procedures may be readily adopted, study we complement the nominal data on individuals the problem of legibility in the original data can be and households with data from the aggregate census severe and is complicated, even for experienced social his- reports regarding the economic and sociodemographic torians, by the problems of familiarity with nineteenth- characteristics of the counties, townships, and urban areas century usage. Problems may arise with the variety of of residence. We are currently beginning the first large names and spellings and with ethnic variations, with occu- regional (provincial) studies with the support of the pational titles, and especially with abbreviations and con- Social Science and Humanities Research Council of Can- ventions used in the original coding. Of course the prob- ada (formerly the Canada Council).3 lems are greatly reduced by experience!, but experience The feasibility project required the coding of all the alone cannot be expected to eliminate errors made in the data given on the census manuscripts faE over 60,000 very first stages of the research. individuals who were members of 10,000 households Moreover, it is crucial that undetected errors committed sampled from the Canadian census of 187 1. The manu- in the coding stages of a project are truly lost. Unlike script data is on microfilm. The sample was a random, 157 stratified sample. The stratification overrepresents house- time, cost, and possible error involved in a two-step holds located in cities and towns (15 percent of the total procedure. Canadian population in 187 1) and also overrepresents We intend to use remote data entry procedures in the selected ethnic-origin populations in various regions. Spe- continuing larger study. At the time initial coding was cifically, the French-origin group is overrepresented in carried out, however, the facilities available through the New Brunswick and Ontario, the British-origin group in York University Computing Centre all but precluded its Quebec; further, there are additional members of the use. On the other hand, direct keypunching would require German-origin group (comprising 5.8 percent of the pop- coders with keypunching skills. However, it is likely that ulation). These groups were added by “double-sampling,” we would have chosen the two-step procedure in any case i.e., additional German-origin members were added to at this feasibility study stage. The procedure is suited to the sample in a second stage of sampling. gaining first-hand experience with the difficulties faced The data were coded from microfilm by students in reading and accurately transcribing microfilm records. employed and trained for the task. They had previous Of course, the two-step procedure requires that the experience coding on social science projects as employees key-punched data be checked and verified for illegal codes, of the Institute for Behavioural Research at York Uni- inconsistencies, and the like. In our case we opted both to versity but no experience with historical data. The coding verify the keypunching and to scrutinize the data for cod- was supervised by members of the staff of the institute ing errors by using a computer program in batch mode.4 working closely with the principal investigators. In all, This means that once errors are detected or suspected, we twelve different coders worked on the project. They were were required to return to the microfilm to locate the selected by the coding supervisor from a larger number. incorrectly coded record. The correction of the files The coders usually worked in groups of four or five for up proved to be a time-consuming’and tedious process which to four hours at a time, but seldom longer. The coding the principal investigators undertook almost entirely them- supervisor was always available for consultation and selves. Nevertheless, two advantages of this verifying- interpretation of the manuscript data. Coders also con- checking procedure became evident. First, in a project the sulted among themselves. The supervisor routinely spot- size of the feasibility study, much less the size of the pro- checked the completed coding forms against the micro- posed study, principal investigators simply cannot under- film records. Corrections to the data were made take much of the original coding, though close supervision immediately. is essential. We have found, however, that familiarity with The coders worked from detailed coding instructions ail the problems of coding and their implications for data and practiced under supervision before proceeding. We analysis has been assured by primarily undertaking the required virtually all the information on the microfilm verification and error-checking ourselves. The computer records to be transcribed exactly as it appeared. The program written for this purpose requires that in addition coding forms were facsimiles of the manuscript records to the detection of field errors and invalid codes a set of with the following exceptions: the place of birth, religion, ambiguities in the card file is reexamined and corrected if and nation of origin data on the 1871 Canadian census it indicates errors. We describe the coding and checking were coded according to preestablished two-letter procedures, then turn to an analysis of the results of a mnemonic codes (in a few cases, where a series of previ- special “error detection” subproject which we built into ously undetected entries turned up, new mnemonics were the data collection. The subproject entailed the complete . . created, with the consent of the supervisor). Otherwise replication of the coding from the microfilm records of all names, occupational titles, and other data were directly a 10 percent sample of the originally sampled households. transcribed. Coding Decisions The coding forms were designed to resemble closely orig- Coding Procedures and Considerations of Error inal printed forms used for the nineteenthcentury cen- Considerations of cost led us to employ a coding procedure suses, as we noted. The variables are transcribed in the for the nominal census data which involves two steps com- precise order in which they appear in the original, and mon to coding in projects such as this one. First, the cen- where it is necessary to code something less than the sus microfilm data are transcribed onto coding forms complete original entry, mnemonic codes with letters designed for the purpose; second, the data is keypunched are used. All the data on an individual are coded in a total from the coding forms. It is possible to combine the two of 80 characters, for ease at the keypunching stage. steps into one, by coding the census directly onto a com- Because of the high cost of correcting errors, our main puter terminal and using remote data entry or by key- objective, especially in using mnemonic codes, was to punching directly from manuscript records to computer minimize error at the coding stage. The pursuit of this cards. Remote data entry would allow for immediate objective is not without its costs, for the data can &hen error correction, using a program such as the one we be analyzed only after considerable transformation. Pri- describe below, to provide continuous error-checking and marily we must substitute a unique numeric (rather than file creation. Direct keypunching, of course, saves the alphabetic) code to represent each possible value of each 158 variable for the purposes of analysis. The transformation Four of the variables which were coded, the surname. of alphabetic into numeric codes is accomplished by a religion, place of birth, and nation of origin, were quite computer program designed specifically for this project. often identical for each person in a sequence of individuals The alternative to our procedure is the conventional one listed within a household. So that the coders would not be of recording the original data as numeric codes when it is required tediously to copy out these variables for each -‘first read from the microfilm. But the conventional pro- person in (I sequence, blank fields in the keypunched data cedure is both slower and, we suspect, more prone to were automatically given the same values as the corre- error than our mnemonic coding method. Consider, for sponding field for the previously coded person. Of course, example, that if two digits are used to represent a religion the first person in each household must have all his or variable, approximately 60 of the 100 possible two-digit her variables coded. combinations would be required. The numbers 43,67, and 10 might represent Wesleyan Methodists, Adventists, and Data Processing: Detecting Errors and Ambiguities Roman Catholics, respectively-instead of the codes WM, Computer programs were designed for this data to accom- AD, and RC which we use. The mnemonic codes are plish two tasks: they carry out a series of logical checks superior in two respects: they are much easier for coders on the coded data to allow coding errors to be corrected, to recall and thus should increase efficiency and result in and they take the original data and transform it into files fewer coding errors; and, if an error is made, the resulting on which data analysis can proceed. The two tasks are error is more likely to be detected in the data-checking intimately related. For example, if the “blank *” is found procedure. It may be noted that there are 26 X 26, or in the religion field for a given individual, it is necessary ’ 676, valid two-character mnemonic codes, of which 60 to make certain that a long form containing the religion ‘are required. Thus in most cases a coding error will take for this individual has also been created. When the data the form of an invalid and hence correctable code. If by are transformed into files for analysis, the contents of the mistake an Adventist is coded AT, rather than AD, the religion field on the long form must be merged into a error is simply more likely to be spotted in any visual specific position on the record for the relevant individual. review of the file than an error in numeric codes. Our Besides checking for the existence of long-form data, error correction entailed scanning a print-out of coded flagged by asterisks on the individual records, a number data in which this feature was helpful. of other checks of the data are carried out. The entries for Two sets of two-character mnemonic codes were variables with a fixed set of codes, including religion, developed: one for religions and one for place of birth. place of birth, nation of origin, marital status, and school The place of birth codes were also used to code the attendance, are checked to see that they are among the “nation of origin” variable in the 187 1 census of Canada, predetermined acceptable codes. In addition, some infre- so for example, the code for England (a place of birth) quently occurring combinations of codes are also flagged was also used for English (reported as a nation of origin). by the program so that they can be checked. These include The major difficulty which arose from the adoption of individuals who are listed as attending school but are under this procedure was that certain religions and places of four or over nineteen years of age, married persons without birth occurred so infrequently that it made no sense to a spouse present in the household, and all individuals listed create codes for them, yet we were committed to preserv- as deaf and dumb, blind, or of unsound mind. These cases ing the exact content of the original manuscripts. The were scrutinized for possible error, and many were checked solution was to use the special code, a blank followed by against the microfilm records. an asterisk, whenever a mention of a religion, place of The fixed field format which we chose to employ has birth, or nation of origin occurred which had not previ- the advantage of providing records which are facsimiles ously been assigned a mnemonic code by us. At the same of the original enumeration and make it especially easy to time an additional coding form, called the “long form,” compare the original microfilm records to the computer was filled out, keypunched, and its contents merged by print-out and cards of household records in which the computer with the individual record file.s program detected errors or ambiguities. The first and last name of each individual and his or The census of Canada for 187 1, like some other her occupation were transcribed directly onto the coding records, does not record the relationships between mem- form in the form in which they appeared on the manu- bers of the household. If some error is tolerated, it is script, for no predetermined coding scheme could pre- possible to deduce relationships among individuals within serve the original content in full. Because of the fixed a household, making use of surnames, marital status, age, field coding scheme, a procedure was required to deal with and the ordering in which individuals within a household cases where the name or occupation exceeded the length are recorded on the census manuscript. A decision was of the field (16 characters for each name and for the made to carry out such an analysis for each household occupation) on the coding form. Here again, the “long and, using this information, to attach to all records of form” was employed: an asterisk was placed in the last individuals a number of summary variables describing space allocated to the variable in question on the individual the households of which they were members. Clearly, coding form, and a long form was filled out containing there were cases where the family relationships are ambigu- only the uncoded part of the person’s name or occupation. ous. In all the detectable cases of ambiguity, a message 159 insist that when this contingency is faced in historical # was printed out by the data-checking program to apprise us of the difficulty. For example, a child could logically studies, some form of intervention by investigators in have more than one person in the household as a possible ambiguous cases should be built into the data-processing, mother-logically here being taken to mean that in the and a record of those interventions and ambiguities should household with the same surname as that child there are become part of the data file. two or more women who are married or have been mar- ried, and whose ages differ by at least 15 and no more Errors and Ambiguities Detected by Computer than 50 years from that of the child. In all such ambigu- Program ous cases an “error” message was printed and the house- We outlined above the main aspects of the programs hold scrutinized by the principal investigators to attempt devised for this project which detected coding errors and to resolve the ambiguity. In the great majority of cases, ambiguities in the course of processing the data files. We the determination of family relationships among mem- summarize here the results of these procedures. The pro- bers of a household was unambiguous. gram had two options; the second added the “long form” Two other potentially ambiguous situations of this data to the files but at the same time searched the data kind were flagged by the program: individuals who appear again for errors which were not completely corrected to be the children of those who are listed later in the after the first pass. Each version checked for valid codes sequence of persons in the household (mostly this seems and correctly punched fields in three sets of variables. to indicate a widowed parent or aged couple living in the The first set consisted of identification variables of vari- . household of their child), and individuals who are identi- ous kinds: household and family numbers (correct ; fied as children of parents in the hot&hold but who are sequencing), sample numbers, and geographic location separated from their parents by one or more persons of codes. The second set consisted of sociodemographic a different surname. variables for which valid codes can be readily established: In each of these ambiguous cases, a “warning” message birthplace, religion, nation of origin, marital status, sex, led us to reexamine the household-for example, if two age, and school attendance. Finally, during the second individuals could “logically” be taken as the mother of a pass of the data, the program identified the “ambiguities” given child, the algorithm assigns the child to the “mother” in inferred family relations, as discussed above. who is closest in the household listing. What if our reexami- nation of a specific household leads us to conclude that ‘In addition, the program flagged ambiguities which the wrong person has been selected “automatically” as the inspection of early coding trials suggested might indicate mother? In such a case a “special allocation” form is filled cases which were poorly coded but in which specific field out, from which a card is keypunched. The card contains errors or invalid codes did not occur. These were the cod- an instruction to the program to reallocate the parent- ing of individuals as married but with no obvious spouse child relationship. In this case, when the raw data are present and cases in which very young children were coded reprocessed a message indicating this change is printed as attending school or being able to read or write. Although when the household is encountered, and a variable on the the completed files contained surprisingly large numbers of record indicates that this “special allocation” has taken both ambiguities, checks of the original enumeration indi- place.6 cated that almost none of these were coding or keypunch- Two additional kinds of variables and their ambiguities ing errors. Specifically, the program detected 433 married are generated by the computer program. The first is a set individuals with spouses absent from households in 187 1 of summary variables describing the entire household and 286 cases of children under six enumerated as being which is attached to the record of each individual in the in school or literate. Substantive analysis may reveal some household. Among these variables are the number of * of the reasons for these characteristics of the enumeration. people in the household, the number of married couples In every case of an ambiguity or error being detected, in the household, the number of children or widowers, we examined the keypunched cards and the original cod- and so on. The second type of variable describes each ing form. If errors were detected in keypunching, they individual uniquely; that is, they have different values for were corrected directly; otherwise the original manuscript each person in the household: for example, a child’s num- records were consulted on microfilm. There are few ber of older and of younger brothers and sisters resident enough errors even in these relatively large files to make in the household, the number of children of widowers, this procedure feasible, if tedious. Since the program pro- each person’s mother and father (head of household, a vided a summary record of all the ambiguities and errors second, third, etc., marital unit), and so forth. The latter, detected, we can report first the actual results of our of course, rely upon prior inferences regarding family error-detection procedures. relationships and duplicate their ambiguities and errors. Errors in identification variables were not consequen- We think that “automatic” creation of such relation- tial. In the file of over 10,000 households and over 60,000 ships is an acceptable substitute in the absence of an individuals, there were just over 2,200 recorded ambigui- original manuscript variable describing them. Certainly ties in the “automatic” allocation of children to parents swine inference regarding family reelatiansb@s is critical in the same hausehotds. Over half of these were ambigui- in the analysis of historical census data. But we would ties which arose because there were two women whose 160 ages made it possible for either to be the mother of a tion marks, due to illegibility or unintelligibility of the child in the household. These cases were resolved by the original microfilm records. Almost all (55) were ambigui- computer algorithm which assigned the child to the ties in age records. women listed closest in the household record. Occasion- That these few hundred errors occur in the records of ally we altered this decision after reviewing the enumera- over 60,000 individuals seems to warrant a firm conclu- tion. In some cases it seemed more likely that a person sion that historical data can be collected with considerable with the same surname as a group of children in the fidelity, and, indeed, probably has been in any reasonably household or a much older couple were actually grand- careful project. Moreover, we found that a negligible num- parents-or perhaps, but unknown to us, an older uncle ber of the errors were keypunch errors. Finally, we hope and aunt. In the case of one of the other ambiguities in it is obvious that a computer program, based on principles family relations, the separation of children from a parent such as the ones we have described, efficiently cleans his- by a different-named household member, over half torical data files and, as well, provides a potentially use- occurred in Quebec in cases in which it appeared likely ful record of the exact errors and ambiguities committed that the married woman was enumerated by her maiden and surviving in the file. name. Despite these results, the procedures described cannot All in all, then, we found the automatic detection of detect errors in the interpretation and transcription of “ambiguities” to provide a useful permanent record of names or occupations. Yet these are crucial errors for these cases and to provide grounds for some confidence historical studies. Moreover, these, or any other errors in family re$tionshi’ps established among members of escaping detection, become “hidden” in the data them- the househo!ds. That is, the family relationships can be selves. Hence, as part of the larger project we undertook inferred consistently for a very large proportion of the the further study of error by ,means of a sample of the households on the basis of a relatively simple set of rules. previously coded household records. Further, ambiguities in the records which suggest incon- sistencies in coding (absent spouses, young children in school) give no indication in fact that omissions or mis- Determining the Extent and Sources of Original codes are a problem in coding large amounts of data. Coding Error Finally, errors in the sociodemographic variables A second, stratified random sample of previously coded could arise in either coding or keypunching. The types households was taken as part of the design of the original of error which an automated procedure can directly detect project. The sample consisted of 810 households, selected are invalid codes and column or field errors. There were as follows: 20 cities and towns were randomly selected seven variables in our records for which these could be from the provinces of Ontario and Quebec, and 8 cities checked. For the three mnemonic codes there were and towns from Nova Scotia and New Brunswick. Forty totals of 185 errors in coding religion, 86 errors in coding rural districts were randomly selected from Ontario and birthplace, and 72 errors in nation of origin. The large Quebec; 13 rural districts from Nova Scotia and New number of errors for religion was primarily accounted for Brunswick. These distributions approximated population by a string of exactly 200 coding errors for a section of distributions. Within each census district selected, we the data in the province of New Brunswick. The coders made a random start among the households included in had used a small set of erroneous but frequent codes for the original sample. Ten consecutive households were this section of the data. The error was obvious and easily then reexamined by a single experienced coder who was corrected upon inspection. fully informed of the need to exercise caution in recoding These errors were all but eliminated after the first pass in order to facilitate the determination of errors in the of the program, but it bears noting that the second pass original coding. The original and subsequent coding were identified 56 remaining errors in the three variables compared, and the differences were recorded as errors in together. Most of these were true ambiguities, due to illegi- the original.’ bility of the records. In such cases the coders had been The recorded errors were classified and punched as a instructed to record a question mark in the place of any card file for analysis. The classification of the errors was truly illegible entry. However, a few errors identified in straightforward. They were recorded as one of three the first pass failed to be corrected and were only caught types: errors of omission (simply failing to record a vari- in the second. Our inspection of the records suggests that able which was on the original manuscript); errors of these were commonly cases in which several errors were interpretation or transcription (failing to record precisely committed in a single household and one or more were all details of the original entries, or in the case of overlooked in the correction. A second pass of the data mnemonic codes, failing to enter the correct code); and by an error-detection program is worth undertaking in field errors (entering data in the wrong columns of the comparable projects. code form). Every error for every variable was classified. There were smaller numbers of errors detected in the In addition we have as information about the coding the other four variables, sex, age, marital status, and school names of the coders, the date of the original coding, the attendance. Together there were only 166 errors com- location of the household by province and district and mitted, of which about 60 were recorded simply as ques- designation as rural or urban, and a subjective estimate of 161 Table 1 .-Illustration of Types of Errors in Coding Household Data

Names

John Woodsich Farmer NS FC 61 M M P L 1 1 Lennet Woodsich Wife SC FC 47 F M P Adelaid Woodsich NS FC 09 F S P Barbra Woodsich UC FC 05 F S P Peter Morison Labourer SC CS 49 M S P L Mary Bell Housekeeper SC CS 51 F S P B John Magnere Labourer SC CS 16 M S P 1

John Woodside? Farmer NS FB_ 61 M M P L 1 1 Tennet Woodside -em-- SC FB_ 47 F M P Adelaide_ Woodside NC_ FB_ 09 F S P Barbara Woodside UC FB_ 05 F S P Peter Morison Labourer SC cs 49 M S P L Mary Bell Housekeeper: SC cs 51 F S P - John Magnerel Labourer SC cs 16MSPB_ - the legibility of the microfilm records (coded 1 to 5, and types of errors committed in the two cases. There from excellent to barely legible). The estimate was made are 23 specific differences between them which would at the time of the second coding. have been counted as errors according to our error- If quantitative social historians lie awake at night checking procedures. There are 3 additional differences worrying about the quality of their data, the coding repro- which would not have been recorded as errors. They are duced below may induce perpetual insomnia. This little the three question marks entered after two last names nightmare is an actual instance of coding problems found and an occupation. In our coding these were flags record- when two different coders accidentally duplicated the ing ambiguity on the part of the coder and punched into coding of a single household. In fact, the duplication did the machine-readable file so that casesof uncertainty were not occur in the study reported above, but in a related readily checked. One coder was more hesitant in a situa- project on two selected counties of southwestern Ontario.8 tion in which the legibility of the microfilm record was We are happy to say at the outset that the evidence we far from good. will present below indicates that the case is exceptional.g No field errors are made, but there are errors of omis- At the risk of reducing the credibility of our research, we sion in each of the examples. In the first, John Magnere (?>, report it here to show concretely the variety of errors labourer, is not recorded as blind. Mary Bell, housekeeper, coding may entail which would be virtually undetectable is, though this would also be recorded as an error of trans- in the absence of duplication or systematic checking. We cription by our procedures. The second coder failed to do wish it were merely hypothetical. record the occupation “wife” and failed to record that All the differences in the coding are underlined in the John Magnere, labourer, was also listed as being in school, second case. (In fact the coding is not completely repre- however little confidence this unusual combination of sented here. Our procedures did not require the duplica- attributes inspires in the original enumeration. tion of some of the information, such as same last names Primarily the errors Ire those of transcription and appearing for adjacent individuals or for a duplication of interpretation. The transcription errors occur mainly in codes for birthplace and religion, marital status, sex, and recording names. There are actually eight separate errors so on, for adjacent individuals.) Consider the numbers committed in recording the last names of four family 162 members of the nuclear family, that is, two mis-transcribed are counted in every case of a missing entry of any kind letters in each entry. Three additional errors were made and any entry in an incorrect column of the coding form in recording the characters of first names. is taken as a single field error. Errors of interpretation are most evident in the differ- In the sample of 810 households, with 22 separate ences in the mnemonic codes given for birthplaces and entries or variables for each, a total of 272 errors of all religious affiliations. In this respect the first coder was kinds, as defined above, was committed. Of these, 187 actually correct. The second gave NC for NS, an invalid were transcription-interpretation errors, 76 were errors code, instead of the one which stands for Nova Scotia. In of omission, and only 9 were field errors. The absolute the case of recording religions, the second coder erred in numbers of coding errors, especially transcription- recording FB rather than FC for all the members of the interpretation errors, appears to be relatively large even family of the head of household. This is an error which in a study based on a moderately large number of house- is particularly difficult to detect, for unlike the invalid holds. On the basis of this subsample, in our total sample code NC, FB was a valid code, for Free Baptist. In fact of over 10,000 households we can expect over 3,400 the members of the family are Scottish Presbyterian, errors to have been committed in separate variables. Two Free Church (FC). important qualifications of this conclusion must be It is clear that errors in coding might accumulate entered. First, the errors detected in this subsample quickly in specific cases and, indeed, are relatively diffi- include those which any checking and data-cleaning pro- cult to detect and correct by conventional verification cedures would locate, and, secondly, the overall rate of procedures. We have dwelt ori a specific, exceptional;case error is actually exceedingly small. If we say for every at length to illustrate the problem; we are able to present individual there are 22 possible errors, one for each vari- a more systematic and representative analysis, which off- able coded, and grant that the average number of indi- sets the impression that coding errors of this kind are viduals per household was 6, then in the sample of only likely to be frequent. 810 households, there were approximately 103,900 possible errors in entries. A total of 272 is a miniscule Rates and Sources of Error: An Analysis of ratio of .0026, or about onequarter of one percent of Sample Data all possibilities. The largest single category of errors, Considering the illustrative case and the analysis given those defined as transcription-interpretation, amounts to above, it is apparent that determining “rates” of errors a mere .18 percent. We suggest that in this common- committed in transcribing nominal, historical data is com- sense view of the quality of coding, there is reason to be plicated by the definition of error accepted. Perhaps the confident. matter seems to be splitting hairs, but clearly errors of If we take a very cautious position, we might argue transcription can be considered for every single character that such rates as reported above tend to minimize the in the record. A number of these are quite crucial, espe- question of error in cases where even small errors of spe- cially first letters of names and individual letters of cific kinds may be of some importance in affecting analy- mnemonic codes and single entries, such as marital status. sis. For example, in our study as in others, the household Of course, the possible number of errors varies with each is the sampling unit and is often the basic unit of analysis individual household, the numbers of persons, the lengths as well. The misinterpretation of surnames alone becomes of their names, occupations, and the information origi- a category of error which directly affects the possibilities nally recorded. It is worth noting that, defined in this of effective linkage and all of the essential analysis which way, the number of possible transcription errors is depends upon it. As linkage increasingly comes to use as extremely large and in any large file is virtually astro- much of the information about entire households as nomical. There are over 200 possible transcription errors possible, to locate individuals by their household contexts, alone in the characters of the illustrative case given above. then other errors in first names, ages, religious affiliation, In a household sample as large as the one we have col- and so forth, all become relevant limitations. Hence, we lected (approximately 10,000 households), this means might consider the rates of error of all kinds per household there are in the order of two million possible transcrip- as a gross indicator of the possibility of errors affecting tion errors. Errors of omission, defined more reasonably linkage. Given the sample data for 810 households, the as the omission of complete entries or variables, and field 272 errors of all kinds amount to a rate of 33.6 errors errors, add many thousands of other possible errors. for every 100 households. In other words, a maximum of In the analysis of the numbers, kinds, and sources of one-third of the households could have at least one error error we present below, we have taken a more limited, of some kind in one or another variable, although in fact but manageable, definition of error. A single error of the errors might be concentrated in a relatively few house-. transcription-interpretation was counted if any character holds. Still, in this average sense a maximum of 23 percent of a complete entry was in error. For example, recording of the households could have an interpretation-transcription Woodsich for Woodside for tiour members of a household error; 9 percent, omission err&s; and 1 percent, field would be counted as a single error of transcription- errors. This is a way of considering, perhaps, the maximum interpretation in the recording of surnames. Errors of influence which errors might have on analysis. omission and field errors are more obvious. The former Another consideration might be the occurrence of 163 errors in specific variables. For example, much recent required coding according to preestablished mnemonic analysis focuses on social mobility and employs occupation symbols, such as RC for Catholic, WM for Wesleyan as the central variable. Admitting all the limitations of Methodist, UC for Upper Canada, and SC for Scotland. this single variable for assessing class and status changes There was a quite long list of such codes for religious still leaves it in a crucial role. It would be of some impor- affiliation, amounting to a total of some 50 separate tance if many of the omission or interpretation errors codes. However, in all three variables only 18 errors of should occur in this variable, a distinct possibility given interpretation-transmission were made, only 15 complete the problems of legibility and unfamiliarity with occupa- omissions, and 1 field error in the sample data. tional titles and spellings which are faced by coders, if not by principal investigators. Errors in other specific We can compare the sample results for these variables variables will also pose unique problems of analysis, akin to the error detected for the same variables in the actual to those such as age-heaping for demographic analysis, or files, as reported above. The results of the actual error perhaps for the analysis of intermarriage rates among detection procedures conform closely to the resultsof ethnic and religious groups. the sampledata. The numbersof errors committed may In terms of the sample data, three kinds of variables on occasion be quite large, for example the 285 errors in account for 244 of the total of 272 errors of all kinds. Of the 272, 142 (51 percent) were committed in recording recording religions (although thesewere readily located and corrected). Yet, in terms of the quality of the coding, names. An additional 68 errors were committed in six given the massof data being collected, the rate of errors sociodemographic variables combined, and 34 errors were is small. Consider that for the six sociodemographicvart- made in recording occupational titles. Most of the errors ablesdiscussed here in the actual coding, 558 errors in ail in names, 110, were errors in recording first names; 32 were detected, or possibleerror in just 5 percent of the were errors in surnames. The latter certainly is the more over 10,000 households.This is not much different than limiting kind of error, and its relative absence is encourag- ing. In terms of household rates of error, a maximum of the 9 percent of the householdsof the sampledata. The 18 percent of the households could have had an error in sampleresults are at variance with the actual error detec- names occur. The number of errors which occur in first tion in one respect. Errors in the three variables,birth- names does raise the problem of the use of first names as place, nation of origin, and religion in the larger project part of the linkage of nominal records. It must be noted, exceeded in number errors in recording ages,the largest however, that our counts do not indicate the actual loss singlesource of error in comparablevariables in the sam- of information in any given name; a mistake in a single ple data. We found it reassuringthat the coding of occupation vowel in a name-perhaps the most common problem of transcription or interpretation-is a minor loss of informa- seemsto be largely error free, that is, 34 errors of all tion in a great many cases and often barely impairs our kinds in the 8 10 householdsin which it is common for ability to make successful links, given any other informa- more than one member to have an occupation given. tion (Jack and Jock, Mary and Marie, Jones and Janes, Moreover, when we expected transcription errors, we Moore and More and so forth). find that 29 of the 34 errors were clear omissions.In any The next largest number of errors occurred in the case,coders seemed in generalto have unexpected ease sociodemographic variables, taken as a whole. Sixty-eight in correctly interpreting and recording even unfamiliar errors were made in the 8 10 households in the following occupational titles. six variables combined: sex, age, place of birth, religion, national origin, and marital status. Most of these, 45, or Finally, there were only 19 errors of any kind madein nearly 70 percent of the total, were interpretation- recording from the manuscript records a variety of vari- transcription errors; 2 1 were omissions; and 2 were field ables,such as school attendance literacy, and reporting errors. The greatest single problem occurred in recording those who were deaf, dumb, blind, or of unsound mind. ages. There were 22 errors in transcribing ages, clearly not As well there were a mere9 errors made in the recording a very substantial number overall. Our experience with of sevenitems of information regardingthe geographic the coding suggests that these errors are mainly in the location, identification codesunique to the study, and second of twodigit figures. We do know from the sample the numbering of the households,families, and individuals data that the errors in recording age were very evenly on the manuscript documentsthemselves. We think the distributed among the sample cases. The rates of error do conclusion is warranted that coding errors in large studies not vary directly with variations in legibility and micro- employing nineteenth-century nominal data can be kept film quality or with differences among coders. We address to a very modest level. Nevertheless,we note againthat below the question of systematically accounting for the someerrors have more seriousconsequences than others. distribution of errors in different varfables. A relatively large proportion of all the errors we found We found it especially reassuring that there were were errors of transcription or interpretation of initials exceedingly few errors of any kind in the coding of the and first names.Though the absolute numbersare quite “ethnicity” variables, place of birth, religion, and nation small, they are common enough to warrant specialcare of origin, since they were the only variables which in coding. 164 Table 2.-Numbers of Errors and Correlation Ratios (Eta*) between the Major Types of Coding Errors and the Independent Variables, Legibility, Coder, and Province

Number Eta* for Eta* for Eta* for Dependent Variable of Legibility Coder Province Errors

Total Errors 272 .026 .111 .076 Interpretation Errors 187 .056 .097 .078 Omission Errors 76 .014 .lOO .096 Errors in Names 142 .092 .088 .07 1 Errors in Six Sociodemographic Variables 68 .OlO .187 .043 Errors in Occupations 34 .036 .I17 .I 13

Possible Sources of Coding Error ‘, structions. We know from other researchers in Canada In an effort to assess some of the possible sources of the that this problem is not uncommon. various kinds of errors which did occur, we undertook The analysis of variance resuhs was straightforward several analyses of variance and associated multiple classi- and conclusive. None of the independent variables had a fication analyses. We took as the several dependent vari- statistically significant effect on total errors Or on any ables the total errors of all kinds committed and errors of the subtypes of error. The negative result adds weight classified into the following subtypes: total transcription- to the interpretation offered above. The absolute num- interpretation errors, total omissions, errors committed bers of errors committed in coding large amounts of his- in recording names or in recording occupations, and the torical data may appear to be considerable, but their errors made in recording six sociodemographic variables, effect relative to the vast amount of detailed information taken together and taken separately (sex, marital status, collected is minor; they are not systematic in their age, birthplace, religion, nation of origin). sources, so far as our data can reveal them. The independent, explanatory variables were those Despite the negative result, we did attempt to sort out which could be directly derived from the coding itself. sources of variation in errors through further analysis. They were the following categoric factors: &he differences Table 2 presents the correlation ratios (Eta*), measuring among the 12 coders, the differences among provinces, the variation due to independent variables, from multiple and a rural-urban dichotomy regarding the location of classification analysis generated in association with analysis the coded househo\ds. Fina\\y we used an ordered vati- of variance.” able, described earlier, which was a rank assigned to the The table is useful in indicating descriptively the gen- recorded records classifying them on a five-point scale erally very low explanatory power of any of the inde- from excellent to barely legible. (Wholly illegible records, pendent variables, with the possible exception of two which had been selected as elements of the sample in the results. The differences among coders, though not sta- original coding, were discarded, and randomly sampled, tistically significant in accounting for any type of error, substitute cases from the same census districts were consistently appear to be the most important source of coded in their place.) variation in the commission of coding errors. In fact we We examined first the results of the one-way analysis can report that in further analysis of covariance, which of variance between each of the dependent, error vari- assessed the effect of legibility first and then the effects ables and each of the independent variables. We were fish- of coders and areas on the residual variation in errors, we ing for results, of course, but we thought that the rela- did locate statistically significant results in explaining tive legibility of the records and differences among coders errors in occupational coding and for the combined socio- were likely to be significant explanatory variables. These demographic variables. were variables which could affect the quality of data in These results, though meeting conventional standards any large-scale historical study. There is a particular rea- of statistical significance, are not readily interpreted. In son to consider differences in coding among regions in a the case of occupational coding errors, it was the varia- Canadian study. For some portions of the populations tion among areas which was statistically significant, with of Quebec, New Brunswick, and Ontario in 1871, the the other two variables taken into account. In the case enumeration of French-named families was done by of the sociodemographic errors, it was only the partial English-speaking enumerators, a situation which, in our effect of differences among coders which was significant. experience, frequently gave rise to curious phonetic con- Further, an examination of the patterns of errors for the 165 categories of the variables (coders and provinces) in the Conclusion multiple classification analysis revealed no consistent We have reported in somedetail first the resultsof a set patterns. of proceduresfor coding and processinglarge amounts of One other result of the analysis is, perhaps, worth nominal historical data in which a primary consideration noting. As reported, we simply could not locate signifi- was the minimization of coding error and the detection cant effects of variations in the legibility of the micro- of thoseerrors and ambiguitieswhich such projects inevi- fhed manuscript records. There were quite great varia- tably incur. Secondly we have reported the resultsof a tions in legibility among the records, as there are likely unique subproject of the larger study which identified to be in any such study. On the other hand, a close and analyzed the nature and possiblesources of coding examination of the pattern of errors of various kinds by errors of all kinds which occurred in the larger project. levels of legibility suggests a consistent and curious rela- The identification of errors resulted from recoding a tionship. In the case of all errors committed, and for 10 percent random sampleof the originally coded census several of the subcategories of error, the most illegibZe manuscript recordsfor Canadain 1871. records (rated “barely legible”) were coded with as little We showhow errors and ambiguities in coding and in error or even lesserror than the records assessedto be keypunching can be readily located by appropriate com- “excellent.” On the other hand, those recordsconsidered puter programsfor checkingand verifying a file of nominal to be of moderate-to-goodquality were often the most records at the sametime one is creating individual and ’ poorly coded. Though there wassome variation in the householdrecords. The two tasks can be efficiently con- pattern, the records rankedas “barely legible” were solidated! Both the resultsof the computerized error never most prone to error for any of the types of error detection’and a rather detailed exploration of the kinds variables,and they were most error-free in severalcases. and sourcesof coding errors identified in the subsample This unexpected resultmay be accounted for, first, are encouraging.They indicate that we can expect by the relatively smallnumbers of recordsrated as extremely smallrates of error in the collection of all “barely legible” in comparisonto the numbersof “excel- types of variablesfrom nominal records, whether in inter- lent” and “good” records.Of the 80 setsof recordssam- pretation and transcription, simple omission,or missed pled, 36 were rated as “excellent” (360 households), fields. Moreover, there was a distinct absenceof systematic while only 4 sets(40 households)were rated as“barely patterns of error amonga variety of codersand among legible” in the error sample.Still there is a clear hint that microfilmed manuscriptrecords which varied considerably when the coderswere most awareof the problemsof in their legibility. The study of errors confirms that it is legibility of the records,they were able to decipher them possibleto preservethe historian’s traditional care with in a very reliable fashion.When the recordswere of high- original sourcesin the collection of very large files of est quality, there were alsofew problems. It waswhen nominal, historical data. there were uneven or unpredictabledeficiencies in the legibility that the coding sufferedmost.

NOTES

1. Stephan Thernstrom, The Other Bostonians (Cam- and a statistically justifiable method for tracing a sample bridge, Mass.: Harvard University Press,1973); Theodore of individuals through two or more setsof historical Hershberg,et al., “The PhiladelphiaSocial History records. We have previously reported the statistical devel- Project,” Historical Methods Newsletter (Special Issue) opment of the samplingtechnique and have subsequently 9 (March-June 1976); Michael Katz, The People of explored in detail the statistical characteristics of such Hamilton, Canada West (Cambridge,Mass.: Harvard Uni- samplesby meansof a computer simulation. SeeMichael versity Press,1975); David Gaganand Herbert Mays, D. Omstein and Gordon Darroch, “National Mobility “Historical Demographyand CanadianSocial History: Studies in PastTime: A Sampling Strategy,” Historical Families and Land in PeelCounty, Ontario,” Canadian Methods 11 (Fall 1978): 152-61. Hirtoricul Review 14 (March 1973): 27-47; Michael Anderson, Craig Scott, and BrendaCoUins, “The National 4. The programswere written for the project by Pro- Sample from the 1851 Censusof Britain: An Interim fessorOmstein. Report on Methods and Progress,”Historical Methods Newsletter 10 (Summer 1977): 117-22. 5. The 1871 long form has three fields: a location code which identifies the individual to whom the data referred, 2. Peter Knights hasdiscussed probable levels of under- a code specifying the variable in question (“R” for enumeration in original censusesof 1850 and 1860 for religion, “B” for placeof birth, “N” for nation of origin), Boston. The Plain People of Boston, 1830-1860 (New and the exact mention or name for the variable as it is York: Oxford University Press,1971), Appendix C. See written on the censusmanuscript. For example, “West also Anderson, et al., ‘The National Sample.” Indies” was a very uncommon place of birth requiring a “blank *” code and a long name form. 3. It was the primary purposeof the feasibility study on which this report is basedto establishthat an unusual 6. Further processingof our basic data was required; form of sampling-letter sampling-provided a practical for example, four variablesrequire substantial recoding 166 from alphabetic to numerical codes to be usable in the the procedure inflates rather than underestimates rates data analysis. They are occupation, religion, place of of error. Since we show rates to be quite small, the pro- birth, and nation of origin. Here we describe neither cedure is a conservative one in this respect. We thank a these procedures nor those used to deal with the coding reviewer for bringing the point to our attention. of “long form” data. They do not give rise to comparable forms of error or ambiguity in the variables. 8. The data are from Essex County, Ontario, in 1861. In this year no nation of origin variable is reported. 7. Of course, having yet another independent coding of the data would provide a possible check on errors 9. We can also report that in one other case of fully made by the second coder, and so on. Concurrence of duplicated coding in the same county study, two differ- two of three coders might be taken as establishing the ent coders provided identical coding. In fact, this case %orrect” code. We did not think the cost of this addi- was a hotel or inn with quite a large number of residents. tional coding was warranted by the probable marginal gain in accuracy given the experience of the second coder 10. Multiple classificationanalysis displays the patterns and the time and care he could take for these 8 10 house- of relationship between dependent variablesand the cate- holds. In any event, whatever errors, or simple disagree- gories of independent variables, in terms of adjusted devi- ments in interpretation, might be contributed by the ations of the category meansfrom the grand mean of the second coder will almost certainly be counted as differ- dependent variable in question. ences between the coding and counted as “error.” Hence