AnExaminationofMonitored,RemoteMicrodataAccessSystems SandraRowland Preparedundercontracttothe CommitteeonNationalStatistics NationalAcademyofSciences(NAS) PresentedattheNASWorkshopon AccesstoResearchData:AssessingRisksandOpportunities October16-17,2003

1 AnExaminationofMonitored,RemoteMicrodataAccessSystems 1.Introduction TypesofMicrodataAccess SamplingofSystemsforMonitoredRemoteAccesstoRestrictedMicrodata MethodologiesCommonlyEmployedinMonitoredRemoteAccessSystems UsageofMonitoredRemoteAccessSystems 2.MonitoredRemoteAccessinForeignAgencies LuxembourgIncomeStudy StatisticsCanada StatisticsDenmark StatisticsNetherlands AustralianBureauofStatistics StatisticsSweden 3.MonitoredRemoteAccessinUSFederalAgencies NationalCenterforHealthStatistics NationalCenterforEducationStatistics CensusBureau 4.SampleofResearchProjectsinUS DigitalGovernment CornellRetrictedAccessDataCenter IntegratedPublicUseMicrodataSeries-International 2andNextGenerationInternet USGenWeb 5.Conclusions Appendices Appendix1A:MonitoredRemoteAccessSystemMethodology–U.S.FederalAgencies Appendix1B:UsageofMonitoredRemoteAccessSystems–U.S.FederalAgencies Appendix2A:MonitoredRemoteAccessSystemMethodology–ForeignAgencies Appendix2B:UsageofMonitoredRemoteAccessSystems–ForeignAgencies

2 AnExaminationofMonitored,RemoteMicrodataAccessSystems SandraRowland Preparedundercontracttothe CommitteeonNationalStatistics,NationalAcademyofSciences October2003 1.Introduction Manynationalstatisticaloffices(NSOs)disseminatemicrodatainthreeways:publicuse microdatafilesonCDROMoron-line,researchcentersorlicensedsites,andremote accesstorestrictedmicrodata.ThispapercoversasamplingofsystemsinNSOsthat permitmonitoredremoteaccesstorestrictedmicrodata.Thesampleincludessixforeign systemsandthreesystemsintheUnitedStates(US).Section1ofthepapercovers foreignsystemsinthefollowingorder:LuxembourgIncomeStudy,StatisticsCanada, StatisticsDenmark,StatisticsNetherlands,AustralianBureauofStatistics,andStatistics Sweden.Section2coversUSsystemsinfederalagenciesinthefollowingorder: NationalCenterforHealthStatistics,NationalCenterforEducationStatisticsandCensus Bureau.Thetypeofmethodologyemployedineachofthesystemsisreviewedforeach countrybecausethemethodologyinfluencesthekindsofaccessandresultsgivento users.Theusageofeachsystemandthekindsofresearchthathavebenefitedfromtheir usearereviewedforeachcountry,insofarassuchinformationisavailable.Afew researchprogramsintheUSbearingonremoteaccesstorestrictedmicrodataarealso brieflyreviewedinSection3becausetheyareexamplesofthekindsofresearchthat contributetopracticalapplicationsinthepublicsector.Appendices1Aand1Bhavea summaryofmethodologyandusageofUSsystemsandAppendices2Aand2Bhavea summaryofmethodologyandusageofforeignsystems. TypesofMicrodataDissemination Publicusemicrodatafilescontaindatafromsurveysandsubsamplesfromcensusesand areusuallyeditedthroughperturbationofdata,additionofrandomnoise,topandbottom coding,rounding,variablesuppression,andadding,removingandswappingrecords (Horm1999).PublicusefilesareusuallyavailableonCDROMandinsomecasesare availableontheInternet.TheFederatedElectronicResearch,Review,Extractionand TabulationTool(nowknownasDataFerrett)isoneexampleofthelatter.Itmakes publicusefilesfromseveralUSfederalagenciesavailableontheweb. Researchdatacenters(RDC)werecreatedtoallowuserstoaccessrestrictedmicrodata filesthatarenotavailableonCDROM.Thefilesmayhaveconfidentialityeditsbutthe detailonthefilesismuchgreaterthanispermittedinapublicusemicrodatafile.Users arerequiredtosubmitaresearchproposaltotheNSOthatmaintainstheresearchcenter andifapprovedmustcarryouttheirworkatthecenter.InsomecasesNSOswilllicense andinspectormonitorusersitestoobtainanduserestrictedmicrodatasetsforapproved purposes(SeastromandKaufman2003).

3 Remoteaccesssystemsmakeitpossibleforuserstoanalyzerestrictedmicrodatawithout visitinganRDC.Thesystemsusedforremoteaccesstorestrictedmicrodataare monitoredautomaticallyand/ormanuallyfordisclosureavoidance.Theyemploy automatedandmanualfiltersthatblockcertainkindsofqueriesandresults.Thefiles availableareusuallyeditedfordisclosureavoidanceusingthesametechniquesasthose usedforpublicusefiles.Theyprovidemoredetailtoresearchersthanpublicusefiles, butlessdetailthanisusuallyavailableinanRDC.ThefilesresideintheNSOand extractsofmicrodataanddirectaccesstotherecordsarenotpermitted. SamplingofSystemsforMonitoredRemoteAccesstoRestrictedMicrodata TheLuxembourgIncomeStudyisreviewedfirstbecauseitistheoldestoftheprograms thatgiveusersremoteaccesstorestrictedmicrodata.Itbeganin1983andutilizesthe LISSYremoteaccesssystem.Thesystemhasservedasamodelformanyothersystems (consciouslyorunconsciously)currentlyinuseandunderdevelopmentabroadandinthe US.CanadaandDenmarkhavegivenusersremoteaccesstorestrictedmicrodatasince 2001.TheNetherlands,SwedenandAustraliabeganpilotsintheuseofremoteaccess systemsin2002and2003.IntheUS,theNationalCenterforEducationStatistics (NCES)andtheNationalCenterforHealthStatistics(NCHS)gaveusersremoteaccess torestrictedmicrodatabeginningin1997and1998,respectively.TheCensusBureau begandisseminatingCensus2000microdatain2003,afterpilotteststhattookplacein 2002. MethodologiesCommonlyEmployedinMonitoredRemoteAccessSystems Therearetwocommonmethodologiesamongthesystemsreviewedinthepaper.One typeusuallyconsistsofanemailinterfacethatallowsusestosendprogramsaspartofthe bodyoftheemailorinanattachment.Thesesystemsusuallyacceptstandardstatistical programssuchasSAS,SPSS,STATAandGAUSS,chosenbecausetheyarecommonly usedbyresearchersandlendthemselvestoautomatedreviewofinputprogramsand statisticalresults.ThesesystemsreturnSAS,SPSS,STATAandGAUSSresultsand mayprohibitormodifycertaincommandsthuslimitingthekindsofoutputsthatusers maywant.TheemailsystemsareusedinalloftheforeignapplicationsandbyNCHS. Suchsystemshavebeenreferredtoas“remotejobexecutionsystems”(Schoutenand Cigrang2003,p.6).Processingisusuallydoneinbatchmoderatherthanonline,sothe systemsmayalsobereferredtoasoff-line.Resultsarereturnedwithinminutesordays dependingonthesizeoftheprogramandthedegreeofmanualintervention. Theother,lesscommontypeofsystem,consistsofawebinterfacewithcustombuiltor customtailored(commercial)softwarethatrequiresuserstolearnhowtousethe programand/oruserinterface.Thewebsystemsproducetabularresultswith percentagesand/ormeansandmayhavevariancesandcorrelationmatrices.Theweb applicationsareusedintheNCESandtheCensusBureau.Processingisdonewhilethe userisonlineandresultsarereturnedwithinsecondsorminutesdependingonthesizeof thetabulation.Thereisnomanualintervention.

4 Thereareseveralaspectsofthemethodologyemployedthatvarybysystem(Referto Appendices1Aand2A).Manyoftheseaspectsarereviewedinthepaperforeach systemdependingontheinformationavailable.Methodologiesemployedmayinclude: confidentialityeditstothebasefilesaccessedthatusuallyinvolveaddingnoiseto thedatatoreducethepossibilityofdisclosure, electronicauthorizationofusersthatrequirestheuseofuseridentificationand passwordstogainaccesstothesystem, emailorwebuserinterfacesthatprovideafacilitytotheusertocommunicate whattheywantfromthesystem, standardstatisticalprogramsorcustomapplicationsforprocessingthatuse statisticalsoftwarepackagesorcustomprogramstoprocesstheuserrequest, queryfilterstoexaminerequestsandblocktheuserfromrequestingcertain resultsprohibitedbytheNSO, resultsfilterstoexamineresultsandblockanyresultprohibitedbytheNSO, automatedandmanualinterventionfordisclosureavoidanceatthequery submissionoroutputstagesthataretotallyautomatedorinvolvethereviewof inputand/oroutputbystatisticians,and usagelogsfordisclosureavoidancereviewthatareaccumulatedandusedby NSOstodetermineiftheirrulesareadequatefordisclosureavoidanceandto detectpossibleriskstoconfidentiality. Anotherimportantmethodologicalaspectofaremotesystemisautomating complementarydisclosurereviewtopreventthepossibledisclosureofrestricteddatathat couldresultfromcombiningmultipleoutputs.Althoughresearchinthisareahasbeen undertaken(Duncan,Roehrig,andKannan2000)noneofthesystemsreviewedcontain mechanismstopreventcomplementarydisclosure.Thismaybeduetothedifficultyand expenseofautomatingsuchprocedures. ResearchersandNSOsarealsointerestedindisseminatingdatafromsmallergeographic areaswhilepreventingdisclosureofrestricteddata.Researchonautomatingaggregation oflowlevelgeographicareasand/orallowinguser-definedgeographicareashasbeen carriedoutusingrealdata(KarrandSanil2001).Noneofthesystemsreviewedcontain mechanismstodisseminatedataforuser-definedareas. UsageofMonitoredRemoteAccessSystems Mostoftheusersofmonitoredremoteaccesssystemsareresearchersandpublicsector staff.Thesystemsaresometimeslimitedtoofficialusersbutmostoftenarenot.Most

5 statisticalofficesrequireuserregistrationandsomemustofficiallyacceptaresearch proposalbeforethesystemcanbeaccessed.Someoftheforeignsystemsreviewedhave littleinformationonusagebecausethesystemsareneworinthepilotstage.Otherssuch asthoseusedinStatisticsCanadaandtheLuxembourgIncomeStudyhavedetailed information.TheCensusBureauhasstatisticsontheusageandalittleinformationon customersatisfaction.TheNCEShasacustomersatisfactionsurveywithsomeusage statisticsandtheNCHShasusagestatistics. Thereareseveralaspectsofsystemusagecoveredinthepaperdependingonthe informationavailable(RefertoAppendices1Band2B): permissionorauthorizationrequiredbytheNSOtoaccessthesystemsthat involvessigningaresearchcontractandconfidentialityagreementorsomekind ofregistration, typesofuserspermittedtousethesystemsrangingfromthemostexclusive policyofallowingonlypublicsectorusers,tothemostinclusivepolicyof allowinganyonetousethesystems, filesavailablethroughthesystemsrangingfromoneortwofilestomany,aswell ascombiningfilesforuseandpermittinguserfiles, documentsoravailableonlinevaryingfromdetaileduserguidesto tailoredemails, assistanceavailableincludingautomaticfeedbackandhelpdesksanduser workshops, turnaroundtimeforresultsrangingfromsecondstodays, hoursofavailabilityrangingfrom24hoursaday,7daysaweek(24/7)tooffice hoursonly(8/5), costrangingfromzerotomembershipfeestofeeforservicetime,and benefitsderivedfromuseincludingreportsandpolicyimplications. 2.MonitoredRemoteAccessinForeignAgencies InrecentyearstheEuropeanAdvisoryCommitteeonStatisticalInformationinthe EconomicandSocialSpheres(CEIES)hasencouragedresearchinthedisseminationof microdatabecause,“significantresearchcanonlybeundertakenwithmicrodata” (CEIES,2002).Thecommitteealsonotesthat,“wearenowmovingtoasituationwhere therearetechnologicalsolutionswhichcanproducea‘virtualsafesetting’overthe Internet…anareathatneedstobeexploredasacheaperandmuchpreferredalternative

6 to[physical]safesettings”.ItrecommendedthatEurostatestablishthefeasibilityofa virtualsafesetting(CEIES2002pp.2and3). CEIESholdsperiodicconferencesondisseminationofmicrodataandconfidentialitythat encouragecooperation.KnowledgeandresearcharesharedamongEuropeanandother countriessuchasCanada,AustraliaandtheUSthatcontributetoeffortstobuildand utilizeremoteaccesssystems.GivenencouragementbyEurostatandtheexampleofthe LuxembourgIncomeStudy,severalEuropeancountries,CanadaandAustraliahave beguntouseortestremoteaccesstorestrictedmicrodata. LuxembourgIncomeStudy(LIS) TheLuxembourgIncomeStudyisaninternationalprogramthatmakesmicrodatafrom 66householdincomesurveysavailableforresearchfromthe25countriesthatparticipate intheprogram.Thefilesmayberestricteduseorpublicusedependingonthecountry. TheCurrentPopulationSurveyisusedfortheUS.Theprogrambeganin1983and utilizestheLuxembourgIncomeStudySystem(LISSY)todisseminatedata.TheLISSY isanindependentsystemthatcanbeandisbeingusedbyotherprograms.Theprograms includetheLuxembourgEmploymentStudy(LES),theDIW(DeutscheInstitutefur Wirtschaft)inBerlinfortheGermanhouseholdpanel,andtheEUROSTAT/London SchoolofEconomicsforremoteaccesstoEUROSTATdata(Cigrang2003).This sectionofthepapercoverstheuseofLISSYintheLIS. LISSYwasdesignedtohandlestatisticalsoftwareprogramsthusallowinguserstowrite theirownprograms.TodayLISSYcanacceptSAS,SPSSandSTATAprograms.The LISSYbeganacceptingandreturningprogramsfromusersondiskettein1983.In1987 itusedemailontheEuropeanAcademicResearchNetwork(EARN/BITNET)allowing userswithaccesstoEARN/BITNETtoemailtheirprogramstothesystem.Emailonthe InternetwasusedinLISSYversion4. TheheartoftheLISSYisthemailrouter(PostOffice).ThePostOffice “retrievestheemailrequestsfromthemailserver, preparestheserequestsforprocessingbycheckingforallsecurityissueslikeclearly identifyingauser,checkingfortheuseofillegalstatisticalcommands,andchecking fortheusageofsequencesofcommandsorvariablesoranyothercombinationsnot allowed, returnsanyjobthatbreachessecuritytothesenderalongwithanerrormessage explainingtheviolation, distributestherequeststothebatchprocessorcomputers, examinestheoutputfilesizeandcontentsforoutputacceptabletosecurityrules,

7 returnsacceptablestatisticalresultstotheproper(registered)useremailaddresses, sendssuspiciousoutputtothereviewqueueformanualreviewinsteadofreturning, resultstotheuser,and maintainscriticalneededfortheoveralloperation”(CigrangandSchouten 2003,p.8). LISSYcanusepublicusefilesandrestrictedfiles.Itaccessesthestandardgeographic areasthataredefinedintheprogramfilesrangingfromnationaltosubnationalareasand hasnofacilityforgeneratinguser-definedareas.Althoughitgeneratesusagelogs,it doesnotautomaticallycheckmultipleoutputsforcomplementarydisclosure.Cigrang andSchoutenexaminedsomeresearchonautomaticcheckingforcomplementary disclosureandfoundthat“completeevaluationofmultiplequeriesmaybetoocomplex, timeconsumingorrestrictivetoimplement”(CigrangandSchouten2003,p.11). TheadvantageofLISSYisthatitallowsuserstosubmittheirownprogramsusing familiarwell-knownstatisticalpackages.Userscanbesuppliedwithsyntheticordummy filestotestthesyntaxoftheirprograms.Theymayobtaineverytypeofanalysisthatthe packagesrenderwithintheboundsofthesecurityrulesdefinedfortheprogram.For example,inLISalltypesofanalysisareacceptableexceptextracts.However,theLIS queryfilterdoesnotpermitcertaincommands,wordsequencesandvariables. InLIS,usersmustsubmitaresearchproposalandsignacontractaswellasa confidentialitypledge.Usersareprimarilyacademicresearchers.Thereisnocosttothe user,butonlyusersfromparticipatingcountriesthatpayanannualfeemayusethedata. Theprogramhasincomefilesfrom25countries.Userfilesarenotaccommodated. Thereisavailabledocumentationforkeyvariablesthathavebeenmadeascomparableas possiblebytheprogram.LISsponsorsworkshopsandhasafull-timepersononahelp desktoassistusers.LISSYoutputcanbereturnedwithinminutesdependingonthesize oftheprogramssubmitted,thenumberofprogramsbeingsubmittedatthesametimeand thenumberofserversonline.Thesystemisavailable24hoursaday,7daysaweek. TheLIShasa20-yearhistoryofdataanalysis.RecentstatisticsshowthatfromJanuary 1,2001throughJune30,2003,213userssubmitted36,280programsonaverageper year.ThehighestusagewasbytheUS,theUKandGermany.USresearchersalone submitted10,047programsor28percentoftheaveragenumberofprogramsperyear, duringthatperiod.ThenumberofLISworkingpaperspublishedperyearhasgrown fromaroundsixin1985toover45in2002foratotalofapproximately350working papers(LIS2003). Fourfifthsofthepaperswrittenareoninequalityofincomeandpoverty.“Therewasa slightpreponderanceofpovertyissuesintheearly90sandofinequalityissuesbeforeand thereafter”(ForsterandVieminckx2003,slide2).Groupsatriskareamainconcernand havemadeuparound30percentofthestudiessincetheinceptionoftheLIS.Thereisan increasingfocusonfamiliesandchildrenatriskcomparedwithstudiesoftheelderly

8 sincetheearly1990s.Analysisofcomparativetrendsamongcountriesmadeupjust under30percentofthetopicsstudiedinthelate1980sbutdroppedtounder20percentin theearly2000s.TheLIShascontributedtofourmajorfieldsofstudyinits20year history: “refinementoftheincomeconcept, proliferationofequivalencescales, conceptualizationandmeasurementofincomeinequalityandpoverty,and properidentificationofinternationalrankingsandtrends”(ForsterandVieminckx 2003,slide7). StatisticsCanada StatisticsCanadabegantoconsidertheuseofremoteaccessforresearchonrestricted microdatainthe1990sbecauseresearchershadcomplainedaboutthelackofdetailinthe PublicUseMicrodataFiles(PUMF)andtheinabilitytoproduceexactvariancesfortheir analyses.“Remoteaccessentailsresearcherse-mailingtheiranalyticalprogramsto StatisticsCanadawheretheyarerunonsurveymasterfilesresidingwithintheAgency's secureinternalnetwork.Researchersdonothavedirectaccesstotheconfidentialsurvey microdata.Theprogramoutputsarevettedforconfidentialitybeforebeinge-mailedback totheresearchers”(Tambay,GoldmanandPotter2003,p.7). StatisticsCanadabeganofferingremoteaccesstoasmallnumberofsurveys,someof whichdonotproduceaPUMF.Disclosurecontrolvariesbysurveyandwhetherornota PUMFisavailableforasurvey.IfaPUMFisavailable,unit-leveldata,minimumand maximumvalues,locationofsampleunitsorclustersandanecdotalinformationabout respondentsarenotreleased.Statisticsandcellvaluesintablesmustbebasedona minimumnumberofcases.Guidelinesforthecreationofsyntheticfilesgiventousers fortestinghavebeenestablished. ResearchersarerequiredtocontactStatisticsCanadatocreatethefilesforuse. Registeredusersmaysubmittheirprogramsthroughemailusingsoftwareorfilessuchas SAS,SPSS,STATA,FoxproandASCII.Theprogramsandfilesusersmaysubmitvary accordingtothesurveydatabeingoffered.Forsomesurveys,researchersareprovided withdocumentationonthedescriptionofthesurvey,recordlayoutandadummy (synthetic)filefortestingprograms.Forotherstheymaysubmitrequestswithout knowingtheinternalstructureofthe(Tambay,GoldmanandPotter2003). Theactualprocessofreceivingprogramsandvettingresultsisfairlymanual.Statistics Canadaexecutestheprogramsandreviewsoutputsfordisclosureavoidancebeforethey arereturnedtotheuserinonetotwoworkingdays.Theworkisdoneduringnormal workinghours.

9 ThelongitudinalNationalPopulationHealthSurvey(NPHS)hasthehighestuseof remoteaccesstodate.TheNPHScovershealthstatus,useofhealthservices,riskfactors anddemographicandsocio-economicstatus.“In2001and2002thenumberofprograms receivedaveraged99and40permonth,respectively”(Tambay,GoldmanandPotter 2003,p.8).StatisticCanadaprovidesdummyfileswithnoanalyticaluseforresearchers totesttheirprograms.Workshopsareprovidedforusersinterestedinusingthedataand macrosareprovidedforvarianceestimation.Thereisafeeforremoteaccesstothe NPHSmicrodata. AccordingtoapresentationonNPHSresearchfindings,therewere242articlesfrom NPHSresearchin91journals.Fifty-ninegrantsbasedontheresearchwereidentified HamiltonandHumphrey(2002).Thepresentationdidnotdistinguishamongthemedia usedtoaccessthedatathatincludespublishedreportsandpublicusemicrodatafilesin additiontoremoteaccess.Importantareasofresearchincludecanceranddiabetes prevalenceandutilizationofhealthservices. ThetechniquesusedintheSurveyofLabourandIncomeDynamics(SLID)aredifferent fromthoseusedinthe(NPHS).TheSLIDisthe“firstCanadianhouseholdsurveyto providenationaldataonthefluctuationsinincomethatatypicalfamilyorindividual experiencesovertimewhichgivesgreaterinsightonthenatureandextentofpovertyin Canada”(http://www.statcan.ca/english/sdds/3889.htm).TheSLIDdataretrievalsystem permitsuserstocreateresultsfromasinglescreenwithoutknowingthestructureofthe database.Theymayselectvariablesandcreatebothlongitudinalandcross-sectionaldata setsthatcanbeusedbytheirpreferredanalyticalsoftware.BetweenMay2002andJuly 2003,160requestsforSLIDhadbeensubmittedremotely. TheprincipalresearchareasoftheSLIDare: Employmentandunemploymentdynamics Life-cyclelabormarkettransitions Jobquality Familyeconomicmobility Dynamicsoflowincome Lifeeventsandfamilychanges Educationaladvancementandcombiningworkandschool. (http://www.ciqss.umontreal.ca/Documents/acetat_e_2002.ppt.) StatisticsCanadahascooperatedwithuniversitiestomakeresearchersawareofwhatis availablefrombothPUMFsandremoteaccesstorestrictedmicrodata.TheData LiberationInitiative(DLI)improvedaccesstoCanadiandataintheuniversities.The BritishColumbiaInteruniversityResearchDataCentreisanexampleofauniversity centerthatprovidesassistancetoresearchers.UniversitylibrariesinQuebecpooledtheir resourcestorenderPUMFsmoreusablethroughtheSherlocksystem.“SHERLOCK wasdevelopedmainlyformembersoftheQuebecacademiccommunitytoenablethem toaccessandutilizethesurveymicrodataoftheDLIandtheUniversityconsortiumfor PoliticalandSocialResearch(ICPSR)”(Drolet1999,p.15).Althoughthiscooperation

10 facilitatesdatausage,therestrictedaccessmicrodatafilesarekeptinStatisticsCanada andresearchersmustapplytoStatisticsCanadaforremoteuseofrestrictedaccess microdataandpayafee. StatisticsDenmark StatisticsDenmarkbeganallowingremoteaccessinMarch2001aftercompletionofa successfulpilotthatbegantheyearbefore(AndersonandThygesen2003).The experienceusingremoteaccessfromitsinceptiontodatehasyieldedgoodresultswith nobreachesofconfidentiality.Therefore,StatisticsDenmarkwilleventuallyreplaceon- siteaccesstorestrictedmicrodatawithremoteaccess. UsersoftheremotesystemmaysubmitSPSS,STATAandGAUSSprogramsandmay workwiththedatafreelycreatingnewdatasetsfromtheoriginaldatasets.Alldata processingisdoneatStatisticsDenmarkandresearchersmaynotdownloadorprint datasetsordataextracts.UserscommunicatewiththesystemthroughtheInternetand outputsarereturnedusingemail.OutputisexaminedbyStatisticsDenmarkstaffand mustbeaggregatedenoughtoavoiddisclosureofinformationonindividualsand enterprises(Anderson2003). RemoteaccessisgrantedbyStatisticsDenmarkonlytoauthorizedinstitutional environmentsgrantedonaneed-to-knowbasisforspecificprojects.Government ministries,researchinstitutions,universitiesandnongovernmentalorganizationsin Denmarkarethetypesofenvironmentsapproved.Accessisnotgiventoindividualsand foreignersmayhaveaccessonlyifresidingtemporarilyinanauthorizedinstitutionin Denmark.FromMarch2001toMarch2003,StatisticsDenmarkhasgivenforty-three authorizations. Mostdatafilesmadeavailabletoresearchersareregister-basedsamplesthatcoverlabor marketresearch,sociology,epidemiologyandbusinesseconomics.StatisticsDenmark hascreatedanumberofresearchdatabasesthatlinkinformationfromseveralindividual registersbecauseusersoftenneedlinkedinformation.Thedatabasesincludethe DemographicDatabase,theFertilityDatabase,thePreventionRegister(healthdata),the SocialResearchRegisterandothers.ThemostpopulardatabaseistheIntegrated DatabaseforLaborMarketResearchdevelopedoveraspanof9-10years.Research institutionsmayalsopayforthecreationofdatabases(Anderson2003). StatisticsNetherlands StatisticsNetherlandsinitiatedapilot,withtheDutchMinistryofSocialAffairsand Employmentastheprincipaluser,toevaluatethefeasibilityofaremoteaccessfacilityin 2002.Basedonthepilot,thesystemwasmadeavailabletoallDutchgovernment ministriesin2003. Duringthepilot,theMinistryofSocialAffairsandEmploymentsubmittedqueriesby emailinSPSS.TheMinistrywasgivenaccesstoamicrodatafilewithoveramillion

11 recordswithinformationonsocialallowancesfrom1997-2000.Asampleofthedata waspreparedforuserstotesttheirprogramsyntaxandbecomefamiliarwiththe variablesavailable.AllSPSScommandswereacceptedbutextractsofindividualrecords werenotpermitted. Thepilotwaspurposelysimple.E-mailswerereceivedandacknowledgedbyphone. StatisticsNetherlandsprocessedtheSPSSprogramsandreviewedalloutputmanuallyto determinethekindsofqueriesthatweresubmittedandhowtheycouldbecontrolledfor disclosurelimitation.Outputwasreturnedtotheusersbyemail.Findingsfromthepilot willbeusedtoimplementautomatedfilters,withemphasisonoutputfilters.Disclosure rulesandfiltersmaybedevelopedbasedonthetypesofvariablesrequested,the sensitivityofcertainvariables,thepossibilityforidentifyingsubpopulations,andthe mostrecurrenttypesofanalyses(SchoutenandJonker2003). Thedevelopmentoftheoutputfilterwillbegradualstartingwithsimplefrequenciesand contingencytables.DutchlegislationandStatisticsNetherlandspoliciesarebeing reviewedtodeterminewhatoutputisnotallowed(Schouten2003). StatisticsNetherlandsmayextendaccesstothesystemtonongovernmentalresearchers. IthopestoimplementotherstatisticalsoftwareinadditiontoSPSS,toautomatequery andresultsfiltersandtoconstructlogfilestoevaluatedisclosure. AustralianBureauofStatistics(ABS) TheAustralianBureauofStatisticsdividesthemeansusedtodisseminatemicrodatainto eightcategories.Althoughsomeofthesecategoriesprovide“safedata”,the categorizationrecognizesthatmicrodataisthesourceforallstatisticaloutput.Theeight categoriesalsoprovideanexcellentframeworkforNSOstoconsidertheoptionsfor disseminatingmicrodata. AccordingtoDennisTrewin,theAustralianStatistician,theeightcategoriesare: “1.StandardStatisticalOutputs:Thereleaseofstatisticaloutputs,usuallyinthe formoftables,inprintedand/orelectronicform… 2.Datacubes:Thereleaseofdetailedstatisticalmatricesthathavealreadybeen confidentialised.Itisamoreappropriateformofreleasewhenconfidentiality protectioncanbeautomated,particularlyforsmallcells(eg.population census)… 3.SpecialDataServices:Thereleaseofstatisticaloutputs,notnecessarilytables, attherequestofresearchers… 4.ConfidentializedUnitRecordFiles(CURFS):Thereleaseofmicrodatafileson aCDROMwhichhavebeenamendedsothattheidentificationofanindividual personororganisationisunlikely…

12 5.RemoteAccessDataLaboratory(RADL):Runningjobssubmittedby authorisedusersviatheinternetagainstCURFsheldattheABS,andreturning analysisresultsafterlargelyautomatedconfidentialitychecks… 6.ABSSiteDataLaboratory:SimilartoRADLexceptthatnodownloadingof unitrecorddataisavailable(thisispossibleinRADLforupto30recordsto supportoutlierdetection,etc)… 7.Collaboration:Workingcollaborativelywitharesearchertoproduceanoutput (oftenapublishedoutput)ofrelevancetotheABS… 8.In-houseAnalysis:-TheABScanengagepersonsas‘officers’iftheyare undertakingfunctionstosupporttheABSinitsactivities.Inthesesituationsthey canaccessunitrecorddataalthoughsubjecttothesamesecrecyprovisionsof otherABSofficers…”(Trewin2003,quotesthroughoutpaper). ThekeyareasoffuturedevelopmentforABSdisseminationofmicrodataare1,5and7. Category5-theRADLbecameavailableinApril2003afteritwasdevelopedbya specialprojectteam.Thesystemwillbemodifiedandmorefilesmadeavailableover time.ThesystemwillpermituserstosubmitSASandSPSSprogramsthroughe-mail. OutputfromthesystemwillbereviewedbytheABSfordisclosureavoidanceand automatictriggerswillbeusedtoidentifyoutputthatrequiresmorethoroughinspection. Downloadingofunitdataispossibleupto30recordstosupportoutlierdetection.Usage logsforconfidentialityreviewwillbekept.Therewillbesanctionsagainstoffenders. TheABSwillencourageuseofRADLwhenuserswantlinkedfilesandthedata matchingriskispresent.TheInformationServicesDivisionofABSwillmaintainthe system.Becausethesystemisnew,theABSprovidednoinformationonusersandusage ofthesystem. StatisticsSweden StatisticsSwedencurrentlydoesnothaveongoingaccesstomicrodata.However,itis exploringthefeasibilityofimplementingasystemsimilartothatofStatisticsDenmark. Inthefeasibilitystudy,usersareabletosubmitprogramsbyemailandobtaintheresults byemail.Theresultsmustbeintheformoftables(Nordback2003).Theuserlogsin fromapredefineddomainusinganencryptednameandpassword.Accesstodatais subjecttoconfidentialityproceduresandclearancebyStatisticsSweden.Thefeasibility studytestedtheCITRIXsystemincombinationwithRSAsoftwareandboxes.Users takingpartinthestudyareresearchers,usersofregionalstatisticsandusersfromother publicauthorities(Hjelm2003). 3.MonitoredRemoteAccessSystemsinUSFederalAgencies EightUSfederalagencieswerecontactedtodetermineiftheycurrentlyhavesystemsfor monitoredremoteaccesstorestrictedmicrodataforexternalusers:

13 DepartmentofAgricultural,NationalAgriculturalStatisticalServiceandtheEconomic ResearchService DepartmentofCommerce,CensusBureau DepartmentofEducation,NationalCenterforEducationStatistics DepartmentofEnergy,EnergyInformationAdministration DepartmentofHealthandHumanServices,NationalCenterforHealthStatistics DepartmentofLabor,BureauofLaborStatistics DepartmentofJustice DepartmentofTransportation,BureauofTransportationStatistics Amongtheseeightagencies,threehadimplementedmonitoredremoteaccesssystemsfor externalusers: NationalCenterforHealthStatistics(NCHS) NationalCenterforEducationStatistics(NCES) CensusBureau(CB) NationalCenterforHealthStatistics(NCHS) TheAnalyticalDataResearchbyEmail(ANDRE)systemprovidesremoteaccessto virtuallyallofthesurveyssponsoredbytheNCHS.TheResearchDataCenter(RDC) andtheANDREwerecreatedin1998toservedatauserswhoneeddatawithsmaller geographicareas(eg.State,countyorlower)andotherdetailnotavailableinthepublic usefiles.Theeffortwasspurredonbyfundingtoprovideuserswithcontextualdataand smallgeographicareaswithdirectidentifiersremovedfromtheNationalSurveyof FamilyGrowth(Horm1999).OtherimportantsurveysthatcanbeaccessedbyANDRE aretheNationalHealthInterviewSurveyandtheNationalHealthandNutrition ExaminationSurvey. ANDREallowsuserstosubmitSASprogramsbyemail.SASwaschosenbecauseitis widelyusedbyresearchersanditlendsitselftoreviewbyanautomatedscanning process.Automaticscanningisusedasaqueryfiltertoexaminetheinputfilesand suppressormodifycertainSAScommandsfordisclosureavoidanceandforeaseof automaticoutputscanning.CommandssuchasADD,PRINT,OBSaresuppressedand commandssuchasPROCMEANS,NMEANSTDaremodified.Automaticscanningof SASoutputwipesoutextremevaluesandsuppressescompleteoutputlineswithsample sizeslessthantheminimumstandardvalue(GambhirandHarris2003).Although ANDREistotallyautomated,questionableoutputidentifiedintheautomatedoutputscan isroutedtoanRDCstaffpersonformanualresolution.Inadditionalloftheusers’data requests,logfilesandresultsaremaintainedinelectronicform. UsersofANDREmustprovidearesearchproposaltotheNCHSRDC.Ifapprovedthe usermustsignaresearchaffidavitofconfidentiality.Approvedusersmustsubmitauser identificationandpasswordtoaccessthesystem.Theresearchersmayrequestdatafrom

14 multiplefilesorhaveNCHSdatamergedwiththeirowndatafiles.“Ingeneraleach datasetisspecificallypreparedfortheuser.Suchadatasetmayincludemanyvariables selectedfrommultipleinternaldatafilesofNCHS.Usersupplieddatamayalsobe merged.Theuserownsthedatasetpreparedforhim/herandRDCservesascustodianto thedataset.Nouserisallowedtoaccessthedatasetofanyotheruser”(Gambhir2003). AllofthemicrodatafilesaccessedbyANDREhaveundergoneconfidentialityediting. DummyfilesaresometimescreatedsotheusercanrefinetheSASinputfilesand documentsareemailedtotheuserexplainingthesystem.Personalassistanceisalso providedifrequired.Thesystemisavailable24hoursadaysousersmaysubmit requests,however,outputisreturnedduringworkinghours.Theresultsarereturnedto theuserwithinafewhours.ThefeeforuseofANDREis$500permonth,makingits uselessexpensivethanvisitingtheRDC. ANDREhashad45usersandhasrun10,000SASprogramsinthelast5years.Themost popularfilerequestedistheNationalFamilyGrowthSurvey.“Themainpurposeofthe 1973-1995surveyswastoprovidereliablenationaldataonmarriage,divorce, contraception,infertility,andthehealthofwomenandinfantsintheUnitedStates.More than250studiesinacademicjournalsandNCHSreportshavebeenpublishedusing NSFGdata.Topicscoveredbyresearchersincludefertility,familyformation,marriage, cohabitation,divorce,contraception,sterilization,unintendedpregnancy,HIV/STDand riskbehavior,infertility,health,andhealthservices.TheNationalSurveyofFamily Growthwasconductedagainin2002and2003(SurveysofMenandWomen,2002). Theinterviewsincludequestionsonschooling,work,marriageanddivorce,havingand raisingchildren(includingcontraceptiveuse,infertility,andparenting),andrelated medicalcare.Thefirststatisticalreportsandpublicusedatafilesanddocumentation shouldbeavailablein2004”(http://www.cdc.gov/nchs/nsfg.htm). NationalCenterforEducationStatistics(NCES) TheDataAnalysisSystem(DAS)providesremoteaccesstoDepartmentofEducation surveydata.TheDASwasdevelopedinordertoincreaseaccesstopostsecondarydata withouthavingtolicenseeachandeverydatauser.Anotherrationaleforthecreationof DASwasthatNCESestablishedthattheamountofcategorizing,topandbottomcoding, andadditionalperturbationsrequiredtoproducepublicusedatafileswouldrendermuch ofthecontentofthepostsecondarysamplesurveydatafilesuselesstotheaverageanalyst (Seastrom2003). TheNCESdevelopedtheDASasaCDapplicationin1987.UserscanusetheCDat theirdesksbuttheycannottransferfilestotheirharddrivesorovertheInternet.Thedata areontheCDbutthesystemwasdesignedtopreventaccesstothemicrodataperseand permitsonlytabularresults. In1997thefirstDASwebapplicationwasdesignedanddeployed.Itpermitteda downloadoftheDASsoftwareandgavetabularresults,butdatafilescouldnotbe downloaded.UserscouldsendtheirtablerequeststhroughtheInternetorthroughfile

15 transferprotocol(FTP).Therequestswereprocessedinabout6hoursandtheusercould obtaintheresultsindesignated“pick-upbins”(Carroll2003). ThethirdandcurrentDASapplicationwasdesignedanddeployedin2003.The applicationisavailableinwindowsandweb-basedformats.DASOnlineistheweb versionoftheDAS.Thesystemallowsuserstocreateprogramminginstructionfiles (DASfiles)thatspecifytheinformationtheywanttodisplayinatable.Thereisa separateDASforeachsurveydatasetandeachoneisusuallybuiltaroundananalysis reportandincludesthemajoranalyticalvariablespublishedinthereports.Usersmust learntheprogramminglanguageusedinDAS,however,allDASapplicationshave consistentinterfaceandcommandstructures.(NCES2003,nces.ed.gov/dasol/index.asp). TheunderlyingDASdatabasesmustincludeaseriesofDisclosureReviewBoard confidentialityedits.TheeditsaredirectedtowardoutliersandDASapplicationswith morecapabilitiesrequiremoreperturbationedits.“However,everyrespondent/itemis givenachanceofbeingedited.Inorderforthistobeeffectivewiththeleastamountof edits,itiscriticalthathowthisisdoneandhowmuchisdonebekeptstrictly confidential”(SeastromandKaufman2003,p.3).Thereisaresultsfilterforthetables thatsuppressescellswithlessthan30casespercellandrowswithproportionsthathave lessthan30casesinthedenominator.Datafilescannotbedownloaded. AsidefromtheuseoftheInternet,therehavebeenothermajoradvancesinDASoverthe years.Thesystemcancomputestandarderrorsappropriatetothecomplexsample designsemployedinthepostsecondarysurveys.Itcancomputeacorrelationmatrixthat canbeusedasinputtorunregressionanalysesanditallowsuserstorecategorizethe variableswithintheDAS.Thebatchprocessingfeatureofthesystempermitsmultiple datasetstoberuninonejob. DAShasreal-timeprocessingofmicrodata.Theweb-basedapplicationresultsintables deliveredwithinsecondstominutesovertheInternet.Thesystemisavailablefreeof chargeandunrestricted,24hoursaday,sevendaysaweek.Helpisavailableon-lineand personalassistanceisgiventhroughemail,asrequired. MostofthesurveyfilesthatDASaccessesareforpostsecondaryeducationanalysis. Thereareeightofthesesurveysincluding:BaccalaureateandBeyondLongitudinal Study,BeginningPostsecondaryStudentsLongitudinalStudy,HighSchoolandBeyond LongitudinalStudy,andNationalPostsecondaryStudentsAidStudy.TheuseofDASis arequirementinseveralNCESresearchcontracts.Forexample,allestimatesproduced forthePostsecondaryEducationDescriptiveAnalysisReportsmustcomefromDAS. Generally,analysesdonearepolicyrelated.Examplesofsomereportsproducedbased onDASresultsinclude: HowFamiliesofLow-andMiddle-IncomeUndergraduatesPayforCollege:Full- TimeDependentStudentsin1999-2000.Thisreportdescribeshowthefamilies ofdependentstudentsusedfinancialaidandtheirownresourcestopayfor college,emphasizingvariationbyfamilyincomeandtypeofinstitutionattended.

16 WhatCollegesContribute:InstitutionalAidtoFull-TimeUndergraduates Attending4-YearCollegesandUniversities.Thisstudyprovidesinformation aboutrecenttrendsininstitutionalaidreceiptandthenexaminestherelationship betweensuchaidandthelikelihoodofrecipientsstayingenrolledintheawarding institutionrelativetocomparableunaidedstudents, CharacteristicofUndergraduateBorrowers:1999-2000.Thereportdescribesthe demographicandenrollmentcharacteristicsoftheseborrowersaswellastheir riskfornotpersistingtocompletionofaneducationalprogramandthevarious typesofloansandotherfinancialaidtheyreceived. DescriptiveSummaryof1995-96BeginningPostsecondaryStudents:SixYears Later.Thisreportdescribestheenrollment,persistence,anddegreeattainmentof studentswhobeganpostsecondaryeducationforthefirsttimeinthe1995–96 academicyear.Itcoverstheexperiencesofthesefirst-timebeginnersovera periodofsixacademicyears,from1995–96to2000–01,andprovides informationabouttheratesatwhichstudentscompleteddegrees,transferredto otherinstitutions,andleftpostsecondaryeducationwithoutattainingdegrees. Manymoreexamplescanbeobtainedathttp://nces.ed.gov/das/reports. TheNCES1999CustomerSatisfactionSurveyissomewhatoutdatedandprecedesthe lastrevisionoftheDAS.However,itdoesgiveanorderofmagnitudeofuseanduser satisfactionwiththesystem.Overall,42percentofNCES’potentialcustomerswere awareofavailabledatabasesandusertoolsand12percenthadusedtheminthetwoyears precedingthesurvey.Smallpercentagesavoidedusertoolsbecausetheyweretoo difficult.FourpercentofrespondentshadusedDASapplicationsinthetwoyears precedingthesurveyand84percentofthemweresatisfiedorverysatisfied.Depending onthedatabase,twotofourpercentoftherespondentsusedthedatabasesthathaveDAS applicationsand78to91percentweresatisfiedorverysatisfiedwiththedatabases (NCES1999). CensusBureau(CB) TheAdvancedQuery(AQ)systemwasdevelopedtogiveuserstheabilitytorequest Census2000variablestheywantintabulationsfromtheonehundredpercentandsample microdatafiles.Remoteaccesstomicrodatawasoriginallyplannedaspartofthe AmericanFactFindersystemthatdisseminatesCensus2000predefinedstandard summarytablesovertheInternet.However,theAdvancedQuerybecameastand-alone systemthatrequiresuserregistration.Thesystemwasdevelopedandtestedforone hundredpercentandsampledatain2001and2002.Itwasmadeavailableforusein April2003toStateDataCenters,CensusInformationCentersandStateLegislatures. TheCensusBureauwillexpandtheuserbaseonaflowbasisandwilltryto accommodateasmanyusersaspossible.

17 Theindividualrecords(observations)inthebasefilesforboththeCensus2000one hundredpercentdataandsampledataareswapped.Thevariablesarecategorizedand usersmaynotdefinetheirowncategories.Sensitivevariablesdealingwithitemssuchas incomeandcostsaretopcoded(Zayatz,SteeleandRowland2000). Thesystemhasaqueryfilterthatlimitswhatusersmayselectintheuserinterface. Usersmayselectupto3variablesforatable.ThegeographicdetailavailableinAQis limitedtothestandardizedareasfromCensus2000andismorecurtailedthanthat availablefromthesummarytablesinAmericanFactFinder.InAQthecensusblock groupisthesmallestareaavailablefromthe100percentdataandthetractisthesmallest areaavailablefromthesampledata.Eachareaselectedmusthaveatleast200people. Thestatisticalresultsfilterchecksforaminimummeanandmediancellsizesand minimumpercentageofcellswithoneobservation.Theminimumvaluesintheresults filterareappliedtoeverygeographicarearequestedinthetable.Ifanyareadoesnot meetanyoneormoreoftheminimumvalues,theentireareaisomittedfromthetable (RowlandandZayatz2001). TheAdvancedQueryusescommercialbusinessintelligencesoftwarethatwastailored foruseinthesystem.Thesystemiscompletelyautomatedandnomanualintervention takesplace.Userlogsofeachqueryaremaintainedthatcontainthevariablesand geographicareasrequestedbyeachuser.Thelogsareexaminedperiodicallyto determinewhatismostoftenrequestedandtoidentifypossibledisclosurerisks.Thereis noprovisionforautomatedcomplementarydisclosureavoidance. UsersmustregisterwiththeCB,butthereisnocosttousethesystem.Registeredusers mustloginwithauseridentificationandpassword.Theymustlearnhowtoselectthe geographicareasandvariablestheywantthroughthewebuserinterface.Theresultsare returnedintheformofdatabasetablesandmeansandmedianscanberequestedwiththe table.Thesoftwareallowstheuserstoreformatthetableinmanydifferentwaysand graphicresultsareavailable.Theresultsaregeneratedinrealtimeandarereturned withinsecondstominutesontheweb.Tablesmaybedownloadedinseveralformats and/orprinted.Thesystemisavailableduring24hoursaday,7daysaweek. Afterthesystemwasdevelopedtoaccesssampledata,theCensusBureautestedthe systemwithusersprimarilyfromStateDataCenters,CensusInformationCenters,and CensusBureauRegionalOffices.Thepurposeofthetestwastodeterminetheutilityof thesystembasedontheresultstheuserscouldobtainwithconfidentialityfiltering. Eighty-twotestersproduced1,186tabulationfromCensus2000sampledata.Theywere askedtofillinevaluationformstodeterminetheusefulnessofthetabulations.They evaluatedapproximately370tabulations. Theobjectivesofthetabulationswerefullymetforoverhalfofthetabulationsand partiallymetfor90percentofthem.“Themainreasonthatobjectiveswerenotmetfully wastheconfidentialityfilters.Morethanhalfofthecasesspecifiedfailureofgeographic areastopassthefiltersandonethirdspecifiedthecauseoffailuretobethesubjectdetail requested.About20percentmentionedinsufficientsubjectdetailavailableintheAQ

18 recodes”(Schneider2002,p.2).Usersexpectedtousetheresultsofthetabulationsfor research,toplanorevaluateprograms,todefineneeds,toapplyforfundingandto implementprograms. Thetestersgavemanyexamplesoftheexpecteduseofthetabulations.Mosthavea directorindirectgovernmentpurposes: Examplesofexpectedusesforresearch “WeareconductinganimmigrationstudyforKentucky…totellwhatpartsofthe statehadlargernumbersofnon-citizensbyspecificdemographiccharacteristics.” “…determinethetaxburdenonhomeownersbycomparingrealestatetaxespaid tomortgagepaymentsandincome.” “…identifylevelofneedforbasicwaterservicesintheUSAlso,toidentify rangeandvarianceinwaterratesinUScounties.” “…prepareareportontheneedsofspecialpopulationsinNewYork.” “…assessmentofthepercentageoftheirsalarythatNYCcitizensareusingto coverrent.” Examplesofexpectedusesforplanningorevaluatingprograms “…(answer)policyquestionsaboutchangesinwelfareprogramsandthetypesof jobstheworkingpoorhave,withafocusonwomen.” “…betterunderstand…theeconomiccharacteristicsofnon-Englishspeakers …todevelopandplanprogramstomatchemployerneeds…” “…planandevaluateprogramsfortheelderlyandidentifyareasofthestate(PA) wheretheremaybespecialneeds.” “…assessthepotentialeffectoflegislationon…(participationin)thestate’s socialservicesprograms.” “…provideabenchmarkagainstwhichtomeasurewhatpercentageofthetarget populationisreachedbyprograms.” Examplesofexpectedusesfordefiningneed,applyingforfunding “Usedbylocalgovernmenttoassessnumberofhouseholders65andabovestill havingbothamortgageandasecondmortgage.” “…identifyneedforlow-incomeenergyprograms.” “…advocateformoreESLtrainingforstudents…” “…determineifworkersinselectedcoreoccupationsareabletolocateaffordable housingclosetotheirplaceofworkinmajorcities..” “...understandtheincomedistributionofhouseholdsinwhichgrandparentsare responsibleforminorchildren…” Exampleofexpectedusesforimplementingprograms “…identifythecharacteristicsoftheunemployedpopulationtoprovidetargeted jobtraining,information,andservices.” “…implementaprogramformammographyscreeningforwomen50-64whoare inpoverty.”

19 “…estimatethenumberofincome-eligiblefamiliesforfederalfoodstamp programinMinnesotainordertobettertargetoutreachtoun-servedeligible families.” “…needofcountycourtsinselectingjurypoolswhichincludeHispanics.” “…analyzewhethertheymeetaffirmativeactionstandards.” (Schneider2003,anonymousquotesfromtesters,pp.7-9) Insummarythetestevaluationreportfoundthat,“testersincludedtheCensusBureau’s majorintermediariesindistributingdata,allofwhichareexperiencedinprovidingdata totheultimateusers.Basedontesters’comments,theAQSwasefficientandfriendly enoughfortheseexperiencedusersbutcouldpresentdifficultiestouserswhoarenot wellversedintheuseofcensusdata.”(Schneider2002,p.1). Therearecurrentlyover500usersregisteredtousetheAQsystem.Thetablebelow showstheusagestatisticssinceMay1,2003. AdvancedQuery(AQ)UsageStatisticsbyMonth Month2003 Numberofusers Numberoftabulations May 72 886 June 54 947 July 75 611 August 119 762 Source:CensusBureau2003 4.SampleofResearchProjectsintheUS ThereareanumberofongoingresearchprojectsintheUSthathavemadecontributions topracticalapplicationsorthatcollaboratewiththeFederalgovernmenttomake microdataavailableremotely.Theseprojectsmayhaveanimpactonawidespectrumof topicsrangingfrommethodology,tohardware,toarchivinganddisseminationof microdata.Someuseonlyrestrictedmicrodataandothersdonot.Manyoftheprojects arefundedbyfederalgovernmentagencies,theNationalScienceFoundation,the NationalAcademyofSciences,theNationalInstitutesofHealth,etc.Theexamples reviewedbelowincludejustafewselectedprojectschosenbecausetheytouchon differentaspectsofremoteaccesswithimplicationsforfuturedevelopmentofremote accesssites. DigitalGovernment Continuedresearchinmonitoredremoteaccesstorestrictedmicrodataisimportantfor futuredevelopmentofsuchsystemsbyNSOs.Researchersfromvariousinstitutionsled bytheNationalInstituteofStatisticalSciences(NISS)haveundertakenresearchina numberofareasofinterestsuchasdataswapping,confidentialityoftabulardata,and

20 web-basedsystemsthatdisseminatedataandprotectconfidentiality (www.niss.org/dgii/techreports.html). ResearcherfromNISSdevelopedaprototypewebsystemfortheNationalAgricultural StatisticalService(NASS)ofUSDAtodisseminatesurveydataonusageoffertilizerson farms.Theprincipalpurposeoftheresearchwastodevelopamethodologyto disseminatedataforsmallergeographicallevelsthanstates(preferablycounties)andstill protecttheconfidentialityofthefarmsinthesurvey. Thedataconsistedofalmost200,000recordsofaveragefertilizeruseperacrefrom 30,500farmsfortheyears1996-1998.Thegoalwastoallowdisseminationofdataatthe countylevel,butdataforhalfofthecountiesintheUSwerenoteligiblefordisclosure usingtheprescribedconfidentialityrules.Therefore,theresearchersdevelopeda methodologytoaggregategeographicareastotheextentthatdatacouldbedisseminated withintheconfidentialityrules.“Undisclosablecountiesaremergedwithneighboring counties(inthesamestate)toformdisclosable“supercounties”(KarrandSanil2001,p. 1). Thesystemcanbeaccessedthroughawebbrowserthatallowstheusertoselectthestate, thecropandfertilizertypedesiredintheoutput.Theoutputisintheformofalistof supercountiesderivedfromtheaggregationalgorithmwithrelatedfertilizerapplication ratesbycropbyyearandalistofcomponentcounties.Theoutputcanbeshownona maporintabularform.Ahistoryofqueriesiskeptinadatabasethatcanbeusedto monitorcomplementarydisclosure. TheNASSprototypedevelopedbyNISStacklesbothautomaticaggregationofsmall geographicareastoavoiddisclosureandexaminationofcomplementarydisclosurefrom previouslyansweredqueries.Thesetwoautomatedtechniqueshavethusfarnotbeen includedintheremotesystemsexaminedinthispaperduetolackoffundingand feasibility.TheNASSwouldliketoimplementthesystempendingavailabilityof funding. AspartoftheDigitalGovernmentprogram,researchersfromCarnegieMellonreviewed theconfidentialityprotectionintheAdvancedQueryofAmericanFactFinder(Duncan, Roehrig,andKannan2000).Theexaminationofconfidentialityprotectioninthesystem compareddisclosurelimitationusedwiththatofagencybestpractice, determinedifconfidentialdatacouldbeinferredusingnonconfidentialdata outsidethesystemorfromdatawithinthesystemitself,and assessedwhetherresultsfromthesystemcouldbecompromisedusingmodern recordlinkagetechniques.

21 CarnegieMellonrecommendedamechanismforcontrollingcomplementarydisclosure fromrepeatedqueriesthroughtheuseofalinearprogrammingmethodtocheckfor confidentialitybeforeallowingqueries.Themethodwastoocomplextoimplement. CornellRestrictedAccessDataCenter(CRADC) CRADCwascreatedattheCornellInstituteforSocialandEconomicResearch(CISER) togiveselectedexternalresearchersaccesstorestrictedusemicrodata,researchtoolsand inference-validsimulatedmicrodatafilesunderdevelopment.CRADCusesawindows userinterfacethatshowsthefilesavailableaccordingtotheaccessrightsoftheuser. CRADCmakesavailableresearchtoolssuchasEXCEL,SAS,STATA,Matlab,Fortran V6,GLIM,Genstat,Gauss,etc.AvailabledatasourcesincludetheLongitudinal Employer-HouseholdDynamicsdatathatcanbeaccessedonlybystatepartners participatingintheproject. Inference-validsimulateddatafileswillbeaccessiblethroughCRADC.Researcherswill createinference-validsimulatedmicrodatafilesbyscientificallyproducingreplacement valuesforactualmicrodata.Thesimulatedmicrodatafilesaremeanttorenderuseful analyticalresults.Researchwillbecarriedouttodetermineiftheuseofthesimulated microdatarendersvalidresultswhencomparedtotheactualmicrodata.Aworkinggroup willpreparetheinference-validsimulatedfilesandcompareresultstotheactual microdatausingtheSurveyofIncomeandProgramParticipationasatestdatabase(Lane 2003). IntegratedPublicUseMicrodataSeries(IPUMS)–International IPUMS-InternationalwasdevelopedbytheMinnesotaPopulationCenterto“inventory, preserve,harmonize,anddisseminatecensusmicrodata”.Datafromsevencountriesis available(China,Colombia,France,Kenya,Mexico,theUnitedStates,andVietnam between1960and2000)anddatafromCentralandSouthAmericawillbeavailable soon.Thecensusdataaremicrodatasamplesandmostfilesarepublicusefiles.Users mustapplyforaccessandsignanauthorizationformagreeingtoabidebyregulationsfor usingthedata.Theuseofthedataisrestrictedtoscholarlyandeducationalpurposes. (http://www.ipums.org/international/release_dates.shtml). Internet2andNextGenerationInternet OvertwohundreduniversitiesareparticipatinginanationwideprojectknownasInternet 2incollaborationwithindustryandgovernment.Overonehundredofthemost importantcomputerandtelecommunicationscorporationsareinvolved.“Theprimary goalsofInternet2areto: • Createaleadingedgenetworkcapabilityforthenationalresearchcommunity • EnablerevolutionaryInternetapplications

22 • Ensuretherapidtransferofnewnetworkservicesandapplicationstothebroader Internetcommunity”(http://www.internet2.edu/about/). Internet2hasdevelopedanddeployeda“10-Gigabit-per-secondnationalbackbone supportinghigh-performanceconnectivityandInternetinnovationwithintheUS researchuniversitycommunity”(http://www.internet2.edu/about/). TheNextGenerationInternet(NGI)initiativeisamulti-agencyFederalresearchand developmentprogramthatdevelopedadvancednetworkingtechnologiesandapplications demonstratedontestingenvironmentsthatare100to1,000timesfasterthanprevious capabilities.Itsgoals,similartothoseofInternet2,havebeencompleted.Federal agenciesarecurrentlycoordinatingadvancednetworkingresearchprogramsunderthe LargeScaleNetworking(LSN)CoordinatingGroup(http://www.ngi.gov/). USGenWeb USGenWebisavolunteersupportedefforttopresentactualtranscriptionsofpublic domainrecordsontheInternet.Thefilescontainhistoricalcensusrecords,marriage bonds,wills,andotherpublicdocumentsorganizedbystate.Itrepresentsaneffortto makemicrodataavailableforgenealogyresearchtothegeneralpublic.Althoughsuch filesmaynotbeconsideredtobestatisticaldatabysome,theyareatypeofmicrodataof greatinteresttothegeneralpublic.TheeffortreflectstheimportanceoftheInternetfor disseminatingalltypesofmicrodataremotelyandreflectsthefactthatthegeneralpublic, notjustsophisticatedresearchers,wantandhaveaccesstoremoteaccesstechnologyand microdata.(http://www.rootsweb.com/-usgenweb/) 5.Conclusions RemoteaccesstorestrictedmicrodataisstillrarelyimplementedbyNSOsduetotheir desiretoprotectconfidentialityandthedifficultyandexpenseofimplementingsystems thatdisseminateresultswithadequatedisclosureavoidance.Someoftheproblemssuch asprotectingtheconfidentialityofsingleoutputshavebeentackledmoresuccessfully thanotherssuchasavoidingcomplementarydisclosure. Generallyspeaking,theuseofsystemsforremoteaccesstorestrictedmicrodataby researchersismuchlessthantheiruseofPUMFsonCDROMandRDCs,todate.Thisis becauseusersaremorefamiliarwithCDROMandRDCs,andaremorecomfortable usingthem.AdditionalreasonsforlessuseofremotesystemsarethatNSOspurposely restrictaccesstothesystemsandbecausethedataaccessedbythesystemsmayhaveless detailormoredistortionthanthatavailableinanRDC. Researchers,however,continuetoexpressinterestinusingrestrictedmicrodatatoobtain thetypesofanalysesnecessaryfortheirwork.NSOsappreciatethesearguments, especiallywhentheresultsarerequiredbyofficialagenciesfornecessarypolicy developmentandimplementation.Therefore,thequesttomakerestrictedmicrodata

23 availablehasresultedinresearchandexperimentationwithremoteaccesssystems.The challengeistotranslateresearchintopractical,affordableapplications. NSOshaveshownanincreasedinterestinthedevelopmentofremoteaccesssystemsfor disseminatingrestrictedusemicrodatainthelastfewyears.TheLISprogrampioneered thetechnologywiththeLISSYinthelate1980s,andmostsystemsdevelopedorunder developmentsincethenhaveemployedsimilarmethodology.Themethodologycanbe referredtoas“remotejobexecutionsystems”becausetheprogramsareexecutedoff-line usuallyinabatchmodethattakesanywherefromminutestodaystoreturnresultstothe user.Theamountoftimerequiredtoreturntheresultsislargelydeterminedbythe degreeofmanualinterventionusedbytheNSOtoreviewtheresultsfordisclosure avoidance.Theusercommunicateswiththesystemsbysendingprogramsfrompopular statisticalsoftwareandreceivingresultsthroughe-mail.Theadvantageofthistypeof systemisthatitallowsuserstoworkwiththeirfavoriteresearchsoftwarepermitting themtowritetheirownprogramsandobtaintheresultstheyneedwithinthe confidentialityconstraintsimposedbythesystem. AcoupleofremoteaccesssystemsusedbyNSOspermituserstocommunicatewiththe systemusingawebbrowserorwindowsapplication.Thesesystemsusecustombuiltor usetailoredcommercialsoftwareanddelivertabulations.Userscannotsubmittheirown programsbuttheresultsareexecutedwhiletheuserison-lineandareusuallyreturned withinsecondsorminutesbecausethereisnomanualintervention. Currently,popularstatisticalprogramssuchasSASandSPSSarenotweb-enabled. Usersmaysubmitprogramsthoughemailbutcannotusethemwithawebbrowser.Itis possiblethatsuchsoftwarewillbecomeweb-enabledinthefuture,thusmakingtheiruse ofawebbrowserpossible.Businessintelligencesoftwarethatiswebenabledwill increasetheuseofstatisticalmeasuresthusbecomingmorelikethefavoredstatistical packages.Bothofthesepossibilitiesforeseethemergingofemailandwebsystems. Commercialsoftwareprograms,betheystatisticalorbusinessintelligencepackages,in somecaseshaveorwillhavethecapacitytostopqueriesgoinginandoutofthesystems thusprovidingautomaticqueryandresultsfiltering.Programmingisthenrequiredonly tocreatethedisclosureavoidancerulesdesiredbytheNSO. Theuseofresearchresultsinpracticalapplicationsisevident,andthecumulativeeffect isimportant.Continuedresearchbystatisticiansisnecessarytoresolvetheremaining problemsofautomatingcomplementarydisclosureavoidanceanddevelopingaggregated and/oruserdefinedareas.Theconduitsforsuchresearchareoftenthemathematical statisticiansfromtheNSOswhoworkcloselywithstatisticiansinuniversities.NSOs thathavedevelopedsystemsandtestsystemshavegainedandsharedexperiencethatis necessarytokeepupwithmodernmethodsandtechnology.Existingsystemscontribute tothebodyofknowledgeaboutremoteaccessthatcanbedrawnupontobuildfuture systems.

24 Astimegoeson,developmentsintechnologyandfamiliaritywithInternetapplications amonguserswillcontinuetogrow,preparingfertilegroundforapplyingresearchto furtherdevelopmentofsystems.Progressisinevitableduetoadvancesincomputer hardware,software,andInternetcommunications.Fasterdatabaseandprocessing softwareandcheaperhardwarewillmakeitmorefeasibletoimplementtheresearch. NSOsandresearcherswillcontinuetoreachcompromisesonwhatdatacanbemade availableaccordingtothelawsofeachcountry.Hopefully,advancedmethodologyand technologywillimprovetheabilityofNSOstomakerestricteddataavailablefor researchwiththeconfidencethatconfidentialdatawillnotbereleased. EachgenerationofuserswillbecomemoresophisticatedintheuseofInternet applications.Usershaveadaptedasthemediafordatadisseminationhasgonefrom printedtables,totapes,todiskettes,toCDROM,toremoteaccess.Eachdecade’snew mediabecomesthenextdecade’smostpopularmedia,asitwaswithpublicusefileson CDROM,soitmaybewithmonitoredremoteaccesstorestricteddata.

25 References Anderson,Otto(2003)‘FromOn-SitetoRemoteAccess-TheRevolutionoftheDanish SystemforAccesstoMicrodata’,PaperpresentedatthejointECE/Eurostatworksession onstatisticaldataconfidentiality,Luxembourg. ______andLarsThygesen(2003)‘TheDanishSystemforAccesstoMicrodata;from on-sitetoremoteaccess’,PaperpresentedatSwedishWorkshoponMicrodata, Stockholm,www.micro2122.scb.se/papers.asp. Berker,AliandSusanChoy(2003)‘HowFamiliesofLow-andMiddle-Income UndergraduatesPayforCollege:Full-TimeDependentStudentsin1999-2000’,National CenterforEducationStatisticsReports,http://nces.ed.gov/pubsearch. Berkner,Lutz,ShirleyHe,andEmilyForrestCataldi(2003)‘DescriptiveSummaryof 1995-96BeginningPostsecondaryStudents:SixYearsLater’,NationalCenterfor EducationStatisticsReports,http://nces.ed.gov/pubsearch. Carroll,Dennis(2003)EmailsCarrolltoRowland. CEIES(2002)‘OpinionsoftheEuropeanAdvisoryCommitteeonStatisticalInformation intheEconomicandSocialSpheres(CEIES)’,19thSeminaronInnovativeSolutionsto ProvidingAccesstoMiocrodata,Lisbon,ContributedpapersubmittedbyEurostat, www.unec.org/stats/documents/2003.04.confidentiality.htm. Cigrang,Mark(2003)EmailsfromCigrangtoRowland. ______andBarrySchouten(2003)‘RemoteAccessSystemsforStatisticalAnalysisof Microdata’MethodsandInformaticsDepartment,StatisticsNetherlands. Clinedinst,MelissaE.,AlisaF.Cunningham,andJamieP.Merisotis(2003) ‘CharacteristicofUndergraduateBorrowers:1999-2000’,NationalCenterforEducation StatisticsReports,http://nces.ed.gov/pubsearch. Drolet,Gaetan(1999)‘Sherlock:AWebMagnifyingGlassforMicrodataFiles’in InternationalAssociationofSocialScienceInformationServiceandTechnology Quarterly,SummerIssue,pp.15-18. Duncan,G.,Roehrig,S.,andKannan,K.(2000)‘FinalReportontheAmericanFactFinder DisclosureAuditProjectfortheUSCensusBureau’,preparedundercontracttotheUSCensus Bureau. Forster,MichaelandKoenVieminckx(2003)‘InequalityandPovertyContributionsof LIS’,presentedattheLuxembourgIncomeStudy20thAnniversaryConference, Luxembourg.

26 Gambhir,Vijay(2003)EmailsfromGambhirtoRowland. ______andKennethHarris(2003)‘CDC/NCHSDataCenter’,Powerpointpresentation givenatUSBureauofTransportationStatisticsSeminarSeriesonConfidentiality, WashingtonDC. Hamilton,ElizabethandChuckHumphrey(2002)‘DataAccessandDataUse:The MissingLink’, http://admin.acadiau.ca/library/DLI2003/session%201.2_pumfs/pumfs%20pumped.ppt. Hjelm,Claus-Goran(2003)‘RemoteAccesstoMicrodataatStatisticsSweden’,paper presentedattheSwedishWorkshoponMicrodata,Stockholm, www.micro2122.scb.se/papers.asp. Horm,John(1999)‘NationalCenterforHealthStatisticsApproachestoProtectionand ReleaseofMicrodata’,contributedpaperforJointECE/EurostatWorkSessionon StatisticalDataConfidentiality,Thessaloniki. Horn,LauraandKatharinPeter(2003)‘WhatCollegesContribute:InstitutionalAidto Full-TimeUndergraduatesAttending4-YearCollegesandUniversities’,NationalCenter forEducationStatisticsReports,http://nces.ed.gov/pubsearch. Karr,AlanF.,andAshishP.Sanil(2001)‘Web-BasedSystemsthatDisseminate InformationbutProtectConfidentiality’,dg.o2001:Proc.FirstNationalConferenceon DigitalGovernmentResearch,pages159-166.DigitalGovernmentResearchCenter, MarinadelRey,CA. Lane,Julia(2003)‘SyntheticDataandConfidentialityProtection’,paperpresentedatthe SwedishWorkshoponMicrodata,Stockholm,www.micro2122.scb.se/papers.asp LuxembourgIncomeStudy(2003)SummaryofNumbersofUsersandJobsSubmitted toLISbyCountry,2001–2003,Luxembourg. ______(2003)NumberofWorkingPapersPublishedperYear1985-2002,Luxembourg. Nordback,Lars(2003)EmailfromNordbacktoRowland. Rowland,SandraandLauraZayatz(2001)‘AutomatingAccesswithConfidentiality Protection:TheAmericanFactFinder’,inProceedingsoftheSocialStatisticsSection, AmericanStatisticalAssociation,Alexandria. Schneider,PaulaJ(2002)AmericanFactFinderAdvancedQuerySystem-Assessment ReportonStageTwo(SampleFile)BetaTesting,preparedundercontracttotheUS CensusBureau,Washington,DC.

27 Schouten,BarryandJanJonker(2003)‘RemoteAccessatStatisticsNetherlands’,paper presentedattheSwedishWorkshoponMicrodata,Stockholm, www.micro2122.scb.se/papers.asp. Seastrom,Marilyn(2003)EmailfromSeastromtoRowland. ______andStevenKaufman(2003)‘NCESDisclosureRiskProcedures’,paper presentedattheAmericanStatisticalAssociationMeetingsinSanFrancisco. StatisticsCanada(2002)SurveyofLaborandIncomeDynamics(SLID)Workshop Presentation,Montreal,http://www.ciqss.umontreal.ca/Documents/acetat_e_2002.ppt. Tambay,Jean-Louis,GustaveGoldmanandGerryPotter(2003)‘ProvidingResearcher AccesstoDataforAnalysisatStatisticsCanada’,paperpresentedattheSwedish WorkshoponMicrodata,Stockholm,www.micro2122.scb.se/papers.asp. Trewin,Dennis(2003)‘AccesstoMicrodata–Issues,OrganizationandApproaches’ paperpresentedattheConferenceofEuropeanStatisticians,Geneva. USDepartmentofCommerce,CensusBureau(2003)AdvancedQueryUserGuidefor 100PercentDataandAdvancedQueryUserGuideforSampleData,Washington,DC. USDepartmentofEducation,NationalCenterforEducationStatistics(2003),NCES HandbookofSurveyMethods,AppendixC:Web-basedandStandaloneToolsforUse withNCESSurveyData,http://nces.ed.gov/pubsearch. ______(1999)NCESCustomerSatisfactionSurveyReport,SectionIV.Questions aboutNCESDatabasesandUserTools,http://nces.ed.gov/pubsearch. Zayatz,Laura,PhilipSteel,andSandraRowland(2000)‘DisclosureLimitationfor Census2000’,inProceedingsoftheSectiononSurveyResearchMethods,American StatisticalAssociation,Alexandria.

28