NSF-Census Research Network Newsletter Vol. 2, Issue 3 Research Node Focus: Carnegie Mellon University Stephen (Steve) Fienberg, Maurice Falk University Pro- “We are looking at things such as record linkage, and fessor of Statistics and Social Science at Carnegie Mellon thinking about how to benchmark the census and how to University (CMU) and Bill Eddy, John C. Warner Profes- decide if something is accurate or not accurate. These are sor of Statistics (Emeritus) at CMU, have a long-standing all themes that show up from the work that Bill and I had interest in everything to do with census taking. They have done over the years,” said Fienberg. spent several decades working on statistical issues for the U.S. Census. Fienberg served two terms, and Eddy served Another important aspect that CMU’s node is working on two terms, as Chair of the Committee on National Statistics is education and training. Eddy is making projects related (CNSTAT) for the National Academy of Science’s National to census work available to summer undergraduates. “Re- Research Council. becca Nugent, who is one on the NCRN team of research- ers, oversees the whole statistics undergraduate program In fact, in 1999 Fienberg and Margo Anderson wrote a book, Who Counts?, that talks about the 1990 decennial census and the efforts made to use sampling to adjust cen- sus results for the differential undercount. Fienberg’s research has long focused on how to conduct a census, and why a census is actually a statistical activity. As such, what is the formal way to think about it? Fienberg is a renowned veteran in census research. He has been studying census surveys since the 1970s. He looked at what methodologies one should use in interpreting census- es. He has been studying census surveys since the 1970’s. He looked at what methodologies one should use in inter- preting censuses. As chair of CNSTAT in the ‘80s, Fienberg convened a panel that was the trigger to census evaluation. “The first really big census panel on new methodology for census taking which came out in 1985 was done under my aus- Bill Eddy and Steve Fienberg. Bill is holding a report from the 9th Census in pices.” Almost two decades later, before Eddy was leading 1870, and Steve is holding the a Encyclopedia of the Census in which he has CNSTAT, he was a member of another panel evaluating an article. the results of the 2000 census; the panel’s work included which has grown from a rather small number to about elaborate field visits during census taking as well as an as- 250 majors. They all get exposed in one way or another sessment of the methodology and how well it worked. about issues in methodology in survey and census taking,” CMU’s node is focusing on three main issues, including explained Fienberg, “We want to get the next generation of the cost of census taking, the diminishing response rates statisticians and informed citizens engaged in the impor- and privacy issues. tance of these issues.” One of the summer projects this year that the undergradu- ate students worked on was to recreate the website that the In This Issue U.S. Census will use in 2020 to conduct online surveying. “Using the Internet is a horrible break from tradition that Profile of Carnegie Mellon University - 1 causes the U.S. Census staff immense mental pain because Communicating Uncertainty in Official Statistics - 3 Kristen Olson Named New PI at University of Nebraska - 3 if somebody says they live in a particular place to give Node News - 4 their information, they have no way to know if that is true,” Fienberg Receives NISS Sacks Award - 4 Continued on page 2 Publications - 5 Presentations - 7 1 Research Node - CMU (Continued) From page 1 said Eddy. Another term for this phenomena is “non-ID” with their methods because working with the Census data because there is no way to know where the person is really is sometimes tricky. One of the data sets they have used is located. the number of people who have been killed in Syria. There are seven of these lists and the team has focused heavily The students wanted to see if there was a better response on combining four of the seven lists and to do it for the rate if it was known up front that the questions being asked multiple files all at once. were for the government organization or not. In about five weeks they built a replica of the U.S. Census’ website for Zack Kurtz in his PhD thesis work as part of the project the 2020 Census. Frank McPhillips from the U.S. Census worked on multiple recapture estimation motivated by told Eddy that he was flabbergasted at the amount of work looking at three or more lists. Once you’ve linked them, the students were able to accomplish in just five weeks. how do you estimate how many people you didn’t see? This is also known as the problem of multiple systems Fienberg said, “Our graduate students are now moving estimation and we have studied it in different ways over on in their careers. Mauricio Sadinle just took a position several decades at CMU. Zack’s work has extended the at Duke and NISS as a postdoctoral fellow. Sam Ventura known approaches in interesting ways of relevance to cen- has just joined our department faculty on a multi-year ap- sus work. pointment, and Rebecca Steorts who was a postdoc on our project is now on the faculty at Duke.” Eddy and Fienberg The other major area of research CMU is doing involves hope that everyone involved in their NCRN project will privacy and research. There are two kinds of intrusions. continue to explore census and survey research issues in One is someone trying to hack into the files. The second their new roles. kind of intrusion is the data that are made available and by doing record linkage, an intruder can gain information “There is a giant gulf between the way the agencies go about people by using the information that is in a suppos- about their business and the ways in which we in the de- edly protected data set. partment think about statistical methodology more broad- ly,” noted Fienberg. Alessandro Acquisti, one of the co-principal investiga- tors, who is professor of Information Technology, and his Another area that CMU is focusing on is record linkage, former Ph.D. student and postdoc, Laura Brandimarte, which is taking information from two files that are over- have been focused on how people respond to the issues lapping and combining the information. The U.S. Census and concerns about privacy and confidentiality when they takes a section of the census block, it does the census again actually interact with a governmental agency. They have for this block and then puts the data together to figure out done some fascinating experiments about this, particularly how accurate the census data is. But there are a couple of about people sharing geography online or in some kind of problems with this methodology. One is that you pair up re- electronic form. They have been deeply involved with the cords for different people, or you fail to match records that discussion about online forms. They are also looking at on- actually belong to the same person. And, once the match line geographic information and the way in which it could has been made, most statisticians and practitioners have be used to help identify the location of individuals respond- treated the matching process as if it had been without error. ing to an online Census form. Laura is joining the faculty CMU’s team is looking at record linkage from very fresh at the University of Arizona this fall, but will remain as a perspectives, and is especially interested in the propagation collaborator to the CMU team. of matching uncertainty into subsequent analyses. There is another project that the CMU team would like Another area that the team is researching is how to to accomplish where they would take the Pennsylvania handle the large amount of data that is produced by the drivers license records and match them with the American various censuses and surveys, especially when it comes to Community Survey (ACS) and the 2010 U.S. Census data record linkage. Take, for example, a file that has 100,000 so they can actually scale up some of the theories they have records in it and another file that has 200,000 records in it written, but moving the data set from Pennsylvania to the and now you are trying to link the two together. You have U.S. Census has been held up for three years, so it is un- to match every one of the 100,000 records to every one of certain as to whether the team will be able to work on this the 200,000 records. That takes a lot of computing power. project or not. But, the U.S. population is over 300 million, so scaling this up to a national level is really hard. “A lot of the activity The NCRN team at CMU can be proud that the work they we have been doing is focusing on how to make record have accomplished over four decades or more will con- linkage methods scale,” said Fienberg. They have been tinue to be carried on by the next generation of statisticians using data other than the U.S. Census data to experiment thanks to the great mentoring by this veteran team.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-