CUCS Welcomes Two New Faculty

NEWSLETTER OF THE DEPARTMENT OF COMPUTER SCIENCE AT COLUMBIA UNIVERSITY VOL.11 NO.1 SPRING 2015 ealth and predisposi- statistics, and signal process- tion to disease strongly ing are needed to detect the CUCS Welcomes Hdepend on the genetic subtle variations and statistical material carried deep within the patterns that reveal traits and cells of every individual. Increas- predisposition to disease.” Two New Faculty ingly this genetic information, unique to every individual, is be- Erlich is well-positioned for coming public. For Yaniv Erlich, untangling the complex genetic it’s two parts of the complex underpinnings of human biol- relationship between genes and ogy. An initial love of math and a health, one increasingly under- friend’s chance remark that biol- stood through quantitative analy- ogy entailed a lot of mathemat- sis and computer algorithms. ics piqued his interest, and he went on to study both biology “Computational methods are and genetics, two sciences in- necessary at every step in creasingly awash in data and in examining the genome,” says need of algorithms for inferring YANIV ERLICH Erlich, assistant professor of information from that data. He ASSISTANT PROFESSOR computer science. “The strings approached both subjects from OF COMPUTER SCIENCE of nucleotides (A, T, G, and C) a computational perspective. that form each person’s unique Dissecting the Complex genome are billions of letters At MIT’s Whitehead Institute of Relationships of Genes, long. It’s not even possible to Biomedical Research, where he Health, and Privacy look at these long sequences headed a research lab from 2010 without a computer. As more through 2014, he had the chance genetic data is collected, con- to develop and then apply new cepts from machine learning, computational tools to genetics. CS@CU SPRING 2015 1 Cover Story (continued) Student Awards Results were impressive. onsider the challenge Blei’s interest in data science One method sequenced tens of the modern-day and machine learning was rein- Columbia Engineering Team of thousands of samples at Cresearcher: Potentially forced by the work of Herbert a time. Another harnessed millions of pages of information Robbins, a former professor signal processing and statisti- dating back hundreds of years of mathematical statistics at Finds Thousands of cal learning to extract genetic are available to be read from a Columbia. “Robbins developed information from short tandem computer screen. How does a an algorithm that allows the repeats, or STRs, a fast-mutat- simple Internet search deliver machine learning community to Secret Keys in Android Apps ing fragment of DNA so small appropriate findings? scale up algorithms to massive it had been mostly neglected data,” says Blei, who predicts by the research community. The answer is topic modeling, that one of the next break- Both methods contributed new DAVID BLEI a mathematical approach to throughs in modern machine information about how genes PROFESSOR OF COMPUTER uncovering the hidden themes in learning and statistics will SCIENCE AND STATISTICS collections of documents. David operate at the molecular level. involve observational data, or In a paper presented—and about the content in Google Blei, professor of computer that which is observed but not But Erlich was also interested Embracing the science and statistics, led the awarded the prestigious Ken Play, including a critical security Science of Teaching collected as part of a carefully Sevcik Outstanding Student problem: developers often store in how genes affect health groundbreaking research that re- controlled experiment. and traits. Is the relation- Topics to Machines sulted in the development of the Paper Award—at the ACM their secret keys in their apps ship linear where mutations Latent Dirichlet Allocation (LDA) “Previously, machine learn- SIGMETRICS conference on software, similar to usernames/ predictably sum to a trait, or model, a topic model capable of ing has focused on predic- June 18, Jason Nieh, professor passwords info, and these can is the relationship nonlinear, exploiting hidden topics and se- tion; observational data has of computer science at Columbia be then used by anyone to with mutations interacting genetic information donated mantic themes among countless been difficult to understand Engineering, and PhD candidate maliciously steal user data or unpredictably with one an- by research participants and lines in billions of documents and exploring that data was Nicolas Viennot reported that resources from service providers other? The answer required a cross-reference it to online data via language processing. By a fuzzily defined activity,” Blei they have discovered a crucial such as Amazon and Facebook. The researchers used PlayDrone to recover app sources and, in the process, uncovered crucial population-scale analysis, one sources. Using only Internet co-developing the LDA, Blei ef- explains. “Now, massive collec- Professor of Computer security problem in Google Play, These vulnerabilities can affect Science Jason Nieh security flaws. too large to be constructed searches with no actual DNA, fectively influenced the research tions of observational data are the official Android app store. users even if they are not actively using traditional data collection Erlich and his research team and development of topic mod- everywhere—in government, running the Android apps. Nieh “Google Play has more than notes that even “Top Develop- methods. Here, Erlich and his were able to correlate the els and machine learning. industry, natural science, and one million apps and over 50 Whitehead colleagues came social science—and practitio- ers,” designated by the Google donated DNA to a surname in “We were fortunate that LDA billion app downloads, but no Other findings of the research up with an innovative solution. 13% of the U.S. population, a ners need to be able to quickly Play team as the best developers include: has become influential for ana- one reviews what gets put on Google Play, included these Using genealogical data already result that surprised even Erlich. explore, understand, visualize, into Google Play—anyone can • Showing that roughly a uploaded to social media sites, lyzing all kinds of digital informa- and summarize them. But to do vulnerabilities in their apps. “Our study highlighted current tion, including text, images, get a $25 account and upload quarter of all Google Play free they created a genealogical tree so, we need new statistical and whatever they want. Very little “We’ve been working closely apps are clones: these apps of 13 million individuals dating gaps in genetic privacy as we user data and recommendation machine learning tools that can enter to the brave new era of systems, social networks, and is known about what’s there at with Google, Amazon, Face- are duplicative of other apps back to the 15th century. Such a help reveal what such data sets an aggregate level,” says Nieh, book, and other service already in Google Play. deep genealogy reveals clusters ubiquitous genetic information,” survey data,” says Blei. “Even say about the real world.” who is also a member of the providers to identify and notify • Identifying a performance of genetic variation, with some explains Erlich. “However, we in population genetics, LDA-like ideas were developed indepen- Prior to joining Columbia Engi- University’s Institute for Data customers at risk, and make problem resulting in very slow tied to longevity and some to must remember that sharing PhD Candidate Sciences and Engineering’s the Google Play store a safer app purchases in Google Play: rare disorders. (And offering as genetic information is crucial to dently and are used to uncover neering in July of 2014, Blei was Nicolas Viennot understanding the hereditary patterns in collections of indi- an associate professor in the Cybersecurity Center. “Given place,” says Viennot. “Google this has since been fixed. a side bonus an intriguing view the huge popularity of Google is now using our techniques to into human migration.) Looking basis of devastating disorders. viduals’ gene sequences.” Department of Computer Sci- • A list of the top 10 most ence at Princeton University. He Play and the potential risks to proactively scan apps for these highly rated apps and the top at how mutations ripple through We were pleased to see that Blei, whose challenges and is a faculty affiliate of Colum- millions of users, we thought it problems to prevent this from 10 worst rated apps in Google populations, researchers can our work has helped facilitate rewards both come from solv- bia’s Institute for Data Sciences was important to take a close happening again in the future.” Play that included surprises measure the frequency of a discussions and procedures to ing large-scale machine learning and Engineering. Blei is a re- look at Google Play content.” such as an app that, while the certain trait, and thus evaluate better share genetic information problems, appreciates the In fact, Nieh adds, developers in ways that respect partici- cipient of the 2013 ACM-Infosys worst rated, still had more the genetic contribution. With interconnectivity of disciplines Nieh and Viennot’s paper is the are already receiving notifications Foundation Award in Comput- than a million downloads: it a larger tree, even more will pants’ preferences.” within his field. first to make a large-scale mea- from Google to fix their apps and ing Sciences and is a winner purports to be a scale that become possible. Erlich hopes that ensuring better surement of the huge Google remove the secret keys. “Columbia has an extremely of an NSF CAREER Award; Play marketplace. To do this, they measures the weight of an The promise is great but it all safeguarding of genetic informa- strong faculty in machine learn- an Alfred P. Sloan Fellowship, Nieh and Viennot expect Play- object placed on the touch- tion will encourage more people developed PlayDrone, a tool that means nothing if there isn’t ing and statistics, and across and the NSF Presidential Early uses various hacking techniques Drone to lay a foundation for screen of an Android device, genetic material to work with.

Load more