New Faces at CUCS Every Other One and Compute Nearest Neighbor, a Search Their Similarity for Each Pair
Total Page:16
File Type:pdf, Size:1020Kb
NEWSLETTER OF THE DEPARTMENT OF COMPUTER SCIENCE AT COLUMBIA UNIVERSITY VOL.12 NO.1 SPRING 2016 straightforward for small data Rather than searching an entire sets: compare every object to data set for the single most New Faces at CUCS every other one and compute nearest neighbor, a search their similarity for each pair. would go much faster if objects But as the number of objects were pre-grouped according to increases into the billions, com- some shared attribute, making it puting time grows quadratically, easy to zero in on just the small he impact of huge data making the task prohibitively subset of objects most likely to sets is hard to understate, contain the most similar neigh- responsible for advancing expensive, at least in terms of T traditional expectations. bor. The new challenge then almost every scientific and becomes: what attribute shall we technological field, from ma- Alexandr Andoni, a theoretical use to make such a pre-grouping chine learning and personalized computer scientist focused on maximally efficient. The speed-up medicine, to speech recognition developing algorithmic founda- thus gained reverberates across and translation. tions for massive data, sees the a wide range of computational The flip side of the data revolu- need to reframe the issue: “The methods since nearest neigh- tion is that massive data has question today is not ‘what can bors search is ubiquitous and ALEXANDR ANDONI rendered many standard algo- we solve in polynomial time?’ serves as a primitive in higher- level algorithms, particularly in ASSOCIATE PROFESSOR rithms computationally expen- but ‘what is possible in time machine learning. OF COMPUTER SCIENCE sive, necessitating a wholesale proportional to data size, or even Advancing the algo- rethinking of even the most less?’ With all this data, what More generally, in the same basic computational methods. is the best we can do given the spirit of relying on (approximate) rithmic foundations To take one example: Nearest resources? Fortunately, vast im- attributes to speed operations, of massive data neighbors search, the classic provements are possible in both Andoni has developed a theory method since the 70s for find- theory and practice once we of sketching that represents ing similarity among objects, is settle for approximate answers.” complex objects by smaller, CS@CU SPRING 2016 1 Cover Story (continued) simpler “sketches” that capture where the ability to provide com- issues related to security and novel by Philip K. Dick—inserts a fiers to find security risks? I’m level of interactivity is not yet Part of the problem is that the For Wu, the natural progres- the main structure and essen- putationally efficient solutions privacy. It’s a wide-open field. privacy protection layer that inter- fortunate to be working within possible for massive data sets. database and the visualization sion is to extend the declara- tial properties of the original depends on the development of cepts the data apps receive and Columbia’s Data Science Insti- communities have traditionally tive approach to interactive Two years ago, for his thesis he “Computing power has grown, objects yet use less (sublinear) new algorithms.” displays it in a console so users tute alongside machine learners been separate, with the data- visualizations. With colleagues looked hard at the security risks data sets have grown, what space and time to compute. For can see and, if they want, limit who can build such classifiers.” base side focusing on efficient at Berkeley and University of Andoni also looks for inspiration inherent in perceptual comput- hasn’t kept pace is the ability many tasks, such as estimating how much data is passed on For guarding against buggy code, query processing and accuracy, Washington, Wu is designing from students. “You feel the new ing, where devices equipped to visualize and interact with similarity of a pair of objects, a to the app. The platform, which Jana imagines adapting program and the visualization commu- a declarative visualization lan- energy. Students are excited, and with cameras, microphones, and all this data in a way that’s sketch may work just as well integrates with the popular analysis, an existing technology nity focusing on usability and guage to provide a set of logical that excitement and enthusiasm sensors are able to perceive the easy and intuitive for people to as a fully realized object. While computer vision library OpenCV, for automatically finding software interactions. Says Wu, “If you operations and mappings that is invigorating. It leads you to world around them so they can understand,” says Eugene Wu, relaxing strict formulations is is designed to make it easy for bugs, so it specifically searches look at visualizations from a would free programmers from think about even things you’ve operate and interact more intel- who recently received his PhD happening generally throughout companies to implement and out those bugs that concern database perspective, a lot of it implementation details so they already checked off, to believe ligently: lights that dim when a from MIT’s Computer Science the community in most part by requires no changes to apps. The security and privacy. looks like database operations. can logically state what they there might be new ways of do- person leaves the room, games and Artificial Intelligence Labo- necessity, Andoni is carrying DARKLY paper, called revolution- In both cases, you’re comput- want while letting the database ing things. You want to try again.” that react to a player’s throwing Technology alone, however, isn’t ratory (CSAIL), where he was a the idea further and is in the ary, won the 2014 PET Award for ing sums, you’re computing figure out the best way to do it. motion, doors that unlock when the answer. Companies are un- member of the database group. forefront of those inventing new Outstanding Research in Privacy common aggregates. We can recognizing the owner. likely to fix privacy problems un- A declarative language for primitives and new data struc- Enhancing Technologies. remove many of the perceived less pressured by the public, and Speed is one important com- visualization would have ad- tures that explicitly incorporate It all comes at a cost, of course, ponent for visualizing data, but differences between databases Making it easy to build in safety Jana sees his role encompassing ditional positive benefits. “Once the concept of sketches. especially in terms of privacy there are others, such as the and visualization systems.” Wu and privacy mechanisms is the policy arena, where he will you have a high-level language and security. ease with which interactive wants to bridge the two sides In early work applying a sketch critical. Manufacturers have work to propose and enact work- capable of expressing analyses, visualizations can be created and to operate more closely togeth- primitive (Locality Sensitive little incentive to construct able regulations and legislation to all of these analysis tools such as “Features don’t come for free; the ability to help understand er so both consider first the Hashing) to nearest neighbor privacy protections; in any protect data and security. the explanatory analysis from my they require incredible amounts what the results actually say. For expectations and requirements search, Andoni in 2006 with case, determining what data is thesis is in a sense baked into of data. And that brings risks. At least for perceptual comput- his PhD thesis, Wu tackled the of the human in the loop. Piotr Indyk was able, for the sensitive is not easy. A single whatever you build; it comes for The same data that tells the ing, Jana says there an opening latter problem by developing a most basic Euclidean distances, data point by itself—a random For instance, what does data- free. There will be less need for thermostat no one is home to do something about privacy visualization tool that automati- to improve over a seminal 1998 security photo of a passerby, base accuracy mean when a individuals to write their own ad might also be telling a would- risks. “The field is still relatively cally generates explanations for algorithm widely used for clas- for instance—might seem human analyst can’t differentiate hoc analysis programs.” be burglar,” says Jana. new, and we have the chance anomalies in a user’s visualiza- sification. The Communications harmless, but combined with 3.4 from 3.45 in a scatterplot? SUMAN JANA to build in security from the tion. This is important because As interactions become portable of the ACM later (2008, vol. 51) What data is being collecting another data point or aggregated A slight relaxation of accuracy ASSISTANT PROFESSOR beginning and make life better so while visualizations are very and sharable, they can be copied hailed the new primitive as a isn’t always known, even by over time—similar photos over requirements—unnoticeable OF COMPUTER SCIENCE people can trust these devices good at showing what’s happen- and pasted from one interac- breakthrough technology that the device manufacturers who, several weeks—reveals patterns to users—would conserve and use them.” ing in the data, they are not good tive visualization to another for allowed researchers to revisit Protecting security and pursuing features, default to and personal behaviors. The resources while speeding up at explaining why. A visualiza- someone else to modify. And it decades-old problems and solve collecting as much data as they challenge to preventing security query operations. In understand- privacy in an age of tion might show that company becomes easier to build tools, them faster. Few expected more can. This data is handed off to leaks is first finding them amidst ing the boundary between what perceptual computing expenses shot up 400% in a which fits with Wu’s focus in progress to be possible.