Dan Suciu Speaks out on Research, Shyness and Being a Scientist

Dan Suciu Speaks Out on Research, Shyness and Being a Scientist Marianne Winslett and Vanessa Braganholo Dan Suciu https://homes.cs.washington.edu/~suciu/ Welcome to ACM SIGMOD Record’s series of interviews with distinguished members of the database community. I’m Marianne Winslett, and today we are in Snowbird, Utah, USA, site of the 2014 SIGMOD and PODS conference. I have here with me Dan Suciu, who is a professor at the University of Washington. Dan has two Test of Time Awards from PODS as well as Best Paper Awards from SIGMOD and ICDT. Dan’s Ph.D. is from the University of Pennsylvania. 28 SIGMOD Record, December 2017 (Vol. 46, No. 4) So, Dan, welcome! Ah, I think that kind of depends on you, and I don’t think your email address is worth in isolation. Maybe it Thank you. is, but you’re part of a larger population. The question is, what is it worth to the analyst to get statistical Can we put a price on individuals’ privacy? information about that population? There is a game between users who would like to be compensated for Oh, that’s tough. Today users give away their private access to their data and the analyst who would like to information to Internet companies like Google or pay for access to aggregated data over a larger Yahoo for free, but this has to change. Users need to population. So, I don’t have a good answer to this. I have control over their private data. So, it’s not clear think that there are developing techniques that would how this would happen, but at the University of allow the users to choose their price. Washington, we’ve started a number of projects that look into how to price data. In particular, we’ve looked It’s like there are two categories. There are the ways to at how to cover the entire continuum between free and contact me and then there are facts about me. differentially private data and paying the users the full price for access to their private data if they’re willing These are indeed different kinds of private to release that. It is still way too early to see what the information. I don’t think we have thought too much right business model would be to allow users to about how to distinguish between these kinds of monetize their private data. So far, we do studies in information. So nope, I can’t tell you much more about academia. We are still waiting for somebody to come this. up with the right business model. Channels vs. attributes so to speak. Okay! What about the temporal element? We need to train the future scientists [..] in a way that Temporal element in data... you got me here. We know the techniques to deal with temporal data. Maybe will help them develop their you’re asking about archiving data and how to keep it careers for many years. For for a long time? that, we need a good Well once something is out there, it’s out there. comBination of Both theory and practice. Ah, how to retract data? Is that the solution? If I change my mind, update it or I know it’s a hard question but what can you tell me whatever? about how we should price it? What kind of methodology should we use? That is again, a difficult technical challenge. I know there has been research, not inside the database Clearly, we need to allow users to opt-in and there are community, but in the operating systems community. some technical challenges here because sometimes the There was a project run by some of my colleagues user’s decision of whether to opt-in or not might called Vanish1 that allowed people to produce data that actually reveal something about their private data. This would completely vanish from the cloud after a few used to be one of the most technically challenging hours. So the idea is that you can send an email to your problems in pricing private data. In addition to that, the friend, but this email is private, and then it will problem that we did address and that seems more completely disappear. Every trace of this email will within our reach is how to adjust the price according to completely disappear from the cloud after a few hours the amount of perturbation that we add to the query. that your friend has read this email. It’s a technically So, if the data analyst wants to have very precise data, challenging problem. Now we also see legislation in then he would pay to have the perturbation removed. If Europe that tries to force companies to remove data. I he is willing to cope with differentially private data, think the jury is still out for what the right model is -- then he would not pay, and then he would get if the data should be kept forever and how much differentially private perturbed query answers. should the users have control over when their data is being deleted. So what is my email address worth? 1 The VANISH project. https://vanish.cs.washington.edu SIGMOD Record, December 2017 (Vol. 46, No. 4) 29 Differential privacy also gets more difficult when time. What we showed is that when you throw in all there’s a periodic release of aggregated information three features, then the equivalence problem becomes compared to a one-time release. co-NP-complete. And that was an interesting insight because it tells you that XPath is as difficult to check That’s a major limitation of differential privacy. They for equivalence as arbitrary conjunctive queries, for talk about a privacy budget. You can only ask as many example. queries as is allowed by a privacy budget. There is no story of what happens after this privacy budget has Interesting. So both of those papers were on XML, and been exhausted. So, then the analyst should after that, you moved to working on privacy and then theoretically never have access to the data because to probabilistic data and from probabilistic data to he/she has exhausted the privacy budget. There is some data markets. How do you choose your next topic and hope if you also take into account data churning. The the timing of the move? data is never static. It always gets updated. Old data is being removed, and new data is being added. So, then This is actually quite hard, but as researchers, we need your privacy budget would maybe be renewed, but this to watch technology trends and application pulls. The is still work in progress, and this is not something we world changes. So in the 90s, the web was new, and do in our group. In our group, we try to solve the for the first time, people discovered that they could problem by allowing users to place a price on this data share data. They could exchange data. XML was and that will simplify the privacy budget because now actually designed by people working as a document the budget is really a monetary budget (as much money community. So, they designed XML thinking of it as a as you have). This is how much you can get access to document. I think the database community should be the private data. credited for showing how XML should be thought of as data and as a data exchange format. The research Ok great. Tell me about the work that you received the problems that emerged were fun, but they were not Test of Time Awards for. particularly deep. I would say that they are largely solved by now. The Test of Times in PODS? There were two papers. But in the meantime, we got new challenges. People Both were about XML. The first was on type checking started to realize that by exchanging data and having 2 XML transformations . Here the question that we access to data, you need to worry about data privacy. asked was: If you are given the schemas for your input Also, much of the data is uncertain, so data privacy data and for your output data, can you automatically and probabilistic databases emerged later as new check if a given XML transformation will indeed map challenges for data management. The technical every input data conforming to the input schema to an questions underlying these challenges turn out to be output data that conforms to the output schema? It much harder. None of them is solved. I actually have turns out this is decidable. You can do this for a non- my doubts whether privacy can ever be solved in the trivial fragment of XPath. Today XQuery does it way in which the academic papers present it. I am a differently. It uses type inference, which is not a little bit more hopeful that adding a price to the data complete decision procedure. We had a complete might lead to a practical solution. For probabilistic decision procedure for a more restricted fragment. So, databases, they are equally technically challenging, but this was one work. at least here we’re not the only ones looking for The other3 was for a very simple and fundamental solutions. The knowledge representation community problem, which is, you’re given two XPath expressions and the machine learning community are very hard at and you need to check if they’re equivalent. They may work at trying to solve the same challenges we face in be syntactically different perhaps, but are they actually probabilistic databases, which is probabilistic semantically equivalent? We looked at the tiniest inference.

Dan Suciu Speaks out on Research, Shyness and Being a Scientist

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support