Low Latency Geo-Distributed Data Analytics

Total Page:16

File Type:pdf, Size:1020Kb

Low Latency Geo-Distributed Data Analytics Low Latency Geo-distributed Data Analytics Qifan Pu1;2, Ganesh Ananthanarayanan1, Peter Bodik1 Srikanth Kandula1, Aditya Akella3, Paramvir Bahl1, Ion Stoica2 1Microsoft Research 2University of California at Berkeley 3University of Wisconsin at Madison ABSTRACT 1. INTRODUCTION Low latency analytics on geographically distributed dat- Large scale cloud organizations are deploying dat- asets (across datacenters, edge clusters) is an upcoming acenters and \edge" clusters globally to provide their and increasingly important challenge. The dominant users low latency access to their services. For instance, approach of aggregating all the data to a single data- Microsoft and Google have tens of datacenters (DCs) [6, center significantly inflates the timeliness of analytics. 11], with the latter also operating 1500 edges world- At the same time, running queries over geo-distributed wide [24]. The services deployed on these geo-distributed inputs using the current intra-DC analytics frameworks sites continuously produce large volumes of data like also leads to high query response times because these user activity and session logs, server monitoring logs, frameworks cannot cope with the relatively low and and performance counters [34, 46, 53, 56]. variable capacity of WAN links. Analyzing the geo-distributed data gathered across We present Iridium, a system for low latency geo-distri- these sites is an important workload. Examples of such buted analytics. Iridium achieves low query response analyses include querying user logs to make advertise- times by optimizing placement of both data and tasks ment decisions, querying network logs to detect DoS of the queries. The joint data and task placement op- attacks, and querying system logs to maintain (stream- timization, however, is intractable. Therefore, Iridium ing) dashboards of overall cluster health, perform root- uses an online heuristic to redistribute datasets among cause diagnosis and build fault prediction models. Be- the sites prior to queries' arrivals, and places the tasks cause results of these analytics queries are used by data to reduce network bottlenecks during the query's ex- analysts, operators, and real-time decision algorithms, ecution. Finally, it also contains a knob to budget minimizing their response times is crucial. WAN usage. Evaluation across eight worldwide EC2 re- Minimizing query response times in a geo-distributed gions using production queries show that Iridium speeds setting, however, is far from trivial. The widely-used up queries by 3× − 19× and lowers WAN usage by approach is to aggregate all the datasets to a central 15% − 64% compared to existing baselines. site (a large DC), before executing the queries. How- ever, waiting for such centralized aggregation, signifi- cantly delays timeliness of the analytics (by as much CCS Concepts as 19× in our experiments).1 Therefore, the natural •Computer systems organization ! Distributed alternative to this approach is to execute the queries Architectures; •Networks ! Cloud Computing; geo-distributedly over the data stored at the sites. Additionally, regulatory and privacy concerns might also forbid central aggregation [42]. Finally, verbose or Keywords less valuable data (e.g., detailed system logs stored only geo-distributed; low latency; data analytics; network for a few days) are not shipped at all as this is deemed aware; WAN analytics too expensive. Low response time for queries on these datasets, nonetheless, remains a highly desirable goal. Permission to make digital or hard copies of all or part of this work for personal Our work focuses on minimizing response times of or classroom use is granted without fee provided that copies are not made or geo-distributed analytics queries. A potential approach distributed for profit or commercial advantage and that copies bear this notice would be to leave data in-place and use unmodified, and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- intra-DC analytics framework (such as Hadoop or Spark) mitted. To copy otherwise, or republish, to post on servers or to redistribute to across the collection of sites. However, WAN band- lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGCOMM ’15, August 17 - 21, 2015, London, United Kingdom 1An enhancement could \sample" data locally and send only c 2015 ACM. ISBN 978-1-4503-3542-3/15/08. $15.00 a small fraction [46]. Designing generic samplers, unfortu- DOI: http://dx.doi.org/10.1145/2785956.2787481 nately, is hard. Sampling also limits future analyses. 421 widths can be highly heterogeneous and relatively mod- Bangalore Boston erate [43, 47, 48] which is in sharp contrast to intra-DC Site Manager networks. Because these frameworks are not optimized Site Manager I1 Map1 S1 for such heterogeneity, query execution could be dra- Map2 matically inefficient. Consider, for example, a simple Reduce2 Reduce1 map-reduce query executing across sites. If we place no (or very few) reduce tasks on a site that has a large Core amount of intermediate data but low uplink bandwidth, Network all of the data on this site (or a large fraction) would Global Manager Site Manager have to be uploaded to other sites over its narrow up- Job Queue link, significantly affecting query response time. Map3 We build Iridium, a system targeted at geo-distributed San Francisco Beijing data analytics. Iridium views a single logical analytics framework as being deployed across all the sites. To Figure 1: Geo-distributed map-reduce query. The user submits the query in San Francisco, and the achieve low query response times, it explicitly considers query runs across Boston, Bangalore and Beijing. the heterogeneous WAN bandwidths to optimize data We also show the notations used in the paper at and task placement. These two placement aspects are Bangalore, see Table 1. central to our system since the source and destination of a network transfer depends on the locations of the data that trades off the WAN usage and latency by limiting and the tasks, respectively. Intuitively, in the example the amount of WAN bandwidth used by data moves and above, Iridium will either move data out of the site with task execution. In our experiments, with a budget equal low uplink bandwidth before the query arrives or place to that of a WAN-usage optimal scheme (proposed in many of the query's reduce tasks in it. [53, 54]), Iridium obtains 2× faster query responses. Because durations of intermediate communications Our implementation of Iridium automatically estimates (e.g., shuffles) depend on the duration of the site with site bandwidths, future query arrivals along with their the slowest data transfer, the key intuition in Iridium characteristics (intermediate data), and prioritizes data is to balance the transfer times among the WAN links, movement of the earlier-arriving queries. It also sup- thereby avoiding outliers. To that end, we formulate ports Apache Spark queries, both streaming [60] as well the task placement problem as a linear program (LP) as interactive/batch queries [59]. 2 by modeling the site bandwidths and query character- Evaluation across eight worldwide EC2 regions and istics. The best task placement, however, is still limited trace-driven simulations using production queries from by input data locations. Therefore, moving (or repli- Bing Edge, Conviva, Facebook, TPC-DS, and the Big- cating) the datasets to different sites can reduce the data benchmark show that Iridium speeds up queries by anticipated congestion during query execution. 3× − 19× compared to existing baselines that (a) cen- The joint data and task placement, even for a single trally aggregate the data, or (b) leave the data in-place map-reduce query, results in a non-convex optimization and use unmodified Spark. with no efficient solution. Hence, we devise an effi- cient and greedy heuristic that iteratively moves small 2. BACKGROUND AND MOTIVATION chunks of datasets to \better" sites. To determine which datasets to move, we prefer those with high value-per- We explain the setup of geo-distributed analytics (x2.1), byte; i.e., we greedily maximize the expected reduction illustrate the importance of careful scheduling and stor- in query response time normalized by the amount of age (x2.2), and provide an overview of our solution (x2.3). data that needs to be moved to achieve this reduction. 2.1 Geo-distributed Analytics This heuristic, for example, prefers moving datasets with many queries accessing them and/or datasets with Architecture: We consider the geo-distributed ana- queries that produce large amount of intermediate data. lytics framework to logically span all the sites. We Our solution is also mindful of the bytes transferred assume that the sites are connected using a network on the WAN across sites since WAN usage has impor- with congestion-free core. The bottlenecks are only be- tant cost implications ($/byte) [53]. Purely minimizing tween the sites and the core which has infinite band- query response time could result in increased WAN us- width, valid as per recent measurements [13]. Addi- age. Even worse, purely optimizing WAN usage can ar- tionally, there could be significant heterogeneity in the bitrarily increase query latency. This is because of the uplink and downlink bandwidths due to widely different fundamental difference between the two metrics: band- link capacities and other applications (non-Iridium traf- width cost savings are obtained by reducing WAN usage fic) sharing the links. Finally, we assume the sites have on any of the links, whereas query speedups are ob- relatively abundant compute and storage capacity. tained by reducing WAN usage only on the bottleneck Data can be generated on any site and as such, a link. Thus, to ensure fast query responses and reason- dataset (such as \user activity log for application X") able bandwidth costs, we incorporate a simple \knob" 2https://github.com/Microsoft-MNR/GDA 422 could be distributed across many sites.
Recommended publications
  • SIGOPS Annual Report 2012
    SIGOPS Annual Report 2012 Fiscal Year July 2012-June 2013 Submitted by Jeanna Matthews, SIGOPS Chair Overview SIGOPS is a vibrant community of people with interests in “operatinG systems” in the broadest sense, includinG topics such as distributed computing, storaGe systems, security, concurrency, middleware, mobility, virtualization, networkinG, cloud computinG, datacenter software, and Internet services. We sponsor a number of top conferences, provide travel Grants to students, present yearly awards, disseminate information to members electronically, and collaborate with other SIGs on important programs for computing professionals. Officers It was the second year for officers: Jeanna Matthews (Clarkson University) as Chair, GeorGe Candea (EPFL) as Vice Chair, Dilma da Silva (Qualcomm) as Treasurer and Muli Ben-Yehuda (Technion) as Information Director. As has been typical, elected officers agreed to continue for a second and final two- year term beginning July 2013. Shan Lu (University of Wisconsin) will replace Muli Ben-Yehuda as Information Director as of AuGust 2013. Awards We have an excitinG new award to announce – the SIGOPS Dennis M. Ritchie Doctoral Dissertation Award. SIGOPS has lonG been lackinG a doctoral dissertation award, such as those offered by SIGCOMM, Eurosys, SIGPLAN, and SIGMOD. This new award fills this Gap and also honors the contributions to computer science that Dennis Ritchie made durinG his life. With this award, ACM SIGOPS will encouraGe the creativity that Ritchie embodied and provide a reminder of Ritchie's leGacy and what a difference a person can make in the field of software systems research. The award is funded by AT&T Research and Alcatel-Lucent Bell Labs, companies that both have a strong connection to AT&T Bell Laboratories where Dennis Ritchie did his seminal work.
    [Show full text]
  • Arxiv:1611.04369V1 [Cs.AI] 14 Nov 2016 Computer Science Conferences Are Selected As Target Conferences, to Rank [8]
    Feature Engineering and Ensemble Modeling for Paper Acceptance Rank Prediction Yujie Qian∗, Yinpeng Dong∗, Ye Ma∗, Hailong Jin, and Juanzi Li Department of Computer Science and Technology, Tsinghua University {qyj13, dongyp13, y-ma13}@mails.tsinghua.edu.cn, [email protected], [email protected] ABSTRACT MAG is a large and heterogeneous academic graph provided by Mi- Measuring research impact and ranking academic achievement are crosoft, containing scientific publication records, citation relation- important and challenging problems. Having an objective picture ships between publications, as well as authors, institutions, jour- of research institution is particularly valuable for students, parents nal and conference venues, and fields of study. The latest version and funding agencies, and also attracts attention from government of MAG includes 19,843 institutions, 114,698,044 authors, and and industry. KDD Cup 2016 proposes the paper acceptance rank 126,909,021 publications. prediction task, in which the participants are asked to rank the im- The evaluation is performed after the conferences announce their portance of institutions based on predicting how many of their pa- decisions of paper acceptance. For every conference, the compe- pers will be accepted at the 8 top conferences in computer science. tition organizers collect the full list of accepted papers, calculate In our work, we adopt a three-step feature engineering method, ground truth ranking, and evaluate the participants’ submissions. including basic features definition, finding similar conferences to Ground truth ranking is generated following a simple policy to de- enhance the feature set, and dimension reduction using PCA. We termine the Institution Ranking Score: propose three ranking models and the ensemble methods for com- 1.
    [Show full text]
  • Llista Congressos Notables
    Llistat de congressos notables per la UPC (Ordenat pel nom del congrés) Servei d'Informació RDI (Per dubtes: [email protected]). 5 de febrer de 2018 Nom del congrés Acrònim AAAI Conference on Artificial Intelligence AAAI ACM Conference on Computer and Communications Security CCS ACM Conference on Human Factors in Computing Systems CHI ACM European Conference on Computer Systems EuroSys ACM Great Lakes Symposium on VLSI GLSVLSI ACM International Conference on Information and Knowledge Management CIKM ACM International Conference on Measurement and Modelling of Computer Systems SIGMETRICS ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems MSWiN ACM International Conference on Multimedia ACMMM ACM International Symposium on Mobile AdHoc Networking and Computing ACM MOBIHOC ACM International Symposium on Modeling, Analysis and Simulation of Wireless and Mobile Systems MSWiM ACM International Symposium on Physical Design ISPD ACM Object-Oriented Programming, Systems, Languages & Applications OOPSLA ACM Principles of Programming Languages POPL ACM SIGCOMM Special Interest Group on Data Communications SIGCOMM ACM SIGMOD Conference ACM SIGMOD ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation PLDI ACM Symposium on Applied Computing SAC ACM Symposium on Operating Systems Principles SOSP ACM Symposium on Parallel Algorithms and Architectures SPAA ACM Symposium on Principles of Database Systems PODS ACM Symposium on Principles of Distributed Computing PODC ACM Symposium on
    [Show full text]
  • Curriculum Vitae
    CURRICULUM VITAE Name: Robert J. Robbins Birth Date: 10 May 1944 Birthplace: Niles, Michigan Citizenship: United States Address: Vice President, Information Technology Fred Hutchinson Cancer Research Center 1100 Fairview Avenue North, J4-300 Seattle, WASHINGTON 98109 (206) 667–4778 (206) 667–7733 (FAX) [email protected] Home: 21454 NE 143rd Street Woodinville, WASHINGTON 98072 (425) 556–9386 (home) (425) 556–0970 (home FAX) Education: Ph.D. 1977 Zoology Michigan State University East Lansing, Michigan M.S. 1974 Biological Michigan State University Science East Lansing, Michigan B.S. 1973 Zoology Michigan State University East Lansing, Michigan A.B. 1966 History Stanford University Palo Alto, California Research Interests: Computer applications in biology; information systems design; database theory and design; software engineering; history of science; intellectual history of genetics; concept of the gene. Professional Affiliations: American Association for the Advancement of Science American Society for Microbiology Association for Computing Machinery SIGMOD, SIGIR, SIGSOFT Genetics Society of America IEEE Computer Society Honors and Awards: Listed in American Men and Women of Science Listed in Who’s Who in the Midwest Listed in Who’s Who of Emerging Leaders in America Listed in Who’s Who in Frontier Science and Technology Outstanding Performance Award, NSF (1991) Outstanding Performance Award, NSF (1989) National Science Foundation National Needs Postdoctoral Fellowship (1978– 1979) Sigma Xi Award for Meritorious Research (1977) National
    [Show full text]
  • Analysing Scholarly Communication Metadata of Computer Science Events
    Analysing Scholarly Communication Metadata of Computer Science Events Said Fathalla1;3, Sahar Vahdati1, Christoph Lange1;2, and Sören Auer4;5 1 Enterprise Information Systems (EIS), University of Bonn, Germany {fathalla,vahdati,langec}@cs.uni-bonn.de 2 Fraunhofer IAIS, Germany 3 Faculty of Science, University of Alexandria, Egypt 4 Computer Science, Leibniz University of Hannover, Germany 5 TIB Leibniz Information Center for Science and Technology, Hannover, Germany [email protected] Abstract Over the past 30 years we have observed the impact of the ubiquitous availability of the Internet, email, and web-based services on scholarly communication. The preparation of manuscripts as well as the organisation of conferences, from submission to peer review to publica- tion, have become considerably easier and efficient. A key question now is what were the measurable effects on scholarly communication in com- puter science? Of particular interest are the following questions: Did the number of submissions to conferences increase? How did the selection processes change? Is there a proliferation of publications? We shed light on some of these questions by analysing comprehensive scholarly commu- nication metadata from a large number of computer science conferences of the last 30 years. Our transferable analysis methodology is based on descriptive statistics analysis as well as exploratory data analysis and uses crowd-sourced, semantically represented scholarly communication metadata from OpenResearch.org. Keywords: Scientific Events, Scholarly Communication, Semantic Publishing, Metadata Analysis 1 Introduction The mega-trend of digitisation affects all areas of society, including business and science. Digitisation is accelerated by ubiquitous access to the Internet, the global, distributed information network.
    [Show full text]
  • Central Library: IIT GUWAHATI
    Central Library, IIT GUWAHATI BACK VOLUME LIST DEPARTMENTWISE (as on 20/04/2012) COMPUTER SCIENCE & ENGINEERING S.N. Jl. Title Vol.(Year) 1. ACM Jl.: Computer Documentation 20(1996)-26(2002) 2. ACM Jl.: Computing Surveys 30(1998)-35(2003) 3. ACM Jl.: Jl. of ACM 8(1961)-34(1987); 43(1996);45(1998)-50 (2003) 4. ACM Magazine: Communications of 39(1996)-46#3-12(2003) ACM 5. ACM Magazine: Intelligence 10(1999)-11(2000) 6. ACM Magazine: netWorker 2(1998)-6(2002) 7. ACM Magazine: Standard View 6(1998) 8. ACM Newsletters: SIGACT News 27(1996);29(1998)-31(2000) 9. ACM Newsletters: SIGAda Ada 16(1996);18(1998)-21(2001) Letters 10. ACM Newsletters: SIGAPL APL 28(1998)-31(2000) Quote Quad 11. ACM Newsletters: SIGAPP Applied 4(1996);6(1998)-8(2000) Computing Review 12. ACM Newsletters: SIGARCH 24(1996);26(1998)-28(2000) Computer Architecture News 13. ACM Newsletters: SIGART Bulletin 7(1996);9(1998) 14. ACM Newsletters: SIGBIO 18(1998)-20(2000) Newsletters 15. ACM Newsletters: SIGCAS 26(1996);28(1998)-30(2000) Computers & Society 16. ACM Newsletters: SIGCHI Bulletin 28(1996);30(1998)-32(2000) 17. ACM Newsletters: SIGCOMM 26(1996);28(1998)-30(2000) Computer Communication Review 1 Central Library, IIT GUWAHATI BACK VOLUME LIST DEPARTMENTWISE (as on 20/04/2012) COMPUTER SCIENCE & ENGINEERING S.N. Jl. Title Vol.(Year) 18. ACM Newsletters: SIGCPR 17(1996);19(1998)-20(1999) Computer Personnel 19. ACM Newsletters: SIGCSE Bulletin 28(1996);30(1998)-32(2000) 20. ACM Newsletters: SIGCUE Outlook 26(1998)-27(2001) 21.
    [Show full text]
  • Test of Time Award
    Test of Time Award ACM SIGMETRICS March 13, 2009 Proposed Award ACM SIGMETRICS proposes to create a new annual award called the ACM SIGMET- RICS Test of Time Award. This annual award will recognize an influential paper from the ACM SIGMETRICS conference proceedings 10-12 years prior, based on the criterion of identifying the paper that has had the most impact (research, methodology, application, transfer) in the intervening time period. In each year, the author(s) of the winning paper will receive a recognition plaque and a check for $1,000 US, to be presented at the annual ACM SIGMETRICS conference. Background and Rationale SIGMETRICS is the ACM Special Interest Group for the computer systems performance evaluation community. It promotes research in performance evaluation and analysis tech- niques, as well as the advanced and innovative use of known methods and tools. Target areas of performance analysis include computer architecture, computer networks, database systems, distributed systems, fault-tolerant systems, file and memory systems, operating systems, and real-time systems. ACM SIGMETRICS has held its own annual conference since 1973. Every three years, this conference occurs jointly with the IFIP Performance conference, the flagship conference of IFIP Working Group 7.3. ACM SIGMETRICS currently sponsors four major awards to recognize important con- tributions in the area of performance evaluation: • the SIGMETRICS Achievement Award (first awarded in 2003); • the SIGMETRICS Rising Star Researcher Award (first awarded in 2008); • the SIGMETRICS Best Paper Award at the ACM SIGMETRICS conference; • the Kenneth C. Sevcik Outstanding Student Paper Award for the best student paper at the ACM SIGMETRICS conference.
    [Show full text]
  • Richard Newton Taylor
    Richard Newton Taylor Contact School of Information and Computer Sciences Information University of California, Irvine E-mail: [email protected] Irvine, CA 92697-3440 www.ics.uci.edu/∼taylor Citizenship USA Research Software engineering, software architecture, Web technologies, decentralized systems, Interests design, system engineering, development environments Education University of Colorado, Boulder, CO USA Ph.D., Computer Science, December 1980 Static Analysis of the Synchronization Structure of Concurrent Programs Advisor: Professor Leon J. Osterweil M.S., Computer Science, May 1976 University of Colorado, Denver, CO USA B.S., Applied Mathematics, May 1974 With Special Honors Minor in Distributed Engineering Awards & 1985 Presidential Young Investigator Award Honors 1998 ACM Fellow 2004 ICSE Distinguished Paper Award 2005 ACM SIGSOFT Distinguished Service Award 2008 ACM SIGSOFT/IEEE TCSE Most Influential Paper from ICSE 1998 Award 2009 Dean's Award for Graduate Student Mentoring 2009 ACM SIGSOFT Outstanding Research Award 2010 UC Irvine Chancellor's Professor 2012 Dean's Award for Excellence in Research 2017 Distinguished Engineering Alumni Award, University of Colorado Boulder 2017 ACM SIGSOFT Impact Paper Award (with Roy Fielding) 1 of 45 Professional 7/17-present Chancellor's Professor Emeritus, School of Information and Computer Experience Sciences, University of California, Irvine (UCI) 7/13-6/17 Chancellor's Professor Emeritus (Recalled status), School of Informa- tion and Computer Sciences, University of California, Irvine (UCI) 7/99{6/17 Director, Institute for Software Research, UCI. 1/03 -6/13 Professor (VIII), School of Information and Computer Sciences. Uni- versity of California, Irvine (UCI) 1/03{6/04 Chair, Department of Informatics, School of Information and Computer Sciences, UCI 7/91{12/02 Professor, Department of Information and Computer Science, UCI.
    [Show full text]
  • Spain BIOGRAPHY Academic Background
    Candidate for Councilor Rosa M. Badia Barcelona Supercomputing Center (BCS), Spain BIOGRAPHY Academic Background: PhD, Universitat Politecnica de Catalunya, 1994, Computer Sciences. Professional Experience: Group Manager, Barcelona Supercomputing Center, Barcelona, Spain, 2005 – Present; Scientific Researcher, Spanish National Research Council (CSIC), Barcelona, Spain, 2008 – 2019; Associate Professor, Universitat Politecnica de Catalunya, Barcelona, Spain, 1997 – 2008. Professional Interests: High Performance computing; Distributed Computing; Programming models for parallel and distributed computing; Tools for workflow development; Cloud computing. ACM Activities: Working Group for Enhancing European Research Visibility, ACM Europe, 2021 – Present; ACM International Conference on Supercomputing, Programme chair, 2020; ACM-IEEE CS George Michael Memorial HPC Fellowships, Member of the committee, 2017 – 2019. Membership and Offices in Related Organizations: IEEE TMRF committee, IEEE, 2020 – Present; IEEE Transactions on Parallel computing - Associate editor, IEEE, 2020 – Present; IEEE CS TCHPC Early Career Researchers Award, IEEE, 2020. Awards Received: Euro-Par Achievement Award 2019; DonaTIC award, category Academia/Researcher, 2019. STATEMENT I am the manager of the Workflows and Distributed Computing research group at the Barcelona Supercomputing Center (BSC). I have been contributed to the research in Parallel programming models for multicore and distributed computing with contributions to task-based programming models during the
    [Show full text]
  • Alexandra Meliou Curriculum Vitae October 5, 2020 Page 1 of 14 Honors, Awards, Fellowships
    Alexandra Meliou College of Information and Computer Sciences +1-413-545-3788 University of Massachusetts [email protected] 140 Governors Dr., Amherst, MA 01003-9264 http://www.cs.umass.edu/∼ameli Research Interests My research augments data management with user-facing functionality that helps people make sense of their data and use it effectively, at a time when data becomes increasingly unpredictable, unwieldy, and unmanageable. I focus on issues of provenance, causality, explanations, data quality, usability, and data and algorithmic bias. In the past, I have worked on a variety of topics spanning data management, machine learning, and sensor networks. Education UNIVERSITY OF CALIFORNIA,BERKELEY: . Berkeley, CA, USA 2009 Doctor of Philosophy in Computer Science Dissertation: Querying uncertain data in resource-constrained settings Advisors: Joseph Hellerstein, Carlos Guestrin 2005 Master of Science in Computer Science Thesis: Data gathering tours in sensor networks Advisors: Joseph Hellerstein, Carlos Guestrin NATIONAL TECHNICAL UNIVERSITY OF ATHENS: . Athens, Greece 2003 Diploma in Electrical and Computer Engineering (5-year degree) Thesis: Modeling and exploring the algebraic properties of hierarchical structures Advisor: Timos Sellis Employment History UNIVERSITY OF MASSACHUSETTS: . Amherst, MA, USA 9/2019 – present Associate Professor 9/2012 – 8/2019 Assistant Professor UNIVERSITY OF WASHINGTON: . Seattle, WA, USA 9/2009 – 8/2012 Postdoctoral Research Associate Advisor: Dan Suciu UNIVERSITY OF CALIFORNIA,BERKELEY: . Berkeley,
    [Show full text]
  • Towards Scalable Udtfs in Noria Justus Adam Technische Universität Dresden Chair for Compiler Construction Dresden, Germany [email protected]
    Student Abstract SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Towards Scalable UDTFs in Noria Justus Adam Technische Universität Dresden Chair for Compiler Construction Dresden, Germany [email protected] ABSTRACT or variance calculations. We apply known techniques [1, 15] User Defined Functions (UDF) are an important and pow- for incrementalizing single-tuple UDFs and UDAs and par- erful extension point for database queries. Systems using tially automate the process for developer convenience. incremental materialized views largely do not support UDFs We integrate the state needed by UDAs into Noria’s mate- because they cannot easily be incrementalized. rialization, which allows UDAs to be data parallelized and In this work we design single-tuple UDF and User Defined integrate with Noria’s memory management. Aggregates (UDA) interfaces for Noria, a state-of-the art dataflow system with incremental materialized views. We fn main(clicks: RowStream<i32, i23, i32>) also add limited support for User Defined Table Functions -> GroupedRows<i32, i32> { let click_streams = group_by(0, clicks); (UDTF), by compiling them to query fragments. We show for (uid, group_stream) in click_streams { our UDTFs scale by implementing a motivational example let sequences = IntervalSequence::new(); used by Friedman et al. [7]. for (_, category, timestamp) in group_stream { let time = deref(timestamp); ACM Reference Format: let cat = deref(category); Justus Adam. 2020. Towards Scalable UDTFs in Noria. In Proceedings if eq(cat , 1) { sequences.open(time) of the 2020 ACM SIGMOD International Conference on Management of } else if eq(cat , 2) { Data (SIGMOD’20), June 14–19, 2020, Portland, OR, USA. ACM, New sequences.close(time) York, NY, USA, 3 pages.
    [Show full text]
  • Kenneth A. Ross
    Kenneth A. Ross Work Address Home Address Email Address Dept. of Computer Science 2828 Broadway, Apt. 9D [email protected] Columbia University New York, NY 10025 New York, NY 10027 (212) 864{1279 Fax: (212) 666{0140 (212) 939{7058 Education Stanford University, Stanford, California. University of Melbourne, Melbourne, Australia. Ph.D., Computer Science. B.Sc. (honors), Mathematics and Computer Science. September 1987 { August 1991. March 1983 { December, 1986. Advisor: Jeffrey D. Ullman. Employment Professor, Columbia University. July 2005 - present Associate Professor, Columbia University. July 1996 - June 2005 Assistant Professor, Columbia University. August 1991 - June 1996 Research Assistant and Teaching Assistant, Stanford University. Sept 1987 - August 1991 Summer Intern, IBM T.J. Watson Research Laboratory, Hawthorne, NY. June - Sept 1990 Teaching Assistant, Melbourne University. 1986, July - August 1988 Experimental Scientist, Commonwealth Scientific and Industrial Research Feb - August 1987 Organization, Melbourne, Australia. Honors NSF Young Investigator Award, 1994. Sloan Foundation Fellowship, 1994. Packard Foundation Fellowship, 1993. NSF Research Initiation Award, 1992. IBM Graduate Fellowship, 1990. Melbourne University Pure Mathematics Prize, 1983 & 1984. Represented Australia at the International Mathematical Olympiad, 1981 & 1982. Publications Journal publications indicated with a *. Journal Publications and Refereed Conference Publications 1. \Master of None Acceleration: A Comparison of Accelerator Architectures for Analytical Query Processing," A. Lottarini, J. P. Cerqueira, T. J. Repetti, S. A. Edwards, K. A. Ross, M. Seok, M. A. Kim Proceedings of the 46th International Symposium on Computer Architecture (ISCA), June 2019. 2. * \Distributed Joins and Data Placement for Minimal Network Traffic," O. Polychroniou, W. Zhang, K. A. Ross, in ACM Transactions on Database Systems, 43(3), pages 14:1{14:45, November 2018.
    [Show full text]