Edinburgh Research Explorer

Total Page:16

File Type:pdf, Size:1020Kb

Edinburgh Research Explorer View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Edinburgh Research Explorer Edinburgh Research Explorer Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time Citation for published version: Thompson, H & Tong, J 2018, Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time. in Companion Proceedings of the The Web Conference 2018. WWW '18, International World Wide Web Conferences Steering Committee, Lyon, France, pp. 1749-1755, The Web Conference 2018, Lyon, France, 23-27 April. DOI: 10.1145/3184558.3191636 Digital Object Identifier (DOI): 10.1145/3184558.3191636 Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Companion Proceedings of the The Web Conference 2018 General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 05. Jun. 2018 Track: 8th Temporal Web Analytics Workshop WWW 2018, April 23-27, 2018, Lyon, France Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time? Henry S. Thompson TONG Jian∗ University of Edinburgh University of Edinburgh Edinburgh, United Kingdom Edinburgh, United Kingdom [email protected] [email protected] ABSTRACT The Digital Object Identifier scheme [11], managed by the Inter- We report here on the results of two studies using two and four national DOI Foundation (IDF) [16], was an early adopter of this monthly web crawls respectively from the Common Crawl (CC) approach, and DOIs are now in widespread use, particularly in scien- initiative between 2014 and 2017, whose initial goal was to provide tific journals, where their use is actually mandated by a number of empirical evidence for the changing patterns of use of so-called major publishers. The mapping for DOIs to actionable https: URIs persistent identifiers. This paper focusses on the tooling needed for is simple: For example a DOI for a journal article written in the form 1 dealing with CC data, and the problems we found with it. The first of a URI such as doi:... is mapped (client-side) to https://doi.org/... study is based on over 1012 URIs from over 5x109 pages crawled in In response to an HTTP request for that URI, the server at doi.org April 2014 and April 2017, the second study adds a further 3x109 (operated on behalf of IDF by the Corporation for National Research pages from the April 2015 and April 2016 crawls. We conclude with Initiatives (CNRI) [5]) will respond with a redirect to the appro- suggestions on specific actions needed to enable studies based on priate http(s): URI from the actual publisher of the article. We CC to give reliable longitudinal information. call the three forms involved the ’original’ (e.g. doi: or info:hdl), the ‘actionable’ (e.g. https://doi.org/... and variants thereof or KEYWORDS http://hdl.handle.net/...) and the ‘locating’. Note that none of these is strictly speaking a PID as such: that’s what comes after temporal web analytics, persistent identifier, Common Crawl, Uni- the doi: or https://hdl:handle.net/. form Resource Identifier, longitudinal web crawl analysis, Digital The success of this approach has overcome a significant barrier to Object Identifier the adoption of PIDs in general: to date there has been no significant ACM Reference Format: move towards support for any of them as URIs in web browsers Henry S. Thompson and TONG Jian. 2018. Can Common Crawl Reliably or PDF viewers. That is, if you try to use doi://10.1000/182 or info: Track Persistent Identifier (PID) Use Over Time?. In WWW ’18 Companion: hdl/20.1000/100 as a link (for example, as the value of the href The 2018 Web Conference Companion, April 23–27, 2018, Lyon, France. ACM, attribute of an HTML A element), it will not work. But you can use New York, NY, USA, 7 pages. https://doi.org/10.1145/3184558.3191636 them as the link text of an A element, and put the actionable form (https://doi.org/10.1000/182 and http://hdl.handle.net/20.1000/100 1 INTRODUCTION respectively) in the href attribute, and that will work just fine. The history of efforts to meet the demand for so-called ‘persistent That’s the good news. The less good news is that the use of identifiers’ (PIDs) for use on the Web is complicated, with many redirection from the actionable form to the locating form means alternative offerings and much debate about the meaning ofper- that when someone follows a link such as those in the previous sistence and how to go about ensuring it. We take no position paragraph, it’s the locating form that appears in the address bar in that debate here, beyond the observation that the demand for of their browser, and is thus the form they may well copy and PIDs shows no signs of abating, and that there has been a more- paste into an email to a colleague or their own reading list. But or-less general acknowledgement over the last 5–10 years that to this undermines the fundamental value proposition of the original be successful in the context of the Web a PID scheme must define (’persistent’) form: that it is not vulnerable to all the things that and support a mapping from PIDs in the scheme to ‘actionable’ cause http: URIs to fail over time. identifiers. In practice this has meant specifying a purely syntac- Our goal in the work reported here was to quantify the growth tic procedure for converting a PID into an http(s): URI using a over time in actual usage of the three forms, to see not only how domain owned and operated by the proprietors of the scheme. An good the good news was, but also whether there was cause to worry HTTP request for such ‘actionable’ URIs will typically result in a about the less good news: are locating forms ‘leaking’ into public redirection to the then-current location of the identified resource. use? For concrete evidence we used the Common Crawl sample of ∗SURNAME forename HTML pages on the Web [3], the only large-scale public source of This paper is published under the Creative Commons Attribution 4.0 International evidence readily available to us. This turned out to be challenging (CC BY 4.0) license. Authors reserve their rights to disseminate the work on their in a number of respects, to the extent that although our results are personal and corporate Web sites with the appropriate attribution. interesting, problems with the CC data mean that they may not WWW ’18 Companion, April 23–27, 2018, Lyon, France accurately reflect the actual situation. In what follows we willfirst © 2018 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC BY 4.0 License. ACM ISBN 978-1-4503-5640-4/18/04. https://doi.org/10.1145/3184558.3191636 1doi: is not (yet) a registered URI scheme, but often used as if it were one 1749 Track: 8th Temporal Web Analytics Workshop WWW 2018, April 23-27, 2018, Lyon, France describe the work as such, and then discuss the ways in which the Table 1: Crawl size for first study [24] CC data fell short of what we think is required for reliable analysis. Note on terminology Although most of the PIDs in various Crawl month URIs crawled Pages retrieved Dup URI %age forms (original, actionable or locating) found during our studies 2014-04 1,718,646,762 2,641,371,316 34.9% were DOIs, we will be careful hereafter to use ‘PID’ when we mean anything recognised as a form of persistent identifier, and ‘DOI’ for 2017-04 2,907,715,349 2,942,930,482 1.2% the subset thereof which are some form of DOI. 2 PRIOR WORK AND OTHER SOURCES OF Table 2: Duplicate page estimates for first study [24] INFORMATION An excellent overview of the space of PIDs and arguments for their Crawl month Pages retrieved Digests Dup pages %age use, only slightly dated, can be found in [21]. The IDF’s views on 2014-04 2,641,371,316 2,250,363,653 14.8% the need for PIDs and their goals for DOIs is described in [1]. The IDF occasionally update their "Key Facts" page [12] which 2017-04 2,907,715,349 2,915,114,582 0.9% currently says that • [DOIs are] Currently used by well over 5,000 assigners, e.g., publishers, science data centres, movie studios, etc. • Approximately 148 million DOI names assigned to date starts with a unique set of URIs and does not follow page-internal • Over 5 billion DOI resolutions per year links, redirects to URIs in the initial set occur surprisingly often, The leading issuer of DOIs for publications is CrossRef [7], who giving rise to duplication in some cases. The "Duplicate URI %age" publish regularly-updated statistics about membership numbers, column in Table reftab:t1 reports this, as estimated by subtracting DOIs registered, etc. [8] the ratio of the URI to Page columns from 1.
Recommended publications
  • A Brief Taxonomy of Hybrid Intelligence
    forecasting Perspective A Brief Taxonomy of Hybrid Intelligence Niccolo Pescetelli Department of Humanities and Social Sciences, New Jersey Institute of Technology, Newark, NJ 07102, USA; [email protected] Abstract: As artificial intelligence becomes ubiquitous in our lives, so do the opportunities to combine machine and human intelligence to obtain more accurate and more resilient prediction models across a wide range of domains. Hybrid intelligence can be designed in many ways, depending on the role of the human and the algorithm in the hybrid system. This paper offers a brief taxonomy of hybrid intelligence, which describes possible relationships between human and machine intelligence for robust forecasting. In this taxonomy, biological intelligence represents one axis of variation, going from individual intelligence (one individual in isolation) to collective intelligence (several connected individuals). The second axis of variation represents increasingly sophisticated algorithms that can take into account more aspects of the forecasting system, from information to task to human problem- solvers. The novelty of the paper lies in the interpretation of recent studies in hybrid intelligence as precursors of a set of algorithms that are expected to be more prominent in the future. These algorithms promise to increase hybrid system’s resilience across a wide range of human errors and biases thanks to greater human-machine understanding. This work ends with a short overview for future research in this field. Keywords: hybrid intelligence; collective intelligence; AI; forecasting Citation: Pescetelli, N. A Brief 1. Introduction Taxonomy of Hybrid Intelligence. The last decade has seen major advances in computation power and its increased Forecasting 2021, 3, 633–643.
    [Show full text]
  • CV 1 September 2016 University of Pittsburgh, PITTSBURGH, PA Present Research Assistant, Research Supervisor : Dr
    Ang LI PhD candidate, expected graduation : Fall 2021 University of Pittsburgh | School of Computing and Information http://www.pitt.edu/~anl125/ github.com/LittleRabbitHole 520-247-8820 [email protected] I am interested in social computing and data science. Specifically, I am interested in utilizing mixed-method approach to unders- tand the human factors for social computing systems and to design better systems to support their users in achieving their goals. EDUCATION 2016 - 2021 University of Pittsburgh, Pittsburgh, PA PhD in Information Science; 3.9/4.0 GPA 2013 - 2016 DePaul University, Chicago, IL Master of Science in Predictive Analytics, 3.9/4.0 GPA EXPERIENCE June 2020 Google, MOUNTAIN VIEW, CA August 2020 UX Research Internship, Host : Dr. Tao Dong, Co-host : Dr. JaYoung Lee Conducted user experience research to support Flutter open source software (OSS) community. ∠ Utilized a mixed-method approach, the project aims to improve the current pull request (PR) tria- ging and reviewing process while maintainers interacting with the PR contributors. ∠ Utilized qualitative research method, we interviewed project managers and evaluated the project documents to understand the ideal attributes of good PRs maintainers would like to see and draw hypotheses on what maintainers are expected to do during PR reviewing and accepting process. ∠ Utilized statistical methods based on the large-scale log data, we investigated the maintainers’ current practices in triaging and reviewing PRs. ∠ The results identify different gaps existed in the current system, and also provide recommendations on how the system can be better designed to support project maintainers. June 2019 Spotify, NEW YORK CITY, NY August 2019 Research Scientist Internship, Supervised by Dr.
    [Show full text]
  • Human Computer Interaction and Emerging Technologies: Adjunct Proceedings from the INTERACT 2019 Workshops
    Navigating through real and fake news by using provenance information Bianca Rodrigues Teixeira and Simone D. J. Barbosa Department of Informatics, PUC-Rio R. Marques de Sao Vicente, 225, Rio de Janeiro, RJ 22451-900, Brazil [email protected], [email protected] Abstract With the large number of internet users today, fake news and misinformation become more and more visible online, especially through social media plat- forms. Users’ belief in online information and news is related to trust in the information sources. To prevent the dissemination of fake news and misinfor- mation online, trust can be supported by provenance information in online publications. We use the OurPrivacy conceptual framework to illustrate a way in which provenance can be used to help users define their trust in artifacts posted online. We also discuss the possibility to filter artifacts by only viewing trusted sources of information. Keywords Provenance · Trust · Fake News · Privacy · Conceptual Framework How to cite this book chapter: Rodrigues Teixeira, B. and Barbosa, S.D.J. 2020. Navigating through real and fake news by using provenance information. In: Loizides, F., Winckler, M., Chatterjee, U., Abdelnour-Nocera, J. and Parmaxi, A. (eds.) Human Computer Interaction and Emerging Technologies: Adjunct Proceedings from the INTERACT 2019 Workshops. Pp. 49–56. Cardiff: Cardiff University Press. DOI: https://doi.org/10.18573/book3.f. License: CC-BY 4.0. 50 Human Computer Interaction and Emerging Technologies 1 Introduction Internet users are everywhere. Around the globe, smartphone ownership rates are rising fast [12], especially in developing countries in which mobile phones used to be considered a luxury.
    [Show full text]
  • Valeria Cardellini
    Valeria Cardellini Dipartimento di Ingegneria Civile e Ingegneria Informatica phone: +39 06 72597388 Universit`adi Roma \Tor Vergata" fax: +39 06 72597460 via del Politecnico 1 email: [email protected] 00133 Roma, Italy URL: www.ce.uniroma2.it/∼valeria 1 Research Interests and Scientific Results My research interests are in the broad field of distributed computing systems, with a focus on Web-based and Cloud-based systems and services. From 1998 my research topics included high performance and quality-aware Web systems as well as distributed infrastructures for adapting Web applications. In the last 10 years, I have been working on: (i) QoS-driven runtime adaptation of service-oriented systems, (ii) resource provisioning and elasticity in Cloud and edge/fog systems, (iii) self-adaptive deployment of geo-distributed data stream applications, and (iv) optimization of kernel codes on GPUs. I have co-authored more than 110 papers published in peer-reviewed international journals, conference and workshop proceedings, and book chapters, starting with the first published paper in 1998. Among these publications, 3 have received paper awards at international conferences (SOSE 2011, ACM DEBS 2015, ACM DEBS 2016) and 1 at international workshops (DistInSys 2021). According to Google Scholar, in September 2021 there are more than 5400 citations of all the articles that I co-authored, with a h-index equal to 34. According to Scopus, in September 2021 the total number of citations exceeds 2700 (with 101 indexed articles) and my h-index is 26. I have been included in the ranking, published in 2020 on PLoS Biology (10.1371/journal.pbio.3000918, Table S7-singleyr-2019), of 2% top-scientists for the single year 2019.
    [Show full text]
  • Museums and the Web 2006
    10 th Museums and the Web 2007 Museums and the Web 2006 April 11-14, 2007 March 22-25, 2006 San Francisco, California, USA Albuquerque, New Mexico, USA Call For Participation Final Program www.archimuse.com /mw2007 / www.archimuse.com /mw2006 / Themes for 2007 include: Social Issues and Impact Applications Museum 2.0 Services • Building Communities • Wireless Inside/Outside • Podcasting, Blogging, RSS, Social • Public Content Creation • Visitor Support On-site + On-line Tagging, Folksonomy, Wikis, Cell • On-going Engagement • Schools + Educational Programs Phone Tours ... Organizational Strategies • E-commerce for Museums • Museum Mashups • Building + Managing Web Teams Technical and Design Issues Evaluation + User Studies • Multi-Institutional Ventures • Standards, Architectures + Protocols • Impact Studies • Facilitating Change • Interface + Design Paradigms • User Analysis + Audience • Sustainability • New Tools + Methods Development • Managing Content + Metadata • Site Promotion Session Formats Choose the right presentation format for your proposal. Even the best ideas can be Please co-ordinate your proposals with rejected if proposed for an inappropriate venue. your collaborators. Multiple proposals • Research? about the same project will not be Propose a Paper, to be given in a formal session with other papers and accepted. Proposals for sessions should discussion be submitted as individual papers with • Case Study? a covering note. Papers are reviewed Present a Paper or a Demonstration, depending on whether you wish to individually; full sessions are rarely emphasize generalizability, or your specific case accepted. • Methods and Techniques? Teach a Pre-conference Workshops (full or half-day) or Mini-workshop (1hr) Deadlines • Debate or Problem Statement? • September 30, 2006 for papers, Engage colleagues in a Professional Forum workshops, mini-workshops + • Product to Show? professional forums (written paper Propose an Exhibit (commercial) or Demonstration (non-commercial) required by Jan.
    [Show full text]
  • Cisco Meetingplace Express 1.1
    Data Sheet Cisco MeetingPlace Express 1.1 Cisco ® MeetingPlace ® Express is an integrated voice and Web conferencing solution that helps midsize organizations realize the cost savings and productivity benefits of deploying conferencing with Cisco CallManager. Part of the Cisco Systems ® IP Communications solution, Cisco MeetingPlace Express provides simple, powerful conferencing functions that are easy to deploy and manage (Figure 1). Figure 1. Cisco MeetingPlace Express Helps Users Make the Most of Their Meeting Time IMPROVED PRODUCTIVITY, ACCELERATED BUSINESS Cisco MeetingPlace Express promotes communication and collaboration by helping people meet from any place at any time with anyone. Organizations can expand their market reach, improve operational effectiveness, and speed decisions by integrating virtual meetings into everyday communications. With just a phone and a Web browser, users can collaborate with coworkers, demonstrate products and services to customers, and deliver compelling presentations. Cisco MeetingPlace Express also makes virtual meetings more productive by integrating meeting management and control capabilities directly into Web and Cisco IP Phone interfaces. COST SAVINGS AND SECURITY Deployed on an organization’s converged IP network, Cisco MeetingPlace Express can reduce costs by virtually eliminating conferencing telephony and service fees paid to service providers. And Cisco MeetingPlace Express helps secure meetings using Secure Sockets Layer (SSL) encryption, behind-the-firewall deployment, and multiple meeting security options. All contents are Copyright © 1992–2005 Cisco Systems, Inc. All rights reserved. Important Notices and Privacy Statement. Page 1 of 15 EASY DEPLOYMENT AND MANAGEMENT Cisco MeetingPlace Express is a software solution, installed on a single server that supports industry-standard protocols—H.323 and Session Initiation Protocol (SIP)—to ensure connectivity with a range of telephony systems, including Cisco CallManager and Cisco CallManager Express.
    [Show full text]
  • Xiangyu Zhao's CV
    Xiangyu Zhao Lau Ming Wai Academic Building Homepage: http://cse.msu.edu/~zhaoxi35/ City University of Hong Kong LinkedIn: https://linkedin.com/in/zhaoxiangyu/ CONTACT 83 Tat Chee Avenue GoogleScholar: https://scholar.google.com/citations?user=Nkm9r2IAAAAJ INFORMATION Kowloon Tong, Hong Kong Phone: +86 18630109812 (CHN) E­mail: [email protected] Phone: +1 929­362­8280 (USA) POSITIONS Assistant Professor (tenure­track), City University of Hong Kong Sep 2021 ­ Present School of Data Science EDUCATION Michigan State University Doctor of Philosophy (Ph.D.) in Computer Science and Engineering Jan 2017 ­ May 2021 • Advisor: Dr. Jiliang Tang • Thesis: Adaptive and Automated Deep Recommender System • Top 100 Chinese New Stars in Artificial Intelligence, Criteo Research Award, Bytedance Research Award, MSU Dissertation Completion Fellowship University of Science and Technology of China Master of Science (M.S.) in Computer Science and Technology Sep 2014 ­ Jun 2017 • Advisor: Prof. Enhong Chen • Thesis: Exploring the Choice under Conflict for Social Event Participation • Outstanding Master’s Thesis Award (Top 15 in Anhui Province) University of Electronic Science and Technology of China Bachelor of Engineering (B.Eng.) in Software Engineering Sep 2010 ­ Jun 2014 • Advisor: Prof. Tao Zhou and Prof. Ming Tang • Thesis: Identifying Effective Multiple Spreaders by Coloring Complex Networks • Outstanding Graduation Thesis Award of UESTC (1% in university, 1/205 in department) RESEARCH Applied Machine Learning Lab, City University of Hong Kong EXPERIENCE Director Jun 2021 ­ Present • Research Interests: Machine Learning, Data Science, AI and their applications in Information Retrieval, Urban Computing, Social Computing, Finance, Education, Ecosystem and Healthcare Data Science and Engineering Lab, Michigan State University Ph.D.
    [Show full text]
  • Curriculum Vitae Ly Dinh 501 E. Daniel St., Champaign, IL 61820 Phone: (425) 389-0307 | Email: [email protected]
    Curriculum Vitae Ly Dinh 501 E. Daniel St., Champaign, IL 61820 Phone: (425) 389-0307 | Email: [email protected] https://publish.illinois.edu/lydinh-uiuc/ RESEARCH INTERESTS Network Science Computational Social Science Organizational Communication Crisis Informatics EDUCATION University of Illinois Urbana-Champaign Ph.D. in Information Sciences, Fall, 2021 (expected) Dissertation title: Advances to network analysis theories and methods with applications in social, organizational, and crisis settings Advisor: Jana Diesner (Chair & Director of Research, School of Information Sciences, University of Illinois Urbana-Champaign) Committee: Scott Althaus (Communication & Political Science) Peter Darch (School of Information Sciences) Leysia Palen (Computer Science, Information Science, University of Colorado-Boulder) M.A in Communication, Spring, 2016 Emphasis: Organizations, Teams Science and Social Networks Advisor: Marshall Scott Poole (Department of Communication, University of Illinois Urbana-Champaign) University of Southern California B.A (Honors) in Communication, Spring, 2014 Emphasis: Social network analysis & social capital Honor’s Thesis Title: What’s Facebook Good For? An Exploratory Study on the Connection between Motives of Social Networking Sites Usage and Generation of Social Capital as moderated by Cultural Differences Advisor(s): Stacy L. Smith (Annenberg School for Communication), Margaret McLaughlin (Annenberg School for Communication). Minor(s): Cinematic Arts & Entrepreneurship SKILLS Computer: Proficient in programming, network analysis, and natural language processing using R (statnet, igraph, tm), Python (scikit-learn, spaCy), Java (Weka). Languages: English (Fluent), Vietnamese (Native) RESEARCH AND PUBLICATIONS Journal Articles J6: Dinh, L.*, Rezapour, R.*, Jiang, L., & Diesner, J. (2021). Assessing balance in signed digraphs using balance and transitivity (*Equal contribution). Scientific Reports (under review) J5: Aref, S.*, Dinh, L.*, Rezapour, R.*, & Diesner, J (*Equal Contribution).
    [Show full text]
  • Ana-Andreea Stoica
    ANA-ANDREEA STOICA 617 CEPSR, Columbia University, New York, NY 10027, USA [email protected] j www.columbia.edu/∼as5001 j (609)-937 9044 Research Interests | Machine Learning, Social Networks, Random Graph Theory. My work focuses on mathematical models, data analysis, and policy implications for algorithm design in social net- works. I am particularly interested in studying the effect of learning algorithms of biased network by employing graph theoretical and machine learning techniques. EDUCATION Columbia University, New York, NY, USA September, 2016{ Ph.D. Candidate, Computer Science Department Advisor: Associate Professor Augustin Chaintreau Columbia University, New York, NY, USA September, 2016{June, 2018 M.Sc., Computer Science Department GPA: 4.00/4.00 Advisor: Associate Professor Augustin Chaintreau Princeton University, Princeton, New Jersey, USA September, 2012{May, 2016 B.A., Mathematics Department GPA: 3.7/4.00 Graduated magna cum laude. Undergraduate Advisor: Professor Emmanuel Abbe Certificates: Applied & Computational Mathematics, Applications of Computing SELECTED FELLOWSHIPS, AWARDS, AND HONORS J.P. Morgan AI Research PhD Fellowship 2019{2020 Andrew P. Kosoresow Memorial Award for excellence in teaching, Columbia University 2018 International Zhautykov Mathematics Olympiad, Bronze Medal 2010 Romanian National Mathematics Olympiad, 1 Gold, 2 Silver, and 3 Bronze medals 2007{2012 Diploma of Excellence awarded by Prime Minister of Romania 2008 PREPRINTS Mihir Nanavati, Ana-Andreea Stoica, Lloyd Brown, Nathan Taylor, Siddhartha Sen.\HAIbrid data structures." Manuscript in preparation. CONFERENCE AND JOURNAL PUBLICATIONS Ana-Andreea Stoica, Abhijnan Chakraborty, Palash Dey, Krishna P. Gummadi. \Minimizing Margin of Victory for Fair Political and Educational Districting." Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems.
    [Show full text]
  • Report on the DATA:SEARCH'18 Workshop
    WORKSHOP REPORT Report on the DATA:SEARCH’18 workshop – Searching Data on the Web Laura Koesten Philipp Mayr The Open Data Institute/ GESIS – Leibniz Institute University of Southampton, UK for the Social Sciences, DE [email protected] [email protected] Paul Groth Elena Simperl Elsevier Labs, NL University of Southampton, UK [email protected] [email protected] Maarten de Rijke University of Amsterdam, NL [email protected] Abstract The increasing availability of structured data on the web makes searching for data an important and timely topic. This report presents the motivation, output, and research challenges of the second DATA:SEARCH workshop which was held in conjunction with SIGIR 2018, in Ann Arbor, Michigan. This workshop explored challenges in data search, with a particular focus on data on the web. The aim was to share and establish links between different perspectives on search and discovery for different kinds of structured data, which can potentially inform the design of a wide range of information retrieval technologies. The DATA:SEARCH workshop tries to bring together communities interested in making the web of data more discoverable, easier to search and more user friendly. 1 Motivation As an increasing amount of data becomes available on the web, searching for it becomes an increasingly important, timely topic (Gregory et al., 2018). The web hosts a whole range of new data species, published in structured and semi-structured formats - from web markup using schema.org and web tables to open government data portals, knowledge bases such as Wikidata and scientific data repositories (Cattaneo et al., 2015; Lehmberg et al., 2016).
    [Show full text]
  • The Web Conference 2020 I Taipei, Taiwan I 1 the WEB CONFERENCE 2020 Welcome Letter
    The Web Conference 2020 I Taipei, Taiwan I 1 THE WEB CONFERENCE 2020 Welcome Letter Since the invention of the World Wide Web in 1989, The Web Conference (formerly known as The WWW Conference) is a yearly international conference on the topic of the World Wide Web. This conference has been the premier venue addressing the evolution and current state of the Web through the lens of computer science, computational social science, economics, public policy, and Web-based applications. The Conference assembles scholars, researchers, policymakers, practitioners, and end-users with one unifying goal: to envision and create the future of the Web. Over the past three decades, The Web Conference has been the forum where some of the most fundamental web technologies have been introduced. These groundbreaking technologies have improved our daily lives. The Conference has also inspired new ideas for business, communication, and entertainment, such as e-commerce, online-shopping, Web/mobile payments, Web TV/phones, and Web banking. Nearly all online systems today employ Web technologies, connecting every corner of the world. We are happy to announce that The Web Conference will reconvene in Taipei, the Heart of Asia, in April 2020, at a time when more than 60% of the world's population will be connected. There we will reflect on the past 30 years of the Web, share the latest research findings, and plan for the future of Web-based technologies. The Conference will host three keynote speakers and highlight numerous research tracks, workshops and tutorials, a PhD consortium, as well as several theme-based tracks: The Intelligent Web, The Web Renovation, and The Asia Silicon Valley Plan.
    [Show full text]
  • Yiqun Liu Professor & Department Co-Chair Department of Computer
    Yiqun Liu Professor & Department co-Chair Department of Computer Science and Technology Email [email protected] URL http://www.thuir.org/group/~yqliu Phone +86-10-62796672 Fax +86-10-62796672 Mobile +86-13810020265 Personal Profile Yiqun Liu is now working as professor and co-chair at the Department of Computer Science and Technology in Tsinghua University, Beijing, China. His major research interests include Web Search, User Behavior Analysis, and Web Data Mining. He is senior members of ACM and IEEE, distinguished member of CCF, a visiting professor of National Institute of Informatics (NII) in Japan and a visiting research professor of National University of Singapore (NUS). He serves as Chair of the local chapter of SIGIR in China named Beijing ACM SIGIR Chapter. He serves as the co-Editor-in-Chief of Foundations and Trends in Information Retrieval (FnTIR) and also on the editorial boards of Journal of the Association for Information Science and Technology (JASIST) and Information Retrieval Journal (IRJ). He serves as General vice Chair of SIGIR2020, Program Co-chair of SIGIR2018, area chair of ACL2018 and WWW2020, Program Co-chair of NTCIR-13/14/15, Program Co- chair of ICTIR2020, General Co-chair of AIRS2016 as well as (senior) program committee members of several important international academic conferences including SIGIR, WWW, KDD, IJCAI, AAAI, CIKM and WSDM. He published over 100 papers in top-tier academic conferences/journals and got over 4,900 citations according to Google scholar (H-index=33). He received the Best Overall Paper Award of CIKM2018, Best Paper Honorable Mention Award of SIGIR2015, Best Short Paper Honorable Mention Award of SIGIR2018 and ICTIR2019, Best Student Paper Award of SIGIR2017 and Best Paper Award of AIRS2018.
    [Show full text]