Rqs: Ranking Queries for Search-As-You- Type In

Total Page:16

File Type:pdf, Size:1020Kb

Rqs: Ranking Queries for Search-As-You- Type In RQS: RANKING QUERIES FOR SEARCH-AS-YOU- TYPE IN DATABASES S.Srinivasa Rao1, V Sai Spandana2 1Pursuing M.Tech (CSE), 2Working as Assistant Professor (CSE), Nalanda Institute of Technology (NIT), Kantepudi(V), Sattenpalli(M),Guntur (India) ABSTRACT Search as you type computes the answers as a user types in a keyword query or the textbox character by character. Here we will study that how to support the data as the user types i.e., supporting search as you type on data residing in a relational Database. Here in this project we mainly focus on how to support this type of what the user is typing using the database language that is we mainly focus on the SQL. The main theme is how to leverage existing database functionalities to meet the high accuracy to achieve a minimum speed. Here we are going to study how to use the auxiliary indexes that is stored as tables to increase the search performance. We present a better solution for entering the single keyword query or the multi keyword query, and develop a technique for fuzzy search as the user enters the keyword incorrectly or some mismatch keyword. We present a technique to answer first N queries and discuss how to support the data updates efficiently. In real time it experiments an shows that our technique is enable DBMS systems on a machine to support the ranking queries for search as you type in the SQ databases. I. INTRODUCTION Numerous data frameworks these days enhance client hunt encounters by giving moment criticism as clients define hunt questions. Most internet searchers and online pursuit structures bolster[1] auto completion, which appears proposed questions or even replies "on the fly" as a client sorts in a watchword question character by character. For occurrence, consider the Web seek interface at Netflix, which permits a client to hunt down film data. In the event that a client sorts in a fractional inquiry "distraught,[2]" the framework demonstrates films with a title coordinating this catchphrase as a prefix, for example, "Madagascar" also, "Crazy people: Season 1." The moment criticism helps the client in figuring the question, as well as in comprehension the fundamental information. This kind of inquiry is for the most part called seek as-you-write or sort ahead hunt. Since numerous inquiry frameworks store their data in a backend social DBMS, an inquiry emerges normally: how to bolster seek as-you-write on the information living in a DBMS? A few databases, for example, Oracle and SQL server as of now bolster prefix hunt, and we could utilize this component to do seek as-you-write. Be that as it may, not all databases give this highlight. Thus, we concentrate new routines that can be utilized as a part of all databases. One methodology is to build up a different application layer on the database to build lists, and actualize calculations for noting inquiries. While the methodology has the benefit of accomplishing a superior, its primary downside is copying information and lists, bringing about extra equipment costs. Another methodology is to utilize database extenders, for example, DB2 Extenders, Informix DataBlades[3], Microsoft SQL Server Common Language Runtime (CLR) joining, and Oracle Cartridges, which permit engineers to 68 | P a g e execute new functionalities to a DBMS. This methodology is not practical for databases that don't give such an extender interface, for example, MySQL. Since it necessities to use exclusive interfaces gave by database merchants, an answer for one database may not be convenient to others. Furthermore, an extender-based arrangement, particularly those executed in C/C++, could bring about genuine dependability what's more, security issues to database motors. In this paper we concentrate how to bolster seek as-you-write on DBMS frameworks utilizing the local inquiry dialect (SQL). In different words, we need to utilize SQL to discover responses to a pursuit inquiry as a client sorts in catchphrases character by character.We will likely use the implicit question motor of the database framework however much as could be expected. Along these lines, we can diminish the programming endeavors to bolster seek as-you-write. In expansion, the arrangement created on one database utilizing standard SQL procedures is compact to different databases which bolster the same standard. Comparative perception are likewise made by Gravano et al[4]. and Jestes et al. which use SQL to bolster similitude join in databases. A principle inquiry while receiving this appealing thought is: Is it plausible and adaptable? Specifically, can SQL meet the high performance necessity to actualize an intuitive inquiry interface? Studies have demonstrated that such an interface requires every inquiry be replied inside of 100 milliseconds. DBMS frameworks are not uniquely intended for catchphrase questions, making it additionally difficult to bolster seek as-you-write. As we will see later in this paper, some vital functionality to bolster look as-you-write requires join operations, which could be fairly costly to execute by the query engine. Let T be a relational table with attributes A1,A2…..AT. Let R={r1,r2,r3….r n},be the collection of records in T, and ri [Aj ]denote the content of record ri; ...;Ain attribute Aj. Let W be the set of tokenized keywords in R. II. RELATED WORK 2.1 Search as You Type for Single Keyword Queries Exact Search: As a client sorts in a single fragmentary (prefix) watchword w character by character, look for as- you-compose on the-fly finds the records that contain catchphrases with a prefix w. We call this chase perspective[5] prefix look. Without loss of exhaustive proclamation, each tokenized catchphrase in the data set and request is acknowledged to use lower case characters. For example, consider the data in Table 1, A1 =title, A2 = authors, A3 = booktitle, and A4 = year. R = fr1; . ; r10g. r3[booktitle] = „„motivation‟‟. W ={privacy; motivation; motive; . .}: If a user types in a query “mot,” we return records r3, r6, andr9. In particular, r3 contains a keyword “motivation” with a prefix “mot.” 2.2 Fuzzy Search As a client sorts in a lone midway watchword w character by character, cushy interest on-the-fly finds records with catchphrases like the inquiry watchword. In Table 1, tolerating a customer sorts in an inquiry "corel," record r7 is a related answer since it contains a catchphrase "association" with a prefix "correl" like the request watchword "corel." 2.3 Search as You Type for Multi Keyword Queries Exact Search:Given a multi watchword question Q with m catchphrases w1; w2; . ; wm, as the customer is completing the last watchword wm, we view wm as a partial catchphrase and distinctive watchwords as 69 | P a g e complete keywords.2 As a customer sorts being referred to Q character by character, look as-you-compose on- the-fly finds the records that contain the complete watchwords and a watchword with a prefix wm. For example, if a customer sorts in a request "privacy not," look as-you-compose returns records r3, r6, and r9. In particular, r3 contains the complete catchphrase "security" and a watchword " inspiration " with a prefix "adage." Fuzzy Search: Fuzzy request on-the-fly discovers the records that contain watchwords like the complete catchphrases and a watchword with a prefix like midway catchphrase wm. For example, expect change partition limit T= 1. Tolerating a customer sorts in a request "protection corel," feathery sort ahead chase returns record r7 since it contains a catchphrase "security" like the complete watchword "privicy" and a catchphrase "relationship" with a prefix "correl" like the midway catchphrase "corel." 2.4 Exact Search for Single Keyword 2.4.1 Non Index Methods An immediate way to deal with support look for as-you-compose is to issue a SQL request that ranges each record and checks whether the record is a reaction to the inquiry. There are two ways to deal with do the checking: 1) Calling User-Defined Functions (UDFs)[6]. We can add limits into databases to check whether a record contains the request watchword; and 2) Using the LIKE predicate. Databases give a LIKE predicate to allow customers to perform string planning. We can use the LIKE predicate to check whether a record contains the inquiry catchphrase. This strategy may exhibit false positives, e.g., watchword "dissemination" contains the inquiry string "ic," yet the watchword does not have the inquiry string "ic" as a prefix. Wecan empty these false positives by calling UDFs. The two no-record systems require no additional space, yet they won't not scale since they need to compass all records in the table (Section 8 gives the results.). SELECT T . * FROM 푃푇 , 퐼푇, 푇 WHERE 푃푇.prefix =”w” AND 푃푇.ukid>퐼푇.kid AND 푃푇.lkid <, 퐼푇.kid AND , 퐼푇.rid= T . rid. For example, assuming a user types in a partial query “sig” on table dblp(Table 1), we issue the following SQL: SELECT dblp . * FROM 푃푑푏푙푝 , 퐼푑푏푙푝 , dplp WHERE 푃푑푏푙푝 .PREFIX =”sig” AND 푃푑푏푙푝 .ukid ≥ 퐼푑푏푙푝 .kid AND 푃푑푏푙푝 .lkid ≤ 퐼푑푏푙푝 .kid AND퐼푑푏푙푝 .rid=dblp.rid 2.5 Index Methods In this segment, we propose to build colleague tables as record structures to support prefix look. A couple of databases, for instance, Oracle and SQL server starting now reinforce prefix request, and we could use this component to prefix look for. Nevertheless, not all databases give this segment. Subsequently, we develop another framework that can be used as a piece of all databases. Besides, our trials in show that our framework performs prefix look for more gainfully. 2.6 Inverted-Index Table Given a table T, we dole out unique ids to the catchphrases in table T, taking after their in successive request demand.
Recommended publications
  • A Comparison of On-Line Computer Science Citation Databases
    A Comparison of On-Line Computer Science Citation Databases Vaclav Petricek1,IngemarJ.Cox1,HuiHan2, Isaac G. Councill3, and C. Lee Giles3 1 University College London, WC1E 6BT, Gower Street, London, United Kingdom [email protected], [email protected] 2 Yahoo! Inc., 701 First Avenue, Sunnyvale, CA, 94089 [email protected] 3 The School of Information Sciences and Technology, The Pennsylvania State University, University Park, PA 16802, USA [email protected] [email protected] Abstract. This paper examines the difference and similarities between the two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously via a crawl of the Web and automatic processing of user submissions. CiteSeer’s autonomous citation database can be considered a form of self-selected on-line survey. It is important to understand the limitations of such databases, particularly when citation information is used to assess the performance of authors, institutions and funding bodies. We show that the CiteSeer database contains considerably fewer single author papers. This bias can be modeled by an exponential process with intuitive explanation. The model permits us to predict that the DBLP database covers approximately 24% of the entire literature of Computer Science. CiteSeer is also biased against low-cited papers. Despite their difference, both databases exhibit similar and signif- icantly different citation distributions compared with previous analysis of the Physics community. In both databases, we also observe that the number of authors per paper has been increasing over time. 1 Introduction Several public1 databases of research papers became available due to the advent of the Web [1,22,5,3,2,4,8] These databases collect papers in different scientific disciplines, index them and annotate them with additional metadata.
    [Show full text]
  • Linking Educational Resources on Data Science
    The Thirty-First AAAI Conference on Innovative Applications of Artificial Intelligence (IAAI-19) Linking Educational Resources on Data Science Jose´ Luis Ambite,1 Jonathan Gordon,2∗ Lily Fierro,1 Gully Burns,1 Joel Mathew1 1USC Information Sciences Institute, 4676 Admiralty Way, Marina del Rey, California 90292, USA 2Department of Computer Science, Vassar College, 124 Raymond Ave, Poughkeepsie, New York 12604, USA [email protected], [email protected], flfierro, burns, [email protected] Abstract linked data (Heath and Bizer 2011) with references to well- known entities in DBpedia (Auer et al. 2007), DBLP (Ley The availability of massive datasets in genetics, neuroimag- 2002), and ORCID (orcid.org). Here, we detail the meth- ing, mobile health, and other subfields of biology and ods applied to specific problems in building ERuDIte. First, medicine promises new insights but also poses significant we provide an overview of our project. In later sections, we challenges. To realize the potential of big data in biomedicine, the National Institutes of Health launched the Big Data to describe our resource collection effort, the resource schema, Knowledge (BD2K) initiative, funding several centers of ex- the topic ontology, our linkage approach, and the linked data cellence in biomedical data analysis and a Training Coordi- resource provided. nating Center (TCC) tasked with facilitating online and in- The National Institutes of Health (NIH) launched the Big person training of biomedical researchers in data science. A Data to Knowledge (BD2K) initiative to fulfill the promise major initiative of the BD2K TCC is to automatically identify, of biomedical “big data” (Ohno-Machado 2014). The NIH describe, and organize data science training resources avail- BD2K program funded 15 major centers to investigate how able on the Web and provide personalized training paths for data science can benefit diverse fields of biomedical research users.
    [Show full text]
  • Analysis of Computer Science Communities Based on DBLP
    Analysis of Computer Science Communities Based on DBLP Maria Biryukov1 and Cailing Dong1,2 1 Faculty of Science, Technology and Communications, MINE group, University of Luxembourg, Luxembourg [email protected] 2 School of Computer Science and Technology, Shandong University, Jinan, 250101, China cailing [email protected] Abstract. It is popular nowadays to bring techniques from bibliometrics and scientometrics into the world of digital libraries to explore mecha- nisms which underlie community development. In this paper we use the DBLP data to investigate the author’s scientific career, and analyze some of the computer science communities. We compare them in terms of pro- ductivity and population stability, and use these features to compare the sets of top-ranked conferences with their lower ranked counterparts.1 Keywords: bibliographic databases, author profiling, scientific commu- nities, bibliometrics. 1 Introduction Computer science is a broad and constantly growing field. It comprises various subareas each of which has its own specialization and characteristic features. While different in size and granularity, research areas and conferences can be thought of as scientific communities that bring together specialists sharing simi- lar interests. What is specific about conferences is that in addition to scope, par- ticipating scientists and regularity, they are also characterized by level. In each area there is a certain number of commonly agreed upon top ranked venues, and many others – with the lower rank or unranked. In this work we aim at finding out how the communities represented by different research fields and conferences are evolving and communicating to each other. To answer this question we sur- vey the development of the author career, compare various research areas to each other, and finally, try to identify features that would allow to distinguish between venues of different rank.
    [Show full text]
  • How Do Computer Scientists Use Google Scholar?: a Survey of User Interest in Elements on Serps and Author Profile Pages
    BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval How do Computer Scientists Use Google Scholar?: A Survey of User Interest in Elements on SERPs and Author Profile Pages Jaewon Kim, Johanne R. Trippas, Mark Sanderson, Zhifeng Bao, and W. Bruce Croft RMIT University, Melbourne, Australia fjaewon.kim, johanne.trippas, mark.sanderson, zhifeng.bao, [email protected] Abstract. In this paper, we explore user interest in elements on Google Scholar's search engine result pages (SERPs) and author profile pages (APPs) through a survey in order to predict search behavior. We inves- tigate the effects of different query intents (keyword and title) on SERPs and research area familiarities (familiar and less-familiar) on SERPs and APPs. Our findings show that user interest is affected by the re- spondents' research area familiarity, whereas there is no effect due to the different query intents, and that users tend to distribute different levels of interest to elements on SERPs and APPs. We believe that this study provides the basis for understanding search behavior in using Google Scholar. Keywords: Academic search · Google Scholar · Survey study 1 Introduction Academic search engines (ASEs) such as Google Scholar1 and Microsoft Aca- demic2, which are specialized for searching scholarly literature, are now com- monly used to find relevant papers. Among the ASEs, Google Scholar covers about 80-90% of the publications in the world [7] and also provides individual profiles with three types of citation namely, total citation number, h-index, and h10-index. Although the Google Scholar website3 states that the ranking is de- cided by authors, publishers, numbers and times of citation, and even considers the full text of each document, there is a phenomenon called the Google Scholar effect [11].
    [Show full text]
  • PP-DBLP: Modeling and Generating Public-Private Attributed Networks with DBLP
    PP-DBLP: Modeling and Generating Public-Private Attributed Networks with DBLP Xin Huang∗, Jiaxin Jiang∗, Byron Choi∗, Jianliang Xu∗, Zhiwei Zhang∗, Yunya Song# ∗Department of Computer Science, Hong Kong Baptist University #Department of Journalism, Hong Kong Baptist University ∗fxinhuang, jxjian, bchoi, xujl, [email protected], #[email protected] Abstract—In many online social networks (e.g., Facebook, graph is visible only to the corresponding user. From each Google+, Twitter, and Instagram), users prefer to hide her/his users viewpoint, the social network is exactly the union of the partial or all relationships, which makes such private relation- public graph and her/his own private graph. Several sketching ships not visible to public users or even friends. This leads to a new graph model called public-private networks, where each and sampling approaches [2] have been proposed to address user has her/his own perspective of the network including the essential problems of graph processing, such as the size of private connections. Recently, public-private network analysis has reachability tree [5], all-pair shortest paths [6], pairwise node attracted significant research interest in the literature. A great similarities [7], correlation clustering [8] and so on. deal of important graph computing problems (e.g., shortest paths, centrality, PageRank, and reachability tree) has been studied. In social networks, vertices usually contain attributes, e.g., However, due to the limited data sources and privacy concerns, a person has information including name, interests, skills, proposed approaches are not tested on real-world datasets, but and so on. Recent study [2] focuses on one essential aspect on synthetic datasets by randomly selecting vertices as private of topological structure of public-private graphs.
    [Show full text]
  • The Best Nurturers in Computer Science Research
    The Best Nurturers in Computer Science Research Bharath Kumar M. Y. N. Srikant IISc-CSA-TR-2004-10 http://archive.csa.iisc.ernet.in/TR/2004/10/ Computer Science and Automation Indian Institute of Science, India October 2004 The Best Nurturers in Computer Science Research Bharath Kumar M.∗ Y. N. Srikant† Abstract The paper presents a heuristic for mining nurturers in temporally organized collaboration networks: people who facilitate the growth and success of the young ones. Specifically, this heuristic is applied to the computer science bibliographic data to find the best nurturers in computer science research. The measure of success is parameterized, and the paper demonstrates experiments and results with publication count and citations as success metrics. Rather than just the nurturer’s success, the heuristic captures the influence he has had in the indepen- dent success of the relatively young in the network. These results can hence be a useful resource to graduate students and post-doctoral can- didates. The heuristic is extended to accurately yield ranked nurturers inside a particular time period. Interestingly, there is a recognizable deviation between the rankings of the most successful researchers and the best nurturers, which although is obvious from a social perspective has not been statistically demonstrated. Keywords: Social Network Analysis, Bibliometrics, Temporal Data Mining. 1 Introduction Consider a student Arjun, who has finished his under-graduate degree in Computer Science, and is seeking a PhD degree followed by a successful career in Computer Science research. How does he choose his research advisor? He has the following options with him: 1. Look up the rankings of various universities [1], and apply to any “rea- sonably good” professor in any of the top universities.
    [Show full text]
  • Author Template for Journal Articles
    Mining citation information from CiteSeer data Dalibor Fiala University of West Bohemia, Univerzitní 8, 30614 Plzeň, Czech Republic Phone: 00420 377 63 24 29, fax: 00420 377 63 24 01, email: [email protected] Abstract: The CiteSeer digital library is a useful source of bibliographic information. It allows for retrieving citations, co-authorships, addresses, and affiliations of authors and publications. In spite of this, it has been relatively rarely used for automated citation analyses. This article describes our findings after extensively mining from the CiteSeer data. We explored citations between authors and determined rankings of influential scientists using various evaluation methods including cita- tion and in-degree counts, HITS, PageRank, and its variations based on both the citation and colla- boration graphs. We compare the resulting rankings with lists of computer science award winners and find out that award recipients are almost always ranked high. We conclude that CiteSeer is a valuable, yet not fully appreciated, repository of citation data and is appropriate for testing novel bibliometric methods. Keywords: CiteSeer, citation analysis, rankings, evaluation. Introduction Data from CiteSeer have been surprisingly little explored in the scientometric lite- rature. One of the reasons for this may have been fears that the data gathered in an automated way from the Web are inaccurate – incomplete, erroneous, ambiguous, redundant, or simply wrong. Also, the uncontrolled and decentralized nature of the Web is said to simplify manipulating and biasing Web-based publication and citation metrics. However, there have been a few attempts at processing the Cite- Seer data which we will briefly mention. Zhou et al.
    [Show full text]
  • The Claremont Report on Database Research
    The Claremont Report on Database Research Rakesh Agrawal, Anastasia Ailamaki, Philip A. Bernstein, Eric A. Brewer, Michael J. Carey, Surajit Chaudhuri, AnHai Doan, Daniela Florescu, Michael J. Franklin, Hector Garcia‐ Molina, Johannes Gehrke, Le Gruenwald, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, Hank F. Korth, Donald Kossmann, Samuel Madden, Roger Magoulas, Beng Chin Ooi, Tim O’Reilly, Raghu Ramakrishnan, Sunita Sarawagi, Michael Stonebraker, Alexander S. Szalay, Gerhard Weikum Abstract In late May, 2008, a group of database researchers, architects, users and pundits met at the Claremont Resort in Berkeley, California to discuss the state of the research field and its impacts on practice. This was the seventh meeting of this sort in twenty years, and was distinguished by a broad consensus that we are at a turning point in the history of the field, due both to an explosion of data and usage scenarios, and to major shifts in computing hardware and platforms. Given these forces, we are at a time of opportunity for research impact, with an unusually large potential for influential results across computing, the sciences and society. This report details that discussion, and highlights the group’s consensus view of new focus areas, including new database engine architectures, declarative programming languages, the interplay of structured and unstructured data, cloud data services, and mobile and virtual worlds. We also report on discussions of the community’s growth, including suggestions for changes in community processes to move the research agenda forward, and to enhance impact on a broader audience. 1. A Turning Point in Database Research Over the last twenty years, small groups of database researchers have periodically gathered to assess the state of the field and propose directions for future research [BDD+89, SSU91, ASU95, AZ+96, BBC+98, AAB03].
    [Show full text]
  • Database Architecture Evolution: Mammals Flourished Long Before Dinosaurs Became Extinct
    Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct Stefan Manegold Martin L. Kersten Peter Boncz CWI, Amsterdam, The Netherlands {manegold,mk,boncz}@cwi.nl ABSTRACT building block for realizing more complex structures (such as ta- The holy grail for database architecture research is to find a solution bles, but also objects and even trees or graphs). that is Scalable & Speedy, to run on anything from small ARM In this paper, we summarize the quest to renovate database ar- processors up to globally distributed compute clusters, Stable & chitecture with the goal of addressing data management needs with Secure, to service a broad user community, Small & Simple, to be a common core rather than bolting on new subsystems [38], and comprehensible to a small team of programmers, Self-managing, to pushing through the functionality plateau, especially using self- let it run out-of-the-box without hassle. adaptation, exploiting resource optimization opportunities as they In this paper, we provide a trip report on this quest, covering appear dynamically, rather than by static prediction. This quest both past experiences, ongoing research on hardware-conscious al- covers the major design space choices, experiences gained and on- gorithms, and novel ways towards self-management specifically fo- going research around the MonetDB system. cused on column store solutions. In all, this story demonstrates the viability of new self-adaptive life-forms that may be very successful yet are very different from the long existing species. In this, the MonetDB strain forms an 1. INTRODUCTION open platform available for all who want to explore further evolu- Forty years of relational database technology seemingly con- tion in the data management ecosystem.
    [Show full text]
  • Decrease in Free Computer Science Papers Found Through Google Scholar Lee A
    Virginia Commonwealth University VCU Scholars Compass VCU Libraries Faculty and Staff ubP lications VCU Libraries 2014 Decrease in Free Computer Science Papers Found through Google Scholar Lee A. Pedersen Brown University, [email protected] Julie Arendt Virginia Commonwealth University, [email protected] Follow this and additional works at: http://scholarscompass.vcu.edu/libraries_pubs Part of the Databases and Information Systems Commons, and the Library and Information Science Commons Copyright Emerald Group Publishing Recommended Citation Lee A. Pedersen, Julie Arendt, (2014) "Decrease in free computer science papers found through Google Scholar", Online Information Review, Vol. 38 Iss: 3, pp.348 - 361. This Article is brought to you for free and open access by the VCU Libraries at VCU Scholars Compass. It has been accepted for inclusion in VCU Libraries Faculty and Staff ubP lications by an authorized administrator of VCU Scholars Compass. For more information, please contact [email protected]. Decrease in free computer science papers found through Google Scholar Abstract: Purpose - Google Scholar was used to locate free full-text versions of computer science research papers to determine what proportion could be freely accessed. Design/methodology/approach - A sample of 1967 conference papers and periodical articles from 2003-2010, indexed in the ACM Guide to Computing Literature, was searched for manually in Google Scholar, using the paper or article title and the first author’s surname and supplementary searches as needed. Findings - Free full-text versions were found for 52% of the conference papers and 55% of the periodical articles. Documents with older publication dates were more likely to be freely accessible than newer documents, with free full-text versions found for 71% of items published in 2003 and 43% of items published 2010.
    [Show full text]
  • DBLP — Some Lessons Learned ∗
    DBLP — Some Lessons Learned ∗ Michael Ley Universitat¨ Trier, Informatik D–54286 Trier Germany [email protected] ABSTRACT Bibliometric studies are a third group of publications. They The DBLP Computer Science Bibliography evolved from an make use of the full semantics of the data. The main disad- early small experimental Web server to a popular service vantages of DBLP for this purpose are the lacking citation for the computer science community. Many design decisions information and the varying coverage for different subfields and details of the public XML-records behind DBLP never of computer science[3]. The main advantages are the free were documented. This paper is a review of the evolution availability and the inclusion of many conference proceed- of DBLP. The main perspective is data modeling. In DBLP ings which play an essential role for many branches of CS persons play a central role, our discussion of person names and are poorly covered by other bibliographic data bases. A may be applicable to many other data bases. All DBLP fourth group of papers deals with person name disambigua- data are available for your own experiments. You may either tion, a special aspect of data quality. download the complete set, or use a simple XML-based API DBLP is a (very imperfect) \authority file”[1] for com- described in an online appendix. puter science researchers. We try to identify the persons be- hind the research papers and to treat synonyms and homo- 1. INTRODUCTION nyms as precise as possible. Incomplete and inconsistent in- formation, imperfect software, lack of time, and our own in- In June 2009 the DBLP Computer Science Bibliography1 ability are the limiting constraints for this task.
    [Show full text]
  • New Dblp URL Scheme and API Updates | 1
    New dblp URL scheme and API updates | 1 A big change has just been made to the dblp website … and, in case we did our job right, you may even haven’t noticed yet: With the latest update, we introduced major changes to the dblp URL scheme. In particular, this applies to the URLs of all author bibliographies listed on dblp, which are now served under a new and persistent URL. But don’t worry, just like the first time we made such a change about eight years ago, we try to keep all previously existing URLs as a redirect for the foreseeable future. In this post, we talk about the reasons that made us abandon our old URL scheme and why you will most likely want to update your hyperlinks and bookmarks anyway. So, what’s the problem with the old URL scheme? Author bibliographies have been an integral part of dblp since its inception in 1993. Since all author name strings are unique in dblp (even if we often have to enforce uniqueness by adding a magic name disambiguation suffix number) it was quite natural to just use an author’s name in order to locate her bibliography. For example, Jim Gray is listed in dblp under the unique name “Jim Gray 0001”. Hence, a simple rewriting rule derived the following URL for his bibliography webpage: https://dblp.org/pers/g/Gray_0001:Jim.html This approach has served us well through the years. But recently, the need for persistent unique identifiers has become more and more evident. And those bibliography URLs tied to an author’s name have two major problems: They are neither unique, nor persistent.
    [Show full text]