Geographic Information Retrieval: Classification, Disambiguation

Total Page:16

File Type:pdf, Size:1020Kb

Geographic Information Retrieval: Classification, Disambiguation Imperial College London Department of Computing Geographic Information Retrieval: Classification, Disambiguation and Modelling Simon E Overell Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of the University of London and the Diploma of Imperial College, July 2009 1 2 Abstract This thesis aims to augment the Geographic Information Retrieval process with information extracted from world knowledge. This aim is approached from three directions: classifying world knowledge, disambiguating placenames and modelling users. Geographic information is becoming ubiquitous across the Internet, with a significant proportion of web documents and web searches containing geographic entities, and the proliferation of Internet enabled mobile devices. Traditional information retrieval treats these geographic entities in the same way as any other textual data. In this thesis I augment the retrieval process with geographic information, and show how methods built upon world knowledge outperform methods based on heuristic rules. The source of world knowledge used in this thesis is Wikipedia. Wikipedia has become a phenomenon of the Internet age and needs little introduction. As a linked corpus of semi-structured data, it is unsurpassed. Two approaches to mining information from Wikipedia are rigorously explored: initially I classify Wikipedia articles into broad categories; this is followed by much finer classification where Wikipedia articles are disambiguated as specific locations. The thesis concludes with the proposal of the Steinberg hypothesis: By analysing a range of wikipedias in different languages I demonstrate that a localised view of the world is ubiquitous and inherently part of human nature. All people perceive closer places as larger and more important than distant ones. The core contributions of this thesis are in the areas of extracting information from Wikipedia, supervised placename disambiguation, and providing a quantitative model for how people view the world. The findings clearly have a direct impact for applications such as geographically aware search engines, but in a broader context documents can be automatically annotated with machine readable meta-data and dialogue enhanced with a model of how people view the world. This will reduce ambiguity and confusion in dialogue between people or computers. 3 4 Acknowledgements In the same way that far more is produced during a PhD than a thesis, it is a journey taken by many more people than just the author. I would like to take this opportunity to thank the people that have accompanied me on this journey. Firstly I would like to thank my supervisor, Stefan R¨uger, for his help, advice and support over • the past three years. My examiners, Mark Sanderson and Julie McCann, for agreeing to scrutinise my work. • EPSRC for funding my work and helping me remain a student for three more years. • My colleagues in the Multimedia and Information Systems group, with whom I shared many a cup • of tea and lazy afternoon: Jo˜ao, Alexei, Peter, Paul, Ed and Daniel from Imperial College London, and Ainhoa, Haiming, Jianhan, Adam and Rui from the Open University. Roelof van Zwol who gave me the opportunity to spend a summer in Barcelona where I failed • spectacularly to learn any Spanish. I would also like to thank B¨okur, Georgina, Llu´ıs and my colleagues at Yahoo! Research Barcelona, • and my flat mates Claudia and Gleb (who, to my knowledge, also failed to learn Spanish). As well as Barcelona, I have had the opportunity to travel both nationally and internationally as • part of my PhD. I would like to thank the people who have made this possible: Ali Azimi Bolourian for hosting my visit to Glasgow, Cathal Gurrin for hosting my visit to Dublin, Jo˜ao Magalh˜aes and Jos´eIria for hosting my visit to Sheffield, MMKM for partially funding my trips to Seattle and Alicante, and Google for partially funding my return visit to Barcelona. A chance meeting with Leif Azzopardi at ECIR led me to joining the BCS-IRSG committee. In • turn, this led me to a series of fascinating talks and introduced me to many members of the IR community. I would like to thank Leif, Ali, Andy and the rest of the BCS-IRSG committee. For the past three years I have lived in a cozy corner of room 433 of Imperial College London’s • Huxley building. A series of people have shared this room with me and joined in often quite heated debates. This includes Uri, Sergio, Georgia, Alok, Mohammed, Mark, Adam, Jonathan, Clemens, Daniel and Nick. Before terms such as Credit Crunch, Down Turn and Recession were common place, I was ap- • proached by Evgeny Shadchnev to found a company. I learnt a huge amount during the next six months and I would like to thank Evgeny for the much needed distraction from my thesis. Unfor- tunately by the time we were approaching investors, beta-release and business-plan in hand, the credit bubble had burst and the venture-capital ships had sailed. 5 PhD, or in full Philosophiæ Doctor, litteral translation is “teacher of philosophy”. This brings me • to the next great distraction from my thesis: teaching the third-year Robotics course. I would like to thank Keith and Andy, who gave me the opportunity to be paid to teach undergraduates how to play with Lego for three months a year. If a PhD is a journey, the person who gave me a map was George Mallen. When I worked at SSL • he showed me what industrial research was and introduced me to Information Retrieval. I would like to thank the many friends that have surrounded me for the past three years. They • have bought me drinks and listened to tales of placename disambiguation. They include Dave, Dom, Steve, Tim, Lewis, Kurt, Alec, Sacha, Mark, Olly, Dan, Ash and far too many more to list. During the final three months of my write up, I developed crippling back pain. I would like to • thank Stewart and Melanie for helping me walk tall and sit up straight. Without their help I would not have finished this thesis. My parents, John and Yvonne, have always supported me to pursue whatever avenue has interested • me. I would like to thank them. Despite wanting to live in a farmhouse in the country, my girlfriend Jancy has shared a flat with • me, slightly too small to be cozy, in Fulham for the past three years. I thank her from the bottom of my heart for her endless love and support. 6 Contents Abstract 3 Acknowledgements 5 Contents 7 1 Introduction 15 1.1 GIR,IRandGIS..................................... 15 1.2 Placenames and locations . 16 1.3 The need for geographic indexing . 16 1.4 The need for placename disambiguation . 17 1.5 GIRsystems ....................................... 18 1.6 Wikipedia......................................... 19 1.7 Scope ............................................. 20 1.8 Contributions...................................... ..... 21 1.9 Publications........................................ 22 1.10Implementations .................................... ..... 23 1.11Organisation ..................................... ...... 24 1.12Roadmap ......................................... 25 2 GIR and Wikipedia 27 2.1 Introduction....................................... ..... 27 2.2 FromIRtoGIR ..................................... 27 2.2.1 Relevance......................................... 27 2.2.2 Phrases and named entities . 28 2.3 GIR ............................................. 29 2.3.1 GIRsystems ..................................... 30 2.3.2 Placename disambiguation . 30 2.3.3 Geographic indexing . 33 2.3.4 Geographic relevance . 35 2.3.5 Combining text and geographic relevance . 37 2.4 Geographic resources . 39 2.5 Wikipedia......................................... 40 2.5.1 Wikipediaarticles ................................. 41 2.5.2 Wikipedians...................................... 41 2.5.3 Accuracy ......................................... 42 7 2.5.4 Mining semantic information . ..... 43 2.6 Discussion........................................ ..... 44 3 Evaluation and metrics 45 3.1 Introduction....................................... ..... 45 3.2 EvaluatingIRsystems ................................ ...... 45 3.2.1 Evaluationforums ................................ 45 3.3 Evaluationmeasures .................................. ..... 47 3.3.1 Binary classification and unordered retrieval . 47 3.3.2 Scored classification and ranked retrieval . 48 3.4 Statisticaltesting.................................. ....... 49 3.4.1 TheSigntest...................................... 50 3.4.2 The Wilcoxon Signed-Rank test . 50 3.4.3 TheFriedmantest ................................... 51 3.4.4 The Student’s t-test . 51 3.4.5 One-tailed vs. two-tailed . 52 3.5 Discussion........................................ ..... 52 4 Classifying Wikipedia articles 55 4.1 Introduction....................................... ..... 55 4.2 Classification and disambiguation of Wikipedia articles . ........... 55 4.3 Classifying Wikipedia articles . ......... 57 4.3.1 Groundtruth...................................... 58 4.3.2 Sparsityofdata................................... 58 4.3.3 Removingnoise.................................... 60 4.3.4 Systemoptimisation ............................... 60 4.4 Evaluation and comparison to existing methods . ....... 62 4.4.1 Experimentalsetup................................. 63 4.4.2 Results ......................................... 64 4.5 Acasestudy:Flickr .................................. ..... 65 4.5.1 Atagclassificationsystem. ..... 67 4.5.2 Coverage ....................................... 69 4.5.3 Summary
Recommended publications
  • Cultural Anthropology Through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-Based Sentiment
    Cultural Anthropology through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-based Sentiment Peter A. Gloor, Joao Marcos, Patrick M. de Boer, Hauke Fuehres, Wei Lo, Keiichi Nemoto [email protected] MIT Center for Collective Intelligence Abstract In this paper we study the differences in historical World View between Western and Eastern cultures, represented through the English, the Chinese, Japanese, and German Wikipedia. In particular, we analyze the historical networks of the World’s leaders since the beginning of written history, comparing them in the different Wikipedias and assessing cultural chauvinism. We also identify the most influential female leaders of all times in the English, German, Spanish, and Portuguese Wikipedia. As an additional lens into the soul of a culture we compare top terms, sentiment, emotionality, and complexity of the English, Portuguese, Spanish, and German Wikinews. 1 Introduction Over the last ten years the Web has become a mirror of the real world (Gloor et al. 2009). More recently, the Web has also begun to influence the real world: Societal events such as the Arab spring and the Chilean student unrest have drawn a large part of their impetus from the Internet and online social networks. In the meantime, Wikipedia has become one of the top ten Web sites1, occasionally beating daily newspapers in the actuality of most recent news. Be it the resignation of German national soccer team captain Philipp Lahm, or the downing of Malaysian Airlines flight 17 in the Ukraine by a guided missile, the corresponding Wikipedia page is updated as soon as the actual event happened (Becker 2012.
    [Show full text]
  • Canton of Basel-Stadt
    Canton of Basel-Stadt Welcome. VARIED CITY OF THE ARTS Basel’s innumerable historical buildings form a picturesque setting for its vibrant cultural scene, which is surprisingly rich for THRIVING BUSINESS LOCATION CENTRE OF EUROPE, TRINATIONAL such a small canton: around 40 museums, AND COSMOPOLITAN some of them world-renowned, such as the Basel is Switzerland’s most dynamic busi- Fondation Beyeler and the Kunstmuseum ness centre. The city built its success on There is a point in Basel, in the Swiss Rhine Basel, the Theater Basel, where opera, the global achievements of its pharmaceut- Ports, where the borders of Switzerland, drama and ballet are performed, as well as ical and chemical companies. Roche, No- France and Germany meet. Basel works 25 smaller theatres, a musical stage, and vartis, Syngenta, Lonza Group, Clariant and closely together with its neighbours Ger- countless galleries and cinemas. The city others have raised Basel’s profile around many and France in the fields of educa- ranks with the European elite in the field of the world. Thanks to the extensive logis- tion, culture, transport and the environment. fine arts, and hosts the world’s leading con- tics know-how that has been established Residents of Basel enjoy the superb recre- temporary art fair, Art Basel. In addition to over the centuries, a number of leading in- ational opportunities in French Alsace as its prominent classical orchestras and over ternational logistics service providers are well as in Germany’s Black Forest. And the 1000 concerts per year, numerous high- also based here. Basel is a successful ex- trinational EuroAirport Basel-Mulhouse- profile events make Basel a veritable city hibition and congress city, profiting from an Freiburg is a key transport hub, linking the of the arts.
    [Show full text]
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Title of Thesis: ABSTRACT CLASSIFYING BIAS
    ABSTRACT Title of Thesis: CLASSIFYING BIAS IN LARGE MULTILINGUAL CORPORA VIA CROWDSOURCING AND TOPIC MODELING Team BIASES: Brianna Caljean, Katherine Calvert, Ashley Chang, Elliot Frank, Rosana Garay Jáuregui, Geoffrey Palo, Ryan Rinker, Gareth Weakly, Nicolette Wolfrey, William Zhang Thesis Directed By: Dr. David Zajic, Ph.D. Our project extends previous algorithmic approaches to finding bias in large text corpora. We used multilingual topic modeling to examine language-specific bias in the English, Spanish, and Russian versions of Wikipedia. In particular, we placed Spanish articles discussing the Cold War on a Russian-English viewpoint spectrum based on similarity in topic distribution. We then crowdsourced human annotations of Spanish Wikipedia articles for comparison to the topic model. Our hypothesis was that human annotators and topic modeling algorithms would provide correlated results for bias. However, that was not the case. Our annotators indicated that humans were more perceptive of sentiment in article text than topic distribution, which suggests that our classifier provides a different perspective on a text’s bias. CLASSIFYING BIAS IN LARGE MULTILINGUAL CORPORA VIA CROWDSOURCING AND TOPIC MODELING by Team BIASES: Brianna Caljean, Katherine Calvert, Ashley Chang, Elliot Frank, Rosana Garay Jáuregui, Geoffrey Palo, Ryan Rinker, Gareth Weakly, Nicolette Wolfrey, William Zhang Thesis submitted in partial fulfillment of the requirements of the Gemstone Honors Program, University of Maryland, 2018 Advisory Committee: Dr. David Zajic, Chair Dr. Brian Butler Dr. Marine Carpuat Dr. Melanie Kill Dr. Philip Resnik Mr. Ed Summers © Copyright by Team BIASES: Brianna Caljean, Katherine Calvert, Ashley Chang, Elliot Frank, Rosana Garay Jáuregui, Geoffrey Palo, Ryan Rinker, Gareth Weakly, Nicolette Wolfrey, William Zhang 2018 Acknowledgements We would like to express our sincerest gratitude to our mentor, Dr.
    [Show full text]
  • IMPRESSIONEN IMPRESSIONS Bmw-Berlin-Marathon.Com
    IMPRESSIONEN IMPRESSIONS bmw-berlin-marathon.com 2:01:39 DIE ZUKUNFT IST JETZT. DER BMW i3s. BMW i3s (94 Ah) mit reinem Elektroantrieb BMW eDrive: Stromverbrauch in kWh/100 km (kombiniert): 14,3; CO2-Emissionen in g/km (kombiniert): 0; Kraftstoffverbrauch in l/100 km (kombiniert): 0. Die offi ziellen Angaben zu Kraftstoffverbrauch, CO2-Emissionen und Stromverbrauch wurden nach dem vorgeschriebenen Messverfahren VO (EU) 715/2007 in der jeweils geltenden Fassung ermittelt. Bei Freude am Fahren diesem Fahrzeug können für die Bemessung von Steuern und anderen fahrzeugbezogenen Abgaben, die (auch) auf den CO2-Ausstoß abstellen, andere als die hier angegebenen Werte gelten. Abbildung zeigt Sonderausstattungen. 7617 BMW i3s Sportmarketing AZ 420x297 Ergebnisheft 20180916.indd Alle Seiten 18.07.18 15:37 DIE ZUKUNFT IST JETZT. DER BMW i3s. BMW i3s (94 Ah) mit reinem Elektroantrieb BMW eDrive: Stromverbrauch in kWh/100 km (kombiniert): 14,3; CO2-Emissionen in g/km (kombiniert): 0; Kraftstoffverbrauch in l/100 km (kombiniert): 0. Die offi ziellen Angaben zu Kraftstoffverbrauch, CO2-Emissionen und Stromverbrauch wurden nach dem vorgeschriebenen Messverfahren VO (EU) 715/2007 in der jeweils geltenden Fassung ermittelt. Bei Freude am Fahren diesem Fahrzeug können für die Bemessung von Steuern und anderen fahrzeugbezogenen Abgaben, die (auch) auf den CO2-Ausstoß abstellen, andere als die hier angegebenen Werte gelten. Abbildung zeigt Sonderausstattungen. 7617 BMW i3s Sportmarketing AZ 420x297 Ergebnisheft 20180916.indd Alle Seiten 18.07.18 15:37 AOK Nordost. Beim Sport dabei. Nutzen Sie Ihre individuellen Vorteile: Bis zu 385 Euro für Fitness, Sport und Vorsorge. Bis zu 150 Euro für eine sportmedizinische Untersuchung.
    [Show full text]
  • An Analysis of Contributions to Wikipedia from Tor
    Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor Chau Tran Kaylea Champion Andrea Forte Department of Computer Science & Engineering Department of Communication College of Computing & Informatics New York University University of Washington Drexel University New York, USA Seatle, USA Philadelphia, USA [email protected] [email protected] [email protected] Benjamin Mako Hill Rachel Greenstadt Department of Communication Department of Computer Science & Engineering University of Washington New York University Seatle, USA New York, USA [email protected] [email protected] Abstract—User-generated content sites routinely block contri- butions from users of privacy-enhancing proxies like Tor because of a perception that proxies are a source of vandalism, spam, and abuse. Although these blocks might be effective, collateral damage in the form of unrealized valuable contributions from anonymity seekers is invisible. One of the largest and most important user-generated content sites, Wikipedia, has attempted to block contributions from Tor users since as early as 2005. We demonstrate that these blocks have been imperfect and that thousands of attempts to edit on Wikipedia through Tor have been successful. We draw upon several data sources and analytical techniques to measure and describe the history of Tor editing on Wikipedia over time and to compare contributions from Tor users to those from other groups of Wikipedia users. Fig. 1. Screenshot of the page a user is shown when they attempt to edit the Our analysis suggests that although Tor users who slip through Wikipedia article on “Privacy” while using Tor. Wikipedia’s ban contribute content that is more likely to be reverted and to revert others, their contributions are otherwise similar in quality to those from other unregistered participants and to the initial contributions of registered users.
    [Show full text]
  • Out of the Ordinary: Finding Hidden Threats by Analyzing Unusual
    RAND-INITIATED RESEARCH CHILD POLICY This PDF document was made available CIVIL JUSTICE from www.rand.org as a public service of EDUCATION the RAND Corporation. ENERGY AND ENVIRONMENT HEALTH AND HEALTH CARE Jump down to document6 INTERNATIONAL AFFAIRS NATIONAL SECURITY The RAND Corporation is a nonprofit POPULATION AND AGING research organization providing PUBLIC SAFETY SCIENCE AND TECHNOLOGY objective analysis and effective SUBSTANCE ABUSE solutions that address the challenges TERRORISM AND facing the public and private sectors HOMELAND SECURITY TRANSPORTATION AND around the world. INFRASTRUCTURE Support RAND Purchase this document Browse Books & Publications Make a charitable contribution For More Information Visit RAND at www.rand.org Explore RAND-Initiated Research View document details Limited Electronic Distribution Rights This document and trademark(s) contained herein are protected by law as indicated in a notice appearing later in this work. This electronic representation of RAND intellectual property is provided for non- commercial use only. Permission is required from RAND to reproduce, or reuse in another form, any of our research documents. This product is part of the RAND Corporation monograph series. RAND monographs present major research findings that address the challenges facing the public and private sectors. All RAND mono- graphs undergo rigorous peer review to ensure high standards for research quality and objectivity. Out of the Ordinary Finding Hidden Threats by Analyzing Unusual Behavior JOHN HOLLYWOOD, DIANE SNYDER, KENNETH McKAY, JOHN BOON Approved for public release, distribution unlimited This research in the public interest was supported by RAND, using discretionary funds made possible by the generosity of RAND's donors, the fees earned on client-funded research, and independent research and development (IR&D) funds provided by the Department of Defense.
    [Show full text]
  • Topological Evolution of Networks: Case Studies in the US Airlines and Language Wikipedias By
    Topological Evolution of Networks: Case Studies in the US Airlines and Language Wikipedias by Gergana Assenova Bounova B.S., Theoretical Mathematics, Massachusetts Institute of Technology (2003) B.S., Aeronautics & Astronautics, Massachusetts Institute of Technology (2003) S.M., Aeronautics & Astronautics, Massachusetts Institute of Technology (2005) Submitted to the Department of Aeronautics and Astronautics in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2009 c 2009 Gergana A. Bounova, All rights reserved. ... The author hereby grants to MIT the permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part. Author.......................................................................................... Department of Aeronautics and Astronautics February 27, 2009 Certified by..................................................................................... Prof. Olivier L. de Weck Associate Professor of Aeronautics and Astronautics and Engineering Systems Thesis Supervisor Certified by..................................................................................... Prof. Christopher L. Magee Professor of the Practice of Mechanical Engineering and Engineering Systems Certified by..................................................................................... Dr. Daniel E. Whitney Senior Research Scientist, Center for Technology, Policy and Industrial Development, Senior Lecturer in
    [Show full text]
  • Langmag July06 14-17.Qxd (Page 1)
    Methodology Robert L. Read and Steven D. Brewer explain how Esperanto acts as a springboard for the acquisition of other languages Who Knows Where Esperanto Might Lead? In 1887, an obscure eye doctor in ly attain a competency that eluded them in Esperanto, or any language, provides a Poland self-published a little book in Russian. learning an ethnic language or report that they propaedeutic effect in learning a next lan- Over the next several years Lingvo Internacia1 reached a given level of competency in a frac- guage which is similar. appeared in English, French, German, tion of the time required by a national lan- Several factors may contribute to the Hebrew, and Polish. This book, written under guage. Early success creates a virtuous cycle Corder effect, including similarities in vocabu- the pen name Doctor Esperanto, laid the which encourages more study and often leads lary, grammatical structure, and word order. foundation for a new language that would to genuine fluency. Achievement yields positive Similarity of vocabulary has been shown to achieve what no other language project had effects on student self-confidence, insight into be an effective metric for predicting how ever done: establish a living community that the nature of languages in general, and the much knowing one language will help with would go on to survive the death of its cre- structure of their native language in particular. learning another.5 Since Esperanto was ator. Even conservative estimates place the Barry Farber writes in his book How to designed to have a widely recognized vocab- number of active speakers in the tens of Learn Any Language:2 “It’s said that once you ulary and grammatical features broadly thousands, with the number who have master one foreign language, all others come shared across language families, it takes learned Esperanto at some time in their lives much more easily.
    [Show full text]
  • Private Companiescompanies
    PrivatePrivate CompaniesCompanies Private Company Profiles @Ventures Kepware Technologies Access 360 Media LifeYield Acronis LogiXML Acumentrics Magnolia Solar Advent Solar Mariah Power Agion Technologies MetaCarta Akorri mindSHIFT Technologies alfabet Motionbox Arbor Networks Norbury Financial Asempra Technologies NumeriX Asset Control OpenConnect Atlas Venture Panasas Autonomic Networks Perimeter eSecurity Azaleos Permabit Technology Azimuth Systems PermissionTV Black Duck Software PlumChoice Online PC Services Brainshark Polaris Venture Partners BroadSoft PriceMetrix BzzAgent Reva Systems Cedar Point Communications Revolabs Ceres SafeNet Certeon Sandbridge Technologies Certica Solutions Security Innovation cMarket Silver Peak Systems Code:Red SIMtone ConsumerPowerline SkyeTek CorporateRewards SoloHealth Courion Sonicbids Crossbeam Systems StyleFeeder Cyber-Ark Software TAGSYS DAFCA Tatara Systems Demandware Tradeware Global Desktone Tutor.com Epic Advertising U4EA Technologies ExtendMedia Ubiquisys Fidelis Security Systems UltraCell Flagship Ventures Vanu Fortisphere Versata Enterprises GENBAND Visible Assets General Catalyst Partners VKernel Hearst Interactive Media VPIsystems Highland Capital Partners Ze-gen HSMC ZoomInfo Invention Machine @Ventures Address: 187 Ballardvale Street, Suite A260 800 Menlo Ave, Suite 120 Wilmington, MA 01887 Menlo Park, CA 94025 Phone #: (978) 658-8980 (650) 322-3246 Fax #: ND ND Website: www.ventures.com www.ventures.com Business Overview @Ventures provides venture capital and growth assistance to early stage clean technology companies. Established in 1995, @Ventures has funded more than 75 companies across a broad set of technology sectors. The exclusive focus of the firm's fifth fund, formed in 2004, is on investments in the cleantech sector, including alternative energy, energy storage and efficiency, advanced materials, and water technologies. Speaker Bio: Marc Poirier, Managing Director Marc Poirier has been a General Partner with @Ventures since 1998 and operates out the firm’s Boston-area office.
    [Show full text]
  • Proceedings of the 3Rd Workshop on Building and Using Comparable
    Workshop Programme Saturday, May 22, 2010 9:00-9:15 Opening Remarks Invited talk 9:15 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff 10:30 Break Session 1: Building Comparable Corpora 11:00 Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Ma- chine Translation Inguna Skadin¸a, Andrejs Vasil¸jevs, Raivis Skadin¸š, Robert Gaizauskas, Dan Tufi¸s and Tatiana Gornostay 11:30 Statistical Corpus and Language Comparison Using Comparable Corpora Thomas Eckart and Uwe Quasthoff 12:00 Wikipedia as Multilingual Source of Comparable Corpora Pablo Gamallo and Isaac González López 12:30 Trillions of Comparable Documents Pascale Fung, Emmanuel Prochasson and Simon Shi 13:00 Lunch break i Saturday, May 22, 2010 (continued) Session 2: Parallel and Comparable Corpora for Machine Translation 14:30 Improving Machine Translation Performance Using Comparable Corpora Andreas Eisele and Jia Xu 15:00 Building a Large English-Chinese Parallel Corpus from Comparable Patents and its Experimental Application to SMT Bin Lu, Tao Jiang, Kapo Chow and Benjamin K. Tsou 15:30 Automatic Terminologically-Rich Parallel Corpora Construction José João Almeida and Alberto Simões 16:00 Break Session 3: Contrastive Analysis 16:30 Foreign Language Examination Corpus for L2-Learning Studies Piotr Banski´ and Romuald Gozdawa-Goł˛ebiowski 17:00 Lexical Analysis of Pre and Post Revolution Discourse in Portugal Michel Généreux, Amália Mendes, L. Alice Santos Pereira and M. Fernanda Bace- lar do Nascimento
    [Show full text]
  • ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia
    ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia AARON HALFAKER∗, Microsoft, USA R. STUART GEIGER†, University of California, San Diego, USA Algorithmic systems—from rule-based bots to machine learning classifiers—have a long history of supporting the essential work of content moderation and other curation work in peer production projects. From counter- vandalism to task routing, basic machine prediction has allowed open knowledge projects like Wikipedia to scale to the largest encyclopedia in the world, while maintaining quality and consistency. However, conversations about how quality control should work and what role algorithms should play have generally been led by the expert engineers who have the skills and resources to develop and modify these complex algorithmic systems. In this paper, we describe ORES: an algorithmic scoring service that supports real-time scoring of wiki edits using multiple independent classifiers trained on different datasets. ORES decouples several activities that have typically all been performed by engineers: choosing or curating training data, building models to serve predictions, auditing predictions, and developing interfaces or automated agents that act on those predictions. This meta-algorithmic system was designed to open up socio-technical conversations about algorithms in Wikipedia to a broader set of participants. In this paper, we discuss the theoretical mechanisms of social change ORES enables and detail case studies in participatory machine learning around ORES from the 5 years since its deployment. CCS Concepts: • Networks → Online social networks; • Computing methodologies → Supervised 148 learning by classification; • Applied computing → Sociology; • Software and its engineering → Software design techniques; • Computer systems organization → Cloud computing; Additional Key Words and Phrases: Wikipedia; Reflection; Machine learning; Transparency; Fairness; Algo- rithms; Governance ACM Reference Format: Aaron Halfaker and R.
    [Show full text]