Enriching Unstructured Media Content About Events to Enable Semi-Automated Summaries, Compilations, and Improved Search by Leveraging Social Networks

Total Page:16

File Type:pdf, Size:1020Kb

Enriching Unstructured Media Content About Events to Enable Semi-Automated Summaries, Compilations, and Improved Search by Leveraging Social Networks Enriching Unstructured Media Content About Events to Enable Semi-Automated Summaries, Compilations, and Improved Search by Leveraging Social Networks Thomas Steiner Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya A thesis submitted for the degree of Philosophiæ Doctor (PhD) February 2014 1st Advisor: Joaquim Gabarró Vallés Universitat Politècnica de Catalunya, Barcelona, Spain 2nd Advisor: Michael Hausenblas Digital Enterprise Research Institute, Galway, Ireland MapR Technologies, San Jose, CA, USA ii Abstract (i) Mobile devices and social networks are omnipresent Mobile devices such as smartphones, tablets, or digital cameras together with social networks enable people to create, share, and consume enormous amounts of media items like videos or photos both on the road or at home. Such mobile devices—by pure definition—accompany their owners almost wherever they may go. In consequence, mobile devices are omnipresent at all sorts of events to capture noteworthy moments. Exemplary events can be keynote speeches at conferences, music concerts in stadiums, or even natural catastrophes like earthquakes that affect whole areas or countries. At such events—given a stable network connection—part of the event-related media items are published on social networks both as the event happens or afterwards, once a stable network connection has been established again. (ii) Finding representative media items for an event is hard Common media item search operations, for example, searching for the of- ficial video clip for a certain hit record on an online video platform can in the simplest case be achieved based on potentially shallow human-generated metadata or based on more profound content analysis techniques like optical character recognition, automatic speech recognition, or acoustic fingerprint- ing. More advanced scenarios, however, like retrieving all (or just the most representative) media items that were created at a given event with the ob- jective of creating event summaries or media item compilations covering the event in question are hard, if not impossible, to fulfill at large scale. The central research question of this thesis can be formulated as follows. (iii) Central research question “Can user-customizable media galleries that summarize given events be created solely based on textual and multimedia data from social networks?” (iv) Core contributions In the context of this thesis, we have developed and evaluated a novel interac- tive application and related methods for media item enrichment, leveraging social networks, utilizing the Web of Data, techniques known from Content- based Image Retrieval (CBIR) and Content-based Video Retrieval (CBVR), and fine-grained media item addressing schemes like Media Fragments URIs to provide a scalable and near realtime solution to realize the abovemen- tioned scenario of event summarization and media item compilation. (v) Methodology For any event with given event title(s), (potentially vague) event location(s), and (arbitrarily fine-grained) event date(s), our approach can be divided in the following six steps. 1. Via the textual search APIs (Application Programming Interfaces) of different social networks, we retrieve a list of potentially event-relevant microposts that either contain media items directly, or that provide links to media items on external media item hosting platforms. 2. Using third-party Natural Language Processing (NLP) tools, we rec- ognize and disambiguate named entities in microposts to predetermine their relevance. 3. We extract the binary media item data from social networks or media item hosting platforms and relate it to the originating microposts. 4. Using CBIR and CBVR techniques, we first deduplicate exact-duplicate and near-duplicate media items and then cluster similar media items. 5. We rank the deduplicated and clustered list of media items and their related microposts according to well-defined ranking criteria. 6. In order to generate interactive and user-customizable media galleries that visually and audially summarize the event in question, we compile the top-n ranked media items and microposts in aesthetically pleasing and functional ways. To Laura, Lena, Emma, and Nil. Acknowledgements Personal Acknowledgements: First and foremost, I would like to thank my wife Laura for her support, understanding, patience, and energy during my time as a PhD student, and simply for being at my side. Without you, I would not be where I am today. I wholeheartedly thank my two advisors Joaquim Gabarró Vallés and Michael Hausenblas for their guidance, helpful comments, informative pointers, and especially for their constructive criticisms. The areas of research that I have tackled in this thesis are still young and sometimes uncharted territory. I am very thankful that the two of you have ventured on the undertaking of leading me through this thesis. I deeply appreciate all the review comments, thoughts, challenging ques- tions, and, last not least, the LATEX help of my dear friend and research col- league Ruben Verborgh. It was, is, and will be an honor to work with you. My warm thanks also go to Raphaël Troncy and his team at EURECOM Sophia Antipolis in France who have helped shape some of the ideas pre- sented in this thesis. I would like to thank my former and current managers at Google, namely N. Kryvossidis, R. Ashley, C. Bouchère, and I. Sassarini for their support for my thesis. A lot of valuable input for my thesis came in via social networks. My thanks go out to everyone I have interacted with around the hashtag #TomsPhD on Twitter, Google+, and Facebook. Finally, I sincerely thank my parents and my brother who have made me the person I am today. My parents have taught me not to go for the easy choices, even if at times they may seem tempting, but instead to try harder and never give up. This thesis also is for you. Formal Acknowledgements: The research presented in this doctoral thesis was partially supported by the European Commission under Grant No. 248296 with the European Union (FP7 ICT STReP) project I-SEARCH . To cite this document: @phdthesis{steiner2013thesis, author = {Thomas Steiner}, title = {Enriching Unstructured Media Content About Events to Enable Semi-Automated Summaries, Compilations, and Improved Search by Leveraging Social Networks}, year = {2013}, school = {Universitat Polit\`{e}cnica de Catalunya} } Copyright and License: ©2013 Thomas Steiner Licensed under the Creative Commons Attribution-ShareAlike 3.0 License. http://creativecommons.org/licenses/by-sa/3.0/ Contents List of Figures xiii List of Listings xv List of Tables xix Glossary xxi 1 Event Summarization Challenge 1 1.1 Motivation and Problem Statement . 1 1.2 Research Question and Hypothesis . 2 1.3 Approach . 2 1.4 Related Work . 3 1.5 Contributions . 6 1.5.1 Social Network Multimedia and Data Analysis . 6 1.5.2 Application of Semantic Web and NLP Techniques . 6 1.5.3 Video Content and Metadata Analysis . 6 1.5.4 Event Detection . 7 1.5.5 Standardization and Specifications . 7 1.5.6 Crowdsourcing . 7 1.5.7 Studies . 7 1.5.8 Multimodal Search Engines . 7 1.5.9 Web Service Description . 7 1.5.10 Others . 8 1.6 Thesis Structure . 8 v CONTENTS 2 Semantic Web Technologies 21 2.1 The World Wide Web and Semantics . 21 2.2 The Semantic Web . 22 2.2.1 The Non-Semantic Web . 22 2.2.2 Structured Data on the Web . 23 2.2.3 The Structured Knowledge Base DBpedia . 24 2.2.4 Semantics in HTML Versions 4.01 and 5 . 25 2.2.5 Structured Data Beyond Pure HTML . 27 Microformats . 27 Microdata . 29 2.3 Resource Description Framework (RDF) . 30 2.3.1 Triples as a Data Structure . 30 2.3.2 Important RDF Serialization Syntaxes . 31 RDF Sample Graph . 31 The RDF/XML Syntax . 31 The N-Triples Syntax . 32 The Turtle Syntax . 33 The Notation3 Syntax . 33 The RDFa Syntax . 34 2.4 SPARQL: Semantic Web Query Language . 35 2.4.1 The Vision of the Web as a Giant Single Database . 35 2.4.2 Different SPARQL Query Variations . 36 SELECT . 37 DESCRIBE . 37 ASK . 37 CONSTRUCT . 37 2.5 Linked Data . 38 2.5.1 The Linked Data Principles . 39 2.5.2 The Linking Open Data Cloud Diagram . 39 2.6 Conclusions . 40 vi CONTENTS 3 Social Networks 47 3.1 Definition of Terms Used in this Thesis . 48 3.2 Description of Popular Social Networks . 49 3.2.1 Facebook . 49 3.2.2 Flickr . 49 3.2.3 Google+ ................................... 50 3.2.4 Img.ly . 50 3.2.5 Imgur . 50 3.2.6 Instagram . 51 3.2.7 Lockerz . 51 3.2.8 MobyPicture . 51 3.2.9 Myspace . 52 3.2.10 Photobucket . 52 3.2.11 Twitpic . 52 3.2.12 Twitter . 53 3.2.13 Yfrog . 53 3.2.14 YouTube . 54 3.3 Decentralized Social Networks . 54 3.4 Classification of Social Networks . 55 3.5 Conclusions . 56 4 Micropost Annotation 61 4.1 Introduction . 61 4.1.1 Direct Access to Micropost Raw Data . 61 4.1.2 Browser Extensions to Access Microposts Indirectly . 62 4.1.3 Data Analysis Flow . 62 4.1.4 A Wrapper API for Named Entity Disambiguation . 63 4.2 Related Work . 63 4.2.1 Named Entity Disambiguation Using Lexical Databases . 64 4.2.2 Named Entity Disambiguation Using Semantic Coherence . 64 4.2.3 Named Entity Disambiguation Using Dictionaries . 64 4.2.4 Disambiguation Using Corpuses and Probability . 65 4.2.5 Disambiguation Using Search Query Logs . 65 vii CONTENTS 4.2.6 Combining Different Web Services and Provenance . 65 4.3 Structuring Unstructured Textual Data . 66 4.3.1 Natural Language Processing Services . 66 OpenCalais . 67 AlchemyAPI . 68 Zemanta . 68 DBpedia Spotlight . 69 4.3.2 Machine Translation . 69 4.3.3 Part-of-Speech Tagging . 69 4.4 Consolidating Named Entity Disambiguation Results .
Recommended publications
  • Comércio Eletrônico
    Comércio Eletrônico Aula 1 - Overview of Electronic Commerce Learning Objectives 1. Define electronic commerce (EC) and describe its various categories. 2. Describe and discuss the content and framework of EC. 3. Describe the major types of EC transactions. 4. Describe the drivers of EC. 5. Discuss the benefits of EC to individuals, organizations, and society. 6. Discuss social computing. 7. Describe social commerce and social software. 8. Understand the elements of the digital world. 9. Describe some EC business models. 10. List and describe the major limitations of EC. Case: Starbucks Electronic Commerce (EC) EC refers to using the Internet and other networks (e.g., intranets) to purchase, sell, transport, or trade data, goods, or services. e-Business • Narrow definition of EC: buying and selling transactions between business partners. • e-Business refers to a broader definition of EC: – buying and selling of goods and services – Servicing customers – collaborating with business partners, – delivering e-learning, – conducting electronic transactions within organizations. – Among others e-Business Note: some view e-business only as comprising those activities that do not involve buying or selling over the Internet, i.e., a complement of the narrowly defined EC. Major EC Concepts: Non-EC vs. Pure EC vs. Partial EC • EC three major activities: – ordering and payments, – order fulfilment, and – delivery to customers. • pure EC: all activities are digital, • non-EC: none are digital, • otherwise, we have partial EC. Major EC Concepts: Pure
    [Show full text]
  • Social Commerce
    SOCIAL COMMERCE AND YOUR E-COMMERCE STRATEGY T HE DIGITAL GRAND BAZAAR Imagine yourself standing in the middle of Istanbul’s Grand Bazaar—the world’s oldest and largest covered market. There are 61 covered streets. Over 3,000 shops. The bazaar attracts nearly half a million shoppers on a daily basis. Imagine the smell of spices and leather goods. Imagine the vibrant colors of the fabrics and glittering jewelry, picture the ornateness of handmade ceramics. All around you, people are looking at the wares, haggling with the vendors, talking about this shop or that one. The noise is steady and unrelenting and the sheer amount of humanity is exhilarating/overwhelming. The Grand Bazaar today is not much different than it was in the 16th century. Shops still sell similar items. Goods are still bargained for—no one pays what a seller is asking (you’re expected to haggle and deal with the vendor face-to-face). It’s been described as one of the best ways to “recapture the romantic atmosphere of old Istanbul.”1 And it represents the archetypical driver of commerce: social interaction; word of mouth. Commerce has always been a social event. Word of mouth, for example, was one of the first and most powerful forms of advertising. And it still remains so today. According to global management consulting firm McKinsey & Company, word of mouth “is the primary factor behind 20 to 50 percent of all purchasing decisions.”2 In the digital era, e-commerce has become the Grand Bazaar for the world. Social media is the modern word of mouth where instead of sharing opinions with just a small and close-knit group, people connect and share across the globe.
    [Show full text]
  • Globalwebindex's Flagship Report on the Latest Trends in Social Media
    Social GlobalWebIndex’s flagship report on the latest trends in social media FLAGSHIP REPORT 2020 globalwebindex.com What's inside? 03 Introduction 04 Key insights 05 Social media engagement 11 The social media landscape 13 Social media behaviors 15 Social entertainment 21 Notes on methodology 23 More from GlobalWebIndex SOCIAL Introduction METHODOLOGY GlobalWebIndex Social Media flagship report provides All figures in this report are drawn from GlobalWebIndex’s version of this survey via mobile, hence the sample sizes the most important insights on the world of social media, online research among internet users aged 16-64. We only presented in the charts throughout this report may differ as from the very latest figures for social media engagement interview respondents aged 16-64 and our figures are some will include all respondents and others will include only to the key trends within the social space and a representative of the online populations of each market, not its respondents who completed GlobalWebIndex’s Core survey via comprehensive view of which social platforms are most total population. Note that in many markets in Latin America, PC/laptop/tablet. popular. Among others, this report covers the following the Middle East and Africa, and the Asia Pacific region, low topics in detail: internet penetration rates can mean online populations are Throughout this report we refer to indexes. Indexes are used more young, urban, affluent and educated than the total to compare any given group against the average (1.00), which How much time per day are digital consumers population. unless otherwise stated refers to the global average.
    [Show full text]
  • John Athayde and Bruce Williams — «The Rails View
    What readers are saying about The Rails View This is a must-read for Rails developers looking to juice up their skills for a world of web apps that increasingly includes mobile browsers and a lot more JavaScript. ➤ Yehuda Katz Driving force behind Rails 3.0 and Co-founder, Tilde In the past several years, I’ve been privileged to work with some of the world’s leading Rails developers. If asked to name the best view-layer Rails developer I’ve met, I’d have a hard time picking between two names: Bruce Williams and John Athayde. This book is a rare opportunity to look into the minds of two of the leading experts on an area that receives far too little attention. Read, apply, and reread. ➤ Chad Fowler VP Engineering, LivingSocial Finally! An authoritative and up-to-date guide to everything view-related in Rails 3. If you’re stabbing in the dark when putting together your Rails apps’ views, The Rails View provides a big confidence boost and shows how to get things done the right way. ➤ Peter Cooper Editor, Ruby Inside and Ruby Weekly The Rails view layer has always been a morass, but this book reins it in with details of how to build views as software, not just as markup. This book represents the wisdom gained from years’ worth of building maintainable interfaces by two of the best and brightest minds in our business. I have been writing Ruby code for over a decade and Rails code since its inception, and out of all the Ruby books I’ve read, I value this one the most.
    [Show full text]
  • Wiki Semantics Via Wiki Templating
    Chapter XXXIV Wiki Semantics via Wiki Templating Angelo Di Iorio Department of Computer Science, University of Bologna, Italy Fabio Vitali Department of Computer Science, University of Bologna, Italy Stefano Zacchiroli Universitè Paris Diderot, PPS, UMR 7126, Paris, France ABSTRACT A foreseeable incarnation of Web 3.0 could inherit machine understandability from the Semantic Web, and collaborative editing from Web 2.0 applications. We review the research and development trends which are getting today Web nearer to such an incarnation. We present semantic wikis, microformats, and the so-called “lowercase semantic web”: they are the main approaches at closing the technological gap between content authors and Semantic Web technologies. We discuss a too often neglected aspect of the associated technologies, namely how much they adhere to the wiki philosophy of open editing: is there an intrinsic incompatibility between semantic rich content and unconstrained editing? We argue that the answer to this question can be “no”, provided that a few yet relevant shortcomings of current Web technologies will be fixed soon. INTRODUCTION Web 3.0 can turn out to be many things, it is hard to state what will be the most relevant while still debating on what Web 2.0 [O’Reilly (2007)] has been. We postulate that a large slice of Web 3.0 will be about the synergies between Web 2.0 and the Semantic Web [Berners-Lee et al. (2001)], synergies that only very recently have begun to be discovered and exploited. We base our foresight on the observation that Web 2.0 and the Semantic Web are converging to a common point in their initially split evolution lines.
    [Show full text]
  • The Girlboss Guide to Killing It Online & Making an Impact in the Digital Space
    AN INTRODUCTORY GUIDE TO MAKING YOUR MARK IN THE DIGITAL SPACE FOR GIRBOSSES, BLOGGERS AND CREATIVES. By Dani Watson A N I N T R O T O M A K I N G Y O U R M A R K O N L I N E B Y D A N I W A T S O N WITH THE AMOUNT OF COMPETITION IN THE ONLINE WORLD, IT CAN BE AN OVERWHELMING TASK TO THINK ABOUT HOW YOU ARE EVER GOING TO MAKE YOU AND YOUR BRAND STAND OUT FROM THE NOISE. WHETHER YOU'RE A BUSINESS LOOKING TO LAUNCH ONLINE, A BLOGGER WHO WANTS TO TURN THEIR BLOG INTO A FULL TIME INCOME OR A BRAND THAT IS LOOKING TO ELEVATE ITS PRESENCE IN THE DIGITAL SPACE, ITS ALL ABOUT FINDING AND CONNECTING WITH YOUR CLIQUE AND ONCE YOU DO, LOVING THEM HARD. I AM A FIRM BELIEVER THAT WHENEVER YOU ARE LOOKING TO CREATE AN INCOME ONLINE, THE COMMUNITY COMES BEFORE THE SELL. PEOPLE HAVE TO KNOW, LIKE AND TRUST YOU BEFORE THEY ARE WILLING TO INVEST. THIS IS WHY IT IS SO IMPORTANT TO BE THINKING ABOUT CREATING NOT ONLY YOUR PRODUCTS/ SERVICES/BLOG, BUT FIGURING OUT WAYS TO MAKE PEOPLE FALL IN LOVE WITH YOU AND YOUR BRAND SO THEY BUY WHATEVER IT IS YOU SELL. I AM SOMEONE WHO THOROUGHLY UNDERSTANDS THE IMPORTANCE OF FINDING YOUR TRIBE ONLINE AND BECOMING KNOWN FOR WHAT YOU DO AS THIS IS EXACTLY WHAT I HAVE BEEN ABLE TO DO WITH MY OWN BUSINESS AND BRAND. WHAT STARTED OFF AS A SMALL BLOG AND SOCIAL MEDIA CONSULTANCY HAS QUICKLY DEVELOPED INTO AN ONLINE EMPIRE THAT SHOWS NO SIGNS OF SLOWING DOWN.
    [Show full text]
  • A Method to Analyze Multiple Social Identities in Twitter Bios
    A Method to Analyze Multiple Social Identities in Twitter Bios ARJUNIL PATHAK, University at Buffalo, USA NAVID MADANI, University at Buffalo, USA KENNETH JOSEPH, University at Buffalo, USA Twitter users signal social identity in their profile descriptions, or bios, in a number of important but complex ways that are not well-captured by existing characterizations of how identity is expressed in language. Better ways of defining and measuring these expressions may therefore be useful both in understanding howsocial identity is expressed in text, and how the self is presented on Twitter. To this end, the present work makes three contributions. First, using qualitative methods, we identify and define the concept of a personal identifier, which is more representative of the ways in which identity is signaled in Twitter bios. Second, we propose a method to extract all personal identifiers expressed in a given bio. Finally, we present a series of validation analyses that explore the strengths and limitations of our proposed method. Our work opens up exciting new opportunities at the intersection between the social psychological study of social identity and the study of how we compose the self through markers of identity on Twitter and in social media more generally. CCS Concepts: • Applied computing Sociology; Additional Key Words and Phrases: Twitter, self-presentation, social identity, computational social science ACM Reference Format: Arjunil Pathak, Navid Madani, and Kenneth Joseph. 2018. A Method to Analyze Multiple Social Identities in Twitter Bios. J. ACM 37, 4, Article 111 (August 2018), 35 pages. https://doi.org/10.1145/1122445.1122456 1 INTRODUCTION A social identity is a word or phrase that refers to a particular social group (e.g.
    [Show full text]
  • Where Is the Semantic Web? – an Overview of the Use of Embeddable Semantics in Austria
    Where Is The Semantic Web? – An Overview of the Use of Embeddable Semantics in Austria Wilhelm Loibl Institute for Service Marketing and Tourism Vienna University of Economics and Business, Austria [email protected] Abstract Improving the results of search engines and enabling new online applications are two of the main aims of the Semantic Web. For a machine to be able to read and interpret semantic information, this content has to be offered online first. With several technologies available the question arises which one to use. Those who want to build the software necessary to interpret the offered data have to know what information is available and in which format. In order to answer these questions, the author analysed the business websites of different Austrian industry sectors as to what semantic information is embedded. Preliminary results show that, although overall usage numbers are still small, certain differences between individual sectors exist. Keywords: semantic web, RDFa, microformats, Austria, industry sectors 1 Introduction As tourism is a very information-intense industry (Werthner & Klein, 1999), especially novel users resort to well-known generic search engines like Google to find travel related information (Mitsche, 2005). Often, these machines do not provide satisfactory search results as their algorithms match a user’s query against the (weighted) terms found in online documents (Berry and Browne, 1999). One solution to this problem lies in “Semantic Searches” (Maedche & Staab, 2002). In order for them to work, web resources must first be annotated with additional metadata describing the content (Davies, Studer & Warren., 2006). Therefore, anyone who wants to provide data online must decide on which technology to use.
    [Show full text]
  • Html Cheat Sheet
    BEGINNER’S_ ​ HTML CHEAT SHEET Main root 2 Document metadata 2 Sectioning root 3 Content sectioning 3 Text content 4 Inline text semantics 6 Image and multimedia 8 Scripting 9 Demarcating edits 9 Table content 9 Forms 11 Interactive elements 12 WebsiteSetup.org - Beginner’s HTML Cheat Sheet ​ 1 Main root <html> … </html> The HTML <html> element represents the root (top-level element) of an HTML document, so it is also referred to as the root element. All other elements must be descendants of this element. Example: <!DOCTYPE html> <html lang="en"> <head>...</head> <body>...</body> </html> Document metadata <head> … </head> The HTML <head> element contains machine-readable information (metadata) about the document, like its title, scripts, and style sheets. <link> The HTML External Resource Link element (<link>) specifies relationships between the current document and an external resource. This element is most commonly used to link to stylesheets, but is also used to establish site icons (both "favicon" style icons and icons for the home screen and apps on mobile devices) among other things. <meta> The HTML <meta> element represents metadata that cannot be represented by other HTML meta-related elements, like <base>, <link>, <script>, <style> or <title> <style> … </style> The HTML <style> element contains style information for a document, or part of a document. <title> … </title> The HTML Title element (<title>) defines the document's title that is shown in a browser's title bar or a page's tab. Example: WebsiteSetup.org - Beginner’s HTML Cheat Sheet ​ 2 <!DOCTYPE html> <html lang="en"> <head>...</head> <body>...</body> </html> Sectioning root <body> … </body> The HTML <body> Element represents the content of an HTML document.
    [Show full text]
  • Westminsterresearch the Role of Knowledge Share, Satisfaction
    WestminsterResearch http://www.westminster.ac.uk/westminsterresearch The Role of Knowledge Share, Satisfaction, Social Commerce Usage Experience on Smart Mobile Device User’s Purchase Intentions: Evidence from South Korean Consumers Kwon, K. This is an electronic version of a PhD thesis awarded by the University of Westminster. © Mr Kyung Kwon, 2018. The WestminsterResearch online digital archive at the University of Westminster aims to make the research output of the University available to a wider audience. Copyright and Moral Rights remain with the authors and/or copyright owners. Whilst further distribution of specific materials from within this archive is forbidden, you may freely distribute the URL of WestminsterResearch: ((http://westminsterresearch.wmin.ac.uk/). In case of abuse or copyright appearing without permission e-mail [email protected] The Role of Knowledge Share, Satisfaction, Social Commerce Usage Experience on Smart Mobile Device User’s Purchase Intentions: Evidence from South Korean Consumers Kyung-Joon Kwon A Thesis submitted in partial fulfilment of the requirements of the University of Westminster for the degree of Doctor of Philosophy March 2018 Abstract This thesis analyses the factors that contribute to consumers’ intention to make online purchases via smart mobile devices. To examine consumers’ purchase intentions, frameworks described in the marketing and information system literatures were integrated, and a theoretical framework was then proposed. In total, 498 Korean consumers were recruited to
    [Show full text]
  • Microformats the Next (Small) Thing on the Semantic Web?
    Standards Editor: Jim Whitehead • [email protected] Microformats The Next (Small) Thing on the Semantic Web? Rohit Khare • CommerceNet “Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards.” — Microformats.org hen we speak of the “evolution of the is precisely encoding the great variety of person- Web,” it might actually be more appropri- al, professional, and genealogical relationships W ate to speak of “intelligent design” — we between people and organizations. By contrast, can actually point to a living, breathing, and an accidental challenge is that any blogger with actively involved Creator of the Web. We can even some knowledge of HTML can add microformat consult Tim Berners-Lee’s stated goals for the markup to a text-input form, but uploading an “promised land,” dubbed the Semantic Web. Few external file dedicated to machine-readable use presume we could reach those objectives by ran- remains forbiddingly complex with most blog- domly hacking existing Web standards and hop- ging tools. ing that “natural selection” by authors, software So, although any intelligent designer ought to developers, and readers would ensure powerful be able to rely on the long-established facility of enough abstractions for it. file transfer to publish the “right” model of a social Indeed, the elegant and painstakingly inter- network, the path of least resistance might favor locked edifice of technologies, including RDF, adding one of a handful of fixed tags to an exist- XML, and query languages is now growing pow- ing indirect form — the “blogroll” of hyperlinks to erful enough to attack massive information chal- other people’s sites.
    [Show full text]
  • Sociolinguistically Driven Approaches for Just Natural Language Processing
    University of Massachusetts Amherst ScholarWorks@UMass Amherst Doctoral Dissertations Dissertations and Theses April 2021 Sociolinguistically Driven Approaches for Just Natural Language Processing Su Lin Blodgett University of Massachusetts Amherst Follow this and additional works at: https://scholarworks.umass.edu/dissertations_2 Part of the Artificial Intelligence and Robotics Commons Recommended Citation Blodgett, Su Lin, "Sociolinguistically Driven Approaches for Just Natural Language Processing" (2021). Doctoral Dissertations. 2092. https://doi.org/10.7275/20410631 https://scholarworks.umass.edu/dissertations_2/2092 This Open Access Dissertation is brought to you for free and open access by the Dissertations and Theses at ScholarWorks@UMass Amherst. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. SOCIOLINGUISTICALLY DRIVEN APPROACHES FOR JUST NATURAL LANGUAGE PROCESSING A Dissertation Presented by SU LIN BLODGETT Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY February 2021 College of Information and Computer Sciences © Copyright by Su Lin Blodgett 2021 All Rights Reserved SOCIOLINGUISTICALLY DRIVEN APPROACHES FOR JUST NATURAL LANGUAGE PROCESSING A Dissertation Presented by SU LIN BLODGETT Approved as to style and content by: Brendan O’Connor, Chair Mohit Iyyer, Member Hanna Wallach, Member Lisa Green, Member James Allan, Chair of the Faculty College of Information and Computer Sciences ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor, Brendan O’Connor, who introduced me to natural language processing and computational sociolinguistics, taught me more than I can say about research, NLP, and writing, and has been endlessly supportive of my many and growing research interests; I consider myself amazingly lucky to have stumbled into his lab.
    [Show full text]