Towards Building a Blog Preservation Platform 3 New Comment, Etc.) As Triggers for Crawling Activities Would Be More Suitable
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
CHAPTER 12 Making Your Web Site Mashable
CHAPTER 12 Making Your Web Site Mashable This chapter is a guide to content producers who want to make their web sites friendly to mashups. That is, this chapter answers the question, how would you as a content producer make your digital content most effectively remixable and mashable to users and developers? Most of this book is addressed to creators of mashups who are therefore consumers of data and services. Why then should I shift in this chapter to addressing producers of data and services? Well, you have already seen aspects of APIs and web content that make it either easier or harder to remix, and you’ve seen what makes APIs easy and enjoyable to use. Showing content and data producers what would make life easier for consumers of their content provides useful guidance to service providers who might not be fully aware of what it’s like for consumers. The main audience for the book—as consumers (as opposed to producers) of services—should still find this chapter a helpful distillation of best practices for creating mashups. In some ways, this chapter is a review of Chapters 1–11 and a preview of Chapters 13–19. Chapters 1–11 prepared you for how to create mashups in general. I presented a lot of the technologies and showed how to build a reasonably sophisticated mashup with PHP and JavaScript as well as using mashup tools. Some of the discussion in this chapter will be amplified by in-depth discussions in Chapters 13–19. For example, I’ll refer to topics such as geoRSS, iCalendar, and microformats that I discuss in greater detail in those later chapters. -
Life Sciences and the Web: a New Era for Collaboration
Life Sciences and the web: a new era for collaboration The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Sagotsky, Jonathan A., Le Zhang, Zhihui Wang, Sean Martin, and Thomas S. Deisboeck. 2008. Life Sciences and the web: a new era for collaboration. Molecular Systems Biology 4: 201. Published Version doi:10.1038/msb.2008.39 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:4874800 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Molecular Systems Biology 4; Article number 201; doi:10.1038/msb.2008.39 Citation: Molecular Systems Biology 4:201 & 2008 EMBO and Nature Publishing Group All rights reserved 1744-4292/08 www.molecularsystemsbiology.com PERSPECTIVE Life Sciences and the web: a new era for collaboration Jonathan A Sagotsky1, Le Zhang1, Zhihui Wang1, Sean Martin2 inaccuracies per article compared to Encyclopedia Britannica’s and Thomas S Deisboeck1,* 2.92 errors. Although measures have been taken to improve the editorial process, accuracy and completeness remain valid 1 Complex Biosystems Modeling Laboratory, Harvard-MIT (HST) Athinoula concerns. Perhaps the issue is one in which it has become A Martinos Center for Biomedical Imaging, Massachusetts General Hospital, difficult to establish exactly what has actually been peer Charlestown, MA, USA and reviewed and what has not, given that the low cost of digital 2 Cambridge Semantics Inc., Cambridge, MA, USA publishing on the web has led to an explosive amount * Corresponding author. -
Quality Spine Care
Quality Spine Care Healthcare Systems, Quality Reporting, and Risk Adjustment John Ratliff Todd J. Albert Joseph Cheng Jack Knightly Editors 123 Quality Spine Care John Ratliff • Todd J. Albert Joseph Cheng • Jack Knightly Editors Quality Spine Care Healthcare Systems, Quality Reporting, and Risk Adjustment Editors John Ratliff Todd J. Albert Department of Neurosurgery Hospital for Special Surgery Stanford University New York, NY Stanford, CA USA USA Jack Knightly Joseph Cheng Atlantic Neurosurgical Specialists University of Cincinnati Morristown, NJ Cincinnati, OH USA USA ISBN 978-3-319-97989-2 ISBN 978-3-319-97990-8 (eBook) https://doi.org/10.1007/978-3-319-97990-8 Library of Congress Control Number: 2018957706 © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. -
Uncovering Social Network Sybils in the Wild
Uncovering Social Network Sybils in the Wild ZHI YANG, Peking University 2 CHRISTO WILSON, University of California, Santa Barbara XIAO WANG, Peking University TINGTING GAO,RenrenInc. BEN Y. ZHAO, University of California, Santa Barbara YAFEI DAI, Peking University Sybil accounts are fake identities created to unfairly increase the power or resources of a single malicious user. Researchers have long known about the existence of Sybil accounts in online communities such as file-sharing systems, but they have not been able to perform large-scale measurements to detect them or measure their activities. In this article, we describe our efforts to detect, characterize, and understand Sybil account activity in the Renren Online Social Network (OSN). We use ground truth provided by Renren Inc. to build measurement-based Sybil detectors and deploy them on Renren to detect more than 100,000 Sybil accounts. Using our full dataset of 650,000 Sybils, we examine several aspects of Sybil behavior. First, we study their link creation behavior and find that contrary to prior conjecture, Sybils in OSNs do not form tight-knit communities. Next, we examine the fine-grained behaviors of Sybils on Renren using clickstream data. Third, we investigate behind-the-scenes collusion between large groups of Sybils. Our results reveal that Sybils with no explicit social ties still act in concert to launch attacks. Finally, we investigate enhanced techniques to identify stealthy Sybils. In summary, our study advances the understanding of Sybil behavior on OSNs and shows that Sybils can effectively avoid existing community-based Sybil detectors. We hope that our results will foster new research on Sybil detection that is based on novel types of Sybil features. -
Svms for the Blogosphere: Blog Identification and Splog Detection
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari∗, Tim Finin and Anupam Joshi University of Maryland, Baltimore County Baltimore MD {kolari1, finin, joshi}@umbc.edu Abstract Most blog search engines identify blogs and index con- tent based on update pings received from ping servers1 or Weblogs, or blogs have become an important new way directly from blogs, or through crawling blog directories to publish information, engage in discussions and form communities. The increasing popularity of blogs has and blog hosting services. To increase their coverage, blog given rise to search and analysis engines focusing on the search engines continue to crawl the Web to discover, iden- “blogosphere”. A key requirement of such systems is to tify and index blogs. This enables staying ahead of compe- identify blogs as they crawl the Web. While this ensures tition in a domain where “size does matter”. Even if a web that only blogs are indexed, blog search engines are also crawl is inessential for blog search engines, it is still pos- often overwhelmed by spam blogs (splogs). Splogs not sible that processed update pings are from non-blogs. This only incur computational overheads but also reduce user requires that the source of the pings need to be verified as a satisfaction. In this paper we first describe experimental blog prior to indexing content.2 results of blog identification using Support Vector Ma- In the first part of this paper we address blog identifica- chines (SVM). We compare results of using different feature sets and introduce new features for blog iden- tion by experimenting with different feature sets. -
Elie Bursztein, Baptiste Gourdin, John Mitchell Stanford University & LSV-ENS Cachan
Talkback: Reclaiming the Blogsphere Elie Bursztein, Baptiste Gourdin, John Mitchell Stanford University & LSV-ENS Cachan 1 What is a blog ? • A Blog ("Web log") is a site, usually maintained by an individual with • Regular entries • Commentary • LinkBack • Entries displayed in reverse-chronological order. http://elie.im/blog Elie Bursztein, Baptiste Gourdin, John Mitchell TalkBack: reclaiming the blogosphere from spammer http://ly.tl/p21 Key Statistics • 184 Millions blogs • 73% of users read blogs • 50% post comments universalmccann Elie Bursztein, Baptiste Gourdin, John Mitchell TalkBack: reclaiming the blogosphere from spammer http://ly.tl/p21 Anatomy of a blog post Elie Bursztein, Baptiste Gourdin, John Mitchell TalkBack: reclaiming the blogosphere from spammer http://ly.tl/p21 Why blogs are special ? User Elie Bursztein, Baptiste Gourdin, John Mitchell TalkBack: reclaiming the blogosphere from spammer http://ly.tl/p21 Why blogs are special ? User Elie Bursztein, Baptiste Gourdin, John Mitchell TalkBack: reclaiming the blogosphere from spammer http://ly.tl/p21 What is a TrackBack ? Elie Bursztein, Baptiste Gourdin, John Mitchell TalkBack: reclaiming the blogosphere from spammer http://ly.tl/p21 Trackback Illustrated Little Timmy said to me... "What's Trackback, Daddy?" "Wow! Jimmy Lightning has written the best 1. post ever! It's so funny! And it's true! That's "Best Post Ever" why it's so good. I need to tell the world!" "Check it out world! I've "Jimmy written all about Jimmy 2. Lightning is Lightning's post on my Elie Bursztein, Baptiste Gourdin, John Mitchell swell"TalkBack: reclaiming the blogosphere from spammerweblog. My weblog's http://ly.tl/p21 called 'The Unbloggable Blogness of Blogging'. -
By Nilesh Bansal a Thesis Submitted in Conformity with the Requirements
ONLINE ANALYSIS OF HIGH-VOLUME SOCIAL TEXT STEAMS by Nilesh Bansal A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto ⃝c Copyright 2013 by Nilesh Bansal Abstract Online Analysis of High-Volume Social Text Steams Nilesh Bansal Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2013 Social media is one of the most disruptive developments of the past decade. The impact of this information revolution has been fundamental on our society. Information dissemination has never been cheaper and users are increasingly connected with each other. The line between content producers and consumers is blurred, leaving us with abundance of data produced in real-time by users around the world on multitude of topics. In this thesis we study techniques to aid an analyst in uncovering insights from this new media form which is modeled as a high volume social text stream. The aim is to develop practical algorithms with focus on the ability to scale, amenability to reliable operation, usability, and ease of implementation. Our work lies at the intersection of building large scale real world systems and developing theoretical foundation to support the same. We identify three key predicates to enable online methods for analysis of social data, namely : • Persistent Chatter Discovery to explore topics discussed over a period of time, • Cross-referencing Media Sources to initiate analysis using a document as the query, and • Contributor Understanding to create aggregate expertise and topic summaries of authors contributing online. The thesis defines each of the predicates in detail and covers proposed techniques, their practical applicability, and detailed experimental results to establish accuracy and scalability for each of the three predicates. -
Blogosphere: Research Issues, Tools, and Applications
Blogosphere: Research Issues, Tools, and Applications Nitin Agarwal Huan Liu Computer Science and Engineering Department Arizona State University Tempe, AZ 85287 fNitin.Agarwal.2, [email protected] ABSTRACT ging. Acknowledging this fact, Times has named \You" as the person of the year 2006. This has created a consider- Weblogs, or Blogs, have facilitated people to express their able shift in the way information is assimilated by the indi- thoughts, voice their opinions, and share their experiences viduals. This paradigm shift can be attributed to the low and ideas. Individuals experience a sense of community, a barrier to publication and open standards of content genera- feeling of belonging, a bonding that members matter to one tion services like blogs, wikis, collaborative annotation, etc. another and their niche needs will be met through online These services have allowed the mass to contribute and edit interactions. Its open standards and low barrier to publi- articles publicly. Giving access to the mass to contribute cation have transformed information consumers to produc- or edit has also increased collaboration among the people ers. This has created a plethora of open-source intelligence, unlike previously where there was no collaboration as the or \collective wisdom" that acts as the storehouse of over- access to the content was limited to a chosen few. Increased whelming amounts of knowledge about the members, their collaboration has developed collective wisdom on the Inter- environment and the symbiosis between them. Nonetheless, net. \We the media" [21], is a phenomenon named by Dan vast amounts of this knowledge still remain to be discovered Gillmor: a world in which \the former audience", not a few and exploited in its suitable way. -
Web Spam Taxonomy
Web Spam Taxonomy Zolt´an Gy¨ongyi Hector Garcia-Molina Computer Science Department Computer Science Department Stanford University Stanford University [email protected] [email protected] Abstract techniques, but as far as we know, they still lack a fully effective set of tools for combating it. We believe Web spamming refers to actions intended to mislead that the first step in combating spam is understanding search engines into ranking some pages higher than it, that is, analyzing the techniques the spammers use they deserve. Recently, the amount of web spam has in- to mislead search engines. A proper understanding of creased dramatically, leading to a degradation of search spamming can then guide the development of appro- results. This paper presents a comprehensive taxon- priate countermeasures. omy of current spamming techniques, which we believe To that end, in this paper we organize web spam- can help in developing appropriate countermeasures. ming techniques into a taxonomy that can provide a framework for combating spam. We also provide an overview of published statistics about web spam to un- 1 Introduction derline the magnitude of the problem. There have been brief discussions of spam in the sci- As more and more people rely on the wealth of informa- entific literature [3, 6, 12]. One can also find details for tion available online, increased exposure on the World several specific techniques on the Web itself (e.g., [11]). Wide Web may yield significant financial gains for in- Nevertheless, we believe that this paper offers the first dividuals or organizations. Most frequently, search en- comprehensive taxonomy of all important spamming gines are the entryways to the Web; that is why some techniques known to date. -
Robust Detection of Comment Spam Using Entropy Rate
Robust Detection of Comment Spam Using Entropy Rate Alex Kantchelian Justin Ma Ling Huang UC Berkeley UC Berkeley Intel Labs [email protected] [email protected] [email protected] Sadia Afroz Anthony D. Joseph J. D. Tygar Drexel University, Philadelphia UC Berkeley UC Berkeley [email protected] [email protected] [email protected] ABSTRACT Keywords In this work, we design a method for blog comment spam detec- Spam filtering, Comment spam, Content complexity, Noisy label, tion using the assumption that spam is any kind of uninformative Logistic regression content. To measure the “informativeness” of a set of blog com- ments, we construct a language and tokenization independent met- 1. INTRODUCTION ric which we call content complexity, providing a normalized an- swer to the informal question “how much information does this Online social media have become indispensable, and a large part text contain?” We leverage this metric to create a small set of fea- of their success is that they are platforms for hosting user-generated tures well-adjusted to comment spam detection by computing the content. An important example of how users contribute value to a content complexity over groupings of messages sharing the same social media site is the inclusion of comment threads in online ar- author, the same sender IP, the same included links, etc. ticles of various kinds (news, personal blogs, etc). Through com- We evaluate our method against an exact set of tens of millions ments, users have an opportunity to share their insights with one of comments collected over a four months period and containing another. -
Spam in Blogs and Social Media
ȱȱȱȱ ȱ Pranam Kolari, Tim Finin Akshay Java, Anupam Joshi March 25, 2007 ȱ • Spam on the Internet –Variants – Social Media Spam • Reason behind Spam in Blogs • Detecting Spam Blogs • Trends and Issues • How can you help? • Conclusions Pranam Kolari is a UMBC PhD Tim Finin is a UMBC Professor student. His dissertation is on with over 30 years of experience spam blog detection, with tools in the applying AI to information developed in use both by academia and systems, intelligent interfaces and industry. He has active research interest robotics. Current interests include social in internal corporate blogs, the Semantic media, the Semantic Web and multi- Web and blog analytics. agent systems. Akshay Java is a UMBC PhD student. Anupam Joshi is a UMBC Pro- His dissertation is on identify- fessor with research interests in ing influence and opinions in the broad area of networked social media. His research interests computing and intelligent systems. He include blog analytics, information currently serves on the editorial board of retrieval, natural language processing the International Journal of the Semantic and the Semantic Web. Web and Information. Ƿ Ȭȱ • Early form seen around 1992 with MAKE MONEY FAST • 80-85% of all e-mail traffic is spam • In numbers 2005 - (June) 30 billion per day 2006 - (June) 55 billion per day 2006 - (December) 85 billion per day 2007 - (February) 90 billion per day Sources: IronPort, Wikipedia http://www.ironport.com/company/ironport_pr_2006-06-28.html ȱȱǵ • “Unsolicited usually commercial e-mail sent to a large -
Advances in Online Learning-Based Spam Filtering
ADVANCES IN ONLINE LEARNING-BASED SPAM FILTERING A dissertation submitted by D. Sculley, M.Ed., M.S. In partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science TUFTS UNIVERSITY August 2008 ADVISER: Carla E. Brodley Acknowledgments I would like to take this opportunity to thank my advisor Carla Brodley for her patient guidance, my parents David and Paula Sculley for their support and en- couragement, and my bride Jessica Evans for making everything worth doing. I gratefully acknowledge Rediff.com for funding the writing of this disserta- tion. D. Sculley TUFTS UNIVERSITY August 2008 ii Abstract The low cost of digital communication has given rise to the problem of email spam, which is unwanted, harmful, or abusive electronic content. In this thesis, we present several advances in the application of online machine learning methods for auto- matically filtering spam. We detail a sliding-window variant of Support Vector Machines that yields state of the art results for the standard online filtering task. We explore a variety of feature representations for spam data. We reduce human labeling cost through the use of efficient online active learning variants. We give practical solutions to the one-sided feedback scenario, in which users only give la- beling feedback on messages predicted to be non-spam. We investigate the impact of class label noise on machine learning-based spam filters, showing that previous benchmark evaluations rewarded filters prone to overfitting in real-world settings and proposing several modifications for combating these negative effects. Finally, we investigate the performance of these filtering methods on the more challenging task of abuse filtering in blog comments.