Dissertation

Total Page:16

File Type:pdf, Size:1020Kb

Dissertation DISSERTATION Titel der Dissertation Efficient Feature Reduction and Classification Methods Applications in Drug Discovery and Email Categorization Verfasser Mag. Andreas Janecek angestrebter akademischer Grad Doktor der technischen Wissenschaften (Dr. techn.) Wien, im Dezember 2009 Studienkennzahl lt. Studienblatt: A 786 880 Dissertationsgebiet lt. Studienblatt: Informatik Betreuer: Priv.-Doz. Dr. Wilfried Gansterer 2 Contents 1 Introduction 11 1.1 Motivation and Problem Statement . 11 1.2 Synopsis . 12 1.3 Summary of Publications . 13 1.4 Notation . 14 1.5 Acknowledgements . 14 I Theoretical Background 15 2 Data Mining and Knowledge Discovery 17 2.1 Definition of Terminology . 18 2.2 Connection to Other Disciplines . 19 2.3 Data . 20 2.4 Models for the Knowledge Discovery Process . 21 2.4.1 Step 1 { Data Extraction . 23 2.4.2 Step 2 { Data Pre-processing . 24 2.4.3 Step 3 { Feature Reduction . 25 2.4.4 Step 4 { Data Mining . 26 2.4.5 Step 5 { Post-processing and Interpretation . 29 2.5 Relevant Literature . 31 3 Feature Reduction 33 3.1 Relevant Literature . 34 3.2 Feature Selection . 34 3.2.1 Filter Methods . 35 3.2.2 Wrapper Methods . 40 3.2.3 Embedded Approaches . 41 3.2.4 Comparison of Filter, Wrapper and Embedded Approaches . 42 3.3 Dimensionality Reduction . 43 3 4 CONTENTS 3.3.1 Low-rank Approximations . 43 3.3.2 Principal Component Analysis . 44 3.3.3 Singular Value Decomposition . 48 3.3.4 Nonnegative Matrix Factorization . 50 3.3.5 Algorithms for Computing NMF . 51 3.3.6 Other Dimensionality Reduction Techniques . 54 3.3.7 Comparison of Techniques . 55 4 Supervised Learning 57 4.1 Relevant Literature . 58 4.2 The k-Nearest Neighbor Algorithm . 58 4.3 Decision Trees . 60 4.4 Rule-Based Learners . 63 4.5 Support Vector Machines . 65 4.6 Ensemble Methods . 70 4.6.1 Bagging . 70 4.6.2 Random Forest . 71 4.6.3 Boosting . 73 4.6.4 Stacking . 74 4.6.5 General Evaluation . 74 4.7 Vector Space Model . 76 4.8 Latent Semantic Indexing . 77 4.9 Other Relevant Classification Methods . 79 4.9.1 Na¨ıve Bayesian . 79 4.9.2 Neural Networks . 80 4.9.3 Discriminant Analysis . 81 5 Application Areas 83 5.1 Email Filtering . 83 5.1.1 Classification Problem for Email Filtering . 84 5.1.2 Relevant Literature for Spam Filtering . 85 5.1.3 Relevant Literature for Phishing . 89 5.2 Predictive QSAR Modeling . 91 5.2.1 Classification Problem for QSAR Modeling . 93 5.2.2 Relevant Literature for QSAR Modeling . 94 5.3 Comparison of Areas . 96 CONTENTS 5 II New Approaches in Feature Reduction and Classification 99 6 On the Relationship Between FR and Classification Accuracy 101 6.1 Overview of Chapter . 101 6.2 Introduction and Related Work . 102 6.3 Open Issues and Own Contributions . 102 6.4 Datasets . 103 6.5 Feature Subsets . 104 6.6 Machine Learning Methods . 105 6.7 Experimental Results . 105 6.7.1 Email Data . 106 6.7.2 Drug Discovery Data . 109 6.7.3 Runtimes . 113 6.8 Discussion . 114 7 Email Filtering Based on Latent Semantic Indexing 117 7.1 Overview of Chapter . 117 7.2 Introduction and Related Work . 118 7.3 Open Issues and Own Contributions . 118 7.4 Classification based on VSM and LSI . 119 7.4.1 Feature Sets . 119 7.4.2 Feature/Attribute Selection Methods Used . 122 7.4.3 Training and Classification . 123 7.5 Experimental Evaluation . 123 7.5.1 Datasets . 123 7.5.2 Experimental Setup . 124 7.5.3 Analysis of Data Matrices . 124 7.5.4 Aggregated Classification Results . 126 7.5.5 True/False Positive Rates . 127 7.5.6 Feature Reduction . 128 7.6 Discussion . 130 8 New Initialization Strategies for NMF 133 8.1 Overview of Chapter . 133 8.2 Introduction and Related Work . 133 8.3 Open Issues and Own Contributions . 135 8.4 Datasets . 135 8.5 Interpretation of Factors . 136 8.5.1 Interpretation of Email Data . 137 8.5.2 Interpretation of Drug Discovery Data . 138 6 CONTENTS 8.6 Initializing NMF . 140 8.6.1 Feature Subset Selection . 140 8.6.2 FS-Initialization . 141 8.6.3 Results . 141 8.6.4 Runtime . 142 8.7 Discussion . 144 9 Utilizing NMF for Classification Problems 147 9.1 Overview of Chapter . 147 9.2 Introduction and Related Work . 148 9.3 Open Issues and Own Contributions . 148 9.4 Classification using Basis Features . 149 9.5 Generalizing LSI Based on NMF . 152 9.5.1 Two NMF-based Classifiers . 152 9.5.2 Classification Results . 153 9.5.3 Runtimes . 154 9.6 Discussion . 157 10 Improving the Performance of NMF 161 10.1 Overview of Chapter . 161 10.2 Introduction and Related Work . 161 10.3 Open Issues and Own Contributions . 163 10.4 Hardware, Software, Datasets . 164 10.4.1 Datasets . 164 10.4.2 Hardware Architecture . 165 10.4.3 Software Architecture . 165 10.5 Task-Parallel Speedup . 166 10.6 Improvements for Single Factorizations . 168 10.6.1 Multithreading Improvements . 168 10.6.2 Improving Matlab's ALS Code . 169 10.7 Initialization vs. Task-parallelism . 171 10.8 Discussion . 174 11 Conclusion and Future Work 177 Summary The sheer volume of data today and its expected growth over the next years are some of the key challenges in data mining and knowledge discovery applications. Besides the huge number of data samples that are collected and processed, the high dimensional nature of data arising in many applications causes the need to develop effective and efficient techniques that are able to deal with this massive amount of data. In addition to the significant increase in the demand of computational resources, those large datasets might also influence the quality of several data min- ing applications (especially if the number of features is very high compared to the number of samples). As the dimensionality of data increases, many types of data analysis and classification problems become significantly harder. This can lead to problems for both supervised and unsupervised learning. Dimensionality reduction and feature (subset) selection methods are two types of techniques for reducing the attribute space. While in feature selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In both approaches, the goal is to select a low dimensional subset of the attribute space that covers most of the information of the original data. During the last years, feature selection and dimensionality reduction techniques have become a.
Recommended publications
  • Delivering Results to the Inbox Sailthru’S 2020 Playbook on Deliverability, Why It’S Imperative and How It Drives Business Results Introduction to Deliverability
    Delivering Results to the Inbox Sailthru’s 2020 Playbook on Deliverability, Why It’s Imperative and How It Drives Business Results Introduction to Deliverability Every day, people receive more than 293 billion Deliverability is the unsung hero of email marketing, emails, a staggering number that only represents ultimately ensuring a company’s emails reach their the tip of the iceberg. Why? The actual number intended recipients. It’s determined by a host of of emails sent is closer to 5.9 quadrillion, with the factors, including the engagement of your subscribers overwhelming majority blocked outright or delivered and the quality of your lists. All together, these factors to the spam folder. result in your sender reputation score, which is used to determine how the ISPs treat your email stream. Something many people don’t realize is that to the Deliverability is also a background player, so far in the major Internet Service Providers (ISPs) — Gmail, shadows that many people don’t think about it, until Yahoo!, Hotmail, Comcast and AOL — “spam” there’s a major issue. doesn’t refer to marketing messages people may find annoying, but rather malicious email filled with That’s why Sailthru’s deliverability team created this scams and viruses. In order to protect their networks guide. Read on to learn more about how deliverability and their customers, the ISPs cast a wide net. If a works on the back-end and how it impacts revenue, message is deemed to be spam by the ISP’s filters, it’s your sender reputation and how to maintain a good dead on arrival, never to see the light of the inbox, as one, and best practices for list management, email protecting users’ inboxes is the top priority of any ISP.
    [Show full text]
  • Spam, Spammers, and Spam Control a White Paper by Ferris Research March 2009
    Spam, Spammers, and Spam Control A White Paper by Ferris Research March 2009. Report #810 Ferris Research, Inc. One San Antonio Place San Francisco, Calif. 94133, USA Phone: +1 (650) 452-6215 Fax: +1 (408) 228-8067 www.ferris.com Table of Contents Spam, Spammers, and Spam Control................................................3 Defining Spam.................................................................................3 Spammer Tactics .............................................................................3 Sending Mechanisms.................................................................4 Spammer Tricks.........................................................................4 Techniques for Identifying Spam ....................................................5 Connection Analysis..................................................................5 Behavioral Analysis...................................................................6 Content Scanning.......................................................................6 Controlling Spam: How and Where ................................................7 The Key Role of Reputation Services .......................................7 Conclusion: An Arms Race.............................................................8 Trend Micro Interview........................................................................9 Ferris Analyzer Information Service. Report #810. March 2009. © 2009 Ferris Research, Inc. All rights reserved. This document may be copied or freely reproduced provided you
    [Show full text]
  • IFIP AICT 394, Pp
    A Scalable Spam Filtering Architecture Nuno Ferreira1, Gracinda Carvalho1, and Paulo Rogério Pereira2 1 Universidade Aberta, Portugal 2 INESC-ID, Instituto Superior Técnico, Technical University of Lisbon, Portugal [email protected], [email protected], [email protected] Abstract. The proposed spam filtering architecture for MTA1 servers is a component based architecture that allows distributed processing and centralized knowledge. This architecture allows heterogeneous systems to coexist and benefit from a centralized knowledge source and filtering rules. MTA servers in the infrastructure contribute to a common knowledge, allowing for a more rational resource usage. The architecture is fully scalable, ranging from all-in- one system with minimal components instances, to multiple components instances distributed across multiple systems. Filtering rules can be implemented as independent modules that can be added, removed or modified without impact on MTA servers operation. A proof-of-concept solution was developed. Most of spam is filtered due to a grey-listing effect from the architecture itself. Using simple filters as Domain Name System black and white lists, and Sender Policy Framework validation, it is possible to guarantee a spam filtering effective, efficient and virtually without false positives. Keywords: spam filtering, distributed architecture, component based, centralized knowledge, heterogeneous system, scalable deployment, dynamic rules, modular implementation. 1 Introduction Internet mail spam2 is a problem for most organizations and individuals. Receiving spam on mobile devices, and on other connected appliances, is yet a bigger problem, as these platforms are not the most appropriate for spam filtering. Spam can be seen as belonging to one of two major categories: Fraud and Commercial.
    [Show full text]
  • Administrator's Guide for Synology Mailplus Server
    Administrator's Guide for Synology MailPlus Server Based on Synology MailPlus Server 2.2 1 Table of Contents Introduction 01 Chapter 1: Deployment Guidelines 02 Select a Synology NAS Estimate RAM and Storage Requirements Running Multiple I/O Intensive Packages on the Same NAS Chapter 2: Getting Started with MailPlus Server 06 Connect Synology NAS to the Internet Set up DNS Set up MailPlus Server Set up MailPlus Client Run MailPlus Third-Party Email Clients Troubleshoot Chapter 3: Mail Migration 19 Create a Mail Migration Task in MailPlus Server Import System Configurations from Microsoft Exchange to MailPlus Server Chapter 4: User Licenses 27 Purchase Licenses Install Licenses Use Licenses Chapter 5: Account Settings 31 Account System Activate Accounts Manage Privileges Chapter 6: Protocol Settings 46 SMTPI MAP/POP3 Network Interface Chapter 7: SMTP Settings 50 Service Settings SMTP Secure Connection Mail Relay Chapter 8: Domain Settings 66 Domain Domain Management Chapter 9: Security Settings 83 Spam Antivirus Scan Authentication Content Protection Chapter 10: Monitor Settings 107 Monitor Server Status Monitor Mail Queue Monitor Mail Log Chapter 11: Disaster Recovery 127 High-Availability Cluster Back up and Restore Email Chapter 12: MailPlus Navigation 140 Basic Operations Advanced Settings Introduction Introduction The Synology MailPlus suite provides advanced and secure mail service with high usability. This suite consists of two packages: MailPlus Server and MailPlus. MailPlus Server is an administration console that offers diverse settings, while MailPlus is an email platform for client users. This administrator's guide will guide you through the MailPlus Server setup and give detailed configuration instructions including DNS settings, mail service migration, and other security adjustments.
    [Show full text]
  • Email Authentication Faqs V3
    Email Authentication GUIDE Frequently Asked QUES T ION S T OGETHER STRONGER EMAIL AUTHENTICATION Marketers that use email for communication and transactional purposes should adopt and use identification and authentication protocols.” This document will explain what authentication is – includ- ing some recommendations on what you should do as an email marketer to implement these guidelines within your organization. * This Guide should not be considered as legal advice. It is being provided for informational purposes only. Please review your email program with your legal counsel to ensure that your program is meeting appropriate legal requirements. THIS COMPLIANCE GUIDE COVERS: Basics of Email Authentication Technologies Basic FAQs on the DMA’s Email Authentication Guidelines Implementation: Complementary Types of Email Authentication Systems Beyond Authentication: Email Reputation Email Authentication Resources for Marketers 1. What Do the DMA’s Email Authentication Guidelines Require? The DMA’s guidelines require marketers to choose and implement authentication technolo- gies in their email systems. It is up to your company to decide what kind of authentication protocol to use, though all are recommended based on current-day trends. The DMA does not require nor endorse the use of any specific protocol, as there are several interoperable, inexpensive, and easy to implement solutions available today. 2. Why does the DMA Require Members to Authenticate Their Email Systems? The DMA requires its members to authenticate their email systems primarily because mailbox providers (aka ISPs, MSPs or receivers) are increasingly requiring authentication. This strongly aligns with a growing trend in the email deliverability industry that’s leaning more towards domain-based reputation (as opposed to IP-based reputation a couple of years ago).
    [Show full text]
  • WHITE PAPER Email Deliverability Review
    WHITE PAPER Email DELIVeraBility REView dmawe are the White Paper Email Deliverability Review Published by Deliverability Hub of the Email Marketing Council Sponsored by 1 COPYRIGHT: THE DIRECT MARKETING ASSOCIATION (UK) LTD 2012 WHITE PAPER Email DELIVeraBility REView Contents About this document ...............................................................................................................................3 About the authors ...................................................................................................................................4 Sponsor’s perspective .............................................................................................................................5 Executive summary .................................................................................................................................6 1. Major factors that impact on deliverability ..............................................................................................7 1.1 Sender reputation .............................................................................................................................7 1.2 Spam filtering ...................................................................................................................................7 1.3 Blacklist operators ............................................................................................................................8 1.4 Smart Inboxes ..................................................................................................................................9
    [Show full text]
  • How Is E-Mail Sender Authentication Used and Misused?
    How is E-mail Sender Authentication Used and Misused? Tatsuya Mori Yousuke Takahashi NTT Service Integration Laboratories NTT Service Integration Laboratories 3-9-11 Midoricho Musashino 3-9-11 Midoricho Musashino Tokyo, Japan 180-8585 Tokyo, Japan 180-8585 [email protected] [email protected] Kazumichi Sato Keisuke Ishibashi NTT Service Integration Laboratories NTT Service Integration Laboratories 3-9-11 Midoricho Musashino 3-9-11 Midoricho Musashino Tokyo, Japan 180-8585 Tokyo, Japan 180-8585 [email protected] [email protected] ABSTRACT tive way of avoiding spam filtration. Thus, e-mail sender authen- E-mail sender authentication is a promising way of verifying the tication mechanisms have attracted attention as a promising way sources of e-mail messages. Since today’s primary e-mail sender of verifying sender identities. Large webmail service providers authentication mechanisms are designed as fully decentralized ar- such as Google have leveraged sender authentication mechanisms chitecture, it is crucial for e-mail operators to know how other or- to classify authenticated sending domains as either likely legit or ganizations are using and misusing them. This paper addresses the spammy [23]. question “How is the DNS Sender Policy Framework (SPF), which Of the several e-mail sender authentication mechanisms, we fo- is the most popular e-mail sender authentication mechanism, used cus on Sender Policy Framework (SPF) [28], which is the most and misused in the wild?” To the best of our knowledge, this is used sender authentication mechanism today [16, 26, 12, 23]. Ac- the first extensive study addressing the fundamental question.
    [Show full text]
  • Effectiveness and Limitations of Statistical Spam Filters
    International Conference on “New Trends in Statistics and Optimization” Effectiveness and Limitations of Statistical Spam Filters M. Tariq Banday, Lifetime Member, CSI P.G. Department of Electronics and Instrumentation Technology University of Kashmir, Srinagar, India Tariq R. Jan P. G. Department of Statistics University of Kashmir, Srinagar, India Abstract Spam is not only clogging the Internet traffic by consuming a hefty amount of network bandwidth but it is also a source for e-mail born viruses, spyware, adware and Trojan Horses. It is also used to carry out denial of service, directory harvesting and phishing attacks that directly cause financial losses. Further, the contents of spam are often offensive and contain adult oriented and fraudulent materials which are objectionable to recipients. Several anti-spam procedures are currently employed to distinguish spam from legitimate e-mails; however spammers and phishers employ dynamic spam structures to obfuscate e- mail content to circumvent these procedures. Apart from other technological procedures various adaptive learning filters have been developed that have an ability to allow an algorithm to constantly learn what sort of e-mail’s or e-mail content a recipient would typically process and what to see in normal course of its business. These filters are based on complex statistical techniques that classify future e-mails based on the word content of accepted e-mails. The statistical techniques employed in these filters separate an incoming e-mail into tokens and assign a probability value to each token. The probability of each token are collectively used to calculate the overall spam probability and accordingly the incoming e-mail is scored as spam, probably spam or legitimate e-mail.
    [Show full text]
  • Email Deliverability: the Ultimate Guide
    1 Email Deliverability: the Ultimate Guide Why does email deliverability matter? According to the "2015 Email Data Quality Trends Report" by Experian, a majority (73%) of companies experienced email deliverability issues in the past 12 months. Return Path has reported that over 20% of legitimate email are missing. Undoubtedly, marketers have problems with deliverability, and that negatively affects their business. "The most common consequences of poor email deliverability are the inability to communicate with subscribers (41%), poor customer service (24%), unnecessary costs (22%), and lost revenue (15%)" - the Experian's "2015 Email Data Quality Trends Report." How to deliver emails to the recipient's Inbox? This simplest question might have the most complicated answer. As an email marketer, you have to make people engage with your emails in a positive way: open, click, forward or reply. Recipient engagement is a powerful factor that mailbox providers rely on when filtering inbound messages. Keeping recipients engaged is not just about sending beautiful, optimized emails. It's also about positioning yourself as a reputable sender, avoiding spam filters and getting to the user's Inbox. That's where email deliverability comes in. 2 In this article, we'll touch the most important factors that determine email deliverability and should be on the mind of every marketer: 1. Permission-Based Marketing: - Single Opt-In vs. Confirmed Opt-In - Pre-checked Boxes or Passive Opt-in - Subscriber's Expectations 2. Sender Reputation: - Branding - Spam Traps - Bounces and Complaints - Monitoring Tools 3. Sending Infrastructure: - Shared IP vs. Dedicated IP - Blacklists - Email Authentication - Feedback Loops 3 1.
    [Show full text]
  • Contents in This Issue
    MARCH 2005 The International Publication on Computer Virus Prevention, Recognition and Removal CONTENTS IN THIS ISSUE 2 COMMENT RATTLING THE Plenty of phish in the sea PERLY GATES Perl/Santy is, 3 NEWS essentially, a small piece of Perl code that Microsoft one step closer to AV spreads to vulnerable Errata: February 2005 Windows NT web servers located using the Google search engine. comparative review Frédéric Perriot describes Santy’s unusual replication strategy and explains why this worm 3 VIRUS PREVALENCE TABLE exemplifies the need for the ‘defence in depth’ approach. page 4 4 VIRUS ANALYSIS Black Perl HOME SWEET HOME FEATURES Randy Abrams looks at how the security support needs and behaviours of home users have changed 6 Protecting the home user over the years, and describes how Microsoft is 9 Virus outbreak protection: network-based adapting to maximise customer support now that detection consumers’ first port of call is their ISP. page 6 11 INSIGHT New kid on the block PRODUCT REVIEWS 13 VirusBuster 2005 Professional 17 Resolution Antivirus This month: anti-spam news & events; review of Fighting Spam for Dummies; MIT Spam 20 END NOTES & NEWS Conference report; ASRG summary. ISSN 0956-9979 COMMENT ‘The number to re-enter their user data. The email lures the recipient into clicking on a link that directs them straight to the of phishing spoofed website where they are asked to enter their attacks, and the personal information, providing the phishers with access to the victim’s bank details, credit card, or on-line associated costs, shopping account. are increasing.’ In any single scam, only a small proportion of recipients will be customers of the spoofed organization, and only a David Emm small proportion of these will ‘take the bait’.
    [Show full text]
  • Contents Software Description:
    Swift Email Processor v2.0 www.webemailverifier.com Support: www.webemailverifier.com/supportsuite/ Contents Software Description: ................................................................................................................................... 3 Program GUI Screenshots: ............................................................................................................................ 4 Program Features: ........................................................................................................................................ 9 Chapter 1: Installation and Settings Configuration ..................................................................................... 11 Configuration and Settings: .................................................................................................................... 12 Chapter 2: Using the Proxy Servers Module ............................................................................................... 31 Chapter 2.1: X-Originating-IP email header & Proxy Acceptable Usage Policy .......................................... 35 Chapter 3: Sending Messages using the Sender Module ........................................................................... 36 3.1: Unsubscribe link placements ........................................................................................................... 49 3.2: Installing and using the MySQL/MariaDB CRUD API script or application ...................................... 50 3.2.1 Installing MySQL API on Linux Server running
    [Show full text]
  • Public Sender Score System(S3) by Esps for Email Spam Mitigation with Score…
    Paper—Public Sender Score System(S3) by ESPs for Email Spam Mitigation with Score… Public Sender Score System(S3) by ESPs for Email Spam Mitigation with Score Management in Mobile Application https://doi.org/10.3991/ijim.v14i17.16609 Lucky Kannan (), Jebakumar R SRM Institute of Science and Technology, Chennai, India [email protected] Abstract—Many businesses use email as a medium for advertising and they use emails to communicate with their customers. In the email world, the most common issue that remains unresolved even now is spamming or in other terms unsolicited bulk email. Currently, there is no common way to regulate the practices of an email sender. This proposed system is to formulate a protocol common for all the ESPs or inbox providers and a centralized system that will easily find the spammers and block them. By this method, the Email Service Providers (ESPs) or Inbox Providers need not wait for the sender behaviour and then take actions on the sender or sender domain or sender IP address. Instead, they can get the sender history of reputation from blockchain where the ESPs or Inbox Provider provides a score based on the emails they have received from the sender. The ESPs can get the Public Sender Score(S3) from the mobile application or web application which provides the score management user interface and APIs. The email marketers can also monitor their score through the application. Keywords—Email, spam detection, sender score, reputation system, blockchain, mobile application, marketing email. 1 Introduction The email industry has been functioning steadily for very long and they needed minimal or no change over the past years.
    [Show full text]