Dissertation

Total Page:16

File Type:pdf, Size:1020Kb

Dissertation AndreasDISSERTATION Titel der Dissertation Efficient Feature Reduction and Classification Methods Applications in Drug Discovery and Email Categorization VerfasserJanecek Mag. Andreas Janecek angestrebter akademischer Grad Doktor der technischen Wissenschaften (Dr. techn.) Wien, im Dezember 2009 Studienkennzahl lt. Studienblatt: A 786 880 Dissertationsgebiet lt. Studienblatt: Informatik Betreuer: Priv.-Doz. Dr. Wilfried Gansterer 2 Andreas Janecek Contents Andreas 1 Introduction 11 1.1 Motivation and Problem Statement . 11 1.2 Synopsis . 12 1.3 Summary of Publications . 13 1.4 Notation . 14 1.5 Acknowledgements . 14 I Theoretical Background 15 2 Data Mining and Knowledge Discovery 17 2.1 Definition of Terminology .Janecek . 18 2.2 Connection to Other Disciplines . 19 2.3 Data . 20 2.4 Models for the Knowledge Discovery Process . 21 2.4.1 Step 1 { Data Extraction . 23 2.4.2 Step 2 { Data Pre-processing . 24 2.4.3 Step 3 { Feature Reduction . 25 2.4.4 Step 4 { Data Mining . 26 2.4.5 Step 5 { Post-processing and Interpretation . 29 2.5 Relevant Literature . 31 3 Feature Reduction 33 3.1 Relevant Literature . 34 3.2 Feature Selection . 34 3.2.1 Filter Methods . 35 3.2.2 Wrapper Methods . 40 3.2.3 Embedded Approaches . 41 3.2.4 Comparison of Filter, Wrapper and Embedded Approaches . 42 3.3 Dimensionality Reduction . 43 3 4 CONTENTS 3.3.1 Low-rank Approximations . 43 3.3.2 Principal Component Analysis . 44 3.3.3 Singular Value Decomposition . 48 3.3.4 Nonnegative Matrix Factorization . 50 3.3.5 Algorithms for Computing NMF . 51 3.3.6 Other Dimensionality Reduction Techniques . 54 Andreas3.3.7 Comparison of Techniques . 55 4 Supervised Learning 57 4.1 Relevant Literature . 58 4.2 The k-Nearest Neighbor Algorithm . 58 4.3 Decision Trees . 60 4.4 Rule-Based Learners . 63 4.5 Support Vector Machines . 65 4.6 Ensemble Methods . 70 4.6.1 Bagging . 70 4.6.2 Random Forest . 71 4.6.3 Boosting . 73 4.6.4 Stacking . .Janecek . 74 4.6.5 General Evaluation . 74 4.7 Vector Space Model . 76 4.8 Latent Semantic Indexing . 77 4.9 Other Relevant Classification Methods . 79 4.9.1 Na¨ıve Bayesian . 79 4.9.2 Neural Networks . 80 4.9.3 Discriminant Analysis . 81 5 Application Areas 83 5.1 Email Filtering . 83 5.1.1 Classification Problem for Email Filtering . 84 5.1.2 Relevant Literature for Spam Filtering . 85 5.1.3 Relevant Literature for Phishing . 89 5.2 Predictive QSAR Modeling . 91 5.2.1 Classification Problem for QSAR Modeling . 93 5.2.2 Relevant Literature for QSAR Modeling . 94 5.3 Comparison of Areas . 96 CONTENTS 5 II New Approaches in Feature Reduction and Classification 99 6 On the Relationship Between FR and Classification Accuracy 101 6.1 Overview of Chapter . 101 6.2 Introduction and Related Work . 102 6.3 Open Issues and Own Contributions . 102 6.4 Datasets . 103 Andreas6.5 Feature Subsets . 104 6.6 Machine Learning Methods . 105 6.7 Experimental Results . 105 6.7.1 Email Data . 106 6.7.2 Drug Discovery Data . 109 6.7.3 Runtimes . 113 6.8 Discussion . 114 7 Email Filtering Based on Latent Semantic Indexing 117 7.1 Overview of Chapter . 117 7.2 Introduction and Related Work . 118 7.3 Open Issues and Own Contributions . 118 7.4 Classification based on VSM and LSI . 119 7.4.1 Feature Sets . .Janecek . 119 7.4.2 Feature/Attribute Selection Methods Used . 122 7.4.3 Training and Classification . 123 7.5 Experimental Evaluation . 123 7.5.1 Datasets . 123 7.5.2 Experimental Setup . 124 7.5.3 Analysis of Data Matrices . 124 7.5.4 Aggregated Classification Results . 126 7.5.5 True/False Positive Rates . 127 7.5.6 Feature Reduction . 128 7.6 Discussion . 130 8 New Initialization Strategies for NMF 133 8.1 Overview of Chapter . 133 8.2 Introduction and Related Work . 133 8.3 Open Issues and Own Contributions . 135 8.4 Datasets . 135 8.5 Interpretation of Factors . 136 8.5.1 Interpretation of Email Data . 137 8.5.2 Interpretation of Drug Discovery Data . 138 6 CONTENTS 8.6 Initializing NMF . 140 8.6.1 Feature Subset Selection . 140 8.6.2 FS-Initialization . 141 8.6.3 Results . 141 8.6.4 Runtime . 142 8.7 Discussion . 144 9 Utilizing NMF for Classification Problems 147 Andreas9.1 Overview of Chapter . 147 9.2 Introduction and Related Work . 148 9.3 Open Issues and Own Contributions . 148 9.4 Classification using Basis Features . 149 9.5 Generalizing LSI Based on NMF . 152 9.5.1 Two NMF-based Classifiers . 152 9.5.2 Classification Results . 153 9.5.3 Runtimes . 154 9.6 Discussion . 157 10 Improving the Performance of NMF 161 10.1 Overview of Chapter . 161 10.2 Introduction and Related Work . 161 10.3 Open Issues and Own ContributionsJanecek . 163 10.4 Hardware, Software, Datasets . 164 10.4.1 Datasets . 164 10.4.2 Hardware Architecture . 165 10.4.3 Software Architecture . 165 10.5 Task-Parallel Speedup . 166 10.6 Improvements for Single Factorizations . 168 10.6.1 Multithreading Improvements . 168 10.6.2 Improving Matlab's ALS Code . 169 10.7 Initialization vs. Task-parallelism . 171 10.8 Discussion . 174 11 Conclusion and Future Work 177 AndreasSummary The sheer volume of data today and its expected growth over the next years are some of the key challenges in data mining and knowledge discovery applications. Besides the huge number of data samples that are collected and processed, the high dimensional nature of data arising in many applications causes the need to develop effective and efficient techniques that are able to deal with this massive amount of data. In addition to the significant increase in the demand of computational resources, those large datasets might also influence the quality of several data min- ing applications (especially if the number of features is very high compared to the number of samples). As the dimensionality of data increases, many types of data analysis and classification problems become significantly harder. This can lead to problems for both supervised and unsupervisedJanecek learning. Dimensionality reduction and feature (subset) selection methods are two types of techniques for reducing the attribute space. While in feature selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In both approaches, the goal is to select a low dimensional subset of the attribute space that covers most of the information of the original data. During the last years, feature selection and dimensionality reduction techniques have.
Recommended publications
  • Delivering Results to the Inbox Sailthru’S 2020 Playbook on Deliverability, Why It’S Imperative and How It Drives Business Results Introduction to Deliverability
    Delivering Results to the Inbox Sailthru’s 2020 Playbook on Deliverability, Why It’s Imperative and How It Drives Business Results Introduction to Deliverability Every day, people receive more than 293 billion Deliverability is the unsung hero of email marketing, emails, a staggering number that only represents ultimately ensuring a company’s emails reach their the tip of the iceberg. Why? The actual number intended recipients. It’s determined by a host of of emails sent is closer to 5.9 quadrillion, with the factors, including the engagement of your subscribers overwhelming majority blocked outright or delivered and the quality of your lists. All together, these factors to the spam folder. result in your sender reputation score, which is used to determine how the ISPs treat your email stream. Something many people don’t realize is that to the Deliverability is also a background player, so far in the major Internet Service Providers (ISPs) — Gmail, shadows that many people don’t think about it, until Yahoo!, Hotmail, Comcast and AOL — “spam” there’s a major issue. doesn’t refer to marketing messages people may find annoying, but rather malicious email filled with That’s why Sailthru’s deliverability team created this scams and viruses. In order to protect their networks guide. Read on to learn more about how deliverability and their customers, the ISPs cast a wide net. If a works on the back-end and how it impacts revenue, message is deemed to be spam by the ISP’s filters, it’s your sender reputation and how to maintain a good dead on arrival, never to see the light of the inbox, as one, and best practices for list management, email protecting users’ inboxes is the top priority of any ISP.
    [Show full text]
  • Spam, Spammers, and Spam Control a White Paper by Ferris Research March 2009
    Spam, Spammers, and Spam Control A White Paper by Ferris Research March 2009. Report #810 Ferris Research, Inc. One San Antonio Place San Francisco, Calif. 94133, USA Phone: +1 (650) 452-6215 Fax: +1 (408) 228-8067 www.ferris.com Table of Contents Spam, Spammers, and Spam Control................................................3 Defining Spam.................................................................................3 Spammer Tactics .............................................................................3 Sending Mechanisms.................................................................4 Spammer Tricks.........................................................................4 Techniques for Identifying Spam ....................................................5 Connection Analysis..................................................................5 Behavioral Analysis...................................................................6 Content Scanning.......................................................................6 Controlling Spam: How and Where ................................................7 The Key Role of Reputation Services .......................................7 Conclusion: An Arms Race.............................................................8 Trend Micro Interview........................................................................9 Ferris Analyzer Information Service. Report #810. March 2009. © 2009 Ferris Research, Inc. All rights reserved. This document may be copied or freely reproduced provided you
    [Show full text]
  • Email Authentication Faqs V3
    Email Authentication GUIDE Frequently Asked QUES T ION S T OGETHER STRONGER EMAIL AUTHENTICATION Marketers that use email for communication and transactional purposes should adopt and use identification and authentication protocols.” This document will explain what authentication is – includ- ing some recommendations on what you should do as an email marketer to implement these guidelines within your organization. * This Guide should not be considered as legal advice. It is being provided for informational purposes only. Please review your email program with your legal counsel to ensure that your program is meeting appropriate legal requirements. THIS COMPLIANCE GUIDE COVERS: Basics of Email Authentication Technologies Basic FAQs on the DMA’s Email Authentication Guidelines Implementation: Complementary Types of Email Authentication Systems Beyond Authentication: Email Reputation Email Authentication Resources for Marketers 1. What Do the DMA’s Email Authentication Guidelines Require? The DMA’s guidelines require marketers to choose and implement authentication technolo- gies in their email systems. It is up to your company to decide what kind of authentication protocol to use, though all are recommended based on current-day trends. The DMA does not require nor endorse the use of any specific protocol, as there are several interoperable, inexpensive, and easy to implement solutions available today. 2. Why does the DMA Require Members to Authenticate Their Email Systems? The DMA requires its members to authenticate their email systems primarily because mailbox providers (aka ISPs, MSPs or receivers) are increasingly requiring authentication. This strongly aligns with a growing trend in the email deliverability industry that’s leaning more towards domain-based reputation (as opposed to IP-based reputation a couple of years ago).
    [Show full text]
  • WHITE PAPER Email Deliverability Review
    WHITE PAPER Email DELIVeraBility REView dmawe are the White Paper Email Deliverability Review Published by Deliverability Hub of the Email Marketing Council Sponsored by 1 COPYRIGHT: THE DIRECT MARKETING ASSOCIATION (UK) LTD 2012 WHITE PAPER Email DELIVeraBility REView Contents About this document ...............................................................................................................................3 About the authors ...................................................................................................................................4 Sponsor’s perspective .............................................................................................................................5 Executive summary .................................................................................................................................6 1. Major factors that impact on deliverability ..............................................................................................7 1.1 Sender reputation .............................................................................................................................7 1.2 Spam filtering ...................................................................................................................................7 1.3 Blacklist operators ............................................................................................................................8 1.4 Smart Inboxes ..................................................................................................................................9
    [Show full text]
  • Effectiveness and Limitations of Statistical Spam Filters
    International Conference on “New Trends in Statistics and Optimization” Effectiveness and Limitations of Statistical Spam Filters M. Tariq Banday, Lifetime Member, CSI P.G. Department of Electronics and Instrumentation Technology University of Kashmir, Srinagar, India Tariq R. Jan P. G. Department of Statistics University of Kashmir, Srinagar, India Abstract Spam is not only clogging the Internet traffic by consuming a hefty amount of network bandwidth but it is also a source for e-mail born viruses, spyware, adware and Trojan Horses. It is also used to carry out denial of service, directory harvesting and phishing attacks that directly cause financial losses. Further, the contents of spam are often offensive and contain adult oriented and fraudulent materials which are objectionable to recipients. Several anti-spam procedures are currently employed to distinguish spam from legitimate e-mails; however spammers and phishers employ dynamic spam structures to obfuscate e- mail content to circumvent these procedures. Apart from other technological procedures various adaptive learning filters have been developed that have an ability to allow an algorithm to constantly learn what sort of e-mail’s or e-mail content a recipient would typically process and what to see in normal course of its business. These filters are based on complex statistical techniques that classify future e-mails based on the word content of accepted e-mails. The statistical techniques employed in these filters separate an incoming e-mail into tokens and assign a probability value to each token. The probability of each token are collectively used to calculate the overall spam probability and accordingly the incoming e-mail is scored as spam, probably spam or legitimate e-mail.
    [Show full text]
  • Email Deliverability: the Ultimate Guide
    1 Email Deliverability: the Ultimate Guide Why does email deliverability matter? According to the "2015 Email Data Quality Trends Report" by Experian, a majority (73%) of companies experienced email deliverability issues in the past 12 months. Return Path has reported that over 20% of legitimate email are missing. Undoubtedly, marketers have problems with deliverability, and that negatively affects their business. "The most common consequences of poor email deliverability are the inability to communicate with subscribers (41%), poor customer service (24%), unnecessary costs (22%), and lost revenue (15%)" - the Experian's "2015 Email Data Quality Trends Report." How to deliver emails to the recipient's Inbox? This simplest question might have the most complicated answer. As an email marketer, you have to make people engage with your emails in a positive way: open, click, forward or reply. Recipient engagement is a powerful factor that mailbox providers rely on when filtering inbound messages. Keeping recipients engaged is not just about sending beautiful, optimized emails. It's also about positioning yourself as a reputable sender, avoiding spam filters and getting to the user's Inbox. That's where email deliverability comes in. 2 In this article, we'll touch the most important factors that determine email deliverability and should be on the mind of every marketer: 1. Permission-Based Marketing: - Single Opt-In vs. Confirmed Opt-In - Pre-checked Boxes or Passive Opt-in - Subscriber's Expectations 2. Sender Reputation: - Branding - Spam Traps - Bounces and Complaints - Monitoring Tools 3. Sending Infrastructure: - Shared IP vs. Dedicated IP - Blacklists - Email Authentication - Feedback Loops 3 1.
    [Show full text]
  • Contents Software Description:
    Swift Email Processor v2.0 www.webemailverifier.com Support: www.webemailverifier.com/supportsuite/ Contents Software Description: ................................................................................................................................... 3 Program GUI Screenshots: ............................................................................................................................ 4 Program Features: ........................................................................................................................................ 9 Chapter 1: Installation and Settings Configuration ..................................................................................... 11 Configuration and Settings: .................................................................................................................... 12 Chapter 2: Using the Proxy Servers Module ............................................................................................... 31 Chapter 2.1: X-Originating-IP email header & Proxy Acceptable Usage Policy .......................................... 35 Chapter 3: Sending Messages using the Sender Module ........................................................................... 36 3.1: Unsubscribe link placements ........................................................................................................... 49 3.2: Installing and using the MySQL/MariaDB CRUD API script or application ...................................... 50 3.2.1 Installing MySQL API on Linux Server running
    [Show full text]
  • Public Sender Score System(S3) by Esps for Email Spam Mitigation with Score…
    Paper—Public Sender Score System(S3) by ESPs for Email Spam Mitigation with Score… Public Sender Score System(S3) by ESPs for Email Spam Mitigation with Score Management in Mobile Application https://doi.org/10.3991/ijim.v14i17.16609 Lucky Kannan (), Jebakumar R SRM Institute of Science and Technology, Chennai, India [email protected] Abstract—Many businesses use email as a medium for advertising and they use emails to communicate with their customers. In the email world, the most common issue that remains unresolved even now is spamming or in other terms unsolicited bulk email. Currently, there is no common way to regulate the practices of an email sender. This proposed system is to formulate a protocol common for all the ESPs or inbox providers and a centralized system that will easily find the spammers and block them. By this method, the Email Service Providers (ESPs) or Inbox Providers need not wait for the sender behaviour and then take actions on the sender or sender domain or sender IP address. Instead, they can get the sender history of reputation from blockchain where the ESPs or Inbox Provider provides a score based on the emails they have received from the sender. The ESPs can get the Public Sender Score(S3) from the mobile application or web application which provides the score management user interface and APIs. The email marketers can also monitor their score through the application. Keywords—Email, spam detection, sender score, reputation system, blockchain, mobile application, marketing email. 1 Introduction The email industry has been functioning steadily for very long and they needed minimal or no change over the past years.
    [Show full text]
  • A Practical Guide to Protect Your Inbox Reputation and Support Your Ongoing Email Deliverability Goals Deliverability Best Practice Guide - 2019
    A PRACTICAL GUIDE TO PROTECT YOUR INBOX REPUTATION AND SUPPORT YOUR ONGOING EMAIL DELIVERABILITY GOALS DELIVERABILITY BEST PRACTICE GUIDE - 2019 For professionals in email marketing, achieving great deliverability begins with a solid understanding of email best practices. And with today’s constantly changing environment, it’s nearly impossible to keep up with current trends. At the foundation of this understanding are practices that align with legal and regulatory compliance requirements that protect consumer rights. The lens we must look through is always the customer’s. Federal and Global laws are only a small part of safely managing inbox reputation, however, and all email marketing professionals must understand the importance of complying with a set of best practices meant to protect reputation and ensure strong, unrestricted email deliverability. In this whitepaper, eDataSource provides readers with an extensive introduction to these best practices. In the future, we’ll be releasing more in-depth coverage of sending best practices so don’t forget to sign up for our email newsletter to get the latest and greatest from our resident industry experts. 2 DELIVERABILITY BEST PRACTICE GUIDE - 2019 TABLE OF CONTENTS EMAIL MARKETING LAWS AND REGULATIONS THE CAN-SPAM ACT 5 Best Practices Go Far Beyond Legal Compliance The Basics When Does CAN-SPAM Apply? CANADA’S ANTI-SPAM LEGISLATION (CASL) 7 What Is Considered A CEM Implied vs. Expressed Consent Content and Record-Keeping Unsubscribes GENERAL DATA PROTECTION REGULATION (GDPR) 8 Consent Opt-Out
    [Show full text]
  • Deliverability-Declassified-1.Pdf
    DELIVERABILITY A Practical Guide to Protect Your Inbox Reputation and Support Your Ongoing Email Deliverability Goals Deliverability Declassified 1 For professionals in email marketing, achieving great deliverability begins with a solid understanding of email best practices. And with today’s constantly changing environment, it’s nearly impossible to keep up with current trends. At the foundation of this understanding are practices that align with legal and regulatory compliance requirements that protect consumer rights. The lens we must look through is always the customer’s. Federal and Global laws are only a small part of safely managing inbox reputation, however, and all email marketing professionals must understand the importance of complying with a set of best practices meant to protect reputation and ensure strong, unrestricted email deliverability. In this guide, SparkPost provides readers with an extensive introduction to these best practices. Once you’ve put these best practices into place, monitoring and measuring your deliverability will be crucial. An easy way to start is to visit sparkpost.com/deliveryindex, enter your mailing domain, and see a quick snapshot of your deliverability in 7 day increments. You’ll be able to check your inbox placement across all of the major inbox providers, as well as key areas impacting your inbox reputation (including blacklists, spam trap hits, and authentication elements). You can also check your competition’s sending domains to see how you measure up. It’s free to use, and a great way to understand your deliverability and identify possible ways to improve it. Deliverability Declassified 2 TABLE OF CONTENTS EMAIL MARKETING LAWS AND REGULATIONS........................................
    [Show full text]
  • Dissertation
    DISSERTATION Titel der Dissertation Efficient Feature Reduction and Classification Methods Applications in Drug Discovery and Email Categorization Verfasser Mag. Andreas Janecek angestrebter akademischer Grad Doktor der technischen Wissenschaften (Dr. techn.) Wien, im Dezember 2009 Studienkennzahl lt. Studienblatt: A 786 880 Dissertationsgebiet lt. Studienblatt: Informatik Betreuer: Priv.-Doz. Dr. Wilfried Gansterer 2 Contents 1 Introduction 11 1.1 Motivation and Problem Statement . 11 1.2 Synopsis . 12 1.3 Summary of Publications . 13 1.4 Notation . 14 1.5 Acknowledgements . 14 I Theoretical Background 15 2 Data Mining and Knowledge Discovery 17 2.1 Definition of Terminology . 18 2.2 Connection to Other Disciplines . 19 2.3 Data . 20 2.4 Models for the Knowledge Discovery Process . 21 2.4.1 Step 1 { Data Extraction . 23 2.4.2 Step 2 { Data Pre-processing . 24 2.4.3 Step 3 { Feature Reduction . 25 2.4.4 Step 4 { Data Mining . 26 2.4.5 Step 5 { Post-processing and Interpretation . 29 2.5 Relevant Literature . 31 3 Feature Reduction 33 3.1 Relevant Literature . 34 3.2 Feature Selection . 34 3.2.1 Filter Methods . 35 3.2.2 Wrapper Methods . 40 3.2.3 Embedded Approaches . 41 3.2.4 Comparison of Filter, Wrapper and Embedded Approaches . 42 3.3 Dimensionality Reduction . 43 3 4 CONTENTS 3.3.1 Low-rank Approximations . 43 3.3.2 Principal Component Analysis . 44 3.3.3 Singular Value Decomposition . 48 3.3.4 Nonnegative Matrix Factorization . 50 3.3.5 Algorithms for Computing NMF . 51 3.3.6 Other Dimensionality Reduction Techniques . 54 3.3.7 Comparison of Techniques .
    [Show full text]
  • Mastering Your Email Reputation: Seven Strategies for Improving
    WHITE PAPER | AugusT 2008 Best Practices in Email Marketing Mastering Your Email Reputation: seven strategies for Improving Deliverability Effective tips and best practices for safeguarding your email sender reputation in today’s ever-changing email environment PUBLISHED BY: StrongMail Systems, Inc. StrongMail Systems UK, Ltd 1300 Island Drive, Suite 200 Prospect House, Crendon Street Redwood City, CA 94065 High Wycombe, Bucks Telephone: (650) 421-4200 HP13 6LA Facsimile: (650) 421-4201 Telephone: +44 (0) 1494 435 120 Copyright © 2008 StrongMail Systems, Inc. All rights reserved. No part of the contents of this publication may be reproduced or transmitted in any form or by any means without the written permission of StrongMail Systems, Inc. STRONGMAIL and the STRONGMAIL logo are registered trademarks in the United States, other countries or both. All Rights Reserved. StrongMail Systems UK, Ltd is a company registered in England and Wales at Carmelite, 50 Victoria Embankment, Blackfriars, London EC4Y 0DX. Reg. No. 6398867. VAT # GB 925 07 6228. Trading Address: Prospect House, Crendon Street, High Wycombe, Bucks HP13 6LA. www.strongmail.com Table of Contents STRATEGY 1: MAINTAIN A CLEAN LIST .......................................................................1 Bounce Management .............................................................................................................1 List Hygiene ................................................................................................................................3 Feedback
    [Show full text]