Online Incremental Feature Learning with Denoising Autoencoders

Total Page:16

File Type:pdf, Size:1020Kb

Online Incremental Feature Learning with Denoising Autoencoders Online Incremental Feature Learning with Denoising Autoencoders Guanyu Zhou Kihyuk Sohn Honglak Lee Department of EECS Department of EECS Department of EECS University of Michigan University of Michigan University of Michigan Ann Arbor, MI 48109 Ann Arbor, MI 48109 Ann Arbor, MI 48109 [email protected] [email protected] [email protected] Abstract long learning in complex real environments. To that end, many large-scale learning algorithms have also While determining model complexity is an been proposed. In particular, Support Vector Ma- important problem in machine learning, chines (SVMs) [5] have been one of the most popular many feature learning algorithms rely on discriminative methods for large-scale learning which cross-validation to choose an optimal num- includes selective sampling to reduce the number of ber of features, which is usually challenging support vectors [4, 24] and incremental support vec- for online learning from a massive stream of tors to learn an SVM with a small number of examples data. In this paper, we propose an incremen- in the early phase of training [7, 23]. Typically, linear tal feature learning algorithm to determine SVMs are used in large-scale settings due to their ef- the optimal model complexity for large-scale, ficient training as well as the scalability to large-scale online datasets based on the denoising au- datasets. However, this approach is limited in that the toencoder. This algorithm is composed of feature mapping has to be fixed during the training; two processes: adding features and merging therefore, it cannot be adapted to the training data. features. Specifically, it adds new features Alternatively, there are several methods for jointly to minimize the objective function’s resid- training the feature mapping and the classifier, such ual and merges similar features to obtain a as via multi-task learning [31, 6], transfer learning [2, compact feature representation and prevent 22, 17], non-parametric Bayesian learning [1, 10], and over-fitting. Our experiments show that the deep learning [16, 3, 28, 27, 20]. Among these, we proposed model quickly converges to the op- are interested in deep learning approaches that have timal number of features in a large-scale on- shown promise in learning features from complex, line setting. In classification tasks, our model high-dimensional unlabeled and labeled data. Specif- outperforms the (non-incremental) denoising ically, we present a large-scale feature learning algo- autoencoder, and deep networks constructed rithm based on the denoising autoencoder (DAE) [32]. from our algorithm perform favorably com- The DAE is a variant of autoencoders [3] that extracts pared to deep belief networks and stacked de- features by adding perturbations to the input data and noising autoencoders. Further, the algorithm attempting to reconstruct the original data. This pro- is effective in recognizing new patterns when cess learns features that are robust to input noise and the data distribution changes over time in the useful for classification. massive online data stream. Despite the promise, determining the optimal model complexity, i.e., the number of features for DAE, still 1 Introduction remains a nontrivial question. When there are too many features, for example, the model may overfit the In recent years, there has been much interest in online data or converge very slowly. On the other hand, when learning algorithms [1, 9, 29, 8] with the purpose of there are too few features, the model may underfit due developing an intelligent agent that can perform life- to the lack of relevant features. Further, finding an op- timal feature set size becomes even more difficult for Appearing in Proceedings of the 15th International Con- large-scale or online datasets whose distribution may ference on Artificial Intelligence and Statistics (AISTATS) change over time, since the cross-validation may be 2012, La Palma, Canary Islands. Volume XX of JMLR: challenging given a limited amount of time or compu- W&CP XX. Copyright 2012 by the authors. 1453 Online Incremental Feature Learning with Denoising Autoencoders tational resources. h through the encoding function defined as follows: To address this problem, we propose an incremental h = f(x) = sigm(Wx + b), algorithm to learn features from the large-scale online data by adaptively incrementing the features depend- N D N where W R × eis a weight matrix,e b R is a ing on the data and the existing features, using DAE ∈ 1 ∈ hidden bias vector, and sigm(s) = 1+exp( s) is the as a basic building block. Specifically, new features − sigmoid function. Then, we reconstruct x [0, 1]D are added and trained to minimize the residual of the from h using the decoding function: ∈ objective function in generative tasks (e.g., minimizing b reconstruction error) or discriminative tasks (e.g., min- x = g(h) = sigm(WT h + c), imizing supervised loss function). At the same time, similar features are merged to avoid redundant feature D where c R b is an input bias vector. Here, we implic- representations. Experimental results show that our itly assumed∈ tied weights for encoding and decoding incremental feature learning algorithms perform favor- functions, but we can consider separate weights as well. ably compared to the non-incremental feature learn- ing algorithms, including the standard DAE, the deep In this work, we use cross-entropy as a cost function: belief network (DBN), and the stacked denoising au- D toencoder (SDAE), on classification tasks with large (x) = (x, x) = x log x + (1 x ) log(1 x ) L H − i i − i − i datasets. Moreover, we show that incremental feature i=1 learning is more effective in quickly recognizing new X b b b patterns than the non-incremental algorithms when Finally, the objective is to learn model parameters that the data distribution (e.g., the ratio of labels) is highly minimize the cost function non-stationary. W, b, c = arg min (x(j)), { } L W,b,c j 2 Preliminaries X c b b In this paper, we use the denoising autoencoder [32] and this can be optimized by gradient descent (or con- as a building block for incremental feature learning. jugate gradient, L-BFGS, etc.) with respect to all pa- Based on the idea that good features should be robust rameters for encoding and decoding functions. to input corruption, the DAE tries to recover the input data x from encoded representation f(x) of corrupted 3 Incremental Feature Learning data x via a decoding function g(h). More precisely, In many cases, the number of hidden units N is fixed there are four components in the DAE: e during the training of DAE and the optimal value e A conditional distribution q(x x), that stochasti- is determined by cross-validation. However, cross- • cally perturbs input x to a corrupted| version x. validation is computationally prohibitive and often in- N feasible for large datasets. Furthermore, it may not An encoding function f(x) eh R , which gives • ≡ ∈ perform well when the distribution of training data a representation of input data x. e significantly changes over time. A decoding function g(he) x RD, which re- • ≡ ∈ covers the data from the representation h. To address these issues, we present an adaptive feature A differentiable cost function b (x) computes the learning algorithm that can handle large-scale datasets • L dissimilarity between input and reconstruction. without explicitly performing cross-validation over the number of features on the whole dataset. In detail, our Throughout the paper, we consider a case in which the incremental feature learning algorithm is composed of input and the hidden variables are binary (or bounded two key processes: (1) adding new feature mappings to between 0 and 1), i.e., x [0, 1]D and h [0, 1]N . As the existing feature set and (2) merging parts of the ex- ∈ ∈ described in [32], we consider the conditional distribu- isting feature mappings that are considered redundant. D tion q(x x) = i=1 q(xi xi) that randomly sets some We describe the details of the proposed algorithm in of the coordinates| to zeros.| By corrupting the input this section. Q elements,e the DAE attemptse to learn informative rep- resentations that can successfully recover the missing 3.1 Overview coordinates to reconstruct the original input data. In The incremental feature learning algorithm addresses other words, the DAE is trained to fill in the missing the following two issues: (1) when to add and merge values introduced by corruption. features and (2) how to add and merge features. To The DAE perturbs the input x to x and maps it to the deal with these problems systematically, we define an hidden representation, or abusing notation, the feature objective function to determine when to add or merge e 1454 Guanyu Zhou, Kihyuk Sohn, Honglak Lee Algorithm 1 Incremental feature learning well as the weight matrix W and the hidden bias b. repeat We use θ to denote all parameters W, b, c, Γ, ν . { } Compute the objective (x) for input x. L For incremental feature learning, we use to denote Collect hard examples into a subset B (i.e., B N ← new features and to denote existing or old features. B x if (x) > µ). O ∪ { } L For example, f (x) h [0, 1]N represents an en- if B > τ then O ≡ O ∈ | | coding function with existing features, and f (x) Select 2∆M candidate features and merge them N h [0, 1]∆N denotes an encoding function with≡ into ∆M features (Section 3.3). N e newly∈ added features. A combined encoding function Add ∆N new features by greedily optimizing e is then written as f (x) = [h ; h ] [0, 1]N+∆N .
Recommended publications
  • Automatic Feature Learning Using Recurrent Neural Networks
    Predicting purchasing intent: Automatic Feature Learning using Recurrent Neural Networks Humphrey Sheil Omer Rana Ronan Reilly Cardiff University Cardiff University Maynooth University Cardiff, Wales Cardiff, Wales Maynooth, Ireland [email protected] [email protected] [email protected] ABSTRACT influence three out of four major variables that affect profit. Inaddi- We present a neural network for predicting purchasing intent in an tion, merchants increasingly rely on (and pay advertising to) much Ecommerce setting. Our main contribution is to address the signifi- larger third-party portals (for example eBay, Google, Bing, Taobao, cant investment in feature engineering that is usually associated Amazon) to achieve their distribution, so any direct measures the with state-of-the-art methods such as Gradient Boosted Machines. merchant group can use to increase their profit is sorely needed. We use trainable vector spaces to model varied, semi-structured input data comprising categoricals, quantities and unique instances. McKinsey A.T. Kearney Affected by Multi-layer recurrent neural networks capture both session-local shopping intent and dataset-global event dependencies and relationships for user Price management 11.1% 8.2% Yes sessions of any length. An exploration of model design decisions Variable cost 7.8% 5.1% Yes including parameter sharing and skip connections further increase Sales volume 3.3% 3.0% Yes model accuracy. Results on benchmark datasets deliver classifica- Fixed cost 2.3% 2.0% No tion accuracy within 98% of state-of-the-art on one and exceed state-of-the-art on the second without the need for any domain / Table 1: Effect of improving different variables on operating dataset-specific feature engineering on both short and long event profit, from [22].
    [Show full text]
  • Exploring the Potential of Sparse Coding for Machine Learning
    Portland State University PDXScholar Dissertations and Theses Dissertations and Theses 10-19-2020 Exploring the Potential of Sparse Coding for Machine Learning Sheng Yang Lundquist Portland State University Follow this and additional works at: https://pdxscholar.library.pdx.edu/open_access_etds Part of the Applied Mathematics Commons, and the Computer Sciences Commons Let us know how access to this document benefits ou.y Recommended Citation Lundquist, Sheng Yang, "Exploring the Potential of Sparse Coding for Machine Learning" (2020). Dissertations and Theses. Paper 5612. https://doi.org/10.15760/etd.7484 This Dissertation is brought to you for free and open access. It has been accepted for inclusion in Dissertations and Theses by an authorized administrator of PDXScholar. Please contact us if we can make this document more accessible: [email protected]. Exploring the Potential of Sparse Coding for Machine Learning by Sheng Y. Lundquist A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Dissertation Committee: Melanie Mitchell, Chair Feng Liu Bart Massey Garrett Kenyon Bruno Jedynak Portland State University 2020 © 2020 Sheng Y. Lundquist Abstract While deep learning has proven to be successful for various tasks in the field of computer vision, there are several limitations of deep-learning models when com- pared to human performance. Specifically, human vision is largely robust to noise and distortions, whereas deep learning performance tends to be brittle to modifi- cations of test images, including being susceptible to adversarial examples. Addi- tionally, deep-learning methods typically require very large collections of training examples for good performance on a task, whereas humans can learn to perform the same task with a much smaller number of training examples.
    [Show full text]
  • Arxiv:2006.04059V1 [Cs.LG] 7 Jun 2020
    Soft Gradient Boosting Machine Ji Feng1;2, Yi-Xuan Xu1;3, Yuan Jiang3, Zhi-Hua Zhou3 [email protected], fxuyx, jiangy, [email protected] 1Sinovation Ventures AI Institute 2Baiont Technology 3National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China Abstract Gradient Boosting Machine has proven to be one successful function approximator and has been widely used in a variety of areas. However, since the training procedure of each base learner has to take the sequential order, it is infeasible to parallelize the training process among base learners for speed-up. In addition, under online or incremental learning settings, GBMs achieved sub-optimal performance due to the fact that the previously trained base learners can not adapt with the environment once trained. In this work, we propose the soft Gradient Boosting Machine (sGBM) by wiring multiple differentiable base learners together, by injecting both local and global objectives inspired from gradient boosting, all base learners can then be jointly optimized with linear speed-up. When using differentiable soft decision trees as base learner, such device can be regarded as an alternative version of the (hard) gradient boosting decision trees with extra benefits. Experimental results showed that, sGBM enjoys much higher time efficiency with better accuracy, given the same base learner in both on-line and off-line settings. arXiv:2006.04059v1 [cs.LG] 7 Jun 2020 1. Introduction Gradient Boosting Machine (GBM) [Fri01] has proven to be one successful function approximator and has been widely used in a variety of areas [BL07, CC11]. The basic idea is to train a series of base learners that minimize some predefined differentiable loss function in a sequential fashion.
    [Show full text]
  • Text-To-Speech Synthesis
    Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February rd, @MIT Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics Text-to-speech as sequence-to-sequence mapping Automatic speech recognition (ASR) “Hello my name is Heiga Zen” ! Machine translation (MT) “Hello my name is Heiga Zen” “Ich heiße Heiga Zen” ! Text-to-speech synthesis (TTS) “Hello my name is Heiga Zen” ! Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Speech production process text (concept) fundamental freq voiced/unvoiced char freq transfer frequency speech transfer characteristics magnitude start--end Sound source fundamental voiced: pulse frequency unvoiced: noise modulation of carrier wave by speech information air flow Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Typical ow of TTS system TEXT Sentence segmentation Word segmentation Text normalization Text analysis Part-of-speech tagging Pronunciation Speech synthesis Prosody prediction discrete discrete Waveform generation ) NLP Frontend discrete continuous SYNTHESIZED ) SEECH Speech Backend Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Rule-based, formant synthesis []
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Sparse Feature Learning for Deep Belief Networks
    Sparse Feature Learning for Deep Belief Networks Marc'Aurelio Ranzato1 Y-Lan Boureau2,1 Yann LeCun1 1 Courant Institute of Mathematical Sciences, New York University 2 INRIA Rocquencourt {ranzato,ylan,[email protected]} Abstract Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have cer- tain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the repre- sentation. We describe a novel and efficient algorithm to learn sparse represen- tations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1 Introduction One of the main purposes of unsupervised learning is to produce good representations for data, that can be used for detection, recognition, prediction, or visualization. Good representations eliminate irrelevant variabilities of the input data, while preserving the information that is useful for the ul- timate task. One cause for the recent resurgence of interest in unsupervised learning is the ability to produce deep feature hierarchies by stacking unsupervised modules on top of each other, as pro- posed by Hinton et al.
    [Show full text]
  • Xgboost: a Scalable Tree Boosting System
    XGBoost: A Scalable Tree Boosting System Tianqi Chen Carlos Guestrin University of Washington University of Washington [email protected] [email protected] ABSTRACT problems. Besides being used as a stand-alone predictor, it Tree boosting is a highly effective and widely used machine is also incorporated into real-world production pipelines for learning method. In this paper, we describe a scalable end- ad click through rate prediction [15]. Finally, it is the de- to-end tree boosting system called XGBoost, which is used facto choice of ensemble method and is used in challenges widely by data scientists to achieve state-of-the-art results such as the Netflix prize [3]. on many machine learning challenges. We propose a novel In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available sparsity-aware algorithm for sparse data and weighted quan- 2 tile sketch for approximate tree learning. More importantly, as an open source package . The impact of the system has we provide insights on cache access patterns, data compres- been widely recognized in a number of machine learning and sion and sharding to build a scalable tree boosting system. data mining challenges. Take the challenges hosted by the machine learning competition site Kaggle for example. A- By combining these insights, XGBoost scales beyond billions 3 of examples using far fewer resources than existing systems. mong the 29 challenge winning solutions published at Kag- gle's blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the mod- Keywords el, while most others combined XGBoost with neural net- Large-scale Machine Learning s in ensembles.
    [Show full text]
  • Comparison of Feature Learning Methods for Human Activity Recognition Using Wearable Sensors
    sensors Article Comparison of Feature Learning Methods for Human Activity Recognition Using Wearable Sensors Frédéric Li 1,*, Kimiaki Shirahama 1 ID , Muhammad Adeel Nisar 1, Lukas Köping 1 and Marcin Grzegorzek 1,2 1 Research Group for Pattern Recognition, University of Siegen, Hölderlinstr 3, 57076 Siegen, Germany; [email protected] (K.S.); [email protected] (M.A.N.); [email protected] (L.K.); [email protected] (M.G.) 2 Department of Knowledge Engineering, University of Economics in Katowice, Bogucicka 3, 40-226 Katowice, Poland * Correspondence: [email protected]; Tel.: +49-271-740-3973 Received: 16 January 2018; Accepted: 22 February 2018; Published: 24 February 2018 Abstract: Getting a good feature representation of data is paramount for Human Activity Recognition (HAR) using wearable sensors. An increasing number of feature learning approaches—in particular deep-learning based—have been proposed to extract an effective feature representation by analyzing large amounts of data. However, getting an objective interpretation of their performances faces two problems: the lack of a baseline evaluation setup, which makes a strict comparison between them impossible, and the insufficiency of implementation details, which can hinder their use. In this paper, we attempt to address both issues: we firstly propose an evaluation framework allowing a rigorous comparison of features extracted by different methods, and use it to carry out extensive experiments with state-of-the-art feature learning approaches. We then provide all the codes and implementation details to make both the reproduction of the results reported in this paper and the re-use of our framework easier for other researchers.
    [Show full text]
  • Downloaded from the Cancer Imaging Archive (TCIA)1 [13]
    UC Irvine UC Irvine Electronic Theses and Dissertations Title Deep Learning for Automated Medical Image Analysis Permalink https://escholarship.org/uc/item/78w60726 Author Zhu, Wentao Publication Date 2019 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Deep Learning for Automated Medical Image Analysis DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Wentao Zhu Dissertation Committee: Professor Xiaohui Xie, Chair Professor Charless C. Fowlkes Professor Sameer Singh 2019 c 2019 Wentao Zhu DEDICATION To my wife, little son and my parents ii TABLE OF CONTENTS Page LIST OF FIGURES vi LIST OF TABLES x LIST OF ALGORITHMS xi ACKNOWLEDGMENTS xii CURRICULUM VITAE xiii ABSTRACT OF THE DISSERTATION xvi 1 Introduction 1 1.1 Dissertation Outline and Contributions . 3 2 Adversarial Deep Structured Nets for Mass Segmentation from Mammograms 7 2.1 Introduction . 7 2.2 FCN-CRF Network . 9 2.3 Adversarial FCN-CRF Nets . 10 2.4 Experiments . 12 2.5 Conclusion . 16 3 Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification 17 3.1 Introduction . 17 3.2 Deep MIL for Whole Mammogram Mass Classification . 19 3.2.1 Max Pooling-Based Multi-Instance Learning . 20 3.2.2 Label Assignment-Based Multi-Instance Learning . 22 3.2.3 Sparse Multi-Instance Learning . 23 3.3 Experiments . 24 3.4 Conclusion . 27 iii 4 DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification 29 4.1 Introduction .
    [Show full text]
  • Gradient Boosting Meets Graph Neural Networks
    Published as a conference paper at ICLR 2021 BOOST THEN CONVOLVE:GRADIENT BOOSTING MEETS GRAPH NEURAL NETWORKS Sergei Ivanov Liudmila Prokhorenkova Criteo AI Lab; Skoltech Yandex; HSE University; MIPT Paris, France Moscow, Russia [email protected] [email protected] ABSTRACT Graph neural networks (GNNs) are powerful models that have been successful in various graph representation learning tasks. Whereas gradient boosted decision trees (GBDT) often outperform other machine learning methods when faced with heterogeneous tabular data. But what approach should be used for graphs with tabular node features? Previous GNN models have mostly focused on networks with homogeneous sparse features and, as we show, are suboptimal in the heterogeneous setting. In this work, we propose a novel architecture that trains GBDT and GNN jointly to get the best of both worlds: the GBDT model deals with heterogeneous features, while GNN accounts for the graph structure. Our model benefits from end- to-end optimization by allowing new trees to fit the gradient updates of GNN. With an extensive experimental comparison to the leading GBDT and GNN models, we demonstrate a significant increase in performance on a variety of graphs with tabular features. The code is available: https://github.com/nd7141/bgnn. 1 INTRODUCTION Graph neural networks (GNNs) have shown great success in learning on graph-structured data with various applications in molecular design (Stokes et al., 2020), computer vision (Casas et al., 2019), combinatorial optimization (Mazyavkina et al., 2020), and recommender systems (Sun et al., 2020). The main driving force for progress is the existence of canonical GNN architecture that efficiently encodes the original input data into expressive representations, thereby achieving high-quality results on new datasets and tasks.
    [Show full text]
  • Comparative Performance of Machine Learning Algorithms in Cyberbullying Detection: Using Turkish Language Preprocessing Techniques
    Comparative Performance of Machine Learning Algorithms in Cyberbullying Detection: Using Turkish Language Preprocessing Techniques Emre Cihan ATESa*, Erkan BOSTANCIb, Mehmet Serdar GÜZELc a Department of Security Science, Gendarmerie and Coast Guard Academy (JSGA), Ankara, Turkey. b Department of Computer Engineering, Ankara University, Ankara, Turkey. c Department of Computer Engineering, Ankara University, Ankara, Turkey. Abstract With the increasing use of the internet and social media, it is obvious that cyberbullying has become a major problem. The most basic way for protection against the dangerous consequences of cyberbullying is to actively detect and control the contents containing cyberbullying. When we look at today's internet and social media statistics, it is impossible to detect cyberbullying contents only by human power. Effective cyberbullying detection methods are necessary in order to make social media a safe communication space. Current research efforts focus on using machine learning for detecting and eliminating cyberbullying. Although most of the studies have been conducted on English texts for the detection of cyberbullying, there are few studies in Turkish. Limited methods and algorithms were also used in studies conducted on the Turkish language. In addition, the scope and performance of the algorithms used to classify the texts containing cyberbullying is different, and this reveals the importance of using an appropriate algorithm. The aim of this study is to compare the performance of different machine learning algorithms in detecting Turkish messages containing cyberbullying. In this study, nineteen different classification algorithms were used to identify texts containing cyberbullying using Turkish natural language processing techniques. Precision, recall, accuracy and F1 score values were used to evaluate the performance of classifiers.
    [Show full text]
  • Convex Multi-Task Feature Learning
    Convex Multi-Task Feature Learning Andreas Argyriou1, Theodoros Evgeniou2, and Massimiliano Pontil1 1 Department of Computer Science University College London Gower Street London WC1E 6BT UK [email protected] [email protected] 2 Technology Management and Decision Sciences INSEAD 77300 Fontainebleau France [email protected] Summary. We present a method for learning sparse representations shared across multiple tasks. This method is a generalization of the well-known single- task 1-norm regularization. It is based on a novel non-convex regularizer which controls the number of learned features common across the tasks. We prove that the method is equivalent to solving a convex optimization problem for which there is an iterative algorithm which converges to an optimal solution. The algorithm has a simple interpretation: it alternately performs a super- vised and an unsupervised step, where in the former step it learns task-speci¯c functions and in the latter step it learns common-across-tasks sparse repre- sentations for these functions. We also provide an extension of the algorithm which learns sparse nonlinear representations using kernels. We report exper- iments on simulated and real data sets which demonstrate that the proposed method can both improve the performance relative to learning each task in- dependently and lead to a few learned features common across related tasks. Our algorithm can also be used, as a special case, to simply select { not learn { a few common variables across the tasks3. Key words: Collaborative Filtering, Inductive Transfer, Kernels, Multi- Task Learning, Regularization, Transfer Learning, Vector-Valued Func- tions.
    [Show full text]