Classifying Bats to Species Based on Ultrasound Recordings Systems Using Artificial Neural Networks and Random Forests

Total Page:16

File Type:pdf, Size:1020Kb

Classifying Bats to Species Based on Ultrasound Recordings Systems Using Artificial Neural Networks and Random Forests IT 20 047 Examensarbete 30 hp Augusti 2020 Classifying bats to species based on ultrasound recordings Systems using artificial neural networks and random forests Samuel Pettersson Institutionen för informationsteknologi Department of Information Technology Abstract Classifying bats to species based on ultrasound recordings Samuel Pettersson Teknisk- naturvetenskaplig fakultet UTH-enheten Methods from machine learning are applied to species identification of bats based on short audio recordings. Feedforward artificial neural networks (densely connected as Besöksadress: well as convolutional) perform the classification based on spectrograms of the Ångströmlaboratoriet Lägerhyddsvägen 1 recordings. This is an in the literature largely unexplored method for acoustic Hus 4, Plan 0 species-identification of bats that has seen great success for species identification of birds. Random forests are trained to classify individual bat calls (detected Postadress: automatically using preëxisting software) based on temporal and spectral features Box 536 751 21 Uppsala (automatically measured in spectrograms of the recordings). The random forests then classify entire recordings by aggregating the classifications of all calls in each recording. Telefon: 018 – 471 30 03 The deep convolutional neural networks perform the best, achieving an accuracy of Telefax: 89.09 % (averaged over all classes) on a held-out test set and proving the viability of 018 – 471 30 00 deep learning for acoustic species-identification of bats. The best-performing random forest achieves an accuracy of 83.10 % (averaged over all classes) on a held-out Hemsida: validation set. These results seem to compare decently to results found in the http://www.teknat.uu.se/student literature, but a fair comparison is difficult to make. Handledare: Lars Pettersson Ämnesgranskare: Alexander Medvedev Examinator: Mats Daniels IT 20 047 Tryckt av: Reprocentralen ITC Contents 1 Introduction 1 2 Background 1 2.1 Bat surveying in general . 2 2.1.1 Motivation . 2 2.1.2 Methods . 3 2.2 Acoustic bat-surveying . 4 2.2.1 Echolocation . 4 2.2.2 Social calls . 7 2.2.3 History of bat acoustics . 8 2.2.4 Automated feature extraction from bat recordings . 8 2.2.5 Automated bat-species identification . 10 2.2.6 Concerns regarding automated species-identification . 15 2.3 Signal processing . 15 2.4 Supervised machine learning . 16 2.4.1 Artificial neural networks . 27 2.4.2 Convolutional neural networks . 33 2.4.3 Decision trees . 41 2.4.4 Random forests . 43 2.5 Popular tools for machine learning . 44 3 Materials and methods 45 3.1 The dataset . 45 3.2 Preprocessing . 46 3.3 Hardware and software . 49 3.4 Part 1: Classifiers using low-level features . 49 3.4.1 Using unweighted data . 50 3.4.2 Using weighted data . 54 3.4.3 Fine tuning . 55 3.5 Part 2: Classifiers using high-level features . 56 4 Results 57 4.1 Part 1: Classifiers using low-level features . 57 4.2 Part 2: Classifiers using high-level features . 76 5 Discussion 79 5.1 Future work . 87 6 Conclusion 92 7 Acknowledgments 92 References 92 iii 1 Introduction Bats of most species make use of echolocation in flight in order to navigate, avoid obstacles, and find prey. In other words, they periodically emit sound and listen to the echoes to get an understanding of their surroundings. These echolocation calls turn out to differ between species to varying degree, allowing for the identification of the species of a bat by acoustic means. This project is performed at Pettersson Elektronik AB, which is a company based in Uppsala that develops hardware and software for bioacoustics. The task is to develop a prototype for classifying short ultrasound recordings of echolocation calls and other vocalizations of bats according to the species of the vocalizing bats, i.e., for acoustic species-identification of bats. Moreover, the classifier should beable to identify recordings that do not actually contain any bat vocalizations; such recordings are plentiful in practice. Preferably, the output of the classifier should be not just a single most likely species but a degree of belief for each species under consideration. It should be possible to refine and extend the prototype into a product (or incorporate the classifier into an existing product), but such refinement is outside the scope of the project. There are two important use cases for a such a classifier. First, large quantities of bat recordings (e.g., as obtained from unattended recording devices) can be analyzed in an offline setting, typically ona desktop computer. This alleviates bat researchers from the time-consuming task of manually classifying all the recordings. Secondly, individual recordings made with a hand-held device can be classified in real time. Compared to the first use case, this places more severe computational constraints and memory constraints on the classifier. This project focuses on the first usecase. Two different approaches to the acoustic species-identification of bats are explored in this project, which is correspondingly divided into two parts. Some aspects are common for both parts: the classifiers distinguish between vocalizations of 13 Swedish bat species as well as other sounds (e.g., vocalizations of other animals and weather-induced noise, grouped into a single ’No bat’ class). This is achieved through the use of methods from machine learning and a dataset of audio recordings labeled by human experts with the species of the vocalizing bats. The classifiers are trained to imitate the labeling of the recordings by the humans in such a way that their labeling abilities to a large extent generalize to previously unseen data. The classifiers are limited to recordings of at most one species of bateach. In the first part of the project, entire recordings (upto 5 seconds) are classified using artificial neural networks (traditional densely connected as well as convolutional networks). The inputs to the networks are power spectral density spectrograms. Most of the artificial neural networks are deep, and so this first part of the project may be thought of as exploring the viability of deep learning for acoustic species- identification of bats, which apparently had not been done in the literature before 2020. Seeing asvery little domain-specific knowledge is used, this approach is likely applicable to species identification of other vocalizing animals. In the second part of the project, a more traditional approach to acoustic species-identification of bats is taken. Individual bat calls rather than entire recordings are classified to species using random forests. The classifier inputs are temporal and spectral measurements of a call, as extracted by the callViewer software. By aggregating the classifications of all calls in a recording, the classifier may be used toclassify entire recordings. All classifiers are trained on a dedicated training set and evaluated on a held-out validation set.The average accuracy over all classes and confusion matrices, which reveal how the predictions relate to the true labels, are computed. The best-performing classifier is finally reëvaluated on a held-out testset (different from the validation set) to ensure that the performance of the final classifier ismeasuredas fairly as possible. 2 Background Bats have inhabited the Earth for over 50 million years [119]. They constitute a speciose (species-rich) and in many ways extraordinary order of mammals, making use of such novelties as powered flight and echolocation. 1 Bats are one of only four groups of animals to have developed powered flight (the other three being insects, pterosaurs, and birds) [9], which makes bats the only mammals capable of powered flight. This has contributed to their rich dietary diversity (which includes insects, fruits, leaves, flowers, nectar, pollen, seeds, fish, frogs, and blood), diverse roosting habitats (which include foliage, caves, hollow trees, crevices in rocks and trees, and various man-made structures), reproductive strategies, and social behavior and likely to their prevalence across the Earth [68]. Most bats (more than 85 % of all species) make use of echolocation [38, 22], which is the emission of sound and use of its echoes to get an understanding of the surroundings. Bats use echolocation for navigation and avoiding obstacles [12] and finding prey [38]. 2.1 Bat surveying in general Identifying the species of bats is critical in surveys and monitoring programs [138], and it is also the focus of this project. 2.1.1 Motivation Bats are excellent indicators for human-induced environmental change (physical or chemical alterations), the ecological effects of environmental change (how biotic systems are affected), and biodiversity (richness and variety of species) [113]. In short, they are excellent bioindicators. Some reasons for this are: • Bats are a diverse group of mammals. In terms of number of individuals, bats may be among the most abundant groups of mammals [68]. In terms of number of species, there are over 1300 species of bats, which amounts to more than a fifth of all mammalian species and makes them secondin species richness only to rodents among all orders of mammals [113, 22]. • Bats are globally distributed. The polar regions and some remote Oceanic islands are the only regions without living bats [68]. • Bats are taxonomically stable. In other words, they have characteristics that make them easy to identify and the rate of species invalidation through synonymy is low [68]. • Insectivorous (insect-eating) bats occupy high trophic levels, which makes them more sensitive to the accumulation of pesticides and other toxins than, e.g., herbivorous (plant-eating) animals [68]. • Bats offer several ecosystem services, such as pollination, seed dispersal, and pest control. There- fore, the change in bat-population size reflects the state of plants and insects in the ecosystem [68]. • Bats have a low reproductive rate, which makes trends in the size of bat populations less sensitive to noise [68]. Other animals used as bioindicators include insects that are easily sampled and birds [68].
Recommended publications
  • Ranking and Automatic Selection of Machine Learning Models Abstract Sandro Feuz
    Technical Disclosure Commons Defensive Publications Series December 13, 2017 Ranking and automatic selection of machine learning models Abstract Sandro Feuz Victor Carbune Follow this and additional works at: http://www.tdcommons.org/dpubs_series Recommended Citation Feuz, Sandro and Carbune, Victor, "Ranking and automatic selection of machine learning models Abstract", Technical Disclosure Commons, (December 13, 2017) http://www.tdcommons.org/dpubs_series/982 This work is licensed under a Creative Commons Attribution 4.0 License. This Article is brought to you for free and open access by Technical Disclosure Commons. It has been accepted for inclusion in Defensive Publications Series by an authorized administrator of Technical Disclosure Commons. Feuz and Carbune: Ranking and automatic selection of machine learning models Abstra Ranking and automatic selection of machine learning models Abstract Generally, the present disclosure is directed to an API for ranking and automatic selection from competing machine learning models that can perform a particular task. In particular, in some implementations, the systems and methods of the present disclosure can include or otherwise leverage one or more machine-learned models to provide to a software application one or more machine learning models from different providers. The trained models are suited to a task or data type specified by the developer. The one or more models are selected from a registry of machine learning models, their task specialties, cost, and performance, such that the application specified cost and performance requirements are met. An application processor interface (API) maintains a registry of various machine learning models, their task specialties, costs and/or performances. A third-party developer can make a call to the API to select one or more machine learning models.
    [Show full text]
  • Front Matter
    TENNESSEE DEPARTMENT OF ENVIRONMENT AND CONSERVATION DIVISION OF REMEDIATION OAK RIDGE OFFICE ENVIRONMENTAL MONITORING PLAN July 2018 June 2019 Tennessee Department of Environment and Conservation, Authorization No. 327023 June 29, 2018 Pursuant to the State of Tennessee’s policy of non-discrimination, the Tennessee Department of Environment and Conservation does not discriminate on the basis of race, sex, religion, color, national or ethnic origin, age, disability, or military service in its policies, or in the admission or access to, or treatment or employment in its programs, services or activities. Equal Employment Opportunity/Affirmative Action inquiries or complaints should be directed to the EEO/AA coordinator, Office of General Counsel, William R. Snodgrass Tennessee Tower 2nd Floor, 312 Rosa L. Parks Avenue, Nashville, TN 37243, 1-888-867-7455. ADA inquiries or complaints should be directed to the ADAAA coordinator, William Snodgrass Tennessee Tower 2nd Floor, 312 Rosa Parks Avenue, Nashville, TN 37243, 1-866-253-5827. Hearing impaired callers may use the Tennessee Relay Service 1-800-848-0298. To reach your local ENVIRONMENTAL ASSISTANCE CENTER Call 1-888-891-8332 or 1-888-891-TDEC This plan was published with 100% federal funds DE-EM0001620 DE-EM0001621 Tennessee Department of Environment and Conservation, Authorization No. 327023 June 29, 2018 Executive Summary The Tennessee Department of Environment and Conservation (TDEC), Division of Remediation (DoR), Oak Ridge Office (ORO), submits its FY 2019 Environmental Monitoring Plan (EMP) in accordance with the Environmental Surveillance and Oversight Agreement (ESOA) between the United States Department of Energy (DOE) and the State of Tennessee; and where applicable, the Federal Facilities Agreement (FFA) between the DOE, the Environmental Protection Agency (EPA), and the State of Tennessee.
    [Show full text]
  • A Robust Deep Learning Approach for Spatiotemporal Estimation of Satellite AOD and PM2.5
    remote sensing Article A Robust Deep Learning Approach for Spatiotemporal Estimation of Satellite AOD and PM2.5 Lianfa Li 1,2,3 1 State Key Laboratory of Resources and Environmental Information Systems, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Datun Road, Beijing 100101, China; [email protected]; Tel.: +86-10-648888362 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 Spatial Data Intelligence Lab Ltd. Liability Co., Casper, WY 82609, USA Received: 22 November 2019; Accepted: 7 January 2020; Published: 13 January 2020 Abstract: Accurate estimation of fine particulate matter with diameter 2.5 µm (PM ) at a high ≤ 2.5 spatiotemporal resolution is crucial for the evaluation of its health effects. Previous studies face multiple challenges including limited ground measurements and availability of spatiotemporal covariates. Although the multiangle implementation of atmospheric correction (MAIAC) retrieves satellite aerosol optical depth (AOD) at a high spatiotemporal resolution, massive non-random missingness considerably limits its application in PM2.5 estimation. Here, a deep learning approach, i.e., bootstrap aggregating (bagging) of autoencoder-based residual deep networks, was developed to make robust imputation of MAIAC AOD and further estimate PM2.5 at a high spatial (1 km) and temporal (daily) resolution. The base model consisted of autoencoder-based residual networks where residual connections were introduced to improve learning performance. Bagging of residual networks was used to generate ensemble predictions for better accuracy and uncertainty estimates. As a case study, the proposed approach was applied to impute daily satellite AOD and subsequently estimate daily PM2.5 in the Jing-Jin-Ji metropolitan region of China in 2015.
    [Show full text]
  • TDEC 2014- TN5288--TDEC 2013-Environmental-Monitoring
    TENNESSEE DEPARTMENT OF ENVIRONMENT AND CONSERVATION DOE OVERSIGHT OFFICE ENVIRONMENTAL MONITORING REPORT JANUARY through DECEMBER 2013 Pursuant to the State of Tennessee’s policy of non-discrimination, the Tennessee Department of Environment and Conservation does not discriminate on the basis of race, sex, religion, color, national or ethnic origin, age, disability, or military service in its policies, or in the admission or access to, or treatment or employment in its programs, services or activities. Equal employment Opportunity/Affirmative Action inquiries or complaints should be directed to the EEO/AA Coordinator, Office of General Counsel, 401 Church Street, 20th Floor, L & C Tower, Nashville, TN 37243, 1-888-867-7455. ADA inquiries or complaints should be directed to the ADA Coordinator, Human Resources Division, 401 Church Street, 12th Floor, L & C Tower, Nashville, TN 37243, 1-888- 253-5827. Hearing impaired callers may use the Tennessee Relay Service 1-800-848-0298. To reach your local ENVIRONMENTAL ASSISTANCE CENTER Call 1-888-891-8332 OR 1-888-891-TDEC This report was published With 100% Federal Funds DE-EM0001621 DE-EM0001620 Tennessee Department of Environment and Conservation, Authorization April 2014 2 TABLE OF CONTENTS TABLE OF CONTENTS …………………………………………………………………………….3 EXECUTIVE SUMMARY……………………………………………………………………………4 ACRONYMS …………………………………………………………………………………………15 INTRODUCTION……………………………………………………………………………………19 AIR QUALITY MONITORING Monitoring of Hazardous Air Pollutants on the Oak Ridge Reservation .……………………………..23 RadNet
    [Show full text]
  • An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification
    Imperial College London Department of Computing An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Supervisors: Author: Prof Alessandra Russo Clavance Lim Nuri Cingillioglu Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Contents Abstract 1 Acknowledgements 2 1 Introduction 3 1.1 Motivation .................................. 3 1.2 Aims and objectives ............................ 4 1.3 Outline .................................... 5 2 Background 6 2.1 Overview ................................... 6 2.1.1 Text classification .......................... 6 2.1.2 Training, validation and test sets ................. 6 2.1.3 Cross validation ........................... 7 2.1.4 Hyperparameter optimization ................... 8 2.1.5 Evaluation metrics ......................... 9 2.2 Text classification pipeline ......................... 14 2.3 Feature extraction ............................. 15 2.3.1 Count vectorizer .......................... 15 2.3.2 TF-IDF vectorizer ......................... 16 2.3.3 Word embeddings .......................... 17 2.4 Classifiers .................................. 18 2.4.1 Naive Bayes classifier ........................ 18 2.4.2 Decision tree ............................ 20 2.4.3 Random forest ........................... 21 2.4.4 Logistic regression ......................... 21 2.4.5 Support vector machines ...................... 22 2.4.6 k-Nearest Neighbours .......................
    [Show full text]
  • Predicting Construction Cost and Schedule Success Using Artificial
    Available online at www.sciencedirect.com International Journal of Project Management 30 (2012) 470–478 www.elsevier.com/locate/ijproman Predicting construction cost and schedule success using artificial neural networks ensemble and support vector machines classification models ⁎ Yu-Ren Wang , Chung-Ying Yu, Hsun-Hsi Chan Dept. of Civil Engineering, National Kaohsiung University of Applied Sciences, 415 Chien-Kung Road, Kaohsiung, 807, Taiwan Received 11 May 2011; received in revised form 2 August 2011; accepted 15 September 2011 Abstract It is commonly perceived that how well the planning is performed during the early stage will have significant impact on final project outcome. This paper outlines the development of artificial neural networks ensemble and support vector machines classification models to predict project cost and schedule success, using status of early planning as the model inputs. Through industry survey, early planning and project performance information from a total of 92 building projects is collected. The results show that early planning status can be effectively used to predict project success and the proposed artificial intelligence models produce satisfactory prediction results. © 2011 Elsevier Ltd. APM and IPMA. All rights reserved. Keywords: Project success; Early planning; Classification model; ANNs ensemble; Support vector machines 1. Introduction Menches and Hanna, 2006). In particular, researches have indi- cated that project definition in the early planning process is an im- In the past few decades, the researchers and industry prac- portant factor leading to project success (Le et al., 2010; Thomas titioners have recognized the potential impact of early plan- and Fernández, 2008; Younga and Samson, 2008). Based on ning to final project outcomes and started to put more these results, this research intends to further investigate this rela- emphasis on early planning process (Dvir, 2005; Gibson et tionship and to examine if the status of early planning can be used al., 2006; Hartman and Ashrafi, 2004).
    [Show full text]
  • A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining Ernest Fokoue
    Rochester Institute of Technology RIT Scholar Works Articles 2013 A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining Ernest Fokoue Follow this and additional works at: http://scholarworks.rit.edu/article Recommended Citation Fokoue, Ernest, "A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining" (2013). Accessed from http://scholarworks.rit.edu/article/1750 This Article is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Articles by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining Ernest Fokoué Center for Quality and Applied Statistics Rochester Institute of Technology 98 Lomb Memorial Drive, Rochester, NY 14623, USA [email protected] Abstract Massive data, also known as big data, come in various ways, types, shapes, forms and sizes. In this paper, we propose a rough idea of a possible taxonomy of massive data, along with some of the most commonly used tools for handling each particular category of massiveness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data massiveness. The specific statistical machine learning technique used to handle a particular massive data set will depend on which category it falls in within the massiveness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Prepro- cessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Se- quentialization.
    [Show full text]
  • A Comparison of Artificial Neural Networks and Bootstrap
    Journal of Risk and Financial Management Article A Comparison of Artificial Neural Networks and Bootstrap Aggregating Ensembles in a Modern Financial Derivative Pricing Framework Ryno du Plooy * and Pierre J. Venter Department of Finance and Investment Management, University of Johannesburg, P.O. Box 524, Auckland Park 2006, South Africa; [email protected] * Correspondence: [email protected] Abstract: In this paper, the pricing performances of two learning networks, namely an artificial neural network and a bootstrap aggregating ensemble network, were compared when pricing the Johannesburg Stock Exchange (JSE) Top 40 European call options in a modern option pricing framework using a constructed implied volatility surface. In addition to this, the numerical accuracy of the better performing network was compared to a Monte Carlo simulation in a separate numerical experiment. It was found that the bootstrap aggregating ensemble network outperformed the artificial neural network and produced price estimates within the error bounds of a Monte Carlo simulation when pricing derivatives in a multi-curve framework setting. Keywords: artificial neural networks; vanilla option pricing; multi-curve framework; collateral; funding Citation: du Plooy, Ryno, and Pierre J. Venter. 2021. A Comparison of Artificial Neural Networks and Bootstrap Aggregating Ensembles in 1. Introduction a Modern Financial Derivative Black and Scholes(1973) established the foundation for modern option pricing the- Pricing Framework. Journal of Risk ory by showing that under certain ideal market conditions, it is possible to derive an and Financial Management 14: 254. analytically tractable solution for the price of a financial derivative. Industry practition- https://doi.org/10.3390/jrfm14060254 ers however quickly discovered that certain assumptions underlying the Black–Scholes (BS) model such as constant volatility and the existence of a unique risk-free interest rate Academic Editor: Jakub Horak were fundamentally flawed.
    [Show full text]
  • Assessment of Natural and Anthropogenic Sound Sources and Acoustic Propagation in the North Sea
    UNCLASSIFIED Oude Waalsdorperweg 63 P.O. Box 96864 2509 JG The Hague The Netherlands TNO report www.tno.nl TNO-DV 2009 C085 T +31 70 374 00 00 F +31 70 328 09 61 [email protected] Assessment of natural and anthropogenic sound sources and acoustic propagation in the North Sea Date February 2009 Author(s) Dr. M.A. Ainslie, Dr. C.A.F. de Jong, Dr. H.S. Dol, Dr. G. Blacquière, Dr. C. Marasini Assignor The Netherlands Ministry of Transport, Public Works and Water Affairs; Directorate-General for Water Affairs Project number 032.16228 Classification report Unclassified Title Unclassified Abstract Unclassified Report text Unclassified Appendices Unclassified Number of pages 110 (incl. appendices) Number of appendices 1 All rights reserved. No part of this report may be reproduced and/or published in any form by print, photoprint, microfilm or any other means without the previous written permission from TNO. All information which is classified according to Dutch regulations shall be treated by the recipient in the same way as classified information of corresponding value in his own country. No part of this information will be disclosed to any third party. In case this report was drafted on instructions, the rights and obligations of contracting parties are subject to either the Standard Conditions for Research Instructions given to TNO, or the relevant agreement concluded between the contracting parties. Submitting the report for inspection to parties who have a direct interest is permitted. © 2009 TNO Summary Title : Assessment of natural and anthropogenic sound sources and acoustic propagation in the North Sea Author(s) : Dr.
    [Show full text]
  • Bagging and the Bayesian Bootstrap
    Bagging and the Bayesian Bootstrap Merlise A. Clyde and Herbert K. H. Lee Institute of Statistics & Decision Sciences Duke University Durham, NC 27708 Abstract reduction in mean-squared prediction error for unsta- ble procedures. Bagging is a method of obtaining more ro- In this paper, we consider a Bayesian version of bag- bust predictions when the model class under ging based on Rubin’s Bayesian bootstrap (1981). consideration is unstable with respect to the This overcomes a technical difficulty with the usual data, i.e., small changes in the data can cause bootstrap in bagging, and it leads to a reduction the predicted values to change significantly. in variance over the bootstrap for certain classes of In this paper, we introduce a Bayesian ver- estimators. Another Bayesian approach for dealing sion of bagging based on the Bayesian boot- with unstable procedures is Bayesian model averaging strap. The Bayesian bootstrap resolves a the- (BMA) (Hoeting et al., 1999). In BMA, one fits sev- oretical problem with ordinary bagging and eral models to the data and makes predictions by tak- often results in more efficient estimators. We ing the weighted average of the predictions from each show how model averaging can be combined of the fitted models, where the weights are posterior within the Bayesian bootstrap and illustrate probabilities of models. We show that the Bayesian the procedure with several examples. bootstrap and Bayesian model averaging can be com- bined. We illustrate Bayesian bagging in a regression problem with variable selection and a highly influen- 1INTRODUCTION tial data point, a classification problem using logistic regression, and a CART model.
    [Show full text]
  • Tensor Ensemble Learning for Multidimensional Data
    Tensor Ensemble Learning for Multidimensional Data Ilia Kisil1, Ahmad Moniri1, and Danilo P. Mandic1 1Electrical and Electronic Engineering Department, Imperial College London, SW7 2AZ, UK, E-mails: fi.kisil15, ahmad.moniri13, [email protected] Abstract In big data applications, classical ensemble learning is typically infeasible on the raw input data and dimensionality reduction techniques are necessary. To this end, novel framework that generalises classic flat-view ensemble learning to multidimensional tensor- valued data is introduced. This is achieved by virtue of tensor decompositions, whereby the proposed method, referred to as tensor ensemble learning (TEL), decomposes every input data sample into multiple factors which allows for a flexibility in the choice of multiple learning algorithms in order to improve test performance. The TEL framework is shown to naturally compress multidimensional data in order to take advantage of the inherent multi-way data structure and exploit the benefit of ensemble learning. The proposed framework is verified through the application of Higher Order Singular Value Decomposition (HOSVD) to the ETH-80 dataset and is shown to outperform the classical ensemble learning approach of bootstrap aggregating. Index terms| Tensor Decomposition, Multidimensional Data, Ensemble Learning, Clas- sification, Bagging 1 Introduction The phenomenon of the wisdom of the crowd has been known for a very long time and was originally formulated by Aristotle. It simply states that the collective answer of a group of peo- ple to questions related to common world knowledge, spatial reasoning, and general estimation tasks, is often superior to the judgement of a particular person within this group. With the advent of computer, the machine learning community have adopted this concept under the framework of ensemble learning [1].
    [Show full text]
  • Some Enhancements of Decision Tree Bagging
    Some Enhancements of Decision Tree Bagging Pierre Geurts University of Li`ege, Department of Electrical and Computer Engineering Institut Montefiore, Sart-Tilman B28, B4000 Li`ege, Belgium [email protected] Abstract. This paper investigates enhancements of decision tree bag- ging which mainly aim at improving computation times, but also accu- racy. The three questions which are reconsidered are: discretization of continuous attributes, tree pruning, and sampling schemes. A very sim- ple discretization procedure is proposed, resulting in a dramatic speedup without significant decrease in accuracy. Then a new method is pro- posed to prune an ensemble of trees in a combined fashion, which is significantly more effective than individual pruning. Finally, different re- sampling schemes are considered leading to different CPU time/accuracy tradeoffs. Combining all these enhancements makes it possible to apply tree bagging to very large datasets, with computational performances similar to single tree induction. Simulations are carried out on two syn- thetic databases and four real-life datasets. 1 Introduction The bias/variance tradeoff is a well known problem in machine learning. Bias relates to the systematic error component, whereas variance relates to the varia- bility resulting from the randomness of the learning sample and both contribute to prediction errors. Decision tree induction [5] is among the machine learning methods which present the higher variance. This variance is mainly due to the recursive partitioning of the input space, which is highly unstable with respect to small perturbations of the learning set. Bagging [2] consists in aggregating predictions produced by several classifiers generated from different bootstrap samples drawn from the original learning set.
    [Show full text]