Chapter 2. Machine Learning Overview
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning Overview 2 2.1 Overview Learning denotes the process of acquiring new declarative knowledge, the organization of new knowledge into general yet effective representations, and the discovery of new facts and theories through observation and experimentation. Learning is one of the most important skills that mankind can master, which also renders us different from the other animals on this planet. To provide an example, according to our past experiences, we know the sun rises from the east and falls to the west; the moon rotates around the earth; 1 year has 365 days, which are all knowledge we derive from our past life experiences. As computers become available, mankind has been striving very hard to implant such skills into computers. For the knowledge which are clear for mankind, they can be explicitly represented in program as a set of simple reasoning rules. Meanwhile, in the past couple of decades, an extremely large amount of data is being generated in various areas, including the World Wide Web (WWW), telecommunication, climate, medical science, transportation, etc. For these applications, the knowledge to be detected from such massive data can be very complex that can hardly be represented with any explicit fine-detailed specification about what these patterns are like. Solving such a problem has been, and still remains, one of the most challenging and fascinating long-range goals of machine learning. Machine learning is one of the disciplines, which aims at endowing programs with the ability to learn and adapt. In machine learning, experiences are usually represented as data, and the main objective of machine learning is to derive models from data that can capture the complicated hidden patterns. With these learned models, when we feed them with new data, the models will provide us with the inference results matching the captured patterns. Generally, to test the effectiveness of these learned models, different evaluation metrics can be applied to measure the performance of the inference results. Existing machine learning tasks have become very diverse, but based on the supervision/label information used in the model building, they can be generally categorized into two types: “supervised learning” and “unsupervised learning.” In supervised learning, the data used to train models are pre-labeled in advance, where the labels indicate the categories of different data instances. The representative examples of supervised learning task include “classification” and “regression.” Meanwhile, in unsupervised learning, no label information is needed when building the models, and the representative example of unsupervised learning task is “clustering.” Between unsupervised ©SpringerNatureSwitzerlandAG2019 19 J. Zhang, P. S. Yu, Broad Learning Through Fusions, https://doi.org/10.1007/978-3-030-12528-8_2 [email protected] 20 2MachineLearningOverview learning and supervised learning, there also exists another type of learning tasks actually, which is named as the “semi-supervised learning.” Semi-supervised learning is a class of learning tasks and techniques that make use of both labeled and unlabeled data for training, typically a small amount of labeled data with a large amount of unlabeled data. Meanwhile, besides these aforementioned learning tasks, there also exist many other categorizations of the learning tasks, like “transfer learning,” “sparse learning,” “reinforcement learning,” and “ensemble learning.” To make this book self-contained, the goal of this chapter is to provide a rigorous, yet easy to follow, introduction to the main concepts underlying machine learning, including the detailed data representations, supervised learning and unsupervised learning tasks and models, and several classic evaluation metrics. Considering the popularity of deep learning [14]inrecentyears,wewillalso provide a brief introduction to deep learning models in this chapter. Many other learn tasks, like semi- supervised learning [7, 57], transfer learning [31], sparse learning [28], etc., will be introduced in the following chapters when discussing the specific research problems in detail. 2.2 Data Overview Data is a physical representation of information, from which useful knowledge can be effectively discovered by machine learning and data mining models. A good data input can be the key to the discovery of useful knowledge. In this section, we will introduce some background knowledge about data, including data types, data quality, data transformation and processing, and data proximity measures, respectively. 2.2.1 Data Types Adatasetreferstoacollectionofdatainstances,whereeachdatainstancedenotesthedescription of a concrete information entity. In the real scenarios, the data instances are usually represented by anumberofattributescapturingthebasiccharacteristicsofthecorrespondinginformationentity. For instance, let’s assume we see a group of Asian and African elephants (i.e., elephant will be the information entity). As shown in Table 2.1,eachelephantinthegroupisofcertainweight,height,skin smoothness, and body shape (i.e., different attributes), which can be represented by these attributes as an individual data instance. Generally, as shown in Fig. 2.1,thematureAsianelephantisina smaller size compared with the mature African elephant. African elephants are discernibly larger in size, about 8.2–13 ft (2.5–4 m) tall at the shoulder, and they weigh between 5000 and 14,000 lbs (2.5–7 t). Meanwhile, Asian elephants are more diminutive, measuring about 6.6–9.8 ft (2–3 m) tall at the shoulder and weighing between 4500 and 12,000 lbs (2.25–6 t). In addition, from their ear size, Table 2.1 An example of the elephant data set (ear size: 1 denotes large; 0 denotes small) Elephant ID Weight (t) Height (m) Skin Ear size Trunk “finger” Category ··· 1 6.8 4.0 Wrinkled 1 2 African ··· 2 5.8 3.5 Wrinkled 1 2 African ··· 3 4.5 2.1 Smooth 0 1 Asian ··· 4 5.8 3.1 Wrinkled 0 1 Asian ··· 5 4.8 2.7 Wrinkled 1 2 African ··· 6 5.6 2.8 Smooth 1 1 Asian ··· ··· ··· ··· ··· ··· ··· ··· ··· [email protected] 2.2 Data Overview 21 Asian Elephant African Elephant Fig. 2.1 AcomparisonofAsianelephantvsAfricanelephant skin smoothness, and trunk “finger” number, etc., we can also effectively differentiate them from each other. The group of the elephant data instances will form an elephant data set, as illustrated in Table 2.1. Instance-attribute style data format is a general way for data representation. For different types of data sets, the attributes used to describe the information entities will be different. In the following parts of this subsection, we will talk about the categories of attributes first, and then introduce different data types briefly. 2.2.1.1 Attribute Types Attribute is the basic element in describing information entities, and we provide its formal definition as follows. Definition 2.1 (Attribute) Formally, an attribute denotes a basic property or characteristic of an information entity. Generally, attribute values can vary from one information entities to another in aprovideddataset. In the example provided in Table 2.1,theelephantbodyweight,height,skinsmoothness,earsize, and trunk “finger” number are all the attributes of the elephant information entities. Among these attributes, both body weight and height are the attributes with continuous values, which can take values in a reasonable range (e.g., [1 t, 7 t] for weight and [1 m, 5 m] for height, respectively); trunk “finger” number is a discrete attribute instead, which takes values from set 1, 2 ;earsizeisatransformed { } attribute, which maps the ear size into 2 levels (1: large; 0: small); and skin smoothness is an attribute with categorical values from set Wrinkled, Smooth . These attributes listed above are all the facts { } about elephants. In Table 2.1,weuseintegerstodenotetheelephantID,earsize,andtrunk“finger”number. However, for the same number appearing in these three kinds of attributes, they will have different physical meanings. For instance, the elephant ID 1 of the 1st row in the table denotes the unique identifier of the elephant in the table; meanwhile, the ear size 1 of rows 1, 2, 5, and 6 in the table denotes the elephants have a large ear size; and the trunk “finger” number 1 of rows 3, 4, and 6 denotes the count of trunk fingers about the elephant instances. Therefore, to interpret these data attributes, we need to know their specific attribute types and the corresponding physical meanings in advance. [email protected] 22 2MachineLearningOverview Fig. 2.2 A systematic Numerical(order, distance) Categorical (Equality) categorization of attribute types Continuous Discrete Coding Sheme Interval Ratio Nominal (orderless) Ordinal (order) Periodic (year, month, week, hours) As shown in Fig. 2.2,theattributetypesusedininformationentitydescriptioncanusuallybe divided into two main categories, categorical attributes and numeric attributes.Forinstance,in Table 2.1,skinsmoothnessisacategorical attribute and weight is a numeric attribute.Astotheear size, originally this attribute is a categorical attribute,wherethevalueshaveinherentordersinterms of the elephant ear size. Meanwhile, the transformed attribute (i.e., the integer numbers) becomes a numerical attribute instead, where the numbers display such order relationships specifically. We will talk about the attribute transformation later. Generally, the categorical attributes are the qualitative attributes, while the numeric attributes are the quantitative attributes, both of which can be further divided into several sub-categories