Quick viewing(Text Mode)

Cyber-Physical System for Environmental Monitoring Based on Deep Learning

Cyber-Physical System for Environmental Monitoring Based on Deep Learning

sensors

Article Cyber-Physical for Environmental Monitoring Based on Deep Learning

Íñigo Monedero 1 , Julio Barbancho 1,* , Rafael Márquez 2 and Juan F. Beltrán 3

1 Tecnología Electrónica, Escuela Politéncia Superior, Universidad de Sevilla, Calle Virgen de África 7, 41012 Sevilla, Spain; [email protected] 2 Fonoteca Zoológica, Departamento de Biodiversidad y Biología Evolutiva, Museo Nacional de Ciencias Naturales (CSIC), Calle José Gutiérrez Abascal, 2, 28006 Madrid, Spain; [email protected] 3 Departamento de Zoología, Facultad de Biología, Universidad de Sevilla, Avenida de la Reina Mercedes, s/n, 41012 Sevilla, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-955-55-28-38

Abstract: Cyber-physical (CPS) constitute a promising paradigm that could fit various applications. Monitoring based on the Internet of Things (IoT) has become a research area with new challenges in which to extract valuable information. This paper proposes a deep learning classification sound system for execution over CPS. This system is based on convolutional neural networks (CNNs) and is focused on the different types of vocalization of two species of anurans. CNNs, in conjunction with the use of mel-spectrograms for sounds, are shown to be an adequate tool for the classification of environmental sounds. The classification results obtained are excellent (97.53% overall accuracy) and can be considered a very promising use of the system for classifying other

 biological acoustic targets as well as analyzing biodiversity indices in the natural environment. The  paper concludes by observing that the execution of this type of CNN, involving low-cost and reduced computing resources, are feasible for monitoring extensive natural areas. The use of CPS enables Citation: Monedero, Í.; Barbancho, J.; Márquez, R.; Beltrán, J.F. flexible and dynamic configuration and deployment of new CNN updates over remote IoT nodes. Cyber-Physical System for Environmental Monitoring Based on Keywords: convolutional neural network; deep learning; machine learning; cyber-physical systems; Deep Learning. Sensors 2021, 21, 3655. passive active monitoring; Internet of Things https://doi.org/10.3390/s21113655

Academic Editors: Alessandro Gardi and Roberto Sabatini 1. Introduction 1.1. Environmental Sound Classification Received: 30 March 2021 Recently, international institutions such as the United Nations (UN) have paid grow- Accepted: 13 May 2021 ing attention to global strategies to coordinate the interactions among different agents Published: 24 May 2021 (countries, companies ... ) around the world. These strategies [1] are focused on the development of sustainable development goals (SDGs). Special interest has been shown Publisher’s Note: MDPI stays neutral regarding goal 11, “sustainable cities and communities”; goal 13, “climate action”; goal 14, with regard to jurisdictional claims in published maps and institutional affil- “life below water”; and goal 15, “life on land”. Local and regional institutions are designing iations. roadmaps to reach these goals. This is the case for the Doñana Biological Station (DBS), a public Research Institute belonging to the Spanish Council for Scientific Research CSIC in the area of Natural Resources located in the South of Spain that focuses its research mainly on the Doñana National Park (DNP), a 135 km2 natural park composed of several . The DNP is a key environment where several species of birds stop on their Copyright: © 2021 by the authors. migration paths between Europe and Africa. These features make this space a unique Licensee MDPI, Basel, Switzerland. enclave for analyzing several aspects of the interactions among human beings and nature. This article is an open access article distributed under the terms and This paper describes a project named SCENA-RBD, developed by the University conditions of the Creative Commons of Seville in collaboration with the DBS. The main aim of the project is to design and Attribution (CC BY) license (https:// implement a cyber-physical system (CPS) that allows distributed data processing for creativecommons.org/licenses/by/ biological information. This CPS will constitute a useful tool for different governments 4.0/). (local, regional, and national) to evaluate the sustainability of an area. Several metrics [2]

Sensors 2021, 21, 3655. https://doi.org/10.3390/s21113655 https://www.mdpi.com/journal/sensors Sensors 2021, 21, x FOR PEER REVIEW 2 of 18

This paper describes a project named SCENA-RBD, developed by the University of Sensors 2021, 21, 3655 Seville in collaboration with the DBS. The main aim of the project is to design and imple-2 of 19 ment a cyber-physical system (CPS) that allows distributed data processing for biological information. This CPS will constitute a useful tool for different governments (local, re- gional, and national) to evaluate the sustainability of an area. Several metrics can be con- cansidered be considered for this purpose. for this The purpose. Acoustic The Entropy Acoustic Index Entropy (H), Index which (H), is calculated which is calculated according according to the Shannon information theory, indicates the richness of a specific habitat. to the Shannon information theory, indicates the richness of a specific habitat. The higher The higher the value of a habitat’s H, the richer it is. Another common metric is the the value of a habitat’s H, the richer it is. Another common metric is the Acoustic Com- Acoustic Index (ACI), which considers the soundscape of an environment by plexity Index (ACI), which considers the soundscape of an environment by measuring the measuring the variability of its natural audio signals in relation to audio signals with a variability of its natural audio signals in relation to audio signals with a human origin. human origin. Aligned with this idea, the Normalised Difference Soundscape Index (NDSI) Aligned with this idea, the Normalized Difference Soundscape Index (NDSI) is a metric is a metric that estimates the influence of human acoustic presence (anthrophony) over that estimates the influence of human acoustic presence (anthrophony) over the biological the biological soundscape (biophony). Based on this, a new paradigm is rising: passive soundscape (biophony). Based on this, a new paradigm is rising: passive acoustic moni- acoustic monitoring (PAM). PAM is a promising research area that has recently increased toring (PAM). PAM is a promising research area that has recently increased in importance in importance due to the development of low-cost wireless electronic sensors. due to the development of low-cost wireless electronic sensors. The CPS of the present paper consists of two stages (Figure1). The first stage is focusedThe on CPS the of identification the present ofpaper acoustic consists patterns. of two Each stages pattern (Figure is identified 1). The first according stage is to fo- a specificcused on algorithm. the identification There could of acoustic be several patterns. algorithms Each running pattern at is the identified same time according depending to a onspecific the target algorithm. under There study—biological could be several acoustic algorithms targets running such as at animals the same or time climate depending events oron even the target environmental under study—biol targets suchogical as acoustic engines, targets gun shots, such humanas animals voices, or climate etc., could events, be consideredor even environmental in the classification. targets In such this as stage engine a classifications, gun shots, procedure human voices, is executed etc., basedcould on be previouslyconsidered supervised in the classification. training. TheIn this second stage, stage a classification correlates the procedure classification is executed donein based the previouson previously stage withsupervised the purpose training. to develop The second biodiversity stage correlates indices. the classification carried out in the previous stage with the purpose to develop biodiversity indices.

Figure 1. Stages in the CPS of SCENA-RBD project. Figure 1. Stages in the CPS of the SCENA-RBD project. This paper is focused on the first stage of the CPS. The first step is the audio recording. This paper is focused on the first stage of the CPS. The first step is the audio record- Industrial devices such as the Song Meter [3] have become solid platforms for recording ing. Industrial devices such as the Song Meter [1] have become solid platforms for record- wildlife sounds in real-world conditions with a high quality of recording. These devices ing wildlife sounds in real-world conditions with a high quality of recording. These de- usually offer complementary software that enables a post-analysis of the recorded data. However,vices usually this sortoffer of complementary products has features software that th mustat enables be considered a post-analysis in an extensive of the recorded sensor devicesdata. However, deployment. this sort First, of the product price ofhas the featur deviceses that is quitemust high.be considered If the area in to an be extensive studied issensor large, devices a significant deployment. number First, of elements the price must of the be devices spread is out. quite This high. could If the increase area to the be coststudied of the is large, project a significant considerably. number Although of elem someents projectsmust be arespread based out. on This open could hardware consid- (e.g.,erably AudioMoth increase the [4 cost]) that of the reduce project. the Although cost of fabrication, some projects the are possibilities based on open they hardware offer are limited(e.g., AudioMoth or unstable. [2]) Second, that reduce this sort the of cost device of fabrication, is designed the to recordpossibilities audio they signals offer on are a programmable schedule. This feature involves a lack of flexibility that produces a large amount of data with no significant content. It would be desirable to ascertain certain types of events that could trigger the recording process. Third, there is no possibility of executing any preprocessing. These data are stored in raw format or with basic preprocessing. Audio Sensors 2021, 21, 3655 3 of 19

waves are recorded using a standardized bandwidth (19 Hz to 19 kHz) using a sample rate of 44.1 kHz and a single channel (mono) with 16 bits per sample. A 1-minute file recorded with this configuration has a size of 5 MB. This file could be stored with or without compression. This procedure is extremely inefficient and requires large data storage. Fourth, there are no communication interfaces available. Therefore, in this sort of scenario a technician must establish a schedule for visiting every installed device to collect all the information. This task introduces an artificial effect in the measurement that contaminates the dynamics of the environment under study. Thus, there is a need to enhance the process of environmental sound classification with a more flexible and powerful system. Not only must central processing be executed, but local and distributed processing should also be considered. Cooperative networking paradigms are suitable for this kind of scenario. In this sense, the use of Internet of Things (IoT) technology with low power consumption, low data rate, and low processing capability offers a good chance to arrive at an optimal solution. Additionally, wireless schemes of communication are desirable for scenarios in which the human presence must be minimized. In other words, the more autonomy the nodes have, the higher the quality of data obtained. Another benefit of this approach is the scalability of the system. Covering large areas for monitoring requires deployment of a large number of devices. The use of IoT technology could ease the creation and management of the network topology. Once the nodes are spread out, they cooperate in the configuration of the network, following a spanning-tree or mesh architecture. The way they design the topology is done in an ad hoc manner. This topology is highly scalable, allowing an autonomous mechanism for network supervision. The reduction of human physical assistance in every node of the network can be done thanks to a wireless networking cooperative feature. The network offers data transportation facilities, usually with a low data rate. In this sense, data fusion and data aggregation techniques must be considered as optimizing the information flow. Furthermore, due to the lack of infrastructure in natural environments, nodes need autonomous power supplies, commonly based on solar cells or windmills. Large area monitoring implies the use of low-cost devices. Consequently, there is a large restriction because of the capabilities of the computation resources. At the classification stage, there is a tendency in the literature to consider semi- automated PAM analysis [5]. One of the most interesting approaches is focused on the use of modern digital image processing techniques for audio pattern identification. This strategy creates an image that represents the audio of a certain record in a time and frequency graph. Deep learning algorithms are run using these images [6–9]. This constitutes a data fusion technique easy to implement in low power and computation resources nodes. This way, the classification method can be done in a distributed manner, avoiding sending the audio data records through the network links, thus minimizing network power consumption. The following section describes this approach.

1.2. Convolutional Neural Networks in Audio Classification The convolutional neural network (CNN) [10] is a class of deep learning neural networks oriented mainly to the classification of images. It presents a unique architecture (Figure2) with a set of layers for feature extraction and an output layer for classification. Layers are organized in three dimensions: width, height, and depth. The neurons in one layer connect not to all the neurons in the next layer, but only to a small region of the layer’s neurons. The final output of a CNN is reduced to a single vector of probability scores, organized along the depth dimension. Sensors 2021, 21, 3655 4 of 19 Sensors 2021, 21, x FOR PEER REVIEW 4 of 18

Figure 2. CNN Architecture. Figure 2. CNN Architecture. Convolution is the first layer that extracts features from an input image. Convolution Convolution is the first layer that extracts features from an input image. Convolution preserves the relationship between pixels by learning image features using small squares preserves the relationship between pixels by learning image features using small squares of input data. It carries out a linear mathematical operation that involves the multiplication of input data. It carries out a linear mathematical operation that involves the multiplica- of the image matrix and a filter or kernel specified in a two-dimensional array of weights. tion ofThe the aimimage of thematrix pooling and layer a filter function or kernel is to specified progressively in a reducetwo-dimensional the spatial sizearray of of the weights.representation in order to optimize the number of parameters and computations in the network.The aim The of the pooling pooling layers layer are function typically is to inserted progressively between reduce convolutional the spatial layers. size of Theythe representationoperate on each in featureorder to map optimize independently. the number of parameters and computations in the network.CNNs The representpooling layers a major are breakthrough typically inse inrted image between recognition. convolutional They learn layers. directly They op- from erateimage on data,each feature using patterns map independently. to classify images and eliminating the need for manual feature extraction.CNNs represent They are a major currently breakthrough a topic of in great image interest recognition. in machine They learn learning, directly and from have imagedemonstrated data, using excellent patterns performance to classify images in classification and eliminating [10,11]. the need for manual feature extraction.A spectrogram They are currently [12] is a visual a topic representation of great interest of the in frequencies machine oflearning, a signal and as it have varies demonstratedwith time. Based excellent on Fourier’s performanc work,e in itclassification transforms the[8–9]. representation of an audio signal as aA functionspectrogram of time [10] intois a visual a representation representation of its of frequenciesthe frequencies and of vice a signal versa. as it By varies using withdiscrete time. FourierBased on transform Fourier’s algorithms, work, it transfor it is possiblems the representation to create an image of an corresponding audio signal as to a functionthe presence of time of various into a representation frequencies in anof it audios frequencies signal across and vice time. versa. Thus, By a spectrogramusing discrete is a Fourierpicture transform of a sound algorithms, and it shows it is the possible frequencies to create that an make image up corresponding the sound, from to low the topres- high, enceand of how various they frequencies change over in time, an audio from signal left to across right. time. Thus, a spectrogram is a picture of a soundA spectrogram and it shows basically the frequencies provides athat two-dimensional make up the sound, graph, from with low a third to high, dimension and howrepresented they change by colors.over time, Time from runs left from to leftright. (oldest) to right (youngest) along the horizontal axis.A Thespectrogram vertical axis essentially represents provides frequency, a two-dimensional which can also graph, be thought with a ofthird as pitch dimension or tone, representedwith the lowest by colors. frequencies Time runs at from the bottom left (old andest) to the right highest (youngest) frequencies along atthe the horizontal top. The axis.amplitude The vertical (or energy axis represents or loudness) frequency, of a frequency which can at also a particular be thought time of as is representedpitch or tone, by withthe the third lowest dimension, frequencies color, at with the darkbottom blues and corresponding the highest frequencies to low amplitudes at the top. and The brighter am- plitudecolors (or up energy through or red loudness) corresponding of a frequency to progressively at a particular stronger time (or is louder) represented amplitudes. by the third dimension,A mel-spectrogram color, with [12 ]dark is a spectrogramblues corresponding where the to frequencies low amplitudes are converted and brighter to the colorsmel scale.up through The mel red scale corresponding is the result to of prog someressively nonlinear stronger transformation (or louder) ofamplitudes. the frequency scale.A mel-spectrogram It is constructed such[10] is that a spectrogram sounds of equal where distance the frequencies from each are other converted on the mel to scalethe melalso scale. “sound” The mel to humans scale is asthe if theyresult are of equalsome innonlinear distance transformation from one another. of the Thus, frequency a typical scale.spectrogram It is constructed uses a linearso that frequency sounds of scaling, equal distance so each from frequency each other bin is on spaced the mel the scale equal alsonumber “sounds” of Hertz to humans apart. as The if they mel-frequency are equal in scale distance is a quasi-logarithmicfrom one another. spacingThus, a typical roughly spectrogramresembling uses the resolution a linear frequency of the human scaling, auditory so each system. frequency bin is spaced the equal numberThe of CNNHertz models apart. haveThe mel-frequency been combined scale with is sound a quasi-logarithmic spectrograms or spacing mel-spectrograms roughly resemblingfor the generation the resolution of classification of the human models auditory [12–17 system.]. Specifically, in [13], CNN and cluster- ingCNN techniques models are have combined been combined to generate with asoun dissimilarityd spectrograms space or used mel-spectrograms to train an SVM for for theautomated generation audio of classification classification—in models this [10–1 case,5]. audios Specifically, of birds in and [1], cats. CNN Ref and [14] clustering uses CNNs techniquesto identify are healthy combined subjects to ge usingnerate spectrograms a dissimilarity of space electrocardiograms. used to train an Ref SVM [15 for] employs auto- matedCNN audio with threeclassification—in audio attribute this extractioncase, audios techniques of birds and (mel-spectrogram, cats. The authors Mel of [12] Frequency used CNNsCepstral to identify Coefficient healthy and subjects Log-Mel) using in orderspectrograms to classify of threeelectrocardiograms. datasets of environmental The study

Sensors 2021, 21, 3655 5 of 19

sounds. Ref [16] uses three types of time-frequency representation (mel-spectrogram, harmonic-component-based spectrogram, and percussive-component-based spectrogram) in conjunction with CNNs to characterize different acoustic components of birds (43 dif- ferent species). Ref [17] employs a concatenated spectrogram as input features for the classification of environmental sounds. The authors state that the concatenated spectrogram feature increases the richness of features compared with a single spectrogram.

1.3. Previous Work It is possible to find two approaches from other research groups to the classification of anuran sounds with CNNs in the literature [18,19]. The first one [18] researches the use of convolutional neural networks with Mel-Frequency Cepstral Coefficients (MFCCs) as input for the classification of anuran sounds. It uses a of 7784 anuran sounds with the following anuran families: Leptodactylidae, Dendrobatidae, Hylidae, and Bufonidae. Two binary CNNs are designed, achieving 91.41% accuracy in identifying Leptodactylidae family members from the rest and 99.14% accuracy for the identification of Adenomera Hylaedactyla species. The paper [19] uses two deep-learning methods that apply convo- lutional neural networks to anuran classification. About 200 calls from 15 frog species from a mixture of web-based collections and field recordings were used for the design. A deep-learning algorithm was used to find the most important features for classification. A 77% classification accuracy was obtained for the dataset of 15 frog species. In the context of the automatic classification of anuran sounds using other algorithms, a set of works [20–23] has been published by our research group at the University of Seville. In [20], the MPEG-7 standard was deeply analyzed in order to provide standard acoustic descriptors that could characterize animal sound patterns. The conclusions of this work demonstrate the advantages of the use of this standard in terms of scalability in the wireless acoustic sensor network. In [21], a set of classifiers based on different algorithms was proposed. The classifier with the best results is based on decision trees, obtaining an overall accuracy of 87.30%. In this work, frames were analyzed without taking into account their relationship with their predecessor and successor. The various classifiers were also compared with a sequential classifier (hidden Markov model). In [22], a study to analyze the improvements in classification that can be provided by the application of classifiers that consider methods different from those used in [21] (MPEG-7) was proposed for the extraction of characteristics of an audio frame. Two were proposed: energy of filter banks and frequency coefficients of mel. The best results, with an overall accuracy of 94.85%, were obtained using the MFCC (optimal) characteristics and a Bayesian classifier. In [23], a new algorithm with the same advantages of the classifiers used in [22] was proposed that adds an additional processing that not only considers the label of the classifier that is assigned to each frame, but also considers a score that determines whether this frame belongs to one class or another. To do this, the researchers analyzed the spectrograms of each sound by identifying the regions of interest (ROI) containing anuran sounds. With this approach, the use of two classifiers was identified as the optimal method: minimum distance and decision tree. The results of accuracy in the classification are 97.35%. In the above-described works by our group [20–23], a database of real sounds of anurans was used for training the models. This database was provided by the National Natural History Museum (Museo Nacional de Ciencias Naturales) [24]. The sounds were recorded in five different locations, four in Spain and one in Portugal, using a Sennheiser ME80 microphone. They were sampled at 44.1 kHz. The sample included sounds from 2 anuran species: the epidalea calamita (natterjack toad) and the alytes obstetricans. Three types of vocalization were distinguished for the epidalea calamita: standard, chorus, and amplexus. Two types of vocalization were distinguished for alytes obstetricans: standard and distress call. This same database was used for training the classification system of the Sensors 2021, 21, 3655 6 of 19

present work. In this way, it should also be possible to compare the results of the CNN against the models used in previous studies.

1.4. Research Objectives The research objectives of this paper are twofold: • The implementation of a cyber-physical system to optimize the execution of CNN algorithms: audio wave processing based on CNN [10] has been frequently proposed in the literature, as shown in Section 1.2. However, an optimal execution model must be considered in order to maximize flexibility and adaptability of the algorithms, and to minimize the execution costs (power consumption and time response). Thus, a first research objective is to propose the implementation of a cyber-physical system (CPS) to reach these goals. Figure3 represents the main components of the CPS proposal. The physical system is composed of IoT nodes that create an ad hoc network. These nodes are equipped with microphones adapted for natural environments. Every node captures the soundscape of its surroundings. The audio signal is converted to a mel-spectrogram [12], which is processed using a set of trained CNNs. This data fusion procedure reduces the size of the information to be sent through the network. The nodes are powered by solar cells and have a wireless interface. The physical system is modeled in a set of servers following the paradigm of digital twins [25]. This constitutes the cyber-physical part of the system. Every physical node has a model (digital twin) that can interact with other models, creating a cooperative network. At this point, the CNNs are trained with the audio data recorded previously by the physical nodes. There are as many CNNs as species or biological acoustic targets being studied. Once a CNN is trained, its structure, its topology, and its weights are sent to the corresponding physical node, where it will be run. The execution of every CNN offers events of identification of a given species. The entire system can be monitored by scientists with a user-friendly interface. • The design of a CNN classification system for anuran sounds: For this second research, the different types of anuran vocalizations that our research group has worked on in the past (see Section 1.3) would be used as training set. Obtaining good results would allow a demonstration of the feasibility of a CNN-based design for the classification of species sounds and, in general, for biological acoustic targets. At the same time, having previous work in the classification of this type of sounds allows us to compare the results of the CNN with other techniques. Thus, although the results of the most recent of the previous works [23] were satisfactory, there were two areas in need of improvement. The first one focused on creating a system that performs an even finer classification of anuran sounds both in terms of accuracy and in number of classified sounds. A second objective was to remove the previous manual preprocessing of the audio signals carried out in [23]. This would speed up the classification process and save costs. The different stages in the designed classification process are shown in Figure4. Sensors 2021Sensors, 21 2021, 3655, 21, x FOR PEER REVIEW 7 of 18 7 of 19

Sensors 2021, 21, x FOR PEER REVIEW 8 of 18

FigureFigure 3. 3.Main Main components components of of the the CPS. CPS.

• The design of a CNN classification system for anuran sounds: For this second re- search, the different types of anuran vocalizations that our research group has worked on in the past (see section 1.3) would be used as training set. Obtaining good results would allow a demonstration of the feasibility of a CNN-based design for the classification of species sounds and, in general, for biological acoustic targets. At the same time, having previous work in the classification of this type of sounds allows us to compare the results of the CNN with other techniques. Thus, although the re- sults of the most recent of the previous works [23] were satisfactory, there were two areas in need of improvement. The first one focused on creating a system that per- forms an even finer classification of anuran sounds both in terms of accuracy and in

number of classified sounds. A second objective was to remove the previous manual FigurepreprocessingFigure 4.4. StagesStages of in the the audio CNN CNN signalsclassification classification carried system. system. out in [23]. This would speed up the clas- sification process and save costs. The different stages in the designed classification Firstly,Firstly,process the the are environmental environmental shown in Figure audiosaudios 4. are are captured captured with with a a microphone microphone and and stored stored in in WAV orWAV MP3 or format. MP3 format. Subsequently, Subsequently, the mel-spectrogramthe mel-spectrogram is generatedis generated for for the the audio audio file file and storedand stored in JPEG in JPEG format. format. Finally, Finally, the the CNN CNN carries carries out out the the detection detection and and classification classification of the anuranthe anuran vocalization vocalization from from the the mel-spectrogram. mel-spectrogram. ThisThis paperpaper isis organizedorganized as as follows. follows. Section Section 2 2gives gives the the details details of ofthe the creation creation of the of the cyber-physicalcyber-physical systemsystem and the the CNN CNN classificati classificationon system. system. The The results results of the of theevaluation evaluation of of thethe system system areare describeddescribed in in Section Section 3.3. Sectio Sectionn 44 offersoffers a a discussion discussion about about the the results results and and somesome conclusions conclusions areare provided in in Section Section 5.4.

2.2. Material Material andand MethodsMethods 2.1.2.1. Cyber-Physical Cyber-Physical System AsAs was was seenseen in the previous section, section, most most of of the the approaches approaches to environmental to environmental moni- moni- toringtoring are are mainly mainly basedbased on two two strategies: strategies: 1. Sensors acquire data and send them to a base station from where they are forwarded to a central processing entity. In this scenario, data are sent raw or with basic pro- cessing. In order to enhance the management of the communications, data could be aggregated. The main drawback to this approach is the huge data traffic that the com- munication infrastructure must allow. This feature is especially important in low- bandwidth systems; 2. Sensors acquire data and process them locally. In this scenario, a data fusion algo- rithm is executed locally in every remote node or in a cluster of nodes that can coop- erate among themselves. Only relevant information is sent to sink nodes, which de- liver it to supervising and control entities. This sort of architectures reduces the data traffic and consequently the power consumption. The required bandwidth for these nodes can be minimized, allowing low-power wireless sensor network paradigms such as IEEE 802.15.4TM [24], LoRa® (LoRa Alliance, https://lora-alliance.org) [25], etc. The main drawback of these systems is the necessity of adapting the data fusion al- gorithm to nodes with processing and power consumption constraints. The first approach is based on a centralized architecture, as depicted in Figure 5. A central processing node spreads its sensor nodes out following a cluster tree communica- tion infrastructure. Data can flow from source nodes through intermediate nodes (called base stations) that forward the packets toward the sink node. In this paradigm, the nodes that acquire data (e.g., audio waves) from the environment are assumed to be in the phys- ical domain. In Figure 5, this domain is represented in black. Additionally, the nodes that process data usually are located in a data center considered the cyber domain (represented in grey in Figure 5). The main processing tasks are developed in the cyber domain. Typical processing algorithms are based on artificial intelligence techniques such as convolutional neural networks for classification purposes. The construction of this artificial neural net- work is based on two stages, both carried out in this central entity. In the first stage, the network is trained with the data collected from sensor nodes (audio wave). In this stage,

Sensors 2021, 21, 3655 8 of 19

1. Sensors acquire data and send them to a base station from where they are forwarded to a central processing entity. In this scenario, data are sent raw or with basic processing. In order to enhance the management of the communications, data could be aggregated. The main drawback to this approach is the huge data traffic that the communication infrastructure must allow. This feature is especially important in low-bandwidth systems. 2. Sensors acquire data and process them locally. In this scenario, a data fusion algorithm is executed locally in every remote node or in a cluster of nodes that can cooperate among themselves. Only relevant information is sent to sink nodes, which deliver it to supervising and control entities. This sort of architectures reduces the data traffic and consequently the power consumption. The required bandwidth for these nodes can be minimized, allowing low-power wireless sensor network paradigms such as IEEE 802.15.4TM [26], LoRa® (LoRa Alliance, Fremont, CA, USA https://lora-alliance.org, accessed on 19 May 2021) [27], etc. The main drawback of these systems is the necessity of adapting the data fusion algorithm to nodes with processing and power consumption constraints. The first approach is based on a centralized architecture, as depicted in Figure5.A central processing node spreads its sensor nodes out following a cluster tree communication infrastructure. Data can flow from source nodes through intermediate nodes (called base Sensors 2021, 21, x FOR PEER REVIEWstations) that forward the packets toward the sink node. In this paradigm, the nodes that9 of 18 acquire data (e.g., audio waves) from the environment are assumed to be in the physical domain. In Figure5, this domain is represented in black. Additionally, the nodes that process data usually are located in a data center considered the cyber domain (represented inhuman grey in supervision Figure5). The is mainneeded processing to identify tasks every are developed output with in the its cybercorrect domain. pattern. Typical In the sec- processingond stage, algorithms the network are is based run with on artificial the trained intelligence supervised techniques neural such network. as convolutional neuralWhile networks this approach for classification delivers purposes. the control The and construction supervision of responsibility this artificial to neural a central- networkized entity is based called on the two sink stages, node both (with carried no ha outrdware in this constraints), central entity. the Indecision the first is stage, made re- themotely network from is data trained sensors, with which the data may collected cause inflexibility from sensor due nodes to (audioits fixed wave). structure In this and the stage,latency human of its supervisionlinks. Furthermore, is needed the to identify processing every is outputcarried with out itsin a correct central pattern. entity Inthat the could secondcreate a stage, bottleneck. the network Possible is run dysfunctionalities with the trained supervisedof this entity neural could network. cause the system to fail.

FigureFigure 5. 5.Centralized Centralized architecture. architecture.

WhileThe second this approach approach delivers is based the control on a decentral and supervisionized architecture, responsibility as toshown a centralized in Figure 6. entityIn this called scenario, the sink every node source (with node no hardware acquires constraints), data and processes the decision them is locally. made remotely In this case, fromthe CNN data sensors,is run on which the mayphysical cause domain. inflexibility There due are to itsas fixedmany structure CNNs andas there the latency are sensor of its links. Furthermore, the processing is carried out in a central entity that could create a nodes. Every CNN implements a classification algorithm that provide a short plain text bottleneck. Possible dysfunctionalities of this entity could cause the system to fail. file as result. These files are collected by the sink node in an aggregation scheme. There- The second approach is based on a decentralized architecture, as shown in Figure6. In thisfore, scenario, the required every sourcebandwidth node on acquires every data link andis not processes considered them a locally. challenge. In this Human case, the super- CNNvision is must run on be the carried physical out domain.on every There node are in order as many to CNNsfit the algorithm as there are to sensor the target nodes. under Everystudy. CNN implements a classification algorithm that provide a short plain text file as

Figure 6. Decentralized architecture.

In contrast with both the first and second approach, the cyber-physical systems (CPS) approach combines the execution of the algorithms in both the physical and the cyber domains. Figure 7 shows an example of how a CPS architecture could fulfill the require- ments of a classification problem while minimizing the drawbacks found in the prior ap- proaches.

Sensors 2021, 21, x FOR PEER REVIEW 9 of 18

human supervision is needed to identify every output with its correct pattern. In the sec- ond stage, the network is run with the trained supervised neural network. While this approach delivers the control and supervision responsibility to a central- ized entity called the sink node (with no hardware constraints), the decision is made re- motely from data sensors, which may cause inflexibility due to its fixed structure and the latency of its links. Furthermore, the processing is carried out in a central entity that could create a bottleneck. Possible dysfunctionalities of this entity could cause the system to fail.

Figure 5. Centralized architecture.

The second approach is based on a decentralized architecture, as shown in Figure 6. Sensors 2021, 21, 3655 In this scenario, every source node acquires data and processes them locally. In this9 ofcase, 19 the CNN is run on the physical domain. There are as many CNNs as there are sensor nodes. Every CNN implements a classification algorithm that provide a short plain text file as result. These files are collected by the sink node in an aggregation scheme. There- result.fore, the These required files are bandwidth collected on by every the sink link node is not in anconsidered aggregation a challenge. scheme. Therefore,Human super- the requiredvision must bandwidth be carried on everyout on link every is not node considered in order ato challenge. fit the algorithm Human to supervision the target under must bestudy. carried out on every node in order to fit the algorithm to the target under study.

FigureFigure 6. 6.Decentralized Decentralized architecture.architecture.

InIn contrastcontrast with with both both the the first first and and second second approach, approach, the cyber-physical the cyber-physical systems systems (CPS) (CPS)approach approach combines combines the execution the execution of the of algo therithms algorithms in both in the both ph theysical physical and the and cyber the cyberdomains. domains. Figure Figure 7 shows7 shows an example an example of how of a howCPS aarchitecture CPS architecture could fulfill could the fulfill require- the Sensors 2021, 21, x FOR PEER REVIEWrequirementsments of a classification of a classification problem problem while minimizing while minimizing the drawbacks the drawbacks found in found the prior in10 the ofap- 18 priorproaches. approaches.

FigureFigure 7. 7.Cyber-Physical Cyber-Physical System System architecture. Architecture.

TheThe strategy strategy followed followed in in this this approach approach develops develops a a dual-stage dual-stage implementation implementation of of the the CNNs:CNNs: trainingtraining carriedcarried onon inin thethe cyber cyber system system and and execution execution over over nodes nodes in in the the physical physical system.system. ThisThis strategystrategy mustmust bebe analyzed analyzed from from the the viewpoint viewpoint of of several several metrics. metrics. Because Because CPS follows a distributed processing strategy, the communication features among nodes CPS follows a distributed processing strategy, the communication features among nodes and between physical and cyber systems play an important role in the achievement of the and between physical and cyber systems play an important role in the achievement of the objectives. On one side, the low data rate in wireless links among nodes does not support objectives. On one side, the low data rate in wireless links among nodes does not support the transmission of raw audio data through the network. Due to this feature, a possible the transmission of raw audio data through the network. Due to this feature, a possible central CNN execution scheme is discarded. Instead, a local CNN execution scheme must be implemented. Furthermore, before the execution of the CNN, a pre-processing stage is needed: the creation of the mel-spectrogram. Therefore, nodes must execute the CNN us- ing its own processing resources. Two possibilities can be considered: pre-process the au- dio signal on the central processing unit (CPU) and execute the CNN on the graphical processing unit (GPU), or pre-process the audio signal and execute the CNN on the CPU. In the work proposed here, both the pre-processing of the audio signal and the execution of the CNN are carried out by the CPU using a TensorFlow Lite library [26]. The platform used for the testbeds is based on the System of Chip (SoC) BCM2835, which includes the microprocessor ARM1176JZF-S at 700 MHz—it can reach 1 GHz doing overclocking using GPU VideoCore IV. The communications were implemented using a LoRa® with a data rate of around 5kbps using a spread factor of 7 and a bandwidth of 125 kHz [27]. Nodes are provided with a weatherproof microphone and a solar cell for an autonomous power supply. On the cyber system side, implementation was based on a workstation with an AMD Threadripper Pro 3955WX processor with 32 GB RAM and GPU NVIDIA Quadro P620, running Windows 10 Pro. The communication between the cyber and the physical system was made using the TCP/IP protocol. A Python application was designed to train the CNN based on a selected dataset. This cyber system is responsible for interacting with a final user (biologist) who is responsible for ordering the new trainings of the CNNs and new updates of the trained CNNs on the physical nodes or on their digital twins to carry out simulations or cooperative classifications. The CPS is being tested at the Doñana Biological Reserve ICTS (DBR) and the Higher Technical School of Computer Engineering (ETSII), both located in the south of Spain and separated by 60 km. Figure 8 depicts the actual location of every part of the system. The physical system is composed of nine IoT nodes spread out over Ojillo Lake, a natural en- vironment in DBR. These nodes create a spanning tree with a base station as a root. The

Sensors 2021, 21, 3655 10 of 19

central CNN execution scheme is discarded. Instead, a local CNN execution scheme must be implemented. Furthermore, before the execution of the CNN, a pre-processing stage is needed: the creation of the mel-spectrogram. Therefore, nodes must execute the CNN using its own processing resources. Two possibilities can be considered: pre-process the audio signal on the central processing unit (CPU) and execute the CNN on the graphical processing unit (GPU); or pre-process the audio signal and execute the CNN on the CPU. In the work proposed here, both the pre-processing of the audio signal and the execution of the CNN are carried out by the CPU using a TensorFlow Lite library [28]. The platform used for the testbeds is based on the System of Chip (SoC) BCM2835, which includes the microprocessor ARM1176JZF-S at 700 MHz—it can reach 1 GHz doing overclocking using GPU VideoCore IV. The communications were implemented using a LoRa® with a data rate of around 5kbps using a spread factor of 7 and a bandwith of 125 kHz [29]. Nodes are provided with a weatherproof microphone and a solar cell for an autonomous power supply. On the cyber system side, implementation was based on a workstation with an AMD Threadripper Pro 3955WX processor with 32 GB RAM and GPU NVIDIA Quadro P620, running Windows 10 Pro. The communication between the cyber and the physical system was made using the TCP/IP protocol. A Python application was designed to train the CNN based on a selected dataset. This cyber system is responsible for interacting with a final user (biologist) who is responsible for ordering the new trainings of the CNNs and new updates of the trained CNNs on the physical nodes or on their digital twins to carry out simulations or cooperative classifications. The CPS is being tested at the Doñana Biological Reserve ICTS (DBR) and the Higher Technical School of Computer Engineering (ETSII), both located in the south of Spain with a separation of 60 km. Figure8 depicts the actual location of every part of the system. Sensors 2021, 21, x FOR PEER REVIEWThe physical system is composed of 9 IoT nodes spread out over Ojillo11 Lake, of 18 a natural environment in DBR. These nodes create a spanning tree with a base station as a root. The communication among nodes is carried out by LoRa®. A node with two interfaces (Ethernetcommunication and LoRa among®) performs nodes is acarried base stationout by LoRa role.®. A This node node with communicatestwo interfaces (Ether- with the cyber systemnet throughand LoRa® fiber) performs optics. a base station role. This node communicates with the cyber sys- tem through fiber optics.

Figure 8. CPS topology. Physical system, DBR, Huelva, Spain; cyber system, ETSII, Sevilla, Spain. Figure 8. CPS topology. Physical system, DBR, Huelva, Spain; Cyber system, ETSII, Sevilla, Spain. 2.2. CNN Classification System In order to carry out the design of the CNN classification system, a set of stages were followed. These stages are represented in Figure 9 and described in the following subsec- tions.

.

Figure 9. Stages in the classification system.

Each of the stages were programmed with Python using the Keras library [28] for CNNs.

2.2.1. Data Augmentation In order to generate a more robust training set, an augmentation process was carried out using the sound database. When one deals with CNN and images, data augmentation consists of extending the set of images from their transformations (rotations, size modifi- cations, etc.). For the purpose of the present work on sounds, the transformations must be made on the audios, and later these transformations will be reflected in the mel-spectro- gram images.

Sensors 2021, 21, x FOR PEER REVIEW 11 of 18

communication among nodes is carried out by LoRa®. A node with two interfaces (Ether- net and LoRa®) performs a base station role. This node communicates with the cyber sys- tem through fiber optics.

Sensors 2021, 21, 3655 11 of 19

Figure 8. CPS topology. Physical system, DBR, Huelva, Spain; cyber system, ETSII, Sevilla, Spain. 2.2. CNN Classification System 2.2. CNN Classification System In order to carry out the design of the CNN classification system, a set of stages were followed.In order toThese carry out stages the aredesign represented of the CNN in Figure classification9 and described system, a inset the of stages following were subsections.followed. These stages are represented in Figure 9 and described in the following subsec- tions.

.

FigureFigure 9. 9.Stages Stages in in the the classification classification system. system. Each of the stages was programmed with Python using the Keras library [30] for CNNs. Each of the stages were programmed with Python using the Keras library [28] for 2.2.1.CNNs. Data Augmentation

2.2.1.In Data order Augmentation to generate a more robust training set, an augmentation process was carried out using the sound database. When one deals with CNN and images, data augmenta- tion consistsIn order of to extendinggenerate a themore set robust of images training from set, their an augmentation transformations process (rotations, was carried size modifications,out using the sound etc.). For database. the purpose When ofone the deals present with work CNN on and sounds, images, the data transformations augmentation mustconsists be madeof extending on the the audios, set of andimages later from these their transformations transformations will (rotations, be reflected size modifi- in the mel-spectrogramcations, etc.). For images.the purpose of the present work on sounds, the transformations must be madeIn on this the way, audios, for each and oflater the these 865 audio transforma samplestions in will the database,be reflected 10 additionalin the mel-spectro- audios weregram generated: images. • 2 files with two types of white noise added to the audio. • 4 files with signal time shifts (1, 1.25, 1.75, and 2 seconds, respectively). • 4 files with modifications in the amplitude of the signal: 2 with amplifications of 20% and 40%, and 2 with attenuations of 20% and 40%. Specifically, dynamic compres- sion [31] was used for this processing. It reduces the volume of loud sounds and amplifies (a percentage) the quiet ones. The aim was to generate new signals with the background sound enhanced and the sound of the anuran reduced. These transformations were carried out so that that the system could detect the anurans in a noisy environment, as well as at different distances from the recording microphone. With the previous transformations, the 865 audios were converted to a total of 9515 audios. After the data augmentation stage, the distribution of 5-second-long samples among the different classes is detailed in Table1.

Table 1. Sound collections.

Number of Type Vocalization Number of Original Samples Augmented Samples Epidalea calamita standard 293 3223 Epidalea calamita chorus 74 814 Epidalea calamita amplexus 63 693 Alytes obstetricans standard 419 4609 Alytes obstetricans distress call 16 176

In the works cited above [20–23], it was assumed that due to the similarity of the chorus in Epidalea calamita with respect to standard vocalization, as well as the low number Sensors 2021, 21, 3655 12 of 19

of chorus vocalization samples, it was not possible to train a system to distinguish between the two. As a contribution to that work, in the present work a CNN was generated for the classification of the 5 classes registered in the database. The objective was to demonstrate that the power of this type of CNN model allowed satisfactory results to be obtained in all vocalizations.

2.2.2. Mel-Spectrogram Generation The generation of a mel-spectrogram for each audio file was perfectly adapted to the classification objective of the system. The WAV files were passed through filter banks to obtain the mel-spectrogram. Each audio sample was converted to a shape of 128 × 435, indicating 128 filter banks used and 435 time steps per audio sample. Figure 10 shows the audio corresponding to a sample with an amplexus call of Sensors 2021, 21, x FOR PEER REVIEW 13 of 18 Epidalea calamita.

FigureFigure 10. ExampleExample of ofanuran anuran call: call: waveform waveform (up) (up)and mel-spectrogram and mel-spectrogram (down). (down).

AsAs can can be be seen seen in the in thefigure, figure, the anuran the anuran call is call located is located in the interval in the interval (2.5, 3.5 s) (2.5, of the 3.5 s) of the sample.sample. The The mel-spectrogram mel-spectrogram allows allows identification, identification, in the image, in the image,of both ofthe both location the of location of thethe call call and and the the range range of frequencies of frequencies and andamplitudes amplitudes of the ofsame. the same.

2.2.3.2.2.3. Design Design and and Training Training AsAs detailed detailed in inthe the introduction, introduction, a CNN a CNN is a istype a type of neural of neural network network essentially basically com- composed posedof convolution of convolution layers layers and and pooling pooling layers. layers. For For the the design design of this of this work, work, the convolu- the convolution tionlayers layers extract extract features features from from the the mel-spectrograms, mel-spectrograms, preserving preserving the the relationship relationship be- between tween pixels by learning image features using small squares of input data. Thus, the role pixels by learning image features using small squares of input data. Thus, the role of a of a convolution layer is to detect call or vocalization features at different positions from theconvolution input feature layer maps is towith detect these calllearnable or vocalization filters. The pooling features layers at different are used positions to reduce from the theinput dimensionality feature maps of the with previous these learnablelayers. For filters.this work, The max pooling pooling layers was used. are used This type to reduce the ofdimensionality pooling calculates of thethe previousmaximum layers.value in For each this patch work, of each max feature pooling map. was used. This type of poolingFor the calculates classification the maximumsystem presented value in th eache paper, patch the of best each results feature were map. found with CNNsFor of three the classification convolution layers system and presented three pooling in the layers. paper, A flattened the best layer results and were a ReLU found with (RectifiedCNNs of Linear 3 convolution Unit) layer layers were and added. 3 pooling The flattened layers. layer A flattened simply converts layer and the a ReLUpooled (Rectified featureLinear map Unit) (represented layer were as added. a numeric The matrix) flattened into layera column simply (represented converts as the a vector). pooled feature ReLUmap (represented[29] stands for asrectified a numeric linear matrix) unit and into is a anonlinear column operat (representedion defined as a by vector). the max ReLU [32] (zerostands input). for rectifiedThe purpose linear of applying unit and the is rectifier a nonlinear function operation is to increase defined the bynonlinearity max (zero input). inThe the purpose images (the ofapplying images are the naturally rectifier nonlinear). function Finally, is to increase softmax the functions nonlinearity with cross- in the images entropy [29] were applied in the output layer. This turns the output vector of real values (the images are naturally nonlinear). Finally, softmax functions with cross-entropy [32] into a vector that sums to 1 so that the results can be interpreted as probabilities. The numberwere applied of output in filters the output per convolution layer. This layer turns is 32 the for output the firstvector one, 64 offor real the second values one, into a vector andthat 128 sums for tothe 1 final so that one. the A kernel results size can (hei beght interpreted and width as of probabilities. the convolution The window) number of of output 3 × 3 is used on each convolution layer. The fully connected layer (FC) consists of 512 neurons and the softmax layer with the number of units for classification (one for each type of sound). Figure 11 shows the structure of the CNN.

Sensors 2021, 21, 3655 13 of 19

filters per convolution layer is 32 for the first one, 64 for the second one, and 128 for the final one. A kernel size (height and width of the convolution window) of 3 × 3 is used on each convolution layer. The fully connected layer (FC) consists of 512 neurons and Sensors 2021, 21, x FOR PEER REVIEW the softmax layer with the number of units for classification (one for each type of14 sound). of 18

Figure 11 shows the structure of the CNN.

Figure 11. Structure of the designedFigure 11. CNN.Structure of the designed CNN. For network training, 80% of the samples were used as training samples, and 20% For networkas training, tests. This 80% strategy of the is widely samples used were to enhance used theas generalizationtraining samples, of a machine and 20% learning as tests. This strategymodel is andwidely prevent used overfitting. to enhance A balanced the generalization sampling was carried of a machine out for the learning test set. For model and preventit, one overfitting. audio sample A balanced (with its corresponding sampling was data carried augmentation out for samples) the test out set. of 5For was it, one audio sampleselected (with for eachits corresponding type of anuran call. data Thus, augmentation 7623 samples weresamples) used for out training of five and was 1892 for testing. selected for each typeAs of mentioned anuran call. in the Thus, data augmentation 7623 samples section were above, used a for CNN training was first and trained 1892 for for testing. the 4 sound classes covered in the classification model developed in previous work [20–23]. As mentionedLater, in athe second data CNN augmentation was trained tosection expand above, the classification a CNN was system first to thetrained total offor the the four sound classes5 classes covered detailed in in Table the1 .classification model developed in previous work For the first network, the best results were obtained with training in only 8 epochs [18–21]. Later, a second CNN was trained to expand the classification system to the total (number of passes of the entire training dataset) and a batch size (number of samples of the five classesutilized detailed in one in trainingTable 1. iteration) equal to 32. For the second network, the best training was For the first achievednetwork, with the 6 best epochs results and, again, were aob batchtained size with equal training to 32. For in both only networks, eight epochs the initial (number of passeslearning of the rate entire [33] withtraining 0.001 dataset) and the “adam” and a optimizerbatch size [34 (number] were the of parameters samples used uti- in lized in one trainingthe training iteration) process. equal to 32. For the second network, the best training was achieved with six3. epochs Results and, again, a batch size equal to 32. For both networks, the initial learning rate [30]3.1. with CPS 0.001 Performance and the “adam” optimizer [31] were the parameters used in the training process. Several factors were considered in evaluating the CPS. First, at the level of data transport communication, a comparison between two paradigms was done. The central 3. Results paradigm is based on sending the audio record from an IoT node to its digital twin at the cyber system side, where it is processed. The distributed paradigm is based on processing 3.1. CPS Performancethe audio record in the IoT node and sending the obtained result to its digital twin, where Several factorsit is deliveredwere considered to the central in processingevaluating entity. the CPS. Considering First, aat 5-second the level audio of registerdata recorded with a sample rate of 44.1 kHz and 32 bits per sample with mono configuration, transport communication,the resulting a file comparison (WAV) has a sizebetw ofeen 434 kB.two For paradigms this, it was assumedwas carried that a out. single The radio central paradigmlink is based (1 kbps) on exists sending and thethe minimum audio record time needed from an to transportIoT node the to audioits digital file from twin the at the cyber systemphysical side, system where to it the is processe cyber systemd. The is 57.8 distributed seconds. In paradigm this scheme, is the based time on associated pro- cessing the audiowith record data in communication the IoT node at and the fibersending optics the was obtained negligible result in comparison to its digital with thetwin, data communication at the radio link. As a result, sending 5 seconds of audio record needs more where it is delivered to the central processing entity. Considering a 5-second audio regis- than 10 times more time. Consequently, no real streaming could be done. A distributed ter recorded withprocessing a sample is needed.rate of 44.1 In this kHz sense, and two 32 times bits have per tosample be considered with mono at thephysical configura- nodes: tion, the resulting file (WAV) has a size of 434 kB. For this it was assumed that there exists only one radio link (1 kbps) and the minimum time needed to transport the audio file from the physical system to the cyber system is 57.8 s. In this scheme, the time associated with data communication at the fiber optics was neglected in comparison with the data com- munication at the radio link. As a result, sending 5 s of audio needs more than 10 times more time. Consequently, no real streaming could be carried out. A distributed processing is needed. In this sense, two times have to be considered at the physical nodes: pre-pro- cessing time (mel-spectrogram representation) and CNN execution. Considering the same previously created WAV file, the time to compute the mel-spectrogram was measured in 120 ms and the later CNN execution, using TensorFlow Lite library, was 160 ms. Finally, the time for sending the CNN results of 5 s audio record analysis (70 B) implies 0.55 s. As a result, the total amount of the time needed in the distributed paradigm is around 300 ms. In conclusion, it is possible to do an audio analysis in real time. The main handicap of this approach is that if any change in the CNN is needed, the network topology, weights, and structure have to be transmitted through the radio links. A common neural network

Sensors 2021, 21, 3655 14 of 19

pre-processing time (mel-spectrogram representation) and CNN execution. Considering the same previously created WAV file, the time to compute the mel-spectrogram was measured in 120 ms and the later CNN execution, using TensorFlow Lite library, took 160 ms. Finally, Sensors 2021, 21, x FOR PEER REVIEW the time for sending the CNN results of 5 seconds audio record analysis (70 B)15 impliesof 18 0.55 seconds. As a result, the total amount of the time needed in the distributed paradigm is around 300 ms. In conclusion, it is possible to do an audio analysis in real time. The main handicap of this approach is that if any change in the CNN is needed, the network size using TensorFlowtopology, Lite weights, is about and structure5 MB. Wi haveth toa be1 kbps transmitted bandwidth, through thethe radio time links. to deliver A common this network fromneural the cyber network system size using to a TensorFlow remote IoT Lite node is about takes 5 MB. 11.4 With hours. a 1kbps This bandwith, time, the alt- time to deliver this network from the cyber system to a remote IoT node takes 11.4 hours. This hough significantly high, corresponds to a procedure that is not executed frequently and time, although significantly high, corresponds to a procedure that is not done frequently can be performedand in the can bebackground. done in the background. Second, at the levelSecond, of boundary at the level ofconditions, boundary conditions, the temperature the temperature of SoC of at SoC the at IoT the IoThas has been measured whilebeen measuredexecuting while the executingCNN. Figu there CNN. 12 shows Figure 12 a showsthermal a thermal photo photo that identifies that identifies ◦ the temperature atthe the temperature SoC (48.8 at °C). the SoC This (48.8 showC).s This that shows the selected that the selected platform platform is suitable is suitable for for executing this kind of processing. This testbed was carried out at 25 ◦C (spring and fall); executing this kindadditional of processing. tests should This be consideredtestbed was at higher carried temperatures out at 25 (summer). °C (spring and fall); additional tests should be considered at higher temperatures (summer).

Figure 12. TemperatureFigure 12. at Temperaturethe SoC of atthe the IoT SoC node of the executing IoT node executing the CNN. the CNN.

Third, at the levelThird, of power at the levelconsumption, of power consumption, the execution the execution of the CNN of the was CNN tested was tested using using a precision multimeter. Figure 13 depicts the level of intensity required by the IoT platform a precision multimeter.when the Figure CNN begins 13 depicts its execution. the level The solarof intensity cell installed required in the remoteby the location IoT plat- where form when the CNNthe node begins is working its execution. has been dimensioned The solar cell according installed to the in IoT the power remote consumption. location where the node is workingFourth, has at the been level dimensioned of resources, the according application thatto the runs IoT the CNNpower takes consump- 16.4% of the tion. CPU capacity and 8.7% of RAM. These results let us conclude that the available computing resources are adequate for running this sort of application.

Figure 13. Power consumption of the IoT executing the CNN.

Sensors 2021, 21, x FOR PEER REVIEW 15 of 18

size using TensorFlow Lite is about 5 MB. With a 1 kbps bandwidth, the time to deliver this network from the cyber system to a remote IoT node takes 11.4 hours. This time, alt- hough significantly high, corresponds to a procedure that is not executed frequently and can be performed in the background. Second, at the level of boundary conditions, the temperature of SoC at the IoT has been measured while executing the CNN. Figure 12 shows a thermal photo that identifies the temperature at the SoC (48.8 °C). This shows that the selected platform is suitable for executing this kind of processing. This testbed was carried out at 25 °C (spring and fall); additional tests should be considered at higher temperatures (summer).

Figure 12. Temperature at the SoC of the IoT node executing the CNN.

Third, at the level of power consumption, the execution of the CNN was tested using a precision multimeter. Figure 13 depicts the level of intensity required by the IoT plat- form when the CNN begins its execution. The solar cell installed in the remote location

Sensors 2021, 21, 3655 where the node is working has been dimensioned according to the IoT power consump-15 of 19 tion.

Figure 13. Power consumption of the IoT executing the CNN. Figure 13. Power consumption of the IoT executing the CNN. 3.2. CNN Models The overall accuracy results obtained for the first network, which classified the sounds in four classes, were 96.34% for training and 97.53% for the test data. The confusion matrix for the test data in this network is detailed in Table2.

Table 2. Confusion matrix for the first CNN (4 classes).

Predicted Values Ep. cal. st&ch Ep.cal. amplexus Al. obs. standard Al. obs. distress Ep. cal. st&ch 96.51% (775) 3.49% (28) 0 0 Ep.cal. amplexus 4.9% (7) 95.10% (136) 0 0 Actual values Al. obs. standard 1.21% (11) 0 98.79% (902) 0 Al. obs. distress 2.28% (1) 0 0 97.72% (43)

The dispersion of the error between the four classes is low, standing between 1.21% for the Alytes obstetricans class (11 errors versus 913 audios) and 4.9% for the Epidalea calamita class with amplexus vocalization (7 errors versus 143 audios). For CNN, which classifies the 5 types of sound detailed in Table1, a 94.54% success rate was obtained for the training data, and 96.2% for the test data. Table3 shows the confusion matrix obtained for the test data. Table3 shows how the errors are distributed among the 5 different classes. The percentage of error for all classes is in the range between 1.01% and 4.78%, except for the Epidalea calamita class with amplexus vocalization, which has a 9.09% error (13 errors versus 143 audios). On the other hand, only one error was made in the Alytes obstetricans class with distress calls. Thus, the network classification results can be considered good and robust at the same time. The accuracy results for this network with respect to the first network decrease slightly, but they are good enough to show that the CNN is capable of distinguishing between standard and amplexus vocalization for Epidalea calamita. Sensors 2021, 21, 3655 16 of 19

Table 3. Confusion matrix for the first CNN (5 classes).

Predicted Values Ep. cal. standard Ep. cal. chorus Ep. cal. amplexus Al. obs. standard Al. obs. distress Ep. cal. standard 94.27% (560) 1.01% (6) 4.71% (28) 0 0 Ep. cal. chorus 4.78% (10) 95.22% (199) 0 0 0 Actual values Ep. cal. amplexus 9.09% (13) 0 90.91% (130) 0 0 Al. obs. standard 1.2% (11) 0 0 98.8% (902) 0 Al. obs. distress 2.28% (1) 0 0 0 97.72% (43) Sensors 2021, 21, 3655 17 of 19

4. Discussion and Conclusions Regarding the CNN-based classification system, as shown in Section3, the accuracy results are excellent and robust for the set of anuran sounds. Specifically, in the CNN based on 4 sound classes, the overall accuracy result obtained (97.53%) involved a small improvement with respect to the best results (97.35%) of the most recent paper [23] from the previous work on this sample set [20–23]. Thus, if Table2 of Section3 is compared with Table 7 of [23], it is possible to observe improvements in the accuracy results of the classes Epidalea calamita and Alytes obstetricans with standard vocalization. The result is only worse in the Alytes obstetricans class with distress calls, in which a classification error occurs in one of the 44 samples. Although the improvement in accuracy is not significant, in [23] a manual labeling of the sounds is required, which adds a cost to the process. In the classification system in the present work, taking advantage of the capabilities of deep learning, this human help is not necessary, which is a significant contribution. Furthermore, previous works could not distinguish between chorus and standard vocalization for Epidalea calamita. The classification system based on the second CNN makes it possible to obtain very good results (96.2%) when adding this new class. This involves a greater level of detail in sound classification and implies a major improvement with respect to the previous work cited above. This classification system, designed for anuran sounds, demonstrates the feasibility of a CNN-based design for the classification of biological acoustic targets. In this way, the first objective of those proposed in the paper (see Section 1.4) was completed. On the other hand, although this paper has focused on the use of a CPS as an anuran classification tool, it could also be used to analyze biodiversity indices in the natural environment, using the proper for CNN training. Regarding the CPS, the proposed scheme provides a promising chance for biologists and engineers to collaborate in complex scenarios, minimizing the supervision of experts and, consequently, the cost in human hours. Doubtless, the CPS could be implemented with less costly devices. Consequently, extensive natural areas could be included for analysis in monitoring applications. Given the results reported here, the authors conclude that the CPS design constitutes a novel, flexible, and powerful PAM system. Overall, the use of CNNs in IoT nodes could be used in the classification stage as a fusion technique that minimizes the data to be transported and maximizes the information exchanged. Thus, the second of the research objectives of the work (see Section 1.4) was completed. As a future line of work, CNNs can be engaged not only in classification but also in feature extraction. In the present work, CNNs were used with classification as the main objective. However, audio data could contain hidden information that has not been considered with this approach. A deeper insight in an unsupervised execution training may provide new interesting conclusions regarding the process dynamics.

Author Contributions: Conceptualization, Í.M. and J.B. Initial Conceptualization, R.M. and J.F.B. Research, Í.M., J.B., R.M. and J.F.B. Writing—Original Draft, Í.M. and J.B. Audio records and field works, R.M. and J.F.B. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Acknowledgments: The authors would like to thank Antonio Barbancho (Escuela Politécnica Su- perior, Universidad de Sevilla) for his collaboration and support. This project has been supported by the Doñana Biological Station (CSIC, Spain) under the title SCENA-RBD (2020/20), TEMPURA (CGL2005-00092/BOS), ACURA (CGL2008-04814-C02) and TATANKA (CGL2011-25062). Conflicts of Interest: The authors declare there to be no conflict of interest.

References 1. United Nations. The Millennium Development Goals Report 2014; Department of Economic and Social Affairs: New York, NY, USA, 2014. 2. Bradfer-Lawrence, T.; Gardner, N.; Bunnefeld, L.; Bunnefeld, N.; Willis, S.G.; Dent, D.H. Guidelines for the use of acoustic indices in environmental research. Methods Ecol. Evol. 2019, 10, 1796–1807. [CrossRef] Sensors 2021, 21, 3655 18 of 19

3. Wildlife Acoustics. Available online: http://www.wildlifeacoustics.com (accessed on 23 February 2021). 4. Hill, A.P.; Prince, P.; Covarrubias, E.P.; Doncaster, C.P.; Snaddon, J.L.; Rogers, A. AudioMoth: Evaluation of a smart open acoustic device for monitoring biodiversity and the environment. Methods Ecol. Evol. 2018, 9, 1199–1211. [CrossRef] 5. Campos-Cerqueira, M.; Aide, T.M. Improving distribution data of threatened species by combining acoustic monitoring and occupancy modelling. Methods Ecol. Evol. 2016, 7, 1340–1348. [CrossRef] 6. Goeau, H.; Glotin, H.; Vellinga, W.P.; Planque, R.; Joly, A. LifeCLEF Bird Identification Task 2016: The arrival of deep learning, CLEF 2016 Work. Notes; Balog, K., Cappellato, L., Eds.; The CLEF Initiative: Évora, Portugal, 5–8 September 2016; pp. 440–449. Available online: http://ceur-ws.org/Vol-1609/ (accessed on 19 May 2021). 7. Farina, A.; James, P. The acoustic communities: Definition, description and ecological role. Biosystems 2016, 147, 11–20. [CrossRef] 8. Mac Aodha, O.; Gibb, R.; Barlow, K.E.; Browning, E.; Firman, M.; Freeman, R.; Harder, B.; Kinsey, L.; Mead, G.R.; Newson, S.E.; et al. Bat detective—Deep learning tools for bat acoustic signal detection. PLoS Comput. Biol. 2018, 14, e1005995. [CrossRef] [PubMed] 9. Karpištšenko, A. The Marinexplore and Cornell University Whale Detection Challenge. 2013. Available online: https://www. kaggle.com/c/whale-detection-challenge/discussion/4472 (accessed on 19 May 2021). 10. Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [CrossRef] 11. Dhillon, A.; Verma, G.K. Convolutional neural network: A review of models, methodologies and applications to object detection. Prog. Artif. Intell. 2019, 9, 85–112. [CrossRef] 12. Dorfler, M.; Bammer, R.; Grill, T. Inside the spectrogram: Convolutional Neural Networks in audio processing. In Proceedings of the 2017 International Conference on Sampling Theory and Applications (SampTA), Tallinn, Estonia, 3–7 July 2017; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2017; pp. 152–155. 13. Nanni, L.; Rigo, A.; Lumini, A.; Brahnam, S. Spectrogram Classification Using Dissimilarity Space. Appl. Sci. 2020, 10, 4176. [CrossRef] 14. Bento, N.; Belo, D.; Gamboa, H. ECG Biometrics Using Spectrograms and Deep Neural Networks. Int. J. Mach. Learn. Comput. 2020, 10, 259–264. [CrossRef] 15. Mushtaq, Z.; Su, S.-F. Environmental sound classification using a regularized deep convolutional neural network with data augmentation. Appl. Acoust. 2020, 167, 107389. [CrossRef] 16. Xie, J.; Hu, K.; Zhu, M.; Yu, J.; Zhu, Q. Investigation of Different CNN-Based Models for Improved Bird Sound Classification. IEEE Access 2019, 7, 175353–175361. [CrossRef] 17. Chi, Z.; Li, Y.; Chen, C. Deep Convolutional Neural Network Combined with Concatenated Spectrogram for Environmental Sound Classification. In Proceedings of the 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 19–20 October 2019; Institute of Electrical and Electronics Engineers (IEEE): New York, New York, USA, 2019; pp. 251–254. 18. Colonna, J.; Peet, T.; Ferreira, C.A.; Jorge, A.M.; Gomes, E.F.; Gama, J. Automatic Classification of Anuran Sounds Using Convolutional Neural Networks. In Proceedings of the Ninth International C* Conference on Computer Science & Software Engineering—C3S2E, Porto, Portugal, 20–22 July 2016; Desai, E., Ed.; Association for Computing Machinery: New York, NY, USA, 2016; pp. 73–78. [CrossRef] 19. Strout, J.; Rogan, B.; Seyednezhad, S.M.; Smart, K.; Bush, M.; Ribeiro, E. Anuran call classification with deep learning. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: New York, NY, USA, 2017; pp. 2662–2665. 20. Luque, J.; Larios, D.F.; Personal, E.; Barbancho, J.; León, C. Evaluation of MPEG-7-Based Audio Descriptors for Animal Voice Recognition over Wireless Acoustic Sensor Networks. Sensors 2016, 16, 717. [CrossRef][PubMed] 21. Luque, A.; Romero-Lemos, J.; Carrasco, A.; Barbancho, J. Non-sequential automatic classification of anuran sounds for the estimation of climate-change indicators. Expert Syst. Appl. 2018, 95, 248–260. [CrossRef] 22. Luque, A.; Gómez-Bellido, J.; Carrasco, A.; Barbancho, J. Optimal Representation of Anuran Call Spectrum in Environmental Monitoring Systems Using Wireless Sensor Networks. Sensors 2018, 18, 1803. [CrossRef][PubMed] 23. Luque, A.; Romero-Lemos, J.; Carrasco, A.; Barbancho, J. Improving Classification Algorithms by Considering Score Series in Wireless Acoustic Sensor Networks. Sensors 2018, 18, 2465. [CrossRef][PubMed] 24. Fonozoo. Available online: www.fonozoo.com (accessed on 23 February 2021). 25. Haag, S.; Anderl, R. Digital twin—Proof of concept. Manuf. Lett. 2018, 15, 64–66. [CrossRef] 26. Howitt, I.; Gutierrez, J. IEEE 802.15.4 low rate—Wireless personal area network coexistence issues. In Proceedings of the 2003 IEEE Wireless Communications and Networking, 2003. WCNC 2003; Institute of Electrical and Electronics Engineers (IEEE): New Orleans, LA, USA, 2004. 27. Augustin, A.; Yi, J.; Clausen, T.; Townsley, W.M. A Study of LoRa: Long Range & Low Power Networks for the Internet of Things. Sensors 2016, 16, 1466. [CrossRef] 28. Byrd, R.H.; Chin, G.M.; Nocedal, J.; Wu, Y. Sample size selection in optimization methods for machine learning. Math. Program. 2012, 134, 127–155. [CrossRef] 29. Petäjäjärvi, J.; Mikhaylov, K.; Pettissalo, M.; Janhunen, J.; Iinatti, J.H. Performance of a low-power wide-area network based on LoRa technology: Doppler robustness, scalability, and coverage. Int. J. Distrib. Sens. Netw. 2017, 13.[CrossRef] Sensors 2021, 21, 3655 19 of 19

30. Chollet, F. Keras. 2015. Available online: https://github.com/fchollet/keras (accessed on 19 May 2021). 31. Wilmering, T.; Moffat, D.; Milo, A.; Sandler, M.B. A History of Audio Effects. Appl. Sci. 2020, 10, 791. [CrossRef] 32. Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. 33. Patterson, J.; Gibson, A. Understanding Learning Rates. Deep Learning: A Practitioner’s Approach, 1st ed.; Mike Loukides, Tim McGovern, O’Reilly: Sebastopol, CA, USA, 2017; Chapter 6; pp. 415–530. 34. Zhong, H.; Chen, Z.; Qin, C.; Huang, Z.; Zheng, V.W.; Xu, T.; Chen, E. Adam revisited: A weighted past gradients perspective. Front. Comput. Sci. 2020, 14, 1–16. [CrossRef]