entropy

Article Texture Classification Using Spectral Entropy of Acoustic Signal Generated by a Human Echolocator

Raja Syamsul Azmir Raja Abdullah 1,*, Nur Luqman Saleh 1, Sharifah Mumtazah Syed Abdul Rahman 1 , Nur Syazmira Zamri 1 and Nur Emileen Abdul Rashid 2

1 Wireless and Photonic Network Research Centre (WiPNET), Faculty of Engineering, University Putra Malaysia (UPM), Serdang 43400, Malaysia; [email protected] (N.L.S.); [email protected] (S.M.S.A.R.); [email protected] (N.S.Z.) 2 Microwave Research Institute, UniversitiTeknologi MARA (UiTM), Shah Alam 40450, Selangor, Malaysia; [email protected] * Correspondence: [email protected]; Tel.: +603-976-943-47

 Received: 12 August 2019; Accepted: 29 September 2019; Published: 2 October 2019 

Abstract: Human echolocation is a biological process wherein the human emits a punctuated acoustic signal, and the ear analyzes the in order to perceive the surroundings. The peculiar acoustic signal is normally produced by clicking inside the mouth. This paper utilized this unique acoustic signal from a human echolocator as a source of transmitted signal in a synthetic human echolocation technique. Thus, the aim of the paper was to extract information from the echo signal and develop a classification scheme to identify signals reflected from different textures at various distance. The scheme was based on spectral entropy extracted from Mel-scale filtering output in the Mel- coefficient of a reflected echo signal. The classification process involved data mining, features extraction, clustering, and classifier validation. The reflected echo signals were obtained via an experimental setup resembling a human echolocation scenario, configured for synthetic data collection. Unlike in typical speech signals, extracted entropy from the characteristics was likely not visible for the human mouth-click signals. Instead, multiple peak spectral features derived from the synthesis signal of the mouth-click were assumed as the entropy obtained from the Mel-scale filtering output. To realize the classification process, K-means clustering and K-nearest neighbor processes were employed. Moreover, the impacts of propagation toward the extracted spectral entropy used in the classification outcome were also investigated. The outcomes of the classifier performance herein indicated that spectral entropy is essential for human echolocation.

Keywords: spectral entropy; acoustic signal; human echolocation; classification; MFCC

1. Introduction The phrase “echolocation” was initially used by Griffin to describe bats’ ability to safely navigate and locate their prey using call signals [1]. What is less known is that a group of people (often blind people) known as human echolocators have adapted to visualize their surrounding using a similar concept. Human echolocation is the ability to perceive one’s surroundings by listening to the echoes of the active emission of sound signals reflected from an obstacle. Recent studies have reported that these people are able to visualize their surrounding by “seeing through sound”. They exhibit exceptional performance in defining their space, even able to accurately discriminate the profile of objects. This ability has elicited inquiries among scholars on how the space and objects can be recognized using mouth-click signals. Although a series of studies have focused on this question, most works have focused on the perceptual concept rather than technical explanations. It is known that

Entropy 2019, 21, 963; doi:10.3390/e21100963 www.mdpi.com/journal/entropy Entropy 2019, 21, 963 2 of 20 Entropy 2019, 21, 963 2 of 20 orderhuman to translate echolocators meaningful depend cues on theirfrom auditorymouth-click systems signals in and order turn to them translate into meaningfulvisual perception, cues from as illustratedmouth-click in Figure signals 1 and[1]. turn them into visual perception, as illustrated in Figure1[1].

FigureFigure 1. 1.TheThe human human echolocation echolocation concept. concept.

It Itis isessential essential for for humans humans to to recognize recognize the the ac acousticoustic signals signals entering entering their their ear, ear, mainly mainly for for communicationcommunication and and recognition. recognition. For For comparison, comparison, radar radar and and sonar sonar are are a good a good example example of of man-made man-made sensorssensors which which benefit benefit fromfrom thesethese classifications; classifications; the th illuminatione illumination of a targetof a target is associated is associated with detection with detectionand classification and classification schemes [schemes2–4]. This [2–4]. meaningful This meaningful information information is useful in is distinguishing useful in distinguishing the profile of thethe profile detected of the target, detected and helpstarget, to and minimize helps to poor minimize and false poor detection and false events. detection For humanevents. echolocation,For human echolocation,how people how recognize people spaces recognize and objects spaces using and objects mouth-click using signalsmouth-click has still signals not been has clearlystill not verified. been clearlyIn addition, verified. no In studies addition, to date no have studies reported to date a technicalhave reported classification a technical process classification of human mouth-clicks.process of humanMoreover, mouth-clicks. human mouth-clicks Moreover, dohuma notn inherit mouth-clicks formant propertiesdo not inherit like thoseformant in typical properties speech, like which those have in typicalbeen provenspeech, to which be strong have features been proven in speech to recognition.be strong features Instead, in the speech multiple recognition. frequency Instead, components the multiple(spectral frequency entropy) foundcomponents in the signal(spectral serve entropy) as features found for thein the classification signal serve process. as features This gap for should the classificationbe investigated process. to assure This continuitygap should and be to investigated utilize the full to potentialassure continuity of human and echolocation. to utilize the We full were potentialthus motivated of human to analyzeecholocation. and design We were this thus classification motivated process. to analyze and design this classification process.Studies related to the human auditory system became the primary references in this paper for the classificationStudies related process to the of human human mouth-clicks. auditory system We be hereincame propose the primary a classification references schemein this paper for human for themouth-clicks classification using process experimental of human data mouth-clicks. by utilizing We a human herein auditory propose model a classification (Mel-frequency scheme cepstral for humancoeffi cientmouth-clicks (MFCC) using framework experime processing).ntal data by By utilizing understanding a huma hown auditory human model echolocators (Mel-frequency carry out cepstralecholocation coefficient (utilizing (MFCC) mouth-click), framework a newprocessi dimensionng). By couldunderstanding be opened how in the human design echolocators of man-made carrysensors out (especiallyecholocation for (utilizing radar and mouth-click), sonar) in the neara new future. dimension In this could paper, be we opened have not in madethe design any claim of man-madethat the classification sensors (especially schemes for described radar and closely sonar) replicate in the the near strategy future. used Infor this human paper, echolocation; we have not we madepresent any the claim best intuitivethat the approachedclassification for schemes decision-making described based closely on thereplicate credible the knowledge strategy used of human for humanecholocation echolocation; techniques we present and the humanthe best hearing intuitiv process.e approached for decision-making based on the credibleHence, knowledge this study of human aimed echolocation to investigate techniques the characteristics and the ofhuman the echo hearing signal process. (acoustic mouth click) reflectedHence, from this distudyfferent aimed textures to atinvestigate various distances. the characteristics We developed of the a classificationecho signal (acoustic scheme to mouth classify click)textures reflected based from on thedifferent spectral textures entropy at various of the echo distances. signal We obtained developed from a Mel-scale classification filtering scheme outputs to classify(Mel-spectrum), textures based which on was the incorporated spectral entropy into the of MFCCthe echo framework. signal obtained The classification from Mel-scale tasks filtering included outputsdistinguishing (Mel-spectrum), hard, medium, which andwas softincorporated textures, and into grouping the MFCC them framework. into their The respective classification cluster tasks region includedsources distinguishing using the reflected hard, echo medium, signal and at di softfferent textures, distances. and Thegrouping classification them into routine their wasrespective realized clusterusing region K-means sources process using and the was refl validatedected echo using signal K-nearest at different neighbors distances. (K-NN). The Theclassification paper is structured routine wasas follows:realized Sectionusing K-means2 provides process a description and was of validated the characteristics using K-nearest of the human neighbors mouth-click (K-NN). signal, The paperand presentsis structured a brief as explanation follows: Section of the 2 experimentalprovides a description setup used of for the data characteristics collection, followed of the human by echo mouth-clicksignal identification. signal, and Section presents3 elaborates a brief theexplanat flowion process of the of extractingexperimental spectral setup entropy used for using data the collection, followed by echo signal identification. Section 3 elaborates the flow process of extracting spectral entropy using the K-means approach, followed by texture clustering and classification

Entropy 2019, 21, 963 3 of 20

K-means approach, followed by texture clustering and classification performance analysis using K-NN. Section4 evaluates the results and discussion. In chapter 5 provides an outline of the study’s conclusion and the direction of future work. The study of human echolocators was initiated by Supa et al. nearly half a century ago [5]. They performed experiments with participants who were instructed to tap their heels on the floor in such a way as to create noise for echolocation purposes. A subsequent study conducted by Kellogg found that tongue-clicks, finger-snaps, hissing, and whistling were among signal sources that could be used to echolocate [6]. Rice et al. conducted a subsequent study which revealed that the majority of the participants preferred to use self-generated signals for echolocation [7]. In a subsequent study, Rojas et al. reported that the majority of the participants also used self-generated to echolocate using oral structure [8]. In 2010, Schenkman et al. found that the distance between the target and the human echolocator could reflect echolocation performance [9]. A year later, Schenkman et al. carried out experiments on auditory perception that revealed the importance of pitch and loudness in human echolocation [10]. During the experiments, the participants were asked to listen to sound signals with manipulated pitch and loudness; the results revealed that pitch (spectral) was sufficient to conduct echolocation. In parallel, studies have proven that the information carried via human echolocation processes is adequate to provide the identity of obstacles in a space (the position, size, material, and shape of an object). Rice et al. found the blind participants were able to make a judgment on the effective surface area (size) of the object [11]. Findings from this study found that larger surface areas could reflect much more energy and lead to an improved success rate. A newer study mostly confirmed that humans do echolocate on a daily basis and are able to differentiate shapes and sizes of objects [12]. In separate works, human echolocation has exhibited remarkable performance in discriminating object texture [13,14]. These facts indicate that human echolocation uses multiple cues, which helps them to translate that meaningful information into an accurate result. These important cues include time delay, spectrum (frequency), and amplitude (loudness). The most recent studies assert that spectral entropy plays a major role in determining the effectiveness of human echolocation processes [15]. However, no studies to date have reported a technical analysis of how these people successfully differentiate the shape, size, and texture of objects. A study revisiting the concept of human echolocation placed an emphasis on the technical analysis perspective. For instance, analysis of the waveform diversity of human mouth-clicks revealed that they are relatively short, wide-band, and have multiple peak frequency components with exceptional resolution detail [16]. Subsequent studies mostly confirmed that mouth-click signals are individually unique, and that their spectral entropy contains multiple frequency components underlying an envelope factor that probably makes up the entire signal [17,18]. Translated into radar and sonar system applications, such properties could resolve detailed Doppler and delay information for accurate detection results. In addition, speech is undoubtedly human beings’ primary mode of communication. The natural speech communication process involves the use of the mouth to produce an audible signal (acoustic signal), and the use of the ear to interpret this signal [19]. Interestingly, human echolocators also use the mouth–ear mechanism in a secondary sensing modality with which to perceive their surroundings [20]. They listen to the return echo of an active emission acoustic signal generated by a punctuated mouth-click in the space Figure1. The echo signal is then interpreted by the human brain and translated into meaningful information. Similarly to speech events, human echolocators rely on their hearing sense ability to interpret the reflected echo signals. Thereby, a similar strategy used in speech processing is used in human mouth-click echolocation. Understanding the nature of the biological processes involved is essential to carrying out an analysis of the human mouth-click [21]. For any audible signal, the recognition process in human hearing is enclosed with identical processing manners. It is worth mentioning that the behavior that stands out the most in the human hearing sense is the Entropy 2019, 21, 963 4 of 20 ability to synthesize an audible signal in a logarithmic scale, in order to accommodate the physical structure of basilar membrane in the inner ear. Human echolocation studies in recent years have shown exceptional performance and high accuracy while echolocating [20] using the mouth-click signal. Mouth–ear structures have been acknowledged to be used to echolocate, but how they exploit those systems for detection and classification has still not been widely explored. As a result, an initiative to analyze the mouth-click mechanism under the context of human auditory modeling has been introduced, and was exemplified by the successful recognition results for a pair of transmission–echo signals using the Linde–Buzo–Gray vector quantization (LBGVQ) method [22]. The method’s principle is to extract cepstral entropy between transmission–echo signals using Mel-frequency cepstral coefficient (MFCC) processes. As a result, features of a true pair of transmission–echo signals are scattered within the same cluster. Improved detection of human mouth-clicks was achieved with a bio-inspired (BI) processing approached over the matched filter (MF) outcome [23,24]. The BI process utilized a gammatone filter (GF) process in order to synthesize the mouth-click that was used in the detection process. In addition, the synthesized mouth-click signal using GF was extended into ambiguity function (AF) analysis [25]. Results of the analysis revealed that the ability to resolve Doppler-delay information from each filter output was unique. Thus, it became worthwhile to study the human mouth-click from a human modeling perspective, as the signal source was a human being.

2. Signal Characteristics and Experimental Setup The human mouth–ear communication framework can be replicated by a speaker–microphone setup. The transmission source of the mouth-click signal that was used throughout the data collection is publicly available and can be retrieved from reference [20]. In this section, the characteristics of the mouth-click used for artificial data collection are described in brief. We acknowledge that there are blind people who use finger-snaps, a tapping cane, and hand claps to help them navigate safely. However, this paper only considered signals from mouth-click sounds, as (i) recent scientific studies have reported that human echolocators often use sound from mouth-clicks for echolocation processes. Several studies have emphasized a technical perspective of this method [16–18,20], and (ii) the signal has been properly recorded and made publicly available to be used for research [20].

2.1. Human Mouth-Click Signal Characteristics A single human echolocator signal was employed in this study, belonging to a blind person with reportedly exceptional echolocating skill. Moreover, as young as 13 months old, this person was diagnosed with the retinoblastoma that caused his blindness, and since then has utilized echolocation in his daily activities. The recorded mouth-click was digitized using a sampling frequency, fs, of 44,100 kHz in an uncompressed Windows audio format (wav). This was able to retain the spectral entropy of the mouth-click in a digital format based on the Nyquist theorem. The duration of the mouth-click was relatively short, approximately 3 ms, without any specific modulation scheme, as shown in Figure2a. The spectrogram analysis shown in Figure2b revealed an entropy identity with the presence of multiple frequency components in two separate regions, namely the main frequency region and upper frequency region. The major energy could be found in the main frequency region. Despite having unique waveform diversity, these multiple frequency components are likely to cause undesirable output when performing the detection process using MF, due to the abundance of multiple local maxima and low sidelobes level (SLL). This is an undesirable outcome especially in detection outputs, where the system is prone to poor detection and false alarms. To tackle the issue, an alternative approach using BI entropy has been reportedly used in recent analyses [22,24]. More specifically, the BI-incorporated GF process first synthesizes the signal, then performs a multi-stage correlation process, followed by summing all correlation products to improve the detection results. Thus, it worthwhile to explore the potential of classification performance by utilizing extracted spectral Entropy 2019, 21, 963 5 of 20

Entropy 2019, 21, 963 5 of 20

Entropyentropybased 2019, 21, 963 on a human auditory modeling approach, as presented in this paper. Details of entropy5 of 20 extraction using the MFCC framework are discussed briefly in Section3.

Figure 2.Characteristics of the human echolocator’s mouth-click: (a) time-domain signal and (b) spectrogram [2].

Despite having unique waveform diversity, these multiple frequency components are likely to cause undesirable output when performing the detection process using MF, due to the abundance of multiple local maxima and low sidelobes level (SLL). This is an undesirable outcome especially in detection outputs, where the system is prone to poor detection and false alarms. To tackle the issue, an alternative approach using BI entropy has been reportedly used in recent analyses [22,24]. More specifically, the BI-incorporated GF process first synthesizes the signal, then performs a multi-stage correlation process, followed by summing all correlation products to improve the detection results. Thus, it worthwhile to explore the potential of classification performance by utilizing extracted spectral entropybased on a human auditory modeling approach, as presented in this paper. Details of entropy extraction using the MFCC framework are discussed briefly in Section 3. Figure 2. Characteristics of the human echolocator’s mouth-click: (a) time-domain signal and Figure(b) spectrogram 2.Characteristics [2]. of the human echolocator’s mouth-click: (a) time-domain signal and (b) 2.2. Humanspectrogram Mouth [2].–Ear Experimental Setup 2.2. Human Mouth–Ear Experimental Setup Figure 3a shows the configuration of the human mouth–ear system modeled with a speakerDespiteFigure–microphone3a having shows unique setup. the configurationwaveformA condenser diversity, ofmicrophone the these human multiple was mouth–ear chosen frequency as systemit offeredcomponents modeled improved are withlikely input ato sensitivity,causespeaker–microphone undesirable lower noise,output setup. and when A the condenser performing widest microphonefrequenc the detectiony response was chosen process of asthe using it dynamic offered MF, improved due microphones. to the input abundance sensitivity,Both the of speakermultiplelower noise, output local and maxima and the widestthe andmicrophone frequency low sidelobes responseinput levelconnector of the(SLL). dynamic were This relayed microphones.is an undesirable into RME Both Firefaceoutcome the speaker UC, especially output acting and asin DACdetectionthe microphone and ADCoutputs, and input where were connector linkedthe system wereto the relayedis computer prone into to poorvia RME high-speed detection Fireface UC, andUSB acting false 2.0, as alarms. as shown DAC To and in tackleFigure ADC the and3b. issue, Data were collectionanlinked alternative to thethroughout computer approach the via using experiment high-speed BI entropy was USB hascontrolled 2.0, been as shown reportedly using in FigureMatlab used3 b.for in Data flexiblerecent collection analyses data manipulation throughout [22,24]. More the in thespecifically,experiment classification wasthe controlledBI-incorporated tasks. using MatlabGF process for flexible first sy datanthesizes manipulation the signal, in the then classification performs a tasks. multi-stage correlation process, followed by summing all correlation products to improve the detection results. Thus, it worthwhile to explore the potential of classification performance by utilizing extracted spectral entropybased on a human auditory modeling approach, as presented in this paper. Details of entropy extraction using the MFCC framework are discussed briefly in Section 3.

2.2. Human Mouth–Ear Experimental Setup Figure 3a shows the configuration of the human mouth–ear system modeled with a speaker–microphone setup. A condenser microphone was chosen as it offered improved input sensitivity, lower noise, and the widest frequency response of the dynamic microphones. Both the speaker output and the microphone input connector were relayed into RME Fireface UC, acting as DAC and ADC and were linked to the computer via high-speed USB 2.0, as shown in Figure 3b. Data collection throughout the experiment was controlled using Matlab for flexible data manipulation in the classification tasks.

Figure 3. Experiment setup for an artificial human echolocator: (a) speaker–microphone layout and (b) complete layout.

The target was placed on top of a flat wooden chair in line of sight and facing the speaker–microphone layout to help maximize the reflection properties of the individual target with different textures. The target distance was measured from the baseline of the speaker–microphone, at 50 and 100 cm, as illustrated in Figure4. Such distance values were reasonable and correspond to the average human step and stride length at a normal walking pace, which is reportedly 70 cm on average [26]. Considering a range of scenarios, 50 and 100 cm should be appropriate to represent

Entropy 2019, 21, 963 6 of 20

Figure 3. Experiment setup for an artificial human echolocator:(a) speaker–microphone layout and (b) complete layout.

The target was placed on top of a flat wooden chair in line of sight and facing the speaker–microphone layout to help maximize the reflection properties of the individual target with different textures. The target distance was measured from the baseline of the speaker–microphone,

Entropyat 50 and2019 ,10021, 963cm, as illustrated in Figure 4. Such distance values were reasonable and correspond6 of 20to the average human step and stride length at a normal walking pace, which is reportedly 70 cm on average [26]. Considering a range of scenarios, 50 and 100 cm should be appropriate to represent actualactual humanhuman echolocation.echolocation. Throughout Throughout the the data data collection, collection, the the air-conditioner air-conditioner temperature temperature was was set set to 16to ◦16°CC in in order order to maintainto maintain the the surrounding surrounding ambient ambient temperature temperature and and humidity. humidity.

Figure 4. Target distance from the speaker–microphone setup. Figure 4. Target distance from the speaker–microphone setup. Accordingly, it was essential for each texture to have an identical effective surface area in order Accordingly, it was essential for each texture to have an identical effective surface area in order to standardize the effect of sound pressure level (SPL) and sound intensity (SI) upon the collected to standardize the effect of sound pressure level (SPL) and sound intensity (SI) upon the collected echo signals [27,28]. Likewise, both SPL and SI were associated with the distance the sound traveled. echo signals [27,28]. Likewise, both SPL and SI were associated with the distance the sound traveled. Importantly, as sound radiates freely in space, it is bound to experience a decrease in sound pressure Importantly, as sound radiates freely in space, it is bound to experience a decrease in sound pressure p (N/m2), as illustrated in Figure5. Thus, p is inversely proportional with distance d, as shown in p (N/m2), as illustrated in Figure 5. Thus, p is inversely proportional with distance d, as shown in Equation (1). Consequently, the relation between SPL and d can be described using the law of sound Equation (1). Consequently, the relation between SPL and d can be described using the law of sound pressure, ∆Pp, as expressed in Equation (2), where Dr is the distance ratio corresponding to the sound pressure, ΔPp, as expressed in Equation (2), where Dr is the distance ratio corresponding to the sound origin and do is sound source location. In the experiments herein, understanding of this phenomenon origin and do is sound source location. In the experiments herein, understanding of this phenomenon was crucial, as a microphone records data in a similar manner to SPL. was crucial, as a microphone records data in a similar manner to SPL. 1 p (1) ∼ d ! d0 ∆Pp = 20 log(Dr);Dr = i = 1, 2, 3, ... , N (2) di

Entropy 2019, 21, 963 7 of 20

Entropy 2019, 21, 963 7 of 20 Entropy 2019, 21, 963 7 of 20

Figure 5. Sound pressure level law theory.

1 𝑝 ~ (1) 𝑑

Δ𝑃 =20 𝑙𝑜𝑔(𝐷);

Figure 5. Sound pressure level𝑑 law theory. 𝐷 = (2) Figure 5. Sound pressure level𝑑 law theory. In addition, SI significantly decreases linearly with an increase in distance from the sound origin when traveling in free space. Degradation of SI value𝑖 = occurred1,2,3, … ,𝑁 because the sound energy continued to 1 spread across a much larger surface area as it traveled,𝑝 ~ as illustrated in Figure6. Thereby, for any given(1) In addition, SI significantly decreases linearly𝑑 with an increase in distance from the sound distanceorigin that when doubled traveling that from in free the space. sound Degradatio source, then intensity of SI value I (W /occurredm2) was boundbecause to decreasethe sound by energy a quartercontinued of its original to spread value across following a much the inverseΔ𝑃 larger =20 squaresurface 𝑙𝑜𝑔(𝐷 law. )area; The as SI it was traveled, calculated as illustrated using Equation in Figure (3), 6. 2 withThereby, Pac (Watt) for being any thegiven acoustic distance power that atdoubleddo, A (m th)at the from sphere the sound area, and source, d (m) the the intensity distance. I (W/m2) was 𝑑 𝐷 = (2) bound to decrease by a quarter of its original 𝑑value following the inverse square law. The SI was 1 Pac 2 calculated using Equation (3), withI Pac;I (Watt)= being;A = the4π acousticr power at do, A (m2) the sphere(3) area, ∼ d2 A × and d (m) the distance. 𝑖 = 1,2,3, … ,𝑁 In addition, SI significantly decreases linearly with an increase in distance from the sound origin when traveling in free space. Degradation of SI value occurred because the sound energy continued to spread across a much larger surface area as it traveled, as illustrated in Figure 6. Thereby, for any given distance that doubled that from the sound source, the intensity I (W/m2) was bound to decrease by a quarter of its original value following the inverse square law. The SI was calculated using Equation (3), with Pac (Watt) being the acoustic power at do, A (m2) the sphere area, and d (m) the distance.

Figure 6. Sound intensity (SI) associated with the inverse square law theory. Figure 6. Sound intensity (SI) associated with the inverse square law theory. The target base was made of polyvinyl chloride (PVC) and wrapped with different materials found from the hardware shop, as detailed in Table1. A total of1 three materials were used for distinguishing 𝐼 ~ ; 2 texture, namely soft, medium, and hard materials, with𝑑 a surface area of 3025 cm . Each material has distinct composition characteristics; thus, the absorption properties varied accordingly, as discussed in 𝑃 (3) Section 3.2[ 29]. As this was a preliminary investigation,𝐼= we; scaled down the scope of this paper to 𝐴 classifying three different textures at the respective distances. 𝐴 =4𝜋×𝑟 Figure 6. Sound intensity (SI) associated with the inverse square law theory. The target base was made of polyvinyl chloride (PVC) and wrapped with different materials found from the hardware shop, as detailed 1in Table 1. A total of three materials were used for 𝐼 ~ ; 𝑑

𝑃 (3) 𝐼= ; 𝐴 𝐴 =4𝜋×𝑟 The target base was made of polyvinyl chloride (PVC) and wrapped with different materials found from the hardware shop, as detailed in Table 1. A total of three materials were used for

Entropy 2019, 21, 963 8 of 20 EntropyEntropy 20192019,, 2121,, 963963 88 ofof 2020 distinguishing texture, namely soft, medium, and hard materials, with a surface area of 3025 cm2. distinguishing texture, namely soft, medium, and hard materials, with a surface area of 3025 cm22.. Each material has distinct composition characteristics; thus, the absorption properties varied Each material has distinct composition characteristics;stics; thus,thus, thethe absorptionabsorption propertiesproperties variedvaried accordingly, as discussed in Section 3.2 [29]. As this was a preliminary investigation, we scaled Entropy 2019, accordingly,21accordingly,, 963 asas discusseddiscussed inin SectionSection 3.23.2 [29].[29]. AsAs thisthis waswas aa preliminarypreliminary investigation,investigation, wewe scaledscaled8 of 20 down the scope of this paper to classifying three different textures at the respective distances. down the scope of this paper to classifying three different textures at the respective distances. Table 1. Description of material used to wrap polyvinyl chloride (PVC) board. Table 1. DescriptionTable 1. Description of material of material used to used wrap to wr polyvinylapap polyvinylpolyvinyl chloride chloridechloride (PVC) (PVC)(PVC) board. board.board. Dimension TextureTexture Material Material DimensionDimension (cm) Images Texture Material (cm) ImagesImages (cm)(cm)

Flat PVC Flat PVC board board L = 50 Flat PVC boardwrapped L =L50 = 50 Hard Hardwrapped wrapped with HH= 50= 50 Hard with H = 50 aluminum foilwith WW= 0.5= 0.5 aluminum W = 0.5 aluminumaluminum foil foilfoil

L = 50 Rubber L =L50 = 50 Medium Rubber H = 50 MediumMedium Rubber matmat H =H50 = 50 mat WW= 0.8= 0.8 W = 0.8

L = 50 L = 50 Soft Sponge L H= 50= 50 SoftSoft SpongeSponge H =H50 = 50 W = 4 W W= 4 = 4

The methodology discussed above emphasized the introduction of diverse elements relative to The methodology discussed above emphasized the introductiontroduction ofof diversediverse elementselements relativerelative toto reflected echo signals so as to classify their behavior and performance, which was a major aim of the The methodologyreflectedreflected echoecho discussedsignalssignals soso asas above toto classifyclassify emphasized theirtheir behabehavior the introductionand performance, of diverse which was elements a major relativeaim of the to analysis in this paper. reflected echoanalysisanalysis signals inin thisthis so paper.paper. as to classify their behavior and performance, which was a major aim of the analysis in this paper. 2.3. Signature of Reflected Echo Signal 2.3.2.3. SignatureSignature ofof ReflectedReflected EchoEcho SignalSignal 2.3. Signature ofThe Reflected spectral Echo entropy Signal in the sound signal was associated with sound propagation effects while it The spectral entropy in the sound signal was associatedciated withwith soundsound propagationpropagation effectseffects whilewhile itit traveled the distance described in Section 2.2. Thus, these two factors were expected to influence the traveledtraveled thethe distancedistance describeddescribed inin SectionSection 2.2.2.2. ThusThus,, thesethese twotwo factorsfactors werewere expectedexpected toto influenceinfluence thethe The spectralreflected entropy echo signal in the quality sound received signal was by associatedthe microphone. with sound In order propagation to analyze eff ectsthe texture while it reflectedreflected echoecho signalsignal qualityquality receivedreceived byby ththee microphone.microphone. InIn orderorder toto analyzeanalyze thethe texturetexture traveled theclassification distance described at a certain in distance Section, the 2.2 correct. Thus, reflected these two signal factors had to were be identified expected from to the influence raw data. the classificationclassification atat aa certaincertain distancedistance,, thethe correctcorrect reflectedreflected signalsignal hadhad toto bebe identifiedidentified fromfrom thethe rawraw data.data. reflected echoThe signaltrue echo quality signal received was extracted by the via microphone. Equation (4), Inwith order d as tothe analyze distance the between texture the classification target and the at The true echo signal was extractedacted viavia EquationEquation (4),(4), withwith dd asas thethe distancedistance betweenbetween thethe targettarget andand thethe a certain distance,speaker–microphone, the correct reflected t as the signalreturned had time to beecho identified signal into from the the microphone, raw data. θ The as the true elevation echo signal of speaker–microphone,speaker–microphone, tt asas thethe returnedreturned timetime echoecho signalsignal intointo thethe microphone,microphone, θ asas thethe elevationelevation ofof target with respect to the speaker–microphone, and C as the speed of sound in air, which equates to was extractedtargettarget via withwith Equation respectrespect(4), toto thethe with speaker–microphone,speaker–microphone,d as the distance between andand C asas thethe speedspeed target ofof and soundsound the inin speaker–microphone, air,air, whichwhich equatesequates toto 342 m/s, t as the returned342342 m/s,m/s, time echo signal into the microphone, θ as the elevation of target with respect to the C∙𝑡 speaker–microphone, and C as the speed of sound𝑑= inC∙𝑡 air,cos whichθ. equates to 342 m/s, (4) 𝑑= 2 coscosθ.. (4)(4) 2 C t d = · cos θ. (4) 2 Moreover, using Equation (4), the echo signal from the raw data could be estimated and extracted. Figure7a shows the experimental data for the 50 cm distance target given t = 3.2538 ms, resulting in d = 55.64 cm. For the distance target at 100 cm and t = 5.7828 ms, d was 98.89 cm, as shown in Figure7b. The offset delay for the reflected echo signal from its actual calculation in Equation (4) was caused by atmospheric factors, which can significantly influence the speed of sound [30]. Entropy 2019, 21, x 9 of 20 Entropy 2019, 21, x 9 of 20 Moreover, using Equation (4), the echo signal from the raw data could be estimated and extracted.Moreover, Figure using 7a shows Equation the experimental (4), the echo data signal for the from 50 thecm rawdistance data target could given be estimated t = 3.2538 andms, extracted.resulting in Figure d = 55.64cm. 7a shows For the the experimental distance target data at 100cmfor the and 50 cmt = 5.7828distance ms, target d was given 98.89 tcm, = 3.2538 as shown ms, resultingin Figure in7b. d The= 55.64cm. offset delayFor the for distance the reflected target echoat 100cm signal and from t = 5.7828its actual ms, calculation d was 98.89 in cm, Equation as shown (4) Entropyinwas Figure caused2019, 217b., 963by The atmospheric offset delay factors, for the which reflected can echosignificantly signal from influence its actual the speedcalculation of sound in Equation [30].9 of 20(4) was caused by atmospheric factors, which can significantly influence the speed of sound [30].

Figure 7.Experimental raw data containing transmit leak and echo signal: distance of (a) 50cm and Figure 7. Experimental raw data containing transmit leak and echo signal: distance of (a) 50 cm and Figure(b) 100cm. 7.Experimental raw data containing transmit leak and echo signal: distance of (a) 50cm and (b()b 100) 100cm. cm. The SPL effect, as discussed in Section 2.2, was discernible from the extracted echo signal; The SPL effect, as discussed in Section 2.2, was discernible from the extracted echo signal; differencesThe SPL in effect,the amplitude as discussed are highlighted in Section in 2.2, Figure was discernible8. At any given from point the extractedof the target echo from signal; the differences in the amplitude are highlighted in Figure8. At any given point of the target from differencesspeaker–microphone, in the amplitude the echo are signalhighlighted experienced in Figure an 8 attenuation. At any given effect, point as of demonstr the targetated from in the the the speaker–microphone, the echo signal experienced an attenuation effect, as demonstrated in speakeramplitude.–microphone, the echo signal experienced an attenuation effect, as demonstrated in the theamplitude. amplitude.

FigureFigure 8. 8.Reflected Reflected echo echo signal signal for for all all textures: textures: distance distance of of (a ()a 50) 50 cm cm and and (b ()b 100) 100 cm. cm. Figure 8. Reflected echo signal for all textures: distance of (a) 50 cm and (b) 100 cm. The time-domain characteristic of a sound signal provides a first glance at the spectral entropy of the energy profile of the individual echo signal. This energy (W/m2) denotes the SI in Equation (3).

The visible variation in normalized amplitude between echoes at 50 and 100 cm was due to sound energy being dissipated across a larger surface area as it traveled from the sound origin, as shown in Figure9. Entropy 2019, 21, x 10 of 20

The time-domain characteristic of a sound signal provides a first glance at the spectral entropy of the energy profile of the individual echo signal. This energy (W/m2) denotes the SI in Equation (3). The visible variation in normalized amplitude between echoes at 50 and 100 cm was due to sound energy being dissipated across a larger surface area as it traveled from the sound origin, as shown in Entropy 2019, 21, 963 10 of 20 Figure 9.

FigureFigure 9. Power9.Power spectrum spectrum density density (PSD) (PSD) of reflectedof reflected echo echo signals: signals: distance distance of (a )of 50 ( cma) 50 and cm (b and) 100 (b cm.) 100 cm. 3. Spectral Entropy Features and Classification Framework 3. SpectralIn classification Entropy tasks, Features it is essentialand Classification to obtain aFramework set of features that helps to translate meaningful informationIn classification accurately tasks, to discriminate it is essential the to respective obtain a set identities of features of the that features helps to [2 translate,31,32]. Inmeaningful human speechinformation recognition accurately applications, to discriminate extracted formant the respective coefficients identities are often of harvestedthe features and [2,31,32 manipulated]. In human and thenspeech used recognition as features [applications,33–36]. Speech extracted recognition formant processes coefficients in the present-dayare often harvested scenarios and have manipulated reached a leveland then of maturity used as where features they [33 are–36] able. Speech to produce recognition exceptional processes performance. in the present Consequently,-day scenarios speech have recognitionreached a processinglevel of maturity schemes where utilizing they spectralare able entropyto produce have exceptional been exploited performance. in non-speech Consequently, signal applications,speech recognition e.g., smart processing devices [schemes37,38], robotics utilizing [39 spectral], and surveillance entropy have cameras been exploited [40]. in non-speech signalTo accomplish applications, the e.g., features smart extraction devices[37,38] task in, robotics this paper, [39] a, Mel-scale and surveillance filtering cameras process(incorporated [40]. into theTo MFCC accomplish framework) the was features deployed extraction in the classification task in this campaign. paper, a The Mel feasibility-scale filtering of MFCC process to contribute(incorporated to obtaining into the exceptional MFCC framework)results in speech was recognition deployed applications in the classification has already campaign. been clearly The demonstratedfeasibility of[ 37MFCC]. In ato wider contribute sense, to if obtaining an extracted exceptional entropy fromresults a Mel-scale in speech filtering recognition process applications is able tohas classify already these been textures clearly which demonstrated are similar in[37] surface. In a area wider (2D sense, shape), if then an extracted it could be entropy much easier from a forMel this-scale method filtering to segregate process a is target able to with classify a complex these curved textures shape which and are materials similar in (3D surface in shape) area for (2D subsequentshape), then analysis. it could The be motivation much easier to use for spectral this method entropy to forsegregate features a arose target because with a the complex mouth-click curved signalshape was and made materials up of multiple(3D in shape) , for subsequent as described analysis. in Section The motivation2.1. to use spectral entropy 3.1.for MFCC features Structure arose because the mouth-click signal was made up of multiple frequencies, as described in Section 2.1. The conventional of the MFCC framework consists of five major sub-processes of the method, namely,3.1. MFCC (i) signal Structure framing, (ii) windowing, (iii) discrete Fourier transform (DFT), (iv) Mel-scale filtering and smoothing, and (v) discrete cosine transform (DCT), as graphically illustrated in Figure 10. To avoid The conventional of the MFCC framework consists of five major sub-processes of the method, conflicts of interest, our aim was to use extracted spectral entropy (Mel-spectrum) from the Mel-scale namely, (i) signal framing, (ii) windowing, (iii) discrete Fourier transform (DFT), (iv) Mel-scale filtering process for classification campaign (see Section 3.2). Hence, in this section we have described filtering and smoothing, and (v) discrete cosine transform (DCT), as graphically illustrated in Figure the complete MFCC framework for easier knowledge on obtaining spectral entropy via Mel-scale 10. To avoid conflicts of interest, our aim was to use extracted spectral entropy (Mel-spectrum) from filtering process (since the Mel-scale filter is embedded in the MFCC framework). the Mel-scale filtering process for classification campaign (see Section 3.2). Hence, in this section we

Entropy 2019, 21, 963 11 of 20

Entropyhave 2019described, 21, 963 the complete MFCC framework for easier knowledge on obtaining spectral entropy11 of 20 via Mel-scale filtering process (since the Mel-scale filter is embedded in the MFCC framework). haveEntropy described2019, 21, 963 the complete MFCC framework for easier knowledge on obtaining spectral entropy11 of 20 via Mel-scale filtering process (since the Mel-scale filter is embedded in the MFCC framework).

Figure 10. Block diagram for features extraction using the Mel-frequency cepstral coefficient

(MFCC). Figure 10. Block diagram for features extraction using the Mel-frequency cepstral coefficient (MFCC). Figure 10. Block diagram for features extraction using the Mel-frequency cepstral coefficient The extracted entropy process was initiated via the breakdown of the mouth-click echo signal The extracted entropy process was initiated(MFCC). via the breakdown of the mouth-click echo signal into into a frame corresponding to a number of Mel-scale filters, K. Here, K was defined to be equal to 40, a frame corresponding to a number of Mel-scale filters, K. Here, K was defined to be equal to 40, which whichThe isextracted sufficient entropy to represent process thewas echo initiated spectral via theamplitude breakdown in a of minimum the mouth-click windowing echo signalsize of is sufficient to represent the echo spectral amplitude in a minimum windowing size of approximately intoapproximately a frame corresponding 15.87 ms after to a zero-paddingnumber of Mel-scale (as the filters,length Kof. Here,mouth-click K was defineditself is aboutto be equal 3 ms). to Next, 40, 15.87 ms after zero-padding (as the length of mouth-click itself is about 3 ms). Next, using a Hammering whichusing isa Hammeringsufficient to filter, represent the windowing the echo spectralprocess minimizedamplitude thein effecta minimum of sharp windowing edges. size of filter, the windowing process minimized the effect of sharp edges. approximatelySubsequently, 15.87 msDFT after was zero-padding taken on each (as theframe length to ofobtain mouth-click the spectral itself isamplitude about 3 ms).(frequency Next, Subsequently, DFT was taken on each frame to obtain the spectral amplitude (frequency domain), usingdomain), a Hammering after which filter, such the amplitude windowing was proc passedess minimized onto the Mel-scale the effect filter-bank of sharp edges. m, where Mn(f) is an after which such amplitude was passed onto the Mel-scale filter-bank m, where M (f ) is an inverse inverseSubsequently, representation DFT wasof m taken, as expressed on each framein Equation to obtain (5). theThe spectral Mel-scale amplitude filter-bankn (frequency response representation of m, as expressed in Equation (5). The Mel-scale filter-bank response corresponding to domain),corresponding after which to Equation such amplitude (5) is illustrated was passed in Figure onto 11.the Mel-scale filter-bank m, where Mn(f) is an Equation (5) is illustrated in Figure 11. inverse representation of m, as expressed in Equation (5). The Mel-scale filter-bank response corresponding to Equation (5) is illustrated in Figure 11.

Figure 11. Mel-scale filter-bank with 40 channels. Figure 11. Mel-scale filter-bank with 40 channels.

Mel-filtering aims to make spectral content smooth, in preparation for meaningful features representation.Mel-filtering In addition, aims to Mel-scaleFiguremake 11.spectral Mel-scale filtering content processesfilter-bank smooth, helpedwith 40in to channels.preparation exhibit spectral for meaningful content, mimicking features therepresentation. human auditory In systemaddition, using Mel-scale the weight filtering Mel-scale processes filter-bank, helpedWk(f ) givento exhibit in Equation spectral (6), content, where Xmimicking(f ) isMel-filtering DFT bin the energy. humanaims Withto auditorymake this scheme,spectral system itcontent wasusing expected smooth,the weight that in apreparation robustMel-scale and efilter-bank,forfficient meaningful taskW fork(f) a featuresgiven specific in representation.applicationEquation (6), could whereIn be addition, obtained. X(f) is DFTMel-scale bin energy. filtering With processes this scheme, helped it was to expectedexhibit spectral that a robust content, and   efficient task for a specific application could be obtained. f k mimicking the human auditory systemm = using2595 log the10 1weight+ 700 ,Mel-scale filter-bank,W (f) given in  m/2595  Equation (6), where X(f) is DFT binM energy.n( f ) = With700 10this scheme,f 1 , it was expected that a robust and(5) 𝑚 = 2595 log 1 + − , (5) efficient task for a specific application couldn = be1, obtained. 2, 3, ... , K700. f 𝑚 = 2595 log 1 + , (5) 700

Entropy 2019, 21, x 12 of 20

푚⁄2595 푀푛(푓) = 700(10 − 1), Entropy 2019, 21, 963 12 of 20 푛 = 1, 2, 3, … . , 퐾.

2 푊푘(푓) = ∑X푀푛 (푓) ∙ |푋(푓) | . (6) 푓 2 W ( f ) = Mn( f ) X( f ) . (6) k · As mentioned earlier in this section, we usedf the Mel-scale filtering process to extract features for classificationAs mentioned tasks. earlier Hence, in this the section, features we extraction used the Mel-scale process via filtering the MFCC process framework to extract featuresused in forthis classificationpaper was limited tasks. Hence,by the theMel features-filtering extraction process (see process Figur viae 10). the MFCCIn effect, framework to obtain usedcepstrum in this entropy paper wasfrom limited the Mel by- thescale Mel-filtering filtering output, process it had (see to Figure be converted 10). In eff intoect, toa time obtain-domain cepstrum representation entropy from using the Mel-scaleDCT termed filtering as the output, MFCC it coefficient had to be converted Ck(also known into a as time-domain cepstrum entropy). representation Thus, usingCk process DCT could termed be asexpressed the MFCC via coe Equationfficient C (7)k(also and known could be as cepstrumused as classification entropy). Thus, featuresCk process in future could works. be expressed via Equation (7) and could be used as classification퐾 features in future works. 1 휋 퐶푘 = ∑ 푊푘(푓) 푐표푠 [푛 (푘 − ) ] (7) K 2 퐾 푘=X1   1 π  C = W ( f ) cos n k (7) k k − 2 K k = 1 3.2. Mel-Spectrum Output 3.2. Mel-Spectrum Output Each of the coefficients extracted during the MFCC process can be mined for meaningful featuresEach inputof the coe dependingfficients extracted on the specific during the tasks. MFCC Ourprocess approach can employedbe mined for the meaningful Mel-scale features filtering inputprocess depending as a baseline on the method specific to tasks.extract Our features approach of the employed mouth-click the signal, Mel-scale because filtering it provided process asan aeffective baseline indiscriminate method to extract texture features cluster of at the respective mouth-click distances. signal, As because such, itthe provided Mel-scale an filter effective-bank indiscriminateoutput in Equation texture (6) was cluster exploited at respective as features distances. for the classification As such, the campaign, Mel-scale due filter-bank to its strength output in inpresenting Equation the (6) spectral was exploited entropy as of featuresthe mouth for-click the classificationsignal. The Mel campaign,-spectrum due output to itsof strengtheach texture in presentingat the distances the spectral of 50 and entropy 100 cm of theare mouth-clickshown in Figure signal. 12a,b The respectively. Mel-spectrum The output Mel-spectrum of each texture output atwas the able distances to resolve of 50 the and spectral 100 cm features are shown because in Figure the human 12a,b respectively.auditory system The was Mel-spectrum able to perceive output the wasloudness able to in resolve an approximately the spectral logarithmic features because manner. the humanCompared auditory to thesystem PSD results was able in Figure to perceive 9, the theMel loudness-spectrum in output an approximately managed to logarithmic discriminate manner. the spectral Compared entropy to the of individualPSD results textures in Figure with9, themuch Mel-spectrum more detailed output representation. managed to discriminate the spectral entropy of individual textures with much more detailed representation.

(a) (b)

FigureFigure 12. 12.Mel-spectrumMel-spectrum output: output: distance distance of of ( a(a)) 50 50 cm cm and and ( b(b)) 100 100cm. cm.

3.3.3.3. Classification Classification Assignment Assignment Using Using K-Means K-Means and and K-NN K-NN Validation Validation FigureFigure 13 13 displays displays the the block block diagram diagram used used for for the the classification classification process process in in this this paper. paper. ItIt waswas divideddivided into into two two sub-processes: sub-processes: a a signal signal pre-process pre-process that that synthesized synthesized the the mouth-click mouth-click echo echo using using the the Mel-scaleMel-scale filtering filtering process process to to obtain obtain the the Mel-spectrum Mel-spectrum output output as as features features input, input, and and a a classification classification tasktask realization realization process process using using K-means K-means to distinguish to distinguish the cluster the of cluster each texture. of each Next, texture the information. Next, the containinginformation the containing clustering vector the clustering from the vector K-means from process the K was-means handed process over was to the handed K-NN process over to for the performance validation. A built-in function was carried out in Matlab in this study to accommodate the K-means and K-NN tasks. Entropy 2019, 21, 963 13 of 20

K-NN process for performance validation. A built-in function was carried out in Matlab in this study Entropy 2019, 21, 963 13 of 20 to accommodate the K-means and K-NN tasks.

Figure 13. Classification block diagram used to distinguish different textures from mouth-click. Figure 13. Classification block diagram used to distinguish different textures from mouth-click. Under the machine learning philosophy, features extraction, data mining, and data simplification Under the machine learning philosophy, features extraction, data mining, and data are welcome steps in achieving the desired outcome prior to a classification campaign [41–44] For simplification are welcome steps in achieving the desired outcome prior to a classification campaign this objective, the authors of this study created a database using the calculated mean values of the [41–44] For this objective, the authors of this study created a database using the calculated mean extracted spectral entropies and arranged them into a matrix, after which they were tagged according values of the extracted spectral entropies and arranged them into a matrix, after which they were to their texture. Once this was completed, relaying into the classification phase could follow using tagged according to their texture. Once this was completed, relaying into the classification phase the K-means process. Essentially, K-means group a large number of data samples into a specific could follow using the K-means process. Essentially, K-means group a large number of data samples number of clusters. To achieve this task, K-means minimize the total intra-cluster variance, represented into a specific number of clusters. To achieve this task, K-means minimize the total intra-cluster in Equation (8) as J, where a indicates a number of clusters, b is the number of cases, xk is cases variance, represented in Equation (8) as J, where a indicates a number of clusters, b is the number of corresponding to b, cj is a clustering of a, and Dj is the distance function. Next, the K-NN process cases, xk is cases corresponding to b, cj is a clustering of a, and Dj is the distance function. Next, the calculated the performance between the training and testing of each cluster texture, which can be K-NN process calculated the performance between the training and testing of each cluster texture, generally expressed in Equation (9) as KN. Moreover, the consistency of each cluster was interpreting which can be generally expressed in Equation (9) as KN. Moreover, the consistency of each cluster using silhouette value. Finally, the clustering performance was further validated with a confusion was interpreting using silhouette value. Finally, the clustering performance was further validated matrix, and the cluster results were later displayed. with a confusion matrix, and the cluster results were later displayed. Xa Xb j 2 j J = xk cj ,Dj = xk cj, (8) 𝐽 =𝑥k −−𝑐k , − j = 1 k = 1 (8) v tu m 𝐷 =𝑥 −𝑐, X 2 KN = D (9) j = 1 𝐾 = 𝐷 (9) A total of 80 and 20 datasets from each texture were used for the training and testing tasks at distances of 50 and 100 cm, respectively. Left and right signal data from the microphone were equally employed for the 80 training data samples, and 20 for the test data sample tabulated in Table2. To ensure that the 80 datasets used for training were adequate for the features generalization stage,

Entropy 2019, 21, 963 14 of 20 we conducted a consistency test using analysis of variance (ANOVA). The ANOVA analysis measured the variation between columns (signal) in dataset, then calculated the probability of the test statistic, pt from the F-statistic table denoted as P(F > pt). Larger pt value indicates that the differences between the columns tested were less significant, and vice versa for smaller values. Based on our results, the individual signal of each dataset score over 90% (translated into a percentage) from the ANOVA analysis is tabulated in Table2. Moreover, it indicates that the signals used to create the dataset for classification were less significant (consistent signal). Hence, we can conclude that the total of 80 datasets used for data training was adequate in this paper.

Table 2. Summary of datasets used in the classification process.

Dataset of Signal (Recorded by Microphone) Distance Texture Training Testing ANOVA Probability Left Right Left Right Hard 40 40 0.9990 10 10 50 cm Medium 40 40 0.9810 10 10 Soft 40 40 0.9850 10 10 Hard 40 40 0.9266 10 10 100 cm Medium 40 40 0.9031 10 10 Soft 40 40 0.9132 10 10

4. Results Evaluation and Discussion

4.1. Clustering Results The clustering region for the training data using K-means environment revealed a visible separation region between the textures, as shown in Figure 14a. Similarly, a clustering texture region for the test data plotted on top of the training data appeared, as shown in Figure 14b. Figure 14b revealed that there was a testing coefficient for medium and soft textures breaching the respective clusters at both distances. There is a high possibility that this event occurred due to ambient noise recorded along with data collection, which was reflected during the K-means process (generalization stage). The features output obtained from the K-means process was used to determine the nearest neighbor features via K-NN process. To further improve the classification performance, we decide to perform dimension reduction by choosing only ten of the nearest neighbor features between the training and testing data from each cluster for the verdict, as shown in Figure 14c. Entropy 2019, 21, 963 15 of 20 Entropy 2019, 21, x 15 of 20

Target Distance (cm)

50 100

A cluster of cluster Training A Data

(a)

A cluster of cluster Data Training A and Testing (b)

A cluster of cluster Best 10thFeatures A (c)

FigureFigure 14. 14.Clustering Clustering region region of of all all textures: textures: ( a(a)) cluster cluster of of training training data; data; ( b(b)) cluster cluster of of training training and and test test data;data; ( c()c) cluster cluster of of 10 10 best best features. features.

4.2. Silhouette Results 4.2. Silhouette Results TheThe clusteringclustering behavior behavior shownshown inin FigureFigure 14 14b,cb,c waswas monitored monitored usingusing aa silhouette silhouette functionfunction inin Matlab,Matlab, which which tooktook over over the the vector vector corresponding corresponding to to the the coordinates, coordinates, points,points, andand column column from from the the K-means process. It measured the degree of similarity between points within a cluster, defined as Cli, K-means process. It measured the degree of similarity between points within a cluster, defined as Cli, as given by Equation (10), where ai is the initial point and bi the next iteration point. The silhouette as given by Equation (10), where ai is the initial point and bi the next iteration point. The silhouette distributiondistribution pattern pattern for for the the full fullK-means K-means coeffi coefficientcient value wasvalue better was at better the distance at the of distance 50 cm compared of 50 cm tocompared at the distance to at the of 100 distance cm, with of 100 the cm, cluster with distances the cluster receiving distances silhouette receiving plot silhouette scores of 75%plot andscores 60%, of 75% and 60%, respectively, as shown in Figure 15a. For comparison, Figure 15b shows that

Entropy 2019, 21, 963 16 of 20

Entropy 2019, 21, x 16 of 20 respectively, as shown in Figure 15a. For comparison, Figure 15b shows that dimension reduction with thedimension 10 best K-means reduction coe ffi withcient the values 10 best significantly K-means superseded coefficient the values silhouette significantly value at superseded both distances, the 50silhouette cm and 100 value cm, at with both scores distances, 90% and 50 82.38%, cm and respectively, 100 cm, with against scores the 90% values and shown 82.38%, in respectively, Figure 15a. Weagainst manage the to values optimize shown the classification in Figure score15a. We and manageconcluded to by optimize using the the 10 best classification K-means coe scorefficient and valuesconcluded to improve by using the the classification 10 best K-means performance, coefficient as presentedvalues to improve in Figure the 16 .classification performance, as presented in Figure 16. (b − (ab)i ai) = i i − Cli Cl= i (10)(10) max(maxai, bi()ai, bi)

Distance (cm)

50 100

Means Means Coefficient Value -

Full of K

(a)

Means Coefficient Value Means Coefficient

- of of K

10th Best 10th Best (b)

FigureFigure 15. 15.Silhouette Silhouette plotplot ofof didifferentfferent texturestextures (cluster(cluster 11 forfor hard, cluster 2 for medium, and and cluster cluster 3 3for for soft) soft) at at different different distances; (a) full full set set of of K-means K-means coe coefficientfficient values values and and (b (b)) 10 10 best best K-means K-means coecoefficientfficient values. values.

Entropy 2019, 21, x 17 of 20 Entropy 2019, 21, 963 17 of 20

Figure 16. Silhouette score trending. Figure 16. Silhouette score trending. 4.3. Confusion Matrix 4.3. Confusion Matrix The final texture classification results corresponding to Figure 14c are tabulated in Table3, revealingThe improved final texture performances classification with results 100% corresponding scores at the distances to Figure of 14 50c are and tabulated 100 cm. It in is Table worth 3, notingrevealing that improved the dimension performances reduction with significantly 100% scores improved at the the distances classification of 50 ofand human 100 cm. mouth-clicks It is worth bynoting selecting that thethe strongdimension features reduction which reflectedsignificantly good improved performance. the classification Moreover, three of human significant mouth findings-clicks fromby selecting the results the are strong worthy features of being which stressed: reflected (i) SPL goodand SI performance. may be the factors Moreover, that reflect three the significant spectral entropyfindings of from the reflectedthe results echo are signals, worthy(ii) of dimensionalbeing stressed:(i) reduction SPL and helped SI may to significantly be the factors improve that reflect the classificationthe spectral entropy performance of the by reflected selecting echo the signals, strongest (ii) featuresdimensional from reduction the K-means helped coe ffitocient significantly values, andimprove (iii) these the factors classification helped performance to build a strong by perceptionselecting the of obstacle strongest texture features using from an echolocation the K-means process.coefficient These values, pieces and of (iii) information these factors could helped be very to build useful a forstrong blind perception people, as of well obstacle as for texture building using an artificialan echolocation system able process. to echolocate, These pieces which of information could be a promising could be very direction useful for for future blind man-made people, as sensorwell as applications.for building Furthermore,an artificial system the results able to suggested echoloca thatte, which the MFCC could framework be a promising could direction become afor suitable future processman-made used sensor to exploit applications. the spectral Furthermore, entropy of humanthe results mouth-clicks, suggested asthat all the textures MFCC and framework both distances could (50become and 100 a suitable cm) scored process full marksused to (100%), exploit asthe shown spectral in Tableentropy3. of human mouth-clicks, as all textures and both distances (50 and 100 cm) scored full marks (100%), as shown in Table 3. Table 3. Confusion matrix for classification performance. Table 3. Confusion matrix for classification performance. Confusion Matrix Score (%) Distance (cm) Texture Confusion Matrix Score (%) Distance (cm) Texture Hard Medium Score Hard Medium Score HardHard 100100 0 00 0 50 cm Medium 0 100 0 50cm Medium 0 100 0 Soft 0 0 100 Soft 0 0 100 HardHard 100100 0 00 0 100 cm Medium 0 100 0 100cm Medium 0 100 0 Soft 0 0 100 Soft 0 0 100

5.5. ConclusionsConclusions ThisThis paper paper presented presented classification classification results results for for human human mouth-clicks, mouth-clicks, the the motivations motivationsfor for which which sproutedsprouted from from classifying classifying speech speech signal signal processes processes by means by means of the MFCCof the MFCC framework framework for spectral for entropyspectral extraction.entropy extraction. Artificial Artificial data of human data of mouth-clicks human mouth were-clicks collected were colle usingcted off -the-shelfusing off-the audio-shelf devices audio thatdevices have that been have described been comprehensively described comprehensively herein. Moreover, herein. the Moreover, frameworks the of frameworks the classification of the frameworkclassification have framework been explained have been in detail, explained from topin detail, to bottom from oftop the to processing bottom of scheme.the processing The extracted scheme. entropyThe extracted of Mel-spectrum entropy of outputMel-spectr as aum feature output vector as a yieldedfeature exceptionalvector yielded outcomes. exceptional The outcomes. experimental The experimental result and analysis justified the combination of MFCC framework, K-means, and

Entropy 2019, 21, 963 18 of 20 result and analysis justified the combination of MFCC framework, K-means, and K-NN as a viable model for classifying human mouth-clicks. It is also worth noting that the distance significantly reflected clustering outcomes, with a shorter distance producing better silhouette values subject to sound propagation. Overall, the spectral entropy from Mel-spectrum output provided sufficient features for the classification task discussed in this paper. This study was the first to conduct an analysis regarding the classification of human mouth-clicks utilizing Mel-spectrum output as a source modality for features. In the long run, several improvements need to be considered to address gaps in this paper. At this stage, it seems that there are plenty of strategies to be learned from the classification of texture using mouth-clicks. However, we also need to be realistic about the applications of the analysis discussed in this paper. The performance of human mouth-click classification could be benchmarked using different human auditory model processes for spectral entropy extraction, in order to find the model that is best fitted for actual human echolocation applications. Diverse spectral entropies using various object profiles (e.g., size and stiffness) should be created in order to evaluate the effectiveness of spectral entropy of mouth-clicks in classification of various objects. These recommendations are just a few factors that will help to ensure the continuing study of human mouth-clicks, especially for classification tasks. As such, they can be exploited by others in relevant fields as be fits the nature of their studies. Furthermore, it such studies will help to validate the credibility of the knowledge discussed in this paper. Moreover, the findings discussed in this paper can contribute towards realizing the development of human echolocator aid devices, and could benefit human echolocators who are learning and polishing their skill in order to achieve accurate results. In addition, the technique can be applied to radar and sonar systems with appropriate frequencies.

Author Contributions: R.S.A.R.A. created the main ideas and served as main advisor. N.L.S. analyzed the data and N.S.Z. conducted the data collection. S.M.S.A.R. and N.E.A.R. served as advisor committee and provided advice based on their related expertise. Funding: Fundamental Research Grant Scheme (FRGS), Ministry of Education, Malaysia; High Impact Grant (HIG), Universiti Putra Malaysia. Acknowledgments: The authors thank Lore Thaler (Durham University) and Mike Cherniakov (University of Birmingham) for their advised on Human Echolocation. Conflicts of Interest: The Authors declare no conflict of interest.

References

1. Griffin, D.R. Echolocation by blind men, bats and radar. Science 1944, 100, 589–590. [CrossRef][PubMed] 2. Abdullah, R.R.; Aziz, N.A.; Rashid, N.A.; Salah, A.A.; Hashim, F. Analysis on Target Detection and Classification in LTE Based Passive Forward Scattering Radar. Sensors 2016, 16, 1607. [CrossRef][PubMed] 3. Will, C.; Vaishnav, P.; Chakraborty, A.; Santra, A. Human Target Detection, Tracking, and Classification Using 24-GHz FMCW Radar. IEEE Sens. J. 2019, 19, 7283–7299. [CrossRef] 4. Angelov, A.; Robertson, A.; Murray-Smith, R.; Fioranelli, F. Practical classification of different moving targets using automotive radar and deep neural networks. IET Radar Sonar Navig. 2018, 12, 1082–1089. [CrossRef] 5. Supa, M.; Cotzin, M.; Dallenbach, K.M. Facial Vision: The Perception of Obstacles by the Blind. Am. J. Psychol. 1944, 57, 133–183. [CrossRef] 6. Kellogg, W.N. Sonar System of the Blind: New research measures their accuracy in detecting the texture, size, and distance of objects by ear. Science 1962, 137, 399–404. [CrossRef][PubMed] 7. Rice, C.E. Human Echo Perception. Science 1967, 155, 656–664. 8. Rojas, J.A.M.; Hermosilla, J.A.; Montero, R.S.; Espí, P.L.L. Physical analysis of several organic signals for human echolocation: Oral vacuum pulses. Acta Acust. United Acust. 2009, 95, 325–330. [CrossRef] 9. Schenkman, B.N.; Nilsson, M.E. Human Echolocation: Blind and Sighted Persons’ Ability to Detect Sounds Recorded in the Presence of a Reflecting Object. Perception 2010, 39, 483–501. [CrossRef] 10. Schenkman, B.N.; Nilsson, M.E. Human Echolocation: Pitch versus Loudness Information. Perception 2011, 40, 840–852. [CrossRef] Entropy 2019, 21, 963 19 of 20

11. Rice, C.E.; Feinstein, S.H. Sonar System of the Blind: Size Discrimination. Science 1965, 148, 1107–1108. [CrossRef][PubMed] 12. Milne, J.L.; Goodale, M.A.; Thaler, L. The role of head movements in the discrimination of 2-D shape by blind echolocation experts. Atten. Percept. Psychophys. 2014, 76, 1828–1837. [CrossRef][PubMed] 13. Hausfeld, S.; Power, R.P.; Gorta, A.; Harris, P. Echo perception of shape and texture by sighted subjects. Percept. Mot. Skills 1982, 55, 623–632. [CrossRef][PubMed] 14. DeLong, C.M.; Au, W.W.L.; Stamper, S.A. Echo features used by human listeners to discriminate among objects that vary in material or wall thickness: Implications for echolocating dolphins. J. Acoust. Soc. Am. 2007, 121, 605–617. [CrossRef][PubMed] 15. Norman, L.J.; Thaler, L. Human Echolocation for Target Detection Is More Accurate With Emissions Containing Higher Spectral Frequencies, and This Is Explained by Echo Intensity. Iperception 2018, 9, 204166951877698. [CrossRef][PubMed] 16. Smith, G.E.; Baker, C.J. Human echolocation waveform analysis. In Proceedings of the IET International Conference on Radar Systems, Glasgow, UK, 22–25 October 2012; p. 103. 17. Zhang, X.; Reich, G.M.; Antoniou, M.; Cherniakov, M.; Baker, C.J.; Thaler, L.; Kish, D.; Smith, G.E. Echolocation in humans: Waveform analysis of tongue clicks. IEEE IET Lett. 2017, 53, 580–582. [CrossRef] 18. Thaler, L.; Reich, G.M.; Zhang, X.; Wang, D.; Smith, G.E.; Tao, Z.; Abdullah, R.S.A.B.R.; Cherniakov, M.; Baker, C.J.; Kish, D. Mouth-clicks used by blind expert human echolocators–signal description and model based signal synthesis. PLoS Comput. Biol. 2017, 13, e1005670. [CrossRef][PubMed] 19. Purves, D.; Williams, S.M. Neuroscience, 3rd ed.; Sinauer Associates Inc.: Sunderland, MA, USA, 2004; Volume 3. 20. Thaler, L.; Arnott, S.R.; Goodale, M.A. Neural correlates of natural human echolocation in early and late blind echolocation experts. PLoS ONE 2011, 6, e20162. [CrossRef][PubMed] 21. Mead, C. Neuromorphic electronic systems. Proc. IEEE 1990, 78, 1629–1636. [CrossRef] 22. Abdullah, R.S.A.R.; Saleh, N.L.; Ahmad, S.M.S.; Rashid, N.E.A.; Reich, G.; Cherniakov, M.; Antoniou, M.; Thaler, L. Bio-inspired radar: Recognition of human echolocator tongue clicks signals. In Proceedings of the 2017 IEEE Asia Pacific Microwave Conference, Kuala Lumpur, Malaysia, 13–16 November 2017; Volume 1, pp. 861–864. 23. Abdullah, R.S.A.R.; Saleh, N.L.; Rashid, N.E.A.; Ahmad, S.M.S. Bio-inspired signal detection mechanism for tongue click waveform used in human echolocation. Electron. Lett. 2017, 53, 1456–1458. [CrossRef] 24. Abdullah, R.R.; Saleh, N.; Ahmad, S.; Salah, A.A.; Rashid, N.A. Detection of Human Echo Locator Waveform Using Gammatone Filter Processing. In Proceedings of the 2018 International Conference on Radar (RADAR), Brisbane, Australia, 30 August 2018; pp. 1–6. 25. Abdullah, R.S.A.R.; Saleh, N.L.; Ahmad, S.M.S.; Salah, A.A.; Rashid, N.E.A. Ambiguity function analysis of human echolocator waveform by using gammatone filter processing. J. Eng. 2019, 2018, 1–5. [CrossRef] 26. Anwary, A.R.; Yu, H.; Vassallo, M. Optimal Foot Location for Placing Wearable IMU Sensors and Automatic Feature Extraction for Gait Analysis. IEEE Sens. J. 2018, 18, 2555–2567. [CrossRef] 27. Patterson, R.D.; Winter, I.M.; Carlyon, R.P. Basic Aspects of Hearing; Springer: New York, NY, USA, 2013; Volume 787. 28. Kuttruff, H. : An Introduction, 1st ed.; CRC Press: New York, NY, USA, 2007. 29. Song, B.; Peng, L.; Fu, F.; Liu, M.; Zhang, H. Experimental and theoretical analysis of sound absorption properties of finely perforated wooden panels. Materials 2016, 9, 942. [CrossRef][PubMed] 30. Albert, D.G. Acoustic waveform inversion with application to seasonal snow covers. J. Acoust. Soc. Am. 2002, 109, 91. [CrossRef][PubMed] 31. Xie, W.; Xie, Z.; Zhao, F.; Ren, B. POLSAR Image Classification via Clustering-WAE Classification Model. IEEE Access 2018, 6, 40041–40049. [CrossRef] 32. Phan, H.; Andreotti, F.; Cooray, N.; Chen, O.Y.; de Vos, M. Joint Classification and Prediction CNN Framework for Automatic Sleep Stage Classification. IEEE Trans. Biomed. Eng. 2019, 66, 1285–1296. [CrossRef][PubMed] 33. Meltzner, G.S.; Heaton, J.T.; Deng, Y.; de Luca, G.; Roy, S.H.; Kline, J.C. Silent Speech Recognition as an Alternative Communication Device for Persons With Laryngectomy. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2386–2398. [CrossRef][PubMed] 34. Grozdic, D.T.; Jovicic, S.T. Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2313–2322. [CrossRef] Entropy 2019, 21, 963 20 of 20

35. Tabibi, S.; Kegel, A.; Lai, W.K.; Dillier, N. Investigating the use of a Gammatone filterbank for a cochlear implant coding strategy. J. Neurosci. Methods 2017, 277, 63–74. [CrossRef] 36. Qi, J.; Wang, D.; Jiang, Y.; Liu, R. Auditory features based on Gammatone filters for robust speech recognition. In Proceedings of the 2013 IEEE International Symposium on Circuits and Systems (ISCAS), Beijing, China, 19–23 May 2013; pp. 305–308. 37. Eronen, A.J.; Peltonen, V.T.; Tuomi, J.T.; Klapuri, A.; Fagerlund, S.; Sorsa, T.; Lorho, G.; Huopaniemi, J. Audio-based context recognition. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 321–329. [CrossRef] 38. Cai, R.; Lu, L.; Hanjalic, A. Co-clustering for auditory scene categorization. IEEE Trans. Multimed. 2008, 10, 596–606. [CrossRef] 39. Chu, S.; Narayanan, S.; Kuo, C.-C.J. Environmental Sound Recognition with Time–Frequency Audio Features. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1142–1158. [CrossRef] 40. Ballan, L.; Bazzica, A.; Bertini, M.; del Bimbo, A.; Serra, G. Deep networks for audio event classification in soccer videos. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, New York, NY, USA, 28 June–3 July 2009; pp. 474–477. 41. Michalak, H.; Okarma, K. Improvement of image binarization methods using image preprocessing with local entropy filtering for alphanumerical character recognition purposes. Entropy 2019, 21, 562. [CrossRef] 42. Li, Z.; Li, Y.; Zhang, K. A Feature Extraction Method of Ship-Radiated Noise Based on Fluctuation-Based Dispersion Entropy and Intrinsic Time-Scale Decomposition. Entropy 2019, 21, 693. [CrossRef] 43. Li, J.; Ke, L.; Du, Q. Classification of Heart Sounds Based on the Wavelet Fractal and Twin Support Vector Machine. Entropy 2019, 21, 472. [CrossRef] 44. Chen, Z.; Li, Y.; Cao, R.; Ali, W.; Yu, J.; Liang, H. A New Feature Extraction Method for Ship-Radiated Noise Based on Improved CEEMDAN, Normalized Mutual Information and Multiscale Improved Permutation Entropy. Entropy 2019, 21, 624. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).