A Single-Trial P300 Detector Based on Symbolized EEG and Autoencoded-(1D)CNN to Improve ITR Performance in Bcis

sensors

Article A Single-Trial P300 Detector Based on Symbolized EEG and Autoencoded-(1D)CNN to Improve ITR Performance in BCIs

Daniela De Venuto * and Giovanni Mezzina

Department of Electrical and Information Engineering, Politecnico di Bari, Via E. Orabona, 4, 70124 Bari, Italy; [email protected] * Correspondence: [email protected]; Tel.: +39-0805963582

Abstract: In this paper, we propose a breakthrough single-trial P300 detector that maximizes the information translate rate (ITR) of the brain–computer interface (BCI), keeping high recognition accuracy performance. The architecture, designed to improve the portability of the algorithm, demonstrated full implementability on a dedicated embedded platform. The proposed P300 detector is based on the combination of a novel pre-processing stage based on the EEG signals symbolization and an autoencoded convolutional neural network (CNN). The proposed system acquires data from only six EEG channels; thus, it treats them with a low-complexity preprocessing stage including baseline correction, windsorizing and symbolization. The symbolized EEG signals are then sent to an autoencoder model to emphasize those temporal features that can be meaningful for the following CNN stage. This latter consists of a seven-layer CNN, including a 1D convolutional layer and three dense ones. Two datasets have been analyzed to assess the algorithm performance: one from a P300 speller application in BCI competition III data and one from self-collected data during a fluid prototype car driving experiment. Experimental results on the P300 speller dataset showed that the proposed method achieves an average ITR (on two subjects) of 16.83 bits/min, outperforming by Citation: De Venuto, D.; Mezzina, G. +5.75 bits/min the state-of-the-art for this parameter. Jointly with the speed increase, the recognition A Single-Trial P300 Detector Based on performance returned disruptive results in terms of the harmonic mean of precision and recall Symbolized EEG and (F1-Score), which achieve 51.78 ± 6.24%. The same method used in the prototype car driving led to Autoencoded-(1D)CNN to Improve an ITR of ~33 bit/min with an F1-Score of 70.00% in a single-trial P300 detection context, allowing ITR Performance in BCIs. Sensors 2021, 21, 3961. https://doi.org/ fluid usage of the BCI for driving purposes. The realized network has been validated on an STM32L4 10.3390/s21123961 microcontroller target, for complexity and implementation assessment. The implementation showed an overall resource occupation of 5.57% of the total available ROM, ~3% of the available RAM, Academic Editor: Chang-Hwan Im requiring less than 3.5 ms to provide the classification outcome.

Received: 12 May 2021 Keywords: brain–computer interface (BCI); P300; single-trial detection; autoencoder; CNN Accepted: 7 June 2021 Published: 8 June 2021

Publisher’s Note: MDPI stays neutral 1. Introduction with regard to jurisdictional claims in Brain–computer interfaces (BCIs) are platforms able to integrate straightforward inter- published maps and institutional afﬁl- actions between the human brain and the external world, “bridging” users’ brain signals iations. directly to a mechatronic device without the need of any physical interaction [1]. BCIs were initially developed to help locked-in patients in formalizing speciﬁc requests [1]; nevertheless, in recent years, the BCIs’ versatility has also spread their use among healthy subjects in areas such as entertainment [2], car driving [3], smart homes [4], speller systems [5], Copyright: © 2021 by the authors. mental state monitoring [6], Internet of Things applications [7], etc. Licensee MDPI, Basel, Switzerland. For these kind of applications, electroencephalogram (EEG)-based BCIs are the most This article is an open access article diffused type due to their noninvasive nature, safety, and the inexpensiveness of the dedi- distributed under the terms and cated acquisition devices, compared with BCIs that exploit, for instance, electrocorticogram conditions of the Creative Commons (ECoG) or functional near-infrared spectroscopy (fNIRS) [1,8–11]. Attribution (CC BY) license (https:// There are several types of EEG patterns used in BCI applications, such as event-related creativecommons.org/licenses/by/ potentials (ERPs), steady-state visual evoked potentials (SSVEPs) and motor imagery (MI). 4.0/).

Sensors 2021, 21, 3961. https://doi.org/10.3390/s21123961 https://www.mdpi.com/journal/sensors Sensors 2021, 21, 3961 2 of 24

Specifically, the first type (i.e., ERPs) is the most used because of the simplicity of the elicitation paradigms implementation [8,11]. Moreover, since these potentials are time- locked with respect to the stimulation onset, they are optimal candidates for repetitive and reliable practical BCIs [11]. ERPs are characterized by positive and negative signal deflections, namely components, with each one being linked to different cognitive processes and elicitable with different paradigms. A particular ERP component that has found large use in the BCI area is the P300. Compared with other potentials, the P300 is simple to be elicited and measured, requires low training time with no complex paradigm and is suitable for most subjects, including those with severe neuromuscular diseases [1]. This work is focused on the P300 signal elicited during a classic odd-ball paradigm [8], in which a series of stimuli are presented to the subject. In general, these stimuli can be categorized into two types: frequent and rare ones. The occurrence of a rare stimulus causes the elicitation of the P300 component [8]. P300 ERPs typically present a very low signal-to-noise ratio (SNR) if considered in a single trial [9]. For this reason, in most BCI applications, several trials are stored and averaged over an increasing number of repetitions, in order to emphasize the time-locked components (e.g., P300) and minimize the background noise [10]. This grand-average method improves the robustness of the system but, at the same time, it drastically reduces the BCI speed in terms of information translate rate (ITR). For instance, considering a widely diffused P300 speller problem, ITR moves from an average value of ~11 bits/min [10–13] in a single repetition context, to ~8 bits/min after accumulating the 15 repetitions constituting the character epoch [8]. The issue can be mitigated by adaptive flashing times techniques [14]. However, the correct classification of the P300 response in a few repetitions is a crucial target for the design of fast BCI systems. Several solutions have been proposed in the literature to address the single-trial P300 recognition problem. Most of them use traditional shallow learning for the P300 classification purpose. In this context, the authors in [13] filtered the EEG signals in the interval 0.1–20 Hz. The filtered signal is then decimated according to the high cut-off frequency. At this point, each signal from a given channel is characterized by 14 samples. Next, they divided the EEG trials into several data subsets, each one connected to a dedicated support vector machine (SVM), for a total of 17 SVMs. To work properly, the implemented algorithm requires a first channel selection stage based on a recursive channel elimination. The proposed system was demonstrated to achieve a P300 detection accuracy of 25.5% after a single repetition and 96% after 15 repetitions in the BCI competition III speller problem (i.e., selection among 36 characters). Authors in [15] also achieved similar results by implementing a Bayesian P300 detection method based on the maximum regression target probability value. Despite the good results, the method suffers from a very long computation time. Indeed, the method requires, on average, 7 min to complete the proposed task (four six-character words and a random number), against about 2.5 min required by most of the equivalent approaches [15]. A processing chain made up of a continuous wavelet transform (CWT), scalogram peak detection, and linear discriminant analysis (LDA) was used for this purpose by the authors in [16]. A misclassification error of 45% was reached by the method, which won the BCI competition II [16]. In the same shallow learning application context, authors in [17] investigated the possibility of exploiting (temporal) attention processing, widely adopted in SSVEP-based BCIs [18] and in EEG-based emotion recognition [19], to improve P300-based brain–computer interface (BCI) performance. Their study, typically oriented to amyotrophic lateral sclerosis (ALS) patients, demonstrated the investigation of those cognitive processes and using a stepwise linear discriminant analysis (SWLDA) for the offline classifier coefficient calibration and the identification of the character as the intersection between the row and the column exhibiting the maximum of the sum of scored features. In P300 speller context, by averaging all the waveforms from target and non-target trials composing each run (i.e., 15), the system in [17] demonstrated the ability to approach P300 state-of-the-art accuracy (i.e., 98%) by Sensors 2021, 21, 3961 3 of 24

using only six channels. Nevertheless, the authors did not test the same system on the BCI competition Dataset III [13], making a possible comparison unsatisfying. In the recent past, a growing interest has been dedicated to the deep learning application in P300 detection. Deep learning approaches strongly impacted natural language and computer vision processing, allowing the resolution of up to that moment still un- solved real-world problems [11]. Deep learning application for P300 detection is still in an evolving stage, far from a consolidated methodology [10]. The first authors who exploited the deep learning capability of capturing hierarchical features from EEG signals were Cecotti and Graser in their work [12]. Here, they implemented a four-layer convolutional neural network (CNN) to capture spatial and temporal features from the raw EEG signals in input. Seven different CNN architectures were proposed with three ensemble models. The most performant among the proposed methods was shown in the results reached by the authors in [13], i.e., 25.5% after a single repetition and ~95% after 15 repetitions in the speller problem. This approach was exploited by the authors in [10], which extended the model by introducing a batch normalization layer to prevent CNN overfitting. The proposed method, named BN3, analyzed EEG data as a bidi- mensional matrix with the {channels × time} shape, acting on it as an image structure and reaching the same results of the authors in [13] after 15 repetitions, with an improvement of +10% of character accuracy on a single repetition. A similar approach was adopted by the authors in [20], who—differently from [10]—analyzed an EEG trial as a 3D matrix. Their work exploits spectral features from three different frequency bands (i.e., θ, α, β) creating from each time instant an RGB image given by the linked spectrogram. Then, they stacked convolutional layers with a long short-term memory (LSTM) neural network, increasing the system complexity without any substantial improvements against the above-listed state-of-the-art results. Similar results were achieved in [21], where a 3D recurrent neural network (3DRNN) was used to detect a P300 signal in a single trial. Finally, authors in [22] proposed the introduction of a preprocessing stage based on principal component analysis (PCA) to improve the CNN results. The proposed system demonstrated a slight improvement of the results, starting from eight repetitions up to 15. Recently, autoencoder neural networks are also approaching the brain activity analysis field thanks to their capabilities in denoising the EEG signal. Autoencoders impacted on the seizure detection field [23] realizing a useful feature extractor step, if empowered with CNNs and LSTMs in the encoder–decoder setup. Authors in [24] exploited this concept, proposing an event-related potential encoder network (ERPENet) that incorporates four 2D CNNs and two 512-units LSTMs in the autoencoder setup structure, trying to simultaneously compress the input EEG signal and extract-related P300 features into a latent vector. This vector was sent to a single one-unit fully connected layer for classification. The model provided a single-trial accuracy of 83.54% (with an area under curve of 69%) on the P300 speller problem included in the BCI competition III dataset, analyzing 35 channels from the central-parietal lobe. In this framework, this paper proposes the design, implementation and test of a novel architecture for single-trial P300 detection with two main aims: (i) maximizing the ITR, while keeping high recognition accuracy for a low number of stimuli repetition and (ii) ensuring the full system implementability on a low-cost dedicated microcontroller- based platform. The first design focus allows the BCI, which exploits the here presented P300 detector, to increase its speed, ensuring a human–machine interaction that is as fluid as possible. It represents a critical constraint for those BCI in which the neural interface speed can strongly impact on the safety of the system, such as in P300-based car driving [3], or in which a BCI becomes an important enabling technology, such as in the case of P300-based word speller for people with disabilities [10,11]. To ensure this condition, in this paper, a punctual inference timing analysis has been carried out. It permitted the identification and the correction of all the time-critical parts in the architecture. The second design focus, which represent a novelty in the NN architecture design workflow, concerns the architecture implementability analysis on a widely diffused family of low-cost microcontrollers. This analysis provides a good starting point for the conceptualization Sensors 2021, 21, 3961 4 of 24

and the realization of a dedicated low-cost and mobile P300-based BCI. In particular, by keeping the consumption of resources of the microcontroller far below their availability, the design of a dedicated P300-based BCI board is also considerably simplified, with positive consequences on board footprint (e.g., external memories and related circuitry no longer needed) and power consumption (e.g., no external read/write operations, only microcontroller static/dynamic consumption). Keeping in mind the above, the here-proposed P300 detector bases its working on the combination of a novel pre-processing stage based on the EEG signals’ symbolization and a first parallel autoencoding stage followed by batch-normalized CNN for the temporal filtering. We refer to this architecture as an autoencoded CNN in the following. Specifically, the proposed system acquires data by six EEG electrodes from central and parietal brain lobes. Next, it submits EEG signals to a first low-complexity preprocessing stage including low pass filtering, baseline correction, windsorizing and symbolization. The symbolized EEG signals are then sent to an autoencoder model to emphasize those temporal features that can be meaningful for the following CNN stage. Autoencoded EEGs are then concatenated and used to feed a seven-layer network, which includes a 1D convolutional layer and three fully connected layers. Inputs, convolutional and fully connected layers are preceded by a batch normalization. The realized network is then validated on an STM32L4 microcontroller target, for complexity and implementation assessment. Since all of the analyzed state-of-the-art studies use the Farwell and Dochin P300-based speller as a testbench [8] to compare among each other their system capabilities, in this paper, two datasets are analyzed to assess the algorithm performance: (i) BCI competition III data from a P300 speller testbench and (ii) self-collected data from a prototype car driving experiment. Experimental results from the first dataset, i.e., P300 speller one, showed that the system is able to outperform the state of the art in terms of harmonic mean of precision and recall (F1-Score = 51.78 ± 6.24%), ensuring a +5 bits/min of average ITR if compared with other state-of-the-art solutions. The same method, applied to the prototype car driving experiment, provided an F1-score of 70.00% and an ITR > 30 bit/min, allowing a car direction change every 1.8 s. The implementation feasibility assessment on an STM32L4 microcontroller showed that the overall system occupies 5.57% of the total available ROM, ~3% of the available RAM. The paper is organized as follows: Section2 describes the P300 paradigms, detailing the considered datasets, the preprocessing stages and the model architecture. Sections3 and4 are dedicated to the experimental results and their discussion, respectively, while Section5 concludes the paper, underlining the achievements and perspectives.

2. The Method 2.1. Datasets and Stimulation Protocols Datasets. Two datasets were investigated and described in the following. Both of them shared the same processing chain but differed among each other for the final application and the acquisition device. One dataset, namely Dataset 1 in the following, was related to data collected by the team of the DEISLab (Politecnico di Bari, Italy) on voluntary subjects in the context of prototype car driving [3]. The subjects group was composed of four students (healthy subjects) recruited to participate in this BCI experiment. The second dataset, chosen for comparison purposes and named Dataset 2 in the following, was a dataset provided by the Wadsworth Research Center NYS Department of Health for the international BCI competition III [25]. This dataset collected EEG data from two subjects who participated in the competition. More in detail, Dataset 1 contained trials acquired by a 32-channel wireless EEG headset by g.Tec (i.e., g.Nautilus Research), considering AFz as a reference channel and A2 (right ear lobe) as a ground lead. Specifically, the monitoring for this application was focused on 6 specific channels, i.e., Fz, Cz, Pz, PO7, Oz, PO8, resulting from a statistical offline recursive channel elimination routine all over the four subjects, as better explained in Sensors 2021, 21, 3961 5 of 25

More in detail, Dataset 1 contained trials acquired by a 32-channel wireless EEG headset by g.Tec (i.e., g.Nautilus Research), considering AFz as a reference channel and A2 (right ear lobe) as a ground lead. Specifically, the monitoring for this application was focused on 6 specific channels, i.e., Fz, Cz, Pz, PO7, Oz, PO8, resulting from a statistical offline recursive channel elimination routine all over the four subjects, as better explained Sensors 2021, 21, 3961 in Section 2.3, and confirmed by the best electrodes configuration for P300 5speller of 24 applications, found by the authors in [26]. EEG data were acquired with a resolution of 24 bits and transmitted at 250 Sa/s, re- Sectionduced 2.3to, match and confirmed the 240 by Sa/s the bestfrom electrodes Dataset configuration 2. The EEG for signals P300 speller were applications,band passed in the foundinterval by the0.1–20 authors Hz before in [26]. the transmission. EEGThe datasecond were dataset acquired [25] with was a composed resolution ofof 24EEG bits trials and transmittedcollected by at 64 250 channels Sa/s, using reducedthe BCI2000 to match system the 240[25]. Sa/s The from monitoring Dataset 2.was The focu EEGsed, signals also werein this band case, passed on 6 selected in the chan- interval 0.1–20 Hz before the transmission. nels to match the architecture input size. Differently from the Dataset 1, the analyzed EEG The second dataset [25] was composed of EEG trials collected by 64 channels using the BCI2000channels system from [ 25the]. Thetwo monitoring involved wassubjects focused, statistically also in this answered case, on 6 selectedin different channels ways to the torecursive match the channel architecture elimination input size. routine Differently (Section from 2.3). the For Dataset the first 1, theinvolved analyzed subject EEG (namely channelssubject A from in the the following), two involved the subjects six selected statistically channels answered were: FCz, in different C2, CP5, ways CPz, to P6 the and PO7, recursivewhile for channel the second elimination subject, routine similarly (Section named 2.3). For subject the first B, involvedthe electrodes subject (namelyCPz, PO7, POz, subjectPO4, PO1, A in theO1 following),were chosen. the six Signals selected from channels the selected were: FCz, channels C2, CP5, were CPz, P6sampled and PO7, at 240 Hz whileand bandpass for the second filtered subject, into similarly 0.1–60 namedHz. subject B, the electrodes CPz, PO7, POz, PO4, PO1, O1Stimulation were chosen. Protocols. Signals fromFor Dataset the selected 1 data channels collection, were sampledthe subjects at 240 underwent Hz and the 4- bandpasschoice paradigm, filtered into already 0.1–60 Hz.proposed in our previous works [3,23] and reported in Figure Stimulation Protocols. For Dataset 1 data collection, the subjects underwent the 1a. The protocol, designed according to the odd-ball paradigm, consisted of four visual 4-choice paradigm, already proposed in our previous works [3,23] and reported in Figure1a . Thestimuli, protocol, individually designed according and randomly to the odd-ball flashing paradigm, on a display consisted (25% of four occurrence visual stimuli, probability) individuallywith an inter-stimuli and randomly time flashing (ISI) of on 200 a displayms. Each (25% stimulus occurrence persisted probability) on the with screen an for 100 inter-stimulims. When the time user (ISI) focused of 200 their ms. attention Each stimulus on a particular persisted onstimulus the screen (so they for 100 are ms. expressing Whena direction), the user the focused stimulation their attention could onbe aconsi particulardered stimulus a classical (so theybinary are discrimination expressing a prob- direction),lem. As reported the stimulation in [3], couldthe above-described be considered a classicalprotocol binary can be discrimination easily used problem.to drive a proto- Astype reported car. A insnapshot [3], the above-describedof the experimental protocol setu canp for be the easily protocol used to is drive shown a prototype in Figure 1b. For car.the training A snapshot set, of a thetotal experimental of 1520 target setup trials for the(with protocol elicited is P300) shown and in Figure 4560 1non-targetb. For trials the training set, a total of 1520 target trials (with elicited P300) and 4560 non-target trials (without P300 contributions) were considered per each involved subject. The test set was, (without P300 contributions) were considered per each involved subject. The test set was,instead, instead, composed composed of 500 of 500 target target trials trials and and 1500 1500 non-target non-target trials trials per per subject. The subjects subjectsinvolved involved in the BCI in the experiments BCI experiments are numbered are numbered from from 1 to 1 to4 4in in the the following. following. TheThe compo- compositionsition of the of presented the presented dataset dataset is summarized is summarized in in Table Table 11..

FigureFigure 1. 1.BCI BCI stimulation stimulation protocol. protocol. (a) Snapshot(a) Snapshot of the of 4 targetsthe 4 targets composing composing the prototype the prototype car driving car driving odd-ballodd-ball paradigm; paradigm; (b )(b experimental) experimental setup setup for the for prototype the prototype car test; car (c) test; 6 × 6 (c character) 6 × 6 character matrix for matrix for P300P300 speller speller diagram diagram from from BCI BCI competition competition III; (d III;) row (d) and row columns and columns indexes indexes distribution. distribution.

Sensors 2021, 21, 3961 6 of 24

Sensors 2021, 21, 3961 6 of 25

Table 1. Training and test datasets composition. Table 1. Training and test datasets composition. Training Set 1 Test Set 1 Dataset Sub. Training Set 1 Test Set 1 Dataset P300Sub. non-P300 P300 non-P300 P300 non-P300 P300 non-P300 Dataset 1 1 . .Dataset . 4 1 1520 1…4 4560 1520 4560 500 500 1500 1500 Dataset 2 A, B 2550 12,750 3000 15,000 Dataset 2 A, B 2550 12,750 3000 15,000 1 Data expressed in trials per subject. 1 Data expressed in trials per subject. Data composing Dataset 2 were collected considering—as stimulation paradigm—the Data composing Dataset 2 were collected considering—as stimulation paradigm— 6 × 6 characters matrix in Figure1b. During the competition experiments, each of the the 6 × 6 characters matrix in Figure 1b. During the competition experiments, each of the 12 between rows and columns flashed randomly with an ISI of 175 ms and a flashing 12 between rows and columns flashed randomly with an ISI of 175 ms and a flashing duration of 100 ms. The subjects were asked to focus on a specific character. Each row duration of 100 ms. The subjects were asked to focus on a specific character. Each row and and column were flashed 15 times, constituting a character epoch. The selected character column were flashed 15 times, constituting a character epoch. The selected character pre- prediction was typically generated at the end of each character epoch. For the training diction was typically generated at the end of each character epoch. For the training set, 85 set, 85 characterscharacters per subject per subject were considered, were considered, for a total for a (per total subject) (per subject) of 2550 of target 2550 target trials trials and and 12,750 non-target12,750 non-target trials. The trials. test The set wastest set composed was composed of 100 of characters 100 characters for a totalfor a oftotal of 3000 3000 target trialstarget and trials 15,000 and non-target 15,000 non-target ones. The ones. subjects The subjects involved involved in the BCI in the experiments BCI experiments are are identified byidentified A and B by in A the and following, B in the accordingfollowing, toacco therding BCI competitionto the BCI competition nomenclature. nomenclature. Table1 summarizesTable the1 summarizes training and the test training datasets and composition. test datasets composition.

2.2. The Architecture2.2. The Architecture The network architectureThe network of architecture the proposed of the single-trial proposed P300 single-trial detector P300 is schematicallydetector is schematically depicted in Figuredepicted2. The in Figure BCI workflow, 2. The BCI shownworkflow, in Figure shown2 in, can Figure be summarized2, can be summarized as fol- as follows. lows. First, theFirst, EEG the signals EEG signals from the from acquisition the acquisition device device (or from (or from the offline the offline test datasettest dataset in Da- in Dataset 2) underwenttaset 2) underwent a preprocessing a preprocessing stage. stage. This stageThis stage consisted consisted of three of three main main steps: steps: (i) base- (i) baseline correction,line correction, (ii) windsorizing (ii) windsorizing and and (iii) (iii) local local binary binary patterning patterning (LBP) (LBP)symbol- symbolization. An ization. An optionaloptional additive additive bandpass bandpass filteringfiltering stagestage should be be included included before before all all the the above-listed above-listed ones,ones, ifif notnot already embedded embedded in in the the acquisition acquisition device device (e.g., (e.g., g.Nautilus g.Nautilus Research signal Research signalconditioner conditioner includes includes this this stage stage before before the thetransmission). transmission).

Figure 2. Overall architecture overview. Figure 2. Overall architecture overview. The first optional stage required a bandpass filter with high and low pass cutoff The first optional stage required a bandpass filter with high and low pass cutoff fre- frequencies of 0.1 Hz and 20 Hz, respectively. For the sake of repeatability, the filter utilized quencies of 0.1 Hz and 20 Hz, respectively. For the sake of repeatability, the filter utilized for the Dataset 2 corresponded to the 8th order Butterworth included in the g.Nautilus Research device. The filtered EEGs were then submitted to a trial extraction stage in which a

Sensors 2021, 21, 3961 7 of 24

specific time window after the stimulus onset was selected. Next, the trials were treated by a local baseline correction routine, which consisted of aligning each trial around a “zeroed” initial point. The corrected EEG trials were analyzed by a windsorizing procedure that truncated signal outliers, giving them a predefined value. Finally, the windsorized signals went through a last step preprocessing step: the LBP symbolization. This routine, widely known in the image-processing field, permitted one to analytically transform experimental measurements such as 1D-time-series into a series of binary strings. As it will become clearer shortly, this step generated a signal per channel that ranged between 0 and 8 with single unit discrete steps [27]. The symbolized EEG signals (sEEG in Figure2) became the input for the proposed autoencoded CNN. The proposed neural network, expanded at the bottom of Figure2, consists of a total of 13 layers, namely L0~13 in the following. The NN embedded a first batch normalization (limited to the training activity adap- tion), then it implemented a first tensor slice on the six selected EEG channels (Lambda in Figure2). Each sliced channel went through an autoencoder based on 3 fully connected layers, with rectified linear unit (ReLU) activation function, which represented—in the order—the encoder, the code, and the decoder of the autoencoder architecture. The last layer consisted of a dense layer with the same size as the sliced channel in input, with a linear activation function (not explicitly reported in the model). The six outcomes were then concatenated in a tensor with size {timesteps, channels}. Next, the tensor fed a 1D convolutional and subsampling layer for temporal feature extraction. ReLU activation was also used in this case. The convolutional layer outcome was then flattened and sent to a deep network made up of 2 dense layers, preceded by a batch normalization step to avoid distribution shift, which led to signal saturation, decelerating the learning [28]. During the training, both the dense layers composing the deep NN were supported by a dropout operation to reduce the overfitting phenomenon and favor the system generalization. The architecture ended with a single-unit dense layer and a sigmoid activation that returned a value between 0 and 1. It returned the probability that the specific sample is related to a P300 trial or a non-P300 one. A threshold system ensured the binary classification.

2.3. Recursive Channels Elimination Aiming to reduce the memory usage and to favor the overall system implementability, a first dimensionality reduction problem on the input size was needed. Considering—for instance—the g.Nautilus Research solution presented in Section 2.1 for the Dataset 1 data collection, an input matrix of 24 bits/sample × 32 channels × 250 samples/s should be analyzed every second (for a total of 24 kB). In a similar way, considering the BCI2000 system used for the Dataset 2 composition, an input matrix of 24 bits/sample × 64 channels × 240 samples/s, for a total of 46.08 kB, would become the input matrix for the overall system. Considering a parallel autoencoded CNN implementation as the one proposed in the Section 2.2, it would result in a strong increment of the network complexity and, thus, of resources usage. For this reason, an offline recursive channels elimination step preceded the nominal working of the architecture. This processing step’s role was to find a 6-channel combination (24 bits/sample × 7 channels × 250 samples/s for an input flow of 5.25 kB/s) able to ensure the best channel selection criterion value proposed by the authors in [13]. It is mathematically defined as:

TP Criterion = (1) TP + FP + FN where TP is the number of P300 trials correctly classified, TN is the number of non-P300 trials correctly classified, FP is the number of non-P300 trials misclassified (classified as P300 trials) and, similarly, FN is the number of P300 trials misclassified. For the recursive channel elimination purpose, a provisional NN architecture model was instantiated. This provisional NN was inspired by the BN3 one proposed in [10] but Sensors 2021, 21, 3961 8 of 24

discarded the spatial filtering layer (i.e., the first one). This architecture was chosen because of its capability to return promising results only by analyzing raw EEG signals. All the combinations were tested on the provisional neural network considering the training set. Specifically, a validation split on 30% of the training dataset was considered in a holdout validation setting and the validation criterion value for each combination was extracted. Firstly, the routine started with all the 32 (Dataset 1) or 64 channels (Dataset 2); then, each channel was eliminated, and the criterion score was computed on the remaining 31 (Dataset 1) or 63 (Dataset 1) channels. The procedure was repeated for all the involved channels. The channel, whose elimination led to the highest criterion score, was discarded. Next, the procedure was repeated up to finding the best 10 channels per subject. A statistical analysis was thus carried out to find the channels that are mostly shared among the involved subjects, aiming to favor the application generalization. Considering Dataset 1, the six channels that fell within the best ten per all the subjects were Fz, Cz, Pz, PO7, Oz, PO8. Differently, on the Dataset 2 training set, only the CPz and PO7 channel resulted in being shared among the two subjects. The remaining best channels for subject A resulted to be FCz, C2, CP5 and P6, while for the subject B, POz, PO4, PO1, O1.

2.4. Data Preprocessing Once the stimulation occurred (i.e., row/column or direction flash), the acquisition device started streaming data to a dedicated parametrizable depth buffer. This buffer depth defined the time window composing the trial. As introduced in Section 2.2, each trial underwent three main pre-processing steps: (i) baseline correction, (ii) windsorizing and (iii) LBP symbolization. This section discusses them in more detail, providing experimental hyperparameters choice. Baseline correction. For the proper window definition of the local baseline correction, firstly, the same provisional NN architecture model presented in Section 2.3 was instantiated. Three different time windows for trials were then analyzed, changing the buffer size: (i) [0 s, 1 s]; (ii) [0 s, 0.8 s]; (iii) [0.1 s, 0.8 s]. For each of these time windows, two kinds of baseline extraction approaches were tried: (i) baseline calculated as the average of the first 50 ms of the time window or (ii) baseline calculated as the average of the first 100 ms of the time window. In an analog way to the procedure described in Section 2.2, all the combinations were tested on the provisional neural network considering the training set. Specifically, a validation split on 30% of the training dataset was considered in a holdout validation setting. The chosen assessment criterion was the harmonic mean of precision and recall, the F1-Score, which is defined as:

2TP F = (2) 1 2TP + TN + FP + FN A grid search routine among the analyzed combinations returned that the best window and, thus, the selected one was the range [0.1 s, 0.8 s] for a total of 168 samples, while the chosen baseline correction consisted of extracting the average in the interval [0.1 s, 0.15 s] for a total of 12 samples. The average value was then subtracted for all the trial values. Windsorizing. Subject movements, eye movement, eye blinking, and muscle activity can cause large amplitude outliers in the EEG. To reduce the effect of these outliers, the data from each channel should be windsorized according to the preprocessing guidelines proposed for a P300-based BCI provided by the authors in [29]. As the ﬁrst step, for each channel, the 1st and the 99th percentile, all over the training set, were computed. Next, all those amplitudes lower than the 1st percentile were replaced with the 1st percentile value. Similarly, all those amplitude values higher than the 90th percentile were assigned to the 99th percentile ﬁxed value. the known percentile presence, in terms of trial length percentage, can represent a useful indicator for the most informative channels. For instance, it was observed that several trials (~100 P300 trials, ~200 non-P300 trials) in Dataset 2 were affected on the Fz channel by more than the 40% of winsorizing values. Optical Sensors 2021, 21, 3961 9 of 25

it was observed that several trials (~100 P300 trials, ~200 non-P300 trials) in Dataset 2 were affected on the Fz channel by more than the 40% of winsorizing values. Optical inspection confirmed the presence of physiological and non-physiological artifacts on this channel Sensors 2021, 21, 3961 9 of 24 [30]. LBP symbolization. The LBP operator was introduced in [31] as a powerful tool for the texture description and defined as a gray-scale invariant texture measure. In that application, for each inspectionpixel in a confirmed2D image, the a bina presencery code of physiological is produced and by non-physiologicalconsidering a central artifacts on this pixel as a thresholdchannel and analyzing [30]. 8 neighbor pixels. This approach also found large ap- LBP symbolization. The LBP operator was introduced in [31] as a powerful tool plication in treating experiments based on 1D-time series such as EEG analysis, especially for the texture description and defined as a gray-scale invariant texture measure. In that in seizure detectionapplication, [27]. In this for paper, each pixel we propose in a 2D image,the application a binary code of a iscustom produced LBP by sym- considering a bolization routinecentral aiming pixel to improve as a threshold the ne andtwork analyzing performance. 8 neighbor The pixels. 1D-LBP This method approach isalso found described step-by-steplarge by application considering in treating a segment experiments of 40 samples based onfrom 1D-time 168 sample series suchEEG astrial, EEG analysis, as per Figure 3a. Theespecially LBP routine in seizure starts detection identifying [27]. Inall this the paper, onset wepoints propose for the the binary application code of a custom assessment, namelyLBP P1,…54 symbolization in Figure 3a. routine These aiming P samples to improve have thea range network that performance.goes from 1 to The 1D-LBP 160 (8 samples are methodleft on the is described right of step-by-stepP54 to avoid byindex considering overflow) a segment with step of 40 3. samplesConsidering from 168 sample a comparison windowEEG trial,of n = as 8 persamples, Figure as3a. shown The LBP in routineFigure starts3a, the identifying amplitude all value the onset related points for the binary code assessment, namely P1, . . . 54 in Figure3a. These P samples have a range that to the jth sample (𝑃) is set as a local threshold and the following 8 samples are compared goes from 1 to 160 (8 samples are left on the right of P54 to avoid index overflow) with with it. The resultingstep LBP 3. Considering code is given a comparison by: window of n = 8 samples, as shown in Figure3a, the amplitude value related to the jth sample (Pj) is set as a local threshold and the following sEEG(k)8 samples = are compared EEG( with𝑖) − it.EEG𝑃 The resulting > 0 with LBP 𝑃 code = 1: is 3: given 160 by: (3) P +n sEEG(𝑗) sEEG(k) = j EEG(i) − EEGP EEG> 0(𝑖with) − EEG𝑃P = 1 : 3 : > 160 0 (3) where is the kth sample of the∑ sEEGi=Pj+ 1signal in Figure 3b,j j is a logic condition returning 1 if EEG(𝑖) is higher than the EEG in the sample 𝑃, zero ( ) ( ) − > otherwise. The resultingwhere sEEG sEEGj signalsis the kth consisted sample of of the 54 sEEG samples signal ranging in Figure between3b, EEG 0i andEEG 8. Pj 0 is a logic condition returning 1 if EEG(i) is higher than the EEG in the sample P , zero The preprocessed data were then ready to be analyzed by the proposed autoencoded j otherwise. The resulting sEEG signals consisted of 54 samples ranging between 0 and 8. CNN. The preprocessed data were then ready to be analyzed by the proposed autoencoded CNN.

FigureFigure 3. Implementation 3. Implementation of 1D-LBP of 1D-LBP code. (a) code. 40-sample (a) 40-sample EEG data EEG subset data with subset LBP code with extraction LBP code (b extraction) sEEG from LBP routine(b application) sEEG from on LBP a 168 routine samples application trial. on a 168 samples trial. 2.5. Autoencoded-CNN: Network Topology 2.5. Autoencoded-CNN: TheNetwork here Topology proposed network consists of 6 parallel 5-layer autoencoder structures The here proposedfollowed network by 6 sequential consists layers. of 6 parallel 5-layer autoencoder structures followed by 6 sequential Specifically,layers. the first layer L0 consists of a batch normalization layer. This layer aims Specifically, theto normalizefirst layer the L0 inputconsists data of with a batch respect normaliz to theation mean layer. and the This standard layer deviationaims of the to normalize the inputtraining data batch. with In respect the very to first the stage, mean it isand necessary the standard to feed thedeviation network of with the the proper data format. The resulting tensor, with size (54,6), is used to feed 6 parallel autoencoders, training batch. In the very first stage, it is necessary to feed the network with the proper one per channel. The autoencoder network is the same for all 6 parallel branches and is data format. The resultingreported intensor, Figure with4. size (54,6), is used to feed 6 parallel autoencoders, one per channel. The autoencoder network is the same for all 6 parallel branches and is reported in Figure 4.

Sensors 2021, 21, 3961 10 of 25 Sensors 2021, 21, 3961 10 of 24

FigureFigure 4. 4.5-layer 5-layer Autoencoder Autoencoder Implementation. Implementation. Autoencoders are neural networks with the aim of generating new data by first compressingAutoencoders the input are into neural a space networks of latent variables with the and, aim subsequently, of generating reconstructing new data by first com- thepressing output basedthe input on the into acquired a space information. of latent This variables type of and, network subsequently, consists of two reconstructing parts: the encoderoutput andbased decoder. on the The acquired code is placed information. in the middle This of thetype coding–decoding of network consists procedure. of two parts: encoderThe encoder and decoder. is the part The of code the network is placed that in compresses the middle the of input the coding–decoding in a space of latent procedure. variables,The andencoder which is can the be part represented of the network by the encoding that compresses function h = the f (x). input The decoderin a space of latent is the part that deals with reconstructing the input based on the information previously collected.variables, It and is represented which can by be the represented decoding function by the r =encoding g (h). The function use of an autoencoderh = f (x). The decoder is inthe this part P300 that detector deals architecture with reconstructing lies in the fact the that input training based the on autoencoder the information to copy thepreviously col- input,lected. the It space is represented of latent variables by the h decoding (code) can takefuncti onon useful r = characteristics.g (h). The use This of an can autoencoder be in achievedthis P300 by detector imposing limitsarchitecture on the coding lies in action, the forcingfact that the spacetraining h to dimensionsthe autoencoder smaller to copy the thaninput, those the of space x. In thisof latent case, the variables autoencoder h (code) is known can astake undercomplete. on useful characteristics. By training the This can be undercomplete space, we bring the autoencoder to grasp the most relevant characteristics achieved by imposing limits on the coding action, forcing the space h to dimensions of the training data [32]. smaller than those of x. In this case, the autoencoder is known as undercomplete. By train- For this aim, the L1 of the autoencoder in Figure4 consists of a slicing procedure, aiminging the to undercomplete singularly scorporate space, the 6we channels. bring Thethe resultingautoencoder tensor to shape grasp is (16,) the according most relevant char- toacteristics Tensorflow of andthe Kerastraining shape data nomenclature. [32]. The encoder part is realized utilizing a fully connectedFor this aim, layer Lthe2 with L1 of 16 unitsthe autoencoder activated by the in ReLUFigure function. 4 consists The ReLUof a slicing activa- procedure, tionaiming function to singularly has been largely scorporate used in the this 6 architecture channels. The because resulting the efficient tensor computation shape is (16,) accord- ReLu(x) = max(x,0) permits a faster convergence during training [33]. Moreover, used in ing to Tensorflow and Keras shape nomenclature. The encoder part is realized utilizing a deep networks, ReLU does not saturate, avoiding the vanishing gradients problem [34]. fully connected layer L2 with 16 units activated by the ReLU function. The ReLU activation L3 serves as code layer and is composed of an 8-unit dense layer with ReLU activation. It allowsfunction one has to compress been largely the initial used tensor in extractingthis architecture only meaningful because features. the efficient L4 acts as computation aReLu(x) decoder = and max(x,0) is a mirrored permits version a faster of L2 .convergence The output layer, during the Ltraining5, is a fully [33]. connected Moreover, used in layerdeep with networks, the same ReLU size asdoes the inputnot saturate, tensor. It avoiding does not have the vanishing an activation gradients function toproblem [34]. properly reconstruct the input in a more useful way. The autoencoder structure has been L3 serves as code layer and is composed of an 8-unit dense layer with ReLU activation. It distinctly compiled with respect to the remaining architecture. Specifically, the model fit wasallows based one on to a meancompress squared the error initial (MSE) tensor loss function,extracting a root only mean meaningful square propagation features. L4 acts as a (RMSprop)decoder and optimizer is a mirrored and monitoring version the of meanL2. The absolute output error layer, (MAE) the asL5, a is metric a fully for connected the layer weightswith the update. same Asize minibatch as the input size of tensor. 32 samples It does was not selected have for an the activation application. function The fit to properly resultedreconstruct in 2144 the parameters input in a properly more useful freezing way. for the The implementation autoencoder instructure the overall has archi- been distinctly tecture.compiled Table with2 summarizes respect to the the autoencoder remaining characteristics architecture. focusing Specifically, on the distribution the model of fit was based theon parameters.a mean squared error (MSE) loss function, a root mean square propagation (RMSprop) optimizer and monitoring the mean absolute error (MAE) as a metric for the weights update. A minibatch size of 32 samples was selected for the application. The fit resulted in 2144 parameters properly freezing for the implementation in the overall architecture. Ta- ble 2 summarizes the autoencoder characteristics focusing on the distribution of the parameters.

Table 2. Autoencoder layers’ characteristics.

Layer Input Type Output # Parameters L1 (54,6) tf.Slice (54,) 0 L2 (54,) Dense(16)+ReLU (16,) 912 L3 (16,) Dense(8)+ReLU (8,) 136

Sensors 2021, 21, 3961 11 of 25

L4 (8,) Dense(16)+ReLU (16,) 144 L5 (16,) Dense(54) (54,1) 952

The tensors resulting from the parallel autoencoding were then concatenated along the 3rd dimension resulting in the initial size (54,6). The resulting tensor is then sent to the 8-layers CNN presented in Figure 5. Since the batch normalization step can reduce the covariate shift in neural network training, another batch normalization was added in the beginning of the CNN following the guidelines in [35]. It constitutes the L7. The normalized tensor was sent to a 1D convolutional layer, the L8, for temporal feature extraction and subsampling. The convolutional kernels had size 16 × 1 with ReLU activation and a stride of 8 samples. For this reason, the time series was subsampled, Sensors 2021, 21, 3961 11 of 24 reaching a length of only 6 samples. The resulting tensor of shape (6,16) was flattened by L9, realizing a (96,) size feature vector. The subsampling rate of 8 led to a temporal filter length of about 100 ms. Notably, the temporal convolution over L8 was not overlapped to saveTable a 2.lotAutoencoder of computational layers’ characteristics.effort and avoid redundant features for the following layers. The normalized vector from L10 went through 2 sequential dense layers, L11 and L12, with theLayer same characteristics: Input 64 units and Type ReLU activation. Output Both the layers # Parameters included a dropout Lstage1 with a dropout(54,6) rate of 0.4 tf.Slicefrom a trial-and-error (54,) application. The 0dropout step wasL necessary2 to reduce(54,) coadaptions Dense(16)+ReLU between neurons (16,)[36]. It forced the 912neuron to L (16,) Dense(8)+ReLU (8,) 136 learn more3 robust features, limiting, in this way, the overfitting phenomenon. Only dur- L (8,) Dense(16)+ReLU (16,) 144 ing the forward4 propagation training, with a dropout rate of 0.4, as in our application, L5 (16,) Dense(54) (54,1) 952 every neuron of the layer has the 0.6 probability of being set to 0. During the test, every neuron is preserved. The outer layer, L12 consists of a single unit layer (binary classification) withThe tensorsa sigmoid resulting activation from function. the parallel This autoencoding activation function were thenreturns concatenated a number in along the rangethe 3rd [0, dimension 1] that represents resulting the in probability the initial sizeof a trial (54,6). to be The a resultingP300 trials. tensor Setting is thena threshold sent to the 8-layers CNN presented in Figure5. Since the batch normalization step can reduce the to 0.5 makes it possible to discriminate if there is a P300 or not. Table 3 summarizes the covariate shift in neural network training, another batch normalization was added in the network characteristics, focusing on the distribution of the parameters. beginning of the CNN following the guidelines in [35]. It constitutes the L7.

FigureFigure 5. 5. 8-layer8-layer sequential sequential NN NN implementation. implementation. The normalized tensor was sent to a 1D convolutional layer, the L , for temporal Table 3. Autoencoded CNN Layers’ characteristics. 8 feature extraction and subsampling. The convolutional kernels had size 16 × 1 with ReLU activationLayer and aInput stride of 8 samples. Type For this reason, the Output time series was# Parameters subsampled, reachingL6 a length(54,6) of only 6 samples. Concatenate The resulting tensor of (54,6) shape (6,16) was flattened 0 by L9, realizingL7 a (96,) (54,6) size featureBatchNormalization vector. The subsampling rate(54,6) of 8 led to a temporal 24 filter lengthL8 of about 100(54,6) ms. Notably,Conv1D+ReLU the temporal convolution (6,16) over L8 was not overlapped 784 to save a lot of computational effort and avoid redundant features for the following layers. L9 (6,16) Flatten (96,) 0 The normalized vector from L10 went through 2 sequential dense layers, L11 and L12, L10 (96,) BatchNormalization (96,) 448 with the same characteristics: 64 units and ReLU activation. Both the layers included a dropout stage with a dropout rate of 0.4 from a trial-and-error application. The dropout step was necessary to reduce coadaptions between neurons [36]. It forced the neuron to learn more robust features, limiting, in this way, the overfitting phenomenon. Only during the forward propagation training, with a dropout rate of 0.4, as in our application, every neuron of the layer has the 0.6 probability of being set to 0. During the test, every neuron is preserved. The outer layer, L12 consists of a single unit layer (binary classification) with a sigmoid activation function. This activation function returns a number in the range [0, 1] that represents the probability of a trial to be a P300 trials. Setting a threshold to 0.5 makes it possible to discriminate if there is a P300 or not. Table3 summarizes the network characteristics, focusing on the distribution of the parameters. Sensors 2021, 21, 3961 12 of 24

Table 3. Autoencoded CNN Layers’ characteristics.

Layer Input Type Output # Parameters

L6 (54,6) Concatenate (54,6) 0 L7 (54,6) BatchNormalization (54,6) 24 L8 (54,6) Conv1D+ReLU (6,16) 784 L9 (6,16) Flatten (96,) 0 L10 (96,) BatchNormalization (96,) 448 L11 (96,) Dense(64)+ReLU (64,) 7232 Dout1 Dropout (0.4) 0 L12 (64,) Dense(64)+ReLU (64,) 4160 Dout2 Dropout (0.4) 0 L13 (64,) Dense(1)+Sigmoid (1,) 65

The compiling settings for the overall neural network training included a binary cross- entropy loss function and an Adam gradient descendent optimizer. Speciﬁcally, the Adam −4 optimizer learning rate was set to 10 with no decay, while its parameters β1, β2, and ε were set, respectively, to 0.9, 0.999, and 10−8 as per [37]. The batch size for the stochastic gradient descendent was set to 64. The proposed model has a total of 14,857 parameters, of which only 12,477 have been trained, since 2380 came from the freezed autoencoder section. The whole neural network described above was realized in Python by using the Keras library. Next, the model was then saved and imported in the STM32CubeMX environment for code migration (Python → C), quantization cross-check, and on-board validation.

2.6. Output Management The outputs provided by the classifier were managed differently for each analyzed dataset. Considering Dataset 1, each selection of direction to be taken consisted of 2 directions scintillations. Ideally, every 8 flashes (2 per each target), the decision was taken as the target that returns the maximum occurrence. Eight flashes corresponded to a decision every 1.4 s considering an ISI of 175 ms. In similar way, the flashing of 6 rows and 6 columns in the experiment was repeated 15 times per character epoch [8]. Initially, this procedure was chosen because the P300 detection in a single trial over 36 characters was very low. The probability values corresponding to multiple row/column flashing could then be accumulated (repetitions number). Thus, the target character can be defined as that character that corresponds to the row and column with the highest occurrence value.

3. Results Section 3.1 is dedicated to the system performance extraction with respect to both the datasets. For this purpose, four metrics are monitored: precision, recall, binary accuracy ad F1 score as deﬁned in Equation (1). Section 3.2 analyzes the BCI performance in terms of character recognition accuracy providing comparison graphs with respect to the state of the art. In Section 3.3, the ITR is analyzed for both the proposed applications, in order to evaluate the system speed. Section 3.4 outlines the LBP routine impact on the P300 detection accuracy metrics by considering several neural network architectures. Finally, Section 3.5 is dedicated to the architecture implementation on a dedicated microcontroller.

3.1. BCI Performance: Single-Trial Classiﬁcation Metrics The performance of the single-trial P300 detection is analyzed in the following by means of four metrics: precision, recall, binary accuracy ad F1 score as defined in Equation (2). These metrics are deﬁned as: TP Precision (%) = 100 ∗ (4) TP + FP Sensors 2021, 21, 3961 13 of 24

TP Recall (%) = 100 ∗ (5) TP + TN TP + TN Acc. (%) = 100 ∗ (6) TP + TN + FP + FN where TP is the number of P300 trials correctly classified, TN is the number of non-P300 trials correctly classified, FP is the number of non-P300 trials misclassified (classified as P300 trials) and, similarly, FN is the number of P300 trials misclassified. Among the above-mentioned metrics, the F1-score is well-correlated with the direction or character recognition accuracy and, thus, generally considered the more reliable metrics for imbalanced datasets such as P300-based ones. The other metrics suffer from the imbalanced nature of the dataset [10]. To address the problem of imbalanced datasets, in this work, a mixed approach undersampling/oversampling was adopted. Data from both datasets underwent a first stage of dataset undersampling based on the NearMiss version 2 from imblearn Python library and, secondarily, upsampling by the RandomOverSampler method from the same library [38]. For the sake of a coherent comparison, the same metrics were extracted from other state-of-the-art solutions which analyzed the same dataset: SVM-1 [13], CNN-1 [12], BN3 [10], ConvLSTM [11], their ensemble provided by [11], PCA-NN [22], the ERPENet [24]. Table4 reports the comparison results over the BCI competition dataset (Dataset 2). Only the CNN-1 [12], BN3 [10], ConvLSTM [11], BN3+ConvLSTM [11] provided a complete overview of these metrics in their works. ERPENet by [24] performs a general average plus standard deviation value over both the subjects involved in the Dataset 2 composition (i.e., 83.54 ± 1.53). However, no explicit reference is reported with respect to the accuracy extraction modalities (e.g., validation or testing related accuracy). PCA-CNN [22] and SVM-1 [13] do not provide the investigated metrics. Moreover, Table4 reports two versions of the here-proposed autoencoded CNN. A first version, named autoencoded CNN in Table4, is the here-presented method in all of its steps, while the second version, i.e., autoencoded CNN no LBP, consists of the same model without the LBP preprocessing step but with an average based downsampling to match the LBP code length, leaving the number of the model parameters unaltered.

Table 4. Performance of different models on test dataset from Dataset 2.

Subject Model TP FP TN FN Prec. (%) Recall (%) Acc. (%) F1-Score (%) Autoencoded CNN 2017 3499 11,501 983 36.57 67.23 75.10 47.37

Our Autoencoded CNN Work 1951 3419 11,581 1049 36.33 65.03 75.18 46.62 no LBP BN3 1910 3771 11,229 1090 33.62 63.67 72.99 44.00 A ConvLSTM 1928 3418 11,582 1072 36.06 64.26 75.05 46.20 CNN-1 2021 4355 10,645 979 31.70 67.37 70.37 43.11 BN3+ConvLSTM 1919 3299 11,701 1081 36.78 63.97 75.67 46.70 ERPENet Data N.A. 83.54 * Data N.A. Solutions PCA-CNN Data Not Available (N.A.) State-of-the-Art SVM-1 Data N.A. Autoencoded CNN 1996 2108 12,892 1004 48.64 66.53 82.71 56.19

Our Autoencoded CNN Work 1988 2278 12,722 1012 46.60 66.27 81.72 54.72 no LBP BN3 2009 2671 12,329 991 42.93 66.97 79.66 52.32 B ConvLSTM 2099 2670 12,330 901 44.01 69.97 80.16 54.04 CNN-1 2035 2961 12,039 965 40.73 67.83 78.19 50.90 BN3+ConvLSTM 1956 2136 12,864 1044 47.80 65.20 82.33 55.16 ERPENet Data N.A. 83.54 * Data N.A. Solutions PCA-CNN Data N.A. State-of-the-Art SVM-1 Data N.A. * The work reports only the average value of accuracy considering both the subjects. Sensors 2021, 21, 3961 14 of 24 Sensors 2021, 21, 3961 14 of 25 Sensors 2021, 21, 3961 14 of 25

The system performance has been also validated on the prototype car driving dataset, The system performanceThe system hasperformance been also hasvalida beented also on the valida prototypeted on the car prototype driving dataset, car driving dataset, wherewhere thethe decision is among among four four possible possible di directions.rections. Table Table 5 5summar summarizesizes the the results results ob- obtained on thewhere single-trial the decision P300 detectionis among allfour over possible the four directions. analyzed Table subjects. 5 summarizes the results obtained on the tainedsingle-trial on the P300 single-trial detection P300 all overdetection the four all overanalyzed the four subjects. analyzed subjects. Table 5. Performance on test dataset from Dataset 1. Table 5. PerformanceTable 5. on Performance test dataset onfrom test Dataset dataset 1. from Dataset 1. Subject TP FP TN FN Prec. Recall Acc. F1-Score Subject TPSubject FP TP TN FP FN TN Prec. FN Recall Prec. Acc.Recall F1-Score Acc. F1-Score 1 364 1 269 269 364 1231 1231 269 136 136 1231 0.5750 0.5750 136 0.7280 0.7280 0.5750 0.7975 0.79750.7280 0.6425 0.64250.7975 0.6425 2 351 176 176 1324 1324 149 149 0.6660 0.6660 0.70200.7020 0.83750.8375 0.68350.6835 3 3492 99 351 1401 176 151 1324 0.7790 149 0.6980 0.6660 0.8750 0.7020 0.7363 0.8375 0.6835 3 349 99 1401 151 0.7790 0.6980 0.8750 0.7363 4 3643 135 349 1365 99 129 1401 0.7332 151 0.7420 0.7790 0.8680 0.6980 0.7376 0.8750 0.7363 4 364 4 135 364 1365 135 129 1365 0.7332 129 0.7420 0.7332 0.86800.7420 0.73760.8680 0.7376 To provide a complete overview of the system workﬂow from the acquisition up to To provide a Tocomplete provide overview a complete of the overview system of workflow the system from workflow the acquisition from the up acquisition to up to thethe mechatronicmechatronic actuation,actuation, aa snapshotsnapshot fromfrom thethein in vivovivoproof proofof ofconcept conceptvalidation validationon onthe the prototype carthe model mechatronic designed inactuation, our previous a snapshot work [from3] is shownthe in vivo in Figure proof6 .of concept validation on the prototype carprototype model designed car model in our designed previo usin ourwork previo [3] isus shown work in [3] Figure is shown 6. in Figure 6.

FigureFigure 6.6. SnapshotSnapshotFigure ofof 6. anan Snapshot inin vivovivoproof proofof an of ofin concept conceptvivo proof validationvalidation of concept onon the thevalidationacrylic acrylicprototype prototypeon the acrylic carcarsystem systemprototype car system designed in [3]. designed in [3].designed in [3].

3.2.3.2. BCIBCI Performance:Performance:3.2. BCI CharacterCharacter Performance: RecognitionRecognition Character AccuracyAccuracy Recognition Accuracy FigureFigure7 7shows showsFigure the the accuracy accuracy7 shows rate therate in recognizingaccuracyin recognizing rate among in among recognizing the 100 the test 100 among characters test characters the provided100 test pro- characters pro- withvided Dataset with Dataset 2vided versus 2with versus the Dataset number the number 2 ofversus considered of the considered number repetitions. of repetitions. considered Figure Figure 7repetitions. excludes 7 excludes fromFigure from the 7 excludes from comparisonthe comparison thethe ConvLSTMthe comparison ConvLSTM and the andthe ConvLSTM ensemblethe ensemb and model,le themodel, ensemb because becausele no model, resultsno results because are are provided no provided results in are provided thein the related related articles inarticles the [ 11related [11].]. It mustIt articles must be be speciﬁed[11]. specif It mustied that that be no specif no trial trialied accumulation accumulation that no trial noraccumulation nor average average have have nor average have beenbeen carriedcarried outoutbeen concerningconcerning carried out thethe concerning here-proposedhere-proposed the here-proposed method.method. TheThe valuesmethod.values reportedreported The values inin FigureFigure reported8 8 in Figure 8 areare fullyfully extractedextractedare fully fromfrom extracted aa single-trialsingle-trial from occurrenceoccurrencea single-trial overover occurrence allall thethe repetitionsrepetitions over all the accordingaccording repetitions toto the theaccording to the procedure reported in Section 2.6. procedure reportedprocedure in Section reported 2.6. in Section 2.6.

Figure 7. Accuracy rate on subjects A and B for Dataset 2. Figure 7. Accuracy rateFigure on 7. subjects Accuracy A and rate B on for subjects Dataset A 2. and B for Dataset 2.

Sensors 2021, 21, 3961 15 of 24 Sensors 2021, 21, 3961 15 of 25

FigureFigure 8.8. ITRITR graph for for four four algorithms algorithms on on Dataset Dataset 2. 2.

3.3.3.3. BCIBCI Performance: Information Information Translate Translate Rate Rate AnotherAnother metric widely widely used used to to characterize characterize a BCI a BCI system system in interms terms of speed of speed is the is the informationinformation translate translate rate rate or orITR. ITR. This This parameter parameter provides provides an idea an of the idea number of the of number bit/mi- of bit/minutenute that the that BCI thearchitecture BCI architecture can provide, can ta provide,king into taking account into the accountclassification the classificationaccuracy. accuracy.As per [19,39], As per ITR [19 can,39 be], ITR defined can beas: defined as: 1−P 60(P ∗ log (P) + (1−P) log +log (N)) N−11−P (7) ITR =60(P∗ log2(P) + (1 − P) log2 N−1 + log2(N) ) ITR = T (7) T where N represents the number of recognizable classes (N = 36 in the speller problem, N where= 4 in the N represents prototype thecar numberapplication), of recognizable P represents classes the accuracy (N = 36 rate in the in spellercharacter/direction problem, N = 4 inrecognition the prototype and T car is the application), time needed P for represents characters/directions the accuracy recognition. rate in character/direction recognitionConcerning and Tthe is theT parameter, time needed it assumes for characters/directions different values for the recognition. different recognition problems.Concerning Considering the T parameter,the recognition it assumes of the direction different for values the prototype for the different car driving recognition ap- problems.plication, some Considering considerations the recognitionshould be taken. of the Each direction direction for scintillation the prototype lasts about car driving 100 application,ms with a blank some status considerations of about 75 shouldms for a be total taken. of 4 Each directions* direction 175 scintillationms × nrep = 700 lasts ms about × 100nrep, ms with with nrep number a blank of status repetitions of about needed 75 ms to provide for a total the direction of 4 directions* choice. A 175 pause ms of× 500nrep = 700ms msfollows× n repthe, selection with nrep innumber order to of ensure repetitions a driving needed fluidity. to provideConsidering the directionthe above-men- choice. A pausetioned of timing 500 ms constraints, follows thethe selectionT value for in this order specific to ensure application a driving can be fluidity. extracted Considering by: the above-mentioned timing constraints, the T value for this specific application can be T = 0.5 + 0.7 ∗ n with n =1,2 (8) extracted by: In a similar way, rows/columns scintillation lasts about 100 ms interspersed by a 75 T = 0.5 + 0.7∗nrep with nrep = {1, 2} (8) ms pause for a total of 12 among rows and columns* 175 ms × nrep = 2100 ms × nrep, with nrep numberIn a similar of repetitions way, rows/columns needed to provide scintillation the character lasts about choice. 100 A mspause interspersed of 2.5 s follows by a 75 msthe pause selection. for aConsidering total of 12 among the above-mention rows and columns*ed timing 175 constraints, ms × nrep =the 2100 T value ms × fornrep this, with nspecificrep number application of repetitions can be extracted needed to by: provide the character choice. A pause of 2.5 s follows the selection. Considering the above-mentioned timing constraints, the T value for this T=2.5+2.1∗n with n = 1 … 15 (9) specific application can be extracted by: Figure 8 shows the ITR achieved by the proposed algorithm and the above analyzed state-of-the-art solutions Tconsidering= 2.5 + 2.1 th∗en accuracyrep with nraterep =from{1 .the . . 15Dataset} 2. (9) Considering the Dataset 1, the ITR achieved by the autoencoded CNN on a single or on aFigure double8 showsrepetition the to ITR decide achieved the protot by theype proposed car direction algorithm is 32.44 and bit/min the above and analyzed 28.35 state-of-the-artbit/min, respectively. solutions considering the accuracy rate from the Dataset 2. Considering the Dataset 1, the ITR achieved by the autoencoded CNN on a single or on a3.4. double Local repetitionBinary Patterning to decide Impact the prototypeon BCI Performance car direction is 32.44 bit/min and 28.35 bit/min, respectively.To analyze the LBP impact on other neural network architectures used in the P300 recognition analysis context, the CNN-1 [12], BN3 [10], ConvLSTM [11], BN3+ConvLSTM 3.4. Local Binary Patterning Impact on BCI Performance [11] structure have been reproposed in Python by using the Keras library with Tensorflow backendTo analyze [40]. The the implemented LBP impact on architectures other neural ha networkve been architecturesrealized according used into thethe P300 details recog- nitionprovided analysis in the context, related theworks. CNN-1 Specifically, [12], BN3 BN3 [10 ],architecture ConvLSTM has [11 been], BN3+ConvLSTM implemented in [11] structure have been reproposed in Python by using the Keras library with Tensorflow backend [40]. The implemented architectures have been realized according to the details provided in the related works. Specifically, BN3 architecture has been implemented in two

Sensors 2021, 21, 3961 16 of 24

forms: (i) the original one, namely BN3 and (ii) a BN3 version without the first spatial filtering stage, named BN3(ns) in the following. The above-mentioned architectures have been tested on the 6-channels EEG setup defined by the recursive channel elimination procedure provided in Section 2.3. Results concerning the LBP impact on the BCI performance are shown in Table6.

Table 6. LBP impact on different state-of-the-art neural network architectures.

Subject Model TP FP TN FN Prec. (%) Recall (%) Acc. (%) F1-Score (%)

Autoencoded CNN 2017 3499 11,501 983 36.57 67.23 75.10 47.37 Our Work BN3 1096 3013 11,987 1904 26.67 36.53 72.68 30.83 BN3(ns) 1611 2462 12,538 1389 39.55 53.70 78.61 45.55 A ConvLSTM 956 1998 13,002 2044 32.36 31.87 77.54 32.11 CNN-1 1312 2144 12,856 1688 37.96 43.73 78.71 40.64

Solutions BN3+ConvLSTM 1496 2464 12,536 1504 37.78 49.87 77.96 42.99

State-of-the-Art BN3(ns)+ConvLSTM 1515 2406 12,594 1485 38.64 50.50 78.38 43.78

Autoencoded CNN 1996 2108 12,892 1004 48.64 66.53 82.71 56.19 Our Work BN3 1401 1975 13,025 1599 41.50 46.70 80.14 43.95 BN3(ns) 1812 2139 12,861 1188 45.86 60.40 81.52 52.14 B ConvLSTM 1035 1105 13,895 1965 48.36 34.50 82.94 40.27 CNN-1 1735 2149 12,851 1265 44.67 57.83 81.03 50.41

Solutions BN3+ConvLSTM 1098 1989 13,011 1902 35.57 36.60 78.38 36.08

State-of-the-Art BN3(ns)+ConvLSTM 1523 2145 12,855 1477 41.52 50.77 79.88 45.68 * The work reports only the average value of accuracy considering both the subjects.

3.5. Microcontroller Implementation One of the main focuses of the present paper has concerned the full compatibility of the NN in a microcontroller implementation context. For this aim, a ﬁrst important comparison that should be carried out before starting the implementation analysis, concerns the number of total parameters composing each model analyzed up to now, as well as the amount of data request in input to permit the model working properly. Figure9 summarizes these parameters via histogram view, comparing only those above-introduced methods that provided these metrics in their works [10–12].

Figure 9. Histogram plot of the number of NN model parameters and input data size (bytes) for the analyzed methods. Sensors 2021, 21, 3961 17 of 25 Sensors 2021, 21, 3961 17 of 24

For the complexity analysis, the proposed model was realized in Python by using the Keras libraryFor the with complexity Tensorflow analysis, backend the [40]. proposed Next, the model different was NNs realized composing in Python the by final using the Keras library with Tensorflow backend [40]. Next, the different NNs composing the autoencoded-CNN architecture were extracted as separate .h5 models for the C-code gen- final autoencoded-CNN architecture were extracted as separate .h5 models for the C-code eration and combination. The target hardware chosen for validation purposes is the Nu- generation and combination. The target hardware chosen for validation purposes is the cleo board L476RG. Nucleo board L476RG. For this aim, the Nucleo board L476RG was instantiated by using the STM32CubeMX For this aim, the Nucleo board L476RG was instantiated by using the STM32CubeMX peripheral initializer [41]. The STM32CubeMX extension, i.e., X-CUBE-AI, was used to peripheral initializer [41]. The STM32CubeMX extension, i.e., X-CUBE-AI, was used to integrate the NN model for a real-time inference assessment. This package provides an integrate the NN model for a real-time inference assessment. This package provides an automatic and advanced NN mapping tool to generate and deploy optimized and robust automatic and advanced NN mapping tool to generate and deploy optimized and robust C-model implementation of a pretrained NN for embedded systems with limited and con- C-model implementation of a pretrained NN for embedded systems with limited and strained hardware resources. This package was chosen because of the possibility of ob- constrained hardware resources. This package was chosen because of the possibility of taining post-training quantization support for the Keras models, which allows the user to obtaining post-training quantization support for the Keras models, which allows the user checkto checkthe error the related error related to the toquantization the quantization operated operated to translate to translate the Keras the Kerasmodel model in a C in a one.C one. A simpleA simple validation validation mechanism mechanism was implem was implementedented to check to the check accuracy the accuracy of a C-gen- of a C- eratedgenerated method method and the and uploaded the uploaded pretrained pretrained model model from fromKeras. Keras. The Thevalidation validation engine engine procedureprocedure is schematized is schematized in Figure in Figure 10. 10It consists. It consists of sending of sending a preprocessed a preprocessed dataset dataset (or (ora a randomrandom number number generator generator with with mean mean and and standard standard deviation deviation from from the theinput input dataset) dataset) in in twotwo dedicated dedicated branches: branches: (i) (i)the the original original model model framework framework and and (ii) (ii) a a serial serial bridge bridge toward the the Nucleo boardboard thatthat runsruns thethe translated translated C-model. C-model. Next, Next, the the labels labels from from the the model model execution exe- cutionblock block are are used used as aas reference a reference to be to comparedbe compared with with the predictionthe prediction from from the COMthe COM bridge. bridge.The The comparison comparison is then is then used used to provide to provide several several complexity complexity metrics. metrics. The metrics The metrics analyzed analyzedin this in section this section are RAM are thatRAM indicates that indica thetes size the (in size bytes) (in bytes) of the of expected the expected R/W memoryR/W memoryused toused store to thestore intermediate the intermediate inference infere computingnce computing values values and ROM/Flash, and ROM/Flash, indicating indi- the catingsize the of thesize generatedof the generated read only read memory only memory to store to weight store andweight bias and parameters. bias parameters. Finally, the Finally,complexity the complexity metric is metric provided is provided by the Multiply by the Multiply and ACCumulate and ACCumulate (MACC) (MACC) operations op- that erationsuniversally that universally define the define functional the functional complexity complexity of the imported of the imported NN. NN.

Figure 10. Validation on target flow by X-CUBE-AI. Figure 10. Validation on target flow by X-CUBE-AI. As a first step, we evaluate the implementation metrics for the first autoencoder As a first step, we evaluate the implementation metrics for the first autoencoder stage. Overall, it results in a complexity of 2184 MACC, a flash occupation of 8.38 kB stage. Overall, it results in a complexity of 2184 MACC, a flash occupation of 8.38 kB over over 1024 kB available and a RAM occupation of 544 B, comprising a single branch input 1024 kB available and a RAM occupation of 544 B, comprising a single branch input and and a single branch output vector (3.264 kB in total). RAM also comprises 96 B from a singleactivation. branch Theoutput MACC vector parameter (3.264 kB providesin total). RAM an estimation also comprises of the 96 autoencoder B from activation. operative Thetiming. MACC Sinceparameter with anprovides STM32 an Arm estimation® Cortex ®of—M4, the autoencoder each MACC operative corresponds timing. to about Since nine ® ® withclock an STM32 cycles Arm and considering Cortex —M4, the systemeach MACC clock corresponds set to 80 MHz, to theabout time nine needed clock tocycles encode andand considering decode thethe symbolizationsystem clock set code to 80 is ~246MHz,µ s.the Validated time needed on target,to encode this and timing decode request the resultedsymbolization to be 307codeµ iss. ~246 Details µs. aboutValidated the on Flash target, utilization, this timing MACC request usage resulted and validated to be 307 timingµs. Details per layerabout are the provided Flash utilization, in Table 7MACC, while usage the RAM and occupationvalidated timing is shown per layerlayer by are layerprovided in Figure in Table 11. Finally,7, while the the quantization RAM occupa comparisontion is shown returns layer anby error layer on in theFigure prediction 11.

Sensors 2021, 21, 3961 18 of 25

Sensors 2021, 21, 3961 18 of 24

Finally, the quantization comparison returns an error on the prediction matching of 4.92 × 10−7, largely below the threshold of 0.01 that ensures a good model match regardless of matching of 4.92 × 10−7, largely below the threshold of 0.01 that ensures a good model the quantization. match regardless of the quantization. Table 7. Autoencoder layers’ ROM utilization and MACC. Table 7. Autoencoder layers’ ROM utilization and MACC. Layer Type Flash (B) MACC Timing (ms) Layer Type Flash (B) MACC Timing (ms) L2,1 Dense(16) 3648 912 0.113 L2,2L2,1 Dense(16)ReLU 3648 - 16 912 0.0070.113 L2,2 ReLU - 16 0.007 L3,1 Dense(8) 544 136 0.024 L3,1 Dense(8) 544 136 0.024 L3,2 ReLU - 8 0.005 L3,2 ReLU - 8 0.005 L4,1L4,1 Dense(16)Dense(16) 576 576 144 144 0.028 L4,2L4,2 ReLUReLU - - 16 16 0.007 L5 Dense(54) 3808 952 0.124 L5 Dense(54) 3808 952 0.124

FigureFigure 11. 11.RAM RAM and and I/O I/O memory memory usage usage of of the the autoencoder autoencoder stage. stage.

AsAs a a second second step, step, we we evaluate evaluate the the layers layers composing composing the remainingthe remaining sequential sequential structure. struc- Overall,ture. Overall, it results it results in a complexity in a complexity of 17,995 of 17,995 MACC, MACC, a ﬂash a occupation flash occupation of 48.74 of kB48.74 over kB 1024over kB 1024 available kB available and a RAMand a occupationRAM occupation of 2.79 of kB 2.79 for thekB for I/O the management. I/O management. RAM alsoRAM comprisesalso comprises 1.48 kB 1.48 from kB activationfrom activation layers. layers. The MACC The MACC parameter parameter allows allows one to one estimate to esti- anmate operation an operation timing oftiming about of ~2.02 about ms ~2.02 @ 80 ms MHz. @ 80 Validated MHz. Validated on target, on this target, timing this request timing resultedrequest toresulted be 2.663 to ms.be 2.663 Details ms. about Details the about Flash the utilization, Flash utilization, MACC MACC usage, andusage, validated and val- timingidated per timing layer per are layer provided are provided in Table in8 ,Table while 8, the while RAM the occupationRAM occupation is shown is shown layer layer by layerby layer in Figure in Figure 12. Finally, 12. Finally, the quantization the quantizatio comparisonn comparison returns returns an error an error on the on prediction the predic- −7 matchingtion matching of 8.98 of× 8.9810 × ,10 largely−7, largely below below the thresholdthe threshold of 0.01. of 0.01.

TableTable 8. 8.CNN CNN layers’ layers’ ROM ROM utilization utilization and and MACC. MACC.

LayerLayer Type Type Flash Flash (B) (B) MACC MACC Timing Timing (ms)(ms)

LL7 7 BatchNormalizationBatchNormalization 48 48 672 672 0.067 LL8 8 Conv1D+ReLUConv1D+ReLU 3136 3136 5504 5504 1.231 L BatchNormalization 896 224 0.028 L1010 BatchNormalization 896 224 0.028 L11,1 Dense(64) 28,928 7232 0.808 L11,1 Dense(64) 28,928 7232 0.808 L11,2 ReLU - 64 0.016 LL11,212,1 ReLUDense(64) - 16,640 64 4160 0.0160.746 LL12,113,2 Dense(64)ReLU 16,640 -4160 64 0.7460.016 LL13,213,1 ReLUDense(1) - 260 64 65 0.0160.014 L13,2 Sigmoid - 10 0.008 L13,1 Dense(1) 260 65 0.014 L13,2 Sigmoid - 10 0.008

Sensors 2021, 21, 3961 19 of 25

Sensors 2021, 21, 3961 19 of 24 Sensors 2021, 21, 3961 19 of 25

Figure 12. RAM and I/O memory usage of the sequential NN following the autoencoder. FigureFigure 12.12. RAM and and I/O I/O memory memory usage usage of ofthe the sequ sequentialential NN NN following following the autoencoder. the autoencoder. 4. Discussion 4.4. DiscussionDiscussion The proposed single-trial P300 detector based on an autoencoded CNN-like architec- TheThe proposed single-trial single-trial P300 P300 detector detector ba basedsed on on an anautoencoded autoencoded CNN-like CNN-like architec- architec- ture was designed with a first main goal of achieving the maximization of the ITR, while tureture waswas designed with with a a first ﬁrst main main goal goal of of achieving achieving the the maximization maximization of the of theITR, ITR, while while keepingkeepingkeeping high highhigh recognition recognition accuracy. accuracy. AsAs second second goal,goal, wewe investigated investigated the thethe full fullfull implementa- implementabil-implementa- bilityitybility of of theof the the whole whole whole architecture architecture architecture on onon a low-costaa low-cost dedicated dedicateddedicated microcontroller-based microcontroller-based microcontroller-based platform,platform, platform, the theSTM32L476RGthe STM32L476RG STM32L476RG Nucleo Nucleo Nucleo board, board, board, minimizing miniminimizing the thethe resources resourcesresources utilization. utilization. utilization. ConcerningConcerningConcerning the the single-trial single-trialsingle-trial P300 P300 detection framework,framework, the the experimental experimental results results on on the thesamethe same same dataset dataset dataset from from from the the the BCI BCI BCI competition competition competition III III (i.e., (i(i.e.,.e., Dataset DatasetDataset 2), 2), 2), reported reported reported in in Tablein Table Table4, 4, showed 4,showed showed how howonhow subjecton on subject subject A, theA, A, the proposed the proposed proposed autoencoded autoencoded autoencoded CNN CNN model model model ensures ensures ensures a precisiona aprecision precision comparable comparable comparable with withthewith bestthe the best model best model model that that isthat the is is ensemblethe the ensembleensemble one composedone composed by BN3 by by BN3 BN3 and and ConvLSTM.and ConvLSTM. ConvLSTM. The The best The best model best modelinmodel terms in in terms of terms recall of of resultsrecall recall results toresults be CNN-1; toto bebe CNN-1; however, however,however, the autoencoded thethe autoencoded autoencoded CNN modelCNN CNN model achieved model achievedaachieved competitive a acompetitive competitive recall (− recall 0.14%)recall ((−−0.14%) in0.14%) the samein the application. samesame application.application. The autoencodedThe The autoencoded autoencoded CNN CNN designCNN designhaddesign a had focus had a a onfocus focus a metric on on a a metric thatmetric results thatthat results in a tradeoff in aa tradeofftradeoff among among among precision precision precision and recall, and and recall, the recall, F1-score. the the F1-score.ConsideringF1-score. Considering Considering this metric, this this themetric, metric, autoencoded thethe autoen CNNcodedcoded ensures CNNCNN ensures ensures an F1 scorean an F1 F1 ofscore score 47.37%, of of 47.37%, 47.37%, which is which is +0.67% with respect to the ensemble model. On subject B, the proposed model which+0.67% is with+0.67% respect with torespect the ensemble to the ensemble model. Onmodel. subject On B,subject the proposed B, the proposed model performs model performs better than the other solutions in terms of precision, binary accuracy and F1 performsbetter than better the than other the solutions other solutions in terms ofin precision,terms of precision, binary accuracy binary andaccuracy F1 score, and withF1 score, with +0.84%, +0.38% and +1.03%, respectively. The model that presents the best re- score,+0.84%, with +0.38% +0.84%, and +0.38% +1.03%, and respectively.+1.03%, respec Thetively. model The that model presents that presents the best the recall best is re- the call is the ConvLSTM. callConvLSTM. is the ConvLSTM. To provide a complete overview of the multi-objective optimization achieved by the ToTo provide provide a acomplete complete overview overview of of the the mult multi-objectivei-objective optimizati optimizationon achieved achieved by by the the proposed method, a bubble scatter plot reporting precision on y-axis, recall on x-axis and proposedproposed method, method, a abubble bubble scatter scatter plot plot reporting reporting precision precision on on y-axis,y-axis, recall recall on on x-axisx-axis and and F1-score on the bubble dimension is presented in Figure 13. All the parameters reported F1-score on the bubble dimension is presented in Figure 13. All the parameters reported F1-scorein Figure on 13 theare bubbleextracted dimension by averaging is presented values from in Figure subjects 13 .A All and the B parametersfrom Dataset reported 2. in inFigure Figure 13 13 are are extracted extracted by by averaging averaging values values from from subjects subjects A A and and B B from from Dataset Dataset 2. 2.

Figure 13. Precision versus recall versus F1-score bubble scatter plot for model comparison. Figure 13. Precision versus recall versus F1-score bubble scatter plot for model comparison. Figure 13. Precision versus recall versus F1-score bubble scatter plot for model comparison.

Sensors 2021, 21, 3961 20 of 24

The red model names are related to the here-presented one as per the definition in Table4. Considering the scatter plot’s nature, the best single-trial P300 detector system should lie in the top-right part of the plot, which means precision and recall are maximized. The bubble should also be as large as possible to maximize the F1-score. The scatter plot in Figure 13 shows how the autoencoded CNN outperforms on average the other state-of-the-art solutions in terms of precision and F1-score, placing itself in the highest top-right corner of the comparison plane. It demonstrates the capability of the model in proper weighting imbalanced datasets such as P300-based ones, where the target trials are largely below the not target ones (target: 16.67% of the total dataset, not target: 83.33%). Moreover, results concerning (Table5) the single-trial P300 recognition in a prototype car driving experiment showed that the system performs better in a four- choice BCI paradigm than a 36-character P300 speller [42,43]. Indeed, with an average precision of 68.83 ± 8.86%, the system overcomes its own performance on the P300 speller problem by about + 26.23%. In a similar way, an improvement of +4.87% has been recorded (71.75 ± 2.10% versus 66.88% from P300 speller) in recall metric and +5.54% in the binary accuracy context (84.45 ± 3.53% versus 78.91% from P300 speller). Finally, the F1-score also returns an improvement of +18.22%, considering an average recall of 51.78% in the P300 speller problem and 70.00 ± 4.58% recorded in the prototype car driving problem. The F1-score improvement between the two datasets can be related to the lower number of choices in the four-choice odd ball paradigm (Dataset 1) with respect to the 36-choice paradigm of the P300 speller. Indeed, in the first case, the ratio between target and not target stimulations is 1:4, making the dataset “less unbalanced” than the P300 speller case (1:6), negatively influencing the harmonic mean between precision and recall [44]. Additionally, no explicit information has been provided by the Wadsworth Research Center NYS Department of Health for data collection routines at the international BCI competition III. EEG channels’ data have been provided with different SNRs, and a lot of outliers have been recorded on the Fz channels, which is one of the most informational ones according to the study conducted by authors in [22]. The data collection routine for the car prototype fluid driving has been—instead—conducted in a controlled environment, by regulating lights and minimizing distracting phenomena and with a real-time impedance check on each electrode. It could result in a performance improvement. This paper also presented the character recognition rate results in Figure7. The experiment conducted on the test set by Dataset 2 [21] showed how the proposed autoencoded CNN model performs better than the other solutions for a few row–column repetitions (considering the range going from 1 to 3 repetitions). Then, the number of FP starts to become critical, reducing the accuracy in the comparison. Overall, the autoencoded CNN is able to achieve an average character accuracy of the 42% after a single character repetition, overcoming by about +7.5% the second-best architecture, the BN3 one. After 15 repetitions, the accuracy reached, on average, is 90.5%, about −6.5% under the best model implemented by the joint application of PCA and CNN. Next to these accuracy values, we analyzed another metric widely used to characterize a BCI system in terms of speed, the ITR, which constitutes one of the main constraints of this design. In this context, results in Figure8 show that the autoencoded CNN ITR is able to reach on average 16.83 bit/min with a single repetition, performing better than other solutions. The ITR performance after eight repetitions degrades proportionally with the accuracy rate, losing—on average—1.64 bit/min if compared to the best solution realized via the BN3 network. Considering subject B, the system achieves an ITR of 26 bits/min that formally means one correctly brain digitized character every 2 s. In the same context, with an ITR of 32.44 bit/min, the user can formalize and actuate a direction choice about every 1.8 s. Contextually, the paper also investigated the impact of the proposed LBP routine as the preprocessing stage for several state-of-the-art neural networks. Specifically, the CNN-1 [12], BN3 [10], ConvLSTM [11], BN3+ConvLSTM [11] structure have been reproposed and used as the final classification step, substituting the raw EEG signals from Sensors 2021, 21, 3961 21 of 24 Sensors 2021, 21, 3961 21 of 25

64 channels with the six-channel EEG setup defined by the recursive channel elimina- channels with the six-channel EEG setup defined by the recursive channel elimination tion procedure (Section 2.3). The BN3 architecture has been reproposed in two forms: procedure (Section 2.3). The BN3 architecture has been reproposed in two forms: (i) the (i) the original one, BN3, and (ii) a BN3 version without the first spatial filtering stage (i.e., original one, BN3, and (ii) a BN3 version without the first spatial filtering stage (i.e., BN3(ns)). BN3(ns)). x AccordingAccording to the the results results in in Table Table 6,6 ,a a scatter scatter plot plot reporting reporting precision precision on onthe the x-axis-axis and and recallrecall onon the y-axis is is depicted depicted in in Figure Figure 14. 14 As. As per per Figure Figure 13, 13 all, allthe the parameters parameters reported reported inin FigureFigure 14 are extracted extracted by by averaging averaging valu valueses from from subjects subjects A Aand and B from B from Dataset Dataset 2. 2.

FigureFigure 14.14. PrecisionPrecision versusversus recallrecall scatter scatter plot plot for for LBP LBP impact impact on on neural neural network network architecture architecture comparison com- . parison. The red model names are related to the here-presented model as per the definition in TablesThe4 and red6 .model names are related to the here-presented model as per the definition in TablesConsidering 4 and 6. the scatter plot (Figure 14), the best single-trial P300 detector system shouldConsidering lie in the top-right the scatter part plot of the(Figure plot, 14), which the means best single-trial precision andP300 recall detector are maximized.system shouldIt islie interesting in the top-right to note part that of the the plot LBP, which routine means works precision better asand a recall preprocessing are maxim- stage forized. those neural networks that implement a first 1D convolutional stage dedicated to the temporalIt is interesting feature extraction to note that (temporal the LBP convolution),routine works asbetter highlighted as a preprocessing by the brown stage colored for modelsthose neural in Figure networks 14. that For instance,implement by a first using 1D the convolutional LBP as a preprocessing stage dedicated stage, to the thetem- BN3 architectureporal feature without extraction a spatial(temporal filtering convolut layerion), (i.e., as BN3(ns))highlighted performs by the betterbrown (+7.46%colored on F1-score)models in than Figure its 14. version For instance, with a firstby using stage the of spatialLBP as filteringa preprocessing (i.e., BN3). stage, The the problemBN3 witharchitecture these CNNs without with a spatial the LBP filtering routine layer seems (i.e., BN3(ns)) to lie in performs the presence betterof (+7.46% the first on spatialF1- score) than its version with a first stage of spatial filtering (i.e., BN3). The problem with convolutional operation that converts each column of spatial data from the input tensor in these CNNs with the LBP routine seems to lie in the presence of the first spatial convolu- an abstract feature map. This feature map is—thus—sent to a temporal convolution layer tional operation that converts each column of spatial data from the input tensor in an ab- that analyzes—in this way—an abstract temporal signal rather than a raw temporal one. stract feature map. This feature map is—thus—sent to a temporal convolution layer that This problem has been also analyzed by the authors in [45], which demonstrated that the analyzes—in this way—an abstract temporal signal rather than a raw temporal one. This introduction of a spatial convolution layer as a first stage in a P300 detector application problem has been also analyzed by the authors in [45], which demonstrated that the in- strongly increased the following CNN complexity to achieve good results. Since most of the troduction of a spatial convolution layer as a first stage in a P300 detector application mainstrongly P300 increased features the have following a temporal CNN nature, complexity the presence to achieve of good this spatial results. filtering Since most degrades of thethe LBPmain capabilities. P300 features The have introduction a temporal ofnature, an autoencoder the presence between of this spatial the LBP filtering and the de- first temporalgrades the convolution LBP capabilities. stage The also introduction improves the of neuralan autoencoder network’s between performance, the LBP leadingand the to a +4.21%first temporal on F1-score convolution for the stage BN3(ns), also improves +5.31% for the the neural ensemble network’s BN3(ns)+ConvLSTM performance, leading and a +1.24%to a +4.21% for the on CNN-1.F1-score Thefor the reason BN3(ns), lies in+5.31% the capabilities for the ensemble of the BN3(ns)+ConvLSTM encoder–decoder process and to denoisea +1.24% the for signal,the CNN-1. while The preserving reason lies the in usefulthe capabilities temporal of information the encoder–decoder [32]. process to denoiseBeyond the the signal, ITR-accuracy while preserving tradeoff, the the useful second temporal main design information goal for [32]. the here-presented systemBeyond was the the architecture ITR-accuracy implementability tradeoff, the second on a main low-cost design microcontroller goal for the orhere-pre- dedicated embeddedsented system platform. was the Forarchitecture this reason, implementa Figure9bility summarizes on a low-cost the microcontroller number of implemented or ded- parameters and the memory usage for those above-introduced methods that provided these metrics in their studies [10–12]. Results in Figure9 show how with a total of 14,857 parameters, the here-proposed autoencoded CNN lies largely below most of the state-of-the-art

Sensors 2021, 21, 3961 22 of 24

solutions (about 21,140 parameters less than the CNN-1). Only the ConvLSTM by [11], with 11,213 parameters, slightly reduces the neural network complexity (−3644 parameters) with respect to the here-proposed method; however, it requires the parallel analysis of trials from 64 EEG channels strongly increasing the input data size (+37.12 kB w.r.t. the autoencoded CNN). It must be specified that all the compared models exploit a high amount of data as input. Indeed, all of them consider data from all the 64 channels and different trial lengths against a monitoring of only six channels proposed in the present work. Nev- ertheless, assessing 64 channels in parallel with a minimum sampling rate of 240 Sa/s is not practically applicable in mobile EEG applications equipped with microcontrollers due to the overflow of RAM capabilities. For this aim, the present work uses only six properly selected channels, which strongly reduce the memory resources consumption. Considering this first complexity view and accuracy results, we can infer that the autoencoded CNN provides the best tradeoff between accuracy, speed, and complexity. Concerning the full architecture implementation, the autoencoded CNN was implemented and validated on a Nucleo board L476RG placed in a COM bridge and in the inference results with the realized Keras model. Overall, the whole system (Autoencoder + CNN) results in a complexity of 20,179 MACC, a flash occupation of 57.11 kB over 1024 kB available and a RAM occupation of 2.85 kB over 96 kB available on the board. The MACC parameter allows one to estimate an operation timing for the overall inference at about ~2.27 ms @ 80 MHz. Validated on target, this timing request resulted to be 3.175 ms. The quantization comparison returns an error on the prediction matching of 4.61 × 10−6, below the threshold of 0.01. It must be also specified that the here=proposed method, with a minimum flash request of 57.11 kB and a minimum RAM of 2.85 kB (without compression process), resulted to be easily implementable on all the STM32 microcontrollers family above the F3 one. It leads to a low-cost solution for the realization of a potentially dedicated embedded platform if we consider—for the sake of example—the use of an STM32F301C8 microcontroller with a cost of USD 0.16 as a central computation core. The use of this low-cost microcontroller ensures all the needed memory capabilities without any additive external memories need. The proposed microcontroller also includes several communication protocols to interface low-cost WiFi dongles such as an ESP8266 to interface the wireless EEG headset.

5. Conclusions In this paper, a single-trial P300 detector that combines a novel pre-processing stage based on the EEG signals symbolization and an autoencoded convolutional neural network (CNN), maximizing the accuracy in single-trial P300 detection, for a low number of stimuli repetition, enhancing—as a consequence—the BCI speed in terms ITR and ensuring the full compatibility with microcontrollers implementation, has been presented. Specifically, the proposed system exploited data from a small number of selected EEG channels to address the memory consumption issue. Next, a preprocessing routine which includes: (i) baseline correction, (ii) windsorizing and (iii) LBP symbolization was implemented and studied. The symbolized EEG signals were then sent to an autoencoder model to emphasize those temporal features that can be well comprised by the following sequential network composed of seven layers. This latter sequential structure is made up of a combination of a 1D convolutional layer for the temporal filtering, three fully connected layers, batch normalization and dropout stages. The method was tested on two different datasets: one from a P300 speller application in BCI competition III data and one from self-collected data implementing a fluid prototype car driving. The experimental results showed that, in a P300-speller application, the proposed method was able to outperform the state of the art in terms of F1-Score = 51.78 ± 6.24%, ensuring a +5 bits/min of average ITR if compared with other state-of-the-art solutions. The same method, applied to the prototype car driving experiment, provided an F1-score of 70.00% and an ITR > 30 bit/min, allowing a car direction change every 1.8 s. The implementation feasibility assessment on an STM32L4 microcontroller showed that the overall system occupies 5.57% of the total available ROM, ~3% of Sensors 2021, 21, 3961 23 of 24

the available RAM. Considering this complexity assessment, the ITR and accuracy results, it can be inferenced that the autoencoded CNN provides a good tradeoff between accuracy, speed, and complexity to realize a single-trial P300 detector. Moreover, an example of design conceptualization has also been proposed and discussed, demonstrating the P300 detector implementability on small footprint and low-cost dedicated microcontroller-based boards. In a real-life scenarios application context, the here-proposed P300 detector was demonstrated to be able to successfully expand the capabilities of autonomous driving systems, while its application in brain-controlled assistive robotics provided promising initial results.

Author Contributions: Conceptualization, D.D.V. and G.M.; methodology, D.D.V. and G.M.; soft- ware, G.M.; validation, D.D.V. and G.M.; formal analysis, D.D.V. and G.M.; investigation, D.D.V. and G.M.; data curation, G.M.; writing—original draft preparation, D.D.V. and G.M.; supervision, D.D.V.; project administration, D.D.V.; funding acquisition, D.D.V. Both authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of Politecnico di Bari (Prot. 0016364—8 June 2021). Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Conﬂicts of Interest: The authors declare no conﬂict of interest.

References 1. Allison, B.Z.; Kübler, A.; Jin, J. 30+ years of P300 brain–computer interfaces. Psychophysiology 2020, 57, e13569. [CrossRef] [PubMed] 2. Li, M.; Li, F.; Pan, J.; Zhang, D.; Zhao, S.; Li, J.; Wang, F. The MindGomoku: An Online P300 BCI Game Based on Bayesian Deep Learning. Sensors 2021, 21, 1613. [CrossRef][PubMed] 3. De Venuto, D.; Annese, V.F.; Mezzina, G. An embedded system remotely driving mechanical devices by P300 brain activity. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017; pp. 1014–1019. [CrossRef] 4. Kim, H.; Lee, M.; Lee, M. A BCI based smart home system combined with event-related potentials and speech imagery task. In Proceedings of the 8th International Winter Conference on Brain-Computer Interface (BCI), Gangwon, Korea, 26–28 February 2020. 5. Abibullaev, B.; Zollanvari, A. A Systematic Deep Learning Model Selection for P300-Based Brain-Computer Interfaces. IEEE Trans. Syst. Man Cybern. Syst. 2021.[CrossRef] 6. Zhang, Y.; Xu, H.; Zhao, Y.; Zhang, L.; Zhang, Y. Application of the P300 potential in cognitive impairment assessments after transient ischemic attack or minor stroke. Neurol. Res. 2021, 43, 336–341. [CrossRef] 7. Chakraborty, D.; Ahona, G.; Sriparna, S. Chapter 2: A survey on Internet-of-Thing applications using electroencephalogram. In Emergence of Pharmaceutical Industry Growth with Industrial IoT Approach; Academic Press: Cambridge, MA, USA, 2020; pp. 21–47. 8. Farwell, L.A.; Donchin, E. Talking off the top of your head: Toward a mental prosthesis utilizing event-related brain potentials. Electroencephalogr. Clin. Neurophysiol. 1988, 70, 510–523. [CrossRef] 9. Fereshteh Salimian, R.; Abootalebi, V.; Taghi Sadeghi, M. Spatial and spatio-temporal ﬁltering based on common spatial patterns and Max-SNR for detection of P300 component. Biocybern. Biomed. Eng. 2017, 37, 365–372. 10. Liu, M.; Wu, W.; Gu, Z.; Yu, Z.; Qi, F.; Li, Y. Deep learning based on batch normalization for P300 signal detection. Neurocomputing 2018, 275, 288–297. [CrossRef] 11. Joshi, R.; Goel, P.; Sur, M.; Murthy, H.A. Single Trial P300 Classiﬁcation Using Convolutional LSTM and Deep Learning Ensembles Method. In Intelligent Human Computer Interaction. IHCI 2018. Lecture Notes in Computer Science; Tiwary, U., Ed.; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11278. [CrossRef] 12. Cecotti, H.; Graser, A. Convolutional neural networks for p300 detection with application to brain–computer interfaces. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 433–445. [CrossRef] 13. Rakotomamonjy, A.; Guigue, V. BCI competition III: Dataset II-ensemble of SVMs for BCI P300 speller. IEEE Trans. Biomed. Eng. 2008, 55, 1147–1154. [CrossRef] 14. Jin, J.; Allison, B.Z.; Sellers, E.W.; Brunner, C.; Horki, P.; Wang, X.; Neuper, C. An adaptive P300-based control system. J. Neural Eng. 2011, 8, 036006. [CrossRef] 15. Throckmorton, C.S.; Colwell, K.A.; Ryan, D.B.; Sellers, E.W.; Collins, L.M. Bayesian Approach to Dynamically Controlling Data Collection in P300 Spellers. IEEE Trans. Neural Syst. Rehabil. Eng. 2013, 21, 508–517. [CrossRef] 16. Bostanov, V. BCI competition 2003-data sets Ib and IIb: Feature extraction from event-related brain potentials with the continuous wavelet transform and the t-value scalogram. IEEE Trans. Biomed. Eng. 2004, 51, 1057–1061. [CrossRef] Sensors 2021, 21, 3961 24 of 24

17. Riccio, A.; Schettini, F.; Simione, L.; Pizzimenti, A.; Inghilleri, M.; Olivetti-Belardinelli, M.; Mattia, D.; Cincotti, F. On the Relationship between Attention Processing and P300-Based Brain Computer Interface Control in Amyotrophic Lateral Sclerosis. Front. Hum. Neurosci. 2018, 12, 165. [CrossRef][PubMed] 18. Gao, Z.; Sun, X.; Liu, M.; Dang, W.; Ma, C.; Chen, G. Attention-based Parallel Multiscale Convolutional Neural Network for Visual Evoked Potentials EEG Classification. IEEE J. Biomed. Health Inform. 2021.[CrossRef] 19. Tao, W.; Li, C.; Song, R.; Cheng, J.; Liu, Y.; Wan, F.; Chen, X. EEG-based Emotion Recognition via Channel-wise Attention and Self Attention. IEEE Trans. Affect. Comput. 2020.[CrossRef] 20. Carabez, E.; Miho, S.; Nambu, I.; Yasuhiro, W. Convolutional Neural Networks with 3D Input for P300 Identification in Auditory Brain-Computer Interfaces. Comput. Intell. Neurosci. 2017, 2017, 8163949. [CrossRef] 21. Maddula, R.; Stivers, J.; Mousavi, M.; Ravindran, S.; de Sa, V. Deep Recurrent Convolutional Neural Networks for Classifying P300 BCI signals. In Proceedings of the 7th Graz Brain-Computer Interface Conference, GBCIC 2017, Gratz, Austria, 18–22 September 2017. 22. Li, F.; Li, X.; Wang, F.; Zhang, D.; Xia, Y.; He, F. A Novel P300 Classification Algorithm Based on a Principal Component Analysis-Convolutional Neural Network. Appl. Sci. 2020, 10, 1546. [CrossRef] 23. Wen, T.; Zhang, Z. Deep Convolution Neural Network and Autoencoders-Based Unsupervised Feature Learning of EEG Signals. IEEE Access 2018, 6, 25399–25410. [CrossRef] 24. Ditthapron, A.; Banluesombatkul, N.; Ketrat, S.; Chuangsuwanich, E.; Wilaiprasitporn, T. Universal joint feature extraction for P300 EEG classification using multi-task autoencoder. IEEE Access 2019, 7, 68415–68428. [CrossRef] 25. Krusienski, D.J.; Schalk, G. Wadsworth BCI Dataset (P300 Evoked Potentials), BCI Competition III Challenge. Available online: http://www.bbci.de/competition/iii/ (accessed on 11 May 2021). 26. Krusienski, D.J.; Sellers, E.W.; McFarland, D.J.; Vaughan, T.M.; Wolpaw, J.R. Toward enhanced P300 speller performance. J. Neurosci. Methods 2008, 167, 15–21. [CrossRef] 27. Yılmaz, K.; Uyar, M.; Tekin, R.; Yildirım, S. 1D-local binary pattern-based feature extraction for classification of epileptic EEG signals. Appl. Math. Comput. 2014, 243, 209–219. [CrossRef] 28. Santurkar, S.; Tsipras, D.; Ilyas, A.; Madry, A. How does batch normalization help optimization? arXiv 2018, arXiv:1805.11604. 29. Hoffmann, U.; Vesin, J.-M.; Ebrahimi, T.; Diserens, K. An efficient P300-based brain–computer interface for disabled subjects. J. Neurosci. Methods 2008, 167, 115–125. [CrossRef][PubMed] 30. De Venuto, D.; Annese, V.F.; Mezzina, G.; Defazio, G. FPGA-Based Embedded Cyber-Physical Platform to Assess Gait and Postural Stability in Parkinson’s Disease. IEEE Trans. Compon. Packag. Manuf. Technol. 2018, 8, 1167–1179. [CrossRef] 31. Ojala, T.; Pietikäinen, M.; Harwood, D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognit. 1996, 29, 51–59. [CrossRef] 32. Wang, W.; Huang, Y.; Wang, Y.; Wang, L. Generalized autoencoder: A neural network framework for dimensionality reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014. 33. Zou, D.; Cao, Y.; Zhou, D.; Gu, Q. Gradient descent optimizes over-parameterized deep ReLU networks. Mach. Learn. 2020, 109, 467–492. [CrossRef] 34. Hidenori, I.; Kurita, T. Improvement of learning for CNN with ReLU activation by sparse regularization. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017. 35. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456. 36. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. 37. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. 38. Ashraf, S.; Ahmed, T. Machine learning shrewd approach for an imbalanced dataset conversion samples. J. Eng. Technol. (JET) 2020, 11, 1–9. 39. Obermaier, B.; Neuper, C.; Guger, C.; Pfurtscheller, G. Information transfer rate in a five-classes brain–computer interface. IEEE Trans. Neural Syst. Rehabil. Eng. 2001, 9, 283–288. [CrossRef] 40. Manaswi, N.K. Understanding and working with Keras. In Deep Learning with Applications Using Python; Apress: Berkeley, CA, USA, 2018; pp. 31–43. 41. Sakr, F.; Bellotti, F.; Berta, R.; De Gloria, A. Machine Learning on Mainstream Microcontrollers. Sensors 2020, 20, 2638. [CrossRef][PubMed] 42. De Venuto, D.; Ohletz, M.J. On-Chip Test for Mixed-Signal ASICs using Two-Mode Comparators with Bias-Programmable Reference Voltages. J. Electron. Test. 2001, 17, 243–253. [CrossRef] 43. De Venuto, D.; Rabaey, J. RFID transceiver for wireless powering brain implanted microelectrodes and backscattered neural data collection. Microelectron. J. 2014, 45, 1585–1594. [CrossRef] 44. Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big. Data 2019, 6, 27. [CrossRef] 45. Hongchang, S.; Liu, Y.; Stefanov, T. A Simple Convolutional Neural Network for Accurate P300 Detection and Character Spelling in Brain Computer Interface. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI 2018, Stockholm, Sweden, 13–19 July 2018.