Gamelan Instrument Sound Recognition Using Spectral and Facial Features of the ﬁrst Harmonic Frequency

Acoust. Sci. & Tech. 36, 1 (2015) #2015 The Acoustical Society of Japan PAPER Gamelan instrument sound recognition using spectral and facial features of the first harmonic frequency Aris Tjahyanto1;Ã, Diah Puspito Wulandari2;y, Yoyon K. Suprapto2;z and Mauridhi Hery Purnomo2;x 1Information Systems Department, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia 2Electrical Engineering Department, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia (Received 7 August 2013, Accepted for publication 19 August 2014) Abstract: Principal component and spectral-based feature sets were applied to the recognition of gamelan instrument sounds using support vector machines (SVMs). The principal components were calculated on the basis of a segmented scalogram from the first harmonic frequency of the gamelan recordings. The segmented scalogram is assumed as a ‘‘facial image’’ of the gamelan instrument sound in a frontal pose, neutral expression, and normal lighting. The scalogram was computed from the gamelan sound signal using a continuous wavelet transform (CWT). The performance and contribution of the principal component and spectral-based features were compared using an F-measure. For the training phase, the feature sets were extracted from isolated tones that were recorded over the entire frequency range of four gamelan instruments (demung, saron, peking, and bonang families). Using 90%/10% splits between the training and validating data sets, model classifiers were constructed from the radial basis function (RBF) kernel SVM. The classifiers are composed of 28 separate One-Against- One multiclass classifiers. The experiment showed that the spectral-based feature set shows an average F-measure of 74.05% and the appearance-based feature yields 71.87%. For saron-only note tracking, the spectral-based feature set had an F-measure of 83.79%, higher than the demung-only note tracking, which yielded 63.89%. Keywords: Support vector machines, Automatic transcription, Wavelet transform PACS number: 43.60.Lq [doi:10.1250/ast.36.12] barung, and the saron’s octave overlaps with the highest 1. INTRODUCTION octave. The process of converting a gamelan audio signal into Gamelan has distinctive scale systems and intervals. balungan gendhing notation can be done automatically by Gamelan also has no standard tuning system. Gamelan recognizing which balungan instrument is playing which tuning practice was not governed by concern with the note. Balungan is the skeleton or core melody of a mathematical purity of interval vibration ratios [1]. For gendhing, which appears like the one shown in Fig. 1; example, a gamelan set from Yogyakarta differs from a gendhing is a generic term for any gamelan composition. gamelan set from Surakarta. Because of the distinctive The gamelan instruments that are played imitating the scale systems and their intervals, the frequency range and core melody are known as balungan instruments, such as the intervals may vary slightly. demung, saron, and peking. The bonang plays as a leading Each frequency channel that represents a note contains and elaborating instrument. The demung, saron, and peking many onsets from several instruments played in a gamelan play balungan within their one-octave ranges. The de- ensemble. Therefore, the transcription processes for a mung’s octave overlaps with the lowest octave of bonang multi-instrumental gamelan that produces balungan gendhing needs information about what instruments attributed Ãe-mail: [email protected] y to the onset. To differentiate one gamelan instrument from e-mail: diah [email protected] ze-mail: [email protected] another, we used features extracted from the time-frequen- xe-mail: [email protected] cy domain. After segmenting the signals in the time- 12 A. TJAHYANTO et al.: GAMELAN INSTRUMENT SOUND RECOGNITION LDA to further reduce the feature space to a lower- dimensional one [7]. In automatic music transcription, many approaches have been applied to estimate the pitch and instruments. Hidden periodicities in a time domain signal can be determined using an autocorrelation function. The peaks are related to the lags where periodicity is stronger [8]. Suprapto et al. introduced a technique for generating Fig. 1 Notation of the balungan Lancaran Manyar gamelan transcription using a spectral density model and Sewu (Thousands of Weaver Birds), a gendhing for adaptive cross correlation [9]. These approaches used the welcoming guests. Bk is for the opening melody, and energy profile taken from the frequency channel of a saron A the letter indicates a section of a gamelan compo- instrument. These approaches work well for producing sition. The circle is for gong ageng, square for gong suwuk, smiley for kempul, and frowny for kenong.A gamelan notation taken from saron-only notes. One of dot above a number indicates the upper octave; below the disadvantages of these approaches is the difficulty in a number, the lower octave. A dot in the place of a distinguishing the energy profile of the saron and bonang number indicates a rest or sustained. signals because of the overlapping frequency channel of both instruments. Another approach is the pattern recognition technique. frequency domain, we can recognize the gamelan instru- This technique requires that a set of features be extracted ment using the global representation of the segment or from the audio signal [10,11]. The feature set for the distinctive spectral features such as centroid, flux, roll-off, recognition process can be grouped into spectral-based skewness, and kurtosis. features and automatic speech recognition (ASR) features In the facial recognition field, it is common to [12,13]. The common features for audio signal are zero implement principal component analysis (PCA) or linear crossing rate, envelope, RMS energy, centroid, spectrum discrimination analysis (LDA) for appearance-based or representation, and flux [14,15]. global representations of the face image. It is also common Feature extraction and selection is a key process in to use geometric relationships among the facial features automatic music transcription using pattern recognition and to use facial features such as eyebrows, eyes, nose, technology. Feature extraction is a process for discovering mouth, and cheeks. In the field of automatic music a set of vectors that represents an observation while transcription systems, monophonic transcription tasks reducing the dimensionality. Many algorithms have been rely on information in the time domain. Polyphonic developed to transform an audio signal into another transcription tasks, by contrast, depend on the analysis of representation for extracting the feature. Common methods information in the frequency domain. In the time domain or for transforming the audio signal are fast Fourier transform frequency domain, many acoustic features are captured for (FFT), short time Fourier transform (STFT), discrete instrument recognition. wavelet transform (DWT), and continuous wavelet trans- In this paper, we propose a novel feature extraction form (CWT). On the basis of the transformed audio signal, approach based on PCA and spectral-based features to various features are calculated and extracted. convert the ‘‘facial image’’ of a segmented scalogram taken from gamelan instrument tones. Then we apply support 1.2. Proposed Method vector machines (SVMs) to build a classifier and evaluate The goal of this paper is to compare the performance of the performance of those features in gamelan instrument principal component and spectral-based features for game- recognition. lan instrument sound recognition, especially the instruments that imitate balungan gendhing. The principal 1.1. Previous Work component obtains an orthogonal projection of the data PCA has been known as an effective technique for signal, including the noise component. The spectral-based feature extraction in the field of facial recognition [2]. features take into account only relevant features if the Many researchers have also addressed the use of LDA as a process of feature selection has been done carefully. The feature in facial recognition research [3]. PCA also has performance of the principal component and spectral-based been applied to image compression [4] and object detection features is cross-validated using SVMs. There are two [5]. De Paula showed that PCA can be used to represent main reasons for addressing these tasks using SVMs. the different timbre spaces of a musical instrument [6]. First, accurate recognition of gamelan instruments is itself Kitahara implemented PCA for reducing a 129-dimen- important for automatic transcription. Second, because of sional feature space to a 79-dimensional one, then applied the effectiveness of SVMs [16] they recently became one 13 Acoust. Sci. & Tech. 36, 1 (2015) Fig. 3 3D surface spectrogram of the recorded Manyar Sewu gamelan sound signals played using saron and bonang. Fig. 2 Automatic Gamelan Notes Transcription System Architecture. domain and then continued for feature extraction. Based on the extracted features, the Gamelan Instrument Classi- fier module estimates the candidate pitch values and of the most popular recognition and classification methods. the predicted instruments. The Cleaning and Tabulation They have been used in a wide variety of applications, such module deletes uncommon events and then tabulates the as text classification [17], facial recognition [18], and gene estimated notes into a balungan gendhing notation. analysis [19]. The rest of this paper is organized as follows. Section 2 2.1.

Gamelan Instrument Sound Recognition Using Spectral and Facial Features of the ﬁrst Harmonic Frequency

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support