Musical Computer Games Played by Singing
Total Page:16
File Type:pdf, Size:1020Kb
04 PrProc.oc. of the 77thth Int. Int.Confer Conferenceence on onDigital DigitalA Audioudio Ef Effectsfects (D (DAFx'04),AFX-04)Naples,, Naples,Italy Italy,, October October5-8, 5-8,2004 2004 DAFx MUSICAL COMPUTER GAMES PLAYED BY SINGING Perttu Hämäläinen, Teemu Mäki-Patola Ville Pulkki, Matti Airas Telecommunications Software Laboratory of Acoustics and Multimedia Laboratory and Audio Signal Processing Helsinki Univ. of Technology, Espoo, Finland Helsinki Univ. of Technology, Espoo, Finland {pjhamala|tmakipat}@tml.hut.fi {ville.pulkki|matti.airas}@hut.fi ABSTRACT cords vibrate. However, the concept of pitch and the related musi- cal vocabulary are not clear for many people. For example, if you Although voice has been used as an input modality in various instruct a person with no musical background to sing “higher”, he user interfaces, there are no reports of using pitch of the user’s or she may just continue to sing at the same pitch, but use a differ- voice for real-time control of computer games. This paper ex- ent vowel. plores pitch-based control for novel games for musical education. In the following, we first describe the two example games, Mapping pitch to the position of a game character provides visual shown in Figures 1 and 2. We then discuss the technological feedback that helps you to learn to control your voice and sing in aspects and our experiences from testing the games. tune. As demonstrated by two example games in this paper, the approach can be applied to both single and two-player games even with just one microphone. 2. THE HEDGEHOG GAME The Hedgehog game in Figure 1 is a part of Soittopeli (PlaySing- 1. INTRODUCTION Music), a musical edutainment CD-ROM released in Finland in 2000 by Elmorex Ltd. PlaySingMusic also features other games Voice has been used as an input modality in various user inter- played with a MIDI keyboard, but they are beyond the scope of faces. Speech recognition is nowadays used in games and other this paper. The game is aimed at 5 to 10 year old children. applications for wide audiences. The technology has matured so The game is designed to provide a fun way of learning to sing. that simple recognition of a limited set of spoken commands can A song is played and the scenery scrolls from right to left in sync be implemented with little technological expertise using tools like with the music. The player has to sing so that the hedgehog stays the Microsoft Speech Application SDK [1]. on the path that twists along the melody. As the vertical position An alternative to recognizing spoken commands is using of the hedgehog corresponds directly to pitch, the player gets sound pressure level, pitch, and other features of voice for imme- immediate visual feedback that helps in finding the right pitch. diate, real-time control. Igarashi and Hughes call this approach “Voice as Sound” [2] and give examples, such as a television volume control where the user says “Volume up, ahhh” and the volume continues to increase as long as the “ahhh” continues. Hämäläinen and Höysniemi have used the approach in a multimo- dal perceptual game user interface, where you control a dragon that flies when you flap your hands like wings and breathes fire when you shout [3]. Earlier, similar features of sound have mainly been used to control musical user interfaces. For example, in the Swim Swan performance in 1993, the pitch and loudness of a clarinet were used to control musical synthesis [4]. Speech input without actual speech recognition has been in used toys, such as a parrot that repeats words [5], for which it is necessary only to segment the input signal into words or phrases. This paper describes two computer games that apply the “Voice as Sound” approach in the context of musical edutainment (education and entertainment). The main idea is to use the pitch of your voice to control a game so that you learn to control your voice and sing correctly with help of immediate graphical feed- back, created by mapping pitch to the position of an avatar or other visual objects in the game. Figure 1: A screenshot of the hedgehog game. The vertical posi- Pitch has been defined as the characteristic of a sound that tion of the hedgehog is controlled by the pitch of the user’s makes it sound high or low or determines its position on a scale voice. You sing the song correctly if the hedgehog stays on the [6]. Human perception of pitch is a complex process, but in case path twisting along the melody. The words of the song are shown of singing, pitch corresponds to the frequency at which the vocal at the bottom of the screen. DAFX-1 367 — DAFx'04 Proceedings — 367 PrProc.oc. of the 77thth Int. Int.Confer Conferenceence on onDigital DigitalA Audioudio Ef Effectsfects (D (DAFx'04),AFX-04)Naples,, Naples,Italy Italy,, October October5-8, 5-8,2004 2004 0.5 0 input signal −0.5 0 200 400 600 800 1000 sample index 100 50 0 −50 autocorrelation −100 0 200 400 600 800 1000 lag Figure 3: A frame of the vowel ‘u’ and the corresponding auto- correlation. The strongest peak of the autocorrelation is located at the signal period of 124 samples. Figure 2: A pitch-controlled Pong-style game. The bats on the left and on the right move vertically according to the pitch of the user’s voice. The screenshot is from one-player mode, where the 4. PITCH DETECTION TECHNOLOGY right bat is controlled by the computer. Pitch detection (fundamental frequency estimation, f0 estimation) has been researched widely since the 70’s. It is perhaps most The feedback helps particularly when the player has some idea of widely used as part of speech recognition and coding. Estimating going up and down with the melody, but the intervals and key are pitch can be implemented as estimating the periodicity of voice, as not right. The game grades the player’s performance as the aver- seen in Figure 3. A periodic waveform repeats itself after the age distance from the correct notes and the player is awarded with period T, the reciprocal of the fundamental frequency f0. In Figure different fanfares and prizes after the song. The expression of the 3, the period of 124 samples is clearly visible. hedgehog also changes according to the performance, shown in There are several algorithms for pitch detection. We have the bottom-right corner of the screen. mainly experimented with autocorrelation based methods. The approach was found suitable for speech signals already in the 70’s [9] and several improved variants of the approach have been 3. PITCH-CONTROLLED PONG developed, one of the latest being YIN by de Cheveigné and Kawahara [10], who report error rates three times lower than the Pitch can be used to control practically any game with one- best competing methods. dimensional input. We created the pitch-controlled Pong game in Comparing pitch detection algorithms is difficult, since they Figure 2 to provide a simple game for learning the concept of are often tested with different data. Recently, the Keele pitch pitch and to experiment with a two-player game. Pitch controls the detection reference database [11] has been gaining popularity. It vertically moving bats and the goal is to intercept a bouncing has been used to evaluate YIN as well as methods by Wang and projectile with your bat, similar to the original Pong and other Seneff [12] and Tabrikian et al. [13] to name a few. “paddle games” derived from it [7]. In autocorrelation based methods, sound input signal is di- The history of computer games features several examples of vided into short frames and the period of a frame is estimated by multiplayer games played on a single computer. Generally speak- computing its autocorrelation function (ACF), as shown in Figure ing, you have to either use multiple control devices, or take turns 3. ACF estimates the similarity of a signal with the same signal using a shared control. Examples of the former are Tekken and delayed by some time lag. If the signal is periodic, the strongest other martial arts games played with several gamepads. Stewart et peak of its ACF is usually located at the lag equal to the period. al. have also researched this kind of interfaces for collaborative Note that the ACF computed from a frame of finite length is only work [8]. Having two microphones would make it easy to imple- an estimate, since the autocorrelation of a random process x(k) is ment a two-player game, because the pitch detection usually out- defined as the expectation E[x(k)x*(l)], where * denotes complex puts the pitch of the loudest voice. However, few users have the conjugate. For wide-sense stationary (WSS) processes, autocorre- equipment needed to connect two microphones into the left and lation depends only on the lag k-l [14]. right input channels of a soundcard. Autocorrelation of a WSS process can also be defined as the The main reason for choosing Pong as the starting point was inverse Fourier transform of the signal’s power spectrum (squared that it combines both real-time and turn-based control so that it magnitude spectrum). In practice, the fastest way to compute the works using a single microphone. In two-player mode, pitch con- ACF of a signal frame is to assume stationarity within the frame, trols both bats so that they are always at the same vertical position. compute an estimate of the power spectrum using FFT and then The players take turns in singing or humming so that once a player apply inverse FFT.