SIGCHI Conference Paper Format

Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada

SoundCraft: Enabling Spatial Interactions on Smartwatches using Hand Generated Acoustics Teng Han1 Khalad Hasan1 Keisuke Nakamura2 Randy Gomez2 Pourang Irani1 University of Manitoba1 Honda Research Institution2 Winnipeg, Canada 8-1 Honcho Wako-shi, Japan {hanteng, khalad, irani}@cs.umanitoba.ca {k.nakamura, r.gomez}@jp.honda-ri.com

Figure 1. The SoundCraft prototype uses an acoustic array on a smartwatch to angularly localize (in azimuth and elevation) and to classify, hand generated sounds, or acoustic signatures, such as (a) a finger snap and (b) body tap. Such localized signatures can enrich smartwatch interactions to facilitate (c) one-handed content navigation and (b) ad-hoc multi-user interaction. ABSTRACT INTRODUCTION We present SoundCraft, a smartwatch prototype embedded Smartwatches facilitate quick glances at information when with a microphone array, that localizes angularly, in azimuth needed. With a growing reliance on such devices [23], short and elevation, acoustic signatures: non-vocal acoustics that and rapid input bursts have become equally important. For are produced using our hands. Acoustic signatures are the most part, smartwatch interactions rely on touch [2]. common in our daily lives, such as when snapping or rubbing However, such an input is hampered by the device’s small our fingers, tapping on objects or even when using an form factor [9]. Furthermore, touch is prone to screen auxiliary object to generate the sound. We demonstrate that occlusion and ‘fat-finger’ limitations [33], is unavailable in we can capture and leverage the spatial location of such typical smartwatch scenarios such as one-handed input [38] naturally occurring acoustics using our prototype. We and provides limited 2D spatial interaction [8]. For these describe our algorithm, which we adopt from the MUltiple reasons, researchers have proposed novel smartwatch SIgnal Classification (MUSIC) technique [31], that enables interaction methods for quicker information access and robust localization and classification of the acoustics when command invocation on wearable devices [25, 29]. the microphones are required to be placed at close proximity. Alternatively, acoustic input using on-board microphones SoundCraft enables a rich set of spatial interaction can enhance smartwatch interactions. Verbal acoustics, or techniques, including quick access to smartwatch content, speech, can be used for triggering specific actions such as rapid command invocation, in-situ sketching, and also multi- making phone calls, sending text messages, and asking for user around device interaction. Via a series of user studies, directions [23]. Non-verbal acoustics such as whooshing or we validate SoundCraft’s localization and classification whistling [25, 29] can also be used for discrete events. capabilities in non-noisy and noisy environments. However, such interactions are not best suited for spatial Author Keywords input, such as scrolling or swiping [29] or multi-command Microphone array processing; smartwatch spatial input; input, all of which can benefit the rapid interaction need acoustic signatures; classification and localization. common to wearables. ACM Classification Keywords To mitigate these issues, we implemented SoundCraft, a H.5.2. [Information Interfaces and Presentation]: User smartwatch prototype embedded with a microphone array to Interfaces — Input Devices and Strategies. capture acoustic signatures, or non-vocal acoustics that are Permission to make digital or hard copies of all or part of this work for personal or classroom produced using our hands. Given the close proximity of our use is granted without fee provided that copies are not made or distributed for profit or microphones on the smartwatch (1.5 inches apart, Figure 3), commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Time Difference of Arrival (TDOA) estimation techniques for Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to audible sounds results in low resolution input. We instead redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. use a more robust target signal subspace method, by adopting the basic idea of the MUltiple SIgnal Classification (MUSIC) UIST 2017, October 22–25, 2017, Quebec City, QC, Canada © 2017 Association for Computing Machinery. technique [31]. This facilitates spatial input on smartwatches, ACM ISBN 978-1-4503-4981-9/17/10 $15.00 by simultaneously localizing and classifying acoustic https://doi.org/10.1145/3126594.3126612

579 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada signatures, generated using our hands, either in-air (finger not fulfill spatial smartwatch tasks such as scrolling, swiping, snapping, Figure 1a), upon surface contact (clapping or menu selection which may require continuous location tapping, Figure 1b), or via externals tools (scissors). These mapping or a rich command sets [29]. As an alternative, we capabilities open opportunities to explore a rich set of novel explore the use of hand generated acoustics, where the sound spatial interaction techniques on smartwatches such as for is produced from a gesture of one or more hands, either in- rapid access to smartwatch content (Figure 1c), in-situ air, on contact with a surface or with external tools. sketching, and ad-hoc multi-user input (Figure 1d). Acoustic hand gestures and localized interactions We validate SoundCraft’s localization and classification Most acoustic approaches for hand gesture detection rely on accuracy in both non-noisy and noisy environments. Our contact microphone(s) capturing signal transmission through results reveal that SoundCraft can effectively capture sound an object [8] or through the body (i.e., bio-acoustics) [1, 10]. sources with an angular separation of 22.5° in azimuth and A notable exception is ViBand [20], in which the authors elevation, and achieve over 90% classification accuracy with recently demonstrated the feasibility of using the embedded 11 acoustic signatures, in a lab environment. In an accelerometer for detecting rich gestures and recognizing environment with relatively constant background noise such micro-vibrations transmitted through the body from the as sounds coming from a nearby vending machine and a 3D objects to the hand wearing the device. The contact acoustic printer, SoundCraft achieves average classification and sensing offers a very high fidelity and a less ‘noisy’ signal localization accuracy of 88% and 89% respectively. preservation generated from the hand wearing the device. As a result, in contrast to SoundCraft, ViBand neither detects in- The contributions of this paper include: (i) the use of acoustic air signals, nor spatial positions. Additionally, ViBand signatures for spatial input on smartwatches; (ii) our requires users to use the hand wearing the watch to perform SoundCraft prototype, which includes the design and the interaction. If the hard wearing the device is busy, such implementation of suitable algorithms that allow angular as holding a rail on a bus, the user cannot use ViBand. localization of acoustic signatures with microphones at very close proximity; and (iii) a set of applications showing the Acoustic sensing allows novel sound-based spatial unique capabilities of SoundCraft over current smartwatch interactions. To localize sound sources, different passive [14, input techniques. 26, 41] and active [27, 42] acoustic approaches have been proposed in the literature. Among them, time difference of RELATED WORK arrival (TDOA) is a well-known and widely used passive SoundCraft’s interaction capabilities are based on hand- acoustic localization technique which relies on input from a generated acoustics. We first briefly review related work on set of acoustic sensors, i.e., microphones, strategically verbal and non-verbal acoustic interactions on mobile and attached to the device. In TDOA, the time difference at which wearable platforms and then discuss acoustic processing for the acoustic signals get processed, allows the system to localization. We also briefly cover gesture-based spatial localize the sound source. In PingpongPlus [14], the authors input capabilities on mobile and wearable devices. demonstrated this technique to track the location where a Speech and non-verbal acoustic for mobile interactions ping pong ball lands on a table to project digital feedback on Speech input for smart devices has now become a viable that table’s surface. Paradiso et al. [24] used a similar setup, input alternative for mobile on-the-go interactions [23]. A with four piezo acoustic sensors at four corners, to examine recent study revealed that of 1,800 smartphone users, 60% knock/tap events on glass. Toffee [41] exploited TDOA use speech input [19]. Such interactions rely on the built-in principles to detect hand tap events around mobile devices microphones of the devices for detecting and processing on table. Systems such as Toffee [41] inspire our work, as speech into commands. More recently, researchers have SoundCraft extends the spatial input for use in smartwatch proposed an increasing number of interaction alternatives contexts with acoustics signatures. However, with close using non-speech voice input. These sounds include proximity sensor arrangement, existing TDOA methods for humming, hissing or blowing at proximity to the device. sounds in the audible range (long wavelength) result in low Humming [34], for example, has been shown to control resolution. Thus, we avoid TDOA and use a more robust games, while blowing [3, 4, 25] and whooshing [29] can be target signal subspace method based on the MUltiple SIgnal used for making coarse selections, or even for text-entry. Classification (MUSIC) technique [31]. SoundTrak [42] uses Additionally, non-speech voice input has also been a similar microphone array setup on smartwatches, but demonstrated as a complementary modality to touch [30]. necessitates an active predefined sound source (ring on Furthermore, varying sound features such as changing pitch finger). We do not require such additional peripherals as we or volume, has shown to be useful for adjusting volume on a are not restricted to spatially tracking the finger, but also television set [13] or controlling a mouse cursor [7]. capturing varying acoustic signatures. This enables a broad range of applications with SoundCraft. The above projects have demonstrated that exploring non- speech interaction where the sound is produced with the Gestural input for spatial interactions on mobile devices vocal tract could be a viable input alternative to interact with Prior research has shown that a mobile device or small smart devices. However, voice generated sounds do environment can be instrumented to track gestures

580 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada performed in the air [5, 16, 18, 33], on the body [10, 37, 44] recognition or technical issues. We discussed and refined the and on physical surfaces [8, 11, 21]. These gestures can proposed signatures together with the participants. The extend the input space of a mobile or a wearable device session took approximately 30 minutes to complete. beyond its touchscreen. For instance, such gestures have Taxonomy been demonstrated for basic on-device operations such as UI We collected 56 acoustic signatures in total. A subset of that widget control [9], multi-touch interaction [18], digit input list is presented in Figure 2, which is not intended to show an [15], item selection and workspace navigation [12]. To detect exhaustive set, but those deemed possible and popular (listed in-air hand gestures with wearable devices, several research at least twice by participants). We grouped them according projects have investigated a rich set of sensing approaches. to the following two key axes, handedness and location: These include, but are not limited to, embedded camera [36], IMU [38], external IR sensors [17, 39], capacitive field [28], Handedness Bio-signal [35, 43], and radar signals [32]. Additionally, on- This category defines the number of hands (one-handed, or board sensors such as using a camera [3] or magnetometer two-handed) involved in producing a sound. One-handed [9] have shown promise for tracking hand movement. signatures usually require the use of multiple fingers. Some SoundCraft leverages microphone array techniques on the are explicit (e.g., snapping, joint cracking, tapping) where smartwatch to detect and localize audible sounds associated the gestures can be performed at a relatively large distance with hand input. We highlight its breadth of use for in-air, from the device and the sound generated is recognizable by on-body, on-surface, and tool interactions, as well as wearing nearby users. Others are finer grained signatures (e.g., finger or non-wearing hand input, one-handed and bimanual input. rubbing, nail scratching) which are more subtle. One-handed signatures can be performed on either the wearing or the non- CATALOG OF ACOUSTIC SIGNATURES wearing hand. During the design session, most participants Acoustic signatures are generated using hand-based sound demonstrated the gestures using their dominant hand. events and therefore can provide a large and rich vocabulary, especially, if we include acoustic features such as tempo and Two-handed signatures require both hands to produce a rhythm. In this initial work, we only focus on defining sound. Examples include hand clapping or clasping where acoustic signatures based on the sound sources generated via one hand comes in contact with the other. These gestures, different hand gestures. We are unaware of any prior work however, can hardly benefit from sound source location to that produces a catalog of acoustic signatures and their input the device as they are restricted to hand position. One-handed attributes. Therefore, we elicit user input to gain initial signatures can be transformed into two-handed ones if both insights on the types and attributes of acoustic signatures. hands simultaneously perform the action, offering the opportunity for bi-manual interactions on smartwatches. We ran a design workshop with eight senior students and one professor from the music department at a local university. Location We selected this group as they are familiar with sound This category defines where the gestures are performed (in- generation theory and processes. The design session was air, on-body or on-surface). In-air signatures are those structured as follows. We first explained the goal of the performed by fingers without touching additional body parts project and mentioned how non-voice acoustics can be or surfaces. Examples of such gestures include finger produced with hand actions (i.e., snapping or clapping). We snapping, rubbing and pinching. then asked for written suggestions or sketches of potential On-body signatures produce sounds with hands or fingers acoustic signatures. Participants were instructed to list all the contacting other body parts (including the other hand). sound sources they can imagine without considering

In Air

Finger Snap** Finger Rub** Finger Flick** Finger Pinch** Joint Crack Finger Flutter Nail Scratch Jonas Snap

On Body

Clap*/** Hand Rub*/** Body Scratch** Tap on Body** Fist Bump* Hand Clasp* Hand Dust* Facepalm

On Surface

Surface Clap** Surface Scratch** Surface Tap** Surface Punch Figure 2: Acoustic signatures collected from musicians (* indicates two-handed, ** indicates signatures evaluated in this paper).

581 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada

Tapping and scratching for instance, are two common Software signatures in this category. In these cases, forearms are often The Linux machine receives and processes acoustic signals. used as they provide relatively large and flat surfaces. It transfers sound source information (i.e., location and classification) over sockets to a Samsung Mini S4 phone. An On-surface signatures are similar to those on-body but Android wear watch is used for demo applications which contact is with physical objects. A variety of sounds can be communicates with the phone via Bluetooth. produced with the same gesture by varying surface types. For instance, flicking on a table would generate a different sound than flicking on water. We discussed with participants their impression of using acoustic signatures for smartwatch usages. All of them were very enthusiastic about using such gestures. Many of our participants mentioned that such input would allow them to access their smartwatch without using the small touch screen. Additionally, they indicated that accessing information with acoustic signatures would be faster than on- screen operations. However, a few participants raised the Figure 4. Block diagram of SoundCraft’s audio processing. issues of not being able to perform certain gestures. For Acoustic processing instance, one student reported that he was not able to snap. Our aim was to implement acoustic processing that operates Some others raised the social acceptance issues in public with microphones at very close proximity, as well as to adapt places. They stated that the “small, quick and less to noise changes when the user moves between environments. noticeable” gestures such as finger rub are more acceptable Therefore, the key challenges in the acoustic processing for in public placed rather than explicit two-handed gestures recognizing acoustic signatures involve: i) designing a such as clapping. suitable algorithm that separates the source locations when SOUNDCRAFT PROTOTYPE the microphones are at proximal location; ii) detecting low Our SoundCraft prototype was built using a microphone power acoustics in noisy environments, and; iii) enhancing array on a small wrist-worn platform. We achieve sound the acoustics by estimating current environmental noise. For detection, localization, enhancement, and classification by (i) and (ii), we adopted a robust target signal subspace placing each of the four microphones in the four corners, and method, MUltiple SIgnal Classification (MUSIC) [31] with running a series of acoustic processing algorithms. extensions to the generalized Eigenvalue decomposition introducing correlation matrices of estimated noise [21]. MUSIC originally uses standard eigenvalue decomposition so that high power sounds can be detected. In our work, we extend it to use generalized eigenvalue decomposition with a dynamic backward and forward noise correlation matrix to improve detection of acoustics in a noisy environment. We detect an acoustic event onset by defining the frames just before the observed frames as ‘noise’. For (iii), we introduce wrist twists as a “trigger gesture” detected by the IMU, to Figure 3. (a) Four microphones placed at 1.5 inches apart at the corners of the device, and a IMU in the middle; (b) a watch explicitly notify the smartwatch of environment changes. display attached to create the SoundCraft prototype. After the trigger, the background noise information is updated and used for a Minimum Variance Distortionless Hardware Response (MVDR) beamformer [40], which results in We use four digital MEMS microphones (Knowles enhanced acoustics. The block diagram shown in Figure 4 1 SPM0406 ) each placed at one corner of a platform (1.51.5 summarizes the acoustic processing components used in this inches, Figure 3). The microphones operate synchronously paper. The system contains two pipelines: detection and 2 with a central micro controller (RASP-ZX ), which is localization of acoustic events, and enhancement of the powered by and communicates with a Linux machine via detected acoustics for classification. We describe the system USB connection. The system is set to record the data at a 16 stepwise as follows. kHz sampling frequency and 24-bit ADC Dynamic Range. We also mounted a 6-axis Inertial Measurement Unit (IMU) Sound Source Model (BMI1503, Figure 3a) that records data at 2 kHz sampling To describe the system mathematically, we define 푋푚(휔, 푓) frequency and 16-bit ADC dynamic range, synchronously and 푆(휔, 푓) as the acoustic signal input of the 푚-th channel with the microphones. (1 ≤ 푚 ≤ 푀) of a 푀-microphone array and a sound source signal after a Short Time Fourier Transform (STFT),

1 www.Knowles.com/eng/Products/Microphones/SiSonic%E2%84%A2-surface- 2 www.sifi.co.jp/system/modules/pico/index.php?content_id=36 mount-MEMS 3 www.bosch-sensortec.com/bst/products/all_products/bmi160

582 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada respectively (where 휔 is the frequency bin index, f is the Sound Source Detection and Localization time frame index). Assuming the frame is sufficiently long We adopt MUSIC [31] to detect and localize acoustic and 푆(휔, 푓) is at location 휓 , the observation model is signatures relative to the microphone array. This block in described as: Figure 4 labels acoustics with direction information.

푋푚(휔, 푓) = 퐴푚(휔, 휓) 푆(휔, 푓), (1) In this paper, we focus on onset detection of acoustic signatures. At each onset, the backward correlation matrix where 퐴푚(휔, 휓) is a Transfer Function (TF) at the contains only noise (i.e., background without acoustic frequency 휔 from the 푚-th microphone to the sound source signatures), and the forward correlation matrix contains both located at 휓 . The TF models the characteristics of the noises and target acoustics. Therefore, we introduced the surroundings relative to the intrinsic properties of the following generalized Eigenvalue decomposition to whiten microphones with regards to sound propagation. TFs are the noise information of the backward correlation matrix usually used as prior information and obtained in advance from the forward correlation matrix: either via actual measurement or geometrical calculation. 푹퐹(휔, 푓)푬(휔, 푓) = 푹퐵(휔, 푓)푬(휔, 푓)휦(휔, 푓) , (5) Since we use a planar microphone array, we consider only the azimuth (from -180° to 180°) and positive elevation where 푬(휔, 푓) and 휦(휔, 푓) are the Eigenvector matrix and (from 0° to 90°) in relation to the device. Moreover, the the Eigenvalue matrix, respectively. Let 푬(휔, 푓) = sound source distance is not considered since the small size [푬1(휔, 푓), … , 푬푀(휔, 푓)] denote the Eigen vectors. The array has a short Fraunhofer distance 푑퐹(휔) computed as: spatial spectrum is given as, ∗ 2 1 휔퐻 |푨 (휔,휓)푨(휔,휓)| 2퐷푀 푃(푓, 휓) = ∑ , (6) 푑 (휔) = , (2) 휔 −휔 +1 휔=휔퐿 ∑푀 |푨∗(휔,휓)푬 (휔,푓)| 퐹 휆(휔) 퐻 퐿 푚=퐿푠+1 푚 where 휆(휔) and 퐷푀 are the wave length at the frequency bin where 퐿푠 is the number of sound sources at a time frame. ωH 휔 and the size of the array, respectively. A distance greater and ωL are the frequency bin indices that represent the than 푑퐹(휔) is considered as “far field”, making the distance maximum and minimum frequency considered, which are set estimation difficult. In our case, the microphone array on the as 500Hz ≤ ω ≤ 2800Hz. Let 휓̂ denote 휓 that maximizes smartwatch, 0.038m sized square, has 퐷푀 ≈ 푃(푓, 휓) at the 푓-th frame. The frames that satisfy 푃(푓, 휓̂) ≥ 0.054m, resulting in a 푑퐹(휔) ≈ 0.135m for 8 kHz and 푇푃 are defined as activity periods of target acoustics, and the 0.051m for 3 kHz, which is too short for capturing spatial sequence of 휓̂ with respect to 푓 is the direction of detected position around the watch. Additionally, we assume that acoustics, where 푇푃 is a threshold. Moreover, we track 휓̂ there is only one acoustic signature at a time in one location. with respect to time and direction to obtain 푆푘(휔, 푓). We ̂ Let 푆푘(휔, 푓) denote the 푘-th spatial acoustic (1 ≤ 푘 ≤ 퐾) simply conduct thresholding if the consecutive 휓 is where K is the total number of acoustics at a time. The sufficiently close with respect to time and direction in order purpose of detection, localization, and enhancement of sound to regard the two 휓̂ as the same acoustic signal. All 휓̂ sources is to detect all 푆푘(휔, 푓) under a noisy environment. regarded as the same signal, and labelled by 푆푘(휔, 푓). Computation of Backward and Forward Correlation Matrices Detection of Trigger Gestures As shown in Figure 4, the system first defines forward and Meanwhile, we use a beamformer to spatially de-noise and backward correlation matrices of the current observed and enhance the detected acoustics for classification. This previous frames, respectively, which are used for sound improves the robustness in noisy environments. source localization and enhancement. The correlation Using a trigger mechanism provides practical support and is matrices are averaged over F frames to make them robust not unusual for commercially available commercial systems against instant noise. F is empirically determined as F = 50. such as “Alexa” in Amazon Echo4 and “Siri” in Apple5. To

Let 푿(휔, 푓) = [푋1(휔, 푓), ⋯ , 푋푀(휔, 푓)] denotes 푋푚(휔, 푓) update the noise information when environments change, we of all channels. The forward correlation matrix of the use a trigger gesture. We experimented with the nine hand observed signals is computed as: gestures, with signals shown in Figure 5. We selected the 1 wrist twists as it showed sufficiently larger angular velocity 푹 (휔, 푓) = ∑퐹−1 푿(휔, 푓 + 푖)푿∗(휔, 푓 + 푖), (3) ̇ 퐹 퐹 푖=0 of y-axis 휃푦(푓) than others. We simply detect wrist twists and the backward correlation matrix is computed as: with the following threshold: ̇ 1 퐹 ∗ max |휃푦(푓 + 푖)| ≥ 푇휃 , (7) 푹 (휔, 푓) = ∑ 푿(휔, 푓 − 푖)푿 (휔, 푓 − 푖), (4) 0≤푖≤퐹 퐵 퐹 푖=1 where 푇 is a threshold. The last frame where Eq. (7) is where ( )∗is a complex conjugate transpose operator. 휃 satisfied is defined as the offset of the trigger gesture, and the frame is defined as 푓̂. The noise information is updated after

4 www.amazon.com/dp/product/B00X4WHP5E/ref=EchoCP_dt_tile_text 5 www.apple.com/ios/siri/

583 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada the trigger. Namely, the forward correlation matrix at the 푓̂- detected as a continuous acoustic event. This can further be th frame (just after the trigger) is memorized as the current mapped for continuous parameter manipulation. ̂ noise information, which is denoted as 푹퐹(휔, 푓), and used Recognizing multiple gestural acoustics for the enhancement of the acoustic signature. The algorithm allows our prototype to separate overlapped sound sources (i.e., acoustics at the same time, but at different directions). For instance, target acoustics could be overlapped with other gestural or ambient sounds. We use this feature for multi-sound source localization purpose.

Extended capabilities with external sounds Figure 5. Angular velocity of Y-axis of the in kdeg/sec We mainly focus on SoundCraft’s capabilities of using hand- Enhancement of Localized Sound Sources made gestures. However, SoundCraft can also operate using We introduced MVDR with the selected forward correlation external sound sources. For instance, it can track sounds ̂ matrix 푹퐹(휔, 푓) by the trigger gesture and the location of from a speaker that is playing music or even sounds the sound source 휓̂ . The sound source location above generated from auxiliary tools, such as scissors. We provides the direction 휓̂ and the label index 푘. When a sound demonstrate the use of such sound events for interaction. source is detected, the corresponding TF is selected from the LOCALIZATION AND CLASSIFICATION ACCURACY estimated 휓̂ and 푘, denoted as 푨푘(휔, 휓̂). The MVDR based We conducted three studies to evaluate how accurately enhanced signal is computed as follows: SoundCraft can localize and classify acoustic signatures,

∗ ̂ −1 ̂ both in non-noisy and noisy environments. ̂ 푨푘(휔,휓)푹퐹 (휔,푓) 푆푘(휔, 푓) = ∗ ̂ −1 ̂ ̂ 푿(휔, 푓), (8) 푨푘(휔,휓)푹퐹 (휔,푓)푨푘(휔,휓) Study 1: Localizing acoustic signatures In this study, we examined how accurately SoundCraft can ̂ where 푆푘(휔, 푓) is the estimated 푆푘(휔, 푓) . Since we use localize acoustic signatures in a non-noisy environment. We ̂ 푹퐹(휔, 푓) which contains the current environmental noise, used a 2D discrete target selection task where wedge shape Eq. (8) can enhance the 푘-th acoustic by beamforming in the items were arranged in a 180° circular layout (Figure 6b and 푨푘(휔, 휓̂) direction and subtracting the environmental noise. c). An external screen was laid horizontally and vertically to mimic azimuth and elevation conditions, respectively. These Sound-Source Classification We train a Gaussian Mixture Model (GMM) with 64 settings alleviate confounds due to the visuo-motor mapping. components for each of the enhanced acoustics. The GMM We selected one in-air, (finger snap) and one on-surface captures the acoustic information of the individual gesture, (surface tap), acoustic signature, based on the most namely 푆푘(휔, 푓). Training parameters are summarized in frequently mentioned gesture by the participants in the Table 1. The trained GMMs are used during online design workshop. We used 4, 8, 12 and 16 discrete wedge classification together with the separated acoustic signal. The shape laid out in a half-circle, giving 45°, 22.5°, 15°, likelihood score for each model representing a particular 11.25° angular widths for the items. In a natural smartwatch sound gesture is computed using the separated acoustic data. viewing position (i.e., close to the body), users have limited The corresponding acoustic model with the highest score is in-air angular space (~180°) in the horizontal direction. To selected as the most likely acoustic signature. be consistent with both azimuth and elevation conditions, we Sampling frequency 16 kHz only considered a 180°angular space for both these two axes. Frame length 25 ms Frame period 10 ms Pre-emphasis 1 − 0.97푧−1 12-order MFCCs, 12-order ΔMFCCs, and 1- Feature vectors order ΔE GMM 64 Gaussian components Table 1. System parameters used in SoundCraft. Figure 6. Experiment interface for 12 segments. BEYOND CLASSIFICATION AND LOCALIZATION The SoundCraft prototype has additional capabilities that We used a 2×2×4 within-subject design for factors Axis come naturally with its input modality. We briefly discuss (Azimuth and Elevation), Acoustic Signature (Finger Snap these here and further show them in our applications. and Surface Tap) and Angular Width (45°, 22.5°, 15°, 11.25°). The presentation order of Axis and Allowing continuous gestural input Acoustic Signature were counterbalanced among the SoundCraft tracks various properties of acoustic sources participants using a Latin square design. The order of the such as sound power in dB and sound event duration. Using Angular Width was randomized. these properties, SoundCraft can distinguish an acoustic signature as a discrete or continuous event. For instance, We recruited 12 right-handed participants (2 females, mean finger rubbing can produce a continuous sound which can be age 26.4 years). They wore the prototype on their non-

584 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada dominant hand and moved it close to the bottom center of the localization accuracy degrades under 22.5 ̊ angular width. screen (Figures 6b and 6c). We then showed a target on the Therefore, angular targets cannot be smaller than this size. screen with a blue wedge and asked the participants to move Study 2: Classifying acoustics signatures their dominant hand close to the target and perform a gesture. We tested SoundCraft’s classification accuracy with the 11 Once the acoustic signal was detected on the target, the blue acoustic signatures listed in Figure 2 in a non-noisy wedge color changed to green and the screen showed the next environment. A paper box was used as surface. target location. An incorrect selection highlighted the actual selected wedge in red. Participants performed 30 repetitions We recruited 12 right-handed participants (2 female, mean (target segment was randomly picked) for each condition, age 25.2 years) for this study. We asked them to perform the resulting in a total of 480 trials per participant. With practice gestures next to the smartwatch while wearing the device. trials, the study session lasted approximately 50 minutes. The study was conducted in a quiet laboratory with an average sound level of 50dB, where the ambient sound was The data was analyzed using a repeated-measures ANOVA and produced from mechanical devices such as air conditioning, Bonferroni corrected paired t-tests for pair-wise comparisons. computers or people talking outside the laboratory. The room We found significant effect on Axis (F1,11 =39.786, p < .01) and was kept quiet (e.g., no door opening/closing) during the Angular Width (F3,33 =82.897, p < .01), and interaction study. This environment allowed us to include more gestures between these two (F3,33 =9.601, p < .01). We found no and evaluate the classification in the best case. significant effect of Acoustic Signature (p > .05). A smartwatch program guided the participants to perform Figure 7 shows the average localization accuracy across the gestures, and informed the SoundCraft server application to twelve participants. Overall, SoundCraft achieved an start recording when the sound event was detected. We average accuracy of 99.37% (std.=0.45%) with 45° segments observed that the non-impulsive sounds such as finger rub on both azimuth and elevation dimensions. As expected, and hand rub last 0.5-1.0 seconds while the impulsive sounds 11.25° segment on these dimensions returns the lowest such as hand clap and finger snap last only 0.1-0.2 seconds. average accuracy of 83.77% (std.=5.02%). Results also Therefore, the number of feature frames obtained by an indicated higher accuracy across all conditions (≥90%) when acoustic signature varies according to its type. To obtain the segment is larger than 15°. approximately the same number of feature frames for We observed that SoundCraft localized the sound events training, we conduct the impulsive sounds at least 3 times significantly more accurately in azimuth (avg.=94.28%, compared to the non-impulsive ones in a trial. The order of std.=4.33%) than in elevation (avg.=88.73%, std.=7.94%). gestures was randomized across participants. The program Tap (avg.=91.33%, std.=6.24%) is more accurate than Snap recorded 20 sound files for each gesture. (avg.=86.13%, std.=8.59%) in elevation dimension, but After completing the training phase, the participants took rest performs equally well in azimuth (avg.=93.77%, std.=4.90% for 30 mins and then recorded data for testing a random and avg.=94.79%, std.=3.60% respectively). gesture order. In each trial, we asked them to perform a gesture once and then wait for the next trial. We repeated it 5 times to collect 5 testing data points for each acoustic signature. The study lasted roughly 75 minutes.

Finger Hand Body Surface Snap Rub Flick Pinch Clap Rub Scratch Tap Clap Scratch Tap Snap 56 1 0 0 2 0 0 1 0 0 0 Rub 0 56 4 0 0 0 0 0 0 0 0

Finger Flick 0 2 57 0 1 0 0 0 0 0 0 Figure 7. Average localization accuracy across 12 participants Pinch 0 14 14 29 0 0 0 3 0 0 0 on Axis and Angular Width. (Error bar: 95% CI) Clap 2 0 0 0 53 0 0 5 0 0 0

Results from this study reveal several key SoundCraft Hand Rub 0 0 0 0 0 60 0 0 0 0 0 Scratch 0 0 1 0 0 0 59 0 0 0 0 features. First, our findings suggest that SoundCraft detects Body Tap 2 0 3 1 2 0 1 50 1 0 0 acoustic signatures in azimuth more accurately than in Clap 0 0 0 0 0 0 0 0 60 0 0 elevation. We anticipate that this is either due to the chosen Scratch 0 0 0 0 0 0 0 0 0 60 0 alignment of the microphone array, or users’ better Surface Tap 0 0 0 0 2 0 0 0 3 0 55 perception of degree segments on the horizontal plane than Figure 8. Confusion matrix of classification results across 11 the vertical. Further experimentation is needed to confirm acoustic signatures and 12 participants, 60 trials per gesture. this. Second, selecting a target with a Snap, on average, is less accurate than with a Tap. This is expected as the sound We trained a GMM and used it for classifying the testing source from a Snap gesture is harder to locate precisely as it gestures. Our results revealed an average accuracy of 90.15% is generated from the thenar muscles at the base of the thumb, across all 12 participants and 11 gestures. One-way ANOVA and not at the location where the thumb and middle fingers yields significant effect between gestures (F10,121 =10.965, p meet. Third and most importantly, our results indicate < .01). A Tukey post hoc test revealed that Finger pinch that

585 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada produced small and less noticeable sounds provided 22.5° angular width for targets as the localization accuracy significantly lower accuracy (avg.=48.33%) compared to degrades after this size. The targets were showed at a random others gestures (all p<.01). There was no significant position in a 180° circular layout. An external screen was difference for the other pairs (all p > .05). Figure 8 shows a positioned horizontally on a table to display the layout, trial confusion matrix for the gestures we tested. We found that number and the target acoustic signatures. the non-impulsive gestures such as finger rub, hand rub, The participants were asked to place their hand wearing the body scratch and surface scratch have a higher average device at the bottom center of the screen. The other hand was (avg.=97.91%) than the impulsive ones (avg.=85.71%). This used to perform the acoustic signature on top of the target raised challenges on collecting training data with equal which was displayed on the monitor. For hand clapping and feature frames for all acoustic signatures. hand rubbing, the targets were always displayed at a fixed We also collected users’ preference of using in-air, on-body location (i.e., on the right bottom corner of the layout) due to and on-surface gestures. 6 out of 12 participants rated finger constrained movements of the hand wearing the device. snap to be the best in-air gesture while 4 preferred the finger Additionally, paper tape was attached along eight angular rub. None of them listed finger pinch as their preferred directions on the screen to ensure enough audible sound gesture. For on-body gestures, 7 favored hand clap and 3 produced with scratches. Once the acoustic event was preferred hand rub. Furthermore, 6 participants preferred detected, the sound was processed to display localization and surface tap for on surface gesture, 3 preferred surface scratch. classification results on the screen. Each acoustic signature Study 3: In-situ evaluation with background noise was repeated 5 times at random locations. The session lasted In this study, we evaluated SoundCraft in noisy approximately 30 minutes. environments with a variety of background noise (i.e., near Figure 10 shows the confusion matrix across the acoustic vending machines or an operating 3D printer). More signatures and the classification accuracy for our tested specifically, we examined the impact of background noise on gestures. Results revealed that out of 180 (6 participants × 6 the prototype’s localization and classification (Figure 9b). gestures × 5 repetition) gesture testing trails, SoundCraft We applied the noise subtraction technique discussed before successfully classified 158 samples into a correct category, where a trigger gesture, i.e., wrist twists, was used to update yielding an average accuracy of 87.78%. We found similar the background noise information. In addition, we collected trends to those in Study 2 where gestures that are continuous and trained models in-situ and let SoundCraft classify and in nature (such as finger rub) provided higher accuracy than localize the acoustic signatures in real-time. the discrete ones (e.g., surface tap).

We recruited 6 participants for this study (all male, mean age Finger Hand Surface Localization 25.5 years). To simulate environments with constant Snap Rub Clap Rub Scratch Tap Accuracy

background noise, three participants were taken close to a set Snap 27 2 1 0 0 0 Snap 86.67% Finger of vending machines and the other three were taken in a room Finger Rub 0 27 0 3 0 0 Rub 93.33%

with a running 3D printer. In both locations, the average Clap 3 0 27 0 0 0 Clap 80.00% Hand sound level was ~70dB. With the sound subtraction Hand Rub 0 0 0 30 0 0 Rub 90.00% algorithm, no sound event was detected from the background Scratch 1 2 0 1 26 0 Scratch 90.00%

noise (Figure 9a). For this study, we selected two top-ranked Surface Surface Tap 0 0 7 2 0 21 Tap 93.33% gestures from in-air (finger snap, finger rub), on-body (hand clap, hand rub) and on-surface (surface tap, and surface Figure 10. (a) Confusion matrix across 6 gestural acoustics. 5 scratch) categories. We used the same procedure as in Study testing samples for each, yielding 30 samples per participants. 2 to train the classification model per user. (b) localization accuracy for these gestural acoustics. Results showed that SoundCraft achieved an average of 88.89% localization accuracy. The prototype estimated the location of Tap and Rub sound events more accurately (93.33%) than others in noisy environments. We also found that the prototype detects Clap and Snap sounds with a lower accuracy. As mentioned before, it’s hard to predict the exact sound source location generated with these gestures, thus,

making them less accurate to localize. Figure 9. (a) SoundCraft handles noises such as sound from a 3D printer. (b) User evaluation in front of a vending machine. Overall, our results indicate that SoundCraft can localize and classify a set of acoustic signatures with high accuracy, in We next started our testing session. As we were also non-noisy and noisy environments. Impulsive acoustic interested in evaluating localization accuracy, we used a signatures require more instances to get enough feature target selection task similar to Study 1. We only included the frames during training. Additionally, the sound generated azimuth condition as the prototype detected acoustic with the gestures that have different initial and final finger signatures more accurately in this dimension. We chose contact positions (e.g., snap) are less accurate than others.

586 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada

SOUNDCRAFT INTERACTION APPLICATIONS creation anytime-anywhere, such as drawing. With We implemented several scenarios to demonstrate our SoundCraft, a user can leverage the large space surrounding SoundCraft implementation capabilities. While numerous the smartwatch for content creation. We developed a demo scenarios are possible, we emphasize those demonstrating application where we showed the use of acoustic signatures novel smartwatch interaction techniques (i.e., unique (i.e., finger rub) to sketch (Figure 12a) and navigate content angularly localized signatures). We note that to make these in space. interactions fully operable, specific delimiters to indicate that an acoustic signature is being issued (instead of environmental sounds) and sound source training are needed. Rapid access to smartwatch content Smartwatch users often require information access on-the- go, even while the other hand is busy, such as holding a Figure 12. (a) A user sketches on the smartwatch with finger coffee cup. SoundCraft facilitates input with either or both rubs. (b) The user unlocks the screen with a chain of acoustic hands to quickly send device commands. signatures. (c) The user “cuts” the rope with a scissor in a game Physical Surface for One-Handed Interaction Smartwatch authentication SoundCraft is capable of using a physical surface such as a The hand wearing the device is often left idle while the non- table as available input space for one-handed interaction. wearing hand is operating. This leaves out the opportunity This makes it possible to use SoundCraft in single-handed for a/symmetric bimanual input. SoundCraft allows mode. For instance, it can track directional finger scratching classifying and localizing sounds generated using both sounds on a table, either from top-to-bottom or in reverse hands, opening possibilities for bimanual smartwatch (Figure 1c). In a demo application, we map these actions with interactions. We demonstrate a smartwatch authentication navigation and scroll activities on the smartwatch. Here a application using bimanual input (Figure 12b). In this user browses an image gallery with the hand wearing the application, a user performs acoustic signatures in device. A finger tap on the table opens the content. asymmetric patterns to log into his laptop. The user taps with Delimiter for One-handed Interactions both hands, on a book and table, generating sounds on the Researchers have demonstrated the use of motion sensors right and the left side of the watch, respectively. A matched (e.g., IMU’s) to control smartwatch input [38]. However, to sequence of sound types and locations unlocks the screen. operate these, a delimiting action is necessary. Acoustic Multi-user ad-hoc gaming on a smartwatch signatures can serve as input delimiters. We implemented a scenario wherein the user can snap the wearing hand finger, Multi-user game to enable and disable IMU functions to browse a contact list SoundCraft is capable of detecting sound sources produced (Figures 11a-11b). Another acoustic signature (in this case a by people at different directions. This capability enables finger clap) places a call to the person selected in the list. SoundCraft to support tasks in co-located ad-hoc collaboration. We implemented a two-player pong game with SoundCraft, to extend the smartwatch into a multiplayer device (Figure 1d). The directions (e.g., inward and outward) of sounds relative to the watch is used to differentiate players and the location of the Finger Snap is mapped to controlling the paddle movement. Figure 11. (a) User snaps to activate IMU function and (b) tilts the hand to scroll items. (c) The user rubs fingers to select an item from a 2D pie menu 2D Pie Menu Operating a menu system with many items and across multiple layers can be challenging on smartwatches. Instead of direct touch or swipe gestures on the touchscreen, SoundCraft allows users to shift the input space around the Figure 13. A user explores details of a plane model on the phone device, thus expanding the item selection zone. Several by scanning it around the watch which shows the overview. acoustic signatures can be used for commands. We use a 2D SoundCraft is tracking white noise from the smartphone. pie menu application to demonstrate this usage (Figure 11c). Physical tools for gaming Users navigate the menus by rubbing their fingers in the We demonstrate the use of a physical tool to play computer corresponding directions. Performing a snap opens the menu games with SoundCraft. We used a scissor to perform the items and a flick moves back to the previous layer. “cut” operation in a popular game, ‘cut-the-rope’. A user In-situ content creation on smartwatches places the hand with SoundCraft to the bottom-center of a Smartwatches are primarily used to consume content. It has game console and holds a regular scissor to play the game. the potential to expand functionalities to support content

587 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada

Snipping sounds produced by the scissor (Figure 12c) are reverberation techniques [6]. We will also investigate issues classified and localized to cut the ropes at different places. concerning computing power consumption while running the system on an embedded platform. Spatially extended display for smartwatches SoundCraft provides a low-cost approach to enable spatially- Training sounds: We implemented SoundCraft with user- aware phone-watch interactions. Here, the sound (e.g., white dependent data. This was necessary as the sound profiles noise) produced by a smartphone’s speaker is used to track generated are significantly different across users. In the smartphone spatial position around the watch. We particular, we found certain gestures more sensitive to user- developed an application where the smartwatch shows an specific approaches than others. For example, several users overview of a 3D plane model, and a user can explore the commented in not being able to easily complete a finger detailed model on the smartphone by moving it around the snap. Also, such gestures can be affected by finger moisture, watch (Figure 13). causing a different sound profile than that trained with. We DISCUSSION AND FUTURE WORK provide training prior to using the system, however, we could We consider SoundCraft to be an early exploration that integrate an on-line training approach that continuously opens up the use of acoustic signatures for spatial monitors input. Given the large vocabulary of acoustic interactions on smartwatches. The angularly localized signatures, we could imagine users preferring one subset signatures enrich its interaction design space, as over another. A longitudinal study could reveal the variances demonstrated in previous sections. During the process of in usage patterns and preferences. designing, implementing and evaluating SoundCraft, we Practical use identified several key concerns, limitations and challenges. Acoustic signatures are applicable to a broad range of Limitations and considerations for improvement smartwatch interactions, including in-air, on-body and on- Ambient Noise: As with any acoustic input, SoundCraft is surface taps. Beside formal evaluations and developing more affected by sudden environment noise. Acoustic signatures advanced algorithms on classifying acoustic signatures, using the signatures for a real-world application requires that are not loud (finger rub) require close input around the mappings suitable gestures to commands. Our initial device, as well as a quiet environment. Although the exploration reveals some practical use cases. For instance, microphone array system and noise subtraction strategy assigning finger snap for discrete commands (e.g., button improved the system’s robustness with various constant click, mode switch) and finger rub for continuous commands background noise, it is challenging to handle accidental (e.g., slider). However, we do not have prior knowledge on noises that could result in false positives. Such noise is what other acoustic signatures are useful for, such as hand common in our daily life such as when people talk, walk, rubbing and joint cracking. The design space is relatively open/close doors, and are likely unpredictable. SoundCraft large. Our next step concerns developing a mapping of could potentially mitigate such challenges with localization acoustic signatures to mobile interactions, in the form of an information, where users could define an operation zone elicitation study. based on direction information around the watch and sounds outside the zone would be discarded, possibly limiting its Comfort: Our informal evaluation suggests that users are spatial input capabilities. Moreover, the current system is not comfortable using certain gestures, but only in private. The robust to the effects caused by reverberation [6] or echo reason for this could be from the social perception of certain when the smartwatch operation is conducted inside an gestures. For example, users were comfortable and indeed enclosed environment. While proximity to the sensing enjoyed snapping in private. Their reaction was completely mechanism can mitigate such concerns, further work is the opposite in public settings. However, if the snapping was necessary to integrate the latest noise elimination techniques. assigned to contextually relevant tasks, they did not show discomfort. For example, snapping to turn-on/off a smart We evaluated SoundCraft in two noisy environments to show appliance seemed natural. that background noise elimination is possible. It is worth to note that we did not exhaust the forms of background noises, Privacy is another limitation that needs consideration. For but we illustrate a feasibility path with our detection example, an unwanted user can produce a snap in the methods. Adding other noisy conditions would further proximity of the user’s device to invoke input. To mitigate ascertain our claims. this, we can limit SoundCraft to the users’ comfort region based on direction information, being treated as the privacy Furthermore, the current localization process does not allow space. Further consideration is needed to define socially us to specify the actual distance of the sound source. We acceptable acoustic zones. observed a ~1 second delay on classification due to the heavy task load on our machine learning algorithm. Without the SoundCraft in conjunction with other alternatives classification step, the localization process does not consume SoundCraft can potentially enhance multimodal interactions as much computation power and runs in real-time. In future, with the smartwatch. We have shown a few examples, such we will look at the prospects of addressing additive noise and as when SoundCraft can delimit input for IMU interactions. the effects of reverberation via de-noising and de- We can also imagine combining voice-input to invoke or

588 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada revert commands, such as “Pan” or “Stop” while SoundCraft Factors in Computing Systems (CHI '14). 3167-3170. input is used for panning in a given direction, using finger http://dx.doi.org.uml.idm.oclc.org/10.1145/2556288.255 snaps or rubs, based on user preference or environmental 7331 context. A combination of the various modalities possible 6. Randy Gomez, Tatsuya Kawahara and Kazuhrio with SoundCraft warrants further exploration. Nakadai. 2015. Optimized wavelet-domain filtering CONCLUSION under noisy and reverberant conditions. APSIPA SoundCraft utilizes angularly localized acoustic signatures Transactions on Signal and Information Processing, 4, for smartwatch input. A dictionary of acoustic signatures was e3 doi:10.1017/ATSIP.2015.5 elicited through a design workshop, revealing opportunities 7. Susumu Harada, James A. Landay, Jonathan Malkin, for a broad range of spatial interaction scenarios. We ensured Xiao Li, and Jeff A. Bilmes. 2006. The vocal joystick:: that the suitable algorithms reveal that such proximal sensor evaluation of voice-based cursor control techniques. placement works, for tasks such as unique angularly In Proceedings of the 8th international ACM localized signatures (i.e., snapping vs. rubbing). The SIGACCESS conference on Computers and classification and localization capabilities of the device accessibility (Assets '06). ACM, New York, NY, USA, allow us to explore a rich set of interaction techniques on the 197-204. http://dx.doi.org/10.1145/1168987.1169021 smartwatch in both non-noisy and noisy environments. For instance, SoundCraft can be used in a single-handed mode as 8. Chris Harrison and Scott E. Hudson. 2008. Scratch a delimiter for tilt gestures, as well as for 2D pie menus, ad- input: creating large, inexpensive, unpowered and hoc multi-user games, in-situ content creation and coarse mobile finger input surfaces. In Proceedings of the 21st spatial manipulation. Preliminary evaluations confirm that annual ACM symposium on User interface software and SoundCraft can accurately classify common acoustic technology (UIST '08). 205-208. signatures and localize sound sources above 90% accuracy. http://dx.doi.org.uml.idm.oclc.org/10.1145/1449715.144 9747 ACKNOWLEDGEMENTS 9. Chris Harrison and Scott E. Hudson. 2009. This project was funded by the Honda Research Institute as Abracadabra: wireless, high-precision, and unpowered well as through a Mitacs grant. finger input for very small mobile devices. REFERENCES In Proceedings of the 22nd annual ACM symposium on 1. Brian Amento, Will Hill, and Loren Terveen. 2002. The User interface software and technology (UIST '09). 121- sound of one hand: a wrist-mounted bio-acoustic 124. fingertip gesture interface. In CHI '02 Extended http://dx.doi.org.uml.idm.oclc.org/10.1145/1622176.162 Abstracts on Human Factors in Computing 2199 Systems (CHI EA '02). ACM, New York, NY, USA, 10. Chris Harrison, Desney Tan, and Dan Morris. 2010. 724-725. Skinput: appropriating the body as an input surface. http://dx.doi.org.uml.idm.oclc.org/10.1145/506443.5065 In Proceedings of the SIGCHI Conference on Human 66 Factors in Computing Systems (CHI '10). 453-462. 2. Daniel L. Ashbrook. 2010. Enabling Mobile http://dx.doi.org.uml.idm.oclc.org/10.1145/1753326.175 Microinteractions. Ph.D. Dissertation. Georgia Institute 3394 of Technology, Atlanta, GA, USA. AAI3414437. 11. Chris Harrison, Robert Xiao, and Scott Hudson. 2012. 3. Wei-Hung Chen. 2015. Blowatch: Blowable and Hands- Acoustic barcodes: passive, durable and inexpensive free Interaction for Smartwatches. In Proceedings of the notched identification tags. In Proceedings of the 25th 33rd Annual ACM Conference Extended Abstracts on annual ACM symposium on User interface software and Human Factors in Computing Systems (CHI EA '15). technology (UIST '12). 563-568. 103-108. http://dx.doi.org.uml.idm.oclc.org/10.1145/2380116.238 http://dx.doi.org.uml.idm.oclc.org/10.1145/2702613.272 0187 6961 12. Khalad Hasan, David Ahlström, and Pourang Irani. 4. Jackson Feijó Filho, Wilson Prata, and Thiago Valle. 2015. SAMMI: A Spatially-Aware Multi-Mobile 2015. Advances on Breathing Based Text Input for Interface for Analytic Map Navigation Tasks. Mobile Devices. In International Conference on In Proceedings of the 17th International Conference on Universal Access in Human-Computer Interaction. 279– Human-Computer Interaction with Mobile Devices and 287. Springer International Publishing, 2015. DOI: Services (MobileHCI '15). ACM, New York, NY, USA, 10.1007/978-3-319-20678-3_27 36-45. 5. Mathieu Le Goc, Stuart Taylor, Shahram Izadi, and http://dx.doi.org.uml.idm.oclc.org/10.1145/2785830.278 Cem Keskin. 2014. A low-cost transparent electric field 5850 sensor for 3d interaction on mobile devices. 13. Takeo Igarashi and John F. Hughes. 2001. Voice as In Proceedings of the SIGCHI Conference on Human sound: using non-verbal voice input for interactive

589 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada

control. In Proceedings of the 14th annual ACM In Proceedings of the 29th Annual Symposium on User symposium on User interface software and technology Interface Software and Technology (UIST '16). ACM, (UIST '01). ACM, New York, NY, USA, 155-156. New York, NY, USA, 321-333. http://dx.doi.org.uml.idm.oclc.org/10.1145/502348.5023 https://doi.org/10.1145/2984511.2984582 72 21. Pedro Lopes, Ricardo Jota, and Joaquim A. Jorge. 2011. 14. Hiroshi Ishii, Craig Wisneski, Julian Orbanes, Ben Augmenting touch interaction through acoustic sensing. Chun, and Joe Paradiso. 1999. PingPongPlus: design of In Proceedings of the ACM International Conference on an athletic-tangible interface for computer-supported Interactive Tabletops and Surfaces (ITS '11). ACM, cooperative play. In Proceedings of the SIGCHI New York, NY, USA, 53-56. https://doi- conference on Human Factors in Computing org.uml.idm.oclc.org/10.1145/2076354.2076364 Systems (CHI '99). 394-401. 22. Keisuke Nakamura, Kazuhiro Nakadai, Futoshi Asano, http://dx.doi.org.uml.idm.oclc.org/10.1145/302979.3031 Yuji Hasegawa, and Hiroshi Tsujino. 2009. Intelligent 15 sound source localization for dynamic environments. 15. Hamed Ketabdar, Mehran Roshandel, and Kamer Ali IEEE/RSJ International Conference on Intelligent Yüksel. 2010. MagiWrite: towards touchless digit entry Robots and Systems, St. Louis, MO, pp. 664-669. using 3D space around mobile devices. In Proceedings doi: 10.1109/IROS.2009.5354419 of the 12th international conference on Human computer 23. OMG! Mobile voice survey reveals teens love to talk. interaction with mobile devices and 2014. Retrieved August 25, 2016 from services (MobileHCI '10). ACM, New York, NY, USA, https://googleblog.blogspot.ca/2014/10/omg-mobile- 443-446. voice-survey-reveals-teens.html http://doi.acm.org.uml.idm.oclc.org/10.1145/1851600.1 851701 24. Joseph A. Paradiso and Che King Leo. 2005. Tracking and characterizing knocks atop large interactive 16. David Kim, Otmar Hilliges, Shahram Izadi, Alex D. displays. Sensor Review, Vol. 25 Iss: 2, pp.134-143. Butler, Jiawen Chen, Iason Oikonomidis, and Patrick http://dx.doi.org/10.1108/02602280510585727 Olivier. 2012. Digits: freehand 3D interactions anywhere using a wrist-worn gloveless sensor. In 25. Shwetak N. Patel and Gregory D. Abowd. 2007. Blui: Proceedings of the 25th annual ACM symposium on low-cost localized blowable user interfaces. User interface software and technology (UIST '12). In Proceedings of the 20th annual ACM symposium on 167-176. User interface software and technology (UIST '07). 217- http://dx.doi.org.uml.idm.oclc.org/10.1145/2380116.238 220. 0139 http://dx.doi.org.uml.idm.oclc.org/10.1145/1294211.129 4250 17. Jungsoo Kim, Jiasheng He, Kent Lyons, and Thad Starner. 2007. The Gesture Watch: A Wireless Contact- 26. D. T. Pham, Ze Ji, Ming Yang, Zuobin Wang, and free Gesture based Wrist Interface. In Proceedings of Mostafa Al-Kutubi. 2007. A novel human-computer the 2007 11th IEEE International Symposium on interface based on passive acoustic localization. Wearable Computers (ISWC '07). IEEE Computer In Proceedings of the 12th international conference on Society. 1-8. Human-computer interaction: interaction platforms and http://dx.doi.org/10.1109/ISWC.2007.4373770 techniques (HCI'07), Julie A. Jacko (Ed.). Springer- Verlag, Berlin, Heidelberg, 901-909. 18. Sven Kratz and Michael Rohs. 2009. Hoverflow: exploring around-device interaction with IR distance 27. Nissanka B. Priyantha, Anit Chakraborty, and Hari sensors. In Proceedings of the 11th International Balakrishnan. 2000. The Cricket location-support Conference on Human-Computer Interaction with system. In Proceedings of the 6th annual international Mobile Devices and Services (MobileHCI '09). Article conference on Mobile computing and 42, 4 pages. networking (MobiCom '00). ACM, New York, NY, http://dx.doi.org.uml.idm.oclc.org/10.1145/1613858.161 USA, 32-43. http://dx.doi.org/10.1145/345910.345917 3912 28. Jun Rekimoto. 2001. Gesturewrist and gesturepad: 19. MindMeld Launches Voice Assistant 2.0, Says Voice Unobtrusive wearable interaction devices. In Search Growing ramatically. 2015. Retrieved August Proceedings of Fifth International Symposium on 25, 2016 from http://searchengineland.com/mindmeld- Wearable Computers, 21-27. IEEE. launches-voice-assistant-2-0-says-voice-search- 29. Gabriel Reyes, Dingtian Zhang, Sarthak Ghosh, Pratik growing-dramatically-238130 Shah, Jason Wu, Aman Parnami, Bailey Bercik, Thad 20. Gierad Laput, Robert Xiao, and Chris Harrison. 2016. Starner, Gregory D. Abowd, and W. Keith Edwards. ViBand: High-Fidelity Bio-Acoustic Sensing Using 2016. Whoosh: non-voice acoustics for low-cost, hands- Commodity Smartwatch Accelerometers. free, and rapid input on smartwatches. In Proceedings of the 2016 ACM International Symposium on Wearable

590 Session: Phones & Watches UIST 2017, Oct. 22–25, 2017, Québec City, Canada

Computers (ISWC '16). ACM, New York, NY, USA, (MobileHCI '15). 217-226. 120-127. http://dx.doi.org/10.1145/2971763.2971765 http://dx.doi.org.uml.idm.oclc.org/10.1145/2785830.278 30. Daisuke Sakamoto, Takanori Komatsu, and Takeo 5885 Igarashi. 2013. Voice augmented manipulation: using 38. Hongyi Wen, Julian Ramos Rojas, and Anind K. Dey. paralinguistic information to manipulate mobile devices. 2016. Serendipity: Finger Gesture Recognition using an In Proceedings of the 15th international conference on Off-the-Shelf Smartwatch. In Proceedings of the 2016 Human-computer interaction with mobile devices and CHI Conference on Human Factors in Computing services (MobileHCI '13). 69-78. Systems (CHI '16). 3847-3851. http://dx.doi.org.uml.idm.oclc.org/10.1145/2493190.249 http://dx.doi.org.uml.idm.oclc.org/10.1145/2858036.285 3244 8466 31. R. Schmidt. 1986. Multiple emitter location and signal 39. Anusha Withana, Roshan Peiris, Nipuna Samarasekara, parameter estimation. In IEEE Transactions on and Suranga Nanayakkara. 2015. zSense: Enabling Antennas and Propagation, vol. 34, no. 3, pp. 276-280. Shallow Depth Gesture Recognition for Greater Input doi: 10.1109/TAP.1986.1143830 Expressivity on Smart Wearables. In Proceedings of the 32. Project Soli. 2016. Retrieved August 25, 2016 from 33rd Annual ACM Conference on Human Factors in http://atap.google.com/soli. Computing Systems (CHI '15). 3661-3670. http://dx.doi.org.uml.idm.oclc.org/10.1145/2702123.270 33. Jie Song, Gábor Sörös, Fabrizio Pece, Sean Ryan 2371 Fanello, Shahram Izadi, Cem Keskin, and Otmar Hilliges. 2014. In-air gestures around unmodified 40. B. D. Van Veen and K. M. Buckley. 1988. mobile devices. In Proceedings of the 27th annual ACM Beamforming: a versatile approach to spatial filtering. symposium on User interface software and technology In IEEE ASSP Magazine, vol. 5, no. 2, pp. 4-24. (UIST '14). 319-329. doi: 10.1109/53.665 http://doi.acm.org.uml.idm.oclc.org/10.1145/2642918.2 41. Robert Xiao, Greg Lew, James Marsanico, Divya 647373 Hariharan, Scott Hudson, and Chris Harrison. 2014. 34. Adam J. Sporka, Sri H. Kurniawan, Murni Mahmud, Toffee: enabling ad hoc, around-device interaction with and Pavel Slavík. 2006. Non-speech input and speech acoustic time-of-arrival correlation. In Proceedings of recognition for real-time control of computer games. In the 16th international conference on Human-computer Proceedings of the 8th international ACM SIGACCESS interaction with mobile devices & services (MobileHCI conference on Computers and accessibility (Assets '06). '14). 67-76. 213-220. http://dx.doi.org.uml.idm.oclc.org/10.1145/2628363.262 http://dx.doi.org.uml.idm.oclc.org/10.1145/1168987.116 8383 9023 42. Cheng Zhang, Qiuyue Xue, Anandghan Waghmare, 35. T. Scott Saponas, Desney S. Tan, Dan Morris, Ravin Sumeet Jain, Yiming Pu, Sinan Hersek, Kent Lyons, Balakrishnan, Jim Turner, and James A. Landay. 2009. Kenneth A. Cunefare, Omer T. Inan, and Gregory D. Enabling always-available input with muscle-computer Abowd. 2017. SoundTrak: Continuous 3D Tracking of a interfaces. In Proceedings of the 22nd annual ACM Finger Using Active Acoustics. Proc. ACM Interact. symposium on User interface software and Mob. Wearable Ubiquitous Technol. 1, 2, Article 30 technology (UIST '09). 167-176. (June 2017), 25 pages. https://doi- http://dx.doi.org.uml.idm.oclc.org/10.1145/1622176.162 org.uml.idm.oclc.org/10.1145/3090095 2208 43. Yang Zhang and Chris Harrison. 2015. Tomo: 36. Wouter Van Vlaenderen, Jens Brulmans, Jo Vermeulen, Wearable, Low-Cost Electrical Impedance Tomography and Johannes Schöning. 2015. WatchMe: A Novel Input for Hand Gesture Recognition. In Proceedings of the Method Combining a Smartwatch and Bimanual 28th Annual ACM Symposium on User Interface Interaction. In Proceedings of the 33rd Annual ACM Software & Technology (UIST '15). 167-173. Conference Extended Abstracts on Human Factors in http://dx.doi.org.uml.idm.oclc.org/10.1145/2807442.280 Computing Systems (CHI EA '15). 2091-2095. 7480 http://dx.doi.org.uml.idm.oclc.org/10.1145/2702613.273 44. Yang Zhang, Junhan Zhou, Gierad Laput, and Chris 2789 Harrison. 2016. SkinTrack: Using the Body as an 37. Cheng-Yao Wang, Min-Chieh Hsiu, Po-Tsung Chiu, Electrical Waveguide for Continuous Finger Tracking Chiao-Hui Chang, Liwei Chan, Bing-Yu Chen, and on the Skin. In Proceedings of the 2016 CHI Conference Mike Y. Chen. 2015. PalmGesture: Using Palms as on Human Factors in Computing Systems (CHI '16). Gesture Interfaces for Eyes-free Input. In Proceedings of 1491-1503. the 17th International Conference on Human-Computer http://dx.doi.org.uml.idm.oclc.org/10.1145/2858036.285 Interaction with Mobile Devices and Services 8082

591