<<

Fine-Grained Hand Pose Estimation System based on

Channel State Information

Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in

the Graduate School of The Ohio State University

By

Weijie Yao Graduate Program in Computer Science and Engineering

The Ohio State University

2020

Thesis Committee

Dr. Dong Xuan, Advisor

Dr. Wei-Lun (Harry) Chao

1 Copyrighted by

Weijie Yao 2020

2 Abstract

In recent years, WiFi-based human-computer interaction has achieved significant progress in localization, fall detection, activity recognition applications since the innovation of CSI (Channel State Information). But WiFi sensing for fine-grained activity recognition like hand pose estimation is not yet discovered. In this study, we present a

WiFi sensing system that only utilizes commercial off-the-shelf WiFi devices to capture human hand pose. To our knowledge, this is the first system that considers the application of hand pose estimation using CSI. We provide configuration details of data collection, data processing for CSI and image that can be reused for any other WiFi-based sensing research. And we propose a deep learning approach that achieves cross-modal learning from CSI to hand pose labels. Our system collects the CSI signals and 2D images in a time-synchronized manner. The 2D images are used to generate hand pose labels. And the CSI signals are collected from 3 x 3 and receiver antenna pairs and used as the input to our model. Our model includes 3 different learning targets. Experiment results show that CSI measurements have similar structures to digital images and popular network architecture for hand pose estimation in images can be applied to CSI measurements with slight modification.

iii Dedication

Dedicated to my parents, who give me freedom of decision for my life and keep

supporting me as always.

iv Acknowledgments

I first want to thank my advisor, Dr. Dong Xuan, for giving me free space to explore and resource to conduct my master’s research during this special period. This thesis would have been impossible to make without his support. He was instrumental in pointing out the right direction throughout the process. It’s a great learning experience to work with him. I also have to thank my thesis committee member Dr. Wei-Lun (Harry)

Chao for his insightful comments.

Next, I would like to thank Cheng Zhang. He has a conscientious work manner and rich research experience. The idea of this research was originated from him. And we had extensive discussions during the process. His suggestions were proved to be correct every time. I wish I could have rethought carefully before retorting and followed all the useful suggestions at first. Special thanks to his patience and effort during our discussions.

I wish we can have more chances to cooperate in the future.

Finally, I would like to thank the people that helped and encouraged me during my time at Ohio State University. You make my life wonderful and color my world. I will keep the memory of our time as precious treasure in my heart.

v Vita

2014 - 2018………………………………………...... B.E. Chemical Engineering,

Tianjin University, Tianjin,

China

August 2018 - present ……………………………………M.S. Electrical and Computer

Engineering, Ohio State

University, Columbus, USA

Fields of Study

Major Field: Computer Science and Engineering

vi Table of Contents

Abstract...... iii Dedication...... iv Acknowledgments...... v Vita...... vi Table of Contents...... vii List of Tables...... ix List of Figures...... x Chapter 1. Introduction...... 1 1.1 Human Activity Recognition...... 1 1.2 WiFi Sensing...... 2 1.3 Channel State Information...... 4 1.4 Contribution...... 6 1.5 Organization of This Thesis...... 7 Chapter 2. Related Work...... 8 2.1 Vision-based pose estimation...... 8 2.2 WiFi-based pose estimation...... 8 Chapter 3. System Framework and Implementation...... 10 3.1 Hardware Device...... 10 3.1.1 Antenna...... 11 3.1.2 card and Laptop...... 11 3.1.3 Camera...... 11 3.2 CSI extraction...... 12 3.3 Video Recording...... 13 3.4 Data Processing...... 13 vii 3.4.1 Time Alignment...... 13 3.4.2 Signal Processing...... 15 3.4.3 Keypoint Generation...... 17 3.5 Deep Learning...... 18 3.5.1 Learning Target...... 18 3.5.2 Network Architecture...... 20 3.5.3 Loss Function...... 21 3.6 Implementation Details...... 21 Chapter 4. Experiments and Discussions...... 23 4.1 Testbed Installation...... 23 4.2 Testbed Setup...... 24 4.3 CSI tool configuration...... 26 4.3.1 Mode Selection...... 26 4.3.2 Parameter Configuration...... 27 4.4 Dataset...... 30 4.5 Evaluation Metrics...... 30 4.6 Experiment and Discussion...... 31 Chapter 5. Conclusion and Future Work...... 36 5.1 Conclusion...... 36 5.2 Future Work...... 37 Bibliography...... 40

viii List of Tables

Table 1 . All possible parameters of monitor_tx_rate in CSI tool...... 29

Table 2 . A, B, C, D The performance of the baseline, PAM, PAM + 2 model...... 32

Table 3 . The comparison between PAM+2 model, Person-in-WiFi and Openpose...... 33

ix List of Figures

Figure 1 . WiFi signal transmission that observes human movement in the indoor environment[2]...... 3

Figure 2 . Subcarrier-level signal strength computed from channel state information for four single-antenna 802.11n links [5]...... 5

Figure 3 . The 4D CSI tensor is a time series of CSI matrices of MIMO-OFDM channels

[4]...... 5

Figure 4 . The overview of system framework...... 10

Figure 5 . The antenna, wireless card, and laptop...... 10

Figure 6 . The comparison of CSI signal before and after signal processing of subcarrier

10 with sampling rate 1000 Hz in 90 seconds...... 16

Figure 7 . The comparison of CSI signal before and after signal processing of subcarrier

10 with sampling rate 1000 Hz in 2 Frames...... 17

Figure 8 . One example of the ground-truth keypoint with the bound box on the testbed 18

Figure 9 . Hand keypoint Skeleton from Openpose...... 19

Figure 10 . Neural Network Architecture...... 20

Figure 11 . The Motherboard of the Lenovo Thinkpad T400 and two MiniPCIe slots.....24

Figure 12 . The placement of WiFi antennas and the camera...... 25

x Figure 13 . The meaning of each bit of monitor_tx_rate in binary view...... 28

Figure 14 . The training curve of baseline, PAM, PAM + 2 model...... 31

Figure 15 . Two examples of comparison between ground-truth and prediction of PAM+2 model...... 33

xi Chapter 1. Introduction

1.1 Human Activity Recognition

Human activity recognition (HAR) is a well-known topic in Human-Computer

Interaction (HCI) field. Generally, the goal of human activity recognition is to recognize the human activity in the environment and understand the behavior of one or more humans given time-series data which observes human behavior and contextual changes.

Human activity recognition plays an essential role in HCI. It provides information about human identity, human physical status, and human activity status which helps people to understand ongoing events.

From the general definition of HAR, two main questions[1] arise: “What is the human activity?” (i.e., what activity we want to recognize) and “what is the time-series data?” (i.e., what data we used to build the model). In terms of machine learning, activity recognition can be classified into activity classification and regression. Activity classification attempts to predict discrete or categorical activity from a pre-defined set of activities. It varies from binary-classification like fall detection, motion detection to multi-class classification like gesture recognition or recognize the behavior of walking, running, jumping. Activity regression attempts to predict numerical or continuous activities, e.g., location or speed, which is more fine-grained compared to activity

1 classification. Defining location representation of an activity requires the activity to be simple or has enough points to form the body skeleton. And the goal of the model is to predict the point representations given the data. The applications of activity regression include localization, motion tracking, pose estimation, etc.

The content of data is usually highly dependent on the sensor. In terms of sensor,

HAR approaches can be classified into vision-based method, frequency based method, inertial sensor based method and other methods. Vision-based method commonly uses image or video from webcam or RGB camera as the data. Radio

Frequency based methods use WiFi signal, UWB, continuous-wave radar, etc. Inertial sensor based method mainly uses the gyroscope, accelerometer, magnetometer, and other wearable sensors. In this work, we mainly focus on hand pose estimation as the activity we want to recognize and WiFi Channel State Information as the data we used.

1.2 WiFi Sensing

Compared with sensor-based or vision-based sensing, WiFi sensing has three major advantages: ubiquitous availability, cost-effective, non-intrusive. WiFi devices are widely deployed in indoor areas and can be used for WiFi sensing directly. Non-intrusive means people don’t have to wear a special sensor or sitting in front of a camera. The sensing process is passive and does not require the cooperation of humans. WiFi devices used for sensing are much cheaper compared to specially designed UWB chips.

2 Figure 1. WiFi signal transmission that observes human movement in the indoor environment[2]

In an indoor environment as shown in Figure 1, the propagation of wireless signals often experiences reflection and scattering. Multiple paths of signals result in multiple aliased signals superimposing at the receiver, therefore the signal characterizes the physical environment[2][3]. When people perform the activity in the environment, wireless signals “observe” the movement. The reflection, refraction, and absorption introduce the distortion of WiFi signals. The distortion may has a positive effect

(amplification) on the signal caused by a new transmission path, or negative effect

(attenuation) caused by absorption of the signal on the human body. Since the receiver keeps receiving the WiFi signal, it records the human activity continuously. The basic 3 idea of WiFi sensing is to find a mapping from WiFi signal RSSI or CSI to human activities.

In WiFi sensing literature, the algorithms can be classified into modeling-based and learning-based algorithms[4]. Modeling-based algorithms are based on physical features like Amplitude Attenuation, Phase Shift, etc, and build a mathematical model or statistical model directly based on these features. Learning-based algorithms attempt to learn the mapping function from the CSI measurements to the ground-truth labels directly via machine learning algorithms or Deep Neural Networks (DNN). This work mainly focuses on the learning-based method using DNN.

1.3 Channel State Information

In WiFi sensing people use two kinds of measurement of signals, RSSI or CSI.

Received Signal Strength Indicator (RSSI) provides the amplitude information of wireless signals transmitted between each pair of transmitter and receiver, namely the total amount of power of all channels. This value is easy to extract, but has limited accuracy since this signal only contains coarse-grained information.

Channel State Information (CSI) provides the amplitude and phase information of signal for OFDM subcarriers between each pair of transmitter and receiver. This value contains fine-grained information about signal change and helps achieve better performance. Compared to RSSI, CSI is a preferable choice for attaining remarkable performance in fine-grained activity recognition.

4 Figure 2. Subcarrier-level signal strength computed from channel state information for four single-antenna 802.11n links [5].

Figure 2 shows one CSI packet from 4 to 1 receiver. The four communication links offer the same performance but differ in RSSI by up to 13 dB. The figure clearly shows the signal propagation anomaly: frequency-selective .

Cancellation of signals on certain subcarriers in the receiver requires the transmitter to spend more power to deliver the same performance, which makes RSSI higher.

Figure 3. The 4D CSI tensor is a time series of CSI matrices of MIMO-OFDM channels

[4].

For a MIMO-OFDM channel with N transmit antennas, M receive antennas, and

K subcarriers, the CSI packet is a 3D matrix ∈ CNxMxK. The data structure of CSI signals have similarity with data structure of images. The 3D CSI matrix is like an image with

5 width N, Height M, and K color channels. The time dimension makes the 4D CSI matrices similar to the video. This similarity provides the potential of CSI to replace the depth camera.

The IEEE 802.11n standard [6] introduces CSI in a CSI feedback mechanism that allows the transmitter to improve the channel communication via transmit beamforming.

But CSI is kept private by chip manufacturers and is not reported to common users. The

802.11n CSI Tool [5] is the most widely used tool in the CSI-based WiFi sensing research. It modifies the driver and firmware of Intel 5300 WiFi cards to report CSI information following IEEE 802.11n Standard. The Intel 5300 WiFi card reports CSI for

30 subcarriers. The frequency spacing between subcarriers spread evenly among the 56 subcarriers in the 20 MHz channel or the 114 carriers in the 40 MHz channel.

1.4 Contribution

We proposed a WiFi-based hand pose estimation system that can estimate human hand pose only use CSI signals. Our contributions are included as follows:

 To the best of our knowledge, this is the first system that explores the potential of

hand pose estimation using cheap, commercial off-the-shelf WiFi hardware without

no hardware modification.

 We provide our configuration on the Linux 80211n CSI tool in detail, which can be

reused for any other WiFi-based sensing research with a correct and high-accurate

CSI extraction, collection, processing, visualization.

6  We propose a deep neural network based on CNN to achieve cross-modal learning

and extract the spatial information from the CSI matrix that observing hand

movement. We showed that CSI measurement includes spatial information similar

digital image and popular neural network architecture in CV for hand pose

estimation can be applied to CSI measurements with slight modification.

1.5 Organization of This Thesis

The rest of the thesis is organized as follows. Chapter 2 introduces the related work in vision-based and WiFi-based pose estimation. Chapter 3 introduces the framework of our system, including hardware testbed, data collection, keypoint generation, signal processing, time alignment, deep learning, and the implementation details of our system, including testbed installation, testbed setup, CSI tool configuration.

Chapter 4 provides the experiment results and discussion. Chapter 5 introduces the conclusion and future work of this thesis.

7 Chapter 2. Related Work

2.1 Vision-based pose estimation

Pose estimation is a popular problem in computer vision that estimates the location of body joints in human figure images and videos. It distinguishs different identities. With the advance of deep learning and neural network, human pose estimation has achieved great results [7]. Recent work [8, 9] usually use a two-step schema, which first detects the location of body parts and crop Region-of-Interest of each person from image feature maps, and then estimates body key points solely on the corresponding feature maps.

Openpose[7] is a vision-based pose estimation framework that achieves competitive performance in 3D and 2D hand pose estimation. The hand keypoint detector of Openpose was training under 31 HD cameras and projected from 3D space to 2D space. The hand location is estimated based on the location of arms, thus it requires other researchers to provide a hand detector for hand standalone keypoint detection.

2.2 WiFi-based pose estimation

Hand pose is usually represented by keypoint locations presented in an image, specifically the location of the pixel in the image. Human senses and extracts information

8 about the world through visible light. Since the wireless signal is invisible to human eyes, it's hard and unnecessary to find a representation of pose in the form of wireless signals.

The popular and most widely used representation of pose is still pixel locations of image and videos coming from cameras, even for wireless-based pose estimation.

Because of the issue mention above, nearly no work has discovered the WiFi- based pose estimation problem. In 2019, Fei et al.[10] released the first work, Person-in-

WiFi, that tried to solve the problem at the body level. They used the existing computer vision algorithm to generate keypoint annotations and reconstruct the keypoint coordinates solely from the WiFi signal via deep learning approach. The experiment was conducted in an indoor environment with 5 meters distance between transmitter and receiver.

9 Chapter 3. System Framework and Implementation

The framework of our system includes hardware testbed, data collection, keypoint generation, signal processing, time alignment, deep learning. The overview flow chart of the system is shown below.

Figure 4. The overview of system framework

3.1 Hardware Device

Figure 5. The antenna, wireless card, and laptop 10 Our hardware testbed include Bingfu dual-band WiFi 9dBi Omni-directional antenna, Intel® Ultimate N WiFi Link 5300 NIC card, Lenovo Thinkpad T400 laptop,

HUE HD portable USB camera. The placement of the device is shown in Section 4.2.

3.1.1 Antenna

Antennas are key components of the WiFi communication systems. Antennas receive incoming Wi-Fi signals or radiate outgoing signals. Thus antennas are the starting position and ending position of signal transmission, which are the transmitter and receiver. WiFi antennas can be mounted externally or embedded inside the device. We choose the Bingfu antennas which have 3 meters long cable.

3.1.2 Wireless card and Laptop

WiFi card or wireless network adapter allows the computer to send and receive data via radio waves. We choose 802.11n CSI Tool [4] to extract CSI signals from the

WiFi card. This tool requires the Intel® Ultimate N WiFi Link 5300 NIC card, which was first released in 2010. The Intel 5300 card has compatibility for PCIe* mini-card and half mini-card form factor.

For compatibility of the Intel 5300 card, we use the Lenovo Thinkpad T400 as the device for CSI extraction.

3.1.3 Camera

We use the HUE HD portable USB camera to record the video. The camera is placed behind the receiver and the camera view covers most of the space for wireless signal transmission.

11 3.2 CSI extraction

Considering the factor of availability and cost, we used 802.11n CSI Tool [4] to extract CSI signals. This tool is developed in C and Matlab. CSI extraction is made possible by modifying the Intel driver and firmware iwlwifi and iwldvm in the Ubuntu system. The detail of the driver modification is not open-source and is not the focus of this research.

We operate the CSI tool on monitor mode. The CSI extraction in monitor mode includes following 4 functions: (1) load the modified firmware and driver iwldvm and iwlwifi, sets the parameters of wireless card, e.g., frequency, bandwidth, operating mode;

(2) In the receiver side, create a socket with infinite timeout and listen data from iwlwifi driver; (3) In the transmitter side, generate packet payload and send packets to the receiver using LORCON module; (4) In the receiver side, receive packets from the socket, log the CSI packet and local timestamp to file.

The CSI tool provides four scripts for monitor mode. Mainly parameter configuration script in transmitter and receiver, the packet sending script, the packet receiving script. These scripts were written in C and modified to achieve the functions mention above.

12 3.3 Video Recording

We use OpenCV Python API to record videos and timestamps. Although the author of Person-in-WiFi [10] showed that the FPS of the video can be adjusted by

OpenCV 3.1.0, it fails to adjust the FPS possibly due to the hardware problem.

When creating the dataset, the camera records first and then the wireless card begin to send CSI packets. This makes sure that all the CSI packets have their corresponding video frames. The camera is mainly used for creating ground-truth keypoint labels and image frames. Each image frame and keypoint label have several CSI packets corresponding to it. Recording the timestamp is necessary for the time alignment of video and CSI.

3.4 Data Processing

The data processing includes Time Alignment, Signal Processing, and Keypoint

Generation. After we collect data in Lenovo T400 laptop, we transfer the data to a high- end laptop for processing and deep learning.

3.4.1 Time Alignment

We only care about the hand pose transition on a frame-basis. So we assume that during the transition from one frame to the next frame, the hand stays still. Because our goal is to first achieve the performance of pose estimation based on the camera. The small handshaking and movement within the transition from one frame to the next frame are considered as noise.

13 So for the CSI signal transmitted in the air, during the time of current frame to next frame, it always observing the signal of the current frame’s hand movement. Thus, we can map each frame with the CSI signal during the duration to the next frame. The mapping will be CSI sampling rate/frame per second. Ideally 33 packets corresponding to one frame.

The timestamp we want to align is the time of hand movement in the air, which is noted as THCn and THIm for CSI and image respectively, where n is the current CSI packet index and m is the current image frame index. According to the assumption above, CSI packets that are aligned with frame 2 are all CSI packets that have THCn during THI2 and

THI3.

The work Person-in-WiFi used timestamp in the machine to align CSI and images.

Next, we will show how to calculate THCn and THIm and why it is incorrect to use timestamp in the machine to do alignment.

Because the light and the wireless radio wave both have the same speed which is the speed of light, the duration of wave transmission is noted as Dw. The timestamp of

CSI in the receiver side, noted as TCn, was recorded using the command gettimeofday in C language after receiving packets from the socket. The duration from the time receiver antenna receives the wireless radio wave to the time socket receives the packets is denoted as Dc. The timestamp of the image frame in the machine, noted as TIm, was recorded using the command datetime.datetime.now() in Python language. The duration of receiving an image is recorded as Di. We denote current index of packet as n and current index of frames as m. We have two equations:

14 THCn TCn  nDc  Dw  (1)

THIm  TIm  m(Di  Dw ) (1)

From the equations above, it’s clear that it’s not correct to align timestamp TCn and TIm directly like what was done in Person-In-WiFi[10]. Because the time of hand movement is not equal to the time of receiving. The cumulative error caused by duration makes the timestamp of receiving and timestamp of hand movement not equally match.

Assume we transmit packets in 3 frames; the timestamp of TC1 and TI1 is equal; CSI packet timestamps during TI1 and TI2 are TC1 to TC33. Starting from frame 2, because of the cumulative delay caused by the Dc, Dw, and Di, there are no longer 33 packets during TI2 and TI3. This delay causes the mismatch of CSI and image frames and is the reason why some image frames do not have enough CSI packets to match. It’s better to align CSI and image frame based on the timestamp of hand movement in the air. Now it’s safe to align all the CSI timestamp of hand movement THCn during the transition of current image frame m from THIm and THIm+1.

3.4.2 Signal Processing

Due to different configuration and instability of the CSI tool, the CSI collection experience packet loss. The CSI packet number during the duration of frames is not necessarily a CSI packet per second/frame per second. The issue of packet loss will be introduced in section 4.3.

Because of the packet loss and hand still assumption in section 3.4.1, we want the

CSI signal only reflects the variation on a frame basis. The raw CSI signal includes the noise introduced by the small hand movement and human activity and reflects the CSI 15 variation on a packet basis. We hope that after introducing the filter, we can remove the noise and flatten the variation of the signal.

For each CSI packet, we first apply a Hampel filter to remove outliers from signals. And then we apply a Butterworth filter which helps remove the noise and flat the signal.

Figure 6. The comparison of CSI signal before and after signal processing of subcarrier

10 with sampling rate 1000 Hz in 90 seconds

16 Figure 7. The comparison of CSI signal before and after signal processing of subcarrier

10 with sampling rate 1000 Hz in 2 Frames

As figure 6 and figure 7 shows, after applying signal processing to the CSI signal, most of the noise is removed and CSI signal within one frame stays relatively the same.

3.4.3 Keypoint Generation

The keypoint generation includes hand detection and hand keypoint estimation.

We use Openpose Python API to detect hand keypoints. The input is an image frame and hand bounding box and the output is hand keypoint tensor with size 2 x 21 x 3.

Because the hand box location in Openpose was an estimate based on the location of arms, we need another hand detector to detect the location of hands.

Single Shot Detector (SSD) [12] pre-trained model trained on the Egohand dataset is used as the hand detector. The hand detector outputs the location of vertices of the square bounding box that bounds the location of hands. Figure 8 shows the example of the ground-truth keypoint with the bound box.

17 Figure 8. One example of the ground-truth keypoint with the bound box on the testbed

3.5 Deep Learning

3.5.1 Learning Target

The learning target of the neural network determines the purpose and structure of the neural network. Analysis of the size and physical meaning of different learning targets helps the design of the neural network.

The final output we want to predict is the location of 21 hand keypoints follows the form of Openpose, which has size 2 x 2 x 21, representing the x,y pixel location of 21 hand keypoints of 2 hands in camera’s view. Adding two vertices of bounding box produce the output 2 x 2 x 23. A natural choice of learning target is to regress the keypoint coordinate (KC) directly, which has low computation cost. But according to several papers in pose estimation literature[13, 14, 15], regressing coordinates directly is

18 highly non-linear, and only one value needs to be correctly predicted over all the image pixels. Imagine we have 150,000 pixels in one image, the positive labels and negative labels are 1: 149,999, this is highly unbalanced and introduces more difficulty for learning. Besides, the keypoint coordinate fails to include the information of the keypoint skeleton. We choose KC as our baseline for WiFi-based pose estimation.

Figure 9. Hand keypoint Skeleton from Openpose

Since the idea was first proposed[23], nearly all the popular pose estimation methods choose heatmaps as the learning target. It provides several advantages over the keypoints: mostly avoids problems with ConvNets predicting real values; represents uncertainty in multiple modes. The size of the hand keypoint heatmap output by the

Openpose is 386 x 386 x 3. Limited by the computational resource, we didn’t choose heatmap as the learning target.

We choose pose adjacency matrix(PAM) adding ground-truth box vertices location as the learning target. We treat the hand skeleton as an undirected finite graph and we define the computation of the pose adjacency matrix as follows, similar for y’i,j

19 and c’i,j, where x,y are the keypoint location, c is the confidence, i,j is the index in the adjacency matrix, if i,j is connected, we denoted it as i :

xi , i  j; '  xi, j  0, i  j, i, j is connected; (3)    xi x j , i j, i, j is not connected;

Adding the upper-right vertice and corner-left vertice of the bounding box, the

PAM+2 have size 23 x 23 x 3 for one image frames, 23 corresponding to 21 keypoints, and two vertice points. We hope that the PAM can represent the information of hand skeleton, keep the information of hand locations, and reuse the popular network structure of heatmap.

3.5.2 Network Architecture

Figure 10. Neural Network Architecture

As figure 10 shows, we use a simple network architecture that includes upsampling, feature extractor, head network. Bilinear interpolation is used directly to upsample the CSI tensor from 200 x 3 x 3 to 200 x 168 x 168.

Deep convolutional neural network is widely used in the pose estimation literature.

But it has the problem of gradient vanishing and gradient exploding. The ResNets [16] are one of the most commonly-used backbone networks and it alleviates the gradient 20 vanishing problem by the skip connection. We use 3 basic residual blocks as the basic feature extractor to extract convolutional features from the input CSI data.

The shallow head network act as the decoder that decodes the convolutional feature to the learning target chosen in section 3.5.1. The head network consists of two convolutional layers.

3.5.3 Loss Function

We defined the predicted keypoints, ground-truth keypoints, the confidence of ground-truth keypoints as follows:

P p p p p p p p p K  ((x1 , y1 ), (x2 , y2 ), ..., (x22 , y22 ), (x23 , y23 )) (4)

G g g g g g g g g K  ((x1 , y1 ), (x2 , y2 ), ..., (x22 , y22 ), (x23 , y23 )) (5)

C  (c1 , c2 , ..., c21, 2, 2) (6)

The confidence of ground-truth points acts as the weight of the correctness of keypoints. Thus the loss function of the network can be defined as the Mean Square Error

(MSE) loss between the predicted keypoints and ground-truth keypoints multiply by the weight.

Loss  MSE(C  K P , C  K G ) (7)

3.6 Implementation Details

We use C in the CSI extraction part and Matlab in the CSI signal processing part.

Except for that, Python is the main language we used for other parts of our system. And

Pytorch is our deep learning library. We use ADAM optimizer. The learning rate is set to 21 0.7. Batch size 30. We train 20 epoch on the baseline model where the learning target is keypoint coordinate (size 2 x 2 x 21) , the PAM model (size 21 x 21 x 3)and PAM+2 model (size 23 x 23 x 3). All models were trained on 6 GB NVIDIA GeForce GTX 1660

Ti.

In each residual block of the feature extractor, we stacked 4 convolutional layers.

Each convolutional layer has 3 x 3 kernel, 1 stride follow a batch normalization, and a rectified linear unit activation. The out channel of each convolutional layer is changed according to the learning target.

22 Chapter 4. Experiments and Discussions

4.1 Testbed Installation

As section 3.1 may have pointed out, we installed Intel 5300 NIC card at two

Lenovo T400 laptops. Both of them were installed with Linux 802.11n CSI Tool [5] for

CSI extraction. Bingfu long antennas are connected to the Intel 5300 NIC. All the devices mentioned above are Commodity Off-The-Shelf (COTS) devices without hardware modification.

The installation of the CSI tool on the laptop can be done smoothly by the official guide of the CSI tool[11]. It’s worthwhile to note that Lenovo had put a whitelist check in

BIOS and would only allow the computer to run “authorized” card. Even the laptop supports the wireless card according to the official maintenance manual of Thinkpad

T400 [18], the BIOS whitelist issue still appeared during the installation of the Intel 5300 card at Thinkpad T400 Laptop.

According to [19], if the computer has multiple slots, the BIOS checks only the

WLAN slot used for WiFi. The WWAN slot is not checked. But WWAN slot’s disable radio signal, which is implemented in pin 20 of common WiFi card, maybe active, so that the card will refuse from using the radio. Isolating the pin 20 makes the disable radio signal inactive thus enable the radio signal. Since the Lenovo Thinkpad T400 laptop has a

23 mini PCI-E WWAN slot, the BIOS check issue can be solved by taping the pin 20 of the

WiFi card and inserting the card into the WWAN slot. Our implementation shows that the Intel 5300 WiFi card works correctly in the WWAN slot. Figure 11 shows the Lenovo

T400 motherboard and the location of mini PCI-E slots.

Figure 11. The Motherboard of the Lenovo Thinkpad T400 and two MiniPCIe slots.

4.2 Testbed Setup

CSI tool supports the operating system Ubuntu 12.04 and Ubuntu 14.04. Although there is a modified version of the tool that provides support for Ubuntu 18.04 [17]. This 24 modified tool is not stable. It has a version magic problem over the iwlwifi module, and it sometimes can not correctly load the modified iwlwifi module due to the problem of the driver. So here we choose to use Ubuntu 14.04 as the operating system to collect the data.

The placement of WiFi antennas and camera is shown in Figure 12. For the purpose of simplifying the experiment, we try to make the propagation path of the wireless signal from the transmitter to the receiver as clean as possible. So we only place the antenna on the table. The distance between each receiver antenna is 13cm. The distance between the receiver and transmitter is 40cm. The camera is placed behind the receiver and we want to make sure the camera covers most of the space of the signal propagation path.

Figure 12. The placement of WiFi antennas and the camera

25 4.3 CSI tool configuration

4.3.1 Mode Selection

The CSI tool provides two modes for well collect the CSI data: AP mode and

Monitor mode.

In AP mode, the computer acts as a client and has communication with the access point. The access point, or the station, can be either a computer or a router. The client sends the data frames to the router and the router sends back the data it processed. Data frames include unicast frames addressed to that station, as well as broadcast frames for addresses to which the station is subscribed. It does not include frames addressed to other stations or sent on other 802.11 networks. When starting the AP mode, the user uses ping command on the client to the station’s IP address, the station will send back the packet information to the client. Thus the user is able to collect the CSI data.

The monitor mode disables the WiFi card’s ability to connect to the internet but allows the WiFi card to observe the 802.11 traffic on the channel. It passively listens to all frames including data frames, beacons, management frames. The CSI tool creates custom frames and injects the data frames to stations with monitor mode. The modified firmware collects the CSI for any frame that both has a source and destination MAC address of 00:16:ea:12:34:56. And the user can configure multiple parameters for data transmission, including sampling rate, packet size, packet number, data rate selection,

WiFi frequency, WiFi bandwidth.

The pros and cons of AP mode are summarized below.

26 Pros: (1) Only need one device with Intel 5300 card, which means only need one computer. (2) Enable internet connection, the received packet on the receiver side can be transmitted through the internet.

Cons: (1) The ping command has overheads that slow down the packet transmission speed. (2) The transmission parameters can not be controlled precisely.

The pros and cons of monitor mode are summarized below.

Pros: (1) There are multiple parameters that monitor mode can control, allows the experiment to be more precise. (2) The high flexibility and high reliability of the monitor mode as the author of the CSI tool point out [11]

Cons: (1) Not internet connection for monitor mode, thus the data can not be processed online. (2) The transmitter and receiver must be equipped with Intel 5300 NIC.

4.3.2 Parameter Configuration

We configure the CSI tool in the monitor mode. Monitor mode provides the configuration of multiple parameters of CSI tool: sampling rate, packet size, monitor_tx_rate, frequency, bandwidth.

Sampling rate determines the number of CSI packets per second. In the CSI tool, it’s controlled by the packet delay in us. If the sampling rate is too high, there will be packet loss because the time to process the packet becomes slower than the time to receive and save the packet. If the sampling rate is too low, we don’t have enough CSI packets to match each image frame. According to a small experiment, the sampling rate is set to 1000 where each image frame has at least 30 CSI packets corresponding to it.

27 The Intel 5300 card measures the CSI for each received packet during the packet preamble. In wireless transmission, packet preamble is “introduction” information of each data frame that is used to improve the data communication and synchronize the time clock. The contents of the packets have no influence on the CSI data. So the packet size was chosen to be 1 to reduce the overhead or packet processing.

In the CSI tool, monitor_tx_rate is the most important parameter that encodes multiple parameters in a 32-bit value. Figure 13 shows the meaning of parameters in the

Binary view of the Hex value.

Figure 13. The meaning of each bit of monitor_tx_rate in binary view

28 The parameters that affect the structure or collection of CSI are the number of transmitter antenna Ntx in bit 14-16 and the number of spatial streams in bit 3-4. The spatial stream can be understood as the number of receiver antenna Nrx and is determined by the MCS index of Tables 20-29 in the IEEE 802.11n standard[6]. Because of the property of MIMO techniques, Ntx must bigger than Nrx. So we have table 1 below which shows all the possible monitor_tx_rate that can be used for CSI packet collection.

We choose 0x1c110 as our monitor_tx_rate. So the structure of a CSI packet would be 3 x 3 x 30. Each value of the CSI packet is a complex value include the amplitude and phase information of CSI. Notice that we only use the amplitude information.

monitor_tx_rate Ntx Nrx MCS index 0x4101 1 1 1 0x4102 1 1 2 .... 1 1 .... 0x4107 1 1 7 0x8101 2 1 1 .... 2 1 .... 0x8107 2 1 7 0x8108 2 2 8 0x8109 2 2 9 .... 2 2 .... 0x810g 2 2 16 0x1c101 3 1 1 .... 3 1 .... 0x1c107 3 1 7 0x1c108 3 2 8 .... 3 2 .... 0x1c10g 3 2 16 0x1c110 3 3 17 .... 3 3 .... 0x1c117 3 3 23 Table 1. All possible parameters of monitor_tx_rate in CSI tool

29 4.4 Dataset

The dataset is created under single environment, single person. 4 videos and corresponding CSI was recorded. Each video is cropped to 50 seconds. The first 10 CSI packets that falls into the frame duration is used to match each image frame. To prevent the network learning the same position, all the frames are randomly shuffled. We remove all the frames that do not detect the bounding box. 80% of frames are used for training and 20% are used for testing. The training/testing frames are 4000/1000.

4.5 Evaluation Metrics

For the hand bounding box, we follow the evaluation metric used in the COCO challenge. The evaluation metric is Mean Average Precision mAP (average from

AP@50 to AP@95). AP@a is defined as follows:

1 N AP@a  (100 IOU n  a) (8) N n1

Where N is the number of image frames,  is logical operation that ouputs 1 if the statement inside is true otherwise output 0.

IOU is defined as the area of the intersection divided by the area of the union of the predicted bounding box (Bp) and the ground-truth box (Bgt):

area(B  B ) IOU  p gt (9) area(Bp  Bgt )

30 For the pose estimation metrics, Percentage of Correct Keypoint (PCK) is the common metric used in pose estimation task. We use [email protected] to [email protected] and compute mPCK based on the average. PCK@a is defined as follows:

1 P || pd p  gt p ||2 PCK @ a  ( i i 2  a) (10) i  2 2 2 N p1 w  h

Where N is the number of image frames, P is the number of hand keypoints, pd and gt are the coordinates of predict and ground-truth keypoints, w and h are the width and height of the bounding box, a is the threshold.

4.6 Experiment and Discussion

The training of each model takes one hour. Figure 14 shows the training curve of the baseline model, PAM model, and PAM+2 model respectively. The shapes of learning curves of the three models are similar, maybe because of the similar network structure and parameter configuration. The PAM+2 model does have the lowest loss compared to the PAM model and baseline model. Adding two bounding box point to the PAM improves the model’s performance. From the loss, we can see all the three models can train more time to converge.

Figure 14. The training curve of baseline, PAM, PAM + 2 model

31 The performance of each model is shown in Table 2. Because the baseline model and PAM model do not have bound box prediction, the mAP is only calculated for the

PAM+2 model as in Table C, D. We can see that the PAM+2 model outperforms the

PAM and baseline model. The baseline model starts to predict correct points from [email protected], while the PAM and PAM+2 models already have 72.88 and 86.86 performance. From the observation of predicted keypoints, the baseline model always predicts the keypoints in the same place.

Model [email protected] [email protected] [email protected] [email protected] [email protected] Baseline 0 0 0 0 0 PAM 0 0 0.88 29.11 60.44 PAM+2 1.43 1.43 2.57 36.57 74.29 A Model [email protected] [email protected] [email protected] [email protected] [email protected] Baseline 4 12 32 65 74 PAM 72.88 74.66 76.44 99.33 99.55 PAM+2 86.86 94.96 96.57 97.71 98.28 B mIOU mAP mAP@50 mAP@55 mAP@60 mAP@65 0.52 0.27 0.77 0.67 0.56 0.43 C mAP@70 mAP@75 mAP@80 mAP@85 mAP@90 mAP@95 0.42 0.23 0.04 0.02 0.01 0.01 D Table 2. A, B, C, D The performance of the baseline, PAM, PAM + 2 model Table 3 shows the comparison between the PAM+2 model, Person-in-WiFi, and

Openpose. The [email protected] and mAP of Person-in-WiFi and Openpose come from the evaluation in Person-in-WiFi[10], which is body estimation on the dataset of the paper.

The comparison between Person-in-WiFi and Openpose shows that, to achieve state-of- art performance that is comparative to the vision-based method, we need at least 80% of

32 the performance of the vision-based method. From this perspective, our model still needs fine-tune and improvement.

mIOU [email protected] Person-in-WiFi 0.66 78.75 Openpose 89.48 PAM+2 0.52 36.57 Table 3. The comparison between PAM+2 model, Person-in-WiFi and Openpose Two examples of comparison between ground-truth and prediction of PAM+2 model are shown in figure 15 below.

Continued

Figure 15. Two examples of comparison between ground-truth and prediction of PAM+2

model

33 Figure 15 Continued.

The comparison of the two examples shows that our model does capture the difference between two different gestures. It learns the location of hand bounding box correctly. But it does not learn the hand skeleton information very well. There are several possible reasons for this:

1. The definition of loss function may be not correct. The confidence from

Openpose was used as the weight of each hand keypoints, which may be unnecessary.

The importance of the hand keypoint is not equal to the confidence of Openpose. The confidence from Openpose only shows the highest confidence of one location in the heatmap on the prediction of this keypoint. While the importance of the keypoint should make sense physically. For example, the keypoint that have more connection to other keypoint should be more important. The keypoint that easy to have movement may be more important. The weight of keypoint may be a parameter or a variable that require the model to learn.

34 2. The hand skeleton information was not extracted from the channel state information as expected. This may because PAM does not represent the hand skeleton information correctly. Or due to the neural network structure which only focuses on local features.

3. The network structure has multiple issues. First, it is not the best fit for pose adjacency matrix. The neural network structure consists of multiple convolutional layers.

But the PAM does not include local features like the heatmap. And we may need to design a specific network for cross-modal learning from CSI to video. Second, the network structure does not capture or encode the hand skeleton information. Third, CSI is time-series data. The time-series information is also useful for learning how the hand movement thus how the hand keypoint change. LSTM should be introduced in the task.

Fourth, the learning of hand location and hand pose is done by one model, which harms the model’s ability to learn well.

4. The effect of noise should be further investigated. Signal processing flat the

CSI packets within one hand, but maybe some useful information also contains inside the noise.

5. PAM maybe not the best learning target. Nearly all the pose estimation papers starting from ECCV 2014 use heatmap as the learning target. The learning target for pose estimation problems using the CSI signal should be further investigated.

35 Chapter 5. Conclusion and Future Work

5.1 Conclusion

In this thesis, we proposed a WiFi-based hand pose estimation system. The system utilizes Channel State Information that observes human hand movement as the data to estimate human hand pose. To our knowledge, this is the first system that focuses on human hand pose estimation using commodity off-the-shelf device. Our system overcomes multiple issues arise when using the Linux 802.11n CSI tool to collect CSI information. This system can be the basis for further research in hand pose estimation and object recognition, detection, and tracking using channel state information. Because our work is at the initial stage, the experiment result of our deep learning model is still not comparative to vision-based method. And we show the possibility that the WiFi can be used for 2D hand pose estimation. Our model of deep learning can be further fine-tuned and improve. With multiple cameras, more computation resources, we believe that our system can achieve comparative performance compared to vision-based models that trained on multiple depth cameras from different angles in 3d hand pose estimation tasks.

We show that channel state information does include spatial information that is similar to the digital image.

36 5.2 Future Work

Due to the restriction of time, space, and device, our system is not mature and it has many directions to improve. There are several possible directions.

The influence of CSI tool configuration on CSI data. Our system can extract stable and useful CSI information during wireless transmission. It’s only one kind of correct configuration that can be used for WiFi sensing research. There are more things to discover about the influence of CSI tool on the CSI signal. Within the Linux 80211n CSI tool, packet loss has a strong relationship with sampling rate, data rate, type, packet size, socket buffer. Since CSI itself is a measurement used to estimate data transmission, it’s worthwhile to investigate this relationship and thus provide better insight for either research in the WiFi communication mechanism or the WiFi sensing.

And move forward, the CSI collection of different WiFi chip may be different. It’s also worthwhile to look at different CSI packets produced by different WiFi chips that observing the same environment, which can be the same hand movement or the same placement of the object. Because of different hardware, firmware, driver, and mechanisms, the CSI recording will probably be different from content to structure. From different CSI information to one environment, we may find the core relationship between the environment and CSI data.

Improvement of neural network. We followed the same network structure for pose estimation task in Computer Vision research. This kind of structure is good at capturing convolutional features. But for an ideal neural network that can learn the

37 mapping from CSI signal to human hand pose, this neural network should satisfy the following functions: signal processing that extracts useful information; cross-modal learning from CSI signal to human hand pose; keep the time-series relationship between each CSI packet; capture the hand skeleton information; learn the representation that is suitable for this specific task. Our model is just the first exploration of the ultimate goal.

It is at the beginning stage in the path to the ideal model. There are still many spaces to explore and investigate.

The relationship between CSI data to hand keypoint. We tried to use the neural network to model the relationship. But as people know, the interpretability of the neural network is still a non-trivial problem. This kind of modeling is a black box and the improvement of the neural network is mainly based on experience and experiment, which does not have solid theory support. While in the field of WiFi sensing, some researchers use the model-based method to achieve hand gesture recognition and hand tracking.

Combine the learning-based method and modeling-based method needs more knowledge and effort from two sides, but the result of combination will be more promising that maybe it can address the limitation of two methods and achieve better results. For estimation applications like hand pose estimation, the learning-based method can be first used to detect human hand segmentation, and then the modeling-based method can be used to estimate the orientation, speed of each part of the hands thus recognize the small part of hands from the segmentation. After that learning-based method can be used to estimate the hand keypoint from the reconstructed image.

38 Data Augmentation. Our data use the hand movement data from single person, single environment, single camera view. The hand pose estimation model of Openpose uses 31 HD cameras from different views, different environment, different person with human label videos, and then projects the learning result from 3D to 2D. That may be the reason why the performance of our system is not so visually pleasing compared to

Openpose. If our system uses the CSI data that has the same condition, I believe the system can achieve comparative results with Openpose.

39 Bibliography

[1] Herath, Samitha, Mehrtash Harandi, and Fatih Porikli. "Going deeper into action recognition: A survey." Image and vision computing 60 (2017): 4-21.

[2] Zhou, Zimu, et al. "Sensorless sensing with WiFi." Tsinghua Science and

Technology 20.1 (2015): 1-6.

[3] Yang, Zheng, Zimu Zhou, and Yunhao Liu. "From RSSI to CSI: Indoor localization via channel response." ACM Computing Surveys (CSUR) 46.2 (2013): 1-32.

[4] Ma, Yongsen, Gang Zhou, and Shuangquan Wang. "WiFi sensing with channel state information: A survey." ACM Computing Surveys (CSUR) 52.3 (2019): 1-36.

[5] Halperin, Daniel, et al. "Tool release: Gathering 802.11 n traces with channel state information." ACM SIGCOMM Computer Communication Review 41.1 (2011): 53-53.

[6] IEEE Standards Association. "IEEE SA–802.11 n-2009–IEEE standard for information technology–telecommunications and information exchange between systems–local and metropolitan area networks–specific requirements part 11: Wireless

LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications amendment 5: enhancements for higher throughput." IEEE Std 802 (2009)..

[7] Cao, Zhe, et al. "OpenPose: realtime multi-person 2D pose estimation using Part

Affinity Fields." arXiv preprint arXiv:1812.08008 (2018).

40 [8] Chen, Yilun, et al. "Cascaded pyramid network for multi-person pose estimation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[9] Fang, Hao-Shu, et al. "Rmpe: Regional multi-person pose estimation." Proceedings of the IEEE International Conference on Computer Vision. 2017.

[10] Wang, Fei, et al. "Person-in-WiFi: Fine-grained person perception using

WiFi." Proceedings of the IEEE International Conference on Computer Vision. 2019.

[11] Halperin, Daniel, et al. Official Website of Linux 802.11n CSI Tool. http://dhalperi.github.io/linux-80211n-csitool/. 2011.

[12] Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.

[13] Belagiannis, Vasileios, and Andrew Zisserman. "Recurrent human pose estimation." 2017 12th IEEE International Conference on Automatic Face & Gesture

Recognition (FG 2017). IEEE, 2017.

[14] Pfister, Tomas, James Charles, and Andrew Zisserman. "Flowing convnets for human pose estimation in videos." Proceedings of the IEEE International Conference on

Computer Vision. 2015.

[15] Bulat, Adrian, and Georgios Tzimiropoulos. "Human pose estimation via convolutional part heatmap regression." European Conference on Computer Vision.

Springer, Cham, 2016.

[16] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

41 [17] Modified Version Of CSI Tool. https://github.com/spanev/linux-80211n-csitool.

2019.

[18] ThinkPad® T400 and R400 Hardware Maintenance

Manual. https://images10.newegg.com/UploadFilesForNewegg/itemintelligence/Lenovo/

43y6629_021400130551905.pdf. 2015.

[19] Siemer, et al. Problem with unauthorized MiniPCI network card. http://www.thinkwiki.org/wiki/Problem_with_unauthorized_MiniPCI_network_card.

2014.

[20] Andriluka, Mykhaylo, et al. "2d human pose estimation: New benchmark and state of the art analysis." Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition. 2014.

[21] Yang, Yi, and Deva Ramanan. "Articulated human detection with flexible mixtures of parts." IEEE transactions on pattern analysis and machine intelligence 35.12 (2012):

2878-2890.

[22] Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." European conference on computer vision. Springer, Cham, 2016.

[23] Ramakrishna, Varun, et al. "Pose machines: Articulated pose estimation via inference machines." European Conference on Computer Vision. Springer, Cham, 2014.

42