Identifying Sensor Accesses from Service Descriptions

Antara Palit,∗ Mudhakar Srivatsa, Raghu Ganti Christopher Simpkin IBM T J Watson Research Center Cardiff University Yorktown Heights, NY Email: [email protected] Email: [email protected],msrivats,[email protected]

Abstract—Recent advances in computing infrastructure con- Our model combines the state-of-the-art in vector repre- stitute edge nodes prominently (e.g., smartphones, cars) in sentation of the application textual descriptions and neural their computation pipeline. Applications that combine sensor/IoT networks to produce a model that can fairly accurately predict data from edge devices in a distributed fashion are growing. Dynamically composed applications that combine resources from the sensor requirements. Our premise is that textual description edge nodes and the cloud are becoming common in various of most targeted applications capture their IoT requirements domains including urban and military settings. A key challenge quite well. For example, consider the app description of in such applications is to bridge the gap between the application’s FindMyDevice, a snippet reads as “Locate your phone, tablet description and its IoT resource requirements, where the applica- or watch.... Find My Device helps you track down your device tion description is unstructured text and its IoT requirements are structured. In this paper, we describe an approach that develops when it’s close by”, the permissions requested by the app are a model which given unstructured text description of the service access to the network, location, and permission to send/receive predicts its IoT sensor requirements. Our model can predict with text messages. As exemplified, we notice that there is a clear an average accuracy of 77% and up to 88% on the top-20 sensor correlation between the textual description and the requested requirements for 300 different applications. permissions. The challenge is in being able to capture this using a machine learning model. We design a pipeline that I.INTRODUCTION learns vector representation of the words and then a neural A rapid proliferation of compute power in the form of network model on these vector representations to capture the edge devices such as smartphones, smart cars has enabled IoT sensor requirements from the textual description of the a distributed computational infrastructure. These devices are application. We evaluate this pipeline on 300 applications equipped with sensors, thus resulting in the largest connected varying from popular applications such as Facebook and and distributed IoT infrastructure [2], [11]. The applications Google Photos to fairly unknown applications such as DataBot and services on top of this infrastructure is regulated through and DateBang. Our model achieves an average accuracy of App Marketplaces such as Apple AppStore and Google Play, 67% across all the 80 sensor types and an accuracy of up to where app developers make their apps available. As apps de- 88% in the top-20 sensor categories. veloped for the edge devices evolve in complexity, it becomes harder to describe and regulate their access to sensitive IoT II. DATA DESCRIPTION sensors (e.g., location, audio). This paper targets the problem of being able to capture the IoT sensor requirements of an app We collect data from 300 apps on the Google Android based on its (unstructured) textual description. MarketPlace, which includes famous apps such as Facebook, A model that can “predict” the sensory requirements based Messenger, Instagram, and Box to relatively unknown apps on textual service descriptions has many applications. To begin such as DataBot, DateBang, and Line. This enables us to study with, it can be a simple privacy evaluator as the human a large variety of apps. We download the apk files for each of being installing the app is not fully aware of all the sensor these applications and then extract the service description (the requirements by reading the description. If the discrepancy text associated with this service) and the permissions that the between the requested application sensors and the predicted application’s XML file declares (this is the file that specifies requirements is significant, such an application can be flagged at installation time what permissions an app requires. and further examined. Another application of such a technique Figure 1 shows the occurrence of the top-20 permissions is in creating sensory requirements from simple textual de- across all the 300 applications, a sum total of 86 permissions scriptions in order to compose microservices and weave them were requested by all these applications. together into a unified application. Such scenarios are fairly We observe from Figure 1 that the top permissions are common in military settings and are becoming popular with access to Internet, external storage, and vibration. From a the advent of distributed edge infrastructure. sensor standpoint, the top permission requests are for camera, location, contacts, and audio. A distribution of the permissions ∗Work was done as an intern at IBM across all the applications are shown in Figure 2, which Fig. 1: Permissions across the 300 applications follows a power-law distribution, quite common in Internet applications.

Fig. 2: Distribution of permissions across 300 applications

The actual service descriptions are summarized for the reader in the form a wordlcoud, as shown in Figure 3. We Fig. 3: Word Cloud of App service descriptions for the 300 obtain the wordcloud in Figure 3 by removing stop words and applications lemmatizing [7] (finding the root-word) the descriptions. We observe from this word-cloud that there are some occurrences of words related to the permissions requested, for example service description corpus in Figure 4, which similar to the camera, photo, and video are words related to the camera permission distribution follows the power-law curve. permission. Words such as download, email, receive, and III. PREDICTION FRAMEWORK upload are related to Internet access permission. However, location and audio related words do not show up in this corpus, The framework that we build relies on a combination of indicating that these permissions even though requested quite a simple Multilayer Perceptron (MLP) model with vector often do not show up in the service descriptions prominently. representation of words. The challenge arises from being able We also plot the word distribution across the lemmatized to represent words or documents as vectors. In the past, of word similarity is learned. There are typically two options for learning this model, one is to predict a target word given a context (a continuous bag of words or CBOW model) and the other is predict a target given a context (also called Skip- gram). It has been shown through empirical evaluations on large datasets that the Skip-gram model yields better results. In our pipeline, we use the Skip-gram model for training and evaluation purposes. We train the Word2Vec model using the text corpus and use this trained model to output the vector representations of the text of the service. C. MultiLayer Perceptron The final stage of our pipeline that feeds from the output Fig. 4: Distribution of lemmatized words across 300 applica- of the previous stages takes as input a tensor – a collection of tions vectors representing the textual description of that service. We use a Multilayer Perceptron for training the neural network on this input. A MLP can be thought of as a logistic regression vector representation was achieved using simple collision free classifier where the input is first transformed using a non- hashing functions such as Murmur Hash [12]. However, such linear transformation, that projects the input data into a space an approach has been shown to not capture the similarity where it becomes linearly separable. The intermediate layer between words. For example, location and geography are used is the hidden layer. The MLP that we use for our training is in the same context. More recent efforts [8], [9] develop a represented in Figure 6. Our configuration for the MLP is a model that learns the context in which a word appears and single hidden layer and one output layer, the input layer takes represents that context as a vector in a high dimensional space. in 128 dimensional vectors and the hidden layer is of size 256 In what follows, we will explain the vector representation with the output layer of size N, where N are the number of of the words followed by the neural network model that we unique IoT sensor capabilities (in our case, 80). The model, use. The entire end-to-end process is illustrated pictorially in once trained, is used for predicting given a new text corpus Figure 5. the class labels for the IoT sensors in the output layer. A. Data Cleaning Data cleaning is achieved using the Stanford CoreNLP toolkit [7], which provides a suite of tools for text processing. Our pipeline relies on identifying stopwords (e.g., the, an, about) – words that do not add significant context to the service description and removing these from the corpus followed by lemmatization. Lemmatization is the process of identifying the root-word of the given word. For example, running root is run. These steps enable better capture of context in the next stage of the pipeline and fairly standard techniques in text analytics. Fig. 6: Illustration of MLP with single hidden layer B. Word2Vec Word2Vec [9] belongs to a group of models that produce IV. EVALUATION word embeddings using a shallow two-layer neural network, We use TensorFlow [1] for our evaluation on the corpus of which is trained to reconstruct linguistic contexts of words. 300 applications. In this process, we train a local Word2Vec The input to this neural network is a large text corpus, with model using various embedding sizes (dimensionality of the the neural network learning vector space embeddings for each vector) as well different skip lengths for the Skip-gram model. of the word based on the context of its usage. Word2Vec Skip length identifies how many words to the left and right of neural network groups the vectors of similar words together the given target word need to be considered for training (i.e., in vectorspace, i.e. it detects similarities from a mathematical the size of the context). We compare the results of the MLP standpoint. Every unique word starts with a random vector in approach with that of a Naive Bayes classifier. The text corpus a high dimension space (dimensionality is chosen based on the is cleaned using the stopword removal and lemmatization vocabulary of the corpus, in our experiments as shown later, steps. For all our accuracy estimates, we use 60% of the data we use a dimensionality of 128). With each iteration of the for training, 20% for validation, and 20% for testing. The best neural network training, the jth value of this vector captures accuracy is obtained on 128 dimensional vector inputs with the probability of the occurrence of a group of words with the 256 hidden layer size. We define accuracy on each individual given word. Over multiple iterations, a probability distribution prediction, i.e. if the text predicts that 50 out of 80 sensors Fig. 5: Architectural illustration of the various stages of the modeling process are being used and only 49 of them are actually being used, of the application text better than a model that is trained on the accuracy is 49/50 or 98%. News corpus. Figure 7 fixes the embedding size as 128 and compares the performance of Naive Bayes and MLP as the skip length of the Word2Vec model is increased. We make note of two observations from this figure, one is that the performance of MLP is consistently better, with the peak difference being 10% at a skip length of 7. The second is that as skip length increases, the performance of both the models also improves significantly. Our experiments (not shown in this paper) show that using hashing functions such as MurmurHash provides poor performance results.

Fig. 8: Accuracy of MLP for a trained Word2Vec and a global Word2Vec Model as embedding size is increased, ﬁxing the skip length to 5

We present the training times for both MLP and Naive Bayes by normalizing the training times with respect to Naive Bayes, skip length of 1 in Figure 9. We observe that the both the models takes longer as the skip length is increased with MLP having a greater rate of increase than the Naive Bayes model. This is a normal expectation for Neural Network Fig. 7: Accuracy of Naive Bayes and MLP as skip length of Models and typical training times of several hours to a few Word2Vec is increased, fixing embedding size to 128 days is common. We plot the training times for MLP model generation as We examine the effect of embedding sizes and different milliseconds for each sample (amortized) as the embedding Word2Vec models using the MLP at a skip length of 5 in size is increased in Figure 10, which shows that the training Figure 8. We consider the Google Word2Vec model predefined time changes linearly with the increased embedding size for in TensorFlow and generate vector embeddings for the same this model. text corpus. The MLP model is trained and evaluated with both Finally, we plot the time taken in milliseconds to predict these Word2Vec models. We observe that the performance is the output given a text corpus and a model in Figure 11. This in general better as the embedding size is increased with the figure illustrates that the model scoring is fairly quick, in order performance of the local Word2Vec model being better. We of sub milliseconds even with a high dimensional embedding note that the local model is better as it captures the context size. accuracies on the testing dataset in Figure 12. We observe from this figure that we achieve >85% accuracies for several sensors in the top 20 with the average of 77% accuracy.

Fig. 9: Amortized runtime for MLP and Naive Bayes model as the skip length is increased

Fig. 12: MLP accuracy for the top 20 sensors by application use, MLP model generated with embedding size of 128 and skip length of 7

V. RELATED WORK Word2Vec models [8], [9] were first introduced by Google in 2013 to the general public. Since then, they have gained in popularity and were used in different contexts. Specifically for smartphone apps and understanding/deriving insights on these apps, there is some previous work. Understanding the significance of ads and their effect in a specific application are studied in [3] and [4]. Automated mobile testing is Fig. 10: Per sample training runtime (in milliseconds) for MLP addressed in [5]. A common aspect in these efforts is the using the trained Word2Vec model as embed size is increased use of Word2Vec for representing a text document as a set of vectors in high dimensional space. These papers do not focus on the identification of sensors required from the text corpus, but point to the usefulness of Word2Vec in vector space embedding. Neural network models have been used in the context of cor- relating different applications based on their descriptions [6], [10]. Synergies between various applications/micro-services are extracted, whereas our focus is to identify the sensors and the idea of using neural networks is applicable when identifying these correlations.

VI. CONCLUSIONS In conclusion, we develop a pipeline/framework that combines Word2Vec with a MLP neural network for predicting the IoT sensor requirements from the textual description of Fig. 11: Scoring time per sample in milliseconds for MLP the application. We trained the MLP model using a corpus of model as the embed size of the Word2Vec model is increased 300 applications and showed that we can achieve an accuracy on an average of 67% and on the top-20 sensors used, we achieve an average accuracy of 77%. We compared with a To summarize, we present the accuracy of the top 20 simple approach that uses Naive Bayes and show that the MLP sensors that occur in the corpus and the individual prediction approach outperforms this baseline approach by about 10%. ACKNOWLEDGEMENTS This research was sponsored by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agree- ment Number W911NF-16-3-0001.The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K. Government. The U.S. and U.K. Governments are au- thorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.

REFERENCES [1] M. Abadi et al. TensorFlow: Large-scale machine learning on hetero- geneous systems, 2015. Software available from tensorﬂow.org. [2] R. Ganti, F. Ye, and H. Lei. Mobile crowdsensing: Current state and future challenges. IEEE Communications Magazine, 49(11):32–39, November 2011. [3] C. Gao, H. Xu, Y. Man, Y. Zhou, and M. R. Lyu. Intelliad understanding in-app ad costs from users perspective. CoRR, abs/1607.03575, 2016. [4] J. Gui, M. Nagappan, and W. G. J. Halfond. What aspects of mobile ads do users care about? an empirical study of mobile in-app ad reviews. CoRR, abs/1702.07681, 2017. [5] P. Liu, X. Zhang, M. Pistoia, Y. Zheng, M. Marques, and L. Zeng. Automatic text input generation for mobile testing. In Proceedings of the 39th International Conference on Software Engineering, ICSE ’17, pages 643–653, 2017. [6] S. Liu, Y. Li, G. Sun, B. Fan, and S. Deng. Hierarchical rnn networks for structured semantic web api model learning and extraction. In Proc. of Web Services, ICWS, pages 708–713, 2017. [7] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014. [8] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efﬁcient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Dis- tributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111– 3119, 2013. [10] T. Thiele, T. Sommer, S. Stiehm, S. Jeschke, and A. Richert. Exploring research networks with data science: A data-driven microservice archi- tecture for synergy detection. In Proc. of Future Internet of Things and Cloud Workshops (FiCloudW), pages 246–251, 2016. [11] R. Want, B. N. Schilit, and S. Jenson. Enabling the internet of things. IEEE Computer, 48:28–35, 2015. [12] F. Yamaguchi and H. Nishi. Hardware-based hash functions for network applications. In Proc. of International Conference on Networks, pages 1–6, 2013.