Analysis of human activities on smart devices using Riak TS

Hinduja Dhanasekaran, Siddharth Selvam, Jeongkyu Lee University of Bridgeport

Abstract—In this paper we have definition – “Extremely large data sets implemented Riak TS which is a time that may be analyzed computationally to series-based database. It is a key value- reveal patterns, trends, and associations, based database and has time as especially relating to human behavior important parameter. During the and interactions”. implementation of the project we have understood the installation process, We should also try to understand that big loading the data and also analyzing the data is changing every second and it is at data using Riak TS. By doing complex a very fast pace and hence the processing querying we learnt how time plays a rate must be super-fast in order to match crucial role in understanding the data the needs. Since big data has huge and analyzing them to visualize. amounts of data in terms of volume, we can’t process them using the traditional Index Terms—Big Data, NOSQL database, tools. The reason for traditional tools not Motivation, Riak TS features, Riak TS being a favorable one for processing Big Architecture, Dataset and Implementation, Data is because most of the Result, Conclusion traditional ones can’t handle huge I. INTRODUCTION amount of data at once and also the format of the data that is being collected We all know that the digital world today from various sources differ from each is based and is running because of the effective usage of Data. Social platforms like Facebook, Instagram, Snapchat all other and hence they are not ideal to deal with huge amount of data on a day handle the Big Data. Also, when we say to day basis. In fact, every second there we are going to handle Big Data we must is a change in the amount of data that is make sure we choose the right database to be stored and supported by their to get the desired results. The database systems. With number of users and the should be able to handle huge set of data. time spent on social media we can The database should be able to imagine how much data needs to be accommodate data of various types and managed and formats which makes it non-predictable. processed. In the year 2013 the oxford Keeping all this in mind the option dictionary officially added the term “Big having a traditional database has been Data”. When searched for Big data you eliminated and hence we must have a will be provided with the following database that supports all this for achieving the desired results. Hence, we have to go for a NoSQL database. We all knowingly or unknowingly spend NoSQL stands for not only SQL. Most time on social media on day to day basis. popular NoSQL database are Mongo Hence, we wanted to deal with DB, Orient DB, Couchbase, and something related or based on the social Neo4j. All the NoSQL database are media and hence we needed a database designed keeping the Big Data in mind is that handles dataset based on time. Riak a notable thing TS is one of the databases that supports and manages the dataset based on time. NoSQL database supports the dynamic It has time as an important factor. It is schema design which leads to maximum designed for handling IOT and time flexibility and ease to work on data. series dataset. Riak TS seamlessly They provide high levels of scalability integrates with Apache Spark for faster compared to the relational database. and easier analyses of data. These key features make it the most suitable one for handling non-uniform III. Riak TS Features – data which various from time to time. They are ideal for all the social media 1.Availability – Riak TS ensures high applications and the web applications as availability of the data always since it well. These NoSQL databases are follows the 3X model and hence the data classified into four different types is replicated across multiple data centers namely – document database, graph spanning across multiple zones. Hence database, key value based, and wide when one goes down, we are not let column based. down as a whole. Availability of data 24*7. From all the above details you can figure out that the NoSQL database does not 2.Resilient – The Riak TS follows a depend on tables, columns and rows. In master class architecture and hence it other words, they don’t depend on any ensures the availability of data even at defined structure, hence pave way to the worst times like network failures or a process the unstructured data. The hardware issue. This also eliminates the NoSQL database which we have selected downtime of copying the data to other is Riak TS. TS in the name stands for places at the emergency situation as “time series”. This indicates the time everything here is automatic. factor importance in our database. Riak TS was developed Basho technologies 3.Scalability – It makes the scaling on top of the Riak KV. The difference process easier when compared to another between them are it adds the facility to database. Increasing and decreasing the co-locate keys of the same series within size of the database is very easy and the same quanta for faster and efficient simple. Based on our requirement we can read operations. scale to meet the peak requirements and improve the performance. II. MOTIVATION – 4.Operational Simplicity – Going by the term, the interface of the database is easy and user friendly to operate and navigate. locations spanning across various It makes it convenient to add the cluster locations. This ensure the high and also uniformly distribute the data availability of the data always. among the clusters. Therefore, the set up 11.Robust API and Client Libraries – and also addition and deletion of the Riak TS supports various programming capacity to the database is very easy. languages like python, java, php, node.js, 5.Data co-location – It locates the data erlang. This makes it comfortable to together in the same physical part of the build for IOT and time series-based data. cluster based on the time limit. This helps in fast query process on the data 12.Aggregations – Certain built in and hence the faster read and analysis of aggregations help in faster read and write the time series data. operations when handling huge amount of data.

6.SQL commands – Riak TS helps in 13.Apache Memos Framework – The storing the semi structured data in a Riak TS memo framework provides the schema by the help of SQL commands. cluster management and push button, Data co location and range queries help scale up and scale down for all the nodes in reading and analyzing the time series in Riak. data at a faster rate. 14.Time Stamped Data Feeds – Riak TS 7.SQL range queries – SQL commands has all its data feeds associated with time in the Riak TS have special time and hence its very handy for medical, quantum attached to them and hence financial and economical data and fields. they help in locating the data at a faster rate instead of going through the entire IV. Riak TS Architecture – set of data which is time consuming. The Riak TS integrates the functionality 8.Data Expiry – The Riak TS has the and the data expiry feature. This feature allows SQL structure with the Riak KV storage to explicitly specify the data by you this is achieved by using the Riak TS when aged must be removed from the tables. Riak TS enables you to query database. This decreases the load of data with large amount of data and hence it is stored in the database and hence different from Riak KV. The Riak TS improves its performance. table consist of –

9.Apache Spark – Riak TS seamlessly 1.Partition Key – It decides where the integrates with the apache spark to data must be located in a cluster. provide faster and efficient analysis of 2.Local Key – It decides where the data the data. is written in the partition.

10.Multi Cluster Replication – The Riak The partition key uses the time TS follows the 3X architecture and quantization to group data that will be hence the data is replicated in multiple required to queried together in the same physical part of the cluster. The time The structure of the partition key must be quantization has its own parameters maintained and defined in the following inside it. For us to query a time series way – data, we need the data to be structured using a specific schema. The schema 1- The first field (family) is type of defines what sort of data can be stored in data or a class. the database and what type it has. The 2- Second one is (series) which following is an example of a table in identifies the particular instance Riak TS – of the class / type like the device ID or user ID. 3- The third one is (quantum) CREATE TABLE GeoCheckin which sets the time intervals to ( groups by the data. id SINT64 NOT NULL, The quantum function when broken region VARCHAR NOT NULL, down has the following three parameters – state VARCHAR NOT NULL, 1- The quantity time TIMESTAMP NOT NULL, 2- The unit of time  I) D days weather VARCHAR NOT NULL, II)H hours III)M minutes temperature DOUBLE, IV)S  seconds 3- The name of a field in the table PRIMARY KEY { definition of type TIMESTAMP. (id, QUANTUN(time, 15, ‘m’)), Id, time LOCAL KEY  ) ) The local key or the second key must contain the same 3 fields in the same In the above example we are having order as the partition key. In order words partition key  (id, QUANTUM(time, they must be identical always. 15, ‘m’)). And the local key as id, time. The makes sure that the same fields The both together constitute the Primary determining your data’s partition also Key. dictate the sorting of data inside the partition. PARTITON KEY

The partition key is defined as the three The values in the partition key decides named fields as below: which vnodes handle its writes and (myfamily, myseries, (quantum(time, 15, queries. If the family and series fields are ‘m’)); identical for a large number of writes and the quanta that is specified has a huge time interval, only n_val vnodes will process the incoming writes. During those times the Riak TS will not be able to parallelize the work load among the CPU.

The full efficiency of riak ts queries achieved when the being queried is in Fig 2 – Smart Phone Dataset the same family, series and quanta because all keys and values are written The following images explains the time contiguously on disk. For each co- taken for uploading both the dataset. located block of data, a sub-query is being created.

V. Riak TS Implementation –

Dataset –

The dataset that we have chosen is – An open data set for human activity analysis. It is taken from the Kaggle website.

Our dataset is of 1 GB in size and Fig 3 – Time taken for the dataset to be contains about 1 million records. The uploaded dataset has two reports namely smartwatch and smartphone. Setting up the environment –

The following image shows the dataset 1- First, we check for all the that contains various activities carried necessary updates and keep all out in a smartwatch. the packages updated using the following command.

$ Sudo apt-get update

2- Then we have to check the authentication - $ Sudo apt-get install libpam0g-dev – Used for Fig 1 – Smartwatch Dataset pluggable authentication

module(PAM) which is necessary The following image shows the various for the installation of Riak TS activities carried out in a smart phone 3- We have implemented our with respect to the timings. project in the GCP (Google

Cloud Platform) and hence we are downloading the database using the following command - Wget http://s3.amazonaws.com/down loads.basho.com/riak_ts/1.5./1.5 .2/Debian/jessie/riak-ts_1.5.2- 1_amd64.deb

4- We can check whether the database is successfully installed by using the following command.

$ dpkg -1 |grep riak – Has to return riak-ts for the Fig 7 – Graph showing the various activities confirmation of successful performed and the number of times they are installation. being carried out The following image shows the VI. RESULTS – reduction of the battery power from starting to ending due to various The following image shows the list of activities performed. smart watch activities, from which the From this we can conclude the following following can be concluded – points:  Highly used activities  Capacity of the phone’s battery  Least used activities  User’s phone addiction capacity  Activities that influence the battery power withstanding capacity

Fig 4 – The various activities carried out Fig 5 – The various activities causing power drain in smartphone The following image shows the graph database. We also learnt the key aspects drawn from the step count of the user. and terms related to the Riak TS. We Form this data, the following can be understood the various activities inferred influencing the battery power drain in  The physical activity of the user the smart phone. We also measured the over the period number of steps covered by person over  The days on which the user must a period of time which indicates that improve his/her activity physical activity is a must for a healthy  The required adjustment to be life. made in the daily life activity for a healthy life VII. REFERENCES – [1] - Data set has been taken from - https://www.kaggle.com/sasanj/human- activity-smart-devices [2] - http://basho.com/products/riak-ts/ [3] - https://opensource.com/life/16/9/time- series-analysis-riak-ts [4]- https://db- engines.com/en/system/Riak+KV%3BRiak+ TS

[5] - Fig 6 – Graph depicting the number of http://docs.basho.com/riak/ts/1.5.2/setup/ steps covered by person over a period [6] - https://github.com/cvitter/Riak-TS- VII. CONCLUSION – Data-Modeling By implementing the above project using Riak TS we have understood the functioning and the working of Riak TS