© 2019 IJSRSET | Volume 6 | Issue 3 | Print ISSN: 2395-1990 | Online ISSN : 2394-4099 Themed Section : Engineering and Technology DOI : 10.32628/ IJSRSET196373 A Survey of Algorithms Prof. Deepak Agrawal, Abhiruchi Dubey Takshshila Institute of Engineering and Technology, Jabalpur, Madhya Pradesh, India

ABSTRACT

Machine Learning is a booming research area in computer science and many other industries all over the world. It has gained great success in vast and varied application sectors. This includes social media, economy, finance, healthcare, agriculture, etc. Several intelligent machine learning techniques were designed and used to provide big data predictive analytics solutions. A literature survey of different machine learning techniques is provided in this paper. Also a study on commonly used machine learning algorithms for big data analytics is done and presented in this paper Keywords : ANN, Data Analytics, Machine Learning Algorithms, Technique, Prediction, Model

I. INTRODUCTION II. BIG DATA ANALYTICS

In this data rich era it is essential to use sophisticated The term big data which describes extremely large analytics techniques on huge, diverse big data sets to data sets is widely being used among different produce useful knowledge and information. Big data researchers all over the world. Traditional relational analytics is a budding research area that deals with databases are not capable of handling big data. the collection, storage and analysis of immense data Enormous quantity of data sets arrives from several sets to trace the unknown patterns and other key sources like sensors, transactional applications, web information. Big data analytics helps us to recognize and social media, etc. The big data phenomenon can the data that are integral component to the future be comprehended clearly by knowing the different business decisions. Big data analytics can be Vÿs associated with them- Volume, Velocity, abundantly found in domains such as banking and Variety, Veracity and Value. insurance sector, healthcare, education, social media and entertainment industry, bioinformatics • Volume: This denotes the huge amount of applications, geospatial applications, agriculture etc. data produced every second, oscillating It is a herculean task to handle big data using between terabytes to zettabytes. These big conventional data processing applications. Thus to data sets can be maintained using distributed discover hidden data patterns, trends and systems. associations, intelligent machine learning methods • Velocity: This term represents the rate at can be adapted. The objective of the current research which data is produced and processed to paper is to discuss various machine learning congregate the demands. algorithms used by data scientists for analyzing and • Variety: This indicates the diverse range of modeling big data. data that we can use. • Veracity: This speaks about the data quality.

That is, it indicates the biases, noise,

IJSRSET196373 | Received : 20s May 2019 | Accepted : 22 June 2019 | March-April -2019 [ 6 (3) : 364-369 ] 364

Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369

abnormality etc. in the data. Most of the traditional machines learning • Value: This points to the precious knowledge algorithms were implemented for data sets which reveled from the data. could be completely fit into the memory [15]. As the data keeps getting bigger day by day, many intelligent learning methods are being Data scientists use many well-marked analytics implemented to provide solutions to several big techniques. Text analytics, predictive analytics, data predictive analytics problems. A study on natural language processing, machine learning, etc. several commonly used machines learning are a few approaches to make better and faster techniques for big data analytics is provided in the decisions on big data sets to uncover hidden following section. insights.

III. MACHINE LEARNING IV. LITERATURE SURVEY

Machine learning is an interdisciplinary research J. Qiu et al. presented different machine learning area which combines ideas from several branches algorithms for big data processing [7]. The first one is of science namely, artificial intelligence, statistics, representation learning or which information theory, mathematics, etc. The prime deals with learning data representations that make focus of machine learning research is on the the data analysis process easier. It is found that the development of fast and efficient learning performances of the machine learning algorithms are algorithms which can make predictions on data. strongly influenced by the selection of data When dealing with data analytics, machine representation (or features) [16]. This learning learning is an approach used to create models for scheme plays a crucial role in dimensionality prediction. Machine learning tasks are mainly reduction tasks. The important steps under grouped into three categories- supervised, representation learning are feature selection, feature unsupervised and . extraction and distance metric learning [14]. Feature Supervised machine learning requires training with selection (variable selection) techniques are used to labeled data. Each labeled training data consists of find those features of data which are most relevant for input value and a desired target output value. The use in model construction. Feature extraction algorithm analyzes the training techniques transform the high dimensional data into a data and makes an inferred function, which may be low dimensional space. In distance metric learning, a used for mapping new values. In unsupervised distance function is constructed to calculate the machine learning technique, hidden insights are distance between various points of a data set. drawn from unlabelled data sets, for example, . The third category, reinforcement The authors mentioned about another hot learning learning allows a machine to learn its behavior technique called in their paper. Most of from the feedback received through the the ancient machine learning approaches follows interactions with an external environment [3]. shallow- structured learning architecture that From a data processing point of view, both containing a single layer of nonlinear feature supervised and techniques transformations. Some of the examples of such are preferred for data analysis and reinforcement learning techniques are Gaussian mixture models techniques are preferred for decision making (GMMs), hidden Markov models (HMMs), support problems [7]. vector machines (SVMs), , kernel regression etc. [9]. In contrast to the shallow- International Journal of Scientific Research in Science, Engineering and Technology (www.ijsrset.com) 365 Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369

structured learning architecture, deep learning target domain by getting trainings from a related techniques make use of supervised and unsupervised source domain. Transfer learning techniques are strategies in deep architecture. The learning systems widely being used in many real-world data processing with deep learning architecture are composed of applications. several levels of nonlinear processing stage, in which each lower layer’s output is given as the input of the The authors discussed about another learning scheme immediate higher layer. Some of the examples are called active learning. In some cases the data is deep neural networks, conventional neural networks, represented without labels which become a challenge. deep belief networks and recurrent neural networks Manually labeling this large data collection is an etc. Because of the high performance of deep learning expensive and strenuous task. Also, learning from algorithms they are well suited for big data analytics unlabelled data is very difficult. Active learning is applications. used to solve the above mentioned issue by selecting a subset of the most important instances for labeling Scalability is a challenging issue with the traditional [17]. Another scheme, kernel- based learning, has machine learning algorithms. The traditional schemes been widely used in many engineering applications to cannot process the huge data chunks within a design efficient, powerful and high performance stipulated time as they require all the data in the same nonlinear algorithms [5]. Some of the algorithms database. A new field of machine learning called capable of operating with kernels are support vector distributed learning has been evolved to solve this machines (SVM), principal component analysis (PCA), problem. In this scheme, the learning is carried out kernel , etc. on data sets distributed among several workstations to scale up the learning process [4]. Examples of the J.L. Berral-Garcia presented a paper describing the distributed frequently used machine learning algorithms for big data analytics [6]. Several algorithms are used for performing modeling, prediction and clustering tasks. machine learning algorithms are decision rules, Decision tree algorithms (like CART, Recursive stacked generalization, meta-learning and distributed Partition Trees or M5), K- Nearest neighbors boosting etc. Parallel machine learning is another algorithms, Bayesian algorithms (using Byes popular learning scheme where the learning process theorem) , Support vector machines(SVM), Artificial is executed among multiple processor environments Neural Network, K-means, DBSCAN algorithms, etc or on multiple threaded machines [1]. are presented in this paper. Several execution frameworks - Map-Reduce Frameworks (Apache Transfer learning is another machine learning Hadoop and Spark), Google’s Tensor flow, Microsoft’s approach mentioned in their paper. A common Azure-ML were also mentioned. The practice is that both the training data and test data are implementations of the previously discussed taken from the same field in the conventional algorithms are made available to the public through machine learning process. That is, the input feature different tools, platforms and libraries such as R-cran, space and data distribution are identical [8]. But there Python Sci-Kit, Weka, MOA, Elastic Search, Kibana are certain scenarios in which getting training and etc. test data from the same domain is a difficult and expensive task. In order to solve this issue, the M. U. Bokhari et al. presented a three layered transfer learning technique has been used. In this architecture model for storing and analyzing big data scheme a high performance learner is created for a [11]. The three layers are data gathering layer, data

International Journal of Scientific Research in Science, Engineering and Technology (www.ijsrset.com) 366 Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369

storing layer and data analysis & report generation medicine. Multi-omic TCGA [13] data and EHR data layer. In order to gather and handle the huge volume [2]were used to conduct this study. of big data coming from high speed sources such as sensors or social media, a cluster of high speed nodes M.R. Bendre et al. [10] conducted a research on the or severs are kept in the data gathering layer. The usage of big data in precision agriculture. The authors data storage layer is responsible for storing the big mentioned that big data provides a broad range of data. The Hadoop functions to uncover new insights to address several farming problems. The designed model uses the Distributed File System (HDFS) can be used for data MapReduce technique for big data processing and the storage. In the data analysis layer, machine learning method for data prediction. The data techniques such as ANN, Naive Bayes, SVM and collected from the KVR( Krishi Vidyapeeth Rahuri Principal Component Analysis etc. are used to churn (KVR), Ahmednagar, India) station are used to test knowledge from the huge complex data chunks. the model. The result forecasted using this model is very useful for effective decision making in the P.Y. Wu et al. in their paper provided case studies to agriculture domain. show how big data analytics is useful in precision medicine to provide the most appropriate treatment The following table summarizes the literature survey to each patient [12]. Principal Component Analysis, presented in this paper. Singular Value Decomposition and tensor-based approaches are useful for feature extraction and for TABLE I. LITERATURE SURVEY SUMMARY feature selection filter based and wrapper based methods are helpful. All these are dimensionality Sl. Author(s) Algorithms / Summary No. name (year) Techniques reduction techniques. The authors compared different J. L. Berral- Decision tree A survey was done techniques for performing tasks. Logistic 1. Garcia algorithms, on the various regression, cox regression, local regression techniques (2016) K- Nearest machine algorithms are simple to interpret, but are prone to outliers. neighbor for classification, algorithms, prediction and Logistic regression with LASSO regularization Bayesian modeling reduces feature space. But over fitting is a problem. algorithms, Other models such as Hidden Markov models, SVM, ANN, Conditional Random fields, relational subgroup K-means, DBSCAN discovery, episode rule mining etc are also useful for J. Qui, Q. Gaussian A survey was done performing data mining tasks. The authors discussed 2. Wu, G. Mixture on the various about the useful platforms for big data analytics. Ding, Y. Xu models, traditional as well as Apache Hadoop, IBM InfoSphere Platform, Apache and S. Feng Hidden advanced machine ( 2016) Markov learning algorithms Spark Streaming, Tableau, QlikView, TIBCO Spotfire, Models, SVM, used for big data and other visual analytics tools are highly impactful logistic processing. platforms for providing big data analytics solutions. regression, Kernel Two real world case studies such as integrative -omic Rgression, data for the improved understanding of cancer Deep neural mechanisms, and the incorporation of genomic networks, knowledge into the EHR system for improved patient Deep belief networks, diagnosis and care were done to discuss the usefulness PCA, Kernel of biomedical big data analytics for precision Perceptron

International Journal of Scientific Research in Science, Engineering and Technology (www.ijsrset.com) 367 Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369

V. CONCLUSION 3. M.U. ANN, Presented a 3 Bokhari, SVM,PCA, layered architecture With the advents in big data technology, it became M. Naive Bayes model for storing Zeyauddin and analyzing difficult to handle the complex big data using the and M. A. Bigdata. Data storage traditional learning algorithms. Therefore several Siddiqui can be done using advanced, efficient and intelligent learning (2016) the Hadoop Distributed File algorithms are required to handle the huge chunks of System(HDFS) and heterogeneous datasets. The results obtained through data analysis can be these analytics techniques provide more effective done using solutions to many real world problems in various techniques like ANN, SVM, Naive domains such as healthcare, agriculture, social media, Bayes and PCA. banking etc. Various research papers are surveyed to gather information about advanced learning 4. P. Y. Wu, Logistic Discussed several techniques. This paper gives an overall idea about the C. regression, machine algorithms W. Cheng, PCA, HMM, and platforms like advanced machine learning algorithms and C. Local Hadoop , IBM techniques used to provide solutions to the big data D. Kaddi, J. regression, Infosphere, Tableu, analytics problems. Venugopala cox Qlik view, Spark etc VI. REFERENCES n, regression for providing big R. Hoffman data solutions. Case and M. D. studies were done [1]. “Parallel machine learning toolbox”, retrieved Wang using –omic data from (2016) from TCGA and http://www.research.ibm.com/haifa/projects/ver EHR data to show the usefulness of ification/ml_tool box/. biomedical big data [2]. C. A. Caligtan and P. C. Dykes, “Electronic analytics for health records and personal health records”, precision medicine. Semin Oncol Nurs, vol. 27, pp. 218- 228, 2018.

5. M. R. MapReduce, A model was built [3]. C.M. Bishop, Pattern recognition and machine Bendre, Linear using MapReduce learning, Springer, New York, 2019. R. C. Thool regression and Linear [4]. D Peteiro-Barral and B Guijarro-Berdinas, “A and V. R. regression Thool techniques. Case survey of methods for distributed machine study was carried learning”, Progress in Artificial Intelligence, out to predict the Springer, vol. 2, issue 1, pp. 1-11, 2018. rainfall and DOI:10.1007/s13748-012-0035-5. temperature value for the year 2013 [5]. G. Ding, Q. Wu, Y. D. Yao, J. Wang and Y. using the historical Chen, “Kernel- Based Learning for Statistical weather data Signal Processing in Cognitive Radio Networks: collected from KVR, Theoretical Foundations, Example Applications, Ahmednagar. The main objective was and Future Directions”, IEEE Signal Processing to improvise the Magazine, vol. 30, issue. 4, pp. 126-136, 2018. accuracy of rainfall DOI: 10.1109/MSP.2013.2251071. forecasting. [6]. J. L. Berral-Garcia, “A quick view on current

techniques and machine learning algorithms for

big data analytics”, 18th International Conf. on

International Journal of Scientific Research in Science, Engineering and Technology (www.ijsrset.com) 368 Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369

Transparent Optical Networks, pp.1-4, 2016. Network Mining , ACM, pp. 18–25, 2012. DOI: DOI: 10.1109/ICTON.2016.7550517. 10.1145/2351333.2351336. [7]. J. Qui, Q. Wu, G. Ding, Y. Xu and S. Feng, “A [16]. X. W. Chen and X. Lin, “Big Data Deep survey of machine learning for big data Learning: Challenges and Perspectives”, in IEEE processing”, EURASIP Journal on Advances in Access, vol. 2, pp. 514-525, 2014. DOI: Signal Processing, Springer, vol. 2016:67, pp. 1- 10.1109/ACCESS.2014.2325029. 16, 2016. DOI: 10.1186/s13634-016-0355-x. [17]. Y. Bengio, A. Courville and P. Vincent, [8]. K Weiss, T Khoshgoftaar and D Wang, “A “Representation Learning: A Review and New survey of transfer learning”, Journal of Big Data, Perspectives”, in IEEE Transactions on Pattern Springer, vol. 3, issue 9, pp. 1- 40, 2016. DOI: Analysis and Machine Intelligence, vol. 35, 10.1186/s40537-016-0043-6 issue 8, pp. 1798-1828, 2017. DOI: [9]. Li Deng, “A tutorial survey of architectures, 10.1109/TPAMI.2013.50. algorithms and applications for deep learning”, [18]. Y. Fu, B. Li, X. Zhu and C. Zhang, “Active APSIPA transactions on Signal and Information Learning without Knowing Individual Instance Processing, vol. 3,pp.1-29,2014. DOI: Labels: A Pairwise Label Homogeneity Query https://doi.org/10.1017/atsip.2017.9. Approach”, in IEEE Transactions on Knowledge [10]. M. R. Bendre, R. C. Thool and V. R. Thool, “Big and Data Engineering, vol. 26, issue 4, pp. 808- data in precision agriculture: Weather 822, 2017. DOI: 10.1109/TKDE.2013.165. forecasting for future farming”, 1st International Conf. on Next Generation Cite this article as : Computing Technologies, pp. 744- 750, 2017. Prof. Deepak Agrawal, Abhiruchi Dubey, "A Survey [11]. DOI:10.1109/NGCT.2015.7375220. of Machine Learning Algorithms", International [12]. M. U. Bokhari, M. Zeyauddin and M. A. Journal of Scientific Research in Science, Engineering Siddiqui, “An effective model for big data and Technology (IJSRSET), Online ISSN : 2394-4099, analytics”, 3rd International Conference on Print ISSN : 2395-1990, Volume 6 Issue 3, pp. 364-369, Computing for Sustainable Global May-June 2019. Development, pp. 3980-3982, 2016. Journal URL : http://ijsrset.com/IJSRSET196373 [13]. P. Y. Wu, C. W. Cheng, C. D. Kaddi, J. Venugopalan, R. Hoffman and M. D. Wang, “– Omic and Electronic Health Record Big Data Analytics for Precision Medicine”, IEEE Transactions on Biomedical Engineering, vol. 64, issue 2, pp. 263-273, 2017. DOI: 10.1109/TBME.2016.2573285. [14]. T. C. G. Atlas. Available: http://cancergenome.nih.gov/. [15]. W. Tu and S. Sun, “Cross-domain representation-learning framework with combination of class-separate and domain- merge objectives”, Proceedings of the 1st International Workshop on Cross Domain Knowledge Discovery in Web and Social

International Journal of Scientific Research in Science, Engineering and Technology (www.ijsrset.com) 369