A Study on Big Data in Health Care Industry

Shivani Jain1, Alankrita Aggarwal2, # 1 Research Scholar, Department of Computer Science, IGDTUW, New Delhi, India 2 Assistant Professor, Department of Computer Science and Engineering Panipat Institute of Engineering and Technology, Samalkha (Haryana) #Corresponding Author, Email: [email protected]

Abstract— With the development in health care meaningful information. Big data [1] has the industry, more and more healthcare data are being potential to manage the large datasets. Every day a collected from different sources. Extracting, maintain and used further for analysis is a complex task in the health zettabyte (1021 gigabytes) of data being generated care industry. Health care data is generated enormously by healthcare industry in India. Various sources are every bit of a second analyze that data is difficult by any there, through which health care data is generated conventional methods so data big data gain interest among the research communities. Big data is a viable solution to such as medical record of a patient, images of solve these problems. In this paper, authors give an infected body part, IOT data, Smartphone health overview of the big data possibility in healthcare domain. data. Many applications are developed that record It outlines the 5V’s of big data in health care and a brief the health data of every second of a user. Such a description about how data is generated from different sources and processed though the big data software and huge data cannot be processed and managed by the how these processes are performed. We have also conventional methods of software that only record summarized few of the software tools used for health care and used for query system. To analyse such records domain. At last, discussed the challenges faced to process of patient’sBig data software tools are required. It health care data efficiently. helps health care organization to improve care, save Keywords— Big Data, Map-Reduce, Hadoop, Healthcare, lives and reduce cost and provide quality of health Clinical data, Software Tools care services [2]. Big data assures to mitigate the 1. INTRODUCTION change to authenticate the data-driven healthcare, to Health is the major concern in all the human take decisions for individual patients based on beings. A good heath is essential for a human life. customized analysis and to detect the risk at an But now a days majorly all person around the globe early state. It is vitally important for the healthcare suffers from various health issues such as organization to be aware of sources of data hypertension, flu, blood pressure, lung infection, generated, how to process data to leverage big data heart diseases, mental disorder etc. A large volume effectively so as to improve efficiency and quality of data is generated from the various methods to of health care delivery. Different software tools, record the human health conditions. Such a huge machine learning techniques are used in big data for volume of data is generated through these records extraction of meaningful information from the are so complex and difficult to manage and process health care domains. In this research paper, authors through conventional data management tool and try to summarize some of techniques used in health techniques. Figure 1 shows the volume of health care domain for better understating of health care data generated around the globe. So, there is a records of human. need of high process devices and software tools which can process large data to extract useful and manifest all of the V’s and it is inevitable that we will use big data techniques to solve them. Features of bigdata are expressed in 3V’s [5]. Many other

characteristics of big data have been proposed features like ‘Value’ and ‘Veracity’. Volume deals with a enormous amount of data being generated; Variety of data coming from variety of data

sources; Velocity is the speed at which data is generated, captured and gathered; Veracity deals with trustworthiness of data and variety of data results in data inconsistency and ambiguity and a value is to identify which data is valuable than transformed and analyzed. Fig.1 illustrates the characteristic of big data.

2.1 5 V’s of Big Data in Healthcare- Volume: Volume comes from the large number of Fig. 1: Health care data generated around the Globe. datasets that include Personal Medical Records, This paper is structured as follows: Ssection Electronic Medical Records (EMR), sensor reading, IIdescribe the big data in healthcare and various source clinical reports, X-rays reports and laboratory results. of data generated in the health care domain; Ssection III Velocity: Healthcare related data is accumulated in at explains the techniques used for processing big data and high speed while monitoring patients’ current condition the major software used in health care sector; Ssection through sensor or tracking web-posts (such as from IV discusses the various challenges faced in healthcare twitter) for clinical decision support. domain, Section V has the conclusion and the future work of the research paper.

2. BIG DATA IN HEALTHCARE Healthcare produces structured and unstructured data from the different source at a rapid speed such as electronic patient’s records contains a different type of information as blood pressure, heart rate, pulse rate, body temperature, past diseases information. U.S healthcare only generated 1150 Exabyte of data in the year 2017. At this rapid pace, health care industry data in U.S. will reach to zeta- byte (1021 gigabytes) in numbers or even yotta- Fig. 2: Characteristics of Big Data bytes (1024 gigabytes) in numbers [3].A patient hospitalisation generates large amount of data Variety: Large heterogeneous data generated every records that includes diagnosis, medical records, day from the different source which includes digital image processing, laboratory rreports, and sensors data, clinical data report, prescriptions for hospital bills, based on these data available certain detection and diagnosis of disease. outcomes can be predicated.In U.S healthcare, Veracity: Patient may have uncertain and dubious analysis of big data helps to save roughly $300 information. Let’s take an example-Patients Health billion every year, 66% of that Records may subsist of typing error, acronyms, and throughdiminishments of proximately 8% in incorrect spellings. Data from social networking expenditures value of national healthcare[4] as sites can prompt to inaccurate decision as they are estimated by McKinsey. Healthcare problems unaware of the context of data and data coming from uncontrolled active environments. Data encourages patients to take more care of their qualities are major cause of concern for decision health. These EHR records further used for making in healthcare. predicative analysis now a days. As such data is Value: Large set of valuable and accurate data are huge so we need Big data tools to process them and generated from sensor data and electronic to identify useful pattern from the health records. monitoring of data for the detection of disease. Wearable devices and Sensor Data: Now a day’s HealthCare data sources: Health care data can be sensors and wearable devices are used to measure generated through various sources, which are the health records of the user. Many health different in structure and have different complexity applications are developed to measure the to extract the useful pattern from these records. In temperature, bp, pulse record, heart beat record of this section we have shown the various sources the patients. Through, these application health from which health care data can be recorded. Fig. 3 records are stored in mobile memory. Companies shows the sources of big data in health domain. like apple launched apple watch (http://www.appple.com/watch/) which measures heart rate for diabetic patients. Wearable and implantable sensors were developed, which providepersistent monitoring and glucose level which is outstanding to be eating regimen subordinate [8]. Sensors senses, monitors and capture critical data and provide continuous health- related information to find useful pattern for early detection of health issues of patients. Social media or web data: Social media helps in increase communication between patients, physician, and communities. Twitter tweets, blogs, Fig. 3: Various Data sources in Health care status posted on platforms, web pages like Facebook, all these include health related pages, 2.2 Electronic Health Records: Traditionally, the websites, pages, smart phone application etc. For health records are written by the doctors and stored analysis of spread of disease or transmission many by the patients only. No copy is available with the online social networks data are generally used [9, doctors and hospital. Next time if the patient visits 10]. Johns Hopkins School of Medicine the doctor, no previous data is available with the Researchers analyzes that “google flu trends” data doctor. To overcome this problem now the doctors could be utilized to anticipate surges in flu-related and hospital are maintaining patient’s data using crisis. Additionally, Twitter updates were as precise of the patients that can as official reports, Haiti to track the effect of used for analysis and accurately assessing the health cholera Haiti in the month of January after the disorder.Doctors composed notes and prescriptions, earthquake they were moreover two weeks earlier like X-rays, CT scan, MRI, [11]. Twitter data can efficiently track an epidemic Laboratory report, Pharmacy, and insurance claims across a given population (in real time). In the are used in electronic form to determine the cost- recent time Covid -19 flu outbreak is analysed effective way to diagnosis and treat patients using the google trends. These data is used to accurately. Electronic Health Records (EHR) [6] predict the Ppatients with perpetual diseases, for also comes under clinical data records. The example- diabetics, cancer and cardiac problems California based healthcare network called Kaiser patients sharing experience with other patients with Permanente, which has memberships around 9 the similar condition on social media resulting in million, archived around 26.5 and 44 petabytes of big data and status updates, text messages and posts data from sources like medical images, EHRs and on Twitter, Facebook provides healthcare related comments [7]. EHR measures outcomes and information. 2.3 Patients Records: Medical images such as X- Tracker and Task Tracker deals with the map and rays, Brain images, fingerprints, retrieve scan, reduce tasks [23]. Job Tracker is the master of the computed tomography are important sources of data system, handles and stores the blocks which are in used for early diagnosis of diseases using image Task Tracker whereas task tracker is the slave of segmentation and machine learning techniques[12- the system, which provides storage for data and 16].Healthcare systems in recent scenario utilizes takes the responsibility of the implementation of various divergent and interminable tracking devices tasks. The figure below describes the Hadoop Map- that use particular discretized or physiological Reduce process, whichincludes input data and five waveform information to give alert direction in case phases i.e., split, map, shuffle, reduce and output. of obvious events.These records are stored in the For complex task, combine together an array of form of Images, so analysis are done on these Map-Reduce jobs and process them in sequentally. records with the help of image process software Using directed acyclic graph (DAG) pattern tools. Different machine learning and deep learning developers were allowed to design miscellaneous, algorithm are applied in these areas to early multistep data pipelines in Spark [24- 27], in- identification of health patterns and disease. memory data sharing among DAGs is also 2.4 Payer Records: Billing records and health care supported by the spark to perform diverse jobs with claims are available in semi-structured and same data and results in better performance than unstructured formats. Healthcare insurers are using other big data technologies. big data to develop customizing insurance product, managing risk and reducing health care costs. These health records are used for insurance companies for their strategy building. 2.5 Publicly Available data: Health data of the human being are used by the government, private sector to analyse the country growth and to measure the overall health of people of their countries. These data can be used for research, to find the vaccine and cure for the particular diseases. Government agencies record the heath condition of their country people to prepare the health policy. A huge amount of revenues is spent to increase the health care of people.

Figure 4: Hadoop Map-Reduce process. 3. PROCESSING BIG DATA Big data needs to be processed and store large To serve upgraded and extra functionality, Spark datasets. MapReduce [17-18] and its open source runs over existing Hadoop Distributed File System implementation Hadoop [19], leading computing (HDFS) platform. Operating drivers in Spark platforms for big data. Hadoop is built to gather and utilizes a pair of operation i.e. (1) Action and (2) analyse data in petabytes and Exabyte scale. Transformation. Action is same as the reduce and Hadoop [20] uses Hadoop distributed file systemis transformation is same as the map operation. Here extremelyfault tolerantt. HDFS is manufactured to we have shown the working of the Hadoop map- handle large extent data and deployed on hardware reducer used in big data technology, but using the with minimal-cost and provide high processing big data technology our main purpose is to handle speed. In 2011, an adverse drug event (ADE) the data that is generated by the health care sector. detection has been developed based on MapReduce The data generated by the health sector is having lot algorithm [21]; for mining useful data for clinical of complexity as its volume is high and highly decision-making purpose. MapReduce [22] is the diverse in nature. EMR health records are combination of map and reduce function. Job maximally written in hard form, first change into Table 1. Software used to store and analyze medical health electronic readable device. For this scanner are used records of the patients. but scanned documents are difficult to process and . read by the machine as every handwritten document is different. Different AI based algorithms are used to fetch the information from the scanned 5. CONCLUSION documents. All data is un-structural format and Big data has great potential to change the way different machine learning techniques are used to Name of Company Software Tools for Analysis Software Name Used read, analyze and predict the information from ClearHealth Clear health PHP, Java Used EMR, these records. Maximum data used in health care Inc Script Scheduling and billing sector in the Image form. So, to process the Open Dental PHP, Used for Dental thousands of image file is a big issue in health care Software MySQL Records OpenHospital Informatici Java Used for billing, system. Here we have summarized a few software Senza scheduling and used in medical domain for storing, analyzing and Frontiere maintaining records CamBA GNU GPL Java, Collecting predicting health care data. Python neuroimaging GIMIAS CISTIB C++ Biomedical Imaging University processing & of Sheffield simulation. 4. CHALLENGES MITK NVIDIA AI, Used to create There are various challenges are faced to mine the Medical Imaging Transfer interactive medical Interaction Toolkit learning, image for better information in the field of health care sector CNN understanding. Clinical data are in unstructured and semi- Caisis GNU GPL C#, XML, Web based HTML, information system structured format, needs to be processed and mined Java for collecting, by the different platform, having different context. Script storing and analyzing the cancer Every practitioner has its own opinion to deal with patient data the problem so standardization is difficult for health cTAKES “clinical Mayo Clinic Python It uses NLP Text Analysis processing system to care data. Healthcare data is rarely standardized, Knowledge identify and extract often fragmented with incompatible format. Privacy Extraction different entities Software” from un-structural and Security is a major problem as data is text data. heterogenous and are generated from the diverse LabKey Server LabKey Java Used for analyzing, sharing, integrating environment. Health records having patients’ and storing medical personal records so security and privacy of the data research data, web- based application for is the main concern for any health care company. querying and Leverage healthcare data deliver quality healthcare reporting across different platform services to patient’s while Glucosio GNUGPL IOS, IOT This is mobile 1. preserving it private and secure. application used to measure the glucose 2. Genomic data analysis is a computationally hard level of diabetic job and integration with standard clinical data adds patients. more difficulty and add an extra layer of healthcare provider uses data sources and complexity. technologies to get vision from the clinical and 3. Data structure of health care data is highly complex other health related repositories. Big data mining and difficult to handle by any machine learning software’s improved efficiency of health services algorithm, DNA one sequence conations millions of and providing quality services at low cost. Machine possible combinations, to identify useful pattern is a learning and deep learning methods have gained complex task. much interest among the research communities to 4. Smart phone applications generate billions of data solve health care data. Health care data is complex every second, storing and processing all the data in nature and are generated from the heterogenous without human intervention is a difficult task. sources so standardization is not possible. In this paper, we have summarized the sources of health- related data and the processes involved to analyse 8. M. Breton, A. Farret, D. Bruttomesso, S. the data for information extraction. Authors have Anderson, L. Magni, S. Patek,C. D. Man, J. also discussed the various software used in health Place, S. Demartini, S. Del Favero, C. care domain. Although various studies are available Toffanin, C. HughesKarvetski, E. Dassau, H. is the field however this research paper focus on Zisser, F. J. Doyle, G. De Nicolao, A. health sector, big data and shows different Avogaro,C. Cobelli, E. Renard, and B. challenges in this area that leads a better Kovatchev, “Fully integrated artificial pancreas understanding of healthcare services at an early in type 1 diabetes: Modular closed-loop stage. We also surveyed major software used in glucose control maintains near healthcare sector. normoglycemia,” Diabetes, 61, 2230–2237, In future, authors will work on various methods of 2012 machine learning and deep learning applied in 9. A. Sadilek, H. Kautz, V. Silenzio, “Predicting Health domain for delivering quality health services disease transmission from geotagged micro- at minimal cost. blog data”, Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. REFERENCES 10. J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large 1. Zikopoulos PC, Eaton C, DeRoos D, Deutsch T clusters”,6th USENIX Symp. Oper. Syst. Des. and Lapis G. “Understanding Bigdata,” New Implementation, 137-150, 2004. York et al: McGraw-Hill, 2012. 11. A. Sadilek, H. Kautz, V. Silenzio, Modelling 2. Knowledgent: Big Data and Healthcare Payers; spread of disease from social interactions, in: 2013. http:// knowledgent.com / media page Sixth AAAI International Conference on /Insights/whitepaper/482. Weblogs and Social Media (ICWSM), 2012, 3. Cerrato, P. “Is Population Health Management http://www. cs.rochester.edu /~kautz/ papers the Latest Health IT Fad?”, 2012, Information /Sadilek-Kautz-Silenzio_Modeling-Spread-of- Week: http://www.info4mationweek.com/ Disease-from-Social Interactions_ICWSM- health care/clinical-systems/is-population- 12.pdf. health-management-latest-h/240004578. 12. TechAmerica Foundation, “Demystifying Big 4. Manyika J, Chui M, Brown B, Buhin J, Dobbs Data: A Practical Guide to Transforming the , Roxburgh C, Byers AH: Big Data: The Next Business of Government”. Washington, D.C.: Frontier for Innovation, Competition, and TechAmerica Foundation,2012 Productivity. USA: McKinsey Global Institute; 13. Mittal, A., Kumar, D., Mittal, M., Saba, T., 2011. Abunadi, I., Rehman, A., & Roy, S. “Detecting 5. Laney,D. "3D Data Management: Controlling d Pneumonia Using Convolutions and Dynamic ata volume, velocity and variety," ed: Meta Capsule Routing for Chest X-ray Group, 2001 Images”. Sensors, 20(4), 1068. 6. Forrest, G. N., Schooneveld,T. C. Van, Kullar, 14. Mittal, M., Kaur, I., Pandey, S. C., Verma, A., R., Schulz, L. T., Duong, P. and Postelnick, & Goyal, L. M. “Opinion Mining for the M. “Use of electronic health records and Tweets in Healthcare Sector using Fuzzy clinical decision support systems for Association Rule”. EAI Endorsed Transactions antimicrobial stewardship,” Clin. Infectious on Pervasive Health and Technology, 4(16). Dis., 59,122–133, 2014 15. Kaur, B., Sharma, M., Mittal, M., Verma, A., 7. IHTT: Transforming Health Care through Big Goyal, L. M., & Hemanth, D. J. “An improved Data Strategies for leveraging big data in the salient object detection algorithm combining health care industry, 2013. background and foreground connectivity for http://ihealthtran.com/wordpress/2013/03/iht% brain image analysis”. Computers & Electrical C2%B2-releases-big-data-research- Engineering, 71, 692-703. reportdownload-today/ 16. Chhetri, B., Goyal, L. M., Mittal, M., Gurung Maryland – USA: American Medical S. “Consumption of Licit and Illicit Substances Informatics Association; 2011:1464. leading to Mental Illness: A Prevalence Study”. 22. E. A. Mohammed, B. H. Far, and C. Naugler, EAI Endorsed Transactions on Pervasive “Applications of the MapReduce programming Health and Technology, 6(21). framework to clinical big data analysis: current 17. Mittal, M., Arora, M., Pandey, T., & Goyal, L. landscape and future trends,” Bigdata M. “Image Segmentation Using Deep Learning Mining,7(22), 1–23, 2014. Techniques in Medical Images. 23. Huang Lu, Hu Ting-ting, Chem Hai-shan,” In Advancement of Machine Intelligence in Research on Hadoop cloud computing model Interactive Medical Image Analysis” 41-63, and its applications”, 2012 Third International Springer, Singapore. Conference on Networking and Distributed 18. Mittal, M., Singh, H., Paliwal, K. K., & Goyal, Computing,59-63. L. M., “Efficient random data accessing in 24. Singh, A., Mittal, M., & Kapoor, N, Data MapReduce”, 2017 International Conference processing framework using Apache and Spark on Infocom Technologies and Unmanned technologies in big data. In Big Data Systems (Trends and Future Directions) Processing Using Spark in Cloud, 107-122, (ICTUS), IEEE, 552-556. Springer, Singapore. 19. Hadoop. (2014) [Online]. Available: 25. Mittal, M., Balas, V. E., Goyal, L. M., & http://hadoop.apache.org/. Kumar, R. (Eds.) “Big data processing using 20. XieJiong, Yin Shu, RuanXiaojun, Ding spark in cloud”, Springer. Zhiyang, TianYun, “Improving Mapreduce 26. Mittal, M., Balas, V. E., & Hemanth, D. J. performance through data placements in (Eds.), “Data Intensive Computing heterogeneous Hadoop cluster”, 2010. Applications for Big Data” 29, IOS Press. 21. Wang W, Haerian K, Salmasian H, Harpaz R, 27. “Big Data Processing with Apache Spark – Chase H, Friedman C: A drug-adverse event Part 1: Introduction”, 2015. extraction algorithm to support https://ww.infoq.com/articles/apache-spark- pharmacovigilance knowledge mining from introduction PubMed citations. In AMIA Annual Symposium Proceedings: 2011.Bethesda,