Big Data Analysis Architecture

DOI: https://doi.org/10.37075/EA.2021.2.09 Economic Alternatives, 2021, Issue 2, pp. 315-328 Big Data Analysis Architecture

Received: 03.02.2021 Available online: 28.06.2021

Daniel Delchev*, Vanya Lazarova** Keywords: Big Data, Traditional Big Data Architecture, Big Data Analysis Architecture. Abstract JEL: O33, O36, L20 Nowadays, because of the rapid changes Introduction. Data and Big Data in the environment we can observe complex Sources developments in the companies. Furthermore, n today’s digital society Big Data is the recent technological growth renders Iessential for both small and large the business world a very competitive companies. For a small period of time storing environment. It is essential for companies and correctly using data became a vital part to make highly accurate decisions in a short in the success of any company. Companies period of time in order to compete in such an can gain a number of advantages against environment. They have to rely not only on their competitors because Big Data enables the amount of data that they have gathered them to analyse the current market and but also on the efficiency of gathering new foresee what changes might occur in the data from external sources. Big Data is a future. This knowledge allows them to make crucial part in the decision making of big and more accurate decisions about the future small enterprises. To use Big Data efficiently development of their business. it is necessary for the companies to rethink Big Data can reveal disadvantages in the already integrated ways of gathering their whole data infrastructure. Traditional knowledge because of the enormous amount architectures cannot enable the companies of data that is being stored. To refine the to explore a whole new range of possibilities existing strategies, the data has to be analysed such as: penetrating new markets, widening in order to discover inefficiencies in the their existing ones, enriching their assortment, systems. By removing those flaws, systems increasing their profitability and many more. for gaining knowledge will become more Modern architectures must consist of adequate in the ever-changing environment. additional data sources, new technological Big Data has been researched by components and modern ways to analyse the numerous scientists and experts in the field data.1 but still there are a lot of improvements that

* PhD student, Department: Information Technologies and Communications, University of National and World Economy. ** Assos.prof PhD, Department: Information Technologies and Communications, University of National and World Economy. 1 This work has been supported by the project NoBG05M2OP001-1.002-0002 “Digital Transformation of Economy in Big Data Environment”, funded by the Operational Program “Science and Education for Smart Growth” 2014– 2020, Procedure BG05M2OP001-1.002 “Establishment and development of competence centers”.

315 Articles Big Data Analysis Arhitecture have to be made in order to overcome the shows the truthfulness in the information but challenges that Big Data has brought. Some also takes in consideration its accuracy. Data of them are connected to data efficiency, value is required when we gather information credibility, sensitivity, accuracy and more. from social data. Some people describe data The most important aspects of Big Data value as a necessity when building any type of Science are gathering, storing, analysing, Big Data application because it allows for the searching, updating and visualising the extraction of the most essential information. data. Also it has to be well protected We can observe a trend in data extraction because sometimes it might contain using NoSQL databases (Radoev, M (2017; sensitive information. In the past data was 2019)) which differ a lot from the well-known characterized in three different ways: in terms relational databases. of speed, variety and size. Eventually two Everything described above proves the more characteristics got in use: value and need of creating efficient and innovative credibility. Nowadays, Big Data is defined algorithms which can find patterns in as the work with huge arrays of data which data analysis models, discover different describe a user’s behaviour. A broader interdependencies in data and others. Only definition is the data analysis whose objective then would we be able to extract useful is to obtain a meaningful result (such as a information from Big Data. forecast for a stock market value). Such data Life in a digital era has rendered knowledge can improve the work efficiency of a company extraction from unstructured, semi-structured or organisation in any field - government and structured data a main challenge we have agencies, the field of science, business to deal with. Big Data scientists have to have sector and financing technologies (Tsaneva, exceptional knowledge in the field and put a M. (2019)). lot of hard work in their research in order to The ability to synthesise the useful data overcome the difficulties that are imposed on from large amounts of data is a process the companies. of data extraction. Specific techniques for One of the primary features of Big Data is extraction are becoming more and more the way it is structured. It is divided into three sophisticated and nowadays they are a long categories: way from just querying the information. Social y Unstructured data - Email, social net- networks, online markets, blogs, sensors, works (Facebook, Twitter, LinkedIn), text health care facilities and others produce files, presentations, mobile data (text petabytes worth of information. This quantity messages), communications (chat), cannot be compared with the traditional phone records, location data (GPS) of amount that people are used to. The data has mobile devices (mobile phones, tablets) to be labeled accordingly in terms of variety, and many more. Unstructured data analy- speed, size, credibility and its value. Data sis searches for patterns in text, media, can be unstructured, semi-structured and photos and files consisting of similar structured which is reflected in the variety contents. That is a lot different from the aspect. Speed is measured in the amount of conventional search which queries docu- data that is being produced by the data source ments using a given string. Text analysis per unit of time. The size of the data shows searches for repeating patterns in docu- how much data there is and usually grows ments, emails, recorded conversations exponentially. The credibility of Big Data in order to collect knowledge. There are

316 Economic Alternatives, Issue 2, 2021 Articles several methods for unstructured data y People (social networks) - This is data analysis. Some of them are Natural Lan- gathered from posts in social networks. guage Processing (NLP), Master Data The contents and the quality of informa- Management (MDM) and other statistics. tion depends on the platform, its pur- Text analysis uses NoSQL databases pose and the audience that the post has which standardize the data in order for reached. Data gathered from news com- it to be queried using languages such as ments and even the search history of PIG and Hive. engines such as Google, Bing, etc. also y Structured data - data generated by a falls into this category. computer (either automatically or with y Automated systems (sensors and spe- the help of human intervention), meta- cialized mechanisms) - Sensors gath- data (time of creation, file size, author, er images from space crafts and trace etc.), library catalogues (date, author, lo- people’s locations using applications on cation, topic, etc.), mobile numbers and their mobile phones, IoT devices, etc. phone books. Such data is stored in da- Mobile applications analyse data for geo- tabases (Microsoft SQL Oracle, MySQL, locations and IoT devices automatically etc.). Using queries one can easily ex- monitor activities such as tracking peo- tract arrays of data which can later be ple’s movements, rainfall information and used for decision making. Enterprises will more. Traffic from wireless and mobile continue to analyse structured data us- devices is increasing rapidly and gener- ing various methods such as Structured ating data to analyze consumer behavior. Query Language (SQL). y IT systems in business models - This y Semi-structured data - data which does is data generated by the Enterprise Re- not obey the tabular structure of data source Planning (ERP). ERP refers to a models associated with databases or type of software that organizations use other forms of data tables, but none- to manage day-to-day business activities theless contains tags or other markers such as accounting, procurement, proj- which separate semantic elements and ect management, risk management and enforce fields within the data. The main compliance, and supply chain operations. advantages of semi-structured data are: Knowledge extraction from the huge arrays of data is not at all an easy task. ◦ Data sources are not limited to a given Even while you are reading this article the data model which allows for a broader data sources are generating a vast quan- range of available data sources tity of data. This is a big factor which hin- ◦ Data can be easily transferred be- ders a company’s capacity because ev- tween databases ery time new data is generated every bit ◦ The data model can be easily updated of information has to be processed and XML is a form of semi-structured data stored somewhere. That is why compa- and is used for data transferring between nies and organizations that want to take servers and applications. A XML analyser advantage of the benefits of working with evaluates the document’s hierarchy, extracts Big Data must, and must have, technolo- the document’s structure and/or its contents. gies that extract knowledge from data - Big Data sources can be divided in 3 main data mining. In addition, it is necessary categories: to implement specialized algorithms from

317 Articles Big Data Analysis Arhitecture the field of machine learning (Mihova, is a way to perform a set of functions on a Veska (2015)). Organizations practicing large amount of data in batch mode. in today’s information society are begin- The map function splits the tasks in a ning to consider not only human capital, number of smaller ones and handles their but also the data from which they gain positioning in a way that balances workloads knowledge and information about one of and facilitates fault recovery. the most valuable and important resourc- Once the distributed calculation is es. Increasingly, information in organiza- complete, the function reduce combines all tions is beginning to be perceived not as the elements back and provides the final plain information, but as a strategic busi- result. ness asset, regardless of the industry In order to take advantage of MapReduce that the business is practicing in. the method has to be implemented on a distributed cluster of servers which provide a Big Data Processing way to apply the same function simultaneously Extracting information (knowledge) by a number of machines. from data is the basis of modern business MPP characteristics intelligence. However, existing information Massive parallel processing (MPP) is retrieval algorithms do not work directly with changing the market of database management Big Data, but need to be adapted to work tools. This method has a performance with Big Data processing platforms such as advantage over traditional approaches, Hadoop and MapReduce or the new open thanks to a parallel processing architecture source platform Spark. that combines blade servers and data The development and implementation of filtering via programmable logic devices. This algorithms for extracting information from Big combination provides fast analytics queries Data is a serious challenge. Two main methods that support thousands of users of business of Big Data processing will be considered: analysis applications, complex analysis with y MapReduce tremendous speed and petabyte scalability. y MPP (Massively Parallel Processing) MPP delivers a high performance without the MapReduce characteristics need of indexing or tuning. All of the integration of hardware, software Hadoop enables you to organize and storage is done in advance for the user distributed processing of large volumes of because it is also a tool. This method provides data (measured in petabytes) using the map / a ready-to-use form of immediate data entry reduce method. MapReduce is the process and allows for the execution of queries via of dividing each individual task into smaller leading ETL, BI and analytical applications via ones, each of which can be performed on a standard ODBC, JDBC and OLE DB interfaces. separate node from the cluster. Massive Parallel Processing greatly simplifies The MapReduce algorithm performs a the analysis by consolidating all analytical highly efficient parallel data processing using activity in one place - exactly where the data the Hadoop MapReduce engine. This machine is stored. provides all the tools required to split Big Data MPP (massive parallel processing) is a into manageable pieces and process them in coordinated processing executed by multiple parallel on a distributed cluster. MapReduce processors working on separate parts of a

318 Economic Alternatives, Issue 2, 2021 Articles given program. Each processor uses its own new technologies. Best collaboration and operating system and memory. training practices that result from Hadoop Usually MPP processors communicate may include: through a type of messaging interface. Some y Private social networks. Encourag- implementations allow up to 200 or more ing communication within and between processors to run on the same application. teams through an internal social network. In some implementations, up to 200+ Publishing interesting articles, tutorials processors can run on the same application and presentations, which can be found simultaneously. Setting up MPP is a more through blogs via Twitter or Facebook. complex task and requires consideration of Such blogs can encourage employees to how the workload and data will be distributed think outside the current technical para- among the processors. MPP systems can digms. be described as a loosely coupled or shared y Innovation time. Employees can be of- nothing systems. Unlike Hive QL, Impala fered the opportunity to devote a certain is not based on a MapReduce algorithm. It percentage of their time to new projects. implements a distributed architecture based Google is known for offering its develop- on daemon processes which are responsible ment teams to dedicate 20% of their time for all aspects of the request execution. for new initiatives. This has led to a num- Good Practices for Big Data Analysis ber of key innovations, such as Gmail and Google Groups. Hadoop enables the distributed processing y Topical discussions. To offer developers of a large set of data across multiple server the opportunity to share their knowledge clusters. It allows for a single server to run on and experience in a public forum. Topical up to a 1000 machines, but also introduces discussions on Friday afternoon, for ex- a high error tolerance. It also provides ample, have become a standard for many means to cost-effectively store and process start-ups in the Silicon Valley. One can unstructured or semi-structured data from choose the best presenters and invite streams of web clicks, social media, server a number of groups to attend a topical logs, devices, sensors and more. It allows discussion on an arbitrarily chosen date. you to draw conclusions from analysis on This is a good example of a forum for newly added datasets. These conclusions presenting projects created during an in- can reveal new business opportunities and novation period. advance a given organization. Hadoop is y Evening consumer groups. Periodic a very good way for implementing network evening event for company employees to performance management, data storage, risk present new tools or products they are management, customer loss prevention, fraud currently building for the consumers. detection, customer opinion analysis, social y Mobile versions of company applica- media analysis and others. tions. In many cases, operational en- When employees get a better grasp on terprise applications are not suitable for Hadoop it can be applied to many projects mobile devices. The application of Ha- and bring benefits for the entire organization. doop allows distributed processing of a Integrating new knowledge sharing practices large set of data. This makes it possible would provide value to different business to create mobile applications that provide groups and improve efficiency when adopting end users with the information they need,

319 Articles Big Data Analysis Arhitecture such as payrolls, benefits and deduc- system components, and software pro- tions. All of this can be compacted within cesses. a sleek design. y Bringing back offices to the front. Busi- From a traditional architecture to a nesses often have some amount of data Big Data architecture distributed across different systems. This In order to understand the aspects of a Big data is stored in different formats, in dif- Data architecture, let’s first look at a traditional ferent infrastructures and for different information architecture for analyzing large purposes. Therefore, it can be quite chal- structured data. Figure 1 shows two data lenging to create new applications that sources which are typical of almost every use this data. This data is of tremendous business. Transaction data is reported by the value so businesses can use Hadoop internal operating systems such as production together with the technique of “bringing management, accounting and finance, back offices to the front” in order to cre- logistics, warehousing, human resource ate new applications. With Hadoop as a management and more. The company may data center, data can be extracted from have a unified enterprise resource planning inherited, closed systems which allows (ERP), or the data may come from separate for new applications to be built. subsystems. The other source of data within y Single customer view. Businesses are the company are the existing nomenclatures starting to interact more and more with that the company handles (nomenclatures their customers using new ways, such of employees, customers, suppliers, goods, as social media, websites and mobile materials, types of documents, etc.) as well applications. However, Traditional Cus- as other basic data such as numbers and tomer Relationship management systems dates of documents, patents, state regulatory (CRM) are insufficiently equipped to han- documents, etc. dle such interactions because the gener- Techniques for integrating and transferring ated data is unstructured. Data pours into data to a DBMS repository or data warehouse the databases very quickly (for example, can be applied to these traditional sources. click flows) which forces the enterprises They offer a wide variety of analytical to use non-relational databases in order capabilities for extracting knowledge from to overcome these challenges. data. Some of these analytical capabilities y Infrastructure monitoring. Many large include: dashboards, reports, BI applications, organizations have hundreds or even summary and statistical queries, semantic thousands of internal systems. These interpretations for text data, and high-density operating systems are often critical for data visualization tools. In addition, some maintaining a business’ workflow, uptime organizations have applied supervision and productivity. Monitoring is a complex and standardization to projects and may process that involves processing and have further developed the capacity of the analyzing multiple signals from different information architecture by managing it at the subsystems, including storage, network, enterprise level.

320 Economic Alternatives, Issue 2, 2021 Articles

Figure 1: Components of a traditional architecture for Big Data analysis The basic principles of information new technological components and new architecture include treating data as an analysis tools which differ greatly from the asset through value, cost and risk as well as traditional ones. ensuring the timeliness, quality and accuracy The data sources in the architecture of of data. It is the responsibility of the enterprise a Big Data analysis system include both architecture oversight to establish and machine-generated data and data from word maintain a balanced management approach, processing. The machine-generated data including the use of a center of excellence includes: databases, metadata, catalogs, for standards management and training. XML files, etc. The text data includes: e-mail, social networks, text files from word Architecture of a Bid Data Analysis processing, spreadsheets, presentations, System text messages, chats, phone records, etc. The amount of data produced by devices, Unique distributed (multi-node) services and sensors throughout the architectures for parallel processing economy and society is constantly increasing (Distributed File Systems) have been created every day. This creates opportunities for for analyzing large data sets. There are innovation and improvement of existing different technological strategies for real- products and services or the creation time storage and batch processing. In order of new ones. Data diversity requires the to store key-value data in real time one has formulation of approaches that allow data to to use Key-value Store and NoSQL (more be exploited in order to stimulate innovation commonly referred to as Hashtables) which and productivity, while ensuring security and allow for highly efficient data extraction respecting privacy and intellectual property based on an index. Map Reduce is used for rights. The type and categorization of the batch processing in order to filter the data generated data may depend on the source, according to a specific search strategy. application or business model. Once the filtered data is detected it can be In order to meet the new requirements analyzed directly, sent to mobile devices, for data size, speed, diversity and value, the loaded into other unstructured or semi- traditional architecture needs to be changed structured databases or even combined accordingly. The new corporate architecture in a traditional storage environment and for Big Data must include new data sources, associated with structured data.

321 Articles Big Data Analysis Arhitecture

Figure 2: Components of a Big Data Architecture In addition to the new components, new for next-generation services and analysis, architectures are emerging to effectively which can be scaled to any latency and size adapt to the new requirements for storage, requirements. A third way to process the data access, processing and analysis. Dedicated is by using MPP data pipelines in order to data warehouses suitable for this purpose allow the handling of data events in moving are able to store and optimize processing for time periods with variable latency. In the long new types of data. One strategy assumes that Big Data-oriented architectures will have run this will change the way ETL is used for many types of data warehouses. Another most uses. way to process it is to parallelize MPP-based In short this is what a corporate Big Data data for both speed and size. This is crucial architecture might look like (Figure 3).

Figure 3: Corporate Big Data Architecture Different types of data (structured, number of ways for it to be stored. It can be unstructured and semi-structured) come written to a file system, traditional RDBMS from the sources. Data can either be written or distributed clusters such as NoSQL and directly (in real time) to memory processes or Hadoop Distributed File System (HDFS). The it can be written to disk as messages, files, basic techniques for organizing and quickly or database transactions. Once the data has evaluating unstructured data are by running been collected (Pick up Data) there are a MapReduce (Hadoop) in a package or Spark

322 Economic Alternatives, Issue 2, 2021 Articles in memory. Additional options are available for Hadoop, Hive and Impala evaluating real-time streaming. One of the latest technological tools of The integration layer (Analyze) in the Hadoop Big Data processing and analysis is environment is extensive and allows the filling Impala - MPP system with SQL. of a data tank, data warehouse and SQL Impala combines the capabilities of SQL analytical architecture. It covers all types of with the scalability and flexibility of Apache data and domains and manages the traditional Hadoop using standard components such as and the new environment for data collection HDFS, HBase, Metastore, YARN and Sentry. and processing in both directions. Different Impala users can communicate with HDFS technologies can be used in the integration or HBase via SQL queries. Impala can read layer (Milev, Plamen. (2019)). Most importantly, almost all data file formats and is a pioneer it meets the requirements of the four Vs: in the use of the Parquet file format - a exceptional volume and speed, a variety of columnar storage layout that is optimized for data types and finding its veracity. All of this large-scale queries. is self-sustained and does not depend on the Impala is freely available as an open analyst. In addition, the architecture provides source application under the Apache license. data quality services, maintains metadata and It supports data processing in memory tracks the lineage of transformation. i.e. accesses and analyzes data stored in Hadoop without data movement. Data can Technological Tools for Building a Big be accessed using Impala using SQL-like Data Analysis System queries. It provides faster access to data in The means for implementing a Big Data HDFS compared to other SQL mechanisms. analysis system architecture can be a Using Impala, data can be stored in storage combination of solutions. Here we will point systems such as HDFS, Apache HBase and out some of them and later we will consider Amazon s3. some of the key technological innovations. Impala can be integrated with BI tools such as Tableau, Pentaho, Micro strategy and y Data entry in a Big Data platform - Zoom data. Impala uses metadata, ODBC Apache Flume, Apache Storm, Apache driver and SQL syntax from Apache Hive. Spark Streaming, Oracle Stream Explore Hive provides a mechanism for designing and Apache Kafka. a Hadoop data structure and retrieving y Acquisition of processed data from the queries using a SQL-like language called primary one and its recording in relational HiveQL. Apache Hive makes it easy to read, and non-relational databases - HDFS and write, and manage large arrays of data stored Cloudera Manager. in distributed storage. The structure can y Data organization - Apache HBase, be designed on data that is already stored. Apache Kudu, Oracle NoSQL Database. Nowadays, Spark is used to execute queries y Data processing - Apache Hadoop, written in HiveQL because it offers enhanced Apache Spark, MapReduce. performance. y Data integration - Oracle Big Data Con- nectors, SQL Access Apache Spark y Data preparation for analytical use and Apache Spark is a general-purpose SQL access - Hadoop Hive and Hadoop cluster computing system. It supports many Impala. machines and uses various computational

323 Articles Big Data Analysis Arhitecture tools for structured data processing, machine Hadoop because Hadoop allows only for learning, graph processing and data flow Java development. processing. Soon after Apache Spark was y Active and expanding community: Devel- developed, it became one of the leading opers from over 50 companies have been cluster computing systems. The functions of involved in the creation of Apache Spark. Spark are manifested in different directions - This project was launched in 2009 and it supports interfaces for Java, SQL, Scala as of today it has about 250 developers and Python. It is used by a number of who have contributed to its expansion. It telecommunications companies, banks, game is considered the most important project development companies and industrial giants of the Apache Community. such as Microsoft, IBM and Apple. Even some y Complex analysis support: Spark offers governments are using Spark for a portion of special tools for data flow processing, their software products. queries and machine learning. Main features of Apache Spark are: y Hadoop integration: Spark works inde- y Fast processing: When using Apache pendently on Hadoop YARN Cluster Man- Spark one can achieve 100 times faster ager which enables it to read existing Ha- speeds when processing in RAM and 10 doop data. times the speed when doing so on a disk. y Spark GraphX: Spark has GraphX - a y Dynamic development: Ability to develop component for graph processing and parallel applications because of the 80 parallel computing. It simplifies graph high-level operators that Spark provides. analysis tasks through algorithms and y In-memory calculation: In-memory pro- constructors. cessing can increase the processing y Cost-effectiveness: Apache Spark is a speed. Here the data is cached thus, it cost-effective Big Data solution. Together is not necessary to retrieve data from with Hadoop, it provides the necessary the disk every time. Spark achieves such Big Data storage and replication. speeds because of a DAG execution There are two main components to Apache mechanism which facilitates calculation Spark. A driver which generates the user code in memory and acyclic data flow. in different tasks and distributes the tasks in y Reusability: Ability to reuse the Spark the individual nodes. The second component code for batch processing and to execute is an executor - it performs the tasks assigned ad-hoc requests in the flow state by the nodes. The cluster manager plays an y Fail safe: Spark-RDD abstraction pro- important role in the connection between the vides resistance to failures and damage. two components. Spark RDDs are designed to deal with The user data commands created by the possible failure of each working node Apache are provided in an acyclic graph or in the cluster. They ensure that data loss DAG which determine which tasks have to is reduced to zero. be performed and the order they have to be y Real-time stream processing: Spark has executed in. the ability to process real-time streams Some analysts compare Apache Spark with Spark Streaming. and Apache Hadoop. However, this is not y Multi-language support: Spark has sup- correct, as the two systems provide different port for multiple languages such as Java, functionality. Spark can be easily found in R, Scala and Python. It has an edge over most Hadoop distributions. It is a common

324 Economic Alternatives, Issue 2, 2021 Articles belief that Hadoop has become so popular considered as a special method by which thanks to MapReduce. There are two main information is obtained continuously in reasons that put Spark at the forefront and different sequences. Information from various made it a desirable cluster computing system sources constantly floods our systems. The for large data set processing. unstructured data that comes in a continuous As a first reason, from a technical point sequence is called a data stream. The way of view Spark is much more convenient for such data is processed is by it being divided developers due to the presence of Spark API into logically related groups. The processing (multiple interfaces). The other main reason is and analysis of data is called a streaming speed. Spark can perform tasks many times process. Further processing of the data is faster than MapReduce especially when performed after it is divided into separate it comes to multi-stage tasks. Map Reduce units. Low latency data processing and data generates a two-step graph of execution analysis is called flow processing. Apache which includes data reduction and mapping. Spark can be linked to many sources of At the same time, Apache Spark DAG information, such as Amazon Kinesis, Apache includes many stages that enable a much Flume, and TCP Sockets. more efficient performance. Even when data Apache Spark Streaming cannot be stored in memory, Apache Spark is Spark Streaming divides the input data several times faster than MapReduce. streams into packets. Spark Engine is used The Apache Spark API is designed to be to process these packets and generates a easier to use when compared to MapReduce. final stream of packets. Spark Streaming can This is because much of the complexity of be easily integrated in many Apache Spark allocation and calculation processes remains components such as Spark SQL and Spark hidden behind simple methods. MLib. Apache Spark works with Python and R Spark Streaming is used to process data and is thus oriented towards large systems streaming. The Spark API environment plays with Java and Scala. Everyone who uses a key role as it is adapted for streaming and Apache Spark has an accessible way to get a error correction. Another powerful feature fast and scalable platform. We are currently of Spark Streaming is real-time processing. flooded with vast amounts of information, The service provided by Spark Streaming is which when properly analyzed, can lead to used by big companies such as Uber, Netflix useful results. In many organizations and and others. A characteristic feature of Spark companies, such as Amazon, data analysis is Streaming is the ability to process data in applied after information about the products real time using a standalone platform built for in question has been recorded by the end Spark Streaming. user. In other companies consumers receive various offers in real time as a result of the Why is Apache Spark needed? analysis of data gathered from the products in We have different types of data sources question. Every other company that generates such as IoT devices, large systems and many and uses a huge amount of data can apply more. Data sources generate data streams those actions to improve the work process which are used by companies such as Apache and increase its customer satisfaction. Kafka. Data is grouped into clusters which Now is the time to mention the importance process it in parallel. Continuous processes of data streaming. Data streaming can be consist of several phases (operators). The

325 Articles Big Data Analysis Arhitecture data is processed by a sequence of operators. times they can be extremely simple. Similar After one operator processes the data, its to relational tables, each table contains a output is passed to the next operator from the primary key. Using the primary key one can pipeline. easily update and even delete records. An In modern computer architectures, example of such a primary key in a table is an problems occur when complex real-time ID of a given user. data processing is required and following Kudu uses a simple data model which challenges have to be overcome: allows for easy data transfer and integration. 1. Fast error recovery Another great convenience is the fact that it This is the process of recovering is not necessary to use a binary format or information after an error occurs. The data to be familiar with JSON files. Tables have is recalculated into parallel computing the ability to self-describe depending on the branches and allows for much faster process database system. This feature enables the completion than traditional computer systems developers to use standard tools such as have to offer. Spark or SQL Engine. 2. Balancing the load We can say that Kudu is more than a file format. It has the power to store data in Load balancing is a process of intelligent real time. NoSQL provides APIs for Python allocation of resources. What it tries to languages, Java and even C ++. Thus, it achieve is efficient use of the system can be applied in data analysis and machine resources by evenly loading all computational learning. The data model is applied in such units of the system. It achieves that by not allowing individual resources to become idle a way that it is not necessary to even think or overloaded. about binary coding. This makes Kudu’s API extremely easy to use. In case you 3. Unification have different types of data and subsets - Unification allows for data stream analysis Kudu, although not an OLTP system, is fully using queries. A stand-alone process can compatible with other systems. combine interactive streaming and batch Apache Kudu is open source, so you can requests. easily link different information processing SQL queries and analysis with Machine frames. Kudu can be integrated with the Learning Hadoop ecosystem without any complications. The development of systems with a common It might not come as a surprise but Kudu also set of commands for database management stores its data in columns instead of rows. is a necessary condition that realizes the Likewise most analytics repositories used in interaction between individual analytical today’s digital society store data in a similar systems. SQL queries are widespread and fashion. That is because the column storage accepted. Typically, systems provide libraries allows for more efficient compression and and machine learning modules that allow for coding. Column storage reduces the amount more complex analysis. of I/O data required to perform analytical queries. Apache Kudu A rational solution is to use a Java In Apache Kudu, data is stored in tables client for real-time data redirection. This similar to relational databases. In most cases, makes Apache Impala, Map Reduce and tables can be extremely complex, and other Apache Spark a good solution for immediate

326 Economic Alternatives, Issue 2, 2021 Articles processing. What’s more, you can even join a stored in Oracle Database in order to produce Kudu table to data stored in HDFS. The ease an integrated analysis. of use and opportunities offered by Kudu is The introduction of new technologies and a major choice when it comes to joining the techniques is always a challenge. The IT Hadoop cluster. A distinctive feature of Kudu department should be expanded to include is that it can easily perform heavy memory Big Data specialists familiar with the relevant operations of less than 1 GB. In addition, technologies. Kudu can execute drill-down and needle-in-a- haystack queries on millions of rows of data Conclusion in less than a second. Nowadays, organizations have the Tables in Kudu are divided into smaller opportunity to become faster, better and units called tablets. The partitioning can be more efficient by adopting new approaches done by configuring each table individually to developing their products. In order to or by hashing. Kudu uses the Raft algorithm adopt those changes, the management of which ensures the preservation of information the company must start to constantly adapt at any time, thus duplicates each operation to the rapidly changing technologies. Some for the respective tablets. This recording of basic steps a company can take in order to each operation on a separate node ensures successfully integrate Big Data: that the data will not be lost in case of y Launch simple and easy-to-implement machine damage. This action is performed Big Data projects. before responding to the customer request. y Expand the data sources used and ex- If, however, a machine fails, the replicated plore the potential uses of internally and information is copied in a matter of seconds, externally available data. thus keeping the system fully operational and y Allow employees to receive data in real creating a sense of security in the user that time. the given system is reliable. Raft Consensus y Use analytical Big Data applications and makes sure that the replicated data will be leading analytical initiatives both at the consistent. This ensures that there will be the management and the operational level of least possible delays even when some nodes the company. are loaded with multiple parallel tasks (as in y Develop strategies for the effective use of the case of Impala queries or Spark tasks). leading analytical techniques and tech- Oracle Big Data Connectors nologies. Oracle Big Data Connectors enables the y Develop and maintain the staff’s skill set integration of Oracle’s RDBMS with data in terms of Big Data. stored on large data platforms, including y Build data visualization skills. HDFS and NoSQL databases. Oracle Big Data The skillful development and gradual Connectors provide an easy-to-use graphical implementation of Big Data analysis environment without the need of writing a technologies will provide huge business complex code. Oracle Big Data Connectors is opportunities. Innovation, of course, does not a means of communication between Apache emerge and does not develop from scratch. Hadoop and Oracle Database. Apache Different industries and businesses design Hadoop can be initially used for data collection their data management architecture based and primary processing. Afterwards, the large on existing and future data sources they data can be linked to the enterprise data have or will have. Technologies, databases

327 Articles Big Data Analysis Arhitecture and analysis tools are selected to serve Business management, issue 2, D. A. Tsenov legacy protocols and standards. The new Academy of Economics, Svishtov, Bulgaria. architecture for Big Data analysis is built on Tsaneva, Monika. 2019. General Data-driven the basis of the existing company architecture File Export Integration Solution. In: for data processing. Proceedings of International Conference on References Application of Information and Communication Technology and Statistics in Economy and Edwards, Martin R. 2016. Predictive HR Education (ICAICTSEE), pp. 154-159. Analytics: Mastering the HR Metric. Apache Kudu Overview. Available URL https:// Fitzenz, Jac, John R. Mattox, II. 2014. kudu.apache.org/overview.html Date Predictive Analytics for Human Resources, accessed [07.01.2021]. Published by John Wiley & Sons, Inc., Apache Spark Streaming Tutorial For Hoboken, New Jersey. Beginners: Working, Architecture & Features. Milev, Plamen. 2019. Integration of Software Available URL https://www.upgrad.com/blog/ Solutions via an Intermediary Web Service. apache-spark-streaming-tutorial-beginners/ Trakia Journal of Sciences, issue 1, pp. 181- Date accessed [07.01.2021]. 185. Overview of Apache Kudu. Available URL Mihova, Veska. 2015. Methods of Using https://www.techntrip.com/overview-of- Business Intelligence Technologies for apache-kudu/ Date accessed [07.01.2021]. Dynamic Database Performance An Enterprise Architect’s Guide to Big Data. Administration. Economic Alternatives, 2015, Available URL (https://www.oracle.com/ issue 3, pp. 105-116, University of National technetwork/topics/entarch/articles/oea-big- and World Economy, Sofia, Bulgaria. data-guide-1522052.pdf) Date accessed Radoev, Mitko. 2019 Using Microsoft SQL [03.11.2020]. Server 2019 Big Data Clusters. In: International Cloudera Documentation. Cloudera Enterprise Conference on Application of Information and 5.13.x Available URL (https://docs.cloudera. Communication Technology and Statistics in com/documentation/enterprise/5-13-x/topics/ Economy and Education (ICAICTSEE). impala_file_formats.html) Date accessed Radoev, M. 2017. A comparison between [03.11.2020]. characteristics of NoSQL databases and Optimized Real-time Analytics using Spark traditional databases. Comput. Sci. Info. Streaming and Apache Druid. Available URL Technol. 5 (5), 149–153. https://medium.com/gumgum-tech/optimized- Tsaneva, Monika. 2019. A Practical Approach real-time-analytics-using-spark-streaming- for Integrating Heterogeneous Systems. and-apache-druid-d872a86ed99d Date accessed [07.01.2021].

328 Economic Alternatives, Issue 2, 2021