Arcadia PAGE 20 HOW TO GET MORE VALUE FROM YOUR DATA LAKE AND EXTEND ACCESS TO MORE USERS

HPE Data Security PAGE 22 SOLVING FOR BUILDING A DATA SECURITY AND PRIVACY IN IoT

Cask DATA LAKE FOR PAGE 24 5 KEY INGREDIENTS OF A SUCCESSFUL THE ENTERPRISE DATA LAKE

Informatica PAGE 26 BUILDING AN ENTERPRISE DATA LAKE WITH BRAINS, NOT BRAWN

IBM PAGE 27 HARNESSING A LOGICAL AND DATA RESERVOIR FOR SUCCESS

Franz PAGE 28 TRANSFORMING DATA LAKES INTO KNOWLEDGEABLE PLATFORMS

Cambridge PAGE 29 SMART DATA LAKES: REVOLUTIONIZING ENTERPRISE

Best Practices Series BUILDING A DATA LAKE FOR THE ENTERPRISE

Best Practices Series

Data lakes are forming as a response to today’s big data to launch a data lake initiative and another 15% had already sub- challenges, offering a cost-effective way to maintain and manage mitted a budget for adoption. On top of that, 35% were research- immense data resources that hold both current and future ing and considering a data lake approach. In addition, the Uni- potential to the enterprise. However, enterprises need to build sphere survey found clear early use cases exist for the data lake. these environments with great care and consideration, as these However, governance and security are still top of mind as key potentially critical business resources could quickly lose their way challenges and success factors for the data lake. with loose governance, insecure protocols, and redundant data. Besides the mechanical considerations of data storage, there In their book, BI and Analytics on a Data Lake: The Definitive is a need to integrate complex datasets to build more intelligent Guide, Sameer Nori and Jim Scott identify the role of a data lake, analytic frameworks for more targeted and intelligent decisions. serving as a repository for raw data “which was previously too The data lake helps move big data analytics beyond handling large expensive to store and process.” It can then be moved to a data files of flat data. Advanced analytics, sitting on a data lake, will warehouse, or simply serve as “as an online archive for infrequently enable the creation of a robust set of business capabilities, such as accessed data.” The benefits of a data lake are far-reaching, from predictive and prescriptive analytics. boosting big data analytics capabilities to simply serving as a hold- For example, a key area in which data lakes are proving their ing tank for raw data that can be refined at some future point by potential is the healthcare sector. A semantic data lake for health- more advanced systems. The best part is data lakes are agnostic care is underway at Montefiore Medical Center, which involves to the format of the data being supported—data can come in any a sophisticated project that is slated to go live format, from any vendor, for any purpose. for patient care in the summer of 2017. The data lake supports In a survey of 385 IT and data managers, Unisphere Research predictive analytics to clinicians, starting with the ability to flag found that the data lake is increasingly recognized as both a viable patients entering the health system at high risk of experiencing and compelling component within a data strategy, with compa- a serious crisis event within 48 hours. The system also generates nies large and small continuing to move toward adoption (“Data customized checklists of intervention tasks sent to clinicians that Lake Adoption and Maturity Survey Findings Report,” Unisphere may help to avert or lessen the impact of the crisis. Research/Radiant Advisors, October 2015). At the time of the sur- The following are best practices for making the most of data vey, more than 32% of enterprises already had an approved budget lakes in the enterprise:

18 BIG DATA QUARTERLY | SPRING 2017 sponsored content

THINK ABOUT THE BUSINESS also be part of the architectural equation. There are many different As with any technology endeavor, a data lake that is created types of data-level solutions that can be considered and designed, simply for the purpose of having a data lake will accomplish little. including Hadoop, Spark, NoSQL databases, and data as a service. The Unisphere survey finds 70% of data managers see data lakes It will also be necessary to eventually be able to accommodate as a way to boost data discovery, data science, and big data proj- next-generation approaches, such as real-time data streaming, ects—functions critical to today’s analytics-driven enterprises. machine learning, artificial intelligence, and 3D interfaces that A majority of respondents, 58%, also view data lakes as key to will become an essential part of business processes. Also important enabling real-time analytics or operationalized insights—again, are the significant technical distinctions between lakes and ware- functions that are critical to businesses that seek to succeed in the houses. While data warehouse data is highly structured, data lakes 21st century. Thus, alignment with business objectives—better store data in its original format—structured, unstructured, and understanding customer needs or creating new revenue streams— semi-structured. Because data warehouse data is loaded through is vital, and the data lake needs to accomplish purposes that data extract, transform, and load processes, it is more expensive per warehouses may not be capable of addressing. byte than data lakes, which do not thoroughly vet incoming data. In addition, data warehouses are more likely supported by com- THINK EXPERTISE mercial vendors, while data lakes tend to be home-grown affairs. As with any major data environment, it’s critical to have compe- tent professionals. Hadoop skills are still notoriously hard to come THINK SECURE by, as is expertise in data governance, data architecture, and data One of the most pressing concerns about data lakes is that security. Managing data lakes requires renaissance professionals they may end up as “data swamps.” The worry is that they may with skills in a broad range of areas. Needed are people who have end up storing untrustworthy data, sensitive corporate data that strong backgrounds in data management and analytics. Data sci- may be subject to unintentional exposure, or even data that vio- entists or high-level data analysts are also required to help business lates corporate policies or ethics, such as personally identifiable users get the most out of their data experiences. information or pornography. Accordingly, the Unisphere survey finds that respondents report the most vexing challenges are meta- THINK ABOUT WHAT DATA REALLY NEEDS data management issues (71%), security (67%), and governance TO BE CAPTURED AND STORED (71%). Rules and policies that outline access and permissions In today’s on-demand environment, many third-party data need to be established, just as in a standard database environ- sources are available almost instantaneously from their original ment. The same safeguards and processes employed to protect sources, such as Google or Twitter Analytics, or other sites with traditional data environments—such as data encryption—need indexed information that are made available for analysis. These to be extended to data lakes. Regulations and mandates—such sources are readily available to enterprise decision makers, and it as HIPAA or Sarbanes-Oxley—will apply to data lakes as well. may not be necessary to capture and store the data in a separate Data held within data lakes is also subject to legal discovery or corporate location. Data that may need to be stored and main- proceedings, in the same manner as all other electronic data. In tained is that which is transitory—such as readings from sensors. addition, disaster recovery and business continuity should be part Before making decisions that will require storage capacity and of a security strategy. In the event of system failure, what datasets security considerations, determine what can be best maintained get priority in terms of recovery? Which datasets should be repli- on the web and what needs to be captured. Another consideration cated in real time, immediately available to end users without so is the fact that established data is often very difficult to move— much as a hiccup? As many data lakes may serve as repositories for data already existing in transactional systems, such as mainframe historical data, they may not require priority, real-time replication. platforms, may need to stay where it is. THINK SELF-SERVICE THINK LONG-TERM, THINK ARCHITECTURALLY All across the data and IT space, there is a growing emphasis on Enterprise architecture and future infrastructure requirements enabling end users to access IT services as they need them. A data are part of the equation in looking forward over the next 3, 5, or lake that delivers value would enable business users to make inqui- 10 years. While data lakes can be quite fluid and requirements and ries with little or no help from the IT department. Framed within technologies are changing quickly beneath data managers’ feet, a permissions, users need to be able to see all the datasets available for data lake still must be part of a long-term, comprehensive strat- analysis or reporting. Ideally, users should be able to, at the click of egy. Since the data lake needs to map to the business, it’s import- a button, view an on-demand catalog of datasets and be able to run ant to plan how and what kind of data will be maintained in the analysis or reports at any time. They should also be able to identify repository, who will have access to it, and how it will be governed. potential new data sources as well, and understand the trustworthi- Achieving a robust infrastructure capable of scaling for large ness of the data. Through a robust self-service approach, the data capacity is also important. Technology strategies such as faster lake can realize significant value to the business. in-memory computing and the use of cloud computing should —Joe McKendrick

DBTA.COM/BIGDATAQUARTERLY 19 sponsored content

How to Get More Value From Your Data Lake And Extend Access to More Users

With all the press about data lakes discoveries to end users which describe soon as data is moved out of Hadoop and in the last year, the conversation is now the insights and value they found. into a different system, users are forced to pivoting to: How can I get more value aggregate the data, which results in a loss from my data lake and make it accessible of fidelity, not to mention the increased to more users? costs of data replication. Data architects, analysts, and data BI platforms which do not inherit scientists alike have been looking for a the security privileges of the underlying solution that allows them to discover data platform make it more difficult to business insights on the raw data, versus properly secure the data in copies that are just working with summaries, aggregates, then shared across the organization. This or extracts. The promise of the data lake creates multiple points for administering is the agility to access and analyze any and security for role-based access to data, all data in its native format to get deeper increasing the risk of mismatched policies business insights. Streamlining traditional between the data management and BI data pipelines for ingesting, cleansing, tiers. and normalizing data for analysis make DATA LAKES WITH TRADITIONAL Legacy BI tools can also get very this possible. There is an opportunity BI ARCHITECTURE expensive as you consider giving more for data architects to reduce the number access to the data lake. Traditional of steps which traditionally relied on • Data summarization perpetual licensing requires a much larger a complex stack of data warehouses, • Big data fidelity loss upfront cost, as well as annual support data marts, and BI or cubing servers • No collaboration and maintenance costs. Licenses must be that can fragment data access, slow • Higher security risk purchased for both the BI server and user down data provisioning, and clutter the • Operational complexity licenses, so it’s expensive to scale and add architecture. Business analysts are used • High TCO with multiple systems new users. to relying on BI tools designed before the big data era; these tools do not scale The main goal of visual analytics is LEVERAGING YOUR DATA horizontally and therefore will choke to provide users across the organization LAKE WITH A “DATA- on all the data in a data lake and require with information at the right time to NATIVE” VISUALIZATION summarizing or aggregating data within support decision making. But will your ANALYTICS PLATFORM limited drill-down capabilities. This can analytics tools be able to scale to support Because of these limitations, there has limit cross-functional data discovery and hundreds of concurrent users on raw data been an increasing interest in the past visualization. And while data scientists on the data lake, or are you setting up few years from business users and data can perform advanced model building your data lake for failure by keeping it as architects who want to extend the value and predictive simulations on data in a walled garden for only the data science of their data lakes and to run dashboards, , they often falter at elite? Will you have to move data out of reports, and analytic applications in a the stage where they need to present your data lake to a separate BI server? As data-native way—directly on the data

20 BIG DATA QUARTERLY | SPRING 2017 sponsored content

nodes—of Hadoop and cloud-based a separate client application or tool to Look for a data-native BI/visualization clusters. With modern BI platforms access everything through the browser. platform that takes advantage of unified which are distributed systems, users can The end result? Business agility. Analysts security with Hadoop-preferred perform on-cluster analytics and build can make faster decisions on live, real- authentication (Kerberos) and applications that run in a data center and time data applications which don’t authorization methods, as well as role- scale to much larger data volumes and require IT involvement or rely on stale based access control and row-level higher numbers of users. There are data. security. many benefits to this approach, and Linear scalability is another important Total cost of ownership (TCO) is the these modern, data-native BI platforms component of a data-native BI/ last piece of the puzzle. With a converged can help both analysts and platform on a data lake. visual analytics and BI platform, you’re architects to find deeper insights on raw Since performance acceleration structures able to dramatically lower TCO, since you data versus working with aggregates and don’t need to be designed ahead of time as only pay for a yearly subscription license. extracts of the data; this allows them to with traditional data cubes, users can start Since it’s licensed per node in the cluster, go way beyond simple reporting to exploring raw data right way, right where there’s no limit to the number of users. better understand user behavior, identify the data resides. As the data lake grows, And since the platform is deployed suspicious network activity, and visualize the computational capabilities grow directly on data nodes, you can leverage other patterns which only emerge with naturally. There’s no need to have a the Hadoop or cloud infrastructure and longer history and detailed data. separate BI server for moving and scale in tandem, so there’s no need for computing extracts off-Hadoop for additional hardware. production analytics. A built-in recommendation engine helps generate MOVE OVER, WEB APPS: IT’S ALL what’s called Analytical Views ABOUT DATA APPLICATIONS (derived forms of raw data) that are based Organizations across a wide variety of on dynamic data usage within a Hadoop industries are now looking to leverage an cluster. These Analytical Views provide on-cluster visual analytics and BI performance optimizations based on platform to build data-centric, interactive DATA LAKES WITH DATA- usage statistics on what data is most apps. One leading American media NATIVE VISUAL ANALYTICS commonly accessed and would therefore conglomerate is doing just that. They’re provide the most value to end users. The taking all of the data from their various • Simplified architecture acceleration framework re-routes data networks and creating a data lake that • Faster insights queries to Analytic alViews, providing contains comprehensive viewership data. • Deeper analytics automated acceleration for sub-second They 're using the native Hadoop BI • More concurrent users performance and making it easy to platform to analyze and visualize linearly scale production workloads. This audience analytics, which include ads that With a web-based architecture for approach solves a major hurdle with user were clicked on, how long shows were big data visualization, business users concurrency, enabling thousands of users watched, etc. They can then provide this can perform self-service analytics on to access and rapidly analyze data at once. data back to the advertisers as a service. very large datasets directly from their You can also leverage the Business intelligence is about building browser, using drag-n-drop authoring to management and security capabilities of apps that deliver the data visually, in the create a full range of charts and graphs, the underlying platform that’s already in context of daily decision-making. Legacy as well as take advantage of in-Hadoop place, so you don’t have to re- analytics/BI tools are built on outdated advanced analytic functions. Users can implement your entire security model models that make it difficult if not perform search-based analysis, behavior- in each storage/database/application impossible to discover critical business based segmentation, event analytics, and layer. insights. After all, you don’t use previous- dimension/measure correlations. The generation architectures to store your data-native architecture is completely data, so why should you use previous- server-side based, so there is no need for generation BI tools for your big data?

DBTA.COM/BIGDATAQUARTERLY 21 sponsored content

Former software division of Hewlett Packard Enterprise is now part of Micro Focus

Solving for Data Security and Privacy in IoT

The Internet of Things (IoT) stringent data privacy regulations or protected health information (PHI). promises new value and revenues to such as the General Data Protection For example, a breach of a connected the enterprise—and new problems. Regulation (GDPR) are requiring the blood pressure monitor’s readings The scale of data generation has no protection and privacy of personal alone may have no value to an attacker, precedent and likely, most of that Big data, with grave consequences for but when paired with a patient’s name, Data lands in Hadoop or a “data lake” non-compliance. it could become identity theft and a where the need for rapid insights has The data generated from IoT is a violation of (HIPAA) regulations. never been greater, yet cyber-threats valued commodity for adversaries, as it Implementing data security can be are targeting the most valuable assets can contain sensitive information such a daunting process, especially in the of the enterprise, sensitive high-value as personally identifiable information rapidly evolving Hadoop space. It’s data. At the same time, increasingly (PII), payment card information (PCI) essential for long-term success and

22 BIG DATA QUARTERLY | SPRING 2017 sponsored content

future-proofing investments, to apply connected devices and sensors can encryption technology to protect IoT technology via a framework that can be protected with format-preserving data at-rest, in-transit and in-use. adapt to the rapid changes ongoing encryption to secure sensitive data With this industry first, the in Hadoop “data lake” environments. from both insider risk and external SecureData for Hadoop and IoT Unfortunately, implementations attack, while the values in the data solution now extends data-centric based on agents frequently face issues maintain usability for analysis. protection, enabling organizations to when new releases or new technology Apache NiFi, a recent technology encrypt data closer to the intelligent are introduced into the stack, and innovation, is enabling IoT to deliver edge before it moves into the back-end require updating the Hadoop instance on its potential for a more connected Hadoop Big Data environment, while multiple times. world. Apache NiFi began as a Federal maintaining the original format for What’s needed is a framework that government project, was then the processing and enabling secure Big enables rapid integration into the basis for a start-up, and subsequently Data analytics. newest technologies, to allow broad developed into an open source Companies that are using IoT access to data for secure analytics. platform that enables security and risk connected devices and collecting large The obvious answer for true Hadoop architects, as well as business users, to amounts of data that use the best security is to augment infrastructure graphically design and easily manage practice of data-centric security will controls with protecting the data data flows in their IoT or back-end come out on top in the push to utilize itself. This “data-centric” security environments. the power IoT. approach calls for de-identifying The integration of high-strength the data as close to its source as format-preserving encryption with ABOUT DATA SECURITY possible. Two technologies—format- NiFi processor technology extends Micro Focus Data Security drives preserving encryption, and Apache™ data-centric protection for IoT data leadership in data-centric security and NiFi™ technology—in combination, at-rest, in-transit and in-use, by encryption solutions. With over 80 patents provide such a framework and are enabling organizations to encrypt and 51 years of expertise we protect the revolutionizing Hadoop and IoT streaming data at scale, before it moves world’s largest brands and neutralize ecosystems by extending data-centric into the back-end Hadoop data lake breach impact by securing sensitive data at protection to the IoT edge. for secure analytics. Now organizations rest, in use and in motion. Our solutions With format-preserving encryption, can incorporate data security into provide advanced encryption, tokenization protection is applied at the data their IoT strategies by allowing them and key management that protect sensitive field and sub-field level, preserves to more easily manage sensitive data data across enterprise applications, data characteristics of the original data, flows and insert encryption closer to processing IT, cloud, payments ecosystems, including numbers, symbols, letters the intelligent edge. mission critical transactions, storage and and numeric relationships such as Micro Focus (the former software big data ecosystems. Micro Focus Data date and salary ranges, and maintains division of HPE) has introduced Security solves one of the industry’s referential integrity across distributed SecureData for Hadoop and IoT, biggest challenges: How to simplify the data sets. This protected form of the designed to easily secure sensitive protection of sensitive data in even the data can then be used in applications, information that is generated and most complex use cases. analytic engines, data transfers and transmitted across IoT environments, data stores, while being readily and with Format-Preserving Encryption. securely re-identified for those specific The solution features the industry’s applications and users that require it. first-to-market ApacheTM NiFiTM LEARN MORE: As with other data sources, sensitive integration with NIST standardized https://software.microfocus.com/software/ streaming information from IoT and FIPS-compliant format-preserving data-security-encryption

DBTA.COM/BIGDATAQUARTERLY 23 sponsored content

5 Key Ingredients of a Successful Data Lake

The concept of a data lake is upfront, but needs to be evaluated TAKE A UNIFIED APPROACH frequently challenging for data and ana- through discovery when it is read. They USING CASK DATA lytics leaders to understand. For some, unlock value which was not previously APPLICATION PLATFORM (CDAP) the data lake is simply a staging area that attainable or was hard to achieve. Even Today, when data analysts try to is accessible to several categories of busi- with the multiple benefits that data lakes work with data in a data lake using ness users. To others, the “data lake” is a provide, there are substantial challenges only Hadoop and its associated open proxy for “not a data warehouse.” While because: source tools, they are effectively build- data warehousing is highly structured • Big data talent is expensive ing from scratch or cobbling together and needs schema to be defined before • Point-solutions are limited and cannot the functionality required to drive the data is stored, data lakes take the oppo- be easily put in production, often data lake process. Point-solutions are site approach. They collect information requiring custom integration code limited and cannot be easily put in in a near-native form without consid- • Petabytes of data distributed across production or often require custom ering semantics, quality, or consistency. a cluster complicates operations, integration code. Data collected in data lakes is defined security, and data governance A significant part of improving the as schema-on-read, where the struc- • Lack of structure and pre-processing adoption of data lakes is to provide ture of the data collected is not known can result in data swamps comprehensive, built-in support to both deploy and maintain a data lake over time. At Cask, we have created a Unified Integration Platform for Big Data that provides all aspects of data lake man- agement, including , and lineage, security and operations, and app development on Hadoop and Spark. This means that companies can focus on application logic and insights instead of integration and infrastructure. This article will describe the 5 key ingredients of successful data lake imple- mentations that can overcome these above-mentioned challenges.

INGREDIENT 1: SIMPLIFY AND STANDARDIZE YOUR API The Big Data ecosystem offers a broad range of choices of open source technologies, which, due to their spe-

24 BIG DATA QUARTERLY | SPRING 2017 sponsored content

cialized and diverse nature, can trigger supposed to achieve. integration headaches: Creating a data lake • Multiple storage layers (HDFS, Hive, becomes the goal HBase, Kudu) rather than some • Many processing engines (MR, Hive, overriding objective, Pig, Impala, Drill) such as enhancing • Various workflow engines/schedules customer experience, (Cron, Oozie, Falcon) improving process CDAP provides simple and easy- efficiency, or any of to-use Java APIs that help maximize the other reasons for productivity, reducing the time to which companies deliver Big Data solutions. CDAP APIs claim they’re invest- can seamlessly integrate new storage ing in Big Data. It is and compute technologies as they gain important to start the traction in the ecosystem. project with a goal but plan on reusability instead of having to INGREDIENT 5: DESIGN FOR INGREDIENT 2: PROVIDE recreate the wheel every time. All the SECURITY BEFORE, NOT SELF-SERVICE SUPPORT TO components created on CDAP are not AFTER, IMPLEMENTATION DIFFERENT PERSONAS only reusable but also future-proofed Finally, the security capabilities of data Data lakes go a long way towards because of CDAP’s unique abstraction lake technologies are still emerging. Many helping business users, citizen data architecture. These components can be data lakes, especially those in highly regu- integrators, and data scientists be more distributed via the external Cask Mar- lated industries require stringent security productive and self-sufficient. Currently, ket or an internal one. and access control. CDAP adds security users spend most of their time doing ETL, Cask Market is Cask’s “Big Data App directly into data lakes so organizations data cleansing, and basic data exploration Store” with push button deployment can remove the burden of compliance tasks rather than applying their valuable of pre-built applications, use cases, and integration from their data analysts and rare analytic skills to complete high- pipelines, data packs, and plugins from and non-IT personnel. value data modeling and analysis. within CDAP. It provides step-by-step CDAP provides purpose-built wizards to help configure and deploy SUMMARY self-service tools that enable users to do new entities effortlessly across all the Data lakes usually fail due to lack of their own data preparation, ingest, and Hadoop distributions (Cloudera, governance, security, self-disciplined metadata management by eliminating MapR, , AWS EMR, Azure users, and a rational data flow. CDAP the need for specialized programming HDInsights, GCP). provides organizations with a unified skills and hand-coding efforts. By platform for data integration, prepara- giving them fast, direct, self-service data INGREDIENT 4: CAPTURE tion, governance, security, operations, access, CDAP eliminates dependence METADATA AT INGEST management, and application devel- on IT and cuts the time required to The main issue with data lakes is that opment on Hadoop and Spark. It takes gain access to a new data source from they can degrade to a data swamp if data care of all the grunt work so users can months to minutes. is just dumped into them with no regard enjoy fast, direct, and secure self-service for metadata, schema, or description. access to data. CDAP also provides IT INGREDIENT 3: ENSURE CDAP mitigates this problem by cap- with turnkey security, compliance, and REUSABILITY, PORTABILITY, turing robust technical and operational manageability, maximizing the value of AND FUTURE- metadata automatically during ingest. data lakes. PROOFING Additionally, users can tag data for busi- Too often, data lakes suffer from ness context and easily search them using CASK lack of forethought on what they’re a Google Search-like interface. http://cask.co/

DBTA.COM/BIGDATAQUARTERLY 25 sponsored content

Building an Enterprise Data Lake with Brains, Not Brawn

THE DATA LAKE generators are used, the higher the risk BRINGING THE RIGHT OPPORTUNITY AND RISK of project delays, project rewrites, and PEOPLE, PROCESS, AND Data has become the lifeblood of long-term technical debt. Avoiding these TECHNOLOGY TOGETHER nearly every industry-leading company. risks requires more than simple auto- Succeeding with data lakes is often But the ability to turn this data into mation of manual processes. The key to best served with building a center of valuable business insights with the right avoiding delays and risks is the use of excellence. Cognizant is a leading pro- data delivered to the right people at the metadata intelligence. vider of business and technology services right time is what separates leaders from Metadata intelligence is the use of for helping organizations get more laggards. machine learning-based technologies to value from their data assets. Cognizant But despite the promise of Hadoop infer “data about the data.” By intelli- believes investments in data quality, data and enterprise data lake environments, gently understanding the meaning of governance, and data security processes very few enterprises have shown fast and data stored in the data lake, a solution are essential for fit-for-purpose data repeatable success with these environ- employing metadata intelligence can do lake environments. Moreover, these data ments. In a world of growing volume, the heavy lifting of data lake manage- management processes are enhanced by variety, and velocity of data, business ment for you. Data can be automatically the right types of training and change analysts and other data consumers still transformed and parsed as it is ingested management to help define the right struggle to find and reconcile fragmented, into the data lake. Rule-based data qual- structures, roles, and operating proce- replicated, incomplete, and inconsistent ity processes quickly identify problems dures to effectively bring together the data across the organization. with data assets and enable organiza- right people and processes for maximum tions to focus on the most important return on investments. FIT-FOR-PURPOSE APPROACH data errors. Fuzzy logic-based data Meanwhile, Informatica is the #1 TO DATA LAKE MANAGEMENT matching enables more complete and independent provider of data manage- The traditional approach to data consistent single sources of truth to be ment solutions for enterprises. Informat- warehousing imposed a highly struc- built. Pre-built connectors and adapters ica’s comprehensive Cloud Data Lake tured and rigid approach to transform- serve to automatically push and pull Management solutions delivers accurate ing raw data into trusted information. data from the requisite systems. and consistent big data assets you can In contrast, a fit-for-purpose approach Metadata intelligence empowers end trust to power faster business decisions. is more appropriate with data lakes as users through its intelligent under- Informatica’s solutions enable you to enterprises seek to balance the demand standing. By inferring the structure and find, prepare, govern, and protect the big for self-service discovery with the format of data, data stewards can easily data that your business needs to quickly demand for security and governance. catalog data assets across the enterprise. and repeatedly get business value. It accel- The right set of controls with the right Data analysts can also be provided with erates developer productivity and analyst mechanism of enforcement can actually guided recommendations about data productivity, while enabling IT agility in promote autonomy and collaboration assets as they perform self-service data a fully secure and governed environment. rather than inhibit it. preparation processes. All of this work Data security and data governance are can then be instantly transformed into automatically enforced through metadata SMARTER, MORE technical metadata and mappings that intelligence that ultimately enables more COLLABORATIVE DATA LAKES data engineers can operationalize for data consumers to quickly and repeatedly WITH METADATA INTELLIGENCE repeatable benefit. Metadata intelli- get more trusted big data without more Data lake projects fail due to exces- gence essentially becomes the brains risk. As close partners, Informatica and sively manual and custom approach to that power the enterprise data lake and Cognizant work together to deliver the specialized development. The more that ensure that the right data is available at right people, process, and technology to one-off platform-specific code and code the right time to the right people. build data lakes with brains, not brawn.

26 BIG DATA QUARTERLY | SPRING 2017 sponsored content

Harnessing a Logical Data Warehouse and Data Reservoir for Big Data Success

In an environment where data is the unifying information access for all users ment in the right direction for the data most critical natural resource, speed–of– and applications. The resulting benefits warehouse, without rigorous manage- thought insights from information and include the delivery of analytics and ment and governance, the data lake risks analytics are a competitive imperative. business insights with superior speed and becoming more of a data swamp—over- Organizations are counting on data to simplicity, a standardized level of access whelmed and populated with data of drive every decision, fuel every interac- to all data assets, and flexible architecture uncertain origin and dubious reliability. tion, and power every process, but with- deployment options. The better alternative is the data res- out the right systems and infrastructure, Technologies such as Apache Hadoop ervoir—a central source of clean, usable, they will be eclipsed by the competition. are aligned with the objectives of the Log- protected data that is catalogued as it The traditional data warehouse is not ical Data Warehouse. Hadoop provides a enters, with all activity tracked for pur- up to the task. With a flood of new data platform for multiple data types and large poses of analytic modeling. The catalogue pouring in at an increasingly rapid pace, data volumes. Its distributed file environ- for all of the data includes a description, prospects for generating new insights, ments excel at storing large data volumes for search and governance purposes, of driving innovation, and sharpening com- and allowing access and exploration across where it came from, who owns it, the last petitive advantage tend to get mired in both structured and unstructured data. time the data was refreshed, as well as the management of complexity. Expec- There has been speculation that Apache structural and profile information. With tations are far exceeding existing data Hadoop could potentially unseat the tradi- governance tools automated and in place, warehouse capabilities. tional data warehouse as the main analytics the data reservoir allows for streamlined, To maintain their competitive advan- engine within the enterprise. However, the self-service access to data. tage, organizations must take action maturity and robustness of the data ware- IBM advocates that most organi- now to modernize the traditional data house will maintain its role as the founda- zations can realize significant benefits warehouse. The alternative is to fall behind tional building block for enterprise big data from the modernization of their data as the competition becomes more adept and analytic strategies. Features such as warehouses. This does not require a “rip at manipulating big data to spot trends, data integration and governance allow for and replace” of what’s currently work- enhance customer interactions, and get creation, control, and enforcement of man- ing. While maintaining what’s in place innovative products to the market faster. datory data security and privacy policies. and what works, new technologies and The need for greater depth and Data warehouses continue to be the best strategic solutions in the data warehouse breadth of data access and insights has choice for traditional structured analytics, architecture stand to bring a new level of given rise to the Logical Data Warehouse. historical reporting, and standard business agility and business intelligence. Designed to optimize data access, while intelligence in most environments, and Your journey can begin from a number reducing complexity, cost, and time to they will remain a key component of any of entry points with various goals such as: value, the Logical Data Warehouse is Logical Data Warehouse implementation. • Ad ding new data sources, increased characterized by a coordinated approach The recognition that multiple data usage, or analytic capability that incorporates a range of integrated types need to be pooled into a central • Acce lerating analytic queries or solutions and interoperable services for repository in order to optimize business boosting performance managing and delivering data. intelligence has given rise to the concept • Exploiting technology innovations to The Logical Data Warehouse leverages of the data lake. Designed to eliminate leverage big data varied structured and unstructured data the inefficiencies associated with ad hoc • Adding flexibility through deploy- repositories, which operate as a single, acquisition and discovery of data for ment model choices, appliances, unified data source from which data analytics, the data lake contains data from software solutions, and cloud can be accessed in whatever format is a wide range of sources, and data can be For more information, visit www. desired. It intelligently routes queries to added or accessed as needed. While this ibm.com/analytics/us/en/technology/ the correct data store, simplifying and kind of consolidation represents move- data-warehousing/.

DBTA.COM/BIGDATAQUARTERLY 27 sponsored content

Transform Data Lakes Into Knowledge

DATA LAKE CHALLENGES id/” typically denotes the organization of AI, and depends entirely on semantic Data Lakes address the Total Cost of defining that URI. This way, the same term technologies. Ownership (TCO) of data warehouses is unlikely to be used by Artificial Intelligence problems, you need warehouses, Data Lakes try to minimize other organizations for different diseases. a data system that goes beyond just data. pre-processing of data and fish out what In essence, we no longer use character You have to create a system that can link is needed later. Extract, Transform, Load strings that are ambiguous and without to anything outside your own predefined (ETL) and data integration is delayed until data semantics to represent a concept parameters—and that can learn from you need to do queries or analytics— like Diabetes Mellitus. Instead, EVERY previous experiences, too. That is where so-called late binding. That all sounds concept within a semantic database is Cognitive Computing comes into the great, but tossing everything into the lake in represented by a unique (and universal) picture, and that is why you need to pay native formats has a number of challenges URI, and the same URIs within a semantic very close attention to how all of these data that need to be addressed. database or among different databases are sets link together. Semantic computing is Meaningful analytics on complex data designed to represent the same concept. adaptable to those changes, if it is set on the in a Data Lake requires: The linkage between data elements is thus right path to begin with, and that is what • Intelligent metadata management automatically established. differentiates it from other forms of Big for curation, provenance and known Data analytics. quality of the data COGNITIVE COMPUTING WITH • Semantics for consistent and uniform SEMANTIC DATA LAKES ALLEGROGRAPH—SEMANTIC taxonomies between silos Cognitive Computing in the context GRAPH DATABASE • Semantics for cross-linking or data of databases and Data Lakes is the use of Unlike traditional relational databases, integration—otherwise, silos stay silos Artificial Intelligence to do very intelligent AllegroGraph provides the unique • Entity-level access control/governance inferencing and analytics. Cognitive ability to link data, without manual user Computing in general combines structured intervention, coding, or the database being WHY SEMANTICS and unstructured enterprise data that is explicitly pre-structured. AllegroGraph Data Warehouses, Data Marts and Data cross-linked through the use of semantics processes data with contextual and Lakes all lack critical data semantics and and uniform terminologies. Once you have conceptual intelligence to resolve queries are bound by the rigidity of the ubiquitous a standard vocabulary based on semantics, and help the clients to build predictive Relational Database schema. However, you can apply algorithms through a analytics, which help them to make better, a Data Lake based on Semantic Models database or Data Lake that can understand real-time decisions. addresses both problems. and leverage the terminology. This is At the core of a Semantic Data Lake only possible by using a semantic graph THE SEMANTIC DATA LAKE model we find two W3C standards: the database, which also links data and creates A Semantic Data Lake is incredibly URI and RDF. The URI is the well-known self-describing data. By applying machine agile. The architecture quickly adapts to Uniform Resource Identifier, and RDF learning to semantically linked data, you changing business needs, as well as to the stands for data based on the Resource have the beginning of Artificial Intelligence. frequent addition of new and continually Description Framework. Instead of using The foundation for Cognitive changing data sets. No schemas, lengthy strings like “Diabetes Mellitus,” Diabetes Computing and AI lies in the multitude data preparation, or curating is required Mellitus NOS,” or “糖尿病” to represent of facets of semantic technology that before analytics work can begin. Data is the disease, it is represented by a URI such provision these applications. The genesis ingested once and is then usable by any as: is uniform terminologies underpinned by and all analytic applications. Best of all, Graph databases provide the environment of pre-selected data sets or pre-formulated Using “C0011860” in the above URI in which semantic statements are used questions, which frees users to follow the example is the Unified Medical Language to draw inferences; ontologies enrich the data trail wherever it may lead them. System (UMLS) designator for Diabetes contextualized understanding of data Mellitus. The beginning of the term linked together. The ability to learn over FRANZ ALLEGROGRAPH “http://linkedlifedata.com/resource/umls/ time and operate autonomously is the crux http://franz.com/

28 BIG DATA QUARTERLY | SPRING 2017 sponsored content

® Smart Data Lakes: Revolutionizing Enterprise Analytics

As the quantity and diversity of concept of semantics at scale, an RDF For instance, the regulatory reporting relevant data grows within and outside graph query engine is able to analyze environment for financial institutions is of the enterprise, business users and billions of triples (the atomic unit of data evolving quickly, placing unprecedented IT are struggling to extract maximum in the semantic web) in a short amount demands on legacy processes and value from this data. Fortunately, recent of time. The result? The ability to issue technology. Two areas where new smart developments in big data technologies more queries, utilize more data and get data solutions are already adding value have significantly impacted the results quicker, so that all enterprise data for banks include report preparation as proficiency of contemporary analytics becomes relevant. well as data and technology. – the most profound of these involving Smart Data Lakes also improve the the deployment of semantically SMART DATA LAKE USE quality of your competitive intelligence enhanced Smart Data Lakes. CASE EXAMPLES by allowing subject-matter experts to Defined as centralized repositories, The intrinsic understanding of the curate, correct, and augment the data Smart Data Lakes enable organizations relationships of the underlying data is they know best. to analyze all data assets with a the premier advantage of semantic graph For instance, top pharma companies specificity and speed that wasn’t technologies. When leveraged at scale, are using Smart Data Lakes to previously available, revolutionizing the this fuels countless capabilities that are proactively monitor the industry and scope and focus of analytics. The value impossible with most other technologies. receive alerts when key developments derived from this approach improves The granular nature of semantics enables occur around drugs, companies, the analytics process and expedites the data to determine the relationships targets, disease areas, or geographies conventional data preparation. By among its various elements, as opposed of interest. The alerts draw on private, understanding data’s meaning prior to guessing what those relationships public, and proprietary sources such to conducting analytics, users are able are and then asking the data to confirm as Citeline or Thomson Reuters to to vastly improve the type of analytics them. The richness of this contextualized give senior decision-makers updates performed while pinpointing results for understanding of how data is linked that summarize drug development specific uses. and related across sources, structures, activities of interest to them. The speed, These new Smart Data Lakes go and systems creates remarkable analytic accuracy and completeness of data that beyond the inflexible relational data insight, especially when exploited at scale. is delivered to stakeholders can make warehouse and the unwieldy Hadoop- Today, many companies are applying all the difference between a blockbuster only data lake, disrupting the way semantic analytics tools with scalable drug or an expensive failed experiment. IT and businesses alike manage and graph-based database technology that These Smart Data Lake industry analyze data at enterprise scale with provide a quick and easy path to query use cases are only the beginning of unprecedented flexibility, insight and and analyze Data Lakes effectively. a revolution in big data analysis. speed. Smart Data Lake solutions permit The unique attributes of Smart Data organizations to focus on the data that Lakes that incorporate graph-based THE BENEFIT OF SMART provides real business benefit. Currently, technologies and semantic standards are DATA LAKES Smart Data Lakes are being adopted enabling the democratization of data Organizations truly reap the benefits by pharma and financial institutions science and data discovery, promising of utilizing Smart Data Lakes, the primary in use cases ranging from competitive answers to new questions for a far being the newfound ability to incorporate intelligence and insider trading wider group of business users across the their entire information assets with the surveillance, to investigatory analytics enterprise than ever before. notion of scalable semantics. With the and risk and compliance.

DBTA.COM/BIGDATAQUARTERLY 29