Harvesting Unstructured Data

SOLUTION SHEET Data Movement and Unstructured Data Content Extraction Transformation STRENGTHS: There is good news and bad news about unstructured data formats Harvesting Unstructured Data Non-programmers can quickly and easily access unstructured Even before the introduction of the World Wide Web, Gartner analysts estimated that more text ﬁle sources than 70% of the world’s data was “lost” or “buried” in unstructured formats like documents. Integrate applications at the If you factor in the World Wide Web – the largest repository of unstructured data in history – presentation layer and close the you begin to realize that the vast majority of data in the world is accessible only as integration “loop” unstructured sources. Imagine the wealth of information and the strategic advantage you could gain by accessing and integrating this virtual gold mine of data! Pervasive has positioned itself at the forefront of this problem by acquiring Content eXtraction Language (CXL) technology. This is the basis for the new Pervasive Internet Rapid ADDED BENEFITS: Integration Services (djIRIS) launched in 2003. Furthermore, Pervasive Data Junction was recently selected by the readers and editors of Intelligent Enterprise for the 5th straight year • In today’s Web-centric world, as the #1 ETL tool for data movement and transformation, substantiating the company as the most data is only accessible in market leader for accessing and integrating data. unstructured formats. 1. Pervasive Data Junction text data in the world? The good news is that • Even before the explosion of Extract Schema Designer once youʼre able to access these unstructured the Web, Gartner analysts text formats, you have unfettered access to estimated that more than 70% The key problem with unstructured data is a scores of valuable “structured” data – invoices, of the world’s data was “lost” or simple one, at least on paper: How do you tap customer details, catalogs, addresses, etc. The “buried” in unstructured formats in to and access the mountains of unstructured bad news, of course, is that the data is “lost” in like documents. Extract Schema Designer • Access and integrate unstructured data from a myriad of sources, tap a wealth of information and gain crucial strategic advantage. >> continued SOLUTION SHEET Web Harvesting �� unstructured formats with all the standard blemishes: floating Schema Designer is that it is in fact a “code generator” that fields, white space, page breaks, large text blobs, etc. The truth, is able to generate the CXL necessary to feed into Pervasive however, is that these sources are not really unstructured. Rather, Data Junction unstructured text parsing engine. Consequently, they are simply defined with structures that make them more users have the best of both worlds: they can quickly build and readable by humans, not conventional, “machine-friendly” fixed implement solutions using the Extract Schema Designer and, and delimited text file readers. Consequently, the best way to for those (admittedly rare) cases where additional performance access unstructured data is with a powerful pattern recognition or power is needed, they can fall back on the CXL language language and graphical interface that allow extraction to occur for the extra engineering horsepower they need. The Extract in an automated fashion, but driven by text patterns – in other Schema Designer also includes visual debugging, source data words, just like human readers. “structured” viewers, add-on modules for PDFs and other document formats, as well as the ability to handle non-text The foundation for Pervasiveʼs ability to extract structured data binary and print characters. This complete infrastructure for from unstructured text sources is the CXL Engine. The CXL accessing unstructured text sources has no rival in the industry, Engine is a highly efficient line-oriented text manipulation equipping users with a world-class toolset for unlocking the and pattern recognition engine invented in, and built with, treasure of unstructured text data. specific wiring to Pervasive Data Junction back-end high-speed Integration Engines. In just a few lines of code, one can extract 2. djIRIS - Internet Rapid Integration Services SDK perfect row and column “views” from streams of otherwise Pervasive Internet Rapid Integration Services SDK (djIRIS) unreachable dirty text and simple HTML sources. solves the difficult yet essential problem of harvesting data Since the real productivity gains in IT come from powerful from the greatest data source of all time: the World Wide Web. end-user graphical interfaces, Pervasive Data Junction has Again, this presents a good news/bad news scenario. The good built a powerful user interface on top of the CXL Engine. This news is that this data source represents an ocean of data of every enables non-programmer users to mark up the unstructured conceivable kind from every conceivable source – and it resides text file source quickly and easily and, in a matter of minutes, literally at our fingertips. The bad news is that it is all locked build an entire Extract Schema for even the most complicated away behind opaque HTML pages. And, in addition to the usual unstructured text sources. The real beauty of the Extract difficulties of navigating unstructured text-like HTML, there are >> continued SOLUTION SHEET the added barriers of HTTP and application-based authentication, The “generation 2” is the “deep” web. This is the more as well as extremely complex navigation and looping scenarios interesting “hidden” web, the part that Google cannot touch, and to access the “leaf” pages targeted for harvesting. is in fact not indexed at all. After some trial and error people To address this problem Pervasive has engineered, from scratch, quickly learned that it was impractical to simply dump the a patent-pending djIRIS Engine that can act as a fully automated contents of their databases onto the surface web – the volume proxy browser, spoofing multiple browsers. djIRIS is a highly of the databases were too large and their content constantly efficient and optimized language, based on the popular Java changed. Consequently, the generation 2 Web became much syntax, for controlling the behavior of an HTTP-based Web- more interactive (with CGI and other scripting alternatives), browser agent. And, via a DOM-based XHTML infrastructure, which helped to build more intelligent gateways or portal pages. djIRIS gives direct and automated control of a Web site to users. These could ferry authenticated user requests from the front-end Guided by djIRIS scripts, or a very rich set of Java API calls, the HTML page to the back-end database, and then return the results djIRIS Engine intelligently traverses the World Wide Web and of the query in dynamically formatted HTML to the browser. extracts useful structures of data of any shape or volume. The This generation 2 Web is rapidly dwarfing the generation 1 Web harvested data can then be delivered as XML, or fed directly of static HTML pages. It is called the “deep Web” because it to the high-speed Integration Engine for further downstream represents Web-based access to the real data treasures in the transformation and processing. world – the thousands of huge databases that are otherwise With the djIRIS Engine, two significant aspects of the Web are locked behind firewalls, but are now integration-friendly via penetrated. First, there is the surface Web; the djIRIS Engine the magic of djIRIS. The scale of content access, aggregation can harvest and deliver this data rather easily. This level consists and harvesting that this represents, via the unstructured medium of the 2,000,000,000+ pages of static HTML that we can think of the World Wide Web, is truly staggering. And Pervasive of as “generation 1” of the Web – i.e., Web pages Google can Data Junction does this at a cost far below the pricey and less index. These HTML pages contain untold riches of data from powerful alternatives offered by other vendors. a staggering variety of sources: internal or external, private or public. In many cases, these HTML surface pages are simply 3. djIRIS Engine as Screen-scraping Integration Tool the most direct path – and sometimes the only path – to every Virtually all new application development is achieved with kind of vital data needed for all sorts of business purposes (e.g., browser-based interfaces. djIRIS Engine, working directly at the catalogs, documents, histories, etc.). browser interface level, and engineered to work bi-directionally (data entry and output) with all Web-based applications at the HTTP/HTML level, is a superb next- 3 - Tier Application Integration generation screen-scraping platform, leapfrogging all traditional legacy options �� (e.g., 3270, 5250, VT100 and PC). Above �� all, this gives users a powerful and dynamic new tool for integration projects. While application integration at the application and logic layers – and �� occasionally at the data layer – is quite �� common, there are still scenarios when �� integration at the presentation layer is �� requisite. Since Pervasive Data Junction ��

Harvesting Unstructured Data

Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

Unstructured Data Is a Risky Business

1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature

Big Data Mining Tools for Unstructured Data: a Review YOGESH S

Extracting Unstructured Data from Template Generated Web Documents

Top Natural Language Processing Applications in Business UNLOCKING VALUE from UNSTRUCTURED DATA for Years, Enterprises Have Been Making Good Use of Their 1

Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis

Solving the Unstructured Data Puzzle with Analytics

Cheminformatics for Genome-Scale Metabolic Reconstructions

Geospatial Semantics Yingjie Hu GSDA Lab, Department of Geography, University of Tennessee, Knoxville, TN 37996, USA

Unstructured Data Analysis in Arcgis

The Role of Text Analytics in Healthcare: a Review of Recent Developments and Applications