<<

72 INFORMATION management

Engineering & Technology September 2012 www.EandTmagazine.com

C2201_R9699_Feature_72.BK.indd 72 14/08/2012 17:12 73

THE LARGING-UP OF BIG DATA ‘Big data’ is a buzz-term that is resonating big-time with IT solutions providers and end-user organisations. But are ‘big data’ applications really so different from the business intelligence and analytics tools that have been around for decades? Martin Courtney investigates.

THE TERM ‘BIG DATA’ has been getting “Traditional customers may have a lot of Goes To Work’, p75). He recalls the case of a big much exposure in IT circles over the data in tabular format – customer credit utility industry customer in the US running last year or two, on a scale that is bound ratings tables, for example – which they need a power plant offering nuclear and fossil fuel. to cause seasoned industry-watchers to to join together in a variety of ways. For “It had a bunch of systems from 20-30 years sniff the air for the familiar aroma of some customers it’s megabytes, gigabytes, ago, and wanted to cut down storage and IT industry hyperbole. There is the customary terabytes – the biggest with petabytes, like costs, but because of compliance and amount of hype, of course, but there is eBay, say.” However, with entities like the regulation it had to keep the old systems more to it than the covert repackaging Web, and social media sites like LinkedIn, going to show the auditor what systems they and repurposing of existing products. the kind of analytics on those data sets are were running to avoid accidents,” Krishna In one sense ‘big data’ is a classic semi-structured. Schrader says it is “hard says. “Now it can use metadata to search misnomer. The implication is that the to force them into a relational database. It them, build a new archive [to house them] volume of electronic information being is far easier if you have database systems and keep it in a place where they can easily generated and stored is now so large that with the required speed to be up and running query it, and shut down all the stuff sitting in existing database systems are no longer able already to handle non-relational database the main database. It can be much more cost to handle it. data, systems able to run queries in parallel”. effective than having two systems where It is certainly true that the world is there is some accountability, and can pay for generating data on an unprecedented Compliance versus intelligence itself in six months.” scale, and it is going to escalate as trends Maintaining separate storage systems to Quocirca’s Longbottom agrees: “If this such as machine-to-machine applications handle all those different forms of data is [stored data] is going to be something about roll-out. However, it is not so much its generally inefficient, particularly if an people’s mortgages, for example, we need to size as the diversity of formats that data individual or organisation wants to exploit be able to prove how we put everything now comes in – particularly unstructured all of the information it stores to use or together to prove that opportunity, so when sources like text, email, instant messages, for meaningful insight, and to do that fast mis-selling cases hit the headlines it is Web pages, audio files, phone records, enough to make the most of any business maintaining that auditability as well.” videos – and what people want to do with opportunity the exercise might subsequently it that presents the bigger problem. provide. Most organisations keep data Onboard the ‘big data’ bandwagon “Most vendors are now realising that big archived for compliance and regulatory When applying business intelligence data has actually very little to do with purposes, at least on a temporary basis, and analytics tools to large repositories databases and more to do with information before deleting. But others see the value in of structured and unstructured data on management,” according to Clive the information itself, and apply business a regular basis, there is a danger that Longbottom, director of analyst firm intelligence and analytics tools to pull out companies will spend time and money on Quocirca. “Eighty per cent of an statistics and patterns which they can turn new systems that are able to sift through organisation’s data is now electronic, yet to their advantage before discarding it. information on an industrial scale, only to 80 per cent of that is not held in a database, so Archiving data as insurance against find that the data contains little or no value cannot be dealt with just by throwing a big potential e-discovery requests is relatively to the business anyway. As such, there are database at it.” It is a question of “how you easy as the organisation does not need certain industries that are far more likely to pull data from a Microsoft Office or whatever to know precisely what information is gain advantage from big data projects than into an environment where it can be dealt being kept, only that they can search it others, with healthcare, retail, utilities, with”, Longbottom believes. if necessary, while modest investment and transport sectors top of the list. “Companies have always done big in the required capacity is easily offset We are already seeing the healthcare data – escalating amounts of information against the cost of potential litigation. sector benefitting, because it has so much – but that is not really the definition of Arvind Krishna is IBM’s general manager information that is not in databases, or is the term. It is more about the variety of of information management, which, like spread across multiple databases. the data and the velocity at which it comes Teradata, EMC, Oracle, and a host of other Longbottom argues that the retail sector at you,” explains Dr David Schrader, software application vendors, is making a “could do a lot with it because it has lots of director of marketing and strategy and big play for big data customers, albeit from stuff held in databases around loyalty cards, data warehousing software firm Teradata. a slightly different approach (see ‘Big Data for example, and they often want to be >

www.EandTmagazine.com September 2012 Engineering & Technology

C2201_R9699_Feature_72.BK.indd 73 14/08/2012 17:12 74 INFORMATION MANAGEMENT

‘Most of an organisation’s data cannot be dealt with just by throwing a great big database at it’ Clive Longbottom, Quocirca

Business intelligence (BI) Big data Describes data sets that have Data analytics (DA) The science of Computer-based techniques used in grown so large that they become examining raw data with the purpose identifying, extracting, and analysing awkward to work with using on-hand of drawing conclusions about that business data, such as sales revenue by database management tools. Typical information. Data analytics is used in products and/or departments, or by difficulties include capture, storage, many industries to allow companies and associated costs and incomes. search, sharing, analytics, and visualising. organisation to make better business BI technologies provide historical, This trend continues because of the decisions and in the sciences to verify current and predictive views of benefits of working with larger and larger or disprove existing models or theories. business operations. Common data sets, allowing analysts to discern and Data analytics is distinguished from functions of business intelligence validate trends, such as tracking (and data mining by the scope, purpose, technologies are reporting, online preventing) the spread of diseases. and focus of the analysis. Data miners analytical processing, analytics, data sort through huge data sets using mining, process mining, complex event Data mining Interdisciplinary field of sophisticated software to identify processing, business performance computer science that describes undiscovered patterns and establish management, benchmarking, text the process of discovering new hidden relationships. Data analytics mining and predictive analytics. Until patterns from large data sets involving focuses on inference, the process of recently, BI applications have been methods at the intersection of artificial deriving a conclusion based solely on seen mainly as the preserve of very intelligence, machine learning, what is already known by the researcher. large enterprises and organisations. statistics and database systems. SOURCES: WIKIPEDIA, TECHTARGET, E&T RESEARCH.

What level of growth are you seeing in the following Considered overall, to what degree does your types of data within your organisation? organisation exploit its information assets for analysis and decision making purposes? Structured data (e.g. tabular in RDBMSs) Structured date

Unstructured data (e.g. documents, messages, Unstructured data multi-media, etc.)

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 5 – Extremely 4 3 2 1 – No Unsure high growth growth 5 – Fully 4 3 2 1 – Poorly Unsure

Table 1: Organisations are seeing data volumes increase, Table 2: Lack of clear return on investment is with unstructured data looking set to grow even faster one key reason why so few organisations are than structured data in some cases, according to a extracting value from information held outside survey by IT industry analyst Freeform Dynamics. systems designed for handling structured data. WWW.FREEFORMDYNAMICS.COM

< pulling data in from social networks to get a “It is also about situational awareness in website,” he says, “but alongside traditional better idea of what customers and prospects real-time – British Airways uses similar measures like sales or net promoter scores, are thinking”. tools to replan operations in the event you can now capture user tweets, which do He adds: “The utility companies have that a volcano blows and screws up [its not use tabular data, and get back an idea masses of data that is not being mined schedules], with information on grounded about who is happy with a new product, and correctly, and they are not pulling in external planes, crew and passengers all at their who is not happy. Those can be critical.” information. Security agencies – MI5 or MI6, fingertips in order to be able to construct for example – have got to be thinking about an alternative [route and schedule].” Cloud services are the future picking-up patterns of information going That return on investment depends across things like mobile phone records, Profit wedge curve to a large extent on the capital cost of email and what’s happening on and Return on investment is always an the storage, processing and analytics Facebook so they can pull it all together and ephemeral concept when it comes to resources to handle big data in the first say ‘right, this is the door that we go and business intelligence and Web analytics, place, which is generally not cheap. knock down’.” but Schrader insists that big data solutions Oracle, EMC, and Microsoft have Teradata’s Dr David Schrader identifies that are able to process so many different rushed to introduce big data solutions the telephone companies or individual types of information in real-time provide based around Apache Hadoop, a company call-centres as those which can better predictions on the effect of new platform that was created by Google to benefit from big data analytics that sales or product strategies than earlier index the vast amounts of text and other interrogate call detail records (CDRs) to tools. So much so, according to Schrader, document metadata it was collecting identify patterns around customer that a profit wedge curve – the classic via the Internet to help improve its own behaviour, as well as examples in the retail V schematic, where growing revenue search engine performance. Apache and transport industries. is offset by reduced costs to deliver Hadoop is customised towards specific “Think about eBay, the rate and volume increased profit – is very much a reality. tasks and data types on an open source of transactions, and the active intelligence “A retailer, for example, would use Web licence running on a specialised you can gather from the data and put it in analytics to see what would happen if they hardware appliance designed to be a database, for example,” Schrader says. dropped something [a new product] into their installed on the customer’s premise.

C2201_R9699_Feature_72.BK.indd 74 14/08/2012 17:13 75

There’s more online... Data management - will we ever press ‘delete’? http://bit.ly/eandt-dump-data Unstructured data: nail it - then mine it http://bit.ly/eandt-unstructured IET Event: Big Data seminar, December 2012 http://conferences.theiet.org/big-data/index.cfm

BIG DATA GOES TO WORK That thinking is starting to change, with all the vendors looking to deliver more JSTART STARTS TO PULL IN BIG DATA BACK ENDS flexible, hosted big-data solutions available through cloud services which – in theory – could trim costs through an on-demand, pay-as-you-go model, as long as customer concerns around security and performance can be addressed. IBM led with the launch of its Hadoop-based InfoSphere BigInsights distributed data processing and analysis platform as a service (PaaS) in October 2011, with rivals seemingly set to follow. “IBM’s makes more sense as a cloud solution rather than selling somebody a shedload of powerful [on-premise] systems,” says Quocirca’s Clive Longbottom. “Business Intelligence vendors are also moving towards the cloud – look at what they are doing when digging through 12TB of data in Facebook and other environments, it is much better that they have that control, their own security and data centres.” The problem with big data and the cloud: pushing large volumes of information over any network invariably risks performance and availability issues. This opens up the IBM’s Watson supercomputer: market to vendors keen to sell additional from TV gameshows to grappling with ‘big data’ bandwidth optimisation solutions, and one reason why Teradata prefers to stick with the Many companies have successfully platform – as used in a special 2011 edition on-premise approach. applied business intelligence or Web of the US TV quiz show ‘Jeopardy’, whereby “That is a key engineering challenge,” analytics tools to existing data warehouses the system competed against humans to reckons Teradata’s David Schrader. or other databases, often integrating data come up with correct answers as quickly “Typically you want to push the computation from various unstructured sources into as possible while receiving all information as close to the data as possible – you don’t other applications, such as customer electronically as a text file. Ignore the wants bits and pieces all over the place, relationship management (CRM) or artificial intelligence and text-to-speech especially with call detail records (CDRs) for enterprise resource planning (ERP). technology which enabled Watson to example. You would never want to copy 100 Whether this constitutes a big-data solution perform, and the mighty computer is billion CDRs into the cloud to do the or not is a moot point; but one company actually a data analytics and insight engine calculation, and that is why a lot of big taking a different approach is IBM through that uses a combination of text analytics, companies prefer to have data at their its Jstart client engagement team, part of natural language processing and semantic fingertips in one system. Other than cloud the IBM Software Solutions Group. systems to analyse the vast stores of surge capabilities, they have mostly tended Car hire company Hertz engaged Flash information contained within its gigantic to keep stuff in-house.” tool Jstart to implement an analytics project complement of hard disks and RAM. based on a MindShare Technologies After calling for proof of concept Is ‘big data’ actually that big? application to gather daily ‘customer implementations last year, IBM is Despite the continued frenzy of hardware sentiment’ information from unstructured now looking to apply Watson’s talents and software vendors keen to sell their sources as diverse as Web surveys, emails, to tackle business orientated big wares on the back of big-data initiatives, and text messages in order to consistently data solutions. That means working any project does necessarily require analyse feedback and provide insights on with Jstart to produce lighter-weight investment in new hardware and software problem areas which could be configurations of Watson for different, if it is done correctly, says Quocirca’s Clive immediately addressed. As a result, Hertz application-specific tasks, then using that Longbottom. He believes it is more about says, it was able to identify areas for processing power to gather, organise tweaking existing systems in the first improvement in its Philadelphia office and sift large volumes of unstructured instance. Deduplication, its advocates claim, around delays for the return of vehicles at data, understand the context of items of can make a significant contribution to specific times of day and solve them by interest within the broader scope of the stripping away the ‘slag’ that can make data adjusting staffing levels at peak times. text it is discovered within to identify mining initiatives daunting at first sight. Jstart was also the catalyst behind a trends and patterns more accurately. “When you start looking at big data you big-data project for US healthcare company IBM general manager Arvind Krishna find much redundant data: the same file in UNC Healthcare. The company used IBM’s points out that Watson’s real strength here 48 different places, so if you can delete 47 own text analytics software to mine patient is the ability to perform very sophisticated of them, and just maintain pointers to all data from various forms and databases to statistical analysis on the information the rest, you instantly need less storage,” discover which patients were at greater gathered to understand ‘relationships’ Longbottom points out. “Once you get single risk of re-admittance to hospital, take and ‘correlations’ between data – many instance you get less network traffic, so proactive steps to minimise that risk, and of which are either not immediately it can all be done correctly; but you need therefore the chance of it being penalised apparent or are very subtle. “If all you to plan correctly. As with anything to do under the new Medicaid regulations which have is the book-keeping table,” says with information management, it is a case fine healthcare providers who have Krishna, “it does not tell you whether of ‘garbage in, garbage out’ – you need excessive numbers of re-admittances. the information is important or not. It to do data cleansing and de-duplication More significantly for the future, IBM needs to become application aware and across the whole environment first so Jstart will help commercialise IBM’s business context aware, you cannot just you end-up with something far cleaner, legendary Watson supercomputer do it by looking at a disk or a tape.” and look at master data modelling before

you look at a big data solution.” * ILLUSTRATION ODI IBM,

C2201_R9699_Feature_72.BK.indd 75 14/08/2012 17:13