<<

Data :Data mining 19/4/07 16:33 Page 39

Informatics DATA MINING in the pharmaceutical

A research-based pharmaceutical company is a data accumulating wonder. More than in any other industry, success is predicated on the collection, processing and exploitation of that data. This is not always recognised and often not planned for by large pharmaceutical companies. Now, however, with the advent of data storage and mining techniques making major advances in other industries, the pharmaceutical industry must adjust to fully exploit this potential competitive advantage in discovery, development and of their products. We discuss some of the impacts to companies and some of the adjustments they will have to make to maintain their position in the information age.

ixteen years ago Peter Drucker1 pointed out were useful on older, smaller databases do not By Dr Robert D. that the pharmaceutical industry was an infor- always scale up to accommodate large amounts of Small and Herbert A. Smation industry – not a or data, necessitating the introduction of new meth- Edelstein even a health industry. Since then, information tech- ods and software tools, in particular data mining. nology has raced beyond anything he could have Data mining is a process that uses a variety of imagined. We now routinely collect and process analysis and modelling techniques to find pat- gargantuan amounts of data quickly and cheaply. terns and relationships in data. These patterns Drucker’s original insight did not receive much can be used to make accurate predictions that aid attention in the pharmaceutical industry at the in solving problems across the entire spectrum of time, and given the subsequent growth and pros- development, including R&D, clinical trials perity of the industry, one could argue that this and marketing. ignorance has not been very damaging. For example, a data mining contest2 was recent- However, the pharmaceutical industry is now ly held to predict the molecular bioactivity for a moving to embrace this viewpoint and the tech- drug ; specifically, determining which organ- nological and organisational changes it demands. ic molecules would bind to a target site on throm- Indeed, some of the biggest contributions to the bin. The predictions were based on about 500 recent growth of pharmaceutical companies have megabytes of data on approximately 1,900 organ- resulted from capitalising on the increased avail- ic molecules, each with more than 130,000 attrib- ability of data, improved information systems utes (or dimensions, as they are called in data min- and database and the introduction ing). This was a challenging problem not only of bioinformatics. because of the large number of attributes but This size and complexity of the databases now because only 42 of the compounds (2.2%) were proliferating in the pharmaceutical business is a active. The relatively small number of cases (organ- major departure from most of the clinical and ic molecules in this example), compared to the R&D databases of the past. Analysis methods that large number of attributes, makes the problem

Drug Discovery World Fall 2001 39 Data mining:Data mining 19/4/07 16:33 Page 40

Informatics

Figure 1 Merck’s High-Throughput Screening business analytic based on Insightful Corp’s S- PLUS and StatServer products allows Merck to apply advanced statistical methods to well-plate data on thousands of promising compounds, and interpret the results quickly and easily using graphical methods

Graphic courtesy B. Pikounis, Merck Research Laboratories and Insightful Corporation

even more difficult. Of the 136 contest entries, been widely applied in retailing, banking and about 10% achieved the impressive result of more , , marketing and than 60% accuracy6, with the winner, Jie Cheng of and telecommunications. the Canadian Imperial of Commerce, reach- At first, parts of the scientific were ing almost 70% accuracy. slow to embrace data mining. This was at least As one of the most data intensive industries, the partially due to the marketing hype and wild pharmaceuticals business has a wealth of such claims made by some software salespeople and potential data mining applications from which it consultants4. However, data mining is now moving will gain substantial benefits. Tremendous into the mainstream of science and . amounts of data are collected during the develop- Data mining has come of age because of the con- ment of a drug. The trend is to collect larger fluence of three factors. The first is the ability to amounts of it automatically. A great deal of the inexpensively capture, store and process tremen- basic biology data is now collected online in labo- dous amounts of data. The second is advances in ratories. More clinical data is being collected with database that allow the stored data to electronic diaries, in clinic laptops, or even on be organised and stored in ways that facilitate devices attached to the in some way. These speedy answers to complex queries. Finally, there range from applications common to most indus- are developments and improvements in analysis tries such as marketing and sales to unique oppor- methods that allow them to be effectively applied tunities in clinical trials and research and develop- to these very large and complex databases. ment such as genomics and proteomics. Clearly, It is important to remember that data mining is data mining has much to contribute to the phar- a tool, not a magic wand. You can’t simply throw maceutical business. your data at a data mining tool and expect it to produce reliable or even valid results. You still The rise of data mining need to know your business, to understand your The term data mining, in its most common use, is data, and to understand the analytical methods very new. The term previously had been used pejo- you use. ratively by some statisticians and other specialists Furthermore, the patterns uncovered by data to refer to the process of analysing the same data mining must be verified in the real world. Just repeatedly until an acceptable result arose3. By the because data mining predicts that a gene will early 1990s, a number of forces converged to make express a particular protein, or that a drug is best data mining a very hot topic. It has subsequently sold to a certain group of , it doesn’t

40 World Fall 2001 Data mining:Data mining 19/4/07 16:33 Page 43

Informatics

mean this prediction is valid in the real world. You ing. Together, they take more time and effort than still need to verify the prediction with experiments all the other steps combined, typically consuming to confirm the existence of a causal relationship. 60% to 95% of a project’s time and resources. The data to be mined is usually gathered from The data mining process multiple sources, and while in some cases it is pos- For success in data mining it is essential to follow sible to mine those sources directly, more often it is a methodical process, such as the following seven- preferable to gather the data into a uniformly step process5: designed database first. Though a daunting task, 1 Define the business problem such integration is worthwhile. For example, inte- 2 Build the data mining database grating all the databases having to do with a single 3 Explore the data drug allows a company to more easily respond to a 4 Prepare the data for modelling regulatory agency that suspects a problem. 5 Build a model Typically, the data relating to the drug is spread 6 Evaluate the model across tens to hundreds of databases of vastly dif- 7 Act on the results ferent design, conforming to different standards and even stored in different database management Although the numbering implies a linear systems. Analysis would be very difficult; millions process, data miners often find themselves revisit- of dollars would be spent in producing the request- ing earlier steps based on what they have learned. ed safety summaries. If this data were properly The first step is to prepare a clear statement of consolidated, on the other hand, data mining the problem you are trying to solve. As you pro- would enable the company to quickly explore the ceed through the steps of data mining, however, data and identify both drug and patient-related your deepening understanding of the data and the candidate factors that may have raised the regula- problem will occasionally lead you to reformulate tory concern. your objectives. There are also external databases that can be The next three steps involve preparing the data mined in conjunction with corporate data. On the for mining and lead up to the actual model build- R&D side, hundreds of public and licensed

Figure 2 In clinical trials, understanding multivariate, time-dependent results can be challenging. This Trellis graphic shows the relationship between beta- carotene and Vitamin E in the bloodstream, for five different dosage levels of supplemental beta-carotene, as a local average of hundreds of blood sample measurements on 45 over a 16-month period

S-PLUS graphic courtesy Insightful Corp

Drug Discovery World Fall 2001 43 Data mining:Data mining 19/4/07 16:33 Page 45

Informatics

Figure 3 Data Mining Products

COMPANY PRODUCT ADDRESS

Angoss Software KnowledgeStudio www.angoss.com 34 Saint Patrick Street, Suite 200, Toronto, Ontario, M5T 1V1, Canada

IBM Intelligent Miner http://www-4.ibm.com/software/data/iminer/fordata/ Route 100, Drop 1302, Somers, NY 10589, USA

Inforsense Kensington Discovery www.inforsense.com 47 Prince’s Gate, London SW7 2QA, UK Edition

Insightful Corporation S-Plus 6 www.insightful.com 1700 Westlake Avenue N, #500, Seattle, WA 98109, USA

MarketMiner ModelQuest http://www.marketminer.com/ 1575 State Farm Blvd, Charlottesville, VA 22911, USA MarketMiner

Megaputer Intelligence Inc PolyAnalyst http://www.megaputer.com/ 120 West 7th Street, Suite 310, Bloomington, IN 47404, USA

Microsoft Data Mining Services http://www.microsoft.com/data/oledb/dm.htm NCR Computer Sys Grp, 17087 Via Del Campo, Teradata http://www.ncr.com/products/software/teradata_mining.htm San Diego, CA, 92127, USA Miner

Norkom Ltd Alchemist http://www.norkom.com/ Norkom House, 43 Upper Mount Street, Dublin 2 Ireland

Oracle 9i Data Mining http://www.oracle.com/ip/analyse/warehouse/datamining/ 9i Personalization

Quadstone Quadstone www.quadstone.com 321 Summer Street, Boston, MA 02210, USA

Salford Systems CART 4.0 http://www.salford-systems.com/ 8880 Rio San Diego Dr, Suite 1045, Mars 2 San Diego, CA 92108, USA

SAS Enterprise Miner http://www.sas.com/ SAS Campus Drive, Cary, NC 27513, USA

Silicon Graphics MineSet http://www.sgi.com/software/mineset/ 2011 N. Shoreline Blvd, Mountain View, CA 94043, USA

SPSS Clementine http://www.spss.com/clementine/ 233 S. Wacker Drive, 11th Floor, Chicago, IL 60606, USA

Torrent Systems Orchestrate http://www.torrent.com/ 5 Cambridge Center, 7th Floor, Cambridge, MA 02152, USA

Unica Solutions, Inc Affinium Model http://www.unicacorp.com/products/model.htm Lincoln North, Lincoln, MA 01773, USA

Urban Science GainSmarts http://urbanscience.com/ 200 Renaissance Center, 19th Fl, Detroit, MI 48243, USA

Drug Discovery World Fall 2001 45 Data mining:Data mining 19/4/07 16:33 Page 47

Informatics

genomic databases are available. These databases single case at a time. For example, as we learn how can be brought in-house. On the marketing side, individuals respond to a drug, a model may be there are databases of individuals’ demographics interactively applied to determine the safest and and behaviours from vendors such as Acxiom most effective drug to prescribe for that person. (www.acxiom.com). Increasingly, companies are There are a large number of data mining tools licensing this data for internal use, accessing it available. A partial list is shown in Figure 36. In through tools such as IBM’s Discovery Link general, the products available are quite good, and (http://www3.ibm.com/solutions/lifesciences/inde are especially strong in model . In the last x.html), or using it through services such as those year, there has been great improvement in the ease from Viaken (www.viaken.com). of using the models you build. However, data Once the data mining database has been built, preparation and visualisation still needs to be made it is time to explore the data. Data visualisation more effective. often yields insights to help build better predic- tive models. The graphs in Figures 1 and 2 show Organising for data mining success how visualisations can help the data miner by In order to enjoy the maximum benefits of their providing a comprehensive yet concise represen- data resources and realise the potential offered by tation of the data. They also show that even a mining them, companies will have to adopt some good visualisation takes some training and expe- of the best organisational practices of those who rience to interpret. have succeeded with data mining. The final step in data preparation is to transform Perhaps the most important change is to recog- the data for mining. Ideally, you feed all your nise that pharmaceutical research, development attributes into the data mining tool and let it deter- and marketing are part of an integrated effort and mine which are the best predictors. In practice, this this integration must be backed with a manage- doesn’t very well. Not only can you intro- ment commitment that supports the integration of duce problems by including too many irrelevant data across the corporation. attributes, but in all likelihood the best predictors This integration will also be reflected in a broad- are actually combinations of other attributes. For ened knowledge and understanding of how to use example, BMI (body mass index, a calculated value data by all participants in the based on height and weight) may be more impor- and marketing process. There is no doubt that the tant than either height or weight in predicting the average and of today efficacy of a drug. is much more sophisticated in many areas of com- The fourth step is actually building the model. puter usage than even five years ago. But it is no The most important thing to remember about longer sufficient to partition and segment duties model building is that it is an iterative process. You among narrowly trained specialists. What is need- will need to explore alternative models to find the ed is a more generalist approach and an integration one that is most useful in solving your business of job roles and skills. problem. The process of building predictive models Because the data now being collected is so requires a well-defined training and validation pro- diverse, massive and complicated, a team approach tocol in order to generate the most accurate and that brings together domain experts, database robust predictions. This kind of protocol is some- experts, and analytical experts is required. Such a times called supervised learning. The essence of team needs a holistic view that cuts across both supervised learning is to train (estimate) your domain and technical areas with each member model on a portion of the data, then test and vali- understanding and appreciating the contribution date it on the remainder of the data. of the others. In the sixth step you evaluate your models’ There are numerous examples of the kinds of results and interpret their significance. Remember integration needed in multiple disciplines such as that the accuracy rate found during testing applies , bioinformatics, pharmacoepidemiolo- only to the data on which the model was built. In gy, pharmacoeconomics and pharmacogenomics practice, the accuracy may vary if the data to that cut across traditional boundaries. Though which the model is applied differs significantly they are only first steps, they have important com- from the original data. mon traits. All involve integration of domain areas Once a data mining model has been built and that were once thought to be unconnected. They all validated, it can be used as a general guideline for require an understanding of data structures, stor- action or it can be applied to a large batch of data, age technology and data models that would have such as microarray data. It can also be applied to a been unthinkable a decade ago. A researcher or

Drug Discovery World Fall 2001 47 Data mining:Data mining 19/4/07 16:33 Page 48

Informatics

References contributor in any of these areas needs training or priate data. The structure, job descriptions and 1 Drucker, Peter F. experience in multiple domain areas as well as in likely organisations will change. and Entrepreneurship, 1985, information science. We are living in an age that emphasises a great Harper & Row. NY, NY. Interestingly, many of these arguments about increase in collection and use of data. It is not sur- 2 Knowledge Discovery in Databases: 10 Years After. multiple skills and technical vs domain knowledge prising that the pharmaceutical industry, which SIGKDD Explorations, ACM now occur at data mining conferences and in data has been an for decades, is SIGKDD V.1 59-61, 2000. mining publications. Though the emphasis is some- even more affected by these trends. The fact that 3 Friedman , HP and what different than those made here, the source is some of these changes have happened so quickly Goldberg, Judith. Knowledge similar – a confluence of multiple disciplines and and that many in pharmaceutical and medical Discovery from Databases and Data Mining: New Paradigms huge amounts of data and computer power pro- research did not recognise the previous emphasis for Statistics and Data Analysis. vide new and only recently realised opportunities. on information may have disguised the cata- Report V.8 clysmic events taking place. Nevertheless they are No.2 2000. Data and the future of the coming and data mining will play a major and 4 DuPont Pharmaceuticals pharmaceutical industry steadily increasing role in pharmaceutical research Research Laboratories and KDD Cup 2001. New approaches are needed in order to fully realise in the 21st century. DDW 5 Edelstein, Herb. Introduction the potential of technologies that allow for the cre- to Data Mining and Knowledge ation, acquisition, storage and analysis of databas- Discovery, 1999, Two Crows es of unprecedented size and complexity. The pri- Corporation. mary change is one of attitude. When database 6 Edelstein, Herb. Data Mining Technology Report: 2002, technology first arrived in the pharmaceutical Forthcoming, Two Crows industry, most researchers thought that it was use- Bob Small is currently Vice-President of Data Corporation. ful to have programmers who had some domain Mining Technologies for GlaxoSmithKline. After knowledge. It was common to give newly hired IT an academic career that included appointments at people a course in medical terminology. The mod- Temple and the University of ern company needs to view itself as a data machine Pennsylvania, Bob joined Merck. There he took on whose primary business is collecting, processing, positions of increasing responsibility serving as analysing and using its data as the primary Senior Research Fellow, Senior Director and resource of the company. Executive Director of the Biometrics Department. This trend will continue and accelerate. He later served as Head of Biometrics for Genomic and related technologies allow for more Burroughs Wellcome, Glaxo Wellcome and . data to be collected on each patient both during In 1996 he joined Two Crows Corp, a data mining the development of a drug and after it is marketed. consultancy, as VP of Research. Bob is widely pub- General IT technologies continue to move in a lished in statistical, medical and biological jour- direction that allows more data to be collected, nals, is active in the American Statistical processed and analysed. There are demands for Association (ASA) and is Chair Elect of the more information about each product from Biopharmaceutical Society Section of the ASA. He patients, physicians and regulators. has a BS from the University of Maryland in Maths These pressures will lead to certain changes in the and Physics and a PhD in Biomathematics from successful pharmaceutical company. The company North Carolina State University. will have to invest substantial effort in integrating its large and diverse data sources. This is a stupen- Herbert Edelstein is President of Two Crows dous effort that we could not hope to describe here. Corporation. He is an internationally recognised Suffice to say it will involve a change in the status expert in data mining, data warehousing and client- of IT people in the organisation, a demand for more server computing, consulting to both computer ven- training and experience in IT for people all over the dors and users. He is regularly invited as a chair and organisation and an increased level of domain keynote speaker at conferences on these topics and expertise by IT and associated workers. is a founder of the Data Warehousing Institute. The company will be storing and processing all Prior to Two Crows, Herb was a founding partner of these data with a view to improving the busi- of Euclid Associates, a specialising ness. This will require access with appropriate in data warehousing and data management. He was tools by an analytically astute workforce. More also Vice-President of Marketing and Sales at people will have more access to more data. More Sybase and International Database Systems, people will need analyses and interpretations of the General Manager of the Model 204 division of data. A larger share of all decisions will be made Computer Corporation of America and a consult- with the support of the pointed analyses of appro- ant for American Management Systems.

48 Drug Discovery World Fall 2001