<<

USOO791 7525B2

(12) United States Patent (10) Patent No.: US 7,917,525 B2 RawlingsWi et all e 45) Date of Patent : Mar.e 29, 2011

(54) ANALYZING ADMINISTRATIVE 5,956,689 A 9/1999 Everhart, III ...... 705/3 HEALTHCARE CLAMS DATA AND OTHER 5,970,463 A 10/1999 Cave et al...... TOS/3 5,970.464 A 10/1999 Apte et al...... TOS/4 DATASOURCES 6,014,631 A 1/2000 Teagarden et al. . ... TOS/3 6,137,911 A 10/2000 Zhilyaev ...... 382,225 (75) Inventors: Jean Rawlings, Roy, UT (US); David 6,151,581. A 1 1/2000 Kraftson et al. ... TOS/3 Anderson, Chaska, MN (US); Carl 6,223,164 B1 4/2001 Seare et al...... TOS/2 Kraus, Raleigh, NC (US); Andrew 6.253,186 B1 6/2001 Pendleton, Jr. ... TOS/2 r s s 6,370,511 B1 4/2002 Dang ...... TOS/3 Paris, Victor, NY (US) 6,732,113 B1 5/2004 Ober et al...... 707/102 (73) Assignee: Ingenix, Inc., Eden Prairie, MN (US) (Continued) (*) Notice: Subject to any disclaimer, the term of this FOREIGN PATENT DOCUMENTS patent is extended or adjusted under 35 EP 1420344 A2 * 5, 2004 U.S.C. 154(b) by 357 days. OTHER PUBLICATIONS (21) Appl. No.: 11/567,577 Grunwald et al., “Analysis of Genotypic Diversity Data for Popula (22) Filed Dec. 6, 2006 tions of Microorganisms.” Phytopathology, 93:738-746, 2003. 1C ec. O, (Continued) (65) Prior Publication Data US 2007/O174252A1 Jul 26, 2007 Primary Examiner — John R. Cottingham s Assistant Examiner — Susan F Rayyan Related U.S. Application Data (74) Attorney, Agent, or Firm — Fulbright & Jaworski L.L.P. (60) Synal application No. 60/742,774, filed on Dec. (57) ABSTRACT s Techniques suitable for identifying potential Subjects for a (51) Int. Cl. clinical trial and other applications are disclosed. One or more G6F 70 (2006.01) exclusion or inclusion criteria are defined for the clinical trial. GO6F 17/30 (2006.01) One or more specialized searching tables are pre-generated (52) U.S. Cl...... 707f758. 707/941, 705/2 using administrative healthcare claims data and the one or (58) Field of Classification Search s 707f1-3 more exclusion or inclusion criteria. The specialized search See application file for complete search historv. ing tables are searched.- 0 Through the searching step, Subjects pp p ry are identified within the administrative healthcare claims data (56) References Cited who match the one or more exclusion or inclusion criteria. Through the searching step, a geographical area is identified U.S. PATENT DOCUMENTS corresponding to the Subjects who match the one or more 5,191.522 A 3, 1993 Bosco et al...... 705/4 exclusion O inclusion criteria. A customized report is gener 5,197,024 A 3, 1993 Pickett ...... 708/517 ated using the identified Subjects and geographical area. 5,557.514 A 9, 1996 Seare et al. ...7052 5,835,897 A 1 1/1998 Dang ...... 705/2 13 Claims, 11 Drawing Sheets zoo

2O2 Pre-generate table

204 w Search administrative healthcare claims data

2O6 w Compare one subset wersus others

208 w Calculate hypergeometric statistical result

210 w Detect fraud/rate physician etc. US 7,917,525 B2 Page 2

U.S. PATENT DOCUMENTS 2004/0260577 Al 12/2004 Dahlin et al...... 705/2 6,799,124 B2 9/2004 Perdue et al...... TO2/34 2005/00 10443 A1 1/2005 Zammitt ...... 705/2 2005, 0071189 A1 3/2005 Blake et al. .. 705/2 6,826,536 B1 1 1/2004 Forman ...... 705/4 2005.0114334 A1 5/2005 Ober et al...... 707/9 2001/00 16842 A1* 8, 2001 Umen et al. 707/1 2005/02O3776 A1 9/2005 Godwin et al...... 705/3 2001/0034631 A1 10, 2001 Kiselik ...... 705/8 2005/0228593 A1 10, 2005 Jones ...... 702/19 2002fOOO2474 A1 1/2002 Michelson et al. 705/3 2005/0228808 A1 10, 2005 Mamou et al. .. 7O7/1OO 2002/01383.06 A1 9, 2002 Sabovich ...... 705/3 2005/023474.0 A1*ck 10/2005 Krishnan et al...... 705/2 2002fO165762 A1 11/2002 Goldman et al. 70.5/10 ck 2002/0173990 A1 11/2002 Marasco ...... 705/2 2005/0234968 A1*ck 10/2005 Arumainayagam et al. .. 707/102

2005/0283466 A1* 12/2005 Dettinger et al...... 707 3 2003/004.6114 A1 3/2003 Davies et al. 705/3 2006, OO15325 A1 ck 1/2006 Moore ...... 704/9 2003, OO65740 A1 4, 2003 Allen ...... 709/217 2007/0271214 A1* 11/2007 Sumino et all 707/1 2003/0191664 A1 10, 2003 Blumenfeld ...... 705/2 ...... 2004/0044654 A1 3/2004 Ballenger et al. 707/3 OTHER PUBLICATIONS 2004f0078216 A1 4/2004 Toto ...... 705/2 2004f0078220 A1 4/2004 Jackson ...... 705/2 Wu, "An Accurate Computation of the Hypergeometric Distribution 383-8658 A. 2. Sisley et al. 293 Function.” ACM Transactions. On Mathematical Sofiware, 19:33-43, 2004/0153435 A1* 8/2004 Gudbartsson et al. 707/1 1993. 2004/0172293 A1 9/2004 Bruschi et al...... 705/2 PCT International Search Report and Written Opinion, issued in 2004/0210457 A1 10, 2004 Sameh ...... 705/2 International Application No. PCT/US 06/61691, dated Feb. 14. 2004/O220831 A1* 11, 2004 Fabricant .... 705/2 2008. 2004/0236601 A1 11/2004 Summers et al. 705/2 2004/024.9677 A1* 12/2004 Datta et al...... 705/3 * cited by examiner U.S. Patent Mar. 29, 2011 Sheet 1 of 11 US 7,917,525 B2

100

102 Define exclusion/inclusion Criteria

104 Pre-generate specialized searching table(s)

106

108 ldentifyh patients/ professionals & geography

110 Generate Customized report

FIG. 1 U.S. Patent Mar. 29, 2011 Sheet 2 of 11 US 7,917,525 B2

200

202 Pre-generate factorial table

204 Search administrative healthCare claims data

2O6 Compare one subset Versus others

208 Calculate hypergeometric Statistical result

210 Detect fraud/rate physician etc.

FIG. 2 U.S. Patent Mar. 29, 2011 Sheet 3 of 11 US 7,917,525 B2

304

if it, it did, if fy if f f 4Riyadiyyyyaf f yy ys FFF Fif F if FR FF 44RyfyFyfyiyyid AR fyly, AR 4 y H

FIG. 3 U.S. Patent Mar. 29, 2011 Sheet 4 of 11 US 7,917,525 B2

Enter title here

Enter title here U.S. Patent Mar. 29, 2011 Sheet 5 of 11 US 7,917,525 B2

Enter title here

88 : U.S. Patent Mar. 29, 2011 Sheet 6 of 11 US 7,917,525 B2

Acute Coronary Syndrome Diaetic acareera Non-Hodgkins lyinphona Acte LekaeriaS Diabetic Meuropathy Obesity ADH Eri Stage Renal Disease O3Sessive-Compulsive disorder Allergic Riinitis Epilepsy Oriopharyngeal Candidiasis Allergy Erectile Dysfunctio Osteoarthritis eire's sease Gastric Symptoms Overactive Bladder (OAB) Area Gastreiteitis Painful Diabetic Neuropathy Aeria in Cance Faents 6astroesophageal Reflux Disease Paisle Argirls Generalized Anxiety Disorder Pakirs risisease Anisometropia and for Amblyopia acca Post-Operative Pain orthopedic) Asia Healthy Wolunteers (Analgesia Premenstral Dysphoric Disorder Benign Prostac Hyperplasia (BPH) Her philia Postate Carcer Bipolar Disorder Hepatitis B Fariasis Bore farrow Transpart Hepatis C Rigrati Atrig Brains Harmone Replacement Therapy Riritis Breast Carcer Adjuvant) Haar immunodeficiency Wirus (HIW Schizophrenia Breast Cancer (Advanted Metastatic Hypercholesterolena SeaStral Affectiye Disorder Brocits Hyperlipidemia Sexual sysfunction Chronic Immune Thrombocytopenic Purpura irritable Bowe Syndrone Small Celling Cancer (SCLC) Chronic Inflammatory Demyelinating Polyneuropathy Major Depressie Disorder Sroking Cessation Chronic, Non-alignant Pain Soelia Sgia Pi Congestive Heart failure Migraine Strike CFD Mild Cognitive impairment Systeris Luptas Erythematosis Cronary Artes isease Multiple Sclerosis Ps Dementia titley Bodies Myocarcial triarction Ulcerative Colts O3es Non Small Ceil Larg Carcer (NSCLC) Urinary Tract infection FIG. 5 U.S. Patent Mar. 29, 2011 Sheet 7 of 11 US 7,917,525 B2

DC Codes DX: 042:HIV DISEASE (23718) OX: 042.0:HIV & SPECIFIC INFECTION (68) DX: 043:HIV INFECTION (3) DX: 043,2HIV-CAUSED IMMUN DISEASE (1) DX 043.3:HIV CAUSING DISEASE NEC (13) DX 0440 HIV WITH ACUTE INFECTION (4) DX 044.9 HIV INFECTION NOS (71) DX W08 ASYMPTOMATIC HIV STATUS (5634)

Female FIG. 6 U.S. Patent Mar. 29, 2011 Sheet 8 of 11 US 7,917,525 B2

Age Distribution

s

s ts scies,

sississ & S 3 is Age it. Years as if Today Top 22ip 3 regions

ric Patiers is

... : : ::::

3:33. 3rtists es; eiser: ,3: ... ck:Fix's is : ...a 53ds is s: sri:"... skiss rise Assis is Fists as U.S. Patent Mar. 29, 2011 Sheet 9 of 11 US 7,917,525 B2

HW Patient Distributior

H. Ragicins 2::1:

its 2 its 'i & S. its ( fries: sy lar is sigs: as: 25.

FIG. 8 U.S. Patent Mar. 29, 2011 Sheet 10 of 11 US 7,917,525 B2

- Figars

: 28; 3. is 73: 3 S. e. g. 3: is a 3: 3

*(iii:S sit.: fast's

s s | r s :

ore sever

: :

: 8.

3 U.S. Patent Mar. 29, 2011 Sheet 11 of 11 US 7,917,525 B2

100

75

50

25

Billing Code X FIG 10 US 7,917,525 B2 1. 2 ANALYZING ADMINISTRATIVE classes of analysis that are impractical or impossible to per HEALTHCARE CLAIMS DATA AND OTHER form on very large data sets, no matter how powerful the DATASOURCES database engine. Administrative Healthcare Claims Data for Clinical Trials This application claims priority to, and incorporates by 5 Clinical trials rely on voluntary participation of study sub reference in its entirety, U.S. Provisional Patent Application jects to evaluate new drugs, medical devices, or other inter Ser. No. 60/742,774 entitled, “Analyzing Administrative ventions. Trials may also be directed to, among other things, Healthcare Claims Data and Other DataSources, which was evaluating procedures for detecting or diagnosing a particular filed on Dec. 6, 2005. disease or finding ways to improve the quality of life for those 10 Suffering from a chronic illness. Trials are usually conducted by researchers associated in some way with a pharmaceutical BACKGROUND OF THE INVENTION company, university, hospital, foundation, or governmental 1. Field of the Invention agency. The present invention relates generally to data mining and A significant challenge in carrying out any clinical trial is 15 recruiting the appropriate number and type of volunteer study analysis. More particularly, it concerns mining and analyzing subjects. Volunteer study subjects are selected so that they medial claims data to, e.g., (a) assist in the identification of meet one or more exclusion or inclusion criteria defined by a clinical investigators and potential trial Subjects for clinical study protocol that has been approved by an ethics review trials or determining feasibility of clinical trials, (b) assist in board. These criteria are aimed at investigating the impact of the identification of medical expert witnesses, medical direc a predefined intervention (e.g., a new drug) on a particular tors, or other medical professionals, (c) assist in the investi patient population (e.g., include only hypertensive patients gation of medical fraud, and (d) assist in various types of and exclude those younger than 18) and thereby characterize marketing. Even more particularly, it concerns improving the the effect of such an intervention on this population. This speed of medical-related data mining and analysis of very stage of the clinical trial patient recruitment—can be costly, large data sets such as administrative healthcare data through 25 for each extra day it takes to identify a pool of Subjects may the creation and use of specialized searching tables (SSTs). It ultimately represent one fewer day a new drug is on the also concerns improving the speed of certain statistical cal market (and protected by a patent or other intellectual prop culations through the creation and use of factorial tables erty). For Some Successful drugs, the cost of delay may having logarithmic entries, making it possible to reliably approach or even Surpass millions of dollars per day. work with very large numbers and data sets. 30 Some have attempted to use administrative healthcare 2. Description of Related Art claims data for the recruitment of subjects for clinical trials. A wealth of information is contained in administrative Services in existence today involve researchers submitting a healthcare claims data. For example, an administrative clinical trial protocol including related inclusion and exclu healthcare claims database may contain information concern sion criteria to a data service. The data service accesses ing, but not limited to, patient identification, physician iden 35 administrative healthcare claims data (often of limited Scope) tification, physician history, prescription drug history, medi in an attempt to estimate the size of a pool of potential study cal examination history, medical diagnosis history, medical subjects and estimate their location. The service, however, billing history, medical cost information, health benefit infor can take upwards of one-month for results to be returned. This mation, medical procedures, etc. time delay comes about, at least partially, due to the large Conventional techniques have been employed to mine at 40 amount of time necessary for the actual data mining and least some of this information. Data mining of healthcare analysis. Because healthcare claims data can involve millions claims data, however, involves a slow, computationally-inten of records, the searching necessary to identify potential study sive process that may return useful results only after hours or Subjects can be very time consuming and can, in some more of computation time. Lengthy search and analysis times instances, representa significant time delay in bringing a drug plague the medical data mining field and discourage many 45 to market. Additionally, the long delay may compound itself from fully utilizing medial claims data for useful applica if researchers discover that a first set of inclusion/exclusion tions. criteria would not yield a large enough potential study subject Administrative Healthcare Claims Data and Statistical Cal pool. When the inclusion/exclusion criteria are modified in an culations attempt to encompass more participants, the researcher may Healthcare organizations and many other organizations 50 be forced to wait another month or longer before knowing if lack the ability to rapidly analyze extremely large data sets the change in criteria will indeed yield an appropriate number (e.g., over a billion claim lines), apply statistical analysis of possible study subjects. protocols, and aggregate output into relevant, actionable Administrative Healthcare Claims Data for Detecting Medi answers for a specific need. cal Fraud When working with very large datasets (like administrative 55 Data mining techniques known in the art have been used in healthcare claims data), it is difficult and time consuming to an attempt to detect abnormalities in billing practices of phy look for patterns that are non-random. Generally speaking, sicians, through analysis of underlying claims data. For the process sometimes involves comparing each record (for example, through claims data, one can attempt to determine example in a claim) against every other record, keeping track whether there are any abnormalities or consistent differences of differences, and then analyzing the differences for patterns. 60 in billing practices that would result in higher payments being AS data sets get larger, there can be an explosion in the directed to the physician in question. number of unique comparisons that need to be made. For Conventional techniques, however, Suffer from the same or example, if one has 10 million records, then adding one similar problems discussed above-namely, lengthy analysis record may mean that there will be 10 million new compari times. Additionally, because of the vast amount of data that sons that need to be made and tracked. When one has 100 65 may be associated with a claims database, traditional tech million records and 1 record is added, there may be up to 100 niques have not been able to take advantage of certain statis million new comparisons to make. As such, there are entire tical techniques that would provide particularly useful infor US 7,917,525 B2 3 4 mation concerning potential fraud. For example, statistical In another respect, the invention involves a computerized techniques that employ the of extremely large num method for identifying potential subjects for a clinical trial. bers are not undertaken at least because the calculations One or more exclusion or inclusion criteria are defined for the would cause “data overflow' errors, or other errors that would clinical trial. One or more specialized searching tables are slow or stop an analysis. 5 pre-generated using administrative healthcare claims data Administrative Healthcare Claims Data for Other Applica and the one or more exclusion or inclusion criteria. The spe tions cialized searching tables are searched. Through the searching Mining administrative healthcare claims data for other step, subjects are identified within the administrative health applications Suffers similar problems concerning long com care claims data who match the one or more exclusion or 10 inclusion criteria. Through the searching step, a geographical putation times and delay. The problems are believed to dis area is identified corresponding to the Subjects who match the courage researchers and others from taking advantage of the one or more exclusion or inclusion criteria. A customized full potential of claims data. report is generated using the identified Subjects and geo The referenced shortcomings of conventional methodolo graphical area. Defining one or more exclusion or inclusion gies mentioned above are not intended to be exhaustive, but 15 criteria may include selecting criteria using a Venn diagram. rather are among many that tend to impair the effectiveness of Defining one or more exclusion or inclusion criteria may previously known techniques concerning data mining and include selecting one or more medical diagnosis codes. Iden aggregated analysis of large amounts of healthcare claims tifying the geographical area may include identifying a Zip data. Other noteworthy problems may also exist; however, code. The customized report may include a map illustrating those mentioned here are sufficient to demonstrate that the Subjects according to location. The method may also include methodology appearing in the art have not been altogether identifying potential clinical investigators for the clinical trial satisfactory and that a significant need exists for the tech through searching of the specialized searching tables and niques described and claimed here. generating a customized report using identified investigators and a corresponding geographical area. One or more investi SUMMARY OF THE INVENTION 25 gator databases may be used to identify the investigators. The method may also include, prior to the generating of the cus Techniques disclosed here may be used to improve data tomized report, defining a minimum subject participation and mining and analysis of administrative healthcare claims data. modifying the one or more exclusion or inclusion criteria if These techniques are applicable to a vast number of applica the number of subjects within the administrative healthcare tions, including but not limited to (a) the identification of 30 claims data who match the one or more exclusion or inclusion potential clinical trial investigators, identification of potential criteria does not meet the minimum subject participation. subject populations for clinical trial participation or analyz Such modifying may be done automatically. Such modifying ing the feasibility of clinical trials, (b) the identification of may be done automatically and iteratively until the minimum medical expert witnesses, medical directors, or other medical Subject participation is met. This technology may be embod professionals, (c) the investigation of medical fraud, and (d) 35 ied on a computer readable medium comprising computer marketing. Medical research applications may also benefit executable instructions that, when executed, carry out the from the techniques of this disclosure. Although focused on techniques described here. administrative healthcare claims data, the same techniques In another respect, the invention involves a computerized can be applied to other types of data. method for recruiting a medical professional. One or more In different embodiments, the techniques of this disclosure 40 exclusion or inclusion criteria are defined for the medical improve the speed of data mining and analysis of administra professional. One or more specialized searching tables are tive healthcare claims data through the creation and use of pre-generated using administrative healthcare claims data specialized searching tables (SSTs). The ability to use certain and the one or more exclusion or inclusion criteria. The spe statistical calculations is provided. Further, those statistical cialized searching tables are searched. Through the searching calculations can be accomplished quickly through the cre 45 step, medical professionals are identified within the admin ation and use of factorial tables including logarithmic entries, istrative healthcare claims data who match the one or more which make it possible to work with very large numbers and exclusion or inclusion criteria. Through the searching step, a data sets. For example, hypergeometric statistical calcula geographical area is identified corresponding to the medical tions can be performed quicker using these tables than by professionals who match the one or more exclusion or inclu traditional techniques. 50 sion criteria. A customized report is generated using the iden In one respect, the invention involves a computerized tified medical professionals and geographical area. Defining method. One or more exclusion or inclusion criteria are one or more exclusion or inclusion criteria may include defined. One or more specialized searching tables are pre selecting criteria using a Venn diagram. Defining one or more generated using the one or more exclusion or inclusion crite exclusion or inclusion criteria may include selecting one or ria. The specialized searching tables are searched. Through 55 more medical diagnosis codes. The medical professionals the searching step, data is identified within a data set that may include physicians being recruited as an expert witness matches the one or more exclusion or inclusion criteria. for litigation. The method may also include determining if Through the searching step, a geographical area is identified one or more of the physicians have previous experience as an corresponding to the data that matches the one or more exclu expert witness, through correlation with one or more expert sion or inclusion criteria. A customized report is generated 60 databases. This technology may be embodied on a computer using the identified data and geographical area. The method readable medium comprising computer executable instruc may also include (a) pre-generating one or more factorial tions that, when executed, carry out the techniques described tables, where the factorial tables include logarithmic entries, here. (b) comparing one or more data records against a plurality of In another respect, the invention involves a computerized other records, and (c) calculating a hypergeometric statistical 65 method for statistical calculations based on administrative result based on the comparing step using the one or more healthcare claims data. Administrative healthcare claims data factorial tables. is searched. One subset of the administrative healthcare US 7,917,525 B2 5 6 claims data is compared against a plurality of other Subsets of from one or more larger tables into a separate table that is then the administrative healthcare claims data. A hypergeometric searched. One SST can act in concert with one or more other statistical result is calculated based on the comparing step SSTs to achieve a search. Searching of SSTs can be done in using one or more pre-generated factorial tables, the factorial parallel, serially, or a combination thereof. In one embodi tables including logarithmic entries. Calculating may include ment, an SST or set of SSTs may be built with or on a FACT one or more calculations using the logarithmic entries fol table using a concatenated index (an index containing several lowed by one or more exponential operations. The method fields and leading with the appropriate field(s)). In such an may also include using the hypergeometric statistical result to embodiment, optimal queries only use the SST index struc detect medical-related fraud. The one subset may include ture and not interact with the FACT table. In this disclosure, medical coding data associated with a first physician and the 10 plurality of other Subsets may include medical coding data SSTs may also be referred to as “packed” tables. associated with a plurality of other physicians. The plurality As used in this disclosure, “administrative healthcare of other physicians may be selected to be within the same claims data or “healthcare data is used according to its specialty as the first physician. The method may also include ordinary meaning in the art and should be interpreted to generating a customized report comparing the first physician 15 include, at least, data organized electronically that is search versus the plurality of other physicians. The customized able via computer algorithm and which contains records asso report may include a graph of utilization percentage versus ciated with one or more medical procedures, prescriptions, medical code for the first physician and the plurality of other diagnoses, medical devices, etc. physicians. The method may also include using the hypergeo As used in this disclosure, “match' in the context of a metric statistical result to rate one physician versus other search should be interpreted to include exact matches as well physicians. The method may also include using the hypergeo as Substantial matches or matches set up with a pre-defined metric statistical result to identify potential subjects for a tolerance. clinical trial. The method may also include using the hyper As used in this disclosure the term, "customized report' geometric statistical result to recruit a medical professional means an output (hard-copy or soft-copy) that is individually for use as an expert witness for litigation. This technology 25 tailored for the user (e.g., person or entity) through the inclu may be embodied on a computer readable medium compris sion of a result or result Summary prompted through user ing computer executable instructions that, when executed, input. A customized report need not be unique to a user. carry out the techniques described here. As used in this disclosure the term, “minimum subject In another respect, the invention involves a computerized participation' is any quantitative measure of a minimum level method, in which one or more specialized searching tables are 30 of participation such as Subject total or subject density. pre-generated using administrative healthcare claims data. As used in this disclosure the term, “factorial table' is an One or more factorial tables are pre-generated, the factorial indexed data table whose entries include factorial values for tables including logarithmic entries. The specialized search one or more numbers. In a preferred embodiment, a factorial ing tables are searched. Through the searching step, one or table is an indexed data table whose entries include logarith more records are identified within the administrative health 35 mic representations of factorial values for one or more num care claims data that matches one or more search criteria. The bers. one or more records are compared against a plurality of other The term “code keys, as used herein, represents any records of the administrative healthcare claims data. A hyper desired searchable attribute. In one embodiment, “code keys' geometric statistical result is calculated based on the compar may represent diagnosis codes, prescription codes, procedure ing step using the one or more factorial tables. A customized 40 codes, or medical device codes. report is generated using the one or more records and the The terms 'a' and “an are defined as one or more unless statistical result. The one or more search criteria may include this disclosure explicitly requires otherwise. one or more exclusion or inclusion criteria selected using a The term “approximately' and its variations are defined as Venn diagram. The calculating may include one or more being close to as understood by one of ordinary skill in the art. calculations using the logarithmic entries followed by one or 45 In one non-limiting embodiment the terms are defined to be more exponential operations. This technology may be within 10%, preferably within 5%, more preferably within embodied on a computer readable medium comprising com 1%, and most preferably within 0.5%. The term “substan puter executable instructions that, when executed, carry out tially' and its variations are defined as being largely but not the techniques described here. necessarily wholly what is specified as understood by one of As used in this disclosure, an “inclusion criteria' means a 50 ordinary skill in the art. In one non-limiting embodiment the parameter that aims at including certain data in search results. terms refer to ranges within 10%, preferably within 5%, more An 'exclusion criteria' aims to exclude certain data in search preferably within 1%, and most preferably within 0.5% of results. Inclusion and exclusion criteria are relative terms— what is specified. an inclusion criteria may by necessity exclude some data and The terms “comprise' (and any form of comprise, Such as Vice-versa. In general, an exclusion or inclusion criteria is 55 “comprises” and “comprising”), “have (and any form of simply a searching parameter. Specifically, exclusion or have, such as “has and “having), “include’ (and any form of inclusion criteria can be any parameters that define a search include, such as “includes” and “including”) and “contain' and operate to filter or potentially filter data. (and any form of contain, such as "contains and “contain As used in this disclosure the term, “pre-generate” means ing') are open-ended linking verbs. As a result, a method or to generate prior to any searching step. 60 device that “comprises.” “has.” “includes” or “contains” one As used in this disclosure the term, “Specialized Searching or more steps or elements possesses those one or more steps Table' or “SST means a custom, indexed data table orga or elements, but is not limited to possessing only those one or nized according to predefined exclusion or inclusion criteria, more elements. Likewise, a step of a method or an element of the indexed table populated with a subset of information from a device that “comprises.”“has “includes” or “contains one one or more larger tables. The SST is designed to optimize or 65 or more features possesses those one or more features, but is speed the searching of data, at the expense of added disk space not limited to possessing only those one or more features. or other memory, for it reproduces a subset of information Furthermore, a device or structure that is configured in a US 7,917,525 B2 7 8 certain way is configured in at least that way, but may also be Defining Exclusion/Inclusion Criteria configured in ways that are not listed. In step 102 of FIG. 1, one or more exclusion or inclusion The term “coupled as used herein, is defined as con criteria are defined. In a preferred embodiment, the exclusion nected, although not necessarily directly, and not necessarily or inclusion criteria are designed to correspond to criteria for mechanically. a clinical trial or other application such as recruiting a medical Other features and advantages will become apparent with professional and may include, but are not limited to, desired reference to the following detailed description of specific, characteristics of a person (e.g., age, gender, etc.), a targeted example embodiments in connection with the accompanying health condition (e.g., possessing a certain diagnosis or being drawings. associated with a medical diagnosis code, etc.), oran employ 10 ment characteristic (e.g., medical specialty, etc.). In a pre BRIEF DESCRIPTION OF THE DRAWINGS ferred embodiment, exclusion or inclusion criteria are defined through direct input from a user. In other embodiments, cri The following drawings form part of the present specifica teria may be input from other software (e.g., a parameter may tion and are included to further demonstrate certain aspects of be generated in Software and output for use in a search). In the present invention. The drawings do not limit the invention 15 still other embodiments, criteria may be pre-stored and but simply offer examples. loaded or otherwise accessed. FIG. 1 is a flowchart showing a computerized method for FIGS. 4A-4B illustrate some possible ways in which exclu identifying clinical trial investigators and potential Subject sion or inclusion criteria may be defined via a computer populations for a clinical trial, for recruiting a medical pro software interface, with direct user input. FIG. 4A presents a fessional, or for evaluating the feasibility of a clinical trial, in “wizard' environment, in whichauser is asked to enterexclu accordance with embodiments of the invention. The steps of sion/inclusion criteria by typing one or more parameters FIG. 1 can also be used for other applications. along with a respective operator and value. FIG. 2 is a flowchart showing a computerized method for Criteria may be keywords specifically recognized by soft statistical calculations based on administrative healthcare ware, as shown in FIG. 4A, in which software recognizes the claims data, in accordance with embodiments of the inven 25 terms “Gender,” “Condition, and Age.” In the illustrated tion. The steps of FIG. 2 can also be used for other applica embodiment, "Gender” refers to male/female, “Condition' tions. refers to a recognized medical condition, and Age” refers to FIG.3 is a schematic diagram of a computer system includ the age of a person. In other embodiments, Suitable criteria ing a computer readable medium Suitable for carrying out (and optionally, here, medical conditions) may be chosen techniques of this disclosure, in accordance with embodi 30 through a “pull down” or “drop down list or other mecha ments of the invention. nism known in the art. For example, one may be presented FIGS. 4A-4C are schematic diagrams of a computer soft with a link stating, "List possible criteria.” Clicking this link ware interface Suitable for carrying out techniques of this would allow the user to view a list of criteria along with an disclosure, in accordance with embodiments of the invention. explanation about each. In different embodiments, the list of FIG. 5 is a list of example indications that can make up a 35 criteria may be modifiable by the user, depending on appli administrative healthcare claims data search criteria, in accor cation and depending on the underlying data being searched. dance with embodiments of the invention. For example, if access is provided to a database that has one FIG. 6 is a schematic diagram illustrating how one or more or more new fields available for search, one may want to add exclusion or inclusion criteria can be selected using a Venn additional search criteria based on those fields. In the embodi diagram, in accordance with embodiments of the invention. 40 ment of FIG. 4A, software allows one to enter up to five FIG. 7 is a schematic diagram illustrating example custom criteria. In other embodiments, more or fewer may be pro ized reports, in accordance with embodiments of the inven vided. In other embodiments, the user may specify the num tion. ber of criteria desired for a particular application. In one FIGS. 8-9 are map-based customized reports, in accor embodiment, entire groups of criteria may be defined by a dance with embodiments of the invention. 45 "one-click” or other shortcut manner by, e.g., providing a FIG. 10 is a customized report directed at fraud detection, button or menu that allows a user to define groups of criteria in accordance with embodiments of the invention. by shortcut name, or through reference to previously-used or saved groups of criteria. If a parameter is not recognized, an DESCRIPTION OF ILLUSTRATIVE appropriate alert or error message may be generated. EMBODIMENTS 50 Operators suitable for use in the illustrated embodiment of FIG. 4A include, but are not limited to, an equal sign (), the Embodiments of this disclosure allow for the computerized greater-than sign (C), and the less-than sign (<). These opera identification of clinical trial investigators and potential Sub tors act on, or modify the “value.” In other embodiments, the ject populations for a clinical trial, the computerized identi operator can be any mathematical or logical operator known fication of medical professionals (e.g., as an expert witness 55 in the art to assist in searching. For example, standard or for litigation, as a medical director for large hospitals), deter customized Boolean-type operations may be permitted. As mining the feasibility of a clinical trial, marketing, and other shown in FIG. 4A, the operator for the first and second criteria purposes. Embodiments of this disclosure also allow for the is the equal sign (), and the operator for the final parameter improved calculation of statistical results using, e.g., pre is the less-than-or-equal-to combination (<). If an operator generated tables and transforming factorial tables into their 60 is not recognized, an appropriate alert or error message may logarithmic equivalent. The statistical results can be used to be generated. further efforts for recruiting, marketing, and other applica The “value.’ acting with the operator and criteria, estab tions. lishes what search is to be performed. In FIG. 4A, the first, Turning first to FIG. 1, an example method 100 is shown second and thirds values are, respectively, Female, Diabetes, for identifying clinical investigators and potential Subject 65 and 35. Accordingly, the type of search the user may be populations for a clinical trial, determining the feasibility of a interested in involves people who are female, are associated clinical trial, or recruiting a medical professional. with the Diabetes medical condition, and who are 35 years old US 7,917,525 B2 10 or younger. In one embodiment, values are entered directly by different exclusion or inclusion criteria combinations are a user. In other embodiments, values may be pre-stored and considered. If more than one box is checked, search output loaded for use, selected individually or in groups, or other may be arranged or formatted to indicate the search results wise entered. FIG. 5 is a list of example indications that can corresponding to each check box. act as a value for a criteria such as “Condition, which is FIG. 6 illustrates another Venn diagram that can be used to shown in FIG. 4A. More or fewer indications could be used in assist in setting up exclusion or inclusion criteria. There, the other embodiments. criteria involved are similar to those of FIG. 4B they are FIG. 6 indicates another example method by which one HIV, age greater than 45, and Female. Five of the seven may define a value for a criteria—here, through the use of possible combinations of criteria are numbered. In FIG. 6, accepted medical codes or medical diagnosis codes. Such as 10 IDC9 codes. Specifically, to establish a value for a criteria different medical codes associated with HIV are given at left such as “Condition, one may enter one or more IDC9 codes (e.g., code 042 for HIV). In FIG. 6, the numbers in parenthesis to help identify what condition is of interest. In the embodi (e.g., 23718) are counts in the searched member population ment of FIG. 6, eight different IDC9 diagnosis codes are that have the particular DX/RX/Px. being used to define an HIV exclusion or inclusion criteria, 15 In one embodiment, the exclusion or inclusion criteria may which is represented by the upper left circle of the Venn be chosen to satisfy conditions of a clinical trial so that one diagram of FIG. 6. In different embodiments, software may may recruit Subjects (e.g., so that one may, through the help users look-up appropriate codes to aid in the searching searching process, identify patients who would meet the clini process. For example, if a user wants a search to involve cal trial criteria). For example, if a researcher is recruiting hepatitis, one may look-up all medical diagnosis codes per patients for a drug study and desires Volunteer patients over tinent to all forms of hepatitis. The user may then select from the age of 40 who have asthma but who are not taking a the look-up results to define exclusion or inclusion criteria, particular class of blood-pressure medications, those criteria and particularly, values for condition-related criteria. may be entered. A Venn diagram or other technique may be used to help the In one embodiment, the criteria may be model criteria, user define or visualize exclusion or inclusion criteria. FIG. 25 chosen by the researcher simply to see if there would be a 4B illustrates using a Venn diagram for this purpose. The suitable subject pool if the model criteria were, in fact, actual Venn diagram of FIG. 4B is used in conjunction with defining requirements. In other words, criteria may be set up to model exclusion or inclusion criteria and appears in a “wizard' a clinical trial for potential subject identification. Such mod screen that is called once the user selects “Next after setting eling may be used to provide a list of potential Suggestions up criteria, operators, and values in FIG. 4A. The Venn dia 30 that could be implemented to meet clinical trial enrollment gram of FIG. 4B includes three circles, corresponding to the targets. Such modeling, discussed more below, may also criteria previously defined in FIG. 4A and an additional two allow a user to check on whether an investigator's enrollment studies or groups. In this embodiment, each complete list of predictions seem reasonable as well as provide temporal and inclusion/exclusion criteria creates one Venn diagram. The geographic data on targeted enrollment. Additionally, mod Venn diagrams allow users to overlap multiple studies or 35 eling may allow a user to evaluate whether, based on patient groups. Thus, the first circle in FIG. 4B corresponds to the base attrition, an investigator is likely to retain study trial entire set of criterialisted in FIG. 4A Females with Diabe Subjects. tes age 35 or younger. The second circle in FIG. 4B would In another embodiment, the exclusion or inclusion criteria correspond with another study—for example, people with may be chosen to satisfy job conditions so that one may hypertension and hepatitis. The third circle in FIG. 4B would 40 recruit a medical professional. For example, one may define correspond with yet another study, trial, or protocol for exclusion or inclusion criteria to find a Suitable medical example, children under the age of 12 who have received expert witness for litigation. If litigation involves esophagus gamma. The Venn capability provides the ability to identify injuries associated with screws backing out from an anterior clinical investigators and potential patient populations who cervical plate, one could define exclusion or inclusion criteria reside in the intersection of these three separate studies. The 45 designed to locate a Surgeon who has performed over 100 result may therefore be a list of providers who treat one or cervical plate procedures during the past five years. If one more patients who are female with diabetes who have hyper believes that a female expert would “connect more with the tension and have had hepatitis who are under 12 and have jury, one could define a Gender criteria to be equal to Female. received a gamma shot. This ability allows users to establish If one believes that the expert witness should be from Texas, completely separate protocols for different drugs and to com 50 one could set a Medical School criteria to be equal to one or bine protocols in the future for new drugs and/or different more Texas schools. If one believes that an expert in the 45-65 indications (potentially identifying off label use, etc.) age range would have the most credibility, an age criteria Of course, a different number of criteria would lead to a could be entered accordingly. In the same manner, one could different Venn diagram, with different labels. The Venn dia tailor a search according to any desire, and limited only by the gram allows the user to tailor a search according to any of the 55 underlying data being searched. As with the clinical trial exclusion or inclusion criteria alone or in any combination recruitment embodiment, one may define exclusion or inclu with other exclusion or inclusion criteria. In the illustrated sion criteria to simply satisfy different “what if scenarios— embodiment of FIG. 4B, there are seven different possibilities for example, “what if I was looking for a male expert wit for searching, represented by the seven different checkboxes ness, age 52-55, who went to Baylor College of Medicine, presented to a user. Here, by way of example only, the user has 60 and who has done over 400 cervical plate procedures—how selected a search aimed at uncovering data that satisfies the many such people could I possibly identify? If the answer is Gender criteria (i.e., Female), the “Condition' criteria (i.e., Zero or extremely low, one may realize that expectations need HIV), and the Age criteria (i.e., 35 years old or younger). Had to be modified. the user wished to search a different combination, he or she In another embodiment, the exclusion or inclusion criteria could have checked a different box. Additionally, a user may 65 may be chosen to satisfy job conditions so that one may wish to chose more than one box to determine, e.g., the recruit a medical professor, executive, researcher, etc. For difference in the number of search “hits” that would result if example, one may define exclusion or inclusion criteria to US 7,917,525 B2 11 12 find an executive with particular experience as a physician The SSTs of this disclosure can overcome a number of working with certain conditions. performance obstacles in both the gathering and the process These examples illustrate that it may be beneficial to com ing of large data sets for, e.g., real time statistical probability bine data from one database with that of others so that addi analysis (a.k.a. signal detection). This method can also con tional criteria may be defined and used for various applica tain temporal information (e.g., service date) to define tions. For example, in the clinical trial recruitment encounters in the data set. Standard warehouse structures applications, it may be beneficial to use information that hold vast amounts of data and allow access to specific records identifies physicians as being past investigators for clinical via bitmapped indexes. However, when large population sets trials so that one may identify not only Volunteer study Sub are desired, the shear number of disk seeks required via the jects, but also appropriate physicians with experience with 10 bitmapped indexes becomes prohibitive for real-time pro cessing. A solution provided herein is to use SSTs (or stan trials. This may be accomplished by linking administrative dard tables loaded and indexed in a particular way) where healthcare claims databases with, for instance, an FDA-re each block of each table is rich with the desired information. lated database. Additionally, one may identify medical pro In other words, if one is searching for possible patients who fessionals who have testified at trial or deposition by corre 15 have had at least one of a set of 10 diabetic medical codes, the lating a physician match from a administrative healthcare user may be directed to a table which contains rowspacked by claims database with a database that keeps track of expert codes. Each physical block read (disk seek) may contain witness experience. hundreds of the desired individuals, whereas in a standard Pre-Generating Specialized Searching Tables warehouse a block read will certainly hold at least one desired In step 104 of FIG. 1, specialized searching tables (SSTs) individual but likely, at most, only a few as dictated by are pre-generated. The SSTs of the present disclosure offer chance. significant benefits in the area of administrative healthcare The SSTs of this disclosure can also overcome statistical claims data searching as well as other fields at least because of processing challenges present in traditional data mining the marked improvement in searching speed and any associ operations. Statistical processing challenges can be encoun ated analysis—albeit at the expense of using more disk space 25 tered once individuals “in play' are ferreted out. For example, (or other computer memory) and the time associated with if checking for drug safety signals, each attribute of each pre-generating the tables themselves, which may be done at individual “in play' must be accessed. Also, each outcome for off-peak times, if desired. In one embodiment, over 18 mil these individuals must be accessed. These two sets may then lion patient healthcare claims histories, resulting in over 410 be permeated against each other, one individual at a time. This million records, may be “packed' into SSTs to greatly 30 process is repeated for each “in play' individual, and the improve data mining and analysis. cumulative set is then aggregated for each outcome for each In one embodiment, SSTs may be pre-generated and used base condition, for each drug in the pairing. For weak filters as follows. In this example, “code keys' represent any desired (filters than don't appreciably narrow the population) this can searchable attribute including, but not limited to, Diagnosis be a time consuming process. To alleviate this, and according codes, Prescription codes, Procedure Codes, etc. In this 35 to embodiments of this disclosure, all possible permutation example, temporal information may also be utilized (e.g., sets may be pre-generated and “packed' by individual IDs service date) to define encounters in the data set. Those hav into an SST. The “in play' population may then be extracted ing ordinary skill in the art, having the benefit of this disclo from the pre-generated SST for aggregation. Sure, will recognize that other types of information may be Techniques of this disclosure may be advantageously included for SSTs, according to need. The steps below rep 40 applied to a wide variety of "raw data” to be searched, a resent an example only. preferred embodiment involving administrative healthcare Creating SSTs claims data. For example, SSTs can act on virtually any data 1. CREATE TABLE PACKED TABLE as SELECT or to improve searching and analysis, and particularly adminis INSERT /*--append/ ... select distinct code key, sex, birth trative healthcare claims data, regardless of the format and date, geographic region, individual id from fact table of other 45 size of the data being mined. In one embodiment, adminis large table . . . ORDER BY code key, sex, birth date, geo trative healthcare claims data may be housed on computer graphic region, individual id setting appropriate block servers or other storage devices at one physical location, parameters to 0 space for updates while in other embodiments, the data may be dispersed about 2. Index the PACKED TABLE by code key many locations. The data may be accessible via network. The a. alternate 1: create concatenated index on large table 50 data may be in one or more different formats or layouts. leading with code key and containing all required fields. Advantageously, the techniques of this disclosure can lay on b. Alternate 2: create stand alone index organized table top of virtually any data, and the data can be linked together leading with code key and containing all required fields. from a variety of sources. Because of inherent TCP transport Access SSTs delays, when dealing with large amounts of data spread 1. Seti 1 is SELECT sex, birth date, geographic region, 55 across multiple platforms, there is a performance advantage individual id from PACKED TABLE where code key in to ensuring that each platforms SST data set be selfcontained (code key 1a, code key2a,etc) in the sense that only aggregated values are passed to an 2. setti 2 is SELECT sex, birth date, geographic region, application server, client or master database server. SSTs individual id from PACKED TABLE where code key in have significant performance benefits whenever data sets are (code key lb.code key2b.etc) 60 primarily accessed by a non-unique field. 3. seti N is SELECT sex, birth date, geographic region, SSTs can be updated in a number of ways. In one embodi individual id from PACKED TABLE where code key in ment, this activity may be done off hours. Updating may be (code key 1X.code key2X,etc) done as follows, which are examples only: 4. The sets can then be combined via INTERSECT, 1. Completely reprocess the SST from a FACT table each UNION, MINUS, etc. to yield results corresponding to any/ 65 load cycle. all Venn region(s). Patient demographic Summaries can also 2. INSERT /*+APPEND*/ the new data into the existing be calculated rapidly without requiring joins. table in the desired order. This will not “pack” as tightly as a US 7,917,525 B2 13 14 complete reprocess but will still have most of the perfor Demographics of the Patient Population that has Participated mance advantages provided by this disclosure. in all Three Code Sets Even if They Did not Happen in the Significant storage benefits can be reaped if the database in Same Encounter: use allows field compression on leading and even non-leading index fields. Depending on the packing method used, indexes may be dropped prior to loading and re-created post-load. Or, SSTF1 code key, birth date, geo region, sex, individual id indexes may be allowed to grow during the load process. Ordered by code key indexed by code key However, repeated loads (and deletions) can have detrimental --tally by gender breakout, age and region are similar effects on index efficiency. Select sex, count(unique individual id) unique patients from 10 ( Searching the SSTs Select birth date, geo region, sex, individual id from SST#1 In step 106 of FIG. 1, the SSTs are searched. The searching where dx IN (Dx1,.DX2,...) INTERSECT step involves the use of the SSTs to filter the underlying Select birth date, geo region, sex, individual id from SST#1 administrative healthcare claims data (or other data being where pX IN (Px1,Px2,...) searched) according to the exclusion or inclusion criteria 15 INTERSECT defined by the user, and which may be associated with a Venn Select birth date, geo region, sex, individual id from SST#1 diagram or other tool. The searching step itselfmay be carried where pX INRX1.RX2,...) out according to techniques known by those of ordinary skill ) group by sex in the art. In one embodiment, searching may be carried out using the Note: when available, temporary holding structures (e.g., techniques of the following example. In this example, one global temp tables, WITH temp AS, etc.) can avoid multi might be interested in determining how many physicians have Sourcing. Analytic functions can also be used. treated a specific set of diagnoses in a certain way during an Example of Multi-Sourcing from a Single Temp Structure encounter (with a specific class of drugs and specific set of procedures) and how many patients each physician has 25 WITH temp AS treated in that way, total encounters by physician. This or a similar embodiment may be framed as shown in the following non-limiting scenarios: ( 1. One class of search only requires that the patient have SELECT *-- first rows 30 had certain DX. PX and RX codes during an interval regardless FROM (SELECT /*-i- index(a)*/ of the intervals between the codes (e.g., in the last year which individual id patients have had procedure “A” and drug “B”). FROM MASTER R1D a WHERE a.code key IN 2. Another class of search imposes temporal restrictions on (DX8648,571, the order and interval between the DX, PX and RX codes. A 35 DX864857481, temporal example might be: A novel treatment approach for DX864857491, DX8648575O1, disease “X” (coded as x1, x2, or X3) is procedure “Y” (coded DX864857511, asy 1 or y2) and drug “Z” (coded as Z1 or Z2). One may define DX864857521, this novel treatment as belonging to an “encounter, which DX864857531, may require the diagnosis “X” to occur on or before “Y” and 40 DX86485753481, DX86485753491, “Z” and furthermore, “Y” must take place on at most 1 day DX864857541, after “X” and “Z” must be filled on or at most 2 days after DX864857551, “X”. Now, one may find all patients who have had disease “X” DX86485755481, DX86485755491, in any of its x1, x2 or X3 forms who were treated with proce DX864857561, dure “Y” informs y1 ory2 within 1 day and filled drug “Z” in 45 DX86485756481, forms Z1 or Z2 within 2 days. DX86485756491, DX864857571, 3. This logic can be further extended into an “episode' DX86485757481, where the procedure “Y” might have a much longer interval DX86485757491, of treatments and even numerous treatments over this longer s 50 ) interval (same with drug “Z”). INTERSECT Regardless of the logic, an output common to all three may SELECT *-i- index(a)*/ be, in one embodiment: Count the unique patients that fit this individual id FROM MASTER R1D a logic, identify and count the providers that treat patients this WHERE a.code key IN way; which providers do this treatment the most often, and 55 (PX48485354541, which providers do not treat patients this way. PX51515349481, One might also be interested in the demographics of the PX51515349491, PX515153495.01, patient population that has participated in three code sets even PX51515349511, if they did not happen in the same encounter. In this example PX51515349521, one may create two SST structures (this could be done in one 60 PX51515349541, PX51515349551, denormalized SST with a moderate performance hit due to PX51515349561, increased row length). PX51515349571, Some fields in this example are shown with “natural val PX51515350491, PX51515350501, ues although it is generally desirable to use Surrogate keys of PX51515350511, the smallest possible length for the desired criteria. After 65 PX51515351511, aggregation, the Surrogate keys may be joined to descriptive PX51515351521, fields. US 7,917,525 B2 16 -continued Determining how Many Physicians have Treated a Specific Set of Diagnosis in a Certain Way During an Encounter (with PX51515351531, a Specific Class of Drugs and Specific Set of Procedures) and PX51515351541, how Many Patients. Each Physician has Treated in that Way, ) 5 Total Encounters by Physician INTERSECT SELECT * + index(a)*/ individual id FROM MASTER R1D a WHERE a.code key IN (RX-ZITHROMAX, )) a, SSTH2 code key, individual id., encounter date, provider key INDIVIDUAL ID R1Db 10 ordered by code key indexed by code key WHERE a.individual id = b.individual id) Select provider key.count(unique individual id) SELECT-1 ord, Male ATTRIBUTE, SUM (CASE unique patients,count(unique individual idencounter date) WHEN gender = “M total encounters from THEN 1 ELSEO Select individual id, encounter date, provider key from SST#2 where dx IN (Dx1,.DX2,...) END) counts 15 FROM temp INTERSECT UNIONALL Select individual id, encounter date, provider key from SST#2 SELECT-2 ord, Female ATTRIBUTE, where pX IN (Px1,Px2,...) SUM (CASE INTERSECT WHEN gender = F Select individual id, encounter date, provider key from SST#2 THEN 1 where pX INRX1.RX2,...) ELSE O ) group by provider key END) counts FROM temp UNIONALL In one embodiment, SSTs may also work together. Con SELECT * sidera query where one wants to know the ensuing “play-out” FROM (SELECT rn, rn chr ATTRIBUTE, counts after an individual has had a particular Diagnosis, procedure FROM (SELECT TO CHAR (TRUNC (SYSDATE-dob) / 25 365.25)) ATTRIBUTE, or drug: COUNT Table SSTH3 is code key, individual id, encounter date (UNIQUE CASE WHEN gender = “M ordered by code key, individual id and indexed by code key THEN individual id Table SSTH4 is individual idencounter date.code key ELSE NULL 30 ordered by individual id, encounter date and indexed by END individual id ) counts FROM temp Then the combined query becomes (one of ordinary skill in GROUP BY TO CHAR (TRUNC (SYSDATE-dob) / the art will recognize that they are many ways to write the 365.25))) a, YEARS b SQL): WHERE brn chr=a.ATTRIBUTE(+) 35 ORDER BY brn ASC) Select SSTH4.* from SST#4,(Select individual idmin(en UNIONALL counter date) first encounter from SSTH3 group by indi SELECT * vidual id) iv FROM (SELECT rn + 500, rn chr ATTRIBUTE, counts where SSTH4.individual id iv.individual id and FROM (SELECT TO CHAR (TRUNC (SYSDATE-dob) / 365.25)) ATTRIBUTE, SSTH4.encounter date iv. first encounter COUNT 40 Searching times of course depend on the amount of data (UNIQUE CASE involved. However, using SSTs can dramatically cut search WHEN gender = “F ing time to a degree where once-impossible or impracticable THEN individual id ELSE NULL tasks can be completed—e.g., identification of clinical inves END tigators and potential Subject populations for clinical trials in ) counts 45 a quick enough manner so that exclusion or inclusion criteria FROM temp can be modified “on the fly in an attempt to establish appro GROUP BY TO CHAR (TRUNC (SYSDATE-dob) / priate protocols with a feasible subject pool. 365.25))) a, YEARS b WHERE brn chr=a.ATTRIBUTE(+) To highlight performance difference versus existing ware ORDER BY brn ASC) house techniques, consider the following example query: UNIONALL 50 SELECT * FROM (SELECT 1000 rn, ZIP-3: Select sex, count(unique individual id) unique patients from | SUBSTR (zipcode, 1, 3) ( | Select birth date, geo region, sex, individual id from SST#1 || TRIM (city) 55 where dx IN (Dx1,.DX2,...) || ". INTERSECT | state ATTRIBUTE, Select birth date, geo region, sex, individual id from SST#1 COUNT (UNIQUE individual id) counts where pX IN (Px1,Px2,...) FROM temp, ZIP3 INTERSECT WHERE SUBSTR (zipcode, 1,3) = ZIP3 Select birth date, geo region, sex, individual id from SST#1 GROUP BY 1, 60 where pX IN (RX1.RX2,...) ZIP-3: ) group by sex | SUBSTR (zipcode, 1, 3) | || TRIM (city) || ". To accommodate itemized costs by procedure and the fact | state) that hundreds of procedures may sometimes define an ORDER BY counts DESC); 65 encounter, administrative healthcare claim tables generally contain a single procedure per line with other fields holding diagnosis codes or pointers to a table listing the diagnosis US 7,917,525 B2 17 18 codes associated with the encounter. Prescription data is fied to identify even more potential clinical investigators and often, but not always, held in a separate table. Bitmapped study Subject populations. In another embodiment, one or indexes are not tailored to this class of problem because they more medical professionals are identified. For example, a cannot be directly merged/ANDed to narrow the output set potential medical expert witness, professor, researcher, etc. prior to table access because the codes are not required to be may be identified. resident on the same line, only on the same patient. Building In one embodiment, information regarding one or more a claim table holding every triplet permutation over the cov geographic regions is also identified in response to a search. ered patient interval could leverage bitmapped index power In addition to information that identifies a potential clinical but would be cause the table to be prohibitively large for investigator, study Subject, or medical professional, the indi 10 vidual's detailed location may be also be identified. The iden anything but a small Subset of the population. Likewise tification of geographic information may be done automati “X-walking columns into a single table is not practical when cally, with or without the user setting up a geographical hundreds of codes are possible in an covered period. search parameter. For example, if the only exclusion or inclu So, for a generic claims table containing sion criteria specified by a user involved the age of a patient, Individual id, Px1, DX1, DX2, DX2, etc. 15 a search may nevertheless return information not only iden And pharmacy table containing tifying patients matching the age limitation, but also identi Individual id., RX1, etc fying a general geographical region where those patients live. The warehouse version of the query becomes: Such information may be pulled from a claims database or Select p.sex.count(unique individual id) unique patients other data source being searched. As described more below, from the geographic area information may be used advantageously ( to present search results in map-format. Select individual id from big claim table where pX1 in In one embodiment, identification of individuals through (Px1,.Px2, . . . )—we must pull this set independently since searching is done without revealing any sensitive or protected they may not occur on the same line?encounter as the DX codes material. In other words, the techniques of this disclosure may 25 be used in a manner that would not violate privacy rules or INTERSECT laws (e.g., HIPAA regulations). Searching can take place on Select individual id from big claim table where (dx1 IN data that has been "de-identified to remove reference to, e.g., (Dx1,.Dx2, . . . ) OR dx2 in (Dx1,.Dx2, . . . ) OR dx3 in patient names and social security numbers. Alternatively, (DX1.Dx2, ...) INTERSECT searching can take place on original data and then de-identi 30 fied so that privacy guidelines are met. Techniques known in Select individual id from big pharmacy table where rX1 the art may be used for the de-identification process. in (RX1.RX2,...) Additional example situations involving the identification ) iv patient lup where iv.individual id=p.individual id of information through searching, and particularly through group by sex searching using SSTs are provided below. Those having ordi This query run on modest code sets compares as follows: 35 nary skill in the art will recognize that the techniques of this disclosure can be used to identify information for a multitude of other applications, encompassed by the claims. Data Structure Physical I/Os Time to run Generate Customized Report In step 110 of FIG. 1, a customized report is generated. The Warehouse 4OOOOO 1000 seconds 40 arrangement and format of the customized report may be SST 3300 10 seconds dictated by the underlying application and desires of the user. In one embodiment, however, one may wish to choose Or 100 times faster. between textual and map-based reports. In FIG. 4C, such a Identifying Patients, Professionals, Geography, and Other choice is presented to a user through the “wizard' interface Results 45 discussed previously. Here, a user can check for a text report In step 108 of FIG. 1, searching yields an identification of or a map. data matching the exclusion or inclusion criteria previously A text report may be set up to show the exclusion or defined. Any data fields, portions of data fields, or combina inclusion criteria at issue, the search results themselves in tions of data fields may be identified in response to the search. columnar or other convenient format, and other information In one embodiment, the identification of data comes about 50 (automatically generated or chosen by the user) that may be through an exact match with search criteria. In other embodi pertinent to the analysis. Text reports need not be text-only. A ments, a search "hit' may result if there is an approximate or text report can include graphics in the form of pictures, Substantial match. For instance, if an exclusion or inclusion graphs, charts, or the like. A customized report may, if criteria seeks physicians who have treated 200 patients hav desired, be entirely graphical. Reports may be electronic ing a particular diagnosis, one embodiment would return 55 (e.g., on a computer Screen) and may include video clips, information for physicians who have treated, e.g., 195 animations, or the like. patients. Non-exact matches Such as these can be indicated FIG. 7 shows example customized reports, which are accordingly on a customized report. In one embodiment, the mixed text and graphics. In the upper left, a graph shows the tolerance required to constitute a match may be defined by the number of unique patients (e.g., potential Subjects for a clini user, and the tolerance may be different for different search 60 cal trial matching pre-defined exclusion or inclusion criteria) ing criteria. Versus age in years. A quick glance at the graph would reveal In one embodiment, patients are identified who may be to a user that potential Subject populations are centered suitable candidates for a clinical trial, for which clinical trial around an age of approximately 42. At the middle-right of exclusion or inclusion criteria were defined. In another FIG. 7 is a graph showing the gender split for the identified embodiment, patients are identified for a clinical trial model, 65 patients. Here, over 80% of potential trial subjects are male. in order to determine the number of patients and in order to At the lower left of FIG. 7, geographical information is pre determine if exclusion or inclusion criteria should be modi sented, but not in the form of a map. Unique Subjects are US 7,917,525 B2 19 20 charted as a function of their 3-digit zip code, which reveals there are approximately 2,100,000*0.0125–26,250 adults in that New York, N.Y. may be a suitable site for clinical inves City A with HIV-related diagnoses. These types of calcula tigator and trial Subject recruitment. tions may be used to effectively normalize data among cities A map-based report takes advantage of geographical infor with vastly different populations—i.e., having 200“hits” in a mation pulled from the underlying data and may advanta large city may actually indicate that it would be a more geously provide a convenient mechanism for a user to quickly difficult recruiting region than a significantly Smaller city determine what area of the country would be a suitable site for having the same number of hits. Using data that shows city a clinical trial, for a job fair, etc. With geographic information size, one may readily arrive at density or other metrics, which accessible, one may study geographic propagation associated would indicate the number of patients per square mile, etc. Of with one or more criteria. For example, map based reports 10 course. Such techniques are not limited to the identification of may be "put in motion” by employing Successive map frames. clinical investigators and potential trial Subjects. They may be This may essentially create a “Doppler like' view of the applied to any application discussed here or recognized by propagation of disease, test Subjects, etc. over time. Intensity those having ordinary skill in the art. of the subjects area can be conveyed thematically by color, 15 Turning now to FIG. 2, an example method 200 is shown shading, object size, height, etc. The time period can be for handling statistical calculations or operations, which can aggregated by any desired interval to visibly accent seasonal be based on administrative healthcare claims data or other variations or long term trends. FIGS. 8-9 are examples of data. In preferred embodiments, the statistical calculations or map-based, customized reports. In FIG. 8, a United States operations can be used in conjunction with the techniques map is shown, with an HIV patient distribution as an overlay. HIV patient totals are given as a function of region, and illustrated in FIG. 1 and described herein. information is provided as a function of region for health Pre-Generate Factorial Table providers. FIG. 9 is a Zoomed-in region of the map of FIG.8. The calculation of certain statistics has historically caused Those having ordinary skill in the art will recognize that problems. With respect to hypergeometric calculations, there several other types of customized reports may be generated. has been a tradeoff between accuracy and speed. In 1993, Wu In one embodiment, for example, one may generate a cus 25 published an algorithm that addressed a number of perfor tomized report that is a provider report. The provider report mance issues involved in processing, especially in dealing assists in identifying and enrolling clinical investigators and with the large factorial sets needed for hypergeometric cal may be similar to those shown in, e.g., FIGS. 8-9. culations. However, while performance using the Wu algo It may also be beneficial to link data for clinical trial rithm may be faster than other conventional techniques, the recruitment applications with data maintained by, e.g., the 30 inventors found it insufficient for, e.g., real time return sets Center for Disease Control (CDC). This linkage, or other from tens of thousands of attribute/outcome sets requiring similar linkages, can allow one to calculate different metrics full processing. Also, the Wu code requires over/under flow for confirmation of data or for other purposes. For example, logic when generating an initial recursion point H(0). by comparing to CDC data, one can determine if the number The cumulative hypergeometric function can be calculated of hits received for a certain condition in a certain geographi 35 in a number of different ways. Typically each probability “p' cal area is “in line' with CDC information for the same value (pdf) is generated in Some fashion and the values are condition. If a search indicates that City A has 5,000 adult Summed to compute the cumulative density function (cdf). patients with HIV (out of a total of 400,000 adult patients total For large populations, challenges exist in both the pdf and cdf for City A residing in the database(s) being searched), a computations. The pdf calculation can be difficult because comparison with CDC information regarding HIV rates in 40 large factorials must be processed. Wu tackles this issue by City A may serve as a confirmation of the 1.25% HIV rate or breaking the factorial terms into prime numbers with expo an alert if the CDC information indicates a substantially dif nents and reducing each prime/exponent combination to its ferent rate. A confirmation with other data such as CDC data simplest value. The remaining primes in the numerator and can be indicated on a customized report through a change in denominator are then processed in Such away that over/under color, a confirmation symbol (e.g., a check mark), or the like. 45 flow issues will not manifest. Once this first probability term, In one embodiment, information pulled to generate a cus h(O), is calculated, then the other probability terms can be tomized report may be linked or compared to information quickly generated and Summed to a cdf using known recur from the U.S. Census to arrive at patient percentage values or sion techniques; again with care to avoid over/under flow. the like. For example, if 5,000 adult patients in City A are Issues arise when many cafs need to be computed and the identified as being associated with HIV diagnoses, and one 50 factorial->prime sets->cancellation->computation process knows that there are 400,000 adult patients total for City A must be processed, and generate accurate results, many thou residing in the database(s) being searched, then one may sands of times. This process has limitations but can be coded assume that about 1.25% of City A's adults have an HIV directly in SQL (below) to returnpdfvalues. Table ALL59701 related diagnosis. If Census data reveals that City A has an contains the prime factorization of every factorial 0 through adult population of 2.1 million people, one can estimate that 59,701.

SELECT value,row number() over ( ORDER BY value ASC) rank ascrow number() over (ORDER BY value DESC) rank desc FROM ( SELECT (POWER(primenumber.((n1 exp+n2 exp)-(d1 exp+d2 exp)))) value FROM ( SELECT primenumber SUM(CASE WHEN n1=number of interest AND prime exp IS NOT NULL THEN prime exp ELSE O END) N1 exp US 7,917,525 B2

-continued SUM(CASE WHEN n2=number of interest AND prime exp IS NOT NULL THEN prime exp ELSE O END) N2 exp SUM(CASE WHEN d1 =number of interest AND prime exp IS NOT NULL THEN prime exp ELSE O END) D1 exp SUM(CASE WHEN d2=number of interest AND prime exp IS NOT NULL THEN prime exp ELSE O END) D2 exp FROM ( SELECT a.nla.n2.a.d1.a.d2,number of interest s primenumber, (prime exponent) prime exp FROM (SELECT 866 n1,1591 n2,1745 d1,712 d2 FROM dual) a ALL59701iotb WHERE a.d1=b.number of interest OR a.d2=b.number of interest OR an1=b.number of interest OR an2=b.number of interest ) GROUP BY primenumber ) WHERE ((n1 exp+n2 exp)-(d1 exp+d2 exp))<>0 ) ORDER BY rank ascrank desc

Table ALL59701 showing the representation of 20000! -continued expressed as a series of prime exp values: 2O FROM (SELECT /*-i- index(b)*/ n1 n1c, n2 n2c, d1 d1 c, d2 d2c, number of interest, NUMBER OF INTEREST PRIMENUMBER PRIME EXPONENT In factorial FROMALL 10Mb 2OOOO 2 1999S 20,000 3 9996 25 WHERE d1 = b.number of interest 20,000 5 4,999 OR d2 = b.number of interest 20,000 7 3,332 OR n1 = b.number of interest 20,000 Many others... Many others... 20,000 19,991 1 OR n2 = b.number of interest) 20,000 19,993 1 GROUP BY n1c, n2c, d1 c, d2c: 20,000 19,997 1 30

A faster method that still yields results accurate to 30 Where the table ALL 10M contains the log value of all decimal places (as tested against published values, e.g., Wu) factorials 0 through 10,000,000:

NUMBER OF INTEREST LN FACTORIAL 2OOOO 178075.6217371987OO312867928.177311631722

40 uses a different form of the factorial table. Instead of express As a check on the internal representation on the values one ing factorials as prime exp pairs, the factorial is expressed as can see for this computation of large factorials the value a logarithm (any base). In this case the computation SQL can differs from 20001 only in the 31st decimal place. Steps may be written as: also be taken to represent logarithmic values in the best pos sible datatype/format for a given platform.

SELECT SUM (CASE WHEN n1c = number of interest SELECT EXP(a.LN FACTORIAL-B.LN FACTORIAL)|| THEN In factorial 50 factorial check FROM ELSE O (SELECT a.NUMBER OF INTEREST.a.LN FACTORIAL END) In factorial FROMALL. 10Ma WHE + SUM (CASE a.NUMBER OF INTEREST=20001)a, (SELECT a.NUMBER OF INTEREST.a.LN FACTORIAL HEN n2c = number of interest In factorial FROMALL. 10Ma WHERE THEN In factorial a.NUMBER OF INTEREST-20000) b ELSE O 55

- SUM (CASE FACTORIAL CHECK EN d1c = number of interest 20000.99999999999999999999999999999999855 EN In factorial In one embodiment, The table ALL 10M may be created EL EO 60 and populated in the following manner END) - SUM (CASE WHEN d2c = number of interest THEN In factorial CREATE TABLE ALL 10M ELSE O ( END) VALUE 65 NUMBER OF INTEREST NUMBER NOT NULL, INTOh accum LN FACTORIAL NUMBER NOT NULL, US 7,917,525 B2 23 -continued -continued CONSTRAINT PK ALL 10M Signal Drug PRIMARY KEY Cluster Population Condition Signal Drug A B (NUMBER OF INTEREST) ) Sub-Sub group Women Dizziness strong Ole ORGANIZATION INDEX on Thyroid NOLOGGING: drugs

, and then running a script as below. This may take less than 1 It should be noted that an adverse signal could be actually hour on a modest platform to complete. 10 reflecting the Suppression of an outcome by the other com paring drug. (e.g., if drug B happens to cure dizziness in women on thyroid drugs and drug A neither causes or Sup CREATE ORREPLACE PROCEDURE Factorial10 presses dizziness then drug Acould show a "dizziness' signal IS even though the symptom was no higher than the rate in -- fills all 10M table with pairs of number, LN(number!) 15 women taking thyroid medication in the general population). i NUMBER: last i NUMBER: Using Such techniques and those of the rest of this disclo BEGIN Sure, one may be capable of, among other things, rapidly: INSERT INTO ALL 1 OM a. Creating attribute clusters; VALUES (0, 0); -- 0:=1 and LN(1)=0 b. Creating outcome clusters; last i := LN (1): FOR IN 1... 1 OOOOOOO c. Scoring all permutations of attribute/outcome clusters LOOP (i.e. finding the Sub groups with the strongest signals for each INSERT INTO ALL 1 OM emergent condition); and VALUES (i, LN (i) + (last i)); last i := LN (i) + last i: d. Presenting only the most significant permutations. IF i? 100000=ROUND(i/100000) THEN -- commit every 100K 25 In one embodiment, a solution to problems associated with COMMIT; prior algorithms is, in general, two-fold: (1) eliminate the dbms outputput line(i); END IF; expensive initial factorial operation associated with hyper END LOOP: geometric calculations by using a table containing pre-gen COMMIT; erated factorials (e.g., caching the natural log (LN) or LOG 10 END Factorial 10; 30 value (or other differently-based log value) of each factorial), f and (2) code the recursion for the H(x) integration entirely in logarithms until a final cumulative P value is computed. Hypergeometric calculations have a wide range of uses. In step 202 of FIG. 2, one or more factorial tables are One such use pertains to the generation of probability scores pre-generated. In a preferred embodiment, these specialized for drug safety measurements. Consider that two populations 35 tables allow for the calculation of hypergeometric statistical are very similar except that one group has taken drug 'A' and results, and they ensure that such calculation proceeds the other group has taken drug “B” (or a placebo, or no drug swiftly. The factorial tables may be similar to look-up tables, at all). At the initiation of each drug the index date is defined. in which the logarithm of different factorial numbers are The index date is the beginning of the “outcome” period. listed for quick access. As explained more below, after cal Typically, a patient is exposed to a new drug, procedure, 40 culations or other operations are performed using the loga diagnosis, event, etc. on the index date. This new condition rithms, an exponential can be taken. could be as varied as “stopped Smoking.” “received coronary Search Administrative Healthcare Claims Data bypass Surgery.” “began taking drug X.” “was diagnosed with In step 204 of FIG. 2, administrative healthcare claims data a hernia,” “began working at a particular location or job.” is searched. In other embodiments, different types of data “began seeing a particular physician, or any other event, 45 may be searched using the same techniques. In fact, any medical or non-medical. The index date for each individual in searchable repository of data (e.g., database) may benefit both population set is then normalized to Zero. Prior to the from this disclosure, and particularly data for which hyper index date each member has a set of attributes including but geometric statistical results are desired. The actual execution not limited to, gender, age, region, diagnosis, procedure, drug of the search in step 204 depends upon what statistical result codes, etc. After the initiation of the drugs all following codes 50 or analysis the user is seeking to generate. But, in a preferred are tagged as emergent. These emergent conditions may or embodiment, the search step 204 entails the searching of may not be related to the index drugs. If the two populations records so that one Subset can be compared against another, as are well matched, one can Surmise that adverse drug effects detailed below. related to, say, drug A, will be more prevalent in the drug A In one embodiment, search step 204 may entail the search population immediately or after prolonged exposure to drug 55 ing of one or more SSTs, as detailed above and herein. For A. Certain subgroups and even 'Sub-Sub” groups may be example, search step 204 may encompass the searching of especially susceptible to these adverse effects. For instance one or more SSTs alone or in combination with the searching women may be more Susceptible to, say, dizziness when of raw data from one or more databases. In the situation in taking drug 'A' than men taking drug 'A'. which SSTs are being searched, the user may see even greater 60 speed and efficiency. Comparing One Subset Versus Others In step 206 of FIG. 2, one subset of administrative health Signal Drug care claims data is compared against other Subsets of the Cluster Population Condition Signal Drug A B administrative healthcare claims data in order to arrive at a ALL ALL Dizziness weak Ole 65 statistical result (see discussion below for embodiments in Subgroup Women Dizziness moderate Ole which the statistical result is based on a hypergeometric cal culation). In a preferred embodiment, information associated US 7,917,525 B2 25 26 with one patient orphysician is compared against information Or, for the simple case when x=0 associated with all other patients or physicians. The first Subset may be a single record or group of records associated with an individual, while the other subset may be a million or n(n - 1) ...(n - r + 1) more records associated with other individuals. In other 5 h(0; r. n, N) = xix. I will embodiments, the record or records associated with an indi m (N-r) vidual can be compared against the records of those who have N (n-r)! Something in common with the individual—for example, the where records of one physician can be compared against records of n = N - i. other physicians practicing in the same medical specialty. Or, 10 the record or records of an individual can be compared against Prior to recursion, if the starting pdf value (initial recursion the records of other patients who suffer from the same or point) is not Zero because, e.g.: similar medical condition. Those having ordinary skill in the 1. Summing the smaller tail would be faster O art will recognize that the selection of different searching 15 Subsets can be chosen at will, and according to desire, in 2. Zero is an undefined starting point, accord with the type of statistical result being sought. then, for these special cases, one may use the longer form: Calculating Hypergeometric Statistical Result In step 208 of FIG. 2, a hypergeometric statistical result is calculated based on the comparing step 206, and using the one SELECTSUM (CASE or more pre-generated factorial tables of step 202. WHEN n1c = number of interest THEN In factorial To detect fraudulent coding patterns, discussed more ELSE below, one may compare the medical coding of an individual END) doctor against the coding patterns of all other doctors in that + SUM (CASE 25 WHEN n2c = number of interest specialty, and all the doctors in the top three specialties that THENH In factorial have billed against the same diagnostic code. The current ELS state of the art hypergeometric method for doing this analysis END) -- SUM (CASE against approximately ten million lives would require ENnc = number of interest approximately 100 hours to execute on modern database H EN In factorial hardware. As a result, this sort of comprehensive fraud detec 30 EL S EO tion would be limited to specific physicians that are suspect of END -- SUM (CASE fraudulent behavior. WHEN nac = number of interest Using pre-generated factorial tables and the other innova THEN In factorial ELSEO tions of this disclosure, however, (1) eliminates the expensive 35 END) initial factorial operation by using a table containing pre - SUM (CASE generated factorials (e.g., caching the LOG 10 value of each WHEN d1 c = number of interest factorial) and, (2) coding the recursion for the hypergeomet THEN In factorial ELSEO ric values entirely in logarithms until a final cumulative value END) is computed. This process reduces the calculation time for the 40 - SUM (CASE same fraud Screen against approximation ten million lives to WHEN d2c = number of interest THEN In factorial approximately six minutes (a speed increase of about 1000 ELSEO times). As a result, a fraud screen can be continuously run END) against a much larger number of doctors, and new hypothesis - SUM (CASE 45 WHEN d3c = number of interest for fraud detection can be developed much more quickly (e.g., THEN In factorial six minutes to see if there is a useful signal, instead of 100 ELSEO hours). With this type of speed, one can apply different “what END) - SUM (CASE if tests to extremely large data sets and get answers in min WHEN d4-c = number of interest utes and hours, not weeks and months. THEN In factorial Computationally expensive processes in the hypergeomet 50 ELSEO ric calculations have been recoded by the inventors using END) - SUM (CASE factorial lookups and natural logarithms to yield accurate P WHEN d5c = number of interest values in milliseconds. As specified by Wu (shown here in the THEN In factorial two equation sets immediately below) and others, a specific ELSEO pdf can be obtained by direct calculation of factorials. Below, 55 END) VALUE N is the total population, n is a Subgroup of the population, r INTOh accum FROM (SELECT /*-i- index(b)*/ is the sample set taken form the population without replace n1 n1c, n2 n2c, n3 n3c, na nAic, d1 d1 c, d2 d2c.d3 d5c, ment, and X is the number of “hits” in the sample set. d4.d4c.d5 d5c, number of interest, In factorial 60 FROMALL. 10MB WHERE d1 = B.number of interest nir (N - n) (N -r) OR d2 = B.number of interest OR d3 = B.number of interest where ORd4 = B.number of interest OR d5 = B.number of interest max(0, r - N + n) six s min(r, n). 65 ORn1 = B.number of interest ORn2 = B.number of interest ORn3 = B.number of interest US 7,917,525 B2 27 28 -continued codes) associated with the physician against the billing prac tices of other groups of physicians (e.g., against other groups OR na = B.number of interest); in total, against groups having the same medical specialty, h Sum := EXP (h accum); h all := h accum; against groups sharing the same geographic region, against tail integration pt:=1 for typical h(O) starts; groups of similar age, etc.). In one embodiment, results of the or tail integration pt:=(n-(ntot-r))+1

CREATE TABLE CLUSTER BASELINE POPS The CLUSTER XD CD PREJOIN value} tables, such ( 45 as the two mentioned above, may be of the following example PRODUCT ID NUMBER NOT NULL, form: VERSION ID NUMBER NOT NULL, NEWINDV ID VARCHAR2(17 BYTE) NOT NULL, GROUPFLAG NUMBER(1) NOT NULL, ATTRIBUTE VARCHAR2(30 BYTE), CREATE TABLE CLUSTER XD XC PREJOIN ???? ATTRIBUTE VALUEVARCHAR2(64 BYTE) 50 ( PRODUCT ID NUMBER NOT NULL, VERSION ID NUMBER NOT NULL, NEWINDV ID VARCHAR2(17 BYTE) NOT NULL, Where product id, and version id are partitioning infor GROUPFLAG NUMBER(1) NOT NULL, mation ATTRIBUTE VARCHAR2(30 BYTE), 55 ATTRIBUTE VALUEVARCHAR2(64 BYTE), NewindV id is a unique patient identifier OUTCOME CLASS VARCHAR2(22 BYTE), OUTCOME TYPE VARCHAR2(4 BYTE), Groupfalg indicates the population set DAYS IN STUDY NUMBER Attribute and attribute value generalize all possible ) TABLESPACE USERS attributes, e.g.: newindiv id 1723423 2 might have: PCTUSED O 60 PCTFREE 10 INITRANS 1 MAXTRANS 255 Attribute Attribute Value PARTITION BY RANGE (PRODUCT ID, VERSION ID) (partition terms) Gender Male Pre-generated tables not related to drug pairs DX code 410 65 Additionally a couple other pre-generated tables are used. Region South ALL CODE XWALK US 7,917,525 B2 35 36 -continued 3. Aggregate this set counting unique patients in each attribute Contains DX/PX/RX code descriptions CREATE TABLE ALL CODE XWALK a. We'll call this Set “A” (Nd and Nc counts for each ( attribute) CODE TYPE CHAR(2 BYTE), 5 Set 'A' steps 1, 2, 3, 3a: CODE SET VARCHAR2(8 BYTE), CODE DESC VARCHAR2(220 BYTE) ) SELECT attribute.attribute value COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindv id Where code type is PX, RX or DX, Code set is the actual 10 ELSE NULL END) ind code, and code desc is the description. Note this can be COUNT (UNIQUE CASE WHEN groupflag=0 THENnewindv id ELSE NULL END) inc extended to any attribute requiring a description term. FROM ALL 1 OM CLUSTER BASELINE POPS WHERE product id=2 AND version id=200503 This table contains the natural log of the factorial of each 15 number from 0 through 10,000,000. This table is called by AND newindv id IN (select distinct newindv id from rx baseline where drug desc like 96thyroid hormones%) - could be drug class, wu4 function 9biot to provide pre-generated factorials for NDC codes, etc. computation of the first P-value from which to integrate (usu GROUP BY attribute, attribute value: ally Zero but accommodates highly skewed sets where Zero is not a viable starting pint). 2O 4. Create a placeholder attribute called “Baseline' and collect all attributes belonging to these individuals in the appropriate pre-joined CLUSTER BASELINE POPS parti CREATE ORREPLACE PROCEDURE Factorial10 tion (using GENDER to prune) IS 25 -- fills all 10M table with pairs of number, LN(number!) 5. Aggregate this set counting unique patients in the “Base i NUMBER: line’ last i NUMBER: a. We'll call this Set “B” (Ndand Nc counts for “Baseline, BEGIN INSERT INTO ALL 1 OM i.e. filter only attribute) VALUES (0, 0); -- 0:=1, ln(1)=0 Set “B” steps 4, 5,5a: last i := LN (1): 30 FOR IN 1... 1 OOOOOOO LOOP INSERT INTO ALL 1 OM SELECT attribute.attribute value.SUM(nd) ind,SUM(nc) inc FROM VALUES (i, LN (i) + (last i)); ( last i := LN (i) + last i: SELECT Baseline attribute, attribute value IF i? 100000-ROUND(i/100000) THEN 35 COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindv id COMMIT; ELSE NULL END) ind dbms output.put line(i): COUNT (UNIQUE CASE WHEN groupflag=0 THENnewindv id END IF; ELSE NULL END) inc END LOOP: FROM COMMIT; CLUSTER BASELINE POPS WHERE product id=2 END Factorial 10; 40 AND version id=200503 AND attribute="GENDER AND newindv id IN (select distinct CREATE TABLE ALL 10M newindv id from rx baseline where drug desc like 96thyroid ( hormones%) - could be drug class, NDC codes, etc. NUMBER OF INTEREST NUMBER NOT NULL, GROUP BY attribute, attribute value LN FACTORIAL NUMBER NOT NULL, 45 CONSTRAINT PK ALL 10M PRIMARY KEY 6. Merge Sets “A” and “B” into Set “AB” (all Nd and Nc (NUMBER OF INTEREST) ) counts) ORGANIZATION INDEX UNION the two sets together and tag the combined set NOLOGGING: 50 nd no. We have now generated the counts for every possible attribute. 7. Collect all permutation sets belonging to these individu Clustered Outcomes Processing Example als in the appropriate pre-joined CLUSTER XD XC PRE A client is interested in the outcomes associated with the JOIN DX OP partition. Now let's collect the outcome sets population of patients taking thyroid hormones prior to their 55 and count the outcome population for each attribute/outcome initiation date on the Ketek/Biaxin drug pair. The user creates pair. the filter “INThyroid hormones” and applies it to “Dx OUT” 8. Aggregate this set counting unique patients in each in the data mining section for the Ketek/Biaxin drug pair. attribute/outcome pairing Process will then commence in multiple stages (in order of a. This becomes Cluster Set “C” (Xd and Xc counts for operation): 60 each attribute/outcome pair) Identify, aggregate, join and filter attribute/outcome pairs Set “C SELECT attribute.attribute value, outcome class.out 1. Identify the population “in play' for the analysis, e.g., come type those taking thyroid hormones in the baseline period. COUNT (UNIQUE CASE WHEN groupflag-1 THEN 2. Collect all attributes belonging to these individuals in the 65 newindvid ELSE NULL END) xd appropriate pre-joined CLUSTER BASELINE POPS parti COUNT (UNIQUE CASE WHEN groupflag-O THEN tion newindvid ELSE NULL END) Xc US 7,917,525 B2 37 38 FROM GROUP BY attribute attribute value, outcome class.out CLUSTER XD XC PREJOIN DX OP one of three come type outcome tables ) GROUP BY attribute, attribute value, outcome class.ou teome type WHERE product id=2 AND version id=200503 AND 11. Merge Sets “C” and “D” into Set “CD (all Xd and Xc newindV id IN (select distinct newindV id from rx base count). line where drug desc like '%thyroid hormones%)—could Merge the two sets using a UNION and tag as xd cd be drug class, NDC codes, etc. 12. Join Set “AB” to Set “CD by attribute becoming Set GROUP BY attribute attribute value, outcome class.out ABCD’ Where Sets A and “C” form the clustered out come type comes while sets “B” and “D' comprise the dynamic base 9. Create a placeholder attribute called “Baseline' and " line. collect all outcome permutation sets belonging to these indi- 13. Discard trivial cases keeping only rows where nd 3 viduals in the appropriate pre-joined CLUSTER XD X- AND nod-3 AND (xd--xc)>3 AND ((nc+nd)-(xc+xd))>3 C PREJOIN DX OP partition (using GENDER to prune). Using the code below, one can join the two sets (into Again using GENDER to ensure a look at the entire baseline ABCD) and eliminate trivial cases (this is optional if one population we'll count the outcome population for each base- wants to keep and process such cases). line/outcome pair. WHERE nd nc.attribute Xd Xc.attribute AND 10.. AAggregate te thisun1S setSet counungti un1due pauenustients 1ni eac h nd nc.attribute value=Xd Xc.attribute value AND nd 3 “Baseline'/outcome pairi AND inc3 AND pairing, (xd+xc)>3 AND ((nc+nd)-(xc+xd))>3 a. This becomes Cluster Set D” (Xd and Xc counts for 14. Process RR and Yates CI for Set 'ABCD’. each “Baseline/outcome pair) 15. Suppress processing of uninteresting cases, processing Set “D’ only those where POWER(ABS(xd-(xd--xc)*nd/(nd+nc))- SELECT attribute, attribute value, outcome class.out- 0.5.2)>=(4*(xd--xc)*nd/(nd+nc)*(1-nd/(nd+nc))) (this is come type.SUM(xd) not done at step 13 due to an Oracle nuance) Xd.SUM(xc)xc as Begin Cumulative Hypergeometric computation (AKA FROM ( cdf) & -- - ? 99 16. Pass Xprime, Nprime, Xtot, Ntot rows from Set SELECT Baseline attribute, "ABCD' into the Oracle function wu4 function'9biot attribute value, outcome class.outcome type The code below works with the population sets in the COUNT (UNIQUE CASE WHEN groupflag-1 THEN attribute outcome pairs generated in steps 1 to 13. Drug and newindvid ELSE NULL END) xd comparator values are Swapped prior to hypergeometric com COUNT (UNIQUE CASE WHEN groupflag-O THEN putation if the comparator has the stronger outcome signal. newindvid ELSE NULL END) Xc The log 10 value of these signals are demarked with a flipped FROM score sign. Prior to the hypergeometric calculation Wu4 function9biot(), a case statement performs a secondary CLUSTER XD XC PREJOIN DX OP one of three 35 filter on uninteresting cases (those that will not have a signifi outcome tables cant score). This case statement can be removed if all cases WHERE product id=2 AND version id=200503 AND are desired to be processed. The Relative Risk and Yates attribute=GENDER confidence interval are also processed in the code as Supple Attach filter here AND newindvid IN (select distinct mentary information on the population set. The scores from newindiv id from rx baseline where drug desc like '%thy 40 the hypergeometric are converted by log10 for readability but roid hormones%)—could be drug class, NDC codes, etc. can be show in their original form if desired.

d.attributed.attribute value,a.code desc Attribute value desc.d.outcome class,d.outcome type.o.code desc “Outcome type desc.d.xd.d.nd.dxc,d.nc, CASE WHEN dxc-O THEN TO CHAR(d.xd'd.nc? (dxcd.ind), 9999990.9) ELSE -- END rr. CASE WHEN (x-plowerx)>O THEN (plower x/d.nd)/((x-plower x)/d.nc) ELSE O END rrlower, CASE WHEN (x-pupper*x)>O THEN (pupper*x/d.nd)/((x-pupper*x)/d.nc) ELSE O END rrupper, CASE WHEN cumul hyper-1e-128 THEN score signTRUNC(LOG (10.cd.cumul hyper)) ELSE-10*score sign END score FROM ( SELECT c.*,(-b high--SQRT(POWER(b high,2)-4*a*c high))/(2*a) pupper,(-b low-SQRT(POWER(b low,2)-4*a*c low))/(2*a) plower, CASE WHEN POWER(ABS (xd-(xd+xc)*nd/(nd+nc))-.5,2) >= (8*(xd--xc)*nd (nd+nc)*(1-nd (nd+nc))) THEN aperio.Wu4 FunctionSbiot(x primen prime.xd+xc,nd+nc) ELSE 1 END cumul hyper FROM ( SELECT b.*,1+da,-2*pd low-d b low,-2*pd high-d b high.POWER(pd low,2) c low. POWER(pd high.2) c high FROM ( SELECT a.*,xd+xcx,POWER(1.96.2)/(xd+xc) d,CXd-.5)/(xd+xc) US 7,917,525 B2 39 40 -continued

( SELECT ind nc.attribute,nd nc.attribute value.Xd Xc.outcome class.xd Xc.outcome type.Xd Xc.Xd,nd inc.nd.Xd Xc.Xc,nd nc.nc CASE WHEN xdd(xd--xc)*nd/(nd+nc) THEN xc ELSExd END x prime CASE WHEN xdd(xd--xc)*nd/(nd+nc) THENnc ELSE ind END in prime CASE WHEN xdd(xd--xc)*nd/(nd+nc) THEN-1 ELSE CASE WHEN xd= (xd--xc)*nd/(nd+nc) THEN 0 ELSE 1 END END score sign FROM Set ABCD

17. Wu.4 function 9biot will determine if the tail starting -continued point is zero or some other value (for sets where Zero is not 15 AND POWER(ABS(xc-(xd--xc)*nd/(nd+nc))-.5.2) >= valid) (8*(xd--xc)*nd (nd+nc)*(1-nd (nd+nc))) -- purposeful Cartesian since the a. For a Zero tail start the natural log of the hypergeometric top IV returns one row is computed directly in the All 10M SQL call per the Wu )) c -- one of three base score tables for filter (2.1) equation WHERE SUBSTR(d.attribute,1,2)=a.code type(+) AND d.attribute value=a.code set(+) AND b. For a non-Zero tail start the natural log of the hypergeo SUBSTR(d.outcome class.1.2)=o.code type(+) AND metric is computed directly in the All 10M SQL call per the d.outcome type=o.code set(+) Wu (1.8) equation AND d.outcome class= c.outcome class 18. These tail starting points are then recursively computed AND d.outcome type=c.outcome type AND ABS(CASE WHEN cumul hyper-1e-128 THEN and summed up to the natural log Xprime value per the Wu score signTRUNC(LOG (10,cumul hyper)) ELSE-10*score sign (1.2) equation. The ln(cdf) is then converted to calf using 25 END)>ABS(c.score) EXP() and truncated to an value. AND ABS(CASE WHEN cumul hyper-1e-128 THEN Output Filtering for Score and Dynamic Baseline score signTRUNC(LOG (10,cumul hyper)) ELSE-10*score sign At this point in the processing the data set contains inter END)>=3 mingled scored values for both clustered outcomes and dynamic baseline data. 30 19. Re-calculate the dynamic baseline Set “BD” as above Example 2 for use as a comparison source. Set “BD” is processed and scored. This step is not necessary unless (as shown) one wants The “master sql' in Example 3 accesses SSTs and com to implement logic Such as “discard any outcomes sets where bines them into attribute/outcome population sets. It then the baseline data scores higher than the sub-cluster'. Note 35 processes these sets into hypergeometric scores, relative risk that some of these sets (e.g., “B” and “D’) can be written to and Yates confidence intervals as shown in steps 1 through 21 use alternate methods of holding the processed data for later above. use in the query. Some of these include global temp tables, Example 4 creates the SSTs for use by the master SQL. WITH temp AS, and other constructs that can pool sets into These SSTs are packed by newind id so the population set in memory/disk can be used to curtail the reprocessing of the 40 play can be quickly retrieved. This is accomplished for clus baseline sets ter baseline pops by the SQL. 20. Filter out rows in Set 'ABCD that have scores below SELECT DISTINCT 3. base.product id,base.version id.base.newindvid, base 21. Remove clustered outcome rows with scores >=3 if the :groupflag.base.attribute,base.attribute value FROM dynamic baseline score is equal to or exceeds the clustered 45 (merged sets) outcome score for a given outcome. The following code sec The first two terms (product id, version id) land the data tion is optional and serves the purpose of tagging the in a partition and the newindv idfield then clusters the data attributes and outcomes with descriptive information, show by individual. This SST packing could also be accomplished ing only scores with an absolute values 3 or greater, and by using an ORDER BY clause, group by clause, a large discarding outcome sets that have a lower score than their 50 concatenated index containing all the terms, or by using an baseline counterparts. index organized table. Below, aspects of this example are explained in more detail. Let's step through some of the nuances in the creation )d, ALL CODE XWALKaALL CODE XWALK o of the CLUSTER BASELINE POPS table. This table is 55 used to Supply counts of the population in any attribute set and WHERE ndd3 AND inc-3 AND (xd--xc)>3 AND ((nc+nd)–(xc+xd))>3 via the SST structure will quickly provide processing code with the desired individuals attributes for rapid counting.

INSERT /*--append'? INTO CLUSTER BASELINE POPS SELECT DISTINCT base.product id,base.version id,base.newindV id,base.groupflag.base.- attribute base.attribute value FROM >>>This section identifies and tags individuals that were active at least >>>one day in the outcome period (the vast majority). The tag value 1 to >>>7 days will >>>be used as an independent attribute and treated in US 7,917,525 B2 41 42 -continued >>>the code as if were a “typical attribute like gender, age, taking drug >>>''A', etc. during the baseline period. SELECT product idversion id.b.NEWINDV ID,b.groupflag, Days since first dispensing attribute, 1 to 7 days attribute value,1 beginning,7 ending FROM COHORT b WHERE b.match=1 AND b.termdate-bindexdt >=1 AND product id=1 AND version i UNION >>> Well append on another section for those active at least 8 days into >>> the outcome period. etc for other days into segments SELECT product idversion id.b.NEWINDV ID,b.groupflag, Days since first dispensing attribute, 8 to 29 days attribute value.8 beginning,29 ending FROM COHORT b WHERE b.match=1 AND b.termdate-bindexdt >=8AND product id=1 AND version i >>>>> Now we | populate the gender attribute holding “Male or Female in >>> the attribute value SELECT product idversion id.NEWINDV ID,groupflag, “GENDER attribute.sex attribute value,0 beginning.99999 ending FROM COHORT WHERE match=1 AND product id=1 AND version id=200503 UNION >>> Likewise for age or age group (age can be grouped into logical ranges >>> as desired), SELECT product idversion id.NEWINDV ID. groupflag, AGE GROU P attribute,b.AGE GROUP name attribute value 0 beginning.99999 endin 9. FROM COHORT a, AGE GROUP b WHERE a.AGE BETWEEN b.AGE GROUP START YRAND b.AGE GROUP END YRAND a match=1 AND product id=1 AND version id=200503 UNION >>> Now codesets like diagnosis, procedure, prescription can be tagged >>> (diagnosis shown). Medical code sets can be either actual codes, >>> truncated codes, classes of codes, etc. There is no limitation on how >>> an attribute can be classified. SELECT product idversion id.NEWINDV ID. groupflag, DX Baseline attribute.dx attribute value 0 beginning.99999 ending FROM DXBASE WHERE match=1 AND product id=1 AND version id=200503 AND indexflag=0 and drhlinpat=1 >>> Note ANY attribute can be presented in this format including but not >>> limited to race, ethnicity, genetic information, family history, >>> height, weight, profession, Smoker, etc.

Let's step through some of the nuances in the creation of individual by individual. Each attribute/outcome pair is the CLUSTER XD XC PREJOIN DX OP table (The tagged by an individual id. This allows for complete flexibil other outcome tables are similar). The purpose of this table is ity when including or excluding sets of individuals by any to permeate all possible attribute with all possible outcomes desired criteria (single or multiple).

>>> again we are creating an SST using the distinct clause which will sort >>> by field order. The first two fields in this case will define a artition and the newindv id will cluster data into the blocks by >>> individual. INSERT /*-- append*/ INTO CLUSTER XD XC PREJOIN DX OP SELEC T DISTINCT base.product id,base.version id,base.newindV id,base.groupflag.base.- attribute,base.attribute value,emergent.outcome class.emergent.outcome type, emergent.days in study FROM >>> Outcomes will be expressed in terms of base attributes so this table >>> will fist find attributes much like what was done in >>>> c uster baseline pops. However this time we will tag each >>>> a tribute? outcome pair with an individual id instead of just each >>>> a tribute with an individual id. >>> This section is finding the outcomes for each person based on when the >>> outcome occurred. Each persons attribute are then permeated by that >>> person's outcomes via the join “base.newindv id=emergent.newindw id >>> and placed in the appropriate time range “emergent.days in study BETWEEN >>> base.beginning AND base.ending. Note in this example the DX codes are >>> truncated to widen the outcome group. This is optional depending on the >>> desired outcome granularity. Also we only use matched patients to US 7,917,525 B2 43 44 -continued >>> mitigate confounding of the results where each person in the study set >>> requires a pair sharing key attribute characteristics. (SE LECT DISTINCT product id, version id, a.newindv id, agroupflag, DX OUTPATIENT outcome class, CASE WHEN LENGTH (TRIM (a.dx emerg)) = 4 AND SUBSTR (dx emerg, 1, 1) <> E THEN SUBSTR (dx emerg, 1, 3) ELSE TRIM (dx emerg) END outcome type, days in study FROM dxemerg a WHERE amatch = 1 AND a product id=1 AND aversion id = 200503 AND siteflag = 2 ) emergent WH ERE base.newindV id=emergent.newindV idAND emergent.days in study BETWEEN base.beginning AND base.ending:

As described earlier, these tables can be used to quickly 20 Seth 1 Drug 'A' vs. Drug “B” Women experiencing process and “score” headache massive permutation sets of attribute/outcome pairs. Sta- Seth2 Drug "A" vs. Drug"B" Women experiencing vom tistically interesting combinations can easily by floated to the 1ting & G A “B” A11 top by Sorting on score. Coupled with the filtering capability 25 in E. 'A' vs. Drug “B” All on NSAIDs experienc to select any Sub-population this question can be extended S.4 Drug “A” vs. Drug “B” All on NSAIDs experienc 1ntO: 1ng Vomiting Are women who are on thyroid medications more likely to The Sub-cluster sets and syndrome processing would also have headaches when taking drug 'A' than drug “B”? test, score and present: Are males between 50 and 59 who have had bypass Surgery 30 Sethi5 Drug “A” vs. Drug “B” Women on NSAIDs expe more likely to have strokes when taking drug 'A' than drug riencing headache “B” Seth 6 Drug"A" vs. Drug “B” Women on NSAIDs expe Does drug “B” have an unforeseen benefit in reducing a riencing vomiting, &g common disease in women who live in the South (e.g. Sinusi- Seth 7 Drug “A” vs. Drug B Women on NSAIDs expe riencing headache and Vomiting tis)? 35 Setit-8 Drug 'A' vs. Drug “B” Women experiencing This powerful technique can also be extended to attribute headache and Vomiting Sub clusters and outcome pairs (AKA syndromes). In other Seth9 Drug “A” vs. Drug “B” All on NSAIDs experienc words we can permeate every combination of attributes and ing headache and Vomiting find these important pieces of information and float them to Seti 10 Drug 'A' vs. Drug “B” Women experiencing the top automatically. Also, pair of outcomes can be automati- " headache and Vomiting cally coupled into syndromes and mined. This sub-attribute The example below walks through the generation of seti5 clustering can also be coupled with syndromes to mine and and setti 6 (and all other possible permutations of attribute 1/ score for information as complex as: attribute2/outcome triplets) Women taking NSAIDs are more likely to show headache By sourcing the CLUSTER BASELINE POPS table and vomiting when taking drug “B” vs drug 'A'. In other twice, one can create and store all attribute sub-cluster per words, with no filtering criteria the previously mentioned mutations as shown below into table attribute/outcome pairs would automatically score the fol- ALL PROD2b DOUBLET NS. This table contains the lowing sets. attribute/attribute counts.

CREATE TABLE ALL PROD2b DOUBLET NSAS SELECT a.ATTRIBUTE a1, a.ATTRIBUTE VALUE av1.b.ATTRIBUTE a2, b.ATTRIBUTE VALUE av2 COUNT(UNIQUE CASE WHEN agroupflag=1 THEN a.newindv id ELSE NULL END) ind COUNT(UNIQUE CASE WHEN agroupflag=0 THEN a.newindv id ELSE NULL END) nC FROM CLUSTER BASELINE POPS a,CLUSTER BASELINE POPS b WHERE a product id=2 AND b.product id=2 AND a.ATTRIBUTE VALUE='FAND b. attribute value=WESTAND *. a.attribu ea...attribute value-b.attributeb.attribute value AND a.newindw id=b.newindw id AND a.attribute value-b.attribute value don't double return set GROUP BY a.ATTRIBUTE, a.ATTRIBUTE VALUE b.ATTRIBUTE, b.ATTRIBUTE VALUE US 7,917,525 B2 45 46 -continued HAVING COUNT(UNIQUE CASE WHEN agroupflag=1 THEN anewindv id ELSE NULL END)>3 AND COUNT(UNIQUE CASE WHEN agroupflag=0 THEN a.newindw id ELSE NULL END) >3

Now we can create the complimentary table containing all 10 the outcomes counts for each attribute/attribute pair. This table contains the attribute/attribute outcome counts.

CREATE TABLE ALL PROD2b DOUBLET XS SELECT * FROM ( SELECT a1av1a2...av2 outcome class.outcome type COUNT(UNIQUE CASE WHEN groupflag=1 THENnewindv id ELSE NULL END) xd COUNT(UNIQUE CASE WHEN groupflag=0 THENnewindv id ELSE NULL END) xC FROM ( SELECTX.attribute a1,x.attribute value av1.y.attribute a2.y.attribute value av2.X.outcome class.X.outcome type.X.newindV id.X.groupflag FROM

ELECT DISTINCT .ATTRIBUTE.a.ATTRIBUTE VALUE.a.outcome class...a...outcome type, ewindiv id:groupflag ROM CLUSTER XD XC PREJOIN DX OP a WHERE a...PRODUCT ID=2) ( SELECT DISTINCT a.ATTRIBUTE.a. ATTRIBUTE VALUE.a.outcome class,a.- outcome type,newindV id, groupflag FROM CLUSTER XD XC PREJOIN DX OP a WHERE a...PRODUCT ID=2) WHERE x.attribute||x.attribute valuesy.attributely.attribute value AND X.newindv id=y.newindv id f* don't double return set/ AND X.outcome class=y.outcome class AND X.Outcome type=y.outcome type GROUP BY alav1a2...av2 outcome class.Outcome type )

These two tables can then be combined and scored often -continued yielding powerful insight into hidden attribute combinations 45 that have interesting properties. CASE WHEN xdd(xd--xc)*nd/(nd+nc)> : THEN-1 ELSE CASE WHEN Xd=(xd+xc)*nd (nd+nc) THENO ELSE 1 END END score sign FROM ALL PROD2b DOUBLET XS iv,ALL PROD2b DOUBLET NS a WHERE iv.a1=a.a1 AND iv.a2=a.a2 AND iv.av1=a.av1 CREATE TABLE spir dx ip doublet scores AS 50 AND iv.av2=a.av2 AND SELECT iv.2.*, CASE WHEN (xd+xc)>3 ANDnd-3 AND inco-3 AND ((nc + nd) - (xc + xd)) > 3 AND POWER(ABS(Xd-(xd+xc)*nd (nd+nc))-.5.2) >= POWER(ABS(Xd-(xd+xc)*nd (nd+nc))-.5.2) >= (4*(xd--xc)*nd (nd+nc)*(1-nd (nd+nc))) THEN (4*(xd--xc)*nd (nd+nc)*(1-nd (nd+nc))) Wu4 FunctionSbiot(x primen prime.xd+xc,nd+nc) ELSE ) iv.2 ORDER BY cumul hyper ASC 1 END cumul hyper FROM 55 ( SELECT iv.*.a.n.d.a.inc CASE WHEN xdd(xd--xc)*nd/(nd+nc) THEN xc ELSExd END x prime Example 3 CASE WHEN xdd(xd--xc)*nd/(nd+nc) THENnc ELSE ind END in prime Master sql, including Hypergeometric Call

SELECT * FROM ( SELECT 2 product id.200503 version id, d.attributed.attribute value.a.code desc Attribute value desc.d.outcome class,d.outcome type.o.code desc “Outcome type US 7,917,525 B2 47 48 -continued desc.d.Xd.d.nd.d.xc,d.nc., CASE WHEN dxc-O THEN TO CHAR(d.xd*d.nc? (dxcd.ind), 9999990.9) ELSE - END rr, CASE WHEN (x-plowerx)>O THEN (plower x/d.nd)/((x-plowerx)/d.nc) ELSE O END rrlower, CASE WHEN (x-pupper*x)>O THEN (pupper*x/d.nd)/((x-pupper*x)/d.nc) ELSE O END rrupper, CASE WHEN cumul hypers-1e-128 THEN score signTRUNC(LOG (10.cd.cumul hyper)) ELSE-10*score sign END score FROM (

FROM ( SELECT b.*,1+da,-2*pd low-d b low,-2*pd high-d b high.POWER(pd low,2) c low. POWER(pd high.2) c high FROM ( SELECT a.*,xd+xcx,POWER(1.96.2)/(xd+xc) d,CXd-.5)/(xd+xc) pd low,(Xd+.5)/(xd+xc) pd high FROM ( SELECT ind nc.attribute,nd nc.attribute value.Xd Xc.outcome class.xd Xc.outcome type.Xd Xc.Xd,nd inc.nd.Xd Xc.Xc,nd nc.nc CASE WHEN xdd(xd--xc)*nd/(nd+nc) THEN xc ELSExd END x prime CASE WHEN xdd(xd--xc)*nd/(nd+nc) THENnc ELSE ind END in prime CASE WHEN xdd(xd--xc)*nd/(nd+nc) THEN-1 ELSE CASE WHEN xd= (xd--xc)*nd/(nd+nc) THEN 0 ELSE 1 END END score sign FROM ( SELECT attribute.attribute value COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindv id ELSE NULL END) ind COUNT (UNIQUE CASE WHEN groupflag=0 THENnewindvid ELSE NULL END) C FROM CLUSTER BASELINE POPS WHERE product id=2 AND version id=200503 -- Attach filter here AND newindv id IN ( ) GROUP BY attribute, attribute value UNION SELECT attribute.attribute value.SUM(nd)nd,SUM(nc) inc FROM ( SELECT Baseline attribute, attribute value COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindv id ELSE NULL END) ind COUNT (UNIQUE CASE WHEN groupflag=0 THENnewindv id ELSE NULL END) C FROM CLUSTER BASELINE POPS WHERE product id=2 AND version id=200503 AND attribute="GENDER -- Attach filter here AND newindv id IN ( ) GROUP BY attribute, attribute value ) GROUP BY attribute.attribute value ) nd inc s ( SELECT attribute.attribute value,outcome class,outcome type COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindv id ELSE NULL END) xd COUNT (UNIQUE CASE WHEN groupflag=0 THENnewindv id ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN DX OP-- one of three outcome tables WHERE product id=2 AND version id=200503 - Attach filter here AND newindv id IN ( ) GROUP BY attribute.attribute value...outcome class.outcome type UNION SELECT attribute, attribute value...outcome class.Outcome type.SUM(xd) Xd,SUM(xc) xc FROM ( SELECT Baseline attribute, attribute value,outcome class.outcome type COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindv id ELSE NULL END) xd COUNT (UNIQUE CASE WHEN groupflag=0 THENnewindv id ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN DX OP-- one of three outcome tables US 7,917,525 B2 49 50 -continued WHERE product id=2 AND version id=200503 AND attribute="GENDER -- Attach filter here AND newindv id IN ( ) GROUP BY attribute.attribute value...outcome class.outcome type ) GROUP BY attribute, attribute value...outcome class.outcome type ) Xd Xc WHERE no inc. attribute=xd Xc.attribute AND ind nc.attribute value=xd Xc.attribute value ANDndo-3 ANDnc-3 AND (xd+xc)>3 AND ((nc+nd)-(xc+xd))>3 ) a ) b ) c ) d, ALL CODE XWALKaALL CODE XWALKo ( SELECT outcome class.outcome type,CASE WHEN POWER(ABS(Xd-(xd+xc)*nd (nd+nc))-.5.2) >= (8*(xd--xc)*nd (nd+nc)*(1-nd (nd+nc))) THEN score signTRUNC(LOG (10,CREATEST(aperio. Wu4 FunctionSbiot(x primen prime Xd+xc,nd+nc), 1e-128))) ELSE O END score FROM (SELECT ind nc.attribute,nd nc.attribute value.Xd Xc.outcome class.xd Xc.outcome type.Xd Xc.Xd.nd inc.nd.Xd Xc.Xc,nd nc.nc CASE WHEN xdd(xd--xc)*nd/(nd+nc) THEN xc ELSExd END x prime CASE WHEN xdd(xd--xc)*nd/(nd+nc) THENnc ELSE ind END in prime CASE WHEN xdd(xd--xc)*nd/(nd+nc) THEN-1 ELSE CASE WHEN xd= (xd--xc)*nd/(nd+nc) THEN 0 ELSE 1 END END score sign FROM ( SELECT attribute.attribute value.SUM(nd)nd,SUM(nc) inc FROM ( SELECT Baseline attribute, attribute value COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindv id ELSE NULL END) ind COUNT (UNIQUE CASE WHEN groupflag=0 THENnewindv id ELSE NULL END) l FROM CLUSTER BASELINE POPS WHERE product id=2 AND version id=200503 AND attribute="GENDER -- Attach filter here AND newindv id IN ( ) GROUP BY attribute, attribute value ) GROUP BY attribute.attribute value ) nd inc s ( SELECT attribute, attribute value...outcome class.Outcome type.SUM(xd) Xd,SUM(xc) xc FROM ( SELECT Baseline attribute, attribute value,outcome class.outcome type COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindv id ELSE NULL END) xd COUNT (UNIQUE CASE WHEN groupflag=0 THENnewindv id ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN DX OP-- one of three outcome tables WHERE product id=2 AND version id=200503 AND attribute="GENDER -- Attach filter here AND newindv id IN ( ) GROUP BY attribute.attribute value...outcome class.outcome type ) GROUP BY attribute, attribute value...outcome class.outcome type ) Xd Xc WHERE ndd3 AND inc-3 AND (xd--xc)>3 AND ((nc+nd)–(xc+xd))>3 AND POWER(ABS(Xd-(xd+xc)*nd (nd+nc))-.5.2) >= (8*(xd--xc)*nd (nd+nc)*(1-nd (nd+nc))) -- purposeful cartesian since the top IV returns one row )) c -- one of three base score tables for filter WHERE SUBSTR(d.attribute,1,2)=a.code type(+) AND d.attribute value=a.code set(+) AND SUBSTR(d.outcome class, 1.2)=o.code type(+) AND d.outcome type=o.code set(+) AND d.outcome class=c.outcome class AND d.outcome type=c.outcome type AND ABS(CASE WHEN cumul hypers-1e-128 THEN score signTRUNC(LOG (10,cumul hyper)) ELSE-10*score sign END)>ABS (c.score) AND ABS(CASE WHEN cumul hypers-1e-128 THEN score signTRUNC(LOG (10,cumul hyper)) ELSE-10*score sign END)>=3 US 7,917,525 B2 51 52 Example 4 Filename: raproduct idl opiprX.sql US 7,917,525 B2 53 54 delete from CLUSTER BASELINE POPS where product id=1 AND version id=200503; US 7,917,525 B2 55 56 commit; INSERT /*-append/INTO CLUSTER BASELINE POPS SELECT DISTINCT base-product id,base.version id,base.newindvid,base.groupflag,base.attribute,base.attri bute value FROM (SELECT product id,version id,b.NEWINDV ID,b,groupflag."Days since first dispensing' attribute, "1 to 7 days' attribute value, 1 beginning,7 ending FROM COHORT b WHERE b.match=1 AND b.termdate-b.index.dt >=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id.b.NEWINDVID,b. groupflag."Days since first dispensing' attribute, "8 to 29 days' attribute value,8 beginning,29 ending FROM COHORT b WHERE b.matchel AND b.termdate-bindexdt >=8 AND product id=1 AND version id=200503 UNION SELECT product idversion id,b.NEWINDV ID,b. groupflag."Days since first dispensing' attribute, "30 to 89 days' attribute value.30 beginning,89 ending FROM COHORT b WHERE b.match=1 AND b. termidate-b.index.dt >=30 AND product id=1 AND version id=200503 UNION SELECT product id,version id.b.NEWINDVID,b. groupflag, Days since first dispensing' attribute, '90 to 365 days' attribute value,90 beginning,365 ending FROM COHORT b WHERE b.match=1 AND b.termdate-b.indexdt >=90 AND product id=1 AND version id=200503 UNION SELECT product id,version idNEWINDV ID,groupflag,'GENDER" attribute, sex attribute value,0 beginning.99999 ending FROM COHORT WHERE match=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id.NEWINDVID,groupflag,'AGE GROUP attribute,b.AGE GROUP name attribute value,0 beginning.99999 ending FROM COHORT a, AGE GROUP b WHERE a.AGE BETWEEN b.AGE GROUP START YRAND b.AGE GROUP END YRAND a match=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id,NEWINDV ID, groupflag,'REGION" attribute, region attribute value,0 beginning.99999 ending FROM COHORT WHERE match=1 AND product id=1 AND version id-200503 UNION SELECT product idversion id,NEWINDVID, groupflag,'DX Baseline' attribute,dx attribute value,0 beginning.99999 ending FROM DXBASE WHERE match=1 AND product id=1 AND version id-200503 AND indexflag-0 and drhlinpat=1 UNION US 7,917,525 B2 57 58

SELECT product idversion id,NEWINDV ID.groupflag,'PX Baseline' attribute, proc catcode attribute value 0 beginning.99999 ending FROM pxbase WHERE match=1 AND product id=1 AND version id-200503 AND indexflag-0 UNION SELECT product idversion id.NEWINDV ID, groupflag,'RX Baseline',THERSPC attribute value 0 beginning.99999 ending FROM rxbase WHERE match=1 AND product id=1 AND version id=200503 AND indexflag-0) base; commit; delete from CLUSTER XD XC PREJOIN DX OP where product id=1 AND version id-200503; commit; INSERT /*-- append? INTO CLUSTER XD XC PREJOIN DX OP SELECT DISTINCT base-product id,base.version id,base.newindvid,base, groupflag,base, attribute,base.attri bute value,emergent.outcome class,emergent.outcome type,emergent.days in study FROM (SELECT product idversion id.b.NEWINDVID,b,groupflag,'Days since first dispensing' attribute, "1 to 7 days' attribute value, 1 beginning,7 ending FROM COHORT b WHERE b.matchel AND b.termidate-bindexdt >=1 AND product id=1 AND version id=200503 UNION SELECT product id,version id,b.NEWINDV ID,b.groupflag, Days since first dispensing' attribute, "8 to 29 days' attribute value,8 beginning,29 ending FROM COHORT b WHERE b.match=1 AND b. termidate-bindex.dt >=8 AND product id=1 AND version id=200503 UNION SELECT product idversion id,b.NEWINDV ID,b. groupflag."Days since first dispensing' attribute, "30 to 89 days' attribute value,30 beginning,89 ending FROM COHORT b WHERE b.match=1 AND b. termdate-b.index.dt >=30 AND product id=1 AND version id=200503 UNION SELECT product id,version id.b.NEWINDVID,b. groupflag."Days since first dispensing' attribute, '90 to 365 days' attribute value,90 beginning,365 ending FROM COHORT b WHERE b.match=1 AND b. termdate-b.index.dt >=90 AND product id=1 AND version id=200503 UNION SELECT product idversion id.NEWINDVID, groupflag,'GENDER" attribute, sex attribute value,0 beginning.99999 ending FROM COHORT WHERE match=1 AND product id=1 AND version id=200503 UNION US 7,917,525 B2 59 60

SELECT product idversion id.NEWINDVID, groupflag,'AGE GROUP attribute,b.AGE GROUP name attribute value 0 beginning.99999 ending FROM COHORT a, AGE GROUP b WHERE a.AGE BETWEEN b.AGE GROUP START YR AND b.AGE GROUP END YRAND a match=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id.NEWINDV ID, groupflag. REGION" attribute.region attribute value,0 beginning,99999 ending FROM COHORT WHERE match=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id.NEWINDV ID. groupflag."DX Baseline' attribute,dx attribute value,0 beginning.99999 ending FROMDXBASE WHERE match=1 AND product id=1 AND version id=200503 AND indexflag F0 and drhlinpat=1 UNION SELECT product id,version id.NEWINDVID, groupflag,'PX Baseline' attribute.proc catcode attribute value,0 beginning.99999 ending FROMpxbase WHERE match=1 AND product id=1 AND version id-200503 AND indexflag-0 UNION SELECT product idversion id.NEWINDV ID,groupflag,'RX Baseline',THERSPC attribute value,0 beginning.99999 ending FROM rxbase WHERE match=1 AND product id=1 AND version id=200503 AND indexflag-0) base (SELECT DISTINCT product id, version id, a.newindvid, agroupflag, "DX OUTPATIENT' outcome class, CASE WHEN LENGTH (TRIM (a.dx emerg)) = 4 AND SUBSTR (dx emerg, 1, 1) <> 'E' THEN SUBSTR (dx emerg, 1, 3) ELSE TRIM (dx emerg) END outcome type, days in study FROM dxemerg a WHERE a match = 1 AND a product id=1 AND aversion id = 200503 AND siteflag = 2 ) emergent WHERE base.newindvid Femergent.newindvid AND emergent.days in study BETWEEN base.beginning AND base.ending; commit; delete from BASE SCORE FILTER 3DX OP where product id=1 AND version id-200503;

US 7,917,525 B2 63 64

COUNT (UNIQUE CASE WHEN groupflag-1 THENnewindvid ELSE NULL END) xd COUNT (UNIQUE CASE WHEN groupflag-O THEN newindvid ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN DX OP WHERE product id=1 AND version id=200503 GROUP BY outcome class,outcome type ) Xd Xc WHERE nd-3 ANDnc>3 AND (xd-xc)>3 AND (inc+nd)-(xc+xd))>3 ) a WHERE POWER(Cx prime - (xd-xc) * (n prime/(nd+nc))),2)>= 1.6 * (xd-xc)* (n prime/(nd+nc)) * (1-(n prime/(nd+nc))) ) c ) d, ALL CODE XWALK o WHERE SUBSTR(d.outcome class, 1,2)=o.code type(+) AND d.outcome type-O, code set(t) commit; delete from NOFILTER PRESCORE DX OP where product id=1 AND version id=200503; commit; INSERT /*-append? INTO NOFILTER PRESCORE DX OP SELECT * FROM ( SELECT 1 product id.200503 version id, d.attribute,d.attribute value,a.code desc "Attribute value desc",d, outcome class,d.outcome type,o.code desc "Outcome type desc".d.Xd.d.nd.d.Xc,d.nc, CASE WHEN dxc50 THEN TO CHAR(d.xddinc/(d.xcd.ind),'9999990.9") ELSE'-- END rr, CASE WHEN (x-plowerx)>0 THEN (plower x/d.nd)/(x-plowerx)/d.nc) ELSE O END rrlower, CASE WHEN (x-pupper*x)>O THEN (pupper*x/d.nd)/((x-pupper*x)/d.nc) ELSE O END rrupper, CASE WHEN cumul hyper>1e-128 THEN score signTRUNC(LOG(10.cd.cumul hyper)) ELSE - 10*score sign END score FROM ( SELECT c.*,(-b high--SQRT(POWER(b high,2)-4*a*c high))/(2*a) pupper,(-b low SQRT(POWER(b low,2)-4*a*c low))/(2* a) plower, Wu4 Function.9biot(x prime,n prime,xd+xc,nd+nc) cumul hyper US 7,917,525 B2 65 66

FROM ( SELECT b.*,1+da,-2*pd low-d b low-2*pd high-d b high, POWER(pd low,2) c low,POWER(pd high,2) c high FROM ( SELECT a.*,xd+xc x,POWER(1.96.2)/(xd-xc) d,(Xd-.5)/(xd-xc) pd low,(Xd+.5)/(xd+xc) pd high FROM ( SELECT nd no.attribute,nd no.attribute value,xd Xc.outcome class,Xd Xc.outcome type,xd Xc.x d,nd inc.nd,Xd Xc.Xc,nd nc.nc CASE WHEN xdd(xd-i-xc)*nd/(nd+nc) THEN xc ELSExd END x prime CASE WHENxdd(xd+xc)*nd/(nd+nc) THENnc ELSE ind END in prime CASE WHENxdd(xd+xc)*nd/(nd+nc) THEN-1 ELSE CASE WHEN xd= (xd+xc)*nd/(nd+nc) THEN 0 ELSE 1 END END score sign FROM ( SELECT attribute, attribute value COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindvid ELSE NULL END) nd COUNT (UNIQUE CASE WHEN groupflag-O THENnewindvid ELSE NULL END) C FROM CLUSTER BASELINE POPS WHERE product id=1 AND version id-200503 GROUP BY attribute, attribute value ) nd inc s ( SELECT attribute, attribute value, outcome class, outcome type COUNT (UNIQUE CASE WHEN groupflag-1 THENnewindvid ELSE NULL END) Xd COUNT (UNIQUE CASE WHEN groupflag-O THENnewindvid ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN DX OP WHERE product id=1 AND version id-200503 GROUP BY attribute, attribute value,outcome class, outcome type ) Xd Xc WHERE nd nc.attribute-xd Xc.attribute AND nd no.attribute value=xd xc.attribute value ANDnd>3 AND inc-3 AND (xd-xc)>3 AND ((nc+nd)-(xchxd))>3 ) a WHERE POWER((x prime - (xd-xc) * (n prime/(nd+nc))),2)>= 1.6 * (xd-xc)* (n prime/(nd-inc) * (1-(n prime/(nd-inc))) ) c ) d, ALL CODE XWALKa, ALL CODE XWALK o,BASE SCORE FILTER 3DX OP c US 7,917,525 B2 67 68

WHERE SUBSTR(d.attribute, 1,2)-acode type(+) AND d.attribute value-Fa.code set(i) AND SUBSTR(d.outcome class, 1,2)-O.code type(+) AND d.outcome type=o.code set(+) AND d.outcome classi c.outcome class(+) AND d.outcome type-Fc.outcome type(+) AND c.version id(+)=200503 AND c. product id(+)=1 AND ABS(CASE WHEN cumul hyper-1e-128 THEN score sign TRUNC(LOG(10,cumul hyper)) ELSE-10*score sign END)>ABS(nvl(c.score,0)) AND ABS(CASE WHEN cumul hyper-le-128 THEN score sign"TRUNC(LOG(10,cumul hyper)) ELSE-10*score sign END)>-3 UNION SELECT * FROM BASE SCORE FILTER 3DX OP WHERE product id=1 AND version id=200503 AND ABS(score)>=3 ) ORDER BY ABS(score) DESC,xd+xc DESC; commit; delete from CLUSTER XD XC PREJOIN DX IP where product id=1 AND version id=200503; commit; INSERT /*-append? INTO CLUSTER XD XC PREJOIN DX IP SELECT DISTINCT base product id,base.version id,base.newindV id,base.groupflag, base.attribute,base.attri bute value,emergent.outcome class,emergent.outcome type,emergent.days in study FROM (SELECT product id,version id,b.NEWINDV ID,b. groupflag."Days since first dispensing attribute, "1 to 7 days' attribute value, 1 beginning,7 ending FROM COHORT b WHERE b.match= AND b.termidate-bindex.dt >=1 AND product id=1 AND version id=200503 UNION SELECT product id,version id,b.NEWINDV ID,b-groupflag."Days since first dispensing' attribute, "8 to 29 days' attribute value,8 beginning,29 ending FROM COHORT b WHERE b.match=1 AND b.termidate-bindex.dt >=8 AND product id=1 AND version id-200503 UNION SELECT product id,version id.b.NEWINDV ID,b. groupflag."Days since first dispensing' attribute, 30 to 89 days' attribute value.30 beginning,89 ending FROM COHORT b WHERE b.match=1 AND b. termidate-b.index.dt >=30 AND product id=1 AND version id-200503 UNION SELECT product id,version id.b.NEWINDV ID,b. groupflag."Days since first dispensing' attribute, '90 to 365 days' attribute value,90 beginning,365 ending US 7,917,525 B2 69 70

FROM COHORT b WHERE b.matchel AND b. termdate-bindex.dt >=90 AND product id=1 AND version id-200503 UNION SELECT product id,version id.NEWINDV ID,groupflag,"GENDER" attribute, sex attribute value,0 beginning.99999 ending FROM COHORT WHERE match=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id.NEWINDV ID,groupflag,'AGE GROUP attribute,b.AGE GROUP name attribute value,0 beginning.99999 ending FROM COHORT a, AGE GROUP b WHERE a.AGE BETWEEN b.AGE GROUP START YRAND b.AGE GROUP END YR AND a match=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id.NEWINDV ID,groupflag, REGION" attribute region attribute value,0 beginning.99999 ending FROM COHORT WHERE match-1 AND product id=1 AND version id=200503 UNION SELECT product id,version id.NEWINDVID, groupflag,'DX Baseline' attribute,dx attribute value,0 beginning.99999 ending FROM DXBASE WHERE match=1 AND product id=1 AND version id=200503 AND indexflag-0 and drhlinpat-l UNION SELECT product idversion id.NEWINDVID,groupflag,'PX Baseline' attribute, proc catcode attribute value,0 beginning.99999 ending FROMpxbase WHERE match=1 AND product id=1 AND version id=200503 AND indexflag-0 UNION SELECT product idversion idNEWINDV ID,groupflag,'RX Baseline',THERSPC attribute value,0 beginning.99999 ending FROMrxbase WHERE match-1 AND product id=1 AND version id=200503 AND indexflag FO) base (SELECT DISTINCT product id, version id, a.newindvid, a groupflag, "DX INPATIENT outcome class, CASE WHEN LENGTH (TRIM (a.dx emerg)) = 4 AND SUBSTR (dx emerg, 1, 1) <> 'E' THEN SUBSTR (dx emerg, 1, 3) ELSE TRIM (dx emerg) END outcome type, days in Study FROM dxemerg a WHERE a.match - 1 AND a product id=1 AND aversion id = 200503 AND siteflag = 1) emergent US 7,917,525 B2 71 72

WHERE base.newindvid Femergent.newindvid AND emergent.days in study BETWEEN base.beginning AND base.ending; commit; delete from BASE SCORE FILTER 3DX IP where product id=1 AND version id=200503; commit; INSERT /*-append/ INTO BASE SCORE FILTER 3DX IP SELECT 1 product id,200503 version id, Baseline' attribute," attribute value," "Attribute value desc",d.outcome class,d.outcome type,o.code desc "Outcome type desc",Xd,nd,xc,nc, CASE WHENxc0 THEN TO CHAR(d.xdd.nc/(d.xcd.ind),'9999990.9") ELSE'-- END rr, CASE WHEN (x-plowerx)>0 THEN (plower x/nd)/(x-plowerx)/nc) ELSE O END rrlower, CASE WHEN (x-pupper*x)>0 THEN (pupper*x/nd)/((x-pupper*x)/nc) ELSE 0 END rrupper, CASE WHEN cumul hyperd-1E-128 THEN score signTRUNC(LOGO10,cumul hyper)) ELSE -10*score sign END score FROM ( SELECT c.*,(-b high-i-SQRT(POWER(b high,2)-4*a*c high))/(2*a) pupper,(-b low SQRTOPOWER(b low,2)-4*a*c low))/(2*a) plower, Wu-4 Function.9biot(x primen prime,xd+xc,nd-inc) cumul hyper FROM ( SELECT b.*,1+da,-2*pd low-d b low,-2*pd high-d b high.POWER(pd low,2) c low,POWER(pd high.2) c high FROM ( SELECT a.*,xd+xcx,POWER(1.96.2)/(xd+xc) d(Xd-.5)/(xd+xc) pd low,(Xd+5)/(xd-xc) pd high FROM ( SELECT Xd Xc.outcome class,Xd Xc.outcome type,Xd Xc.Xd,nd nc.nd,Xd Xc.Xc,nd inc.nc CASE WHENxdd(xd+xc)*nd/(nd+nc) THEN xc ELSExd END x prime CASE WHEN xdd(xd+xc)*nd/(nd+nc) THEN inc ELSE ind END in prime CASE WHENxdd(xd+xc)*nd/(nd+nc) THEN-1 ELSE CASE WHEN xd= (xd+xc)*nd/(nd+nc) THEN 0 ELSE 1 END END score sign FROM ( SELECT COUNT (UNIQUE CASE WHEN groupflag-1 AND match-1 THENnewindvid ELSE NULL END) ind US 7,917,525 B2 73 74

COUNT (UNIQUE CASE WHEN groupflag-O AND match=1 THENnewindw id ELSE NULL END) inc FROM cohort WHERE product id=1 AND version id=200503 ) nd inc ( SELECT outcome class,outcome type COUNT (UNIQUE CASE WHEN groupflag-1 THENnewindvid ELSE NULL END) Xd COUNT (UNIQUE CASE WHEN groupflag-O THENnewindvid ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN DX IP WHERE product id=1 AND version id-200503 GROUP BY outcome class,outcome type ) Xd Xc WHERE nd>3 AND inco-3 AND (xd-xc)>3 AND ((nc+nd)-(xc+xd))>3 ) a WHERE POWER(x prime - (xd-Exc) * (n prime/(nd+nc))) .2)>= 1.6 * (xd-xc)* (n prime/(nd+nc)) * (1-(n prime/(nd+nc))) ) c ) d, ALL CODE XWALKo WHERE SUBSTR(d.outcome class, 1,2)-O.code type(+) AND d.outcome type=o.code set() commit; delete from NOFILTER PRESCORE DX IP where product id=1 AND version id=200503; commit; INSERT /*-append'? INTO NOFILTER PRESCORE DX IP SELECT * FROM ( SELECT 1 product id.200503 version id, d.attributed.attribute value,a,code desc "Attribute value desc".d.outcome class,d.outcome type.o.code desc "Outcome type desc",d.Xd,d.nd.d.Xc,d.nc, CASE WHEN dxc50 THEN TO CHAR(d.xdd.nc/(d.xcd.ind),'9999990.9") ELSE'-- END rr, CASE WHEN (x-plowerx)>0 THEN (plower x/d.nd)/((x-plowerx)/d.nc) ELSE O END rrlower, US 7,917,525 B2 75 76

CASE WHEN (x-pupper*x)>0 THEN (pupper*x/d.nd)/(x-pupper*x)/d.nc) ELSE O END rrupper, CASE WHEN cumul hyper>1E-128 THEN score signTRUNC(LOG(10,d.cumul hyper)) ELSE-10*score sign END score FROM ( SELECT c.*,(-b high--SQRTOPOWER(b high,2)-4*a*c high))/(2*a) pupper,(-b low SQRT(POWER(b low,2)-4*a*c low))/(2*a) plower, Wu4 Function.9biot(x prime,n prime,xd+xc,nd-inc) cumul hyper FROM ( SELECT b.*,1+da,-2*pd low-db low,-2*pd high-d b high.POWER(pd low,2) c low,POWER(pd high.2) c high FROM ( SELECT a.*,xd+xcx,POWER(1.96,2)/(xd4xc) d,(Xd-.5)/(xdxc) pd low,(Xd+.5)/(xd+xc) pd high FROM ( SELECT nd no.attribute.nd inc.attribute value,Xd Xc.outcome class,xd Xc.outcome type,Xd Xc.X d.nd ne.nd,Xd Xc.Xc,nd inc.nc CASE WHEN xdd(xd-xc)*nd/(nd-inc) THEN xc ELSE xd END x prime CASE WHEN xdd(xd+xc)*nd/(nd+nc) THENnc ELSE ind END in prime CASE WHENxd5(xd+xc)*nd/(nd+nc) THEN-1 ELSE CASE WHENxd= (xd+xc)*nd/(nd+nc) THEN 0 ELSE 1 END END score sign FROM ( SELECT attribute,attribute value COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindvid ELSE NULL END) ind COUNT (UNIQUE CASE WHEN groupflag-O THENnewindvid ELSE NULL END) C FROM CLUSTER BASELINE POPS WHERE product id=1 AND version id=200503 GROUP BY attribute, attribute value ) nd inc s ( SELECT attribute,attribute value,outcome class,outcome type COUNT (UNIQUE CASE WHEN groupflag=1 THENnewindvid ELSE NULL END) xd COUNT (UNIQUE CASE WHEN groupflag-O THENnewindvid ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN DX IP WHERE product id=1 AND version id-200503 GROUP BY attribute,attribute value, outcome class, outcome type ) Xd Xc US 7,917,525 B2 77 78

WHERE nd nc.attribute-xd Xc.attribute AND nd nc.attribute value=xd Xc.attribute value ANDnd>3 ANDnc>3 AND (xd-xc)>3 AND ((nc-ind)-(xc+xd))>3 ) a WHERE POWER((x prime - (xd-xc) * (n prime/(nd+nc))) .2)>= 1.6 * (xd-xc)* (n prime/(nd+nc)) * (1-(n prime/(nd+nc))) ) b ) c ) d, ALL CODE XWALKaaLL CODE XWALK oBASE SCORE FILTER 3DX IP c WHERE SUBSTR(d.attribute, 1,2)-a.code type(+) AND d.attribute value=a.code set(+) AND SUBSTRCd.outcome class, 1,2)-O.code type(+) AND d.outcome type to.code set(+) AND d.outcome class- c.outcome class(+) AND d.outcome type-c.outcome type(t) AND c.version id(+)-200503 AND c. product id(+)-l AND ABS(CASE WHEN cumul hyperdl E-128 THEN score signTRUNC(LOG(10,cumul hyper)) ELSE -10*score sign END)>ABS(nvl(c.score,0)) AND ABS(CASE WHEN cumul hyperd1E-128 THEN score signTRUNC(LOG(10,cumul hyper)) ELSE-10*score sign END)>=3 UNION SELECT * FROM BASE SCORE FILTER 3DX IP WHERE product id=1 AND version id-200503 AND ABS(score)>=3 ) ORDER BY ABS(score) DESC,xd+xcDESC; commit; delete from CLUSTER XD XC PREJOIN RX where product id=1 AND version id=200503; commit, INSERT /*-- append/INTO CLUSTER XD XC PREJOIN rx SELECT DISTINCT base-product id,base.version id,base.newindvid,base, groupflag, base.attribute,base.attri bute value,emergent. Outcome class,emergent.outcome type,emergent.days in study FROM (SELECT product idversion id.b.NEWINDVID,b. groupflag."Days since first dispensing' attribute, "1 to 7 days' attribute value, 1 beginning,7 ending FROM COHORT b WHERE b.match=1 AND b. termdate-b.index.dt >=1 AND product id=1 AND version id=200503 UNION SELECT product id,version id,b.NEWINDV ID,b. groupflag."Days since first dispensing' attribute, "8 to 29 days' attribute value,8 beginning,29 ending FROM COHORT b WHERE b.match-1 AND b. termidate-b.indexdt >=8 AND product id=1 AND version id=200503 US 7,917,525 B2 79 80

UNION SELECT product idversion id.b.NEWINDV ID,b. groupflag."Days since first dispensing' attribute, "30 to 89 days' attribute value.30 beginning,89 ending FROM COHORT b WHERE b,match=1 AND b. termidate-bindex.dt >=30 AND product id=1 AND version id=200503 UNION SELECT product idversion id.b.NEWINDV ID,b. groupflag, Days since first dispensing' attribute, '90 to 365 days' attribute value,90 beginning,365 ending FROM COHORT b WHERE b.matchel AND b. termidate-bindex.dt >=90 AND product id=1 AND version id-200503 UNION SELECT product idversion id.NEWINDV ID, groupflag,'GENDER" attribute, sex attribute value,0 beginning.99999 ending FROM COHORT WHERE match=1 AND product id=1 AND version id=200503 UNION SELECT product id,version id.NEWINDV ID, groupflag,'AGE GROUP attribute,b.AGE GROUP name attribute value 0 beginning.99999 ending FROM COHORT a, AGE GROUP b WHERE a.AGE BETWEEN b.AGE GROUP START YRAND b.AGE GROUP END YR AND a match=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id,NEWINDVID, groupflag,"REGION" attribute, region attribute value,0 beginning.99999 ending FROM COHORT WHERE match=1 AND product id=1 AND version id=200503 UNION SELECT product idversion id,NEWINDV ID, groupflag,'DX Baseline' attribute,dx attribute value,0 beginning.99999 ending FROMDXBASE WHERE match=1 AND product id=1 AND version id=200503 AND indexflag F0 and drhlinpat 1 UNION SELECT product idversion id.NEWINDV ID, groupflag,'PX Baseline' attribute, proc catcode attribute value 0 beginning.99999 ending FROM pxbase WHERE match=1 AND product id=1 AND version id=200503 AND indexflag-0 UNION SELECT product idversion idNEWINDV ID, groupflag,'RX Baseline',THERSPC attribute value 0 beginning.99999 ending FROM rxbase WHERE match=1 AND product id=1 AND version id-200503 AND indexflag F0) base (SELECT DISTINCT product id, version id, a.newindvid, a groupflag, "RX' outcome class, rX emerg Outcome type, days in study FROM rxemerg a WHERE amatch = 1 US 7,917,525 B2 81 82

AND a product id=1 AND aversion id = 200503 ) emergent WHERE base.newindvid Femergent.newindvid AND emergent, days in study BETWEEN base.beginning AND base.ending; commit; delete from BASE SCORE FILTER RX where product id=1 AND version id-200503; commit; INSERT /*-append/INTO BASE SCORE FILTER RX SELECT 1 product id.200503 version id, "Baseline' attribute," attribute value," "Attribute value desc",d, outcome class,d, outcome type,o.code desc "Outcome type desc",Xd,nd,Xc,nc, CASE WHENxc0 THEN TO CHAR(dxddinc/(d.xcid.nd),'9999990.9") ELSE'-- END rr, CASE WHEN (x-plowerx)>0 THEN (plower x/nd)/(x-plower'x)/nc) ELSE O END rrlower, CASE WHEN (x-pupper*x)>0 THEN (pupper*x/nd)/((x-pupper*x)/nc) ELSE O END rrupper, CASE WHEN cumul hyperdlE-128 THEN score sign:TRUNC(LOG(10,cumul hyper)) ELSE-10*score sign END score FROM ( SELECT c.*,(-b high--SQRTOPOWER(b high,2)-4*a*c high))/(2*a) pupper,(-b low SQRT(POWER(b low,2)-4*a*c low))/(2*a) plower, Wu4 Function9biot(x primen prime,xd+xc,nd-inc) cumul hyper FROM ( SELECT b.*, l+da,-2*pd low-db low,-2*pd high-d b high.POWER(pd low,2) c low,POWER(pd high,2) c high FROM ( SELECT a.*,xd+xcx,POWER(1.96,2)/(xd+xc) d,(Xd-.5)/(xd+xc) pd low,(Xd+5)/(xd+xc) pd high FROM ( SELECT xd Xc.outcome class,Xd Xc.outcome type.Xd Xc.xd,nd inc.nd,Xd Xc.xc,nd inc.nc CASE WHEN xdd(xd+xc)*nd/(nd+nc) THEN xc ELSExd ENDx prime CASE WHEN xdd(xd-xc)*nd/(nd+nc) THENnc ELSE nd ENDn prime CASE WHEN xdd(xd+xc)*nd/(nd+nc) THEN-1 ELSE CASE WHEN xd= (xd+xc)*nd/(nd-inc) THEN 0 ELSE 1 END END score sign FROM ( SELECT US 7,917,525 B2 83 84

COUNT (UNIQUE CASE WHEN groupflag–1 AND match=1 THENnewindvid ELSE NULL END) ind COUNT (UNIQUE CASE WHEN groupflag-O AND match=1 THENnewindvid ELSE NULL END) inc FROM cohort WHERE product id=1 AND version id=200503 ) nd inc s ( SELECT outcome class, outcome type COUNT (UNIQUE CASE WHEN groupflag-1 THENnewindvid ELSE NULL END) Xd COUNT (UNIQUE CASE WHEN groupflag-O THENnewindvid ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN RX WHERE product id=1 AND version id=200503 GROUP BY outcome class,outcome type ) Xd Xc WHERE nd>3 AND inc>3 AND (xd-xc)>3 AND ((nc-ind)-(xc+xd))>3 ) a WHERE POWER(x prime - (xdhxc) * (n prime/(nd+nc))),2)>= 1.6 * (xd-xc)* (n prime/(nd+nc)) * (1-(n prime/(nd+nc))) ) c ) d, ALL CODE XWALK o WHERE SUBSTR(d.outcome class, 1,2)=o.code type(+) AND d.outcome typeFo.code set() commit; delete from NOFILTER PRESCORE RX where product id=1 AND version id=200503; commit; INSERT /*-append? INTO NOFILTER PRESCORE RX SELECT * FROM ( SELECT 1 product id,200503 version id, d.attributed.attribute value,a.code desc "Attribute value desc".d.outcome class,d.outcome type,o.code desc "Outcome type desc",d.Xd,d.nd,d.xc,d.nc, CASE WHEN dxc-0 THEN TO CHAR(dxdd.nc/(d.xcd.ind),'9999990.9") ELSE'-- END rr, US 7,917,525 B2 85 86

CASE WHEN (x-plowerx)>0 THEN (plower x/d.nd)/((x-plowerx)/d.nc) ELSE O END rrlower, CASE WHEN (x-pupper*x)>0 THEN (pupper*x/d.nd)/((x-pupper*x)/d.nc) ELSE O END rrupper, CASE WHEN cumul hyperd1E-128 THEN score signTRUNC(LOG(10,a,cumul hyper)) ELSE -10*score sign END score FROM ( SELECT c.*,(-b high-i-SQRT(POWER(b high,2)-4*a*c high))/(2*a) pupper,(-b low SQRT(POWER(b low,2)-4*a*c low))/(2*a) plower, Wu4 Function9biot(x primen prime,xd+xc,nd+nc) cumul hyper FROM ( SELECT b.*,1+da,-2*pd low-db low,-2*pd high-d b high, POWER(pd low,2) c low,POWER(pd high,2) c high FROM ( SELECT a.*,xd+xcx,POWER(1.96.2)/(xd+xc) d(Xd-.5)/(xd+xc) pd low,(Xdi,5)/(xd-xc) pd high FROM ( SELECT nd nc.attribute,nd inc.attribute value,Xd Xc.outcome class,Xd Xc.outcome type,Xd Xc.X d,nd inc.nd,Xd Xc.Xc,nd nc.nc CASE WHENxd>(xd+xc)*nd/(nd+nc) THEN xc ELSExd ENDx prime CASE WHEN xdd(xd+xc)*nd/(nd+nc) THENnc ELSE ind END in prime CASE WHENxdd(xd4xc)*nd/(nd+nc) THEN-1 ELSE CASE WHENxd= (xd+xc)*nd/(nd+nc) THEN 0 ELSE 1 END END score sign FROM ( SELECT attribute, attribute value COUNT (UNIQUE CASE WHEN groupflag-1 THENnewindvid ELSE NULL END) nd COUNT (UNIQUE CASE WHEN groupflag-O THENnewindvid ELSE NULL END) C FROM CLUSTER BASELINE POPS WHERE product id=1 AND version id=200503 GROUP BY attribute, attribute value ) nd inc s ( SELECT attribute,attribute value,outcome class, outcome type COUNT (UNIQUE CASE WHEN groupflag-1 THENnewindvid ELSE NULL END) xd COUNT (UNIQUE CASE WHEN groupflag-O THENnewindvid ELSE NULL END) XC FROM CLUSTER XD XC PREJOIN RX WHERE product id=1 AND version id-200503 US 7,917,525 B2 87 88

GROUP BY attribute, attribute value,outcome class, outcome type ) Xd Xc WHERE nd inc.attribute-xd Xc.attribute AND nd no.attribute value=Xd Xc.attribute value ANDnd 3 ANDnc-3 AND (xd+xc)>3 AND ((nc-ind)-(xc-xd))>3 ) a WHERE POWER(x prime - (xdhxc) * (n prime/(nd+nc))) .2)>= 1.6 * (xd+xc)* (n prime/(nd+nc)) * (1-(n prime/(nd+nc))) ) c ) d, ALL CODE XWALKaALL CODE XWALK oBASE SCORE FILTER RXc WHERE SUBSTR(d.attribute, 1,2)-a.code type(+) AND d.attribute value:ra.code set(+) AND SUBSTR(d.outcome class, 1,2)-o.code type(+) AND d.outcome type=o.code set(+) AND d.outcome class- c.outcome class(+) AND d.outcome type-c.outcome type(+) AND c.version id(+)-200503 AND c. product id(+)=1 AND ABS(CASE WHEN cumul hyper-1E-128 THEN score signTRUNC(LOG(10,cumul hyper)) ELSE -10*score sign END)>ABS(nvl(c.score,0)) AND ABS(CASE WHEN cumul hyper-1E-128 THEN score signiTRUNC(LOG(10,cumul hyper)) ELSE -10*score sign END)>=3 UNION SELECT * FROM BASE SCORE FILTER RX WHERE product id=1 AND version id=200503 AND ABS(score)>=3 ) ORDER BY ABS(score) DESC,xd+xcDESC; commit; SELECT DONE1' FROM DUAL; US 7,917,525 B2 89 90 Example 5 gender-M then individual id else null end) males count (unique case when gender-F then individual id else null Techniques of this disclosure address a number of chal end) females FROM ( Select individual id, gender from lenges in data manipulation encountered in both the medical attribute table where attribute=DIABETES space and other areas. INTERSECT DB forced to process gender here Select indi Data Acquisition vidual id, gender from attribute table where All Attributes and outcomes may be stored in a coded attribute="LIPITOR manner that allows complete flexibility to accommodate all MINUS DB forced to process gender here Select individu manner of terms (e.g., attributes like “Male”, “Valium” and al id, gender from attribute table where attribute=CABG’ “Age' can occupy the same attribute field and be processed 10 ) If, however, the SQL is coded with inline views as below, identically). SSTs provider rapid working sets of data, reduc the database does not have to match 'gender when determin ing physical I/OS by presenting the data in a "rich' format ing valid individuals i.e. only using “gender in the final where each database block holds many rows of the desired segmentation case statements. The predicate “WHERE data. For cluster investigations, SSTs can be created that hold iv 1.inidividual id=iv2.individual id' is more quickly doing all possible attribute/outcome permutation sets. Sub-clusters 15 the (above) INTERSECT and the “WHERE NOT EXISTS .. (double attributes) and syndromes (double outcomes) can . is more quickly doing the (above) MINUS operation. also be processed if desired. SELECT COUNT(UNIQUE a.individual id) Set based Data Reduction (MINUS/INTERSECT) count(unique case when Logarithms may be used in both the generation of the initial gender-M then individual id else null end) males count recursion point (logarithmic factorial table) and in the recur (unique case when gender-F then individual id else null ring of successive terms (to avoid over/under flow). Attribute/ end) females FROM ( Selectiv1.individual idiv 1.gender outcome sets are quickly mingled and counted using the FROM (Select individual id, gender from attribute table pre-permeated tables. Uninteresting cases may be filtered out where attribute=DIABETES)iv1.(Select individual id, prior to processing if desired. gender from attribute table where attribute="LIPITOR) iv2 Data Presentation 25 WHERE iv 1.inidividual id=iv2.individual id) iv 12 Scores sets can be presented to the user in order of signal WHERE NOT EXISTSSELECT *X FROM attribute table z strength. Geographic charts can be used to provider “birds WHERE Z.attribute=CABG and eye” views of disease clusters, provider clusters, and the Z.individual id=iv12.individual id trending of both. )

Problem (in order of explanation) Example Solution Attribute mixed forms (Male and DX = 250) Create attribute table that allow co mingling of all possible attribute types Physical Disk I/Os required for data Packed tables containing only the fields retrieval desired (DB blocks are “rich' and field efficient) Hash Join of attribute? outcomes Pre-permeated (and packed) attribute? outcome table sets Large number of permutation sets Elimination of uninteresting sets prior to processing (per Walker input) Hypergeometric: Numeric Overfunder flow Work entirely in logarithms Hypergeometric: Factorial generation Pre-generate, cached table of all factorials (values held in LOG form, 4 cached I/Os to create initial recursion point)

Example 6 In practice the two approaches might appear as below. The inline view version gives the same results as the set based Alternate Coding: Inline Views vs. Set Based SQL and, in this example, returns the data four times faster 50 than the set based SQL. Performance is likely to be even more Though SQL set based operations (MINUS, UNION, pronounced between the two approaches if more piggyback INTERSECT) are fast, it is possible to realize even faster variables are in play. results if the set variables are limited to only those needed for the set based operation. For example, if one is interested in the SELECT COUNT(UNIQUE a.individual id) Set based CaSC when gender distribution of patients with diabetes who are also on 55 (MINUS/INTERSECT) count(unique LIPITOR but have not had coronary bypass surgery in the last gender-M then individual id else null end) males count year, one can construct the SQL as below. A key point is that, (unique case when gender-F then individual id else null in such an example, the “individual id' is all that is required end) females.count(unique case when gender not in (F.M.) for the set intersection whilegender is a piggyback variable defined as useful in segmenting the counts (e.g. male and then individual id else null end) unknown FROM (SELECT female below). When the SQL is coded in this way, the 60 a.INDIVIDUAL ID, a.DOB, a.GENDER, a.ZIP5, SUBSTR database is forced to hash both patient id and gender when (a.ZIP5, 1, 3) ZIP3 FROM INDIVIDUAL ID a WHERE checking for individuals in both sets. Gender is redundant and not needed for this particular comparison operation. After the a. CODE KEY IN (DX5053481, DX5053484.81, comparisons are made, gender is need to count individuals in “DX505348484.81, DX505348484.91, DX505348491, the various segments. 65 DX 505348.511, DX5053485.1481, DX50534851491, SELECT COUNT(UNIQUE a.individual id) Set based DX505348551, DX505348561, DX505348571, (MINUS/INTERSECT) count(unique case when DX50535 1531, DX5 15355501, DX5 15450481, DX5 US 7,917,525 B2 91 92 1545048491, DX 51545452491, DX535656491, PX515 15349521, “PX51515349541, PX51515349551, “DX545256481, 'DX54525648481, DX54525648491, PX51515349561, “PX5 15153495 71, PX51515350491, “DX 545256485O1 PX51515350501, “PX51515350511, “PX51515351511, DX545256485 11, DX54525648521, DX555553481, PX51515351521, PX51515351531, 'PX51515351541) DX555553491, DX864956481, D X865555491, 5 AND Z.individual id=d.individual id); PX655353484.81, PX65535348491, “PX65535348501, With the benefit of the present disclosure, those having PX65535348.511, “PX65535348521, “PX65535348531, ordinary skill in the art will comprehend that techniques PX65535348541, PX6553534.8551, “PX655353485 61, claimed here may be modified and applied to a number of PX7148494.8561, PX71484.948,571, “PX83574952481, additional, different applications, achieving the same or a PX83574952491, “PX83575253531, “PX83575254481, 10 similar result. The claims cover all such modifications that “PX8357525.4531, 'PX8952545 1481) fall within the scope and spirit of this disclosure. INTERSECT SELECT a.INDIVIDUAL ID, a.DOB, a.GENDER, a.ZIP5, REFERENCES SUBSTR(a.ZIP5, 1, 3) ZIP3 FROM INDIVIDUAL ID a WHERE a.CODE KEY IN (RX-7101552337652, RX 15 Each of the following references is hereby incorporated by 710155341, RX-7101554037652, RX-7101562337652, reference in its entirety: RX-7101564037652, RX-7101572337652, RX Pat. No. 6,826,536 710157731, RX-7101582337652, RX-710158731) . Pat. No. 6,732,113 MINUS . Pat. No. 6,370,511 SELECT a.INDIVIDUAL ID, a.DOB, a.GENDER, a.ZIP5, . Pat. No. 6,253,186 SUBSTR(a.ZIP5, 1, 3) ZIP3 FROM INDIVIDUAL ID a . Pat. No. 6,223, 164 WHERE a. CODE KEY IN (PX48485354541, . Pat. No. 6,151,581 PX51515349481, PX51515349491, “PX5151534950, . Pat. No. 6,014,631 PX515153495 11, PX51515349521, PX51515349541, . Pat. No. 5,970,464 PX51515349551, PX51515349561, “PX51515349571, 25 . Pat. No. 5,970,463 “PX51515350491, PX51515350501, “PX51515350511, . Pat. No. 5,956,689 “PX515153515 11, PX51515351521, PX51515351531, Pat. No. 5,835,897 “PX51515351541) Pat. No. 5,557.514 ) a . Pat. No. 5,191,522 SELECT COUNT(UNIQUE individual id) Inline View 30 . Patent Publication N 2005O234740 based count(unique case when gender-M then individu . Patent Publication N 2005O228808 al id else null end) males count(unique case when . Patent Publication N 2005O228593 gender=F then individual id else null end) females count . Patent Publication N 2005O2O3776 (unique case when gender not in (F.M.) then individual id . Patent Publication N 2005O114334 else null end) unknown FROM (SELECT a.individual id, 35 . Patent Publication N 2005OO71189 agender FROM (SELECT a.INDIVIDUAL ID, a.DOB, . Patent Publication N 2005OO 10443 a.GENDER, a ZIP5, SUBSTR(a.ZIP5, 1, 3) ZIP3 FROM . Patent Publication N 2004O260577 INDIVIDUAL ID a WHERE a.CODE KEY IN . Patent Publication N 2004O24.9677 (DX5053481, 'DX505348481, DX505348484.81, . Patent Publication N 200402366O1 “DX505348484.91, DX505348491, DX 50534851 1, DX5053485.1481, DX50534851491, DX505348551, 40 . Patent Publication N 2004O210457 DX505348561, DX505348571, DX505351531, . Patent Publication N 2004O172293 DX515355501, DX515450481, DX51545048491, DX . Patent Publication N 2004OO93240 51545452491, DX535656491, DX545256481, . Patent Publication N 20040078236 “DX545256484.81, DX54525648491, DX 54525648501 . Patent Publication N 2004OO78220 DX54525648511, DX54525648521, DX555553481, 45 . Patent Publication N 2004OO78216 DX555553491, DX864956481, D X865555491, . Patent Publication N 2004OO44.654 PX655353484.81, PX65535348491, “PX65535348501, . Patent Publication N 2003OO65740 PX65535348.511, “PX65535348521, “PX65535348531, . Patent Publication N 2003OO46 114 PX65535348541, PX6553534.8551, “PX655353485 61, . Patent Publication N 2002O173990 PX7148494.8561, PX71484.948,571, “PX83574952481, 50 . Patent Publication N 2002O165762 PX83574952491, “PX83575253531, “PX83575254481, . Patent Publication N 2002O1383.06 “PX8357525.4531, 'PX8952545.1481) . Patent Publication N 2002OO77853 ) a, . Patent Publication N 2002OOO2474 ( Patent Publication No 2001OO34631 SELECT a.INDIVIDUAL ID, a.DOB, a.GENDER, a.ZIP5, 55 Walker, Alec; Detection Routine for Interesting Subgroups, SUBSTR(a.ZIP5, 1, 3) ZIP3 FROM INDIVIDUAL ID a Data Mining for Aperio, May 16, 2005 WHERE a.CODE KEY IN (RX-7101552337652, RX Wu, Trong: An Accurate Computation of the Hypergeometric 710155341, RX-7101554037652, RX-7101562337652, Distribution Function, ACM Transactions on Mathemati RX-7101564037652, RX-7101572337652, RX cal Software, Vol 19, No. 1, March 1993, Pages 33–43 710157731, RX-7101582337652, RX-710158731) 60 The invention claimed is: 1. A non-transitory computer readable medium comprising WHERE a.individual id=b.individual id computer executable instructions embodied therein that when ) d WHERE NOT EXISTS ( executed carry out a method comprising: SELECT X FROM INDIVIDUAL ID Z WHERE Z. searching administrative healthcare claims data; and CODE KEY IN (PX48485354541, PX5 65 comparing one Subset of the administrative healthcare 1515349481, PX5 1515349491, PX5 1515349501, PX5 claims data against a plurality of other Subsets of the 15153495 11 administrative healthcare claims data; and US 7,917,525 B2 93 94 calculating a hypergeometric statistical result based on the 10. The non-transitory computer readable medium of claim comparing step using one or more pre-generated facto 1, further comprising using the hypergeometric statistical rial tables, the factorial tables comprising logarithmic result to recruit a medical professional for use as an expert entries. witness for litigation. 2. The non-transitory computer readable medium of claim 5 11. A non-transitory computer readable medium compris 1, where calculating comprises one or more calculations ing computer executable instructions embodied therein that using the logarithmic entries followed by one or more expo when executed carry out a method comprising: nential operations. pre-generating one or more specialized searching tables 3. The non-transitory computer readable medium of claim using administrative healthcare claims data; 1, further comprising using the hypergeometric statistical 10 pre-generating one or more factorial tables, the factorial result to detect medical-related fraud. tables comprising logarithmic entries; 4. The non-transitory computer readable medium of claim searching the specialized searching tables; 3, where the one Subset comprises medical coding data asso identifying, through the searching step, one or more ciated with a first physician and the plurality of other subsets records within the administrative healthcare claims data comprises medical coding data associated with a plurality of 15 that matches one or more search criteria; other physicians. comparing the one or more records against a plurality of 5. The non-transitory computer readable medium of claim other records of the administrative healthcare claims 4, where the plurality of other physicians are selected to be data; within the same specialty as the first physician. calculating a hypergeometric statistical result based on the 6. The non-transitory computer readable medium of claim comparing step using the one or more factorial tables; 5, further comprising generating a customized report com and paring the first physician Versus the plurality of other physi generating a customized report using the one or more cians. records and the statistical result. 7. The non-transitory computer readable medium of claim 12. The non-transitory computer readable medium of claim 6, the customized report comprising a graph of utilization 25 11, where the one or more search criteria comprise one or percentage versus medical code for the first physician and the more exclusion or inclusion criteria selected using a Venn plurality of other physicians. diagram. 8. The non-transitory computer readable medium of claim 13. The non-transitory computer readable medium of claim 1, further comprising using the hypergeometric statistical 11, where calculating comprises one or more calculations result to rate one physician versus other physicians. 30 using the logarithmic entries followed by one or more expo 9. The non-transitory computer readable medium of claim nential operations. 1, further comprising using the hypergeometric statistical result to identify potential subjects for a clinical trial.