An Ontology-Based Approach for Facilitating Information Retrieval from Disparate Sources: Patent System as an Exemplar Kincho H. Law

Professor of Civil and Environmental Engineering Engineering Informatics Group Stanford University

Collaborators: Jay P. Kesan, Professor , College of Law, UIUC Siddharth Taduri (Former Student), Stanford University Gloria Lau, Consulting Assoc. Professor, Stanford University

Ontology Summit March 10, 2016 Ref: S. Taduri, Information Retrieval Across Multiple Information Sources Using Knowledge- Based Approach, Engineering Degree Thesis, Stanford University, March, 2012. Motivation

 Patents: Can we obtain all relevant (validity, enforceability, and infringement) information related to patent(s) in a particular sector/category/market segment and analyze that information?  In the patent context:  What are the issued patents in a given space?  What is the legal scope of protection for same/similar patents?  Who are the competitors?  Have any same/similar patents been challenged in court?  Are there any relevant scientific literature, prior court decisions, laws and regulations that can potentially be used to challenge and to invalidate some patent claims?  Focus: Biomedical Patents  Other Similar Problems: integrating administrative agencies, courts, technical/scientific literature, and technical product literature in a host of law and science areas (Pharmaceuticals; Biofuels;….) Problem Statement Issued Patents and Applications File Wrappers Court Cases Technical Regulations Publications and Laws

Patent Validity and Infringement/Enforcement Questions involves analysis of documents in various domains – Patents, USPTO File Wrappers, Court Documents, Scientific/Technical Publications, and Technical Product Literature Owned by disparate public (government) and private sectors The information is often available online, but siloed into several diverse information sources Today, the analysis is done manually and poorly by companies offering various patent research and strategy services Use-Case: (Repository)

 Synthetic production of the hormone has made it possible to treat diseases such as

 Core patents – U.S. Patents 5,621,080, 5,756,349, 5,955,422, 5,547,933, 5,618,698

 135 directly related patents and over 3000 related publications

 Around 30 court cases, patent litigation involving major companies including Amgen, Hoechst Marion Roussel, Inc., Transkaryotic Therapies, Inc.

 Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine from the TREC 2007 Genome Dataset (http://ir.ohsu.edu/genomics/2007protocol.html)

 Comprehensive domain knowledge available Domain Terminology is Everywhere

Excerpt from scientific publication Excerpt from U.S. Patent# 5,441,868 Regional variability in the incidence of end-stage

renal disease: an epidemiological approach. Title: Production of recombinant erythropoietin …. Regional variability in the incidence of end-stage Abstract renal disease (ESRD) in Austria is reported. Our aim was …. low rates in the state of Tyrol. Disclosed are novel polypeptides possessing part or all …. of the primary structural conformation and one or moreESRD incidence data were obtained from …. of the biological properties of mammalian …. Between 1995 and 1999, 4811 new cases of ESRD were recorded; erythropoietin ("EPO") which are characterized in the state of Tyrol (T) …. incidence of ESRD patients with type 2 preferred forms by being the product of procaryotic or diabetes mellitus …. the difference in the overall ESRD eucaryotic host expression of an exogenous DNA incidence …. prevalence of DM, a highly significant correlation was sequence. Illustratively, genomic DNA, cDNA and Excerptfound from between court ESRD case incidence – Amgen, and DM Inc. . v/s Chugai manufactured DNA sequences coding for part orPharm all of .… . the sequence of amino acid residues of EPO or for variability in the ESRD incidence in Austria is explained mainly by analogs thereof are incorporated into autonomouslyOn Juneregional 30, differences1987, thein DM United-2. Data Statesfrom similar Patent studies and …. allocation replicating plasmid or viral vectors employed toTrademark for ESRD Office … (PTO). issued to Dr. Rodney Hewick transform or transfect suitable procaryotic or U.S. Patent…. 4,677,195, entitled "Method for the eucaryotic host cells such as bacteria, yeast or Purification of Erythropoietin and Erythropoietin vertebrate cells in culture. Upon isolation from Compositions"culture (the '195 patent). The patent claims both media or cellular lysates or fragments, productshomogeneous of EPO and compositions thereof and a expression of the … method for purifying human EPO using reverse phase high performance liquid chromatography. The method claims are not before us.

Problem Statement

Knowledge Knowledge Issued Source 1: Source 2: Patents and Patent System Bio Ontology Applications File Ontology Wrappers Court Cases Specific Technical Domain

Technical Regulations Publications and Laws Integration

Sources are diverse in structure, formats, semantics and syntax

How to retrieve patent information in a particular technological space? A knowledge-driven (Ontology-based) approach • Knowledge of scientific/technical domain • Knowledge of patent system domain Why Ontology?

 An ontology is an explicit description of a domain:  concepts  properties and attributes of concepts  constraints on properties and attributes

 An ontology defines  a common vocabulary  a shared understanding Domain (Bio) Ontologies

 Bio Ontologies serve as standards for terminology in Bio-Medical (Science) domain

(Ref: Bioportal.bioontology.org, accessed March 2012) Using Concept Hierarchy to Determine Relevancy

Doc 1 Bio Ontology … erythropoietin …colony Hematopoietic stimulating factor Growth Factor Use of super class … concept for relevancy Colony No direct similarity Stimulating Factor

Erythropoietin EPO Doc 2 … EPO …growth factor …

 Direct term based matching cannot relate the two documents

 Bio-ontology reveals that EPO and erythropoietin are synonymous

 Class hierarchy provides concepts (such as colony simulating factor) useful for determining relevance between documents (with appropriate weighting scheme) Expanded Query (with domain ontology)

Original Term: Erythropoietin

Synonyms: Erythropoietin, Recombinant Erythropoietin, erythropoietin receptor binding, Hematopoietin, Recombinant EPO, Erythrocyte Colony Stimulating Factor, Epoetin, EPO …

Children: Darbopoietin Alfa, , Epoetin Beta …

Parents: Colony Stimulating Factors, cytokine receptor binding, recombinant hematopoietic growth factors…

Grand-Parents: hematopoietic growth factor, receptor binding, recombinant growth factor …

 An appropriate ranking function is applied to balance the more general terms. Heuristically, we assign a higher weight to synonyms, and a lower weight as we traverse away from the concept node

 Resulting Query: “original term” OR [synonyms]^weight OR [children]^weight OR …. Patent System Ontology (patent documents, court cases, file wrappers) Competency Questions Patent Domain: • Return all patent documents which contain the phrase ‘recombinant erythropoietin receptor’ in the claims • Return all the patent documents which contain the phrase ‘recombinant erythropoietin receptor’, at least 3 claims, issued before 02-02-1999 and assigned to Genetics Inc.

Court Case Domain: • Return all court cases which contain the term – ‘erythropoietin’ • Return all court cases which involve the company Amgen Inc. either as the plaintiff or defendant, and from the District Court of Massachusetts

Multi-domain: • Return all patents which contain the term – ‘erythropoietin’ in their claims, which are involved in at least one court litigation. • Return all court cases with the term ‘erythropoietin’. From these court cases, return the patents involved. From these patents, follow the backward and forward citations to identify more important patents. Patents Documents

 Around 8+ million U.S. patents (2.2 million in force today)

 In 2009, 485,312 patent applications were filed

 Information is contained in various sections of the documents; a full-text search alone is not sufficient –- other metrics such as classification, citations etc... need to be considered

 Documents are available in HTML Format and can be easily parsed Patent System Ontology

Conceptual View of Patent Documents 927 F.2d 1200 (1991) Court Cases AMGEN, INC., Plaintiff/Cross-Appellant, v. CHUGAI PHARMACEUTICAL CO., LTD., and Genetics Institute, Inc., Defendants- Appellants.  Court Cases are not very well Nos. 90-1273, 90-1275. structured United States Court of Appeals, Federal Circuit.

March 5, 1991.  Comparatively more difficult Suggestion for Rehearing Declined May 20, 1991. … to parse information … Before MARKEY, LOURIE and CLEVENGER, Circuit Judges. …  PACER – public access to court THE PATENTS On June 30, 1987, the United States Patent and Trademark Office (PTO) issued to Dr. Rodney electronic records (database) Hewick U.S. Patent 4,677,195, entitled "Method for the Purification of Erythropoietin and Erythropoietin Compositions" (the '195 patent). The patent claims both homogeneous EPO and system for U.S. Courts - compositions thereof and a method for purifying human EPO using reverse phase high performance liquid chromatography. The method claims are not before us. The relevant claims requires one to know judicial of the '195 patent are: 1. Homogeneous erythropoietin characterized by a molecular weight of about 34,000 district, party/assignee name, daltons on SDS PAGE, movement as a single peak on reverse phase high performance case number/type, etc… liquid chromatography and a specific activity of at least 160,000 IU per absorbance unit at 280 nanometers. which may not be known. * * * * * * 3. A pharmaceutical composition for the treatment of anemia comprising a therapeutically effective amount of the homogeneous erythropoietin of claim 1 in a pharmaceutically acceptable vehicle.  Bloomberg Law is better but 4. Homogeneous erythropoietin characterized by a molecular weight of about 34,000 daltons on SDS PAGE, movement as a single peak on reverse phase high performance has limitations. liquid chromatography and a specific activity of at least about 160,000 IU per absorbance unit at 280 nanometers. Patent System Ontology

Conceptual View of Court Cases Patent File Wrappers Events Text  File Wrappers are folders which contain all documents exchanged between a patent applicant and the patent office

 Every File Wrapper is different!! Limited standardized ordering of events

 The relevant information is embedded within lots of irrelevant text

 File Wrappers are available as images requiring additional processing in order to extract the text Patent System Ontology

Events Contained in a File Wrapper Cross-Referencing

 There are many aspects of these documents which can be utilized; especially the cross-referencing between the documents

COURT CASE REGULATIONS: U.S. Code Title 35, C. F. R Title 37, M. P. 314 F.3d 1313 (2003) E. P. … AMGEN INC., Plaintiff-Cross Appellant v. HOECHST MARION ROUSSEL, INC. (now known as Aventis Pharmaceuticals, Inc.) and Publication Database FILE WRAPPER Transkaryotic Therapies, Inc., Defendants- U.S. Patent 5,955,422 Appellants. … … PATENT

Plaintiff-Cross Appellant Amgen Inc. is the owner of numerous patents directed to the United States Patent, 5,955,422 Claims 61-63 are rejected under production of erythropoietin ("EPO"), September 21, 1999 35 U.S.C. § 103 as being …alleging that TKT's Investigational New Production of erthropoietin unpatentable over any one of Drug Application ("INDA") infringed United Miyake et al., 1977 (R) States Patent Nos. 5,547,933; 5,618,698; and Abstract: Disclosed are novel polypeptides … 5,621,080. The complaint was amended in possessing part or all of the primary In accordance with the October 1999 to include United structural conformation and one or more provisions of 37 C.F.R. §1.607, the States Patent Nos. 5,756,349 and 5,955,422, of the biological properties of mammalian present continuation is being which issued after suit was filed. erythropoietin ("EPO") … filed for the purpose of Inventors: Lin; Fu-Kuen (Thousand Oaks, … CA) Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA) BIOPORTAL: DOMAIN KNOWLEDGE Appl. No.: 08/100,197 Erythropoietin, Epoetin, EPO … Filed: August 2, 1993. Patent System Ontology

Top Level Ontology for the Patent System Parsing the Document to Instantiate the Ontology

 Documents are automatically parsed using a regular expression based script

 Separate scripts needed for each document domain

 Ontology is automatically instantiated using the Protégé- OWL API

Chugai ..

hasDefendant Amgen .. Case 1 hasPlaintiff Patent System Ontology

 Established semantics allow us to reason over the classes, properties and instances to infer new facts

 Documents can be connected to form a network similar to citation networks. Only now we have not just citations, but other metadata such as co-inventorships, technological classification and other cross-domain relevancy metrics between documents (ex: patents occurring in court cases etc…)

 Can develop rules to perform additional inferences over the knowledge Information Retrieval Framework Prototype System Implementation

(Virtuoso)

(SWRL) Summary of the Implementation

 Jena libraries and triple store integration for modifying the patent system ontology through new constructs, cross-references, or rules

 Solr and Lucene libraries to create, update, and query the text indexes

 Generic API for integration with sources of domain knowledge such as BioPortal Automatic query generation, abstracting the syntactic details from the user

Example Query

Expressing Competency Questions in SPARQL

Competency Questions SPARQL Query SELECT ?case WHERE { Return all court cases which involve the company ?case type CourtCase . Amgen Inc. as the plaintiff and from the District ?case hasPlaintiff “Amgen Inc.” . Court of Massachusetts ?case hasCourt “District Court…” } SELECT ?pat WHERE { ?pat type Patent . Return all patents which contain the phrase ?pat hasClaim ?clm . ‘recombinant erythropoietin receptor’ in the ?clm hasTerm “recombinant …” . claims and IPC class “A61K” ?pat hasIPCClass “A61K” . } Example Query

 Return all the patent documents which contain the keyword “erythropoietin” in the Claims and Assigned to “Amgen_Inc”.

 SPARQL Query:

Patent Inventor SELECT DISTINCT ?patent ?inventor 5856298 Strickland_Thomas_W FROM WHERE{ 5885574 Elliott_Steven_G ?patent a ont:Patent . 7304150 Egrie_Joan_C ?patent ont:hasAbstract ?abs . 7304150 Elliott_Steven_G ?abs ont:resourceVal ?val . 7304150 Browne_Jeffrey_K ?val bif:contains "erythropoietin" . 7304150 Sitney_Karen_C ?patent ont:hasAssignee ont:Amgen_Inc . 7217689 Elliott_Steven_G 7217689 Byrne_Thomas_E ?patent ont:hasInventor ?inventor 6319499 Elliott_Steven_G } Limit 10 5756349 Lin_Fu-Kuen Example Query SPARQL Query to Retrieve Information Related to “erythropoietin”

SELECT ?party ?pat ?class ?inventor ?assignee WHERE { -- Retrieve all cases related to erythropoietin ?case a CourtCase . Retrieve all cases ?case hasBody ?body . related to FILTER REGEX (?body, “erythropoietin”, “i”) erythropoietin -- retrieve plaintiff’s and defendants { {?case hasPaintiff ?party .} Retrieve plaintiff’s UNION and defendants {?case hasDefendant ?party .} } . ?case patentsInvolved ?pat . Retrieve involved ?pat hasUSClass ?class . patents, US ?pat hasInventor ?inventor . classification, inventors, ?pat hasAssignee ?assg . assignees } LIMIT 4; Example Query

Summary of Extracted Information

Plaintiffs/ Patents US Class Inventor Assignee Defendants Involved in Cases Amgen Inc. 5,955,422 514/8 Lin, Fu-Kuen Kirin-Amgen, Inc.

Chugai 5,547,933 530/350 Hewick, Amgen, Inc. Pharmaceuticals Rodney, M.

Hoescht Marion 5,621,080 536/23.51 Seehra, Jasbir, S. Kiren-Amgen, Inc. Roussel

Genetics Inc. 5,618,698 435/325 Seenra, Jasbir, S. Genetics Institute, Inc. Example Query SPARQL Query to Retrieve Information Related to U.S. Patent 5,955,422

SELECT ?pat1 ?pat2 ?case ?pub ?inv ?assg ?class WHERE { ?case a CourtCase . ?case patentsInvolved US5955422. Query court case ?case patentsInvolved ?pat1

{?pat1 hasCitation ?pat2 .} {?pat1 hasInventor ?inv .} Retrieve patents {?pat1 hasAssignee ?assg .} {?pat1 hasUSClass ?class .}

?pat1 hasClaim ?claim . ?pub a Publication . Retrieve information ?pub hasBody ?body . across domains FILTER REGEX (?body, ?claim, “i”) } Example Query Actual Documents Retrieved by Querying Patent System Ontology Multi-Domain Drug Ontology: Initial Features: {{erythropoietin, epo}, Information Retrieval {epoetin alfa, epogen, procrit …} …}. Disease Ontology:

Disease Ontology Query: [DISEASE] AND Acquired Features-I:{anemia, {aplastic anemia…},{esrd, chronic disease…}…} {{erythropoietin, epo}, {epoetin alfa, Step III Patent Database (USPTO) epogen, procrit …} …} Symptom Ontology Step I Extracted Features: {anemia, {aplastic Search TREC dataset U. S. Patent Classification anemia, hemolytic anemia, …},{esrd, Patent System Ontology Initial Features: {{erythropoietin, chronic kidney disease…}…} epo},{epoetin alfa, epogen, procit…}…} TREC corpus (patent system ontology): Step II Query: [AUTHOR] AND {anemia, {aplastic New Features Search TREC dataset Drug Ontology anemia, hemolytic anemia, …} AND { Acquired Features-II: {Miyake, {Goldwasser E., {erythropoietin, epo} … } Eugene Goldwasser} …} MEDLINE Metadata Extracted Features: {Miyake, {Goldwasser, Eugene Goldwasser}…} Query: [PATENT] AND {{Goldwasser, Eugene G, …}, …} AND {{anemia, …}, …}

Results: the 5 core patents that originated from Amgen Inc. (U.S. Patents 5,621,080, 5,756,349, 5,955,422, 5,547,933, 5,618,698) Bioportal (bioportal.bioontology.org) Summary: BIO-REGNET

Patent Document Scientific Publication

Court Case Knowledge Source: Bio Ontology (Technical Domain)

Knowledge Source: Issued Patent System Patents and Ontology Applications (Business/Legal File Domain) Wrappers Court Cases

Technical Regulations Publications and Laws

Siloed Patent System Information Summary and Discussion

 IP informatics: from research/development, patent filings to infringement and IP protection  Knowledge-Driven Ontology-Based Approach  Technological ontologies  Patent system ontology  Generalization – Linking to other information sources – technical/scientific publications, product literature  User Interface – Efficient presentation of relevant (semantic) information  Comparative analysis of documents  Scalability (Graph Database?)  Experiment with more use cases in other technical domains outside of the biomedical domain Acknowledgement This research is partially supported by • NSF Grant Number IIS-0811975 awarded to the University of Illinois at Urbana-Champaign • NSF Grant Number IIS-0811460 to Stanford University • NIST Award Number 60NANAB11D129 to Stanford University

Any opinions and findings are those of the authors, and do not necessarily reflect the views of the National Science Foundation (NSF) or the National Institute of Standards and Technology (NIST).

Certain identification of public or commercial systems in the paper/presentation does not imply recommendation or endorsement by NSF or NIST; nor does it imply that the products identified are necessarily the best available for the purpose.