(10) Patent No.: US 8140267 B2

US008140267B2 (12) UnitedO States Patent (10) Patent No.: US 8,140,267 B2 Boyer et al. (45) Date of Patent: Mar. 20, 2012 (54) SYSTEMAND METHOD FOR IDENTIFYING 2. 6. E: S. W et al. SMILARMOLECULES 7,206,7354. WW B2 4/2007 Menezesang et al. 7,260,568 B2 8/2007 Zhang et al. (75) Inventors: Stephen Kane Boyer, San Jose, CA 7.286,978 B2 10/2007 Huang et al. (US); Gregory Breyta, San Jose, CA 7,321,854 B2 1/2008 Sharma et al. (US); Tapas Kanungo, San Jose, CA 7.340,388 B2 3, 2008 Soricut et al. (US); Jeffrey Thomas Kreulen, San 7,346,5077,343,624 B1 3/2008 RihnNatarajan et al. et al. Jose, CA (US); James J. Rhodes, Los T.373.291 B2 5/2008 Garst Gatos, CA (US) 7,398,211 B2 7/2008 Wang 7.421418 B2 9, 2008 Nakano (73) Assignee: International Business Machines 2322 R 239: Stig et al. C orporation,tion, Armonk, NY (US)(US 7,707,206- - w B2 * 4/2010 Encinaakano et al. ................. 707/716 2002/0087508 A1* 7/2002 Hull et al. ......................... 707/1 (*) Notice: Subject to any disclaimer, the term of this 2002.0099536 A1 T, 2002 Sri et al. patent is extended or adjusted under 35 2003. O195890 A1 10, 2003 Oommen U.S.C. 154(b) by 1664 days. 2004/0042667 A1 3/2004 Lee et al. 2004/0044952 A1 3/2004 Jiang et al. 2004.0143574 A1 7/2004 Nakamura et al. (21) Appl. No.: 11/428,147 2004/0176915 A1 9, 2004 Williams et al. 2005/0013507 A1 1/2005 Lee et al. ...................... 382,284 (22) Filed: Jun. 30, 2006 2005/0203898 A1 9/2005 Boyer et al. 2005/024631.6 A1* 11/2005 Lawson et al. .................... 707/2 (65) Prior Publication Data 2007/0143322 A1* 6/2007 Kothari et al. ................ 707/101 US 2008/OOO4810 A1 Jan. 3, 2008 OTHER PUBLICATIONS (51) Int. Cl. U. Bandara et al., “Fast Algorithm for evaluating word sequence GOIN 33/48 (2006.01) statistics in large text corpora by Small computers'. IBM Technical GOIN33/50 (2006.01) Disclosure Bulletin, vol. 32, No. 10B, Mar. 1990, pp. 268-270. G06F 17/50 (2006.01) R. Kubota, "Lessening Index file for fulltext search”, IBMTechnical (52) U.S. Cl. .............................................. 702/19, 703/1 Disclosure Bulletin, vol. 38, No. 11, Nov. 1995, p. 321. (58) Field of Classification Search ........................ None “OpenEye Scientific Software", http://www.eyesopen.com/prod See application file for complete search history. ucts/toolkitsfogham.html, 2 pages, Jun. 28, 2006. "ACD/Name to Structure Batch'. http://www.acdlabs.com/products/ 56 Ref Cited name lab? rename? batch.html., 2Z pages.pageS, Jun. 28, 2006. (56) eerees e J. Brecher, "Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature'. J. Chem. Inf. Comput. Sci., vol. U.S. PATENT DOCUMENTS 39, 1999, pp. 943-950. 4,811,217 A 3, 1989 TokiZane et al. “Unofficial InChl FAQ'. http://wwmm.ch.cam.ac.uk/inchifaq'. 3 5,157,736 A 10/1992 Boyer et al. pages, Jun. 26, 2006. 5,418,951 A 5, 1995 Damashek “Distributed Structure-Searchable Toxicity (DSSTox) Public Data 5,647,058 A * 7/1997 Agrawal et al. ....................... 1f1 s 5,752,051 A 5, 1998 Cohen base Network”. U.S. Environmental Protection Agency, http://www. 5,845,049 A 12, 1998 Wu epa.gov/ncct/dsstox/MoreonInChl.html. 4 pages, Jun. 26, 2006. 5,949,961 A 9, 1999 Sharman D. Weininger, “SMILES, a Chemical Language and Information 5,970,453 A 10, 1999 Sharman System. 1. Introduction to Methodology and Encoding Rules”. J. 5,983,180 A 11/1999 Robinson Chem. Inf. Comput. Sci., vol. 28, 1988, pp. 31-36. 6,047,251 A 4/2000 Ponet al. 6,098,035 A 8/2000 Yamamoto et al. (Continued) 6,167,398 A 12/2000 Wyard et al. 6,169,969 B1 1/2001 Cohen 6,178,396 B1 ck 58: R f Primary Examiner — Jason Sims 6,189,002 B1 2/2001 Roitblat ................................ 1.1 riva 6,236,768 B1 ck 5, 2001 Rhodes et al. 382,306 (74) Attorney, Agent, Of Firm Daniel E. Johnson 6,311,152 B1 10, 2001 Bai et al. 6,314,399 B1 11/2001 Deligne et al. 6,332,138 B1 12/2001 Hull et al. (57) ABSTRACT 6,415,248 B1 7/2002 Bangalore et al. - 6,542,903 B2 4/2003 Hull et al. A vectorization process is employed in which chemical iden 6,574,597 B1 6, 2003 Mohri et al. tifier strings are converted into respective vectors. These vec 6,636,636 B1 10/2003 Takasu. tors may then be searched to identify molecules that are 6,865,5286,785,651 B1 3/20058/2004 WangHuang et al. 1dent1Calidentical Or S1m1arimil to eachh Other.other. Thee d1menS1OnSdi Off theth 7,013,264 B2 3, 2006 Dolan et al. vector space can be defined by sequences of symbols that 7,013,265 B2 3/2006 Huang et al. make up the chemical identifier strings. The International 7,016,830 B2 3/2006 Huang et al. Chemical Identifier (InChI) string defined by the Interna 7,031,908 B1 4/2006 Huang et al. tional Union of Pure and Applied Chemistry (IUPAC) is 7.050,9647,046,847 B2 5,5/2006 2006 MenzesHurst et etal. al. parucularlyarticularl wellll suitedSu1led forthIor unese meunods.thod 7,113.903 B1 9, 2006 Riccardi et al. 7,129,932 B1 10, 2006 Karlund et al. 7,143,091 B2 11/2006 Charnocket al. 30 Claims, 16 Drawing Sheets US 8,140,267 B2 Page 2 OTHER PUBLICATIONS Jerome R. Bellegarda, “Exploiting Latent Semantic Information in Statistical Language Modeling”. Proceedings of the IEEE vol. 88, D. Weininger et al., “SMILES. 2. Algorithm for Generation of No. 8, Aug. 2000, pp. 1279-1296. Unique SMILES Notation”. J. Chem. Inf. Comput. Sci., vol. 29. Jen-Tzung Chien, "Association Pattern Language Modeling”, IEEE 1989, pp. 97-101. Transactions on Audio, Speech and Language Processing, vol. 14. Jia Cui et al..."Investigating Linguistic Knowledge in a Maximum No. 5, Sep. 2006, pp. 1719-1728. Entropy Token-Based Language Model”, ASRU 2007 IEEE Work Dou Shen et al., “Text Classification Improved through Automati shop, Dec. 2007, pp. 171-176. cally Extracted Sequences'. Proceedings of the 22nd International Vesa Silivola et al., “A State-Space Method for Language Modeling”. Conference on Data Engineering, Apr. 3-7, 2006, pp. 121-123. ASRU 2003 IEEE Workshop, Nov. 30-Dec. 3, 2003, pp. 548-553. Solen Quiniou et al., “Statistical Language Models for On-line Hand Jia-Li You et al., “Improving Letter-To-Sound Conversion Perfor written Sentence Recognition'. Proceedings of the 2005 Eighth mance With Automatically Generated New Words”, ICASSP 2008 International Conference on Document Analysis and Recognition, IEEE International Conference, Mar. 31, 2008-Apr. 4, 2008, pp. vol. 1, Aug. 29-Sep. 1, 2005, pp. 516-520. 4653-4656. Ave Wrigley, "Parse Tree N-Grams for Spoken Language Model Wen Wang et al., “The Use of a Linguistically Motivated Language ling'. Grammatical Inference: Theory, Applications and Alterna Model in Conversational Speech Recognition'. Proceedings of IEEE tives, IEEE Colloquium 1993, pp. 26/1-26/6. International Conference on Acoustics, Speech and Signal Process P. O'Boyle et al., “Improving N-Gram Models by Incorporating ing (ICASSP 2004) vol. 1, No. 17-21, May 2004, pp. I-261-I-264. Enhanced Distributions'. Acoustics, Speech, and Signal Processing, Hiroki Moriet al., “Japanese Document Recognition Based on Inter ICASSP-96, Conference Proceedings 1996, IEEE International Con polated n-gram Model of Character'. Proceedings of the Third Inter ference Digital Object Identifier: 10.1109/ICASSP 1996.540317. national Conference Document on Analysis and Recognition vol. 1, Publication Year: 1996, vol. 1, pp. 168-171. No. 14-16, Aug. 1995, pp. 274-277. Tatsuya Kawahara et al., “Phrase Language Models for Detection and Jerome R. Bellegarda et al., “A Muitispan Language Modeling Verification-Based Speech Understanding”. Automatic Speech Rec Framework for Large Vocabulary Speech Recognition', IEEE Trans ognition and Understanding 1997, IEEE Workshop Digital Object actions on Speech and Audio Processing, vols. 6, No. 5, Sep. 1998, Identifier: 10.1109/ASRU. 1997.658977, Publication Year: 1997, pp. pp. 456-467. 49-56. Hui Mao et al., “Chinese Keyword Extraction Based on N-Gram and K. A. Papineni et al., “Maximum Likelihood and Discriminative Word Co-occurrence”, 2007 International Conference on Computa Training of Direct Translation Models'. Acoustics, Speech and Sig tional Intelligence and Security Workshops, Dec. 15-19, 2007, pp. nal Processing, Proceedings of the 1998, IEEE International Confer 152-155. ence Digital Object Identifier: 10.1109/ICASSP 1998.674399, Pub Mathew Palakalet al., “A Multi-level TextMining Method to Extract lication Year: 1998, vol. 1, pp. 189-192. Biological Relationships'. Proceedings of the IEEE Computer Soci ety Bioinformatics Conference 2002 Aug. 14-16, 2002, pp. 97-108. * cited by examiner U.S. Patent Mar. 20, 2012 Sheet 1 of 16 US 8,140,267 B2 O O 108 110 FIG. 1A U.S. Patent US 8,140,267 B2 BOZ) U.S. Patent Mar. 20, 2012 Sheet 3 of 16 US 8,140,267 B2 144a Markov Model Annotation Hydrogen Bi-Gram Oxygen Model 1 144b Markov Model Annotation 4-fluoro-2-(4-iod. Bi-Gram n-Cyclopropylmethoxy. Model 2 144C MarkOV Model Non-Annotation Bi-Gram Model FIG. 2 U.S. Patent Mar. 20, 2012 Sheet 4 of 16 US 8,140,267 B2 160 < S1 S2 ..... S > Training text where S; e a-Z A-Z 0-9 () - ..... } = Q 170 Construct Count Matrix Cilj = # {(Sk, Ski) S = q and Ski F qi, where qi, qi e Q} Ci = #8 (S.

(10) Patent No.: US 8140267 B2

Expanding the Scope of Thiophene Based Semiconductors: Perfluoroalkylated Materials and Fused Thienoacenes

Thiophene-Based Aldehyde Derivatives for Functionalizable

House Fly Attractants and Arrestante: Screening of Chemicals Possessing Cyanide, Thiocyanate, Or Isothiocyanate Radicals

Heterocyclic Chemistrychemistry

Pyrrole, Thiophene and Furan

PEDOT) Derivatives: Innovative Conductive Polymers for Bioelectronics

High-Temperature Unimolecular Decomposition Pathways for Thiophene Angayle K

Towards the Synthesis of Isocoronene

Essentials of Heterocyclic Chemistry-I Heterocyclic Chemistry

(Organic) Heteroaromatic Chemistry LECTURES 4 & 5 Pyrroles, Furans

Table of Contents

Chemistry of Polyhalogenated Nitrobutadienes, Part 11: Ipso-Formylation of 2-Chlorothiophenes Under Vilsmeier-Haack Conditions