Thesis Submitted for the Requirements of The

Robust Biomedical Name Entity Recognition Based on Deep Machine Learning Robert Phan A Thesis submitted for the requirements of the Degree of Doctor of Philosophy Faculty of Science and Technology & UC Health Research Institute 10 June 2020 Abstract Biomedical Named Entity Recognition (Bio-NER) is a complex natural language processing (NLP) task for extracting important concepts (named entities) from biomedical texts, such as RNA (Ribonucleic acid), protein, cell type, cell line, and DNA (Deoxyribonucleic acid), and attempts to discover automatically, biomedical knowledge and clinical concepts and terms from text-based digital health records. For automatic recognition and discovery, researchers have investigated extensively different types of machine learning models, leading to sophisticated NER systems. However, most of the computer based NER systems often require manual annotations, and handcrafted features specifically designed for each type of biomedical or clinical entities. The feature generation process for biomedical and clinical NLP texts, requires extensive manual efforts, significant background knowledge from biomedical, clinical and linguistic experts, and suffer from lack of generalization capabilities for different application contexts, and difficult to adapt to new biomedical entity types or the clinical concepts and terms. Recently, there have been increasing efforts to apply different machine learning models to improve the performance and generalization capabilities of automatic Named Entity Recognition (NER) systems for different application contexts. However, their performance and robustness for biomedical and clinical contexts are far from optimal, due to the complex nature of medical texts and lack of annotated dictionaries, and requirement of multi-disciplinary experts for annotating massive training data for the manual feature generation process. In this thesis, a novel computational framework based on robust machine learning models for biomedical and clinical named entity recognition task is proposed. The validation of the proposed computational framework models based on shared representations, including the Abstract i BioNER and clinical-NER domains resulted in improved NER performance. The innovative machine learning models based on shared representations were built using different types of machine learning algorithms, including traditional shallow machine learning algorithms based on international (Maximum Entropy), CRF (Conditional Random Fields), and several Deep Learning variants, including FFN (Feedforward networks), RNN (Recurrent Neural Networks), Hybrid CNN (Convolution Neural Networks), and the enhancement with hyperparameter optimization techniques, allowing the characteristics of context-specific biomedical and clinical entity types, to be captured from unstructured free text. The experimental validation of proposed Biomedical NER framework based on a large scale and deep machine learning algorithms was done on Bio-NER and clinical-NER benchmark datasets, representing different biomedical and clinical entity types and has resulted in significant improvement in performance, as compared to the traditional NER systems based on traditional shallow learning models, by a large margin, even with limited training data, and unbalanced datasets, and with challenges of recognizing a large set of entities from small datasets. The reason behind the improved performance and generalization of the proposed deep learning-based computational framework models, could be due to embedding of context-specific shared information at character- and word-level between different biomedical entities and clinical entities, modelled by the deep learning architectures used. The contributions from this research can create new opportunities, in terms of a generalized, robust, high-performing computer-based NER framework, that can work across a wide range of inter-related health domains, with several polysemous names and entities, including biomedical, clinical, chemical, medical and public health contexts. Abstract ii Acknowledgments After a long journey of research, I take this opportunity to say thank you to all staff who have helped me to accomplish this dissertation. I sincerely thank the supervisory panel: especially I am grateful to Professor Girija Chetty for her tremendous help and advice on the right track throughout this research; Professor Roland Goecke, Associate Professor Dat Tran, and Adjunct Professor Leif Hanlen for their great support to make this long journey possible. I enjoyed being at the University of Canberra to complete my previous studies and this current research. The staffs are friendly and helpful. I honestly thank my Uncle, who encouraged me during tough times, but he could not wait for me to complete this research course, and he passed away in February 2018 due to the fastest movement of the pancreatic cancer. I am also grateful to my parents, brothers, and sisters for their unconditional support, patience, and encouragement. Without their help and support, the preparation of this dissertation would not be possible. v Publications The following is a list of my conference papers and peer-reviewed journal articles published, as contributions to my thesis. The full papers are included in the links below: 1. Phan, R, Luu, TM, Davey, R & Chetty, G 2016, ‘Clinical Name Entity Recognition Based on A Novel Machine Learning Scheme’, Published in: 2016 JNIT (Journal of Next Generation Information Technology), Volume 7 Issue 2, June 2016, Pages: 31-44, ISSN: 2092-8637, (Print) 2233-9388, http://www.globalcis.org/dl/citation.html?id=JNIT- 384&Search=Clinical%20Name%20Entity%20Recognition%20Based%20on%20A%20 Novel%20Machine%20Learning%20Scheme&op=Title 2. Phan, R, Luu, TM, Davey, R & Chetty, G 2018, ‘Deep Learning Based Biomedical N ER Framework’, Published in: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Date of Conference: 18-21 Nov. 2018, DOI: 10.1109/SSCI.2018.8628740, Conference Location: Bangalore, India, https://ieeexplore.ieee.org/document/8628740 3. T. M. Luu, R. Phan, R. Davey and G. Chetty, ‘A Multilevel NER Framework for Automatic Clinical Name Entity Recognition’, 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, 2017, pp. 1134-1143, doi: 10.1109/ICDMW.2017.161. https://ieeexplore.ieee.org/document/8215793 4. Phan, R, Luu, TM, Davey, R & Chetty, G 2018, ‘Biomedical Named Entity Recognition Based on Hybrid Multistage CNN-RNN Learner’, Published in: 2018 International Conference on Machine Learning and Data Engineering (iCMLDE), Conference Date 3-7 Dec. 2018, Conference Location: Sydney, Australia. DOI: 10.1109/iCMLDE.2018.00032, https://ieeexplore.ieee.org/document/8614015 Publications vii 5. Phan, R, Luu, TM & Chetty, G 2019. ‘Enhancing Clinical Name Entity Recognition Based on Hybrid Deep Learning Scheme’, Published in the proceedings of the IEEE International Conference on Data Mining (ICDM'19), 08-11 November 2019, Beijing, China, https://ieeexplore.ieee.org/abstract/document/8955534 Publications viii Table of Contents Abstract ................................................................................................................................................... i Certificate of Authorship of Thesis Form .......................................................................................... iii Retention and Use of Thesis by the University Form ........................... Error! Bookmark not defined. Acknowledgments ................................................................................................................................. v Publications ......................................................................................................................................... vii Acronyms ............................................................................................................................................ xix CHAPTER ONE ................................................................................................................................... 1 Introduction ........................................................................................................................................... 1 1.1 Overview ................................................................................................................................. 1 1.2 Introduction and Background.................................................................................................. 1 1.3 Significance and Motivation ................................................................................................... 4 1.4 Challenges and Research Gap ............................................................................................... 12 1.4.1 Bio-NER challenges ...................................................................................................... 13 1.4.2 Clinical-NER challenges ............................................................................................... 16 1.5 Research Questions ............................................................................................................... 19 1.6 Aims and Objectives ............................................................................................................. 20 1.7 Approach and Methodology.................................................................................................. 21 1.8 Research Contributions ........................................................................................................

Thesis Submitted for the Requirements of The

Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding

A Multitask Bi-Directional RNN Model for Named Entity Recognition On

ML Cheatsheet Documentation

CRF, CTC, Word2vec

PDF Hosted at the Radboud Repository of the Radboud University Nijmegen

Conditional Random Fields Meet Deep Neural Networks for Semantic

Evaluating the Utility of Hand-Crafted Features in Sequence Labelling∗

Deep Recurrent Conditional Random Field Network for Protein Secondary Prediction

Conditional Random Fields by Charles Sutton and Andrew Mccallum

Fast and Effective Biomedical Named Entity Recognition Using Temporal Convolutional Network with Conditional Random Field

Conditional Random Field Autoencoders for Unsupervised Structured Prediction

With Crf-Vaes