Robust Biomedical Name Entity Recognition Based on Deep

Robert Phan

A Thesis submitted for the requirements of the

Degree of Doctor of Philosophy

Faculty of Science and Technology & UC Health Research Institute

10 June 2020

Abstract

Biomedical Named Entity Recognition (Bio-NER) is a complex natural language processing

(NLP) task for extracting important concepts (named entities) from biomedical texts, such as

RNA (Ribonucleic acid), protein, cell type, cell line, and DNA (Deoxyribonucleic acid), and attempts to discover automatically, biomedical knowledge and clinical concepts and terms from text-based digital health records. For automatic recognition and discovery, researchers have investigated extensively different types of machine learning models, leading to sophisticated NER systems. However, most of the computer based NER systems often require manual annotations, and handcrafted features specifically designed for each type of biomedical or clinical entities. The feature generation process for biomedical and clinical NLP texts, requires extensive manual efforts, significant background knowledge from biomedical, clinical and linguistic experts, and suffer from lack of generalization capabilities for different application contexts, and difficult to adapt to new biomedical entity types or the clinical concepts and terms. Recently, there have been increasing efforts to apply different machine learning models to improve the performance and generalization capabilities of automatic

Named Entity Recognition (NER) systems for different application contexts. However, their performance and robustness for biomedical and clinical contexts are far from optimal, due to the complex nature of medical texts and lack of annotated dictionaries, and requirement of multi-disciplinary experts for annotating massive training data for the manual feature generation process.

In this thesis, a novel computational framework based on robust machine learning models for biomedical and clinical named entity recognition task is proposed. The validation of the proposed computational framework models based on shared representations, including the

Abstract i

BioNER and clinical-NER domains resulted in improved NER performance. The innovative machine learning models based on shared representations were built using different types of machine learning algorithms, including traditional shallow machine learning algorithms based on international (Maximum Entropy), CRF (Conditional Random Fields), and several Deep

Learning variants, including FFN (Feedforward networks), RNN (Recurrent Neural Networks),

Hybrid CNN (Convolution Neural Networks), and the enhancement with hyperparameter optimization techniques, allowing the characteristics of context-specific biomedical and clinical entity types, to be captured from unstructured free text. The experimental validation of proposed Biomedical NER framework based on a large scale and deep machine learning algorithms was done on Bio-NER and clinical-NER benchmark datasets, representing different biomedical and clinical entity types and has resulted in significant improvement in performance, as compared to the traditional NER systems based on traditional shallow learning models, by a large margin, even with limited training data, and unbalanced datasets, and with challenges of recognizing a large set of entities from small datasets. The reason behind the improved performance and generalization of the proposed -based computational framework models, could be due to embedding of context-specific shared information at character- and word-level between different biomedical entities and clinical entities, modelled by the deep learning architectures used. The contributions from this research can create new opportunities, in terms of a generalized, robust, high-performing computer-based NER framework, that can work across a wide range of inter-related health domains, with several polysemous names and entities, including biomedical, clinical, chemical, medical and public health contexts.

Abstract ii

Acknowledgments

After a long journey of research, I take this opportunity to say thank you to all staff who have helped me to accomplish this dissertation. I sincerely thank the supervisory panel: especially I am grateful to Professor Girija Chetty for her tremendous help and advice on the right track throughout this research; Professor Roland Goecke, Associate Professor Dat Tran, and Adjunct

Professor Leif Hanlen for their great support to make this long journey possible. I enjoyed being at the University of Canberra to complete my previous studies and this current research.

The staffs are friendly and helpful.

I honestly thank my Uncle, who encouraged me during tough times, but he could not wait for me to complete this research course, and he passed away in February 2018 due to the fastest movement of the pancreatic cancer.

I am also grateful to my parents, brothers, and sisters for their unconditional support, patience, and encouragement. Without their help and support, the preparation of this dissertation would not be possible.

v

Publications

The following is a list of my conference papers and peer-reviewed journal articles published, as contributions to my thesis. The full papers are included in the links below:

1. Phan, , Luu, TM, Davey, R & Chetty, G 2016, ‘Clinical Name Entity Recognition Based

on A Novel Machine Learning Scheme’, Published in: 2016 JNIT (Journal of Next

Generation Information Technology), Volume 7 Issue 2, June 2016, Pages: 31-44, ISSN:

2092-8637, (Print) 2233-9388, http://www.globalcis.org/dl/citation.html?id=JNIT-

384&Search=Clinical%20Name%20Entity%20Recognition%20Based%20on%20A%20

Novel%20Machine%20Learning%20Scheme&op=Title

2. Phan, R, Luu, TM, Davey, R & Chetty, G 2018, ‘Deep Learning Based Biomedical N

ER Framework’, Published in: 2018 IEEE Symposium Series on Computational

Intelligence (SSCI), Date of Conference: 18-21 Nov. 2018, DOI:

10.1109/SSCI.2018.8628740, Conference Location: Bangalore, India,

https://ieeexplore.ieee.org/document/8628740

3. T. M. Luu, R. Phan, R. Davey and G. Chetty, ‘A Multilevel NER Framework for Automatic

Clinical Name Entity Recognition’, 2017 IEEE International Conference on

Workshops (ICDMW), New Orleans, LA, 2017, pp. 1134-1143, doi:

10.1109/ICDMW.2017.161. https://ieeexplore.ieee.org/document/8215793

4. Phan, R, Luu, TM, Davey, R & Chetty, G 2018, ‘Biomedical Named Entity Recognition

Based on Hybrid Multistage CNN-RNN Learner’, Published in: 2018 International

Conference on Machine Learning and Data Engineering (iCMLDE), Conference Date 3-7

Dec. 2018, Conference Location: Sydney, Australia. DOI: 10.1109/iCMLDE.2018.00032,

https://ieeexplore.ieee.org/document/8614015

Publications vii

5. Phan, R, Luu, TM & Chetty, G 2019. ‘Enhancing Clinical Name Entity Recognition Based

on Hybrid Deep Learning Scheme’, Published in the proceedings of the IEEE International

Conference on Data Mining (ICDM'19), 08-11 November 2019, Beijing, China,

https://ieeexplore.ieee.org/abstract/document/8955534

Publications viii

Table of Contents

Abstract ...... i

Certificate of Authorship of Thesis Form ...... iii

Retention and Use of Thesis by the University Form ...... Error! Bookmark not defined.

Acknowledgments ...... v

Publications ...... vii

Acronyms ...... xix

CHAPTER ONE ...... 1

Introduction ...... 1

1.1 Overview ...... 1 1.2 Introduction and Background...... 1 1.3 Significance and Motivation ...... 4 1.4 Challenges and Research Gap ...... 12 1.4.1 Bio-NER challenges ...... 13 1.4.2 Clinical-NER challenges ...... 16 1.5 Research Questions ...... 19 1.6 Aims and Objectives ...... 20 1.7 Approach and Methodology...... 21 1.8 Research Contributions ...... 23 1.9 Thesis Road Map ...... 25 1.10 Conclusion ...... 27

CHAPTER TWO ...... 29

Literature Review and Background Research ...... 29

2.1 Overview ...... 29 2.2 Bio-NER literature review ...... 29 2.3 Clinical-NER literature review ...... 32 2.4 Robust NER Framework Design...... 36 2.5 Conclusion ...... 38

Declaration for Thesis Chapter [3] ...... 39

CHAPTER THREE ...... 41

Table of Contents ix

Bio-NER Model Development based on Maximum Entropy Learner and Supervised Machine Leaning ...... 41

3.1 Introduction ...... 41 3.2 Bio-NER model based on MaxEnt Algorithm ...... 41 3.3 Experimental Work ...... 47 3.4 Conclusion ...... 52

CHAPTER FOUR ...... 53

Bio-NER Model Development based on Conditional Random Fields with Semi-Supervised Machine Learning ...... 53

4.1 Overview ...... 53 4.2 Proposed Conditional Scheme...... 54 4.3 Experimental Work ...... 59 4.4 Conclusion ...... 62

Declaration for Thesis Chapter [5, 6] ...... 63

CHAPTER FIVE ...... 65

Bio-NER Model based on Conditional Random Field Learner with Rich Feature Sets ...... 65

5.1 Overview ...... 65 5.2 Background and Related Work ...... 66 5.3 CRF Model Development ...... 67 5.3.1 CRF Learning ...... 67 5.3.2 Rich Text Feature Set ...... 69 5.4 Proposed Conditional Random Field Scheme...... 73 5.4.1 BioNLP/NLPBA 2004 challenge dataset ...... 73 5.4.2 Model Building Approach and Methodology ...... 76 5.5 Experimental Results and Discussion ...... 78 5.6 Conclusion ...... 84

Declaration for Thesis Chapter [6] ...... 85

CHAPTER SIX ...... 87

Bio-NER Model Development Based on Deep Learning Approach ...... 87

6.1 Overview ...... 87 6.2 Deep Learning Background ...... 89 6.3 Proposed Deep Learning Bio-NER Scheme ...... 95

Table of Contents x

6.3.1 BioNLP 2004 Challenge dataset ...... 96 6.3.2 Deep Bio-NER Model ...... 96 6.4 Experimental Work ...... 102 6.5 Conclusion ...... 109

Declaration for Thesis Chapter [7] ...... 111

CHAPTER SEVEN ...... 113

Extending Bio-NER Models to Clinical-NER Based on Deep Learning and Hyperparameter Optimization Techniques ...... 113

7.1 Overview ...... 113 7.2 Hyper-parameter Optimization ...... 114 7.3 Hyperparameter optimization for Hybrid CNN- RNN Model ...... 117 7.4 Experimental Results and Discussions ...... 120 7.5 Conclusion ...... 123

CHAPTER EIGHT ...... 125

Software Platform Development and Deployment...... 125

8.1 Overview ...... 125 8.2 Software Development Framework ...... 126 8.3 The Experimental Results ...... 126 8.3.2 Main Webpage ...... 128 8.3.3 Open Sub Webpage: CRF predicts Biomedical ...... 129 8.3.4 Open Sub-Webpage: DL predicts Biomedical ...... 132 8.3.5 Open Sub-Webpage: DL predicts NHO ...... 135 8.4 Conclusion ...... 138

CHAPTER NINE ...... 139

Conclusions and Further Plan ...... 139

APPENDIX 1 ...... 143

Details of Benchmark datasets used ...... 143

The first challenge dataset ...... 143 The second challenge dataset ...... 147

APPENDIX 2 ...... 149

Machine Learning Development Tools Used ...... 149

Table of Contents xi

The Apache OpenNLP library ...... 152 The MALLET library ...... 153 Sklearn-crfsuite is a thin CRFsuite ...... 154 Weights&Biases (W&B) tool ...... 154

APPENDIX 3 ...... 157

The World of DNA and RNA ...... 157

References ...... 159

Table of Contents xii

List of Figures

Figure 1: GIS Algorithm to train the MaxEnt model...... 45

Figure 2: Shows the relationship between unlabelled data with their high confidence score.46

Figure 3: Maximum Entropy Learner with Active/Incremental Learning...... 47

Figure 4: Results of MaxEnt based BioNER approach for 1978-1989 subset...... 49

Figure 5: Results of MaxEnt based BioNER approach for 1990-1999 set...... 49

Figure 6: Results of MaxEnt based BioNER approach for 2000-2001 set...... 50

Figure 7: Results of MaxEnt based Bio-NER approach for S1998-2001 set...... 50

Figure 8: Expected results for the proposed NER approach (Maximum entropy learner with active/incremental learning)...... 50

Figure 9: Comparative performance of proposed MaxEnt Bio-NER with Lee04 and Baseline systems...... 51

Figure 10: Semi- algorithm using GE criteria for building CRF model. . 56

Figure 11: The CRF learner based on semi-supervised learning with Generalized Expectation criteria and rich text features...... 58

Figure 12: Comparison performances of NER Pha18CRF with previous systems...... 61

Figure 13: Relationships between the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model in of the CRF...... 72

Figure 14: TF-IDF and Similarity feature sets obtained after pre-processing data...... 75 List of Figures xiii

Figure 15: Combination of 3 features sets: TF-IDF, Find Similar, and other rich features. .. 77

Figure 16: Use Train dataset to create the Word2Vec Model ...... 79

Figure 17: CRF Model Building Workflow...... 80

Figure 18: Comparison the performance of NER Pha18 CRFs with previous participating systems...... 83

Figure 19: Structure of FFN with 3 input nodes, a first hidden layer with 2 nodes, a second hidden layer with 3 nodes, and a final output layer with 2 nodes...... 92

Figure 20: Illustration of a CNN architecture for sentence classification...... 93

Figure 21: Basic CNN Model architecture...... 97

Figure 22: Character level of Convolutional Neural Network and ...... 98

Figure 23: The performance of the character-level for the CNN-RNN hybrid model for training data set after 10 epochs...... 100

Figure 24: The ReLu activation function: R (z) = max (0, z) gives an output z if z is positive and 0 otherwise...... 102

Figure 25: Comparison of the performance between Neural Networks from Hybrid Pha18 (CNN) with other BioNLP participating systems using F-Score...... 108

Figure 26: Validating accuracy results for the Hybrid CNN-RNN model for 75 epochs. ... 119

Figure 27: Testing accuracy results for the Hybrid CNN model in 75 epochs...... 120

Figure 28: Comparison F-score performance between Neural Networks from Hybrid PHA-19- CNN-RNN with all other participating Systems in the CLEF challenge task using Bar Chart...... 122

List of Figures xiv

Figure 29: Effective Communication between Client and Server...... 127

Figure 30: The main Webpage contains three sub-webpages...... 128

Figure 31: Before submitting, this sub-webpage contains “Value” as an empty result under the “details” heading...... 130

Figure 32: After submitting, this sub-webpage form contains expected results under the “details” heading...... 131

Figure 33: Before submitting, this sub-webpage form contains “Value” as an empty result under the “details” heading...... 133

Figure 34: After submitting, this sub-webpage form contains expected results under the “details” heading...... 134

Figure 35: Before submitting, this sub-webpage form contains “N/A” as an empty result under the “details” heading...... 136

Figure 36: After submitting, this sub-webpage form contains expected results under the “details” heading...... 137

Figure 37: Shows 6 class labels contained in the Training dataset as input...... 143

Figure 38: Illustrates 11 class labels as expected output predicted from the BioNLP challenge task...... 144

Figure 39: Illustrates 71 class labels as expected output from the second CLEF challenge task...... 145

Figure 40: Illustrates 36 input class labels contained in the Train dataset of CLEF eHealth...... 147

List of Figures xv

List of Tables

Table 1: Subset of clinical entities defined as per UMLS Semantic Group and Semantic Types members used proposed MaxEnt learner...... 44

Table 2: Shared Task Report for the previously participating systems in the challenge...... 48

Table 3: Performance of proposed NER approach with respect of participating systems on each test subset...... 51

Table 4: Rich Text Features used to create the CRF Model...... 55

Table 5: Construction of an event List with Features (an Outcome and a Context)...... 55

Table 6: Performance of CRF_Model_1...... 59

Table 7: Performance of CRF_Model_2...... 59

Table 8: Performance of CRF_Model_3...... 59

Table 9: Performance of CRF_Model_4...... 60

Table 10: Performance of CRF_Model_5...... 60

Table 11: Comparison Performance of CRF_Model_5 with Respect of Participating Systems on Each Test Subset...... 61

Table 12: Performance of CRF_Model_1...... 81

Table 13: Performance of CRF_Model_2...... 81

Table 14: Performance of CRF_Model_3...... 81

List of Tables xvii

Table 15: Comparison the Performance results of three CRF Models with previous Participating Systems on Each Test Subset...... 82

Table 16: Performance Results of Participating Systems in BioNLP challenge tasks...... 107

Table 17: Performance Results of All Participating Systems in CLEF challenge task...... 121

Table 18: Train data file contains Labelled data from BioNLP/NLPBA 2004 challenge task...... 146

Table 19: Test data file contains unlabelled data from Bio-NLP/NLPBA 2004 challenge task...... 146

List of Tables xviii

Acronyms

ADAM: Adaptive Moment

AI:

API: Application Programming Interface

AWS: Amazon Web Services

BioNER: Biomedical Named Entity Recognition

BioNLP: Biomedical Natural Language Processing

CLEF: Clinical Laboratory Evaluation Forum

CoNLL-2002: Conference on Computational Natural Language Learning-2002

CNNs: Convolutional Neural Networks

CPOE: computerized provider order entry

CQDB: Constant Quark Database

CRF: Conditional Random Field

DL4J:

DNNs: Deep Neural Networks

EC2: Elastic Computing Cloud eHealth: electronic Health Acronyms xix

EMR: Electronic Medical Record

FFNs: Feedforward Networks

Flask: Flux Advanced Security Kernel

F-score: Final Score

GE: Generalized Expectation criteria

GIS: Generalized Iterative Scaling

GPU: Graphics Processing Unit

HMM:

ICT: Information Communication Technology

IDF: Inverse Document Frequency

IE:

IIS: Internet Information Services

IV: Information Visualization

JNLPBA: Joint workshop on Natural Language Processing in Biomedicine and its

Applications

LSTM: Long Short-Term Memory

MALLET: Machine Learning for Language Toolkit

MaxEnt: Maximum Entropy

Acronyms xx

ML: Machine Learning

NLPBA: Natural Language Processing in Biomedicine and its Applications

NH: Nursing Handover

NHO: Nursing Handover Dataset

NHS: National Health System

NICTA: National Information and Communications Technology Australia

NLP: Natural Language Processing

NER: Named Entity Recognition

NNR: Neural networks with regression

O: Others

POS: Part-Of-Speech

RBM: Restricted Boltzmann Machine

RegExp: Regular Expression

ReLu: Rectified Linear Unit

RegExlib: Regular Expression Library

RNN: Recurrent Neural Network

SCTK: Scoring Toolkit

SGD: Stochastic Acronyms xxi

Tanh: Hyperbolic Tangent Activation Function

TF: Term Frequency

TF-IDF: Term Frequency – Inverse Document Frequency

TM: Text Mining

UK: United Kingdom

UIMA - Unstructured Information Management Architecture

UMLS: Unified Medical Language System

URI: Uniform Resource Identifier

US: United States

VM: Virtual Machine

XML: Extensible Mark-up Language

Word2Vec: Word to Vector

WR: word representing

WSGI: Web Server Gateway Interface

Acronyms xxii

CHAPTER ONE

Introduction

1.1 Overview

In this Chapter, the name entity recognition problem for biomedical and clinical context is introduced first, and then the background, challenges, and research gaps are identified. The formulation of research questions to address the gaps and research plan in terms of aims, objectives, approach, and methodology for the thesis is presented next. The Chapter concludes with research contributions of this thesis, and road map for the rest of the work done for this research.

1.2 Introduction and Background

As the prosperity of medical knowledge either the form of literature or in the form of electronic records increases, there is an increasing opportunity and need for mining this knowledge with new and effective natural language processing (NLP) based analysis approaches, for efficient computer-based decision support tools and assistive technologies to be developed and deployed in the health care systems. The computer-based named entity recognition (NER) is an important first step for text-based information extraction systems, and the objective of the NER task is to identify words and phrases in free text that belong to classes of interest, using NLP, text mining, machine learning, and Artificial Intelligence approaches.

CHAPTER ONE 1

The named entity recognition (NER) task in general, involves the automated assignment of a named entity tag to a designated word by using rules and heuristics. The named entity can represent a human, location, and an organization, and this needs to be recognized from a snippet of text [1]. In a typical NER task, nominal and numeric information is extracted from a document containing the words that represent a person, location, an organization, or a date category [2]. NER systems are normally designed to classify all words in the document into existing categories and “none-of-the-above” or “unknown” category.

The named entity recognition task for a block of text corresponding to Biomedical domain ( being termed as “Bio-NER” task in this thesis for convenience), is an important task, in order to get an in-depth understanding (in terms of specific knowledge), containing crucial information about proteins and genes, such as RNA or DNA. Finding named entities of genes from texts is a crucial, but difficult task [3]. It is very similar to finding a company name or a human name in newspapers, but recognizing biomedical named entities is more difficult than recognizing normal named entities from newspapers [4].

Like biomedical domain, the NER task is an equally important and crucial task for the clinical domain as well (termed as clinical-NER task in this thesis). Clinical studies often require detailed patients’ information documented in clinical narratives and handover summaries [1].

Clinical-NER task aims to extract entities of interest (e.g., disease names, medication names, and lab tests) from clinical narratives and handover summaries, to provide insight on patients’ treatment regime, adverse drug reactions, tracking and monitoring the quality of care, and facilitate clinical and translational research [2, 3].

A large body of research currently exists on methods for recognizing named entities for many health-related domains, including the biomedical and the clinical domain. But most of these methods are based on dictionary and rule-based approaches [5]. Recently, some studies on the

CHAPTER ONE 2

use of computer-based approaches for mining the medical text has also been reported in the literature, using a combination of rule-based and supervised machine learning and text mining approaches. Some of these approaches for the Bio-NER task include methods involving Hidden

Markov Models (HMMs) [6], decision trees [7], support vector machines (SVMs) [8], and conditional random fields (CRFs) [9, 10]. Supervised machine learning models usually involving building representation models during training phase with features extracted from annotated data, based on various linguistic rules created from expert annotations, and tested with unannotated evaluation/validation data, for assessing the performance and generalization possible during real deployment/test scenarios.

For the clinical domain, similar computational models based on dictionary-based approaches, rule-based approaches, and a combination of supervised machine learning and rule-based approaches have been proposed as well. For the clinical-NER task, some of the well-known approaches and tools include MetaMap [4], Med LEE [5], and Knowledge Map [6], and these are based on rule-based approaches relying on existing medical vocabularies and dictionaries for NER. The clinical NLP community has been regularly organizing several challenges and shared tasks, to further the state-of-the-art with benchmark datasets, and evaluation criteria in each of these challenges [7-10]. Several of the top-performing systems in these challenges and shared tasks are primarily based on supervised machine learning models with expert annotated data made available by challenge organizers, and manually extracted and handcrafted NLP features [11-13]. Few other methods based on traditional supervised machine learning algorithms include ensemble models that combine multiple machine learning methods [14,15], hybrid systems that combine machine learning with high-confidence rules [16], unsupervised features generated using clustering algorithms [17,18], Brown clustering [19], and domain adaptation [20,21], to leverage labeled corpora from other domains, albeit requiring manual

CHAPTER ONE 3

and handcrafting of features requiring human expert annotated training corpus creation.

However, despite the significant progress that seems to have been made, in the development of computer-based approaches and tools for Bio-NER and Clinical-NER tasks, there are several challenges in each of these domains, that were overlooked or not been addressed. This could be due to the evolving, growing, and complex nature of text sources/documents used in different sub-specialties within the health-related domains. For example, there are certain common terms used in the text documents in some of the sub-specialty domains, which appear as similar words, but they do not mean the same in the context of these individual domains.

The current computer-based approaches and tools cannot handle these complexities and cannot provide decision support and assistance without human expert intervention. There is a need to address this problem, in terms of better Bio-NER and clinical-NER approaches to be developed, and efficient computer-based decision support tools and assistive technologies to be made available for translational benefits to be realized in this health area. Next Section provides the rationale, in terms of significance and motivation underpinning this research, and some strategies for addressing the problems associated with the current state of Bio-

NER/clinical-NER systems.

1.3 Significance and Motivation

The healthcare industry collects enormous amounts of health data from diverse sources including electronic healthcare records, hospital and census databases, medical sensors, and mobile devices. These large biomedical and clinical data sets provide opportunities to advance health care and biomedical research. Predictive analytics based on efficient machine learning

CHAPTER ONE 4

techniques can leverage these large, heterogeneous data sets to further knowledge and foster discovery. Predictive modeling can facilitate appropriate and timely care by forecasting an individual’s health risk, and health outcomes. Approaches to predictive modeling can include traditional modeling approaches such as structural equation models, or descriptive statistics- driven methods, as well as the recent machine learning methods. The power of machine learning methods comes from the ability to build data-driven models that improve automatically through experience [15], such as support vector machines (SVM), neural networks (NN), decision trees (DT), random forests (RF), probabilistic and Bayesian Networks

(BN). Compared to statistical methods, these machine learning techniques can increase prediction accuracy, sometimes doubling it, with less strict assumptions, e.g., on data distribution [16-18].

In general, the focus of traditional machine learning techniques used for Bio-NER and clinical-

NER systems is based on supervised learning (with SVM, NN, DT, RT, or BN learners), as it allows expert annotated training corpus to be used for model building. And the model-building techniques mostly used for these systems are based on shallow machine learning approaches, and normally involves several feature engineering and manual handcrafting steps, to obtain a good representation of the input data, and reasonable prediction performance and robustness, as well as generalization during real test/deployment phase with the unannotated and unseen data. Further, the inherent data quality (goodness of data) of the text sources/documents for building the Bio-NER and clinical-NER systems can also dictate the performance, robustness, and generalization that could be achieved. In addition to the goodness of the data, the feature engineering and representation techniques used can have a significant impact on the performance of machine learners. For example, with poor-quality data and weak and inefficient features extracted from this data, even an advanced machine learning algorithm cannot produce

CHAPTER ONE 5

an efficient model. However, with a good data representation, a high-performance model can be produced even with a relatively simpler machine learning algorithm.

Having said that, there are several problems associated even with good representation, due to implementation challenges (particularly in biomedical and clinical data domains with complex

NLP text and lack of appropriate annotated dictionaries from medical sub-domains) and require significant human intervention and handcrafting from experts. Some of these challenges include:

1. The complexity of the health data captured from heterogeneous disparate sources,

including the Electronic Health Records, the biomedical databases, the clinical and

hospital databases, the Census databases, the data from medical sensors, and mobile

devices, resulting in massive volumes of raw and unstructured data. Most of the

traditional machine learning techniques based on shallow learning currently

available, cannot leverage the benefit from massive volumes of complementary data

available since the massive data needs human expert interventions from multiple

disciplines to be able to create annotations and predictive models to be built. This

is evident from the lack of appropriate dictionaries and laid rules for building NER

models. Another problem that was discussed in earlier section as well, is the

problem of polysemous names and entities that exist in the closely related health

domains (for example biomedical, clinical, pathological, chemical and disease-

related datasets have all similar sounding words and names, but they carry a

different meaning in different contexts in their respective domains). So, the

annotated corpus creation from raw data stores requires not just expert input from

one discipline, but several knowledge engineers and experts from multiple health-

related disciplines participating in the creation of annotated corpus. This has been

CHAPTER ONE 6

practically impossible in the past, though this is being addressed recently, with

shared tasks and evaluation challenges from participants across multiple disciplines.

This situation has led to the development of several NER systems, within each

subspecialty that may work satisfactorily in a silo but for a broader deployment in

the real world, they have several performance-related issues, although they all

appear to be based on machine learning-based technologies.

2. For analyzing massive volumes of data, traditional machine learning based on

shallow learning methods with extensive feature engineering, pose other unique

challenges, including difficulty in dealing with moving and changing / streaming

data, the spatiotemporal aspects of it, the highly distributed input sources, noisy and

poor-quality data, high dimensionality, scalability of algorithms, imbalanced input

data, unlabelled and un-categorized data, and limited supervised/, etc.

3. The complications associated with NLP datasets for health contexts, in terms of

adequate data storage, data indexing/tagging, and fast information retrieval also

make the existing machine learning approaches ineffective for Bio-NER and

Clinical-NER tasks.

4. The other user-centric aspects such as quality issues associated with the NLP data

sources/documents coming from discharge summaries, handover notes and care

documents/reports available from the current systems, where there is plenty of

medical jargon, incomplete sentences, and short-hand abbreviations, leading to poor

quality data, and problems in extracting proper NLP features, for building

CHAPTER ONE 7

computer-based NER models, often resulting in propagation of medical errors in

the implementation pipeline in real-world operational scenarios.

5. The translational aspects arising due to problems with translating the discoveries

driven by findings from inter-disciplinary health related domains including biology,

physiology or genomic, into safer, more effective, and more personalized healthcare

solutions. As an illustration, it is important to highlight one of the problems related

to this scenario - the user-centric aspects in particular: Currently, the patients and

general people in public often find eHealth documents quite difficult to understand.

Even clinicians have problems in understanding the jargon of other professional

groups even though laws and policies emphasise the need to document care

comprehensively and provide further information on health conditions to help their

understanding. A simple example from a United State discharge document is “AP:

72 yo f w/ ESRD on HD, CAD, HTN, asthma p/w significant hyperkalaemia &

associated arrhythmias”. The clinical handover between nurses in hospital systems

involves similar discharge notes and use of such handover notes can lead to loss of

information, as computer-based systems and tools cannot interpret the named

entities in such text, and the clinical-NER/Bio-NER systems may not work

correctly. Further, coupled with this, the use of the public Internet as the source of

health-related information is a widespread phenomenon. Search engines are

commonly used to access health information available online, however the

reliability, quality, and suitability of the information for the target audience varies

greatly. If an individual (for example, the patient in this case), tries to search the

meaning of discharge note above on social media/online, it would not provide any

meaningful explanation. Previous research has shown that exposing people with no

or scarce medical knowledge to complex medical language may lead to erroneous CHAPTER ONE 8

self-diagnosis and self-treatment. Hence, trying to access to medical information on

the public Internet that cannot handle such a complex discharge note as shown

above, can lead to the escalation of concerns about disease symptoms (e.g.,

cyberchondria). The current commercial search engines are far from being effective

in answering such queries yet.

Use of better computer based approaches, algorithms and data driven machine learning

techniques, without a need for multidisciplinary experts for annotation, corpus creation

and NLP feature engineering work (involving biomedical, clinical, linguistic,

computing experts), is needed to improve the state of the art in Bio-NER and clinical-

NER systems. This can then lead to development of efficient decision support tools and

assistive technologies for empowering the patients, the care givers, the health

professionals, the regulatory bodies, and the researchers in the entire ecosystem of

domain of health to be benefited. For example, the same text based eHealth discharge

document (“AP: 72 yo f w/ ESRD on HD, CAD, HTN, asthma p/w significant

hyperkalaemia & associated arrhythmias”) would be much easier to understand, if an

efficient Bio-NER and/or clinical-NER system is available for translating this sentence,

by expanding shorthand and correcting the misspellings (if any), by normalising all

health conditions to standardised terminology, and by linking the words to a patient-

centric search on the Internet. This would result in making sense of above mentioned

cryptic jargon ridden sentence to be more user friendly and can be read as: “Description

of the patient's active problem: 72-year-old female with dependence on haemodialysis,

coronary heart disease, hypertensive disease, and asthma who is currently presenting

with the problem of significant hyperkalaemia and associated arrhythmias” with the

CHAPTER ONE 9

highlighted words linked to their definitions in Consumer Health Vocabulary and other

patient-friendly sources.

Further, the current Bio-NER/clinical-NER systems are weaker in their capabilities, to

provide tailored and customised eHealth information in response to different target user

groups’ information needs in a timely manner, and ensure that this information is

reliable and accurate, and available for different usage scenarios and settings. Building

Clinical-NER and Bio-NER systems that can work with unlabelled data, and that do

not require large set of annotated data, and can provide tailored and customised domain-

adaptation capabilities, and are applicable across different clinical domains and usage

scenarios would be a real breakthrough in this field. While it may not be feasible to

achieve all these objectives at once, it is possible to incrementally make progress in

achieving each objective and building on each capability for better NER systems to be

available for Biomedical and Clinical NER tasks. Some of these incremental steps could

include:

 Adapt, extend and enhance the Bio-NER and Clinical-NER models built with

shallow learning strategies, with novel NLP feature representations that can model

the latent information between words in sentences and blocks of text,

 Explore automatic discovery of NLP features and feature combinations, with less

reliance on dictionary and rule-based approaches,

 Use novel machine learning algorithms based on semi-supervised learning,

incremental learning, transfer learning, or the recent deep learning algorithms and

their hybrid combined learning architectures [19-21].

CHAPTER ONE 10

Using several of these strategies together, could be helpful in building better NER systems for complex health contexts, and can result in improved Bio-NER and clinical-NER systems [22-

23].

Another hurdle that can complicate the development of better Bio-NER and clinical-NER systems, is level of technological maturity and sophistication in the support infrastructure available for each of the biomedical and clinical NER systems. While there has been significant progress in the availability of support systems and resources for building Bio-NER systems, in terms of benchmark datasets, shared tasks, and automated computational frameworks for addressing different complexities and challenges in Biomedical domain, the situation for

Clinical-NER systems is less than ideal. There could be several reasons behind that, with some of the reasons discussed in the next Chapter on Literature review, but there is a need for focussed research efforts in building better clinical-NER systems, and one way of doing this is to leverage the cross-domain modelling approaches - by using shared representations, by leveraging shared aspects within the subspecialty domains in a discipline, and exploit the best performing modelling techniques from relatively mature and sophisticated Bio-NER field for example, and tailor/customise them to build better models for weaker and ill-addressed clinical-NER contexts. This could be a promising strategy, due to the similarities and shared objectives that exist between these two health-related domains. Some of earlier studies on such cross-domain adaptation strategies for building NER systems in other application domains include, use of an approach based on stacking up nonlinear feature extractors resulting in improved classification modelling [24], use of better quality samples obtained by generative probabilistic models [25], and the use of features based on invariant property of data representations [26]. These cross-domain adaptation approaches have produced outstanding results in some other application contexts as well, including [27-28], speech recognition [29-31], and natural language processing [32]. The ability of these cross-domain CHAPTER ONE 11

machine learning strategies based on shared representations, to extract high-level complex abstractions from heterogenous disparate data sources, from different domains, to condition the ill-structured data in clinical domain, would be an attractive option for building better NER tools for clinical-NER systems. Next section focusses on some of the challenges associated with cross-domain application contexts, particularly the two sub-domains of Bio-NER and clinical-NER domains specifically, and identifies the research gap that exists in the automated computer-based approaches that are available currently for dealing with such contexts

1.4 Challenges and Research Gap

There are several complexities and challenges the NER systems must address within the health- related sub-domains and contexts. Some of these arise from commonality between each of these subspecialty domains, and some from differences in meanings and interpretation of similar sounding names and entities within each of these subspecialty domains, and inability of current Bio-NER and Clinical-NER systems in dealing with them. For example, making sense of polysemous names and entities, which leads to word sense ambiguation, is one of major problem which the current Bio-NER and clinical-NER systems cannot address. There are several other challenges in addition to this, and any innovation and research contributions need to take these challenges into account, for improving the state of the art. Further details on some of these challenges are described in next two sections of this chapter.

CHAPTER ONE 12

1.4.1 Bio-NER challenges

Named entity recognition (NER) task is a subtask of information extraction, which aims to classify all unregistered words appearing in texts, and unstructured data collection. In general, the NER task involves eight categories — location, person, organization, date, time, percentage, monetary value, and “none-of-the-above” or “Unknown” [12, 13]. A typical NER pipeline attempts to first find named entities in sentences and declares the category of the entity.

For example, in the following sentence:

“Apple [organization] CEO Tim Cook [Person] Introduces 2 New, Lager iPhones, Smart

Watch at Cupertino [Location] Flint Center [Organization] Event [14].”

Here “Apple” needs to be recognized as an organization name instead of a fruit name in terms of its context. Further, the words “Tim” and “Cook” need to be together recognized as a single word having a meaning of CEO of the Apple Company and a person’s full name. “Cupertino” needs to be identified as a city name in California and tagged with a location label or entity, and “Flint” and “Center” should be considered as a single name and recognized as the name of an organization. This might be easy for humans, but for computers to be able to distinguish each of these named entities based on the context in the sentence, is not quite easy.

Named entity recognition approaches have traditionally tried to make computers understand such sentences and subtle linguistic nuances in the text (which humans can do better) in different ways — by building models based on dictionary based, rule based, and machine learning based approaches. For dictionary-based approach, a list object is normally maintained for storing as many named entities as possible. This list-based approach however has some limitations, though it appears simple. Since the target words tend to be mainly proper nouns or unregistered words, making a computer understand them is normally difficult. Further, the CHAPTER ONE 13

same word stream could mean different things in different contexts within the same domain and need to be recognized as different named entities in terms of their current context. In addition, if the domain changes (from biomedical to clinical for instance), the same set of words in a sentence could carry totally different meaning. So, dictionary-based methods, based on lists, developed for one context and domain, tend to perform poorly for different contexts and domains. Additionally, new words and word combinations get generated frequently and get updated for evolving NER fields, and even the same word stream would carry different meaning after updates and has to be recognised as a different named entity [15, 16].

The second approach used for NER is a rule-based approach [17]. The success of rule-based approach depends on the rules laid out, and patterns of named entities appearing in actual sentences. However, for rule-based approaches to be able to use context for solving the problem of multiple named entities, every rule needs to be written down before it is used. These rules need to be changed based on the context and domain changes as well.

The third approach, based on traditional shallow machine learning-based strategies, consist of tagging the named entities to words, even without explicit listing of words in the dictionary, and lack of description of the context and domain definition in the rule set. Some of these traditional machine learning approaches include, Support Vector Machines (SVMs) [18],

Hidden Markov Models (HMMs) [6, 19], Maximum Entropy Markov Models (MEMMs) [20], and Conditional Random Fields (CRFs) [9, 10].

For the Bio-NER domain and the associated contexts, the research focus is normally on information extraction of protein, DNA, RNA, cell type, and cell line from biomedical literature [21–24]. Biomedical named entity recognition is an essential task in biomedical information extraction and forms the first stage of text mining in information extraction from biomedical documents and reports. Recognizing technical terms in the biomedical area from

CHAPTER ONE 14

the biomedical documents, is in fact a very challenging task in natural language processing and has attracted the interest of researchers significantly for years [25]. For Bio-NER task, in general, five categories (protein, DNA, RNA, cell type, and cell line) are used, instead of normal NER categories in general English language text. For example, the NER tagged sentence in a BioNER task can be shown as:

“IL-2 [B-protein] responsiveness requires three distinct elements [B-DNA] within the enhancer [B-DNA].”

Biomedical NER system for extracting named entities from a sentence shown above, can face, several difficulties for several reasons. Firstly, due to evolving and continuously changing nature of biomedical domain, the number of new technical terms would be rapidly and continuously increasing. To build a dictionary or rule-based decision strategy to include all the new and evolving terms could be very difficult. Secondly, based on their context, the same words or expressions would mean different things, and hence need to be classified as different named entities in terms of their context. Thirdly, for a complex entity (e.g., “12-o- tetradecanoylphorbol 13-acetate”), with long entity length, and inclusion of control characters, such as hyphens, both dictionary and rule based NERs tend to fail or perform poorly. Fourthly, a significant word-sense-ambiguity is experienced in biomedical area, due to the large proportion of abbreviated expressions frequently used. For instance, “TCF” could refer to “T cell factor” or to “Tissue Culture Fluid” [26, 27]. And the final challenging part is, in biomedical domain, the normal terms or functional terms are often combined, making it a long biomedical term. For example, “HTLV-I-infected” and “HTLV-I-transformed” and include a special meaning for the terms “I”, as “infected”, and “T” as “transformed”. Even humans (other than biomedical science researchers), cannot interpret this meaning for “I” and “T”, let alone computers. Hence, most of the current Bio-NER approaches cannot cope with such named

CHAPTER ONE 15

entities. This is compounded by problems associated with spelling changes [28], and subsumption of one named entity category into another category [29].

1.4.2 Clinical-NER challenges

Clinical-NER task has mostly been approached by several researchers by addressing the problem as a sequence labelling problem, and researchers have developed a machine learning formulation with an aim to find the best label sequence in a particular format (for example

BIO format labels) for a given input sentence (individual words from clinical text). Some of the machine learning models developed for clinical-NER task includes Conditional Random

Fields (CRFs) [22], Maximum Entropy (ME), and Structured Support Vector Machines

(SSVMs) [23]. CRFs model appears to be the most popular solution amongst several traditional shallow machine learning based models in several top-ranking systems. A typical state-of-the- art clinical NER approach involves, utilisation of NLP features extracted from different linguistic levels, including orthographic information (e.g., capitalization of letters, prefix and suffix), syntactic information (e.g. POS tags), word n-grams, and semantic information (e.g., the UMLS concept unique identifier). Some of the hybrid models [16] tend to further leverage the concepts and semantic types from the existing clinical NLP systems such as MetaMap and cTAKES [24]. Further, for an improved performance, some of the researchers have also explored combination of different machine learning models, including ensemble models and re-ranking [14, 15]. Few other researchers, more recently, have examined unsupervised features extracted from large unlabelled corpora, and generated word clusters using Brown clustering and random indexing [19]. The intensive and continuous contributions from the clinical NLP community have certainly boosted the performance of clinical NER systems.

CHAPTER ONE 16

However, some of the challenges that couldn’t be addressed adequately, led to major gaps, and hampered their performance and generalisation capabilities. Some of these include:

Weaker feature representations:

The most commonly used feature representation for clinical NER tasks is based on bag-of- words model [26, 27]. It is a simplified representation of a group of words in the text based on the presence/absence of its words. The feature representation based on bag-of-words model tends to ignore the order between words, grammatical relations, and semantic information, and due to sparsity problem, the bag-of-word model for clinical NER leads to weaker feature representation and poor performance. As an example, consider the two similar clinical entities:

“mildly dilated right atrium” and “somewhat enlarged left ventricle”. The bag-of-words feature representation of these two clinical entities shows no commonality between them. They have nothing in common, from their bag-of-word feature representation. However, clinicians would treat these two entities as related concepts. Hence, a more robust feature representation models are needed for clinical NER tasks.

Significant Human Intervention Needs:

Most of Clinical-NER methods proposed so far are based on traditional shallow machine learning approaches, but tend to be rigid and task-specific, and require significant time- consuming human feature engineering efforts [27, 28]. Two of the critical steps for a typical machine learning based clinical NER system are: the feature extraction step and parameter optimization step. The feature extraction can be very time-consuming step, as the feature extraction for traditional shallow machine learning based methods, heavily depend on human intervention for handcrafted features, and machines are just involved for handling the

CHAPTER ONE 17

parameter optimization supervised by the gold-standard annotations. Researchers’ manual intervention is also needed for screening the positive/negative samples to identify possible features with special meanings (for example, tokens containing capitalized letters are more likely to have special meanings), and for design of optimal feature set combination and sequences (for example body location followed by a disorder mention). These manual interventions are commonly referred to as “human feature engineering” and can lead to several problems. Mainly, the features extracted by human experts could be incomplete – humans cannot sometimes identify all possible features, and there could be inconsistency in the way different humans tend to identify all possible features. It could be based on their individual subjective judgements or expertise or could be over specified – the same information, inconsistencies or human errors could get propagated down the information extraction pipeline.

Also, with human feature engineering, researchers must engineer the features again for a different task or different data source, context or domain. Having a computer do all these tasks can help in an objective, consistent and robust models to be developed, with minimal supervision and time-consuming feature engineering needs.

Models with poor long-term dependencies

Most of the popular clinical-NER systems surveyed were based on CRF models [28-30]. The

CRF based models use a word window for the input sequence of tokens. But use of word windows cannot model long-term dependencies and can lead to system prediction errors due to false negatives. However, it is not possible to model long term dependencies, by simply increasing the window size, as it may corrupt the signal with more noise getting into the feature space and slowing the model building process during the training phase. This problem can only be addressed by developing new and better architectures for clinical NER tasks, with capabilities to capture long-term dependencies from the clinical texts. CHAPTER ONE 18

To summarize, there is a huge body of work on NER approaches for different domains and contexts, including several health-related subspecialty domains. Despite the significant work done and methods reported, including different dictionary based, rule-based and machine learning based formulations, there are several complex challenges associated with NER systems for clinical and biomedical scenarios, and most of the existing systems have several shortcomings in dealing with these challenges. There is a need for better approaches to be developed to further the state of the art in Bio-NER and clinical-NER field. The research work reported in this thesis tries to address this gap and aims to address some of the challenges discussed. Three research questions have been formulated for guiding the line of investigation and address the challenges and gap in the state of the art in biomedical and clinical NER approaches. Next section describes the three research questions formulated for this thesis.

1.5 Research Questions

The overarching research question that will be addressed in this thesis is, whether it is possible to develop a robust name entity recognition framework that can leverage the benefits of well- resourced Bio-NER context for improving the low-resourced clinical-NER task performance.

This will be achieved by attempting to improve the state of the art for Bio-NER systems first, as it is a well-resourced context, by exploring some novel strategies for improving the baseline

Bio-NER task performance, and then attempting to improve the clinical-NER task performance by extending and adapting these models appropriately. This line of investigation then leads to three focussed research questions as outlined next:

CHAPTER ONE 19

Research Question 1:

Is it possible to enhance the performance of current Bio-NER approaches based on

shallow supervised machine learning techniques, with new NLP feature sets, and

alternate learning strategies, based on semi-supervised learning?

Research Question 2:

Is it possible to improve the performance of Bio-NER systems by using end-to-end fully

automated feature discovery and deep machine learning approaches, without

handcrafting and feature engineering requirements?

Research Question 3:

Is it possible to develop novel robust deep machine learning models with shared

representations, by leveraging the commonality between named entities from high

resource context (Bio-NER) to low resource context (clinical-NER), and improve the

performance of low resourced context?

1.6 Aims and Objectives

The aims and objectives of this research for addressing the three research questions formulated for this thesis are as outlined here:

1. To enhance the performance of current Bio-NER approaches based on shallow

supervised machine learning techniques, with new NLP feature sets, and alternate

learning strategies, based on semi-supervised learning.

CHAPTER ONE 20

2. To improve the performance of Bio-NER systems by using end-to-end fully automated

feature discovery and deep machine learning approaches, without handcrafting and

feature engineering requirements.

3. To develop robust deep machine learning models with shared representations, by

leveraging the commonality between named entities from high resource context (Bio-

NER) to low resource context (clinical-NER) and improve the performance of low

resourced context.

1.7 Approach and Methodology

The approach and methodology used for addressing the aims and objectives set for this thesis are described briefly here, with further details provided in the individual chapters in the rest of the thesis. There are as follows:

1. The Biomedical named entity recognition (Bio-NER) task chosen was the identification

of biomedical text terms (five classes), such as RNA, protein, cell type, cell line, and

DNA. For baseline comparison, traditional shallow machine learning models based on

supervised learning were examined first. Some of the shallow machine learning models

examined include the Maximum Entropy Models (MaxEnt), that use supervised

machine learning techniques. This serves as a baseline for comparing other models

developed in the thesis. For experimental evaluation and validation, the benchmark

GENIA dataset from BioNLP/NLPBA 2004 shared task was used, with model

development and testing conducted on training and evaluation set provided by the task

organizers.

CHAPTER ONE 21

2. The next step was to investigate the performance improvements possible for Bio-NER

task, with an alternative learning strategy based on semi-supervised learning with the

conditional random field (CRF) models, and evaluation and validation using the same

benchmark GENIA dataset from BioNLP/NLPBA 2004 shared task.

3. The third step was to investigate the performance improvements possible for Bio-NER

task, with the rich text feature set, consisting of a combination of several text features,

and build models based with semi-supervised learning and Conditional Random Field

(CRF) learner, and validate using the same benchmark GENIA dataset from

BioNLP/NLPBA 2004 shared task.

4. The next step was to investigate the Bio-NER performance improvements possible with

deep learning models based on different architectures, including Feedforward Network

(FFN), Recurrent Neural Network (RNN), hybrid Convolutional Neural Network

(CNN-RNN) models, and validate with the same benchmark GENIA dataset. This

concludes the investigations towards step by step iterative improvements for Bio-NER

model development. Being a well-resourced context (Due to availability of large

annotated dataset from a moderately complex scenario)

5. The final step involved extending and adapting the Bio-NER model developed in a

well-resourced context (Step 4 above) for the low-resourced context (Clinical-NER),

by enhancing the robustness of the deep-learning model using hyper-parameter

optimization. The clinical-NER task chosen was more complex as compared to the Bio-

NER task, and it involved identification of 35 different classes, for a smaller dataset,

which is sparse, unbalanced, synthetic, and noisy and represents a real-world nursing

handover context in the clinical domain. For experimental evaluation and validation,

the benchmark dataset from CLEF eHealth challenge 2016 Task 1 shared task on

handover information extraction was used, with validation/evaluation experiments

CHAPTER ONE 22

conducted on a training and evaluation set provided by the task organizers. As the

clinical-NER task considered in this work was more challenging as compared to the

Bio-NER task, new hyper-parameter optimization approaches for deep learning

networks were developed for improving the performance of the clinical-NER task.

The details of each step above are described in each chapter in the rest of the thesis. Next

Section presents the original research contributions made in this work.

1.8 Research Contributions

The present research aims to contribute towards a novel robust NER computational framework based on different machine learning techniques. Some of the innovative contributions made towards the improving the state of the art in Bio-NER and clinical-NER system performance includes:

1. Improvement in Bio-NER model performance by enhancing traditional supervised

shallow learning models with six different machine learning techniques involving a

combination of different NLP features, and with supervised, semi-supervised, and deep

learning approaches. These six algorithms include the Maximum Entropy with

supervised machine learning, the Conditional Random Field with semi-supervised

machine Learning, the Conditional Random Fields with supervised machine Learning,

FFN and RNN based on deep learning, and hybrid CNN–RNN with hyper-parameter

optimization.

2. Extending the enhanced machine learning framework towards a robust NER

framework, involving shared representations, using character-level and word-level CHAPTER ONE 23

representations from two different domains (Bio-NER and Clinical-NER) embedded

within deep learning models, allowing an improvement in the performance of low-

resourced clinical-NER task. The experimental validation on two different benchmark

datasets (GENIA-BioNLP/JNLBPA and CLEF eHealth 2016 task1 dataset) covering

major biomedical and clinical entity types shows, that the robust shared modeling

technique proposed can outperform state-of-the-art systems based on single-domain

models by a large margin, even with limited training data available. This could be due

to large performance improvements possible by sharing character- and word-level

information across different biomedical and clinical entities.

These innovative research contributions made have been peer-reviewed in terms of several international conferences and journal publications, validating the line of investigation used for the work reported in this thesis. The outcomes from this research, in the form of a novel robust

NER computational framework, can pave the way and create new opportunities for the development of automatic decision support tools and assistive health technologies for evolving/new contexts and domains that share certain commonalities (such as clinical, biomedical, chemical, medical and public health domains) for an improved performance, robustness and generalization.

CHAPTER ONE 24

1.9 Thesis Road Map

The research work reported in this thesis is organized into the following nine chapters:

Chapter 1: Introduction.

Chapter 2: Literature Review and Background Research.

Chapter 3: Bio-NER Model Development based on Maximum Entropy Learner and supervised machine learning.

Chapter 4: Bio-NER Model Development based on CRF learner with Semi-Supervised

Machine Learning.

Chapter 5: Bio-NER Model Development Based on CRF learner and rich NLP feature sets.

Chapter 6: Bio-NER Model Development Based on Deep Learning Approach.

Chapter 7: Extending Bio-NER Model to Clinical-NER Based on Deep Learning and hyper- parameter Optimization techniques.

Chapter 8: Software Platforms Development and Deployment

Chapter 9: Conclusions and Further Plan.

This chapter (Chapter one) introduces the research problem, and provides the background, rationale, significance and motivation, challenges and research gap, formulation of research questions, identifies the aims, objectives, approach and methodology for addressing the research questions, research contribution from this work to the scientific knowledge, and finally details the road map of the thesis on what each chapter entails. Chapter two provides a review of the literature and related work done and sets the background for determining the line CHAPTER ONE 25

of the investigation, and the strategy for iterative model design, development, implementation, and test/evaluation for the proposed NER framework, in terms of different performance measures. Chapter three focusses on Biomedical Named Entity Recognition Based on MaxEnt

Machine Learning, a baseline model for comparison and assessment of improvements possible with other NER models developed in this thesis. Chapter four presents an enhanced Bio-NER model based on Conditional Random Fields and Semi-Supervised learning techniques. Chapter five describes an improved Bio-NER model with a combination of conditional random fields and rich NLP feature sets. Chapter six presents details of the enhanced Bio-NER model with an end-to-end deep learning approach without any feature engineering. It uses a combination of rich NLP features and unsupervised learned word2vec features and feedforward neural networks. Chapter seven describes robust Bio-NER model development with a hybrid deep learning architecture and hyperparameter optimization. Chapter eight provides the implementation of the proposed web service platform on AWS Amazon Web Services) Cloud, with a link for the public to test and assess our three decent models: CRF based supervised ML and tested with biomedical text, Deep Learning based on Neural Networks RNN and CNN and tested with biomedical text, and Hybrid Deep Learning based on Hyperparameter Optimization technique and tested with clinical / Nursing Handover text. Finally, chapter nine describes some of the conclusions drawn from this thesis and the suggestions for future research directions.

CHAPTER ONE 26

1.10 Conclusion

This Chapter provides an overview of the entire thesis, and involves defining the research problem, reviewing the background, and identifying the rationale, significance and motivation, formulation of research questions, identification of aims and objectives, approach and methodology for addressing the research questions, research contributions from this work, and provides a road map for the thesis. The next chapter presents the literature review and related work for this research.

CHAPTER ONE 27

28

CHAPTER TWO

Literature Review and Background Research

2.1 Overview

In this Chapter a review of the literature and prior research on biomedical and clinical name entity recognition is discussed, for setting the background and providing the rationale for choosing the line of investigation for the research reported in this thesis. First, the related work done on Bio-NER approaches is reviewed, and then the related work on Clinical-NER approaches is reviewed. After this brief review of related work done, the details of design and iterative development of the proposed robust NER framework, for enhancing the performance of NER systems are outlined.

2.2 Bio-NER literature review

The Named Entity Recognition (NER) task is a computerized procedure for recognizing and labelling entities in given texts. When used in the biomedical domain, the NER task is referred to as the Bio-NER task and it is one of the most fundamental tasks in biomedical text mining workflow. A Bio-NER system allows computer-based recognition and extraction of different biomedical entities (e.g., genes, proteins, chemicals, and diseases) from text documents [31,

32]. There were several Bio-NER approaches proposed in the literature, with a focus on addressing the problems related to specific applications - large-scale biomedical data analysis

CHAPTER TWO 29

tasks for example. Some of these approaches include biomedical network construction [32], gene prioritization [33], drug repositioning [34], finding literature support of experimental findings [35, 36, 37], generating hypothesis [38, 39, 40] and database curation [40]. Another set of Bio-NER approaches that were regarded as very efficient approaches, were aimed at identifying new gene names from text [40], with the inclusion of the Bio-NER processing step as the primitive step for many downstream information extraction applications, such as for relation extraction [41] and for knowledge base completion [42, 43, 44, 45, 46, 47]. As the focus of Bio-NER task was to identify named entities corresponding to chemical compounds, genes, proteins, viruses, disorders, DNAs and RNAs, with documents describing these details available in several scientific papers, books and other publications produced annually, the Bio-

NER researchers had access to abundant resources in terms of large databases for investigating

Bio-NER systems [4]. MEDLINE is one of such large-scale resource in the biomedical domain in which millions of articles are stored regularly in a database [5].

However, much of the Bio-NER research efforts focussed on extracting biomedical named entities using handcrafted rule-based approaches [6], and manual building of the rules is a very time-consuming and expensive task. Due to this, the Bio-NER researchers quickly adopted machine-learning techniques for identifying biomedical entities automatically, with most of the research focussed on using supervised machine learning techniques, an easy option due to large annotated databases available for this field [7- 9]. The data size wasn’t a problem, because being a high resourced environment, there were several large databases available, and availability of annotated corpora wasn’t a problem either, due to several annotated and labeled corporate available for Bio-NER researchers, including the SCAI [11] and GENIA [11] corpora. The problem was with a poor performance that was being achieved in general for Bio-

NER systems. This could be due to the inherent complexity of Bio-NER terminology and jargon, leading to difficulties in extracting discriminative NLP features and building NER CHAPTER TWO 30

models based on supervised machine learning approaches. Biomedical entities tend to be more complicated compared to the traditional named entities [13,14]. Campos et al. [5] identified several reasons behind the complexity of the Bio-NER task. Firstly, the biomedical entities tend to be very descriptive such as “normal thymic epithelial cells”. Secondly, the biomedical entities appear differently in texts even though they mean the same, such as “N-acetylcysteine” which can be formed as “N-acetylcysteine” or “NAcetylCysteine”. Thirdly, the biomedical abbreviations sometimes may refer to completely different entities such as the abbreviation of

‘TCF’ may refer to “T-Cell Factor” or “Tissue Culture Fluid”. Finally, biomedical entities contain complex morphology such as the combination of numbers and punctuations (e.g. 94-

KDA). Hence, the Bio-NER task is inherently complex and challenging task, and the NLP features extracted should have the ability to generalize and discriminate the occurrence of ambiguous biomedical entities as described above. This is because the NLP features play a significant role in building the models in supervised machine learning approaches, and in turn determine the effectiveness of the Bio-NER classification process. As good features need to be extracted for supervised machine learning methods to be successful and lead to good Bio-NER performance, most of the prior research focussed on developing good and efficient NLP features, and not much on different learning strategies (supervised learning strategy dominated in most of the studies).

Different types of NLP features that were proposed for Bio-NER task in several previous research works include morphological features (which can contain Boolean, numeric and nominal features), lexical features (containing nominal features only), the dictionary-based features (containing only Boolean features only), and the distance-based features (containing numeric features only). As mentioned previously, as compared to traditional named entities, the biomedical named entities such as protein, gene, DNA, chemical compounds, and others tend to be more complicated in terms of the morphological structure. The reason behind the CHAPTER TWO 31

morphological complexity in the entities is due to unusual character combinations that are used for describing Bio-NER entities, such as Greek letters, digits, and special characters. For example, some of the biomedical entities consist of multi-word combinations separated via punctuations such as ‘Hydro-Oxide’, resulting in poor performance of Bio-NER systems, even with a large set of NLP features and supervised learning approaches used. Therefore, there is a need to investigate alternate approaches for improving the performance of Bio-NER systems, either by finding better feature representations or improved learning strategies.

2.3 Clinical-NER literature review

The prior work done on Clinical-NER approaches can be broadly categorized into two main categories: rule-based approaches and machine learning approaches. Rule-based clinical-NER approaches mainly consist of a set of rules, and an interpreter to apply the rules. The rules could be patterns of properties that need to be fulfilled by different positions in the EHR text documents. The rules could be defined as regular expressions (regex) with a sequence of characters to define a search pattern. From the earlier research works reviewed on clinical-NER approaches, it was found that more than 65% of approaches are rule-based, and the rules are represented as regular expressions. Savova et al [51] proposed a method based on regular expressions to identify peripheral arterial disease (PAD). A positive PAD was extracted if the pre-defined patterns were matched (e.g., for a word sequence “severe atherosclerosis”, the word “severe” was matched with a list of modifiers associated with positive PAD evidence and the word “atherosclerosis” was matched from a dictionary tailored to the specific task of PAD discovery). Some other works reviewed used logic rules for building clinical-NER models.

Sohn and Savova [57] proposed a method based on a set of logic rules for improving smoking

CHAPTER TWO 32

status classification. In this approach, the smoking status was extracted from each sentence and by utilizing precedence logic rules, document-level smoking status was determined. For example, one of the logic rules assigned highest precedence to current smoker, followed by past smoker, smoker, non-smoker, and unknown. First, the smoking status was extracted for each sentence, and then processed with precedence logic rules to determine a document-level smoking status. The current smoker has the highest precedence, followed by past smoker, smoker, non-smoker, and unknown. If the current smoker was extracted from any sentence in a document, then the document was labeled as a current smoker). Similar logic rules were used for assigning final patient-level smoking status. If there is a current smoker document, and since the current smoker has higher precedence than past smokers, even if there is a past smoker document belonging to that patient, the patient is still assigned a current smoker status. One major problem with such rule-based clinical-NER approaches developed by prior researchers, is that the composition of rules is done by a human knowledge engineer from knowledge bases or with real involvement of the actual clinical expert. Both are very expensive and time- consuming, as it requires manual feature engineering and requires collaboration with physicians, though it is an efficient approach with the physician’s knowledge and experience embedded in the rules. Several expensive knowledge base tools were constructed and made available by commercial software companies, with computerized database systems for stored complex structured information. These expensive software suites available commercially used similar NER approaches to extract named entities from unstructured free-text notes. There were some popular, open-source structured knowledge-based tools including, UMLS meta-thesaurus for medical concepts, PheWas for phenome-wide association studies [90] (disease-gene relations), and DrugBank for drug-gene interactions [91]. Other similar tools include MetaMap by Martinez et al. [69] for mapped phrases into UMLS medical concepts; RadLex by

Hassanpour and Langlotz [53] for identifying semantic classes for terms in radiology reports,

CHAPTER TWO 33

by using a controlled lexicon for radiology terminology. Elkin et al. Developed a Systematized

Nomenclature of Medicine – Clinical Terms (SNOMED CT) medical terminology with coded signs, symptoms, diseases, and other findings of influenza from encounter notes [92].

There were also some research efforts on using machine learning-based Clinical-NER approaches based on shallow learning techniques and seemed to be successful attempts with acceptable performance and efficiency in many shared tasks [93-96]. Shallow machine learning approach based on Support Vector Machine (SVM) appears to be the most frequently employed method by researchers for clinical-NER tasks in the earlier work. An integrated feature-based classification SVM and template-based clinical-NER approach was proposed by Barrett et al.

[97]. Roberts et al. [94] proposed an approach based on using SVM with various features to extract anatomic sites of appendicitis-related findings. An automatic text classification approach for detecting adverse drug reactions using SVM was proposed by Sarker et al. [98].

A study to classify chronic obstructive pulmonary disease using SVM was conducted by Himes et al. [99] for asthma patients recorded in electronic medical records. Another machine learning approach used by researchers in some of the previous studies, is the one based on logistic regression (LR), particularly for entity and relation extraction, and downstream processing.

Chen et al. [100] used the LR approach for a clinical-NER system for detecting geriatric competency exposures assessed from students’ clinical notes; and Rochefort et al. [101] employed multivariate LR for detecting events with adverse relations in EHRs. Another line of investigation, that relied on sequence-based algorithms, due to the inherent nature of the sequential structure of NLP text, is the use of a Conditional Random Field (CRF) algorithm for clinical-NER tasks. Deleger et al. [23] used CRFs for extracting Paediatric Appendicitis Score

(PAS) elements from clinical notes; and Li et al. [60] employed CRFs for detecting medication names and attributes from clinical notes to extract medication discrepancies automatically. In addition, some of the researchers used certain post-processing approaches in addition to CHAPTER TWO 34

traditional machine learning-based algorithms. Yadav et al. [102] used a technique involving extraction of NLP features corresponding to complete medical words in the text, and then utilized those features as input for a decision tree for classifying emergency department computed tomography imaging reports. Few other research works reported a comparative study with different machine learning approaches. Zhou et al. [86] compared SVM, Generalized nearest neighbor (NNge), and Repeated Incremental Pruning to Produce Error Propositional

Rule (RIPPER), for identification of patients with depression in free-text clinical documents, and by using decision trees (DT) for post-processing, found the combination of DT and NNge resulted in the best performance in terms of best F-measure with high confidence, while in terms of intermediate confidence, RIPPER outperformed other approaches.

The comprehensive algorithms for the NER task for Biomedical vs. Clinical NER revealed, that there is a significant body of work done in Bio-NER compared to Clinical-NER field, and in general, clinical-NER research works had narrow applications focus, model development was done in a low resource environment, with small and sparse datasets, and didn’t have translational effects, with generalization ability in NER models. Review of the earlier work done in using automated techniques and machine learning comparatively, the Bio-NER research outcomes led to better translational outcomes, and the Bio-NER models had better robustness and generalization abilities and could be easily adapted for different application context. This considerable gap could be due to lack of availability of EHR data openly to NLP experts and machine learning engineers, and due to privacy and security protocols associated with EHR data (particularly the Health Insurance Portability and Accountability Act (HIPAA) privacy rule and institutional concerns [169]), as compared to Bio-NER, where the data used by NLP experts is freely available in the public domain as publications and reports.

CHAPTER TWO 35

Further, in clinical-NER contexts, there are trust issues associated with relying on completely automated clinical-NER systems, and the clinical community, in general prefers rule-based approaches, requiring medical professional’s knowledge base to be embedded within the system. Currently, more than 60% of the clinical-NER approaches surveyed are based on rule- based approaches. Even the commercial clinical-NER systems developed by large vendors, such as IBM, SAP, and Microsoft, use completely rule-based approaches, since rule-based clinical-NER systems can incorporate domain knowledge from knowledge bases curated by medical experts.

One possible solution to address this low resource situation, in terms of improving the state of the art for clinical-NER systems, is to use alternate strategies, in terms of leveraging the similarities in annotated common data available from another domain, or use a different type of learning strategy based on semi-supervised or appropriate deep machine learning technique, for example, instead of fully supervised shallow machine learning. The research reported in this thesis uses some of these innovative strategies and proposes a novel robust NER computational framework based on different types of machine learning techniques. The line of investigation and the design of this innovative computational framework, in terms of step-by- step incremental and iterative development and implementation strategy, is outlined in the next section.

2.4 Robust NER Framework Design

The design and development of the proposed robust NER framework is as outlined below:

1. The Biomedical named entity recognition (Bio-NER) task chosen was the

identification of biomedical text terms (five classes), such as RNA, protein, cell type,

cell line, and DNA. For a baseline comparison, traditional shallow machine learning CHAPTER TWO 36

models based on supervised learning were examined first. Some of the shallow

machine learning models examined include the Maximum Entropy Models (MaxEnt),

and Conditional Random Field models (CRF), that use supervised machine learning

techniques. For experimental evaluation and validation, the benchmark GENIA

dataset from BioNLP/NLPBA 2004 shared task was used, with validation/evaluation

experiments conducted on a training and evaluation set provided by the task

organizers.

2. The next step was to investigate the performance improvements possible for Bio-NER

task, with an alternative learning strategy based on semi-supervised learning, with the

same shallow learning models used in step (1), the MaxEnt, and the CRF models, and

evaluation and validation using the same benchmark GENIA dataset.

3. The third step was to investigate the performance improvements possible for Bio-NER

task with the recent deep learning models based on different architectures, including,

FFN, RNN, and Hybrid CNN-RNN/LSTM models built by combining one or more of

deep and shallow learning architectures, and with experimental evaluation with the

same benchmark GENIA dataset. This concluded the investigations towards

improving the Bio-NER system.

4. The next step involved enhancing the robustness of the Bio-NER models with

automatic hyper-parameter optimization and extending and adapting it for building

the clinical-NER models. The clinical-NER task represents a low-resourced and more

complex context, as compared to a well-resourced Bio-NER context, due to the

requirement to identify 35 different classes for clinical-NER tasks, with a small

CHAPTER TWO 37

dataset, which is sparse, synthetic, and noisy representing a nursing handover context

in the clinical domain. For experimental evaluation and validation, the benchmark

dataset from CLEF eHealth challenge 2016 Task 1 shared task on handover

information extraction was used, with validation/evaluation experiments conducted

on a training and evaluation set provided by the task organizers.

The details of the development and implementation of each step described above are described in the next few chapters of the thesis.

2.5 Conclusion

This chapter presented a review of relevant literature and related work and has set the background for determining the line of investigation for this research. First, the prior work on

Bio-NER approaches reported in the research literature is discussed, and the related work clinical-NER approaches were done in Section 3. Based on the thorough review of the earlier studies done, problems that need to be addressed for improving the state of the art, and identification of a strategy in terms of design and development of an innovative cross-domain

NER computational framework to address this gap was identified. The chapter concludes with step by step implementation plan for the development of this framework, with brief details on what each implementation step entails. The next chapter describes the details of the first step of the proposed NER framework, which serves as the performance baseline for the rest of the research contributions in this thesis.

CHAPTER TWO 38

DECLARATION OF CO-AUTHORED PUBLICATIONS

For use in theses which include publications. This declaration must be completed for each co-authored publication and to be placed at the start of the thesis chapter in which the publication appears.

Declaration for Thesis Chapter [3]

Declaration by candidate In the case of Chapter [4], the nature and extent of my contribution to the work was the following: Nature of contribution Contribution % I managed to improve the performance of named entity recognition system by using the Maximum Entropy and active learning algorithms based on special formulation of supervised machine learning, which allowed analyzing the 57.35 relationship between labelled and unlabeled data using bootstrapping techniques for building a decent model. I was able to achieve a considerable improvement for the name entity recognition system compared to other two previous participating systems in the Biomedical Natural Language Processing 2004 Challenge task. http://www.globalcis.org/jnit/ppl/JNIT384PPL.pdf

The following co-authors contributed to the work: Name Nature of Contribution Contributor is also a student at UC Y/N Dr Girija Chetty Advise remarkable approaches to achieve N result models with highest performance. Dr Thoai Man Luu Advise new technologies and assist in various N experiments to obtain for a better result.

______19__/__11__/_2019___ Candidate’s Signature Date

DECLARATION BY CO-AUTHORS

The undersigned hereby certify that: (1) the above declaration correctly reflects the nature and extent of the candidate’s contribution to this work, and the nature of the contribution of each of the co-authors. (2) they meet the criteria for authorship in that they have participated in the conception, execution, or interpretation, of at least that part of the publication in their field of expertise;

39

(3) they take public responsibility for their part of the publication, except for the responsible author who accepts overall responsibility for the publication. (4) there are no other authors of the publication according to these criteria. (5) potential conflicts of interest have been disclosed to (a) granting bodies, (b) the editor or publisher of journals or other publications, and () the head of the responsible academic unit; and (6) the original data are stored at the following location(s) and will be held for at least five years from the date indicated below:

[Please note that the location(s) must be institutional in nature, and should be indicated here as a department, centre, or institute, with specific campus identification where relevant.]

Location(s): University of Canberra

Signatures Date

20/12/2019

20/12/2019

40

CHAPTER THREE

Bio-NER Model Development based on Maximum Entropy Learner and Supervised Machine Leaning

3.1 Introduction

This chapter presents the details of the Bio-NER model development based on the MaxEnt machine learning algorithm and bootstrapped incremental supervised learning technique. This model serves as a baseline reference system and allows assessment of other advanced machine learning models being proposed in this work. For model building and evaluation, the GENIA

- BIONLP/NLBPA 2004 shared task/challenge dataset; a benchmark biomedical named entity recognition dataset was used. The performance of the Bio-NER model was assessed in terms of precision, recall, and F-measure. The details of the model development and evaluation are described in the next few sections of this chapter.

3.2 Bio-NER model based on MaxEnt Algorithm

The baseline Bio-NER system development involved the use of the MaxEnt algorithm, and the benchmark GENIA dataset, which is described in detail in Appendix 1. The model building steps comprised the maximum entropy (MaxEnt) principle and active/incremental type supervised learning approach for building the baseline system. In this learning approach, the algorithm interactively collects new data samples by obtaining unlabelled data during the training iterations, instead of using a large block of training data at the beginning. CHAPTER THREE 41

Maximum Entropy approach provides a good baseline framework for integrating information from many heterogeneous information sources for classification and combined with good NLP features, can lead to the construction of good statistical models for classification tasks. Several example applications using this framework can be found in the Open-NLP Tools Library [73].

MaxEnt learning algorithm has been used in several systems to support common NLP-NER tasks, such as finding the names of people, organizations, or locations in news; automatic classification of Twitter search results into categories; suggestion of correct spellings of queries.

For building baseline Bio-NER system with Maximum Entropy (MaxEnt) learner, we focussed on ten types of clinical entities from the Unified Medical Language System (UMLS) Semantic

Groups: Anatomy, Chemical and Drugs, Devices, Disorders, Geographic Areas, Living Beings,

Objects, Phenomena, Physiology, and Procedures [74]. Table 1 shows a subset of clinical entities defined as per UMLS Semantic Group and Semantic Types members for building the proposed MaxEnt based Bio-NER model. The entities were defined based on the UMLS meta- thesaurus.

The UMLS meta-thesaurus unifies concepts from several dozen terminologies in the biomedical domain, including linked terms and relations. The Semantic Table comprises

Semantic Types and Semantic Relations, which are organized sequentially. The 134 Semantic

Types can be clustered into 15 Semantic Groups [75]. Each concept in the UMLS is assigned a Concept Unique Identifier (CUI), a set of terms, and one or more Semantic Types. Semantic

Groups were designed so that each concept could be assigned to only one semantic category.

The GENIA - BIONLP/NLBPA 2004 benchmark dataset provided by challenge organizers, contained annotations at two granularity levels: Semantic Groups and Concepts. The annotators used freely available tools for accessing the UMLS in English and in French in order to assess

CHAPTER THREE 42

whether a mention in the text referred to a specific Semantic Group and to identify the specific concept.

The fundamental principle behind MaxEnt learner is the Maximum Entropy Principle, a commonly used technique that provides the probability of belongingness of a token to a class.

MaxEnt computes the probability p(o|h) for any o from the space of all possible outcomes O, and for every h from the space of all possible histories H. In NER, history can be viewed as all information derivable from the training corpus relative to the current token. The computation of probability (p(o|h)) of an outcome for a token in MaxEnt depends on a set of features that are helpful in making predictions about the outcome. Given a set of features and a training corpus, the MaxEnt estimation produces a model in which every feature fi has a weight αi.

Symbol Z(h) is the normalization constant, which forces the sum of p(o|h) over the different class labels output o to be one for all input h. We can compute the conditional probability as

[71, 77, 78, 159].

The conditional probability of the outcome is the product of the weights of all active features, normalized over the products of all the features.

The GENIA-BIONLP/NLBPA 2004 shared task corpus used for model building and evaluation of BioNER task contained annotations in BIO format, where ‘B-ne’ refers to the words which are the beginning word of a NE of type ‘ne’, ‘I-ne’ indicates rest of the words (if the NE contains more than one words) and ‘O’ refers to the not-name words. Some tag sequences can never happen. For example, ‘I-ne’ should not occur after an ‘O’ tag. Also ‘I-ne2’ should not occur after a ‘B-ne1’ or ‘Ine1’ where ‘ne1’ and ‘ne2’ are two different NE classes.

CHAPTER THREE 43

Table 1: Subset of clinical entities defined as per UMLS Semantic Group and Semantic Types members used proposed MaxEnt learner.

CHAPTER THREE 44

We used the Generalized Iterative Scaling (GIS) algorithm [77] to build the MaxEnt model, using a training subset of the GENIA corpus. Figure 1 shows the GIS Algorithm.

Figure 1: GIS Algorithm to train the MaxEnt model.

The advantage of this simple GIS algorithm is that it is able to bootstrap from the test data, eliminate the low confidence data and only select the high confidence data and combine it with the original training data to generate a decent Maximum Entropy Model. The GIS procedure should terminate [78] after a fixed number of iterations (i.e., in our case, we used 100), or when the change in log-likelihood is negligible. Figure 2 shows the active/incremental learning process based on the GIS algorithm: the dashed line depicts the behavior of the posterior probability-based confidence metric, while the asterisks on the bottom denote the unlabelled data being selected for their high confidence score [79].

CHAPTER THREE 45

Figure 2: Shows the relationship between unlabelled data with their high confidence score.

The active/incremental learning algorithm for MaxEnt learner was implemented in a stepwise fashion as shown below:

i) First the NLP feature set was extracted from the training dataset, and the Maximum

Entropy learner uses these NLP features to build a temporary draft model.

ii) The draft model is validated with unlabelled test data subset, and test data classification

results in terms of low and high confidence data, categorized as Test Result 1 and Test

Result 2 is obtained.

iii) The High confidence score related to unlabelled data [76, 80] - Test Result 2, was

selected and combined with the original Train Data and then the BioNER model is

rebuilt with MaxEnt Learning algorithm, and a refined version of the model is obtained.

iv) The refined version of the model generated was evaluated with complete test data set,

and Final Test Results were obtained. The final test results provide an estimate of the

performance of the model, assessed in terms of evaluation measures suggested by

CHAPTER THREE 46

GENIA organizers, and include quantitative measures, the Precision, Recall, and the F-

measure.

v) The special characteristic of this bootstrapping approach is that the Model will get

iteratively get refined, gaining more knowledge and produce better results with more

iterations.

Figure 3: Maximum Entropy Learner with Active/Incremental Learning.

3.3 Experimental Work

The Bio-NER model based on a simple MaxEnt algorithm can serve as a good baseline for comparison with other advanced algorithms proposed in this thesis. Figure 3 showed the overall architecture of the process for obtaining the Maximum Entropy Learner based on bootstrapped incremental learning algorithm. Table 2 shows the results for the evaluation of previously participating systems’ performance for each subset of data. CHAPTER THREE 47

Table 2: Shared Task Report for the previously participating systems in the challenge.

The performance of each system, shown in Table 2 was assessed in terms of three related performance measures Precision/Recall/F-measure. The F-measure or F-Score is the weighted harmonic mean of precision and recall [155]. Precision is the percentage of the correct annotations, and Recall is the percentage of the total named entities (NEs) that are successfully annotated. The value of β is taken as 1. The general expression for measuring the F-Score is:

This shared task with BIONLP/NLPBA GENIA corpus is one of the benchmark databases. The corpus is very popular for most Bio-NER tasks. As recently as of 2016, most of the Bio-NER approaches used this corpus as a baseline for evaluating automatic named entity recognition based on novel machine learners [8].

The MaxEnt model building in this work involves incremental learning strategy facilitated by bootstrapping techniques, which allows iterative model building and refinement, and uses small subsets of data for building the model, instead of using a large block of training data, and suffer from performance and convergence issues. The subsets provided for training and testing were made available with segregation in 10-year window blocks. The four subsets available and used in this work include, the 1978-1989 set (which represents an old age from the

CHAPTER THREE 48

viewpoint of the models that were trained using the training set), 1990-1999 set (which represents the same age as the training set), 2000-2001 set (which represents a new age compared to the training set) and S/1998-2001 set (which represents roughly a new age in a super domain). The last subset represents a super domain and the abstracts were retrieved with

MeSH terms, `blood cells' and `transcription factors' (without `human'). The S/1998-2001 set includes the whole 2000-2001 set. The performance reported by previous participating systems in the shared task/challenge in terms of percentage precision/recall/F-measure for each named entity class, and average performance of all classes for each subset is shown in Figure 4 – 8 for reference.

Figure 4: Results of MaxEnt based BioNER approach for 1978-1989 subset.

Figure 5: Results of MaxEnt based BioNER approach for 1990-1999 set.

CHAPTER THREE 49

Figure 6: Results of MaxEnt based BioNER approach for 2000-2001 set.

Figure 7: Results of MaxEnt based Bio-NER approach for S1998-2001 set.

Figure 8: Expected results for the proposed NER approach (Maximum entropy learner with active/incremental learning).

The performance expectation of the proposed MaxEnt based Bio-NER approach for each subset of data was to achieve at least the best performing average performance figures of all named entity classes of previously participating systems in the challenge (shown circled in red in Figure 4 - 8). The actual performance achieved by the MaxEnt based Bio-NER model is shown in Table 3 and Figure 9. As can be seen from Table 3 and Figure 9, the proposed Bio-

NER model based on MaxEnt approach (Pha16 red color in Table 3 and blue bar color in Figure CHAPTER THREE 50

9) results in an acceptable performance in terms of precision, recall, and F-measure, as compared to other participating systems in the shared task/challenge.

Table 3: Performance of proposed NER approach with respect of participating systems on each test subset.

Figure 9 is a graphical representation of the results, showing improved performance of the proposed MaxEnt based Bio-NER approach (Pha16) as compared to two other closely competing systems, the Lee04 NER, and the Baseline NER system provided by challenge organizers.

Figure 9: Comparative performance of proposed MaxEnt Bio-NER with Lee04 and Baseline systems.

CHAPTER THREE 51

3.4 Conclusion

This chapter presents the details of the Bio-NER model development based on the MaxEnt machine learning algorithm and bootstrapped incremental supervised learning technique. This model serves as a baseline reference system and allows assessment of other advanced machine learning models being proposed in this work. For model building and evaluation, the GENIA

- BIONLP/NLBPA 2004 shared task/challenge dataset; a benchmark biomedical named entity recognition dataset was used. The performance of the Bio-NER model was assessed in terms of precision, recall, and F-measure. The MaxEnt incremental/bootstrapped supervised learning approach for the Bio-NER task has resulted in acceptable performance, better than two others, which are closely competing systems in the challenge. This modeling approach serves as the baseline for comparing other innovative approaches proposed in the thesis. The next chapter presents details of an enhanced Bio-NER approach based on Conditional Random Fields, semi- supervised learning, and NLP features.

CHAPTER THREE 52

CHAPTER FOUR

Bio-NER Model Development based on Conditional Random Fields with Semi- Supervised Machine Learning

4.1 Overview

In this chapter, we propose a novel approach for the Bio-NER task based on the Conditional

Random Field (CRF) algorithm. The approach is based on a special formulation of semi- supervised Machine Learning (ML), which allows building the Bio-NER model by analyzing the relationship between labeled and unlabelled data by using Generalized Expectation criteria

(GE) method, which has a potential to improve the NER system performance significantly. The model development, involving model building and evaluation was done using benchmark

GENIA data set, from the BioNLP/NLPBA 2004 shared task/challenge. The performance of the system was assessed in terms of several quantitative measures, including precision, recall, and final score (F-score). The details of this approach and outcomes achieved are described in the rest of this Chapter.

Semi-supervised learning, where a small amount of human annotation is combined with a large amount of unlabelled data to yield an accurate classifier, has received a significant amount of attention from the research community. GE (Generalised Expectation) criteria are a simple, robust scalable method for semi-supervised training using weakly-labeled data.

CHAPTER FOUR 53

4.2 Proposed Conditional Random Field Scheme

The Conditional Random Fields (CRF) models are undirected graphical models based on a probabilistic scheme for labelling and segmenting structured data [10], such as sequences, trees, and lattices. A key advantage of CRFs is their great flexibility to include a wide variety of arbitrary, non-independent features of the input, such as more unlabelled data with less labeled data, and adding a variety of features or deep domain knowledge, which enable to improve the results more accurately.

The primary advantage of CRFs over other popular probabilistic models, such as hidden

Markov models (HMMs) is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs [81-83] to ensure tractable inference. Moreover, the NER model built with CRF Algorithm based on a semi-Supervised learning algorithm can transform it into a naïve large-scale machine learning algorithm and can be then suitable for complex health data. The results of the CRF model based on semi-supervised learning are described next.

For building the NER model based on CRF learner, the BioNLP/NLPBA shared task corpus

[7] was used for evaluating this CRF algorithm using Bio format where

B: denotes the beginning character of an NE,

I: denotes each following character that inside the NE,

O: denotes a character that is not part of any NE, to agree with the convention of the

NER community.

CHAPTER FOUR 54

Table 18 and Table 19 showed both train and test data based on the BioNLP format from the first dataset of the Bio-NLP challenge task, as shown in Appendix 1.

Table 4 shows the rich feature text, which used to build the models, and improvement in the performance of the CRF model, in terms of average word detections and F-scores, and how it outperforms the previous models that have used fewer features or without features at all. Table

5 shows the event list constructed during building the NER model. And Tables 6 - 9 show the outcomes from these rich features.

Table 4: Rich Text Features used to create the CRF Model.

Table 5: Construction of an event List with Features (an Outcome and a Context).

We used a simple semi-supervised learning algorithm for named entity recognition (NER) based conditional random fields (CRFs) which used a wide variety of features [76]. This achieved by using Generalized Expectation (GE) criteria to express a preference for parameter settings in which the model’s distribution on data matches a target distribution. The use of generalized expectation criteria allows for a dramatic reduction in annotation time [84] when

CHAPTER FOUR 55

limited human effort is available and allows adaptation of the base learner based on CRF with semi-supervised learning for large scale machine learning tasks that can work well for complex health data.

The GE criteria used to train our CRF model. Figure 10 shows how the semi-supervised learning algorithm with GE criteria which allows learning of the NER model using CRFs.

Assume that we have a small amount of labeled data L and a classifier Ck that is trained on L.

We exploit a large unlabelled corpus U from the test domain from which we automatically and gradually add new training data D to L, such that L has two properties:

1) Accurately labeled, meaning that the labels assigned by automatic annotation of the

selected unlabelled data are correct, and:

2) Non-redundant, this means that the new data is from regions in the feature space that

the original training set does not adequately cover.

Thus, the classifier Ck is expected to get better monotonically as the training data gets updated.

Figure 10: Semi-supervised learning algorithm using GE criteria for building CRF model.

CHAPTER FOUR 56

At each of iteration, the classifier trained on the previous training data (using the features introduced in the previous section) is used to tag the unlabelled data. In addition, for each O token and NE segment, a confidence score is computed using the constrained forward- backward algorithm [85], which calculates the LcX, the sum of the probabilities of all the paths size passing through the constrained segment (constrained to be the assigned labels).

One way to increase the training data is to add all the tokens classified with high confidence to the training set. This scheme is unlikely to improve the accuracy of the classifier at the next iteration because the newly added data is unlikely to include new patterns. Instead, we use the high confidence data to tag other data by exploiting independent features.

Since the features of each token include the features copied from its neighbours, in addition to those extracted from the token itself, its neighbours need to be added to the training set also. If the confidences of the neighbours are low, the neighbours will be removed from the training data after copying their features to the token of interest. If the confidence scores of the neighbours are high, we further extend to the neighbours of the neighbours until low- confidence tokens are reached. We then remove low-confidence neighbours to reduce the chances of adding training examples with false labels. Figure 11 shows the overall architecture of the NER system for building a decent CRF model.

CHAPTER FOUR 57

Figure 11: The CRF learner based on semi-supervised learning with Generalized Expectation criteria and rich text features.

As can be seen in Table 11, the proposed strategy for adapting the CRF model and its evaluation with BioNLP/NLPBA 2004 shared task corpus, resulted in promising outcomes, and the results from five prominent experiments are described next.

CHAPTER FOUR 58

4.3 Experimental Work

The first experiment used only the Training data (40% labelled, 60% unlabelled) was submitted to the CRF learner without adding any text features, compiled to obtain first model

CRF_Model_1 with F-score 7.20%. Table 6 shows the performance of CRF_Model_1 in terms of recall, precision, and F-score.

Table 6: Performance of CRF_Model_1.

The second experiment, we still used the Training data (20% labelled, 80% unlabelled, and added rich text features F1-F6 showed in Table 4). And we observed that CRF_Model_2 improved in terms of F-score 32.51%. Table 7 shows the performance of CRF_Model_2 in terms of recall, precision, and F-score.

Table 7: Performance of CRF_Model_2.

The third experiment, we increased to 40% labelled data, 60% unlabelled, added rich text features F1-F6 showed in Table 4. And we observed that CRF_Model_3 resulted in a better performance in terms of F-score 65.32%, Table 8 shows the performance of CRF_Model_3.

Table 8: Performance of CRF_Model_3.

CHAPTER FOUR 59

The fourth experiment, we increased to 100% labelled train data, 60% unlabelled test data, added rich text feature F1-F6 showed in Table 4. And we observed that CRF_Model_4 resulted again in better performance in terms of F-score 68.72%. Table 9 shows the performance of

CRF_Model_4.

Table 9: Performance of CRF_Model_4.

The fifth or final experiment, we used maximum available data from subsets of the corpus:

100% labelled from Training data, 100% unlabelled from Test data, and added rich text features

F1-F6 showed in Table 4. And observed that CRF_Model_5 resulted in superior performance in terms of F-score 68.74%. Table 10 shows the performance of CRF_Model_5.

Table 10: Performance of CRF_Model_5.

The details of the four test subsets are described at Evaluation Data in [7]. This research evaluated these test subsets and has outperformed most of the previous systems, in terms of recall, precision and F-score. Results show that the proposed CRF model approach achieved an F-score of 68.74% substantially better performance than other participating systems in the challenge [7]. Table 11 and Figure 12 show the entity recognition performance of each participating system on each test subset.

CHAPTER FOUR 60

Table 11: Comparison Performance of CRF_Model_5 with Respect of Participating Systems on Each Test Subset.

Figure 12: Comparison performances of NER Pha18CRF with previous systems.

CHAPTER FOUR 61

4.4 Conclusion

The innovation in the work proposed in this chapter involves the development of a large-scale machine learning technique based on a semi-supervised CRF learner, resulting in improved performance. The experimental work involved using a base learner, which is the CRF model, and adapting it with novel training algorithm - the semi-supervised learning algorithm, and its evaluation on benchmark dataset BioNLP/NLBA shared task corpus. It was possible to achieve an F-score of 68.74% for the CRF model using semi-supervised learning and has outperformed as higher compared to the other six participating systems in the BioNLP Challenge task. Next

Chapter, the combination of several rich text features investigated to further enhance the performance of semi-supervised CRF learner is discussed.

CHAPTER FOUR 62

DECLARATION OF CO-AUTHORED PUBLICATIONS

For use in theses which include publications. This declaration must be completed for each co-authored publication and to be placed at the start of the thesis chapter in which the publication appears.

Declaration for Thesis Chapter [5, 6]

Declaration by candidate In the case of Chapter [6, 7], the nature and extent of my contribution to the work was the following: Nature of contribution Contribution % I have improved the performance of the multilevel named entity recognition 70.50 scheme: CRF based on supervised machine learning to experiment with (Conditional abundant features sets; and hybrid Deep learning based on varieties of inputs Random Field) which was using word-Embedding techniques to convert text to numeric for processing. And I was able to achieve a remarkable improvement in and performance for both systems (CRF and Deep Learning) which had the result 70.32 higher than the previous participating systems in the BioNLP/NLPBA 2004 (Deep Learning) challenge task. https://ieeexplore.ieee.org/abstract/document/8614015

The following co-authors contributed to the work: Name Nature of Contribution Contributor is also a student at UC Y/N Dr Girija Chetty Advise remarkable approaches to achieve N result models with highest performance. Dr Thoai Man Luu Assist in the experiment and review the N conference paper.

______20__/__12__/_2019___ Candidate’s Signature Date

DECLARATION BY CO-AUTHORS

The undersigned hereby certify that: (1) the above declaration correctly reflects the nature and extent of the candidate’s contribution to this work, and the nature of the contribution of each of the co-authors.

63

(2) they meet the criteria for authorship in that they have participated in the conception, execution, or interpretation, of at least that part of the publication in their field of expertise. (3) they take public responsibility for their part of the publication, except for the responsible author who accepts overall responsibility for the publication. (4) there are no other authors of the publication according to these criteria. (5) potential conflicts of interest have been disclosed to (a) granting bodies, (b) the editor or publisher of journals or other publications, and (c) the head of the responsible academic unit; and (6) the original data are stored at the following location(s) and will be held for at least five years from the date indicated below:

[Please note that the location(s) must be institutional in nature, and should be indicated here as a department, centre or institute, with specific campus identification where relevant.]

Location(s): University of Canberra

Signatures Date

20/12/2019

20/12/2019

64

CHAPTER FIVE

Bio-NER Model based on Conditional Random Field Learner with Rich Feature Sets

5.1 Overview

In this chapter, we propose a novel approach for Bio-NER task based on Conditional Random

Field (CRF) algorithm with supervised learning, and a combination of several rich text feature sets. These feature sets mainly include three distinctive features: word2vec, Term Frequency-

Inverse Document Frequency (TF-IDF), and Regular Expression (RegExp), resulting in improved performance, in terms of various measures including precision, recall, and Fscore. It was possible to achieve the Fscore of 70.50% for the CRF learner based on supervised learning with large and rich feature sets, outperforming most of the other participating systems in the

BioNLP challenge task.

In this chapter, we propose an approach based on combining a large feature set, including

Word2vec features, TF-IDF features, Regular Expression features, and supervised CRF learner, leading to around 70 percent Fscore, as shown at the entry Pha18CRF in Table 15. The rest of this chapter is organized as follows. The next section 5.2 discusses the background and related work. Section 5.3 describes the CRF learner using rich text feature set. Section 5.4 presents a brief description of the challenge dataset, details for pre-processing data, and model building strategy using the CRF learner algorithm. The details of the experimental evaluation

CHAPTER FIVE 65

done is discussed in section 5.5, and the paper concludes in Section 5.6 with outcomes obtained from this work, and plan for further research.

5.2 Background and Related Work

Many techniques have been proposed to extract word representing (WR) text features, such as

LSA (latent semantic analysis) [93], random indexing (RI) [94], Brown clustering (BC) [95], and neural language models (NLM) [96]. As per a review by L. Ratinov et al. [97], WR text features can be divided into three groups: firstly, word embeddings, such as NLM; secondly, clustering-based methods such as BC; and thirdly, distributional representations, such as LSA and RI.

Recently, word representing (WR) approaches have been widely used to improve various machine learning-based NLP tasks, such as entity recognition in biomedical text [98-99], part- of-speech (POS), chunking, and NER in newswire domain [97]. Word embeddings have also been applied to the biomedical domain and showed improvement on entity recognition in biomedical literature [100]. Having said that, the contribution of diverse types of WR features to Bio-NER has not been extensively investigated yet. Therefore, this Chapter investigates a variety of features, and systematically evaluates three diverse types of feature sets: TF-IDF features created from Weighing each Feature algorithm and Model; Find Similarity/nearest

Features created from Word2vec algorithm and Model; and other rich features created from

Regular Expression.

After pre-processing, we obtained and combined these three distinct types of feature sets together and build the Bio-NER model using CRF supervised learner. Our results showed that the supervised CRF learning algorithm combined with a large set of complex and rich text

CHAPTER FIVE 66

features can improve Bio-NER task performance. These could be due to complementarity and latent capabilities amongst these different feature sets to support each other. By combining these features, including the use of the TF-IDF feature set, the improvements in F-score obtained were above 70 percent, outperforming other participating systems in the Bio-NER challenge task [7].

5.3 CRF Model Development

5.3.1 CRF Learning

Conditional Random Field is a , and their underlying principle is that they apply Logistic Regression on sequential inputs. Discriminative classifiers are Machine Learning models that have two common categorizations, Generative and Discriminative. Conditional

Random Fields are a type of Discriminative classifier, and as such, they model the decision boundary between the different classes. Generative models, on the other hand, model how the data was generated, which after having learned, can be used to make classifications. As a simple example, Naive Bayes, a very simple and popular probabilistic classifier, is a Generative algorithm, and Logistic Regression, which is a classifier based on Maximum Likelihood estimation, is a discriminative model.

In CRFs, our input data is sequential, and we have to take the previous context into account when making predictions on a data point. To model this behavior, we will use Feature Functions, that will have multiple input values. Each feature function is based on the label of the previous word and the current word and is either a 0 or a 1.

The feature functions are the key components of CRF. Each feature function is a function that takes in as input: CHAPTER FIVE 67

 a sentence S

 the position i of a word in the sentence we are predicting

 the label li of the current word

 the label li-1 of the previous word in S.

And it is defined as 푓i (S, i, li, li-1). The purpose of the feature function is to express the character of the sequence that the word represents. To build the conditional field, we next assign each feature function a set of weights (lambda values), which the algorithm is going to learn as mentioned at the figure Probability Distribution for Conditional Random Fields of the section

Mathematical overview of Conditional Random Fields [101].

It is vital to understand the applications of CRFs. For example, given their ability to model sequential data, CRFs are often used in Natural Language Processing, and have many applications in that area. One such application we discussed is Parts-of-Speech (POS) tagging.

Parts of speech of a sentence rely on previous words, and by using feature functions that take advantage of this, we can use CRFs to learn how to distinguish which words of a sentence correspond to which POS. Another similar application is Named Entity recognition or extracting

Proper nouns from sentences. Conditional Random Fields can be used to predict any sequence in which multiple variables depend on each other.

As a summary, we use Conditional Random Fields by first defining the feature functions needed, initializing the weights to random values, and then applying Gradient Descent iteratively until the parameter values (in this case, lambda) converge. We can see that CRFs are similar to

Logistic Regression, since they use the Conditional Probability distribution, but we extend the algorithm by applying Feature functions, as our input data is sequential.

CHAPTER FIVE 68

5.3.2 Rich Text Feature Set

In this Bio-NER model development work, we included three types of features: TF-IDF features created from Weighing each Feature Model; Find Similar/Nearest features created from Word2vec Model; and other rich features created from Regular Expression. Details of each type of feature can be described as follows.

Firstly, Term Frequency – Inverse Document Frequency [102] is a weighting scheme that is commonly used in information retrieval tasks. The goal is to model each document into a vector space, ignoring the exact ordering of the words in the document while retaining information about the occurrences of each word. It is important to understand how TF-IDF works [103]. A

Term Frequency (TF) is a count of how many times a word occurs in a given document

(synonymous with a bag of words). The Inverse Document Frequency (IDF) is the number of times a word occurs in a corpus of documents. TF-IDF is used to weight words according to how important they are. Words that are used frequently in many documents will have a lower weighting while infrequent ones will have a higher weighting. In section 2.1.2 of [104] illustrates exactly how the TF-IDF works. The TF-IDF is also used in several NLP techniques such as text mining, search queries, and summarization. In information retrieval or text mining, the TF-IDF is a well-known method to evaluate how important is a word in a document. The

TF-IDF is a very interesting way to convert the textual representation of information into sparse features.

In information retrieval or text mining, the TF-IDF is a well know method to evaluate how important is a word in a document. The TF-IDF is a very interesting way to convert the textual representation of information into sparse features.

CHAPTER FIVE 69

Additionally, TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. TF-IDF is one of the most popular term-weighting schemes today;

83% of text-based recommender systems in digital libraries use TF-IDF. Variations of the TF-

IDF weighting scheme are often used by search engines [105] as a central tool in scoring and ranking a document's relevance [106] given a user query. TF-IDF can be successfully used for stop-words [107] filtering in various subject fields, including text summarization [108] and classification.

Secondly, The Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors or feature vectors extracted for words in that corpus [7]. While

Word2vec is not a deep neural network, it turns text into a similar feature set that the CRF algorithm can understand better. There are several implementations of word2vec algorithms, including a distributed form of Word2vec for Java and Scala, which works on Spark with

GPUs, as documented by the developers in [109].

The Word2vec provides a computationally efficient predictive model for learning word embeddings from raw text. The is mapping words with vectors of real numbers, for example, W(“cat”) = (0.2, -0.4, 0.7, …) where W is word embedding and

W(“cat”) is word embedding of word “cat”. This can help in achieving better performance in natural language processing tasks, by grouping similar words. In word embedding, similar words are likely to have similar vectors. It also can capture various dimensions of meanings and phrase information relevant to the potential features of words within the vector.

CHAPTER FIVE 70

Moreover, the Word2vec comes with two options: The Skip-Gram model and the Continuous

Bag-of-Words model (CBOW) as shown in Figure 13. Algorithmically, these models are similar, except that the skip-gram predicts the source context-words (e.g. 'the cat sits on the') from the target words (‘mat’), while the CBOW does the inverse and predicts the target words from the source context-words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothed over a lot of the distributional information by treating an entire context as one observation. Therefore, the majority turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.

We have utilized the Skip-Gram model for this work, as it is a useful word representation in predicting neighbouring words in a sentence or a document. It predicts the neighbouring words or context when a single word is given. Skip-gram captures the averaged co-occurrence of two words in a training set. Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture many precise syntactic and semantic word relationships. Imagine a sliding window over the text that includes the central word currently in focus, together with the four words and precede it, and the four words that follow it.

CHAPTER FIVE 71

Figure 13: Relationships between the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model in Word2vec of the CRF.

It is important to note that the Skip-Gram model is constructed with the focus word as the single input vector, and the target context words are now at the output layer. The model we train will run each word in the Skip-Gram through W to get a vector representing it and feed those into another ‘module’ called R which tries to predict if the Skip-Gram is going to belong to our dataset classes. The CBOW is the opposite of the skip-gram model. The context words form the input layer, and each word is encoded in one-hot form. So, if the vocabulary size is V these will be

V-dimensional vectors with just one of the elements set to one, and the rest all zeros.

Thirdly, we also explored other rich features created from the Regular Expression as showed in Table 4.

The proposed Conditional Random Field Scheme, which includes the challenge datasets and approach and methodology used for building the decent CRF model, will be presented in the next section.

CHAPTER FIVE 72

5.4 Proposed Conditional Random Field Scheme

In this section, we describe the proposed Conditional Random Fields learning scheme used for

Biomedical named entity recognition as well as the details of the experimental setup. The details to the BioNLP/NLPBA 2004 shared task corpus [7] used is described first to understand the complexity of data, followed by the process and model building approach and methodology used for recognizing genes and entities.

5.4.1 BioNLP/NLPBA 2004 challenge dataset

The BioNLP/NLPBA 2004 challenge dataset has been developed for biomedical recognition and information extraction. The dataset details for performing name entity recognition are as follows:

1. Training Set: the training datasets came from the GENIA version 3.02 corpora. This was formed from a controlled search on MEDLARS using the MeSH terms “human”, “blood cells” and “transcription factors”. From this search, 2000 abstracts were selected, and hand-annotated according to a small taxonomy of 48 classes based on a chemical classification. Among the classes, 36 terminal classes were used to annotate the GENIA corpus. For the shared task we decided however to simplify the 36 classes and used only the classes protein, DNA, RNA, cell line, and cell type. The first three incorporate several subclasses from the original taxonomy while the last two are interesting to make the task realistic for post-processing by a potential template filling application. The publication year of the training set ranges from 1990-1999.

For building the NER model based on CRF learner, the BioNLP/NLBPA shared task corpus

[7] was used for evaluating this CRF algorithm using Bio format where

CHAPTER FIVE 73

B: denotes the beginning character of an NE,

I: denotes each following character that inside the NE,

O: denotes a character that is outside and not part of any NE, to agree with the

convention of the NER community. Table 18 showed the train data based on the Bio-

NLP/NLBPA shared task corpus, as shown in Appendix 1.

Test Set: for testing purposes, we used a new annotated collection of MEDLARS online abstracts from the GENIA project. 404 abstracts were used that were annotated for the same classes of entities. Most parts of the test set include abstracts retrieved with the same set of

MeSH terms, and their publication year ranges from 1978~2001. To see the effect of publication year, the test set was roughly divided into four subsets: 1978-1989 set (which represents an old age from the viewpoint of the models that will be trained using the training set), 1990-1999 set (which represents the same age as the training set), 2000-2001 set (which represents a new age compared to the training set) and S/1998-2001 set (which represents roughly a new age in a super domain). The last subset represents a super domain and the abstract was retrieved with MeSH terms, `blood cells', and `transcription factors' (without

`human'). (The S/1998-2001 set includes the whole 2000-2001 set). Table 19 showed a test dataset based on the BioNLP/NLBPA shared task corpus, as shown in Appendix 1.

As shown in Figure 14, we first extract TF-IDF features and Similarity features for pre- processing the raw text data, before extracting the regular expression, and these NLP features try to model different syntactic and semantic relationship between classes and entities in the text.

CHAPTER FIVE 74

Figure 14: TF-IDF and Similarity feature sets obtained after pre-processing data.

Some of the features shown in Figure 14 include:

 Orthographic features: Orthographic information is used to identify whether a token

is capitalized, or an acronym, or a pure number, or punctuation, or has mixed letters

and digits, etc.

 Joint features: Joint features are the conjunctions or combinations of individual

features. For example, if a token is in a biomedical name list and its suffix token is in a

number list, the token will have a joint feature called as Name + Number (e.g., HIV-1,

HIV-2, virus-1, etc).

CHAPTER FIVE 75

 Neighbourhood Features: After extracting the above features for each token, its

features are then copied to its neighbours (The neighbours of a token include the

previous two and next two tokens) with a position id. For example, if the previous token

of a token has a feature “Cap@0”, this token will have a feature “Cap@-1”.

 Single / Multiple-token feature list: Each list is a collection of words that have a

common semantic meaning, such as entity, tag/label/class, TF-IDF features, Word2vec

to find similarity of features, and Regular Expression to generate other rich features.

Table 4 gave an example to demonstrate the features selection process used for

processing the CRF algorithm.

5.4.2 Model Building Approach and Methodology

The approach and methodology utilized for building the CRF model can be outlined as follows:

We use the word2vec technique for discovering optimal features, for building the CRF model.

As reiterated before, Word2vec is a computationally efficient predictive model for learning word embeddings from raw text. It comes in two choices: The Continuous Bag-of-Words model (CBOW) and the Skip-Gram model as showed in Figure 13. And we have chosen the

Skip-Gram model for the work reported in this paper, as the skip-gram treats each context- target pair as a new observation, and this tends to do better when we have larger datasets.

Since the dataset is relatively sparse, with eleven sets of classes (labels/tags), as compared to the quality and quantity of text data available. Therefore, we must do some pre-processing, in terms of extracting different NLP features from the raw text. The pre-processing step allows better and optimal feature discovery during feature extraction stages, such as using Weighing

Each Feature Model to generate TF-IDF features, Word2vec Model to find similar/nearest CHAPTER FIVE 76

features, and Regular Expression to create other rich text features, which are leading to better

CRF models. Hence, during the pre-processing stage, 3 feature sets are extracted as shown in

Figure 15.

Figure 15: Combination of 3 features sets: TF-IDF, Find Similar, and other rich features.

After the pre-processing stage, involving different NLP text feature extraction the raw text from the BioNLP challenge dataset, we submit the combination of these three feature sets (Figure

15) into the CRF algorithm to generate the decent CRF model. Figure 17 shows the workflow for this process.

We believe that this novel approach resulted in CRF NER (Table 15) can perform well with large scale machine learning that are needed for the complex health data

CHAPTER FIVE 77

5.5 Experimental Results and Discussion

In this section, experimental results are discussed for different CRF learning models (i.e., when using either with or without the TF-IDF feature set) built with the BioNLP challenge dataset.

Since the dataset is relatively sparse, with eleven sets of classes (labels/tags), as compared to the quality and quantity of text data available. Therefore, we must do some pre-processing, in terms of extracting different NLP features from the raw text. The pre-processing step allows better and optimal feature discovery during feature extraction stages, such as using Weighing

Each Feature Model to generate TF-IDF features, Word2vec Model to find similar/nearest features, and Regular Expression to create other rich text features, which are leading to better

CRF models. Hence, during the pre-processing stage, we extracted three different types of NLP feature sets, and are outlined in Figure 16.

After the pre-processing stage, involving different NLP text feature extraction the raw text form

BioNLP challenge dataset, we submit the combination of these three feature sets (Figure 16) into the CRF algorithm to generate a decent CRF model. Figure 17 shows the workflow for this process. The details of the algorithm for this are summarized in the following fashion:

 The training dataset submits into the Weighing Words using TF-IDF Algorithm to

obtain the Weight of each feature for 6 categories (DNA, RNA, Cell Line, Cell Type,

Protein, O) as shown in Figure 14, which contains 6 separate documents vectorizes for

each category, as shown in Figure 15.

 The training dataset submits to the Word2vec Model to find similar words, as shown in

Figure 15.

CHAPTER FIVE 78

 The training dataset then submits to the Regular Expression to find other rich features,

such as Capitalized, Hyphenated, Window-5, the first token of a sentence, suffixes, and

With Number, as shown in Table 4.

 The combination of these three features sets (Figure 15), which finally submits to the

CRF Algorithm to generate a decent CRF model (Figure 17).

 The test data then submits to the decent CRF model to generate the Prediction files,

which will be evaluated with the Reference Text File to generate the Expected results

as shown at the entry Pha18CRF in Table 15.

Both Figures 16 and 17 show the overall architecture of the NER system for building a decent

CRF model.

Figure 16: Use Train dataset to create the Word2Vec Model

CHAPTER FIVE 79

Figure 17: CRF Model Building Workflow.

As can be seen in Table 15, the proposed strategy for adapting the CRF model and its evaluation with BioNLP/NLPBA 2004 shared task corpus [7], resulted in promising outcomes, and the results from three prominent experiments will be described next.

The first experiment used the Testing data (with other features, without Word2vec and TF-

IDF) was submitted to the CRF learner, compiled to obtain the CRF_Model_1 with F-score CHAPTER FIVE 80

65.60%. Table 12 shows the performance of the first model in terms of recall, precision, and

F-score.

Table 12: Performance of CRF_Model_1.

The second experiment also used the Testing data (with other features and Word2vec) was submitted to the CRF learner, compiled to obtain the CRF_Model_2, which improved with the

F-score of 69.39%. Table 13 shows the performance of the second model.

Table 13: Performance of CRF_Model_2.

The third and final experiment, we also used the Testing data (with other features, Word2vec, and with TF-IDF) was submitted to the CRF learner, compiled to obtain the CRF_Model_3, which improved with F-score of 70.50%. Table 14 shows the performance of the third model.

Table 14: Performance of CRF_Model_3.

The details of the four test subsets are described at Evaluation Data in [7]. This research evaluated these test subsets and has outperformed most of the previous systems, in terms of recall, precision and F-score. Results show that the proposed CRF model approach achieved an F-score of 70.50%, which has outperformed most of the other participating systems in the challenge. Table 15 and Figure 18 show the entity recognition performances of each participating system on each test subset.

CHAPTER FIVE 81

Table 15: Comparison the Performance results of three CRF Models with previous Participating Systems on Each Test Subset.

CHAPTER FIVE 82

Figure 18: Comparison the performance of NER Pha18 CRFs with previous participating systems.

CHAPTER FIVE 83

5.6 Conclusion

In this chapter, we examined a novel scheme for Bio-NER task based on conditional random fields and rich text feature set. The evaluation of three diverse types of feature sets with the

BioNLP/NLPBA 2004 shared task corpus showed that not only individual types of feature sets

(Capitalized, Hyphenated, Window 5, First token of a sentence, suffixes and With Number) were beneficial to improving Bio-NER task performance but also additional feature sets, the learning features (Word2Vec) and statistical features (TF-IDF) when combined, can result in performance improvement for the Bio-NER task. This ascertains that a good learner and powerful set of features can indeed enhance the performance of the complex text mining task, involving the recognition of biomedical entities. However, rich text feature extraction may not be an easy feat. Ideally, what is needed is an algorithm that can perform as good as this powerful combination of rich text features and machine learners, without the manual involvement of multi-disciplinary experts for feature engineering. The recent deep learning algorithms provide this opportunity, and are known to perform well, or even better, with an appropriate choice of the deep learning architecture. The next chapter will discuss the details of Bio-NER model development, using deep machine learning algorithms, which can allow latent features to be extracted without feature engineering and handcrafting of features, and removing the dependency on the need for multi-disciplinary experts for creating annotated training datasets.

CHAPTER FIVE 84

DECLARATION OF CO-AUTHORED PUBLICATIONS

For use in theses which include publications. This declaration must be completed for each co-authored publication and to be placed at the start of the thesis chapter in which the publication appears.

Declaration for Thesis Chapter [6]

Declaration by candidate

In the case of Chapter [6], the nature and extent of my contribution to the work was the following: Nature of contribution Contribution %

The current biomedical natural language processing datasets are unbalanced and difficult manage when using traditional machine learning techniques, as it may decrease the result remarkably. To improve the performance of biomedical 70.32 named entity recognition, I explored varieties of novel neural network algorithms and experimented each of them with the bio-Entity Recognition Task at BioNLP/NLPBA 2004 datasets and the result was surpassed most of the previous participating systems provided by the challenge task. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8628740

The following co-authors contributed to the work: Name Nature of Contribution Contributor is also a student at UC Y/N Dr Girija Chetty Advise remarkable approaches to achieve N result models with highest performance. Dr Thoai Man Luu Advise new technologies and assist in various N experiments to obtain for a better result.

______20__/__12__/_2019___ Candidate’s Signature Date

DECLARATION BY CO-AUTHORS

The undersigned hereby certify that: (1) the above declaration correctly reflects the nature and extent of the candidate’s contribution to this work, and the nature of the contribution of each of the co-authors.

85

(2) they meet the criteria for authorship in that they have participated in the conception, execution, or interpretation, of at least that part of the publication in their field of expertise. (3) they take public responsibility for their part of the publication, except for the responsible author who accepts overall responsibility for the publication. (4) there are no other authors of the publication according to these criteria. (5) potential conflicts of interest have been disclosed to (a) granting bodies, (b) the editor or publisher of journals or other publications, and (c) the head of the responsible academic unit; and (6) the original data are stored at the following location(s) and will be held for at least five years from the date indicated below:

[Please note that the location(s) must be institutional in nature, and should be indicated here as a department, centre or institute, with specific campus identification where relevant.]

Location(s): University of Canberra

Signatures Date

20/12/2019

20/12/2019

86

CHAPTER SIX

Bio-NER Model Development Based on Deep Learning Approach

6.1 Overview

This chapter describes a deep learning-based approach for Biomedical named entity recognition (Bio-NER) task. Three different deep learning architectures were investigated, including the Feedforward Networks (FFNs), Recurrent Neural Networks (RNNs), and Hybrid

Convolutional Neural Networks (CNNs), to examine the latent feature discovery capabilities for the Bio-NER tasks. The performance evaluation of the proposed deep learning approach with the same benchmark BioNLP dataset as used in experimental work reported in the previous Chapter 5, has led to promising performance, when assessed in terms of F-measure

(F-score), Recall and Precision. The best performing deep learning algorithm was the one based on Hybrid CNN architecture, resulting in an F-score of 70.69%, matching/surpassing the performance achieved with traditional shallow learning algorithms based on CRF learner and rich text feature set. Though the performance in terms of F-measure metric is similar, there is minimal feature engineering and handcrafting required with a deep learning approach, and an end of end automated model building was possible. The performance, however, can be improved further, anytime, with an appropriate choice of deep learning architecture. The rest of the chapter describes the Bio-NER model development based on deep learning models.

CHAPTER SIX 87

The three different deep learning techniques investigated for this task includes a traditional deep machine learning technique - called the Feedforward Networks (FFNs); a variant of the feedforward neural networks with a feedback loop - called the Recurrent Neural Networks

(RNNs); and a hybrid combination of RNNs with another popular architecture for extracting latent features – called the Convolutional Neural Networks (CNNs). These three deep machine learning models examined, have shown the potential to model latent features differently, and proved their capability to discover and learning different feature representations, directly from the raw data.

The algorithmic workflow for deep learning model development for Bio-NER text mining task is similar to any machine learning model development, involving problem formulation as a sequence labelling problem, intending to assign a label to each word in a sentence. The traditional approach to handle this text mining task is to use well-known NLP techniques, often requiring handcrafted text features (e.g., capitalization, prefix and suffix, POS tags, removal of stop words and duplicate words) to be specifically done for each entity type [110-113]. This

NLP feature extraction process is very labor-intensive, often takes most of the time and cost, particularly for the Bio-NER model development, due to complex and obscure biological jargon [114]. Also, even after significant feature engineering efforts involving multidisciplinary NLP/linguistic, biomedical and machine learning experts, it leads to highly specialized systems that cannot be directly used to recognize new evolving entity types, and their accuracy is still a limiting factor, if the data is insufficient, unbalanced and data quality is poor [115].

The Bio-NER model development and evaluation for the three different deep learning architectures, was done on a publicly available NLPBA/BioNLP 2004 challenge data set. This benchmark dataset consists of Biomedical text data and five major entity types to be recognized

CHAPTER SIX 88

(DNA, RNA, proteins, Cell type, and Cell line). As can be seen in Table 16, out of the three different deep learning architectures examined, the Hybrid CNN model resulted in the best performance, with an F-score of 70.69%, and matched the performance of system based on shallow machine learning, based on CRF learner and rich text feature set (discussed in previous

Chapter 5).

The rest of this chapter is organized, as follows: The next two sections 6.2 and 6.3 discuss the background and related work on different deep machine learning approaches. Section 6.4 presents the three deep learning algorithms used. The experimental work has done for evaluation on different deep learning algorithms is presented in Section 6.5. The chapter concludes in section 6.6 by summarizing the outcomes of work reported in this Chapter, and the plan for future research.

6.2 Deep Learning Background

The feed-forward networks (FFNs) are the traditional deep learning networks [116], based on a series of algorithms, modeled loosely after the human brain, and are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labelling, or clustering of raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text, or time series, must be translated. In general, neural networks allow clustering and classification layer implementation, on top of data we store and manage. The FFNs and RNNs, the recurrent neural networks [117] are different types of artificial neural networks designed to recognize patterns in sequences of data, such as text data, genomes, handwriting, the spoken words, or numerical times series data emanating from sensors, stock markets, and government agencies.

CHAPTER SIX 89

RNNs or recurrent neural networks are unique in the sense that they are designed to utilize sequential information. Since input data are processed sequentially, recurrent computation is performed in the hidden units where the cyclic connection exists. Therefore, past information is implicitly stored in the hidden units called state vectors, and output for the current input is computed considering all previous inputs using these state vectors.

It is important to note the differences between these two neural networks. A feedforward network feeds information straight through the nodes of the network (never touching a given node twice), while in recurrent networks, it cycles through a loop. The feedforward network has no notion of order in time, and the only input it considers is the current example it has been exposed to. Feedforward networks are amnesiacs regarding their recent past; they only remember the formative moments of training. Recurrent networks, on the other hand, take as their input, not just the current input example they see, but also what they have perceived previously in time. In addition, recurrent networks can be distinguished from feedforward networks by the feedback loop connected to their past decisions, ingesting their own outputs moment after moment as input. It is often said that recurrent networks have memory. Adding memory to neural networks has a purpose: There is information in the sequence itself, and recurrent networks use it to perform tasks that feedforward networks cannot.

Further, the main difference between RNN and FNN is that in each neuron of RNN, the output of the previous time step is fed as the input of the next time step. This makes RNN aware of time (past, current, and future time) while the FNN has none. For example, in the handwritten digit’s classification, you have the input and output. There is no difference between times 1st or time 100th because the network has the same input and output. Whereas RNN is usually used to describe a sequence (can be time sequence or a context). For example, while reading a text file corresponding to the BioNLP challenge dataset [7], the algorithm allows the machine

CHAPTER SIX 90

to output the next character. As it reads more and more data, it accumulates the knowledge through its previous time step, making the network aware of the context [118].

One of the other deep learning architectures, that is most popular is the Convolutional Neural

Network (CNN). The difference between CNN and FNN is that CNNs include several layers of convolutions with nonlinear activation functions like Rectified Linear unit (ReLu) [119] or

Hyperbolic Tangent Function (Tanh) [120] applied to the results. In a traditional feedforward neural network, we connect each input neuron to each output neuron in the next layer. That is also called a fully connected layer. In CNNs we do differently. Here, we use convolutions over the input layer to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters, typically like the one shown in Figure 20, and combines their expected results. Due to this, the network has the capability to discover latent feature representations automatically.

Figure 19 shows the architecture of FNNs where a connection between two nodes is only permitted from nodes in layer i to nodes in layer i + 1 (To that end, the term feedforward; there are no backward, or inter-layer connections allowed).

CHAPTER SIX 91

Figure 19: Structure of FFN with 3 input nodes, a first hidden layer with 2 nodes, a second hidden layer with 3 nodes, and a final output layer with 2 nodes.

As can be seen from Figure 19, the nodes in layer i are fully connected to the nodes in layer i+1. This implies that every node in layer i connects to every node in layer i+1. For example, there are a total of 2 x 3 = 6 connections between layer 0 and layer 1, this is where the term

“fully connected” comes from. We normally use a sequence of integers to describe the number of nodes quickly and concisely in each layer. For instance, the network above is a 3-2-3-2 feedforward neural network, where:

 Layer 0 contains 3 inputs, our xi values. These could be raw pixel intensities or entries

from a feature vector.

 Layers 1 and 2 are hidden layers, containing 2 and 3 nodes, respectively.

 Layer 3 is the output layer or the visible layer, this is where we obtain the overall output

classification from our feedforward network. The output layer normally has as many

nodes as class labels; one node for each potential output. As an example, for a structure

involving Dogs vs. Cats classification, we will have two output nodes: one for “dog” and

another for “cat”.

CHAPTER SIX 92

Figure 20: Illustration of a CNN architecture for sentence classification.

Figure 20 shows the architecture for CNN, the Convolution Neural Networks, for a sentence classification task. The convolution process involves three filter region sizes: 2, 3, and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates

(variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus, a univariate feature vector is generated CHAPTER SIX 93

from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final SoftMax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence show two possible output states [121].

The third type of deep neural model we examined in this work, is the Recurrent neural network

(RNN) model, essentially built from two smaller modules, W and R. This approach, of building neural network models from smaller neural network “modules” that can be composed together, is not very widespread. It can, however, lead to very successful outcomes in Natural Language

Processing (NLP) contexts. RNN can only have a fixed number of inputs. We can overcome this by adding an association module, A, which will take two word or phrase representations and merge them. By merging sequences of words, A takes us from representing words to representing phrases or even representing whole sentences. And, because we can merge different numbers of words, we do not need to have a fixed number of inputs. It does not necessarily make sense to merge words in a sentence linearly. These models, called “recursive neural networks” use recursion, because one often has the output of a module go into a module of the same type [122].

They are also sometimes called “tree-structured neural networks”.

The Convolutional Neural Networks (CNNs) [124] discussed before, are mostly used for extracting or for discovery of features, and are designed to process multiple data types, and are more popular for processing images. They can be used to classify images, cluster them by similarity photo search, and perform object recognition within scenes. They can identify faces, individuals, street signs, eggplants, platypuses, and many other aspects of visual data. More recently, convolutional networks have also been applied directly to text analytics [125] as well as graph data with graph convolutional networks; Other than the FFN, RNN and CNN type deep learning models, there are other types of deep learning architectures, such as those based on

CHAPTER SIX 94

Restricted Boltzmann Machines (RBM) [123], and Neural Network with Regression. RBM is an algorithm useful for several stages, including , classification, regression, collaborative filtering, , and topic modeling. The Neural Network with Regression (NNR) used for , classification, or regression [126]. That is, they help grouping unlabelled data, categorize that data or predict continuous values after supervised training. For performing classification, it typically uses a form of logistic regression in the net’s final layer to convert continuous data into dummy variables like 0 and 1.

However, many of these complex deep learning algorithms are highly computationally intensive and require Graphics Processing Units (GPUs) to perform named entity recognition from large text corpora. Instead, by combining one or more of these FNN, RNN, or CNN architectures, it could be possible to obtain better performing deep learning models. So, we tried exploring a Hybrid CNN model involving a combination of RNN and CNN, that might perform well even without GPU power, and without significant computational burden for complex text processing task, as in Bio-NER problem. As can be seen from the work reported in this Chapter, this indeed was a good line of investigation. The next few sections in this chapter describe the proposed Deep learning Bio-NER model building in detail.

6.3 Proposed Deep Learning Bio-NER Scheme

In this section, we describe the proposed Deep machine learning Bio-NER model building scheme and its evaluation with the benchmark BioNLP challenge dataset, involving biomedical names (DNA, RNA, Cell line, Cell type, Protein, O) and entities.

CHAPTER SIX 95

6.3.1 BioNLP 2004 Challenge dataset

This is one of the most popular benchmark datasets in the Bio-NER domain and is used as a baseline for comparing the performance of different computer-based learning algorithms. The

BioNLP/NLPBA 2004 shared task corpus contains a training data subset with labeled data with tags and annotations, and the test data subset contains unlabeled data as shown in Table 18 and

Table 19, respectively as shown in Appendix 1. The details of classes/ tags/labels in this dataset are as follows:

 B: denotes the beginning character of an NE,

 I: denotes each following character that inside the NE, and

 O: denotes a character that is not part of any NE, to agree with the convention of the

NER community.

6.3.2 Deep Bio-NER Model

The deep Bio-NER model development approach involved several methodological aspects as described here. Figure 21 shows the basic architecture for the deep learning model development with CNN architecture as an example.

CHAPTER SIX 96

Figure 21: Basic CNN Model architecture.

CHAPTER SIX 97

Figure 22: Character level of Convolutional Neural Network and Recurrent Neural Network.

After the pre-processing of raw data, we build three different neural network models based on

FFN, RNN, and a hybrid-CNN-RNN/LSTM architecture. The first architecture used was the multi-layer with two dimensions Feedforward Network (FFN) configuration, and

CHAPTER SIX 98

the performance of this model is shown in Table 16 at the entry “Pha18 FFN”. The second architecture used was the three-dimensional Recurrent Neural Network (RNN) architecture, in which the data will be organized as a time series sequence, with time steps, and the sequence length, and the performance of this model is shown in Table 16 at the entry “Pha18 RNN”.

Figure 22 showed the third architecture with the character level of Hybrid CNN-RNN together with the word embedding feature, followed by the Bidirectional LSTM layer. The model consists of several stages and layers, with 3-dimensional character-level data in layer 1, feeding into the convolutional layer with 3x3 kernel for processing the data, followed by several layers of convolution processing, and learning of shared representations, culminating into a final output layer, with 32 output channels, with rectified linear unit (ReLu) activation function.

The hybrid CNN model used several stages of character-level convolutional layer processing, with a character level 1D max-pooling layer, and a character level convolutional and max- pooling layer, with 64 output channels. The output of character-level CNN was concatenated with word base embedding layer and casing embedding layer (word broke down in few casing features for example hyphenated case, numeric case, punctuation case, title case and so on).

The output of the concatenation layer was fed into two hidden Bidirectional RNN layers which processed the input with 200 hidden neurons. And it continued with a fully connected network before it ended at another full connection layer, which produced eleven classes as the output.

This hybrid CNN model building, and evaluation was done with the BioNLP/NLPBA 2004 shared task corpus. After 10 epochs of training the Hybrid-CNN model, we were able to achieve an accuracy of 99.26 percent on the training set as shown in Figure 23.

CHAPTER SIX 99

Figure 23: The performance of the character-level for the CNN-RNN hybrid model for training data set after 10 epochs.

The performance achieved by different deep learning models for the independent test set is shown in Table 16. The best performing model was the hybrid-CNN model resulting in an F- score of 70.69% followed by an RNN model with an F-score of 65.17%, and finally the simple

FFN model performance, with an F-score of 38.54%.

The performance of the hybrid CNN model (Figure 22), with an F-score of 70.69% matches the performance reported for the shallow learning model in the previous chapter 5, with CRF learner and rich text feature set, and surpasses the performance slightly. Further, this improved performance was achieved with no handcrafting or feature engineering, yielding a complete computer-based automated solution. This improvement in performance with the hybrid CNN model could be due to the efficient CNN-RNN structure, and deep discovery of latent feature sets by multiple processing layers in the hybrid-CNN model.

CHAPTER SIX 100

Further, it is possible to achieve better performance than this, with appropriate hyper-parameter selection, which is discussed in the next Chapter. Some hyper-parameters that are crucial and need to be chosen carefully could be the architectural components or the variables, including the type of activation function, number of hidden units, , drop out, etc. For example, choosing the right activation function might consist of choosing either ReLu [119] or

Tanh [120] activation functions. While ReLu is an activation function defined as the positive part of its argument: R(z) = max (0, z), where z is the input to a neuron, the Tanh is the simplest and the most successful activation function, as it is a hyperbolic analogue of the tan circular function used throughout trigonometry. Tanh is defined as the ratio of the corresponding hyperbolic sine and hyperbolic cosine functions. Though the Tanh converges faster than some of the activation functions in terms of substantial number of classes/labels but did not perform well for our BioNLP data, whereas the ReLu is quite similar, it also converges faster than some of the activation functions in terms of the substantial number of classes/labels. In addition, the

ReLu is a good approximator. It works most of the time as a general approximator, as any function can be approximated with combinations of ReLu.

For the Bio-NER model building attempts based on deep learning, we have utilized the ReLu activation function to get a feature map from input numeric data for feeding into the neural networks for further processing. Figure 24 shows the transfer function of the Rectified Linear unit (ReLu) activation function.

CHAPTER SIX 101

Figure 24: The ReLu activation function: R (z) = max (0, z) gives an output z if z is positive and 0 otherwise.

6.4 Experimental Work

In this section, experimental work and performance achieved for three different deep learning algorithms for Bio-NER model development are discussed.

The algorithm workflow for each deep learning-based Bio-NER model development (FFN,

RNN, and hybrid CNN-RNN/LSTM) proceeds in the following manner:

 Data preparation phase, in this phase, datasets are prepared by pre-processing each

sentence. All sentences from the training and testing dataset undergo the same pre-

processing, involving the removal of stop words and duplicate words. We use

numeric/index to represent the unique words produced. We do the same process for the

testing and the validation datasets. And we obtain three different unique word datasets.

 These three unique word datasets are combined into the word pool.

CHAPTER SIX 102

 Each unique word training, testing or validation dataset is then split into unique

sentences and put into separate string array for training, testing or validation dataset.

 These 3 separate string arrays are mapped with the word pool to generate the integer

array (i.e., each digit value in the array represents the unique word produced).

 In the combined dataset, we have 18654 sentences which is equivalent to 18654

elements in the array (i.e., one sentence represents one row/record which represents one

element of the array).

 Since each record in the string array has a different length, we use padding, so that

each record in the integer array has the same length.

 Now, we have the training integer array with 18654 records and 208 columns, and

testing integer array with 4650 records and 208 columns.

 We then append 11 labels/classes, with B-DNA, I-DNA, B-RNA, I-RNA, B-Cell line,

I-Cell line, B-Cell type, I-Cell type, B-Protein, I-Protein, and O tags(Figure 40) into the

training integer array with 18654 records and 208 columns; and testing integer array

with 4650 records and 208 columns.

 For creating, training, and evaluating deep learning neural network models, we

need to use different steps, such as defining the network, compiling the network, fitting

network, evaluating the network, and make predictions.

Firstly, we need to define which neural network we are going to use either FNN, RNN, or hybrid-CNN. We used Keras tools for building the deep learning models, where neural networks are defined as a sequence of layers. The container for these layers in Keras is the

Sequential class. The first step is to create an instance of the Sequential class. Then we can

CHAPTER SIX 103

create our layers and integrate them in the order that they should be connected. Now, we have the pre-processed text input at the input layer, we then specify the number of neurons in the hidden layer and a neuron in the output layer. We also define activation function for each layer, and the choice of activation function and the output layers depends on the type of classification task:

 Regression: For regression tasks, the linear activation function is used, and the number

of neurons matches the number of outputs.

 Binary Classification (2 classes): For binary classification tasks, the Logistic

activation function or Tanh: the hyperbolic functions are used and one neuron for the

output layer is needed for binary classification.

 Multiclass Classification (more than 2 classes): For multiclass classification, the

SoftMax activation function is used, and one output neuron per class value is used,

assuming a one-hot encoded output pattern.

Next, after defining the neural network, we need a pre-compute step for our network, as the compilation process is required after defining a model. This includes both before training using an optimization strategy as well as loading a set of pre-trained weights from a saved file. The reason is that the compilation step prepares an efficient representation of the network that is also required to make predictions on the hardware. The compilation step requires certain parameters to be specified, particularly tailored to training our network. Further, we use the optimization algorithm, involving minimization of a loss function, for building/training the model.

Some of the optimization algorithms available in Keras deep learning library, include:

CHAPTER SIX 104

 Stochastic Gradient Descent Optimizer (SGD): that requires the tuning of a learning

rate and momentum [133]. It is a popular algorithm for training a wide range of models

in machine learning, including (linear) support vector machines and logistic regression.

It is also known as incremental gradient descent and is an iterative method for

optimizing a differentiable objective function, a stochastic approximation of gradient

descent optimization. It is called stochastic because samples are selected randomly (or

shuffled) instead of as a single group (as in standard gradient descent) or in the order

they appear in the training set [134].

 RMSprop: an optimization method that requires the turning of the learning rate, and it

is also used to enhance gradient descent [135].

 ADAM Optimizer: Adam optimizer combines the advantages of two other extensions

of stochastic gradient descent optimizers [132], involving:

o Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter

learning rate that improves performance on problems with sparse gradients (e.g.

natural language and computer vision problems).

o Root Mean Square Propagation (RMSProp) that also maintains per-parameter

learning rates that are adapted based on the average of recent magnitudes of the

gradients for the weight (e.g. how quickly it is changing). This means the

algorithm does well on online and non-stationary problems (e.g. noisy).

Adam realizes the benefits of both AdaGrad and RMSProp.

We have used the ADAM optimization algorithm for tuning of hyperparameters for three types of neural network models: FFN, RNN, and Hybrid-CNN. ADAM optimizer method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to the diagonal rescaling of the gradients, and is well suited for problems that are CHAPTER SIX 105

large in terms of data and/or parameters. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning; the name Adam is derived from adaptive moment estimation [132].

The third step involved fitting, where, once the network is compiled, it needs to be fitted, which means adapting the weights on a training dataset. Fitting the network requires the training data to be specified, both a matrix of input patterns X and an array of matching output patterns Y. The network is trained using the algorithm and optimized according to the optimization algorithm and its loss function specified when compiling the model. The backpropagation algorithm requires that the network be trained for a specified number of epochs. Each epoch can be partitioned into groups of input-output pattern pairs called batches. This defines the number of patterns that the network is exposed to before the weights are updated within an epoch. It is also requiring an efficient optimization, ensuring that not too many input patterns are loaded into memory at a time. Once it is fit, a history object is returned representing a trained model.

The fourth step involves evaluation, where once the network is trained, it needs to be evaluated. The network can be evaluated and trained by using the same training data, but this will not provide a useful indication of the performance of the network, in terms of the predictive power, as it has seen all the data beforehand. Hence, we evaluate the performance of the network on a separate dataset, unseen data during training, and can provide an estimate of the performance of the network at making predictions for unseen data in the future. The model evaluation is done in terms of the loss metric across all the test patterns, as well as other metrics specified when the model was compiled, such as accuracy, confusion matrix, precision, recall,

CHAPTER SIX 106

and F-measure. Table 16 shows the performance of each network in terms of different metrics we have defined for evaluating each deep learning algorithm for the Bio-NER model development. Figure 21 (Basic CNN) and Figure 22 (Hybrid CNN-RNN) shows the architecture for two CNN based deep learning algorithms. As can be seen in Table 16 and

Figure 25, the hybrid CNN (Pha18CNN) model has resulted in improved performance in terms of Precision, Recall and F-score: Pha18CNN (72.45, 68.30, 70.69), and surpasses the performance reported of standalone deep learning models based on FFN, RNN and CNN architectures, shown in Table 16 at the entry “Pha18 CNN”.

Table 16: Performance Results of Participating Systems in BioNLP challenge tasks.

CHAPTER SIX 107

Figure 25: Comparison of the performance between Neural Networks from Hybrid Pha18 (CNN) with other BioNLP participating systems using F-Score.

CHAPTER SIX 108

6.5 Conclusion

In this chapter we present the details of a novel strategy for Bio-NER model development based on different deep machine learning models. The three deep learning approaches based on feedforward networks (FFNs), recurrent neural networks (RNNs), and hybrid convolutional neural networks (Hybrid-CNNs) were examined. The hybrid-CNN model based on bidirectional LSTMs, character-level CNN embeddings, allowed latent feature discovery at a deep level to be discovered, and permitted automatic learning of structure from biomedical text, unlike the manual handcrafting and feature engineering stage required for traditional shallow learning algorithms based on CRF and rich text feature sets, described in Chapter 5.

Out of the different deep learning models examined, the Hybrid-CNN architecture resulted in better performance in terms of Precision, Recall and F-score for the CNN Pha18 (72.45, 68.30,

70.69), and surpassed most of other participating systems in the BioNLP challenge task, including our own model proposed in previous Chapter involving CRF learner and rich text feature set. The experimental evaluation was based on benchmark BioNLP/NLPBA 2004 shared task corpus.

Overall, the experimental work in this chapter has shown that the use of deep learning algorithms for complex Bio-NER tasks can be beneficial, particularly a hybrid deep learning model, based on an appropriate combination of different deep learning techniques. The hybrid deep learning model allows the automatic discovery of latent features contained in the structure of the biomedical NLP texts better. This can help in the development of fully automated machine learning pipelines, for any type of text data for any complex application domain, with unbalanced, sparse and unannotated data, as is the case with complex biomedical entity recognition tasks (i.e., the benchmark challenge dataset we used for Bio-NER modeling has six class labels as input and 11 class labels as output, and has poor quality data). CHAPTER SIX 109

To strengthen these findings, the next Chapter seven in this thesis describes the investigations done for the deep learning model development for a different and more complex dataset, involving a clinical handover scenario. The dataset used for this scenario involves data from

CLEF eHealth 2016 challenge task and is described in detail in the next Chapter.

.

CHAPTER SIX 110

DECLARATION OF CO-AUTHORED PUBLICATIONS

For use in theses which include publications. This declaration must be completed for each co-authored publication and to be placed at the start of the thesis chapter in which the publication appears.

Declaration for Thesis Chapter [7]

Declaration by candidate In the case of Chapter [7], the nature and extent of my contribution to the work was the following: Nature of contribution Contribution %

I was discovering a new Hyperparameter optimization machine learning approach to work with a variety of Hybrid Deep Machine learning algorithms by utilizing the clinical CLEF eHealth challenge 2016 dataset. And I was able to 89.00 achieve a better result for the clinical name entity recognition system as compared to the previous participating systems provided by the challenge task. This link will be providing soon after the conference 8-11 in Beijing, China.

The following co-authors contributed to the work: Name Nature of Contribution Contributor is also a student at UC Y/N Dr Girija Chetty Advise remarkable approaches to achieve N result models with highest performance. Dr Thoai Man Luu Advise new technologies, assist in various N experiments and review the conference paper.

______20__/__12__/_2019___ Candidate’s Signature Date

DECLARATION BY CO-AUTHORS

The undersigned hereby certify that: (1) the above declaration correctly reflects the nature and extent of the candidate’s contribution to this work, and the nature of the contribution of each of the co-authors.

111

(2) they meet the criteria for authorship in that they have participated in the conception, execution, or interpretation, of at least that part of the publication in their field of expertise; (3) they take public responsibility for their part of the publication, except for the responsible author who accepts overall responsibility for the publication; (4) there are no other authors of the publication according to these criteria; (5) potential conflicts of interest have been disclosed to (a) granting bodies, (b) the editor or publisher of journals or other publications, and (c) the head of the responsible academic unit; and (6) the original data are stored at the following location(s) and will be held for at least five years from the date indicated below:

[Please note that the location(s) must be institutional in nature, and should be indicated here as a department, centre or institute, with specific campus identification where relevant.]

Location(s): University of Canberra

Signatures Date

20/12/2019

20/12/2019

112

CHAPTER SEVEN

Extending Bio-NER Models to Clinical- NER Based on Deep Learning and Hyperparameter Optimization Techniques

7.1 Overview

As seen in the previous Chapter, deep learning approaches have shown to perform well for complex Bio-NER model development and relieved the manual feature engineering step required for shallow machine learning techniques. However, the choice of deep learning architecture, and hyperparameters are chosen for the architecture, can determine the performance improvements that can be achieved. Out of the three different deep learning architectures used in the previous Chapter, the hybrid-CNN-RNN architecture seemed to perform well, whereas the other two architectures, the one based on FFN and RNN did not perform well when compared with shallow machine learning technique as a baseline. One of the reasons behind the poor performance deep learning models (FFN/RNN here) could be the choice of hyperparameters, and difficulty in choosing multiple hyperparameters. As the deep learning architecture gets more complex (hybrid-CNN-RNN for example), there will be more hyper-parameters that need to be tuned, often requiring human expert intervention. As the aim of deep learning approaches is to minimize human intervention, and create end-to-end automated algorithm pipelines, the hyper-parameter selection and tuning needs to be automated as well. With appropriate automatic hyperparameter selection and tuning tools, it is not only possible to minimize human intervention and create fully automated pipelines, but it is also

CHAPTER SEVEN 113

possible to adapt and extend the deep learning algorithm pipeline to a different NER context.

This chapter describes the research efforts made in this direction and shows how the hybrid-

CNN-RNN deep learning model with hyperparameter optimization has been extended for the clinical-NER model development. The rest of the Chapter describes each step for this work done in detail. Next Section provides a background on hyper-parameter optimization. The clinical-NER problem is described in Section 7.3, and details of the approach used for hybrid-

CNN-RNN with hyperparameter optimization for Clinical-NER model development are described in Section 7.4. Section 7.5 presents the experimental work done for validating the proposed approach, and the Chapter ends with the outcomes achieved and plans for further research work.

7.2 Hyper-parameter Optimization

Hyperparameter optimization is a process for selecting suitable hyperparameters to work with an existing dataset or any new dataset to obtain better results for a shallow or deep machine learning algorithm [145]. Most deep learning algorithms explicitly define specific hyperparameters that control different aspects, such as memory or cost of execution. However, additional hyperparameters can be defined to adapt an algorithm to a specific scenario. practitioners typically spend quite a bit of time tuning hyperparameters in order to achieve the best performance for a specific model.

Various techniques have been proposed in the literature for automatic hyper-parameter optimization, including random search [136], Bayesian optimization [137-139], evolutionary optimization [140], meta-learning [141-142] and bandit-based methods [143]. All these techniques require a search space/the domain of the function to be optimized that specifies which hyperparameters to optimize and what ranges to consider for each of them. Currently CHAPTER SEVEN 114

these search spaces are designed by experienced machine learners, with decisions typically based on a combination of intuition for testing purposes.

Also, when it comes to selecting and optimizing hyperparameters, there are two basic approaches: manual and automatic approaches. Both approaches are useful, and the decision typically represents a tradeoff between the deep understandings of a machine learning model required to select hyperparameters manually versus the high computational cost required by automatic selection algorithms. The main objective of manual hyperparameter selection is to tune the effective capacity of a model to match the complexity of the target task. Machine learning models also have a concept of effective capacity. In that context, the effective capacity of a machine learning algorithm is determined by three main factors: (1) The representational capacity of the algorithm or the set of hypotheses that satisfy the training dataset; (2) The effectiveness of the algorithm to minimize its cost function; (3) The degree on which the cost function and training process minimize the test error. Tuning different hyperparameters is the key to find the optimal balance between those factors. However, optimizing hyperparameter manually can be an exhausting endeavor. To address that challenge, we can use algorithms that automatically work out a potential set of hyperparameters and attempt to optimize them while hiding that complexity from the developer. Algorithms such as Random Search and Grid

Search are some of the popular and well-performing algorithms for hyperparameter inference and optimization.

Essentially, model optimization is one of the toughest challenges in the implementation of machine learning solutions. Entire branches of machine learning and deep learning theory have been dedicated to the optimization of models. Typically, we think about model optimization as a process of regularly modifying the code of the model in order to minimize the testing error.

However, deep learning optimization often requires fine-tuning elements that live outside the

CHAPTER SEVEN 115

model but that can heavily influence its behavior. Deep learning often refers to those hidden elements as hyperparameters as they are one of the most critical components of any machine learning application.

It is also important to mention the tools for optimizing hyperparameters. Most deep learning frameworks such as Tensor-Flow or Keras natively include hyperparameters optimization algorithms. However, they do not give much freedom in tailoring it to different application contexts. Certain toolsets available recently (Weights and Biases Toolset for example) provide this autonomy. We selected Weights&Biases (W&B) as our Hyperparameter optimization toolset for tuning each component of the deep learning architecture to improve the performance and extend it to different application contexts. The W&B toolset is used by deep learning powerhouses such as OpenAI [146] and uses an advanced programming model for recording and tuning hyperparameters across different experiments.

In addition to W&B toolset, few other approaches are also used for hyper-parameter optimization, such as Speeding up the Hyperparameter optimization tools for Deep

Convolutional Neural network [147], which is based on using lower-dimensional data representations at the beginning, while increasing the dimensionality of the input later in the optimization process; Efficient Hyperparameter Optimization of Deep Learning Algorithms

Using Deterministic RBF Surrogates tools [148] aims to improve the performance of Bayesian optimization, as the probabilistic surrogates require accurate estimates of sufficient statistics of the error distribution and thus need many function evaluations with a sizeable number of hyperparameters; Sherpa library [149], is a free open-source hyperparameter optimization library for machine learning models. It is designed for problems with computationally expensive iterative function evaluations, such as the hyperparameter tuning of deep neural networks. With Sherpa, scientists can quickly optimize hyperparameters using a variety of

CHAPTER SEVEN 116

powerful and interchangeable algorithms. Sherpa library also allows implementing custom algorithms. This library can be run on either a single machine or a cluster via a grid scheduler with minimal configuration.

7.3 Hyperparameter optimization for Hybrid CNN-

RNN Model

In this section, we describe how the hyper-parameter optimization was done for the hybrid CNN-

RNN model, one of the best performing model in previous Chapter (Chapter 6), out of the three different deep learning models: the FFN, RNN and hybrid CNN-RNN, and how it was extended for a different NER model development, comprising clinical-NER task. Before discussing further details of the framework, it is important to understand the details of the benchmark challenge dataset, the CLEF eHealth 2016 Task 1 challenge shared task corpus, used for clinical-NER model development and validation. This dataset represents a more complex application scenario as compared to the Bio-NER problem discussed in previous Chapters and involved recognition of 36-class recognition with small training, validation, and test corpus.

The train data file format contained labeled data or annotated text, and the test data file contained unlabeled data or unstructured text. The details of tags or class labels can be outlined as follows:

 B: denotes the beginning character of an NE,

 I: denotes each following character that inside the NE, and

 O: denotes a character that is not part of any NE, to agree with the convention of the

NER community.

CHAPTER SEVEN 117

The deep learning model used is the best performing Hybrid-CNN-RNN model used in the previous Chapter 6, involving model development and evaluation using the benchmark

BioNLP 2004 dataset. The only difference is that we introduced the hyperparameter optimization for this model based on W&B toolset for automatic hyper-parameter selection and tuning of the Hybrid CNN-RNN model. We selected different types of hyperparameters using a systematic approach, as detailed next:

 We defined five different hyperparameters for tuning the hybrid CNN-RNN separately,

including batch and epoch-size parameter, optimizer, kernel regularization, dropout

rate, and number of outputs for the embedding layer.

 Each hyperparameter consists of different algorithms inside it, such as Batch size,

Nadam, Regu, Dropout rate, layEmb.

 Each algorithm within the chosen hyperparameter will perform cross-validation checks

independently and generate different results.

 After obtaining results from all algorithms for each hyperparameter, we select the best

result from the best algorithms and add it for building the Clinical-NER model.

 The same process applied to the remaining of the Hyperparameters. Finally, we obtain

an optimal Clinical-NER model built from the training subset of the dataset.

 The model is ready for validation with the validation subset and testing with test-subset.

Figure 22 showed the character level Hybrid CNN together with the word embedding feature followed by RNN/LSTM layers. We have used the hyperparameter optimization approach to tune five layers, such as Convolutional Neural Network (CNN), Word Embedding,

Bidirectional (LSTM) Hidden Layer 1, Bidirectional (LSTM) Hidden Layer 2, and the Dense or Linear Operation.

CHAPTER SEVEN 118

The hybrid CNN-RNN model was trained with a pre-processed dataset from the CLEF challenge corpus. After 75 epochs of validating the Hybrid CNN above model, we were able to achieve an accuracy of 94.87 percent on an independent validation set. The accuracy for each epoch is shown in Figure 26.

Figure 26: Validating accuracy results for the Hybrid CNN-RNN model for 75 epochs.

CHAPTER SEVEN 119

Further, after 75 epochs of testing the Hybrid CNN above model, we were able to achieve an accuracy of 89.07 percent on an independent test set. The accuracy for each epoch is shown in

Figure 27.

Figure 27: Testing accuracy results for the Hybrid CNN model in 75 epochs.

7.4 Experimental Results and Discussions

In this section, experimental results for three different deep learning-based Clinical-NER models, the FFN, RNN, and Hybrid CNN-RNN built using the CLEF eHealth 2016 challenge task 1 dataset and its’ comparison with other models are discussed [8]. The performance achieved with each model is shown in Table 17.

CHAPTER SEVEN 120

As can be seen in Table 17, the best performing deep learning model for Clinical-NER task is the Hybrid-CNN-RNN model, with 89% precision, recall and F-measure (entry PHA-19-CNN-

RNN (0.89, 0.89, 0.89)), and it is one of the best performing model, outperforming all others participating systems in the challenge task [8].

Table 17: Performance Results of All Participating Systems in the CLEF challenge task.

CHAPTER SEVEN 121

Figure 28: Comparison F-score performance between Neural Networks from Hybrid PHA- 19-CNN-RNN with all other participating Systems in the CLEF challenge task using Bar Chart.

Figure 28 shows the graphical representation of the performance achieved and shows the F- score of 89% for the Clinical-NER hybrid CNN-RNN model (PHA-19-CNN-RNN (89.0, 89.0, and 89.0)).

CHAPTER SEVEN 122

7.5 Conclusion

In this chapter, we presented the hyper-parameter optimization technique for deep learning models, and discussed it’s use for automating the entire deep learning algorithm pipeline and extending it to a different and more complex task, from Bio-NER task to Clinical-NER task.

The hyper-parameter optimization done for a well-designed hybrid-CNN-RNN deep learning model has allowed improved performance, resulting in a Precision, Recall, and F-score of 89%, outperformed all other participating systems in the same challenge task. Further, this study has shown that the efficient deep learning architectures for complex NER tasks with efficient hyperparameter tuning, are extensible and adapt well for addressing the challenges of different application contexts (from Bio-NER to Clinical-NER task), and are able to learn and understand the shared representations and the latent structure for the biomedical and medical

NLP texts better.

Next Chapter shows the software platform and implementation of the NER system based on the algorithms developed in this thesis.

CHAPTER SEVEN 123

124

CHAPTER EIGHT

Software Platform Development and Deployment

8.1 Overview

This chapter presents an Information Visualization stage which is based on our three robust models: CRF learner based on rich features; Bio-NER based on CNN and RNN Neural networks; and Clinical-NER based on hybrid deep learning. We developed these three robust models in Chapters 5, 6, and Chapter 7 respectively, and deployed on Amazon Web Services

(AWS) cloud.

To make the public link ec2-34-203-212-72.compute-1.amazonaws.com convenient for you to test and assess these models, the web application needs to be developed and created first, which has the main page and 3 sub web pages: Deep Learning predicts NHO (Nursing Handover),

Deep Learning predicts Biomedical, and CRF predicts Biomedical. To deploy our web application to the web browser for displaying the predictive results to the client, we need to register a new account with the Amazon Web Services (AWS) to configure the new Virtual

Machine (VM) using Amazon Elastic Computing Cloud (Amazon EC2), which is a web service that supplies resizable, secure compute volume in the cloud. It is designed to make web-cloud computing easier for developers [152].

CHAPTER EIGHT 125

8.2 Software Development Framework

This research work has used two popular software tools: The Flask framework and the AWS platform. Flask stands for Flux Advanced Security Kernel. It is a lightweight Web Server

Gateway Interface (WSGI) web application framework. It is designed to make getting started quick and easy, with the ability to scale up to complex applications. Flask also provides a web server that can be used for testing the web application in the development environment [153].

Furthermore, after the web application was successfully tested in the development stage, it is ready to be hosted into the production stage for public access for testing and assessing the proposed model. We have chosen AWS as a different webserver to host my web application.

AWS stands for Amazon Web Services. It is designed to work and develop a cloud computing platform provided by Amazon company. AWS services offer a variety of tools include database storage, compute power, and content delivery services [154].

8.3 The Experimental Results

After three web applications developed based on three best models that we developed in chapter six (CRF), chapter seven (Neural Network), and chapter eight (Hybrid Deep Learning).

To make these web applications available to the public, we need to register to the Amazon Web

Services (AWS). Once we have successfully configured both server and Internet Information

Services (IIS), then we can upload these applications to the AWS services, which automatically provided the public link here to the public who wish to access and test my proposed models by using either biomedical or clinical text. http://ec2-34-203-212-72.compute-1.amazonaws.com/

CHAPTER EIGHT 126

8.3.1 Client-Server Communication

Initially, the client submits either biomedical text or clinical text as a query and clicks on the submit button to request a server about the model performance. Once the server received patient information details from the model, it responds the result to the client, who displays the expected result as shown in the column heading “Details” of figures 32, 34, and 36. Figure 29 describes how both clients and servers are effectively communicating with each other.

Figure 29: Effective Communication between Client and Server.

CHAPTER EIGHT 127

8.3.2 Main Webpage

To open the main webpage, copy the provided link and paste it on your internet address bar, as shown in the red color box below.

Figure 30: The main Webpage contains three sub-webpages.

As mentioned on the main webpage above, we have provided three buttons for predicting clinical-NER entities and from Biomedical NER entities.

The three buttons displayed on the main webpage showed in different functions:

 DL predicts NHO stands for “Deep Learning predicts Nursing Handover text”.

 DL predicts Biomedical stands for “Deep Learning predicts biomedical text”.

 CRF predicts Biomedical for “Conditional Random Field predicts biomedical text”.

CHAPTER EIGHT 128

8.3.3 Open Sub Webpage: CRF predicts Biomedical

To predict biomedical text using CRF learner, click on button “CRF predicts Biomedical”, as shown in figure 30. And you can see figure 31 displays below. To test this CRF model:

Enter the biomedical text on to the large text box circled in green color, and click on the Submit button circled in blue color, which means that the client requests the server to access the model, which predicts the biomedical text request. Once the server has the predicted result from the model, the server responds to the client with the table result in the red heading color “Patient

Information, Details”.

It is important to note that the form before submitting as shown in figure 31 has empty results under the “Details” heading, and the form after submitting as shown in figure 32 has full results which the client has just received from the server.

CHAPTER EIGHT 129

Figure 31: Before submitting, this sub-webpage contains “Value” as an empty result under the “details” heading.

CHAPTER EIGHT 130

Figure 32: After submitting, this sub-webpage form contains expected results under the “details” heading.

CHAPTER EIGHT 131

8.3.4 Open Sub-Webpage: DL predicts Biomedical

This sub-webpage is similar to the previous webpage. To predict biomedical text using Deep

Machine learner, click on button “DL predicts Biomedical”, as shown in figure 30. And you can see figure 33 displayed below. To test this Deep Learning model, enter the biomedical text on to the large text box circled in green color, and click on Submit button circled in blue color, which means that the client requests the server to communicate with the model, whose predicts the biomedical text request. Once the server has the predicted result from the model, the server responds to the client with the table result in the red heading color “Patient Information,

Details”.

It is important to note that both forms are user friendly and displaying in different results:

 The form before submitting as shown in figure 33 has empty results under the

“Details” heading. And after you click the Submit button, a small dialog box displayed

and informed you that “please wait … responding soon. Thank you”, which means

that the client should wait as usual responding time from 5 to 15 seconds for the result”.

Once you click on the OK button, wait for a few seconds, the second dialog box

displayed “Result is ready now” which informed you that the result is now available.

Once you click on the OK button, the full results displayed under the “Details” heading,

refer to figure 34.

 The form after submitting as shown in figure 34 has full results under the “Details”

heading, which the client has just received from the server.

CHAPTER EIGHT 132

Figure 33: Before submitting, this sub-webpage form contains “Value” as an empty result under the “details” heading.

CHAPTER EIGHT 133

Figure 34: After submitting, this sub-webpage form contains expected results under the “details” heading.

CHAPTER EIGHT 134

8.3.5 Open Sub-Webpage: DL predicts NHO

Likewise, this sub-web page is like the previous two sub-webpages. To predict clinical Nursing

Handover text using a Hybrid Deep Machine learner, click on button “DL predicts NHO”, as shown in figure 30. And you can see figure 35 displayed below. To test this Hybrid Deep

Learner model, enter the clinical text on to the large text box circled in green color, and click on the Submit button circled in blue color, which means that the client requests the server to communicate with the model, whose predicts the clinical text requested. Once the server has the predicted result from the model, the server responds to the client with the table result in red heading color “Patient Information, Details” as shown in figure 36.

CHAPTER EIGHT 135

Figure 35: Before submitting, this sub-webpage form contains “N/A” as an empty result under the “details” heading.

CHAPTER EIGHT 136

Figure 36: After submitting, this sub-webpage form contains expected results under the “details” heading.

CHAPTER EIGHT 137

8.4 Conclusion

This chapter presented the software development of the machine learning Information

Extraction and Information Visualization based on three different NER models with the best performance. The software platform expects the client to send the request to the server which communicates with the model for predicting the biomedical or clinical named entities from the unstructured free text as input. Once input text is provided, the server responds to the client via a web browser (i.e., can be Chrome, Firefox, or Internet Explorer) which displays the result on the form under the “Patient Information and Details” heading.

The next Chapter nine concludes the findings from this thesis, and plan for further research.

CHAPTER EIGHT 138

CHAPTER NINE

Conclusions and Further Plan

The main objective of this work is to develop an innovative computational framework for biomedical named entity recognition that is robust, extensible and adaptive between its sub- domains (biomedical and clinical sub-domains). This was achieved by iterative step-by-step development based on different shallow and deep machine learning models, and validation with different benchmark datasets.

Firstly, Bio-NER model development was done with the MaxEnt machine learning algorithm and bootstrapped incremental supervised learning technique. This model served as a baseline reference system and allows assessment of other advanced machine learning models developed. The model building and evaluation was done with the GENIA - BIONLP/NLBPA

2004 shared task/challenge dataset, a benchmark biomedical named entity recognition dataset.

The performance of the Bio-NER model was assessed in terms of precision, recall, and F- measure. The MaxEnt incremental/bootstrapped supervised learning approach for the Bio-NER task has resulted in an F-measure of 59.7%, better than two others closely competing systems in the challenge.

The next step was the Bio-NER model development with a semi-supervised CRF learner, the experimental work involved using a CRF learner, and adapting it with novel training algorithm

- the semi-supervised learning algorithm, and its evaluation on benchmark dataset

BioNLP/NLBA shared task corpus. It was possible to achieve an F-score of 68.74% for the

CRF model using semi-supervised learning and has outperformed compared to the other six participating systems in the BioNLP Challenge task. CHAPTER NINE 139

The next stage involved Bio-NER model development, based on conditional random fields and rich text feature set. The rich text feature set contained three different types of feature sets, the

NLP feature set with distinct features for (Capitalized, Hyphenated, Window 5, First token of a sentence, suffixes and With Number), the learning feature set (Word2Vec) and statistical feature set (TF-IDF), resulting in improved performance for the Bio-NER task. The experimental validation of this model for the same benchmark GENIA dataset shows significant improvement in performance with an F-score of around 70%. This ascertained that a good learner and powerful set of features can indeed enhance the performance of the complex text mining task, involving the recognition of biomedical entities. However, rich text feature extraction may not be possible or an easy task. Ideally, what is needed, is an algorithm that can perform as good as this powerful combination of rich text features and machine learners, without the manual involvement of multi-disciplinary experts for feature engineering. The recent deep learning algorithms provide this opportunity, and are known to perform well, or even better, with an appropriate choice of the deep learning architecture.

The next step was to build the Bio-NER model, using different types of deep machine learning algorithms, with the power to model latent features to be extracted without feature engineering and handcrafting of features, and removing the dependency on the need for multi-disciplinary experts for creating annotated training datasets. Three different deep learning approaches were examined, including the feedforward networks (FFNs), recurrent neural networks (RNNs), and hybrid convolutional neural networks (Hybrid-CNNs). Out of the different deep learning models examined, the Hybrid-CNN architecture resulted in better performance in terms of

Precision, Recall and F-score for the CNN Pha18 (72.45, 68.30, 70.69), and surpassed most of other participating systems in the BioNLP challenge task, including our own model proposed in previous Chapter involving CRF learner and rich text feature set. The experimental validation was done on benchmark BioNLP/NLPBA 2004 shared task corpus. CHAPTER NINE 140

This established the power of deep learning algorithms for complex Bio-NER tasks, particularly the hybrid CNN-RNN/LSTM model, based on an appropriate combination of different deep learning techniques. The hybrid deep learning model allows the automatic discovery of latent features contained in the structure of the biomedical NLP texts better. This can help in the development of fully automated machine learning pipelines, for any type of text data for any complex application domain, with unbalanced, sparse and unannotated data, as is the case with complex biomedical entity recognition tasks (i.e., the benchmark challenge dataset we used for Bio-NER modeling has six class labels as input and 11 class labels as output, and has poor quality data).

To examine the robustness and capacity to generalize and adapt to different sub-domain, the robust Bio-NER model was developed based on a hybrid-CNN-RNN algorithm with hyper- parameter optimization, and validation with BioNLP GENIA corpus, and extended for low- resourced clinical-NER context. This resulted in an improved performance with Precision,

Recall, and F-score of 89%, the best performance achieved amongst all other participating systems in the same challenge task. Further, the findings in this Chapter has shown that the efficient deep learning architectures for complex NER tasks with efficient hyper-parameter tuning, are extensible and adapt well for addressing the challenges of different application contexts (from Bio-NER to Clinical-NER task), and are able to learn and understand the shared representations and the latent structure for the biomedical and medical NLP texts better.

Some of the future directions of this research is to investigate such hybrid deep learning algorithms that are robust and work well under low resourced contexts, as it is hard to find large annotated texts in several subdomains in the medical field. For medical NLP fields, the availability of annotated examples is scarce. To improve NER model performance and robustness in such situations, some of the enhancements that could be explored include layer-

CHAPTER NINE 141

wise initialization with pre-trained weights, combining pre-training data from two different sub-domains, custom word embeddings, optimizing out-of-vocabulary (OOV) words. Further, these improved algorithmic implementations should be validated for performance improvement and robustness, with several benchmark datasets drawn from low resourced contexts. This can then lead to seamless integration of end-to-end automated algorithmic pipeline that can work with few training/annotated examples and can be implemented for real- time online NER systems.

CHAPTER NINE 142

APPENDIX 1

Details of Benchmark datasets used

Data selection is defined as the process of determining the appropriate data type and source, as well as suitable instruments to collect data. The current thesis has collected and prepared two different types of data for the experiments involved: the BioNLP 2004 challenge task and the

CLEF eHealth 2016 task 1A, which can be described next.

The first challenge dataset: The data obtained from the biomedical entity recognition task at

BioNLP/NLPBA 2004 [7]. The first step for preparing the data involved checking to see what format of the data provided and found that the data format is given using the IOB format as we expected. The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics, such as named entity recognition (NER). The task of NER involved the recognition of six biomedical concepts, such as DNA, RNA, Protein, Cell-line, Cell-Type, and O (for Others) as shown in figure 37.

Figure 37: Shows 6 class labels contained in the Training dataset as input.

Regarding the IOB format, all datasets contain one word per line with empty lines representing sentence boundaries. In addition, each line contains a label indicating whether the word is

APPENDIX 1 143

inside the named entity or not. Words labeled with O are outside of named entities. The B-

XXX label is used for the first in a named entity of type XXX. And I-XXX is used for all other words in named entities of type XXX. Figure 40 showed the NER system produced an output of 11 classes in the IOB format as an example.

The current thesis concentrates on classification technique to predict 11 class labels as expected output (Figure 38) when using the first BioNLP dataset, and to predict 71 class labels as expected output (Figure 39) when using the second CLEF challenge dataset rather than using regression technique. This classification technique has performed better results -67\compared to previous participating systems in the same challenge task.

Figure 38: Illustrates 11 class labels as expected output predicted from the BioNLP challenge task.

APPENDIX 1 144

Figure 39: Illustrates 71 class labels as expected output from the second CLEF challenge task.

The first BioNLP challenge task consists of two files: a training file and a test file. The data consists of two columns separated by a space. The first column of each line is a word and the second column of that line is the named entity class label. Table 18 shows an example of the sentence as part of the training dataset, Table 19 shows the test data based on the Bio-NLP format from the first dataset of the Bio-NLP challenge task, respectively.

APPENDIX 1 145

Table 18: Train data file contains labeled data from the BioNLP/NLPBA 2004 challenge task

Table 19: Test data file contains unlabelled data from Bio-NLP/NLPBA 2004 challenge task.

APPENDIX 1 146

The second challenge dataset: The data obtained from the CLEF eHealth 2016 challenge task

1 [8]. The first step for preparing the data for the experimental work in this thesis involved, using a benchmark dataset, with synthetic data collected initially from voice recording simulating the nursing handover process, by researchers from NICTA Canberra Research Lab,

Australia. The data obtained was in the XML format and the data preparation step involved, conversion into the Conference on Computational Natural Language Learning (CoNLL-2002) format [65]. The data consists of two columns separated by a single space. Each word was put on a separate line and there is an empty line after each sentence. The first item on each line is a word and the second the named entity tag. Figure 40 shows the 36 class labels as input data contained inside the training dataset of the CLEF challenge corpus.

Figure 40: Illustrates 36 input class labels contained in the Train dataset of CLEF eHealth.

B-YYY was used for the first word in a named entity of type YYY. Words tagged with O are outside of named entities. And I-YYY was used for all other words in a named entity of type

YYY. And to predict 71 classes as expected output (Figure 41). In addition, this CLEF eHealth challenge task consists of three files: a training file, a validation file, and a testing file.

APPENDIX 1 147

148

APPENDIX 2

Machine Learning Development Tools Used

The current research has used six well-known software development tools: Apache OpenNLP toolkit, MAchine Learning for LanguagE Toolkit (MALLET), sklearn-crfsuite toolkit, and

Keras: The Python Deep Learning Library; and six machine learning development schemes:

Maximum Entropy, CRF semi-supervised ML, CRF with supervised ML, and deep learning

(FNN, RNN, and Hybrid CNN-RNN). These machine learning algorithms allow the prototype outcomes to be translated into the production system for business deployments. The software development tools will be described in the following:

The Apache OpenNLP library is a machine learning-based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, , language detection, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron-based machine learning. The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide many pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from. The

Apache OpenNLP library contains several components, enabling one to build a full natural language processing pipeline. These components include sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, coreference resolution.

Components contain parts that enable one to execute the respective natural language processing task, to train a model, and often also to evaluate a model. Each of these facilities is accessible APPENDIX 2 149

via its application program interface (API). In addition, a command-line interface (CLI) is provided for the convenience of experiments and training. The Apache OpenNLP library also supports testing and model training. We have modified this toolkit and used it for solving the problem of NER in the biomedical context.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including

Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics. In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text.

Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and

Conditional Random Fields. These methods are implemented in an extensible system for finite- state transducers. Furthermore, to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stop-words, and converting sequences into count vectors. MALLET also includes implementations of several classification algorithms, including Naïve Bayes, Maximum Entropy, and Decision Trees. In addition,

MALLET provides tools for evaluating classifiers, which are algorithms that distinguish between a fixed set of classes, such as "spam" vs. "non-spam", based on labeled training examples.

As mentioned, the Sklearn-crfsuite toolkit is a thin CRFsuite (python-crfsuite) wrapper that provides a scikit-learn-compatible estimator. CRFsuite [67] is a fast implementation of

APPENDIX 2 150

Conditional Random Fields (CRFs) [10, 68, 69] for labelling sequential data. Among the various implementations of CRFs, this software provides the following features: (1) Fast training and tagging. The primary mission of this software is to train and use CRF models as fast as possible; (2) Simple data format for training and tagging. The data format is similar to those used in other machine learning tools; each line consists of a label and attributes (features) of an item, consecutive lines represent a sequence of items (an empty line denotes an end of item sequence). This means that users can design an arbitrary number of features for each item, which is impossible in CRF++; (3) Forward/backward algorithm using the scaling method [85].

The scaling method seems faster than computing the forward/backward scores in the logarithm domain; (4) Linear-chain (first-order Markov) CRF; (5) Performance evaluation on training.

CRFsuite can output precision, recall, F1 scores of the model evaluated on test data; (6) an efficient file format for storing/accessing CRF models using Constant Quark Database

(CQDB). It takes a little time to start up a tagger since preparation is done only by reading an entire model file to a memory block. Retrieving the weight of a feature is also very quick.

Keras: The Python Deep Learning library is used to discover latent feature sets proposed in the hybrid CNN model. It is called “Keras” [63], which is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is the key to doing good research. For that reason, the current research used Keras, as we need a deep learning library that: (1) allows for easy and fast prototyping through user-friendliness, modularity, and extensibility:

 User-friendliness: Keras is an API designed for human beings, not machines. It

puts the user experience front and center. Keras follows best practices for reducing

cognitive load: it offers consistent & simple APIs, it minimizes the number of user

APPENDIX 2 151

actions required for common use cases, and it provides clear and actionable

feedback upon user error.

 Modularity: A model is understood as a sequence or a graph of standalone, fully

configurable modules that can be plugged together with as few restrictions as

possible. In particular, neural layers, cost functions, optimizers, initialization

schemes, activation functions, regularization schemes are all standalone modules

that you can combine to create new models.

 Easy extensibility: New modules are simple to add (as new classes and functions),

and existing modules provide ample examples. To be able to easily create new

modules allows for total expressiveness, making Keras suitable for advanced

research.

(2) Supports convolutional networks and recurrent networks, as well as combinations of the two; (3) runs seamlessly on CPU and GPU.

The Apache OpenNLP library is a machine learning-based toolkit for the processing of natural language text [59]. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron- based machine learning. The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide many pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.

The Apache OpenNLP library contains several components, enabling one to build a full natural language processing pipeline. These components include sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, coreference resolution.

APPENDIX 2 152

Components contain parts that enable one to execute the respective natural language processing task, to train a model, and often also to evaluate a model. Each of these facilities is accessible via its application program interface (API). In addition, a command-line interface (CLI) is provided for the convenience of experiments and training. The Apache OpenNLP library also supports testing and model training. We have modified this toolkit and used it for solving the problem of NER in the biomedical context.

The MALLET library is a Java-based package for statistical natural language processing, document classification (classifier), clustering, topic modeling, information extraction, and other machine learning applications to text. It includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms

(including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools [60] for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models,

Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite- state transducers

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of "pipes", which handle distinct tasks such as tokenizing strings, removing stop-words, and converting sequences into count vectors.

A classifier is an algorithm that distinguishes between a fixed set of classes, such as "spam" vs.

"non-spam", based on labeled training examples. MALLET includes implementations of

APPENDIX 2 153

several classification algorithms, including Naïve Bayes, Maximum Entropy, and Decision

Trees. Additionally, MALLET provides tools for evaluating classifiers. The MALLET Named

Entity Recognition is also known as CRF Classifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text.

Sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides an interface similar to Scikit-Learn [62], which is a Machine Learning in Python: simple and efficient tools for data mining and data analysis; accessible to everybody, and reusable in various contexts; open-source, commercially usable - BSD license. Sklearn-crfsuite CRF is a Scikit-Learn compatible estimator, in which you can use Scikit-Learn model selection utilities (cross- validation, hyperparameter optimization) with it. In summary, sklearn-crfsuite is a very easy- to-use package for applying CRF algorithms.

Weights&Biases (W&B) tool was used for hyperparameter optimization in this work.

Investigation reveals that hyper-parameters optimization algorithms natively included in most of deep learning platforms, such as Keras, TensorFlow, or Theano. Nevertheless, the process of optimizing hyperparameters able in real-world deep learning schemes requires advanced tooling [64] that can help developers conduct different experiments and track the results. Most deep learning platforms in the market include some basic form of hyperparameter optimization but many of them are still quite limited to be used in large-scale deep learning schemes. A hyperparameter optimization toolset should be to work consistently across different deep learning frameworks while providing rich interfaces for modeling, running, and comparing experiments using different hyperparameter optimizations. Here are some very innovative hyperparameter optimization tools, such as SageMaker, Comet.ml, Weights&Biases, Deep

APPENDIX 2 154

Cognition, Azure ML, and Cloud ML. We have utilized the Weights&Biases (W&B) tool with our deep learning scheme, as W&B provides an advanced toolset and programming model for recording and tuning hyperparameters across different experiments. The solution records and visualizes the different steps in the execution of the model and correlates it with the configuration of its hyperparameters.

APPENDIX 2 155

156

APPENDIX 3

The World of DNA and RNA

It is important to mention that DNA stands for deoxyribonucleic acid. It is the genetic code that determines all the characteristics of a living thing. Deoxyribonucleic acid is a large molecule in the shape of a double helix. That is a bit like a ladder that has been twisted many times. It is made up of repeating units called nucleotides. Each nucleotide contains a sugar and a phosphate molecule, which make up the ‘backbone’ of DNA. The information in DNA is stored as a code made up of four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). It has the specific order of A, G, C and T within a DNA molecule that is unique to you and gives you your characteristics. Human DNA consists of about 3 billion bases, and more than 99 percent of those bases are the same in all people. The order, or sequence, of these bases, determines the information available for building and maintaining an organism, similar to the way in which letters of the alphabet appear in a certain order to form words and sentences.

You got your DNA from your parents, we call it ‘hereditary material’ (information that is passed on to the next generation). Nobody else in the world will have DNA the same as you unless you have an identical twin [156].

Furthermore, RNA stands for Ribonucleic Acid. It is one of the major biological macromolecules that is essential for all known forms of life. The nucleotides that comprise

RNA include adenine (A), guanine (G), cytosine (C) and Uracil (U). It performs various important biological roles related to protein synthesis such as transcription, decoding, regulation, and expression of genes [157].

APPENDIX 3 157

It is also very important to distinguish between DNA and RNA. Structurally, RNA is quite similar to DNA. However, whereas, DNA molecules are typically long and double-stranded,

RNA molecules are much shorter and are typically single-stranded. RNA molecules perform a variety of roles in the cell but are mainly involved in the process of protein synthesis

(translation) and its regulation. There are four different nitrogenous bases found in DNA and

RNA; however, there is one difference. The bases found in DNA are guanine, cytosine, adenine, and thymine. The bases found in RNA are guanine, cytosine, adenine, and uracil.

Remember that DNA and RNA differ slightly at the nucleotide level. Therefore, the process of transcribing DNA into RNA not only changes the information from a double-stranded into a single-stranded molecule, but also changes all the thymine bases into uracil ones [158].

APPENDIX 3 158

References

1. C. Langlotz, X. Wang, Y. Zhang, X. Ren, Yuhao Zhang, M. Zitnik, J. Shang, J. Han. "Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning". Technical Report. School of Medicine, Stanford University, Stanford, CA 94305, USA.

2. Smith, L., Tanabe, L. K., nee Ando, R. J., Kuo, C.-J., Chung, I.-F., Hsu, C.-N., Lin, Y.- S., Klinger, R., Friedrich, C. M., Ganchev, K., et al. (2008). "Overview of biocreative ii gene mention recognition". Genome Biology, 9 (S2), S2.

3. Cokol, M., Iossifov, I., Weinreb, C., and Rzhetsky, A. (2005). Emergent behavior of growing knowledge about molecular interactions. Nature biotechnology, 23 (10), 1243– 1247.

4. Morris, J. H.,Szklarczyk, D., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., Santos, A., Doncheva, N. T., Roth, A., Bork, P., et al. (2017). "The string database in 2017: quality- controlled protein–protein association networks, made broadly accessible". Nucleic acids research, 45 (D1), D362–D368.

5. Wei, C.-H., Kao, H.-Y., and Lu, Z. (2013). Pubtator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41 (W1), W518–W522.

6. Xie, B., Ding, Q., Han, H., and Wu, D. (2013). mircancer: a microrna–cancer association database constructed by text mining on literature. , 29 (5), 638–64.

7. Jenny Finkel, Burr Settles, GuoDong Zhou, Yu Song, Gary Geunbae Lee, Shaojun Zhao, “Report on Bio-Entity Recognition Task at BioNLP/NLPBA 2004”, last modification made on 14 October 2004 by Jin-Dong Kim.

8. CLEF eHealth 2016 Task 1: Handover Information Extraction; Available from: https://sites.google.com/site/clefehealth2016/. (Accessed 3rd July 2018).

9. Hanna Suominen, Liyuan Zhou, Leif Hanlen, Gabriela Ferraro, “Benchmarking Clinical Speech Recognition and Information Extraction: New Data, Methods, and Evaluations”, Journal of Machine Learning Research Number, 2014; Submitted 06/14; Published 2014.

10. John Lafferty, Andrew McCallum, and Fernando C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”. Published in Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pages 282-289, Carnegie Mellon University, Pittsburgh, PA 15213 USA, 2001.

11. K Tomanek, J Wermter, and U Hahn. “A reappraisal of sentence and token splitting for life sciences documents”. In Medinfo 2007: Proceedings of the 12th World Congress on REFERENCES 159

Health (Medical) Informatics; Building Sustainable Health Systems, page 524. IOS Press, 2007.

12. Stanford core NLP; Available from: http://nlp.stanford.edu/software/corenlp.shtml/ (Accessed 10th September 2014).

13. Metamap; Available from: http://metamap.nlm.nih.gov/ (accessed 10th September 2014).

14. CSIRO Ontoserver; Available from: http://ontoserver.csiro.au:8080/ (accessed 10th September 2014).

15. Steyerberg EW. “Clinical prediction models: a practical approach to development, validation, and updating”. New York: Springer; 2009.

16. Kuhn M, Johnson K. “Applied predictive modeling”. New York: Springer; 2013.

17. Axelrod RC, Vogel D. “Predictive modeling in health plans. Dis Manag Health Outcomes”. 2003;11(12):779–87.

18. Asadi H, Dowling R, Yan B, Mitchell P. Machine learning for outcome prediction of acute ischemic stroke post intra-arterial therapy. PLoS One. 2014;9(2): e88225.

19. Y. LeCun, Y. Bengio. “Scaling learning algorithms towards AI”. Published In: "Large- Scale Kernel Machines", L. Bottou, O. Chapelle, D. DeCoste, J. Weston (eds) MIT Press, 2007. The Courant Institute of Mathematical Sciences, New York University, New York, USA.

20. A. Courville, Y. Bengio, P. Vincent. “Representation learning: A review and new perspectives”. Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence (v.35, Issue 8; August 2013; pp 1789-1828. doi:10.1109/TPAMI.2013.50.

21. I. Arel, D. C. Rose, T. P. Karnowski (2010). “Deep machine learning-a new frontier in artificial intelligence research”. Published in: IEEE Computational Intelligence Magazine [research frontier]. Vol. 5, pp 13-18. The University of Tennessee, USA.

22. G.E. Hinton, S. Osindero, Teh Y-W (2006). “A fast learning algorithm for deep belief nets”. Neural Comput 18(7):1527– 1554.

23. Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle. “Greedy layer-wise training of deep networks”. Vol. 19 (2007).

24. H. Larochelle, J. Louradour, P. Lamblin, Y. Bengio (2009). “Exploring strategies for training deep neural networks”. J Mach, Learn Res 10:1–40.

25. Salakhutdinov R, Hinton GE (2009). “Deep Boltzmann machines”. In: International Conference on, Artificial Intelligence and Statistics. JMLR.org. pp 448–455.

REFERENCES 160

26. Goodfellow I, Lee H, Le QV, Saxe A, Ng AY (2009). “Measuring invariances in deep networks”. Published In: Advances in Neural Information Processing Systems. Curran Associates, Inc. pp 646–654.

27. N. Dalal, B. Triggs. “Histograms of oriented gradients for human detection”. Published in 2015 IEEE Computer Society Conference on Computer Vision and , San Diego, California, USA, IEEE Vol. 1. pp. 886–893 (2005).

28. D. G. Lowe. “Object recognition from local scale-invariant features”. Published in 1999 Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, IEEE Vol. 2. pp 1150–1157.

29. Hinton G, Deng L, Yu D, Mohamed A-R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Dahl G, Kingsbury B (2012). “Deep neural networks for acoustic modelling in speech recognition: The shared views of four research groups”. Signal Process Mag IEEE 29(6):82–97.

30. Dahl GE, Yu D, Deng L, Acero A (2012). “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition”. Audio Speech Lang Process IEEE Trans 20 (1):30–42.

31. Hanna Suominen, Liyuan Zhou, Leif Hanlen, Gabriela Ferraro, “Benchmarking Clinical Speech Recognition and Information Extraction: New Data, Methods, and Evaluations”, Journal of Machine Learning Research Number (2014) pp-pp Submitted 06/14; Published MM/14.

32. Friedman, C., G. Hripcsak, W. DuMouchel, S. B. Johnson, and P. D. Clayton. “Natural language processing in an operational clinical information system”. Natural Language Engineering 1, 83-108, 1995. // 32-36 here from 29, 32, 38, 41-42 of RP’s conf seminar

33. A. L. Tarca, V. J. Carey, X-wen. Chen, et al. "Machine learning and its applications to biology". Published in: PLoS Computational Biology, 2007, vol. 3, pg. e116.

34. M. Garrouste-Orgeas, F. Philippart, C. Bruel, A. Max, N. Lau, and B. Misset. “Overview of medical errors and adverse events”. Published in: Ann Intensive Care. 2012 Feb 16;2(1):2. doi: 10.1186/2110-5820-2-2.

35. N. Saul. Weingart; Ross Wilson; Robert. W. Gibberd; B. Harrison (2000). "Epidemiology of medical error". British Medical Journal (BMJ) Publishing Group. (Accessed 2nd February 2018)

36. R. A. Friedman. "CASES; Do Spelling and Penmanship Count? In Medicine, You Bet". The New York Times. (Accessed 29 August 2018).

REFERENCES 161

37. P. A. Palmieri; P. R. DeLucia; L. T. Peterson; T. E. Ott; A. Green, (2008). "The anatomy and physiology of error in adverse healthcare events". Published in: Advances in Health Care Management. vol.7, pp. 33–68. (Accessed 29 August 2018).

38. “Difference Between side effects and adverse effects". Published in: DifferenceBetween.net. From: http://www.differencebetween.net/science/health/disease- health/difference-between-side-effects-and-adverse-effects/ (Accessed 8th August 2018).

39. Linda T. Kohn; Janet M. Corrigan; M. S. Donaldson. "To err is human: building a safer health system". Institute of Medicine. 2000; Committee on Quality of Health Care in America, Washington, DC: The National Academies Press.

40. J. A. Martin; BL. Smith; T. J. Mathews; S. J. Ventura. “Births and deaths: preliminary data for 1998”. Centers for Disease Control and Prevention, National Center for Health Statistics. National Vital Statistics Reports. 1999; 47 (25): 6.

41. T. A. Brennan; L. L. Leape; N. M. Laird. "Incidence of adverse events and negligence in hospitalized patients". Results of the Harvard Medical Practice Study I. The New England Journal of Medicine. 1991;324(6):370–376.

42. Leape LL, Brennan TA, Laird N, et al. "The nature of adverse events in hospitalized patients”. Results of the Harvard Medical Practice Study II. The New England Journal of Medicine. 1991;324(6):377–384.

43. DP. Phillips; N. Christenfeld; LM. Glynn. "Increase in US medication-error deaths between 1983 and 1993". Lancet. 1998;351(9103):643–644.

44. FR. Ernst; AJ. Grizzle. "Drug-related morbidity and mortality: updating the cost-of-illness model". J Am Pharm Assoc (Wash) 2001;41(2):192–199.

45. BJ. Kopp; BL. Erstad; ME. Allen; AA. Theodorou; G. Priestley. “Medication errors and adverse drug events in an intensive care unit: direct observation approach for detection”. Crit Care Med. 2006; 34:415–425. doi: 10.1097/01.CCM.0000198106.54306.D7.

46. AD. Calabrese; BL. Erstad; K. Brandl; JF. Barletta; SL. Kane; DS. Sherman. “Medication administration errors in adult patients in the ICU”. Intensive Care Med. 2001;27:1592– 1598. doi: 10.1007/s001340101065.

47. A. Valentin; M. Capuzzo; B. Guidet; R. Moreno; B. Metnitz; P. Bauer; P. Metnitz. “Errors in administration of parenteral drugs in intensive care units: multinational prospective study”. BMJ. 2009;338:b814. doi: 10.1136/bmj.b814.

48. Ridley SA, Booth SA, Thompson CM. “Prescription errors in UK critical care units”. Anaesthesia. 2004;59:1193–1200. doi: 10.1111/j.1365-2044.2004.03969.x.

REFERENCES 162

49. Garrouste-Orgeas M, Timsit JF, Vesin A, Schwebel C, Arnodo P, Lefrant JY, Souweine B, Tabah A, Charpentier J, Gontier O. et al. “Selected medical errors in the intensive care unit: results of the IATROREF study: parts I and II on behalf of the Outcome area study group”. Am J Respir Crit Care Med. 2010;181:134–142. doi: 10.1164/rccm.200812- 1820OC.

50. Garrouste-Orgeas M, Soufir L, Tabah A, Schwebel C, Vesin A, Adrie C, Thuong M, Timsit JF. “A multifaceted program for improving quality of care in ICUs”. (IATROREF STUDY) on behalf of the Outcome area study group. Critical Care Med. in press.

51. E. M. Borycki, and A.W. Kushniruk. “Where do technology induced errors come from? Towards a model for conceptualizing and diagnosing errors caused by technology. In Human, social, and organizational aspects of health information systems”. London, Information science reference. 2008.

52. Y. Han, J. Carcillo, S. Venkatraman, R. Clark, R. Watson, T. Nguyen, H. Bayir & R. “Orr Unexpected increased mortality after implementation of a commercially sold computerized physician order entry system”. Paediatrics, 116:1506-1512, 2005.

53. E. Coiera. “Lessons from the NHS National Programme for IT”. Med J Aust., 186:1, 3-4, 2007.

54. Biomedical natural language processing Tools and resources. Retrieved from: http://bio.nlplab.org/. (Accessed 9 July 2018).

55. B. Tang, H. Cao, X. Wang, Q. Chen, H. Xu. “Evaluating word representation features in biomedical named entity recognition tasks”. BioMed Res. Int. (2014).

56. P. A. Domingos. “A few useful things to know about machine learning”. Communications ACM, 2012, 55 (10).

57. M. Steinbach; P-N. Tan; V. Kumar. “Introduction to Data Mining”. First Edition, Addison- Wesley. Longman Publishing Co, Inc, Boston, MA, 2005.

58. Difference Between Classification and Regression in Machine Learning. Retrieved from: https://machinelearningmastery.com/classification-versus-regression-in-machine- learning/. (Accessed 3rd March 2018).

59. The Apache Software Foundation. “OpenNLP: Welcome to Apache OpenNLP”. Licensed under the Apache License, Version 2.0. Retrieved from: http://opennlp.apache.org/. (Accessed 9th June 2018).

60. A. K. McCallum. "MALLET: A Machine Learning for Language Toolkit", Computer Science Department University of Massachusetts Amherst, MA 01003, USA.2002.

REFERENCES 163

61. sklearn-crfsuite. Retrieved from: https://sklearn-crfsuite.readthedocs.io/en/latest/. (Accessed 12th March 2018).

62. Scitkit-learn: Machine Learning in Python. Retrieved from: https://scikit-learn.org/stable/. (Accessed 12th March 2018).

63. Keras: The Python Deep Learning library. Available from: https://keras.io/ . (Accessed 13th March 2018).

64. Tools for Optimizing Hyperparameters. Retrieved from: https://towardsdatascience.com/understanding-hyperparameters-optimization-in-deep- learning-models-concepts-and-tools-357002a3338a (Accessed 17th January 2019).

65. F. Erik; K. S. Tjong. “Introduction to the CoNLL-2002 Shared Task: Language- Independent Named Entity Recognition”. Retrieved from: http://www.aclweb.org/anthology/W02-2024. (accessed 4th March 2018).

66. Han, J. and K. M. “Data Mining Concepts and Techniques”. Vol. 3. 2011, San Francisco: Morgan Kaufmann.

67. CRFsuite. Available from: http://www.chokkan.org/software/crfsuite/. (Accessed 4th March 2018).

68. Fei Sha and Fernando Pereira. “ with conditional random fields”. NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 134-141. 2003.

69. Charles Sutton and Andrew McCallum. “An Introduction to Conditional Random Fields”. Submission.

70. Christopher D. Manning, Mihai Surdeanu, John Bauer, John Bauer, Jenny Finkel, Steven J. Bethard, David McClosky, “The Stanford coreNLP Natural Language Processing Toolkit”, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60, 2014.

71. Sujan Kumar Saha, Sudeshma Sarkar, Pabitra Mitra, “Feature selection techniques for maximum entropy based biomedical named entity recognition”, Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, West Bengal 72’ 1 302, India.

72. Suominen, H., Salantera, S., Velupillai, S., Chapman, W. W., Savova, G. K., Elhadad, N., Pradhan, S, South, B. R., Mowery, D., Jones, G. J. F., Leveling, J., Kelly, L., Goeuriot, L., Martınez, D., and Zuccon, G. “Overview of the share/clef eHealth evaluation lab 2013”. In Forner, P., M¨uller, H., Paredes, R., Rosso, P, and Stein, B., editors, CLEF, volume 8138 of Lecture Notes in Computer Science, pages 212–231. Springer, 2013.

REFERENCES 164

73. Gann Bierner, Jason Baldridge, Joern Kottmann, Thomas Morton, “The OpenNLP Maximum Entropy Package”, Last update 2013-04-11.

74. Alexa T. McCray, Anita Burgun, Oliver Bodenreider, “Aggregating UMLS semantic types for reducing conceptual complexity”, MEDINFO 2001, volume 84, pages 216–220, National Library of Medicine, Bethesda, Maryland USA.

75. Olivier Bodenreider, Alexa T. McCray (2013), “Exploring semantic groups through visual approaches”, Journal of Biomedical Informatics 36 (2003) 414–432. Department of Health and Human Services, National Institutes of Health, National Library of Medicine, Lister Hill National Centre for Biomedical Communications, MS 43, Bldg 38A Rm B1N28U, 8600 Rockville Pike, Bethesda, MD 20894, USA.

76. W. Liao; S. Weeramachaneni. “A Simple Semi-Supervised Algorithm for Named Entity Recognition”. Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing Pages 58-65. Stroudsburg, PA, USA, 2009.

77. Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing”. Journal, Computational Linguistics archive, Volume 22 Issue 1, March 1996 Pages 39-7, MIT Press Cambridge, MA, USA.

78. Adwait Ratnaparkhi. “A Simple Introduction to Maximum Entropy Models for Natural Language Processing”. Philadelphia, PA 19104-6228, USA. May 1998.

79. Rong Zhang, Alexander I. Rudnicky, “A new data selection approach for semi-supervised Acoustic Modelling”. Language Technologies Institute, School of Computer Science Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213. USA.

80. Andrew McCallum, “Efficiently Inducing Features of Conditional Random Fields”. Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence (UAI-2003).

81. Collier N, Nobata C, Tsujii J. 2000. “Extracting the names of genes and gene products with a hidden Markov model In Proceedings of COLING”; 2000: p. 201–7.

82. Shen D, Zhang J, Zhou GD, Su J, Tan CL. Effective adaptation of a hidden markovmodel- based named entity recognizer for biomedical domain. In: Proceedings of ACL 2003 workshop on natural language processing in biomedicine; 2003: p. 49–56.

83. N. Ponomareva, F. Pla, A. Molina, P. Rosso, “Biomedical named entity recognition: a poor knowledge HMM-based approach, LNCS, 4592 (2007), pp. 382–387.

84. Gideon S. Mann, Andrew McCallum. “Generalized Expectation Criteria for Semi- Supervised Learning with Weakly Labeled Data”. Department of Computer Science 140 Governors Drive University of Massachuset Amherst, 01003; Journal of Machine Learning Research 11 (2010) 955-984, USA.

REFERENCES 165

85. Terry Koo, Amir Globerson, Xavier Carreras, and Michael Collins. models via the matrix-tree theorem. In Proceedings of EMNLPCoNLL, pages 141–150, June 2007.

86. U. Leser and J. Hakenberg, “What makes a gene name? Named entity recognition in the biomedical literature,” Briefings in Bioinformatics, vol. 6, no. 4, pp. 357–369, 2005.

87. Kim, J. D., Ohta, T., Tsuruoka, Y., Tateisi, Y., “Introduction to the bio-entity recognition task at jnlbpa”. In Proceedings of the Int. Workshop on Natural Language Processing in Biomedicine and its Applications, (JNLBPA 2004), pp. 70–75, Stroudsburg, USA, Pa, 2004.

88. L. Smith, L. K. Tanabe, R. Ando et al. “Overview of BioCreative II gene mention recognition”. Genome Biology, vol. 9, 2, article S2, 2008.

89. R. Gaizauskas, G. Demetriou, and K. Humphreys, “Term Recognition and Classification in Biological Science Journal Articles”. In Proceedings of the Computational Terminology for Medical and Biological Applications Workshop of the 2nd International Conference on NLP, pp. 37–44, 2000.

90. K. Fukuda, A. Tamura, T. Tsunoda, and T. Takagi, “Toward information extraction: identifying protein names from biological papers”. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, pp. 707–718, 1998.

91. D. Proux, F. Rechenmann, L. Julliard, V. V. Pillet, and B. Jacq, “Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction”. Genome Informatics Work. Genome Informatics, vol. 9, pp. 72–80, 1998.

92. C. Nobata, N. Collier, and J. Tsujii. “Automatic term identification and classification in biology texts”. In Proceedings of the 5th NLPRS, pp. 369–374, 1999.

93. T. Hofmann, “Probabilistic latent semantic analysis”. In Proceedings of the Uncertainty in Artificial Intelligence (UAI ’99), pp. 289–296, 1999.

94. J. Kristoferson, P. Kanerva, and A. Holst, “Random indexing of text samples for latent semantic analysis”. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, pp. 103–106, 2000.

95, V. J. D. Pietra, P. F. Brown, P. V. deSouza, R. L. Mercer, and J. C. Lai, “Class-based n- gram models of natural language”. Computational Linguistics, vol. 18, pp. 467–479, 1992.

96. P. Vincent, R. Ducharme, C. Jauvin, and Y. Bengio. “A neural probabilistic language model”. Journal of Machine Learning Research, vol. 3, no. 6, pp. 1137–1155, 2003.

97. L. Ratinov, Y. Bengio, and J. Turian. “Word representations: a simple and general method for semi-supervised learning”. In Proceedings of the 48th Annual Meeting of the

REFERENCES 166

Association for Computational Linguistics (ACL ’10), pp. 384–394, Stroudsburg, Pa, USA, July 2010.

98. M. Jiang, H. Cao, Y. Wu, B. Tang, and H. Xu, “Clinical entity recognition using structural support vector machines with rich features". In Proceedings of the ACM6th International Workshop on Data and Text Mining in Biomedical Informatics, pp. 13–20, New York, NY, USA, 2012.

99. H. Xu, M. Jiang, H. Cao, Y. Wu, and B. Tang, “Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features”. BMC Medical Informatics and Decision Making, vol. 13, no. supplement 1, p. S1, 2013.

100. J. C. Denny, Y. Chen, B. Tang, Y. Wu, H. Xu, and M. Jiang, “A hybrid system for temporal information extraction from clinical text”. Journal of the American Medical Informatics Association, 2013.

101. Ravish Cahwla. "Overview of Conditional Random Fields". Machine Learning Graduate from Georgia Tech. August 8, 2017. Retrieved from: https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541. (Accessed 5th March 2018).

102. Alexander Crosson. “Summarize Documents using Tf-Idf”. Retrieved from: https://medium.com/@acrosson/summarize-documents-using-tf-idf-bdee8f60b71 (Accessed 6th March 2018).

103. Yohei Seki, "Sentence Extraction by tf-idf and Position Weighting from Newspaper Articles". National Institute of Informatics. Proceedings of the Third NTCIR Workshop. Dept. of Informatics, The Graduate University for Advanced Studies (Sokendai) 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan.

104. An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec. Retrieved from: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings- count-word2veec/ . (Accessed 5th February 2018).

105. Web search engine: Retrieved from: https://en.wikipedia.org/wiki/Web_search_engine, (Accessed 8th March 2018).

106. Relevance (information retrieval): (Accessed 9th March 2018) Retrieved from: https://en.wikipedia.org/wiki/Relevance (information_retrieval).

107. Stop words: Retrieved from: https://en.wikipedia.org/wiki/Stop_words, (Accessed 10th March 2018).

108. Automatic summarization: (Accessed 11th March 2018). Retrieved from: https://en.wikipedia.org/wiki/Automatic_summarization. REFERENCES 167

109. Word2Vec. Retrieved from: https://deeplearning4j.org/word2vec. (Accessed 12th March 2018).

110. Ando, R. K. (2007). Biocreative ii gene mention tagging system at IBM Watson. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, volume 23, pages 101– 103.

111. Leaman, R. and Lu, Z (2016). “Tagger one: joint named entity recognition and normalization with semi-markov models”. Bioinformatics, 32 (18), 2839–2846.

112. Zhou, G. and Su, J. (2004). “Exploring deep knowledge resources in biomedical name recognition”. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pages 96–99.

113. Lu, Y., Ji, D., Yao, X., Wei, X., and Liang, X. (2015). “Chemdner system with mixed conditional random fields and multi-scale word clustering”. Journal of cheminformatics, 7 (S1), S4.

114. U. Leser, and J. Hakenberg. (2005). “What makes a gene name? named entity recognition in the biomedical literature”. Briefings in bioinformatics, 6 (4), 357–369. //~[12] ch7 India

115. C-C. Huang, and Z. Lu (2015). Community challenges in biomedical text mining over 10 years: success, failure, and the future. Briefings in bioinformatics, 17 (1), 132–144.

116. Deep Neural Networks, 2017. Retrieved from deep learning: https://deeplearning4j.org/neuralnet-overview.html. (accessed 10th May 2017).

117. Recurrent Networks and LSTMS. Available from: https://deeplearning4j.org/lstm.html. (accessed 11h May 2017).

118. LeCun Y, Bengio Y, Hinton G. “Deep learning”. Nature 2015;521(7553): 436–444.

119. Rectified linear unit. (Accessed 16th October 2017). Retrieved from: https://en.wikipedia.org/wiki/Rectifier_(neural_networks).

120. The hyperbolic tangent function. (Accessed 15th September 2017). Available from: https://www.google.com.au/search?client=firefoxb&dcr=0&q=What+is+the+hyperbolic +tangent%3F&sa=X&ved=0ahUKEwi8gvqTlPXXAhXMyLwKHdeVAMsQzmcICA& biw=797&bih=603. (Accessed 12th November 2017).

121. B. Denny (2015). “Understanding Convolutional Neural Networks for Natural Language Processing (NLP)”. http://www.wildml.com/2015/11/understanding-convolutional- neural-networks-for-nlp/. (Accessed 15th September 2017).

122. Deep Learning, NLP, and Representations. (Accessed 12th July 2017). Retrieved from: http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.

REFERENCES 168

123. Restricted Boltzmann Machines. (Accessed 5th June 2017). Retrieved from: https://deeplearning4j.org/restrictedboltzmannmachine.

124. Convolutional Neural Networks. (Accessed 6th June 2017). Retrieved from: https://deeplearning4j.org/convolutionalnets.html.

125. Schwenk Holge. “Learning phrase representations using RNN encoder–decoder for statistical machine translation”. arXiv Preprint arXiv:1406.1078, 2014.

126. Neural Networks with Regression. Retrieved from: https://deeplearning4j.org/linear- regression.html. (Accessed 7th June 2017).

127. M. Mahmud, M.S. Kaiser, A. Hussain, S. Vassanelli. “Applications of Deep Learning and to Biological Data”. IEEE Transactions on Systems. 2018, doi: 10.1109/TNNLS.2018.2790388.

128. Tom Young, Devamanyu Hazarika, Soujanya Poria, Erik Cambria5. "Recent Trends in Deep Learning Based Natural Language Processing". School of Computing, National University of Singapore, Singapore.

129. Yongqin Xian, Student Member, IEEE, Christoph H. Lampert, Bernt Schiele. "Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly". Fellow, IEEE, and Zeynep Akata, Member, IEEE. arXiv:1707.00600v3 [cs.CV] 9 Aug 2018.

130. Deng L., Liu Y. "A Joint Introduction to Natural Language Processing and to Deep Learning". 24 May 2018.

131. Deep Learning in Natural Language Processing. Retrieved from: https://pdfs.semanticscholar.org/9ec2/0b90593695e0f5a343dade71eace4a5145de.pdf. (Accessed 5th September 2018).

132. Diederik P. Kingma, Jimmy Lei Ba, “Adam: A Method for Stochastic Optimization”. University of Toronto. Published as a conference paper at ICLR 2015.

133. Bottou L, 1991. "Stochastic gradient learning in neural networks". In Proceeding Neuro- Nimes 1991;91(8).

134. Stochastic gradient descent. (Accessed 17h November 2017). Retrieved from: https://en.wikipedia.org/wiki/Stochastic_gradient_descent.

135. Hinton, N. Srivastava, Kevin Swersky, "RMSprop Optimization Algorithm for Gradient Descent with Neural Networks". University of Toronto, Canada, 2012.

136. Bergstra, J., Bengio, Y.: “Random search for hyper-parameter optimization”. Journal of Machine Learning Research 13(Feb), 281-305 (2012).

REFERENCES 169

137. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization. pp. 507-523. Springer (2011)

138. Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: “Fast Bayesian optimization of machine learning hyperparameters on large datasets”. In: Proc. of AISTATS (2017).

139. Snoek, J., Larochelle, H., Adams, R.P.: “Practical Bayesian optimization of machine learning algorithms”. In: Advances in neural information processing systems 25. pp. 2951-2959 (2012).

140. Loshchilov, I., Hutter, F.: “CMA-ES for hyperparameter optimization of deep neural networks”. ArXiv [cs.NE] 1604.07269v1, 8 pages (2016).

141. Van Rijn, J.N., Abdulrahman, S.M., Brazdil, P., Vanschoren, J.: “Fast Algorithm Selection using Learning Curves”. In: Advances in Intelligent Data Analysis XIV. pp. 298-309. Springer (2015).

142. Soares, C., Brazdil, P.B., Kuba, P.: “A meta-learning method to select the kernel width in support vector regression”. Machine learning 54(3), 195-209 (2004).

143. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: “Hyperband: Bandit- Based Configuration Evaluation for Hyperparameter Optimization”. In: Proc. of ICLR (2017).

144. Phan R, Luu TM, Davey R, Chetty G. Deep Learning Based Biomedical NER Framework. In2018 IEEE Symposium Series on Computational Intelligence (SSCI) 2018 Nov 18 (pp. 33-40). IEEE.

145. Jesus Rodriguez (2018). "Understanding Hyperparameters Optimization in Deep Learning Models: Concepts and Tools". Retrieved from https://towardsdatascience.com/understanding-hyperparameters-optimization-in-deep- learning-models-concepts-and-tools-357002a3338a. (Accessed 16th January 2019).

146. I. Sutskever; et al. "Introduction to OpenAI", December 11, 2015. Retrieved from: https://blog.openai.com/introducing-openai/ (Accessed 16th February 2019).

147. T. Hinz, N. Navarro-Guerrero, S. Magg, S. Wermter. “Speeding up the Hyperparameter optimization of Deep Convolutional Neural network”. In proceeding of International Journal of Computational Intelligence and Applications Vol. 17, No. 2 (2018), pp. 1850008 (15 pages). Published 18 June 2018.

148. I. Ilievski, T. Akhtar, J. Feng, C-A. Shoemaker. "Efficient Hyperparameter Optimization of Deep Learning Algorithms Using Deterministic RBF Surrogates". In proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017).

REFERENCES 170

149. Lars Hertel, Julian Collado, Peter Sadowski, Pierre Baldi. "Sherpa: Hyperparameter Optimization for Machine Learning Models". In: 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.

150. J. Bergstra, R. Bardenet, Y. Bengio, B. Kegl. "Algorithms for Hyper-Parameter Optimization". In: Neural Information and Proceeding Systems, 2011.

151. Hanna Suominen, L. Zhou, L. Goeuriot, and L. Kelly. "Task 1 of the CLEF eHealth Evaluation Lab 2016: Handover Information Extraction". The Australian National University (ANU), University of Canberra (UC), Canberra, ACT, Australia.

152. AWS EC2 Dashboard Resources. (Accessed 19th August 2019). Retrieved from: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Home:.

153. Flask Web App with Python. (Accessed 12th April 2019). Retrieved from: https://pythonspot.com/flask-web-app-with-python/.

154. AWS Web Services (AWS). (Accessed 20th August 2019). Retrieved from: https://searchaws.techtarget.com/definition/Amazon-Web-Services.

155. Y. Sasaki (2007). “The Truth of the F-measure”. Research Fellow. School of Computer Science, University of Manchester. MIB, 131 Princess Street, Manchester, M1 7DN.

156. Z. Gamble (2014). "Science made simple". Retrieved from: http://www.sciencemadesimple.co.uk/curriculum-blogs/biology-blogs/what-is-dna. (Accessed 30th April 2020)

157. B. Cuffari. "What is RNA?". Research fellow. News Medical Life Science. Retrieved from: https://www.news-medical.net/life-sciences/What-is-RNA.aspx. (Access 1st May 2020).

158. Biology 101: Intro to Biology. "Differences Between RNA and DNA & Types of RNA (mRNA, tRNA & rRNA)". Technical report. Retrieved from: https://study.com/academy/lesson/differences-between-rna-and-dna-types-of-rna-mrna- trna-rrna.html. (Accessed 2nd May 2020).

159. R. Jun, R. Yan, J. Zhang. “A Faster Iterative Scaling Algorithm for Conditional Exponential Model”. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, USA, 2003.

REFERENCES 171