Advancing Unsupervised and Weakly Supervised Learning With
Total Page:16
File Type:pdf, Size:1020Kb
Faculty of Science and Technology Department of Mathematics and Statistics Advancing Unsupervised and Weakly Supervised Learning with Emphasis on Data-Driven Healthcare — Karl Øyvind Mikalsen A dissertation for the degree of Philosophiae Doctor – November 2018 Abstract In healthcare, vast amounts of data are stored digitally in the electronic health records (EHRs). EHRs contain patient-specific data in the form of unstructured free text notes as well as structured lab tests, diagnosis codes, etc., and represent a largely untapped source of clinically relevant information, which combined with advances in machine learning, have the potential to transform healthcare into a more data-driven direction. Due to the complexity and poor quality of the EHRs, data-driven health- care is facing many challenges. In this thesis, we address the challenge posed by lack of ground-truth labels and provide methodological solutions to challenges related with missing data, temporality, and high dimensional- ity. Towards that end, we present four lines of work where we develop novel unsupervised and weakly supervised learning methodology. The first work presents a novel kernel for a type of data that frequently occur in the EHRs, namely multivariate time series with missing values. Key com- ponents in the method are clustering and ensemble learning, which ensure robustness to hyper-parameters and make the kernel well-suited as a com- ponent in unsupervised learning frameworks. Experiments on benchmark datasets demonstrate that the proposed kernel is robust to hyper-parameter choices and performs well in presence of missing data. Next, we present a novel dimensionality reduction method, which is de- signed to account for many of the challenges data-driven healthcare is fac- ing. One of them is high dimensionality, but in addition, the method is capable of exploiting noisy and partially labeled multi-label data, touching upon challenges related with lack of labels, domain complexity and noisy data. Extensive experiments on benchmark datasets, as well as a case study of patients suffering from chronic diseases, demonstrate the effectiveness of the proposed algorithm. A main motivation for the third work is to take advantage of the fact that missing values and patterns often contain rich information about the clinical outcome of interest. We present a multivariate time series kernel, capable of exploiting this information to learn useful representations of incompletely observed time series data. Moreover, we also propose a novel semi-supervised kernel, capable of taking advantage of incomplete label information. The ef- fectiveness of the proposed methods is demonstrated via experiments on benchmark data and a case study of patients suffering from infectious post- i operative complications. In the last work, we focus on another complication following major high- risk surgeries, namely postoperative delirium. It is a common complication among the elderly that often goes undetected, but might have serious con- sequences. We perform phenotyping using a weakly supervised learning framework, wherein clinical knowledge is used to generate a noisy labeled training set, which in turn is used to train classifiers. Experiments on a dataset collected from a Norwegian university hospital demonstrate the ef- ficiency of the framework. ii List of publications The thesis is based on the following original journal papers. I K. Ø. Mikalsen, F. M. Bianchi, C. Soguero-Ruiz and R. Jenssen, \Time series cluster kernel for learning similarities between multivariate time series with missing data", Pattern Recognition, Apr. 2018, Vol. 76, pp 569{581, doi: https://doi.org/10.1016/j.patcog.2017.11.030. II K. Ø. Mikalsen, C. Soguero-Ruiz, F. M. Bianchi and R. Jenssen, \Noisy multi-label semi-supervised dimensionality reduction", submitted to Pattern Recognition, Sept. 2018. III K. Ø. Mikalsen, C. Soguero-Ruiz, F. M. Bianchi, A. Revhaug and R. Jenssen, \Time series cluster kernels to exploit informative missingness and incomplete label information", submitted to Pattern Recognition, Nov. 2018. IV K. Ø. Mikalsen, C. Soguero-Ruiz, K. Jensen, K. Hindberg, M. Gran, A. Revhaug, R.-O. Lindsetmo, S. O. Skrøvseth, F. Godtliebsen and R. Jenssen, \Using anchors from free text in electronic health records to di- agnose postoperative delirium", Computer Methods and Programs in Biomedicine, 2017, Vol. 152, pp 105{114, doi: https://doi.org/10.1016/ j.cmpb.2017.09.014. Other papers The following journal papers, manuscripts under review and other peer- reviewed publications also contribute to this thesis, but are not included. 5. J. N. Myhre, K. Ø. Mikalsen, S. Løkse and R. Jenssen, \A robust clustering us- ing a kNN mode seeking ensemble", Pattern Recognition, Apr. 2018, Volume 76, pp 491{505. 6. K. Jensen, C. Soguero-Ruiz, K. Ø. Mikalsen, R. O. Lindsetmo, I. Kouskoumvekaki, M. Girolami, S. O. Skrovseth and K. M. Augestad, \Analysis of free text in electronic health records for identification of cancer patient trajectories", Scientific Reports, Apr. 2017, Volume 7. 7. K. Ø. Mikalsen, F. M. Bianchi, C. Soguero-Ruiz, R. Jenssen, \The time series cluster kernel", published in Proceedings of 2017 IEEE 27th International Work- shop on Machine Learning for Signal Processing (MLSP), Tokyo, Japan, Sep. 2017, pp. 1{6. 8. K. Ø. Mikalsen, C. Soguero-Ruiz, K. Jensen, K. Hindberg, M. Gran, A. Revhaug, R.-O. Lindsetmo, S. O. Skrøvseth, F. Godtliebsen and R. Jenssen, \Predicting postoperative delirium using anchors", poster presentation at NIPS 2015 Workshop on Machine Learning in Healthcare, Montreal, December 2015. iii 9. K. Ø. Mikalsen, F. M. Bianchi, C. Soguero-Ruiz, S. O. Skrøvseth, R.-O. Lind- setmo, A. Revhaug, R. Jenssen, \ Learning similarities between irregularly sampled short multivariate time series from EHRs", oral presentation at 3rd ICPR International Workshop on Pattern Recognition for Healthcare Analyt- ics, Cancun, Mexico, Dec. 2016, available at https://sites.google.com/site/ iwprha3/proceedings. 10. K. Ø. Mikalsen, C. Soguero-Ruiz, K. Jensen, K. Hindberg, M. Gran, A. Revhaug, R.-O. Lindsetmo, S. O. Skrøvseth, F. Godtliebsen and R. Jenssen, \Using anchors from free text to diagnose postoperative delirium from EHRs ", poster presentation at Regional helseforskningskonferanse 2016, Tromsø, Norway, Nov. 2016. 11. M. A. Hansen, K. Ø. Mikalsen, M. Kampffmeyer, C. Soguero-Ruiz and R. Jenssen, \Towards deep anchor learning", published in Proceedings of 2018 IEEE EMBS International Conference on Biomedical Health Informatics (BHI), Las Vegas, USA, Mar. 2018, pp 315{318. 12. A. Storvik Strauman, F. M. Bianchi, K. Ø. Mikalsen, M. Kampffmeyer, C. Soguero- Ruiz, R. Jenssen, \Classification of postoperative surgical site infections from blood measurements with missing data using recurrent neural net- works", published in Proceedings of 2018 IEEE EMBS International Conference on Biomedical Health Informatics (BHI), Las Vegas, USA, Mar. 2018, pp 307{310. 13. F. M. Bianchi, K. Ø. Mikalsen and R. Jenssen, \Learning compressed repre- sentations of blood samples time series with missing data ", published in Proceedings of 26th European Symposium on Artificial Neural Networks, Computa- tional Intelligence and Machine Learning (ESANN), Bruges, Belgium, Apr. 2018, 14. J. N. Myhre, K. Ø. Mikalsen, S. Løkse and R. Jenssen, \Consensus cluster- ing using kNN mode seeking", published in Proceedings of 19th Scandinavian Conference on Image Analysis, Copenhagen, Denmark, June 2015, pp 175{186. 15. J. N. Myhre, K. Ø. Mikalsen, S. Løkse and R. Jenssen, \Robust non-parametric mode clustering", published in NIPS workshop on Adaptive and Scalable Non- parametric Methods in Machine Learning, Barcelona, Spain, Dec 2016. https: //sites.google.com/site/nips2016adaptive/accepted-papers 16. K. Ø. Mikalsen, C. Soguero-Ruiz, I. Mora-Jim´enez,I. Caballero L´opez Fando, and R. Jenssen, \Using multi-anchors to identify patients suffering from multimorbidities", accepted for IEEE International Conference on Bioinformatics and Biomedicine (BIBM), BHI workshop, Madrid, Spain, Dec. 2018. 17. F. M. Bianchi, L. Livi, K. Ø. Mikalsen, M. Kampffmeyer and R. Jenssen, \Learning representations of multivariate time series with missing data using Tem- poral Kernelized Autoencoders", arXiv preprint arXiv:1805.03473, submitted to Pattern Recognition, Aug. 2018, https://arxiv.org/abs/1805.03473. 18. P. Kocbek, A. Stozer, C. Soguero-Ruiz, G. Stiglic, K. Ø. Mikalsen, N. Fijacko, P. Povalej Brzan, R. Jenssen, S. O. Skrøvseth, and U. Maver\, Maximizing in- terpretability and cost-effectiveness of Surgical Site Infection (SSI) pre- dictive models using feature-specific regularized logistic regression on preoperative temporal data", submitted to Computational and Mathematical Methods in Medicine, Sept. 2018. iv Acknowledgements I have been privileged to enjoy the support and encouragement of many people during the creation of this thesis. I am thankful to everyone who made this possible. First, I would like to express my sincere gratitude to my supervisor, Robert, for his guidance, support, constant encouragement, optimism and patience throughout this entire journey. I would also like to thank my co-supervisors Fred and Stein-Olav for their advice and helpful discussions. I would like to express my gratitude to everyone at the University Hospital of North-Norway and the Norwegian Centre for E-health Research that I have collaborated with. In particular, I would like to mention Arthur, Rolv-Ole, Knut-Magne, Kristian, Mads, Kasper, Anne-Torill and Stein-Olav. To Filippo,