Latent Probabilistic Topic Discovery for Text Documents Incorporating Segment Structure and Word Order

Latent Probabilistic Topic Discovery for Text Documents Incorporating Segment Structure and Word Order JAMEEL, Mohammad Shoaib A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Doctor of Philosophy in Systems Engineering and Engineering Management The Chinese University of Hong Kong July 2014 The Chinese University of Hong Kong holds the copyright of this thesis. Any person(s) intending to use a part or whole of the materials in the thesis in a proposed publication must seek copyright release from the Dean of the Graduate School. ¯ « « ¯ ¯ © É É º « « « ª « « « ii Thesis/Assessment Committee Professor CHENG, Chun Hung (Chair) Professor LAM, Wai (Thesis Supervisor) Professor MENG, Mei Ling Helen (Committee Member) Professor Michael Chau (External Examiner) Preface This dissertation is submitted for the degree of Doctor of Philosophy at The Chinese University of Hong Kong. The research described herein was conducted under the supervision of Prof. LAM, Wai, between August 2009 and May 2014. This work is to the best of my knowledge original, except where acknowledgment and reference is made to previous work. Neither this, nor any substantially similar dissertation has been or is being submitted for any degree, diploma or other qualification at any other university. Part of this work are published (and some under review) in the following publications: 1. Lidong Bing, Wai Lam, Shoaib Jameel, and Chunliang Lu. “Website Com- munity Mining from Query Logs with Two-Phase Clustering.” In Compu- tational Linguistics and Intelligent Text Processing (CICLing), pp. 201–212. Springer Berlin Heidelberg, 2014. 2. Shoaib Jameel, and Wai Lam. “A Nonparametric N-Gram Topic Model with Interpretable Latent Topics.” In Proceedings of the Ninth Asia Information Retrieval Societies Conference (AIRS), pp. 74–85. Springer Berlin Heidelberg, 2013. 3. Shoaib Jameel, and Wai Lam. “An unsupervised topic segmentation model incorporating word order.” In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM-SIGIR), pp. 203–212. ACM, 2013. 4. Shoaib Jameel, and Wai Lam. “An N-gram topic model for time-stamped documents.” In Proceedings of the 35th European Conference on Information Retrieval (ECIR), pp. 292–304. Springer Berlin Heidelberg, 2013. iv 5. Shoaib Jameel, Xiaojun Qian, and Wai Lam. “N-gram Fragment Sequence Based Unsupervised Domain-Specific Document Readability”. In Proceedings of the 24th International Conference on Computational Linguistics (COLING), pp. ACL, 1309–1326, 2012. 6. Shoaib Jameel, Wai Lam, Xiaojun Qian, and Ching-man Au Yeung. “An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space.” In Proceedings of the 12th ACM/IEEE-CS Joint conference on Digital Libraries (ACM-JCDL), pp. 351–352. ACM, 2012. 7. Shoaib Jameel, and Xiaojun Qian. “An Unsupervised Technical Readability Ranking Model by Building a Conceptual Terrain in LSI.” In Eighth Inter- national Conference on Semantics, Knowledge and Grids (SKG), pp. 39–46. IEEE, 2012. 8. Shoaib Jameel, Wai Lam, and Xiaojun Qian. “Ranking Text Documents Based on Conceptual Difficulty using Term Embedding and Sequential Dis- course Cohesion.” In Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology-Volume 01 (WI-IAT), pp. 145–152. IEEE Computer Society, 2012. 9. Shoaib Jameel, Wai Lam, Ching-man Au Yeung, and Sheaujiun Chyan. “An unsupervised ranking method based on a technical difficulty terrain.” In Pro- ceedings of the 20th ACM International Conference on Information and Knowl- edge Management (ACM-CIKM), pp. 1989–1992. ACM, 2011. 10. Shoaib Jameel, Wai Lam, and Lidong Bing. “Nonparametric Topic Models with Word Order for Text Documents.” Journal paper - Under review. 11. Shoaib Jameel, Wai Lam, and Lidong Bing. “Supervised Topic Models with Word Order Structure for Document Classification and Retrieval Learning.” v Conference paper - Under review. 12. Shoaib Jameel, and Wai Lam. “N-gram Fragment Based Domain-Specific Readability Ranking Model for Text Documents.” Journal paper - Under review. Environment Friendly Thesis I have found a common limitation in most of the theses that I have read so far. The authors of those theses do not make their document environment friendly. What I mean by this is, one is forced to print the thesis for smooth reading. This is exemplified by the following scenario: Imagine that you are reading a long thesis written by a person named Dr. Doolittle. Since you don’t have a printer with you or you want to save some papers by refraining yourself from printing a long thesis, you try to read it on your computer. However, while reading the content and upon reaching some citation in the thesis, you are forced to check what research paper it is actually referring to. Then you scroll down the thesis to look for the citation. You view that citation, and again try to locate the specific location in the document where you were reading it previously. Repeatedly doing this results in a loss of interest, and thus forces you to make your task simpler. Therefore, a better option then is to print the thesis. Some clever ones will only print the references section and read the text from the computer. But still something is printed and paper is lost. I have made this thesis as environment friendly as I could. It means that I have worked extra hard to discourage people from printing it, considering that it is a long thesis. I have included special effects in this electronic version of the thesis, where hovering the mouse pointer over any citation pops-up the full reference . vi One can click on that pop-up text window, and place it anywhere one likes in the text window. Remember to click the pop-up text window once and then release the mouse pointer. This will make the text window stick to the mouse pointer. Same is done for all abbreviations in this thesis. This will make the process of reading this thesis pleasant and smooth. Most importantly, paper is saved. One can argue about the heat generated by the computer monitor while reading this thesis and its effect on the environment. There are always some trade-offs in everything that one does. JAMEEL, Mohammad Shoaib vii Dedication I would like to dedicate this small piece of work to my parents, and to all those who love and care for me. Without such wonderful people around me, it would not have been possible for me to learn many things not only in Science, but even in general aspects of life in the course of these five years and all throughout my life. I would also dedicate my “hard-work” to all those innocent poverty-stricken people all across the world who sleep with empty stomach at night hoping that next day an agent from beloved God would descend down to Earth and bless them with food so that they never have to go back to sleep without it. I hope my hard-work could help them in the future, and I can one day become that agent from God. Whatever you bestow in charity must go to parents and to kinsfolk, to the orphans and to the destitute and to the traveler in need. - The Glorious Quran, The Cow 2:215 Let him never turn away (a stranger) from his house, that is the rule. Therefore a man should by all means acquire much food, for (good) people say (to the stranger): There is food ready for him. - Taittiriyaka Upanishad, 3rd Valli, 10th Anuvaka:I For I was hungry, and you gave me something to eat; I was thirsty, and you gave me something to drink; I was a stranger, and you took me in; I was sick and you took care of me, I was in prison and you visited me...I tell you whenever you do these things for the least important of these followers of mine, you did it for me. - Jesus: Matthew 25:35-40 The generous man is blessed. For he gives of his bread to the poor. - Old Testament, Proverbs 22:9 Acknowledgements I would like to express my heartfelt thanks to my parents, other members of my family, and my supervisor Prof. LAM, Wai, whose encouragement and support has made me do this work possible. Moreover, I would like to emphasize the human values like simplicity and elegance that I found in Prof. LAM, Wai that has given me the inspiration to work in the Text Mining Group of the Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong. Hence, I would like to add a note that it has not only been scientifically beneficial for me, but I have also benefited as a person. I can never forget my teachers who have taught me right from the very begin- ning. Miss. Prema Subramaniam taught me to read and write which is useful to me until this day. Thanks to my Mathematics teacher, Miss. Oli Ganguly for teach- ing me basic Mathematics which I am still applying today and publishing papers out of it. Thanks to my undergraduate research mentors Prof. Tejbanta Singh Chingtham, Professor at the Sikkim Manipal Institute of Technology, Sikkim, India, Dr. Marimuthu Muruganant (Cantab.), Mr. Fredi B. Zarolia from Tata Steel Lim- ited, Jamshedpur, India, whose love and support has helped me reach where I am today. Thanks to all my other teachers both in my school and undergraduate university. Special mention of Mr. Andrei Raevsky from New Smyrna Beach, Florida for helping me proof-read my research and personal statements for applications to dif- ferent universities for doctoral study. Special thanks to my thesis committee which comprises of Prof. Helen Meng and Prof. C.H.

Latent Probabilistic Topic Discovery for Text Documents Incorporating Segment Structure and Word Order

Multi-View Learning for Hierarchical Topic Detection on Corpus of Documents

Nonstationary Latent Dirichlet Allocation for Speech Recognition

Large-Scale Hierarchical Topic Models

Journal of Applied Sciences Research

Incorporating Domain Knowledge in Latent Topic Models

Extração De Informação Semântica De Conteúdo Da Web 2.0

An Unsupervised Topic Segmentation Model Incorporating Word Order

A New Method and Application for Studying Political Text in Multiple Languages

Semi-Automatic Terminology Ontology Learning Based on Topic Modeling

Song Lyric Topic Modeling

Topic Modeling Based on Keywords and Context

A Latent Semantic Approach Yik-Cheung Tam CMU-LTI-09-016