Corpus-Based Analysis of Mediæval Chinese Literature: Case Study on the Chinese Buddhist Canon 數據庫為本之中古漢語文獻攷察: 以大藏經為例
Total Page:16
File Type:pdf, Size:1020Kb
CITY UNIVERSITY OF HONG KONG 香港城市大學 Corpus-based Analysis of Mediæval Chinese Literature: Case Study on the Chinese Buddhist Canon 數據庫為本之中古漢語文獻攷察: 以大藏經為例 Submitted to Department of Linguistics and Translation 翻譯及語言學系 in Partial Fulfilment of the Requirements for the Degree of Doctor of Philosophy 哲學博士學位 by Wong Tak Sum 黃得森 May 2018 二零一八年五月 Corpus-based Analysis of Mediæval Chinese Literature: Case Study on the Chinese Buddhist Canon ABSTRACT Treebanks, collections of syntactically analyzed sentences, have been applied more and more frequently for data-mining in recent decades. This study built a treebank of the entire Tripiţaka Koreana, which is the edition of the Chinese Buddhist canon storing in Korea, to provide a tool for systematic analysis of the content therein. The treebank provides three levels of annotation, which also serves as an aid for reading the original text: word boundaries, parts-of-speech and dependency relations. In this dissertation, the author demonstrates the potential of applying this tool to analyze both the content and the language of the canon. The Chinese Buddhist canon is a remarkable religious text, with over 250 million followers worldwide. The sheer volume of this text, of a total of approximately 50 million Chinese characters, poses huge difficulty for any research team, let alone individual scholar, to analyze the contents or linguistic features in even a small fraction of the whole canon. Fortunately, in this new era, corpora and treebanks provide tools for data-mining raw text, and thus a basis for distant reading. At present, large-scale treebanks of religious texts were compiled for only two of the major religions of the world. The current work contributes to the gap in the form of a dependency treebank of the Chinese Buddhist canon with limited manually annotated data. The treebank was then applied to analyze different aspects of the original text, namely, (i) the protagonists and locations, (ii) conversational networks, and (iii) diachronic syntax of Mediæval Chinese. The author first built the Tripiţaka Koreana Treebank, of 46 million characters, using limited manually annotated data of 50 thousand characters only. The small-scale i treebank of Chinese Buddhist text developed by Lee and Kong (2016) served as the training data to build the model for parsing the treebank produced in this study. For this small-scale treebank, the guidelines of the Penn Chinese Treebank by Xue et al. (2005) and the Stanford Dependencies for Chinese by Chang et al. (2009) were adopted, with five syntactic relations added for the grammatical structures existing in Classical Chinese but not in Modern Standard Chinese. The author trained a model by conditional random fields for automatic tagging of word boundary, and part-of-speech. A minimum-spanning tree parser was then trained to parse the treebank automatically. Syntactic trees of the whole Tripiţaka Koreana were thus successfully derived with information on word boundaries, parts-of-speech and dependency relations labelled. The whole Tripiţaka Koreana Treebank was released on the internet for public open access in CoNLL format. The author also developed a methodology for the quantitative analysis of the protagonists and locations in a literary text by making use of techniques from natural language processing, and applied it on the treebank. The grammatical relations of the words were used to extract all the predicatives, their subjects and objects for further analysis. As a result, the most frequent verbs, and characters as nominal subjects were derived. The most significant differences between the three most popular epithets of Śākyamuni Buddha were also discovered by making use of the log-likelihood statistic. In addition to the protagonists, the most frequent locations appearing in the canon and where Śākyamuni Buddha visited were also derived. Furthermore, the Mahāyāna and Hīnayāna sections of the canon were also contrasted such that the significant differences were revealed in terms of the locations and characters. The full database of protagonist and location in Tripiţaka Koreana was released on the internet. The above analysis shows that the most frequent verbs in the treebank in this study are quotative verbs. For this reason, the author also derived an algorithm to extract all the direct speeches in the treebank and analyzed the quotative verbs, hearers and listeners ii and thus the whole conversational network. As a result, statistics like the most frequent speakers, their most frequent listeners, the speech length, the most frequent quotative verbs that Buddha is the speaker or listener, were produced. The honorific use of quotative verbs was also discovered in the canon. This study thus made use of the special properties of these verbs to induce a hierarchy of characters. It was found that Śākyamuni Buddha occupies the top spot while the bodhisattvas rank second. The disciples, deities, and kings were considered as characters of lower status in the canon. In addition to analyzing the content, this study also made use of this treebank to show one way of applying a dependency treebank to the study of diachronic linguistic research. The author studied the nature and genre of the Chinese Buddhist canon from the perspective of syntactic change of these constructions in the canon: (i) the use of demonstratives, (ii) classifier constructions, (iii) nominal constructions, (iiii) disposal construction, (v) prepositional phrases, and (vi) polar questions. It was found that the texts translated before the tenth century were vernacular in general. However, for those translated in the tenth and eleventh century, the language therein lacked many vernacular elements that were removed by the Stylists. Keywords: Buddhism, Chinese Buddhist canon, Classical Chinese literature, corpus linguistics, treebank, quantitative method iii CITY UNIVERSITY OF HONG KONG Qualifying Panel and Examination Panel Surname: WONG First Name: Tak Sum Degree: Doctor of Philosophy College/Department: Department of Linguistics and Translation The Qualifying Panel of the above student is composed of: Supervisor Dr LEE John Sie Yuen Department of Linguistics and Translation City University of Hong Kong Qualifying Panel Members Dr FANG Chengyu Alex Department of Linguistics and Translation City University of Hong Kong Dr KIT Chun Yu Department of Linguistics and Translation City University of Hong Kong This thesis has been examined and approved by the following examiners: Dr LUN Suen Caesar Department of Linguistics and Translation City University of Hong Kong Dr LEE John Sie Yuen Department of Linguistics and Translation City University of Hong Kong Prof. KWONG Oi Yee Department of Translation The Chinese University of Hong Kong Prof. ZHU Qingzhi Department of Chinese Language Studies The Education University of Hong Kong iv ACKNOWLEDGEMENTS During these three and a half years of study, I have greatly profited from the academic atmosphere at City University of Hong Kong. At this point, I would like to express my deep appreciation to all those who have helped me in the production of this work. First of all, I must express my sincere gratitude and deep appreciation to my advisor, Dr John S. Y. Lee for his patience with my shortcomings from the beginning when I served as a research assistant, his unfailing support during these three and a half years, as well as his invitation to join this PhD programme. I am also grateful to Prof. Lewis R. Lancaster, from the University of California, Berkeley, who kindly provided an electronic version of the complete Tripiţaka Koreana, which made this research project possible. I would also like to thank Dr Mable Chan for her efficient proofreading service. The comments and suggestions provided by the examiners are very much appreciated. I am also indebted to my great colleagues, Dr Wàn Míngyú, for her generous help of submitting the first version of this dissertation to the department during my research trip in Paris; and Dr Yeung Chak Yan, for designing the tools for drawing the graphs of conversational networks in this dissertation. Last but not the least, I would also like to thank my family, other teachers, classmates and the supporting staff in the Department of Linguistics and Translation for their teaching, support and understanding. This dissertation was made possible by postgraduate studentships received from City University of Hong Kong. The studentships were initially accepted by the university as block grants, and by Dr John Lee as a part of his grant from the Early Career Scheme (Project Number 155412), in which both were offered by the University Grants Council of the Hong Kong Government. Without these financial aids, it would have been more difficult for me to finish my research degree at the university. Wong Tak-sum June, 2562 Buddhist Era v vi TABLE OF CONTENTS Abstract................................................................................................................................................ i Qualifying Panel and Examination Panel ..................................................................................... iv Acknowledgements........................................................................................................................... v Table of Contents ........................................................................................................................... vii List of Figures ..................................................................................................................................