Conversational Network in the Chinese Buddhist Canon
Total Page:16
File Type:pdf, Size:1020Kb
Conversational Network in the Chinese Buddhist Canon Tak-sum Wong and John Sie Yuen Lee City University of Hong Kong Conference on Digital Humanities 2015 Application of corpus • It is common to apply linguistic annotation to study the language therein. • Can we apply dependency relations to analyze the characters in a literary text? 2 Outline • What is treebank? • Construction of our treebank • Conversational network of Buddhist text – Goddess of Mercy – Who spoke most? –Mahāyāna vs Hīnayāna 3 Syntactic tree 4 Treebanks: What? • Many types of parse trees –Example: Stanford dependency parse tree 5 Treebanks: What? • A treebank is a collection of syntactically analyzed sentences – Typically in the form of parse trees 6 Treebanks: What? • A dependency tree represents grammatical relations ‘Bills’ is the child of between words ‘submitted’in the relation nsubjpass 7 Treebank: What? • A tree also includes part-of-speech tags – Critical for Chinese since it has no inflectional morphology Dependency Part-of- relation: ‘monk’ speech tag: is a direct object ‘monk’ is a of ‘see’ noun ‘not’ ‘hear’ ‘Buddha’ ‘sutra’ ‘and’ ‘not’ ‘see’ ‘monk’ 8 [He] has neither heard about the Buddhist scriptures nor seen any monk. Treebanks: Why? • Quickly find examples to support linguistic research – E.g., In the passive structure 為…所…, 為 is sometimes dropped in Buddhist Chinese text • A feature of Buddhist Chinese – Easy to search for passive sentences in treebank 9 Treebanks: Why? • Characterize the “profile” of a word What can you pray ‘for’, and who can you pray ‘to’? Word Sketch [Kilgarriff et 10 al. 2004] Treebanks: Why? • Sketch differences between ‘clever’ and ‘intelligent’ Compare the meaning of “clever” and “intelligent” by looking at adjectives that collocate with them 11 Word Sketch [Kilgarriff et al. 2004] Treebank development • Training data – Small-scale treebank created by Lee & Kong (2014) – 50k characters – Finely tagged by Buddhist specialists – POS-tag set: adapted from Penn Chinese Treebank – Dependency label: largely followed Stanford Dependencies for Chinese + 5 new relations for MC 12 Treebank development • Pre-processing: – Transplanted punctuations to the Tripiṭaka Koreana 高麗藏 from the Taishō edition 大正藏 • No parser for Classical Chinese – Word segmentation, POS-tagging by CRF++ – Dependency parsing by MST parser – External dictionaries • Soothill-Hodous Dictionary of Chinese Buddhist Terms • Person and Place Authority Databases from DDBC 13 Interesting problems • How close are the characters in the Buddhist world? 14 Interesting problems • How close are the characters in the Buddhist world? • We aim at answering this question by making inquiry on the conversation in Buddhist texts. 15 Most Frequent Say verbs • 言 yán ‘to say’ (10979) • 告 gào ‘to tell; to announce to’ (5401) • 白……言 bái… yán ‘to address … and say’ (5015) • 答曰 dáyuē ‘to reply and say’ (2157) • 曰 yuē ‘to say’ (2126) • 問 wèn ‘to inquire’ (2091) • 告……言 gào… yán ‘to tell… and say’ (737) • 白 bái ‘to address’ (475) • 語 yù ‘to say’ (453) 16 Extraction of speaker and listener 17 The case of Goddess of Mercy Avalokiteśvara Kwun Yam, Gwan-eum, Kanon, Guānyīn,and Quan Âm 觀音 18 The case of Goddess of Mercy others 2% tell 1% unmarked bodhisattva 11% 6% reply 2% Buddha address 92% 86% 白 Distribution of listeners of Distribution of type of saying verbs, the Goddess of Mercy (N=195) the Goddess of Mercy as speaker (N=195) 19 The case of Goddess of Mercy others ask/reply address 1% 3% 1% bodhisattva 9% unmarked 41% tell 55% 告 Buddha 90% Distribution of type of saying verbs, Distribution of speakers of the Goddess of Mercy as listener (N=143) the Goddess of Mercy (N=143) 20 Visualization of conversational network Conversational network 22 Conversational network of the CBC, showing edges with 200 utterances or more Protagonists 23 Who Talked the most? •Subhūti • Maudgalyāyana • Avalokiteśvara • Śākyamuni Buddha 24 Protagonists 25 Interlocutors of the protagonists 26 Speak to Listen Ratio 27 Buddhist network without Buddha 28 Mahāyāna and Hīnayāna 29 Mahāyāna section Theory of wisdom endowed with insight into emptiness Absolute fundamental reality Perfection of wisdom 30 Conversational network of the Mahāyāna section of the Buddhist Canon, showing edges with more than 280 utterances. Hīnayāna section 31 Conversational network of the Hīnayāna section of the Chinese Buddhist Canon, showing edges with 100 or more utterances Mahāyāna and Hīnayāna 釋迦牟尼佛 17457 釋迦牟尼佛 15154 須菩提 5013 比丘 10096 文殊菩薩 3553 阿難 2273 舍利弗 3316 人 833 阿難 2702 比丘尼 682 菩薩 1772 舍利弗 612 天子 1333 王 564 比丘 924 婆羅門 457 帝釋天 878 優波離 456 彌勒菩薩 690 摩訶迦葉 365 32 Conclusion • Built the first corpus of CBC of 46 million characters semi-automatically with limited manually annotated data of 50k chars • Demonstrated how to exploit linguistic annotations to conduct analysis of the characters in a large-scale Chinese literary texts by using dependency relations – Studied conversational network – Statistics, e.g. protagonists, interlocutors of protagonists –Mahāyāna versus Hīnayāna 33 Thank you! Q&A.