Conversational Network in the Chinese Buddhist Canon
Tak-sum Wong and John Sie Yuen Lee City University of Hong Kong Conference on Digital Humanities 2015 Application of corpus
• It is common to apply linguistic annotation to study the language therein. • Can we apply dependency relations to analyze the characters in a literary text?
2 Outline
• What is treebank? • Construction of our treebank • Conversational network of Buddhist text – Goddess of Mercy – Who spoke most? –Mahāyāna vs Hīnayāna
3 Syntactic tree
4 Treebanks: What?
• Many types of parse trees –Example: Stanford dependency parse tree
5 Treebanks: What?
• A treebank is a collection of syntactically analyzed sentences – Typically in the form of parse trees
6 Treebanks: What?
• A dependency tree represents grammatical relations ‘Bills’ is the child of between words ‘submitted’in the relation nsubjpass
7 Treebank: What?
• A tree also includes part-of-speech tags – Critical for Chinese since it has no inflectional morphology Dependency Part-of- relation: ‘monk’ speech tag: is a direct object ‘monk’ is a of ‘see’ noun
‘not’ ‘hear’ ‘Buddha’ ‘sutra’ ‘and’ ‘not’ ‘see’ ‘monk’ 8 [He] has neither heard about the Buddhist scriptures nor seen any monk. Treebanks: Why?
• Quickly find examples to support linguistic research – E.g., In the passive structure 為…所…, 為 is sometimes dropped in Buddhist Chinese text • A feature of Buddhist Chinese – Easy to search for passive sentences in treebank
9 Treebanks: Why?
• Characterize the “profile” of a word
What can you pray ‘for’, and who can you pray ‘to’? Word Sketch [Kilgarriff et 10 al. 2004] Treebanks: Why?
• Sketch differences between ‘clever’ and ‘intelligent’
Compare the meaning of “clever” and “intelligent” by looking at adjectives that collocate with them
11 Word Sketch [Kilgarriff et al. 2004] Treebank development
• Training data – Small-scale treebank created by Lee & Kong (2014) – 50k characters – Finely tagged by Buddhist specialists – POS-tag set: adapted from Penn Chinese Treebank – Dependency label: largely followed Stanford Dependencies for Chinese + 5 new relations for MC
12 Treebank development
• Pre-processing: – Transplanted punctuations to the Tripiṭaka Koreana 高麗藏 from the Taishō edition 大正藏 • No parser for Classical Chinese – Word segmentation, POS-tagging by CRF++ – Dependency parsing by MST parser – External dictionaries • Soothill-Hodous Dictionary of Chinese Buddhist Terms • Person and Place Authority Databases from DDBC
13 Interesting problems
• How close are the characters in the Buddhist world?
14 Interesting problems
• How close are the characters in the Buddhist world?
• We aim at answering this question by making inquiry on the conversation in Buddhist texts.
15 Most Frequent Say verbs
• 言 yán ‘to say’ (10979) • 告 gào ‘to tell; to announce to’ (5401) • 白……言 bái… yán ‘to address … and say’ (5015) • 答曰 dáyuē ‘to reply and say’ (2157) • 曰 yuē ‘to say’ (2126) • 問 wèn ‘to inquire’ (2091) • 告……言 gào… yán ‘to tell… and say’ (737) • 白 bái ‘to address’ (475) • 語 yù ‘to say’ (453)
16 Extraction of speaker and listener
17 The case of Goddess of Mercy
Avalokiteśvara
Kwun Yam, Gwan-eum, Kanon, Guānyīn,and Quan Âm 觀音
18 The case of Goddess of Mercy
others 2% tell 1% unmarked bodhisattva 11% 6% reply 2%
Buddha address 92% 86% 白
Distribution of listeners of Distribution of type of saying verbs, the Goddess of Mercy (N=195) the Goddess of Mercy as speaker (N=195)
19 The case of Goddess of Mercy
others ask/reply address 1% 3% 1% bodhisattva 9% unmarked 41% tell 55% 告 Buddha 90%
Distribution of type of saying verbs, Distribution of speakers of the Goddess of Mercy as listener (N=143) the Goddess of Mercy (N=143)
20 Visualization of conversational network Conversational network
22 Conversational network of the CBC, showing edges with 200 utterances or more Protagonists
23 Who Talked the most?
•Subhūti • Maudgalyāyana
• Avalokiteśvara • Śākyamuni Buddha
24 Protagonists
25 Interlocutors of the protagonists
26 Speak to Listen Ratio
27 Buddhist network without Buddha
28 Mahāyāna and Hīnayāna
29 Mahāyāna section
Theory of wisdom endowed with insight into emptiness
Absolute fundamental reality Perfection of wisdom
30 Conversational network of the Mahāyāna section of the Buddhist Canon, showing edges with more than 280 utterances. Hīnayāna section
31 Conversational network of the Hīnayāna section of the Chinese Buddhist Canon, showing edges with 100 or more utterances Mahāyāna and Hīnayāna
釋迦牟尼佛 17457 釋迦牟尼佛 15154 須菩提 5013 比丘 10096 文殊菩薩 3553 阿難 2273 舍利弗 3316 人 833 阿難 2702 比丘尼 682 菩薩 1772 舍利弗 612 天子 1333 王 564 比丘 924 婆羅門 457 帝釋天 878 優波離 456 彌勒菩薩 690 摩訶迦葉 365 32 Conclusion
• Built the first corpus of CBC of 46 million characters semi-automatically with limited manually annotated data of 50k chars • Demonstrated how to exploit linguistic annotations to conduct analysis of the characters in a large-scale Chinese literary texts by using dependency relations – Studied conversational network – Statistics, e.g. protagonists, interlocutors of protagonists –Mahāyāna versus Hīnayāna
33 Thank you!
Q&A