<<

Conversational Network in the Chinese Buddhist Canon

Tak-sum Wong and John Sie Yuen Lee City University of Hong Kong Conference on Digital Humanities 2015 Application of corpus

• It is common to apply linguistic annotation to study the language therein. • Can we apply dependency relations to analyze the characters in a literary text?

2 Outline

• What is treebank? • Construction of our treebank • Conversational network of Buddhist text – Goddess of Mercy – Who spoke most? –Mahāyāna vs Hīnayāna

3 Syntactic tree

4 Treebanks: What?

• Many types of parse trees –Example: Stanford dependency parse tree

5 Treebanks: What?

• A treebank is a collection of syntactically analyzed sentences – Typically in the form of parse trees

6 Treebanks: What?

• A dependency tree represents grammatical relations ‘Bills’ is the child of between words ‘submitted’in the relation nsubjpass

7 Treebank: What?

• A tree also includes part-of-speech tags – Critical for Chinese since it has no inflectional morphology Dependency Part-of- relation: ‘monk’ speech tag: is a direct object ‘monk’ is a of ‘see’ noun

‘not’ ‘hear’ ‘Buddha’ ‘’ ‘and’ ‘not’ ‘see’ ‘monk’ 8 [He] has neither heard about the Buddhist scriptures nor seen any monk. Treebanks: Why?

• Quickly find examples to support linguistic research – E.g., In the passive structure 為…所…, 為 is sometimes dropped in Buddhist Chinese text • A feature of Buddhist Chinese – Easy to search for passive sentences in treebank

9 Treebanks: Why?

• Characterize the “profile” of a word

What can you pray ‘for’, and who can you pray ‘to’? Word Sketch [Kilgarriff et 10 al. 2004] Treebanks: Why?

• Sketch differences between ‘clever’ and ‘intelligent’

Compare the meaning of “clever” and “intelligent” by looking at adjectives that collocate with them

11 Word Sketch [Kilgarriff et al. 2004] Treebank development

• Training data – Small-scale treebank created by Lee & Kong (2014) – 50k characters – Finely tagged by Buddhist specialists – POS-tag set: adapted from Penn Chinese Treebank – Dependency label: largely followed Stanford Dependencies for Chinese + 5 new relations for MC

12 Treebank development

• Pre-processing: – Transplanted punctuations to the Tripiṭaka Koreana 高麗藏 from the Taishō edition 大正藏 • No parser for – Word segmentation, POS-tagging by CRF++ – Dependency parsing by MST parser – External dictionaries • Soothill-Hodous Dictionary of Chinese Buddhist Terms • Person and Place Authority Databases from DDBC

13 Interesting problems

• How close are the characters in the Buddhist world?

14 Interesting problems

• How close are the characters in the Buddhist world?

• We aim at answering this question by making inquiry on the conversation in .

15 Most Frequent Say verbs

• 言 yán ‘to say’ (10979) • 告 gào ‘to tell; to announce to’ (5401) • 白……言 bái… yán ‘to address … and say’ (5015) • 答曰 dáyuē ‘to reply and say’ (2157) • 曰 yuē ‘to say’ (2126) • 問 wèn ‘to inquire’ (2091) • 告……言 gào… yán ‘to tell… and say’ (737) • 白 bái ‘to address’ (475) • 語 yù ‘to say’ (453)

16 Extraction of speaker and listener

17 The case of Goddess of Mercy

Avalokiteśvara

Kwun Yam, Gwan-eum, Kanon, Guānyīn,and Quan Âm 觀音

18 The case of Goddess of Mercy

others 2% tell 1% unmarked 11% 6% reply 2%

Buddha address 92% 86% 白

Distribution of listeners of Distribution of type of saying verbs, the Goddess of Mercy (N=195) the Goddess of Mercy as speaker (N=195)

19 The case of Goddess of Mercy

others ask/reply address 1% 3% 1% bodhisattva 9% unmarked 41% tell 55% 告 Buddha 90%

Distribution of type of saying verbs, Distribution of speakers of the Goddess of Mercy as listener (N=143) the Goddess of Mercy (N=143)

20 Visualization of conversational network Conversational network

22 Conversational network of the CBC, showing edges with 200 utterances or more Protagonists

23 Who Talked the most?

•Subhūti • Maudgalyāyana

• Avalokiteśvara • Śākyamuni Buddha

24 Protagonists

25 Interlocutors of the protagonists

26 Speak to Listen Ratio

27 Buddhist network without Buddha

28 Mahāyāna and Hīnayāna

29 Mahāyāna section

Theory of wisdom endowed with insight into emptiness

Absolute fundamental reality Perfection of wisdom

30 Conversational network of the Mahāyāna section of the Buddhist Canon, showing edges with more than 280 utterances. Hīnayāna section

31 Conversational network of the Hīnayāna section of the Chinese Buddhist Canon, showing edges with 100 or more utterances Mahāyāna and Hīnayāna

釋迦牟尼佛 17457 釋迦牟尼佛 15154 須菩提 5013 比丘 10096 文殊菩薩 3553 阿難 2273 舍利弗 3316 人 833 阿難 2702 比丘尼 682 菩薩 1772 舍利弗 612 天子 1333 王 564 比丘 924 婆羅門 457 帝釋天 878 優波離 456 彌勒菩薩 690 摩訶迦葉 365 32 Conclusion

• Built the first corpus of CBC of 46 million characters semi-automatically with limited manually annotated data of 50k chars • Demonstrated how to exploit linguistic annotations to conduct analysis of the characters in a large-scale Chinese literary texts by using dependency relations – Studied conversational network – Statistics, e.g. protagonists, interlocutors of protagonists –Mahāyāna versus Hīnayāna

33 Thank you!

Q&A