Speaker Identification and Its Application to Social Network Construction for Chinese Novels

IALP 2020, Kuala Lumpur, Dec 4-6, 2020 Speaker Identification and Its Application to Social Network Construction for Chinese Novels

Yuxiang Jia1, Huayi Dou1, Shuai Cao1,2, Hongying Zan1 1. School of Information Engineering, Zhengzhou University, Zhengzhou, P.R.China 2. Zhengzhou Zoneyet Technology Co., Ltd., Zhengzhou, P.R.China [email protected]; [email protected]

Abstract—Character is one of the three elements of a novel, speakers in the context. We consider dialogues in the form of and conversation is an important way to describe characters. direct quotation, such as: The personality, emotion, and interpersonal relationships of characters are reflected in conversations. Thus, extracting In the example above, Q1, Q2 are quotes of characters, conversations, speakers and other conversation elements from which are marked by double quotation marks in Chinese novels are crucial for character analysis and content novels. Speaker Identification is also called quote attribution. understanding. We start with Jin Yong’s novels, annotate the Speakers are divided into two layers, one is the Mention layer largest corpus for speaker identification with 9721 quotes, and the other is the Entity layer. Mention refers to speakers in propose a machine learning-based speaker identification the sentence level, such as M1:Yang Tiexin, M2:Guo Jing, method, and design feature templates showing a good M3:Qiu Chuji. Entity is a character object in the literary world, performance. For the application of speaker identification, we such as E1:Yang Tiexin. An Entity can correspond to different construct the social network of characters in Jin Yong's novels mentions, such as alias, reference, or even ellipsis. Mention based on dialogue chain, which lays the foundation for the level of speaker identification can serve as an intermediate analysis of the relationship between characters in long text phase, culminating in identifying Entity level speakers. novels. The contributions of this paper are as follows: Keywords—fictional character, dialogue, speaker identification, feature templates, social network (1) We build the largest Chinese corpus for speaker identification of fictional conversations and release it I. INTRODUCTION at https://github.com/huayi-dou/The- speakeridentification-corpus-of-Jin-Yong-novels. Dialogue is an important means of shaping characters in novels. Characters' personalities, emotions and relationships (2) We propose a machine learning-based entity-level among characters are all reflected through dialogue. The speaker identification method and design feature extraction of dialogue elements in novels includes the templates showing a good performance. extraction of dialogues and their speakers, which is an effective way to construct dialogue corpus and can be used to (3) After speaker identification, we construct a social construct the social network of characters in novels[1] to network between fictional characters based on provide help for text mining in novels. It can also be used to dialogue chain, which is more precise than social synthesize multi-character audio books[2][3] to give network based on paragraph co-occurrence. personalized voices to different characters in literary works. It II. RELATED WORK can also provide help for the adaptation of novels into scripts. For example, Danescu[4] used movie dialogue to build a The general process of a machine identifying a speaker is number of chat corpora. It can also be used for writing to locate the conversation, search for candidate speakers in the recommendation[5] to provide reference materials for context of the conversation, and identify speakers from the character quotation. In the field of journalism, information or candidates. In Chinese novels, dialogue can be located opinions are verified by tracing the source of the quotation[6]. through double quotation marks, and candidate speakers can be extracted with resources or technologies such as character According to statistics of Jin Yong's 15 novels (published list, named entity recognition, co-reference resolution[9], entity between 1955 and 1970), 50.05% of the sentences contain linking, etc. The determination of the speaker can be regarded dialogue, and 40.24% of the words belong to quotation. In as the classification or sorting of candidate speakers and then other types of fiction, the dialogue is also widely distributed. selecting the best candidate as the speaker. Therefore, dialogue extraction and speaker identification are of great significance to the character analysis[7][8] and the text Speaker identification methods can be divided into rule- [10] [11- understanding of the novel. The novel texts used in this paper based methods and machine_learning-based methods 14] are all from www.jinyongwang.com. .The rule-based method quantifies the influence of different features on the candidate speakers, scores and accumulates the To identify speakers in literary works, we extract the candidate speakers according to the occurrence of features, dialogue content in the text and attribute it to one of several and chooses the speaker with the highest score. Machine learning-based methods usually adopt a classification model where categories can be 2 or more. The dichotomy problem is 杨铁心(M1)尚未断气，见到郭靖(M2)后嘴边露出一丝笑容，说 to classify each candidate speaker into speaker or non-speaker. 道：“……”(Q1)眼光望着丘处机(M3)道：“……”(Q2) ---《射 The multi-classification problem is to categorize the dialogue 雕英雄传》杨铁心 (E1) directly into all the characters in the novel. At present, it is When Yang Tiexin (M1) was about to die, he saw Guo Jing (M2) with mainly based on traditional machine learning methods. The a smile on his lips and said:"..."(Q1) He looked at Qiu Chuji (M3) and key is to design feature templates and extract features from the said, "..."(Q2) -- ‘The Legend of Condor Heroes’, Yang Tiexin (E1) dialogue and its context for speaker identification. Chaganty

978-1-7281-7689-5/20/$31.00 c 2020 IEEE 13 et al. [15] and Schmerling et al.[16] conduct speaker The preceding and next 5 sentences of each dialog are identification based on multi-layer perceptron and RNN extracted as the context window, and the local candidate list is respectively, which only consider the conversation but not the constructed from the global character list. The candidate list context. The performance of such speaker identification will change under the influence of the size of the context methods solely based on speaking style and content is window. When the context window is 10, 5 and 1, the unsatisfactory. Vishnubhotla et al.[17] classify the characters proportion of speakers who are not in the candidate list is according to the lines of the characters in the drama and 0.26%, 0.30% and 8.71%, respectively, while considering examine the distinguishability between the characters. The only the above, the proportion is 0.41%, 0.80% and 9.24%. results show that the characters created by a good writer (such We finally choose to extract the five sentences before and after as George Bernard Shaw) have better distinguishability. the quotation as context. The corpus of speaker identification, QuoteLi3[13] includes According to the period and double quotation marks, we three English novels: Pride and Prejudice, Emma, and The divide the original text into dialogue sentences and narrative Steppe. The dialogue of 3103 sentences is annotated, which is sentences. In The Legend of Condor Heroes, 10,723 dialogues divided into two levels: Mention and Entity. Lee and Yeung[18] have been identified. We annotate speaker of the entity level annotate speakers and listeners in the New Testament. Chen for each quote, from a given quote and context, select a person et al.[19] annotate the speakers in the Chinese novel World of from candidates list to be the speaker of the quote. The plainness, including the speakers at the Entity level of 2,548 annotation was completed by two graduate students. The dialogues. Prasad et al.[20] annotate the speakers of the Penn annotation result can be used as the correct answer if the two Discourse TreeBank, which includes a lot of indirect speech. annotators are consistent. The consistency rate of the two Pareti et al.[21] study the speaker identification problem of annotators is 93.95%, and there are 649 inconsistent sentences. indirect speech, which is ubiquitous in the field of news. The The reason for the inconsistencies in labeling is usually the simultaneous recognition of speaker and listener is another misunderstanding of the reference, or the ambiguity in the research direction of speaker identification[22,23], establishing expression of the original text, at which time a third person the association between the speaker and the listener, and then participates in the discussion and finally reaches an agreement. studying the social relationship network between characters. According to Mention speakers, there are five categories III. CORPUS CONSTRUCTION of speakers, as shown in Table 1.

Jin Yong's novels are one of the most influential literary TABLE I. SPEAKER CATEGORIES works in the Chinese world. They have created many classic Propor Category Example Mention Entity characters and are of great literary value. A series of tion computational literature studies have been carried out on Jin 铁木真见是大将博尔术，甚喜道：“…” Yong's novels, including a comparative study on the language 87.96 [24] Name TemuJin TemuJin style of Jin Yong's novels , a comparative study on the When Temujin saw the % editions of the novels[25], and a study on the relationship general Boershu, he was [26] delighted and said, "..." between characters in the novels . However, there is no 韩小莹把耳朵凑到他嘴 computational study on the dialogue between characters in Jin 边，只听得他说道： Zhang 2.69 Pronouns “…” him Yong's novels. Therefore, starting from Jin Yong's novel The Han Xiaoying put Asheng % Legend of Condor Heroes, we annotated 9721 quotations, her ear close to his mouth, randomly divided 75% as a training set and 25% as a test set, only to hear him say,"..." and constructed a dialogue corpus to conduct research on 带队的军官下令停箭， Common 叫道：“…”The officer 8.56 officer Null speaker identification and character analysis. noun ordered the arrows to stop, % shouting, "..." In Chinese novels, the symbol of dialogue is double 郭啸天和杨铁心齐感诧 Guo quotation marks. According to the GB/T 15834-2011 异，同声问道：“…” Guo Xiao Multiple Xiaotian, 0.51 tian, Yang punctuation usage, a quotation mark is defined as a type of persons Guo Xiaotian and Yang Yang % Tiexin label that identifies the content directly quoted in a paragraph Tiexin feel surprised, Tiexin or elements that need to be specified. Therefore, the content in asked simultaneously, "..." “张先生，你可是从北 Guo Xiao 0.27 double quotation marks can be either a quotation or a specific Ellipsis 方来吗？”"Mr. Zhang, Null tian % element. An obvious sign of a quotation is to ‘end with a are you from the north?" punctuation mark’, such as ‘，。？！……’, while special The first type is a person's name, the second type is a elements often do not have punctuation, according to this pronoun, the third type is a common noun, which does not feature can be identified by the vast majority of the quotations. necessarily refer to the main character, the fourth type is more According to the Chinese language peculiarity, the special than one person, and the fifth type is ellipsis, which does not element exists as a part of a sentence and cannot be segmented. explicitly mention the speaker of the current conversation. In We replace the double quotation marks with brackets to this case, there are often multiple consecutive conversations, distinguish it from the quotation. which need to be judged according to the dialogue chain. Due The website of Jin Yong novels provides 85 main to class 3 does not appearing as the main characters, class 4 is characters of The Legend of Condor Heroes, and with the not a single person, class 5 cannot judge the speaker in the result of Named entity recognition, we have expanded to candidate list, and our subsequent applications are based on 94.The same character will appear under different names, such the main characters’ quotes to analyze them, so, temporarily as Yang Kang -- Wanyan Kang, Yan Lie -- Wanyan Honglie, we don't consider class 3, 4, 5, only consider the two classes Yang Tiexin -- Mu Yi, TemuJin -- Genghis Khan. We of 1 and 2, resulting in a total of 9,721 sentences of dialogue. combine them to build a list of global characters. We count the number of dialogues of 94 characters, and find that Huang Rong and Guo Jing, as the main characters in

2020 International Conference on Asian Language Processing (IALP) 14 the novel, their quotation number account for 21.23% and B. Feature Templates 18.34% of the total number of dialogues of class 1 and class 2 The design of the feature template is shown in Table 2. respectively. The third Hong Qigong accounts for 5.73%, the There are a total of 20 features, which are divided into three twentieth Ying Gu accounts for 1.02%, and the following categories: 8 boolean features, 4 distance features, and 8 twenty figures account for less than 1%. It can be seen that the statistical features. The boolean feature mainly reflects the number of dialogues in the novel presents a typical long tail situation of the pre- and post-sentence of the current quotation distribution. The number of dialogues of the first 20 characters and whether the candidate speaker is used with a verb, if used in The Legend of Condor Heroes is shown in Figure 1. with a verb, it is likely to be the speaker. The distance feature mainly considers the distance between the candidate speakers Mention and the quotation, as well as the relative distance between the candidate speakers. The statistical features mainly consider the number and frequency of candidate speakers in the context, distinguish the occurrence in the quotation and in the narrative sentence, and the occurrence frequency of characters in the whole corpus, which is the global feature.

TABLE II. FEATURE TEMPLATE Fig. 1. The number of quotes of the main characters Id Feature Type 1 pre_line_is_quote Bool IV. SPEAKER IDENTIFICATION 2 pre_line_endwith_colon Bool The instance format of the data set is {quote, context, candidate list, speaker}, where the candidate list contains 0 3 mention_in_pre_line Bool or several characters, and the speaker identification task is to 4 next_line_is_quote Bool select one character from the candidate list as a speaker. If the speaker is not on the candidates list, then the speaker is set as 5 next_line_endwith_colon Bool “Others”. 6 mention_in_next_line Bool

A. Methodology 7 pre_word_is_verb Bool

The speaker identification process is shown in Figure 2. 8 next_word_is_verb Bool First extract the quotation Q from the novel text, then select 9 mention_distance_in_context_pre Distance K(K=5) sentences before and after the dialogue as the context, and use the character list to obtain the candidate speakers in 10 mention_distance_in_context_next Distance the context. The context features of each candidate speaker 11 latest_mention_order_in_pre Distance were extracted and the machine learning classification model was used to classify them into two categories. The probability 12 latest_mention_order_in_next Distance that each candidate belongs to the speaker is ranked, and the 13 count_in_context_pre_quote Count candidate with the highest probability is selected as the predicted speaker. The average number of candidates per 14 count_in_context_next_quote Count conversation in the corpus was 3.56. 15 count_in_context_pre_narrative Count

16 count_in_context_next_narrative Count

17 count_in_quote Count

18 count_context Count

19 candidate_num Count

20 global_proportion Count

C. Experimental Setup The classification model adopts Multi-layer Perceptron (MLP), outputs the probability that each candidate belongs to the speaker, and selects the candidate with the highest probability as the predicted speaker. MLP has two hidden layers, the first with 50 nodes, the second with 10 nodes, and the activation function is ReLu. Because the data set is small, Slover chooses the l-BFGS algorithm that can converge faster. The maximum number of iterations is 200.The L2 penalty (regularization item) parameter is 0.0001. The probability of each candidates are further normalized by a softmax layer. We experimented with three comparison methods: Support Vector Machine (SVM), Latest_Mention method Fig. 2. Speaker identification method.

2020 International Conference on Asian Language Processing (IALP) 15 and Random method. The SVM model uses RBF kernel V. SOCIAL NETWORK CONSTRUCTION OF FICTIONAL function and probability estimation. The Latest mention CHARACTERS BASED ON DIALOGUE CHAIN method selects the Entity corresponding to the nearest For the social network construction of fictional characters Mention as the speaker. The Random method randomly in Chinese literary works, the current mainstream method is to selects a character from a candidates list as the speaker. detect the relationship between characters based on text co- D. Experimental Results and Analysis occurrence, that is, different characters appear together in the same paragraph or the same chapter[26][27] as an interactive For 9721 dialogues in corpus, 75% were randomly event. However, characters often refer to other people in the selected as the training set and 25% as the test set. The quotes who are not interacting with the characters in the scene, Accuracy is used as the evaluation index, that is, the and it is inaccurate to use these mentioned characters as percentage of the number of dialogues correctly identified by interaction objects. The dialogue between the characters in the the speaker in the total number of dialogues. The novel is imitated from the real world, and the alternating experimental results are shown in Table 3. dialogue between different characters can be regarded as the occurrence of a conversational event, while the transformation TABLE III. THE RESULTS OF THE SPEAKER IDENTIFICATION of different scenes and events in the novel needs to be EXPERIMENT described in several narrative sentences. Consider the Method MLP SVM Latest_mention Random following dialogue: Accuracy 95.64% 94.49% 86.63% 33.72% 黄蓉道：“我瞧他们也稀松平常，跟人家动手，三招两式之间便中毒受伤。”洪七公道：“是吗？那都是王重阳的徒弟了。听说他七个弟子中丘处机武功最强，但终究还不及他们师叔周伯通。”黄蓉听了周 The classification models SVM and MLP can better deal 伯通的名字微微一惊，开口想说话，却又忍住。郭靖一直在旁听两人 with ambiguity in context, and the accuracy of speaker 谈论，这时插口道：“是，马道长说过他们有个师叔，但没有提到这 identification is 95.53% and 94.10% respectively. The MLP 位前辈道长的名号。” model was 8.72% higher than Latest_Mention, and the SVM Huang Rong says: " I see their martial arts also commonplace, if they fight model was 7.57% higher than Latest_Mention. with others, they will soon be injured." Hong Qigong says: " Really? Those are Wang Chongyang's apprentices. I heard that Qiu Chuji was the When the sentence structure of a sentence is complex, such strongest person in martial arts, but still not as good as their uncle Zhou as the absence of subject, the presence of pronouns, and the Botong." Hearing the name of Zhou Botong, Huang Rong feels slightly surprised, and she wants to speak but stops. Guo Jing, who has been presence of multiple characters, the model is more likely to listening to the conversation, interjects: "Yes, Master Ma said they had an predict with errors. This indicates that the model needs to uncle, but did not mention his name." make further use of context information, and integrate the information of syntactic analysis and co-reference resolution In this example, the character connection constructed to make judgments. based on the paragraph co-occurrence method is ‘Huang Rong - Hong Qigong - Wang Chongyang - Qiu Chuji - Zhou Botong Among the 2431 data in the test set, speakers of the first - Guo Jing’, and the character connection constructed based and second categories account for 96.83% and 3.17%, on the dialogue chain method is ‘Huang Rong - Hong Qigong respectively. The accuracy of the two categories of speakers - Guo Jing’. Among them, Wang Chongyang, Qiu Chuji and in the test set was analyzed, and the results were shown in Zhou Botong do not appear in this scene and have no storyline Table 4. For class 1 speakers, there was no significant connection with the other three characters. However, the difference between the two models, while for class 2 speakers, dialogue chain based on speaker identification accurately MLP was 15.58% more accurate than SVM. In the future, we reflects the three participants of a conversation event, and the will use more precise anaphora resolution algorithms to character social network constructed by dialogue chain is improve the accuracy of the second type of speaker. more in line with the reality.

TABLE IV. ACCURACY OF CATEGORY 1 AND 2 Based on the result of speaker identification, we can attribute each quote to a certain main character, thus detecting Method MLP SVM all conversational events of the character. We propose a novel Category 1 96.39% 95.83% character social network based on dialogue chain. Category 2 66.23% 50.65% A dialogue chain represents multiple rounds of dialogue between two or more characters who meet the following three By subtracting features of type bool, distance and count in criteria: turn, the contribution of features of different types to accuracy was verified, and the accuracy dropped to 80.47%, 95.87% (1) The characters appear in the same scene at the same and 94.82% respectively. It can be found that Boolean features time. have the greatest impact on the accuracy of speaker (2) The different characters alternate in speaking. identification, with a reduction of 15.92% after removal. The accuracy decreases by 2.57% after removal of distance (3) Each character speaks to the other characters in this features. The count features have the least impact, with a 1.57% chain of dialogue. drop of accuracy after removal. The experimental results of We divide quotations of more than N consecutive narrative features ablation are shown in Table 5. sentences into different dialogue chains. The influence of different thresholds on the distribution of dialogue chain is TABLE V. FEATURES ABLATION RESULTS shown in Table 6. In The Legend of Condor Heroes, when the Method MLP -bool -distance -count threshold is 3, the distribution of the dialogue chain conforms Accuracy 95.64% 79.72% 93.07% 94.07% to the development of the story. The average number of quotes per dialogue chain is 6.67, the number of characters is 2.14,

2020 International Conference on Asian Language Processing (IALP) 16 and the maximum number of participants is 9. In detail, the social networks is shown in Figure 5, with 64 nodes and 140 dialogue chains that only involve decent characters account edges. for 66.14%, and the dialogue chains between decent characters and villains account for 23.28%. This indicates that the plot of The Legend of Condor Heroes mainly revolves around the interaction of decent characters, and the villains appear mostly because of their conflicts with decent characters. The dialogue between the decent characters and the villains reflects the conflicts in literary works, which is also the highlight of the novel.

TABLE VI. THE DISTRIBUTION OF THE DIALOGUE CHAIN Average Average Decent Decent Villain Num N quotation character - - - ber number number Villain Decent Villain 3 1456 6.67 2.14 23.28% 66.14% 10.58% 5 719 13.52 2.79 32.68% 58.55% 8.76%

In order to filter out the frequent occurrence of small characters, the node threshold is set as the number of Fig. 4. Social Network of Characters based on the Co-occurrence of Paragraphs (Part) characters' quotes, and the edge threshold is the number of conversations between characters. When the dialog chain threshold is 5, the node threshold value is 10 and the edge threshold value is 20, the social network of Characters in Chinese novels constructed is shown in Figure 3, where the node size is directly proportional to the degree centrality, which reflects the importance of the character. The distance between nodes indicates the degree of intimacy between characters. Node’s color represents the factions of characters, and line’s color represents the relationship between characters, such as friends or rivals. Currently, they are constructed by manual definition dictionary. In the future, we will carry out relationship extraction based on character quotation to automatically extract and analyze the relationship between characters in long text literary works.

Fig. 5. The Difference Set of Two Social Networks (Part)

It can be found that there are some characters who do not have dialogue but are only mentioned by other characters, such as ‘Wang Chongyang’. Although some characters appear in the same paragraph, but two people did not produce the interaction on the story line, for example ‘Hong Qigong – Lu Youjiao’, although belong to the Beggar's side, but two people never appear in the same scene at the same time; ‘Guo Xiaotian - Guo Jing’, although they are father and son, but Guo Xiaotian had died before Guo Jing was born, two characters also did not have the interaction of the story. The character social network based on the dialogue chain has 24 Fig. 3. Social Network of Characters in Jin Yong's Novels based on unique edges, such as ‘Hong Qigong - Ke Zhene’, ‘Yang Dialogue Chain (Part) Tiexin - Zhang Shiwu’, ‘Duan Tiande - Ku Mu’, etc. These are the new character relationships that are not found in the If the co-occurrence of different characters in the same paragraph co-occurrence method. paragraph is regarded as an interactive event, the character social network constructed based on the paragraph co- According to the calculation of the degree centrality of occurrence method is shown in Figure 4.The node threshold is each node, the Spearman correlation coefficient of the two set as 10, the edge threshold is set as 20, the node size is the networks was 0.9426, that is, a highly linear correlation. It occurrence times of characters, and the edge thickness is the shows that the dialogue chain constructed based on speaker co-occurrence times of characters. The character social identification results can be used as a method of character network based on dialogue chain has 59 nodes and 192 edges, social network denoising, and the character relationship while the social network based on paragraph co-occurrence extraction based on dialogue content can be used to construct has 78 nodes and 308 edges. The difference between the two character social networks more accurately.

2020 International Conference on Asian Language Processing (IALP) 17 VI. CONCLUSION [8] Zhang Chenlin, Wang Mingming, Tan Yiming, et al. Research on the Doubt of "True and False Monkey King" based on emotion analysis [J]. This paper studies speaker identification and its Journal of Chinese Information Processing, 2019, 33(3): 118-125,135. application in fiction. Starting from Jin Yong's novel The (In Chinese) Legend of Condor Heroes, we extract the dialogues and [9] Almeida M S C, Almeida M B, Martins A F T. A joint model for annotate the speakers, forming the largest speaker quotation attribution and coreference resolution[C]//Proceedings of the 14th Conference of the European Chapter of the Association for identification corpus to our knowledge. A machine learning- Computational Linguistics. 2014: 39-48. based speaker identification method is proposed and feature [10] Glass K, Bangay S. A naive salience-based method for speaker templates are designed with good performance. A social identification in fiction books[C]//Proceedings of the 18th Annual network of fictional characters based on dialogue chain is Symposium of the Pattern Recognition Association of South Africa constructed, which lays a foundation for analyzing the (PRASA’07). 2007: 1-6. relationship between characters in Chinese literary works. [11] Elson D K, McKeown K R. Automatic attribution of quoted speech in literary narrative[C]//Twenty-Fourth AAAI Conference on Artificial Our future work is to apply the proposed speaker Intelligence. 2010. identification method to Jin Yong's other novels to carry out [12] He H, Barbosa D, Kondrak G. Identification of speakers in cross-novel character analysis. In order to enhance the novels[C]//Proceedings of the 51st Annual Meeting of the Association robustness of speaker identification methods, the deep for Computational Linguistics (Volume 1: Long Papers). 2013: 1312- learning-based model will be considered applying to more 1320. novel styles. The quotation-based language style, personality, [13] Muzny G, Fang M, Chang A, et al. A two-stage sieve approach for quote attribution[C]//Proceedings of the 15th Conference of the emotion analysis of characters, and interpersonal relation European Chapter of the Association for Computational Linguistics: classification will also be a direction worth exploring. Volume 1, Long Papers. 2017: 460-470. [14] O'Keefe T, Pareti S, Curran J R, et al. A sequence labelling approach ACKNOWLEDGEMENTS to quote attribution[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and This work was supported in part by National Key Computational Natural Language Learning. Association for Research and Development Project under Grant Computational Linguistics, 2012: 790-799. 2017YFB1002101, Major Program of National Social [15] Chaganty A, Muzny G. Quote attribution for literary text with neural Science Foundation of China under Grant 17ZDA138 and networks[R/OL]. Avaliable at http://cs224d. stanford. 18ZDA295, China Postdoctoral Science Foundation under edu/reports/ChagantyArun. pdf. Grant 2019TQ0286, Science and Technique Program of [16] Schmerling E. Whose Line Is It?–Quote Attribution through Recurrent Neural Networks [R/OL]. Avaliable at Henan Province under Grant 192102210260, Medical https://cs224d.stanford.edu/reports/edward.pdf. Science and Technique Program Co-sponsored by Henan [17] Vishnubhotla K, Hammond A, Hirst G. Are Fictional Voices Province and Ministry under Grant SB201901021, Key Distinguishable? Classifying Character Voices in Modern Scientific Research Program of Higher Education of Henan Drama[C]//Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Province under Grant 19A520003 and 20A520038, the MOE Humanities and Literature. 2019: 29-34. Layout Foundation of Humanities and Social Sciences under [18] Lee J, Yeung C Y. An annotated corpus of direct Grant 20YJA740033, Henan Social Science Planning Project speech[C]//Proceedings of the Tenth International Conference on under Grant 2019BYY016. Language Resources and Evaluation (LREC'16). 2016: 1059-1063. [19] Chen J X, Ling Z H, Dai L R. A Chinese Dataset for Identifying REFERENCES Speakers in Novels[J]. Proc. Interspeech 2019, 2019: 1561-1565. [1] Labatut V, Bost X. Extraction and Analysis of Fictional Character [20] Prasad R, Dinesh N, Lee A, et al. Annotating attribution in the Penn Networks: A Survey[J]. ACM Computing Surveys (CSUR), 2019, discourse treebank[C]//Proceedings of the Workshop on Sentiment and 52(5): 1-40. Subjectivity in Text. Association for Computational Linguistics, 2006: 31-38. [2] Zhang J Y, Black A W, Sproat R. Identifying speakers in children's stories for speech synthesis[C]//Eighth European Conference on [21] Pareti S, O’Keefe T, Konstas I, et al. Automatically detecting and Speech Communication and Technology. 2003. attributing indirect quotations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013: 989-999. [3] Greene E, Mishra T, Haffner P, et al. Predicting character-appropriate voices for a TTS-based storyteller system[C]//Thirteenth Annual [22] Yeung C Y, Lee J. Identifying speakers and listeners of quoted speech Conference of the International Speech Communication Association. in literary works[C]//Proceedings of the Eighth International Joint 2012. Conference on Natural Language Processing (Volume 2: Short Papers). 2017: 325-329. [4] Danescu-Niculescu-Mizil C, Lee L. Chameleons in imagined conversations: A new approach to understanding coordination of [23] Ek A, Wirén M, Östling R, et al. Identifying speakers and addressees linguistic style in dialogs[C]//Proceedings of the 2nd workshop on in dialogues extracted from literary fiction[C]//Proceedings of the cognitive modeling and computational linguistics. Association for Eleventh International Conference on Language Resources and Computational Linguistics, 2011: 76-87. Evaluation (LREC 2018). 2018. [5] Tan J, Wan X, Liu H, et al. QuoteRec: Toward quote recommendation [24] Liu Ying, Xiao Tianjiu. Research on Quantitative Style of Jin Yong for writing[J]. ACM Transactions on Information Systems (TOIS), and Gu Long's Novels [J].Journal of Tsinghua University (Philosophy 2018, 36(3): 1-36. and Social Sciences), 2014, 5: 135-147. (In Chinese) [6] Pavllo D, Piccardi T, West R. Quootstrap: Scalable Unsupervised [25] Jia Y, Wang L, Zan H. Text Rewriting Pattern Mining Based on Extraction of Quotation-Speaker Pairs from Large News Corpora via Monolingual Alignment[C]//Workshop on Chinese Lexical Semantics. Bootstrapping[C]//Twelfth International AAAI Conference on Web Springer, Cham, 2018: 551-558. and Social Media. 2018. [26] Zhang Xuan, Liang Xun, Li Zhiyu, et al. Recognition and Analysis of [7] Wu Yufeng, Wu Shengtao, Zhu Tingshao, et al. Analysis of Literary complex Love Patterns of protagonists in Jin Yong's novels [J].Journal Intelligence of Characters in Novels: A Case Study of Ordinary World of Chinese Information Processing, 2019, 33(4): 109-119. (In Chinese) [J].Journal of Chinese Information Processing, 2018, 32(7): 128-136. [27] Zhao Jingsheng, Zhang Li, Zhu Qiaoming, et al. Social Network (In Chinese) Extraction and Analysis in Chinese Literary Works [J].Journal of Chinese Information Processing, 2017, 31(2): 99-106. (In Chinese)

2020 International Conference on Asian Language Processing (IALP) 18