Arxiv:2104.05964V3 [Cs.CL] 7 May 2021 1 Introduction Discover the Important Historical Events Over the Last Hundreds of Years
Total Page:16
File Type:pdf, Size:1020Kb
Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation Kyeongpil Kang Kyohoon Jin Soyoung Yang Scatter Lab Chung-Ang University KAIST Seoul, South Korea Seoul, South Korea Daejeon, South Korea [email protected] [email protected] [email protected] Soojin Jang Jaegul Choo Youngbin Kim Chung-Ang University KAIST Chung-Ang University Seoul, South Korea Daejeon, South Korea Seoul, South Korea [email protected] [email protected] [email protected] Abstract records in a digital form for long-term preservation. A representative example is the Google Books Li- Understanding voluminous historical records brary Project1. However, despite the importance of provides clues on the past in various aspects, such as social and political issues and even the historical records, it has been challenging to natural science facts. However, it is generally properly utilize the records for the following rea- difficult to fully utilize the historical records, sons. First, the nontrivial amounts of the documents since most of the documents are not written are partially damaged and unrecognizable due to in a modern language and part of the contents unfortunate historical events or environments, such are damaged over time. As a result, restoring as wars and disasters, as well as the weak durability the damaged or unrecognizable parts as well as of paper documents. These factors result in difficul- translating the records into modern languages are crucial tasks. In response, we present a ties to translate and understand the records. Second, multi-task learning approach to restore and as most of the records are written in ancient and out- translate historical documents based on a self- dated languages, non-experts are difficult to read attention mechanism, specifically utilizing two and understand them. Thus, for their in-depth anal- Korean historical records, ones of the most vo- ysis, it is crucial to recover the damaged parts and luminous historical records in the world. Ex- properly translate them into modern languages. perimental results show that our approach sig- nificantly improves the accuracy of the trans- To address these issues existing in historical lation task than baselines without multi-task records, we formulate them as the task of language learning. In addition, we present an in-depth modeling, especially for the recovery and neural exploratory analysis on our translated results machine translation, by leveraging the advanced via topic modeling, uncovering several signifi- neural networks. Moreover, we apply topic model- cant historical events. ing to the translated historical records to efficiently arXiv:2104.05964v3 [cs.CL] 7 May 2021 1 Introduction discover the important historical events over the last hundreds of years. In particular, we utilize two Historical records are invaluable sources of infor- representative Korean historical records: the An- mation on the lifestyle and scientific records of our nals of the Joseon Dynasty and the Diaries of the ancestors. Humankind has learned how to handle Royal Secretariat (hereafter we refer to them as social and political problems by learning from the AJD and DRS respectively). These records, which past. The historical records also serve as the evi- contain 50 million and 243 million characters re- dence of intellectual accomplishment of humanity spectively, are recognized as the largest historical over time. Given such importance, there has been a records in the world. Considering their high value, great deal of nationwide efforts to preserve these UNESCO recognized them as the Memory of the historical records. For instance, UNESCO protects world heritage sites, and experts from all around the world have been converting and restoring historical 1https://support.google.com/websearch/answer/9690276 Large-scale Text%mining%viatopic%modeling ancient documents … 정언 김상(⾦尙)이 아뢰기를, 또 아뢰기를, 以司謁□□下敎⽈, 時 … “ 폐인에 대해서 의리로 처단해야“□□□정원이가아뢰기를한다는대신의,논의가뜻으로 와서 말하기를, ‘추국하기 原任⼤⾂·閣⾂·宗親· 사헌부에서“ 오늘사알을추국나왔습니다만통해( 推鞫구전으로) 해야, 위해서는하는데이른, 아침에사헌부의와서 儀賓· □□ ·宗正卿·⼆ … 본원의하교하기를관원들이, "시원임대부분대신, 모여야대관각신(臺官하는데, 종친)이 매일, 지금의빈사직하고이미, 옥당, , 品以上· 六曹· 兩司⾧ 일이 있어 나오지 않은 데다 Transformer 신만으로는저녁이숙배종정경(肅拜되었습니다2) 품- 1,이상2자, 원문육조. , 官·承·史, □□□□, … 대간훼손양사(臺諫- □장관)이원(-員, 7)승지,이8자있으니, 원문사관은, - 留待。 훼손4,음식을5- 자게다원문내릴훼손것이니- 추국은머물러 어떻게기다리라해야겠습." 하였다. … … Translation*/Restoration* Figure 1: Overview of the proposed approach of recovering, translating, and mining historical documents. World.2,3 These two historical corpora contain the a model suitable for the historical documents using contents of five hundred years from the fourteenth the self-attention mechanism. century to the early twentieth century. In detail, Overall, we propose a novel multi-task approach AJD consists of administrative affairs with national to restore the damaged parts and translate the events, and DRS contains events that occurred records into a modern language. Afterward, we around the kings of the Joseon Dynasty. These cor- extract the meaningful historical topics from the pora are valuable as they contain diverse informa- world’s largest historical records as shown in Fig.1. tion including international relations and natural This study makes the following contributions: disasters. In addition, the contents of the records • We design a model based on the self-attention are objective since the writing rules are strict that mechanism with multi-task learning to restore political intervention, even from the kings, is not and translate the historical records. Results allowed by their independent institution. demonstrate that our methods are effective in Although DRS contains a much larger amount restoring the damaged characters and translat- of information than AJD, only 10–20% of DRS has ing the records into a modern language. been translated into the modern Korean language • We translate all the untranslated sentences in by a few dozens of experts for the last twenty years. DRS. We believe that this dataset would be in- The complete translation of DRS is currently ex- valuable for researchers in various fields.4 pected to additionally take more than 30–40 years • We present a case study that extracts meaning- if only human experts continue to translate them. ful historical events by applying topic modeling, Applying the neural machine translation models highlighting the importance of analysis of his- into the historical records contains several issues. torical documents. First, the pre-trained models for Chinese are not suitable to DRS and AJD, mainly because of the 2 Related Work differences between Hanja and the Chinese lan- This work broadly incorporates three different guage. In the past, Korean historiographers bor- tasks: document restoration, machine translation, rowed the Chinese character to write the sentences and document analysis. Therefore, this section de- spoken by Koreans. As a result, diverse characters scribes studies related to the restoration of dam- had been moderated or created, and considerable aged documents, neural machine translation, and grammatical differences exist between the Chinese the analysis of historical records. language and Hanja. Furthermore, several parts of those records are damaged and require restoration 2.1 Neural Machine Translation as shown in Fig.2. Therefore, these damaged parts Recently, neural machine translation (NMT) has should be restored in order to translate them cor- achieved outstanding achievements. Based on the rectly. In order to address these issues, we propose encoder-decoder architecture, the attention mecha- 2http://www.unesco.org/new/en/communication-and-in nism (Bahdanau et al., 2015) significantly improves formation/memory-of-the-world/register/full-list-of-register the performance of NMT, by calculating the target ed-heritage/registered-heritage-page-8/the-annals-of-the-c hoson-dynasty/ context vector in the current time step via dynam- 3http://www.unesco.org/new/en/communication-and-in ically combining the encoding vectors of source formation/memory-of-the-world/register/full-list-of-register ed-heritage/registered-heritage-page-8/seungjeongwon-ilgi- 4The codes, trained model, and datasets are accessible via the-diaries-of-the-royal-secretariat/ https://github.com/Kyeongpil/deep-joseon-record-analysis. quent translation tasks. To address this problem, (a) 川□□李時益, 行己悖戾 several studies focus on normalizing the misspelled words (Tang et al., 2018; Domingo and Nolla, (b) 尹知敬, □□李敏求·趙誠立 2018), and others further apply language model- ing to restore the parts of the documents via deep (c) □□悶迫不得已先赴母所 neural networks (DNNs) (Caner and Haritaoglu, 2010; Assael et al., 2019). (d) 抄同趙□陪臣來時 Recently, the Cloze-style approach of machine reading comprehension (masked language model- (a) (b) (c) (d) ing; MLM) predicts the original tokens for those Figure 2: Examples of damaged documents. Those positions where the words in the original sentence characters that should be put in rectangles are damaged are randomly chosen and masked or replaced (Her- or unrecognizable. mann et al., 2015). Several studies significantly im- proved the model performance by pre-training the model via the Cloze-style approach. By utilizing words. The self-attention-based networks (Vaswani the MLM approach with the self-attention mecha- et al., 2017) consider the correlations among all nism and the large-scale training dataset, numerous word pairs in the source and target sentences. Based models improve the performances of various down- on the success of self-attention networks, Trans-