Restoring and Mining the Records of the Dynasty via Neural Language Modeling and Machine Translation

Kyeongpil Kang Kyohoon Jin Soyoung Yang Scatter Lab Chung-Ang University KAIST , South Seoul, Daejeon, South Korea [email protected] [email protected] [email protected]

Soojin Jang Jaegul Choo Youngbin Kim Chung-Ang University KAIST Chung-Ang University Seoul, South Korea Daejeon, South Korea Seoul, South Korea [email protected] [email protected] [email protected]

Abstract records in a digital form for long-term preservation. A representative example is the Google Books Li- Understanding voluminous historical records brary Project1. However, despite the importance of provides clues on the past in various aspects, such as social and political issues and even the historical records, it has been challenging to natural science facts. However, it is generally properly utilize the records for the following rea- difficult to fully utilize the historical records, sons. First, the nontrivial amounts of the documents since most of the documents are not written are partially damaged and unrecognizable due to in a modern language and part of the contents unfortunate historical events or environments, such are damaged over time. As a result, restoring as wars and disasters, as well as the weak durability the damaged or unrecognizable parts as well as of paper documents. These factors result in difficul- translating the records into modern languages are crucial tasks. In response, we present a ties to translate and understand the records. Second, multi-task learning approach to restore and as most of the records are written in ancient and out- translate historical documents based on a self- dated languages, non-experts are difficult to read attention mechanism, specifically utilizing two and understand them. Thus, for their in-depth anal- Korean historical records, ones of the most vo- ysis, it is crucial to recover the damaged parts and luminous historical records in the world. Ex- properly translate them into modern languages. perimental results show that our approach sig- nificantly improves the accuracy of the trans- To address these issues existing in historical lation task than baselines without multi-task records, we formulate them as the task of language learning. In addition, we present an in-depth modeling, especially for the recovery and neural exploratory analysis on our translated results machine translation, by leveraging the advanced via topic modeling, uncovering several signifi- neural networks. Moreover, we apply topic model- cant historical events. ing to the translated historical records to efficiently arXiv:2104.05964v3 [cs.CL] 7 May 2021 1 Introduction discover the important historical events over the last hundreds of years. In particular, we utilize two Historical records are invaluable sources of infor- representative Korean historical records: the An- mation on the lifestyle and scientific records of our nals of the Joseon Dynasty and the Diaries of the ancestors. Humankind has learned how to handle Royal Secretariat (hereafter we refer to them as social and political problems by learning from the AJD and DRS respectively). These records, which past. The historical records also serve as the evi- contain 50 million and 243 million characters re- dence of intellectual accomplishment of humanity spectively, are recognized as the largest historical over time. Given such importance, there has been a records in the world. Considering their high value, great deal of nationwide efforts to preserve these UNESCO recognized them as the Memory of the historical records. For instance, UNESCO protects world heritage sites, and experts from all around the world have been converting and restoring historical 1https://support.google.com/websearch/answer/9690276 Large-scale Textminingviatopicmodeling ancient documents … 정언 김상(⾦尙)이 아뢰기를, 또 아뢰기를, 以司謁□□下敎⽈, 時 … “ 폐인에 대해서 의리로 처단해야“□□□정원이가아뢰기를한다는대신의,논의가뜻으로 와서 말하기를, ‘추국하기 原任⼤⾂·閣⾂·宗親· 사헌부에서“ 오늘사알을추국나왔습니다만통해( 推鞫구전으로) 해야, 위해서는하는데이른, 아침에사헌부의와서 儀賓· □□ ·宗正卿·⼆ … 본원의하교하기를관원들이, "시원임대부분대신, 모여야대관각신(臺官하는데, 종친)이 매일, 지금의빈사직하고이미, 옥당, , 品以上· 六曹· 兩司⾧ 일이 있어 나오지 않은 데다 Transformer 신만으로는저녁이숙배종정경(肅拜되었습니다2) 품- 1,이상2자, 원문육조. , 官·承·史, □□□□, … 대간훼손양사(臺諫- □장관)이원(-員, 7)승지,이8자있으니, 원문사관은, - 留待。 훼손4,음식을5- 자게다원문내릴훼손것이니- 추국은머물러 어떻게기다리라해야겠습." 하였다. …

… Translation/Restoration

Figure 1: Overview of the proposed approach of recovering, translating, and mining historical documents.

World.2,3 These two historical corpora contain the a model suitable for the historical documents using contents of five hundred years from the fourteenth the self-attention mechanism. century to the early twentieth century. In detail, Overall, we propose a novel multi-task approach AJD consists of administrative affairs with national to restore the damaged parts and translate the events, and DRS contains events that occurred records into a modern language. Afterward, we around the kings of the Joseon Dynasty. These cor- extract the meaningful historical topics from the pora are valuable as they contain diverse informa- world’s largest historical records as shown in Fig.1. tion including international relations and natural This study makes the following contributions: disasters. In addition, the contents of the records • We design a model based on the self-attention are objective since the writing rules are strict that mechanism with multi-task learning to restore political intervention, even from the kings, is not and translate the historical records. Results allowed by their independent institution. demonstrate that our methods are effective in Although DRS contains a much larger amount restoring the damaged characters and translat- of information than AJD, only 10–20% of DRS has ing the records into a modern language. been translated into the modern Korean language • We translate all the untranslated sentences in by a few dozens of experts for the last twenty years. DRS. We believe that this dataset would be in- The complete translation of DRS is currently ex- valuable for researchers in various fields.4 pected to additionally take more than 30–40 years • We present a case study that extracts meaning- if only human experts continue to translate them. ful historical events by applying topic modeling, Applying the neural machine translation models highlighting the importance of analysis of his- into the historical records contains several issues. torical documents. First, the pre-trained models for Chinese are not suitable to DRS and AJD, mainly because of the 2 Related Work differences between and the Chinese lan- This work broadly incorporates three different guage. In the past, Korean historiographers bor- tasks: document restoration, machine translation, rowed the Chinese character to write the sentences and document analysis. Therefore, this section de- spoken by Koreans. As a result, diverse characters scribes studies related to the restoration of dam- had been moderated or created, and considerable aged documents, neural machine translation, and grammatical differences exist between the Chinese the analysis of historical records. language and Hanja. Furthermore, several parts of those records are damaged and require restoration 2.1 Neural Machine Translation as shown in Fig.2. Therefore, these damaged parts Recently, neural machine translation (NMT) has should be restored in order to translate them cor- achieved outstanding achievements. Based on the rectly. In order to address these issues, we propose encoder-decoder architecture, the attention mecha- 2http://www.unesco.org/new/en/communication-and-in nism (Bahdanau et al., 2015) significantly improves formation/memory-of-the-world/register/full-list-of-register the performance of NMT, by calculating the target ed-heritage/registered-heritage-page-8/the-annals-of-the-c hoson-dynasty/ context vector in the current time step via dynam- 3http://www.unesco.org/new/en/communication-and-in ically combining the encoding vectors of source formation/memory-of-the-world/register/full-list-of-register ed-heritage/registered-heritage-page-8/seungjeongwon-ilgi- 4The codes, trained model, and datasets are accessible via the-diaries-of-the-royal-secretariat/ https://github.com/Kyeongpil/deep-joseon-record-analysis. quent translation tasks. To address this problem, (a) 川□□李時益, 行己悖戾 several studies focus on normalizing the misspelled words (Tang et al., 2018; Domingo and Nolla, (b) 尹知敬, □□李敏求·趙誠立 2018), and others further apply language model- ing to restore the parts of the documents via deep (c) □□悶迫不得已先赴母所 neural networks (DNNs) (Caner and Haritaoglu, 2010; Assael et al., 2019). (d) 抄同趙□陪臣來時 Recently, the Cloze-style approach of machine reading comprehension (masked language model- (a) (b) (c) (d) ing; MLM) predicts the original tokens for those Figure 2: Examples of damaged documents. Those positions where the words in the original sentence characters that should be put in rectangles are damaged are randomly chosen and masked or replaced (Her- or unrecognizable. mann et al., 2015). Several studies significantly im- proved the model performance by pre-training the model via the Cloze-style approach. By utilizing words. The self-attention-based networks (Vaswani the MLM approach with the self-attention mecha- et al., 2017) consider the correlations among all nism and the large-scale training dataset, numerous word pairs in the source and target sentences. Based models improve the performances of various down- on the success of self-attention networks, Trans- stream tasks including NMT task (Baevski et al., former architecture for language modeling has been 2019; Devlin et al., 2019; Zhang et al., 2019a; Con- proposed, showing the forefront performances (De- neau and Lample, 2019; Liu et al., 2019c; Clark vlin et al., 2019; Radford et al., 2019). Especially, et al., 2019). However, to our knowledge, few stud- the pre-training approaches further improve the per- ies apply such an MLM approach to restore the formances, since they train the model robustly with damaged parts. several tasks using a large document corpus. In ad- Motivated by these studies, we design our model dition, lightweight models, such as ALBERT (Lan using masked language modeling based on the self- et al., 2019), are proposed to reduce the model attention architecture to recover the damaged docu- size while preserving the model performance. How- ments considering their contexts. ever, as most of the recent approaches focus on pre-training with documents written in a modern 2.3 Analysis on Historical Records language, the model for historical datasets does not exist. Therefore, we adopt a lightweight model in Various studies apply the machine learning ap- the same manner as ALBERT to efficiently recon- proaches to analyze the historical records (Zhao struct and translate millions of documents. et al., 2014; Kumar et al., 2014; Mimno, 2012; Regarding the translation task for the historical Kim et al., 2015; Bak and Oh, 2015, 2018). In ad- documents, several studies attempt to translate the dition, researchers adopt neural networks such as ancient Chinese documents into modern Chinese convolutional neural networks and autoencoders, language (Zhang et al., 2019b; Liu et al., 2019a). for page segmentation and optical character recog- However, as they mainly attempt to translate ar- nition to convert the historical records in a digital chaic characters into the modern language using form (Chen et al., 2017; Clanuwat et al., 2019). paired corpus, they do not fully utilize the unpaired Given such digital-form records, analysts attempt corpus. Therefore, we improve the performance to utilize the topic modeling to discover the histori- of machine translation for historical corpora with cally meaningful events (Yang et al., 2011). multi-task learning with the translation and restora- Especially, using the translated AJD, researchers tion tasks, which fully utilize the paired and un- discover historical events such as magnetic storm paired corpora. activities (Yoo et al., 2015; Hayakawa et al., 2017), meteors (Lee et al., 2009), and solar activities (Jeon 2.2 Restoration of Historical Documents et al., 2018). In political science, researchers ana- Unfortunately, lots of characters in the historical lyze the decision patterns of a royal family in the records are damaged or misspelled. As shown in Joseon Dynasty (Bak and Oh, 2015, 2018). Besides, Fig.2, the damaged parts are prevalent in DRS, the dietary patterns and dynamic social relations which significantly degrade the quality of subse- among key figures during the Joseon Dynasty have Linear for both tasks, the shared encoder is trained with Add & Norm a large-scale corpus, i.e., the Hanja-Korean paired Feed Linear RestorationForward dataset and the additional unpaired Hanja dataset. Add & Norm EncoderMulti-Head Add & Norm The parameter sharing technique assists the model Attention Feed Forward to learn rich information from the Hanja corpus.

Add & Norm Add & Norm We apply the cross-layer parameter-sharing tech- Feed TranslationMulti-Head nique in the same manner as used in ALBERT (Lan Forward DecoderAttention Shared et al., 2019), which shares the attention parameters Add & Norm EncoderAdd & Norm Multi-Head Multi-Head for each Transformer encoder and decoder modules Attention Attention to reduce the model size and the inference time.

Hanja Input Korean Input 3.1 Restoration of Damaged Documents Embedding Embedding The restoration task for damaged documents is sim- Figure 3: Overview of the proposed model for the ilar to the MLM approach, which masks randomly restoration and translation tasks. chosen tokens in the input sentence and then pre- dicts their original tokens in the corresponding po- been investigated (Ki et al., 2018). However, exist- sition. We apply the MLM technique to restore the ing studies mainly rely on the documents translated damaged documents, especially in the case of the by human experts. Therefore, we first translate the Hanja sentences Hˆ. documents in AJD and DRS. Afterward, we ap- hi hi For word indices (w1 , . . . , w ) in the Hanja ply topic modeling approaches to mine meaningful Li sentence hi, where Li is the length of the i-th se- historical events over large-scale data. quence, several words are randomly selected and replaced by a [MASK] token. We extract word em- 3 Proposed Methods h h bedding vectors (e i , . . . , e i ) ∈ demb from the 1 Li R This section describes a multi-task learning ap- Hanja embedding layer combined with positional proach based on the Transformer networks to ef- embedding vectors, where demb represents the di- fectively restore and translate the historical records. mension size of the embedding space. Here, we The overview of our model is shown in Fig.3. apply the factorized embedding parameterization AJD and DRS datasets consist of Hanja sen- technique to reduce model parameters (Lan et al., tences H = {h1, . . . , hN } and Korean sentences 2019). These embedding vectors are projected onto K = {k1, . . . , kN }, where each Korean sentence is the dmodel-dimensional embedding space through translated from its corresponding Hanja sentence. a linear layer. Subsequently, the embedding vec- Here, the Hanja represents the Chinese characters tors are transformed into the Hanja context vec- borrowed to write the Korean language in the past. tors (ˆshi ,..., sˆhi ) via the shared encoder and the 1 Li Especially, DRS contains additional Hanja sen- restoration encoder as tences He = {hN+1, . . . , hM } that are not trans- hi hi hi hi s , . . . , s = fS(e , . . . , e ), (1) lated yet. Hence, we have in total M Hanja sen- 1 Li 1 Li ˆ sˆhi ,..., sˆhi = f (shi , . . . , shi ), (2) tences in the Hanja corpus such that H = H ∪ He 1 Li R 1 Li and N Korean sentences in the Korean corpus K. where fS and fR functions represent the shared Considering the properties of AJD and DRS, we encoder and the restoration encoder, respectively. design a multi-task learning approach with docu- The Hanja context vectors is non-linearly trans- ment restoration and machine translation, based hi demb formed into the output vector zk ∈ R via the on the Transformer networks. As shown in Fig.3, output layer. We also apply the factorized embed- our model consists of embedding and output lay- ding parameterization technique to the output lay- ers for Hanja and Korean, and three Transformer ers for parameter reduction. We calculate the prob- modules: the shared encoder, the restoration en- ability P (w ˆhi |whi , . . . , whi ) for the index m of k,m 1 Li coder, and the translation decoder. The restoration h the original token wˆ i , using the softmax function encoder is an encoder for the restoration task. The k as translation decoder is used for translating Hanja h > hi hi hi hi exp(Wm zk ) sentences into modern Korean sentences, and the P (w ˆk,m|w1 , . . . , wL ) = , (3) i P|Vh| exp(Wh>zhi ) shared encoder is used for both the restoration and j j k translation tasks. By sharing the encoder module where |Vh| is the size of the Hanja vocabulary. 3.2 Neural Machine Translation for Paired Hanja Unpaired Hanja Korean #(Train data) 239,226 1,377,320 239,226 Historical Records #(Test data) 20,000 20,000 20,000 In order to facilitate the training of our transla- 1st Quartile 26 27 22 Mean 143.81 165.66 123.68 tion module, we exploit the Hanja-Korean paired 3rd Qquatile 106 113 80 dataset {(hi, ki)|hi ∈ H, ki ∈ K}. As shown Median 52 55 40 in Fig.3, we first extract the Hanja context vec- Vocab size 8,742 8,742 24,000 tors (shi , . . . , shi ) from the word tokens in the 1 Li Table 1: Dataset summary. The third to the sixth rows Hanja sentence hi, using the shared encoder in indicate the statistics for the length of each document. the same manner as in Eq.1. Utilizing the Hanja context vectors and previously predicted Korean Our model is optimized by using the rectified ki ki words (w1 , . . . , wt−1), we subsequently calculate Adam (Liu et al., 2019b) with the layer-wise adap- ki the dmodel-dimensional Korean context vector st tive rate scheduling technique (You et al., 2017). for the current time step t as We also apply the gradient accumulation technique and update our model for each loss asynchronously, ski = f (shi , . . . , shi , wki , . . . , wki ), (4) t D 1 Li 1 t−1 to increase the batch size and efficiently manage the GPU memory. where f represents the translation decoder lay- D After training the model, the damaged tokens are ers. After calculating the Korean context vector ki replaced by the [MASK] token during the restora- st , we non-linearly transform the context vector to k tion stage, and the model obtains the top-K char- the output vector z i ∈ demb , through the output t R acters with the highest probabilities, among which layer, along with the above-mentioned factorized users can choose and confirm a correct characters embedding parameterization for parameter reduc- in the position of the damaged parts. In addition, tion. Finally, we yield the probability that the word we translate all the Hanja records that are not yet V is generated from the t-th step as m translated for further in-depth analysis. When trans- > k ki lating the Hanja sentence, we additionally apply ki ki exp(Wm zt ) P (wt,m|hi, w1:t−1) = , (5) P|Vk| k> ki beam search with length normalization. The trans- j exp(Wj zt ) lation task for all the untranslated records using 20 where |Vk| is the size of the vocabulary for the V100 GPUs had a duration of approximately five k |V |×d Korean corpus, and W ∈ R k emb is the output days. layer for the Korean corpus. As previously mentioned, we employ the param- 4 Experiments eter sharing approach for the encoder module, (i.e., This section first describes our datasets and experi- the shared encoder), thus enhancing the robustness mental settings. of our model, especially with the Hanja dataset. 4.1 Datasets and Preprocessing 3.3 Training and Inference To train our model, we collect most of the docu- In order to train our model, we use the cross- ments of AJD and DRS, including those manually entropy loss to maximize the probability of the translated to date, provided by the National Insti- original token indices for the masked tokens and tute of the Korean History5. The records contain the target sentence for the translation task as approximately 250K documents for AJD and 1.4M 1 P  hi  documents for DRS. Lrst = − M h ∈Hˆ Ek∼ξ(hi) log P (wk |hi) , (6) i After collecting documents, we tokenize each 1 PN  1 P|ki| ki ki  Ltrs = − P (w |hi, w ) , (7) N i=1 |ki| t=1 t 1:t−1 Hanja sentence into the character-level tokens, sim- where ξ(·) is an operator that randomly selects the ilar to previous studies (Zhang et al., 2014; Li et al., tokens from each sentence for MLM. In this study, 2018), and also tokenize each Korean sentence we apply not only unigram masking but also the based on the unigram language model (Kudo, 2018) 6 n-gram masking techniques (i.e., bigrams and tri- provided by Google’s SentencePiece library. Here, grams), as previously applied (Zhang et al., 2019a). we included those words appearing more than ten Finally, the total loss is defined as times in the Hanja vocabulary, the size of which 5http://www.history.go.kr/ L = Lrst + Ltrs. (8) 6https://github.com/google/sentencepiece is about 8.7K words. For the Korean corpus, we ist K topics in the corpus. The term-date matrix limit the size of the Korean vocabulary to 24K. The V is decomposed into the term-topic weight ma- V ×K out-of-vocabulary words are replaced with UNK trix W ∈ R and the date-topic weight matrix D×K (unknown) tokens. To improve the stability and effi- H ∈ R as ciency during the training stage, we filter out those > 2 W, H = arg min kV − WH kF + α · ψ(W, H), (9) Hanja sentences with less than four tokens or more W,H≥0 than 350 tokens and those Korean sentences with where k·kF represents the Frobenius norm, and ψ less than four tokens or more than 300 tokens. Note and α represent the L1 regularization function and that the portion of sentences filtered out from each the regularization weight, respectively. We set the dataset is less than 10%. number of topics K as 208 and the regularization To evaluate the performance of our model, we weight α as 0.1. randomly select 20K sentences as a test dataset for each of the paired and the unpaired sets. The 5 Experimental Results sizes of the training set for the Hanja-Korean paired This section describes the results of the perfor- corpus and the unpaired Hanja corpus are 240K and mances of our model for restoration and translation, 1.38M, respectively. The statistics of the dataset are followed by qualitative examples of each task as summarized in Table1. well as topic modeling results. 4.2 Hyper-parameter Settings 5.1 Document Restoration We set hyper-parameters similarly to the BERT (De- We evaluate the performance of our model on vlin et al., 2019) base model. We set the size of the the document restoration task on the test dataset. embedding dimension demb, the hidden vector di- We also compare performance between the model mension dmodel, and the dimension of the position- trained with and without multi-task learning. Ta- wise feed-forward layers as 256, 768, and 3,072, ble3 shows the results of top-K (HITS@K). The respectively. The shared encoder, the translation top-10 accuracy of our proposed model is almost decoder, and the restoration encoder consist of 12, 89%, which indicates the high performance of our 12, and 6 layers, respectively. We use 12 attention model and demonstrates that our model provides heads for each multi-head attention layer. Overall, analysts with appropriate options. However, the the total number of parameters is around 168.8M. baseline model, trained without multi-task learning, performs slightly better than the one with multi- 4.3 Mining Historical Records via Topic task learning. This shows that the baseline model Modeling is more specialized in the document restoration After obtaining machine-translated outputs of the task. However, although our model performance is remaining records, we apply topic modeling to the slightly lower than the baseline model, the benefits full set of documents for exploratory analysis of of the multi-task learning approach are significantly historical events. To be specific, the full set of docu- manifested in the NMT task as shown in Table5. ments include all of the manually translated records As our model shows the acceptable performances as well as machine-translated records by our model. on both the restoration and the translation tasks, we By using each translated record ki and its written conclude that our model learns the purpose of our date information di, we first parse the document research well via multi-task learning. We will fur- into morphemes and then use the only noun and ther discuss the main benefits of multi-task learning adjective tokens. Afterward, we build the term-date in Section 5.2. V ×D matrix V ∈ R where V is the vocabulary size We further investigate the qualitative results of and D is the number of dates in the total set of the document restoration task. Table2 shows four historical documents. randomly sampled, example pairs. As shown in the In this study, we utilize non-negative matrix fac- first three rows of this table, the model also has the torization (NMF) (Lee and Seung, 2001) as a topic ability to predict bi-gram and tri-gram character- 7 modeling method . We first assume that there ex- level tokens because the model is trained using 7Topic modeling includes several methods such as latent the results of NMF are slightly better than those of LDA. Dirichlet allocation (LDA) (Blei et al., 2003)-based and non- 8We set the number of topics as 20 after we conducted negative matrix factorization-based models (Lee and Seung, experiments by varying the topic numbers, such as 10, 15, 20, 2001). We additionally tested topic modeling with LDA, but 30, and 50. Original 上在慶德宮. 停常參 · 經筵. Predicted 上在慶熙宮. 停常參 · 經筵. Original 右承旨李世用疏曰云云. 省疏具悉. 疏辭, 下該曹稟處. Predicted 右承旨李世用疏曰云云. 省疏具悉. 疏辭, 下該曹稟處. Original 玉堂子. 答曰, 省具悉. 辭當採用焉. 內下記草 Predicted 玉堂子. 答曰, 省具悉. 辭當採用焉. 內下記草 Original 又曰, 假注書金基龍, 身病猝重, 勢難察任, 今姑改差, 何如? 傳曰, 允. Predicted 又曰, 假注書李基淳, 身病猝重, 勢難察任, 今姑改差, 何如? 傳曰, 允. Table 2: Our model prediction results. Blue- and red-colored letters represent masked and predicted ones, respec- tively.

HITS@1 HITS@5 HITS@10 Furthermore, we generate sentences using the Baseline 77.83% 88.29% 90.89% beam search method with the length normalization. Full model 75.20% 86.21% 89.09% In this study, we compare the greedy search and the beam search with a beam size of 3. As shown Table 3: Top-K accuracies for the restoration task. in Table5, results obtained with a beam size of 3 are slightly better than the greedy search method. n-gram-based MLM. Furthermore, although each Finally, the BLEU score of our model is obtained character is not exactly the same as the original one, as 0.5410, which indicates that our model performs the last example in the table shows that our model reasonably well, compared to other recent models restores the proper format of the name part. How- trained in other languages. ever, predicting the exact name is a difficult task for We additionally compared our model to the human experts, even when considering the context model trained via the pretraining-then-finetuning of the sentence, as prior knowledge is necessary approach. As shown in Table6, the BLEU score to predict the exact name. Therefore, we quanti- of this approach is 0.3755, which is 5.9% higher tatively measured the model performance on the than that of the model trained from scratch but proper nouns, e.g. person and location names, using 28.7% lower than our multi-task learning approach. 200 samples of them. The average top-10 accuracy The results can be explained for two reasons. is only 8.3%, significantly lower than the overall First, as the size of unpaired data is much larger accuracy, which is larger than 89%. We conjecture than that of paired data, the multi-task learning that the degradation is mainly due to the difficulty fully utilizes the paired and unpaired data for the in maintaining the information of the proper nouns, translation task, compared to the pretraining-then- which would require external knowledge. We leave finetuning approach. Second, The pretraining-then- it as our future work. finetuning approach has a catastrophic forgetting 5.2 Machine Translation Quality problem (Chen et al., 2020). In other words, the finetuning step can fail to maintain the knowledge To investigate the performance of the machine acquired at the pretraining step. However, as both translation task, we translate the Hanja sentences reconstruction and translation tasks are crucial for in the test dataset and then evaluate the model per- historical documents, such a forgetting issue is crit- formance. As shown in Table5, the results for the ical to our tasks. translation task are evaluated by BLEU (Papineni We also tested the quality of the Hanja-Korean et al., 2002), METEOR (Banerjee and Lavie, 2005), translation task using a Chinese-Korean machine and ROUGE-L (Lin, 2004). In this result, “Full” translator. As few publicly available machine trans- represents our proposed model trained by multi- lation models for Chinese-Korean exist, we used task learning of the translation and the restoration Google Translate9 instead. The translator failed tasks. Therefore, the model is trained to take both to translate given Hanja sentences in most cases, the translated and untranslated sentences. On the mainly because Hanja and Chinese have different other hand, “Base” represents the model trained properties in terms of grammar and word meanings. only by the translation task, and thus, the model To investigate the translation performance qual- is trained to accept only the translated sentences. itatively, we sampled translated samples. Table4 Our model outperforms the baseline model with a significant margin. 9https://translate.google.co.kr Original 上在昌慶宮. 停常參 · 經筵. Predicted 상이 창경궁에 있었다. 상참과 경연을 정지하였다. Predicted (Eng.) King was in the Changkyeong palace. He stopped the discussion of political affairs with other officers. Original 答大司憲南龍翼疏曰, 省疏具悉. 內局提調之任, 當勉副, 卿其勿辭, 救護母病, 從速上來察職. Predicted 대사헌 남용익의 상소에 답하기를,“상소를 보고 잘 알았다. 내국 제조의 직임은 부지런히 마지못해 경의 뜻을 따라주니, 경은 사직하지 말고 어미를 구호하는 데에 속히 올라와 직임을 살피라.” 하였다. Predicted (Eng.) Replying to the Prosecutor General Namyongik’s memorial, the king said, “I looked at the memorial and thoroughly understood what it meant. As the position of the director at the office of the royal physicians cannot help but agree to your message, you should not resign your position, care for your mother’s illness, and come back to be responsible for your duties quickly.” Original 夜一更, 月暈. 五更, 西方坤方, 有氣如火光. Predicted 밤 1경에 달무리가 졌다. 5경에 서방, 곤방에 화광 같은 기운이 있었다. Predicted (Eng.) The moon has a ring around it at 7-9 PM. At 3-5 AM, there was the light of the fire in the west and south-west.

Table 4: Examples of original Hanja sentences, ground-truth sentences, and predicted sentences. For readability, we appended English sentences corresponding to the predicted sentences in each row.

BLEU METEOR ROUGE-L chine translation task of the historical records, as Base (1) 0.3547 0.3488 0.6082 it is essential to survey by researchers in various Base (3) 0.3536 0.3482 0.6127 fields such as astrophysics and geology. Therefore, Full (1) 0.5269 0.4594 0.7463 we further analyze the documents with the topic Full (3) 0.5410 0.4719 0.7606 modeling approach.

Table 5: Results of the performance of the translation 5.3 Results of Topic Modeling task. “Base” and “Full” represent the model trained only using the machine translation task and the model As described in Section 4.3, we calculate the term- trained using multi-task learning with machine transla- topic weight matrix W and date-topic weight ma- tion and restoration tasks, respectively. trix H. We select three interesting topics from the total of K topics and visualize the term-topic Multi-task Scratch Pipelining weights in W using the word cloud and the date- BLEU 0.5410 0.3536 0.3755 topic matrix H in a smoothed time-series graph for Table 6: BLEU scores of the models trained with each topic. Fig.4 shows the results. multi-task learning, scratch, and the pretraining-then- The first topic is related to troops and military ex- finetuning approach (pipelining), respectively. ercise. As shown in the red dashed box in the time- series graph, the weights dramatically decrease in shows the sentences translated from the untrans- 1882, while the weights continuously increase after lated documents by our model. For readability, we the biggest war in 1592. In fact, a coup attempt of append English sentences corresponding to the pre- the old-fashioned soldiers occurred in 1882, caus- dicted sentences in each row. Each result indicates ing the national intervention of neighboring coun- that our model generates the modern sentences cor- tries and the decline of self-reliant defense. The responding to contexts of the source Hanja sen- fifteenth topic is related to war and national de- tences. Interestingly, the third example in the table fense. Although this topic is related to the preced- is related to the astronomical observation of the ing military topic, it is more related to the inter- aurora. Later, we found prior studies confirming national relationship compared to the first one. In that the red energy mentioned in our document was the early years of the dynasty, northern enemies an aurora (Zhang, 1985; Stephenson and Willis, and pirates frequently invaded Joseon, which re- 2008). This highlights the importance of the ma- veals as the large topical weights in the beginning. Foreign invasion Mounder National security and Leonids minimum foreign intervention meteor

Topic1 Topic15 Topic18

Figure 4: Three topics extracted from topic modeling. We translated topic keywords into English for readability.

The weights increase in the late sixteenth century, niques may help the analysts effectively explore and the weight maintains at a high level until 1637 the historical records. when three great wars broke out in Joseon. 6 Conclusions The eighteenth topic is related to astronomical observations such as a halo and a meteor shower. In this paper, we proposed a novel approach to In the mid-sixteenth century, people observed the translate and restore the historical records of the Leonids, as shown in the first red box of the graph. Joseon dynasty by formulating the multi-task learn- We later found that experts in astronomy also dis- ing task based on the self-attention mechanism. covered this in the past, using AJD (Yang et al., Our approach significantly increases the transla- 2005). Moreover, from the mid-seventeenth cen- tion quality by learning the rich contents in large tury to the early eighteenth century, the number of documents. We anticipate these tasks are the first sunspots was low. Solar observers name this event steps towards translating the ancient Korean his- as the Maunder minimum (Eddy, 1976; Shindell torical records into modern languages such as En- et al., 2001). This event caused abnormal climate glish. Furthermore, the model effectively predicts phenomena, such as the third example in Table4, the original words from the damaged parts of the as shown in the second red box of the graph. This documents, which is an essential step for restoaring topic demonstrates the importance of the use of the damaged documents. Results from text mining historical records since it is difficult to easily spot approaches show that our approaches have the po- the phenomena that occurred centuries ago. tential in supporting analysts in effectively explor- ing the large volume of historical documents. We Note that previous studies mainly attempted to also expect researchers from diverse domains can exploit only AJD or translated parts of DRS. How- explore documents and discover historical findings ever, we utilize both AJD and the majority of DRS such as astronomical phenomena and undiscovered records by applying advanced NMT techniques. international affairs, with no special domain knowl- When performing topic modeling by using only edge. As future work, we will also leverage the those manually translated sentences, it failed to in- transfer learning approach to translate historical clude topics such as the health of the royal families documents into other languages, such as English and actions against treason sinners, which were or French. We also plan to apply knowledge graph- revealed by our approach. It is because the volu- based machine learning approaches, e.g. knowl- minous documents that have not been manually edge graph embedding and graph neural networks, translated contain their own topics. Thereby, we to discover historical events and relations. extract several valuable topics even with no special knowledge in the Hanja domain. Translating the Acknowledgements historical records into modern languages expands our knowledge base, and analysis of the records This work was supported by Institute for Infor- using machine translation and text mining tech- mation & communications Technology Planning & Evaluation (IITP) grant funded by the Korea Gulcin Caner and Ismail Haritaoglu. 2010. Shape-dna: government (MSIT) (No.2020-0-00368, A Neural- effective character restoration and enhancement for Symbolic Model for Knowledge Acquisition and arabic text documents. In International Conference on Pattern Recognition. Inference Techniques, No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST), Kai Chen, Mathias Seuret, Jean Hennebert, and Rolf In- and No.2021-0-01341, Artificial Intelligence Grad- gold. 2017. Convolutional neural networks for page uate School Program (Chung-Ang University)) and segmentation of historical document images. In In- ternational Conference on Document Analysis and the National Research Foundation of Korea (NRF) Recognition. grant funded by the Korean government (MSIT) (No.NRF-2019R1A2C4070420). Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. 2020. Recall and learn: Fine-tuning deep pretrained language models with References less forgetting. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Yannis Assael, Thea Sommerschield, and Jonathan Processing (EMNLP), pages 7870–7881, Online. As- Prag. 2019. Restoring ancient text using deep learn- sociation for Computational Linguistics. ing: a case study on Greek epigraphy. In Proceed- ings of the 2019 Conference on Empirical Methods Tarin Clanuwat, Alex Lamb, and Asanobu Kitamoto. in Natural Language Processing and the 9th Inter- 2019. Kuronet: Pre-modern japanese kuzushiji char- national Joint Conference on Natural Language Pro- acter recognition with deep learning. In 2019 In- cessing (EMNLP-IJCNLP), pages 6368–6375, ternational Conference on Document Analysis and Kong, China. Association for Computational Lin- Recognition (ICDAR). guistics. Kevin Clark, Minh-Thang Luong, Quoc V Le, and Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Christopher D Manning. 2019. Electra: Pre-training Zettlemoyer, and Michael Auli. 2019. Cloze-driven text encoders as discriminators rather than gener- pretraining of self-attention networks. In Proceed- ators. In Proc. the International Conference on ings of the 2019 Conference on Empirical Methods Learning Representations (ICLR). in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 5360–5369, Hong Alexis Conneau and Guillaume Lample. 2019. Cross- Kong, China. Association for Computational Lin- lingual language model pretraining. In Proc. the guistics. Advances in Neural Information Processing Systems (NIPS). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly Jacob Devlin, Ming-Wei Chang, Kenton Lee, and learning to align and translate. In Proc. the Inter- Kristina Toutanova. 2019. Bert: Pre-training of deep national Conference on Learning Representations bidirectional transformers for language understand- (ICLR). ing. In NAProc. the Annual Meeting of the Associa- tion for Computational Linguistics (ACL). JinYeong Bak and Alice Oh. 2015. Five centuries of monarchy in korea: mining the text of the annals of Miguel Domingo and Francisco Casacuberta Nolla. the joseon dynasty. In Proceedings of the SIGHUM 2018. Spelling normalization of historical docu- Workshop on Language Technology for Cultural Her- ments by using a machine translation approach. In itage, Social Sciences, and Humanities. Proceedings of the Conference of the European As- sociation for Machine Translation. JinYeong Bak and Alice Oh. 2018. Conversational decision-making model for predicting the king’s de- John A Eddy. 1976. The maunder minimum. Science. cision in the annals of the joseon dynasty. In Proc. of the Conference on Empirical Methods in Natural Hisashi Hayakawa, Kiyomi Iwahashi, Yusuke Ebihara, Language Processing (EMNLP). Harufumi Tamazawa, Kazunari Shibata, Delores J Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An Knipp, Akito D Kawamura, Kentaro Hattori, Ku- automatic metric for mt evaluation with improved miko Mase, Ichiro Nakanishi, et al. 2017. Long- correlation with human judgments. In Proceedings lasting extreme magnetic storm activities in 1770 of the Proc. the Annual Meeting of the Association found in historical documents. The Astrophysical for Computational Linguistics (ACL) workshop on Journal. intrinsic and extrinsic evaluation measures for ma- chine translation and/or summarization. Karl Moritz Hermann, Tomáš Kociskˇ y,` Edward Grefen- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, David M Blei, Andrew Y Ng, and Michael I Jordan. and Phil Blunsom. 2015. Teaching machines to read 2003. Latent dirichlet allocation. Journal of ma- and comprehend. In Proc. the Advances in Neural chine Learning research. Information Processing Systems (NIPS). Junhyeok Jeon, Sung-Jun Noh, and Dong-Hee Lee. Dayiheng Liu, Kexin Yang, Qian Qu, and Jiancheng Lv. 2018. Relationship between lightning and solar ac- 2019a. Ancient–modern chinese translation with a tivity for recorded between ce 1392–1877 in ko- new large training dataset. ACM Transactions on rea. Journal of Atmospheric and Solar-Terrestrial Asian and Low-Resource Language Information Pro- Physics. cessing.

Ho Chul Ki, Eun-Kyoung Shin, Eun Jin Woo, Eu- Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu nju Lee, Jong Ha Hong, and Dong Hoon Shin. Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2018. Horse-riding accidents and injuries in histor- 2019b. On the variance of the adaptive learning rate ical records of joseon dynasty, korea. International and beyond. In Proc. the International Conference journal of paleopathology. on Learning Representations (ICLR). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Reddy, and Haesun Park. 2015. Simultaneous dis- Luke Zettlemoyer, and Veselin Stoyanov. 2019c. covery of common and discriminative topics via Roberta: A robustly optimized bert pretraining ap- joint nonnegative matrix factorization. In Proc. the proach. arXiv:1907.11692. ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (KDD). David Mimno. 2012. Computational historiography: Data mining in a century of classics journals. Jour- Taku Kudo. 2018. Subword regularization: Improving nal on Computing and Cultural Heritage. neural network translation models with multiple sub- word candidates. In Proceedings of the 56th Annual Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Meeting of the Association for Computational Lin- Jing Zhu. 2002. Bleu: a method for automatic eval- guistics (Volume 1: Long Papers), pages 66–75, Mel- uation of machine translation. In Proc. the Annual bourne, Australia. Association for Computational Meeting of the Association for Computational Lin- Linguistics. guistics (ACL).

Neethu S Kumar, Dinesh S Kumar, S Swathikiran, and Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Alex Pappachen James. 2014. Ancient indian docu- Dario Amodei, and Ilya Sutskever. 2019. Language ment analysis using cognitive memory network. In models are unsupervised multitask learners. OpenAI International Conference on Advances in Comput- Blog. ing, Communications and Informatics . Drew T Shindell, Gavin A Schmidt, Michael E Mann, David Rind, and Anne Waple. 2001. Solar forcing Zhenzhong Lan, Mingda Chen, Sebastian Goodman, of regional climate change during the maunder min- Kevin Gimpel, Piyush Sharma, and Radu Soricut. imum. Science. 2019. Albert: A lite bert for self-supervised learn- ing of language representations. In Proc. the Inter- F Richard Stephenson and David M Willis. 2008. national Conference on Learning Representations ‘vapours like fire light’are korean aurorae. Astron- (ICLR). omy & Geophysics.

Daniel D Lee and H Sebastian Seung. 2001. Algo- Gongbo Tang, Fabienne Cap, Eva Pettersson, and rithms for non-negative matrix factorization. In Joakim Nivre. 2018. An evaluation of neural ma- Proc. the Advances in Neural Information Process- chine translation models on historical spelling nor- ing Systems (NIPS). malization. In Proceedings of the 27th International Conference on Computational Linguistics, pages Ki-Won Lee, Hong-Jin Yang, and Myeong-Gu Park. 1320–1331, Santa Fe, New Mexico, USA. Associ- 2009. Orbital elements of comet c/1490 y1 and the ation for Computational Linguistics. quadrantid shower. Monthly Notices of the Royal As- tronomical Society. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Yanyang Li, Tong Xiao, Yinqiao Li, Qiang Wang, you need. In Proc. the Advances in Neural Informa- Changming Xu, and Jingbo Zhu. 2018. A simple tion Processing Systems (NIPS). and effective approach to coverage-aware neural ma- chine translation. In Proceedings of the 56th An- Hong-Jin Yang, Changbom Park, and Myeong-Gu Park. nual Meeting of the Association for Computational 2005. Analysis of historical meteor and meteor Linguistics (Volume 2: Short Papers), pages 292– shower records: Korea, china, and japan. Icarus. 297, Melbourne, Australia. Association for Compu- tational Linguistics. Tze-I Yang, Andrew Torget, and Rada Mihalcea. 2011. Topic modeling on historical newspapers. In Pro- Chin-Yew Lin. 2004. ROUGE: A package for auto- ceedings of the 5th ACL-HLT Workshop on Lan- matic evaluation of summaries. In Text Summariza- guage Technology for Cultural Heritage, Social Sci- tion Branches Out, pages 74–81, Barcelona, Spain. ences, and Humanities, pages 96–104, Portland, OR, Association for Computational Linguistics. USA. Association for Computational Linguistics. Chulsang Yoo, Minkyu Park, Hyeon Jun Kim, Juhee Choi, Jiye Sin, and Changhyun Jun. 2015. Classifi- cation and evaluation of the documentary-recorded storm events in the annals of the choson dynasty (1392–1910), korea. Journal of Hydrology. Yang You, Igor Gitman, and Boris Ginsburg. 2017. Large batch training of convolutional networks. arXiv:1708.03888. Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting Liu. 2014. Character-level chinese dependency pars- ing. In Proc. the Annual Meeting of the Association for Computational Linguistics (ACL). Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019a. ERNIE: En- hanced language representation with informative en- tities. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguis- tics, pages 1441–1451, Florence, Italy. Association for Computational Linguistics. Zhiyuan Zhang, Wei Li, and Qi Su. 2019b. Automatic translating between ancient chinese and contempo- rary chinese with limited aligned corpora. In CCF International Conference on Natural Language Pro- cessing and Chinese Computing. ZW Zhang. 1985. Korean auroral records of the period ad 1507-1747 and the sar arcs. Journal of the British Astronomical Association. Huidong Zhao, Bin Wu, Haoyu Wang, and Chuan Shi. 2014. Sentiment analysis based on transfer learn- ing for chinese ancient literature. In International Conference on Behavioral, Economic, and Socio- Cultural Computing.