Natural Language Processing and Information Retrieval Natural Language Processing and Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL NATURAL LANGUAGE GENERATION Title Generation for Machine-Translated Documents Rong Jin Alexander G. Hauptmann Language Technologies Institute Dept. of Computer Science Carnegie Mellon University, Carnegie Mellon University, Pittsburgh, PA, 15213 Pittsburgh, PA, 15213 [email protected] [email protected] Abstract which methods perform better than others will be very helpful towards general understanding on how to dis- In this paper, we present and compare automatically cover knowledge from errorful or corrupted data and generated titles for machine-translated documents using how to apply learned knowledge in ‘noisy’ environments. several different statistics-based methods. A Naïve Bayesian, a K-Nearest Neighbour, a TF-IDF and an it- Historically, the task of title generation is strongly con- erative Expectation-Maximization method for title gen- nected to more traditional document summarization tasks eration were applied to 1000 original English news (Goldstein et al, 1999) because title generation can be documents and again to the same documents translated thought of as extremely short summarization. Traditional from English into Portuguese, French or German and summarization has emphasized the extractive approach, back to English using SYSTRAN. The using selected sentences or paragraphs from a document to provide a summary (Strzalkowski et al., 1998, Salton AutoSummarization function of Microsoft Word was et al, 1997, Mitra et al, 1997). Most notably, McKeown used as a base line. Results on several metrics show that et al. (1995) have developed systems that extract phrases the statistics-based methods of title generation for and recombine elements of phrases into titles. machine-translated documents are fairly language independent and title generation is possible at a level More recently, some researchers have moved toward approaching the accuracy of titles generated for the “learning approaches” that take advantage of training original English documents. data (Witbrock and Mittal, 1999). Their key idea was to estimate the probability of generating a particular title 1. Introduction word given a word in the document. In their approach, they ignore all document words that do not appear in the Before we discuss generating target language titles for title. Only document words that effectively reappear in documents in a different source language, let’s consider the title of a document are counted when they estimate in general the complex task of creating a title for a the probability of generating a title word wt given a document: One has to understand what the document is document word wd as: P(wt|wd) where wt = wd. While about, one has to know what is characteristic of this the Witbrock/Mittal Naïve Bayesian approach is not in document with respect to other documents, one has to principle limited to this constraint, our experiments show know how a good title sounds to catch attention and how that it is a very useful restriction. Kennedy and to distill the essence of the document into a title of just a Hauptmann (2000) explored a generative approach with few words. To generate a title for a machine-translated an iterative Expectation-Maximization algorithm using document becomes even more challenging because we most of the document vocabulary. Jin and Hauptmann have to deal with syntactic and semantic translation er- (2000a) extended this research with a comparison of sev- rors generated by the automatic translation system. eral statistics-based title word selection methods. Generating text titles for machine translated documents In our approach to the title generation problem we will is very worthwhile because it produces a very compact assume the following: target language representation of the original foreign First, the system will be given a set of target language language document, which will help readers to under- training data. Each datum consists of a document and its stand the important information contained in the docu- corresponding title. After exposure to the training cor- ment quickly, without requiring a translation of the com- pus, the system should be able to generate a title for any plete document. From the viewpoint of machine learning, unseen document. All source language documents will be studies on how well general title generation methods can translated into the target language before any title gen- be adapted to errorful machine translated documents and eration is attempted. NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL 1229 grams to order the newly generated title word We decompose the title generation problem into two candidates into a linear sequence. phases: Finally, as a baseline, we also used the extractive sum- • learning and analysis of the training corpus and marization approach implemented as AutoSummarize in • generating a sequence of words using learned Microsoft Word that selects the “best” sentence from the statistics to form the title for a new document. document as a title. For learning and analysis of the training corpus, we The outline of this paper is as follows: This section gave present five different learning methods for comparison. an introduction to the title generation problem. Details of All the approaches are described in detail in section 2. our approach and experiments are presented in Section 2. • A Naïve Bayesian approach with limited vo- The results and their analysis are presented in Section 3. cabulary. This closely mirrors the experiments Section 4 discusses our conclusions drawn from the ex- reported by Witbrock and Mittal (1999). periment and suggests possible improvements. • A Naïve Bayesian approach with full vocabu- lary. Here, we compute the probability of gener- 2. Title Generation Experiment across ating a title word given a document word for all Multiple Languages. words in the training data, not just those docu- ment words that reappear on the titles. We will first describe the data used in our experiments, • A KNN (k nearest neighbors) approach, which and then explain and justify our evaluation metrics. The treats title generation as a special classification six learning approaches will then be described in more problem. We consider the titles in the training detail at the end of this section. corpus as a fixed set of labels, and the task of generating a title for a new document is essen- 2.1 Data Description tially the same as selecting an appropriate label The experimental dataset comes from a CD of 1997 (i.e. title) from the fixed set of training labels. broadcast news transcriptions published by Primary The task reduces to finding the document in the Source Media (1997). There were a total of roughly training corpus, which is most similar to the cur- 50,000 news documents and corresponding titles in the rent document to be titled. Standard document dataset. The training dataset was formed by randomly similarity vectors can be used. The new docu- picking four documents-title pairs from every five pairs ment title will be set to the title for the training in the original dataset. The size of training corpus was document most similar to the current document. therefore 40,000 documents and their titles. We fixed the • Iterative Expectation-Maximization approach. size of the test collection at 1,000 items from the unused This duplicates the experiments reported by document-title pairs. By separating training data and test Kennedy and Hauptmann (2000). data in this way, we ensure strong overlap in topic cover- • Term frequency and inverse document frequency age between the training and test datasets, which gives (TFIDF) method (Salton and Buckley, 1988). the learning algorithms a chance to play a bigger role. TFIDF is a popular method used by the Informa- tion Retrieval community for measuring the im- Since we did not have a large set of documents with titles portance of a term related to a document. We in multiple parallel languages, we approximated this by use the TFIDF score of a word as a measurement creating machine-translated documents from the original of the potential that this document word will be 1000 training documents with titles as follows: adopted into the title, since important words Each of the 1000 test documents was submitted to the have higher chance of being used in the title. SYSTRAN machine translation system (http://babelfish.altavista.com) and translated into For the title-generating phase, we can further decom- French. The French translation was again submitted to pose the issues involved as follows: the SYSTRAN translation system and translated back • Choosing appropriate title words. This is done into English. This final retranslation resulted in our by applying the learned knowledge from one of French machine translation data. The procedure was re- the six methods examined in this paper. peated on the original documents for translation into Por- • Deciding how many title words are appropriate tuguese and German to obtain two more machine trans- for this document title. In our experiments we lated sets of identical documents. For all languages, the simply fixed the length of generated titles to the average vocabulary overlap between the translated average expected title length for all compari- documents and the original documents was around 70%. sons. This implies that at least 30% of the English words were • Finding the correct sequence of title words that translated incorrectly during translation from English to forms a readable title ‘sentence’. This was done a foreign language and back to English. by applying a language model of title word tri- 1230 NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL 2.2 Evaluation Metric word DW, if DW is the same as TW (i.e. DW = In this paper, two evaluations are used, one to measure TW). To generate a title, we merely apply the statis- selection quality and another to measure accuracy of tics to the test documents for generating titles and the sequential ordering of the title words.