Position-Invariant Truecasing with a Word-And-Character Hierarchical Recurrent Neural Network
Total Page:16
File Type:pdf, Size:1020Kb
Position-Invariant Truecasing with a Word-and-Character Hierarchical Recurrent Neural Network Hao Zhang and You-Chi Cheng and Shankar Kumar and Mingqing Chen and Rajiv Mathews Google {haozhang,youchicheng,shankarkumar,mingqing,mathews}@google.com Abstract and compact truecaser. From a modeling perspective, there has been a Truecasing is the task of restoring the correct case (uppercase or lowercase) of noisy text divide between character-based and word-based generated either by an automatic system for approaches. Most of the earlier works are word- speech recognition or machine translation or based (Lita et al., 2003; Chelba and Acero, 2004). by humans. It improves the performance of In the word-based view, the task is to classify each downstream NLP tasks such as named entity word in the input into one of the few classes (Lita recognition and language modeling. We pro- et al., 2003): all lowercase (LC), first letter upper- pose a fast, accurate and compact two-level hi- case (UC), all letters uppercase (CA), and mixed erarchical word-and-character-based recurrent case (MC). The main drawback of the word-based neural network model, the first of its kind for this problem. Using sequence distillation, we approach is that it does not generalize well on un- also address the problem of truecasing while seen words and mixed case words. The develop- ignoring token positions in the sentence, i.e. in ment of character-based neural network modeling a position-invariant manner. techniques (Santos and Zadrozny, 2014) led to the introduction of character-based truecasers (Susanto 1 Introduction et al., 2016; Ramena et al., 2020). In the character- Automatically generated texts such as speech recog- based view, the task is to classify each character in nition (ASR) transcripts as well as user-generated the input into one of the two classes: U and L, for texts from mobile applications such as Twitter uppercase and lowercase respectively. The short- Tweets (Nebhi et al., 2015) often violate the gram- coming of character-based models is inefficiency. matical rules of casing in English and other western As word-level features are always important for languages (Grishina et al., 2020). The process of classification, character-based models need to be restoring the proper case, often known as tRuEcas- deep enough to capture high-level features, which Ing (Lita et al., 2003), is not only important for the exacerbates their slowness. ease of consumption by end-users (e.g. in the sce- We view truecasing as a special case of text nor- nario of automatic captioning of videos), but is also malization (Zhang et al., 2019) with similar trade- crucial to the success of downstream NLP com- offs for accuracy and efficiency. Thus, we clas- ponents. For instance, named entity recognition sify input words into one of two classes: SELF arXiv:2108.11943v2 [cs.CL] 1 Sep 2021 (NER) relies heavily on the case features (Mayhew and OTHER. Words labeled SELF will be copied et al., 2020). In language modeling (LM), truecas- as-is to the output. Words labeled OTHER will ing is crucial for both pre- and post-processing. It be fed to a sub-level character-based neural net- has been used for restoring case and improving the work along with the surrounding context, which recognition performance for ASR (post-processing, is encoded. The task of the sub-network to per- Gravano et al., 2009; Beaufays and Strope, 2013) form non-trivial transduction such as: iphone as well as case normalization of user text prior to ! iPhone, mcdonald's ! McDonald's and LM training (pre-processing, Chen et al., 2019). hewlett-packard ! Hewlett-Packard. The The latter employs Federated Learning (McMa- two-level hierarchical network is fast while also han et al., 2017), a privacy-preserving learning being accurate. paradigm, on distributed devices. The need for Lita et al.(2003) highlighted the uneventful up- normalizing text on a variety of mobile devices percasing of the first letter in a sentence, and infor- makes it an imperative to develop a fast, accurate mally defined a new problem: “a more interesting problem from a truecasing perspective is to learn chosen to simplify the problem by mapping mixed how to predict the correct case of the first word in case words to first letter uppercase words. Sunkara a sentence (i.e. not always UC)”. We address the et al.(2020) only evaluated word class F1, without problem with a modified sequence distillation (Kim refining the class of MC. and Rush, 2016) method. The intuition is inspired Character-based models have been explored by the distinction between positional and intrin- largely after the dawn of modern neural network sic capitalization made by Beaufays and Strope models. Susanto et al.(2016) first introduced (2013). We first train a teacher model on position- character-based LSTM for this task and completely ally capitalized data and then distill sub-sequences solved the mixed case word problem. Recently, from the teacher model when training the student Grishina et al.(2020) compared character-based model. By doing so, the student model is only ex- n-gram (n up to 15) language models with the char- posed to intrinsically capitalized examples. The acter LSTM of Susanto et al.(2016). Ramena et al. method accomplishes both model compression and (2020) advanced the state of the art with a character- the position-invariant property which is desirable based CNN-LSTM-CRF model which introduced for mobile NLP applications. local output label dependencies. Section2 discusses related work. Section3 de- Text normalization is the process of transform- fines the machine learning problem and presents the ing text into a canonical form. Examples of text word- and character-based hierarchical recurrent normalization include but are not limited to written- neural network architecture. Section4 introduces to-spoken text normalization for speech synthesis the modified sequence distillation method for po- (Zhang et al., 2019), spoken-to-written text nor- sitional invariant truecasing, as well as a curated malization for speech recognition (Peyser et al., Wikipedia data set for evaluation1. Section5 shows 2019), social media text normalization (Han et al., the results and analysis. 2013), and historical text normalization (Makarov and Clematide, 2020). Truecasing is a problem that 2 Related Work appears in both spoken-to-written and social media text normalization. Word-based truecasing has been the dominant ap- proach for a long time since the introduction of the 3 Formulation and Model Architecture task by Lita et al.(2003). Word-based models can The input is a sequence of all lowercase words be further categorized into generative models such X~ = (x ;:::; x ). The output is a sequence of words as HMMs (Lita et al., 2003; Gravano et al., 2009; 1 l with proper casing Y~ = (y ;:::; y ). We introduce Beaufays and Strope, 2013; Nebhi et al., 2015) and 1 l a latent sequence of class labels C~ = (c ;:::; c ), discriminative models such as Maximum-Entropy 1 l where c 2 fS=SELF,O=OTHERg. We also use the Markov Models (Chelba and Acero, 2004), Con- i notation x j and y j to represent the j-th character ditional Random Fields (Wang et al., 2006), and i i within the i-th word. most recently Transformer neural network mod- The model is trained to predict the probability: els (Nguyen et al., 2019; Rei et al., 2020; Sunkara X et al., 2020). Word-based models need to refine the P(Y~ jX)~ = P(Y~ jX~ ; C)~ · P(C~ jX)~ ; (1) class of mixed case words because there is a com- C~ binatorial number of possibilities of case mixing where P(C~ jX~ ) is a word-level model that predicts if for a word (e.g., LaTeX). Lita et al.(2003); Chelba a word should stay all lowercase (SELF) or change and Acero(2004) suggested using either all of the to a different case form (OTHER), which assumes forms or the most frequent form of mixed case label dependency between c ;:::; c and c . words in the training data. The large-scale finite 1 i−1 i state transducer (FST) models by Beaufays and Yl ~ ~ ~ Strope(2013) used all known forms of mixed case P(CjX) = P(cijc1;:::; ci−1; X) (2) words to build the “capitalization” FST. Wang et al. i=1 (2006) used a heuristic GEN function to enumerate The label sequence C~ works as a gating mechanism, a superset of all forms seen in the training data. But Yl others (Nguyen et al., 2019; Rei et al., 2020) have P(Y~ jX~ ; C)~ = δ(ci; O)P(yijX) + δ(ci; S)δ(xi; yi); i=1 1https://github.com/google-research-datasets/wikipedia- intrinsic-capitalization (3) where P(yijX) is a character-level model that pre- word appearing at the beginning position of a sen- dicts each output character, assuming dependency tence can be capitalized due to either one of the 1 j−1 between characters within a word: yi ;:::; yi two reasons. But it is more interesting to recover j the intrinsic capitalization which is independent and yi , but no cross-word dependency between y1;:::; yi−1 and yi, and S and O denote SELF and of word positions. For example, in apples are OTHER respectively. nutritious, the intrinsic capitalization form of the word apples is lowercase. But in Apple is jY=jxij a personal computer company, the first letter j 1 j−1 ~ P(yijX) = P(yi jyi ;:::; yi ; X) (4) of Apple should be capitalized. j=1 A truecasing model that recovers the position- invariant word forms can benefit downstream NLP Given that δ(c ; S) ≡ δ(x ; y ), we can derive the log i i i components such as NER and LM models.