Universal Language Model Fine-tuning for Text Classification

Jeremy Howard* Sebastian Ruder* * equal contribution

Transfer learning for NLP status quo Analysis Low-shot learning - Best practice: initialise first layer with pretrained - 100 labeled examples: ULMFiT matches word embeddings performance of training from scratch with 10x and - Recent approaches (McCann et al., 2017; Peters 20x more data (on IMDb and AG News). et al., 2018): Pretrained embeddings as fixed - 100 labeled examples + 50-100k unlabeled features. Peters et al. (2018) is task-specific. examples: ULMFiT matches performance of - Why not initialise remaining parameters? training from scratch with 50x and 100x more data - Dai and Le (2015) first proposed fine-tuning a LM. (on IMDb and AG News). However: No pretraining. Naive fine-tuning (require millions of in-domain documents). (c) Target task classifier fine-tuning Train classification layer on top of LM. Universal Language Model Fine-tuning (ULMFit) Concat pooling Concatenate pooled representations of hidden 3-step recipe for state-of-the-art on any text states to capture long document contexts: classification task: IMDb AG News 1.Train language model (LM) on general domain hc = [hT, �������(H), ��������(H)] Pretraining LM quality data. Gradual unfreezing - Most useful for small - Even a vanilla LM can 2.Fine-tune LM on target data. and -sized perform well with fine- 3.Train classifier on labeled data. Gradually unfreeze the layers starting from the last layer to prevent catastrophic forgetting. datasets. tuning. Without pretraining Vanilla LM 10.67 With pretraining AWD-LSTM LM Bidirectional language model 7.41 5.98 5.69 5.76 5 5.38 5.69 Pretrain both forward and backward LMs and fine- 5.63 5 5.52 5.38 tune them independently.

IMDb TREC-6 AG News Experiments IMDb TREC-6 AG News (a) General-domain LM pretraining LM fine-tuning Classifier fine-tuning Train LM on a large general domain corpus, e.g. Comparison against state-of-the-art (SOTA) on six - Most useful for larger - ULMFiT works well WikiText-103. widely studied text classification datasets. datasets. No LM fine-tuning across all datasets. Regular Previous SOTA vs. ULMFiT Regular + discr From scratch ULMFiT Regular (b) Target task LM fine-tuning 6.99 6.386.546.36 13.36 Freez + discr 7 5.86 6.09 Discriminative fine-tuning 5.55 5.69 5.615.475.38 ULMFiT 5 9.93 Different layers capture different types of 6.57 Previous SOTA 6.87 6.86 6.81 30.58 5.86 5.816.04 5.9 ULMFiT 29.98 5.39 5 5.69 5.38 information. They should be fine-tuned to different 5.25 extents with different learning rates: 5.01 IMDb TREC-6 AG News IMDb TREC-6 AG News l l l 4.6 θ = θ − η ⋅ ∇ l J(θ) t t−1 θ 3.9 3.5 3.6 Fine-tuning Slanted triangular behaviour 2.64 learning rates rate (%) Error - No catastrophic 1.75 2.16 The model should forgetting. Stable converge quickly to a 0.84 0.8 even across a large suitable region and then 0 # of epochs. refine its parameters. IMDb TREC-6 AG News DBpedia Yelp-bi Yelp-full Models and code: http://nlp.fast.ai/ulmfit