Universal Language Model Fine-Tuning for Text Classification

Universal Language Model Fine-tuning for Text Classification Jeremy Howard* Sebastian Ruder* * equal contribution Transfer learning for NLP status quo Analysis Low-shot learning - Best practice: initialise first layer with pretrained - 100 labeled examples: ULMFiT matches word embeddings performance of training from scratch with 10x and - Recent approaches (McCann et al., 2017; Peters 20x more data (on IMDb and AG News). et al., 2018): Pretrained embeddings as fixed - 100 labeled examples + 50-100k unlabeled features. Peters et al. (2018) is task-specific. examples: ULMFiT matches performance of - Why not initialise remaining parameters? training from scratch with 50x and 100x more data - Dai and Le (2015) first proposed fine-tuning a LM. (on IMDb and AG News). However: No pretraining. Naive fine-tuning (require millions of in-domain documents). (c) Target task classifier fine-tuning Train classification layer on top of LM. Universal Language Model Fine-tuning (ULMFit) Concat pooling Concatenate pooled representations of hidden 3-step recipe for state-of-the-art on any text states to capture long document contexts: classification task: IMDb AG News 1.Train language model (LM) on general domain hc = [hT, ��(H), ��(H)] Pretraining LM quality data. Gradual unfreezing - Most useful for small - Even a vanilla LM can 2.Fine-tune LM on target data. and medium-sized perform well with fine- 3.Train classifier on labeled data. Gradually unfreeze the layers starting from the last layer to prevent catastrophic forgetting. datasets. tuning. Without pretraining Vanilla LM 10.67 With pretraining AWD-LSTM LM Bidirectional language model 7.41 5.98 5.69 5.76 5 5.38 5.69 Pretrain both forward and backward LMs and fine- 5.63 5 5.52 5.38 tune them independently. IMDb TREC-6 AG News Experiments IMDb TREC-6 AG News (a) General-domain LM pretraining LM fine-tuning Classifier fine-tuning Train LM on a large general domain corpus, e.g. Comparison against state-of-the-art (SOTA) on six - Most useful for larger - ULMFiT works well WikiText-103. widely studied text classification datasets. datasets. No LM fine-tuning across all datasets. Regular Previous SOTA vs. ULMFiT Regular + discr From scratch ULMFiT Regular (b) Target task LM fine-tuning 6.99 6.386.546.36 13.36 Freez + discr 7 5.86 6.09 Discriminative fine-tuning 5.55 5.69 5.615.475.38 ULMFiT 5 9.93 Different layers capture different types of 6.57 Previous SOTA 6.87 6.86 6.81 30.58 5.86 5.816.04 5.9 ULMFiT 29.98 5.39 5 5.69 5.38 information. They should be fine-tuned to different 5.25 extents with different learning rates: 5.01 IMDb TREC-6 AG News IMDb TREC-6 AG News l l l 4.6 θ = θ − η ⋅ ∇ l J(θ) t t−1 θ 3.9 3.5 3.6 Fine-tuning Slanted triangular behaviour 2.64 learning rates rate (%) Error - No catastrophic 1.75 2.16 The model should forgetting. Stable converge quickly to a 0.84 0.8 even across a large suitable region and then 0 # of epochs. refine its parameters. IMDb TREC-6 AG News DBpedia Yelp-bi Yelp-full Models and code: http://nlp.fast.ai/ulmfit .

Universal Language Model Fine-Tuning for Text Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support