Perplexity (lower is better) of Neural Language Models

Exclude scores of methods using test data statistics such as cache mechanism

updated at 2018/9/28

Publication	Model	Parameters	Valid	Test
Mikolov and Zweig ‘12	RNN	-	124.7
Mikolov and Zweig ‘12	RNN + LDA + Kneser-Ney smoothing	-	98.3
Zaremba et al. ‘14	LSTM (medium)	20M	86.2	82.7
Gal and Ghahramani. ‘16	Variational LSTM (medium) + Word Tying	20M	81.8 ± 0.2	79.2 ± 0.1
Kim et al. ‘16	CharCNN + LSTM	19M	-	78.9
Zaremba et al. ‘14	LSTM (large)	66M	82.2	78.4
Gal and Ghahramani. ‘16	Variational LSTM (large) + Word Tying	66M	77.3 ± 0.2	75.0 ± 0.1
Gal and Ghahramani. ‘16	Variational LSTM (large) + Word Tying + MC dropout	66M	-	73.4 ± 0.0
Zaremba et al. ‘14	LSTM (large) Ensemble	2.5G	71.9	68.7
Zilly et al. ‘17	Variational RHN + Word Tying	23M	67.9	65.4
Takase et al. ‘17	Variational RHN + Word Tying + IOG	29M	67.0	64.4
Zoph and Le ‘17	Neural Architecture Search + Word Tying	54M	-	62.4
Takase et al. ‘17	Variational RHN + IOG Ensemble	326M	64.1	61.4
Melis et al. ‘18	LSTM with skip connections	24M	60.9	58.3
Merity et al ‘18	AWD-LSTM	24M	60.0	57.3
Yang et al ‘18	AWD-LSTM-MoS	22M	56.54	54.44
Gong et al ‘18	AWD-LSTM-MoS + FRAGE	22M	55.52	53.31
Takase et al ‘18	AWD-LSTM-DOC	23M	54.12	52.38
Takase et al ‘18	AWD-LSTM-DOC Ensemble	114M	48.63	47.17