Deep Learning and Neural Machine Translation

If you start from scratch with neural networks and neural machine translation, here are my personal ("biased" - if you understand this pun already, skip the introductions to neural networks in general!) recommendations.

Did I miss something important or do you have a comment? Please write to mmueller AT cl.uzh.ch.

Fundamentals

Before we even get to neural networks, here are some pointers to stuff deep learning is building on:

Statistics: you should be familiar with concepts like mean, variance, distributions, sampling. A good book is "Statistics" (Freedman, Pisani and Purves), and Khan Academy has great videos about statistics: https://www.khanacademy.org/math/statistics-probability.
Math: tl;dr: You need to be comfortable with partial derivatives and matrix operations. Look at Khan Academy videos from the Differential Calculus section, and Multivariable Calculus after that, and Linear Algebra. If short videos are not for you, read Gilbert Strang's Introduction to Linear Algebra. Then work through "The Matrix Calculus you need for Deep Learning" (Parr and Howard) or through parts of "Calculus" (Spivak).
Programming: Given the current landscape of academic and commercial applications, it makes sense for you to learn Python. There are plenty of courses, for instance "Introduction to Computational Thinking and Data Science" from MIT. Justin Johnson wrote a "Python Numpy Tutorial" for cs231n.
Machine Learning: A good introductory book is "Introduction to Machine Learning with Python" (Müller and Guido). For formal and thorough introductions, read "Elements of Statistical Learning" (Friedman, Tibshirani and Hastie) and "Pattern Recognition and Machine Learning" (Bishop)

Neural Networks in General

Resources that introduce you to neural networks in general, largely without applying them to NLP problems yet:

First Encounters: A good book as the very first introduction is "Neural Networks and Deep Learning" (Nielsen), then look at Chris Olah's Blog, the very recent book "Deep Learning with Python" (Chollet).
Tour de force introductions: Work through the Stanford Course cs231, make sure you look at the 2016 version of the course, with the brilliant Andrej Karpathy. Finally, you could read "Deep Learning" (Goodfellow, Bengio and Courville) - caution, it's a more advanced book.

Suggestions for papers that discuss specific methods that 1) are very widely used and 2) are what actually makes deep learning work well:

Initialization: "Understanding the difficulty of training deep feedforward neural networks" (Glorot and Bengio), and "Delving Deep into Rectifiers ..." (He et al.) for the relationship between the type of activation and initialization.
Normalization: Batch Normalization (Ioffe and Szegedy), Weight Normalization (Salimans and Kingma) and Layer Normalization (Ba, Kiros and Hinton)
Dropout: "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (Srivastava et al.) and Bayesian Dropout (Gal and Gharahmani)
Optimization: an overview of optimizer algorithms by Sebastian Ruder, Adam (Kingma and Ba) - a very popular optimizer
Recurrent networks: “On the difficulty of training recurrent neural networks” (Pascanu et al.) (vanishing and exploding gradient) and "How to construct deep recurrent neural networks" (Pascanu et al.) (notions of depth)
Variants of RNNs: the most widely used variant of RNN cell is an LSTM unit, introduced by "Long Short-Term Memory" (Hochreiter and Schmidhuber), a viable alternative are "Gated Recurrent Units" (GRU)

Neural networks in NLP

Moving on to how neural networks are used in NLP:

Overview: "Neural network methods for natural language processing" (Goldberg) gives a good overview, there is a slimmer version that is free: "A Primer on Neural Network Models for Natural Language Processing" (Goldberg). Then there is a Stanford course by Chris Manning and Richard Socher, "CS224n: Natural Language Processing with Deep Learning".
Word Embeddings: a particularly universal method for NLP problems. Well-known works are "Efficient Estimation of Word Representations in Vector Space" (Mikolov et al.) and "GloVe: Global Vectors for Word Representation" (Pennington, Socher and Manning).

Neural Machine Translation

Neural networks are used extensively in machine translation. Here are resources you should read to get up to speed. It is assumed that you are familiar with machine translation in general.

Overview: Philipp Koehn's chapter draft on neural machine translation, Graham Neubig's Tutorial on Neural Machine Translation, Lecture Notes by Kyunghyun Cho: "Natural Language Understanding with Distributed Representation"
Encoder-Decoder Architecture: very popular approach that reads the source language with an encoder, attends to the encoded source with soft attention and produces the translation with a decoder network. Important papers: "Neural machine translation by jointly learning to align and translate (Bahdanau, Cho and Bengio), "Sequence to Sequence Learning with Neural Networks" (Sutskever, Vinyals and Le) and "Effective Approaches to Attention-based Neural Machine Translation" (Luong, Pham and Manning)
Non-recurrent models: NMT systems do not need to be recurrent, they can also be convolutional: "Convolutional Sequence to Sequence Learning" (Gehring et al.) or self-attentional Transformers: "Attention is all you need" (Vaswani et al.)
Attention networks: For a similar model that demonstrates the versatility of attention, see "Show, Attend and Tell" (Xu et al.) - about Image Captioning.
Exploration of Hyperparameters: "Massive Exploration of Neural Machine Translation Architectures" (Britz et al.).

Suggestions for papers about specific methods (usually within the Encoder-Decoder framework, but independent of the type of network (RNN or not)):

Subword Units: using byte-pair encoding to pre-process your data is essential to 1) control vocabulary size and 2) generate unseen words. See "Neural Machine Translation of Rare Words with Subword Units" (Sennrich, Haddow and Birch).
Backtranslation: a mechanism that lets you incorporate monolingual text into the models: "Improving Neural Machine Translation Models with Monolingual Data" (Sennrich, Haddow and Birch).
Deep Models: Like other kinds of networks, RNNs profit from deeper transformations, "Deep Architectures for Neural Machine Translation" (Miceli Barone et al.)

Deep Learning Libraries (and actual code!)

You are encouraged to scrutinize and write actual code as you learn about NMT:

Deep Learning with numpy: forward passes of several flavours of RNN using only numpy (disclosure: this is my own repository). Joel Grus has live-coded a tiny deep learning library using numpy only, watch the video and look at similar code (my repository, heavily inspired by Joel)
Libraries: reasonable libraries that have Python bindings are: Tensorflow (see especially the sequence to sequence tutorial), MXnet (look at the gluon tutorial), DyNet (tutorial slides and code). PyTorch has also gained popularity in recent years (see for instance their seq2seq tutorial). We do not recommend Theano, since its development will be discontinued.
NMT Tools: mentioning just a few popular systems: Nematus (code, paper), Sockeye (code, paper), Fairseq (code, paper), seq2seq (code, docs), Neuralmonkey (code, paper), OpenNMT-py (code, paper), Xnmt (code). Besides, there are auxiliary tools like "NMT Alignment Visualizations" (Rikters, Fishel and Bojar).

Quicklinks

Hauptnavigation