In NMT, a very common way to use monolingual data in the target language is to use back-translation (https://arxiv.org/pdf/1511.06709.pdf). Back-translation means training a target-to-source model and using it to fill in a pseudo source sentence for sentences in the target language. Back-translation has a profound impact on translation quality.
Recently, it was found that adding noise to back-translation improves performance even further (https://arxiv.org/pdf/1808.09381.pdf). The authors argue that noised back-translation achieves this by increasing source diversity. http://www.statmt.org/wmt19/pdf/WMT0074.pdf on the other hand offer a different explanation: they argue that the gains come from indicating to the model that back-translated text is a different kind of data.
In this project, you will put the latter view to the test by probing model states: you will train classifiers that take NMT encoder states as input, and label them as one of two classes: genuine source sentences and back-translations. The main idea is to test the extent to which encoder states contain information that helps discriminate between two kinds of inputs.
See https://www.aclweb.org/anthology/D18-1313 for more information on probing tasks.
- Theoretical knowledge of NMT models
- Experience with training machine translation models, such as Sockeye