Neural Machine Translation by Jointly Learning to Align and Translate
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--1409.0473 |
| License | ArXiv |
| Provider | semantic_scholar |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__1409.0473,
author = {Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio},
title = {Neural Machine Translation by Jointly Learning to Align and Translate Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--1409.0473}},
note = {Accessed via Free2AITools Knowledge Fortress}
} ๐ฌTechnical Deep Dive
Full Specifications [+]โพ
โ๏ธ Nexus Index V2.0
๐ฌ Index Insight
FNI V2.0 for Neural Machine Translation by Jointly Learning to Align and Translate: Semantic (S:50), Authority (A:92), Popularity (P:77), Recency (R:100), Quality (Q:45).
Verification Authority
๐ Executive Summary
โ Cite Node
@article{Unknown2026Neural,
title={Neural Machine Translation by Jointly Learning to Align and Translate},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--1409.0473},
year={2026}
} Abstract & Analysis
[1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate
Neural Machine Translation
by Jointly Learning to Align and Translate
Dzmitry Bahdanau Jacobs University Bremen, Germany &KyungHyun Choย ย ย ย Yoshua Bengio Universitรฉ de Montrรฉal CIFAR Senior Fellow
Abstract
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoderโdecoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoderโdecoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
1
Introduction
Neural machine translation is a newly emerging approach to machine translation, recently proposed by Kalchbrenner and Blunsom ( 2013 ) , Sutskever etย al. ( 2014 ) and Cho etย al. ( 2014b ) . Unlike the traditional phrase-based translation systemย (see, e.g., Koehn etย al. , 2003 ) which consists of many small sub-components that are tuned separately, neural machine translation attempts to build and train a single, large neural network that reads a sentence and outputs a correct translation.
Most of the proposed neural machine translation models belong to a family of encoderโdecoders ย (Sutskever etย al. , 2014 ; Cho etย al. , 2014a ) , with an encoder and a decoder for each language, or involve a language-specific encoder applied to each sentence whose outputs are then comparedย (Hermann and Blunsom, 2014 ) . An encoder neural network reads and encodes a source sentence into a fixed-length vector. A decoder then outputs a translation from the encoded vector. The whole encoderโdecoder system, which consists of the encoder and the decoder for a language pair, is jointly trained to maximize the probability of a correct translation given a source sentence.
A potential issue with this encoderโdecoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Cho etย al. ( 2014b ) showed that indeed the performance of a basic encoderโdecoder deteriorates rapidly as the length of an input sentence increases.
In order to address this issue, we introduce an extension to the encoderโdecoder model which learns to align and translate jointly. Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
The most important distinguishing feature of this approach from the basic encoderโdecoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.
In this paper, we show that the proposed approach of jointly learning to align and translate achieves significantly improved translation performance over the basic encoderโdecoder approach. The improvement is more apparent with longer sentences, but can be observed with sentences of any length. On the task of English-to-French translation, the proposed approach achieves, with a single model, a translation performance comparable, or close, to the conventional phrase-based system. Furthermore, qualitative analysis reveals that the proposed model finds a linguistically plausible (soft-)alignment between a source sentence and the corresponding target sentence.
2
Background: Neural Machine Translation
From a probabilistic perspective, translation is equivalent to finding a target sentence ๐ฒ ๐ฒ \mathbf{y} that maximizes the conditional probability of ๐ฒ ๐ฒ \mathbf{y} given a source sentence ๐ฑ ๐ฑ \mathbf{x} , i.e., arg โก max ๐ฒ โก p โ ( ๐ฒ โฃ ๐ฑ ) subscript ๐ฒ ๐ conditional ๐ฒ ๐ฑ \operatorname*{\arg\max}_{\mathbf{y}}p(\mathbf{y}\mid\mathbf{x}) . In neural machine translation, we fit a parameterized model to maximize the conditional probability of sentence pairs using a parallel training corpus. Once the conditional distribution is learned by a translation model, given a source sentence a corresponding translation can be generated by searching for the sentence that maximizes the conditional probability.
Recently, a number of papers have proposed the use of neural networks to directly learn this conditional distributionย (see, e.g., Kalchbrenner and Blunsom, 2013 ; Cho etย al. , 2014a ; Sutskever etย al. , 2014 ; Cho etย al. , 2014b ; Forcada and รeco, 1997 ) . This neural machine translation approach typically consists of two components, the first of which encodes a source sentence ๐ฑ ๐ฑ \mathbf{x} and the second decodes to a target sentence ๐ฒ ๐ฒ \mathbf{y} . For instance, two recurrent neural networks (RNN) were used by (Cho etย al. , 2014a ) and (Sutskever etย al. , 2014 ) to encode a variable-length source sentence into a fixed-length vector and to decode the vector into a variable-length target sentence.
Despite being a quite new approach, neural machine translation has already shown promising results. Sutskever etย al. ( 2014 ) reported that the neural machine translation based on RNNs with long short-term memory (LSTM) units achieves close to the state-of-the-art performance of the conventional phrase-based machine translation system on an English-to-French translation task. 1 1 1 We mean by the state-of-the-art performance, the performance of the conventional phrase-based system without using any neural network-based component.
Adding neural components to existing translation systems, for instance, to score the phrase pairs in the phrase tableย (Cho etย al. , 2014a ) or to re-rank candidate translationsย (Sutskever etย al. , 2014 ) , has allowed to surpass the previous state-of-the-art performance level.
2.1
RNN EncoderโDecoder
Here, we describe briefly the underlying framework, called RNN EncoderโDecoder , proposed by Cho etย al. ( 2014a ) and Sutskever etย al. ( 2014 ) upon which we build a novel architecture that learns to align and translate simultaneously.
In the EncoderโDecoder framework, an encoder reads the input sentence, a sequence of vectors ๐ฑ = ( x 1 , โฏ , x T x ) ๐ฑ subscript ๐ฅ 1 โฏ subscript ๐ฅ subscript ๐ ๐ฅ \mathbf{x}=\left(x_{1},\cdots,x_{T_{x}}\right) , into a vector c ๐ c . 2 2 2 Although most of the previous worksย (see, e.g., Cho etย al. , 2014a ; Sutskever etย al. , 2014 ; Kalchbrenner and Blunsom, 2013 ) used to encode a variable-length input sentence into a fixed-length vector, it is not necessary, and even it may be beneficial to have a variable-length vector, as we will show later. The most common approach is to use an RNN such that
h t = f โ ( x t , h t โ 1 ) subscript โ ๐ก ๐ subscript ๐ฅ ๐ก subscript โ ๐ก 1 \displaystyle h_{t}=f\left(x_{t},h_{t-1}\right)
(1)
and
c = q โ ( { h 1 , โฏ , h T x } ) , ๐ ๐ subscript โ 1 โฏ subscript โ subscript ๐ ๐ฅ \displaystyle c=q\left(\left\{h_{1},\cdots,h_{T_{x}}\right\}\right),
where h t โ โ n subscript โ ๐ก superscript โ ๐ h_{t}\in\mathbb{R}^{n} is a hidden state at time t ๐ก t , and c ๐ c is a vector generated from the sequence of the hidden states. f ๐ f and q ๐ q are some nonlinear functions. Sutskever etย al. ( 2014 ) used an LSTM as f ๐ f and q โ ( { h 1 , โฏ , h T } ) = h T ๐ subscript โ 1 โฏ subscript โ ๐ subscript โ ๐ q\left(\left{h_{1},\cdots,h_{T}\right}\right)=h_{T} , for instance.
The decoder is often trained to predict the next word y t โฒ subscript ๐ฆ superscript ๐ก โฒ y_{t^{\prime}} given the context vector c ๐ c and all the previously predicted words { y 1 , โฏ , y t โฒ โ 1 } subscript ๐ฆ 1 โฏ subscript ๐ฆ superscript ๐ก โฒ 1 \left{y_{1},\cdots,y_{t^{\prime}-1}\right} . In other words, the decoder defines a probability over the translation ๐ฒ ๐ฒ \mathbf{y} by decomposing the joint probability into the ordered conditionals:
p โ ( ๐ฒ ) = โ t = 1 T p โ ( y t โฃ { y 1 , โฏ , y t โ 1 } , c ) , ๐ ๐ฒ superscript subscript product ๐ก 1 ๐ ๐ conditional subscript ๐ฆ ๐ก subscript ๐ฆ 1 โฏ subscript ๐ฆ ๐ก 1 ๐ \displaystyle p(\mathbf{y})=\prod_{t=1}^{T}p(y_{t}\mid\left\{y_{1},\cdots,y_{t-1}\right\},c),
(2)
where ๐ฒ = ( y 1 , โฏ , y T y ) ๐ฒ subscript ๐ฆ 1 โฏ subscript ๐ฆ subscript ๐ ๐ฆ \mathbf{y}=\left(y_{1},\cdots,y_{T_{y}}\right) . With an RNN, each conditional probability is modeled as
p โ ( y t โฃ { y 1 , โฏ , y t โ 1 } , c ) = g โ ( y t โ 1 , s t , c ) , ๐ conditional subscript ๐ฆ ๐ก subscript ๐ฆ 1 โฏ subscript ๐ฆ ๐ก 1 ๐ ๐ subscript ๐ฆ ๐ก 1 subscript ๐ ๐ก ๐ \displaystyle p(y_{t}\mid\left\{y_{1},\cdots,y_{t-1}\right\},c)=g(y_{t-1},s_{t},c),
(3)
where g ๐ g is a nonlinear, potentially multi-layered, function that outputs the probability of y t subscript ๐ฆ ๐ก y_{t} , and s t subscript ๐ ๐ก s_{t} is the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be usedย (Kalchbrenner and Blunsom, 2013 ) .
3
Learning to Align and Translate
In this section, we propose a novel architecture for neural machine translation. The new architecture consists of a bidirectional RNN as an encoder (Sec.ย 3.2 ) and a decoder that emulates searching through a source sentence during decoding a translation (Sec.ย 3.1 ).
3.1
Decoder: General Description
Figure 1:
The graphical illustration of the proposed model trying to
generate the t ๐ก t -th target word y t subscript ๐ฆ ๐ก y_{t} given a source sentence ( x 1 , x 2 , โฆ , x T ) subscript ๐ฅ 1 subscript ๐ฅ 2 โฆ subscript ๐ฅ ๐ (x_{1},x_{2},\dots,x_{T}) .
In a new model architecture, we define each conditional probability in Eq.ย ( 2 ) as:
p โ ( y i | y 1 , โฆ , y i โ 1 , ๐ฑ ) = g โ ( y i โ 1 , s i , c i ) , ๐ conditional subscript ๐ฆ ๐ subscript ๐ฆ 1 โฆ subscript ๐ฆ ๐ 1 ๐ฑ ๐ subscript ๐ฆ ๐ 1 subscript ๐ ๐ subscript ๐ ๐ \displaystyle p(y_{i}|y_{1},\ldots,y_{i-1},\mathbf{x})=g(y_{i-1},s_{i},c_{i}),
(4)
where s i subscript ๐ ๐ s_{i} is an RNN hidden state for time i ๐ i , computed by
s i = f โ ( s i โ 1 , y i โ 1 , c i ) . subscript ๐ ๐ ๐ subscript ๐ ๐ 1 subscript ๐ฆ ๐ 1 subscript ๐ ๐ s_{i}=f(s_{i-1},y_{i-1},c_{i}).
It should be noted that unlike the existing encoderโdecoder approach (see Eq.ย ( 2 )), here the probability is conditioned on a distinct context vector c i subscript ๐ ๐ c_{i} for each target word y i subscript ๐ฆ ๐ y_{i} .
The context vector c i subscript ๐ ๐ c_{i} depends on a sequence of annotations ( h 1 , โฏ , h T x ) subscript โ 1 โฏ subscript โ subscript ๐ ๐ฅ (h_{1},\cdots,h_{T_{x}}) to which an encoder maps the input sentence. Each annotation h i subscript โ ๐ h_{i} contains information about the whole input sequence with a strong focus on the parts surrounding the i ๐ i -th word of the input sequence. We explain in detail how the annotations are computed in the next section.
The context vector c i subscript ๐ ๐ c_{i} is, then, computed as a weighted sum of these annotations h i subscript โ ๐ h_{i} :
c i = โ j = 1 T x ฮฑ i โ j โ h j . subscript ๐ ๐ superscript subscript ๐ 1 subscript ๐ ๐ฅ subscript ๐ผ ๐ ๐ subscript โ ๐ \displaystyle c_{i}=\sum_{j=1}^{T_{x}}\alpha_{ij}h_{j}.
(5)
The weight ฮฑ i โ j subscript ๐ผ ๐ ๐ \alpha_{ij} of each annotation h j subscript โ ๐ h_{j} is computed by
ฮฑ i โ j = exp โก ( e i โ j ) โ k = 1 T x exp โก ( e i โ k ) , subscript ๐ผ ๐ ๐ subscript ๐ ๐ ๐ superscript subscript ๐ 1 subscript ๐ ๐ฅ subscript ๐ ๐ ๐ \displaystyle\alpha_{ij}=\frac{\exp\left(e_{ij}\right)}{\sum_{k=1}^{T_{x}}\exp\left(e_{ik}\right)},
(6)
where
e i โ j = a โ ( s i โ 1 , h j ) subscript ๐ ๐ ๐ ๐ subscript ๐ ๐ 1 subscript โ ๐ e_{ij}=a(s_{i-1},h_{j})
is an alignment model which scores how well the inputs around position j ๐ j and the output at position i ๐ i match. The score is based on the RNN hidden state s i โ 1 subscript ๐ ๐ 1 s_{i-1} (just before emitting y i subscript ๐ฆ ๐ y_{i} , Eq.ย ( 4 )) and the j ๐ j -th annotation h j subscript โ ๐ h_{j} of the input sentence.
We parametrize the alignment model a ๐ a as a feedforward neural network which is jointly trained with all the other components of the proposed system. Note that unlike in traditional machine translation, the alignment is not considered to be a latent variable. Instead, the alignment model directly computes a soft alignment, which allows the gradient of the cost function to be backpropagated through. This gradient can be used to train the alignment model as well as the whole translation model jointly.
We can understand the approach of taking a weighted sum of all the annotations as computing an expected annotation , where the expectation is over possible alignments. Let ฮฑ i โ j subscript ๐ผ ๐ ๐ \alpha_{ij} be a probability that the target word y i subscript ๐ฆ ๐ y_{i} is aligned to, or translated from, a source word x j subscript ๐ฅ ๐ x_{j} . Then, the i ๐ i -th context vector c i subscript ๐ ๐ c_{i} is the expected annotation over all the annotations with probabilities ฮฑ i โ j subscript ๐ผ ๐ ๐ \alpha_{ij} .
The probability ฮฑ i โ j subscript ๐ผ ๐ ๐ \alpha_{ij} , or its associated energy e i โ j subscript ๐ ๐ ๐ e_{ij} , reflects the importance of the annotation h j subscript โ ๐ h_{j} with respect to the previous hidden state s i โ 1 subscript ๐ ๐ 1 s_{i-1} in deciding the next state s i subscript ๐ ๐ s_{i} and generating y i subscript ๐ฆ ๐ y_{i} . Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.
3.2
Encoder: Bidirectional RNN for Annotating Sequences
The usual RNN, described in Eq.ย ( 1 ), reads an input sequence ๐ฑ ๐ฑ \mathbf{x} in order starting from the first symbol x 1 subscript ๐ฅ 1 x_{1} to the last one x T x subscript ๐ฅ subscript ๐ ๐ฅ x_{T_{x}} . However, in the proposed scheme, we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, we propose to use a bidirectional RNNย (BiRNN, Schuster and Paliwal, 1997 ) , which has been successfully used recently in speech recognitionย (see, e.g., Graves etย al. , 2013 ) .
A BiRNN consists of forward and backward RNNโs. The forward RNN f โ โ ๐ \overrightarrow{f} reads the input sequence as it is ordered (from x 1 subscript ๐ฅ 1 x_{1} to x T x subscript ๐ฅ subscript ๐ ๐ฅ x_{T_{x}} ) and calculates a sequence of forward hidden states ( h โ 1 , โฏ , h โ T x ) subscript โ โ 1 โฏ subscript โ โ subscript ๐ ๐ฅ (\overrightarrow{h}{1},\cdots,\overrightarrow{h}{T_{x}}) . The backward RNN f โ โ ๐ \overleftarrow{f} reads the sequence in the reverse order (from x T x subscript ๐ฅ subscript ๐ ๐ฅ x_{T_{x}} to x 1 subscript ๐ฅ 1 x_{1} ), resulting in a sequence of backward hidden states ( h โ 1 , โฏ , h โ T x ) subscript โ โ 1 โฏ subscript โ โ subscript ๐ ๐ฅ (\overleftarrow{h}{1},\cdots,\overleftarrow{h}{T_{x}}) .
We obtain an annotation for each word x j subscript ๐ฅ ๐ x_{j} by concatenating the forward hidden
state h โ j subscript โ โ ๐ \overrightarrow{h}{j} and the backward one h โ j subscript โ โ ๐ \overleftarrow{h}{j} , i.e., h j = [ h โ j โค ; h โ j โค ] โค subscript โ ๐ superscript superscript subscript โ โ ๐ top superscript subscript โ โ ๐ top top h_{j}=\left[\overrightarrow{h}{j}^{\top};\overleftarrow{h}{j}^{\top}\right]^{\top} . In this way, the annotation h j subscript โ ๐ h_{j}
contains the summaries of both the preceding words and the following words. Due
to the tendency of RNNs to better represent recent inputs, the annotation h j subscript โ ๐ h_{j}
will be focused on the words around x j subscript ๐ฅ ๐ x_{j} . This sequence of annotations is used
by the decoder and the alignment model later to compute the context vector
(Eqs.ย ( 5 )โ( 6 )).
See Fig.ย 1 for the graphical illustration of the proposed model.
4
Experiment Settings
We evaluate the proposed approach on the task of English-to-French translation. We use the bilingual, parallel corpora provided by ACL WMT โ14. 3 3 3 http://www.statmt.org/wmt14/translation-task.html As a comparison, we also report the performance of an RNN EncoderโDecoder which was proposed recently by Cho etย al. ( 2014a ) . We use the same training procedures and the same dataset for both models. 4 4 4 Implementations are available at https://github.com/lisa-groundhog/GroundHog .
Figure 2:
The BLEU scores of the generated translations on the test set with
respect to the lengths of the sentences. The results are on the full
test set which includes sentences having unknown words to the
models.
4.1
Dataset
WMT โ14 contains the following English-French parallel corpora: Europarl (61M words), news commentary (5.5M), UN (421M) and two crawled corpora of 90M and 272.5M words respectively, totaling 850M words. Following the procedure described in Cho etย al. ( 2014a ) , we reduce the size of the combined corpus to have 348M words using the data selection method by Axelrod etย al. ( 2011 ) . 5 5 5 Available online at http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/ . We do not use any monolingual data other than the mentioned parallel corpora, although it may be possible to use a much larger monolingual corpus to pretrain an encoder. We concatenate news-test-2012 and news-test-2013 to make a development (validation) set, and evaluate the models on the test set (news-test-2014) from WMT โ14, which consists of 3003 sentences not present in the training data.
After a usual tokenization 6 6 6 We used the tokenization script from the open-source machine translation package, Moses. , we use a shortlist of 30,000 most frequent words in each language to train our models. Any word not included in the shortlist is mapped to a special token ( [ UNK ] delimited-[] UNK \left[\mbox{UNK}\right] ). We do not apply any other special preprocessing, such as lowercasing or stemming, to the data.
(a)
(b)
(c)
(d)
Figure 3:
Four sample alignments found by RNNsearch-50. The x-axis and y-axis of
each plot correspond to the words in the source sentence (English) and
the generated translation (French), respectively. Each pixel shows the
weight ฮฑ i โ j subscript ๐ผ ๐ ๐ \alpha_{ij} of the annotation of the j ๐ j -th source word for the
i ๐ i -th target word (see Eq.ย ( 6 )), in grayscale
( 0 0 : black, 1 1 1 : white). (a) an arbitrary sentence. (bโd) three
randomly selected samples among the sentences without any unknown words
and of length between 10 and 20 words from the test set.
4.2
Models
We train two types of models. The first one is an RNN EncoderโDecoderย (RNNencdec, Cho etย al. , 2014a ) , and the other is the proposed model, to which we refer as RNNsearch. We train each model twice: first with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).
The encoder and decoder of the RNNencdec have 1000 hidden units each. 7 7 7 In this paper, by a โhidden unitโ, we always mean the gated hidden unit (see Appendixย A.1.1 ). The encoder of the RNNsearch consists of forward and backward recurrent neural networks (RNN) each having 1000 hidden units. Its decoder has 1000 hidden units. In both cases, we use a multilayer network with a single maxoutย (Goodfellow etย al. , 2013 ) hidden layer to compute the conditional probability of each target wordย (Pascanu etย al. , 2014 ) .
We use a minibatch stochastic gradient descent (SGD) algorithm together with Adadeltaย (Zeiler, 2012 ) to train each model. Each SGD update direction is computed using a minibatch of 80 sentences. We trained each model for approximately 5 days.
Once a model is trained, we use a beam search to find a translation that approximately maximizes the conditional probabilityย (see, e.g., Graves, 2012 ; Boulanger-Lewandowski etย al. , 2013 ) . Sutskever etย al. ( 2014 ) used this approach to generate translations from their neural machine translation model.
For more details on the architectures of the models and training procedure used in the experiments, see Appendicesย A and B .
5
Results
5.1
Quantitative Results
Model All No UNK โ
RNNencdec-30 13.93 24.19
RNNsearch-30 21.50 31.44
RNNencdec-50 17.82 26.71
RNNsearch-50 26.75 34.16
RNNsearch-50 โ
28.45 36.15
Moses 33.30 35.63
Table 1: BLEU scores of the trained models computed on the test set. The second and third columns show respectively the scores on all the sentences and, on the sentences without any unknown word in themselves and in the reference translations. Note that RNNsearch-50 โ was trained much longer until the performance on the development set stopped improving. ( โ \circ ) We disallowed the models to generate [UNK] tokens when only the sentences having no unknown words were evaluated (last column).
In Tableย 1 , we list the translation performances measured in BLEU score. It is clear from the table that in all the cases, the proposed RNNsearch outperforms the conventional RNNencdec. More importantly, the performance of the RNNsearch is as high as that of the conventional phrase-based translation system (Moses), when only the sentences consisting of known words are considered. This is a significant achievement, considering that Moses uses a separate monolingual corpus (418M words) in addition to the parallel corpora we used to train the RNNsearch and RNNencdec.
One of the motivations behind the proposed approach was the use of a fixed-length context vector in the basic encoderโdecoder approach. We conjectured that this limitation may make the basic encoderโdecoder approach to underperform with long sentences. In Fig.ย 2 , we see that the performance of RNNencdec dramatically drops as the length of the sentences increases. On the other hand, both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch-50, especially, shows no performance deterioration even with sentences of length 50 or more. This superiority of the proposed model over the basic encoderโdecoder is further confirmed by the fact that the RNNsearch-30 even outperforms RNNencdec-50 (see Tableย 1 ).
5.2
Qualitative Analysis
5.2.1
Alignment
The proposed approach provides an intuitive way to inspect the (soft-)alignment between the words in a generated translation and those in a source sentence. This is done by visualizing the annotation weights ฮฑ i โ j subscript ๐ผ ๐ ๐ \alpha_{ij} from Eq.ย ( 6 ), as in Fig.ย 3 . Each row of a matrix in each plot indicates the weights associated with the annotations. From this we see which positions in the source sentence were considered more important when generating the target word.
We can see from the alignments in Fig.ย 3 that the alignment of words between English and French is largely monotonic. We see strong weights along the diagonal of each matrix. However, we also observe a number of non-trivial, non-monotonic alignments. Adjectives and nouns are typically ordered differently between French and English, and we see an example in Fig.ย 3 ย (a). From this figure, we see that the model correctly translates a phrase [European Economic Area] into [zone รฉconomique europรฉen]. The RNNsearch was able to correctly align [zone] with [Area], jumping over the two words ([European] and [Economic]), and then looked one word back at a time to complete the whole phrase [zone รฉconomique europรฉenne].
The strength of the soft-alignment, opposed to a hard-alignment, is evident, for instance, from Fig.ย 3 ย (d). Consider the source phrase [the man] which was translated into [lโ homme]. Any hard alignment will map [the] to [lโ] and [man] to [homme]. This is not helpful for translation, as one must consider the word following [the] to determine whether it should be translated into [le], [la], [les] or [lโ]. Our soft-alignment solves this issue naturally by letting the model look at both [the] and [man], and in this example, we see that the model was able to correctly translate [the] into [lโ]. We observe similar behaviors in all the presented cases in Fig.ย 3 . An additional benefit of the soft alignment is that it naturally deals with source and target phrases of different lengths, without requiring a counter-intuitive way of mapping some words to or from nowhere ([NULL])ย (see, e.g., Chaptersย 4 and 5 of Koehn, 2010 ) .
5.2.2
Long Sentences
As clearly visible from Fig.ย 2 the proposed model (RNNsearch) is much better than the conventional model (RNNencdec) at translating long sentences. This is likely due to the fact that the RNNsearch does not require encoding a long sentence into a fixed-length vector perfectly, but only accurately encoding the parts of the input sentence that surround a particular word.
As an example, consider this source sentence from the test set:
An admitting privilege is the right of a doctor to admit a patient to a hospital or a medical centre to carry out a diagnosis or a procedure, based on his status as a health care worker at a hospital.
The RNNencdec-50 translated this sentence into:
Un privilรจge dโadmission est le droit dโun mรฉdecin de reconnaรฎtre un patient ร lโhรดpital ou un centre mรฉdical dโun diagnostic ou de prendre un diagnostic en fonction de son รฉtat de santรฉ .
The RNNencdec-50 correctly translated the source sentence until [a medical center]. However, from there on (underlined), it deviated from the original meaning of the source sentence. For instance, it replaced [based on his status as a health care worker at a hospital] in the source sentence with [en fonction de son รฉtat de santรฉ] (โbased on his state of healthโ).
On the other hand, the RNNsearch-50 generated the following correct translation, preserving the whole meaning of the input sentence without omitting any details:
Un privilรจge dโadmission est le droit dโun mรฉdecin dโadmettre un patient ร un hรดpital ou un centre mรฉdical pour effectuer un diagnostic ou une procรฉdure, selon son statut de travailleur des soins de santรฉ ร lโhรดpital.
Let us consider another sentence from the test set:
This kind of experience is part of Disneyโs efforts to โextend the lifetime of its series and build new relationships with audiences via digital platforms that are becoming ever more important,โ he added.
The translation by the RNNencdec-50 is
Ce type dโexpรฉrience fait partie des initiatives du Disney pour โprolonger la durรฉe de vie de ses nouvelles et de dรฉvelopper des liens avec les lecteurs numรฉriques qui deviennent plus complexes.
As with the previous example, the RNNencdec began deviating from the actual meaning of the source sentence after generating approximately 30 words (see the underlined phrase). After that point, the quality of the translation deteriorates, with basic mistakes such as the lack of a closing quotation mark.
Again, the RNNsearch-50 was able to translate this long sentence correctly:
Ce genre dโexpรฉrience fait partie des efforts de Disney pour โprolonger la durรฉe de vie de ses sรฉries et crรฉer de nouvelles relations avec des publics via des plateformes numรฉriques de plus en plus importantesโ, a-t-il ajoutรฉ.
In conjunction with the quantitative results presented already, these qualitative observations confirm our hypotheses that the RNNsearch architecture enables far more reliable translation of long sentences than the standard RNNencdec model.
In Appendixย C , we provide a few more sample translations of long source sentences generated by the RNNencdec-50, RNNsearch-50 and Google Translate along with the reference translations.
6
Related Work
6.1
Learning to Align
A similar approach of aligning an output symbol with an input symbol was proposed recently by Graves ( 2013 ) in the context of handwriting synthesis. Handwriting synthesis is a task where the model is asked to generate handwriting of a given sequence of characters. In his work, he used a mixture of Gaussian kernels to compute the weights of the annotations, where the location, width and mixture coefficient of each kernel was predicted from an alignment model. More specifically, his alignment was restricted to predict the location such that the location increases monotonically.
The main difference from our approach is that, in (Graves, 2013 ) , the modes of the weights of the annotations only move in one direction. In the context of machine translation, this is a severe limitation, as (long-distance) reordering is often needed to generate a grammatically correct translation (for instance, English-to-German).
Our approach, on the other hand, requires computing the annotation weight of every word in the source sentence for each word in the translation. This drawback is not severe with the task of translation in which most of input and output sentences are only 15โ40 words. However, this may limit the applicability of the proposed scheme to other tasks.
6.2
Neural Networks for Machine Translation
Since Bengio etย al. ( 2003 ) introduced a neural probabilistic language model which uses a neural network to model the conditional probability of a word given a fixed number of the preceding words, neural networks have widely been used in machine translation. However, the role of neural networks has been largely limited to simply providing a single feature to an existing statistical machine translation system or to re-rank a list of candidate translations provided by an existing system.
For instance, Schwenk ( 2012 ) proposed using a feedforward neural network to compute the score of a pair of source and target phrases and to use the score as an additional feature in the phrase-based statistical machine translation system. More recently, Kalchbrenner and Blunsom ( 2013 ) and Devlin etย al. ( 2014 ) reported the successful use of the neural networks as a sub-component of the existing translation system. Traditionally, a neural network trained as a target-side language model has been used to rescore or rerank a list of candidate translationsย (see, e.g., Schwenk etย al. , 2006 ) .
Although the above approaches were shown to improve the translation performance over the state-of-the-art machine translation systems, we are more interested in a more ambitious objective of designing a completely new translation system based on neural networks. The neural machine translation approach we consider in this paper is therefore a radical departure from these earlier works. Rather than using a neural network as a part of the existing system, our model works on its own and generates a translation from a source sentence directly.
7
Conclusion
The conventional approach to neural machine translation, called an encoderโdecoder approach, encodes a whole input sentence into a fixed-length vector from which a translation will be decoded. We conjectured that the use of a fixed-length context vector is problematic for translating long sentences, based on a recent empirical study reported by Cho etย al. ( 2014b ) and Pouget-Abadie etย al. ( 2014 ) .
In this paper, we proposed a novel architecture that addresses this issue. We extended the basic encoderโdecoder by letting a model (soft-)search for a set of input words, or their annotations computed by an encoder, when generating each target word. This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word. This has a major positive impact on the ability of the neural machine translation system to yield good results on longer sentences. Unlike with the traditional machine translation systems, all of the pieces of the translation system, including the alignment mechanism, are jointly trained towards a better log-probability of producing correct translations.
We tested the proposed model, called RNNsearch, on the task of English-to-French translation. The experiment revealed that the proposed RNNsearch outperforms the conventional encoderโdecoder model (RNNencdec) significantly, regardless of the sentence length and that it is much more robust to the length of a source sentence. From the qualitative analysis where we investigated the (soft-)alignment generated by the RNNsearch, we were able to conclude that the model can correctly align each target word with the relevant words, or their annotations, in the source sentence as it generated a correct translation.
Perhaps more importantly, the proposed approach achieved a translation performance comparable to the existing phrase-based statistical machine translation. It is a striking result, considering that the proposed architecture, or the whole family of neural machine translation, has only been proposed as recently as this year. We believe the architecture proposed here is a promising step toward better machine translation and a better understanding of natural languages in general.
One of challenges left for the future is to better handle unknown, or rare words. This will be required for the model to be more widely used and to match the performance of current state-of-the-art machine translation systems in all contexts.
Acknowledgments
The authors would like to thank the developers of Theanoย (Bergstra etย al. , 2010 ; Bastien etย al. , 2012 ) . We acknowledge the support of the following agencies for research funding and computing support: NSERC, Calcul Quรฉbec, Compute Canada, the Canada Research Chairs and CIFAR. Bahdanau thanks the support from Planet Intelligent Systems GmbH. We also thank Felix Hill, Bart van Merriรฉnboer, Jean Pouget-Abadie, Coline Devin and Tae-Ho Kim.
References
Axelrod etย al. (2011)
Axelrod, A., He, X., and Gao, J. (2011).
Domain adaptation via pseudo in-domain data selection.
In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 355โ362. Association for Computational Linguistics.
Bastien etย al. (2012)
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.ย J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012).
Theano: new features and speed improvements.
Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
Bengio etย al. (1994)
Bengio, Y., Simard, P., and Frasconi, P. (1994).
Learning long-term dependencies with gradient descent is difficult.
IEEE Transactions on Neural Networks , 5 (2), 157โ166.
Bengio etย al. (2003)
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003).
A neural probabilistic language model.
J. Mach. Learn. Res. , 3 , 1137โ1155.
Bergstra etย al. (2010)
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010).
Theano: a CPU and GPU math expression compiler.
In Proceedings of the Python for Scientific Computing Conference (SciPy) .
Oral Presentation.
Boulanger-Lewandowski etย al. (2013)
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013).
Audio chord recognition with recurrent neural networks.
In ISMIR .
Cho etย al. (2014a)
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014a).
Learning phrase representations using RNN encoder-decoder for statistical machine translation.
In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) .
to appear.
Cho etย al. (2014b)
Cho, K., van Merriรซnboer, B., Bahdanau, D., and Bengio, Y. (2014b).
On the properties of neural machine translation: EncoderโDecoder approaches.
In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation .
to appear.
Devlin etย al. (2014)
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014).
Fast and robust neural network joint models for statistical machine translation.
In Association for Computational Linguistics .
Forcada and รeco (1997)
Forcada, M.ย L. and รeco, R.ย P. (1997).
Recursive hetero-associative memories for translation.
In J.ย Mira, R.ย Moreno-Dรญaz, and J.ย Cabestany, editors, Biological and Artificial Computation: From Neuroscience to Technology , volume 1240 of Lecture Notes in Computer Science , pages 453โ462. Springer Berlin Heidelberg.
Goodfellow etย al. (2013)
Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013).
Maxout networks.
In Proceedings of The 30th International Conference on Machine Learning , pages 1319โ1327.
Graves (2012)
Graves, A. (2012).
Sequence transduction with recurrent neural networks.
In Proceedings of the 29th International Conference on Machine Learning (ICML 2012) .
Graves (2013)
Graves, A. (2013).
Generating sequences with recurrent neural networks.
arXiv: 1308.0850 [cs.NE] .
Graves etย al. (2013)
Graves, A., Jaitly, N., and Mohamed, A.-R. (2013).
Hybrid speech recognition with deep bidirectional LSTM.
In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on , pages 273โ278.
Hermann and Blunsom (2014)
Hermann, K. and Blunsom, P. (2014).
Multilingual distributed representations without word alignment.
In Proceedings of the Second International Conference on Learning Representations (ICLR 2014) .
Hochreiter (1991)
Hochreiter, S. (1991).
Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fรผr Informatik, Lehrstuhl Prof. Brauer, Technische Universitรคt Mรผnchen.
Hochreiter and Schmidhuber (1997)
Hochreiter, S. and Schmidhuber, J. (1997).
Long short-term memory.
Neural Computation , 9 (8), 1735โ1780.
Kalchbrenner and Blunsom (2013)
Kalchbrenner, N. and Blunsom, P. (2013).
Recurrent continuous translation models.
In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1700โ1709. Association for Computational Linguistics.
Koehn (2010)
Koehn, P. (2010).
Statistical Machine Translation .
Cambridge University Press, New York, NY, USA.
Koehn etย al. (2003)
Koehn, P., Och, F.ย J., and Marcu, D. (2003).
Statistical phrase-based translation.
In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 , NAACL โ03, pages 48โ54, Stroudsburg, PA, USA. Association for Computational Linguistics.
Pascanu etย al. (2013a)
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a).
On the difficulty of training recurrent neural networks.
In ICMLโ2013 .
Pascanu etย al. (2013b)
Pascanu, R., Mikolov, T., and Bengio, Y. (2013b).
On the difficulty of training recurrent neural networks.
In Proceedings of the 30th International Conference on Machine Learning (ICML 2013) .
Pascanu etย al. (2014)
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014).
How to construct deep recurrent neural networks.
In Proceedings of the Second International Conference on Learning Representations (ICLR 2014) .
Pouget-Abadie etย al. (2014)
Pouget-Abadie, J., Bahdanau, D., van Merriรซnboer, B., Cho, K., and Bengio, Y. (2014).
Overcoming the curse of sentence length for neural machine translation using automatic segmentation.
In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation .
to appear.
Schuster and Paliwal (1997)
Schuster, M. and Paliwal, K.ย K. (1997).
Bidirectional recurrent neural networks.
Signal Processing, IEEE Transactions on , 45 (11), 2673โ2681.
Schwenk (2012)
Schwenk, H. (2012).
Continuous space translation models for phrase-based statistical machine translation.
In M.ย Kay and C.ย Boitet, editors, Proceedings of the 24th International Conference on Computational Linguistics (COLIN) , pages 1071โ1080. Indian Institute of Technology Bombay.
Schwenk etย al. (2006)
Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006).
Continuous space language models for statistical machine translation.
In Proceedings of the COLING/ACL on Main conference poster sessions , pages 723โ730. Association for Computational Linguistics.
Sutskever etย al. (2014)
Sutskever, I., Vinyals, O., and Le, Q. (2014).
Sequence to sequence learning with neural networks.
In Advances in Neural Information Processing Systems (NIPS 2014) .
Zeiler (2012)
Zeiler, M.ย D. (2012).
ADADELTA: An adaptive learning rate method.
arXiv: 1212.5701 [cs.LG] .
Appendix A
Model Architecture
A.1
Architectural Choices
The proposed scheme in Sectionย 3 is a general framework where one can freely define, for instance, the activation functions f ๐ f of recurrent neural networks (RNN) and the alignment model a ๐ a . Here, we describe the choices we made for the experiments in this paper.
A.1.1
Recurrent Neural Network
For the activation function f ๐ f of an RNN, we use the gated hidden unit recently proposed by Cho etย al. ( 2014a ) . The gated hidden unit is an alternative to the conventional simple units such as an element-wise tanh \tanh . This gated unit is similar to a long short-term memory (LSTM) unit proposed earlier by Hochreiter and Schmidhuber ( 1997 ) , sharing with it the ability to better model and learn long-term dependencies. This is made possible by having computation paths in the unfolded RNN for which the product of derivatives is close to 1. These paths allow gradients to flow backward easily without suffering too much from the vanishing effectย (Hochreiter, 1991 ; Bengio etย al. , 1994 ; Pascanu etย al. , 2013a ) . It is therefore possible to use LSTM units instead of the gated hidden unit described here, as was done in a similar context by Sutskever etย al. ( 2014 ) .
The new state s i subscript ๐ ๐ s_{i} of the RNN employing n ๐ n gated hidden units 8 8 8 Here, we show the formula of the decoder. The same formula can be used in the encoder by simply ignoring the context vector c i subscript ๐ ๐ c_{i} and the related terms.
is computed by
s i = f โ ( s i โ 1 , y i โ 1 , c i ) = ( 1 โ z i ) โ s i โ 1 + z i โ s ~ i , subscript ๐ ๐ ๐ subscript ๐ ๐ 1 subscript ๐ฆ ๐ 1 subscript ๐ ๐ 1 subscript ๐ง ๐ subscript ๐ ๐ 1 subscript ๐ง ๐ subscript ~ ๐ ๐ \displaystyle s_{i}=f(s_{i-1},y_{i-1},c_{i})=(1-z_{i})\circ s_{i-1}+z_{i}\circ\tilde{s}_{i},
where โ \circ is an element-wise multiplication, and z i subscript ๐ง ๐ z_{i} is the output of the update gates (see below). The proposed updated state s ~ i subscript ~ ๐ ๐ \tilde{s}_{i} is computed by
s ~ i = tanh โก ( W โ e โ ( y i โ 1 ) + U โ [ r i โ s i โ 1 ] + C โ c i ) , subscript ~ ๐ ๐ ๐ ๐ subscript ๐ฆ ๐ 1 ๐ delimited-[] subscript ๐ ๐ subscript ๐ ๐ 1 ๐ถ subscript ๐ ๐ \displaystyle\tilde{s}_{i}=\tanh\left(We(y_{i-1})+U\left[r_{i}\circ s_{i-1}\right]+Cc_{i}\right),
where e โ ( y i โ 1 ) โ โ m ๐ subscript ๐ฆ ๐ 1 superscript โ ๐ e(y_{i-1})\in\mathbb{R}^{m} is an m ๐ m -dimensional embedding of a word
y i โ 1 subscript ๐ฆ ๐ 1 y_{i-1} , and r i subscript ๐ ๐ r_{i} is the output of the reset gates (see below). When y i subscript ๐ฆ ๐ y_{i}
is represented as a 1 1 1 -of- K ๐พ K vector, e โ ( y i ) ๐ subscript ๐ฆ ๐ e(y_{i}) is simply a column of an
embedding matrix E โ โ m ร K ๐ธ superscript โ ๐ ๐พ E\in\mathbb{R}^{m\times K} . Whenever possible, we omit bias terms
to make the equations less cluttered.
The update gates z i subscript ๐ง ๐ z_{i} allow each hidden unit to maintain its previous activation, and the reset gates r i subscript ๐ ๐ r_{i} control how much and what information from the previous state should be reset. We compute them by
z i subscript ๐ง ๐ \displaystyle z_{i}
= ฯ โ ( W z โ e โ ( y i โ 1 ) + U z โ s i โ 1 + C z โ c i ) , absent ๐ subscript ๐ ๐ง ๐ subscript ๐ฆ ๐ 1 subscript ๐ ๐ง subscript ๐ ๐ 1 subscript ๐ถ ๐ง subscript ๐ ๐ \displaystyle=\sigma\left(W_{z}e(y_{i-1})+U_{z}s_{i-1}+C_{z}c_{i}\right),
r i subscript ๐ ๐ \displaystyle r_{i}
= ฯ โ ( W r โ e โ ( y i โ 1 ) + U r โ s i โ 1 + C r โ c i ) , absent ๐ subscript ๐ ๐ ๐ subscript ๐ฆ ๐ 1 subscript ๐ ๐ subscript ๐ ๐ 1 subscript ๐ถ ๐ subscript ๐ ๐ \displaystyle=\sigma\left(W_{r}e(y_{i-1})+U_{r}s_{i-1}+C_{r}c_{i}\right),
where ฯ โ ( โ ) ๐ โ \sigma\left(\cdot\right) is a logistic sigmoid function.
At each step of the decoder, we compute the output probability (Eq.ย ( 4 )) as a multi-layered functionย (Pascanu etย al. , 2014 ) . We use a single hidden layer of maxout unitsย (Goodfellow etย al. , 2013 ) and normalize the output probabilities (one for each word) with a softmax function (see Eq.ย ( 6 )).
A.1.2
Alignment Model
The alignment model should be designed considering that the model needs to be evaluated T x ร T y subscript ๐ ๐ฅ subscript ๐ ๐ฆ T_{x}\times T_{y} times for each sentence pair of lengths T x subscript ๐ ๐ฅ T_{x} and T y subscript ๐ ๐ฆ T_{y} . In order to reduce computation, we use a single-layer multilayer perceptron such that
a โ ( s i โ 1 , h j ) = v a โค โ tanh โก ( W a โ s i โ 1 + U a โ h j ) , ๐ subscript ๐ ๐ 1 subscript โ ๐ superscript subscript ๐ฃ ๐ top subscript ๐ ๐ subscript ๐ ๐ 1 subscript ๐ ๐ subscript โ ๐ \displaystyle a(s_{i-1},h_{j})=v_{a}^{\top}\tanh\left(W_{a}s_{i-1}+U_{a}h_{j}\right),
where W a โ โ n ร n , U a โ โ n ร 2 โ n formulae-sequence subscript ๐ ๐ superscript โ ๐ ๐ subscript ๐ ๐ superscript โ ๐ 2 ๐ W_{a}\in\mathbb{R}^{n\times n},U_{a}\in\mathbb{R}^{n\times 2n} and v a โ โ n subscript ๐ฃ ๐ superscript โ ๐ v_{a}\in\mathbb{R}^{n}
are the weight matrices. Since U a โ h j subscript ๐ ๐ subscript โ ๐ U_{a}h_{j} does not depend on i ๐ i , we can
pre-compute it in advance to minimize the computational cost.
A.2
Detailed Description of the Model
A.2.1
Encoder
In this section, we describe in detail the architecture of the proposed model (RNNsearch) used in the experiments (see Sec.ย 4 โ 5 ). From here on, we omit all bias terms in order to increase readability.
The model takes a source sentence of 1-of-K coded word vectors as input
๐ฑ = ( x 1 , โฆ , x T x ) , ย โ x i โ โ K x formulae-sequence ๐ฑ subscript ๐ฅ 1 โฆ subscript ๐ฅ subscript ๐ ๐ฅ ย subscript ๐ฅ ๐ superscript โ subscript ๐พ ๐ฅ \mathbf{x}=(x_{1},\ldots,x_{T_{x}}),\mbox{ }x_{i}\in\mathbb{R}^{K_{x}}
and outputs a translated sentence of 1-of-K coded word vectors
๐ฒ = ( y 1 , โฆ , y T y ) , ย โ y i โ โ K y , formulae-sequence ๐ฒ subscript ๐ฆ 1 โฆ subscript ๐ฆ subscript ๐ ๐ฆ ย subscript ๐ฆ ๐ superscript โ subscript ๐พ ๐ฆ \mathbf{y}=(y_{1},\ldots,y_{T_{y}}),\mbox{ }y_{i}\in\mathbb{R}^{K_{y}},
where K x subscript ๐พ ๐ฅ K_{x} and K y subscript ๐พ ๐ฆ K_{y} are the vocabulary sizes of source and target languages, respectively. T x subscript ๐ ๐ฅ T_{x} and T y subscript ๐ ๐ฆ T_{y} respectively denote the lengths of source and target sentences.
First, the forward states of the bidirectional recurrent neural network (BiRNN) are computed:
h โ i = subscript โ โ ๐ absent \displaystyle\overrightarrow{h}_{i}=
{ ( 1 โ z โ i ) โ h โ i โ 1 + z โ i โ h ยฏ โ i , ifย โ i > 0 0 , ifย โ i = 0 cases 1 subscript โ ๐ง ๐ subscript โ โ ๐ 1 subscript โ ๐ง ๐ subscript โ ยฏ โ ๐ , ifย ๐ 0 0 , ifย ๐ 0 \displaystyle\begin{cases}(1-\overrightarrow{z}_{i})\circ\overrightarrow{h}_{i-1}+\overrightarrow{z}_{i}\circ\overrightarrow{\underline{h}}_{i}&\mbox{, if }i>0\\
0&\mbox{, if }i=0\end{cases}
where
h ยฏ โ i = subscript โ ยฏ โ ๐ absent \displaystyle\overrightarrow{\underline{h}}_{i}=
tanh โก ( W โ โ E ยฏ โ x i + U โ โ [ r โ i โ h โ i โ 1 ] ) โ ๐ ยฏ ๐ธ subscript ๐ฅ ๐ โ ๐ delimited-[] subscript โ ๐ ๐ subscript โ โ ๐ 1 \displaystyle\tanh\left(\overrightarrow{W}\overline{E}x_{i}+\overrightarrow{U}\left[\overrightarrow{r}_{i}\circ\overrightarrow{h}_{i-1}\right]\right)
z โ i = subscript โ ๐ง ๐ absent \displaystyle\overrightarrow{z}_{i}=
ฯ โ ( W โ z โ E ยฏ โ x i + U โ z โ h โ i โ 1 ) ๐ subscript โ ๐ ๐ง ยฏ ๐ธ subscript ๐ฅ ๐ subscript โ ๐ ๐ง subscript โ โ ๐ 1 \displaystyle\sigma\left(\overrightarrow{W}_{z}\overline{E}x_{i}+\overrightarrow{U}_{z}\overrightarrow{h}_{i-1}\right)
r โ i = subscript โ ๐ ๐ absent \displaystyle\overrightarrow{r}_{i}=
ฯ โ ( W โ r โ E ยฏ โ x i + U โ r โ h โ i โ 1 ) . ๐ subscript โ ๐ ๐ ยฏ ๐ธ subscript ๐ฅ ๐ subscript โ ๐ ๐ subscript โ โ ๐ 1 \displaystyle\sigma\left(\overrightarrow{W}_{r}\overline{E}x_{i}+\overrightarrow{U}_{r}\overrightarrow{h}_{i-1}\right).
E ยฏ โ โ m ร K x ยฏ ๐ธ superscript โ ๐ subscript ๐พ ๐ฅ \overline{E}\in\mathbb{R}^{m\times K_{x}} is the word embedding matrix.
W โ , W โ z , W โ r โ โ n ร m โ ๐ subscript โ ๐ ๐ง subscript โ ๐ ๐ superscript โ ๐ ๐ \overrightarrow{W},\overrightarrow{W}_{z},\overrightarrow{W}_{r}\in\mathbb{R}^{n\times m} , U โ , U โ z , U โ r โ โ n ร n โ ๐ subscript โ ๐ ๐ง subscript โ ๐ ๐ superscript โ ๐ ๐ \overrightarrow{U},\overrightarrow{U}_{z},\overrightarrow{U}_{r}\in\mathbb{R}^{n\times n} are weight matrices. m ๐ m and n ๐ n are the word
embedding dimensionality and the number of hidden units, respectively. ฯ โ ( โ ) ๐ โ \sigma(\cdot) is as usual a logistic sigmoid function.
The backward states ( h โ 1 , โฏ , h โ T x ) subscript โ โ 1 โฏ subscript โ โ subscript ๐ ๐ฅ (\overleftarrow{h}{1},\cdots,\overleftarrow{h}{T_{x}}) are computed similarly. We share the word embedding matrix E ยฏ ยฏ ๐ธ \overline{E} between the forward and backward RNNs, unlike the weight matrices.
We concatenate the forward and backward states to to obtain the annotations ( h 1 , h 2 , โฏ , h T x ) subscript โ 1 subscript โ 2 โฏ subscript โ subscript ๐ ๐ฅ (h_{1},h_{2},\cdots,h_{T_{x}}) , where
h i = [ h โ i h โ i ] subscript โ ๐ delimited-[] subscript โ โ ๐ subscript โ โ ๐ \displaystyle h_{i}=\left[\begin{array}[]{c}\overrightarrow{h}_{i}\\
\overleftarrow{h}_{i}\end{array}\right]
(9)
A.2.2
Decoder
The hidden state s i subscript ๐ ๐ s_{i} of the decoder given the annotations from the encoder is computed by
s i = subscript ๐ ๐ absent \displaystyle s_{i}=
( 1 โ z i ) โ s i โ 1 + z i โ s ~ i , 1 subscript ๐ง ๐ subscript ๐ ๐ 1 subscript ๐ง ๐ subscript ~ ๐ ๐ \displaystyle(1-z_{i})\circ s_{i-1}+z_{i}\circ\tilde{s}_{i},
where
s ~ i = subscript ~ ๐ ๐ absent \displaystyle\tilde{s}_{i}=
tanh โก ( W โ E โ y i โ 1 + U โ [ r i โ s i โ 1 ] + C โ c i ) ๐ ๐ธ subscript ๐ฆ ๐ 1 ๐ delimited-[] subscript ๐ ๐ subscript ๐ ๐ 1 ๐ถ subscript ๐ ๐ \displaystyle\tanh\left(WEy_{i-1}+U\left[r_{i}\circ s_{i-1}\right]+Cc_{i}\right)
z i = subscript ๐ง ๐ absent \displaystyle z_{i}=
ฯ โ ( W z โ E โ y i โ 1 + U z โ s i โ 1 + C z โ c i ) ๐ subscript ๐ ๐ง ๐ธ subscript ๐ฆ ๐ 1 subscript ๐ ๐ง subscript ๐ ๐ 1 subscript ๐ถ ๐ง subscript ๐ ๐ \displaystyle\sigma\left(W_{z}Ey_{i-1}+U_{z}s_{i-1}+C_{z}c_{i}\right)
r i = subscript ๐ ๐ absent \displaystyle r_{i}=
ฯ โ ( W r โ E โ y i โ 1 + U r โ s i โ 1 + C r โ c i ) ๐ subscript ๐ ๐ ๐ธ subscript ๐ฆ ๐ 1 subscript ๐ ๐ subscript ๐ ๐ 1 subscript ๐ถ ๐ subscript ๐ ๐ \displaystyle\sigma\left(W_{r}Ey_{i-1}+U_{r}s_{i-1}+C_{r}c_{i}\right)
E ๐ธ E is the word embedding matrix for the target language.
W , W z , W r โ โ n ร m ๐ subscript ๐ ๐ง subscript ๐ ๐ superscript โ ๐ ๐ W,W_{z},W_{r}\in\mathbb{R}^{n\times m} ,
U , U z , U r โ โ n ร n ๐ subscript ๐ ๐ง subscript ๐ ๐ superscript โ ๐ ๐ U,U_{z},U_{r}\in\mathbb{R}^{n\times n} , and
C , C z , C r โ โ n ร 2 โ n ๐ถ subscript ๐ถ ๐ง subscript ๐ถ ๐ superscript โ ๐ 2 ๐ C,C_{z},C_{r}\in\mathbb{R}^{n\times 2n} are weights. Again, m ๐ m and n ๐ n are the word
embedding dimensionality and the number of hidden units, respectively.
The initial hidden state s 0 subscript ๐ 0 s_{0} is computed by
s 0 = tanh โก ( W s โ h โ 1 ) , subscript ๐ 0 subscript ๐ ๐ subscript โ โ 1 s_{0}=\tanh\left(W_{s}\overleftarrow{h}{1}\right),
where W s โ โ n ร n subscript ๐ ๐ superscript โ ๐ ๐ W{s}\in\mathbb{R}^{n\times n} .
The context vector c i subscript ๐ ๐ c_{i} are recomputed at each step by the alignment model:
c i = subscript ๐ ๐ absent \displaystyle c_{i}=
โ j = 1 T x ฮฑ i โ j โ h j , superscript subscript ๐ 1 subscript ๐ ๐ฅ subscript ๐ผ ๐ ๐ subscript โ ๐ \displaystyle\sum_{j=1}^{T_{x}}\alpha_{ij}h_{j},
where
ฮฑ i โ j = subscript ๐ผ ๐ ๐ absent \displaystyle\alpha_{ij}=
exp โก ( e i โ j ) โ k = 1 T x exp โก ( e i โ k ) subscript ๐ ๐ ๐ superscript subscript ๐ 1 subscript ๐ ๐ฅ subscript ๐ ๐ ๐ \displaystyle\frac{\exp\left(e_{ij}\right)}{\sum_{k=1}^{T_{x}}\exp\left(e_{ik}\right)}
e i โ j = subscript ๐ ๐ ๐ absent \displaystyle e_{ij}=
v a โค โ tanh โก ( W a โ s i โ 1 + U a โ h j ) , superscript subscript ๐ฃ ๐ top subscript ๐ ๐ subscript ๐ ๐ 1 subscript ๐ ๐ subscript โ ๐ \displaystyle v_{a}^{\top}\tanh\left(W_{a}s_{i-1}+U_{a}h_{j}\right),
and h j subscript โ ๐ h_{j} is the j ๐ j -th annotation in the source sentence (see Eq.ย ( 9 )). v a โ โ n โฒ , W a โ โ n โฒ ร n formulae-sequence subscript ๐ฃ ๐ superscript โ superscript ๐ โฒ subscript ๐ ๐ superscript โ superscript ๐ โฒ ๐ v_{a}\in\mathbb{R}^{n^{\prime}},W_{a}\in\mathbb{R}^{n^{\prime}\times n} and U a โ โ n โฒ ร 2 โ n subscript ๐ ๐ superscript โ superscript ๐ โฒ 2 ๐ U_{a}\in\mathbb{R}^{n^{\prime}\times 2n} are weight matrices. Note that the model becomes RNN EncoderโDecoderย (Cho etย al. , 2014a ) , if we fix c i subscript ๐ ๐ c_{i} to h โ T x subscript โ โ subscript ๐ ๐ฅ \overrightarrow{h}{T{x}} .
With the decoder state s i โ 1 subscript ๐ ๐ 1 s_{i-1} , the context c i subscript ๐ ๐ c_{i} and the last generated word y i โ 1 subscript ๐ฆ ๐ 1 y_{i-1} , we define the probability of a target word y i subscript ๐ฆ ๐ y_{i} as
p โ ( y i | s i , y i โ 1 , c i ) โ proportional-to ๐ conditional subscript ๐ฆ ๐ subscript ๐ ๐ subscript ๐ฆ ๐ 1 subscript ๐ ๐ absent \displaystyle p(y_{i}|s_{i},y_{i-1},c_{i})\propto
exp โก ( y i โค โ W o โ t i ) , superscript subscript ๐ฆ ๐ top subscript ๐ ๐ subscript ๐ก ๐ \displaystyle\exp\left(y_{i}^{\top}W_{o}t_{i}\right),
where
t i = subscript ๐ก ๐ absent \displaystyle t_{i}=
[ max โก { t ~ i , 2 โ j โ 1 , t ~ i , 2 โ j } ] j = 1 , โฆ , l โค superscript subscript delimited-[] subscript ~ ๐ก ๐ 2 ๐ 1 subscript ~ ๐ก ๐ 2 ๐ ๐ 1 โฆ ๐ top \displaystyle\left[\max\left\{\tilde{t}_{i,2j-1},\tilde{t}_{i,2j}\right\}\right]_{j=1,\ldots,l}^{\top}
and t ~ i , k subscript ~ ๐ก ๐ ๐ \tilde{t}{i,k} is the k ๐ k -th element of a vector t ~ i subscript ~ ๐ก ๐ \tilde{t}{i} which is computed by
t ~ i = subscript ~ ๐ก ๐ absent \displaystyle\tilde{t}_{i}=
U o โ s i โ 1 + V o โ E โ y i โ 1 + C o โ c i . subscript ๐ ๐ subscript ๐ ๐ 1 subscript ๐ ๐ ๐ธ subscript ๐ฆ ๐ 1 subscript ๐ถ ๐ subscript ๐ ๐ \displaystyle U_{o}s_{i-1}+V_{o}Ey_{i-1}+C_{o}c_{i}.
W o โ โ K y ร l subscript ๐ ๐ superscript โ subscript ๐พ ๐ฆ ๐ W_{o}\in\mathbb{R}^{K_{y}\times l} , U o โ โ 2 โ l ร n subscript ๐ ๐ superscript โ 2 ๐ ๐ U_{o}\in\mathbb{R}^{2l\times n} , V o โ โ 2 โ l ร m subscript ๐ ๐ superscript โ 2 ๐ ๐ V_{o}\in\mathbb{R}^{2l\times m} and C o โ โ 2 โ l ร 2 โ n subscript ๐ถ ๐ superscript โ 2 ๐ 2 ๐ C_{o}\in\mathbb{R}^{2l\times 2n} are weight
matrices. This can be understood as having a deep outputย (Pascanu etย al. , 2014 ) with a single maxout hidden layerย (Goodfellow etย al. , 2013 ) .
A.2.3
Model Size
For all the models used in this paper, the size of a hidden layer n ๐ n is 1000, the word embedding dimensionality m ๐ m is 620 and the size of the maxout hidden layer in the deep output l ๐ l is 500. The number of hidden units in the alignment model n โฒ superscript ๐ โฒ n^{\prime} is 1000.
Model Updates ( ร 10 5 absent superscript 10 5 \times 10^{5} ) Epochs Hours GPU Train NLL Dev. NLL
RNNenc-30 8.46 6.4 109 TITAN BLACK 28.1 53.0
RNNenc-50 6.00 4.5 108 Quadro K-6000 44.0 43.6
RNNsearch-30 4.71 3.6 113 TITAN BLACK 26.7 47.2
RNNsearch-50 2.88 2.2 111 Quadro K-6000 40.7 38.1
RNNsearch-50 โ
6.67 5.0 252 Quadro K-6000 36.7 35.2
Table 2: Learning statistics and relevant information. Each update corresponds to updating the parameters once using a single minibatch. One epoch is one pass through the training set. NLL is the average conditional log-probabilities of the sentences in either the training set or the development set. Note that the lengths of the sentences differ.
Appendix B
Training Procedure
B.1
Parameter Initialization
We initialized the recurrent weight matrices U , U z , U r , U โ , U โ z , U โ r , U โ , U โ z ๐ subscript ๐ ๐ง subscript ๐ ๐ โ ๐ subscript โ ๐ ๐ง subscript โ ๐ ๐ โ ๐ subscript โ ๐ ๐ง U,U_{z},U_{r},\overleftarrow{U},\overleftarrow{U}{z},\overleftarrow{U}{r},\overrightarrow{U},\overrightarrow{U}{z} and U โ r subscript โ ๐ ๐ \overrightarrow{U}{r} as random orthogonal matrices. For W a subscript ๐ ๐ W_{a} and U a subscript ๐ ๐ U_{a} , we initialized them by sampling each element from the Gaussian distribution of mean 0 0 and variance 0.001 2 superscript 0.001 2 0.001^{2} . All the elements of V a subscript ๐ ๐ V_{a} and all the bias vectors were initialized to zero. Any other weight matrix was initialized by sampling from the Gaussian distribution of mean 0 0 and variance 0.01 2 superscript 0.01 2 0.01^{2} .
B.2
Training
We used the stochastic gradient descent (SGD) algorithm. Adadeltaย (Zeiler, 2012 ) was used to automatically adapt the learning rate of each parameter ( ฯต = 10 โ 6 italic-ฯต superscript 10 6 \epsilon=10^{-6} and ฯ = 0.95 ๐ 0.95 \rho=0.95 ). We explicitly normalized the L 2 subscript ๐ฟ 2 L_{2} -norm of the gradient of the cost function each time to be at most a predefined threshold of 1 1 1 , when the norm was larger than the thresholdย (Pascanu etย al. , 2013b ) . Each SGD update direction was computed with a minibatch of 80 sentences.
At each update our implementation requires time proportional to the length of the longest sentence in a minibatch. Hence, to minimize the waste of computation, before every 20-th update, we retrieved 1600 sentence pairs, sorted them according to the lengths and split them into 20 minibatches. The training data was shuffled once before training and was traversed sequentially in this manner.
In Tablesย 2 we present the statistics related to training all the models used in the experiments.
Appendix C
Translations of Long Sentences
Source
An admitting privilege is the right of a doctor to admit a patient to a hospital or a medical centre to carry out a diagnosis or a procedure, based on his status as a health care worker at a hospital.
Reference
Le privilรจge dโadmission est le droit dโun mรฉdecin, en vertu de son statut de membre soignant dโun hรดpital, dโadmettre un patient dans un hรดpital ou un centre mรฉdical afin dโy dรฉlivrer un diagnostic ou un traitement.
RNNenc-50
Un privilรจge dโadmission est le droit dโun mรฉdecin de reconnaรฎtre un patient ร lโhรดpital ou un centre mรฉdical dโun diagnostic ou de prendre un diagnostic en fonction de son รฉtat de santรฉ.
RNNsearch-50
Un privilรจge dโadmission est le droit dโun mรฉdecin dโadmettre un patient ร un hรดpital ou un centre mรฉdical pour effectuer un diagnostic ou une procรฉdure, selon son statut de travailleur des soins de santรฉ ร lโhรดpital.
Google Translate
Un privilรจge admettre est le droit dโun mรฉdecin dโadmettre un patient dans un hรดpital ou un centre mรฉdical pour effectuer un diagnostic ou une procรฉdure, fondรฉe sur sa situation en tant que travailleur de soins de santรฉ dans un hรดpital.
Source
This kind of experience is part of Disneyโs efforts to โextend the lifetime of its series and build new relationships with audiences via digital platforms that are becoming ever more important,โ he added.
Reference
Ce type dโexpรฉrience entre dans le cadre des efforts de Disney pour โรฉtendre la durรฉe de vie de ses sรฉries et construire de nouvelles relations avec son public grรขce ร des plateformes numรฉriques qui sont de plus en plus importantesโ, a-t-il ajoutรฉ.
RNNenc-50
Ce type dโexpรฉrience fait partie des initiatives du Disney pour โprolonger la durรฉe de vie de ses nouvelles et de dรฉvelopper des liens avec les lecteurs numรฉriques qui deviennent plus complexes.
RNNsearch-50
Ce genre dโexpรฉrience fait partie des efforts de Disney pour โprolonger la durรฉe de vie de ses sรฉries et crรฉer de nouvelles relations avec des publics via des plateformes numรฉriques de plus en plus importantesโ, a-t-il ajoutรฉ.
Google Translate
Ce genre dโexpรฉrience fait partie des efforts de Disney ร โรฉtendre la durรฉe de vie de sa sรฉrie et construire de nouvelles relations avec le public par le biais des plates-formes numรฉriques qui deviennent de plus en plus importantโ, at-il ajoutรฉ.
Source
In a press conference on Thursday, Mr Blair stated that there was nothing in this video that might constitute a โreasonable motiveโ that could lead to criminal charges being brought against the mayor.
Reference
En confรฉrence de presse, jeudi, M. Blair a affirmรฉ quโil nโy avait rien dans cette vidรฉo qui puisse constituer des โmotifs raisonnablesโ pouvant mener au dรฉpรดt dโune accusation criminelle contre le maire.
RNNenc-50
Lors de la confรฉrence de presse de jeudi, M. Blair a dit quโil nโy avait rien dans cette vidรฉo qui pourrait constituer une โmotivation raisonnableโ pouvant entraรฎner des accusations criminelles portรฉes contre le maire.
RNNsearch-50
Lors dโune confรฉrence de presse jeudi, M. Blair a dรฉclarรฉ quโil nโy avait rien dans cette vidรฉo qui pourrait constituer un โmotif raisonnableโ qui pourrait conduire ร des accusations criminelles contre le maire.
Google Translate
Lors dโune confรฉrence de presse jeudi, M. Blair a dรฉclarรฉ quโil nโy avait rien dans cette vido qui pourrait constituer un โmotif raisonnableโ qui pourrait mener ร des accusations criminelles portes contre le maire.
Table 3: The translations generated by RNNenc-50 and RNNsearch-50 from long source sentences (30 words or more) selected from the test set. For each source sentence, we also show the gold-standard translation. The translations by Google Translate were made on 27 August 2014.
โ
Feeling lucky?
Conversion report
Report an issue
View original on arXiv โบ
AI Summary: Based on semantic_scholar metadata. Not a recommendation.
๐ก๏ธ Paper Transparency Report
Technical metadata sourced from upstream repositories.
๐ Identity & Source
- id
- arxiv-paper--unknown--1409.0473
- slug
- unknown--1409.0473
- source
- semantic_scholar
- author
- Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
- license
- ArXiv
- tags
- paper, research, academic
โ๏ธ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
๐ Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.