There are a lot of similarities between the Transformer encoder and decoder, comparable to of their implementation of multi-head consideration, layer normalization and a completely related feed-forward community as their last sub-layer. Having applied the Transformer encoder, we are going to now proceed to use our data in implementing the Transformer decoder, as an additional step in direction of implementing the entire Transformer mannequin. Our finish objective stays the applying of the entire mannequin to Pure Language Processing (NLP).
On this tutorial, you’ll uncover tips on how to implement the Transformer decoder from scratch in TensorFlow and Keras.
After finishing this tutorial, you’ll know:
- The layers that kind a part of the Transformer decoder.
- The best way to implement the Transformer decoder from scratch.
Let’s get began.
Implementing the Transformer Decoder From Scratch in TensorFlow and Keras
Photograph by François Kaiser, some rights reserved.
Tutorial Overview
This tutorial is split into three components; they’re:
- Recap of the Transformer Structure
- The Transformer Decoder
- Implementing the Transformer Decoder From Scratch
- The Decoder Layer
- The Transformer Decoder
- Testing Out the Code
Stipulations
For this tutorial, we assume that you’re already acquainted with:
- The Transformer mannequin
- The scaled dot-product consideration
- The multi-head consideration
- The Transformer positional encoding
- The Transformer encoder
Recap of the Transformer Structure
Recall having seen that the Transformer structure follows an encoder-decoder construction: the encoder, on the left-hand facet, is tasked with mapping an enter sequence to a sequence of steady representations; the decoder, on the right-hand facet, receives the output of the encoder along with the decoder output on the earlier time step, to generate an output sequence.

The Encoder-Decoder Construction of the Transformer Structure
Taken from “Consideration Is All You Want“
In producing an output sequence, the Transformer doesn’t depend on recurrence and convolutions.
We had seen that the decoder a part of the Transformer shares many similarities in its structure with the encoder. This tutorial can be exploring these similarities.
The Transformer Decoder
Much like the Transformer encoder, the Transformer decoder additionally consists of a stack of $N$ equivalent layers. The Transformer decoder, nevertheless, implements an extra multi-head consideration block, for a complete of three most important sub-layers:
- The primary sub-layer includes a multi-head consideration mechanism that receives the queries, keys and values as inputs.
- The second sub-layer includes a second multi-head consideration mechanism.
- The third sub-layer includes a fully-connected feed-forward community.

The Decoder Block of the Transformer Structure
Taken from “Consideration Is All You Want“
Every one in all these three sub-layers can be adopted by layer normalisation, the place the enter to the layer normalization step is its corresponding sub-layer enter (by way of a residual connection) and output.
On the decoder facet, the queries, keys and values which can be fed into the primary multi-head consideration block additionally symbolize the identical enter sequence. Nonetheless, this time spherical, it’s the goal sequence that’s embedded and augmented with positional info earlier than being provided to the decoder. The second multi-head consideration block, alternatively, receives the encoder output within the type of keys and values, and the normalized output of the primary decoder consideration block because the queries. In each circumstances, the dimensionality of the queries and keys stays equal to $d_k$, whereas the dimensionality of the values stays equal to $d_v$.
Vaswani et al. introduce regularization into the mannequin on the decoder facet too, by making use of dropout to the output of every sub-layer (earlier than the layer normalization step), in addition to to the positional encodings earlier than these are fed into the decoder.
Let’s now see tips on how to implement the Transformer decoder from scratch in TensorFlow and Keras.
Implementing the Transformer Decoder From Scratch
The Decoder Layer
Since we now have already applied the required sub-layers after we coated the implementation of the Transformer encoder, we are going to create a category for the decoder layer that makes use of those sub-layers immediately:
from multihead_attention import MultiHeadAttention from encoder import AddNormalization, FeedForward class DecoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, charge, **kwargs): tremendous(DecoderLayer, self).__init__(**kwargs) self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(charge) self.add_norm1 = AddNormalization() self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout2 = Dropout(charge) self.add_norm2 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout3 = Dropout(charge) self.add_norm3 = AddNormalization() ...
Discover right here that since my code for the completely different sub-layers had been saved into a number of Python scripts (specifically, multihead_attention.py and encoder.py), it was essential to import them to have the ability to use the required courses.
As we had completed for the Transformer encoder, we are going to proceed to create the category methodology, name()
, that implements the entire decoder sub-layers:
... def name(self, x, encoder_output, lookahead_mask, padding_mask, coaching): # Multi-head consideration layer multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output1 = self.dropout1(multihead_output1, coaching=coaching) # Adopted by an Add & Norm layer addnorm_output1 = self.add_norm1(x, multihead_output1) # Anticipated output form = (batch_size, sequence_length, d_model) # Adopted by one other multi-head consideration layer multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask) # Add in one other dropout layer multihead_output2 = self.dropout2(multihead_output2, coaching=coaching) # Adopted by one other Add & Norm layer addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2) # Adopted by a completely related layer feedforward_output = self.feed_forward(addnorm_output2) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in one other dropout layer feedforward_output = self.dropout3(feedforward_output, coaching=coaching) # Adopted by one other Add & Norm layer return self.add_norm3(addnorm_output2, feedforward_output)
The multi-head consideration sub-layers can even obtain a padding masks or a look-ahead masks. As a short reminder of what we had stated in a earlier tutorial, the padding masks is critical to suppress the zero padding within the enter sequence from being processed together with the precise enter values. The look-ahead masks prevents the decoder from attending to succeeding phrases, such that the prediction for a selected phrase can solely rely upon recognized outputs for the phrases that come earlier than it.
The identical name()
class methodology can even obtain a coaching
flag to solely apply the Dropout layers throughout coaching, when the worth of this flag is ready to True
.
The Transformer Decoder
The Transformer decoder takes the decoder layer that we now have simply applied, and replicates it identically $N$ occasions.
We can be creating the next Decoder()
class to implement the Transformer decoder:
from positional_encoding import PositionEmbeddingFixedWeights class Decoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, charge, **kwargs): tremendous(Decoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(charge) self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n) ...
As in the Transformer encoder, the input to the first multi-head attention block on the decoder side receives the input sequence after this would have undergone a process of word embedding and positional encoding. For this purpose, an instance of the PositionEmbeddingFixedWeights
class (covered in this tutorial) is initialized and its output assigned to the pos_encoding
variable.
The final step is to create a class method, call()
, that applies word embedding and positional encoding to the input sequence and feeds the result, together with the encoder output, to $N$ decoder layers:
... def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training): # Generate the positional encoding pos_encoding_output = self.pos_encoding(output_target) # Expected output shape = (number of sentences, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, training=training) # Pass on the positional encoded values to each encoder layer for i, layer in enumerate(self.decoder_layer): x = layer(x, encoder_output, lookahead_mask, padding_mask, training) return x
The code listing for the full Transformer decoder is the following:
from tensorflow.keras.layers import Layer, Dropout from multihead_attention import MultiHeadAttention from positional_encoding import PositionEmbeddingFixedWeights from encoder import AddNormalization, FeedForward # Implementing the Decoder Layer class DecoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs): super(DecoderLayer, self).__init__(**kwargs) self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(rate) self.add_norm1 = AddNormalization() self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout2 = Dropout(rate) self.add_norm2 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout3 = Dropout(rate) self.add_norm3 = AddNormalization() def call(self, x, encoder_output, lookahead_mask, padding_mask, training): # Multi-head attention layer multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask) # Expected output shape = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output1 = self.dropout1(multihead_output1, training=training) # Followed by an Add & Norm layer addnorm_output1 = self.add_norm1(x, multihead_output1) # Expected output shape = (batch_size, sequence_length, d_model) # Followed by another multi-head attention layer multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask) # Add in another dropout layer multihead_output2 = self.dropout2(multihead_output2, training=training) # Followed by another Add & Norm layer addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2) # Followed by a fully connected layer feedforward_output = self.feed_forward(addnorm_output2) # Expected output shape = (batch_size, sequence_length, d_model) # Add in another dropout layer feedforward_output = self.dropout3(feedforward_output, training=training) # Followed by another Add & Norm layer return self.add_norm3(addnorm_output2, feedforward_output) # Implementing the Decoder class Decoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs): super(Decoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(rate) self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] def name(self, output_target, encoder_output, lookahead_mask, padding_mask, coaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(output_target) # Anticipated output form = (variety of sentences, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, coaching=coaching) # Cross on the positional encoded values to every encoder layer for i, layer in enumerate(self.decoder_layer): x = layer(x, encoder_output, lookahead_mask, padding_mask, coaching) return x
Testing Out the Code
We can be working with the parameter values specified within the paper, Consideration Is All You Want, by Vaswani et al. (2017):
h = 8 # Variety of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the internal absolutely related layer d_model = 512 # Dimensionality of the mannequin sub-layers' outputs n = 6 # Variety of layers within the encoder stack batch_size = 64 # Batch dimension from the coaching course of dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers ...
As for the enter sequence we can be working with dummy information in the interim till we arrive to the stage of coaching the entire Transformer mannequin in a separate tutorial, at which level we can be utilizing precise sentences:
... dec_vocab_size = 20 # Vocabulary dimension for the decoder input_seq_length = 5 # Most size of the enter sequence input_seq = random.random((batch_size, input_seq_length)) enc_output = random.random((batch_size, input_seq_length, d_model)) ...
Subsequent, we are going to create a brand new occasion of the Decoder
class, assigning its output to the decoder
variable, and subsequently passing within the enter arguments and printing the consequence. We can be setting the padding and look-ahead masks to None
in the interim, however we will return to those after we implement the entire Transformer mannequin:
... decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(decoder(input_seq, enc_output, None, True)
Tying all the things collectively produces the next code itemizing:
from numpy import random dec_vocab_size = 20 # Vocabulary dimension for the decoder input_seq_length = 5 # Most size of the enter sequence h = 8 # Variety of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the internal absolutely related layer d_model = 512 # Dimensionality of the mannequin sub-layers' outputs n = 6 # Variety of layers within the decoder stack batch_size = 64 # Batch dimension from the coaching course of dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers input_seq = random.random((batch_size, input_seq_length)) enc_output = random.random((batch_size, input_seq_length, d_model)) decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(decoder(input_seq, enc_output, None, True))
Operating this code produces an output of form, (batch dimension, sequence size, mannequin dimensionality). Word that you’ll probably see a unique output because of the random initialization of the enter sequence, and the parameter values of the Dense layers.
tf.Tensor( [[[-0.04132953 -1.7236308 0.5391184 ... -0.76394725 1.4969798 0.37682498] [ 0.05501875 -1.7523409 0.58404493 ... -0.70776534 1.4498456 0.32555297] [ 0.04983566 -1.8431275 0.55850077 ... -0.68202156 1.4222856 0.32104644] [-0.05684051 -1.8862512 0.4771412 ... -0.7101341 1.431343 0.39346313] [-0.15625843 -1.7992781 0.40803364 ... -0.75190556 1.4602519 0.53546077]] ... [[-0.58847624 -1.646842 0.5973466 ... -0.47778523 1.2060764 0.34091905] [-0.48688865 -1.6809179 0.6493542 ... -0.41274604 1.188649 0.27100053] [-0.49568555 -1.8002801 0.61536175 ... -0.38540334 1.2023914 0.24383534] [-0.59913146 -1.8598882 0.5098136 ... -0.3984461 1.2115746 0.3186561 ] [-0.71045107 -1.7778647 0.43008155 ... -0.42037937 1.2255307 0.47380894]]], form=(64, 5, 512), dtype=float32)
Additional Studying
This part offers extra sources on the subject in case you are trying to go deeper.
Books
- Superior Deep Studying with Python, 2019.
- Transformers for Pure Language Processing, 2021.
Papers
- Consideration Is All You Want, 2017.
Abstract
On this tutorial, you found tips on how to implement the Transformer decoder from scratch in TensorFlow and Keras.
Particularly, you discovered:
- The layers that kind a part of the Transformer decoder.
- The best way to implement the Transformer decoder from scratch.
Do you have got any questions?
Ask your questions within the feedback under and I’ll do my finest to reply.
The submit Implementing the Transformer Decoder From Scratch in TensorFlow and Keras appeared first on Machine Studying Mastery.