Having seen the way to implement the scaled dot-product consideration, and combine it throughout the multi-head consideration of the Transformer mannequin, we might progress one step additional in direction of implementing an entire Transformer mannequin by implementing its encoder. Our finish aim stays the applying of the entire mannequin to Pure Language Processing (NLP).
On this tutorial, you’ll uncover the way to implement the Transformer encoder from scratch in TensorFlow and Keras.
After finishing this tutorial, you’ll know:
- The layers that kind a part of the Transformer encoder.
- The best way to implement the Transformer encoder from scratch.
Let’s get began.
Implementing the Transformer Encoder From Scratch in TensorFlow and Keras
Picture by ian dooley, some rights reserved.
Tutorial Overview
This tutorial is split into three components; they’re:
- Recap of the Transformer Structure
- The Transformer Encoder
- Implementing the Transformer Encoder From Scratch
- The Absolutely Linked Feed-Ahead Neural Community and Layer Normalization
- The Encoder Layer
- The Transformer Encoder
- Testing Out the Code
Conditions
For this tutorial, we assume that you’re already aware of:
- The Transformer mannequin
- The scaled dot-product consideration
- The multi-head consideration
- The Transformer positional encoding
Recap of the Transformer Structure
Recall having seen that the Transformer structure follows an encoder-decoder construction: the encoder, on the left-hand facet, is tasked with mapping an enter sequence to a sequence of steady representations; the decoder, on the right-hand facet, receives the output of the encoder along with the decoder output on the earlier time step, to generate an output sequence.

The Encoder-Decoder Construction of the Transformer Structure
Taken from “Consideration Is All You Want“
In producing an output sequence, the Transformer doesn’t depend on recurrence and convolutions.
We had seen that the decoder a part of the Transformer shares many similarities in its structure with the encoder. On this tutorial, we can be specializing in the elements that kind a part of the Transformer encoder.
The Transformer Encoder
The Transformer encoder consists of a stack of $N$ similar layers, the place every layer additional consists of two major sub-layers:
- The primary sub-layer contains a multi-head consideration mechanism that receives the queries, keys and values as inputs.
- A second sub-layer that contains a fully-connected feed-forward community.

The Encoder Block of the Transformer Structure
Taken from “Consideration Is All You Want“
Following every of those two sub-layers is layer normalisation, into which the sub-layer enter (by a residual connection) and output are fed. The output of every layer normalization step is the next:
LayerNorm(Sublayer Enter + Sublayer Output)
As a way to facilitate such an operation, which entails an addition between the sublayer enter and output, Vaswani et al. designed all sub-layers and embedding layers within the mannequin to supply outputs of dimension, $d_{textual content{mannequin}}$ = 512.
Recall as properly the queries, keys and values because the inputs to the Transformer encoder.
Right here, the queries, keys and values carry the identical enter sequence after this has been embedded and augmented by positional data, the place the queries and keys are of dimensionality, $d_k$, whereas the dimensionality of the values is $d_v$.
Moreover, Vaswani et al. additionally introduce regularization into the mannequin by making use of dropout to the output of every sub-layer (earlier than the layer normalization step), in addition to to the positional encodings earlier than these are fed into the encoder.
Let’s now see the way to implement the Transformer encoder from scratch in TensorFlow and Keras.
Implementing the Transformer Encoder From Scratch
The Absolutely Linked Feed-Ahead Neural Community and Layer Normalization
We will start by creating courses for the Feed Ahead and Add & Norm layers which might be proven within the diagram above.
Vaswani et al. inform us that the totally linked feed-forward community consists of two linear transformations with a ReLU activation in between. The primary linear transformation produces an output of dimensionality, $d_{ff}$ = 2048, whereas the second linear transformation produces an output of dimensionality, $d_{textual content{mannequin}}$ = 512.
For this function, let’s first create the category, FeedForward
that inherits kind the Layer
base class in Keras, and initialize the dense layers and the ReLU activation:
class FeedForward(Layer): def __init__(self, d_ff, d_model, **kwargs): tremendous(FeedForward, self).__init__(**kwargs) self.fully_connected1 = Dense(d_ff) # First totally linked layer self.fully_connected2 = Dense(d_model) # Second totally linked layer self.activation = ReLU() # ReLU activation layer ...
We’ll add to it the category methodology, name()
, that receives an enter and passes it by the 2 totally linked layers with ReLU activation, returning an output of dimensionality equal to 512:
... def name(self, x): # The enter is handed into the 2 fully-connected layers, with a ReLU in between x_fc1 = self.fully_connected1(x) return self.fully_connected2(self.activation(x_fc1))
The following step is to create one other class, AddNormalization
, that additionally inherits kind the Layer
base class in Keras, and initialize a Layer normalization layer:
class AddNormalization(Layer): def __init__(self, **kwargs): tremendous(AddNormalization, self).__init__(**kwargs) self.layer_norm = LayerNormalization() # Layer normalization layer ...
In it, we are going to embrace the next class methodology that sums its sub-layer’s enter and output, which it receives as inputs, and applies layer normalization to the consequence:
... def name(self, x, sublayer_x): # The sublayer enter and output must be of the identical form to be summed add = x + sublayer_x # Apply layer normalization to the sum return self.layer_norm(add)
The Encoder Layer
Subsequent, we are going to implement the encoder layer, which the Transformer encoder will replicate identically $N$ instances.
For this function, let’s create the category, EncoderLayer
, and initialize the entire sub-layers that it consists of:
class EncoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, charge, **kwargs): tremendous(EncoderLayer, self).__init__(**kwargs) self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(charge) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(charge) self.add_norm2 = AddNormalization() ...
Right here you could discover that we have now initialized cases of the FeedForward
and AddNormalization
courses, which we have now simply created within the earlier part, and assigned their output to the respective variables, feed_forward
and add_norm
(1 and a pair of). The Dropout
layer is self-explanatory, the place charge
defines the frequency at which the enter models are set to 0. We had created the MultiHeadAttention
class in a earlier tutorial, and should you had saved the code right into a separate Python script, then don’t forget to import
it. I saved mine in a Python script named, multihead_attention.py, and for that reason I would like to incorporate the road of code, from multihead_attention import MultiHeadAttention.
Let’s now proceed to create the category methodology, name()
, that implements the entire encoder sub-layers:
... def name(self, x, padding_mask, coaching): # Multi-head consideration layer multihead_output = self.multihead_attention(x, x, x, padding_mask) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output = self.dropout1(multihead_output, coaching=coaching) # Adopted by an Add & Norm layer addnorm_output = self.add_norm1(x, multihead_output) # Anticipated output form = (batch_size, sequence_length, d_model) # Adopted by a completely linked layer feedforward_output = self.feed_forward(addnorm_output) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in one other dropout layer feedforward_output = self.dropout2(feedforward_output, coaching=coaching) # Adopted by one other Add & Norm layer return self.add_norm2(addnorm_output, feedforward_output)
Along with the enter information, the name()
methodology also can obtain a padding masks. As a short reminder of what we had stated in a earlier tutorial, the padding masks is critical to suppress the zero padding within the enter sequence from being processed together with the precise enter values.
The identical class methodology can obtain a coaching
flag which, when set to True
, will solely apply the Dropout layers throughout coaching.
The Transformer Encoder
The final step is to create a category for the Transformer encoder, which we will be naming Encoder
:
class Encoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, charge, **kwargs): tremendous(Encoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(charge) self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] ...
The Transformer encoder receives an enter sequence after this could have undergone a technique of phrase embedding and positional encoding. As a way to compute the positional encoding, we are going to make use of the PositionEmbeddingFixedWeights
class described by Mehreen Saeed on this tutorial.
As we have now equally accomplished within the earlier sections, right here we can even create a category methodology, name()
, that applies phrase embedding and positional encoding to the enter sequence, and feeds the consequence to $N$ encoder layers:
... def name(self, input_sentence, padding_mask, coaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(input_sentence) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, coaching=coaching) # Move on the positional encoded values to every encoder layer for i, layer in enumerate(self.encoder_layer): x = layer(x, padding_mask, coaching) return x
The code itemizing for the total Transformer encoder is the next:
from tensorflow.keras.layers import LayerNormalization, Layer, Dense, ReLU, Dropout from multihead_attention import MultiHeadAttention from positional_encoding import PositionEmbeddingFixedWeights # Implementing the Add & Norm Layer class AddNormalization(Layer): def __init__(self, **kwargs): tremendous(AddNormalization, self).__init__(**kwargs) self.layer_norm = LayerNormalization() # Layer normalization layer def name(self, x, sublayer_x): # The sublayer enter and output must be of the identical form to be summed add = x + sublayer_x # Apply layer normalization to the sum return self.layer_norm(add) # Implementing the Feed-Ahead Layer class FeedForward(Layer): def __init__(self, d_ff, d_model, **kwargs): tremendous(FeedForward, self).__init__(**kwargs) self.fully_connected1 = Dense(d_ff) # First totally linked layer self.fully_connected2 = Dense(d_model) # Second totally linked layer self.activation = ReLU() # ReLU activation layer def name(self, x): # The enter is handed into the 2 fully-connected layers, with a ReLU in between x_fc1 = self.fully_connected1(x) return self.fully_connected2(self.activation(x_fc1)) # Implementing the Encoder Layer class EncoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, charge, **kwargs): tremendous(EncoderLayer, self).__init__(**kwargs) self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(charge) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(charge) self.add_norm2 = AddNormalization() def name(self, x, padding_mask, coaching): # Multi-head consideration layer multihead_output = self.multihead_attention(x, x, x, padding_mask) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output = self.dropout1(multihead_output, coaching=coaching) # Adopted by an Add & Norm layer addnorm_output = self.add_norm1(x, multihead_output) # Anticipated output form = (batch_size, sequence_length, d_model) # Adopted by a completely linked layer feedforward_output = self.feed_forward(addnorm_output) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in one other dropout layer feedforward_output = self.dropout2(feedforward_output, coaching=coaching) # Adopted by one other Add & Norm layer return self.add_norm2(addnorm_output, feedforward_output) # Implementing the Encoder class Encoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, charge, **kwargs): tremendous(Encoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(charge) self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] def name(self, input_sentence, padding_mask, coaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(input_sentence) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, coaching=coaching) # Move on the positional encoded values to every encoder layer for i, layer in enumerate(self.encoder_layer): x = layer(x, padding_mask, coaching) return x
Testing Out the Code
We can be working with the parameter values specified within the paper, Consideration Is All You Want, by Vaswani et al. (2017):
h = 8 # Variety of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the interior totally linked layer d_model = 512 # Dimensionality of the mannequin sub-layers' outputs n = 6 # Variety of layers within the encoder stack batch_size = 64 # Batch dimension from the coaching course of dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers ...
As for the enter sequence we can be working with dummy information in the meanwhile till we arrive to the stage of coaching the entire Transformer mannequin in a separate tutorial, at which level we can be utilizing precise sentences:
... enc_vocab_size = 20 # Vocabulary dimension for the encoder input_seq_length = 5 # Most size of the enter sequence input_seq = random.random((batch_size, input_seq_length)) ...
Subsequent, we are going to create a brand new occasion of the Encoder
class, assigning its output to the encoder
variable, and subsequently feeding within the enter arguments and printing the consequence. We can be setting the padding masks argument to None
in the meanwhile, however we will return to this once we implement the entire Transformer mannequin:
... encoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(encoder(input_seq, None, True))
Tying the whole lot collectively produces the next code itemizing:
from numpy import random enc_vocab_size = 20 # Vocabulary dimension for the encoder input_seq_length = 5 # Most size of the enter sequence h = 8 # Variety of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the interior totally linked layer d_model = 512 # Dimensionality of the mannequin sub-layers' outputs n = 6 # Variety of layers within the encoder stack batch_size = 64 # Batch dimension from the coaching course of dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers input_seq = random.random((batch_size, input_seq_length)) encoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(encoder(input_seq, None, True))
Operating this code produces an output of form, (batch dimension, sequence size, mannequin dimensionality). Be aware that you’ll probably see a distinct output as a result of random initialization of the enter sequence, and the parameter values of the Dense layers.
tf.Tensor( [[[-0.4214715 -1.1246173 -0.8444572 ... 1.6388322 -0.1890367 1.0173352 ] [ 0.21662089 -0.61147404 -1.0946581 ... 1.4627445 -0.6000164 -0.64127874] [ 0.46674493 -1.4155326 -0.5686513 ... 1.1790234 -0.94788337 0.1331717 ] [-0.30638126 -1.9047263 -1.8556844 ... 0.9130118 -0.47863355 0.00976158] [-0.22600567 -0.9702025 -0.91090447 ... 1.7457147 -0.139926 -0.07021569]] ... [[-0.48047638 -1.1034104 -0.16164204 ... 1.5588069 0.08743562 -0.08847156] [-0.61683714 -0.8403657 -1.0450369 ... 2.3587787 -0.76091915 -0.02891812] [-0.34268388 -0.65042275 -0.6715749 ... 2.8530657 -0.33631966 0.5215888 ] [-0.6288677 -1.0030932 -0.9749813 ... 2.1386387 0.0640307 -0.69504136] [-1.33254 -1.2524267 -0.230098 ... 2.515467 -0.04207756 -0.3395423 ]]], form=(64, 5, 512), dtype=float32)
Additional Studying
This part offers extra assets on the subject if you’re seeking to go deeper.
Books
- Superior Deep Studying with Python, 2019.
- Transformers for Pure Language Processing, 2021.
Papers
- Consideration Is All You Want, 2017.
Abstract
On this tutorial, you found the way to implement the Transformer encoder from scratch in TensorFlow and Keras.
Particularly, you realized:
- The layers that kind a part of the Transformer encoder.
- The best way to implement the Transformer encoder from scratch.
Do you will have any questions?
Ask your questions within the feedback beneath and I’ll do my greatest to reply.
The publish Implementing the Transformer Encoder From Scratch in TensorFlow and Keras appeared first on Machine Studying Mastery.