Building an English-Spanish Translator with Transformers
Language barriers have always been a challenge in our increasingly connected world. While traditional translation methods rely on complex rule-based systems or statistical models, modern deep learning approaches—particularly transformer architectures—have revolutionized machine translation. In this post, I'll walk you through building a sequence-to-sequence transformer model that translates between English and Spanish.
The Power of Transformers
Transformers have become the gold standard for natural language processing tasks. Unlike recurrent neural networks (RNNs) that process sequences sequentially, transformers use attention mechanisms to capture relationships between words regardless of their distance in the sentence. This parallel processing capability makes them both faster and more effective at understanding context.
Project Overview
We'll build an end-to-end translation system through four main steps:
- Data Preprocessing - Download and prepare parallel text datasets
- Model Architecture - Implement positional embeddings, encoder, and decoder layers
- Training - Train the model on English-Spanish sentence pairs
- Inference - Use the trained model to translate new sentences
Setting Up the Environment
First, let's import the necessary libraries. We'll be using TensorFlow and Keras for building our neural network:
import pathlib
import random
import string
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
Step 1: Data Preprocessing
Downloading the Dataset
We'll use a publicly available English-to-Spanish translation dataset from Anki, which contains thousands of parallel sentence pairs:
text_file = keras.utils.get_file(
fname="spa-eng.zip",
origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
extract=True,
)
text_file = pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"
Parsing Sentence Pairs
Each line in our dataset contains an English sentence and its Spanish translation, separated by a tab. We'll add special [start] and [end] tokens to the Spanish sentences to help the model learn when to begin and end translations:
with open(text_file) as f:
lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
eng, spa = line.split("\t")
spa = "[start] " + spa + " [end]"
text_pairs.append((eng, spa))
Here are some example pairs from our dataset:
for _ in range(5):
print(random.choice(text_pairs))
('I think Tom is working now.', '[start] Creo que ahora Tomás trabaja. [end]')
("I'm very interested in classical literature.", '[start] Me interesa mucho la literatura clásica. [end]')
('I appreciate you.', '[start] Te tengo cariño. [end]')
('Do you want to watch this program?', '[start] ¿Quieres ver este programa? [end]')
('We just have to stick together.', '[start] Sólo tenemos que permanecer juntos. [end]')
Splitting the Data
Let's divide our data into training, validation, and test sets. We'll use 70% for training, 15% for validation, and 15% for testing:
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
num_test_samples = len(text_pairs) - num_val_samples - num_train_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]
print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")
118964 total pairs
83276 training pairs
17844 validation pairs
17844 test pairs
Vectorizing Text Data
Neural networks work with numbers, not text. We'll use Keras's TextVectorization layer to convert our sentences into sequences of integers, where each integer represents a word in our vocabulary.
The English layer uses default preprocessing (removing punctuation and splitting on whitespace), while the Spanish layer includes custom handling for the inverted question mark (¿):
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")
vocab_size = 15000
sequence_length = 20
batch_size = 64
def custom_standardization(input_string):
lowercase = tf.strings.lower(input_string)
return tf.strings.regex_replace(lowercase, "[%s]" % re.escape(strip_chars), "")
eng_vectorization = TextVectorization(
max_tokens=vocab_size,
output_mode="int",
output_sequence_length=sequence_length,
)
spa_vectorization = TextVectorization(
max_tokens=vocab_size,
output_mode="int",
output_sequence_length=sequence_length + 1,
standardize=custom_standardization,
)
train_eng_texts = [pair[0] for pair in train_pairs]
train_spa_texts = [pair[1] for pair in train_pairs]
eng_vectorization.adapt(train_eng_texts)
spa_vectorization.adapt(train_spa_texts)
Formatting the Dataset
At each training step, our model predicts the next word in the target sequence using the source sentence and all previous target words. We format our data accordingly:
def format_dataset(eng, spa):
eng = eng_vectorization(eng)
spa = spa_vectorization(spa)
return (
{
"encoder_inputs": eng,
"decoder_inputs": spa[:, :-1],
},
spa[:, 1:]
)
def make_dataset(pairs):
eng_texts, spa_texts = zip(*pairs)
eng_texts = list(eng_texts)
spa_texts = list(spa_texts)
dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
dataset = dataset.batch(batch_size)
dataset = dataset.map(format_dataset)
return dataset.shuffle(2048).prefetch(16).cache()
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)
Step 2: Building the Transformer
Positional Embedding Layer
Since transformers don't inherently understand word order, we need positional embeddings to encode each word's position in the sequence:
class PositionalEmbedding(layers.Layer):
def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
super().__init__(**kwargs)
self.token_embeddings = layers.Embedding(
input_dim=vocab_size, output_dim=embed_dim
)
self.position_embeddings = layers.Embedding(
input_dim=sequence_length, output_dim=embed_dim
)
self.sequence_length = sequence_length
self.vocab_size = vocab_size
self.embed_dim = embed_dim
def call(self, inputs):
length = tf.shape(inputs)[-1]
positions = tf.range(start=0, limit=length, delta=1)
embedded_tokens = self.token_embeddings(inputs)
embedded_positions = self.position_embeddings(positions)
return embedded_tokens + embedded_positions
def compute_mask(self, inputs, mask=None):
return tf.math.not_equal(inputs, 0)
Transformer Encoder
The encoder processes the source language (English) and creates contextual representations:
class TransformerEncoder(layers.Layer):
def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.dense_dim = dense_dim
self.num_heads = num_heads
self.attention = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.dense_proj = keras.Sequential([
layers.Dense(dense_dim, activation="relu"),
layers.Dense(embed_dim),
])
self.layernorm_1 = layers.LayerNormalization()
self.layernorm_2 = layers.LayerNormalization()
self.supports_masking = True
def call(self, inputs, mask=None):
if mask is not None:
padding_mask = tf.cast(mask[:, tf.newaxis, tf.newaxis, :], dtype="int32")
else:
padding_mask = None
attention_output = self.attention(
query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
)
proj_input = self.layernorm_1(inputs + attention_output)
proj_output = self.dense_proj(proj_input)
return self.layernorm_2(proj_input + proj_output)
Transformer Decoder
The decoder generates the target language (Spanish) one word at a time, attending to both the encoder output and previously generated words:
class TransformerDecoder(layers.Layer):
def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.latent_dim = latent_dim
self.num_heads = num_heads
self.attention_1 = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.attention_2 = layers.MultiHeadAttention(
num_heads=num_heads, key_dim=embed_dim
)
self.dense_proj = keras.Sequential([
layers.Dense(latent_dim, activation="relu"),
layers.Dense(embed_dim),
])
self.layernorm_1 = layers.LayerNormalization()
self.layernorm_2 = layers.LayerNormalization()
self.layernorm_3 = layers.LayerNormalization()
self.supports_masking = True
def call(self, inputs, encoder_outputs, mask=None):
causal_mask = self.get_causal_attention_mask(inputs)
if mask is not None:
padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
padding_mask = tf.minimum(padding_mask, causal_mask)
else:
padding_mask = causal_mask
attention_output_1 = self.attention_1(
query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
)
out_1 = self.layernorm_1(inputs + attention_output_1)
attention_output_2 = self.attention_2(
query=out_1,
value=encoder_outputs,
key=encoder_outputs,
attention_mask=padding_mask,
)
out_2 = self.layernorm_2(out_1 + attention_output_2)
proj_output = self.dense_proj(out_2)
return self.layernorm_3(out_2 + proj_output)
def get_causal_attention_mask(self, inputs):
input_shape = tf.shape(inputs)
batch_size, sequence_length = input_shape[0], input_shape[1]
i = tf.range(sequence_length)[:, tf.newaxis]
j = tf.range(sequence_length)
mask = tf.cast(i >= j, dtype="int32")
mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
mult = tf.concat(
[tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)],
axis=0,
)
return tf.tile(mask, mult)
Complete Transformer Model
Finally, we combine all components into a complete sequence-to-sequence model:
embed_dim = 256
latent_dim = 2048
num_heads = 8
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])
transformer = keras.Model(
[encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)
Step 3: Training the Model
Now we can train our transformer. We'll use sparse categorical crossentropy as our loss function since we're predicting word indices:
transformer.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=30, validation_data=val_ds)
Step 4: Inference - Translating New Sentences
To translate a new sentence, we use a technique called "greedy decoding" - generating one word at a time by always choosing the most probable next word:
spa_vocab = spa_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20
def decode_sequence(input_sentence):
tokenized_input_sentence = eng_vectorization([input_sentence])
decoded_sentence = "[start]"
for i in range(max_decoded_sentence_length):
tokenized_target_sentence = spa_vectorization([decoded_sentence])[:, :-1]
predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
sampled_token_index = np.argmax(predictions[0, i, :])
sampled_token = spa_index_lookup[sampled_token_index]
decoded_sentence += " " + sampled_token
if sampled_token == "[end]":
break
return decoded_sentence.replace("[start] ", "").replace(" [end]", "")
Let's test our model with some example translations:
test_sentences = [
"I love machine learning",
"The weather is beautiful today",
"Can you help me with this project?",
"What time is the meeting?"
]
for sentence in test_sentences:
translation = decode_sequence(sentence)
print(f"English: {sentence}")
print(f"Spanish: {translation}\n")
Key Takeaways
Building a transformer-based translation system taught me several important lessons:
- Data preprocessing is crucial - Clean, well-formatted parallel text makes a huge difference in translation quality
- Attention mechanisms are powerful - They allow the model to focus on relevant parts of the input when generating each output word
- Positional encoding matters - Since transformers process all words simultaneously, they need explicit position information
- Inference requires careful implementation - Greedy decoding is simple but effective for generating translations
Next Steps
This implementation serves as a solid foundation, but there's room for improvement:
- Implement beam search instead of greedy decoding for better translations
- Add attention visualization to understand what the model focuses on
- Try byte-pair encoding (BPE) to handle unknown words better
- Experiment with pre-trained models like BERT or GPT for transfer learning
- Scale to larger datasets for improved performance on diverse text
Neural machine translation has come a long way, and transformers have been at the forefront of this revolution. Whether you're building a production translation system or just exploring NLP, understanding these architectures opens up a world of possibilities in language AI.
The complete code for this project is available on my GitHub. Feel free to experiment with different languages, dataset sizes, and model architectures!