AI Project: Manual Text Translation (Seq2Seq Model)

ByAhmed Nabil May 15, 2026May 3, 2026

3D isometric cutaway of a machine encoding text into energy and decoding it back into a new language, representing a Seq2Seq model.

In a previous Hugging Face project, we used the translation pipeline, which is fast and easy. But what if you need more control? Or want to see how it works? This is where a Hugging Face Manual Translation approach comes in handy.

The “manual” way involves loading the Tokenizer and the Model (a Sequence-to-Sequence model) separately.

Step 1: Installation

pip install transformers torch sentencepiece
# 'sentencepiece' is the tokenizer used by these models

Step 2: Load the Model and Tokenizer

We’ll use the same “Helsinki-NLP” model as before.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-en-es" # English to Spanish

# 1. Load the Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. Load the Model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Step 3: Tokenize, Generate, Decode

This is the three-step process:

Tokenize: Convert our English text into numbers (input IDs) the model understands.
Generate: Feed those numbers into the model, which outputs new numbers (the Spanish translation).
Decode: Convert those new numbers back into human-readable Spanish text.

text = "Python is the best programming language for data science."

# 1. Tokenize (convert text to numbers)
# 'return_tensors="pt"' tells it to give us PyTorch Tensors
inputs = tokenizer(text, return_tensors="pt")

# 2. Generate (the AI "thinks")
# This generates the new token IDs
output_tokens = model.generate(**inputs)

# 3. Decode (convert numbers back to text)
translation = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

print("--- Manual Translation ---")
print(f"Original: {text}")
print(f"Translation: {translation}")

Output:

--- Manual Translation ---
Original: Python is the best programming language for data science.
Translation: Python es el mejor lenguaje de programación para la ciencia de datos.

This method gives you far more control, allowing you to tweak parameters like max_length or num_beams (for a better quality search).

Key Takeaways

The article discusses Hugging Face Manual Translation for more control over translation tasks.
It covers the method of loading the Tokenizer and Model separately, specifically using the Helsinki-NLP model.
The translation process consists of three steps: Tokenize, Generate, and Decode.
Tokenize converts English text into numbers for the model, Generate produces new numbers for the target language, and Decode turns them back into readable text.
This method offers flexibility, allowing adjustments to parameters like max_length and num_beams for improved quality.

Ahmed Nabil

Python Engineer and the founder of Python Pro Hub. With a focus on modern data science (Polars), backend architecture (FastAPI/Django), and automation, builds production-grade tutorials designed to take developers from absolute beginners to advanced software engineers.