
In a previous Hugging Face project, we used the translation pipeline, which is fast and easy. But what if you need more control? Or want to see how it works? This is where a Hugging Face Manual Translation approach comes in handy.
The “manual” way involves loading the Tokenizer and the Model (a Sequence-to-Sequence model) separately.
Step 1: Installation
pip install transformers torch sentencepiece # 'sentencepiece' is the tokenizer used by these models
Step 2: Load the Model and Tokenizer
We’ll use the same “Helsinki-NLP” model as before.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_name = "Helsinki-NLP/opus-mt-en-es" # English to Spanish # 1. Load the Tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name) # 2. Load the Model model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
Step 3: Tokenize, Generate, Decode
This is the three-step process:
- Tokenize: Convert our English text into numbers (input IDs) the model understands.
- Generate: Feed those numbers into the model, which outputs new numbers (the Spanish translation).
- Decode: Convert those new numbers back into human-readable Spanish text.
text = "Python is the best programming language for data science."
# 1. Tokenize (convert text to numbers)
# 'return_tensors="pt"' tells it to give us PyTorch Tensors
inputs = tokenizer(text, return_tensors="pt")
# 2. Generate (the AI "thinks")
# This generates the new token IDs
output_tokens = model.generate(**inputs)
# 3. Decode (convert numbers back to text)
translation = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]
print("--- Manual Translation ---")
print(f"Original: {text}")
print(f"Translation: {translation}")Output:
--- Manual Translation --- Original: Python is the best programming language for data science. Translation: Python es el mejor lenguaje de programaciรณn para la ciencia de datos.
This method gives you far more control, allowing you to tweak parameters like max_length or num_beams (for a better quality search).
Key Takeaways
- The article discusses Hugging Face Manual Translation for more control over translation tasks.
- It covers the method of loading the Tokenizer and Model separately, specifically using the Helsinki-NLP model.
- The translation process consists of three steps: Tokenize, Generate, and Decode.
- Tokenize converts English text into numbers for the model, Generate produces new numbers for the target language, and Decode turns them back into readable text.
- This method offers flexibility, allowing adjustments to parameters like max_length and num_beams for improved quality.





