
We’ve used tokenizers to prepare data for AI models. But how do they work? If you’re interested in a deeper understanding, we’ll explore how Hugging Face Tokenizers process and split text and why “PythonProHub” becomes ['python', '##pro', '##hub']?
This is the magic of Subword Tokenization. Old tokenizers just split on spaces, which created millions of unique words (e.g., “run”, “running”, “ran”).
Modern tokenizers like BPE (Byte-Pair Encoding) and WordPiece (used by BERT) are smarter. They break words down into common sub-pieces.
Why Subwords are Better
- Smaller Vocabulary: The model only needs to know ~30,000 subword pieces instead of 1,000,000+ words.
- Handles Unknown Words: It can understand a new word like “PythonProHub” by breaking it into
python+pro+hub. - Understands Morphology: It sees that “running” and “ran” both share the “run” token.
How a Tokenizer Really Works
Let’s look at the full process.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Welcome to PythonProHub!"
# 1. The .tokenize() method gives you the string tokens
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
# Output: ['welcome', 'to', 'python', '##pro', '##hub', '!']
# Notice the "##" which means "attached to the previous word".
# 2. The .encode() method gives you the final numbers
ids = tokenizer.encode(text)
print(f"Token IDs: {ids}")
# Output: [101, 7315, 2000, 18750, 22290, 22739, 999, 102]
# The '101' is the [CLS] (start) token and '102' is the [SEP] (end) token.
# 3. The .decode() method goes backwards
decoded_text = tokenizer.decode(ids)
print(f"Decoded: {decoded_text}")
# Output: [CLS] welcome to pythonprohub! [SEP]This process of converting text to numbers is the fundamental first step for every modern NLP model.
Key Takeaways
- Tokenizers prepare data for AI models by breaking words into subword pieces through techniques like Subword Tokenization.
- Unlike traditional tokenizers, modern ones like BPE and WordPiece reduce vocabulary size and manage unknown words effectively.
- Subwords enable models to understand variations of words, improving their efficiency and comprehension in NLP tasks.





