
You’ve used Hugging Face pipelines to run pre-trained models. If you want to get the most from these models, learning about Hugging Face Fine-Tuning is essential. the real power comes when you “fine-tune” a model on your own data.
Fine-tuning takes a general-purpose model (like gpt-2) and makes it an expert in a specific task (like “classifying your company’s support tickets”).
This is an advanced topic, so we’ll split it up. Part 1 is all about preparing the data.
Step 1: Install datasets
transformers works best with its companion library, datasets.
pip install transformers datasets
Step 2: Load a Dataset from the Hub
Let’s load the “imdb” dataset, a classic for sentiment analysis.
from datasets import load_dataset
# This downloads the dataset from the Hugging Face hub
dataset = load_dataset("imdb")
print(dataset)Output:
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
...
})
})Step 3: Load the Tokenizer
The data must be pre-processed with the exact same tokenizer as the model we want to fine-tune. Let’s use distilbert.
from transformers import AutoTokenizer model_name = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name)
Step 4: Create a Tokenizing Function
We need a function that will take a batch of our data and run the tokenizer on it.
def tokenize_function(batch):
# 'padding="max_length"' ensures all sentences are the same size
# 'truncation=True' cuts off long sentences
return tokenizer(batch["text"], padding="max_length", truncation=True)Step 5: map() the Function
The .map() method is the magic. It runs our tokenize_function on the entire dataset in a super-fast, parallel way.
tokenized_datasets = dataset.map(tokenize_function, batched=True) # 'text' is now replaced by 'input_ids', 'attention_mask', etc. print(tokenized_datasets["train"][0])
Your data is now fully pre-processed and ready for training. In the next part, we’ll feed this into a Trainer to create our own custom AI model.





