|

AI Project: Multi-Label Text Classification with Hugging Face

3D isometric illustration of a document receiving multiple category tags simultaneously from robotic arms, representing multi-label classification.

We’ve done single-label classification (e.g., “POSITIVE” or “NEGATIVE”). But what if a text can be both? A news article could be about “POLITICS” and “FINANCE”. This is Multi-Label Classification. In this guide, we’ll explore Hugging Face Multi-Label approaches and how they work for texts with several categories.

Let’s build a model that can read a toxic comment and assign multiple labels (e.g., “toxic”, “insult”, “obscene”).

Step 1: Install & Load

We’ll use a model specifically trained for this.

pip install transformers torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "bhadresh-savani/distilbert-base-uncased-toxic-comment-multi-label"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Step 2: Tokenize and Predict

text = "You are a stupid, idiotic jerk. I hate you."

# 1. Tokenize the text
inputs = tokenizer(text, return_tensors="pt")

# 2. Get the raw logits from the model
with torch.no_grad():
    logits = model(**inputs).logits

Step 3: Interpret the Results (Sigmoid)

For multi-label, we don’t use softmax. We use sigmoid to get the independent probability for each class.

# 3. Apply sigmoid to get probabilities (0.0 to 1.0) for each label
probabilities = torch.sigmoid(logits)
print(probabilities)
# Output: tensor([[0.99, 0.98, 0.99, 0.05, 0.99, 0.95]])

# 4. Get the model's labels
labels = model.config.id2label
# Output: {0: 'toxic', 1: 'severe_toxic', 2: 'obscene', ...}

# 5. Set a threshold (e.g., > 0.5) to see what's "on"
threshold = 0.5
results = {}
for i, prob in enumerate(probabilities[0]):
    if prob > threshold:
        results[labels[i]] = prob.item()

print(results)

Final Output:

{'toxic': 0.99, 'severe_toxic': 0.98, 'obscene': 0.99, 'insult': 0.99, 'identity_hate': 0.95}

The model correctly identified the text as being toxic, obscene, and an insult, all at the same time.

Key Takeaways

  • This article discusses multi-label classification, where a text can have multiple relevant labels.
  • Building a model involves installing and loading a pre-trained model designed for this task.
  • The process includes tokenising the text and using sigmoid instead of softmax to determine independent probabilities for each label.
  • The final output demonstrates the model successfully identifying a toxic comment as ‘toxic’, ‘obscene’, and an ‘insult’.

Similar Posts

Leave a Reply