Transformers library, model hub, tokenizers, datasets, pipelines, fine-tuning, and deployment with HF.
The Hugging Face Transformers library provides thousands of pre-trained models for NLP, vision, audio, and multimodal tasks. It is the most popular open-source ML library with 150K+ GitHub stars.
from transformers import (AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
AutoModelForCausalLM, AutoModelForMaskedLM, pipeline)
# ── Load a Pre-trained Model ──
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# ── Text Classification ──
classifier = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
# ── Text Generation (LLM) ──
generator = AutoModelForCausalLM.from_pretrained("gpt2")
generator_tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = generator_tokenizer("Once upon a time", return_tensors="pt")
outputs = generator.generate(**inputs, max_length=50, do_sample=True,
temperature=0.7, top_k=50, top_p=0.9)
print(generator_tokenizer.decode(outputs[0]))
# ── Fill-Mask (MLM) ──
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
result = fill_mask("The capital of France is [MASK].")
# Paris (0.98), Lyon (0.01), ...| Architecture | Type | Key Models | Best For |
|---|---|---|---|
| BERT | Encoder | bert-base, distilbert, roberta, deberta | Classification, NER, QA, feature extraction |
| GPT-2/GPT-J | Decoder | gpt2, gpt-j-6b, gpt-neo | Text generation, story writing |
| T5 | Encoder-Decoder | t5-small, t5-base, t5-large, flan-t5 | Translation, summarization, multi-task |
| BART | Encoder-Decoder | bart-large, facebook/bart-large-cnn | Summarization, text generation |
| LLaMA 2/3 | Decoder | meta-llama/Meta-Llama-3-8B | Chat, instruction following, reasoning |
| Mistral | Decoder | mistralai/Mistral-7B | Efficient open-source chat models |
| ViT | Vision Transformer | google/vit-base-patch16-224 | Image classification |
| Whisper | Encoder-Decoder | openai/whisper-large-v3 | Speech-to-text (multilingual) |
| CLIP | Vision+Text | openai/clip-vit-large-patch14 | Zero-shot image classification |
| Stable Diffusion | Diffusion | stabilityai/stable-diffusion-xl | Image generation |
Tokenizers convert text to numbers that models understand. Hugging Face supports WordPiece (BERT), BPE (GPT), SentencePiece (T5, LLaMA), and Unigram tokenization.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# ── Basic Tokenization ──
text = "Hello, how are you doing today?"
inputs = tokenizer(text)
print(tokens := tokenizer.tokenize(text))
# ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']
print(f"Token IDs: {inputs['input_ids']}")
print(f"Attention mask: {inputs['attention_mask']}")
print(f"Vocab size: {tokenizer.vocab_size}") # 30522
# ── Decode back to text ──
decoded = tokenizer.decode(inputs['input_ids'])
print(decoded) # "hello, how are you doing today?"
# ── Batch Tokenization ──
texts = ["First sentence here.", "Second sentence is longer."]
batch = tokenizer(texts, padding=True, truncation=True,
max_length=16, return_tensors="pt")
print(batch['input_ids'].shape) # torch.Size([2, 16])
# ── Special Tokens ──
print(f"PAD: {tokenizer.pad_token}") # [PAD]
print(f"UNK: {tokenizer.unk_token}") # [UNK]
print(f"CLS: {tokenizer.cls_token}") # [CLS]
print(f"SEP: {tokenizer.sep_token}") # [SEP]
# ── Custom Tokenizer (Training from scratch) ──
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()
tokenizer.trainer = tokenizers.trainers.BpeTrainer(vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["file1.txt", "file2.txt"])| Type | Algorithm | Used By | Handling Unknown Words |
|---|---|---|---|
| WordPiece | Greedy longest-match first | BERT, DistilBERT | Subword tokenization, rare words split into pieces |
| BPE | Frequency-based merging | GPT-2, GPT-J, RoBERTa | Iteratively merges most frequent pairs |
| SentencePiece | BPE or Unigram on raw text | T5, LLaMA, Gemini | No pre-tokenization needed, works on any script |
| Unigram | Probabilistic, remove least likely | XLNet, ALBERT | Probabilistic model of subword occurrences |
The pipeline API is the highest-level interface in Transformers. It provides zero-code solutions for common NLP, vision, and audio tasks.
from transformers import pipeline
# ── Sentiment Analysis ──
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product! It's amazing.")
# [{'label': 'POSITIVE', 'score': 0.999}]
# Batch processing
results = classifier(["Great movie!", "Terrible experience.", "It was okay."])
# ── Text Generation ──
generator = pipeline("text-generation", model="gpt2")
output = generator("In the year 2050,", max_length=100, num_return_sequences=2)
# ── Question Answering ──
qa = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
answer = qa(question="Who invented Python?", context="Python was created by Guido van Rossum.")
# {'answer': 'Guido van Rossum', 'score': 0.95}
# ── Named Entity Recognition ──
ner = pipeline("ner", grouped_entities=True)
entities = ner("Elon Musk founded Tesla in Palo Alto, California.")
# [{'entity_group': 'PER', 'word': 'Elon Musk'}, {'entity_group': 'ORG', 'word': 'Tesla'}, ...]
# ── Summarization ──
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer("Long text to summarize...", max_length=50, min_length=25)
# ── Translation ──
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you today?")
# ── Image Classification ──
vision = pipeline("image-classification", model="google/vit-base-patch16-224")
result = vision("photo.jpg") # [{"score": 0.89, "label": "golden retriever"}, ...]
# ── Text-to-Speech ──
tts = pipeline("text-to-speech", model="suno/bark")
audio = tts("Hello, I am an AI assistant.")| Task | Pipeline Name | Default Model | Input |
|---|---|---|---|
| Sentiment Analysis | sentiment-analysis | distilbert-base-uncased | Text string/list |
| Text Generation | text-generation | gpt2 | Prompt string |
| NER | ner | dberta-large | Text string |
| Question Answering | question-answering | distilbert-cased | Question + context |
| Summarization | summarization | bart-large-cnn | Long text |
| Translation | translation_xx_to_yy | opus-mt models | Text string |
| Feature Extraction | feature-extraction | bert-base | Text/list |
| Text Classification | text-classification | distilbert | Text string |
| Zero-Shot Classification | zero-shot-classification | bart-large-mnli | Text + labels |
| Image Classification | image-classification | vit-base | Image file/URL |
The Hugging Face Datasets library provides access to 100K+ datasets with a unified API for loading, processing, and sharing datasets.
from datasets import load_dataset, Dataset, DatasetDict
# ── Load from Hub ──
dataset = load_dataset("imdb")
print(dataset) # DatasetDict with train/test splits
print(dataset['train'][0]) # First example
print(dataset['train'].features) # Column types
# ── Load specific splits ──
dataset = load_dataset("squad", split="train")
# ── Streaming for large datasets ──
dataset = load_dataset("c4", "en", streaming=True)
for example in dataset.take(5):
print(example['text'][:100])
# ── Custom Dataset ──
import pandas as pd
df = pd.DataFrame({"text": texts, "label": labels})
dataset = Dataset.from_pandas(df)
# From list of dicts
data = [{"text": "hello", "label": 0}, {"text": "world", "label": 1}]
dataset = Dataset.from_list(data)
# ── Process / Map ──
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length",
truncation=True, max_length=128)
tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4)
# ── Filter and Shuffle ──
filtered = dataset.filter(lambda x: len(x['text']) > 100)
shuffled = dataset.shuffle(seed=42)
# ── Train/Val Split ──
split = dataset['train'].train_test_split(test_size=0.2, seed=42)
train_data = split['train']
val_data = split['test']
# ── Save and Load ──
dataset.save_to_disk("my_dataset")
loaded = Dataset.load_from_disk("my_dataset")| Dataset | Task | Size | Description |
|---|---|---|---|
| SQuAD 2.0 | QA | 150K QA pairs | Reading comprehension with unanswerable questions |
| GLUE | Classification | 340K examples | Benchmark: SST-2, MNLI, QQP, QNLI, etc. |
| SuperGLUE | Advanced NLP | Various | Advanced NLU tasks |
| IMDB | Sentiment | 50K reviews | Binary sentiment (positive/negative) |
| CNN/DailyMail | Summarization | 300K articles | News article summarization |
| WMT | Translation | 40M sentence pairs | Machine translation (multiple languages) |
| Common Crawl | Pre-training | Petabytes | Web crawl data for LLM pretraining |
| The Pile | Pre-training | 825 GB | Diverse pre-training dataset by EleutherAI |
The Trainer API provides a high-level interface for training and fine-tuning models with minimal code. It handles training loops, evaluation, logging, and checkpointing.
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding)
from datasets import load_dataset
import evaluate
import numpy as np
# ── Load Model & Tokenizer ──
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# ── Prepare Dataset ──
dataset = load_dataset("imdb")
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch")
# ── Load Model ──
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2
)
# ── Metrics ──
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return accuracy.compute(predictions=predictions, references=labels)
# ── Training Arguments ──
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
greater_is_better=True,
logging_steps=100,
fp16=True, # Mixed precision
gradient_accumulation_steps=2,
warmup_steps=500,
report_to="tensorboard",
)
# ── Train ──
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()
# ── Evaluate & Save ──
results = trainer.evaluate()
print(results)
trainer.save_model("./fine-tuned-model")
trainer.push_to_hub("my-finetuned-sentiment") # Push to HF Hub| Parameter | Default | Description | Recommended |
|---|---|---|---|
| per_device_train_batch_size | 8 | Batch size per GPU | 16-32 (fit GPU memory) |
| learning_rate | 5e-5 | Peak learning rate | 1e-5 to 5e-5 for fine-tuning |
| num_train_epochs | 3 | Training epochs | 3-5 for fine-tuning |
| weight_decay | 0 | L2 regularization | 0.01 for preventing overfitting |
| warmup_steps | 0 | LR warmup steps | 100-500 for stability |
| fp16 | False | Mixed precision | True on NVIDIA GPUs (V100+) |
| gradient_accumulation_steps | 1 | Accumulate gradients | 2-8 for larger effective batch |
| eval_strategy | no | When to evaluate | epoch or steps |
| lr_scheduler_type | linear | LR schedule | cosine for longer training |
The Hugging Face Hub is the central platform for sharing models, datasets, and ML demos. With over 500K models and 100K datasets, it is the GitHub of machine learning.
# ── Install Hub library ──
# pip install huggingface_hub
from huggingface_hub import (HfApi, create_repo, upload_folder,
login, snapshot_download, ModelCard)
# ── Login ──
login(token="hf_xxxx") # Or: huggingface-cli login
# ── Create Repository ──
create_repo(repo_id="username/my-awesome-model", private=False)
# ── Upload Model ──
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./local_model")
tokenizer = AutoTokenizer.from_pretrained("./local_model")
model.push_to_hub("username/my-awesome-model")
tokenizer.push_to_hub("username/my-awesome-model")
# ── Upload Folder ──
upload_folder(
repo_id="username/my-model",
folder_path="./model_files",
commit_message="Add model files",
)
# ── Download Model ──
model = AutoModelForCausalLM.from_pretrained("username/my-awesome-model")
# ── Download specific files ──
snapshot_download(
repo_id="username/my-model",
allow_patterns=["*.safetensors", "tokenizer.json"],
)
# ── Search Models ──
from huggingface_hub import list_models
models = list_models(search="text-classification", limit=10)
for m in models:
print(f"{m.id}: {m.downloads} downloads, likes: {m.likes}")PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA enable training massive models on consumer GPUs by training only a small fraction of parameters.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# ── QLoRA: 4-bit quantization + LoRA ──
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config, device_map="auto",
)
model = prepare_model_for_kbit_training(model)
# ── LoRA Configuration ──
lora_config = LoraConfig(
r=16, # Rank (8-64 typical)
lora_alpha=32, # Scaling factor (2x rank)
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable: 0.12% of total parameters
# ── Train with SFTTrainer ──
dataset = load_dataset("tatsu-lab/alpaca", split="train")
training_args = SFTConfig(
output_dir="./llama-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
max_seq_length=512,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
dataset_text_field="text",
peft_config=lora_config,
)
trainer.train()
# ── Save & Merge ──
model.save_pretrained("./lora-adapter") # ~100MB
# Merge into full model
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(model_id)
merged = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged = merged.merge_and_unload()
merged.save_pretrained("./merged-model") # Full 15GB model| Method | Trainable Params | Memory | Quality | Best For |
|---|---|---|---|---|
| LoRA | 0.1-1% | Low (single GPU) | Good | General PEFT, most popular |
| QLoRA | 0.1-1% | Very Low (4-bit) | Good | Consumer GPUs, 7B-70B models |
| Prefix Tuning | ~0.1% | Low | Moderate | Generation tasks, small datasets |
| P-Tuning v2 | ~0.1% | Low | Moderate | NLU tasks, understanding |
| Adapter Layers | 1-5% | Moderate | Good | Multi-task, modular adapters |
| Full Fine-tuning | 100% | Very High | Best | Maximum quality, large compute |
Essential Hugging Face interview questions covering the Transformers library, training, tokenization, and best practices.