Hugging Face Cheatsheet Cheatsheet

🤗

Transformers Library

Core Library

The Hugging Face Transformers library provides thousands of pre-trained models for NLP, vision, audio, and multimodal tasks. It is the most popular open-source ML library with 150K+ GitHub stars.

transformers_basics.py

from transformers import (AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
    AutoModelForCausalLM, AutoModelForMaskedLM, pipeline)

# ── Load a Pre-trained Model ──
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# ── Text Classification ──
classifier = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

# ── Text Generation (LLM) ──
generator = AutoModelForCausalLM.from_pretrained("gpt2")
generator_tokenizer = AutoTokenizer.from_pretrained("gpt2")

inputs = generator_tokenizer("Once upon a time", return_tensors="pt")
outputs = generator.generate(**inputs, max_length=50, do_sample=True,
                              temperature=0.7, top_k=50, top_p=0.9)
print(generator_tokenizer.decode(outputs[0]))

# ── Fill-Mask (MLM) ──
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
result = fill_mask("The capital of France is [MASK].")
# Paris (0.98), Lyon (0.01), ...

Model Architectures in Transformers

Architecture	Type	Key Models	Best For
BERT	Encoder	bert-base, distilbert, roberta, deberta	Classification, NER, QA, feature extraction
GPT-2/GPT-J	Decoder	gpt2, gpt-j-6b, gpt-neo	Text generation, story writing
T5	Encoder-Decoder	t5-small, t5-base, t5-large, flan-t5	Translation, summarization, multi-task
BART	Encoder-Decoder	bart-large, facebook/bart-large-cnn	Summarization, text generation
LLaMA 2/3	Decoder	meta-llama/Meta-Llama-3-8B	Chat, instruction following, reasoning
Mistral	Decoder	mistralai/Mistral-7B	Efficient open-source chat models
ViT	Vision Transformer	google/vit-base-patch16-224	Image classification
Whisper	Encoder-Decoder	openai/whisper-large-v3	Speech-to-text (multilingual)
CLIP	Vision+Text	openai/clip-vit-large-patch14	Zero-shot image classification
Stable Diffusion	Diffusion	stabilityai/stable-diffusion-xl	Image generation

🔤

Tokenizer Deep Dive

Text Processing

Tokenizers convert text to numbers that models understand. Hugging Face supports WordPiece (BERT), BPE (GPT), SentencePiece (T5, LLaMA), and Unigram tokenization.

tokenizer_examples.py

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ── Basic Tokenization ──
text = "Hello, how are you doing today?"
inputs = tokenizer(text)
print(tokens := tokenizer.tokenize(text))
# ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']
print(f"Token IDs: {inputs['input_ids']}")
print(f"Attention mask: {inputs['attention_mask']}")
print(f"Vocab size: {tokenizer.vocab_size}")  # 30522

# ── Decode back to text ──
decoded = tokenizer.decode(inputs['input_ids'])
print(decoded)  # "hello, how are you doing today?"

# ── Batch Tokenization ──
texts = ["First sentence here.", "Second sentence is longer."]
batch = tokenizer(texts, padding=True, truncation=True,
                   max_length=16, return_tensors="pt")
print(batch['input_ids'].shape)  # torch.Size([2, 16])

# ── Special Tokens ──
print(f"PAD: {tokenizer.pad_token}")     # [PAD]
print(f"UNK: {tokenizer.unk_token}")     # [UNK]
print(f"CLS: {tokenizer.cls_token}")     # [CLS]
print(f"SEP: {tokenizer.sep_token}")     # [SEP]

# ── Custom Tokenizer (Training from scratch) ──
from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()
tokenizer.trainer = tokenizers.trainers.BpeTrainer(vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["file1.txt", "file2.txt"])

Tokenizer Types Comparison

Type	Algorithm	Used By	Handling Unknown Words
WordPiece	Greedy longest-match first	BERT, DistilBERT	Subword tokenization, rare words split into pieces
BPE	Frequency-based merging	GPT-2, GPT-J, RoBERTa	Iteratively merges most frequent pairs
SentencePiece	BPE or Unigram on raw text	T5, LLaMA, Gemini	No pre-tokenization needed, works on any script
Unigram	Probabilistic, remove least likely	XLNet, ALBERT	Probabilistic model of subword occurrences

⚠️

Padding strategy:Use padding="longest" for variable-length batches (saves compute). Use padding="max_length" only when fixed length is required. Always set truncation=True to prevent exceeding model max length (512 for BERT, 2048 for GPT-2, 4096 for LLaMA).

🔌

Pipeline API

Quick Inference

The pipeline API is the highest-level interface in Transformers. It provides zero-code solutions for common NLP, vision, and audio tasks.

pipeline_examples.py

from transformers import pipeline

# ── Sentiment Analysis ──
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product! It's amazing.")
# [{'label': 'POSITIVE', 'score': 0.999}]

# Batch processing
results = classifier(["Great movie!", "Terrible experience.", "It was okay."])

# ── Text Generation ──
generator = pipeline("text-generation", model="gpt2")
output = generator("In the year 2050,", max_length=100, num_return_sequences=2)

# ── Question Answering ──
qa = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
answer = qa(question="Who invented Python?", context="Python was created by Guido van Rossum.")
# {'answer': 'Guido van Rossum', 'score': 0.95}

# ── Named Entity Recognition ──
ner = pipeline("ner", grouped_entities=True)
entities = ner("Elon Musk founded Tesla in Palo Alto, California.")
# [{'entity_group': 'PER', 'word': 'Elon Musk'}, {'entity_group': 'ORG', 'word': 'Tesla'}, ...]

# ── Summarization ──
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer("Long text to summarize...", max_length=50, min_length=25)

# ── Translation ──
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you today?")

# ── Image Classification ──
vision = pipeline("image-classification", model="google/vit-base-patch16-224")
result = vision("photo.jpg")  # [{"score": 0.89, "label": "golden retriever"}, ...]

# ── Text-to-Speech ──
tts = pipeline("text-to-speech", model="suno/bark")
audio = tts("Hello, I am an AI assistant.")

Available Pipeline Tasks

Task	Pipeline Name	Default Model	Input
Sentiment Analysis	sentiment-analysis	distilbert-base-uncased	Text string/list
Text Generation	text-generation	gpt2	Prompt string
NER	ner	dberta-large	Text string
Question Answering	question-answering	distilbert-cased	Question + context
Summarization	summarization	bart-large-cnn	Long text
Translation	translation_xx_to_yy	opus-mt models	Text string
Feature Extraction	feature-extraction	bert-base	Text/list
Text Classification	text-classification	distilbert	Text string
Zero-Shot Classification	zero-shot-classification	bart-large-mnli	Text + labels
Image Classification	image-classification	vit-base	Image file/URL

📊

Datasets Library

Data Loading

The Hugging Face Datasets library provides access to 100K+ datasets with a unified API for loading, processing, and sharing datasets.

datasets_examples.py

from datasets import load_dataset, Dataset, DatasetDict

# ── Load from Hub ──
dataset = load_dataset("imdb")
print(dataset)  # DatasetDict with train/test splits
print(dataset['train'][0])  # First example
print(dataset['train'].features)  # Column types

# ── Load specific splits ──
dataset = load_dataset("squad", split="train")

# ── Streaming for large datasets ──
dataset = load_dataset("c4", "en", streaming=True)
for example in dataset.take(5):
    print(example['text'][:100])

# ── Custom Dataset ──
import pandas as pd
df = pd.DataFrame({"text": texts, "label": labels})
dataset = Dataset.from_pandas(df)

# From list of dicts
data = [{"text": "hello", "label": 0}, {"text": "world", "label": 1}]
dataset = Dataset.from_list(data)

# ── Process / Map ──
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",
                     truncation=True, max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4)

# ── Filter and Shuffle ──
filtered = dataset.filter(lambda x: len(x['text']) > 100)
shuffled = dataset.shuffle(seed=42)

# ── Train/Val Split ──
split = dataset['train'].train_test_split(test_size=0.2, seed=42)
train_data = split['train']
val_data = split['test']

# ── Save and Load ──
dataset.save_to_disk("my_dataset")
loaded = Dataset.load_from_disk("my_dataset")

Popular Hugging Face Datasets

Dataset	Task	Size	Description
SQuAD 2.0	QA	150K QA pairs	Reading comprehension with unanswerable questions
GLUE	Classification	340K examples	Benchmark: SST-2, MNLI, QQP, QNLI, etc.
SuperGLUE	Advanced NLP	Various	Advanced NLU tasks
IMDB	Sentiment	50K reviews	Binary sentiment (positive/negative)
CNN/DailyMail	Summarization	300K articles	News article summarization
WMT	Translation	40M sentence pairs	Machine translation (multiple languages)
Common Crawl	Pre-training	Petabytes	Web crawl data for LLM pretraining
The Pile	Pre-training	825 GB	Diverse pre-training dataset by EleutherAI

🏋️

Trainer API

Fine-Tuning

The Trainer API provides a high-level interface for training and fine-tuning models with minimal code. It handles training loops, evaluation, logging, and checkpointing.

trainer_example.py

from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding)
from datasets import load_dataset
import evaluate
import numpy as np

# ── Load Model & Tokenizer ──
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# ── Prepare Dataset ──
dataset = load_dataset("imdb")
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch")

# ── Load Model ──
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

# ── Metrics ──
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

# ── Training Arguments ──
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    logging_steps=100,
    fp16=True,              # Mixed precision
    gradient_accumulation_steps=2,
    warmup_steps=500,
    report_to="tensorboard",
)

# ── Train ──
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()

# ── Evaluate & Save ──
results = trainer.evaluate()
print(results)
trainer.save_model("./fine-tuned-model")
trainer.push_to_hub("my-finetuned-sentiment")  # Push to HF Hub

TrainingArguments Key Parameters

Parameter	Default	Description	Recommended
per_device_train_batch_size	8	Batch size per GPU	16-32 (fit GPU memory)
learning_rate	5e-5	Peak learning rate	1e-5 to 5e-5 for fine-tuning
num_train_epochs	3	Training epochs	3-5 for fine-tuning
weight_decay	0	L2 regularization	0.01 for preventing overfitting
warmup_steps	0	LR warmup steps	100-500 for stability
fp16	False	Mixed precision	True on NVIDIA GPUs (V100+)
gradient_accumulation_steps	1	Accumulate gradients	2-8 for larger effective batch
eval_strategy	no	When to evaluate	epoch or steps
lr_scheduler_type	linear	LR schedule	cosine for longer training

🌐

Hugging Face Hub

Model Sharing

The Hugging Face Hub is the central platform for sharing models, datasets, and ML demos. With over 500K models and 100K datasets, it is the GitHub of machine learning.

hub_usage.py

# ── Install Hub library ──
# pip install huggingface_hub

from huggingface_hub import (HfApi, create_repo, upload_folder,
    login, snapshot_download, ModelCard)

# ── Login ──
login(token="hf_xxxx")  # Or: huggingface-cli login

# ── Create Repository ──
create_repo(repo_id="username/my-awesome-model", private=False)

# ── Upload Model ──
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./local_model")
tokenizer = AutoTokenizer.from_pretrained("./local_model")

model.push_to_hub("username/my-awesome-model")
tokenizer.push_to_hub("username/my-awesome-model")

# ── Upload Folder ──
upload_folder(
    repo_id="username/my-model",
    folder_path="./model_files",
    commit_message="Add model files",
)

# ── Download Model ──
model = AutoModelForCausalLM.from_pretrained("username/my-awesome-model")

# ── Download specific files ──
snapshot_download(
    repo_id="username/my-model",
    allow_patterns=["*.safetensors", "tokenizer.json"],
)

# ── Search Models ──
from huggingface_hub import list_models
models = list_models(search="text-classification", limit=10)
for m in models:
    print(f"{m.id}: {m.downloads} downloads, likes: {m.likes}")

Hub Features

Model CardsREADME.md with model description, usage examples, training details, and evaluation results.

Gated ReposRequire users to accept terms before downloading (for licensed models like LLaMA).

OrganizationsTeam accounts: meta-llama, openai, google, microsoft, stabilityai.

Inference APIServerless API for running models without local GPU. Free tier available.

SpacesHost ML demos with Gradio or Streamlit. Free GPU spaces available.

DiscussionsCommunity forum attached to each model/dataset for Q&A.

🎯

PEFT / LoRA Fine-Tuning

Parameter Efficient

PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA enable training massive models on consumer GPUs by training only a small fraction of parameters.

peft_lora.py

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# ── QLoRA: 4-bit quantization + LoRA ──
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# ── LoRA Configuration ──
lora_config = LoraConfig(
    r=16,                    # Rank (8-64 typical)
    lora_alpha=32,           # Scaling factor (2x rank)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable: 0.12% of total parameters

# ── Train with SFTTrainer ──
dataset = load_dataset("tatsu-lab/alpaca", split="train")
training_args = SFTConfig(
    output_dir="./llama-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    max_seq_length=512,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="text",
    peft_config=lora_config,
)
trainer.train()

# ── Save & Merge ──
model.save_pretrained("./lora-adapter")    # ~100MB
# Merge into full model
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(model_id)
merged = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged = merged.merge_and_unload()
merged.save_pretrained("./merged-model")    # Full 15GB model

PEFT Methods Comparison

Method	Trainable Params	Memory	Quality	Best For
LoRA	0.1-1%	Low (single GPU)	Good	General PEFT, most popular
QLoRA	0.1-1%	Very Low (4-bit)	Good	Consumer GPUs, 7B-70B models
Prefix Tuning	~0.1%	Low	Moderate	Generation tasks, small datasets
P-Tuning v2	~0.1%	Low	Moderate	NLU tasks, understanding
Adapter Layers	1-5%	Moderate	Good	Multi-task, modular adapters
Full Fine-tuning	100%	Very High	Best	Maximum quality, large compute

💬

Interview Questions

Top 10

Essential Hugging Face interview questions covering the Transformers library, training, tokenization, and best practices.

Q1: How does AutoModel work?

AnswerAutoModel automatically selects the correct model class based on the checkpoint. AutoModel.from_pretrained('bert-base') loads BertModel. It inspects the config.json in the checkpoint to determine the architecture. Use specific classes (AutoModelForSequenceClassification) for task-specific heads.

Q2: What is the difference between padding and truncation?

AnswerPadding adds tokens to make all sequences in a batch the same length. Truncation cuts sequences exceeding max_length. Use both together for batched training. padding='longest' is efficient; padding='max_length' ensures fixed size. Always truncate to prevent exceeding model context window.

Q3: Explain LoRA and why it works

AnswerLoRA (Low-Rank Adaptation) freezes pre-trained weights and injects trainable rank decomposition matrices (A and B) into each layer. During forward: W_new = W + BA where A is (d x r) and B is (r x d). This means training r*d + r*d parameters instead of d*d. Works because model adaptations have low intrinsic dimension (rank deficit).

Q4: Trainer vs custom training loop

AnswerTrainer: less code, built-in features (logging, checkpointing, mixed precision, gradient accumulation, multi-GPU), good for standard fine-tuning. Custom loop: full control over every step, needed for complex architectures, custom loss functions, RLHF, or non-standard training procedures.

Q5: BERT vs GPT architecture differences

AnswerBERT is bidirectional (encoder-only) - sees full context, good for understanding (classification, NER, QA). GPT is autoregressive (decoder-only) - sees only past context, good for generation. BERT uses MLM + NSP pre-training; GPT uses next-token prediction. BERT outputs a representation for each token; GPT predicts the next token.

Q6: How does QLoRA reduce memory?

AnswerQLoRA uses 4-bit NormalFloat (NF4) quantization to store the base model weights, reducing memory by ~4x. A LoRA adapter (FP16) is trained on top. Double quantization quantizes the quantization constants. Paged optimizers use CPU RAM for optimizer states to handle memory spikes during gradient updates.

Q7: What is the safetensors format?

AnswerSafetensors is a secure, fast format for storing tensors (replaces .bin/.pt files). Benefits: no pickle (prevents arbitrary code execution), faster loading (mmap-based), lazy loading. Most models on HF Hub now use .safetensors. Use model.save_pretrained() to save in safetensors format automatically.

Q8: How to handle long sequences?

AnswerOptions: (1) Truncation - simple but loses information. (2) Chunking - split into overlapping windows, process separately. (3) Longformer - sparse attention for longer contexts (4096 tokens). (4) Longformer, BigBird - efficient attention patterns. (5) Models with longer context: LLaMA (4096), GPT-4 (128K), Claude (200K).

Q9: What is bitsandbytes?

AnswerA library for 8-bit and 4-bit quantization of neural network weights. Integrates with HF Transformers for loading models in 4-bit/8-bit. Key features: NF4 (4-bit NormalFloat) quantization, double quantization, quantization-aware fine-tuning (QLoRA). Reduces GPU memory 3-4x with minimal accuracy loss.

Q10: HF Hub vs local storage

AnswerHub benefits: versioning, collaboration, model cards, community, Inference API, gated access. Local: faster loading (no download), offline use, data privacy. Use HF_HUB_OFFLINE=1 for offline mode. Use snapshot_download() to cache locally. push_to_hub() for uploading. Always cache downloaded models (~/.cache/huggingface/).

⏳

Loading cheatsheet...

from transformers import (AutoTokenizer, AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoModelForMaskedLM, pipeline) # ── Load a Pre-trained Model ── model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # ── Text Classification ── classifier = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english" ) # ── Text Generation (LLM) ── generator = AutoModelForCausalLM.from_pretrained("gpt2") generator_tokenizer = AutoTokenizer.from_pretrained("gpt2") inputs = generator_tokenizer("Once upon a time", return_tensors="pt") outputs = generator.generate(**inputs, max_length=50, do_sample=True, temperature=0.7, top_k=50, top_p=0.9) print(generator_tokenizer.decode(outputs[0])) # ── Fill-Mask (MLM) ── from transformers import pipeline fill_mask = pipeline("fill-mask", model="bert-base-uncased") result = fill_mask("The capital of France is [MASK].") # Paris (0.98), Lyon (0.01), ...

Architecture

Type

Key Models

Best For

BERT

Encoder

bert-base, distilbert, roberta, deberta

Classification, NER, QA, feature extraction

GPT-2/GPT-J

Decoder

gpt2, gpt-j-6b, gpt-neo

Text generation, story writing

Encoder-Decoder

t5-small, t5-base, t5-large, flan-t5

Translation, summarization, multi-task

BART

Encoder-Decoder

bart-large, facebook/bart-large-cnn

Summarization, text generation

LLaMA 2/3

Decoder

meta-llama/Meta-Llama-3-8B

Chat, instruction following, reasoning

Mistral

Decoder

mistralai/Mistral-7B

Efficient open-source chat models

ViT

Vision Transformer

google/vit-base-patch16-224

Image classification

Whisper

Encoder-Decoder

openai/whisper-large-v3

Speech-to-text (multilingual)

CLIP

Vision+Text

openai/clip-vit-large-patch14

Zero-shot image classification

Stable Diffusion

Diffusion

stabilityai/stable-diffusion-xl

Image generation

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # ── Basic Tokenization ── text = "Hello, how are you doing today?" inputs = tokenizer(text) print(tokens := tokenizer.tokenize(text)) # ['hello', ',', 'how', 'are', 'you', 'doing', 'today', '?'] print(f"Token IDs: {inputs['input_ids']}") print(f"Attention mask: {inputs['attention_mask']}") print(f"Vocab size: {tokenizer.vocab_size}") # 30522 # ── Decode back to text ── decoded = tokenizer.decode(inputs['input_ids']) print(decoded) # "hello, how are you doing today?" # ── Batch Tokenization ── texts = ["First sentence here.", "Second sentence is longer."] batch = tokenizer(texts, padding=True, truncation=True, max_length=16, return_tensors="pt") print(batch['input_ids'].shape) # torch.Size([2, 16]) # ── Special Tokens ── print(f"PAD: {tokenizer.pad_token}") # [PAD] print(f"UNK: {tokenizer.unk_token}") # [UNK] print(f"CLS: {tokenizer.cls_token}") # [CLS] print(f"SEP: {tokenizer.sep_token}") # [SEP] # ── Custom Tokenizer (Training from scratch) ── from tokenizers import Tokenizer from tokenizers.models import BPE tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace() tokenizer.trainer = tokenizers.trainers.BpeTrainer(vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(["file1.txt", "file2.txt"])

Type

Algorithm

Used By

Handling Unknown Words

WordPiece

Greedy longest-match first

BERT, DistilBERT

Subword tokenization, rare words split into pieces

BPE

Frequency-based merging

GPT-2, GPT-J, RoBERTa

Iteratively merges most frequent pairs

SentencePiece

BPE or Unigram on raw text

T5, LLaMA, Gemini

No pre-tokenization needed, works on any script

Unigram

Probabilistic, remove least likely

XLNet, ALBERT

Probabilistic model of subword occurrences

from transformers import pipeline # ── Sentiment Analysis ── classifier = pipeline("sentiment-analysis") result = classifier("I love this product! It's amazing.") # [{'label': 'POSITIVE', 'score': 0.999}] # Batch processing results = classifier(["Great movie!", "Terrible experience.", "It was okay."]) # ── Text Generation ── generator = pipeline("text-generation", model="gpt2") output = generator("In the year 2050,", max_length=100, num_return_sequences=2) # ── Question Answering ── qa = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad") answer = qa(question="Who invented Python?", context="Python was created by Guido van Rossum.") # {'answer': 'Guido van Rossum', 'score': 0.95} # ── Named Entity Recognition ── ner = pipeline("ner", grouped_entities=True) entities = ner("Elon Musk founded Tesla in Palo Alto, California.") # [{'entity_group': 'PER', 'word': 'Elon Musk'}, {'entity_group': 'ORG', 'word': 'Tesla'}, ...] # ── Summarization ── summarizer = pipeline("summarization", model="facebook/bart-large-cnn") summary = summarizer("Long text to summarize...", max_length=50, min_length=25) # ── Translation ── translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr") result = translator("Hello, how are you today?") # ── Image Classification ── vision = pipeline("image-classification", model="google/vit-base-patch16-224") result = vision("photo.jpg") # [{"score": 0.89, "label": "golden retriever"}, ...] # ── Text-to-Speech ── tts = pipeline("text-to-speech", model="suno/bark") audio = tts("Hello, I am an AI assistant.")

Task

Pipeline Name

Default Model

Input

Sentiment Analysis

sentiment-analysis

distilbert-base-uncased

Text string/list

Text Generation

text-generation

gpt2

Prompt string

NER

ner

dberta-large

Text string

Question Answering

question-answering

distilbert-cased

Question + context

Summarization

summarization

bart-large-cnn

Long text

Translation

translation_xx_to_yy

opus-mt models

Text string

Feature Extraction

feature-extraction

bert-base

Text/list

Text Classification

text-classification

distilbert

Text string

Zero-Shot Classification

zero-shot-classification

bart-large-mnli

Text + labels

Image Classification

image-classification

vit-base

Image file/URL

from datasets import load_dataset, Dataset, DatasetDict # ── Load from Hub ── dataset = load_dataset("imdb") print(dataset) # DatasetDict with train/test splits print(dataset['train'][0]) # First example print(dataset['train'].features) # Column types # ── Load specific splits ── dataset = load_dataset("squad", split="train") # ── Streaming for large datasets ── dataset = load_dataset("c4", "en", streaming=True) for example in dataset.take(5): print(example['text'][:100]) # ── Custom Dataset ── import pandas as pd df = pd.DataFrame({"text": texts, "label": labels}) dataset = Dataset.from_pandas(df) # From list of dicts data = [{"text": "hello", "label": 0}, {"text": "world", "label": 1}] dataset = Dataset.from_list(data) # ── Process / Map ── def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128) tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4) # ── Filter and Shuffle ── filtered = dataset.filter(lambda x: len(x['text']) > 100) shuffled = dataset.shuffle(seed=42) # ── Train/Val Split ── split = dataset['train'].train_test_split(test_size=0.2, seed=42) train_data = split['train'] val_data = split['test'] # ── Save and Load ── dataset.save_to_disk("my_dataset") loaded = Dataset.load_from_disk("my_dataset")

Dataset

Task

Size

Description

SQuAD 2.0

150K QA pairs

Reading comprehension with unanswerable questions

GLUE

Classification

340K examples

Benchmark: SST-2, MNLI, QQP, QNLI, etc.

SuperGLUE

Advanced NLP

Various

Advanced NLU tasks

IMDB

Sentiment

50K reviews

Binary sentiment (positive/negative)

CNN/DailyMail

Summarization

300K articles

News article summarization

WMT

Translation

40M sentence pairs

Machine translation (multiple languages)

Common Crawl

Pre-training

Petabytes

Web crawl data for LLM pretraining

The Pile

Pre-training

825 GB

Diverse pre-training dataset by EleutherAI

from transformers import (AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding) from datasets import load_dataset import evaluate import numpy as np # ── Load Model & Tokenizer ── model_name = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) # ── Prepare Dataset ── dataset = load_dataset("imdb") def tokenize(batch): return tokenizer(batch["text"], padding=True, truncation=True, max_length=512) tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"]) tokenized = tokenized.rename_column("label", "labels") tokenized.set_format("torch") # ── Load Model ── model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=2 ) # ── Metrics ── accuracy = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return accuracy.compute(predictions=predictions, references=labels) # ── Training Arguments ── training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=32, learning_rate=2e-5, weight_decay=0.01, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, logging_steps=100, fp16=True, # Mixed precision gradient_accumulation_steps=2, warmup_steps=500, report_to="tensorboard", ) # ── Train ── trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], compute_metrics=compute_metrics, data_collator=DataCollatorWithPadding(tokenizer=tokenizer), ) trainer.train() # ── Evaluate & Save ── results = trainer.evaluate() print(results) trainer.save_model("./fine-tuned-model") trainer.push_to_hub("my-finetuned-sentiment") # Push to HF Hub

Parameter

Default

Description

Recommended

per_device_train_batch_size

Batch size per GPU

16-32 (fit GPU memory)

learning_rate

5e-5

Peak learning rate

1e-5 to 5e-5 for fine-tuning

num_train_epochs

Training epochs

3-5 for fine-tuning

weight_decay

L2 regularization

0.01 for preventing overfitting

warmup_steps

LR warmup steps

100-500 for stability

fp16

False

Mixed precision

True on NVIDIA GPUs (V100+)

gradient_accumulation_steps

Accumulate gradients

2-8 for larger effective batch

eval_strategy

When to evaluate

epoch or steps

lr_scheduler_type

linear

LR schedule

cosine for longer training

# ── Install Hub library ── # pip install huggingface_hub from huggingface_hub import (HfApi, create_repo, upload_folder, login, snapshot_download, ModelCard) # ── Login ── login(token="hf_xxxx") # Or: huggingface-cli login # ── Create Repository ── create_repo(repo_id="username/my-awesome-model", private=False) # ── Upload Model ── from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("./local_model") tokenizer = AutoTokenizer.from_pretrained("./local_model") model.push_to_hub("username/my-awesome-model") tokenizer.push_to_hub("username/my-awesome-model") # ── Upload Folder ── upload_folder( repo_id="username/my-model", folder_path="./model_files", commit_message="Add model files", ) # ── Download Model ── model = AutoModelForCausalLM.from_pretrained("username/my-awesome-model") # ── Download specific files ── snapshot_download( repo_id="username/my-model", allow_patterns=["*.safetensors", "tokenizer.json"], ) # ── Search Models ── from huggingface_hub import list_models models = list_models(search="text-classification", limit=10) for m in models: print(f"{m.id}: {m.downloads} downloads, likes: {m.likes}")

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from trl import SFTTrainer, SFTConfig from datasets import load_dataset # ── QLoRA: 4-bit quantization + LoRA ── bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model_id = "meta-llama/Meta-Llama-3-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", ) model = prepare_model_for_kbit_training(model) # ── LoRA Configuration ── lora_config = LoraConfig( r=16, # Rank (8-64 typical) lora_alpha=32, # Scaling factor (2x rank) target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable: 0.12% of total parameters # ── Train with SFTTrainer ── dataset = load_dataset("tatsu-lab/alpaca", split="train") training_args = SFTConfig( output_dir="./llama-lora", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, max_seq_length=512, ) trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, dataset_text_field="text", peft_config=lora_config, ) trainer.train() # ── Save & Merge ── model.save_pretrained("./lora-adapter") # ~100MB # Merge into full model from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained(model_id) merged = PeftModel.from_pretrained(base_model, "./lora-adapter") merged = merged.merge_and_unload() merged.save_pretrained("./merged-model") # Full 15GB model

Method

Trainable Params

Memory

Quality

Best For

LoRA

0.1-1%

Low (single GPU)

Good

General PEFT, most popular

QLoRA

0.1-1%

Very Low (4-bit)

Good

Consumer GPUs, 7B-70B models

Prefix Tuning

~0.1%

Low

Moderate

Generation tasks, small datasets

P-Tuning v2

~0.1%

Low

Moderate

NLU tasks, understanding

Adapter Layers

1-5%

Moderate

Good

Multi-task, modular adapters

Full Fine-tuning

100%

Very High

Best

Maximum quality, large compute