NLP & Text Processing Cheatsheet Cheatsheet

🧹

Text Preprocessing

Foundation

Text preprocessing transforms raw text into clean, structured data suitable for NLP models. Proper preprocessing is crucial — garbage in, garbage out.

text-preprocessing.py

import re
import string
import unicodedata
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# ── Basic Text Cleaning ──
text = "I LOOOVE this product!!! It's amazing. Check out https://example.com 🎉"

# Lowercase
text = text.lower()

# Remove URLs
text = re.sub(r'https?://S+|www.S+', '', text)

# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)

# Remove special characters and punctuation
text = re.sub(r'[^ws]', '', text)  # Keep alphanumeric + whitespace
text = text.translate(str.maketrans('', '', string.punctuation))

# Remove extra whitespace
text = re.sub(r's+', ' ', text).strip()

# Remove numbers (optional)
text = re.sub(r'd+', '', text)

# Remove emojis
text = re.sub(r'[^-]+', '', text)  # ASCII only
# Or keep text but remove emoji:
emoji_pattern = re.compile("["
    u"U0001F600-U0001F64F"
    u"U0001F300-U0001F5FF"
    u"U0001F680-U0001F6FF"
    u"U0001F1E0-U0001F1FF"
    "]+", flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)

# ── Unicode Normalization ──
text = unicodedata.normalize('NFKC', text)  # Normalize special chars

# ── Remove Stopwords ──
stop_words = set(stopwords.words('english'))
# Add custom stopwords
stop_words.update(['said', 'would', 'could', 'also', 'us', 'one'])
tokens = word_tokenize(text)
filtered = [w for w in tokens if w not in stop_words]

# ── Stemming vs Lemmatization ──
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ['running', 'better', 'geese', 'corpora', 'am', 'are']
stemmed = [stemmer.stem(w) for w in words]
# ['run', 'better', 'gees', 'corpora', 'am', 'are']

lemmatized = [lemmatizer.lemmatize(w, pos='v') for w in words]
# ['run', 'better', 'goose', 'corpus', 'be', 'be']

Preprocessing Techniques Comparison

Technique	Description	Pros	Cons	When to Use
Lowercasing	Convert all text to lowercase	Reduces vocabulary size	Loses information (US vs us)	Almost always, unless case matters
Stopword Removal	Remove common words (the, is, at)	Reduces noise, focuses on content	Can lose sentiment info (not, no)	Information retrieval, classification
Stemming	Reduce words to root form (running -> run)	Fast, reduces vocabulary	Produces non-words, aggressive	Search engines, fast prototyping
Lemmatization	Reduce to dictionary form (geese -> goose)	Valid words, context-aware	Slower, needs POS tag	Quality-sensitive tasks, analysis
Tokenization	Split text into tokens (words/subwords)	Required for all NLP models	Language-specific challenges	Always, before any other processing
Spell Correction	Fix typos automatically	Improves model input quality	Can change meaning, slow	User-generated content, search

⚠️

Modern NLP note: With transformer-based models (BERT, GPT), minimal preprocessing is often better. These models learn from raw text. Avoid aggressive stemming/lemmatization and stopword removal when using pre-trained transformers.

🔤

Tokenization

Text to Numbers

Tokenization is the process of breaking text into discrete units (tokens) that can be processed by models. Modern NLP uses subword tokenization to handle vocabulary and OOV issues.

tokenization-methods.py

# ── Word Tokenization ──
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
text = "I can't believe it's not butter! 2.5 million users."
print(word_tokenize(text))
# ["I", "ca", "n't", "believe", "it", "'s", "not", "butter", "!", "2.5", "million", "users", "."]

# ── Sentence Tokenization ──
from nltk.tokenize import sent_tokenize
doc = "Hello world. How are you? I'm fine!"
print(sent_tokenize(doc))
# ["Hello world.", "How are you?", "I'm fine!"]

# ── Regular Expression Tokenizer ──
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'w+')  # Only alphanumeric tokens
print(tokenizer.tokenize("Hello, world! 123"))
# ['Hello', 'world', '123']

# ── Subword Tokenization (HuggingFace Transformers) ──
from transformers import AutoTokenizer

# BERT Tokenizer (WordPiece)
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = bert_tokenizer("Hello, I love NLP!")
print(tokens.tokens())
# ['[CLS]', 'hello', ',', 'i', 'love', 'nl', '##p', '!', '[SEP]']
print(tokens.input_ids)      # [101, 7592, 1010, 1045, 2293, 17953, 2361, 999, 102]
print(tokens.attention_mask)  # [1, 1, 1, 1, 1, 1, 1, 1, 1]

# GPT-2 Tokenizer (BPE)
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = gpt_tokenizer("Hello, I love NLP!")
print(tokens.tokens())
# ['Hello', ',', 'ĠI', 'Ġlove', 'ĠNLP', '!']

# T5 Tokenizer (SentencePiece)
t5_tokenizer = AutoTokenizer.from_pretrained("t5-base")
tokens = t5_tokenizer("Hello, I love NLP!")
print(tokens.tokens())

Tokenization Algorithms Comparison

Algorithm	Vocab Type	Handles OOV?	Example	Used By
Word-level	Full words	No (UNK token)	["I", "love", "NLP"]	Classic NLTK, spaCy
Character-level	Characters	Yes	["H", "e", "l", "l", "o"]	Character CNNs, some TTS
BPE (Byte-Pair Encoding)	Subwords	Yes	["lo", "v", "ing"]	GPT-2, GPT-3, RoBERTa
WordPiece	Subwords	Yes (##prefix)	["un", "##believ", "##able"]	BERT, DistilBERT
SentencePiece (Unigram)	Subwords	Yes	["▁Hello", "▁world"]	T5, XLNet, LLaMA
Tiktoken (BPE variant)	Byte-level	Yes	Byte-level BPE, efficient	GPT-4, Claude

Tokenizer Parameters

max_lengthMaximum sequence length. Longer sequences are truncated. BERT: 512, GPT-4: 128K.

paddingPad shorter sequences to max_length. Options: True, False, 'max_length', 'longest'.

truncationTruncate sequences exceeding max_length. Options: True, False, 'only_first', 'only_second'.

return_tensorsOutput format: 'pt' (PyTorch), 'tf' (TensorFlow), 'np' (NumPy), None (lists).

add_special_tokensInclude [CLS], [SEP] (BERT) or <|endoftext|> (GPT). Default: True.

special_tokens_mapDictionary mapping special token names to their string values.

📊

Word Embeddings

Vector Representation

Word embeddings are dense vector representations of words that capture semantic meaning. Words with similar meanings have similar embeddings, enabling models to understand relationships between words.

word-embeddings.py

# ── Word2Vec (Gensim) ──
from gensim.models import Word2Vec

# Train Word2Vec
sentences = [
    ['the', 'cat', 'sat', 'on', 'the', 'mat'],
    ['the', 'dog', 'sat', 'on', 'the', 'floor'],
    ['the', 'bird', 'flew', 'over', 'the', 'house'],
]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,
                 workers=4, sg=1, epochs=100)
# sg=1: Skip-gram, sg=0: CBOW

# Word similarity
model.wv.most_similar('cat', topn=5)
# [('dog', 0.92), ('bird', 0.78), ...]

# Word arithmetic (king - man + woman ≈ queen)
model.wv.most_similar(positive=['king', 'woman'], negative=['man'])

# ── Pre-trained GloVe Embeddings ──
import numpy as np

def load_glove_embeddings(glove_path):
    embeddings = {}
    with open(glove_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            vector = np.array(parts[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

glove = load_glove_embeddings('glove.6B.300d.txt')
print(f"Vocabulary size: {len(glove)}")  # 400,000

# ── Pre-trained FastText Embeddings ──
import fasttext
# FastText handles OOV by using character n-grams
ft_model = fasttext.load_model('cc.en.300.bin')
ft_model.get_word_vector('hello')     # 300-dim vector
ft_model.get_word_vector('unheard')   # Still works (subword info)

embedding-layers.py

# ── Embedding Layer in PyTorch ──
import torch
import torch.nn as nn

# Random embedding layer
embedding = nn.Embedding(num_embeddings=10000, embedding_dim=300)
input_ids = torch.tensor([1, 23, 456, 789, 0])  # Token IDs
embedded = embedding(input_ids)
print(embedded.shape)  # (5, 300) - each token is a 300-dim vector

# ── Pre-trained Embedding Matrix ──
# Create embedding matrix from GloVe
vocab_size = 10000
embed_dim = 300
embedding_matrix = np.zeros((vocab_size, embed_dim))

word2idx = {}
for i, word in enumerate(sorted(glove.keys())[:vocab_size]):
    word2idx[word] = i
    embedding_matrix[i] = glove[word]

# Load into nn.Embedding
embedding = nn.Embedding.from_pretrained(
    torch.tensor(embedding_matrix, dtype=torch.float),
    freeze=False  # True = don't fine-tune
)

# ── Sentence Embeddings (Mean Pooling) ──
def sentence_embedding(tokens, embedding_layer):
    """Average word embeddings to get sentence embedding"""
    embedded = embedding_layer(tokens)       # (seq_len, embed_dim)
    mask = (tokens != 0).unsqueeze(-1).float()  # Ignore padding
    summed = (embedded * mask).sum(dim=0)
    counts = mask.sum(dim=0).clamp(min=1e-9)
    return summed / counts  # (embed_dim,)

Embedding Models Comparison

Model	Dimensions	Vocab Size	OOV Handling	Best For
Word2Vec (Skip-gram)	100-300	Custom	No (UNK)	Task-specific word similarity
Word2Vec (CBOW)	100-300	Custom	No (UNK)	Faster training, frequent words
GloVe	50/100/200/300	400K/2.2M	No (UNK)	General-purpose, pre-trained
FastText	100-300	2M+ (subword)	Yes (character n-grams)	Morphological languages, OOV
ELMo	1024	Custom (char CNN)	Yes	Contextual embeddings (LSTM-based)
BERT embeddings	768/1024	30K WordPiece	Yes (##subwords)	Contextual, sentence-level tasks
Sentence-BERT	384/768	BERT vocab	Yes	Sentence similarity, clustering
OpenAI text-embedding-3	1536/3072	BPE	Yes	RAG, semantic search, clustering

💡

Key insight:Static embeddings (Word2Vec, GloVe) have one vector per word regardless of context. Contextual embeddings (BERT, GPT) produce different vectors for the same word in different contexts. "bank" in "river bank" vs "savings bank" gets different BERT embeddings.

🔀

Transformers

State of the Art

The Transformer architecture, introduced in "Attention Is All You Need" (2017), is the foundation of modern NLP. Self-attention mechanisms enable parallel processing and capture long-range dependencies.

transformers-huggingface.py

from transformers import (AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
    AutoModelForTokenClassification, pipeline)
import torch

# ── BERT for Text Classification ──
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Encode input
inputs = tokenizer(
    "This movie was absolutely fantastic and moving!",
    "I highly recommend it.",
    padding=True, truncation=True, max_length=512,
    return_tensors="pt"
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=-1)
    pred = torch.argmax(probs, dim=-1).item()
print(f"Prediction: {pred}, Probability: {probs[0, pred]:.4f}")

# ── BERT for Token Classification (NER) ──
ner_model = AutoModelForTokenClassification.from_pretrained(
    "dbmdz/bert-large-cased-finetuned-conll03-english"
)
inputs = tokenizer("John lives in New York City", return_tensors="pt")
outputs = ner_model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# ── Pipeline API (Simplest interface) ──
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("I love this product! Best purchase ever.")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named Entity Recognition
ner_pipe = pipeline("ner", grouped_entities=True)
entities = ner_pipe("Apple CEO Tim Cook announced the iPhone in California.")
# [{'entity_group': 'ORG', 'word': 'Apple'}, {'entity_group': 'PERSON', 'word': 'Tim Cook'}, ...]

# Question Answering
qa_pipe = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
answer = qa_pipe(question="What is the capital of France?",
                 context="France is a country in Europe. Its capital is Paris.")
# {'answer': 'Paris', 'score': 0.97, 'start': 56, 'end': 61}

# Text Generation
gen_pipe = pipeline("text-generation", model="gpt2")
output = gen_pipe("The future of AI is", max_length=50, num_return_sequences=1)

Self-Attention Mechanism Explained

Query (Q)"What am I looking for?" — the current token's request vector

Key (K)"What do I contain?" — each token's identifying vector

Value (V)"What information do I provide?" — each token's content vector

Attention ScoreQ @ K^T — how relevant each token is to the current token

SoftmaxNormalize scores to sum to 1 (probability distribution over tokens)

Outputsoftmax(Q @ K^T) @ V — weighted sum of values based on attention

Multi-HeadMultiple parallel attention heads capture different relationship types

Positional EncodingSince attention is position-agnostic, positional info is injected via sine/cosine or learned embeddings

Transformer Model Families

Model	Type	Architecture	Parameters	Key Feature	Best For
BERT	Encoder	Bidirectional encoder	110M/340M	Bidirectional context	Classification, NER, QA
GPT-4	Decoder	Autoregressive decoder	~1.8T (MoE)	Next token prediction	Text generation, chat
T5	Encoder-Decoder	Text-to-text	60M-11B	All tasks as text generation	Translation, summarization
BART	Encoder-Decoder	Denoising autoencoder	140M-400M	Reconstruction pretraining	Summarization, generation
RoBERTa	Encoder	Optimized BERT	125M/355M	Better training recipe	Classification (replaces BERT)
DeBERTa	Encoder	Disentangled attention	86M/400M	Content + position separation	High-accuracy classification
LLaMA 3	Decoder	Autoregressive decoder	8B-70B	Open weights, efficient	Open-source LLM
Mistral	Decoder	Sliding window attention	7B-8x22B	Efficient long context	Open-source LLM

😊

Sentiment Analysis

Opinion Mining

Sentiment analysis determines the emotional tone of text — positive, negative, or neutral. It is widely used for brand monitoring, customer feedback analysis, and social media analysis.

sentiment-analysis.py

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# ── Quick Sentiment Analysis (Pipeline) ──
classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

texts = [
    "This product is amazing! Best purchase I ever made.",
    "Terrible experience. The food was cold and service was slow.",
    "The movie was okay, nothing special but not bad either."
]
results = classifier(texts, batch_size=32)
for text, result in zip(texts, results):
    print(f"{result['label']}: {result['score']:.4f} | {text[:50]}...")

# ── Fine-grained Sentiment (5-star) ──
star_classifier = pipeline(
    "sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment"
)
result = star_classifier("The hotel room was clean but the breakfast was disappointing.")
print(result)  # [{'label': '3 stars', 'score': 0.85}]

# ── Aspect-Based Sentiment Analysis ──
# Identify sentiment for specific aspects within a text
aspects = {
    'food': 'The pizza was delicious but the pasta was bland.',
    'service': 'The waiter was rude and inattentive.',
    'ambiance': 'The restaurant had beautiful decor and nice lighting.',
}
for aspect, review in aspects.items():
    result = classifier(review)[0]
    print(f"Aspect [{aspect}]: {result['label']} ({result['score']:.3f})")

# ── Custom Sentiment Model Training ──
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",
                     truncation=True, max_length=256)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir="./sentiment-model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
)}

Sentiment Analysis Approaches

Approach	Accuracy	Speed	Context-Aware	Use Case
Lexicon (VADER)	Low-Medium	Very Fast	No	Social media, quick analysis
Rule-based	Medium	Fast	No	Domain-specific (finance, healthcare)
TF-IDF + SVM	Medium-High	Fast	No	Balanced accuracy/speed
BERT (fine-tuned)	High	Medium	Yes	Production sentiment analysis
RoBERTa (fine-tuned)	Very High	Medium	Yes	State-of-the-art accuracy
GPT-4 (zero-shot)	High	Slow	Yes	Custom criteria, no training data
LLM (few-shot)	Very High	Slow	Yes	Complex sentiment, nuanced analysis

🚫

Common pitfall: Sentiment models trained on movie reviews (IMDB) often fail on domain-specific text (finance, medical). Always fine-tune on domain data or use LLMs with custom prompts for specialized sentiment analysis.

🏷️

Named Entity Recognition

Extract Entities

NER identifies and classifies named entities (people, organizations, locations, dates, etc.) in text. It is fundamental for information extraction, knowledge graphs, and search.

named-entity-recognition.py

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
import spacy

# ── spaCy NER (Fast, Production-Ready) ──
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976.")

for ent in doc.ents:
    print(f"{ent.text:25s} {ent.label_:10s} {spacy.explain(ent.label_)}")
# Apple Inc.                 ORG        Companies, agencies, institutions
# Steve Jobs                 PERSON     People, including fictional
# Cupertino                  GPE        Countries, cities, states
# California                 GPE        Countries, cities, states
# April 1, 1976              DATE       Absolute or relative dates

# ── HuggingFace NER Pipeline ──
ner_pipe = pipeline("ner", grouped_entities=True)
text = "Elon Musk is the CEO of Tesla, headquartered in Austin, Texas."
entities = ner_pipe(text)
for ent in entities:
    print(f"{ent['word']:25s} {ent['entity_group']:10s} {ent['score']:.3f}")
# Elon Musk                  PER         0.999
# Tesla                      ORG         0.998
# Austin                     LOC         0.997
# Texas                      LOC         0.994

# ── Fine-tune NER Model ──
from transformers import AutoModelForTokenClassification, TrainingArguments

label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC',
              'B-DATE', 'I-DATE', 'B-MISC', 'I-MISC']
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {label: i for i, label in enumerate(label_list)}

model = AutoModelForTokenEvaluation.from_pretrained(
    "bert-base-uncased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

NER Entity Types (BIO Schema)

Entity	Label	Description	Examples
Person	PER	People (real or fictional)	Barack Obama, Sherlock Holmes
Organization	ORG	Companies, agencies, institutions	Google, United Nations, NASA
Location	LOC / GPE	Geographic locations (cities, countries)	New York, Germany, Mount Everest
Date	DATE	Absolute and relative dates	January 2024, last Monday, 1999-12-31
Time	TIME	Times of day	3:45 PM, noon, midnight
Money	MONEY	Currency amounts	$50, 10 million euros, $4.99
Percentage	PERCENT	Percentage values	50%, one third, 75 percent
Facility	FAC	Buildings, airports, highways	JFK Airport, Golden Gate Bridge
Product	PRODUCT	Products, vehicles, food	iPhone, Toyota Camry, Coca-Cola
Event	EVENT	Named events	Olympics, World War II, CES 2024

BIO Tagging Scheme

B-{ENTITY}Beginning of a multi-token entity. First token of "New York" gets B-LOC.

I-{ENTITY}Inside a multi-token entity. Second token "York" gets I-LOC.

OOutside any entity. All non-entity tokens get O.

Why BIO?Without B/I distinction, adjacent entities of the same type would merge. B distinguishes boundaries.

✍️

Text Generation

Create Content

Text generation produces coherent text by predicting the next token autoregressively. Modern approaches use large language models with sophisticated decoding strategies.

text-generation.py

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch

# ── Simple Text Generation ──
generator = pipeline("text-generation", model="gpt2")

# Basic generation
output = generator("The future of AI is", max_length=50, num_return_sequences=3)
for i, seq in enumerate(output):
    print(f"Sequence {i+1}: {seq['generated_text']}")

# ── Generation Strategies ──
# Greedy (always picks highest probability — repetitive)
output = generator("Once upon a time", max_length=100, do_sample=False)

# Beam Search (explores multiple paths — good for factual text)
output = generator("The capital of France is", max_length=50,
                   num_beams=5, early_stopping=True)

# Temperature + Top-k (controlled randomness)
output = generator("Write a haiku about coding:", max_length=100,
                   do_sample=True, temperature=0.7, top_k=50)

# Top-p (nucleus) sampling — most popular approach
output = generator("Explain quantum computing to a 5-year-old:",
                   max_length=200, do_sample=True,
                   temperature=0.8, top_p=0.95, top_k=0,
                   repetition_penalty=1.2)

# ── Chat Templates (for conversational models) ──
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct",
                                             torch_dtype=torch.float16)

messages = [
    {"role": "system", "content": "You are a helpful programming assistant."},
    {"role": "user", "content": "Write a Python function to check if a string is a palindrome."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Decoding Strategies Comparison

Strategy	How It Works	Quality	Diversity	Best For
Greedy	Always picks highest probability token	Medium (repetitive)	None	Deterministic outputs
Beam Search	Keeps top-k sequences at each step	High (factual)	Low	Translation, summarization
Top-k Sampling	Sample from top k most likely tokens	Good	Good	Creative writing (k=50)
Top-p (Nucleus)	Sample from smallest set with cumulative prob >= p	Best	Best	General-purpose (p=0.9)
Temperature	Scales logits before softmax (higher = more random)	Variable	Variable	Control randomness (0.3-1.0)
Repetition Penalty	Penalize tokens already generated	Medium	Medium	Reduce repetition (1.1-1.3)

⚠️

Practical tip: The combination of temperature=0.7, top_p=0.9, and repetition_penalty=1.1 is a good default for most generation tasks. Adjust temperature: 0.3 for factual, 0.7 for balanced, 1.0+ for creative.

💬

Interview Q&A

Get Hired

Essential NLP interview questions with detailed answers.

Q1: What is the difference between BERT and GPT?

Answer: BERT is a bidirectional encoder — it reads the entire sequence at once and builds context from both left and right sides. It is pretrained with masked language modeling (predict masked words). Best for: classification, NER, QA (understanding tasks).

GPT is a unidirectional (left-to-right) decoder — it predicts the next token given previous tokens. It is pretrained with next token prediction. Best for: text generation, chatbots, code completion (generation tasks).

Key insight: BERT sees the full context for each word. GPT only sees past context (causal masking). This makes BERT better at understanding, GPT better at generating.

Q2: Explain word embeddings and how they capture meaning.

Answer: Word embeddings are dense, low-dimensional vectors (typically 100-300 dimensions) that represent words in a continuous vector space. They are learned from large corpora so that semantically similar words are close together.

Why they work: Distributional hypothesis — words that appear in similar contexts have similar meanings. "king" and "queen" appear in similar contexts (royal, palace, throne), so their embeddings are close.

Analogies: The vector space captures relationships: king - man + woman = queen (gender), Paris - France + Italy = Rome (capital-city). This shows embeddings encode semantic relationships as vector arithmetic.

Q3: What is attention and why is it important?

Answer: Attention computes relevance scores between every pair of positions in a sequence, allowing the model to focus on the most relevant parts of the input when processing each element.

Self-attention: Each element in a sequence attends to all other elements. For "The cat sat on the mat because it was tired," attention helps the model learn that "it" refers to "cat," not "mat."

Why important: (1) Captures long-range dependencies without distance limitations of RNNs. (2) Parallelizable (unlike sequential RNNs). (3) Interpretable (attention weights show what the model focuses on). (4) Enables multi-head attention for capturing different types of relationships simultaneously.

Q4: How would you handle a text classification task with imbalanced classes?

Answer: Text classification imbalance is common (e.g., 95% negative reviews, 5% positive).

Class weightsPass class_weight='balanced' to loss function. In transformers: compute class weights and pass to CrossEntropyLoss.

OversamplingDuplicate samples from minority class or use data augmentation (back-translation, synonym replacement).

Focal LossDown-weight easy examples, focus on hard ones. Particularly effective with BERT.

Threshold tuningInstead of default 0.5 threshold, find optimal threshold on validation set using precision-recall curve.

MetricsUse F1, precision, recall, AUC — NOT accuracy. Accuracy is misleading with imbalanced data.

Q5: What is the difference between BERT and RoBERTa? Why is RoBERTa better?

Answer: RoBERTa (Robustly Optimized BERT Pretraining Approach) uses the exact same architecture as BERT but with a better training recipe:

Dynamic maskingBERT masks tokens once; RoBERTa re-masks on every epoch (10x more diverse training data).

More dataRoBERTa trains on 160GB vs BERT's 16GB (10x more text).

Longer trainingRoBERTa trains for more steps with larger batches.

No NSPRoBERTa removes Next Sentence Prediction objective (found to hurt performance).

Longer sequencesRoBERTa trains on 512 tokens from the start (BERT used 128 and increased).

Result: RoBERTa outperforms BERT on every benchmark with the same architecture, proving training methodology matters as much as model design.

💡

NLP interview tip: Be ready to explain the Transformer architecture from scratch. Understand positional encoding, multi-head attention, and the encoder-decoder structure. Know when to use BERT vs GPT vs T5 for different tasks.

⏳

Loading cheatsheet...

import re import string import unicodedata from nltk.tokenize import word_tokenize, sent_tokenize from nltk.corpus import stopwords from nltk.stem import PorterStemmer, WordNetLemmatizer # ── Basic Text Cleaning ── text = "I LOOOVE this product!!! It's amazing. Check out https://example.com 🎉" # Lowercase text = text.lower() # Remove URLs text = re.sub(r'https?://S+|www.S+', '', text) # Remove HTML tags text = re.sub(r'<[^>]+>', '', text) # Remove special characters and punctuation text = re.sub(r'[^ws]', '', text) # Keep alphanumeric + whitespace text = text.translate(str.maketrans('', '', string.punctuation)) # Remove extra whitespace text = re.sub(r's+', ' ', text).strip() # Remove numbers (optional) text = re.sub(r'd+', '', text) # Remove emojis text = re.sub(r'[^-]+', '', text) # ASCII only # Or keep text but remove emoji: emoji_pattern = re.compile("[" u"U0001F600-U0001F64F" u"U0001F300-U0001F5FF" u"U0001F680-U0001F6FF" u"U0001F1E0-U0001F1FF" "]+", flags=re.UNICODE) text = emoji_pattern.sub(r'', text) # ── Unicode Normalization ── text = unicodedata.normalize('NFKC', text) # Normalize special chars # ── Remove Stopwords ── stop_words = set(stopwords.words('english')) # Add custom stopwords stop_words.update(['said', 'would', 'could', 'also', 'us', 'one']) tokens = word_tokenize(text) filtered = [w for w in tokens if w not in stop_words] # ── Stemming vs Lemmatization ── stemmer = PorterStemmer() lemmatizer = WordNetLemmatizer() words = ['running', 'better', 'geese', 'corpora', 'am', 'are'] stemmed = [stemmer.stem(w) for w in words] # ['run', 'better', 'gees', 'corpora', 'am', 'are'] lemmatized = [lemmatizer.lemmatize(w, pos='v') for w in words] # ['run', 'better', 'goose', 'corpus', 'be', 'be']

Technique

Description

Pros

Cons

When to Use

Lowercasing

Convert all text to lowercase

Reduces vocabulary size

Loses information (US vs us)

Almost always, unless case matters

Stopword Removal

Remove common words (the, is, at)

Reduces noise, focuses on content

Can lose sentiment info (not, no)

Information retrieval, classification

Stemming

Reduce words to root form (running -> run)

Fast, reduces vocabulary

Produces non-words, aggressive

Search engines, fast prototyping

Lemmatization

Reduce to dictionary form (geese -> goose)

Valid words, context-aware

Slower, needs POS tag

Quality-sensitive tasks, analysis

Tokenization

Split text into tokens (words/subwords)

Required for all NLP models

Language-specific challenges

Always, before any other processing

Spell Correction

Fix typos automatically

Improves model input quality

Can change meaning, slow

User-generated content, search

# ── Word Tokenization ── from nltk.tokenize import word_tokenize, TreebankWordTokenizer text = "I can't believe it's not butter! 2.5 million users." print(word_tokenize(text)) # ["I", "ca", "n't", "believe", "it", "'s", "not", "butter", "!", "2.5", "million", "users", "."] # ── Sentence Tokenization ── from nltk.tokenize import sent_tokenize doc = "Hello world. How are you? I'm fine!" print(sent_tokenize(doc)) # ["Hello world.", "How are you?", "I'm fine!"] # ── Regular Expression Tokenizer ── from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'w+') # Only alphanumeric tokens print(tokenizer.tokenize("Hello, world! 123")) # ['Hello', 'world', '123'] # ── Subword Tokenization (HuggingFace Transformers) ── from transformers import AutoTokenizer # BERT Tokenizer (WordPiece) bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") tokens = bert_tokenizer("Hello, I love NLP!") print(tokens.tokens()) # ['[CLS]', 'hello', ',', 'i', 'love', 'nl', '##p', '!', '[SEP]'] print(tokens.input_ids) # [101, 7592, 1010, 1045, 2293, 17953, 2361, 999, 102] print(tokens.attention_mask) # [1, 1, 1, 1, 1, 1, 1, 1, 1] # GPT-2 Tokenizer (BPE) gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2") tokens = gpt_tokenizer("Hello, I love NLP!") print(tokens.tokens()) # ['Hello', ',', 'ĠI', 'Ġlove', 'ĠNLP', '!'] # T5 Tokenizer (SentencePiece) t5_tokenizer = AutoTokenizer.from_pretrained("t5-base") tokens = t5_tokenizer("Hello, I love NLP!") print(tokens.tokens())

Algorithm

Vocab Type

Handles OOV?

Example

Used By

Word-level

Full words

No (UNK token)

["I", "love", "NLP"]

Classic NLTK, spaCy

Character-level

Characters

Yes

["H", "e", "l", "l", "o"]

Character CNNs, some TTS

BPE (Byte-Pair Encoding)

Subwords

Yes

["lo", "v", "ing"]

GPT-2, GPT-3, RoBERTa

WordPiece

Subwords

Yes (##prefix)

["un", "##believ", "##able"]

BERT, DistilBERT

SentencePiece (Unigram)

Subwords

Yes

["▁Hello", "▁world"]

T5, XLNet, LLaMA

Tiktoken (BPE variant)

Byte-level

Yes

Byte-level BPE, efficient

GPT-4, Claude

# ── Word2Vec (Gensim) ── from gensim.models import Word2Vec # Train Word2Vec sentences = [ ['the', 'cat', 'sat', 'on', 'the', 'mat'], ['the', 'dog', 'sat', 'on', 'the', 'floor'], ['the', 'bird', 'flew', 'over', 'the', 'house'], ] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4, sg=1, epochs=100) # sg=1: Skip-gram, sg=0: CBOW # Word similarity model.wv.most_similar('cat', topn=5) # [('dog', 0.92), ('bird', 0.78), ...] # Word arithmetic (king - man + woman ≈ queen) model.wv.most_similar(positive=['king', 'woman'], negative=['man']) # ── Pre-trained GloVe Embeddings ── import numpy as np def load_glove_embeddings(glove_path): embeddings = {} with open(glove_path, 'r', encoding='utf-8') as f: for line in f: parts = line.strip().split() word = parts[0] vector = np.array(parts[1:], dtype=np.float32) embeddings[word] = vector return embeddings glove = load_glove_embeddings('glove.6B.300d.txt') print(f"Vocabulary size: {len(glove)}") # 400,000 # ── Pre-trained FastText Embeddings ── import fasttext # FastText handles OOV by using character n-grams ft_model = fasttext.load_model('cc.en.300.bin') ft_model.get_word_vector('hello') # 300-dim vector ft_model.get_word_vector('unheard') # Still works (subword info)

# ── Embedding Layer in PyTorch ── import torch import torch.nn as nn # Random embedding layer embedding = nn.Embedding(num_embeddings=10000, embedding_dim=300) input_ids = torch.tensor([1, 23, 456, 789, 0]) # Token IDs embedded = embedding(input_ids) print(embedded.shape) # (5, 300) - each token is a 300-dim vector # ── Pre-trained Embedding Matrix ── # Create embedding matrix from GloVe vocab_size = 10000 embed_dim = 300 embedding_matrix = np.zeros((vocab_size, embed_dim)) word2idx = {} for i, word in enumerate(sorted(glove.keys())[:vocab_size]): word2idx[word] = i embedding_matrix[i] = glove[word] # Load into nn.Embedding embedding = nn.Embedding.from_pretrained( torch.tensor(embedding_matrix, dtype=torch.float), freeze=False # True = don't fine-tune ) # ── Sentence Embeddings (Mean Pooling) ── def sentence_embedding(tokens, embedding_layer): """Average word embeddings to get sentence embedding""" embedded = embedding_layer(tokens) # (seq_len, embed_dim) mask = (tokens != 0).unsqueeze(-1).float() # Ignore padding summed = (embedded * mask).sum(dim=0) counts = mask.sum(dim=0).clamp(min=1e-9) return summed / counts # (embed_dim,)

Model

Dimensions

Vocab Size

OOV Handling

Best For

Word2Vec (Skip-gram)

100-300

Custom

No (UNK)

Task-specific word similarity

Word2Vec (CBOW)

100-300

Custom

No (UNK)

Faster training, frequent words

GloVe

50/100/200/300

400K/2.2M

No (UNK)

General-purpose, pre-trained

FastText

100-300

2M+ (subword)

Yes (character n-grams)

Morphological languages, OOV

ELMo

1024

Custom (char CNN)

Yes

Contextual embeddings (LSTM-based)

BERT embeddings

768/1024

30K WordPiece

Yes (##subwords)

Contextual, sentence-level tasks

Sentence-BERT

384/768

BERT vocab

Yes

Sentence similarity, clustering

OpenAI text-embedding-3

1536/3072

BPE

Yes

RAG, semantic search, clustering

from transformers import (AutoTokenizer, AutoModel, AutoModelForSequenceClassification, AutoModelForTokenClassification, pipeline) import torch # ── BERT for Text Classification ── tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=2 ) # Encode input inputs = tokenizer( "This movie was absolutely fantastic and moving!", "I highly recommend it.", padding=True, truncation=True, max_length=512, return_tensors="pt" ) # Predict with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probs = torch.softmax(logits, dim=-1) pred = torch.argmax(probs, dim=-1).item() print(f"Prediction: {pred}, Probability: {probs[0, pred]:.4f}") # ── BERT for Token Classification (NER) ── ner_model = AutoModelForTokenClassification.from_pretrained( "dbmdz/bert-large-cased-finetuned-conll03-english" ) inputs = tokenizer("John lives in New York City", return_tensors="pt") outputs = ner_model(**inputs) predictions = torch.argmax(outputs.logits, dim=-1) tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) # ── Pipeline API (Simplest interface) ── classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english") result = classifier("I love this product! Best purchase ever.") # [{'label': 'POSITIVE', 'score': 0.9998}] # Named Entity Recognition ner_pipe = pipeline("ner", grouped_entities=True) entities = ner_pipe("Apple CEO Tim Cook announced the iPhone in California.") # [{'entity_group': 'ORG', 'word': 'Apple'}, {'entity_group': 'PERSON', 'word': 'Tim Cook'}, ...] # Question Answering qa_pipe = pipeline("question-answering", model="distilbert-base-cased-distilled-squad") answer = qa_pipe(question="What is the capital of France?", context="France is a country in Europe. Its capital is Paris.") # {'answer': 'Paris', 'score': 0.97, 'start': 56, 'end': 61} # Text Generation gen_pipe = pipeline("text-generation", model="gpt2") output = gen_pipe("The future of AI is", max_length=50, num_return_sequences=1)

Model

Type

Architecture

Parameters

Key Feature

Best For

BERT

Encoder

Bidirectional encoder

110M/340M

Bidirectional context

Classification, NER, QA

GPT-4

Decoder

Autoregressive decoder

~1.8T (MoE)

Next token prediction

Text generation, chat

Encoder-Decoder

Text-to-text

60M-11B

All tasks as text generation

Translation, summarization

BART

Encoder-Decoder

Denoising autoencoder

140M-400M

Reconstruction pretraining

Summarization, generation

RoBERTa

Encoder

Optimized BERT

125M/355M

Better training recipe

Classification (replaces BERT)

DeBERTa

Encoder

Disentangled attention

86M/400M

Content + position separation

High-accuracy classification

LLaMA 3

Decoder

Autoregressive decoder

8B-70B

Open weights, efficient

Open-source LLM

Mistral

Decoder

Sliding window attention

7B-8x22B

Efficient long context

Open-source LLM

from transformers import pipeline from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # ── Quick Sentiment Analysis (Pipeline) ── classifier = pipeline( "sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english" ) texts = [ "This product is amazing! Best purchase I ever made.", "Terrible experience. The food was cold and service was slow.", "The movie was okay, nothing special but not bad either." ] results = classifier(texts, batch_size=32) for text, result in zip(texts, results): print(f"{result['label']}: {result['score']:.4f} | {text[:50]}...") # ── Fine-grained Sentiment (5-star) ── star_classifier = pipeline( "sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment" ) result = star_classifier("The hotel room was clean but the breakfast was disappointing.") print(result) # [{'label': '3 stars', 'score': 0.85}] # ── Aspect-Based Sentiment Analysis ── # Identify sentiment for specific aspects within a text aspects = { 'food': 'The pizza was delicious but the pasta was bland.', 'service': 'The waiter was rude and inattentive.', 'ambiance': 'The restaurant had beautiful decor and nice lighting.', } for aspect, review in aspects.items(): result = classifier(review)[0] print(f"Aspect [{aspect}]: {result['label']} ({result['score']:.3f})") # ── Custom Sentiment Model Training ── from transformers import TrainingArguments, Trainer from datasets import load_dataset dataset = load_dataset("imdb") tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256) tokenized_datasets = dataset.map(tokenize_function, batched=True) training_args = TrainingArguments( output_dir="./sentiment-model", eval_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=32, num_train_epochs=3, weight_decay=0.01, save_strategy="epoch", load_best_model_at_end=True, )}

Approach

Accuracy

Speed

Context-Aware

Use Case

Lexicon (VADER)

Low-Medium

Very Fast

Social media, quick analysis

Rule-based

Medium

Fast

Domain-specific (finance, healthcare)

TF-IDF + SVM

Medium-High

Fast

Balanced accuracy/speed

BERT (fine-tuned)

High

Medium

Yes

Production sentiment analysis

RoBERTa (fine-tuned)

Very High

Medium

Yes

State-of-the-art accuracy

GPT-4 (zero-shot)

High

Slow

Yes

Custom criteria, no training data

LLM (few-shot)

Very High

Slow

Yes

Complex sentiment, nuanced analysis

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification import spacy # ── spaCy NER (Fast, Production-Ready) ── nlp = spacy.load("en_core_web_sm") doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976.") for ent in doc.ents: print(f"{ent.text:25s} {ent.label_:10s} {spacy.explain(ent.label_)}") # Apple Inc. ORG Companies, agencies, institutions # Steve Jobs PERSON People, including fictional # Cupertino GPE Countries, cities, states # California GPE Countries, cities, states # April 1, 1976 DATE Absolute or relative dates # ── HuggingFace NER Pipeline ── ner_pipe = pipeline("ner", grouped_entities=True) text = "Elon Musk is the CEO of Tesla, headquartered in Austin, Texas." entities = ner_pipe(text) for ent in entities: print(f"{ent['word']:25s} {ent['entity_group']:10s} {ent['score']:.3f}") # Elon Musk PER 0.999 # Tesla ORG 0.998 # Austin LOC 0.997 # Texas LOC 0.994 # ── Fine-tune NER Model ── from transformers import AutoModelForTokenClassification, TrainingArguments label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-DATE', 'I-DATE', 'B-MISC', 'I-MISC'] id2label = {i: label for i, label in enumerate(label_list)} label2id = {label: i for i, label in enumerate(label_list)} model = AutoModelForTokenEvaluation.from_pretrained( "bert-base-uncased", num_labels=len(label_list), id2label=id2label, label2id=label2id )

Entity

Label

Description

Examples

Person

PER

People (real or fictional)

Barack Obama, Sherlock Holmes

Organization

ORG

Companies, agencies, institutions

Google, United Nations, NASA

Location

LOC / GPE

Geographic locations (cities, countries)

New York, Germany, Mount Everest

Date

DATE

Absolute and relative dates

January 2024, last Monday, 1999-12-31

Time

TIME

Times of day

3:45 PM, noon, midnight

Money

MONEY

Currency amounts

$50, 10 million euros, $4.99

Percentage

PERCENT

Percentage values

50%, one third, 75 percent

Facility

FAC

Buildings, airports, highways

JFK Airport, Golden Gate Bridge

Product

PRODUCT

Products, vehicles, food

iPhone, Toyota Camry, Coca-Cola

Event

EVENT

Named events

Olympics, World War II, CES 2024

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer import torch # ── Simple Text Generation ── generator = pipeline("text-generation", model="gpt2") # Basic generation output = generator("The future of AI is", max_length=50, num_return_sequences=3) for i, seq in enumerate(output): print(f"Sequence {i+1}: {seq['generated_text']}") # ── Generation Strategies ── # Greedy (always picks highest probability — repetitive) output = generator("Once upon a time", max_length=100, do_sample=False) # Beam Search (explores multiple paths — good for factual text) output = generator("The capital of France is", max_length=50, num_beams=5, early_stopping=True) # Temperature + Top-k (controlled randomness) output = generator("Write a haiku about coding:", max_length=100, do_sample=True, temperature=0.7, top_k=50) # Top-p (nucleus) sampling — most popular approach output = generator("Explain quantum computing to a 5-year-old:", max_length=200, do_sample=True, temperature=0.8, top_p=0.95, top_k=0, repetition_penalty=1.2) # ── Chat Templates (for conversational models) ── tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.float16) messages = [ {"role": "system", "content": "You are a helpful programming assistant."}, {"role": "user", "content": "Write a Python function to check if a string is a palindrome."} ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Strategy

How It Works

Quality

Diversity

Best For

Greedy

Always picks highest probability token

Medium (repetitive)

None

Deterministic outputs

Beam Search

Keeps top-k sequences at each step

High (factual)

Low

Translation, summarization

Top-k Sampling

Sample from top k most likely tokens

Good

Creative writing (k=50)

Top-p (Nucleus)

Sample from smallest set with cumulative prob >= p

Best

General-purpose (p=0.9)

Temperature

Scales logits before softmax (higher = more random)

Variable

Control randomness (0.3-1.0)

Repetition Penalty

Penalize tokens already generated

Medium

Reduce repetition (1.1-1.3)