Text Preprocessing, Tokenization, Word Embeddings, Transformers, Sentiment Analysis, NER, Text Generation — NLP mastery.
Text preprocessing transforms raw text into clean, structured data suitable for NLP models. Proper preprocessing is crucial — garbage in, garbage out.
import re
import string
import unicodedata
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
# ── Basic Text Cleaning ──
text = "I LOOOVE this product!!! It's amazing. Check out https://example.com 🎉"
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'https?://S+|www.S+', '', text)
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove special characters and punctuation
text = re.sub(r'[^ws]', '', text) # Keep alphanumeric + whitespace
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove extra whitespace
text = re.sub(r's+', ' ', text).strip()
# Remove numbers (optional)
text = re.sub(r'd+', '', text)
# Remove emojis
text = re.sub(r'[^ -]+', '', text) # ASCII only
# Or keep text but remove emoji:
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F"
u"U0001F300-U0001F5FF"
u"U0001F680-U0001F6FF"
u"U0001F1E0-U0001F1FF"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)
# ── Unicode Normalization ──
text = unicodedata.normalize('NFKC', text) # Normalize special chars
# ── Remove Stopwords ──
stop_words = set(stopwords.words('english'))
# Add custom stopwords
stop_words.update(['said', 'would', 'could', 'also', 'us', 'one'])
tokens = word_tokenize(text)
filtered = [w for w in tokens if w not in stop_words]
# ── Stemming vs Lemmatization ──
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ['running', 'better', 'geese', 'corpora', 'am', 'are']
stemmed = [stemmer.stem(w) for w in words]
# ['run', 'better', 'gees', 'corpora', 'am', 'are']
lemmatized = [lemmatizer.lemmatize(w, pos='v') for w in words]
# ['run', 'better', 'goose', 'corpus', 'be', 'be']| Technique | Description | Pros | Cons | When to Use |
|---|---|---|---|---|
| Lowercasing | Convert all text to lowercase | Reduces vocabulary size | Loses information (US vs us) | Almost always, unless case matters |
| Stopword Removal | Remove common words (the, is, at) | Reduces noise, focuses on content | Can lose sentiment info (not, no) | Information retrieval, classification |
| Stemming | Reduce words to root form (running -> run) | Fast, reduces vocabulary | Produces non-words, aggressive | Search engines, fast prototyping |
| Lemmatization | Reduce to dictionary form (geese -> goose) | Valid words, context-aware | Slower, needs POS tag | Quality-sensitive tasks, analysis |
| Tokenization | Split text into tokens (words/subwords) | Required for all NLP models | Language-specific challenges | Always, before any other processing |
| Spell Correction | Fix typos automatically | Improves model input quality | Can change meaning, slow | User-generated content, search |
Tokenization is the process of breaking text into discrete units (tokens) that can be processed by models. Modern NLP uses subword tokenization to handle vocabulary and OOV issues.
# ── Word Tokenization ──
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
text = "I can't believe it's not butter! 2.5 million users."
print(word_tokenize(text))
# ["I", "ca", "n't", "believe", "it", "'s", "not", "butter", "!", "2.5", "million", "users", "."]
# ── Sentence Tokenization ──
from nltk.tokenize import sent_tokenize
doc = "Hello world. How are you? I'm fine!"
print(sent_tokenize(doc))
# ["Hello world.", "How are you?", "I'm fine!"]
# ── Regular Expression Tokenizer ──
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'w+') # Only alphanumeric tokens
print(tokenizer.tokenize("Hello, world! 123"))
# ['Hello', 'world', '123']
# ── Subword Tokenization (HuggingFace Transformers) ──
from transformers import AutoTokenizer
# BERT Tokenizer (WordPiece)
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = bert_tokenizer("Hello, I love NLP!")
print(tokens.tokens())
# ['[CLS]', 'hello', ',', 'i', 'love', 'nl', '##p', '!', '[SEP]']
print(tokens.input_ids) # [101, 7592, 1010, 1045, 2293, 17953, 2361, 999, 102]
print(tokens.attention_mask) # [1, 1, 1, 1, 1, 1, 1, 1, 1]
# GPT-2 Tokenizer (BPE)
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = gpt_tokenizer("Hello, I love NLP!")
print(tokens.tokens())
# ['Hello', ',', 'ĠI', 'Ġlove', 'ĠNLP', '!']
# T5 Tokenizer (SentencePiece)
t5_tokenizer = AutoTokenizer.from_pretrained("t5-base")
tokens = t5_tokenizer("Hello, I love NLP!")
print(tokens.tokens())| Algorithm | Vocab Type | Handles OOV? | Example | Used By |
|---|---|---|---|---|
| Word-level | Full words | No (UNK token) | ["I", "love", "NLP"] | Classic NLTK, spaCy |
| Character-level | Characters | Yes | ["H", "e", "l", "l", "o"] | Character CNNs, some TTS |
| BPE (Byte-Pair Encoding) | Subwords | Yes | ["lo", "v", "ing"] | GPT-2, GPT-3, RoBERTa |
| WordPiece | Subwords | Yes (##prefix) | ["un", "##believ", "##able"] | BERT, DistilBERT |
| SentencePiece (Unigram) | Subwords | Yes | ["▁Hello", "▁world"] | T5, XLNet, LLaMA |
| Tiktoken (BPE variant) | Byte-level | Yes | Byte-level BPE, efficient | GPT-4, Claude |
Word embeddings are dense vector representations of words that capture semantic meaning. Words with similar meanings have similar embeddings, enabling models to understand relationships between words.
# ── Word2Vec (Gensim) ──
from gensim.models import Word2Vec
# Train Word2Vec
sentences = [
['the', 'cat', 'sat', 'on', 'the', 'mat'],
['the', 'dog', 'sat', 'on', 'the', 'floor'],
['the', 'bird', 'flew', 'over', 'the', 'house'],
]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,
workers=4, sg=1, epochs=100)
# sg=1: Skip-gram, sg=0: CBOW
# Word similarity
model.wv.most_similar('cat', topn=5)
# [('dog', 0.92), ('bird', 0.78), ...]
# Word arithmetic (king - man + woman ≈ queen)
model.wv.most_similar(positive=['king', 'woman'], negative=['man'])
# ── Pre-trained GloVe Embeddings ──
import numpy as np
def load_glove_embeddings(glove_path):
embeddings = {}
with open(glove_path, 'r', encoding='utf-8') as f:
for line in f:
parts = line.strip().split()
word = parts[0]
vector = np.array(parts[1:], dtype=np.float32)
embeddings[word] = vector
return embeddings
glove = load_glove_embeddings('glove.6B.300d.txt')
print(f"Vocabulary size: {len(glove)}") # 400,000
# ── Pre-trained FastText Embeddings ──
import fasttext
# FastText handles OOV by using character n-grams
ft_model = fasttext.load_model('cc.en.300.bin')
ft_model.get_word_vector('hello') # 300-dim vector
ft_model.get_word_vector('unheard') # Still works (subword info)# ── Embedding Layer in PyTorch ──
import torch
import torch.nn as nn
# Random embedding layer
embedding = nn.Embedding(num_embeddings=10000, embedding_dim=300)
input_ids = torch.tensor([1, 23, 456, 789, 0]) # Token IDs
embedded = embedding(input_ids)
print(embedded.shape) # (5, 300) - each token is a 300-dim vector
# ── Pre-trained Embedding Matrix ──
# Create embedding matrix from GloVe
vocab_size = 10000
embed_dim = 300
embedding_matrix = np.zeros((vocab_size, embed_dim))
word2idx = {}
for i, word in enumerate(sorted(glove.keys())[:vocab_size]):
word2idx[word] = i
embedding_matrix[i] = glove[word]
# Load into nn.Embedding
embedding = nn.Embedding.from_pretrained(
torch.tensor(embedding_matrix, dtype=torch.float),
freeze=False # True = don't fine-tune
)
# ── Sentence Embeddings (Mean Pooling) ──
def sentence_embedding(tokens, embedding_layer):
"""Average word embeddings to get sentence embedding"""
embedded = embedding_layer(tokens) # (seq_len, embed_dim)
mask = (tokens != 0).unsqueeze(-1).float() # Ignore padding
summed = (embedded * mask).sum(dim=0)
counts = mask.sum(dim=0).clamp(min=1e-9)
return summed / counts # (embed_dim,)| Model | Dimensions | Vocab Size | OOV Handling | Best For |
|---|---|---|---|---|
| Word2Vec (Skip-gram) | 100-300 | Custom | No (UNK) | Task-specific word similarity |
| Word2Vec (CBOW) | 100-300 | Custom | No (UNK) | Faster training, frequent words |
| GloVe | 50/100/200/300 | 400K/2.2M | No (UNK) | General-purpose, pre-trained |
| FastText | 100-300 | 2M+ (subword) | Yes (character n-grams) | Morphological languages, OOV |
| ELMo | 1024 | Custom (char CNN) | Yes | Contextual embeddings (LSTM-based) |
| BERT embeddings | 768/1024 | 30K WordPiece | Yes (##subwords) | Contextual, sentence-level tasks |
| Sentence-BERT | 384/768 | BERT vocab | Yes | Sentence similarity, clustering |
| OpenAI text-embedding-3 | 1536/3072 | BPE | Yes | RAG, semantic search, clustering |
The Transformer architecture, introduced in "Attention Is All You Need" (2017), is the foundation of modern NLP. Self-attention mechanisms enable parallel processing and capture long-range dependencies.
from transformers import (AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
AutoModelForTokenClassification, pipeline)
import torch
# ── BERT for Text Classification ──
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# Encode input
inputs = tokenizer(
"This movie was absolutely fantastic and moving!",
"I highly recommend it.",
padding=True, truncation=True, max_length=512,
return_tensors="pt"
)
# Predict
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=-1)
pred = torch.argmax(probs, dim=-1).item()
print(f"Prediction: {pred}, Probability: {probs[0, pred]:.4f}")
# ── BERT for Token Classification (NER) ──
ner_model = AutoModelForTokenClassification.from_pretrained(
"dbmdz/bert-large-cased-finetuned-conll03-english"
)
inputs = tokenizer("John lives in New York City", return_tensors="pt")
outputs = ner_model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# ── Pipeline API (Simplest interface) ──
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("I love this product! Best purchase ever.")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Named Entity Recognition
ner_pipe = pipeline("ner", grouped_entities=True)
entities = ner_pipe("Apple CEO Tim Cook announced the iPhone in California.")
# [{'entity_group': 'ORG', 'word': 'Apple'}, {'entity_group': 'PERSON', 'word': 'Tim Cook'}, ...]
# Question Answering
qa_pipe = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
answer = qa_pipe(question="What is the capital of France?",
context="France is a country in Europe. Its capital is Paris.")
# {'answer': 'Paris', 'score': 0.97, 'start': 56, 'end': 61}
# Text Generation
gen_pipe = pipeline("text-generation", model="gpt2")
output = gen_pipe("The future of AI is", max_length=50, num_return_sequences=1)| Model | Type | Architecture | Parameters | Key Feature | Best For |
|---|---|---|---|---|---|
| BERT | Encoder | Bidirectional encoder | 110M/340M | Bidirectional context | Classification, NER, QA |
| GPT-4 | Decoder | Autoregressive decoder | ~1.8T (MoE) | Next token prediction | Text generation, chat |
| T5 | Encoder-Decoder | Text-to-text | 60M-11B | All tasks as text generation | Translation, summarization |
| BART | Encoder-Decoder | Denoising autoencoder | 140M-400M | Reconstruction pretraining | Summarization, generation |
| RoBERTa | Encoder | Optimized BERT | 125M/355M | Better training recipe | Classification (replaces BERT) |
| DeBERTa | Encoder | Disentangled attention | 86M/400M | Content + position separation | High-accuracy classification |
| LLaMA 3 | Decoder | Autoregressive decoder | 8B-70B | Open weights, efficient | Open-source LLM |
| Mistral | Decoder | Sliding window attention | 7B-8x22B | Efficient long context | Open-source LLM |
Sentiment analysis determines the emotional tone of text — positive, negative, or neutral. It is widely used for brand monitoring, customer feedback analysis, and social media analysis.
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# ── Quick Sentiment Analysis (Pipeline) ──
classifier = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
texts = [
"This product is amazing! Best purchase I ever made.",
"Terrible experience. The food was cold and service was slow.",
"The movie was okay, nothing special but not bad either."
]
results = classifier(texts, batch_size=32)
for text, result in zip(texts, results):
print(f"{result['label']}: {result['score']:.4f} | {text[:50]}...")
# ── Fine-grained Sentiment (5-star) ──
star_classifier = pipeline(
"sentiment-analysis",
model="nlptown/bert-base-multilingual-uncased-sentiment"
)
result = star_classifier("The hotel room was clean but the breakfast was disappointing.")
print(result) # [{'label': '3 stars', 'score': 0.85}]
# ── Aspect-Based Sentiment Analysis ──
# Identify sentiment for specific aspects within a text
aspects = {
'food': 'The pizza was delicious but the pasta was bland.',
'service': 'The waiter was rude and inattentive.',
'ambiance': 'The restaurant had beautiful decor and nice lighting.',
}
for aspect, review in aspects.items():
result = classifier(review)[0]
print(f"Aspect [{aspect}]: {result['label']} ({result['score']:.3f})")
# ── Custom Sentiment Model Training ──
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length",
truncation=True, max_length=256)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="./sentiment-model",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=3,
weight_decay=0.01,
save_strategy="epoch",
load_best_model_at_end=True,
)}| Approach | Accuracy | Speed | Context-Aware | Use Case |
|---|---|---|---|---|
| Lexicon (VADER) | Low-Medium | Very Fast | No | Social media, quick analysis |
| Rule-based | Medium | Fast | No | Domain-specific (finance, healthcare) |
| TF-IDF + SVM | Medium-High | Fast | No | Balanced accuracy/speed |
| BERT (fine-tuned) | High | Medium | Yes | Production sentiment analysis |
| RoBERTa (fine-tuned) | Very High | Medium | Yes | State-of-the-art accuracy |
| GPT-4 (zero-shot) | High | Slow | Yes | Custom criteria, no training data |
| LLM (few-shot) | Very High | Slow | Yes | Complex sentiment, nuanced analysis |
NER identifies and classifies named entities (people, organizations, locations, dates, etc.) in text. It is fundamental for information extraction, knowledge graphs, and search.
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
import spacy
# ── spaCy NER (Fast, Production-Ready) ──
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976.")
for ent in doc.ents:
print(f"{ent.text:25s} {ent.label_:10s} {spacy.explain(ent.label_)}")
# Apple Inc. ORG Companies, agencies, institutions
# Steve Jobs PERSON People, including fictional
# Cupertino GPE Countries, cities, states
# California GPE Countries, cities, states
# April 1, 1976 DATE Absolute or relative dates
# ── HuggingFace NER Pipeline ──
ner_pipe = pipeline("ner", grouped_entities=True)
text = "Elon Musk is the CEO of Tesla, headquartered in Austin, Texas."
entities = ner_pipe(text)
for ent in entities:
print(f"{ent['word']:25s} {ent['entity_group']:10s} {ent['score']:.3f}")
# Elon Musk PER 0.999
# Tesla ORG 0.998
# Austin LOC 0.997
# Texas LOC 0.994
# ── Fine-tune NER Model ──
from transformers import AutoModelForTokenClassification, TrainingArguments
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC',
'B-DATE', 'I-DATE', 'B-MISC', 'I-MISC']
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {label: i for i, label in enumerate(label_list)}
model = AutoModelForTokenEvaluation.from_pretrained(
"bert-base-uncased",
num_labels=len(label_list),
id2label=id2label,
label2id=label2id
)| Entity | Label | Description | Examples |
|---|---|---|---|
| Person | PER | People (real or fictional) | Barack Obama, Sherlock Holmes |
| Organization | ORG | Companies, agencies, institutions | Google, United Nations, NASA |
| Location | LOC / GPE | Geographic locations (cities, countries) | New York, Germany, Mount Everest |
| Date | DATE | Absolute and relative dates | January 2024, last Monday, 1999-12-31 |
| Time | TIME | Times of day | 3:45 PM, noon, midnight |
| Money | MONEY | Currency amounts | $50, 10 million euros, $4.99 |
| Percentage | PERCENT | Percentage values | 50%, one third, 75 percent |
| Facility | FAC | Buildings, airports, highways | JFK Airport, Golden Gate Bridge |
| Product | PRODUCT | Products, vehicles, food | iPhone, Toyota Camry, Coca-Cola |
| Event | EVENT | Named events | Olympics, World War II, CES 2024 |
Text generation produces coherent text by predicting the next token autoregressively. Modern approaches use large language models with sophisticated decoding strategies.
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
# ── Simple Text Generation ──
generator = pipeline("text-generation", model="gpt2")
# Basic generation
output = generator("The future of AI is", max_length=50, num_return_sequences=3)
for i, seq in enumerate(output):
print(f"Sequence {i+1}: {seq['generated_text']}")
# ── Generation Strategies ──
# Greedy (always picks highest probability — repetitive)
output = generator("Once upon a time", max_length=100, do_sample=False)
# Beam Search (explores multiple paths — good for factual text)
output = generator("The capital of France is", max_length=50,
num_beams=5, early_stopping=True)
# Temperature + Top-k (controlled randomness)
output = generator("Write a haiku about coding:", max_length=100,
do_sample=True, temperature=0.7, top_k=50)
# Top-p (nucleus) sampling — most popular approach
output = generator("Explain quantum computing to a 5-year-old:",
max_length=200, do_sample=True,
temperature=0.8, top_p=0.95, top_k=0,
repetition_penalty=1.2)
# ── Chat Templates (for conversational models) ──
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.float16)
messages = [
{"role": "system", "content": "You are a helpful programming assistant."},
{"role": "user", "content": "Write a Python function to check if a string is a palindrome."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)| Strategy | How It Works | Quality | Diversity | Best For |
|---|---|---|---|---|
| Greedy | Always picks highest probability token | Medium (repetitive) | None | Deterministic outputs |
| Beam Search | Keeps top-k sequences at each step | High (factual) | Low | Translation, summarization |
| Top-k Sampling | Sample from top k most likely tokens | Good | Good | Creative writing (k=50) |
| Top-p (Nucleus) | Sample from smallest set with cumulative prob >= p | Best | Best | General-purpose (p=0.9) |
| Temperature | Scales logits before softmax (higher = more random) | Variable | Variable | Control randomness (0.3-1.0) |
| Repetition Penalty | Penalize tokens already generated | Medium | Medium | Reduce repetition (1.1-1.3) |
Essential NLP interview questions with detailed answers.
Answer: BERT is a bidirectional encoder — it reads the entire sequence at once and builds context from both left and right sides. It is pretrained with masked language modeling (predict masked words). Best for: classification, NER, QA (understanding tasks).
GPT is a unidirectional (left-to-right) decoder — it predicts the next token given previous tokens. It is pretrained with next token prediction. Best for: text generation, chatbots, code completion (generation tasks).
Key insight: BERT sees the full context for each word. GPT only sees past context (causal masking). This makes BERT better at understanding, GPT better at generating.
Answer: Word embeddings are dense, low-dimensional vectors (typically 100-300 dimensions) that represent words in a continuous vector space. They are learned from large corpora so that semantically similar words are close together.
Why they work: Distributional hypothesis — words that appear in similar contexts have similar meanings. "king" and "queen" appear in similar contexts (royal, palace, throne), so their embeddings are close.
Analogies: The vector space captures relationships: king - man + woman = queen (gender), Paris - France + Italy = Rome (capital-city). This shows embeddings encode semantic relationships as vector arithmetic.
Answer: Attention computes relevance scores between every pair of positions in a sequence, allowing the model to focus on the most relevant parts of the input when processing each element.
Self-attention: Each element in a sequence attends to all other elements. For "The cat sat on the mat because it was tired," attention helps the model learn that "it" refers to "cat," not "mat."
Why important: (1) Captures long-range dependencies without distance limitations of RNNs. (2) Parallelizable (unlike sequential RNNs). (3) Interpretable (attention weights show what the model focuses on). (4) Enables multi-head attention for capturing different types of relationships simultaneously.
Answer: Text classification imbalance is common (e.g., 95% negative reviews, 5% positive).
Answer: RoBERTa (Robustly Optimized BERT Pretraining Approach) uses the exact same architecture as BERT but with a better training recipe:
Result: RoBERTa outperforms BERT on every benchmark with the same architecture, proving training methodology matters as much as model design.