Generative AI & LLM APIs Cheatsheet Cheatsheet

🧠

LLM Fundamentals

Core Concepts

Large Language Models are transformer-based neural networks trained on massive text corpora. They generate text by predicting the next token, and can perform a wide range of tasks through in-context learning.

LLM Architecture Components

TransformerCore architecture using self-attention to process all tokens in parallel, capturing long-range dependencies.

TokenizerConverts text to tokens (BPE/SentencePiece). GPT-4 uses cl100k_base with ~100K vocabulary.

Embedding LayerConverts token IDs to dense vectors (typically 4096-12288 dimensions for large models).

Attention HeadsMultiple heads attending to different aspects. GPT-4 has 96 heads. Each head learns different relationship patterns.

FFN (Feed-Forward)Position-wise feed-forward network with expansions (typically 4x hidden dim). Contains most parameters.

Layer NormNormalizes activations per layer for training stability. Pre-norm vs Post-norm architectures.

Positional EncodingRotary (RoPE) or learned embeddings inject position information into otherwise order-agnostic attention.

KV CacheCache key/value tensors from previous tokens to avoid recomputation during autoregressive generation.

Major LLM Model Families (2025-26)

Model	Developer	Parameters	Context	Key Feature	License
GPT-4o	OpenAI	~1.8T (MoE)	128K	Multimodal, function calling	Proprietary
o3	OpenAI	Unknown	200K	Deep reasoning, coding	Proprietary
Claude 4 Sonnet	Anthropic	Unknown	200K	Speed/quality balance	Proprietary
Claude 4 Opus	Anthropic	Unknown	200K	Best reasoning, safety	Proprietary
Gemini 2.5 Pro	Google	Unknown	1M+	Massive context, multimodal	Proprietary
LLaMA 3.1 405B	Meta	405B	128K	Open weights, largest open model	Llama 3.1 License
Mistral Large	Mistral	123B	128K	Efficient, strong reasoning	Apache 2.0
DeepSeek V3	DeepSeek	671B (MoE)	128K	Efficient MoE, strong coding	MIT
Qwen 2.5 72B	Alibaba	72B	128K	Strong multilingual, coding	Apache 2.0
Mixtral 8x22B	Mistral	141B (MoE)	64K	MoE efficiency, open weights	Apache 2.0

LLM Training Stages

Stage	Data	Goal	Compute	Duration
Pretraining	Trillions of tokens from web, books, code	Learn language, facts, reasoning	Thousands of GPU-years	Weeks-Months
Supervised Fine-Tuning (SFT)	High-quality instruction-response pairs (10K-1M)	Learn to follow instructions, format	Hundreds of GPU-hours	Hours-Days
RLHF	Human preference comparisons (1K-100K)	Align with human values, reduce harmful output	Hundreds of GPU-hours	Hours-Days
DPO (Direct Preference)	Preference pairs (chosen vs rejected)	Simpler alternative to RLHF, no reward model	Similar to RLHF	Hours-Days
Constitutional AI	Self-critique + revision	Self-alignment without human feedback	Moderate	Hours-Days

💡

Key insight: All LLMs are fundamentally next-token predictors. GPT predicts the most likely next word. Understanding this helps explain hallucinations (they predict plausible text, not true text), limitations (no grounding in reality), and capabilities (they learn patterns from training data).

✍️

Prompt Engineering

Essential Skill

Prompt engineering is the practice of crafting inputs to LLMs to elicit optimal outputs. It is the most accessible way to improve LLM performance without retraining.

prompt-patterns.py

# ── System + User Prompt Pattern ──
system_prompt = """You are an expert Python developer. Follow these rules:
1. Always include type hints
2. Write docstrings for functions
3. Handle errors gracefully
4. Use modern Python 3.12+ features
5. Return only the code, no explanations"""

user_prompt = """Write a function that:
- Takes a list of dictionaries with 'name' and 'age' keys
- Returns a filtered list of people over 25
- Sorts by age descending
- Raises ValueError if input is not a list"""

# ── Chain-of-Thought (CoT) ──
cot_prompt = """Solve this step by step:
A store has a 20% off sale. An item costs $80 before discount.
Sales tax is 8%. What is the final price?

Step 1: Calculate the discount amount
Step 2: Apply the discount
Step 3: Calculate the tax
Step 4: Calculate the final price"""

# ── Few-Shot Prompting ──
few_shot_prompt = """Classify the sentiment of each review.

Review: "Amazing product, love it!" → POSITIVE
Review: "Terrible quality, broke after one day" → NEGATIVE
Review: "It's okay, nothing special" → NEUTRAL
Review: "Best purchase I've made this year" → """

# ── Structured Output (JSON Mode) ──
json_prompt = """Extract the following information from the text and return as JSON:
{
  "company": "string",
  "founded": "number (year)",
  "headquarters": "string",
  "revenue": "string",
  "employees": "number"
}

Text: "Apple was founded in 1976 and is headquartered in Cupertino, CA. 
The company reported $394 billion in revenue with 164,000 employees."

Return ONLY valid JSON, no markdown fences."""

# ── Role-Based Prompting ──
role_prompt = """You are a senior security auditor conducting a code review.
Analyze the following code for security vulnerabilities.
For each finding, provide:
- Severity (Critical/High/Medium/Low)
- Description
- Recommendation
- Fixed code snippet"""

Advanced Prompting Techniques

Technique	Description	When to Use	Example Pattern
Zero-shot	Direct instruction with no examples	Simple, well-defined tasks	Classify this as positive/negative: {text}
Few-shot	Provide 2-5 examples before the task	Specific format/style needed	Q: ... A: ... Q: ... A: ...
Chain-of-Thought	Ask for step-by-step reasoning	Math, logic, complex reasoning	Think step by step before answering.
Self-Consistency	Generate multiple answers, take majority	Reducing reasoning errors	Solve 5 ways, pick most common answer.
Tree-of-Thought	Explore multiple reasoning paths	Planning, strategy problems	Consider 3 approaches. Evaluate trade-offs.
ReAct	Reason + Act with tools	Agents, multi-step tasks	Thought: I need to search. Action: search("...")
Reflexion	Self-evaluate and retry	Iterative improvement	Evaluate your answer. If wrong, try again.

System Prompt Best Practices

Be SpecificDefine exact role, constraints, output format, and rules. Vague prompts get vague answers.

Use DelimitersSeparate instructions from content with triple quotes, XML tags, or markdown headers.

Specify FormatClearly state: "Return JSON with keys: name, score, reason" or "Output as markdown table."

Define Constraints"Max 200 words. No technical jargon. Address the reader as 'you'."

Give ExamplesFew-shot examples dramatically improve output consistency, especially for format.

Negative ExamplesTell the model what NOT to do: "Do not include headers. Do not explain your reasoning."

⚠️

Pro tip: Use structured output (JSON mode) whenever you need to parse LLM responses programmatically. Define a Pydantic model or JSON schema as part of the prompt. This eliminates parsing errors and ensures consistent output.

🔌

OpenAI API

Production APIs

The OpenAI API provides programmatic access to GPT-4, GPT-4o, o3, and other models. Understanding the API parameters, pricing, and patterns is essential for building production AI applications.

openai-api-examples.py

from openai import OpenAI
import json

client = OpenAI()  # Uses OPENAI_API_KEY env variable

# ── Chat Completions ──
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful data science tutor."},
        {"role": "user", "content": "Explain the difference between precision and recall."}
    ],
    temperature=0.7,
    max_tokens=500,
    top_p=0.95,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    seed=42,
)

answer = response.choices[0].message.content
usage = response.usage
print(f"Tokens: {usage.total_tokens} (prompt: {usage.prompt_tokens}, completion: {usage.completion_tokens})")

# ── Streaming Response ──
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about Python"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# ── Function Calling (Tool Use) ──
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

# Check if model wants to call a function
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)
    # Execute the function and send result back

# ── JSON Mode (Structured Output) ──
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Extract entities as JSON with keys: people, organizations, locations."},
        {"role": "user", "content": "Tim Cook visited the Apple Park in Cupertino."}
    ]
)

OpenAI API Pricing (Feb 2026)

Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window	Best For
GPT-4o	$2.50	$10.00	128K	General purpose, multimodal
GPT-4o-mini	$0.15	$0.60	128K	Cheap, fast, simple tasks
o3-mini	$1.10	$4.40	200K	Reasoning, coding (cheaper)
o3	$10.00	$40.00	200K	Deep reasoning, hard problems
GPT-4.1	$2.00	$8.00	1M	Long context, instructions
GPT-4.1-mini	$0.40	$1.60	1M	Fast, long context, cheap
text-embedding-3-small	$0.02	-	-	Embeddings, semantic search
text-embedding-3-large	$0.13	-	-	Higher-quality embeddings
whisper-1	$0.006/min	-	-	Speech-to-text
tts-1	$15/1M chars	-	-	Text-to-speech

API Best Practices

Error HandlingAlways handle RateLimitError, APIError, and timeout. Implement exponential backoff with retries.

CachingCache identical requests (especially embeddings) to reduce costs. Use Redis or in-memory cache.

Token OptimizationUse max_tokens to limit output. Keep system prompts concise. Truncate long inputs.

StreamingUse streaming for long responses to reduce perceived latency and improve UX.

Async ClientUse AsyncOpenAI for concurrent requests: client = AsyncOpenAI().

Rate LimitingTrack tokens per minute (TPM) and requests per minute (RPM). Implement client-side throttling.

⛓️

LangChain

Orchestration Framework

LangChain is a framework for building applications with LLMs. It provides chains, agents, memory, retrieval, and tool integration — the building blocks for production LLM apps.

langchain-basics.py

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_community.vectorstores import Chroma
from langchain.chains import create_retrieval_chain
from langchain.memory import ConversationBufferMemory

# ── Initialize Components ──
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# ── Prompt Template ──
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant. Answer concisely."),
    ("human", "{input}"),
])
chain = prompt | llm | StrOutputParser()
result = chain.invoke({"input": "What is machine learning?"})

# ── Multi-Step Chain (LCEL) ──
from langchain_core.runnables import RunnablePassthrough

# Step 1: Generate
gen_prompt = ChatPromptTemplate.from_template(
    "Write a short summary about: {topic}"
)
summary_chain = gen_prompt | llm | StrOutputParser()

# Step 2: Translate
trans_prompt = ChatPromptTemplate.from_template(
    "Translate to French:
{summary}"
)
translate_chain = trans_prompt | llm | StrOutputParser()

# Compose chains
full_chain = (
    {"topic": RunnablePassthrough(), "summary": summary_chain}
    | translate_chain
)
result = full_chain.invoke("Artificial Intelligence")

# ── Conversational Chain with Memory ──
memory = ConversationBufferMemory(return_messages=True)
chat_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful chatbot."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])
chat_chain = chat_prompt | llm | StrOutputParser()

# First message
response1 = chat_chain.invoke({"input": "Hi, I'm Alice", "history": memory.chat_memory.messages})
memory.save_context({"input": "Hi, I'm Alice"}, {"output": response1})

# Second message (remembers context)
response2 = chat_chain.invoke({"input": "What's my name?", "history": memory.chat_memory.messages})

# ── Structured Output (Pydantic) ──
from pydantic import BaseModel, Field

class SentimentResult(BaseModel):
    sentiment: str = Field(description="POSITIVE, NEGATIVE, or NEUTRAL")
    confidence: float = Field(description="Confidence score 0-1")
    reason: str = Field(description="Brief reason for classification")

structured_chain = prompt | llm.with_structured_output(SentimentResult)
result = structured_chain.invoke({"input": "I love this product!"})
# SentimentResult(sentiment='POSITIVE', confidence=0.95, reason='Expresses strong positive emotion')

LangChain Core Components

Component	Purpose	Key Classes	Use Case
Models (LLMs)	Interface to LLM providers	ChatOpenAI, ChatAnthropic, ChatGoogle	Chat completions, generation
Prompts	Dynamic prompt templates	ChatPromptTemplate, MessagesPlaceholder	System/user/assistant messages
Chains	Sequential operations	LCEL pipe operator (\|)	Multi-step processing
Memory	Conversation history	ConversationBufferMemory, ConversationSummaryMemory	Chatbots, context retention
Retrievers	Search relevant documents	VectorStoreRetriever, BM25Retriever	RAG, document search
Vector Stores	Embedding storage & search	Chroma, FAISS, Pinecone, Weaviate	Semantic search, RAG
Agents	LLM + tool use loops	create_tool_calling_agent, AgentExecutor	Autonomous task execution
Tools	Functions LLM can call	Search, Calculator, SQL, Python REPL	Augmenting LLM capabilities
Output Parsers	Structure LLM output	StrOutputParser, JsonOutputParser, PydanticOutputParser	Reliable structured output

💡

LangChain vs direct API: For simple tasks, use the OpenAI SDK directly — it is simpler and faster. Use LangChain when you need: complex chains, RAG, agents with tools, memory management, or multi-model orchestration.

📚

RAG (Retrieval Augmented Generation)

Ground Your LLM

RAG enhances LLM responses by retrieving relevant documents from a knowledge base and including them in the prompt. This grounds the model in factual data and reduces hallucinations.

rag-implementation.py

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader, TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain, create_stuff_documents_chain
from langchain_core.output_parsers import StrOutputParser

# ── Step 1: Load Documents ──
loader = DirectoryLoader(
    "./documents/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
documents = loader.load()

# ── Step 2: Split into Chunks ──
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Characters per chunk
    chunk_overlap=200,     # Overlap between chunks
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],  # Split priority
)
chunks = text_splitter.split_documents(documents)

# ── Step 3: Create Embeddings & Vector Store ──
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Retrieve top 5 chunks
)

# ── Step 4: Create RAG Chain ──
rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information to answer this."

Context:
{context}

Question: {input}

Answer (be specific and cite sources when possible):
""")

llm = ChatOpenAI(model="gpt-4o", temperature=0.2)  # Low temp for factual answers

# Method 1: Using create_stuff_documents_chain
document_chain = create_stuff_documents_chain(llm, rag_prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)
response = retrieval_chain.invoke({"input": "What is the refund policy?"})

# Method 2: Using LCEL (more flexible)
rag_chain = (
    {"context": retriever, "input": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

RAG Optimization Techniques

Technique	Description	Impact	Implementation
Chunk Size Tuning	Optimize chunk size and overlap	High	Test 500/1000/2000 chars with 100-200 overlap
Hybrid Search	Combine dense + sparse (BM25) search	High	EnsembleRetriever with BM25 + vector
Reranking	Re-score retrieved documents	High	Cohere Rerank, CrossEncoder, bge-reranker
Query Transformation	Rewrite user query for better retrieval	Medium	HyDE (hypothetical document), multi-query
Metadata Filtering	Filter by document metadata before search	Medium	vectorstore.as_retriever(search_kwargs={"filter": {...}})
Parent Document	Store small chunks, retrieve with context	Medium	ParentDocumentRetriever
Context Window Mgmt	Fit more relevant info in context	Medium	Compress retrieved docs, iterative retrieval

rag-evaluation.py

# ── RAG Evaluation (RAGAS) ──
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy, context_precision, context_recall
)

# Prepare evaluation data
eval_data = {
    "question": ["What is the company refund policy?"],
    "answer": [response["answer"]],
    "contexts": [[doc.page_content for doc in response["context"]]],
    "ground_truth": ["Full refund within 30 days of purchase..."],
}

# Run evaluation
results = evaluate(
    eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(f"Faithfulness: {results['faithfulness']:.3f}")     # Is answer grounded in context?
print(f"Answer Relevancy: {results['answer_relevancy']:.3f}")  # Is answer relevant to question?
print(f"Context Precision: {results['context_precision']:.3f}")   # Are retrieved chunks relevant?
print(f"Context Recall: {results['context_recall']:.3f}")        # Did we retrieve all needed info?

🚫

RAG golden rule: Garbage in, garbage out. The quality of your RAG system depends entirely on the quality of your documents and chunking strategy. Invest time in preprocessing, cleaning, and properly structuring your knowledge base.

🎯

Fine-Tuning

Customize Models

Fine-tuning adapts a pre-trained model to your specific task or domain by training on task-specific data. It is more effective than prompting for specialized tasks with sufficient training data.

fine-tuning-openai.py

# ── OpenAI Fine-Tuning (GPT-4o-mini) ──
from openai import OpenAI
import json

client = OpenAI()

# Step 1: Prepare training data (JSONL format)
# Each line: {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a medical terminology expert."},
            {"role": "user", "content": "What does CBC stand for?"},
            {"role": "assistant", "content": "CBC stands for Complete Blood Count. It is a common blood test that measures: 1) Red blood cells (RBC), 2) White blood cells (WBC), 3) Hemoglobin, 4) Hematocrit, 5) Platelets."}
        ]
    },
    # ... more examples (minimum 50, ideally 500+)
]

# Save as JSONL
with open("medical_training.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

# Step 2: Upload training file
file = client.files.create(
    file=open("medical_training.jsonl", "rb"),
    purpose="fine-tune"
)

# Step 3: Create fine-tuning job
fine_tune = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": "auto", "batch_size": "auto"}
)

# Step 4: Monitor status
import time
while True:
    job = client.fine_tuning.jobs.retrieve(fine_tune.id)
    print(f"Status: {job.status}")
    if job.status in ["succeeded", "failed"]:
        break
    time.sleep(60)

# Step 5: Use fine-tuned model
response = client.chat.completions.create(
    model=fine_tune.fine_tuned_model,
    messages=[
        {"role": "system", "content": "You are a medical terminology expert."},
        {"role": "user", "content": "What does MRI stand for?"}
    ]
)

fine-tuning-huggingface.py

# ── HuggingFace Fine-Tuning (LoRA / QLoRA) ──
from transformers import (AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, TrainingArguments, Trainer)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

# Step 1: Load model with 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Step 2: Configure LoRA (Low-Rank Adaptation)
lora_config = LoraConfig(
    r=16,              # Rank (8-64 typical)
    lora_alpha=32,     # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: 0.1% of total parameters

# Step 3: Prepare data
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

def format_instruction(sample):
    return f"### Instruction:\n{sample['instruction']}\n\n### Response:\n{sample['output']}<|end_of_text|>"

# Step 4: Train
training_args = TrainingArguments(
    output_dir="./lora-finetuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    optim="paged_adamw_8bit",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer, padding=True),
)
trainer.train()

# Step 5: Save & Merge
model.save_pretrained("./lora-adapter")  # Save only LoRA weights (~50MB)

Fine-Tuning Approaches Comparison

Approach	Data Needed	Cost	Quality	When to Use
Prompt Engineering	0 samples	Low (API only)	Medium	Quick prototyping, simple tasks
OpenAI Fine-Tuning	50-10K examples	Medium ($100-1000)	Good	Domain-specific, style, format
Full Fine-Tuning	10K+ examples	Very High (GPUs)	Best	Fundamentally change model behavior
LoRA / QLoRA	1K+ examples	Low (single GPU)	Good-Very Good	Cost-effective, open models
RAG	Documents (no labeled)	Medium (infra + API)	Good (factual)	Knowledge-heavy, up-to-date info
RLHF / DPO	1K-100K preferences	High (compute + human)	Excellent	Align model with human values

⚠️

Decision guide: Start with RAG if you need factual knowledge. Use fine-tuning if you need specific style, tone, or behavior that prompting cannot achieve. Combine both for best results: RAG for knowledge, fine-tuning for behavior.

🤖

AI Agents

Autonomous Systems

AI Agents are LLM-powered systems that can plan, use tools, and autonomously complete multi-step tasks. They combine reasoning with action — the frontier of AI application design.

ai-agent-langchain.py

from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.tools import tool
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_core.prompts import ChatPromptTemplate

# ── Define Custom Tools ──
@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression. Example: '2 + 3 * 4'"""
    try:
        result = eval(expression)
        return f"The result of {expression} = {result}"
    except Exception as e:
        return f"Error: {str(e)}"

@tool
def get_current_weather(city: str) -> str:
    """Get current weather for a city. Example: 'San Francisco'"""
    # In production, call a weather API
    return f"Weather in {city}: 72F, Partly Cloudy, Humidity 65%"

@tool
def search_web(query: str) -> str:
    """Search the web for information. Example: 'latest AI news 2025'"""
    search = DuckDuckGoSearchRun()
    return search.run(query)

tools = [calculate, get_current_weather, search_web]

# ── Create Agent ──
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use tools to answer questions. "
     "Think step by step about which tools to use."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),  # Agent thinking space
])

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    max_iterations=10,
    return_intermediate_steps=True,
)

# Run the agent
result = agent_executor.invoke({
    "input": "What's the weather in Tokyo? If it's above 70F, calculate 15% tip on a $85 bill."
})
# Agent will: 1) Call get_current_weather("Tokyo")
#             2) Check temperature
#             3) Call calculate("85 * 0.15") if above 70F
#             4) Provide final answer

Agent Frameworks

Framework	Creator	Approach	Strengths	Best For
LangChain Agents	LangChain	Tool-calling agent loop	Rich ecosystem, many integrations	RAG + tools, multi-step tasks
LangGraph	LangChain	Graph-based agent workflows	State management, cycles, branching	Complex agent workflows
CrewAI	CrewAI	Multi-agent collaboration	Role-based agents, task delegation	Team of specialized agents
AutoGen	Microsoft	Multi-agent conversation	Code execution, human-in-loop	Coding agents, research
OpenAI Assistants	OpenAI	Built-in agent platform	Code interpreter, file search	Simple agent apps
Semantic Kernel	Microsoft	Enterprise agent framework	Planners, connectors, enterprise	Enterprise applications

Agent Design Patterns

Pattern	Description	Example	Complexity
ReAct	Reason then Act, loop until done	Think about query -> Search -> Answer	Low
Plan & Execute	Plan all steps first, then execute	Make a plan -> Execute each step	Medium
Multi-Agent	Delegate subtasks to specialized agents	Researcher + Writer + Reviewer	High
Human-in-the-Loop	Ask human for approval at key steps	Execute step -> Ask human -> Continue	Medium
Reflection	Self-evaluate and retry on failure	Attempt -> Evaluate -> Improve -> Retry	Medium
Router	Route to different agents based on input	Classify query -> Route to expert agent	Low

💡

Agent safety: Always add guardrails: tool whitelisting, input validation, output filtering, and human confirmation for irreversible actions. Agents with unrestricted access to tools (file system, email, API) can cause real damage.

🛡️

Responsible AI

Build Safely

Responsible AI ensures AI systems are fair, transparent, private, and safe. As AI becomes more powerful, the ethical implications grow more significant.

AI Safety Challenges

Challenge	Description	Impact	Mitigation
Hallucination	LLM generates plausible but false information	Misinformation, bad decisions	RAG, fact-checking, low temperature, explicit uncertainty
Bias & Fairness	Model reflects biases in training data	Discriminatory outcomes	Diverse training data, fairness metrics, red-teaming
Privacy	Model memorizes and leaks personal data	Data breaches, PII exposure	Differential privacy, data anonymization, PII detection
Security	Prompt injection, jailbreaking, adversarial inputs	Data theft, harmful outputs	Input validation, system prompts, content filtering
Over-reliance	Users trust AI blindly without verification	Errors in critical decisions	Confidence calibration, uncertainty display, human review
Deepfakes	AI generates realistic fake content	Misinformation, fraud, reputation harm	Watermarking, detection tools, provenance tracking
Copyright	Training data may include copyrighted content	Legal risk, ethical concerns	Licensed training data, attribution, opt-out mechanisms

responsible-ai-patterns.py

# ── Input Validation & Safety ──
import re
from openai import OpenAI

client = OpenAI()

def safe_generate(user_input: str, system_prompt: str) -> dict:
    """Generate with safety guardrails."""
    # 1. Input validation
    if not user_input or len(user_input) > 5000:
        return {"error": "Invalid input length"}

    # 2. PII detection (basic)
    email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    phone_pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
    if re.search(email_pattern, user_input) or re.search(phone_pattern, user_input):
        return {"error": "Input contains potential PII. Please remove personal information."}

    # 3. Prompt injection detection
    injection_patterns = [
        r'ignore previous instructions',
        r'you are now',
        r'system prompt',
        r'forget everything',
    ]
    for pattern in injection_patterns:
        if re.search(pattern, user_input, re.IGNORECASE):
            return {"error": "Potential prompt injection detected"}

    # 4. Generate with safety system prompt
    safety_system = f"""{system_prompt}

    SAFETY RULES:
    - Never generate harmful, illegal, or dangerous content
    - If unsure, say "I'm not sure about that"
    - Do not share private information
    - Be honest about limitations"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": safety_system},
            {"role": "user", "content": user_input}
        ],
        temperature=0.3,
    )

    return {"response": response.choices[0].message.content}

# ── Content Moderation (OpenAI) ──
moderation = client.moderations.create(input="Sample text to check")
if moderation.results[0].flagged:
    categories = moderation.results[0].categories.model_dump()
    print(f"Flagged categories: {categories}")

Responsible AI Principles

TransparencyDisclose AI use to users. Make it clear when content is AI-generated. Explain how decisions are made.

FairnessTest for demographic bias. Use diverse evaluation datasets. Monitor outcomes across user groups.

PrivacyMinimize data collection. Anonymize where possible. Comply with GDPR, CCPA, and local regulations.

AccountabilityMaintain human oversight for critical decisions. Have clear escalation paths. Document AI decisions.

SafetyImplement content filtering. Red-team your system. Have incident response plans.

ExplainabilityMake model reasoning visible where possible. Provide confidence scores. Log inputs and outputs.

Regulatory Landscape (2025-26)

Regulation	Jurisdiction	Key Requirements	Status
EU AI Act	European Union	Risk-based classification, transparency, conformity assessment	Enacted Aug 2024, phased rollout 2025-2027
NIST AI RMF	United States	Risk management framework, governance, mapping, measuring	Voluntary framework, widely adopted
Executive Order 14110	United States	Safety testing, red-teaming, watermarking for AI	Signed Oct 2023, agencies implementing
AI Safety Institute	UK/US	Frontier model evaluation, safety research	Active, evaluating major models
GDPR Art. 22	European Union	Right to explanation for automated decisions	Enforced since 2018

🚫

Remember:Responsible AI is not a checkbox — it's a continuous process. Regularly audit your AI systems for bias, safety issues, and performance degradation. Build diverse teams. Include affected communities in design decisions. When in doubt, err on the side of caution.

⏳

Loading cheatsheet...