Home

AI Cheat Sheet

Quick reference for working with LLMs. Copy-paste patterns, parameter settings, and practical formulas.


Contents


Prompt Patterns

Basic Structure

[ROLE/PERSONA]
You are a [role] who [key trait]. Your goal is to [objective].

[CONTEXT]
Background information the model needs.

[TASK]
Specific instruction for what to do.

[FORMAT]
How to structure the output.

[CONSTRAINTS]
What NOT to do or boundaries to respect.

System Prompt Template

You are [role] with expertise in [domain].

Your responsibilities:
- [Primary task]
- [Secondary task]

Guidelines:
- [Behavior 1]
- [Behavior 2]

Constraints:
- Never [prohibited action]
- Always [required action]

Output format: [format specification]

Few-Shot Template

Here are examples of the task:

Example 1:
Input: [example input]
Output: [example output]

Example 2:
Input: [example input]
Output: [example output]

Now complete this:
Input: [actual input]
Output:

Chain-of-Thought Triggers

Trigger Phrase Use Case
"Let's think step by step" General reasoning
"First, let's break this down" Complex problems
"Walk through your reasoning" Explanations needed
"Consider each option carefully" Decision making
"Show your work" Math/calculations

Output Format Specifiers

# JSON
Respond in JSON format:
{"field1": "value", "field2": "value"}

# Markdown
Use markdown with headers and bullet points.

# XML
Wrap your response in <response></response> tags.

# Structured
Use this exact format:
ANALYSIS: [your analysis]
RECOMMENDATION: [your recommendation]
CONFIDENCE: [high/medium/low]

API Parameters

Temperature Guide

Temperature Behavior Use Case
0.0 Deterministic Factual Q&A, code, data extraction
0.3 Low variance Technical writing, analysis
0.7 Balanced General conversation, creative but grounded
1.0 High variance Brainstorming, creative writing
1.5+ Very random Experimental, often incoherent

Top-p (Nucleus Sampling)

Top-p Effect
0.1 Very focused, limited vocabulary
0.5 Moderately diverse
0.9 Standard setting, good diversity
1.0 Consider all tokens

Common Parameter Combinations

Task Temperature Top-p Max Tokens
Code generation 0.0-0.2 0.95 2000-4000
Data extraction 0.0 1.0 500-1000
Summarization 0.3 0.9 500-1500
Creative writing 0.8-1.0 0.95 2000+
Chat/conversation 0.7 0.9 500-1000
Analysis/reasoning 0.2-0.5 0.95 1000-2000

Stop Sequences

# Common stop sequences
stop_sequences = [
    "\n\n",           # Double newline
    "Human:",         # Conversation boundary
    "```",            # End of code block
    "</response>",    # XML tag closure
    "---",            # Section break
]

Token Estimation

Quick Rules of Thumb

Content Type Tokens per Unit
English text ~0.75 tokens/word
Code ~0.4 tokens/character
JSON ~1.3 tokens/character
Non-English 1.5-4x English

Estimation Formulas

# English text
tokens  words × 1.3
tokens  characters / 4

# Code
tokens  characters / 2.5

# Quick estimate
tokens  len(text.split()) * 1.3

Context Window Budgeting

Total Context = System Prompt + Conversation History + Retrieved Context + User Message + Output Buffer

Example for 128K context:
- System prompt: 2,000 tokens
- Conversation history: 10,000 tokens  
- Retrieved context (RAG): 50,000 tokens
- User message: 1,000 tokens
- Output buffer: 4,000 tokens
- Safety margin (10%): 12,800 tokens
─────────────────────────────────
Available for retrieval: ~48,200 tokens

Model Context Limits (January 2026)

Model Context Window Output Limit
GPT-4 Turbo 128K 4K
GPT-4o 128K 16K
Claude 3.5 Sonnet 200K 8K
Claude 3 Opus 200K 4K
Gemini 1.5 Pro 1M 8K
Llama 3.1 405B 128K 4K

Model Selection

Decision Matrix

Need Best Choice Why
Highest quality Claude 3 Opus, GPT-4 Best reasoning
Best value Claude 3.5 Sonnet, GPT-4o Quality/cost balance
Speed critical Claude 3 Haiku, GPT-4o mini Low latency
Long documents Gemini 1.5 Pro 1M context
Code generation Claude 3.5 Sonnet Best benchmarks
Self-hosted Llama 3.1 70B Open weights
Cost sensitive Llama 3.1 8B, Mixtral Free/cheap

Pricing Quick Reference (per 1M tokens, approximate)

Model Input Output
GPT-4o $2.50 $10.00
GPT-4o mini $0.15 $0.60
Claude 3.5 Sonnet $3.00 $15.00
Claude 3 Haiku $0.25 $1.25
Gemini 1.5 Pro $1.25 $5.00
Llama 3.1 (hosted) $0.50-2.00 $0.50-2.00

Prices change frequently — verify current rates


Cost Optimization

Strategies

Strategy Savings Trade-off
Prompt caching 50-90% Cache invalidation
Smaller models for routing 70-90% Accuracy on complex tasks
Batch processing 50% Latency
Output length limits Variable May truncate
Semantic caching 30-60% Cache misses

Prompt Caching Pattern

# Structure prompts with static content first
prompt = f"""
{STATIC_SYSTEM_PROMPT}      # Cached
{STATIC_EXAMPLES}           # Cached
{STATIC_INSTRUCTIONS}       # Cached
---
{dynamic_user_input}        # Not cached
"""

Cost Estimation Formula

cost = (input_tokens / 1_000_000 * input_price) + 
       (output_tokens / 1_000_000 * output_price)

# Example: GPT-4o, 2000 input, 500 output
cost = (2000 / 1_000_000 * 2.50) + (500 / 1_000_000 * 10.00)
cost = $0.005 + $0.005 = $0.01 per request

Evaluation Metrics

Classification Metrics (for LLM outputs)

Metric Formula Use When
Accuracy (TP + TN) / Total Balanced classes
Precision TP / (TP + FP) False positives costly
Recall TP / (TP + FN) False negatives costly
F1 Score 2 × (P × R) / (P + R) Balance P and R

Generation Quality Metrics

Metric What It Measures Range
BLEU N-gram overlap with reference 0-1 (higher = better)
ROUGE Recall of reference n-grams 0-1 (higher = better)
Perplexity Model confidence Lower = better
BERTScore Semantic similarity 0-1 (higher = better)

RAG-Specific Metrics

Metric Formula Target
Context Precision Relevant chunks / Retrieved chunks > 0.8
Context Recall Retrieved relevant / Total relevant > 0.9
Faithfulness Claims supported by context / Total claims > 0.95
Answer Relevancy Semantic similarity to question > 0.8

LLM-as-Judge Template

You are evaluating an AI response on a scale of 1-5.

Criteria:
- Accuracy: Is the information correct?
- Completeness: Does it fully address the question?
- Clarity: Is it well-organized and easy to understand?
- Relevance: Does it stay on topic?

Response to evaluate:
{response}

Original question:
{question}

Provide scores for each criterion and an overall score with brief justification.

Common Patterns

RAG Pattern

# 1. Embed query
query_embedding = embed(user_query)

# 2. Retrieve relevant chunks
chunks = vector_db.search(query_embedding, top_k=5)

# 3. Construct prompt
prompt = f"""
Use the following context to answer the question.
If the context doesn't contain the answer, say so.

Context:
{format_chunks(chunks)}

Question: {user_query}

Answer:
"""

# 4. Generate response
response = llm.generate(prompt)

ReAct Agent Pattern

Thought: I need to [reasoning about what to do]
Action: tool_name(param1="value1", param2="value2")
Observation: [result from tool]
Thought: Based on this, I should [next reasoning]
Action: another_tool(param="value")
Observation: [result]
Thought: I now have enough information
Answer: [final response to user]

Tool Definition (OpenAI Format)

tools = [{
    "type": "function",
    "function": {
        "name": "search_database",
        "description": "Search the product database for items",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query"
                },
                "category": {
                    "type": "string",
                    "enum": ["electronics", "clothing", "home"],
                    "description": "Product category filter"
                },
                "max_results": {
                    "type": "integer",
                    "default": 10
                }
            },
            "required": ["query"]
        }
    }
}]

Retry with Exponential Backoff

import time
import random

def call_with_retry(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Troubleshooting

Common Issues and Fixes

Problem Likely Cause Solution
Repetitive output Temperature too low Increase temperature to 0.7+
Nonsense/gibberish Temperature too high Decrease temperature to 0.3-0.5
Ignores instructions Prompt too long Move key instructions to end
Wrong format Unclear specification Add explicit format example
Hallucinations No grounding Add RAG or fact-checking
Cuts off mid-response max_tokens too low Increase max_tokens
Rate limit errors Too many requests Add retry logic with backoff
Context overflow Input too long Summarize or chunk input

Prompt Debugging Checklist

□ Is the instruction clear and unambiguous?
□ Is there an example of the expected output?
□ Are constraints explicitly stated?
□ Is the most important instruction near the end?
□ Is the context relevant and not too long?
□ Are delimiters used to separate sections?
□ Is the output format specified?
□ Are edge cases addressed?

Error Messages Quick Reference

Error Meaning Fix
context_length_exceeded Input + output > limit Reduce input or max_tokens
rate_limit_exceeded Too many requests Add delays, use backoff
invalid_api_key Auth failed Check API key
model_not_found Wrong model name Verify model string
content_filter Safety triggered Rephrase request

Quick Reference Cards

Prompt Engineering Principles

1. Be specific, not vague
2. Show, don't just tell (use examples)
3. Structure with clear sections
4. Put critical instructions last
5. Specify what NOT to do
6. Request step-by-step for complex tasks
7. Define output format explicitly
8. Test with edge cases

Token-Saving Tips

1. Remove redundant phrases ("I want you to", "Please")
2. Use abbreviations in system prompts
3. Compress examples to minimum needed
4. Summarize conversation history
5. Use structured formats over prose
6. Cache static prompt components

Notes

This cheat sheet is maintained for quick reference. For deeper explanations, see:

Feedback and suggestions are welcome!

Last updated: January 2026