Skip to main content

Models Family (G5)

These elements are the raw intelligence that powers everything else.

All the orchestration, retrieval, and validation in the world doesn't matter without capable models underneath. This family covers the spectrum from general-purpose LLMs to specialized variants.

ElementNameRowDescription
LgLLMPrimitivesThe core reasoning engines
MmMulti-modalCompositionsModels that process text, images, audio
SmSmall ModelsDeploymentFast, cheap, efficient alternatives
ThThinking ModelsEmergingModels that reason before answering

Lg — LLM

Position in Periodic Table:

G5: Models Family
┌──────────────────────────┐
│ → [LLM] │ Row 1: Primitives
│ Multi-modal │ Row 2: Compositions
│ Small Models │ Row 3: Deployment
│ Thinking Models │ Row 4: Emerging
└──────────────────────────┘

What It Is

Large Language Models (LLMs) are the core reasoning engines—GPT-4, Claude, Gemini, Llama, and others. Trained on vast text, they're the primitive capability everything else builds on.

Why It Matters

LLMs are the foundation of modern AI:

  • All other elements in the periodic table depend on them
  • They provide the reasoning that powers agents, RAG, and more
  • Understanding their capabilities and limitations is essential
  • Model selection impacts cost, quality, and capabilities

How LLMs Work (High Level)

  1. Training: Learn patterns from massive text datasets
  2. Prediction: Given input tokens, predict the next token
  3. Generation: Repeat prediction to produce text
  4. Instruction tuning: Fine-tuned to follow instructions
  5. RLHF: Refined via human feedback

Key Properties

PropertyDescription
ParametersModel size (7B, 70B, 175B, etc.)
Context windowHow much text it can process
Training dataWhat knowledge it has
Knowledge cutoffHow recent its information is
CapabilitiesReasoning, coding, creativity, etc.

Major Model Families (2026)

ProviderModelsNotes
OpenAIGPT-4, GPT-4 TurboStrong general capabilities
AnthropicClaude 3.5, Claude 3 OpusStrong reasoning, longer context
GoogleGemini Pro, UltraMultimodal, large context
MetaLlama 3Open weights
MistralMixtral, Mistral LargeEfficient, European

Model Selection Factors

FactorConsideration
Task fitWhich model excels at your task?
CostPrice per token varies 100x
LatencyResponse time requirements
ContextHow much input you need
PrivacySelf-hosted vs. API
FeaturesTool use, vision, etc.

Capabilities and Limitations

What LLMs Can Do Well:

  • Text generation and transformation
  • Summarization and extraction
  • Code generation and explanation
  • Question answering (with context)
  • Creative writing and brainstorming
  • Following complex instructions
  • Reasoning through problems

Known Limitations:

LimitationDescription
HallucinationGenerating plausible-sounding false information
Knowledge cutoffNo awareness of recent events
Math errorsUnreliable arithmetic
InconsistencyDifferent answers to same question
No memoryEach conversation is independent
Context limitsCan't process unlimited text

When to Use Which

Use CaseRecommendation
Complex reasoningFrontier models (GPT-4, Claude Opus)
High volume, simpleSmaller/cheaper models
Privacy-criticalSelf-hosted (Llama, Mistral)
Long documentsLarge context models (Claude, Gemini)
MultimodalVision-capable models

Tier Relevance

TierExpectation
FoundationUnderstand capabilities, limitations, and hallucination risks
PractitionerSelect appropriate models for use cases
ExpertOptimize model selection for cost/quality tradeoffs

Mm — Multi-modal

Position in Periodic Table:

G5: Models Family
┌──────────────────────────┐
│ LLM │ Row 1: Primitives
│ → [Multi-modal] │ Row 2: Compositions
│ Small Models │ Row 3: Deployment
│ Thinking Models │ Row 4: Emerging
└──────────────────────────┘

What It Is

Multi-modal models process multiple input types—text, images, audio, video. See a chart and explain it. Hear a question and answer it. Unified intelligence across modalities.

Why It Matters

The world is multi-modal. Limiting AI to text-only means missing:

  • Visual understanding (charts, diagrams, screenshots)
  • Audio processing (speech, music, sounds)
  • Video analysis (demonstrations, surveillance)
  • Document understanding (PDFs with layouts)

Multi-modal capabilities open entirely new use cases.

Modality Types

ModalityInput ExamplesCapabilities
VisionImages, screenshots, diagramsDescription, analysis, OCR
AudioSpeech, music, soundsTranscription, understanding
VideoRecordings, streamsScene understanding, action recognition
DocumentsPDFs, scansLayout-aware extraction

Multi-modal Model Architectures

Vision-Language Models (VLMs):

  • Image encoder + language model
  • Examples: GPT-4V, Claude 3 Vision, Gemini Pro Vision

Speech-Language Models:

  • Audio encoder + language model
  • Examples: Whisper + GPT, Gemini

Unified Models:

  • Single model handles multiple modalities
  • Examples: Gemini, GPT-4o

Use Cases

Vision:

Use CaseExample
Chart analysis"Explain the trends in this graph"
UI understanding"What does this screenshot show?"
Document extraction"Extract the table from this PDF"
Image description"Describe what's happening in this photo"
Visual QA"What color is the car in the image?"

Audio:

Use CaseExample
TranscriptionConvert speech to text
TranslationTranslate spoken language
SummarizationSummarize a meeting recording
Analysis"What emotion is expressed?"

Video:

Use CaseExample
Summarization"What happens in this video?"
Action recognition"Is the person walking or running?"
Temporal QA"What happens after the door opens?"

Considerations

Image Quality:

FactorImpact
ResolutionHigher = more detail, more tokens
ClarityBlurry images = worse understanding
RelevanceCrop to relevant content
FormatJPEG, PNG widely supported

Token Costs: Images are tokenized differently than text:

  • A typical image = 85-1700 tokens depending on size/detail
  • Video = many frames = many tokens
  • Cost can add up quickly

Limitations:

  • Hallucination: Models may "see" things not present
  • OCR errors: Text in images may be misread
  • Spatial reasoning: Understanding layouts can be imperfect
  • Small details: Fine print may be missed

Tier Relevance

TierExpectation
FoundationUnderstand multi-modal capabilities
PractitionerBuild features using image or audio input
ExpertOptimize multi-modal pipelines for cost and quality

Sm — Small Models

Position in Periodic Table:

G5: Models Family
┌──────────────────────────┐
│ LLM │ Row 1: Primitives
│ Multi-modal │ Row 2: Compositions
│ → [Small Models] │ Row 3: Deployment
│ Thinking Models │ Row 4: Emerging
└──────────────────────────┘

What It Is

Small models are distilled, specialized models—fast, cheap, and efficient. They run on phones, edge devices, or at high volume. When you don't need frontier capability, small models deliver 90% of value at 10% of cost.

Why It Matters

Not every task needs GPT-4:

  • Cost: Small models are 10-100x cheaper
  • Latency: Faster inference, better user experience
  • Privacy: Can run locally, no data leaves device
  • Scale: Affordable at high volume
  • Availability: Self-hosted means no API dependencies

Size Spectrum

CategoryParametersExamples
TinyUnder 1BDistilBERT, TinyLlama
Small1-7BLlama 3 8B, Mistral 7B
Medium7-30BLlama 3 70B, Mixtral
Large30-100BGPT-4, Claude Opus
Frontier100B+Next-gen models

How Small Models Are Created

Distillation: Train a small model to mimic a large model's behavior

Quantization: Reduce numerical precision (FP32 to INT8 to INT4)

Pruning: Remove less important weights

Architecture optimization: Efficient designs from the start (Mamba, etc.)

Capability Tradeoffs

CapabilityLarge ModelsSmall Models
Complex reasoningStrongWeaker
Following instructionsExcellentGood
Knowledge breadthVery wideNarrower
Creative writingHigh qualityAdequate
Code generationStrongGood for common patterns
ConsistencyMore consistentMore variance

When to Use Small Models

Good Fit:

Use CaseWhy Small Works
ClassificationTask is well-defined
ExtractionPattern matching
Simple Q&AFAQ-style responses
EmbeddingsSpecialized models exist
High volumeCost matters at scale
Edge deploymentDevice constraints
Privacy-criticalKeep data local

Poor Fit:

Use CaseWhy Large Is Better
Complex reasoningNeeds more capability
Novel tasksNeeds generalization
Long documentsContext limitations
High stakesQuality requirements

Running Small Models

Self-Hosted Options:

ToolPurpose
OllamaEasy local model running
vLLMHigh-performance serving
llama.cppCPU-optimized inference
TensorRT-LLMNVIDIA GPU optimization

Cloud Options:

ProviderOffering
Together AIOpen model hosting
AnyscaleScalable endpoints
ReplicateSimple model deployment
Hugging FaceInference endpoints

Cost Comparison Example

Task: Process 1M customer support tickets

Frontier model (GPT-4):
~500 tokens/ticket x 1M = 500M tokens
~$15,000 for input + output

Small model (Llama 3 8B, self-hosted):
Server cost: ~$500/month
Can process 1M+ tickets/month

Savings: 95%+ reduction in ongoing costs

Tier Relevance

TierExpectation
FoundationUnderstand when small models are appropriate
PractitionerDemonstrate model selection with cost/performance analysis
ExpertDesign systems with optimal model routing

Th — Thinking Models

Position in Periodic Table:

G5: Models Family
┌──────────────────────────┐
│ LLM │ Row 1: Primitives
│ Multi-modal │ Row 2: Compositions
│ Small Models │ Row 3: Deployment
│ → [Thinking Models] │ Row 4: Emerging
└──────────────────────────┘

What It Is

Thinking models reason before answering. Chain-of-thought is built into their architecture. They spend compute time thinking, not just generating. The smartest models today use this approach.

Examples: OpenAI's o1, Claude's extended thinking mode.

Why It Matters

Traditional LLMs generate the first plausible response. Thinking models:

  • Consider alternatives before committing
  • Catch errors through internal verification
  • Handle complexity that stumps regular models
  • Show their work (sometimes) for transparency

For hard problems, thinking models significantly outperform.

How Thinking Models Differ

Standard LLM:

Input → Generate tokens → Output
(fast, but may miss nuances)

Thinking Model:

Input → Reason internally → Verify → Refine → Output
(slower, but more accurate on hard problems)

Characteristics

AspectThinking ModelsStandard Models
LatencyHigher (seconds to minutes)Lower (milliseconds to seconds)
CostHigher (more compute)Lower
Simple tasksOverkillEfficient
Complex reasoningExcelsStruggles
Math/logicStrongUnreliable
TransparencyCan show reasoningLimited visibility

When Thinking Helps

Task TypeBenefit
Math problemsHigh—verifies calculations
Logic puzzlesHigh—explores possibilities
Complex codeHigh—considers edge cases
PlanningHigh—thinks through steps
Simple Q&ALow—unnecessary overhead
Creative writingVariable—may overthink

Trade-offs

Latency vs. Quality:

Simple question: "What's the capital of France?"
├─ Standard model: 200ms, "Paris" ✓
└─ Thinking model: 5s, "Paris" ✓ (wasted time)

Complex problem: "Prove this mathematical theorem"
├─ Standard model: 500ms, often wrong ✗
└─ Thinking model: 60s, usually correct ✓

Cost Considerations: Thinking models use more tokens internally:

  • A problem that takes 100 tokens to state
  • May require 5,000+ tokens of internal reasoning
  • Billed accordingly

Use strategically on problems that benefit.

Design Patterns

Selective Reasoning: Route simple queries to fast models, complex to thinking:

def answer_query(query):
complexity = assess_complexity(query)

if complexity < 0.5:
return fast_model.complete(query)
else:
return thinking_model.complete(query)

Hybrid Approaches: Use thinking for planning, fast models for execution:

# Thinking model creates the plan
plan = thinking_model.complete(f"Create a plan to: {goal}")

# Fast model executes each step
for step in plan.steps:
result = fast_model.complete(f"Execute: {step}")

Verification Loops: Use thinking model to verify fast model outputs:

draft = fast_model.complete(query)
verification = thinking_model.complete(
f"Verify this response is correct: {draft}"
)
if verification.has_issues:
return thinking_model.complete(query) # Redo properly
return draft

Current State (2026)

Thinking models are relatively new:

  • o1 (OpenAI): Released late 2024, shows strong reasoning
  • Extended thinking (Anthropic): Claude's reasoning mode
  • Gemini thinking: Google's approach
  • Research: Rapid progress in this area

Expect significant advances in coming years.

Limitations

  • Not always better: Overkill for simple tasks
  • Costly: Token usage can be 10-100x higher
  • Latency: Inappropriate for real-time applications
  • Opaque reasoning: Internal thoughts often hidden
  • New failure modes: Can reason itself into wrong answers

Tier Relevance

TierExpectation
FoundationUnderstand what thinking models are
PractitionerKnow when to use thinking vs. standard models
ExpertDesign systems optimizing reasoning vs. speed tradeoffs