Skip to main content

Retrieval Family (G2)

These elements handle memory and knowledge—how AI systems store, find, and adapt information.

Three time scales of memory: runtime (context), persistent (vector databases), and baked-in (fine-tuning).

ElementNameRowDescription
EmEmbeddingsPrimitivesNumerical representations of meaning
VxVector DBCompositionsDatabases for storing and querying embeddings
FtFine-tuningDeploymentAdapting models by training on specific data
SySynthetic DataEmergingAI-generated training data

Em — Embeddings

Position in Periodic Table:

G2: Retrieval Family
┌──────────────────────┐
│ → [Embeddings] │ Row 1: Primitives
│ Vector DB │ Row 2: Compositions
│ Fine-tuning │ Row 3: Deployment
│ Synthetic Data │ Row 4: Emerging
└──────────────────────┘

What It Is

Embeddings are numerical representations of meaning. Text becomes vectors (lists of numbers) where similar meanings have similar numbers. "Happy" and "joyful" will have similar embeddings, even though they share no letters.

Why It Matters

Embeddings unlock semantic understanding. Instead of searching for exact keyword matches, you can search by meaning. This enables:

  • Finding relevant documents even when words differ
  • Clustering similar content automatically
  • Measuring how related two pieces of text are
  • Building the foundation for RAG systems

How Embeddings Work

  1. Text goes into an embedding model
  2. Model outputs a vector (e.g., 1536 numbers for OpenAI's ada-002)
  3. Vectors can be compared using distance metrics
  4. Closer vectors = more similar meaning

Similarity Metrics

MetricDescriptionUse Case
Cosine similarityAngle between vectorsMost common, direction matters
Euclidean distanceStraight-line distanceWhen magnitude matters
Dot productCombined magnitude and directionNormalized vectors

Embedding Models

ModelDimensionsNotes
OpenAI text-embedding-3-small1536Good balance of quality/cost
OpenAI text-embedding-3-large3072Higher quality, more cost
Cohere embed-v31024Strong multilingual
Open source (e5, bge)VariesSelf-hosted option

Common Pitfalls

  • Comparing across models: Embeddings from different models are incompatible
  • Ignoring chunking: Long documents need to be split strategically
  • Assuming perfection: Embeddings capture semantic similarity, not factual accuracy
  • Forgetting updates: Embedding models improve; re-embed periodically

Tier Relevance

TierExpectation
FoundationConceptual understanding of semantic similarity
PractitionerGenerate, store, and query embeddings
ExpertOptimize embedding strategies for specific domains

Vx — Vector DB

Position in Periodic Table:

G2: Retrieval Family
┌──────────────────────┐
│ Embeddings │ Row 1: Primitives
│ → [Vector DB] │ Row 2: Compositions
│ Fine-tuning │ Row 3: Deployment
│ Synthetic Data │ Row 4: Emerging
└──────────────────────┘

What It Is

Vector databases are optimized for storing and querying embeddings. Store millions of vectors, find the most semantically similar ones in milliseconds. Traditional databases search by exact values. Vector databases search by similarity.

Why It Matters

You can't do semantic search at scale without specialized storage. Vector databases enable:

  • Finding relevant documents among millions
  • Real-time similarity search for recommendations
  • Efficient k-nearest-neighbor queries
  • The persistence layer for RAG systems

How Vector Search Works

  1. Index: Vectors are organized using specialized algorithms (HNSW, IVF, etc.)
  2. Query: A query vector is compared against the index
  3. Approximate nearest neighbors: Trade perfect accuracy for speed
  4. Top-k results: Return the k most similar vectors

Key Components

ComponentPurpose
VectorsThe numerical embeddings
MetadataAdditional info attached to each vector (source, date, etc.)
IndexData structure enabling fast search
NamespaceLogical separation of vector sets
DatabaseTypeNotes
PineconeManagedEasy to start, scales well
WeaviateSelf-hosted/managedStrong hybrid search
ChromaEmbeddedGreat for prototyping
QdrantSelf-hosted/managedHigh performance
pgvectorPostgreSQL extensionUse existing Postgres
MilvusSelf-hostedEnterprise scale

Key Considerations

  • Embedding consistency: Use the same model for indexing and querying
  • Metadata filtering: Combine vector search with traditional filters
  • Chunking strategy: How you split documents affects retrieval quality
  • Update patterns: Some indexes are expensive to update
  • Cost: Managed services charge by storage and queries

Tier Relevance

TierExpectation
FoundationUnderstand what vector DBs do
PractitionerSet up, populate, and query vector databases
ExpertOptimize indexing and retrieval performance

Ft — Fine-tuning

Position in Periodic Table:

G2: Retrieval Family
┌──────────────────────┐
│ Embeddings │ Row 1: Primitives
│ Vector DB │ Row 2: Compositions
│ → [Fine-tuning] │ Row 3: Deployment
│ Synthetic Data │ Row 4: Emerging
└──────────────────────┘

What It Is

Fine-tuning is adapting a base model by training on specific data. It bakes knowledge directly into the model's weights. Domain expertise becomes part of the model itself. Unlike RAG (which retrieves at runtime), fine-tuning permanently modifies the model.

Why It Matters

Fine-tuning enables:

  • Consistent style or tone across outputs
  • Domain-specific knowledge without retrieval
  • Faster inference (no retrieval step)
  • Behavior modification the model resists via prompting

Fine-tuning vs. RAG

AspectFine-tuningRAG
KnowledgeBaked into weightsRetrieved at runtime
UpdatesRequires retrainingUpdate database anytime
CostUpfront training costPer-query retrieval cost
LatencyNo retrieval overheadRetrieval adds latency
Best forStyle, behavior, static knowledgeDynamic, frequently updated info

When to Fine-tune

Good candidates:

  • Consistent output format or style
  • Domain-specific terminology
  • Behavior the base model resists
  • High-volume, similar queries

Poor candidates:

  • Rapidly changing information
  • Information that needs citations
  • One-off customization needs
  • When you lack quality training data

Fine-tuning Process

  1. Prepare data: Create prompt-completion pairs
  2. Format: Convert to required format (JSONL typically)
  3. Upload: Send to model provider
  4. Train: Provider fine-tunes the model
  5. Evaluate: Test on held-out examples
  6. Deploy: Use your custom model endpoint

Common Pitfalls

  • Overfitting: Model memorizes training data, fails on new inputs
  • Catastrophic forgetting: Loses general capabilities
  • Poor data quality: Model learns bad habits
  • Insufficient examples: Not enough signal to learn
  • Neglecting evaluation: No way to know if it worked

Tier Relevance

TierExpectation
FoundationUnderstand when fine-tuning vs. RAG
PractitionerEvaluate if fine-tuning is appropriate
ExpertPrepare datasets and execute fine-tuning projects

Sy — Synthetic Data

Position in Periodic Table:

G2: Retrieval Family
┌──────────────────────┐
│ Embeddings │ Row 1: Primitives
│ Vector DB │ Row 2: Compositions
│ Fine-tuning │ Row 3: Deployment
│ → [Synthetic Data] │ Row 4: Emerging
└──────────────────────┘

What It Is

Synthetic data is AI-generated training data for AI. When real examples are scarce or expensive, synthetic data fills the gap. Use one model to generate data that trains another.

Why It Matters

Quality training data is often the bottleneck in AI development. Synthetic data enables:

  • Bootstrapping when real data is limited
  • Augmenting datasets for better coverage
  • Generating edge cases that rarely occur naturally
  • Protecting privacy (no real user data needed)

Types of Synthetic Data

TypeDescriptionUse Case
AugmentationVariations of real dataExpand limited datasets
GenerationEntirely AI-created examplesBootstrap new domains
DistillationCapturing larger model behaviorTrain smaller models
SimulationEnvironment-generated dataRobotics, games

Generation Techniques

  1. Prompt-based: Ask an LLM to generate examples
  2. Template-based: Fill in templates with variations
  3. Model distillation: Have a strong model label data
  4. Paraphrasing: Rewrite existing examples

Quality Challenges

RiskMitigation
HomogeneityDiverse prompts, multiple generators
Errors propagateHuman validation on samples
Model collapseMix with real data
Bias amplificationAudit for bias patterns

Best Practices

  1. Always validate: Humans should review a sample
  2. Mix with real data: Don't train on 100% synthetic
  3. Diversify generation: Multiple prompts, temperatures
  4. Track provenance: Know what's real vs. synthetic
  5. Watch for leakage: Ensure test data isn't synthetic

Tier Relevance

TierExpectation
FoundationUnderstand what synthetic data is
PractitionerGenerate and validate synthetic examples
ExpertDesign synthetic data pipelines with quality controls