Embeddings

Embeddings are dense vector representations that encode semantic information about data in a format that machines can process. They transform complex data like text, images, or audio into arrays of numbers while preserving meaningful relationships.

How Embeddings Work

Embeddings are created by neural networks trained to understand relationships in data. For example, a text embedding model learns that:

Similar words have similar vectors
Semantic relationships are preserved (e.g., "king" - "man" + "woman" ≈ "queen")
Context matters (same word in different contexts gets different embeddings)

Types of Embeddings

Text Embeddings

Convert text into vectors:

Word embeddings: Individual words (Word2Vec, GloVe)
Sentence embeddings: Complete sentences (Sentence-BERT)
Document embeddings: Entire documents (Doc2Vec)

# Example: Creating text embeddings
embedding = model.encode("Retrieval-Augmented Generation")
# Result: [0.23, -0.15, 0.87, ..., 0.45]  # 768 dimensions

Image Embeddings

Represent visual content:

Capture features like colors, textures, objects
Enable image similarity search
Power recommendation systems

Multimodal Embeddings

Unified representations across data types:

CLIP: Joint text-image embeddings
ImageBind: Embeddings across vision, text, audio

Key Properties

Dimensionality

Common embedding dimensions:

Small (128-384): Fast, less memory
Medium (512-768): Balanced performance
Large (1024-3072): High accuracy, more resources

Distance Metrics

Measuring similarity between embeddings:

# Cosine similarity (most common)
similarity = cosine_similarity(embedding1, embedding2)

# Euclidean distance
distance = np.linalg.norm(embedding1 - embedding2)

# Dot product
score = np.dot(embedding1, embedding2)

Creating Embeddings

Pre-trained Models

Use existing models:

OpenAI: text-embedding-3-large
Sentence Transformers: all-MiniLM-L6-v2
Cohere: embed-english-v3.0

import { OpenAI } from "openai";

const openai = new OpenAI();

const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: "Your text here",
});

const embedding = response.data[0].embedding;

Fine-tuning

Customize embeddings for your domain:

Train on domain-specific data
Optimize for your use case
Improve accuracy for specialized vocabulary

Use Cases

Semantic Search
- Find documents by meaning, not keywords
- "machine learning" matches "artificial intelligence"
Recommendation Systems
- Find similar products/content
- Personalized suggestions
Clustering and Classification
- Group similar items
- Categorize content automatically
RAG Systems
- Retrieve relevant context
- Match queries to knowledge base
Anomaly Detection
- Identify outliers
- Detect unusual patterns

Best Practices

Choosing an Embedding Model

Consider:

Domain: General vs. specialized
Language: Multilingual support needed?
Size: Balance quality vs. speed
Cost: API costs vs. self-hosted

Optimizing Performance

Batch processing: Embed multiple texts at once
Caching: Store computed embeddings
Dimension reduction: Use PCA or UMAP if needed
Normalization: Normalize vectors for cosine similarity

Common Pitfalls

Using the wrong distance metric
Not normalizing embeddings
Mixing embeddings from different models
Ignoring context length limits

Technical Details

Embedding Generation Process

Tokenization: Split text into tokens
Encoding: Pass through neural network
Pooling: Combine token embeddings (mean, max, CLS)
Normalization: Scale to unit length

Storage Requirements

For 1 million documents with 768-dimensional embeddings:

Memory: ~3 GB (float32)
Disk: ~3 GB compressed
Index: Additional 1-2 GB

Advanced Topics

Contextual Embeddings

Modern models like BERT create context-aware embeddings:

Same word, different meaning → different embeddings
"Apple" (fruit) vs. "Apple" (company)

Cross-lingual Embeddings

Shared embedding space across languages:

Search in English, find results in any language
Machine translation applications

Sparse vs. Dense Embeddings

Dense: Most values non-zero (neural networks)
Sparse: Mostly zeros (traditional IR, SPLADE)
Hybrid: Combine both for best results