Embeddings
Dense vector representations of data (text, images, etc.) that capture semantic meaning in a numerical format suitable for machine learning.
Embeddings are dense vector representations that encode semantic information about data in a format that machines can process. They transform complex data like text, images, or audio into arrays of numbers while preserving meaningful relationships.
How Embeddings Work
Embeddings are created by neural networks trained to understand relationships in data. For example, a text embedding model learns that:
- Similar words have similar vectors
- Semantic relationships are preserved (e.g., "king" - "man" + "woman" ≈ "queen")
- Context matters (same word in different contexts gets different embeddings)
Types of Embeddings
Text Embeddings
Convert text into vectors:
- Word embeddings: Individual words (Word2Vec, GloVe)
- Sentence embeddings: Complete sentences (Sentence-BERT)
- Document embeddings: Entire documents (Doc2Vec)
# Example: Creating text embeddings
embedding = model.encode("Retrieval-Augmented Generation")
# Result: [0.23, -0.15, 0.87, ..., 0.45] # 768 dimensions
Image Embeddings
Represent visual content:
- Capture features like colors, textures, objects
- Enable image similarity search
- Power recommendation systems
Multimodal Embeddings
Unified representations across data types:
- CLIP: Joint text-image embeddings
- ImageBind: Embeddings across vision, text, audio
Key Properties
Dimensionality
Common embedding dimensions:
- Small (128-384): Fast, less memory
- Medium (512-768): Balanced performance
- Large (1024-3072): High accuracy, more resources
Distance Metrics
Measuring similarity between embeddings:
# Cosine similarity (most common)
similarity = cosine_similarity(embedding1, embedding2)
# Euclidean distance
distance = np.linalg.norm(embedding1 - embedding2)
# Dot product
score = np.dot(embedding1, embedding2)
Creating Embeddings
Pre-trained Models
Use existing models:
- OpenAI: text-embedding-3-large
- Sentence Transformers: all-MiniLM-L6-v2
- Cohere: embed-english-v3.0
import { OpenAI } from "openai";
const openai = new OpenAI();
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: "Your text here",
});
const embedding = response.data[0].embedding;
Fine-tuning
Customize embeddings for your domain:
- Train on domain-specific data
- Optimize for your use case
- Improve accuracy for specialized vocabulary
Use Cases
-
Semantic Search
- Find documents by meaning, not keywords
- "machine learning" matches "artificial intelligence"
-
Recommendation Systems
- Find similar products/content
- Personalized suggestions
-
Clustering and Classification
- Group similar items
- Categorize content automatically
-
RAG Systems
- Retrieve relevant context
- Match queries to knowledge base
-
Anomaly Detection
- Identify outliers
- Detect unusual patterns
Best Practices
Choosing an Embedding Model
Consider:
- Domain: General vs. specialized
- Language: Multilingual support needed?
- Size: Balance quality vs. speed
- Cost: API costs vs. self-hosted
Optimizing Performance
- Batch processing: Embed multiple texts at once
- Caching: Store computed embeddings
- Dimension reduction: Use PCA or UMAP if needed
- Normalization: Normalize vectors for cosine similarity
Common Pitfalls
- Using the wrong distance metric
- Not normalizing embeddings
- Mixing embeddings from different models
- Ignoring context length limits
Technical Details
Embedding Generation Process
- Tokenization: Split text into tokens
- Encoding: Pass through neural network
- Pooling: Combine token embeddings (mean, max, CLS)
- Normalization: Scale to unit length
Storage Requirements
For 1 million documents with 768-dimensional embeddings:
- Memory: ~3 GB (float32)
- Disk: ~3 GB compressed
- Index: Additional 1-2 GB
Advanced Topics
Contextual Embeddings
Modern models like BERT create context-aware embeddings:
- Same word, different meaning → different embeddings
- "Apple" (fruit) vs. "Apple" (company)
Cross-lingual Embeddings
Shared embedding space across languages:
- Search in English, find results in any language
- Machine translation applications
Sparse vs. Dense Embeddings
- Dense: Most values non-zero (neural networks)
- Sparse: Mostly zeros (traditional IR, SPLADE)
- Hybrid: Combine both for best results
Examples
- •Converting 'cat' and 'kitten' to similar vectors because they have related meanings
- •Representing a document as a 768-dimensional vector
- •Image embeddings that capture visual features