TrustGraphGet Started

Embeddings

Dense vector representations of data (text, images, etc.) that capture semantic meaning in a numerical format suitable for machine learning.

Core Concepts

Embeddings are dense vector representations that encode semantic information about data in a format that machines can process. They transform complex data like text, images, or audio into arrays of numbers while preserving meaningful relationships.

How Embeddings Work

Embeddings are created by neural networks trained to understand relationships in data. For example, a text embedding model learns that:

  • Similar words have similar vectors
  • Semantic relationships are preserved (e.g., "king" - "man" + "woman" ≈ "queen")
  • Context matters (same word in different contexts gets different embeddings)

Types of Embeddings

Text Embeddings

Convert text into vectors:

  • Word embeddings: Individual words (Word2Vec, GloVe)
  • Sentence embeddings: Complete sentences (Sentence-BERT)
  • Document embeddings: Entire documents (Doc2Vec)
# Example: Creating text embeddings
embedding = model.encode("Retrieval-Augmented Generation")
# Result: [0.23, -0.15, 0.87, ..., 0.45]  # 768 dimensions

Image Embeddings

Represent visual content:

  • Capture features like colors, textures, objects
  • Enable image similarity search
  • Power recommendation systems

Multimodal Embeddings

Unified representations across data types:

  • CLIP: Joint text-image embeddings
  • ImageBind: Embeddings across vision, text, audio

Key Properties

Dimensionality

Common embedding dimensions:

  • Small (128-384): Fast, less memory
  • Medium (512-768): Balanced performance
  • Large (1024-3072): High accuracy, more resources

Distance Metrics

Measuring similarity between embeddings:

# Cosine similarity (most common)
similarity = cosine_similarity(embedding1, embedding2)

# Euclidean distance
distance = np.linalg.norm(embedding1 - embedding2)

# Dot product
score = np.dot(embedding1, embedding2)

Creating Embeddings

Pre-trained Models

Use existing models:

  • OpenAI: text-embedding-3-large
  • Sentence Transformers: all-MiniLM-L6-v2
  • Cohere: embed-english-v3.0
import { OpenAI } from "openai";

const openai = new OpenAI();

const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: "Your text here",
});

const embedding = response.data[0].embedding;

Fine-tuning

Customize embeddings for your domain:

  • Train on domain-specific data
  • Optimize for your use case
  • Improve accuracy for specialized vocabulary

Use Cases

  1. Semantic Search

    • Find documents by meaning, not keywords
    • "machine learning" matches "artificial intelligence"
  2. Recommendation Systems

    • Find similar products/content
    • Personalized suggestions
  3. Clustering and Classification

    • Group similar items
    • Categorize content automatically
  4. RAG Systems

    • Retrieve relevant context
    • Match queries to knowledge base
  5. Anomaly Detection

    • Identify outliers
    • Detect unusual patterns

Best Practices

Choosing an Embedding Model

Consider:

  • Domain: General vs. specialized
  • Language: Multilingual support needed?
  • Size: Balance quality vs. speed
  • Cost: API costs vs. self-hosted

Optimizing Performance

  • Batch processing: Embed multiple texts at once
  • Caching: Store computed embeddings
  • Dimension reduction: Use PCA or UMAP if needed
  • Normalization: Normalize vectors for cosine similarity

Common Pitfalls

  • Using the wrong distance metric
  • Not normalizing embeddings
  • Mixing embeddings from different models
  • Ignoring context length limits

Technical Details

Embedding Generation Process

  1. Tokenization: Split text into tokens
  2. Encoding: Pass through neural network
  3. Pooling: Combine token embeddings (mean, max, CLS)
  4. Normalization: Scale to unit length

Storage Requirements

For 1 million documents with 768-dimensional embeddings:

  • Memory: ~3 GB (float32)
  • Disk: ~3 GB compressed
  • Index: Additional 1-2 GB

Advanced Topics

Contextual Embeddings

Modern models like BERT create context-aware embeddings:

  • Same word, different meaning → different embeddings
  • "Apple" (fruit) vs. "Apple" (company)

Cross-lingual Embeddings

Shared embedding space across languages:

  • Search in English, find results in any language
  • Machine translation applications

Sparse vs. Dense Embeddings

  • Dense: Most values non-zero (neural networks)
  • Sparse: Mostly zeros (traditional IR, SPLADE)
  • Hybrid: Combine both for best results

Examples

  • Converting 'cat' and 'kitten' to similar vectors because they have related meanings
  • Representing a document as a 768-dimensional vector
  • Image embeddings that capture visual features

Related Terms

Learn More