Skip to content

๐Ÿ” Vector Search with Monggregate

MongoDB Atlas provides powerful vector search capabilities through the $vectorSearch stage, enabling approximate nearest neighbor (aNN) search on vector embeddings. Monggregate makes these advanced vector search features accessible through an intuitive Python interface.

๐Ÿ’ก Vector search allows you to find documents with similar vector embeddings to a query vector, enabling semantic search, recommendations, and AI-powered applications.

Atlas Vector Search offers:

  • ๐Ÿง  Semantic similarity search using vector embeddings
  • ๐Ÿ” Approximate nearest neighbor (aNN) algorithms for efficient vector comparison
  • โšก Fast retrieval of similar items from large collections
  • ๐Ÿงฉ Pre-filtering to narrow search scope and improve relevance
  • ๐Ÿ”„ Integration with AI models like OpenAI, Hugging Face, and others

๐Ÿ“š Vector search is particularly useful for applications like: - Semantic text search that understands meaning, not just keywords - Image similarity search - Recommendation systems - AI-powered chatbots and RAG (Retrieval Augmented Generation)

Before using vector search with Monggregate, you need to:

  1. ๐Ÿ“Š Create an Atlas Vector Search index on your collection
  2. ๐Ÿงช Generate vector embeddings for your documents using an embedding model
  3. ๐Ÿ’พ Store these embeddings in your MongoDB documents

โš ๏ธ Vector search is only available on MongoDB Atlas clusters running v6.0.11 or v7.0.2 and later.

Creating a vector search query with Monggregate is straightforward:

from monggregate import Pipeline

# Generate or obtain your query vector (embedding)
query_vector = [0.1, 0.2, 0.3, 0.4, ...]  # Your vector dimensions here

# Build the vector search pipeline
pipeline = Pipeline()
pipeline.vector_search(
    index="vector_index",        # Name of your Atlas Vector Search index
    path="embedding",            # Field containing vector embeddings
    query_vector=query_vector,   # Your search vector
    num_candidates=100,          # Number of candidates to consider
    limit=10                     # Number of results to return
)

๐Ÿ“˜ This query will find the 10 documents whose embedding vectors are most similar to your query vector, considering 100 nearest neighbors during the search.

๐Ÿ” Filtering Vector Search Results

You can narrow your vector search with filters:

from monggregate import Pipeline

pipeline = Pipeline()
pipeline.vector_search(
    index="product_embeddings",
    path="product_vector",
    query_vector=query_vector,
    num_candidates=200,
    limit=20,
    filter={
        "category": "electronics",
        "price": {"$lt": 1000}
    }
)

๐Ÿ” This search will only consider products in the "electronics" category with a price less than 1000.

๐ŸŒŸ Retrieving Search Scores

To include the similarity score in your results:

pipeline = Pipeline()
pipeline.vector_search(
    index="text_embeddings",
    path="content_vector",
    query_vector=query_vector,
    num_candidates=150,
    limit=10
).project(
    content=1,
    metadata=1,
    score={"$meta": "vectorSearchScore"}  # Include the vector similarity score
)

๐Ÿ’ฏ Atlas Vector Search assigns a score between 0 and 1 to each result, with higher scores indicating greater similarity.

Here's a comprehensive example that uses vector search for a semantic search application:

import numpy as np
from sentence_transformers import SentenceTransformer
from monggregate import Pipeline
import pymongo

# Connect to MongoDB
client = pymongo.MongoClient("mongodb+srv://...")
db = client["knowledgebase"]

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embedding for user query
user_query = "How do I implement authentication in my application?"
query_embedding = model.encode(user_query).tolist()

# Create vector search pipeline
pipeline = Pipeline()
pipeline.vector_search(
    index="document_vectors",
    path="embedding",
    query_vector=query_embedding,
    num_candidates=100,
    limit=5,
    filter={
        "document_type": "article",
        "status": "published"
    }
).project(
    title=1,
    content=1,
    url=1,
    score={"$meta": "vectorSearchScore"}
)

# Execute search
results = list(db.documents.aggregate(pipeline.export()))

# Display results
for doc in results:
    print(f"Title: {doc['title']}")
    print(f"Score: {doc['score']:.4f}")
    print(f"URL: {doc['url']}")
    print("-" * 40)

๐Ÿ”ฌ Technical Details

  • ๐Ÿ”ข Vector dimensions: Your query vector must have the same number of dimensions as the vectors in your indexed field
  • ๐ŸŽฏ numCandidates: Should be greater than the limit for better accuracy, typically 10-20x for optimal recall
  • โšก Performance tuning: Adjust numCandidates to balance between search quality and speed
  • ๐Ÿ”„ Filtering: Only works on indexed fields marked as the "filter" type in your vector search index
  • ๐Ÿ“Š Scoring: For cosine and dotProduct similarities, scores are normalized using the formula: score = (1 + cosine/dot_product(v1,v2)) / 2

๐Ÿ”œ Next Steps