Switch Language
Toggle Theme

Ollama Embedding in Practice: Local Vector Search and RAG Setup

Last week I dug through over 200 PDF documents on my computer, trying to find details from a technical proposal I’d read six months ago. Keyword search? Useless—I remembered the meaning, not the exact words. Took me nearly an hour to track it down. That’s when I thought: what if I had a search tool that actually “understood” the semantics?

The bigger problem? These documents involved internal company architecture. Uploading them to the cloud for vector search? Not happening. Privacy constraints were non-negotiable.

Then I discovered Ollama’s Embedding feature. Exactly what I needed—runs locally, keeps data private, and handles semantic search. Spent a few days experimenting, tested three models (mxbai, nomic, Qwen3), and dove deep into vector database options. Honestly, there were plenty of pitfalls—pick the wrong model and search quality tanks; pick too small a database and scaling becomes a nightmare later.

This article consolidates my hard-won experience. By the end, you’ll be able to build your own local RAG system—from document processing to semantic search—with complete code examples.

Ollama Embedding Models: The Full Picture

Ollama now supports several Embedding models, and honestly, I didn’t know which one to pick at first. The docs claim mxbai outperforms OpenAI’s text-embedding-3-large—sounds impressive, but how does it actually perform in practice? I tested them all, so here’s a straight answer.

First, check this comparison table to get oriented:

ModelVector DimensionsContext LengthModel SizeKey Features
mxbai-embed-large1024512 tokens670MGeneral-purpose, top MTEB rankings
nomic-embed-text7688192 tokens274MLong context support, extensible
Qwen3 Embedding10248192 tokens~600MReleased 2026, Chinese-optimized

mxbai-embed-large is my go-to choice. Why? Simple and reliable. Its 1024-dimensional vectors work well in most scenarios, and it ranks high on the MTEB (Massive Text Embedding Benchmark)—indeed a bit higher than OpenAI’s text-embedding-3-large. For everyday document retrieval and code search, this is your safest bet.

nomic-embed-text stands out with its 8192-token context window. Super useful for processing entire articles or long conversation logs. It’s also smaller at 274M parameters, running faster than mxbai. The tradeoff? Vector dimensions drop to 768, theoretically reducing semantic expressiveness—though in practice, short text retrieval shows minimal difference. For long documents, nomic is the better choice.

Qwen3 Embedding is Alibaba’s April 2026 release. The Chinese performance is genuinely good—I tested several technical articles and found “distributed system fault tolerance” matched well with “fault-tolerant design,” whereas mxbai struggled a bit. If you primarily process Chinese content, this one’s worth trying.

My recommendation? Start with mxbai for general use—no need to overthink it. Go with nomic for long texts, and Qwen3 for Chinese-heavy content. Honestly, testing all three takes maybe 30 minutes. Real-world results are what matter.

Vector Database Selection Guide

Picked your model, but where do you store the data? Vector database selection is even trickier than model choice. Pick too small and scaling becomes painful later; pick too big and you waste resources. I compared three mainstream options:

DatabaseUse CaseData ScaleCharacteristics
ChromaDBDevelopment, personal projects< 100K recordsZero-config, plug-and-play
FAISSHigh-performance single-machine, research100K-1M recordsMeta open-source, blazing fast
MilvusProduction deployment, enterpriseMillion+ recordsDistributed, scalable, feature-rich

ChromaDB is my top recommendation for beginners. Installation is a single command: pip install chromadb. The API design is intuitive—store data, query data, just a few lines of code. It uses HNSW (Hierarchical Navigable Small World) indexing, and retrieval speed is perfectly adequate for small datasets. The downside? Single-machine deployment—performance starts degrading beyond 100K records.

FAISS is Meta’s battle-tested open-source tool. Pure C++ implementation, speed is genuinely impressive. I tested 500K records with retrieval latency consistently in milliseconds. But it’s more of a vector search library than a full database—you manage storage and index files yourself. Great for tinkerers or performance-critical scenarios.

Milvus is different—built for production from the ground up. Supports distributed deployment, persistent storage, multiple index types, and has a cloud service version (Zilliz Cloud). But configuration is complex, deployment costs are high. Worth the investment only for million-scale data, high availability needs, or team collaboration.

My strategy: Use ChromaDB for personal projects—get something running quickly. For research or performance-sensitive work, go FAISS. For actual production, Milvus or cloud services from the start. Don’t plan on migrating from ChromaDB to Milvus later—different data formats, different APIs, substantial migration cost.

Complete RAG Workflow Implementation

Enough theory—let’s see actual code. Below is a complete local RAG implementation, from PDF documents to semantic search, using Ollama + ChromaDB.

Environment Setup

First, install dependencies:

pip install ollama chromadb langchain langchain-community pypdf

Make sure Ollama is running and models are pulled:

ollama pull mxbai-embed-large
ollama pull qwen2.5:7b  # For generating answers

Code Implementation

import ollama
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import chromadb

# 1. Load PDF document
loader = PyPDFLoader("./your_document.pdf")
docs = loader.load()

# 2. Split document into chunks—this is crucial
# Too large = poor retrieval; too small = information loss
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,      # 800 characters per chunk
    chunk_overlap=100,    # 100-character overlap to preserve context
)
chunks = splitter.split_documents(docs)

# 3. Generate embeddings and store in ChromaDB
client = chromadb.Client()
collection = client.create_collection("my_docs")

for i, chunk in enumerate(chunks):
    # Call Ollama API to generate vector
    response = ollama.embed(
        model="mxbai-embed-large",
        input=chunk.page_content,
    )
    embedding = response["embeddings"][0]

    # Store in vector database
    collection.add(
        ids=[str(i)],
        embeddings=[embedding],
        documents=[chunk.page_content],
        metadatas=[{"source": chunk.metadata.get("source", "unknown")}],
    )

print(f"Indexed {len(chunks)} document chunks")

# 4. Semantic search
query = "What are the fault tolerance mechanisms in distributed systems?"
query_embedding = ollama.embed(
    model="mxbai-embed-large",
    input=query,
)["embeddings"][0]

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,  # Return top 3 most relevant chunks
)

# 5. Generate answer from retrieved results
context = "\n\n".join(results["documents"][0])
response = ollama.chat(
    model="qwen2.5:7b",
    messages=[
        {
            "role": "system",
            "content": "Answer based on the following document content. If the document doesn't contain relevant information, say so honestly.",
        },
        {"role": "user", "content": f"Document content: {context}\n\nQuestion: {query}"},
    ],
)

print(f"Answer: {response['message']['content']}")

Give it a try. The overall flow isn’t complicated: chunk documents → generate vectors → store → query → assemble answer.

A few pitfalls to watch out for:

First, don’t set chunk_size randomly. I tried 200 characters per chunk—retrieval returned fragmented information that couldn’t form complete answers. The 500-1000 range is more reliable.

Second, batch processing. With large document volumes, calling the Ollama API one by one is slow. Batch them together:

# Batch generate embeddings for significant speedup
batch_texts = [chunk.page_content for chunk in chunks[:50]]
batch_embeddings = ollama.embed(
    model="mxbai-embed-large",
    input=batch_texts,
)["embeddings"]

Third, similarity threshold. ChromaDB returns n_results by default, regardless of relevance. Some scenarios need filtering:

# Custom distance threshold filtering
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10,
)
# Keep only results with distance < 0.3 (lower = more similar)
filtered = [
    doc for doc, dist in zip(results["documents"][0], results["distances"][0])
    if dist < 0.3
]

Run this code, and you’ve got your own local RAG system. Swap in your PDFs, change the query, and you’re set.

Performance Tuning and Practical Tips

System’s running—now for optimization. A few parameters directly impact results. Here are the pitfalls I’ve encountered.

What chunk_size should you use?

500-1000 characters is my empirical sweet spot. Too small and you lose semantic completeness—a sentence cut in half won’t match queries well. Too large and retrieval gets noisy—a single chunk covering multiple topics blurs the boundaries.

Different document types have different needs. Technical documentation with clear structure can be split by paragraphs; fragmented content like chat logs works better with 500-character chunks. Test in your actual scenario—no universal answer.

Batch processing for speed

Calling the Ollama API individually means network round-trip for every request. Batch 50-100 together and speed multiplies. Just don’t go too large—embedding models have input length limits, and exceeding them throws errors.

How to set similarity threshold

0.7-0.85 range, depending on your accuracy needs. Higher threshold (say 0.85) keeps only highly relevant results—low recall but precise. Lower threshold (0.7) recalls more but may include noise. Clean document collection with clear queries? Set it higher. Vague questions needing more information? Go lower.

One tip: Run 50-100 documents first to gauge retrieval quality before deciding on parameters. Tuning after loading the full dataset is time-consuming. Iterative optimization—run, check, adjust.

Summary

After all that, it really comes down to a few key points:

Model selection depends on your scenario—mxbai for general use, nomic for long texts, Qwen3 for Chinese. Database: ChromaDB for beginners, FAISS for performance, Milvus for production. Complete workflow code is all here—adapt it and run.

The advantages are clear: local deployment, data privacy under your control; Ollama models are free, zero cost; ChromaDB is simple, low barrier to entry. Drawbacks exist too—single-machine performance has limits, and scaling beyond a million records means upgrading your solution.

My suggestion: Run through the code in this article first with 50 documents. See if retrieval quality and answer accuracy meet your needs. Once parameters are tuned, expand to the full dataset.

For a deeper dive into combining LangChain with Ollama for more complex applications, check out my earlier piece “LangChain + Ollama Integration in Practice”—that one covers conversation chains and tool calling. This article focuses on vector retrieval. Together, they form a complete roadmap for local LLM application development.

Build a Local RAG System

Set up a local vector search system using Ollama + ChromaDB

⏱️ Estimated time: 30 min

  1. 1

    Step1: Install dependencies and prepare models

    Run the following commands to install dependencies:

    ```bash
    pip install ollama chromadb langchain langchain-community pypdf
    ollama pull mxbai-embed-large
    ollama pull qwen2.5:7b
    ```

    Make sure the Ollama service is running.
  2. 2

    Step2: Load and chunk documents

    Use PyPDFLoader to load PDFs, RecursiveCharacterTextSplitter to chunk:

    ```python
    loader = PyPDFLoader("./your_document.pdf")
    docs = loader.load()
    splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    )
    chunks = splitter.split_documents(docs)
    ```

    Recommended chunk_size: 500-1000 characters.
  3. 3

    Step3: Generate vectors and store in database

    Call Ollama API to generate embeddings, store in ChromaDB:

    ```python
    client = chromadb.Client()
    collection = client.create_collection("my_docs")

    for i, chunk in enumerate(chunks):
    response = ollama.embed(
    model="mxbai-embed-large",
    input=chunk.page_content,
    )
    embedding = response["embeddings"][0]
    collection.add(
    ids=[str(i)],
    embeddings=[embedding],
    documents=[chunk.page_content],
    )
    ```

    Use batch processing for large document volumes.
  4. 4

    Step4: Semantic search and answer generation

    Convert query to vector, retrieve relevant documents, then use LLM to generate answer:

    ```python
    query_embedding = ollama.embed(
    model="mxbai-embed-large",
    input=query,
    )["embeddings"][0]

    results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    )

    context = "\n\n".join(results["documents"][0])
    response = ollama.chat(
    model="qwen2.5:7b",
    messages=[
    {"role": "system", "content": "Answer based on document content"},
    {"role": "user", "content": f"Document: {context}\nQuestion: {query}"},
    ],
    )
    ```

    Adjust similarity threshold to filter results as needed.

FAQ

Which Ollama Embedding model is the best?
No absolute best—depends on your scenario:

• mxbai-embed-large: General-purpose choice, stable performance, suitable for most scenarios
• nomic-embed-text: Long text scenarios, supports 8192 tokens
• Qwen3 Embedding: Chinese-friendly, released in 2026

Recommend trying all three—real-world testing trumps theory.
ChromaDB, FAISS, or Milvus—which should I choose?
Choose based on data scale and use case:

• ChromaDB: Best for beginners, zero configuration, suitable for under 100K records
• FAISS: Performance-focused, single-machine million-scale, requires managing storage yourself
• Milvus: Production environments, distributed deployment, million+ records

Personal projects: ChromaDB. Production: Milvus.
What should I set chunk_size to?
Recommended range: 500-1000 characters. Too small = incomplete semantics; too large = retrieval noise. Technical docs can be split by paragraphs; chat logs work well with 500-character chunks. Test in your actual scenario.
How can I speed up embedding generation?
Use batch processing. Accumulate 50-100 items and call the Ollama API once—several times faster than individual calls. Be careful with batch size to avoid exceeding model input length limits.
What should the similarity threshold be?
Adjust within the 0.7-0.85 range. High threshold (0.85) = accurate but low recall; low threshold (0.7) = high recall but potential noise. Clean document collection with clear queries? Go higher. Vague questions needing more information? Go lower.
What are the main advantages and disadvantages of local RAG?
Advantages: Data privacy under your control, free models with zero cost, simple deployment and quick start.

Disadvantages: Single-machine performance limits, need to upgrade solution for million+ records, must maintain the service yourself.

8 min read · Published on: Apr 8, 2026 · Modified on: Apr 8, 2026

Comments

Sign in with GitHub to leave a comment

Related Posts