LangChain + Ollama Integration Guide: Complete Local LLM App Development
Last month, I took a look at my OpenAI bill—$52.30. Honestly, I felt a bit crushed. As an individual developer occasionally dabbling with AI, spending that much seemed excessive. Then I remembered Ollama running locally on my machine, with free Llama 3.1 just waiting to be used.
But here’s the problem: calling the Ollama API directly to build applications requires a lot of code. Request formatting, response parsing, error handling… every time I wrote it, it felt tedious. That’s where LangChain comes in—a set of well-encapsulated interfaces that work for Chat, RAG, and Agent scenarios, plus the ability to switch models with a single line of code.
This article is the framework integration chapter of the Ollama Local LLM Practical Guide series. It will take you from the basics of the langchain-ollama package all the way to three practical scenarios: Chat, RAG, and Agent. If you’ve read the previous articles in the series (API calls, multi-model deployment), this piece will tie together those scattered pieces into a complete development framework.
Getting Started with langchain-ollama Package
Let me start with a pitfall I fell into. I was previously using langchain_community.llms.Ollama, and the code ran fine, but something felt off—every time I checked the documentation, I saw references to the “langchain-ollama” package. Later I learned that LangChain officially spun off the Ollama integration into a standalone package, which is langchain-ollama.
Why use the official package?
Better type hints and more friendly IDE autocompletion. Maintenance is synchronized with LangChain’s main version, so you don’t need to worry about compatibility issues. Also, community packages might be deprecated at any time, while official packages are a long-term solution—this is important, as I’ve been burned by deprecated community packages before.
Installation is a single command:
pip install langchain-ollama
Once installed, you’ll find this package provides three core classes, each corresponding to different scenarios:
| Class Name | Purpose | Typical Scenarios |
|---|---|---|
ChatOllama | Chat model | Multi-turn conversations, Q&A systems |
OllamaLLM | Text completion | One-time generation, text continuation |
OllamaEmbeddings | Vector embeddings | RAG, semantic search |
In my experience, ChatOllama is sufficient for 90% of scenarios. The chat model supports multi-turn interactions and streaming output—that effect of text appearing character by character provides a much better user experience than returning the entire response at once.
Here’s a minimal example to show you how simple it is:
from langchain_ollama import ChatOllama
# Initialize model
llm = ChatOllama(
model="llama3.1:8b", # Model name, needs to be pulled in Ollama first
temperature=0.7 # Randomness parameter, between 0-1
)
# Send message
response = llm.invoke("Hello, please introduce yourself")
print(response.content)
Before running this code, make sure you’ve pulled the model with ollama pull llama3.1:8b. If you haven’t installed Ollama yet, you can refer back to the first article in the series.
The embedding model OllamaEmbeddings works similarly, mainly used for converting text into vectors. We’ll use it in detail in the RAG section later, but here’s a simple example:
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Single text embedding
vector = embeddings.embed_query("This is a test text")
print(f"Vector dimensions: {len(vector)}") # Usually outputs 768 or higher
# Batch embed multiple texts
vectors = embeddings.embed_documents([
"First text segment",
"Second text segment"
])
nomic-embed-text is currently a mainstream embedding model, specifically designed for semantic retrieval. The vector dimensions are high (typically 768+), and retrieval quality is much better than general-purpose models.
Chat Application in Practice: Multi-turn Conversations and Streaming Output
Making a single API call is simple, but real chat scenarios are much more complex—users ask follow-up questions, and the model needs to remember previous conversation content. LangChain handles this with a message list.
Implementing Multi-turn Conversations
LangChain has three message types:
SystemMessage: Sets the model’s role and behavior (e.g., “You are a professional programming assistant”)HumanMessage: User inputAIMessage: Model response
Here’s the code:
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
llm = ChatOllama(model="llama3.1:8b", temperature=0.7)
# Build conversation history
messages = [
SystemMessage(content="You are a developer assistant skilled at explaining technical concepts. Keep your answers concise and easy to understand."),
HumanMessage(content="What is a REST API?"),
AIMessage(content="REST API is a web service interface design style that uses HTTP methods (GET/POST/PUT/DELETE) to manipulate resources. Simply put, it's accessing data through URLs."),
HumanMessage(content="How does GraphQL differ from it?")
]
# Model generates response based on entire conversation history
response = llm.invoke(messages)
print(response.content)
In this code, the model can see the previous answer and knows the user is asking about the difference between GraphQL and REST. Without that AIMessage record, the model might explain GraphQL from scratch, breaking the user experience flow.
Streaming Output: Making Responses Feel More “Alive”
The benefit of streaming output is that users don’t stare at a blank screen waiting for results—text appears one character at a time, like someone typing. This experience is especially important for long responses.
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.1:8b")
# Streaming output
print("Model response:", end="", flush=True)
for chunk in llm.stream("Write a quicksort algorithm in Python and explain the principle"):
print(chunk.content, end="", flush=True)
print() # Final newline
The stream() method returns an iterator, where each chunk contains a small piece of text. flush=True ensures content displays immediately without being buffered.
I’ve tested this and found that streaming output feels much less delayed than returning everything at once—especially when responses exceed 100 words. Users feel the system is “thinking” rather than being “stuck.”
RAG Application in Practice: Local Knowledge Base Retrieval
RAG (Retrieval-Augmented Generation) is one of the most practical LLM application scenarios today. Simply put: first retrieve relevant content from a document library, then have the model generate an answer based on that content. This way, the model can “know” information outside its training data.
Breaking Down the RAG Process
A complete RAG system involves five steps:
- Load Documents — Read in PDF, TXT, Markdown and other files
- Split Text — Documents are too long, split into small chunks for easier retrieval
- Generate Vectors — Use embedding model to convert text into numerical vectors
- Store Index — Store in a vector database (we’ll use ChromaDB)
- Retrieve and Generate — When user asks a question, retrieve relevant chunks and have the model answer
Here’s the complete code—I’ve run it once and it works properly:
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# === 1. Load Documents ===
# Supports multiple formats like PDF, TXT, Markdown
loader = TextLoader("./my_document.txt") # Replace with your document path
docs = loader.load()
# === 2. Split Text ===
# chunk_size=1000 is a common parameter, each chunk is about 1000 characters
# chunk_overlap=200 allows overlap between chunks to avoid information breaks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
splits = text_splitter.split_documents(docs)
# === 3 & 4. Generate Vectors and Store ===
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory="./chroma_db" # Persistence storage path
)
# === 5. Build Retriever ===
# search_kwargs={"k": 4} means retrieve the 4 most relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# === 6. Build RAG Chain ===
template = """Answer the question based on the following context. If there's no relevant information in the context, clearly state "No relevant information found in the documents."
Context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOllama(model="llama3.1:8b")
# LangChain's LCEL syntax, using | symbol to chain components
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# === 7. Query ===
response = rag_chain.invoke("What is the main content of the document?")
print(response)
This code looks a bit long, but broken down it’s actually quite clear. The key is building the rag_chain—using LangChain’s LCEL (LangChain Expression Language) syntax to chain the retriever, prompt template, model, and output parser.
Some practical parameter adjustment suggestions:
chunk_size: For dense content (technical documentation), you can adjust to 800; for prose-like content, 1000-1500 works fine.
k value (number of retrieval chunks): Usually 3-5 is enough. Too many will dilute relevance, too few might miss key information.
persist_directory must be set, otherwise you’ll need to rebuild the vector store every time you restart, which is time-consuming and wasteful.
The first time I ran RAG without setting persistence, I had to re-run embeddings every time I changed code—it was painfully slow. Later I added persist_directory, and the built vector store loaded directly, starting up in seconds.
Agent Application in Practice: JSON-based Tool Calling
The biggest difference between Agent and regular chat is: Agents can call external tools.
For example, when a user asks “What’s the weather in Beijing today?”, a regular chat model can only make up an answer. An Agent will first call a weather query tool, then answer based on real data.
The Challenge of Ollama Tool Calling
I need to be honest here: Ollama’s tool calling support isn’t as mature as OpenAI’s. OpenAI models natively support function calling, accurately recognizing when to call tools and how to pass parameters. Ollama models (including Llama 3.1) are still catching up in this area.
So what’s the solution? LangChain officially offers a workaround: JSON-based Agent.
The idea is to have the model output structured JSON, which the Agent framework parses to decide which tool to call. In practice, it works quite well—though not as smooth as OpenAI’s native tool calling, it can complete basic tasks.
Custom Tool Example
First, let’s define a few simple tool functions:
from langchain_ollama import ChatOllama
from langchain_core.tools import tool
# Define tools
@tool
def get_weather(city: str) -> str:
"""Get weather information for the specified city"""
# This is mock data, could connect to real weather API
weather_data = {
"Beijing": "Sunny, 25°C, good air quality",
"Shanghai": "Cloudy, 22°C, chance of light rain",
"Shenzhen": "Hot, 30°C, strong UV"
}
return weather_data.get(city, f"Weather data not found for {city}")
@tool
def calculate(expression: str) -> str:
"""Perform mathematical calculations"""
try:
result = eval(expression) # Note: Production needs safer implementation
return f"Calculation result: {result}"
except:
return "Calculation error, please check the expression"
@tool
def search_local_docs(query: str) -> str:
"""Search local document library"""
# Could connect to the retriever from the RAG section earlier
return f"Search results for '{query}': found 3 relevant records"
The @tool decorator turns a regular function into a LangChain tool. The function’s docstring automatically becomes the tool description—the model uses this description to determine when to use which tool.
Then create the JSON Agent:
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOllama(model="llama3.1:8b")
tools = [get_weather, calculate, search_local_docs]
# Create prompt template
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant that can use tools to complete tasks."),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
# Create Agent
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
# Execute task
response = agent_executor.invoke({
"input": "Help me check the weather in Beijing today, then calculate 23 + 45"
})
print(response["output"])
verbose=True will print the Agent’s thinking process, helpful for debugging. You’ll see the model’s decision-making process for tool calls.
Real-world experience:
I ran several tasks, and the JSON Agent’s success rate is about 70-80%. Simple tasks (checking weather, doing math) work fine, but complex tasks (combining multiple tools) occasionally fail—like wrong parameter formats or choosing the wrong tool. This is a common issue with local LLM Agents right now; they’re not as stable as OpenAI.
If you have high requirements for Agents, consider:
- Using stronger models (like Qwen 2.5 or DeepSeek)
- Simplifying task workflows and reducing the number of tools
- Or just using OpenAI’s native tool calling—higher cost, but much better stability
OpenAI vs Ollama: Switching with One Line of Code
Many people ask me: “I’ve written my LangChain code pretty well, can I use both OpenAI and Ollama?” The answer is: yes, and it’s so simple you won’t believe it.
Method 1: Change the Import
Suppose you have code using OpenAI:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4", temperature=0.7)
response = llm.invoke("Explain quantum computing")
To switch to Ollama, just change one line of import:
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.1:8b", temperature=0.7)
response = llm.invoke("Explain quantum computing")
All other code—prompt templates, Chain construction, output parsing—doesn’t need to change at all. LangChain’s abstraction layer is well-designed enough that model switching is almost transparent to business logic.
Method 2: OpenAI-Compatible API
Ollama also provides a “pretend to be OpenAI” approach—the OpenAI-Compatible API. The benefit is you don’t even need to change the import:
from langchain_openai import ChatOpenAI
# Only change base_url and api_key, everything else stays the same
llm = ChatOpenAI(
model="llama3.1:8b",
base_url="http://localhost:11434/v1", # Ollama's OpenAI-compatible endpoint
api_key="ollama" # Fill in anything, Ollama doesn't verify
)
response = llm.invoke("Explain quantum computing")
When is this approach suitable? When your project already extensively uses ChatOpenAI and you don’t want to change the code structure, but want to test local model performance.
Comparison Summary
With all these switching methods discussed, when should you use Ollama and when should you use OpenAI? I’ve organized the key differences into a table:
| Comparison Point | OpenAI (GPT-4) | Ollama (Llama 3.1) |
|---|---|---|
| Cost | $0.03/1K input tokens | Free (consumes local GPU electricity) |
| Privacy | Data uploaded to cloud, cautious in compliance-sensitive scenarios | Processed locally, data never leaves |
| Tool Calling | Native support, stable and reliable | Requires JSON Agent, 70-80% success rate |
| Response Speed | Fast (cloud optimized, 1-3 sec first token) | Depends on local GPU (3-10 sec variable) |
| Model Capability | GPT-4 is among the strongest | Llama 3.1 8B is mid-tier, usable but not GPT-4 level |
My recommendations:
- Personal learning, prototype development: Use Ollama, save money and experiment freely
- Production environment, high concurrency: Use OpenAI, guaranteed stability and response speed
- Privacy-sensitive data: Use Ollama, data never leaves local machine
- Complex Agent tasks: Use OpenAI, tool calling is more stable
The ideal state is having both—use Ollama during development to save costs, switch to OpenAI in production for stability. The switching cost is only one line of code, so why not?
Summary
By now, you have a complete map of the LangChain + Ollama integration path.
We started with the langchain-ollama package basics, learning about the three core classes: ChatOllama, OllamaLLM, and OllamaEmbeddings. Then we practiced three scenarios: Chat multi-turn conversations (streaming output makes the experience smoother), RAG knowledge base retrieval (turning local documents into intelligent Q&A), and Agent tool calling (JSON Agent is the current compromise). Finally, we compared OpenAI and Ollama switching strategies—one line of code to switch, costs dropping from $50/month to free.
When to choose Ollama?
In one sentence: when you want to save money, protect privacy, or just want to learn LLM development. Run it locally, experiment freely, no worries about exploding bills.
When do you still need OpenAI?
Complex Agent tasks, high-concurrency production environments, scenarios requiring high response speed and stability. Local LLMs still can’t replace the cloud-based experience at this point.
If you haven’t tried it hands-on yet, I suggest starting with the Chat scenario—the code is simplest and the effect is most intuitive. After getting it working, try RAG, connecting local documents to experience that “aha moment” when the model can read your materials. Save Agent for later, as there are still quite a few pitfalls with tool calling that require patient debugging.
There’s more content coming in the series: multi-model deployment (how to switch between different models in LangChain), performance tuning (making local LLMs run faster), production deployment (turning local applications into usable services). Stay tuned if you’re interested.
Feel free to discuss in the comments, or find me directly on GitHub. I’ve tested all the code examples—they should work properly. If you encounter errors, it’s probably because the model isn’t pulled or dependencies aren’t installed; just troubleshoot based on the error messages.
LangChain + Ollama Integration Development
From installation and configuration to Chat, RAG, and Agent practical scenarios—master local LLM application development in one go
⏱️ Estimated time: 60 min
- 1
Step1: Install langchain-ollama package
Run the installation command:
```bash
pip install langchain-ollama
```
Ensure Ollama is installed and model is pulled (e.g., `ollama pull llama3.1:8b`). - 2
Step2: Create Chat Application
Initialize ChatOllama and send message:
```python
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.1:8b", temperature=0.7)
response = llm.invoke("Hello")
print(response.content)
```
Supports multi-turn conversations and streaming output. - 3
Step3: Build RAG Knowledge Base
Five-step process:
• Load documents (TextLoader / PyPDFLoader)
• Split text (RecursiveCharacterTextSplitter)
• Generate vectors (OllamaEmbeddings)
• Store index (ChromaDB)
• Retrieve and generate (RAG Chain)
Key parameters: chunk_size=1000, k=4, persist_directory must be set. - 4
Step4: Implement Agent Tool Calling
Define tool functions and create JSON Agent:
```python
@tool
def get_weather(city: str) -> str:
"""Get weather information"""
...
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools)
```
Success rate around 70-80%, complex tasks recommend OpenAI. - 5
Step5: Switch Between OpenAI / Ollama
Method 1: Change import
```python
from langchain_ollama import ChatOllama # Use Ollama
from langchain_openai import ChatOpenAI # Use OpenAI
```
Method 2: OpenAI-Compatible API (no import change)
```python
llm = ChatOpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
```
FAQ
What's the difference between langchain-ollama and langchain_community.llms.Ollama?
Should I use ChatOllama or OllamaLLM?
How should I set chunk_size and k values in a RAG system?
Why is Ollama's tool calling less stable than OpenAI?
How do I switch between OpenAI and Ollama?
Is Ollama suitable for production environments?
12 min read · Published on: Apr 7, 2026 · Modified on: Apr 8, 2026
Related Posts
Ollama Embedding in Practice: Local Vector Search and RAG Setup
Ollama Embedding in Practice: Local Vector Search and RAG Setup
Ollama Multi-Model Deployment: Running Qwen, Llama, and DeepSeek in Parallel
Ollama Multi-Model Deployment: Running Qwen, Llama, and DeepSeek in Parallel
Ollama Modelfile Parameters Explained: A Complete Guide to Creating Custom Models

Comments
Sign in with GitHub to leave a comment