Ollama API Practice: Python and Node.js Client Development Guide
It’s 1 AM. You type ollama run gemma3 in your terminal, and the screen spits out its first response. Feels good.
But then you start wondering: Can I integrate this into my own project? No API key needed, no payments, running locally. Sounds perfect.
Honestly, I thought the same thing at first. After digging through the docs, I discovered that the official SDKs for Python and JavaScript are straightforward—you can even use the OpenAI SDK with just two lines of code changed. Simpler than expected.
But simple doesn’t mean pitfall-free. How do you accumulate streaming responses? How do you write an Agent Loop for tool calling? How do you separate reasoning from the answer in thinking mode? These are all traps I’ve stepped into.
This article fills in those gaps. We’ll compare Python and Node.js side by side, covering both native SDK and OpenAI compatibility approaches, giving you a complete client development guide.
By the way, if you haven’t installed Ollama yet, check out the first article in our series LangChain + Ollama Integration Guide to get your local model running first.
Chapter 1: Ollama API Basics
Let’s start with how the API works.
Ollama starts a REST API service locally by default at http://localhost:11434/api. Open your browser and visit this address, and you’ll see a dry “Ollama is running” message—that means the service is working.
Main Endpoints
You need to remember two core endpoints:
| Endpoint | Use Case | Features |
|---|---|---|
/api/chat | Multi-turn conversations | Supports messages array, can pass context |
/api/generate | Single-turn generation | Simple and direct, good for one-off tasks |
There’s also /v1/chat/completions, the OpenAI-compatible endpoint. If you have existing OpenAI projects, just change the base_url and you’re good to go—we’ll cover this in detail later.
Try It with curl
Let’s test the API the old-school way:
curl http://localhost:11434/api/chat -d '{
"model": "gemma3",
"messages": [
{ "role": "user", "content": "Why is the sky blue?" }
]
}'
The terminal will spit out a bunch of JSON. The key field is message.content—that’s where the model’s response lives.
The response structure looks like this:
{
"model": "gemma3",
"created_at": "2026-04-18T01:23:45.678Z",
"message": {
"role": "assistant",
"content": "The sky appears blue mainly because..."
},
"done": true
}
The done field is important. In streaming responses, each chunk has done set to false, and only the last chunk has true. We’ll use this when handling streaming responses.
Streaming Responses
By default, the API waits for the model to finish generating before returning everything at once. But if you want users to see a “typewriter effect,” you need to add stream: true:
curl http://localhost:11434/api/chat -d '{
"model": "gemma3",
"messages": [{ "role": "user", "content": "Why is the sky blue?" }],
"stream": true
}'
This time the terminal will output JSON line by line. Each line is a small chunk, and you need to collect them to piece together the complete answer.
Yeah, handling these chunks manually is a pain. That’s why the official SDKs exist—they encapsulate these details for you.
Chapter 2: Python SDK Complete Practice
Python’s SDK is officially maintained and super easy to install:
pip install ollama
Ready to use right away. Supports Python 3.8+, which is quite friendly.
Basic Calling
The simplest call, done in one line:
from ollama import chat
response = chat(
model='gemma3',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}]
)
print(response.message.content)
That simple. The chat() function is a shortcut provided by the SDK, which internally creates a default Client connecting to the local Ollama service.
If you want to customize connection parameters—for example, if Ollama is running on another machine—you can create your own Client:
from ollama import Client
client = Client(host='http://192.168.1.100:11434')
response = client.chat(model='gemma3', messages=[...])
Streaming Responses
Streaming is the key feature. You want users to see text appearing gradually, not waiting forever and then suddenly getting everything at once.
from ollama import chat
stream = chat(
model='gemma3',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
stream=True,
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Here’s a pitfall: chunk returns a dictionary, not an object. So you access it as chunk['message']['content'], not chunk.message.content. I stepped right into this trap and spent ages figuring out the error.
Async Client
If your application uses an async architecture—for instance, with FastAPI or aiohttp—you need the async client:
import asyncio
from ollama import AsyncClient
async def main():
client = AsyncClient()
# Non-streaming
response = await client.chat(
model='gemma3',
messages=[{'role': 'user', 'content': 'Hello'}]
)
print(response.message.content)
# Streaming
stream = await client.chat(
model='gemma3',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
stream=True,
)
async for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
asyncio.run(main())
The async streaming response returns an async generator, which you iterate with async for. Same logic as the synchronous version, just with await and async added.
Cloud Models
Interestingly, the Ollama SDK also supports cloud models. Some large models won’t run locally—like that 120B gpt-oss—but they’re available in the cloud.
from ollama import chat
response = chat(
model='gpt-oss:120b-cloud',
messages=[{'role': 'user', 'content': 'Hello'}]
)
Models with the -cloud suffix will use the cloud API. Of course, you need an Ollama cloud account and API key—configuration is different from local. Check the official docs if you’re interested.
Honestly, this feature is quite practical. Small models run locally to save money, large models run in the cloud to save hardware. A hybrid approach works well.
Chapter 3: Node.js SDK Complete Practice
Node.js SDK is just as clean:
npm i ollama
Note that this package supports both Node.js and browser environments. The browser version needs a separate import:
// Node.js
import ollama from 'ollama'
// Browser
import ollama from 'ollama/browser'
Basic Calling
Node.js is async by default, so it feels more natural:
import ollama from 'ollama'
const response = await ollama.chat({
model: 'gemma3',
messages: [{ role: 'user', content: 'Why is the sky blue?' }],
})
console.log(response.message.content)
Compare this to the Python version: Python passes messages as a dictionary, Node.js uses an object. Parameter names are basically the same, so you don’t need to relearn concepts when switching languages.
Streaming Responses
Node.js streaming is naturally an async generator:
import ollama from 'ollama'
const stream = await ollama.chat({
model: 'gemma3',
messages: [{ role: 'user', content: 'Why is the sky blue?' }],
stream: true,
})
for await (const chunk of stream) {
process.stdout.write(chunk.message.content)
}
Here we use process.stdout.write instead of console.log because console.log automatically adds a newline—you don’t want a newline after every character, right?
Custom Configuration
The SDK supports custom hosts and headers:
import ollama from 'ollama'
// Custom host
const client = new ollama.Ollama({ host: 'http://192.168.1.100:11434' })
// Or use global configuration
ollama.setDefaultHost('http://192.168.1.100:11434')
// Add headers (for authentication)
const stream = await ollama.chat({
model: 'gemma3',
messages: [{ role: 'user', content: 'Hello' }],
headers: { Authorization: 'Bearer xxx' },
})
The headers parameter is useful. If your Ollama service is behind an authentication proxy, you can pass the token here.
Canceling Streaming Generation
There’s an abort() method to cancel ongoing streaming:
import ollama from 'ollama'
const stream = await ollama.chat({
model: 'gemma3',
messages: [{ role: 'user', content: 'Write a long essay...' }],
stream: true,
})
// User clicked stop button
ollama.abort()
for await (const chunk of stream) {
// Loop ends early after abort
process.stdout.write(chunk.message.content)
}
This feature is essential for chat interfaces. If users don’t want to wait for the model to finish rambling, they can click a button to stop it.
Browser Version
Browser usage is similar, with some differences:
import ollama from 'ollama/browser'
// Browser only supports streaming because native API doesn't support non-streaming cross-origin requests
const stream = await ollama.chat({
model: 'gemma3',
messages: [{ role: 'user', content: 'Hello' }],
stream: true,
})
for await (const chunk of stream) {
document.getElementById('output').textContent += chunk.message.content
}
The browser version has a limitation: you must use streaming mode. Because Ollama API’s non-streaming requests return large JSON all at once, cross-origin requests easily timeout or get blocked. Streaming requests come in chunks, which causes fewer problems.
This design makes sense. When building a chat UI in the browser, you want streaming display anyway.
Chapter 4: Tool Calling Practice
Tool calling is the foundation for building Agents. Ollama supports letting the model call functions you define, then continuing to generate answers based on the function results.
Python SDK has a handy feature: you can pass Python functions directly as tools, and the SDK automatically parses the function’s docstring and parameter types.
Python Function Auto-Parsing
def get_weather(city: str) -> str:
"""Get weather information for a specified city
Args:
city: City name, like "Beijing" or "Shanghai"
Returns:
Weather description string
"""
# Simulated data
weather_data = {
'Beijing': 'Sunny, 18°C',
'Shanghai': 'Cloudy, 22°C',
'Guangzhou': 'Rainy, 26°C',
}
return weather_data.get(city, f'Weather data for {city} not found')
from ollama import chat
response = chat(
model='qwen3',
messages=[{'role': 'user', 'content': "What's the weather in Beijing today?"}],
tools=[get_weather],
)
print(response.message.content)
The SDK automatically converts the function to a tool definition format: name comes from the function name, description from the docstring, parameters from type annotations. Saves you the trouble of writing JSON Schema by hand.
Agent Loop Pattern
But it’s not that simple. The model might call multiple tools, or want to continue calling tools after getting results. You need a loop to handle this.
That’s the Agent Loop:
from ollama import chat
def add(a: int, b: int) -> int:
"""Addition operation"""
return a + b
def multiply(a: int, b: int) -> int:
"""Multiplication operation"""
return a * b
tools = [add, multiply]
tool_map = {'add': add, 'multiply': multiply}
messages = [{'role': 'user', 'content': 'Calculate (3 + 5) * 2'}]
while True:
response = chat(model='qwen3', messages=messages, tools=tools)
if response.message.tool_calls:
# Model wants to call tools
for call in response.message.tool_calls:
func_name = call.function.name
func_args = call.function.arguments
result = tool_map[func_name](**func_args)
# Add tool call result to message history
messages.append({
'role': 'tool',
'content': str(result),
'tool_name': func_name,
})
else:
# Model didn't call tools, means we're done
print(response.message.content)
break
Here’s the logic:
- Send message to model with tool definitions
- If model returns
tool_calls, execute corresponding functions - Put function results back into message history, send to model again
- Loop until model stops calling tools
This pattern is essential when building Agents. You define a bunch of tool functions, and the model decides when to call them, which ones to call, and in what order.
Thinking Mode
Some models—like qwen3—support thinking mode. The model “thinks” first, then gives an answer.
from ollama import chat
stream = chat(
model='qwen3',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
stream=True,
think=True,
)
thinking = ''
content = ''
for chunk in stream:
if chunk.message.thinking:
thinking += chunk.message.thinking
elif chunk.message.content:
content += chunk.message.content
print('=== Thinking Process ===')
print(thinking)
print('=== Final Answer ===')
print(content)
In thinking mode, chunks have an extra thinking field. You need to accumulate thinking content and final answer separately.
This feature is quite interesting. You can see how the model derives the answer step by step. Very useful for educational applications or debugging prompts.
Chapter 5: Native SDK vs OpenAI Compatibility API
Now you have two choices:
- Use Ollama native SDK (what we covered earlier)
- Use OpenAI SDK, just change the address to connect to Ollama
Which is better? Depends on your situation.
OpenAI Compatibility Solution
If you have existing OpenAI projects, the lowest migration cost is to change the base_url:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # Required but ignored
)
response = client.chat.completions.create(
model='gemma3',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
)
print(response.choices[0].message.content)
That simple. The OpenAI SDK has no idea it’s talking to Ollama behind the scenes—it just knows it’s talking to an “OpenAI API”.
Node.js version works the same way:
import OpenAI from 'openai'
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
})
const completion = await client.chat.completions.create({
model: 'gemma3',
messages: [{ role: 'user', content: 'Why is the sky blue?' }],
})
console.log(completion.choices[0].message.content)
Comparison of Both Approaches
| Aspect | Native SDK | OpenAI Compatible |
|---|---|---|
| Installation | pip install ollama | Use existing OpenAI SDK |
| Tool Calling | Auto-parse function docstrings | Hand-write JSON Schema |
| Streaming | Dictionary format chunks | Standard OpenAI format |
| Cloud Models | Supported | Not supported |
| Migration Cost | No cost for new projects | Minimal for existing projects |
Selection Recommendations
New Projects: Use native SDK.
Reasons:
- Tool calling is more convenient, just pass Python functions as tools
- Supports more features (Cloud Models, thinking mode)
- Documentation and examples are official, easier to troubleshoot
Migrating Existing OpenAI Projects: Use OpenAI compatibility.
Reasons:
- Two lines of code and you’re running
- No need to rewrite any logic
- Easy to switch back to OpenAI later
In short: native SDK has more features, OpenAI compatibility has faster migration. Choose based on your needs.
Honestly, I’ve tried both. The native SDK’s tool calling really saves effort—no need to write JSON Schema definitions yourself, just write clear function docstrings. But if your project is already running on OpenAI, there’s no need to completely refactor just to use Ollama.
Wrapping Up
After all this, let’s summarize the key points:
Basic Calling: Both Python and Node.js SDKs are well-designed, running in just a few lines. Remember to use stream=True for streaming responses.
Tool Calling: Agent Loop is the core pattern—loop through tool_calls until the model stops calling. Python SDK lets you pass functions directly as tools, saving the trouble of writing JSON Schema.
Thinking Mode: Supported by qwen3 and other models, lets you see the model’s reasoning process. Need to handle thinking and content fields separately in chunks.
Solution Selection: Use native SDK for new projects, more features; use compatibility for existing OpenAI projects, just change the address.
Next steps:
- If you haven’t set up Ollama yet, check out the first article in the series to get your local model running
- Pick the approach that fits your project (native or OpenAI compatible), and try it out
- Official docs keep updating with new features, worth browsing when you have time
Running LLMs locally is getting easier. Ollama hides the complexity behind simple APIs. You just need to know how to call it, and let it handle the rest.
That wraps up the second article in our series. The next one will cover Modelfile customization—how to tune models to behave exactly the way you want.
Ollama API Client Development
Complete guide to using Python or Node.js SDK to call Ollama local model API
⏱️ Estimated time: 45 min
- 1
Step1: Install SDK and test basic calling
Python users run `pip install ollama`, Node.js users run `npm i ollama`.
After installation, test the connection with the simplest code:
```python
from ollama import chat
response = chat(model='gemma3', messages=[{'role': 'user', 'content': 'Hello'}])
print(response.message.content)
```
Make sure the Ollama service is running (default port 11434) and you've downloaded the corresponding model. - 2
Step2: Implement streaming response
Enable streaming mode to let users see character-by-character output:
```python
from ollama import chat
stream = chat(model='gemma3', messages=[...], stream=True)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
```
Note that chunk returns a dictionary, access it with `chunk['message']['content']`. - 3
Step3: Configure tool calling (optional)
Define Python functions as tools, SDK auto-parses docstrings and type annotations:
```python
def get_weather(city: str) -> str:
"""Get city weather information"""
return f'{city}: Sunny'
response = chat(model='qwen3', messages=[...], tools=[get_weather])
```
Implement Agent Loop to handle multiple tool calls until the model returns the final answer. - 4
Step4: Choose native or OpenAI compatible approach
For new projects, native SDK is recommended with more features (Cloud Models, thinking mode).
For existing OpenAI projects, just change two lines:
```python
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
```
Migration cost is minimal, can switch back to OpenAI anytime.
FAQ
What is Ollama API's default port and address?
What type does Python SDK streaming response return?
What are the limitations of Node.js SDK in browser environment?
What is the Agent Loop pattern?
How to separately get thinking process and final answer in thinking mode?
How to choose between native SDK and OpenAI compatible approach?
11 min read · Published on: Apr 18, 2026 · Modified on: Apr 18, 2026
Ollama Local LLM Guide
If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.
Previous
Ollama Production Monitoring: Logging Configuration and Prometheus Alerting in Practice
A complete Ollama production deployment monitoring solution, including logging configuration, Prometheus metrics collection, AlertManager rules, and Grafana dashboard setup for multi-GPU monitoring and automated fault recovery
Part 11 of 12
Next
This is the latest post in the series so far.
Related Posts
Getting Started with Ollama: Your First Step to Running LLMs Locally
Getting Started with Ollama: Your First Step to Running LLMs Locally
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control
Ollama Modelfile Parameters Explained: A Complete Guide to Creating Custom Models

Comments
Sign in with GitHub to leave a comment