Ollama API Practice: Python and Node.js Client Development Guide

I typed ollama run gemma3 and the first reply appeared. Local model, running.

Next question: can I wire this into my project? No API key, no per-token bill, all on my machine.

The docs ship Python and JavaScript SDKs; you can even point the OpenAI SDK at Ollama with small changes. Easier than I expected.

Easy still has traps: streaming accumulation, tool-calling agent loops, splitting thinking vs answer in thinking mode—I hit all of them.

This guide covers those pitfalls with Python and Node.js, native SDK vs OpenAI-compatible paths, end to end.

If Ollama is not installed yet, start with our series opener LangChain + Ollama integration to get a model running first.

Chapter 1: Ollama API Basics

Let’s start with how the API works.

Ollama starts a REST API service locally by default at http://localhost:11434/api. Open your browser and visit this address, and you’ll see a dry “Ollama is running” message—that means the service is working.

Main Endpoints

You need to remember two core endpoints:

Endpoint	Use Case	Features
`/api/chat`	Multi-turn conversations	Supports messages array, can pass context
`/api/generate`	Single-turn generation	Simple and direct, good for one-off tasks

There’s also /v1/chat/completions, the OpenAI-compatible endpoint. If you have existing OpenAI projects, just change the base_url and you’re good to go—we’ll cover this in detail later.

Try It with curl

Let’s test the API the old-school way:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma3",
  "messages": [
    { "role": "user", "content": "Why is the sky blue?" }
  ]
}'

The terminal will spit out a bunch of JSON. The key field is message.content—that’s where the model’s response lives.

The response structure looks like this:

{
  "model": "gemma3",
  "created_at": "2026-04-18T01:23:45.678Z",
  "message": {
    "role": "assistant",
    "content": "The sky appears blue mainly because..."
  },
  "done": true
}

The done field is important. In streaming responses, each chunk has done set to false, and only the last chunk has true. We’ll use this when handling streaming responses.

Streaming Responses

By default, the API waits for the model to finish generating before returning everything at once. But if you want users to see a “typewriter effect,” you need to add stream: true:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma3",
  "messages": [{ "role": "user", "content": "Why is the sky blue?" }],
  "stream": true
}'

This time the terminal will output JSON line by line. Each line is a small chunk, and you need to collect them to piece together the complete answer.

Yeah, handling these chunks manually is a pain. That’s why the official SDKs exist—they encapsulate these details for you.

Chapter 2: Python SDK Complete Practice

Python’s SDK is officially maintained and super easy to install:

pip install ollama

Ready to use right away. Supports Python 3.8+, which is quite friendly.

Basic Calling

The simplest call, done in one line:

from ollama import chat

response = chat(
  model='gemma3',
  messages=[{'role': 'user', 'content': 'Why is the sky blue?'}]
)

print(response.message.content)

That simple. The chat() function is a shortcut provided by the SDK, which internally creates a default Client connecting to the local Ollama service.

If you want to customize connection parameters—for example, if Ollama is running on another machine—you can create your own Client:

from ollama import Client

client = Client(host='http://192.168.1.100:11434')
response = client.chat(model='gemma3', messages=[...])

Streaming Responses

Streaming is the key feature. You want users to see text appearing gradually, not waiting forever and then suddenly getting everything at once.

from ollama import chat

stream = chat(
  model='gemma3',
  messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
  stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

Here’s a pitfall: chunk returns a dictionary, not an object. So you access it as chunk['message']['content'], not chunk.message.content. I stepped right into this trap and spent ages figuring out the error.

Async Client

If your application uses an async architecture—for instance, with FastAPI or aiohttp—you need the async client:

import asyncio
from ollama import AsyncClient

async def main():
  client = AsyncClient()
  
  # Non-streaming
  response = await client.chat(
    model='gemma3',
    messages=[{'role': 'user', 'content': 'Hello'}]
  )
  print(response.message.content)
  
  # Streaming
  stream = await client.chat(
    model='gemma3',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True,
  )
  async for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

asyncio.run(main())

The async streaming response returns an async generator, which you iterate with async for. Same logic as the synchronous version, just with await and async added.

Cloud Models

Interestingly, the Ollama SDK also supports cloud models. Some large models won’t run locally—like that 120B gpt-oss—but they’re available in the cloud.

from ollama import chat

response = chat(
  model='gpt-oss:120b-cloud',
  messages=[{'role': 'user', 'content': 'Hello'}]
)

Models with the -cloud suffix will use the cloud API. Of course, you need an Ollama cloud account and API key—configuration is different from local. Check the official docs if you’re interested.

Honestly, this feature is quite practical. Small models run locally to save money, large models run in the cloud to save hardware. A hybrid approach works well.

Chapter 3: Node.js SDK Complete Practice

Node.js SDK is just as clean:

npm i ollama

Note that this package supports both Node.js and browser environments. The browser version needs a separate import:

// Node.js
import ollama from 'ollama'

// Browser
import ollama from 'ollama/browser'

Basic Calling

Node.js is async by default, so it feels more natural:

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'gemma3',
  messages: [{ role: 'user', content: 'Why is the sky blue?' }],
})

console.log(response.message.content)

Compare this to the Python version: Python passes messages as a dictionary, Node.js uses an object. Parameter names are basically the same, so you don’t need to relearn concepts when switching languages.

Streaming Responses

Node.js streaming is naturally an async generator:

import ollama from 'ollama'

const stream = await ollama.chat({
  model: 'gemma3',
  messages: [{ role: 'user', content: 'Why is the sky blue?' }],
  stream: true,
})

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content)
}

Here we use process.stdout.write instead of console.log because console.log automatically adds a newline—you don’t want a newline after every character, right?

Custom Configuration

The SDK supports custom hosts and headers:

import ollama from 'ollama'

// Custom host
const client = new ollama.Ollama({ host: 'http://192.168.1.100:11434' })

// Or use global configuration
ollama.setDefaultHost('http://192.168.1.100:11434')

// Add headers (for authentication)
const stream = await ollama.chat({
  model: 'gemma3',
  messages: [{ role: 'user', content: 'Hello' }],
  headers: { Authorization: 'Bearer xxx' },
})

The headers parameter is useful. If your Ollama service is behind an authentication proxy, you can pass the token here.

Canceling Streaming Generation

There’s an abort() method to cancel ongoing streaming:

import ollama from 'ollama'

const stream = await ollama.chat({
  model: 'gemma3',
  messages: [{ role: 'user', content: 'Write a long essay...' }],
  stream: true,
})

// User clicked stop button
ollama.abort()

for await (const chunk of stream) {
  // Loop ends early after abort
  process.stdout.write(chunk.message.content)
}

This feature is essential for chat interfaces. If users don’t want to wait for the model to finish rambling, they can click a button to stop it.

Browser Version

Browser usage is similar, with some differences:

import ollama from 'ollama/browser'

// Browser only supports streaming because native API doesn't support non-streaming cross-origin requests
const stream = await ollama.chat({
  model: 'gemma3',
  messages: [{ role: 'user', content: 'Hello' }],
  stream: true,
})

for await (const chunk of stream) {
  document.getElementById('output').textContent += chunk.message.content
}

The browser version has a limitation: you must use streaming mode. Because Ollama API’s non-streaming requests return large JSON all at once, cross-origin requests easily timeout or get blocked. Streaming requests come in chunks, which causes fewer problems.

This design makes sense. When building a chat UI in the browser, you want streaming display anyway.

Chapter 4: Tool Calling Practice

Tool calling is the foundation for building Agents. Ollama supports letting the model call functions you define, then continuing to generate answers based on the function results.

Python SDK has a handy feature: you can pass Python functions directly as tools, and the SDK automatically parses the function’s docstring and parameter types.

Python Function Auto-Parsing

def get_weather(city: str) -> str:
  """Get weather information for a specified city

  Args:
    city: City name, like "Beijing" or "Shanghai"

  Returns:
    Weather description string
  """
  # Simulated data
  weather_data = {
    'Beijing': 'Sunny, 18°C',
    'Shanghai': 'Cloudy, 22°C',
    'Guangzhou': 'Rainy, 26°C',
  }
  return weather_data.get(city, f'Weather data for {city} not found')

from ollama import chat

response = chat(
  model='qwen3',
  messages=[{'role': 'user', 'content': "What's the weather in Beijing today?"}],
  tools=[get_weather],
)

print(response.message.content)

The SDK automatically converts the function to a tool definition format: name comes from the function name, description from the docstring, parameters from type annotations. Saves you the trouble of writing JSON Schema by hand.

Agent Loop Pattern

But it’s not that simple. The model might call multiple tools, or want to continue calling tools after getting results. You need a loop to handle this.

That’s the Agent Loop:

from ollama import chat

def add(a: int, b: int) -> int:
  """Addition operation"""
  return a + b

def multiply(a: int, b: int) -> int:
  """Multiplication operation"""
  return a * b

tools = [add, multiply]
tool_map = {'add': add, 'multiply': multiply}

messages = [{'role': 'user', 'content': 'Calculate (3 + 5) * 2'}]

while True:
  response = chat(model='qwen3', messages=messages, tools=tools)

  if response.message.tool_calls:
    # Model wants to call tools
    for call in response.message.tool_calls:
      func_name = call.function.name
      func_args = call.function.arguments
      result = tool_map[func_name](**func_args)

      # Add tool call result to message history
      messages.append({
        'role': 'tool',
        'content': str(result),
        'tool_name': func_name,
      })
  else:
    # Model didn't call tools, means we're done
    print(response.message.content)
    break

Here’s the logic:

Send message to model with tool definitions
If model returns tool_calls, execute corresponding functions
Put function results back into message history, send to model again
Loop until model stops calling tools

This pattern is essential when building Agents. You define a bunch of tool functions, and the model decides when to call them, which ones to call, and in what order.

Thinking Mode

Some models—like qwen3—support thinking mode. The model “thinks” first, then gives an answer.

from ollama import chat

stream = chat(
  model='qwen3',
  messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
  stream=True,
  think=True,
)

thinking = ''
content = ''

for chunk in stream:
  if chunk.message.thinking:
    thinking += chunk.message.thinking
  elif chunk.message.content:
    content += chunk.message.content

print('=== Thinking Process ===')
print(thinking)
print('=== Final Answer ===')
print(content)

In thinking mode, chunks have an extra thinking field. You need to accumulate thinking content and final answer separately.

This feature is quite interesting. You can see how the model derives the answer step by step. Very useful for educational applications or debugging prompts.

Chapter 5: Native SDK vs OpenAI Compatibility API

Now you have two choices:

Use Ollama native SDK (what we covered earlier)
Use OpenAI SDK, just change the address to connect to Ollama

Which is better? Depends on your situation.

OpenAI Compatibility Solution

If you have existing OpenAI projects, the lowest migration cost is to change the base_url:

from openai import OpenAI

client = OpenAI(
  base_url='http://localhost:11434/v1',
  api_key='ollama',  # Required but ignored
)

response = client.chat.completions.create(
  model='gemma3',
  messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
)

print(response.choices[0].message.content)

That simple. The OpenAI SDK has no idea it’s talking to Ollama behind the scenes—it just knows it’s talking to an “OpenAI API”.

Node.js version works the same way:

import OpenAI from 'openai'

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
})

const completion = await client.chat.completions.create({
  model: 'gemma3',
  messages: [{ role: 'user', content: 'Why is the sky blue?' }],
})

console.log(completion.choices[0].message.content)

Comparison of Both Approaches

Aspect	Native SDK	OpenAI Compatible
Installation	`pip install ollama`	Use existing OpenAI SDK
Tool Calling	Auto-parse function docstrings	Hand-write JSON Schema
Streaming	Dictionary format chunks	Standard OpenAI format
Cloud Models	Supported	Not supported
Migration Cost	No cost for new projects	Minimal for existing projects

Selection Recommendations

New Projects: Use native SDK.

Reasons:

Tool calling is more convenient, just pass Python functions as tools
Supports more features (Cloud Models, thinking mode)
Documentation and examples are official, easier to troubleshoot

Migrating Existing OpenAI Projects: Use OpenAI compatibility.

Reasons:

Two lines of code and you’re running
No need to rewrite any logic
Easy to switch back to OpenAI later

In short: native SDK has more features, OpenAI compatibility has faster migration. Choose based on your needs.

Honestly, I’ve tried both. The native SDK’s tool calling really saves effort—no need to write JSON Schema definitions yourself, just write clear function docstrings. But if your project is already running on OpenAI, there’s no need to completely refactor just to use Ollama.

Wrapping Up

After all this, let’s summarize the key points:

Basic Calling: Both Python and Node.js SDKs are well-designed, running in just a few lines. Remember to use stream=True for streaming responses.

Tool Calling: Agent Loop is the core pattern—loop through tool_calls until the model stops calling. Python SDK lets you pass functions directly as tools, saving the trouble of writing JSON Schema.

Thinking Mode: Supported by qwen3 and other models, lets you see the model’s reasoning process. Need to handle thinking and content fields separately in chunks.

Solution Selection: Use native SDK for new projects, more features; use compatibility for existing OpenAI projects, just change the address.

Next steps:

If you haven’t set up Ollama yet, check out the first article in the series to get your local model running
Pick the approach that fits your project (native or OpenAI compatible), and try it out
Official docs keep updating with new features, worth browsing when you have time

Running LLMs locally is getting easier. Ollama hides the complexity behind simple APIs. You just need to know how to call it, and let it handle the rest.

That wraps up the second article in our series. The next one will cover Modelfile customization—how to tune models to behave exactly the way you want.

Ollama API Client Development

Complete guide to using Python or Node.js SDK to call Ollama local model API

⏱️ Estimated time: 45 min

1
Step1: Install SDK and test basic calling
Python users run `pip install ollama`, Node.js users run `npm i ollama`.

After installation, test the connection with the simplest code:
```python
from ollama import chat
response = chat(model='gemma3', messages=[{'role': 'user', 'content': 'Hello'}])
print(response.message.content)
```

Make sure the Ollama service is running (default port 11434) and you've downloaded the corresponding model.
2
Step2: Implement streaming response
Enable streaming mode to let users see character-by-character output:

```python
from ollama import chat
stream = chat(model='gemma3', messages=[...], stream=True)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
```

Note that chunk returns a dictionary, access it with `chunk['message']['content']`.
3
Step3: Configure tool calling (optional)
Define Python functions as tools, SDK auto-parses docstrings and type annotations:

```python
def get_weather(city: str) -> str:
"""Get city weather information"""
return f'{city}: Sunny'

response = chat(model='qwen3', messages=[...], tools=[get_weather])
```

Implement Agent Loop to handle multiple tool calls until the model returns the final answer.
4
Step4: Choose native or OpenAI compatible approach
For new projects, native SDK is recommended with more features (Cloud Models, thinking mode).

For existing OpenAI projects, just change two lines:
```python
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
```

Migration cost is minimal, can switch back to OpenAI anytime.

FAQ

What is Ollama API's default port and address?

Ollama starts a REST API service at localhost:11434 by default. Core endpoints include /api/chat (multi-turn conversations) and /api/generate (single-turn generation), plus /v1/chat/completions for OpenAI compatibility.

What type does Python SDK streaming response return?

Python SDK streaming response returns a dictionary type, not an object. Access content with chunk['message']['content'], not chunk.message.content. Async client returns an async generator, iterate with async for.

What are the limitations of Node.js SDK in browser environment?

The browser version must use streaming mode (stream: true), because non-streaming requests return large JSON all at once, and cross-origin requests easily timeout or get blocked. Import method also differs: browser uses import ollama from 'ollama/browser'.

What is the Agent Loop pattern?

Agent Loop is a pattern for handling tool calling: send message to model → check if tool_calls returned → execute corresponding functions → put results back into message history → call model again → loop until model stops calling tools. This pattern is fundamental for building Agents.

How to separately get thinking process and final answer in thinking mode?

In thinking mode, streaming response chunks have an extra thinking field. Need to accumulate separately: if chunk.message.thinking then accumulate thinking content, elif chunk.message.content then accumulate final answer. Currently qwen3 and other models support this feature.

How to choose between native SDK and OpenAI compatible approach?

For new projects, native SDK is recommended: tool calling supports function docstring auto-parsing, supports Cloud Models and thinking mode. For existing OpenAI projects, compatibility is recommended: just change base_url, minimal migration cost, easy to switch back to OpenAI later.

8 min read · Published on: Apr 18, 2026 · Modified on: Jun 8, 2026

Easton

AI & Intelligence

Series Reading Path Part 16 of 19

Ollama Local LLM Guide

If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.

View Series Hub

Ollama GPU Acceleration: Complete Guide for CUDA, ROCm & Metal

Complete Ollama GPU acceleration guide covering NVIDIA CUDA, AMD ROCm, and Apple Metal platforms. Includes verification steps, multi-GPU setup, and troubleshooting for 10-20x faster local LLM inference.

Part 15 of 19

Ollama Model Quantization Guide: GGUF Format and Accuracy Loss Analysis

Deep dive into Ollama GGUF quantization principles, referencing Red Hat's 500K+ evaluation data to reveal accuracy loss truths. Practical quantization recommendations for different hardware configurations to run large models on consumer GPUs.

Part 17 of 19