Ollama API Calls: From curl to OpenAI SDK Compatible Interface
3 AM. I stared at that curl command in my terminal, mind full of the pitfalls I’d just stumbled into—I followed the official docs exactly, so why was the JSON only returning half a word?
Turned out it was streaming responses. Ollama defaults to spitting out content like a stream, each JSON object containing just a few characters. This thing ate two hours of my time.
Honestly, for local LLM deployment, Ollama really lowered the barrier. You don’t need to mess with Docker configs or worry about GPU drivers—download, install, run, three steps done. But calling the API? That’s where lots of people (including me) get confused at first: how many ways to call it? What’s the difference between native REST API and OpenAI SDK compatible interface? How to handle streaming responses?
This article is just me organizing the pitfalls I encountered, hoping to help you avoid some detours. We’ll chat about two calling methods, from basic curl commands to zero-code migration with OpenAI SDK, plus those little details the docs don’t explicitly mention.
Ollama API Has Two Interfaces
Man, this really confused me for a while. Ollama actually provides two completely different API interfaces:
Native REST API: http://localhost:11434/api/*
- Endpoints:
/api/generate(text generation),/api/chat(conversation),/api/tags(model list) - Default streaming responses (that’s what I hit at 3 AM)
- Direct HTTP calls, no SDK needed
OpenAI Compatible Interface: http://localhost:11434/v1/*
- Endpoints:
/v1/chat/completions,/v1/completions,/v1/models - Fully compatible with OpenAI SDK (Python, JavaScript work directly)
- Supports existing OpenAI tool ecosystem
You might wonder: why have two sets? Actually each has its use. Native API is lighter, more direct, fits when you write your own HTTP client; OpenAI compatible interface lets you directly use existing OpenAI SDK code—don’t even need to change it, just modify base_url.
Honestly, this design is pretty smart. It accommodates developers who want simple calls, and teams who already have OpenAI ecosystem code.
Native REST API: Starting with curl
Let’s talk native API first. This part is actually pretty straightforward, just a standard REST interface.
Basic curl Calls
Simplest example—text generation:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Yeah, notice that stream: false. By default Ollama streams content back, if you want complete JSON response, you need to explicitly disable streaming. Otherwise you’ll see output like this:
{"model":"llama3.2","response":"That","done":false}
{"model":"llama3.2","response":"'","done":false}
{"model":"llama3.2","response":"s","done":false}
{"model":"llama3.2","response":" a","done":false}
...
{"model":"llama3.2","response":"!","done":true}
Each JSON object only contains a few characters, output token by token. This is so-called NDJSON (Newline-Delimited JSON) format—one JSON object per line. That 3 AM time, I just didn’t notice this, parsed it as regular JSON, ended up only getting “That” from the first object.
Chat Mode is More Practical
Single generation fits simple tasks, but chat mode is what’s really commonly used:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "user", "content": "Hello!" }
],
"stream": false
}'
You can maintain a messages array, pass in all conversation history, so the model can remember context. This point is especially important for building chat applications.
Viewing Installed Models
Sometimes you want to see which models are locally available:
curl http://localhost:11434/api/tags
Returned JSON lists all downloaded models, including size, modification time, quantization level info. Pretty handy.
Handling Streaming Responses
This part needs special discussion. Ollama’s streaming responses have a characteristic: it doesn’t return complete content at once, but outputs token by token.
Python Handling Streaming
Using Python’s requests library to handle streaming responses:
import requests
import json
url = "http://localhost:11434/api/chat"
payload = {
"model": "llama3.2",
"messages": [{"role": "user", "content": "Write a short poem"}],
"stream": True
}
response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
if line:
chunk = json.loads(line)
print(chunk.get("message", {}).get("content", ""), end="", flush=True)
Key point is response.iter_lines()—this lets you read NDJSON stream line by line. Each chunk might only have a few characters, you need to accumulate to get complete reply.
JavaScript Streaming Handling
Frontend using fetch API is similar:
const response = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
body: JSON.stringify({
model: 'llama3.2',
messages: [{ role: 'user', content: 'Hello!' }],
stream: true
})
});
const reader = response.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = new TextDecoder().decode(value);
// Handle each chunk...
}
Honestly, streaming handling is more troublesome than non-streaming, but user experience is way better—you can real-time see the model “thinking” and outputting, instead of waiting forever then suddenly popping out a big chunk of text.
OpenAI SDK Compatible Interface: Zero-Code Migration
This part is my favorite. Ollama provides complete OpenAI API compatible interface, you can almost seamlessly migrate existing code.
Python OpenAI SDK Example
Directly use OpenAI’s official SDK:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama' # local doesn't validate, fill anything
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
See, only change is setting base_url and randomly filling api_key. Other code completely unchanged.
Development and Production Environment Switching
This feature is super practical. You can use local Ollama in dev environment, OpenAI in production:
# Development .env
OPENAI_API_KEY=anyrandomtext
LLM_ENDPOINT="http://localhost:11434/v1"
MODEL=llama3.2
# Production .env
OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXX
LLM_ENDPOINT="https://api.openai.com/v1"
MODEL=gpt-3.5-turbo
Code just needs to read from environment variables:
import os
from openai import OpenAI
client = OpenAI(
base_url=os.getenv('LLM_ENDPOINT'),
api_key=os.getenv('OPENAI_API_KEY')
)
This way, you don’t spend money calling OpenAI API during development, local testing is enough; when launching just change environment variables to switch to real OpenAI.
Supported Endpoints
Ollama’s OpenAI compatible interface supports these endpoints:
| Endpoint | Function | Support Level |
|---|---|---|
/v1/chat/completions | Chat generation | Fully supported |
/v1/completions | Text completion | Fully supported |
/v1/models | Model list | Fully supported |
/v1/embeddings | Text embeddings | Fully supported |
/v1/responses | New response API | Fully supported |
There’s also experimental /v1/images/generations endpoint, but stability isn’t enough yet.
Model Aliases
There’s a small trick: you can give models aliases. Like you want code to look like calling GPT-3.5:
ollama cp llama3.2 gpt-3.5-turbo
This way your code writes model="gpt-3.5-turbo", actually using local llama3.2. This trick is pretty useful when migrating code.
How to Choose Between Two Methods?
Said so much, you might think: which should I use?
Native REST API Scenarios
Fits these situations:
- You want lightest calling method
- Don’t need OpenAI SDK ecosystem
- Write your own HTTP client (like embedded devices, special environments)
- Need precise control over streaming response details
Native API is more direct, more low-level. If you’re familiar with HTTP protocol, using it will be smooth.
OpenAI SDK Compatible Interface Scenarios
Fits these situations:
- You already have OpenAI SDK based code
- Need quick migration to local deployment
- Using OpenAI toolchain (LangChain, LlamaIndex stuff)
- Need to switch between dev/production environments
Basically, if you’re a “pragmatist”, don’t want to change code, then use OpenAI compatible interface.
My Suggestion
Honestly, I prefer using OpenAI SDK compatible interface during development—less code changes, toolchain works, debugging is convenient. But some scenarios native API is indeed more suitable, like you want to write a minimal CLI tool, or need to call in some environment that doesn’t support OpenAI SDK.
Practical Code Snippets
Finally sharing a few code snippets I commonly use.
Python Streaming Chat (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama'
)
stream = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
JavaScript Complete Conversation (Native API)
async function chat(messages) {
const response = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama3.2',
messages: messages,
stream: false
})
});
return await response.json();
}
// Maintain conversation history
let conversation = [
{ role: 'user', content: 'Hello!' }
];
const result = await chat(conversation);
conversation.push({
role: 'assistant',
content: result.message.content
});
console.log(result.message.content);
Tool Calling Example
Ollama also supports Function Calling (tool calling):
from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1/', api_key='ollama')
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools
)
if response.choices[0].message.tool_calls:
print("Model wants to call:", response.choices[0].message.tool_calls[0].function.name)
Having Said All This
Actually Ollama’s API design is pretty balanced—gives you native REST API’s directness and lightness, also gives you OpenAI SDK’s convenience and ecosystem compatibility. Two methods each have applicable scenarios, key is looking at your needs.
If you’re just starting with Ollama, I suggest first try OpenAI SDK compatible interface—quick to pick up, less code changes. After you get familiar, then decide whether to use native API based on specific needs.
Oh, one more thing: streaming response handling is indeed easy to stumble into. Remember default is streaming, if you don’t need streaming then explicitly set stream: false, otherwise you’ll be like me at 3 AM staring at half a word blankly.
Two Ways to Call Ollama API
Complete calling flow from curl native API to OpenAI SDK compatible interface
⏱️ Estimated time: 10 min
- 1
Step1: Confirm Ollama Installed and Running
First check if Ollama is running normally:
• Terminal execute: ollama list (view downloaded models)
• Or visit: http://localhost:11434 (should return Ollama is running)
• Default port: 11434 - 2
Step2: Choose Calling Method
Choose based on your scenario:
• Native REST API: fits lightweight calls, custom clients
• OpenAI SDK compatible: fits existing OpenAI code, quick migration - 3
Step3: Use Native REST API (curl Method)
Most basic curl calling:
• Text generation: curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "...", "stream": false}'
• Chat mode: curl http://localhost:11434/api/chat -d '{"model": "llama3.2", "messages": [...], "stream": false}'
• Note: default streaming response, need to set stream: false to disable - 4
Step4: Use OpenAI SDK Compatible Interface
Python OpenAI SDK calling:
• Set base_url='http://localhost:11434/v1/'
• api_key can be anything (local doesn't validate)
• Other code identical to OpenAI
• Environment switch: just change base_url (dev use local, production use OpenAI) - 5
Step5: Handle Streaming Responses
Streaming response handling key points:
• Python: response.iter_lines() read NDJSON line by line
• JavaScript: response.body.getReader() stream read
• Each chunk only contains a few characters, need to accumulate complete reply
• Non-streaming: set stream: false to get complete JSON
FAQ
What's the difference between Ollama's native API and OpenAI SDK compatible interface?
Why did calling Ollama API only return half a word?
• Set stream: false to disable streaming, get complete JSON
• Or properly handle NDJSON stream: read line by line and accumulate content
How to use local Ollama in dev environment and OpenAI in production?
What OpenAI endpoints does Ollama support?
Can I give Ollama models aliases?
Does Ollama support tool calling (Function Calling)?
References
- Ollama API Introduction
- Ollama OpenAI Compatibility
- Ollama Streaming Guide
- KodeKloud OpenAI Compatibility Guide
8 min read · Published on: Apr 3, 2026 · Modified on: Apr 5, 2026
Related Posts
Ollama Modelfile Parameters Explained: A Complete Guide to Creating Custom Models
Ollama Modelfile Parameters Explained: A Complete Guide to Creating Custom Models
Ollama + Open WebUI: Build Your Own Local ChatGPT Interface (Complete Guide)
Ollama + Open WebUI: Build Your Own Local ChatGPT Interface (Complete Guide)
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control

Comments
Sign in with GitHub to leave a comment