Switch Language
Toggle Theme

Ollama API Calls: From curl to OpenAI SDK Compatible Interface

3 AM. I stared at that curl command in my terminal, mind full of the pitfalls I’d just stumbled into—I followed the official docs exactly, so why was the JSON only returning half a word?

Turned out it was streaming responses. Ollama defaults to spitting out content like a stream, each JSON object containing just a few characters. This thing ate two hours of my time.

Honestly, for local LLM deployment, Ollama really lowered the barrier. You don’t need to mess with Docker configs or worry about GPU drivers—download, install, run, three steps done. But calling the API? That’s where lots of people (including me) get confused at first: how many ways to call it? What’s the difference between native REST API and OpenAI SDK compatible interface? How to handle streaming responses?

This article is just me organizing the pitfalls I encountered, hoping to help you avoid some detours. We’ll chat about two calling methods, from basic curl commands to zero-code migration with OpenAI SDK, plus those little details the docs don’t explicitly mention.


Ollama API Has Two Interfaces

Man, this really confused me for a while. Ollama actually provides two completely different API interfaces:

Native REST API: http://localhost:11434/api/*

  • Endpoints: /api/generate (text generation), /api/chat (conversation), /api/tags (model list)
  • Default streaming responses (that’s what I hit at 3 AM)
  • Direct HTTP calls, no SDK needed

OpenAI Compatible Interface: http://localhost:11434/v1/*

  • Endpoints: /v1/chat/completions, /v1/completions, /v1/models
  • Fully compatible with OpenAI SDK (Python, JavaScript work directly)
  • Supports existing OpenAI tool ecosystem

You might wonder: why have two sets? Actually each has its use. Native API is lighter, more direct, fits when you write your own HTTP client; OpenAI compatible interface lets you directly use existing OpenAI SDK code—don’t even need to change it, just modify base_url.

Honestly, this design is pretty smart. It accommodates developers who want simple calls, and teams who already have OpenAI ecosystem code.


Native REST API: Starting with curl

Let’s talk native API first. This part is actually pretty straightforward, just a standard REST interface.

Basic curl Calls

Simplest example—text generation:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Yeah, notice that stream: false. By default Ollama streams content back, if you want complete JSON response, you need to explicitly disable streaming. Otherwise you’ll see output like this:

{"model":"llama3.2","response":"That","done":false}
{"model":"llama3.2","response":"'","done":false}
{"model":"llama3.2","response":"s","done":false}
{"model":"llama3.2","response":" a","done":false}
...
{"model":"llama3.2","response":"!","done":true}

Each JSON object only contains a few characters, output token by token. This is so-called NDJSON (Newline-Delimited JSON) format—one JSON object per line. That 3 AM time, I just didn’t notice this, parsed it as regular JSON, ended up only getting “That” from the first object.

Chat Mode is More Practical

Single generation fits simple tasks, but chat mode is what’s really commonly used:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Hello!" }
  ],
  "stream": false
}'

You can maintain a messages array, pass in all conversation history, so the model can remember context. This point is especially important for building chat applications.

Viewing Installed Models

Sometimes you want to see which models are locally available:

curl http://localhost:11434/api/tags

Returned JSON lists all downloaded models, including size, modification time, quantization level info. Pretty handy.


Handling Streaming Responses

This part needs special discussion. Ollama’s streaming responses have a characteristic: it doesn’t return complete content at once, but outputs token by token.

Python Handling Streaming

Using Python’s requests library to handle streaming responses:

import requests
import json

url = "http://localhost:11434/api/chat"
payload = {
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Write a short poem"}],
    "stream": True
}

response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        print(chunk.get("message", {}).get("content", ""), end="", flush=True)

Key point is response.iter_lines()—this lets you read NDJSON stream line by line. Each chunk might only have a few characters, you need to accumulate to get complete reply.

JavaScript Streaming Handling

Frontend using fetch API is similar:

const response = await fetch('http://localhost:11434/api/chat', {
  method: 'POST',
  body: JSON.stringify({
    model: 'llama3.2',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true
  })
});

const reader = response.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = new TextDecoder().decode(value);
  // Handle each chunk...
}

Honestly, streaming handling is more troublesome than non-streaming, but user experience is way better—you can real-time see the model “thinking” and outputting, instead of waiting forever then suddenly popping out a big chunk of text.


OpenAI SDK Compatible Interface: Zero-Code Migration

This part is my favorite. Ollama provides complete OpenAI API compatible interface, you can almost seamlessly migrate existing code.

Python OpenAI SDK Example

Directly use OpenAI’s official SDK:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama'  # local doesn't validate, fill anything
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

See, only change is setting base_url and randomly filling api_key. Other code completely unchanged.

Development and Production Environment Switching

This feature is super practical. You can use local Ollama in dev environment, OpenAI in production:

# Development .env
OPENAI_API_KEY=anyrandomtext
LLM_ENDPOINT="http://localhost:11434/v1"
MODEL=llama3.2

# Production .env
OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXX
LLM_ENDPOINT="https://api.openai.com/v1"
MODEL=gpt-3.5-turbo

Code just needs to read from environment variables:

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.getenv('LLM_ENDPOINT'),
    api_key=os.getenv('OPENAI_API_KEY')
)

This way, you don’t spend money calling OpenAI API during development, local testing is enough; when launching just change environment variables to switch to real OpenAI.

Supported Endpoints

Ollama’s OpenAI compatible interface supports these endpoints:

EndpointFunctionSupport Level
/v1/chat/completionsChat generationFully supported
/v1/completionsText completionFully supported
/v1/modelsModel listFully supported
/v1/embeddingsText embeddingsFully supported
/v1/responsesNew response APIFully supported

There’s also experimental /v1/images/generations endpoint, but stability isn’t enough yet.

Model Aliases

There’s a small trick: you can give models aliases. Like you want code to look like calling GPT-3.5:

ollama cp llama3.2 gpt-3.5-turbo

This way your code writes model="gpt-3.5-turbo", actually using local llama3.2. This trick is pretty useful when migrating code.


How to Choose Between Two Methods?

Said so much, you might think: which should I use?

Native REST API Scenarios

Fits these situations:

  • You want lightest calling method
  • Don’t need OpenAI SDK ecosystem
  • Write your own HTTP client (like embedded devices, special environments)
  • Need precise control over streaming response details

Native API is more direct, more low-level. If you’re familiar with HTTP protocol, using it will be smooth.

OpenAI SDK Compatible Interface Scenarios

Fits these situations:

  • You already have OpenAI SDK based code
  • Need quick migration to local deployment
  • Using OpenAI toolchain (LangChain, LlamaIndex stuff)
  • Need to switch between dev/production environments

Basically, if you’re a “pragmatist”, don’t want to change code, then use OpenAI compatible interface.

My Suggestion

Honestly, I prefer using OpenAI SDK compatible interface during development—less code changes, toolchain works, debugging is convenient. But some scenarios native API is indeed more suitable, like you want to write a minimal CLI tool, or need to call in some environment that doesn’t support OpenAI SDK.


Practical Code Snippets

Finally sharing a few code snippets I commonly use.

Python Streaming Chat (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama'
)

stream = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

JavaScript Complete Conversation (Native API)

async function chat(messages) {
  const response = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'llama3.2',
      messages: messages,
      stream: false
    })
  });
  return await response.json();
}

// Maintain conversation history
let conversation = [
  { role: 'user', content: 'Hello!' }
];

const result = await chat(conversation);
conversation.push({
  role: 'assistant',
  content: result.message.content
});

console.log(result.message.content);

Tool Calling Example

Ollama also supports Function Calling (tool calling):

from openai import OpenAI

client = OpenAI(base_url='http://localhost:11434/v1/', api_key='ollama')

tools = [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {"type": "string"}
        },
        "required": ["location"]
      }
    }
  }
]

response = client.chat.completions.create(
  model="llama3.2",
  messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
  tools=tools
)

if response.choices[0].message.tool_calls:
  print("Model wants to call:", response.choices[0].message.tool_calls[0].function.name)

Having Said All This

Actually Ollama’s API design is pretty balanced—gives you native REST API’s directness and lightness, also gives you OpenAI SDK’s convenience and ecosystem compatibility. Two methods each have applicable scenarios, key is looking at your needs.

If you’re just starting with Ollama, I suggest first try OpenAI SDK compatible interface—quick to pick up, less code changes. After you get familiar, then decide whether to use native API based on specific needs.

Oh, one more thing: streaming response handling is indeed easy to stumble into. Remember default is streaming, if you don’t need streaming then explicitly set stream: false, otherwise you’ll be like me at 3 AM staring at half a word blankly.

Two Ways to Call Ollama API

Complete calling flow from curl native API to OpenAI SDK compatible interface

⏱️ Estimated time: 10 min

  1. 1

    Step1: Confirm Ollama Installed and Running

    First check if Ollama is running normally:

    • Terminal execute: ollama list (view downloaded models)
    • Or visit: http://localhost:11434 (should return Ollama is running)
    • Default port: 11434
  2. 2

    Step2: Choose Calling Method

    Choose based on your scenario:

    • Native REST API: fits lightweight calls, custom clients
    • OpenAI SDK compatible: fits existing OpenAI code, quick migration
  3. 3

    Step3: Use Native REST API (curl Method)

    Most basic curl calling:

    • Text generation: curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "...", "stream": false}'
    • Chat mode: curl http://localhost:11434/api/chat -d '{"model": "llama3.2", "messages": [...], "stream": false}'
    • Note: default streaming response, need to set stream: false to disable
  4. 4

    Step4: Use OpenAI SDK Compatible Interface

    Python OpenAI SDK calling:

    • Set base_url='http://localhost:11434/v1/'
    • api_key can be anything (local doesn't validate)
    • Other code identical to OpenAI
    • Environment switch: just change base_url (dev use local, production use OpenAI)
  5. 5

    Step5: Handle Streaming Responses

    Streaming response handling key points:

    • Python: response.iter_lines() read NDJSON line by line
    • JavaScript: response.body.getReader() stream read
    • Each chunk only contains a few characters, need to accumulate complete reply
    • Non-streaming: set stream: false to get complete JSON

FAQ

What's the difference between Ollama's native API and OpenAI SDK compatible interface?
Native API is lighter and more direct, defaults to streaming responses (NDJSON format), fits custom HTTP clients. OpenAI SDK compatible interface fully compatible with OpenAI SDK, just needs base_url change, fits quick migration of existing OpenAI code.
Why did calling Ollama API only return half a word?
This is streaming response default behavior. Ollama defaults to using NDJSON format outputting token by token, each JSON object only contains a few characters. Solutions:

• Set stream: false to disable streaming, get complete JSON
• Or properly handle NDJSON stream: read line by line and accumulate content
How to use local Ollama in dev environment and OpenAI in production?
Use environment variable switching: dev time set LLM_ENDPOINT="http://localhost:11434/v1", production set to "https://api.openai.com/v1". Code just needs to read base_url from environment variable, other code completely unchanged.
What OpenAI endpoints does Ollama support?
Fully supported: /v1/chat/completions, /v1/completions, /v1/models, /v1/embeddings, /v1/responses. Experimental support: /v1/images/generations (stability not enough yet).
Can I give Ollama models aliases?
Yes. Use command: ollama cp llama3.2 gpt-3.5-turbo, this way code writing model="gpt-3.5-turbo" actually uses local llama3.2. This trick is useful when migrating code.
Does Ollama support tool calling (Function Calling)?
Yes. Can use through OpenAI SDK compatible interface with tools parameter, define function schema, model will return tool_calls field indicating which function to call.

References

8 min read · Published on: Apr 3, 2026 · Modified on: Apr 5, 2026

Comments

Sign in with GitHub to leave a comment

Related Posts