The Python + AI Stack Everyone Is Adopting Right Now
Ollama, FastAPI, LangChain — your new power trio.
I’ll be honest: the Python world hasn’t seen a shift this massive since virtual environments stopped ruining everyone’s morning. But right now? There’s a new stack quietly taking over serious AI developers, indie hackers, startups, and even the “I only run GPT-4 from the cloud” crowd. And yes — it’s powerful enough to make you feel like you’ve unlocked a cheat code. Today, I’ll show you rare, actually useful Python scripts you won’t find on YouTube tutorials or 15-year-olds’ GitHub repos. Let’s talk Ollama, LangChain, FastAPI — and how they fuse into a system every serious Python developer should be building with. Let’s get into the rare stuff.
1. Spin Up a Local LLM API with Ollama (No Cloud Bill, No Waiting)
Most developers don’t know Ollama exposes a built-in server that can be hijacked (nicely) into your own Python pipeline. Here’s how you run a model like llama3, mistral, or codellama with one rare trick: you can set system prompts per request, not globally.
ollama pull llama3 ollama serve
And now your Python backend can talk to it:
import requests
def run_llm(prompt):
payload = {
"model": "llama3",
"prompt": prompt,
"system": "You are a senior Python engineer. Give concise, practical answers."
}
r = requests.post("http://localhost:11434/api/generate", json=payload, stream=False)
return r.json()['response']
print(run_llm("Explain the difference between asyncio.gather and asyncio.wait."))
Why this is rare: Most tutorials use the CLI… and completely ignore the API that turns Ollama into a production-ready local inference server.
2. Build an AI Endpoint with FastAPI in < 12 Lines (Yes, Really)
FastAPI + Ollama is the quiet power combo. You get speed, async, auto-docs, and local LLM inference.
from fastapi import FastAPI
import requests
app = FastAPI()
@app.post("/ask")
def ask_llm(prompt: str):
res = requests.post("http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": prompt})
return {"reply": res.json()["response"]}
Run it: uvicorn app:app --reload
You now have your own ChatGPT, but running for free, locally, and hackable.
Most devs don’t realize how tiny this API can be.
3. LangChain + Ollama: The “Hidden Recipes” Nobody Shows
Everyone uses LangChain wrong. They build huge, bloated pipelines because they follow 40-minute YouTube tutorials narrated by someone who sounds like GPT-2. You? You’re getting a rare, production-level snippet.
Structured Output + Ollama This is not documented well, but Ollama supports JSON-mode like behavior when prompted correctly.
from langchain_ollama import OllamaLLM
from langchain.prompts import PromptTemplate
import json
llm = OllamaLLM(model="llama3")
template = """
Extract with JSON only.
Text: "{text}"
Return:
{{
"tech": "",
"difficulty": "",
"summary": ""
}}
"""
prompt = PromptTemplate.from_template(template)
response = llm(prompt.format(text="FastAPI accelerates backend dev."))
print(json.loads(response))
You just built an AI data extractor that costs $0 to run.
4. The Rare Trick: Local RAG With ZERO Vector Databases Forget Pinecone. Forget Chroma. You don’t need them. Let’s build a RAG system in <25 lines using FAISS, which the AI community quietly uses because it’s faster and simpler than 90% of the hype.
from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings
from langchain_ollama import OllamaLLM
texts = [
"FastAPI is an async Python web framework.",
"LangChain is used to build LLM pipelines.",
"Ollama runs LLMs locally."
]
emb = OllamaEmbeddings(model="nomic-embed-text")
db = FAISS.from_texts(texts, emb)
llm = OllamaLLM(model="llama3")
query = "How do I run AI locally?"
docs = db.similarity_search(query)
context = "\n".join([d.page_content for d in docs])
print(llm(f"Based on this context: {context}\nAnswer the question: {query}"))
This setup is extremely uncommon because:
- FAISS is faster than Chroma
- Local embeddings + local LLM = zero external calls
- You can ship this as a desktop app or local service
5. Stream LLM Tokens (Like OpenAI) Using Python + Ollama Almost nobody knows Ollama supports token streaming via Python.
import requests
def stream(prompt):
r = requests.post(
"http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": prompt, "stream": True},
stream=True
)
for chunk in r.iter_lines():
if chunk:
print(chunk.decode(), end="")
stream("Explain vector embeddings like I'm a backend developer.")
Now your LLM output behaves like real-time ChatGPT.
6. The Most Advanced Script: Your Own Micro ChatGPT Server (With Memory) Okay, here’s the script that separates beginners from developers who actually ship things. We’re going to build:
- local llm
- FastAPI
- persistent conversation memory
- streaming responses
- system instructions
- and auto-trimming memory
from fastapi import FastAPI
from pydantic import BaseModel
import requests
from collections import deque
app = FastAPI()
memory = deque(maxlen=10) # last 10 messages
class Query(BaseModel):
message: str
def call_ollama(prompt):
res = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3",
"prompt": prompt,
"stream": False
}
)
return res.json()['response']
@app.post("/chat")
def chat(q: Query):
memory.append(f"User: {q.message}")
conversation = "\n".join(memory)
system = "You are a senior Python engineer. Keep answers short, smart, and practical."
full_prompt = f"{system}\n\n{conversation}\nAI:"
reply = call_ollama(full_prompt)
memory.append(f"AI: {reply}")
return {"reply": reply}
You just built your own persistent-memory AI assistant. Locally. In under 40 lines. Good luck finding a tutorial that shows this.
7. Bonus: Run 4 Local Models in Parallel (Yeah, This Is Real) If you’re like me and keep 4 LLMs loaded like a gamer switching weapons mid-battle:
import asyncio
import aiohttp
async def ask(model, prompt):
async with aiohttp.ClientSession() as s:
async with s.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt}
) as r:
data = await r.json()
return model, data["response"]
async def main():
tasks = [
ask("llama3", "Explain async."),
ask("mistral", "Explain async."),
ask("codellama", "Explain async."),
ask("phi3", "Explain async."),
]
results = await asyncio.gather(*tasks)
for model, response in results:
print(f"\n=== {model.upper()} ===\n{response}")
asyncio.run(main())
Most devs don’t know Ollama can run multiple models in parallel if your RAM allows.
Read the full article here: https://ai.plainenglish.io/the-python-ai-stack-everyone-is-adopting-right-now-61429ae2c968