Jump to content

The Python + AI Stack Everyone Is Adopting Right Now

From JOHNWICK

Ollama, FastAPI, LangChain — your new power trio.

I’ll be honest: the Python world hasn’t seen a shift this massive since virtual environments stopped ruining everyone’s morning.
But right now? 
There’s a new stack quietly taking over serious AI developers, indie hackers, startups, and even the “I only run GPT-4 from the cloud” crowd. And yes — it’s powerful enough to make you feel like you’ve unlocked a cheat code. Today, I’ll show you rare, actually useful Python scripts you won’t find on YouTube tutorials or 15-year-olds’ GitHub repos. 
Let’s talk Ollama, LangChain, FastAPI — and how they fuse into a system every serious Python developer should be building with. Let’s get into the rare stuff.

1. Spin Up a Local LLM API with Ollama (No Cloud Bill, No Waiting)

Most developers don’t know Ollama exposes a built-in server that can be hijacked (nicely) into your own Python pipeline. Here’s how you run a model like llama3, mistral, or codellama with one rare trick: you can set system prompts per request, not globally.

ollama pull llama3
ollama serve

And now your Python backend can talk to it:

import requests
def run_llm(prompt):
    payload = {
        "model": "llama3",
        "prompt": prompt,
        "system": "You are a senior Python engineer. Give concise, practical answers."
    }
    r = requests.post("http://localhost:11434/api/generate", json=payload, stream=False)
    return r.json()['response']
print(run_llm("Explain the difference between asyncio.gather and asyncio.wait."))

Why this is rare:
Most tutorials use the CLI… and completely ignore the API that turns Ollama into a production-ready local inference server.

2. Build an AI Endpoint with FastAPI in < 12 Lines (Yes, Really)

FastAPI + Ollama is the quiet power combo.
You get speed, async, auto-docs, and local LLM inference.

from fastapi import FastAPI
import requests
app = FastAPI()
@app.post("/ask")
def ask_llm(prompt: str):
    res = requests.post("http://localhost:11434/api/generate", 
        json={"model": "llama3", "prompt": prompt})
    return {"reply": res.json()["response"]}

Run it: uvicorn app:app --reload


You now have your own ChatGPT, but running for free, locally, and hackable. Most devs don’t realize how tiny this API can be.

3. LangChain + Ollama: The “Hidden Recipes” Nobody Shows

Everyone uses LangChain wrong.
They build huge, bloated pipelines because they follow 40-minute YouTube tutorials narrated by someone who sounds like GPT-2. You?
You’re getting a rare, production-level snippet.

Structured Output + Ollama This is not documented well, but Ollama supports JSON-mode like behavior when prompted correctly.

from langchain_ollama import OllamaLLM
from langchain.prompts import PromptTemplate
import json
llm = OllamaLLM(model="llama3")
template = """
Extract with JSON only.
Text: "{text}"
Return:
{{
    "tech": "",
    "difficulty": "",
    "summary": ""
}}
"""
prompt = PromptTemplate.from_template(template)
response = llm(prompt.format(text="FastAPI accelerates backend dev."))
print(json.loads(response))

You just built an AI data extractor that costs $0 to run.

4. The Rare Trick: Local RAG With ZERO Vector Databases Forget Pinecone.
Forget Chroma.
You don’t need them. Let’s build a RAG system in <25 lines using FAISS, which the AI community quietly uses because it’s faster and simpler than 90% of the hype.

from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings
from langchain_ollama import OllamaLLM
texts = [
    "FastAPI is an async Python web framework.",
    "LangChain is used to build LLM pipelines.",
    "Ollama runs LLMs locally."
]
emb = OllamaEmbeddings(model="nomic-embed-text")
db = FAISS.from_texts(texts, emb)
llm = OllamaLLM(model="llama3")
query = "How do I run AI locally?"
docs = db.similarity_search(query)
context = "\n".join([d.page_content for d in docs])
print(llm(f"Based on this context: {context}\nAnswer the question: {query}"))

This setup is extremely uncommon because:

  • FAISS is faster than Chroma
  • Local embeddings + local LLM = zero external calls
  • You can ship this as a desktop app or local service

5. Stream LLM Tokens (Like OpenAI) Using Python + Ollama Almost nobody knows Ollama supports token streaming via Python.

import requests
def stream(prompt):
    r = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "llama3", "prompt": prompt, "stream": True},
        stream=True
    )
    for chunk in r.iter_lines():
        if chunk:
            print(chunk.decode(), end="")
stream("Explain vector embeddings like I'm a backend developer.")

Now your LLM output behaves like real-time ChatGPT.

6. The Most Advanced Script: Your Own Micro ChatGPT Server (With Memory) Okay, here’s the script that separates beginners from developers who actually ship things. We’re going to build:

  • local llm
  • FastAPI
  • persistent conversation memory
  • streaming responses
  • system instructions
  • and auto-trimming memory
from fastapi import FastAPI
from pydantic import BaseModel
import requests
from collections import deque
app = FastAPI()
memory = deque(maxlen=10)  # last 10 messages
class Query(BaseModel):
    message: str
def call_ollama(prompt):
    res = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llama3",
            "prompt": prompt,
            "stream": False
        }
    )
    return res.json()['response']
@app.post("/chat")
def chat(q: Query):
    memory.append(f"User: {q.message}")
    conversation = "\n".join(memory)
    system = "You are a senior Python engineer. Keep answers short, smart, and practical."
    full_prompt = f"{system}\n\n{conversation}\nAI:"
    reply = call_ollama(full_prompt)
    memory.append(f"AI: {reply}")
    return {"reply": reply}

You just built your own persistent-memory AI assistant.
Locally.
In under 40 lines.
Good luck finding a tutorial that shows this.

7. Bonus: Run 4 Local Models in Parallel (Yeah, This Is Real) If you’re like me and keep 4 LLMs loaded like a gamer switching weapons mid-battle:

import asyncio
import aiohttp
async def ask(model, prompt):
    async with aiohttp.ClientSession() as s:
        async with s.post(
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt}
        ) as r:
            data = await r.json()
            return model, data["response"]
async def main():
    tasks = [
        ask("llama3", "Explain async."),
        ask("mistral", "Explain async."),
        ask("codellama", "Explain async."),
        ask("phi3", "Explain async."),
    ]
    results = await asyncio.gather(*tasks)
    for model, response in results:
        print(f"\n=== {model.upper()} ===\n{response}")
asyncio.run(main())

Most devs don’t know Ollama can run multiple models in parallel if your RAM allows.

Read the full article here: https://ai.plainenglish.io/the-python-ai-stack-everyone-is-adopting-right-now-61429ae2c968