Jump to content

I Added ‘ChatGPT-Like’ Search To Our SaaS And Support Tickets 10x’d

From JOHNWICK

A cautionary tale about confusing magic with reliability. Based on real experiences building AI features in B2B SaaS — names and details changed to protect the innocent (and guilty).

Monday morning: Marketing popped champagne bottles. Friday afternoon: Support threatened to quit en masse.

The crime? I shipped a “conversational AI search” powered by RAG that transformed our reliable keyword filter into an eloquent liar. NPS nosedived. ARR wobbled. The CEO asked if I’d “considered pivoting my career toward fiction writing.”

This is the autopsy of how I turned a boring-but-profitable B2B SaaS into a trust-destroying slot machine. If you’ve stared at ChatGPT and thought “we should build that for our users,” grab popcorn. This is your warning shot wrapped in dark comedy.

The Scene: A Search Box Nobody Hated The product: DocuVault — document management for legal teams. 10K users. $2M ARR. Boring as oatmeal, profitable as a parking lot.

The old search:

  • Elasticsearch with faceted filters
  • 200ms p95, deterministic, predictable
  • Query: “contract amendment 2023” → Result: exactly that
  • User satisfaction: high because it just worked

The “problem” (according to the sales team): “LegalTech competitors have AI chatbots. We look like we’re stuck in 2015.”

The actual problem: We didn’t have one. We had a PowerPoint anxiety disorder. But when your VP of Sales forwards a competitor demo with the subject line “THIS IS WHY WE’RE LOSING DEALS,” you don’t push back. You engineer your own downfall with confidence. The Itch: Why I Built My Own Nightmare

Three forces created the perfect storm: 1. Board-level FOMO Every investor deck started asking: “What’s your AI strategy?” We had grep and Elasticsearch. That wasn’t sexy enough for slide 3.

2. Engineer ego I’d been reading RAG papers at midnight like they were mystery novels. Benchmarks looked clean. Twitter was unanimous: “RAG is production-ready.” (Narrator: It wasn’t.)

3. Competitive paranoia ClerkHQ launched “Ask Your Documents Anything™” and our renewal pipeline developed a nervous twitch.

The prototype stack:

  • LangChain for orchestration (first mistake: trusting abstractions I couldn’t debug)
  • OpenAI embeddings (ada-002) — because everyone said so
  • Pinecone vector store — $399/month seemed reasonable
  • GPT-4 for answers — “enterprise needs the best”
  • Streaming responses for that authentic ChatGPT cosplay

I chunked every document (500 tokens, 50 overlap), embedded the corpus, built a slick chat UI with typing animations, and feature-flagged it behind “AI Search Beta.” Rollout:

  • Week 1: 5% to power users (they loved it)
  • Week 2: 25% to enthusiasts (cautiously optimistic)
  • Week 3: 100% because Marketing tweeted a launch thread

Week 5: I stopped answering Slack messages after 6 PM. The Metrics That Lied With Confidence Dashboard victory lap:

  • Engagement ↑ 40% (users clicking the shiny toy)
  • Session duration ↑ 60% (or were they just… confused?)
  • Search queries ↑ 80% (spoiler: they were rephrasing because answers were wrong)

Support queue apocalypse:

  • Tickets ↑ 900% in 30 days
  • CSAT ↓ from 4.2 to 2.1
  • Churn risk alerts lighting up like a Christmas tree in a fireworks factory

Sample tickets that aged me five years:

“The AI said our NDA expires in 2027. It’s actually 2025. Did we get hacked?”

“It claims we have 47 active contracts with Acme Corp. We have 4. Is this a data breach or incompetence?”

“Search used to take 2 seconds. Now I wait 18 seconds for fiction. Can I have the old one back? I have court Monday.”

“Your AI invented a confidentiality clause that doesn’t exist. I almost sent it to opposing counsel. Fix this or we’re gone.”

That last one came CCed to our CEO, Legal, and — I assume — her therapist.

A senior paralegal spent three hours fact-checking AI results before realizing the entire summary was creative writing. She described it as “like working with an intern who’s confidently wrong about everything.”

In legal tech, “confidently wrong” isn’t a personality quirk. It’s a liability claim waiting to happen.

The Meeting Where Reality Arrived With A Lawsuit Threat The war room: CEO, CTO, VP Sales, VP Support, Chief Legal Officer, and me with my laptop like a defendant at sentencing.

CEO: (calm, terrifying) “Walk me through what happened.”

Me: “RAG retrieves relevant document chunks, GPT synthesizes them into natural language. It’s the same approach everyone — “

CTO: “Why is it inventing information?”

Me: “LLMs can hallucinate. It’s a known characteristic of the technology — “

Legal: (ice-cold) “We sell software to law firms. ‘Known hallucination risk’ is not a product feature. It’s malpractice insurance.”

VP Support: “Two enterprise clients are threatening to leave. One already escalated to their legal team about a compliance error they blame on our AI.”

VP Sales: “Prospects are explicitly asking if we’re liable for AI mistakes. I have no answer that doesn’t kill the deal.”

The silence had weight. You could hear the air conditioning and the faint sound of my career flatling.

CEO: “Can we make it reliable?”

Me: “We can tune prompts, add confidence thresholds, show source citations — “

CTO: “Timeline?”

Me: “…months. Maybe a quarter. And even then, LLMs aren’t deterministic. Perfect accuracy isn’t technically possible.”

CEO: (closes laptop) “Kill it. Today. Restore the old search. Issue a statement apologizing for the ‘beta experience.’”

By 4 PM, AI Search was a disabled feature flag and a Slack channel renamed #ai-postmortem.

By 5 PM, I was in HR explaining “lessons learned.”

Five Mistakes That Turned Hype Into Hazard 1. I Built For Demos, Not Workflows “ChatGPT-like” sounds incredible in a sales pitch. In production, it’s a support ticket generator. I optimized for wow-factor; users needed a work-factor of zero.

2. I Ignored The One Failure Mode That Mattered In B2B SaaS for high-stakes industries, wrong answer > no answer > right answer. Precision beats recall. Boring beats clever. I shipped poetry to people who wanted SQL.

3. I Misread My Audience Legal teams don’t want “pretty good” or “mostly accurate.” They want WHERE clause = exact_match RETURN guaranteed_result. I gave them probabilistic outputs. They gave us one-star reviews.

4. I Treated Latency Like A Nice-To-Have The old search: 200ms. The new search: 12–18 seconds.

Users would rather wait 2 seconds for certainty than 15 for maybe. I destroyed workflows chasing impressiveness.

5. I Broke Trust At Scale One hallucination → user double-checks next result. Two hallucinations → user questions the whole product. Three hallucinations → user questions the whole company.

Trust takes months to build. AI destroyed ours in queries.

What RAG Actually Changed (And Broke) Changed (technically):

  • Semantic search worked beautifully. “Find IP assignment language” understood intent.
  • Natural language queries felt futuristic.
  • Cross-document synthesis was genuinely impressive (when it wasn’t lying).

Destroyed (operationally):

  • Trust. Every result became suspect.
  • Speed. Embedding + retrieval + generation = workflow killer.
  • Cost. $8K/month Pinecone + $15K/month OpenAI = CFO’s least favorite Slack ping.
  • Predictability. Same question ≠ same answer. Non-determinism broke enterprise expectations.

Didn’t fix:

  • The real problem. Users wanted better filters and metadata, not a chatbot.
  • Underlying data issues. Bad tagging and incomplete records; RAG just wrote prettier lies about them.

The Ironic Part (If You’re Not Paying The Bill)

  • Marketing’s “The Future of Legal Search” blog post went live the same day we killed the feature. They edited it to past tense by lunch.
  • Our competitor ClerkHQ quietly removed AI search six weeks later (we heard through a shared customer).
  • The Support VP printed my launch announcement and hung it in the break room with a hand-drawn sad face.
  • Our biggest customer (22% of ARR) told us: “We chose you because you were boring and reliable. Please go back to that immediately.”

The kicker: When we flipped the kill switch and restored keyword search, NPS climbed 1.8 points in two weeks.

Users sent thank-you emails. For making the product dumber.

One user literally wrote: “Thank God. I thought you’d been acquired by someone who didn’t understand legal work.”

The Better Path I Should’ve Followed If you’re RAG-curious (and sometimes you should be), here’s the playbook that doesn’t end in apology emails:

1) Start With Hybrid, Not Generative Combine keyword + semantic embeddings for retrieval. Show ranked results — no generation, no hallucination risk. Test whether smarter retrieval alone moves metrics before you add an LLM that makes things up.

2) Make Generation Opt-In, Never Default Let power users click “Summarize these 5 documents.” Default to showing sources. If you must generate text, make users consciously request it.

3) Citations Aren’t Optional — They’re The Product Every sentence needs inline links to source documents. If the model can’t cite a claim, don’t display it. Transparency > fluency.

4) Build The “I Don’t Know” Path First Teach your system to fail gracefully. “No documents match your query” is better than confident fiction. Set strict confidence thresholds. When in doubt, say nothing.

5) Constrain Before You Generate Use structured outputs (function calling, JSON mode, strict schemas) instead of free-form text generation. Guardrails aren’t limitations — they’re product requirements.

6) Measure Trust, Not Engagement Track what actually matters:

  • Result acceptance rate (did they use it?)
  • Verification time (how long do they fact-check?)
  • Support tickets per search
  • User-reported inaccuracies

If trust metrics collapse, engagement is just users trying to figure out what went wrong.

7) Pilot With The Forgiving 5% Find early adopters who want to experiment. Don’t force AI on the 95% who just want to finish their work and go home. Beta groups exist for good reason.

Postmortem: The Parts We Don’t Put On LinkedIn What Went Well:

  • Rollback strategy saved the company
  • Feature flags contained the damage to weeks, not months
  • We learned RAG’s limits without losing the business

What Went Badly:

  • Built for the pitch deck, not the user
  • Ignored risk profiles of our customer base
  • Treated “AI” as a checkbox instead of carefully evaluating fit

What We’re Doing Now:

  • Hybrid search (embeddings for retrieval, no generation)
  • Better filters based on actual user research
  • AI summarization as opt-in for non-critical workflows only
  • If we ever touch generation again: strict citations, confidence thresholds, and lawyers review first

A Brutally Honest Go/No-Go Checklist Before you ship RAG to real users:

✓ Low-stakes failure: Wrong answer = minor inconvenience, not compliance violation ✓ Users explicitly asked for it: Solving a real problem, not a conference talk ✓ Every claim can be cited: Model shows receipts or stays silent ✓ Sub-3-second p95 latency: If users can brew coffee while waiting, redesign ✓ One-click rollback: Kill switch should work in under 60 seconds ✓ Support is trained: They can explain model behavior to angry customers ✓ Cost model scales: Each query costs $0.12 × 50K daily searches = do that math before launch ✓ Trust measurement in place: You’re tracking verification time and error reports

If you can’t check every box, improve your filters and save the LLM for your weekend project.

The Aftermath: Eating Humble Pie In The Infrastructure Team HR didn’t fire me. They “redeployed my talents to foundational systems work.” (Translation: exiled from anything customer-facing.)

But six months later, we did ship AI features. Smarter ones:

Auto-tagging on upload: Useful, low-risk, verifiable Similar document suggestions: Helpful, deterministic Meeting notes summarization: Opt-in, clearly labeled as experimental, non-critical No chatbots. No “ask me anything” hubris. No conversational search.

Users loved them. Because they solved actual problems without pretending to be magic.

The Unglamorous Lesson I still believe in RAG. I still think LLMs are genuinely transformative. But I learned that reliability beats impressiveness in systems people depend on for their livelihoods.

The best engineers aren’t the ones who chase the newest tech — they’re the ones who ship improvements without breaking the invisible contract of trust. Hype is easy; user confidence takes years to build and seconds to destroy.

If you’re feeling the RAG itch, ask yourself one question:

Am I solving a user problem, or am I solving a pitch deck problem?

If it’s the latter, build boring things that work. If it’s the former, start with retrieval, add generation later (maybe never), and absolutely never ship a feature that makes things up to people who trust you with their careers.

The old search wasn’t exciting. But it was honest. And in B2B SaaS, honest beats impressive every single time.

Read the full article here: https://medium.com/@theabhishek.040/rag-ai-search-saas-failure-support-disaster-894b4b59abd2