Generative AI in Production: Lessons From Building 3 AI Products

Generative AI system architecture with LLM APIs, caching, and monitoring layers

Everyone's Building AI. Few Are Running It in Production.

There's a massive gap between "I made a ChatGPT wrapper" and "I run an AI product serving real users."

I've built three AI products in the last 18 months:

Calling Agent — AI voice campaign platform (1000+ leads/day)
AI Agent Builder — Drag-and-drop platform to deploy custom AI agents
AI Integration Layer — LLM-powered customer interaction system

Here's everything I've learned that nobody writes about.

Lesson 1: Prompt Engineering Is Software Engineering

Your prompt is not a note you leave for the AI. It's code. Treat it that way.

What this means practically:

Version control your prompts — I store prompts in the database with version IDs
Test prompts like unit tests — Run the same inputs against new prompt versions before deploying
Separate concerns — System prompt (persona/rules), context prompt (dynamic data), user message

const buildPrompt = (agent, lead, history) => ({
  system: `${agent.persona}\n\nRules:\n${agent.rules.join("\n")}`,
  context: `Lead Info: ${JSON.stringify(lead)}\nObjective: ${agent.objective}`,
  history: compressHistory(history, MAX_TOKENS),
});

Lesson 2: Latency Will Kill Your UX

GPT-4 Turbo averages 2–4 seconds for a full completion. In a voice call, 4 seconds of silence feels like a dropped call.

What I did:

Streaming responses — Start speaking as tokens arrive, don't wait for the full completion
Parallel API calls — Run sentiment analysis and lead scoring in parallel, not sequentially
Cache common responses — Greetings, objection handlers, closing lines — cache them in Redis

// Streaming reduces perceived latency dramatically
const stream = await openai.chat.completions.create({
  model: "gpt-4-turbo",
  messages: prompt,
  stream: true,
});

for await (const chunk of stream) {
  const token = chunk.choices[0]?.delta?.content;
  if (token) sendToTwilioStream(token); // Speak tokens as they arrive
}

Lesson 3: Cost Will Surprise You

GPT-4 Turbo at $0.01/1K input tokens sounds cheap. At 1000 calls/day with 500 tokens per turn and 5 turns per call — that's $25/day just in LLM costs, not counting Twilio, Redis, or hosting.

Optimizations I implemented:

GPT-3.5 for simple turns (greetings, confirmations), GPT-4 only for complex reasoning
Token budgeting — Hard limit on context window per call
Batching — Group non-urgent API calls

Cost dropped 60% without a noticeable quality difference.

Lesson 4: Hallucinations Are a Product Problem, Not Just an AI Problem

When an AI agent gives wrong information to a real customer, that's a business risk.

My mitigation strategy:

Constrain the output — Force JSON responses for structured data
Verification layer — Run a second prompt to check if the first response contradicts known facts
Confidence scoring — If the model's logprobs indicate uncertainty, fall back to a safe default

Lesson 5: Observability Is Non-Negotiable

You cannot debug an AI product without logs. I built a logging layer that captured:

Every prompt sent (with version ID)
Every completion received
Latency per API call
Token usage per session
Final outcome (qualified / not qualified / transferred)

This data revealed that 40% of failed calls happened because the agent didn't handle "call me later" variants correctly — a fixable prompt issue invisible without logging.

The Stack That Works

LLM:         OpenAI GPT-4 Turbo (reasoning) + GPT-3.5 (simple turns)
Voice:       Twilio Programmable Voice + Deepgram STT
Queue:       Redis Bull (rate limiting + retries)
Real-time:   Socket.io (WebSocket events to dashboard)
Storage:     MongoDB (conversations) + S3 (recordings)
Frontend:    Next.js + Tailwind CSS
Monitoring:  Custom dashboard + Sentry

See all my AI projects: buildbysandeep.dev/projects