Most AI hires can prompt an API. Few have shipped LLM products to thousands of real users, debugged hallucination at scale, or orchestrated GPU inference for paying customers. I've shipped Riya.chat (7,500 users in 7 days), SoftPod (self-hosted voice AI), TalkPod (offline LLM on edge), and indexed 10M+ documents with vector embeddings + GPT trait extraction.
The AI hiring market is flooded with people who've read Karpathy threads and built a Streamlit demo. You need someone who's debugged production traffic, handled hallucination tickets at 2 AM, owned cost-per-inference budgets, and knows when a foundation model API beats a fine-tune. Here's what I've actually owned in production:
Riya.chat — AI companion with state management for 7,500+ users in week 1. Memory architecture, drift mitigation, escalation logic when LLM should call tools vs respond from context.
SoftPod — GPU orchestration on RunPod/Vast.ai. Loaded Whisper, Bark, VLLM, custom models. Cost modeling per-minute, latency budgets, fallback chains.
Indexed 10M+ stories across multiple Indian languages. Evaluated embedding models on a golden test set. Sub-second retrieval, ground truth eval, hallucination guardrails.
Trait extraction across the same 10M-doc corpus. Schema design, batched inference economics, quality eval loops. Powered an IP licensing portal and content moderation system on the same backbone.
Speech-to-text → LLM with tool calls → text-to-speech under sub-second latency. Production calls, not benchmarks.
TalkPod — Raspberry Pi voice AI with wake word detection and local LLM response. For environments where cloud latency or data privacy kills the deal.
Built solo. WhatsApp Business API + OpenAI + Redis state + MongoDB. Designed the agentic conversation layer: context memory, persona consistency, tool-call escalation. Now also live on Telegram.
Self-hosted Voice AI platform. Load open-source voice + LLM models on GPUs of choice — Whisper, Coqui, Bark, VLLM. Built the orchestration panel, per-minute cost dashboard, real-time agentic pipelines.
Vector embeddings + GPT-based trait extraction over 10M+ multilingual stories. Powered three production systems on the same infra: reader semantic search, content moderation (plagiarism + inappropriate content), and an IP licensing intelligence portal.
Voice AI device that works without internet. Wake word detection, local LLM response. For regulated environments where cloud is a non-starter — privacy, latency, or compliance.
LLM apps fail differently than CRUD apps. I've debugged hallucinations, rate-limit storms, token budget overruns, and prompt-injection attempts in production.
OpenAI / Anthropic / open-source. Self-hosted GPU stacks. RAG architectures. Vector DBs (Pinecone, pgvector, Weaviate). I pick based on the constraint, not the hype.
Tool-calling, agent handoffs, voice pipelines combining STT + LLM + TTS at low latency. Not just a single-shot prompt wrapped in a Streamlit.
Golden test sets, hallucination metrics, quality eval loops, ground truth datasets. AI quality is measurable — I treat it that way.
I think about token budgets, GPU rental costs, batched vs streaming inference. Not just "does it work" but "does it work at the unit economics that close the deal."
I use Cursor, Claude Code, agentic pipelines every day to build. I move at small-team velocity solo. Which means I ship your AI roadmap faster than your current team can scope it.
20-minute call. Bring me the hallucination problem, the latency budget, the on-prem constraint. I'll tell you how I'd solve it.