AI Engineer · Production AI Builder

Production AI.
Not GitHub stars.

Most AI hires can prompt an API. Few have shipped LLM products to thousands of real users, debugged hallucination at scale, or orchestrated GPU inference for paying customers. I've shipped Riya.chat (7,500 users in 7 days), SoftPod (self-hosted voice AI), TalkPod (offline LLM on edge), and indexed 10M+ documents with vector embeddings + GPT trait extraction.

WhatsApp me Email me →

The hire you actually need

Anyone can prompt. Few can ship.

The AI hiring market is flooded with people who've read Karpathy threads and built a Streamlit demo. You need someone who's debugged production traffic, handled hallucination tickets at 2 AM, owned cost-per-inference budgets, and knows when a foundation model API beats a fine-tune. Here's what I've actually owned in production:

Persona consistency across multi-turn conversations

Riya.chat — AI companion with state management for 7,500+ users in week 1. Memory architecture, drift mitigation, escalation logic when LLM should call tools vs respond from context.

Self-hosted inference at enterprise scale

SoftPod — GPU orchestration on RunPod/Vast.ai. Loaded Whisper, Bark, VLLM, custom models. Cost modeling per-minute, latency budgets, fallback chains.

RAG over millions of multilingual documents

Indexed 10M+ stories across multiple Indian languages. Evaluated embedding models on a golden test set. Sub-second retrieval, ground truth eval, hallucination guardrails.

GPT-based structured extraction at scale

Trait extraction across the same 10M-doc corpus. Schema design, batched inference economics, quality eval loops. Powered an IP licensing portal and content moderation system on the same backbone.

Real-time voice AI pipelines

Speech-to-text → LLM with tool calls → text-to-speech under sub-second latency. Production calls, not benchmarks.

Edge / offline LLM inference

TalkPod — Raspberry Pi voice AI with wake word detection and local LLM response. For environments where cloud latency or data privacy kills the deal.

Selected AI work

Live products. Real users. Real revenue.

Consumer AI · WhatsApp + Telegram

Riya.chat — AI companion at 7,500-user scale

7,500

users week 1

100+

paid subs

7 days

idea to live

Built solo. WhatsApp Business API + OpenAI + Redis state + MongoDB. Designed the agentic conversation layer: context memory, persona consistency, tool-call escalation. Now also live on Telegram.

Voice AI · Self-Hosted Infra

SoftPod — enterprise voice AI on customer GPUs

~80%

cost cut vs APIs

1.5 yrs

consulting

Sub-1s

e2e latency

Self-hosted Voice AI platform. Load open-source voice + LLM models on GPUs of choice — Whisper, Coqui, Bark, VLLM. Built the orchestration panel, per-minute cost dashboard, real-time agentic pipelines.

RAG · Multilingual NLP

10M-document semantic search + GPT trait extraction

10M+

docs indexed

Multi

Indian langs

Sub-sec

retrieval

Vector embeddings + GPT-based trait extraction over 10M+ multilingual stories. Powered three production systems on the same infra: reader semantic search, content moderation (plagiarism + inappropriate content), and an IP licensing intelligence portal.

Edge AI · Hardware

TalkPod — offline LLM on edge hardware

0ms

cloud latency

Local

inference

Voice AI device that works without internet. Wake word detection, local LLM response. For regulated environments where cloud is a non-starter — privacy, latency, or compliance.

Why I fit your AI team

Production reps. Not paper.

I've owned the on-call rotation

LLM apps fail differently than CRUD apps. I've debugged hallucinations, rate-limit storms, token budget overruns, and prompt-injection attempts in production.

Foundation models AND infra

OpenAI / Anthropic / open-source. Self-hosted GPU stacks. RAG architectures. Vector DBs (Pinecone, pgvector, Weaviate). I pick based on the constraint, not the hype.

Multi-agent orchestration

Tool-calling, agent handoffs, voice pipelines combining STT + LLM + TTS at low latency. Not just a single-shot prompt wrapped in a Streamlit.

I evaluate, I don't vibe-check

Golden test sets, hallucination metrics, quality eval loops, ground truth datasets. AI quality is measurable — I treat it that way.

Cost-per-inference economics

I think about token budgets, GPU rental costs, batched vs streaming inference. Not just "does it work" but "does it work at the unit economics that close the deal."

AI-augmented delivery itself

I use Cursor, Claude Code, agentic pipelines every day to build. I move at small-team velocity solo. Which means I ship your AI roadmap faster than your current team can scope it.

Let's talk

Tell me about the AI feature you can't get to production.

20-minute call. Bring me the hallucination problem, the latency budget, the on-prem constraint. I'll tell you how I'd solve it.

WhatsApp me Email goureshramrakhyani@gmail.com →

Production AI.Not GitHub stars.