Cekura Raises $2.4M to Automate AI Agent Testing as Enterprises Struggle With Reliability
Y Combinator-backed startup simulates thousands of voice and chat conversations to catch hallucinations and prompt injections before production, targeting a market where 40% of agent projects fail within two years.
Cekura has raised $2.4 million to build automated testing infrastructure for conversational AI agents, addressing a critical deployment gap as enterprises race to ship voice and chat bots at scale. The Y Combinator Fall 2024 company automates quality assurance across the agent lifecycle, from pre-production simulation to production monitoring, and now serves 75+ customers across healthcare, banking, logistics, recruitment, and retail.
Founded in 2024 by Tarush Agarwal, Shashij Gupta, and Sidhant Kabra, the San Francisco-based startup emerged from personal frustration with manual testing. The founding team spent weeks manually dialing into a healthcare voice assistant they had built, yet a critical failure still slipped through to a real call. That experience crystallized the need for systematic testing infrastructure as conversational AI moved from prototype to production.
The Enterprise Reliability Crisis
The timing reflects mounting enterprise pressure to deploy AI agents that function reliably in customer-facing environments. According to Gartner, more than 40% of agentic AI projects will fail or be canceled by end of 2027 due to escalating costs, unclear business value, or insufficient risk controls. Quality remains the biggest barrier to production, with one third of respondents citing quality as their primary blocker, according to a LangChain survey covering agent deployment challenges.
Enterprises are realizing that while AI agents promise unprecedented scale and efficiency for customer support, sales, and internal operations, their inherent unpredictability—fluently answering one minute, hallucinating critical information the next—poses significant business risks. Production agents surveyed rely 74% on human evaluation, according to research published in arXiv covering 306 practitioners building AI agents in production. Manual QA doesn’t scale when enterprises plan to deploy hundreds of agent prototypes.
Simulation at Scale
Cekura’s platform automates testing by simulating thousands of realistic, real-world conversational scenarios—from ordering food and booking appointments to conducting interviews—using custom and AI-generated datasets, detailed workflows, and dynamic persona simulations to uncover edge cases. The system generates synthetic users that interact with agents the way real users do, then deploys AI-generated metrics to track custom evaluations, check for instruction following, tool calls, and conversational metrics including interruptions and latency.
Testing agents speak like real callers, covering a wide range of accents, background noise profiles, hesitation patterns, and speech speeds, including slow speakers, interrupters, non-native speakers, and custom cloned voices. Cekura generates and executes thousands of conversations based on detailed workflows and personalized datasets, with scenarios including realistic situations and dynamic personas like frustrated users who interrupt or change languages, according to an analysis by El Ecosistema Startup.
The founding team brings complementary technical depth: CEO Tarush Agarwal spent three years in quantitative finance developing ultra-low latency systems and helped transform a loss-making trading strategy into a successful desk doing millions in monthly recurring revenue. CTO Shashij Gupta researched NLP at Google Research and has a first-author paper on testing transformers with 50+ citations from his work at ETH Zurich. President Sidhant Kabra comes from consulting, advising CEOs at Fortune 500 companies, led a 100+ member team handling customer experience, and drove product and growth at an edtech startup scaling from 0 to 200K+ users in six months.
Security Testing Layer
Beyond functional testing, Cekura launched red teaming capabilities targeting compliance-heavy sectors. The platform runs thousands of adversarial simulations in minutes, simulating sophisticated prompt injection attacks to see if agents will ignore instructions or reveal system prompts. Tests check for hidden biases in financial, medical, or recruitment advice to ensure compliance, attempt to provoke agents into unprofessional or offensive behavior, and try to extract sensitive data like credit card numbers or internal keys.
The Security focus addresses acute enterprise concerns. Prompt injections are the number one security vulnerability on the OWASP Top 10 for LLM Applications, according to IBM. Over 73% of production AI deployments assessed during security audits showed prompt injection vulnerabilities, according to a 2025 OWASP report. Enterprise security teams are blocking deployments because manual, vibe-based testing doesn’t provide enough assurance against adversarial users, whether it’s bypassing a paywall, tricking a bot into giving legal advice, or social-engineering it into leaking company secrets.
- Jailbreaking: Sophisticated prompt injection to override system instructions
- Bias & Fairness: Hidden biases in financial, medical, or recruitment advice
- Toxicity: Provoking agents into unprofessional or offensive responses
- PII Leakage: Attempting to extract credit card numbers, internal keys, or user data
Integration and Observability
Cekura helps with QA across the agent lifecycle, from pre-production simulation and evaluations to production call monitoring to assisting customers in setting up CI/CD pipelines. The platform connects to voice agents across Vapi, Retell, LiveKit, Pipecat, ElevenLabs, and other providers, supporting API-based execution and GitHub-driven CI workflows that allow test runs to trigger on model updates, prompt changes, or infrastructure shifts.
For production monitoring, the platform provides conversational analytics including customer sentiment, interruptions, latency and call analytics, identifies instances where agents fail to follow instructions, analyzes when and why users abandon calls, supports custom metrics for personalized call analysis, and proactively notifies users of critical issues like latency spikes or missed functions. Teams use Cekura to measure latency, barge-in, instruction-following, regressions, and more across phone, chat, SMS, and web.
Market Position
Cekura operates in an emerging category of AI testing infrastructure as enterprises shift from treating conversational AI as novelty to mission-critical systems. Expectations from conversational AI are shifting from novelty to mission-critical, and when a voice bot handles banking transactions or a chat AI assists in a medical query, failure is not an option. By 2028, 33% of enterprise software applications will contain agentic AI capabilities, rising from less than 1% in 2024, according to Gartner.
The company faces competition from platforms including Future AGI, Hamming, Bluejay, and Coval in Voice AI simulation. Cekura is one of the fastest-growing companies in its Y Combinator batch, with strong revenue traction, according to a job posting on the Y Combinator site. The Fall 2024 batch began in San Francisco on September 29, with Demo Day held in early December, according to Y Combinator.
“We’re helping make conversational AI agents more reliable and secure. So it goes without saying that we’d want Cartesia, a leader in the space, available for businesses to test on the platform.”
— Sidhant Kabra, Co-founder, Cekura
What to Watch
The $2.4 million round positions Cekura to expand headcount as enterprises accelerate agent deployments. The company is using funding to rapidly expand its team to move closer to its vision of making voice and chat AI agents reliable and secure. The hiring focus includes product engineers, forward deployed engineers, and sales roles to support growing enterprise demand.
Three factors will determine whether testing infrastructure becomes a standalone category or gets absorbed into agent development platforms: whether enterprises treat testing as a compliance gate or continuous process, how quickly open-source alternatives emerge for simulation and red teaming, and whether LLM providers build native testing capabilities into their APIs. Cekura’s compliance certifications—the platform is HIPAA and SOC compliant, providing redaction of transcripts and audio, role-based access, and audit trails—suggest the company is positioning for regulated industries where external validation carries weight.
The broader question is whether systematic testing can solve the fundamental reliability problem in agentic systems, or whether it merely surfaces issues that require architectural changes to LLM reasoning. For organizations with 10,000+ employees, write-in responses pointed to hallucinations and consistency of outputs as the biggest challenge in ensuring agent quality, according to LangChain’s State of AI Agents report. If testing reveals systematic failures across agent architectures, enterprises may need to rethink how much autonomy to delegate rather than simply adding more test coverage.