Qureos

FIND_THE_RIGHTJOB.

AI Evaluation Specialist - QA

JOB_REQUIREMENTS

Hires in

Not specified

Employment Type

Not specified

Company Location

Not specified

Salary

Not specified

Kore.ai is a pioneering force in enterprise AI transformation, empowering organizations through our comprehensive agentic AI platform. With innovative offerings across "AI for Service," "AI for Work," and "AI for Process," we're enabling over 400+ Global 2000 companies to fundamentally reimagine their operations, customer experiences and employee productivity.

Our end-to-end platform enables enterprises to build, deploy, manage, monitor, and continuously improve agentic applications at scale. We've automated over 1 billion interactions every year with voice and digital AI in customer service, and transformed employee experiences for tens of thousands of employees through productivity and AI-driven workflow automation.

Recognized as a leader by Gartner, Forrester, IDC, ISG, and Everest, Kore.ai has secured Series D funding of $150M, including strategic investment from NVIDIA to drive Enterprise AI innovation. Founded in 2014 and headquartered in Florida, we maintain a global presence with offices in India, UK, Germany, Korea, and Japan.

You can find full press coverage at https://kore.ai/press/


POSITION:
Senior AI Evaluation Specialist


POSITION SUMMARY:
We are seeking a Senior AI Evaluation Specialist to design and execute robust evaluation methodologies for Generative and Agentic AI systems. This role bridges AI product quality, evaluation science, and responsible AI governance — ensuring every AI feature, agent, and model release is measured, benchmarked, and validated using standardised frameworks.

The ideal candidate combines a QA mindset, ML evaluation rigour, and hands-on coding expertise to benchmark LLMs, multi-agent workflows, and GenAI APIs, driving consistent, measurable, and safe AI product performance.


LOCATION: Hyderabad (Work from Office)


RESPONSIBILITIES:


1. AI Evaluation & Benchmarking

Build and maintain end-to-end evaluation pipelines for Generative and Agentic AI features (e.g., chat, reasoning agents, RAG workflows, summarization, classification).

Implement standardized evaluation frameworks such as RAGAS, G-Eval, HELM, PromptBench, MT-Bench, or custom evaluation harnesses.

Define and measure core AI quality metrics — accuracy, groundedness, coherence, contextual recall, hallucination rate, and response time.

Create reproducible benchmarks, leaderboards, and regression tracking for models and agents across multiple releases or providers (OpenAI, Anthropic, Mistral, etc.).


2. Agentic AI Evaluation

Evaluate multi-agent systems and autonomous AI workflows, measuring task success rates, reasoning trace quality, and tool-use efficiency.

Assess Agentic AI behaviors such as planning accuracy, goal completion rate, context handoff success, and inter-agent communication reliability.

Validate decision-making transparency and error recovery mechanisms in autonomous agent frameworks (LangGraph, AutoGen, CrewAI, etc.).

Design agent-specific evaluation scenarios — simulated environments, user-in-the-loop testing, and “mission-based” performance scoring.


3. Experimentation & Automation

Develop Python-based evaluation scripts to automate testing using OpenAI, Anthropic, and Hugging Face APIs.

Conduct large-scale comparative studies across prompts, models, and fine-tuned variants, analyzing quantitative and qualitative differences.

Integrate evaluations into CI/CD pipelines to enable continuous AI quality monitoring.

Visualize results using dashboards (Plotly, Streamlit, Dash, or Grafana).


4. Quality Governance & Reporting

Define and enforce AI acceptance thresholds before deployment.

Collaborate with Responsible AI teams to evaluate bias, fairness, safety, and privacy implications.

Produce detailed evaluation reports and audit logs for model releases and governance boards.

Present findings to Product, Data Science, and Executive stakeholders — transforming metrics into actionable insights.


5. Collaboration & Continuous Improvement

Work closely with Prompt Engineers, ML Scientists, and QA Engineers to close the loop between testing and improvement.

Support Product teams in defining evaluation-driven release criteria.

Mentor junior evaluators in AI testing methodologies, benchmarking, and analysis.

Keep abreast of advances in LLM evaluation research, Agentic AI frameworks, and tool-calling reliability testing.


QUALIFICATIONS / SKILLS REQUIRED:


Category


Expected Expertise


Programming


Python (Pandas, NumPy, LangChain, LangGraph, OpenAI/Anthropic SDKs)


Evaluation Frameworks


RAGAS, HELM, G-Eval, MT-Bench, PromptBench, custom scoring pipelines


GenAI APIs


OpenAI GPT-4/5, Claude, Gemini, Mistral, Azure OpenAI


Agentic AI


Understanding of multi-agent orchestration, tool use, reasoning traces, and planning frameworks (AutoGen, CrewAI, LangGraph)


Metrics Knowledge


BLEU, ROUGE, cosine similarity, factuality, coherence, bias, toxicity, reasoning success rate


Data & Analytics


JSON parsing, prompt dataset curation, result visualization


Tooling


Git, Jupyter/Colab, Jira, Confluence, evaluation dashboards


Soft Skills


Analytical communication, documentation excellence, cross-team collaboration


EDUCATION QUALIFICATION:

  • Bachelor’s or Master’s degree in Computer Science, AI, Data Science, or related discipline.
  • 5 to 10 years total experience with at least 3+ years in AI evaluation, GenAI QA, or LLM quality analysis.
  • Strong understanding of AI/ML model lifecycle, prompt engineering, and RAG or agentic architectures.
  • Experience contributing to AI safety, reliability, or responsible AI initiatives.

Similar jobs

No similar jobs found

© 2026 Qureos. All rights reserved.