Find The RightJob.
Seeking experienced researchers and technical experts to support a frontier-model evaluation project focused on agentic workflows. You will design and validate challenging benchmark tasks in data science, machine learning, finance, and coding to help identify reasoning and problem-solving gaps in advanced STEM models. The role involves building real-world tasks with executable tests and analyzing model or agent behavior.
Design challenging, real-world STEM problems
Implement each task within an agentic development environment using Python
Similar jobs
Alibaba Cloud
Riyadh, Saudi Arabia
about 1 hour ago
State Street
Riyadh, Saudi Arabia
10 days ago
Valvoline Global Operations
Jeddah, Saudi Arabia
10 days ago
Vanderlande Industries
Jeddah, Saudi Arabia
11 days ago
YO IT CONSULTING
Saudi Arabia
11 days ago
CTI Clinical Trial and Consulting
Riyadh, Saudi Arabia
11 days ago
Devoteam
Riyadh, Saudi Arabia
11 days ago
© 2026 Qureos. All rights reserved.