Model Evals Jobs
Model evaluation, red-teaming, and safety assessment engineering roles.
50 open positions
Senior Staff ML Engineer
Graphcore
Senior Staff ML Engineer role at Graphcore focused on testing, validating, and benchmarking a complex ML software stack across AI accelerator hardware. The ideal candidate has deep experience with ML frameworks, model training/execution, and debugging functional and performance issues while collaborating across software and hardware teams.
Senior Staff ML Engineer
Graphcore
Senior Staff ML Engineer role at Graphcore focused on testing, validating, and benchmarking complex ML software stacks across AI accelerator hardware. The position requires deep expertise in ML systems, hands-on debugging of functional and performance issues, and close collaboration with software and hardware teams to ensure reliability and correctness across modern AI workloads.
2026 Graduate Software Engineer - AI/ ML Test Systems
Graphcore
This is a 2026 graduate software engineer role at Graphcore focused on AI/ML test systems and reliability testing. The position emphasizes hands-on debugging, performance validation, and problem-solving for AI accelerator hardware, ideal for recent graduates seeking practical exposure to AI/ML infrastructure and testing methodologies.
Tech Lead Manager - Multi Modal Foundation Models (Language)
Wayve
Lead a team developing language grounding and reasoning capabilities for Wayve's multimodal foundation models used in autonomous driving. Drive foundational research at the intersection of large-scale pretraining, language understanding, and embodied agent alignment, with focus on grounded understanding in real-world contexts.
Tech Lead - AI Validation Systems Engineer
Wayve
Tech Lead role at Wayve focused on building AI validation systems for autonomous driving. Responsible for designing test scenarios and safety assessments for end-to-end learned driving models, while leading a team of validation engineers. Bridges AI/ML expertise with systems engineering to establish trust and safety standards for the autonomous vehicle industry.
Principal Machine Learning Engineer, App SW
Wayve
Principal ML Engineer role focused on developing state-of-the-art driving models for autonomous vehicles, spanning model architecture, data pipelines, and real-world deployment. Responsibilities include leading personalized and collaborative driving projects while collaborating across AI Platform, Simulation, and Robot Software teams to deliver scalable, production-ready systems.
Senior Research Scientist (Must be based in UK)
PolyAI
PolyAI is seeking a Senior Research Scientist to lead cutting-edge research on large language model post-training for conversational voice assistants. The role focuses on novel approaches including conversational reinforcement learning, audio-native LLMs, streaming turn-taking, and reasoning model distillation. This is a research-focused leadership position requiring expertise in LLMs and dialogue systems, based in the UK.
Gulf Arabic/Bahraini Language Specialist - Part time contract (Must be based in UK)
PolyAI
PolyAI seeks a Gulf Arabic/Bahraini language specialist for part-time contract work (16 hours/week) based in the UK to improve voice assistant language understanding through training data creation and system prompt refinement. The role involves quality assurance testing, voice recording management, and iterative refinement of conversational AI systems for Gulf Arabic speakers.
Applied Research Intern
Labelbox
Design and build evaluation systems and benchmarks for frontier LLMs and multimodal models, including reasoning, code, and agent-use tasks. Create post-training datasets and prototype RLHF/RLAIF/DPO-style training loops to measure and improve model performance on real-world tasks.
Applied Research Engineer
Labelbox
Develop cutting-edge systems for creating and leveraging high-quality human feedback data to train frontier AI models using techniques like RLHF and DPO. Design advanced methods to align human preferences with AI training processes and measure the quality and impact of human-in-the-loop data. Work at the intersection of applied research and engineering to solve critical data-centric challenges in modern AI development.
Senior Software Engineer, AI - Simulation
Brex
Lead the design and implementation of Brex's simulation and validation platform for AI-powered financial products. Own continuous testing infrastructure using synthetic data and scenario generation to ensure AI agents behave correctly under realistic conditions before reaching customers. Drive product quality through creative testing, regression detection, and safe capability validation across the organization.
Senior Staff Machine Learning Engineer
Reddit seeks a Senior Staff ML Engineer to lead the Relevance team and architect next-generation search and answer systems powered by modern AI. The role involves designing large-scale systems spanning query understanding, retrieval, ranking, and LLM-based answers to deliver highly relevant search experiences across Reddit's massive content corpus.
Senior Research Engineer, Post-training & Evaluation
Reddit is seeking a Senior Research Engineer to own the post-training and evaluation pipeline for their Reddit-native Large Language Models. You will architect evaluation suites, build internal benchmarks (Reddit Benchmark), and execute fine-tuning workflows to ensure models are safe, performant, and culturally aligned with Reddit communities. This role bridges applied research and massive-scale infrastructure, sitting at the core of Reddit's AI foundation.
Director of Machine Learning, Safety & Mods
Reddit seeks a Director of Machine Learning to lead safety and moderation ML initiatives, building industry-leading systems that detect and prevent harmful content at scale. The role combines strategic leadership with hands-on ML expertise in fine-tuned LLMs and transformer models, requiring cross-functional collaboration across product, engineering, and AI/ML platform teams to protect global users.
Head of Efficacy Research
Duolingo
Lead Duolingo's Efficacy Research Lab to evaluate whether learning outcomes actually persist for 500M+ learners, expanding from language into math and new subjects. This is a hands-on leadership role combining strategic vision-setting with direct involvement in rigorous research design, team management, and cross-functional partnership with product and company leadership.
Head of Efficacy Research
Duolingo
Lead Duolingo's Efficacy Research Lab to measure learning outcomes at scale across language and math education. Set strategic vision, manage a team of 3 researchers, and partner with product leaders to ensure rigorous research drives product improvements for 500M+ learners worldwide.
Software Engineer, Fee Insights
Stripe
Build AI-powered fee explainability experiences for Stripe merchants using agentic systems and NLP to help users understand complex billing and pricing. Work across full-stack engineering—backend services, frontend dashboards, data pipelines, and AI agents—while ensuring reliability and accuracy at scale through robust evaluation frameworks.
Machine Learning Engineer, Supportability
Stripe
Stripe seeks an ML Engineer to design, build, and deploy state-of-the-art AI/ML models for supportability detection and compliance across their financial platform. The role focuses on scaling LLM-based systems using agentic approaches while balancing high-precision detection with system scalability for millions of merchants. You'll collaborate across teams to operationalize ML systems in production and drive innovation in financial AI applications.
Machine Learning Engineer, Stripe Assistant
Stripe
Build end-to-end ML and agent architecture for an intelligent Stripe Assistant that executes high-trust actions, provides accurate analytics, and orchestrates multi-tool capabilities. Establish rigorous evaluation frameworks, governance models, and human-in-the-loop execution patterns while driving improvements in quality, latency, cost, and availability.
Sr. MLE, GAI Search Relevance - JB0069884
Moveworks
Sr. Machine Learning Engineer focused on improving search relevance for a QA platform using modern information retrieval and NLP techniques. Responsible for developing ranking features, designing experimentation platforms, and optimizing search quality metrics while addressing challenges like semantic understanding and data sparsity. Requires deep expertise in ML-based ranking models, text processing, and search systems with experience analyzing query logs to drive product improvements.
Sr. MLE, GAI Search Platform - JB0070751
Moveworks
This is a Staff-level Machine Learning Engineering role leading Moveworks' GenAI Search Platform team, focusing on building conversational search and question-answering systems that combine traditional ML with LLMs. The position requires deep expertise in information retrieval, ranking, and search architecture, combined with strong technical leadership capabilities to mentor engineers and drive product innovation at scale.
Senior Machine Learning Engineer II, NLU & Agentic AI
Moveworks
Senior Machine Learning Engineer role focused on advancing NLU and agentic AI capabilities at Moveworks. Requires expertise in LLM fine-tuning, conversational agents, compound AI systems, and production-grade ML infrastructure to build reliable enterprise copilot experiences. Strong emphasis on balancing model quality with latency, reliability, and end-to-end system performance.
Senior Machine Learning Engineer II, NLU & Agentic AI
Moveworks
Senior ML engineer role focused on building production NLU and agentic AI systems at scale. You'll work on LLM fine-tuning, reasoning strategies, multimodal agents, and compound AI system design while collaborating with annotation and product teams to deliver enterprise-grade conversational AI.
Principal Product Manager, Search Platform
Moveworks
Principal-level product management role leading the search platform and infrastructure for an advanced enterprise AI platform. Requires 5+ years of technical product management experience building ML/generative AI products, with deep expertise in NLP, semantic search, and real-time ML performance optimization. This position demands strong technical foundations, cross-functional collaboration skills, and ability to translate cutting-edge AI research into differentiated product capabilities.
MLE Manager, GAI Search Relevance - JB0070741
Moveworks
This is a senior engineering manager role leading a team focused on improving search relevance for AI-powered enterprise applications using LLMs and RAG techniques. The role combines technical leadership in machine learning with team management responsibilities, requiring expertise in search ranking, model development, evaluation methodologies, and cross-functional coordination.
AI Engineer- Gen AI/SWE- Weights & Biases
CoreWeave
Senior AI Engineer role at Weights & Biases focused on building production-grade GenAI applications and agentic systems. You'll design end-to-end workflows from prompting through RAG to agents and evals, then publish reference implementations and teach the community. This is applied AI work at the research-to-product boundary, emphasizing reproducibility, responsible deployment, and shipping production systems.
Vendor and Contract Manager, Safeguards
Anthropic
Own the end-to-end lifecycle of safety-critical vendor and partner relationships at Anthropic, including selection, contract negotiation, onboarding, and performance management. Build scalable procurement processes while navigating novel partnership structures like research collaborations and red-teaming engagements. Work across legal, finance, and technical teams to manage vendor performance and spending across verification, threat intelligence, and capability evaluation categories.
Technical Program Manager, Safeguards (Infrastructure & Evals)
Anthropic
Own operational health and reliability of Anthropic's AI safety infrastructure, including classifiers, detection pipelines, and evaluation platforms. Drive incident response, SLO management, and coordinate platform investments while ensuring safety-critical systems remain reliable and performant.
Technical Policy Manager, Cyber Harms
Anthropic
Lead Anthropic's cyber safety efforts by managing a team that develops threat models and evaluation frameworks to detect harmful cyber behaviors in AI systems. Design safety systems and policies that balance enabling legitimate security research while preventing misuse by malicious actors, serving as the primary domain expert on cyber harms across the organization.
Staff Research Engineer, Discovery Team
Anthropic
Staff-level research engineering role at Anthropic focused on building AI systems capable of scientific discovery. Requires 8+ years of ML experience with deep expertise in large-scale LLM training, distributed systems, and performance optimization to address bottlenecks in developing scientific AI agents.
Solutions Architect, National Security
Anthropic
Solutions Architect role focused on advising U.S. national security customers on Claude AI integration and deployment. Requires deep technical expertise in LLM systems, solution architecture, and the ability to translate complex AI capabilities into mission-critical applications while maintaining safety standards.
Solutions Architect, Applied AI (State and Local Government, West)
Anthropic
Pre-sales Solutions Architect role at Anthropic focused on helping state and local government agencies architect and deploy Claude LLM solutions. Requires deep technical expertise in LLMs and AI systems combined with strong customer-facing skills, from initial discovery through deployment and performance evaluation.
Solutions Architect, Applied AI (Startups)
Anthropic
Solutions Architect role focused on guiding startups through LLM adoption on Claude's platform, requiring 3+ years of technical customer-facing experience. The position bridges sales engineering and product expertise, demanding deep knowledge of LLM capabilities, evaluation frameworks, and AI architecture design while serving as a trusted technical advisor.
Solutions Architect, Applied AI (Startups)
Anthropic
A senior-level Solutions Architect role focused on helping startups successfully adopt Claude by providing deep technical expertise, winning evaluations, and guiding LLM architecture decisions. The position requires 5+ years of technical customer-facing experience and strong startup ecosystem knowledge, blending sales engineering with product-level technical insight.
Solutions Architect, Applied AI (National Security)
Anthropic
A pre-sales solutions architect role focused on helping national security agencies adopt and integrate Claude LLMs into their technology stacks. The position requires deep technical expertise in LLM systems combined with strong customer-facing skills to architect scalable solutions, develop evaluation frameworks, and guide deployment from discovery through production. Success demands the ability to communicate complex AI concepts across technical and executive audiences while coordinating cross-functional teams.
Solutions Architect, Applied AI (Industries)
Anthropic
This role requires a senior technical architect with deep LLM expertise to serve as a pre-sales trusted advisor for enterprise Claude adoption. The position bridges technical and business domains, requiring the ability to design scalable AI solutions, architect integrations, and communicate complex capabilities to both engineering teams and executives. Success depends on translating customer requirements into technical implementations while coordinating across internal teams to drive customer success.
Solutions Architect, Applied AI (Industries)
Anthropic
Pre-sales Solutions Architect role focused on guiding enterprise customers through Claude AI adoption from discovery to deployment. Requires deep LLM technical expertise combined with customer-facing skills to design scalable architectures, develop evaluation frameworks, and serve as trusted technical advisor bridging sales, product, and engineering teams.
Solutions Architect, Applied AI (Industries)
Anthropic
Solutions Architect role at Anthropic focused on pre-sales technical advisory for enterprise customers adopting Claude LLM. Responsibilities include translating business requirements into technical solutions, guiding architecture decisions, developing evaluation frameworks, and serving as primary technical advisor throughout the customer adoption journey from discovery through deployment.
Solutions Architect, Applied AI (Government Technology)
Anthropic
Pre-sales Solutions Architect role at Anthropic focused on helping government and enterprise customers integrate Claude LLM into their technology stacks. The position requires deep technical expertise in LLMs and system architecture combined with customer-facing skills to advise on solution design, evaluation frameworks, and deployment strategies across the customer adoption journey.
Solutions Architect, Applied AI (Federal Civilian)
Anthropic
This pre-sales solutions architect role focuses on helping federal civilian agencies adopt Claude by serving as a trusted technical advisor throughout their deployment journey. The position requires translating customer requirements into LLM-based solutions, creating evaluation frameworks, and architecting scalable integrations while maintaining alignment between business objectives and technical implementation.
Solutions Architect, Applied AI (Digital Native Business)
Anthropic
This role is a pre-sales solutions architect at Anthropic focused on helping enterprise customers adopt Claude AI through technical advisory and architecture guidance. The position bridges technical expertise with customer-facing skills, requiring deep LLM knowledge, evaluation design, and the ability to translate complex technical concepts for both engineering and executive audiences.
Solutions Architect, Applied AI (Beneficial Deployments)
Anthropic
This is a pre-sales Solutions Architect role focused on helping enterprise customers across India understand and deploy Claude LLMs into their technology stacks. The role requires deep technical expertise in LLM systems combined with strong customer-facing skills to architect solutions, develop evaluations, and guide customers from discovery through successful deployment.
Solutions Architect, Applied AI
Anthropic
This pre-sales Solutions Architect role focuses on helping enterprise customers understand and integrate Claude LLMs into their technology stacks. The position requires deep technical expertise in LLM capabilities, architecture design, and evals, combined with strong customer-facing and communication skills to bridge business objectives with technical implementation across the entire adoption lifecycle.
Solutions Architect, Applied AI
Anthropic
A pre-sales solutions architect role at Anthropic focused on helping large Japanese enterprises integrate Claude into their technology stacks. The position combines deep LLM technical expertise with customer-facing skills to architect scalable AI solutions, develop evaluation frameworks, and serve as a trusted technical advisor throughout the adoption journey.
Solutions Architect, Applied AI
Anthropic
A pre-sales Solutions Architect role at Anthropic India focused on guiding enterprise customers through Claude LLM adoption, from technical discovery to deployment. The position requires deep LLM expertise combined with strong customer-facing skills to architect scalable AI solutions and serve as a technical advisor bridging Sales, Product, and Engineering teams.
Software Engineer, Safeguards Infrastructure
Anthropic
Anthropic is seeking a Software Engineer to build foundational safeguards infrastructure for monitoring AI models, detecting unwanted behaviors, and preventing misuse at scale. The role requires 4-10+ years of software engineering experience with strong Python skills and the ability to work across the full stack, with preference for candidates experienced in trust/safety systems and metrics infrastructure.
Software Engineer, Cloud Inference Safeguards
Anthropic
Build and operate real-time safety guardrails for Claude on third-party cloud platforms, embedding detection classifiers and enforcement mechanisms directly into the inference serving path. Own the architecture, telemetry pipelines, and operational systems that enable safe deployment of frontier models at CSP partners while maintaining privacy and data residency compliance.
Senior+ Software Engineer, Research Tools
Anthropic
Build full-stack research tools and infrastructure that enable AI safety researchers to conduct experiments and extract insights from frontier AI systems. Work at the intersection of product thinking and engineering, directly partnering with research teams to understand workflows and rapidly ship solutions that accelerate productivity. High agency, independent project ownership with no prior ML or research experience required.
Senior Research Scientist, Reward Models
Anthropic
Lead research scientist role focused on advancing reward modeling techniques for large language models, particularly RLHF training and LLM-based evaluation methods. The position emphasizes developing novel architectures, mitigating reward specification gaming, and translating research into production improvements while driving AI alignment and safety initiatives.
Safeguards Policy Analyst, Fraud & Scams
Anthropic
This role focuses on designing and executing fraud and scam mitigation policies for Anthropic's AI products, requiring expertise in threat modeling, policy development, and enforcement workflows. The analyst will serve as the subject matter expert translating fraud ecosystem knowledge into scalable policies and guidelines that power both AI classifiers and human review processes. The position involves cross-functional collaboration to detect emerging fraud threats and maintain robust safeguards across Anthropic's product ecosystem.