Model Evals Jobs

Model evaluation, red-teaming, and safety assessment engineering roles.

50 open positions

Senior Staff ML Engineer

Graphcore

Senior Staff ML Engineer role at Graphcore focused on testing, validating, and benchmarking a complex ML software stack across AI accelerator hardware. The ideal candidate has deep experience with ML frameworks, model training/execution, and debugging functional and performance issues while collaborating across software and hardware teams.

MACHINE LEARNINGMLOPS INFRAEVALS

Senior Staff ML Engineer

Graphcore

Cambridge, UKstaff19h ago

Senior Staff ML Engineer role at Graphcore focused on testing, validating, and benchmarking complex ML software stacks across AI accelerator hardware. The position requires deep expertise in ML systems, hands-on debugging of functional and performance issues, and close collaboration with software and hardware teams to ensure reliability and correctness across modern AI workloads.

MACHINE LEARNINGMLOPS INFRAEVALS

2026 Graduate Software Engineer - AI/ ML Test Systems

Graphcore

Bristol, UKjunior19h ago

This is a 2026 graduate software engineer role at Graphcore focused on AI/ML test systems and reliability testing. The position emphasizes hands-on debugging, performance validation, and problem-solving for AI accelerator hardware, ideal for recent graduates seeking practical exposure to AI/ML infrastructure and testing methodologies.

MACHINE LEARNINGSYSTEM DESIGNEVALS

Tech Lead Manager - Multi Modal Foundation Models (Language)

Wayve

Sunnyvalesenior19h ago

Lead a team developing language grounding and reasoning capabilities for Wayve's multimodal foundation models used in autonomous driving. Drive foundational research at the intersection of large-scale pretraining, language understanding, and embodied agent alignment, with focus on grounded understanding in real-world contexts.

NLP LLMSMACHINE LEARNINGCOMPUTER VISIONFINE TUNING+2

Tech Lead - AI Validation Systems Engineer

Wayve

Sunnyvalesenior19h ago

Tech Lead role at Wayve focused on building AI validation systems for autonomous driving. Responsible for designing test scenarios and safety assessments for end-to-end learned driving models, while leading a team of validation engineers. Bridges AI/ML expertise with systems engineering to establish trust and safety standards for the autonomous vehicle industry.

MACHINE LEARNINGCOMPUTER VISIONREINFORCEMENT LEARNINGEVALS+1

Principal Machine Learning Engineer, App SW

Wayve

Sunnyvaleprincipal19h ago

Principal ML Engineer role focused on developing state-of-the-art driving models for autonomous vehicles, spanning model architecture, data pipelines, and real-world deployment. Responsibilities include leading personalized and collaborative driving projects while collaborating across AI Platform, Simulation, and Robot Software teams to deliver scalable, production-ready systems.

MACHINE LEARNINGCOMPUTER VISIONREINFORCEMENT LEARNINGSYSTEM DESIGN+1

Senior Research Scientist (Must be based in UK)

PolyAI

London, United Kingdomsenior19h ago

PolyAI is seeking a Senior Research Scientist to lead cutting-edge research on large language model post-training for conversational voice assistants. The role focuses on novel approaches including conversational reinforcement learning, audio-native LLMs, streaming turn-taking, and reasoning model distillation. This is a research-focused leadership position requiring expertise in LLMs and dialogue systems, based in the UK.

NLP LLMSFINE TUNINGREINFORCEMENT LEARNINGEVALS

Gulf Arabic/Bahraini Language Specialist - Part time contract (Must be based in UK)

PolyAI

United Kingdommid19h ago

PolyAI seeks a Gulf Arabic/Bahraini language specialist for part-time contract work (16 hours/week) based in the UK to improve voice assistant language understanding through training data creation and system prompt refinement. The role involves quality assurance testing, voice recording management, and iterative refinement of conversational AI systems for Gulf Arabic speakers.

NLP LLMSEVALS

Applied Research Intern

Labelbox

San Francisco Bay Areajunior19h ago

Design and build evaluation systems and benchmarks for frontier LLMs and multimodal models, including reasoning, code, and agent-use tasks. Create post-training datasets and prototype RLHF/RLAIF/DPO-style training loops to measure and improve model performance on real-world tasks.

NLP LLMSEVALSFINE TUNINGREINFORCEMENT LEARNING

Applied Research Engineer

Labelbox

San Francisco Bay Areamid19h ago

Develop cutting-edge systems for creating and leveraging high-quality human feedback data to train frontier AI models using techniques like RLHF and DPO. Design advanced methods to align human preferences with AI training processes and measure the quality and impact of human-in-the-loop data. Work at the intersection of applied research and engineering to solve critical data-centric challenges in modern AI development.

NLP LLMSREINFORCEMENT LEARNINGFINE TUNINGEVALS+1

Senior Software Engineer, AI - Simulation

Brex

Seattle, Washington, United Statessenior19h ago

Lead the design and implementation of Brex's simulation and validation platform for AI-powered financial products. Own continuous testing infrastructure using synthetic data and scenario generation to ensure AI agents behave correctly under realistic conditions before reaching customers. Drive product quality through creative testing, regression detection, and safe capability validation across the organization.

AGENTSEVALSSYSTEM DESIGNMLOPS INFRA

Senior Staff Machine Learning Engineer

Remote - United StatesRemotestaff19h ago

Reddit seeks a Senior Staff ML Engineer to lead the Relevance team and architect next-generation search and answer systems powered by modern AI. The role involves designing large-scale systems spanning query understanding, retrieval, ranking, and LLM-based answers to deliver highly relevant search experiences across Reddit's massive content corpus.

NLP LLMSMACHINE LEARNINGRAG SYSTEMSSYSTEM DESIGN+1

Senior Research Engineer, Post-training & Evaluation

Remote - United StatesRemotesenior19h ago

Reddit is seeking a Senior Research Engineer to own the post-training and evaluation pipeline for their Reddit-native Large Language Models. You will architect evaluation suites, build internal benchmarks (Reddit Benchmark), and execute fine-tuning workflows to ensure models are safe, performant, and culturally aligned with Reddit communities. This role bridges applied research and massive-scale infrastructure, sitting at the core of Reddit's AI foundation.

NLP LLMSFINE TUNINGEVALSMLOPS INFRA

Director of Machine Learning, Safety & Mods

Remote - United StatesRemoteprincipal19h ago

Reddit seeks a Director of Machine Learning to lead safety and moderation ML initiatives, building industry-leading systems that detect and prevent harmful content at scale. The role combines strategic leadership with hands-on ML expertise in fine-tuned LLMs and transformer models, requiring cross-functional collaboration across product, engineering, and AI/ML platform teams to protect global users.

MACHINE LEARNINGNLP LLMSFINE TUNINGMLOPS INFRA+2

Head of Efficacy Research

Duolingo

New York, NYsenior19h ago

Lead Duolingo's Efficacy Research Lab to evaluate whether learning outcomes actually persist for 500M+ learners, expanding from language into math and new subjects. This is a hands-on leadership role combining strategic vision-setting with direct involvement in rigorous research design, team management, and cross-functional partnership with product and company leadership.

STATISTICSMACHINE LEARNINGEVALS

Head of Efficacy Research

Duolingo

Pittsburgh, PAsenior19h ago

Lead Duolingo's Efficacy Research Lab to measure learning outcomes at scale across language and math education. Set strategic vision, manage a team of 3 researchers, and partner with product leaders to ensure rigorous research drives product improvements for 500M+ learners worldwide.

STATISTICSMACHINE LEARNINGEVALS

Software Engineer, Fee Insights

Stripe

Bengalurusenior19h ago

Build AI-powered fee explainability experiences for Stripe merchants using agentic systems and NLP to help users understand complex billing and pricing. Work across full-stack engineering—backend services, frontend dashboards, data pipelines, and AI agents—while ensuring reliability and accuracy at scale through robust evaluation frameworks.

AGENTSRAG SYSTEMSNLP LLMSSYSTEM DESIGN+2

Machine Learning Engineer, Supportability

Stripe

Torontomid19h ago

Stripe seeks an ML Engineer to design, build, and deploy state-of-the-art AI/ML models for supportability detection and compliance across their financial platform. The role focuses on scaling LLM-based systems using agentic approaches while balancing high-precision detection with system scalability for millions of merchants. You'll collaborate across teams to operationalize ML systems in production and drive innovation in financial AI applications.

MACHINE LEARNINGNLP LLMSAGENTSMLOPS INFRA+2

Machine Learning Engineer, Stripe Assistant

Stripe

Seattle; San Francisco; New York City; RemoteRemotesenior19h ago

Build end-to-end ML and agent architecture for an intelligent Stripe Assistant that executes high-trust actions, provides accurate analytics, and orchestrates multi-tool capabilities. Establish rigorous evaluation frameworks, governance models, and human-in-the-loop execution patterns while driving improvements in quality, latency, cost, and availability.

NLP LLMSAGENTSRAG SYSTEMSEVALS+1

Sr. MLE, GAI Search Relevance - JB0069884

Moveworks

Mountain View, CAsenior19h ago

Sr. Machine Learning Engineer focused on improving search relevance for a QA platform using modern information retrieval and NLP techniques. Responsible for developing ranking features, designing experimentation platforms, and optimizing search quality metrics while addressing challenges like semantic understanding and data sparsity. Requires deep expertise in ML-based ranking models, text processing, and search systems with experience analyzing query logs to drive product improvements.

NLP LLMSMACHINE LEARNINGRAG SYSTEMSEVALS

Sr. MLE, GAI Search Platform - JB0070751

Moveworks

Mountain View, CAstaff19h ago

This is a Staff-level Machine Learning Engineering role leading Moveworks' GenAI Search Platform team, focusing on building conversational search and question-answering systems that combine traditional ML with LLMs. The position requires deep expertise in information retrieval, ranking, and search architecture, combined with strong technical leadership capabilities to mentor engineers and drive product innovation at scale.

NLP LLMSMACHINE LEARNINGRAG SYSTEMSAGENTS+2

Senior Machine Learning Engineer II, NLU & Agentic AI

Moveworks

Mountain View, CAstaff19h ago

Senior Machine Learning Engineer role focused on advancing NLU and agentic AI capabilities at Moveworks. Requires expertise in LLM fine-tuning, conversational agents, compound AI systems, and production-grade ML infrastructure to build reliable enterprise copilot experiences. Strong emphasis on balancing model quality with latency, reliability, and end-to-end system performance.

NLP LLMSAGENTSFINE TUNINGEVALS+2

Senior Machine Learning Engineer II, NLU & Agentic AI

Moveworks

San Francisco, CAsenior19h ago

Senior ML engineer role focused on building production NLU and agentic AI systems at scale. You'll work on LLM fine-tuning, reasoning strategies, multimodal agents, and compound AI system design while collaborating with annotation and product teams to deliver enterprise-grade conversational AI.

NLP LLMSAGENTSFINE TUNINGEVALS+2

Principal Product Manager, Search Platform

Moveworks

Mountain View, CAprincipal19h ago

Principal-level product management role leading the search platform and infrastructure for an advanced enterprise AI platform. Requires 5+ years of technical product management experience building ML/generative AI products, with deep expertise in NLP, semantic search, and real-time ML performance optimization. This position demands strong technical foundations, cross-functional collaboration skills, and ability to translate cutting-edge AI research into differentiated product capabilities.

NLP LLMSRAG SYSTEMSMACHINE LEARNINGAGENTS+2

MLE Manager, GAI Search Relevance - JB0070741

Moveworks

Mountain View, CAsenior19h ago

This is a senior engineering manager role leading a team focused on improving search relevance for AI-powered enterprise applications using LLMs and RAG techniques. The role combines technical leadership in machine learning with team management responsibilities, requiring expertise in search ranking, model development, evaluation methodologies, and cross-functional coordination.

NLP LLMSMACHINE LEARNINGRAG SYSTEMSEVALS+1

AI Engineer- Gen AI/SWE- Weights & Biases

CoreWeave

Livingston, NJ / New York, NY / San Francisco, CA / Sunnyvale, CA / Bellevue, WA / Remove - USsenior19h ago

Senior AI Engineer role at Weights & Biases focused on building production-grade GenAI applications and agentic systems. You'll design end-to-end workflows from prompting through RAG to agents and evals, then publish reference implementations and teach the community. This is applied AI work at the research-to-product boundary, emphasizing reproducibility, responsible deployment, and shipping production systems.

NLP LLMSRAG SYSTEMSAGENTSEVALS+1

Vendor and Contract Manager, Safeguards

Anthropic

San Francisco, CA | New York City, NY | Washington, DCmid2d ago

Own the end-to-end lifecycle of safety-critical vendor and partner relationships at Anthropic, including selection, contract negotiation, onboarding, and performance management. Build scalable procurement processes while navigating novel partnership structures like research collaborations and red-teaming engagements. Work across legal, finance, and technical teams to manage vendor performance and spending across verification, threat intelligence, and capability evaluation categories.

EVALSSYSTEM DESIGN

Technical Program Manager, Safeguards (Infrastructure & Evals)

Anthropic

San Francisco, CA | New York City, NY | Seattle, WAsenior2d ago

Own operational health and reliability of Anthropic's AI safety infrastructure, including classifiers, detection pipelines, and evaluation platforms. Drive incident response, SLO management, and coordinate platform investments while ensuring safety-critical systems remain reliable and performant.

MLOPS INFRAEVALSSYSTEM DESIGN

Technical Policy Manager, Cyber Harms

Anthropic

Remote-Friendly (Travel-Required) | San Francisco, CA | Washington, DCRemotesenior2d ago

Lead Anthropic's cyber safety efforts by managing a team that develops threat models and evaluation frameworks to detect harmful cyber behaviors in AI systems. Design safety systems and policies that balance enabling legitimate security research while preventing misuse by malicious actors, serving as the primary domain expert on cyber harms across the organization.

NLP LLMSEVALSSYSTEM DESIGN

Staff Research Engineer, Discovery Team

Anthropic

San Francisco, CAstaff2d ago

Staff-level research engineering role at Anthropic focused on building AI systems capable of scientific discovery. Requires 8+ years of ML experience with deep expertise in large-scale LLM training, distributed systems, and performance optimization to address bottlenecks in developing scientific AI agents.

NLP LLMSMACHINE LEARNINGMLOPS INFRAAGENTS+2

Solutions Architect, National Security

Anthropic

Washington, DCsenior2d ago

Solutions Architect role focused on advising U.S. national security customers on Claude AI integration and deployment. Requires deep technical expertise in LLM systems, solution architecture, and the ability to translate complex AI capabilities into mission-critical applications while maintaining safety standards.

NLP LLMSSYSTEM DESIGNEVALS

Solutions Architect, Applied AI (State and Local Government, West)

Anthropic

San Francisco, CAsenior2d ago

Pre-sales Solutions Architect role at Anthropic focused on helping state and local government agencies architect and deploy Claude LLM solutions. Requires deep technical expertise in LLMs and AI systems combined with strong customer-facing skills, from initial discovery through deployment and performance evaluation.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI (Startups)

Anthropic

San Francisco, CA | New York City, NYmid2d ago

Solutions Architect role focused on guiding startups through LLM adoption on Claude's platform, requiring 3+ years of technical customer-facing experience. The position bridges sales engineering and product expertise, demanding deep knowledge of LLM capabilities, evaluation frameworks, and AI architecture design while serving as a trusted technical advisor.

NLP LLMSRAG SYSTEMSAGENTSEVALS+1

Solutions Architect, Applied AI (Startups)

Anthropic

London, UKsenior2d ago

A senior-level Solutions Architect role focused on helping startups successfully adopt Claude by providing deep technical expertise, winning evaluations, and guiding LLM architecture decisions. The position requires 5+ years of technical customer-facing experience and strong startup ecosystem knowledge, blending sales engineering with product-level technical insight.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI (National Security)

Anthropic

Washington, DCsenior2d ago

A pre-sales solutions architect role focused on helping national security agencies adopt and integrate Claude LLMs into their technology stacks. The position requires deep technical expertise in LLM systems combined with strong customer-facing skills to architect scalable solutions, develop evaluation frameworks, and guide deployment from discovery through production. Success demands the ability to communicate complex AI concepts across technical and executive audiences while coordinating cross-functional teams.

NLP LLMSRAG SYSTEMSSYSTEM DESIGNEVALS

Solutions Architect, Applied AI (Industries)

Anthropic

New York City, NY; San Francisco, CA | New York City, NY | Seattle, WAsenior2d ago

This role requires a senior technical architect with deep LLM expertise to serve as a pre-sales trusted advisor for enterprise Claude adoption. The position bridges technical and business domains, requiring the ability to design scalable AI solutions, architect integrations, and communicate complex capabilities to both engineering teams and executives. Success depends on translating customer requirements into technical implementations while coordinating across internal teams to drive customer success.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI (Industries)

Anthropic

London, UKsenior2d ago

Pre-sales Solutions Architect role focused on guiding enterprise customers through Claude AI adoption from discovery to deployment. Requires deep LLM technical expertise combined with customer-facing skills to design scalable architectures, develop evaluation frameworks, and serve as trusted technical advisor bridging sales, product, and engineering teams.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI (Industries)

Anthropic

Munich, Germanymid2d ago

Solutions Architect role at Anthropic focused on pre-sales technical advisory for enterprise customers adopting Claude LLM. Responsibilities include translating business requirements into technical solutions, guiding architecture decisions, developing evaluation frameworks, and serving as primary technical advisor throughout the customer adoption journey from discovery through deployment.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI (Government Technology)

Anthropic

New York City, NY; Washington, DCsenior2d ago

Pre-sales Solutions Architect role at Anthropic focused on helping government and enterprise customers integrate Claude LLM into their technology stacks. The position requires deep technical expertise in LLMs and system architecture combined with customer-facing skills to advise on solution design, evaluation frameworks, and deployment strategies across the customer adoption journey.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI (Federal Civilian)

Anthropic

New York City, NY; Washington, DCmid2d ago

This pre-sales solutions architect role focuses on helping federal civilian agencies adopt Claude by serving as a trusted technical advisor throughout their deployment journey. The position requires translating customer requirements into LLM-based solutions, creating evaluation frameworks, and architecting scalable integrations while maintaining alignment between business objectives and technical implementation.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI (Digital Native Business)

Anthropic

San Francisco, CA | New York City, NYsenior2d ago

This role is a pre-sales solutions architect at Anthropic focused on helping enterprise customers adopt Claude AI through technical advisory and architecture guidance. The position bridges technical expertise with customer-facing skills, requiring deep LLM knowledge, evaluation design, and the ability to translate complex technical concepts for both engineering and executive audiences.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI (Beneficial Deployments)

Anthropic

Bangalore, Indiasenior2d ago

This is a pre-sales Solutions Architect role focused on helping enterprise customers across India understand and deploy Claude LLMs into their technology stacks. The role requires deep technical expertise in LLM systems combined with strong customer-facing skills to architect solutions, develop evaluations, and guide customers from discovery through successful deployment.

NLP LLMSRAG SYSTEMSSYSTEM DESIGNEVALS

Solutions Architect, Applied AI

Anthropic

Sydney, Australiasenior2d ago

This pre-sales Solutions Architect role focuses on helping enterprise customers understand and integrate Claude LLMs into their technology stacks. The position requires deep technical expertise in LLM capabilities, architecture design, and evals, combined with strong customer-facing and communication skills to bridge business objectives with technical implementation across the entire adoption lifecycle.

NLP LLMSRAG SYSTEMSSYSTEM DESIGNEVALS

Solutions Architect, Applied AI

Anthropic

Tokyo, Japansenior2d ago

A pre-sales solutions architect role at Anthropic focused on helping large Japanese enterprises integrate Claude into their technology stacks. The position combines deep LLM technical expertise with customer-facing skills to architect scalable AI solutions, develop evaluation frameworks, and serve as a trusted technical advisor throughout the adoption journey.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Solutions Architect, Applied AI

Anthropic

Bangalore, Indiamid2d ago

A pre-sales Solutions Architect role at Anthropic India focused on guiding enterprise customers through Claude LLM adoption, from technical discovery to deployment. The position requires deep LLM expertise combined with strong customer-facing skills to architect scalable AI solutions and serve as a technical advisor bridging Sales, Product, and Engineering teams.

NLP LLMSRAG SYSTEMSEVALSSYSTEM DESIGN

Software Engineer, Safeguards Infrastructure

Anthropic

London, UKmid2d ago

Anthropic is seeking a Software Engineer to build foundational safeguards infrastructure for monitoring AI models, detecting unwanted behaviors, and preventing misuse at scale. The role requires 4-10+ years of software engineering experience with strong Python skills and the ability to work across the full stack, with preference for candidates experienced in trust/safety systems and metrics infrastructure.

MLOPS INFRASYSTEM DESIGNEVALS

Software Engineer, Cloud Inference Safeguards

Anthropic

San Francisco, CA | Seattle, WAmid2d ago

Build and operate real-time safety guardrails for Claude on third-party cloud platforms, embedding detection classifiers and enforcement mechanisms directly into the inference serving path. Own the architecture, telemetry pipelines, and operational systems that enable safe deployment of frontier models at CSP partners while maintaining privacy and data residency compliance.

NLP LLMSMLOPS INFRASYSTEM DESIGNEVALS

Senior+ Software Engineer, Research Tools

Anthropic

San Francisco, CA | New York City, NYsenior2d ago

Build full-stack research tools and infrastructure that enable AI safety researchers to conduct experiments and extract insights from frontier AI systems. Work at the intersection of product thinking and engineering, directly partnering with research teams to understand workflows and rapidly ship solutions that accelerate productivity. High agency, independent project ownership with no prior ML or research experience required.

MLOPS INFRASYSTEM DESIGNNLP LLMSEVALS

Senior Research Scientist, Reward Models

Anthropic

Remote-Friendly (Travel Required) | San Francisco, CARemotesenior2d ago

Lead research scientist role focused on advancing reward modeling techniques for large language models, particularly RLHF training and LLM-based evaluation methods. The position emphasizes developing novel architectures, mitigating reward specification gaming, and translating research into production improvements while driving AI alignment and safety initiatives.

NLP LLMSREINFORCEMENT LEARNINGFINE TUNINGEVALS+1

Safeguards Policy Analyst, Fraud & Scams

Anthropic

Remote-Friendly (Travel-Required) | San Francisco, CA | New York City, NYRemotemid2d ago

This role focuses on designing and executing fraud and scam mitigation policies for Anthropic's AI products, requiring expertise in threat modeling, policy development, and enforcement workflows. The analyst will serve as the subject matter expert translating fraud ecosystem knowledge into scalable policies and guidelines that power both AI classifiers and human review processes. The position involves cross-functional collaboration to detect emerging fraud threats and maintain robust safeguards across Anthropic's product ecosystem.

NLP LLMSEVALSSYSTEM DESIGN