DevRadar
🤗 HuggingFaceSignificant

Hugging Face Open-Sources "ml-intern": An AI Agent That Replicates ML Research Workflows

Hugging Face open-sourced 'ml-intern', an AI agent that replicates ML research workflows. The agent completed a post-training internship task by implementing DeepMind research on test-time compute scaling, achieving 45% to 65% accuracy (+20pp) using last-step PRM prediction scoring strategy, outperforming greedy, majority vote, and standard Best-of-N baselines. Released artifacts include a trained model on Math500, weighted results dataset, Docker Space on T4 GPU, and the take-home test template on GitHub.

AkselThursday, April 23, 2026Original source

Hugging Face Open-Sources "ml-intern": An AI Agent That Replicates ML Research Workflows

Summary

Hugging Face released "ml-intern", an autonomous AI agent that completed a post-training research internship by replicating a DeepMind baseline on test-time compute scaling. The agent achieved 45%→65% accuracy (+20pp) using last-step PRM prediction scoring, outperforming greedy, majority vote, and standard Best-of-N baselines. Full implementation artifacts are now open source.

Integration Strategy

When to Use This?

  • Research acceleration: Automating baseline replication for new papers
  • Benchmark evaluation: Systematic comparison of inference strategies
  • Post-training research: Iterating on reward model and scoring strategies
  • Reproducibility pipelines: Standardized research workflow automation

The take-home test template (github.com/huggingface/post-training-takehome) provides a reusable framework for evaluating agent capabilities on ML research tasks.

How to Integrate?

Available Artifacts:

ArtifactLinkPurpose
Documentationhuggingface.co/blog/cmpatino/ml-intern-takehomeFull technical report
Trained Modelhuggingface.co/cmpatino/math500-bon-exercisePRM model on Math500
Results Datasethuggingface.co/datasets/cmpatino/math500-bon-weighted-resultsExperimental data
Docker SpaceHugging Face Spaces (T4 GPU)Live demo deployment
Test Templategithub.com/huggingface/post-training-takehomeEvaluation framework

Deployment: The Docker Space runs on T4 GPU hardware, providing accessible inference for experimentation without local GPU requirements.

Compatibility

  • Framework: Hugging Face ecosystem (Transformers, PEFT)
  • Evaluation: Math500 benchmark
  • Infrastructure: Docker containers, Hugging Face Spaces
  • Integration points: Model Hub, Datasets Hub, Spaces deployment

Source

Source: @huggingface

Reference: ml-intern Full Documentation Reference: Math500 Trained Model Reference: Weighted Results Dataset Reference: Post-Training Takehome Test

Published: November 2024 DevRadar Analysis Date: 2026-04-23

Tags: #OpenSource, #LLM, #Inference, #Research, #Agentic