HUD Documentation — Evaluations and RL Environments.

This page covers how to build datasets, run evaluations, and publish results.

Quick Start

# Run a local tasks file (interactive agent selection)
hud eval tasks.json

# Run a hosted dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full

SDK (Context Manager)

import hud

# Single task evaluation
async with hud.eval("hud-evals/SheetBench-50:0") as ctx:
    agent = MyAgent()
    result = await agent.run(ctx)
    ctx.reward = result.reward

# All tasks with variants
async with hud.eval(
    "hud-evals/SheetBench-50:*",
    variants={"model": ["claude-sonnet", "gpt-4o"]},
    group=3,
    max_concurrent=50,
) as ctx:
    agent = create_agent(model=ctx.variants["model"])
    result = await agent.run(ctx)
    ctx.reward = result.reward

SDK (Batch Execution)

from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks

tasks = load_tasks("hud-evals/SheetBench-50")
results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    max_concurrent=50,
)

Build Benchmarks

Explore Evaluators

hud analyze hudpython/hud-remote-browser:latest

Create Tasks

import uuid

web_tasks = []
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Navigate to the documentation page",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-remote-browser:latest"
            }
        }
    },
    "setup_tool": {"name": "setup", "arguments": {"name": "navigate", "arguments": {"url": "https://example.com"}}},
    "evaluate_tool": {"name": "evaluate", "arguments": {"name": "url_match", "arguments": {"pattern": ".*/docs.*"}}},
    "metadata": {"difficulty": "easy", "category": "navigation"}
})

Save to HuggingFace

from hud.utils.tasks import save_tasks

save_tasks(
  web_tasks,
  repo_id="web-navigation-benchmark",
  private=False,
  tags=["web", "navigation", "automation"],
)

Leaderboards

After running, visit your dataset leaderboard and publish a scorecard:

from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks

tasks = load_tasks("hud-evals/SheetBench-50")
results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    name="Claude Sonnet SheetBench",
)

# Open https://hud.ai/leaderboards/hud-evals/SheetBench-50
# Click "My Jobs" to see runs and create a scorecard

Best Practices

Clear, measurable prompts (binary or graded)
Isolated task state and deterministic setup
Use metadata tags (category, difficulty)
Validate locally, then parallelize
Version datasets; include a system_prompt.txt

Documentation Index

​Quick Start

​Build Benchmarks

​Explore Evaluators

​Create Tasks

​Save to HuggingFace

​Leaderboards

​Best Practices

​See Also

Quick Start

Build Benchmarks

Explore Evaluators

Create Tasks

Save to HuggingFace

Leaderboards

Best Practices

See Also