HUD Documentation — Evaluations and RL Environments.

The HUD SDK provides the LegacyTask class for defining agent objectives and dataset utilities for managing task collections.

LegacyTask is deprecated. For new code, use env("scenario_name", **args) to create Task objects. See Environments for the recommended approach.

LegacyTask Class

from hud.datasets import LegacyTask

Pydantic model that defines an agent’s objective, setup, and evaluation criteria. Fields:

Field	Type	Description	Default
`id`	`str \| None`	Unique identifier (UUID recommended)	`None`
`prompt`	`str`	Task instruction for the agent	Required
`mcp_config`	`dict[str, Any]`	MCP server configuration	Required
`setup_tool`	`MCPToolCall \| list[MCPToolCall] \| None`	Tool(s) to prepare environment	`None`
`evaluate_tool`	`MCPToolCall \| list[MCPToolCall] \| None`	Tool(s) to score performance	`None`
`agent_config`	`BaseAgentConfig \| dict[str, Any] \| None`	Agent configuration	`None`
`metadata`	`dict[str, Any]`	Extra task metadata	`{}`

Environment Variable Substitution

The mcp_config field automatically resolves environment variables using ${VAR_NAME} syntax:

task = LegacyTask(
    prompt="Navigate to the dashboard",
    mcp_config={
        "browser": {
            "url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    }
)

Variables are resolved when LegacyTask is created from a dict - this is why datasets should store raw dictionaries.

Running Tasks

run_single_task

Execute a single task with tracing. This is the core execution primitive.

from hud.datasets import run_single_task
from hud.types import AgentType

result = await run_single_task(
    task=task,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    max_steps=20,
    trace_name="My Task",
)
print(f"Reward: {result.reward}")

Parameters:

Parameter	Type	Description	Default
`task`	`Task`	Task to execute	Required
`agent_type`	`AgentType`	Agent type to use	Required
`agent_params`	`dict`	Parameters for `agent.create()`	`None`
`max_steps`	`int`	Maximum steps	`10`
`job_id`	`str`	Job ID for telemetry	`None`
`task_id`	`str`	Task ID for telemetry	`None`
`group_id`	`str`	Group ID for variance runs	`None`
`trace_name`	`str`	Name for the trace	`None`
`metadata`	`dict`	Additional trace metadata	`None`

run_tasks

Run multiple tasks with automatic job tracking. This is the primary evaluation function.

from hud.datasets import run_tasks
from hud.types import AgentType

results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    name="SheetBench Evaluation",
    max_concurrent=50,
    max_steps=50,
)

Parameters:

Parameter	Type	Description	Default
`tasks`	`list[Task]`	Tasks to execute	Required
`agent_type`	`AgentType`	Agent type to use	Required
`agent_params`	`dict`	Parameters for `agent.create()`	`None`
`name`	`str`	Job name for tracking	`"Evaluation"`
`max_concurrent`	`int`	Maximum concurrent tasks	`30`
`metadata`	`dict`	Job metadata	`None`
`max_steps`	`int`	Max steps per task	`10`
`group_size`	`int`	Runs per task (variance estimation)	`1`
`remote`	`bool`	Submit to HUD platform	`False`

Returns:

If remote=True: Empty list (fire-and-forget)
If group_size=1: List of Trace results
If group_size>1: List of statistics dicts

Examples:

# Run filtered tasks locally
all_tasks = load_tasks("hud-evals/SheetBench-50")
selected = [t for t in all_tasks if "login" in t.prompt.lower()]
results = await run_tasks(selected, AgentType.CLAUDE)

# Run with variance estimation
stats = await run_tasks(tasks, AgentType.CLAUDE, group_size=3)

# Submit for remote execution
await run_tasks(tasks, AgentType.CLAUDE, remote=True)

Remote Execution

Submit tasks to the HUD platform for execution in the cloud:

from hud.datasets import run_tasks
from hud.types import AgentType

# Submit and return immediately
await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    name="Remote Evaluation",
    remote=True,
)
# Monitor at https://hud.so/jobs/{job_id}

Cancellation

Cancel running remote jobs:

from hud.datasets.utils import cancel_job, cancel_task, cancel_all_jobs

# Cancel specific job
await cancel_job(job_id)

# Cancel specific task
await cancel_task(job_id, task_id)

# Cancel ALL your active jobs (panic button)
await cancel_all_jobs()

Or use the CLI:

hud cancel --job <job_id>
hud cancel --task <job_id> <task_id>
hud cancel --all

Saving Tasks

from hud.datasets import save_tasks

# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
    tasks=task_dicts,
    repo_id="my-tasks",
    private=False,
)

Always pass dictionaries to preserve environment variable templates. Task objects have already resolved ${VAR} to actual values!

Agent Config Options

The agent_config field on tasks supports:

Option	Type	Description
`system_prompt`	`str`	Appended to agent’s default system prompt
`allowed_tools`	`list[str]`	Tools the agent can use
`disallowed_tools`	`list[str]`	Tools to hide from the agent
`append_setup_output`	`bool`	Include setup output in first message
`initial_screenshot`	`bool`	Take screenshot before first action

task = LegacyTask(
    prompt="Complete the form",
    mcp_config={...},
    agent_config={
        "system_prompt": "Be careful with form fields",
        "allowed_tools": ["computer", "search"],
    }
)

Best Practices

Use UUIDs for task IDs - Required for HuggingFace datasets
Save dictionaries, not objects - Preserves env var templates
Start with single tasks - Debug before scaling up
Use group_size for variance - Run each task 3-5 times for reliable metrics
Use remote for large evals - Better parallelization and monitoring

Documentation Index

​LegacyTask Class

​Environment Variable Substitution

​Running Tasks

​run_single_task

​run_tasks

​Remote Execution

​Cancellation

​Saving Tasks

​Agent Config Options

​Best Practices

​See Also

LegacyTask Class

Environment Variable Substitution

Running Tasks

run_single_task

run_tasks

Remote Execution

Cancellation

Saving Tasks

Agent Config Options

Best Practices

See Also