Documentation Index
Fetch the complete documentation index at: https://hud-f5fd7c15-feat-agent-orchestrator-cookbook.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The HUD SDK provides the LegacyTask class for defining agent objectives and dataset utilities for managing task collections.
LegacyTask is deprecated. For new code, use env("scenario_name", **args) to create Task objects. See Environments for the recommended approach.
LegacyTask Class
from hud.datasets import LegacyTask
Pydantic model that defines an agent’s objective, setup, and evaluation criteria.
Fields:
| Field | Type | Description | Default |
|---|
id | str | None | Unique identifier (UUID recommended) | None |
prompt | str | Task instruction for the agent | Required |
mcp_config | dict[str, Any] | MCP server configuration | Required |
setup_tool | MCPToolCall | list[MCPToolCall] | None | Tool(s) to prepare environment | None |
evaluate_tool | MCPToolCall | list[MCPToolCall] | None | Tool(s) to score performance | None |
agent_config | BaseAgentConfig | dict[str, Any] | None | Agent configuration | None |
metadata | dict[str, Any] | Extra task metadata | {} |
Environment Variable Substitution
The mcp_config field automatically resolves environment variables using ${VAR_NAME} syntax:
task = LegacyTask(
prompt="Navigate to the dashboard",
mcp_config={
"browser": {
"url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
}
)
Variables are resolved when LegacyTask is created from a dict - this is why datasets should store raw dictionaries.
Running Tasks
run_single_task
Execute a single task with tracing. This is the core execution primitive.
from hud.datasets import run_single_task
from hud.types import AgentType
result = await run_single_task(
task=task,
agent_type=AgentType.CLAUDE,
agent_params={"checkpoint_name": "claude-sonnet-4-5"},
max_steps=20,
trace_name="My Task",
)
print(f"Reward: {result.reward}")
Parameters:
| Parameter | Type | Description | Default |
|---|
task | Task | Task to execute | Required |
agent_type | AgentType | Agent type to use | Required |
agent_params | dict | Parameters for agent.create() | None |
max_steps | int | Maximum steps | 10 |
job_id | str | Job ID for telemetry | None |
task_id | str | Task ID for telemetry | None |
group_id | str | Group ID for variance runs | None |
trace_name | str | Name for the trace | None |
metadata | dict | Additional trace metadata | None |
run_tasks
Run multiple tasks with automatic job tracking. This is the primary evaluation function.
from hud.datasets import run_tasks
from hud.types import AgentType
results = await run_tasks(
tasks=tasks,
agent_type=AgentType.CLAUDE,
agent_params={"checkpoint_name": "claude-sonnet-4-5"},
name="SheetBench Evaluation",
max_concurrent=50,
max_steps=50,
)
Parameters:
| Parameter | Type | Description | Default |
|---|
tasks | list[Task] | Tasks to execute | Required |
agent_type | AgentType | Agent type to use | Required |
agent_params | dict | Parameters for agent.create() | None |
name | str | Job name for tracking | "Evaluation" |
max_concurrent | int | Maximum concurrent tasks | 30 |
metadata | dict | Job metadata | None |
max_steps | int | Max steps per task | 10 |
group_size | int | Runs per task (variance estimation) | 1 |
remote | bool | Submit to HUD platform | False |
Returns:
- If
remote=True: Empty list (fire-and-forget)
- If
group_size=1: List of Trace results
- If
group_size>1: List of statistics dicts
Examples:
# Run filtered tasks locally
all_tasks = load_tasks("hud-evals/SheetBench-50")
selected = [t for t in all_tasks if "login" in t.prompt.lower()]
results = await run_tasks(selected, AgentType.CLAUDE)
# Run with variance estimation
stats = await run_tasks(tasks, AgentType.CLAUDE, group_size=3)
# Submit for remote execution
await run_tasks(tasks, AgentType.CLAUDE, remote=True)
Remote Execution
Submit tasks to the HUD platform for execution in the cloud:
from hud.datasets import run_tasks
from hud.types import AgentType
# Submit and return immediately
await run_tasks(
tasks=tasks,
agent_type=AgentType.CLAUDE,
agent_params={"checkpoint_name": "claude-sonnet-4-5"},
name="Remote Evaluation",
remote=True,
)
# Monitor at https://hud.so/jobs/{job_id}
Cancellation
Cancel running remote jobs:
from hud.datasets.utils import cancel_job, cancel_task, cancel_all_jobs
# Cancel specific job
await cancel_job(job_id)
# Cancel specific task
await cancel_task(job_id, task_id)
# Cancel ALL your active jobs (panic button)
await cancel_all_jobs()
Or use the CLI:
hud cancel --job <job_id>
hud cancel --task <job_id> <task_id>
hud cancel --all
Saving Tasks
from hud.datasets import save_tasks
# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
tasks=task_dicts,
repo_id="my-tasks",
private=False,
)
Always pass dictionaries to preserve environment variable templates. Task objects have already resolved ${VAR} to actual values!
Agent Config Options
The agent_config field on tasks supports:
| Option | Type | Description |
|---|
system_prompt | str | Appended to agent’s default system prompt |
allowed_tools | list[str] | Tools the agent can use |
disallowed_tools | list[str] | Tools to hide from the agent |
append_setup_output | bool | Include setup output in first message |
initial_screenshot | bool | Take screenshot before first action |
task = LegacyTask(
prompt="Complete the form",
mcp_config={...},
agent_config={
"system_prompt": "Be careful with form fields",
"allowed_tools": ["computer", "search"],
}
)
Best Practices
- Use UUIDs for task IDs - Required for HuggingFace datasets
- Save dictionaries, not objects - Preserves env var templates
- Start with single tasks - Debug before scaling up
- Use group_size for variance - Run each task 3-5 times for reliable metrics
- Use remote for large evals - Better parallelization and monitoring
See Also