Skip to main content

Documentation Index

Fetch the complete documentation index at: https://hud-f5fd7c15-feat-agent-orchestrator-cookbook.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The HUD SDK provides the LegacyTask class for defining agent objectives and dataset utilities for managing task collections.
LegacyTask is deprecated. For new code, use env("scenario_name", **args) to create Task objects. See Environments for the recommended approach.

LegacyTask Class

from hud.datasets import LegacyTask
Pydantic model that defines an agent’s objective, setup, and evaluation criteria. Fields:
FieldTypeDescriptionDefault
idstr | NoneUnique identifier (UUID recommended)None
promptstrTask instruction for the agentRequired
mcp_configdict[str, Any]MCP server configurationRequired
setup_toolMCPToolCall | list[MCPToolCall] | NoneTool(s) to prepare environmentNone
evaluate_toolMCPToolCall | list[MCPToolCall] | NoneTool(s) to score performanceNone
agent_configBaseAgentConfig | dict[str, Any] | NoneAgent configurationNone
metadatadict[str, Any]Extra task metadata{}

Environment Variable Substitution

The mcp_config field automatically resolves environment variables using ${VAR_NAME} syntax:
task = LegacyTask(
    prompt="Navigate to the dashboard",
    mcp_config={
        "browser": {
            "url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    }
)
Variables are resolved when LegacyTask is created from a dict - this is why datasets should store raw dictionaries.

Running Tasks

run_single_task

Execute a single task with tracing. This is the core execution primitive.
from hud.datasets import run_single_task
from hud.types import AgentType

result = await run_single_task(
    task=task,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    max_steps=20,
    trace_name="My Task",
)
print(f"Reward: {result.reward}")
Parameters:
ParameterTypeDescriptionDefault
taskTaskTask to executeRequired
agent_typeAgentTypeAgent type to useRequired
agent_paramsdictParameters for agent.create()None
max_stepsintMaximum steps10
job_idstrJob ID for telemetryNone
task_idstrTask ID for telemetryNone
group_idstrGroup ID for variance runsNone
trace_namestrName for the traceNone
metadatadictAdditional trace metadataNone

run_tasks

Run multiple tasks with automatic job tracking. This is the primary evaluation function.
from hud.datasets import run_tasks
from hud.types import AgentType

results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    name="SheetBench Evaluation",
    max_concurrent=50,
    max_steps=50,
)
Parameters:
ParameterTypeDescriptionDefault
taskslist[Task]Tasks to executeRequired
agent_typeAgentTypeAgent type to useRequired
agent_paramsdictParameters for agent.create()None
namestrJob name for tracking"Evaluation"
max_concurrentintMaximum concurrent tasks30
metadatadictJob metadataNone
max_stepsintMax steps per task10
group_sizeintRuns per task (variance estimation)1
remoteboolSubmit to HUD platformFalse
Returns:
  • If remote=True: Empty list (fire-and-forget)
  • If group_size=1: List of Trace results
  • If group_size>1: List of statistics dicts
Examples:
# Run filtered tasks locally
all_tasks = load_tasks("hud-evals/SheetBench-50")
selected = [t for t in all_tasks if "login" in t.prompt.lower()]
results = await run_tasks(selected, AgentType.CLAUDE)

# Run with variance estimation
stats = await run_tasks(tasks, AgentType.CLAUDE, group_size=3)

# Submit for remote execution
await run_tasks(tasks, AgentType.CLAUDE, remote=True)

Remote Execution

Submit tasks to the HUD platform for execution in the cloud:
from hud.datasets import run_tasks
from hud.types import AgentType

# Submit and return immediately
await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    name="Remote Evaluation",
    remote=True,
)
# Monitor at https://hud.so/jobs/{job_id}

Cancellation

Cancel running remote jobs:
from hud.datasets.utils import cancel_job, cancel_task, cancel_all_jobs

# Cancel specific job
await cancel_job(job_id)

# Cancel specific task
await cancel_task(job_id, task_id)

# Cancel ALL your active jobs (panic button)
await cancel_all_jobs()
Or use the CLI:
hud cancel --job <job_id>
hud cancel --task <job_id> <task_id>
hud cancel --all

Saving Tasks

from hud.datasets import save_tasks

# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
    tasks=task_dicts,
    repo_id="my-tasks",
    private=False,
)
Always pass dictionaries to preserve environment variable templates. Task objects have already resolved ${VAR} to actual values!

Agent Config Options

The agent_config field on tasks supports:
OptionTypeDescription
system_promptstrAppended to agent’s default system prompt
allowed_toolslist[str]Tools the agent can use
disallowed_toolslist[str]Tools to hide from the agent
append_setup_outputboolInclude setup output in first message
initial_screenshotboolTake screenshot before first action
task = LegacyTask(
    prompt="Complete the form",
    mcp_config={...},
    agent_config={
        "system_prompt": "Be careful with form fields",
        "allowed_tools": ["computer", "search"],
    }
)

Best Practices

  1. Use UUIDs for task IDs - Required for HuggingFace datasets
  2. Save dictionaries, not objects - Preserves env var templates
  3. Start with single tasks - Debug before scaling up
  4. Use group_size for variance - Run each task 3-5 times for reliable metrics
  5. Use remote for large evals - Better parallelization and monitoring

See Also