Documentation Index
Fetch the complete documentation index at: https://hud-f5fd7c15-feat-agent-orchestrator-cookbook.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This page covers how to build datasets, run evaluations, and publish results.
Quick Start
# Run a local tasks file (interactive agent selection)
hud eval tasks.json
# Run a hosted dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full
import hud
# Single task evaluation
async with hud.eval("hud-evals/SheetBench-50:0") as ctx:
agent = MyAgent()
result = await agent.run(ctx)
ctx.reward = result.reward
# All tasks with variants
async with hud.eval(
"hud-evals/SheetBench-50:*",
variants={"model": ["claude-sonnet", "gpt-4o"]},
group=3,
max_concurrent=50,
) as ctx:
agent = create_agent(model=ctx.variants["model"])
result = await agent.run(ctx)
ctx.reward = result.reward
from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks
tasks = load_tasks("hud-evals/SheetBench-50")
results = await run_tasks(
tasks=tasks,
agent_type=AgentType.CLAUDE,
max_concurrent=50,
)
Build Benchmarks
Explore Evaluators
hud analyze hudpython/hud-remote-browser:latest
Create Tasks
import uuid
web_tasks = []
web_tasks.append({
"id": str(uuid.uuid4()),
"prompt": "Navigate to the documentation page",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-remote-browser:latest"
}
}
},
"setup_tool": {"name": "setup", "arguments": {"name": "navigate", "arguments": {"url": "https://example.com"}}},
"evaluate_tool": {"name": "evaluate", "arguments": {"name": "url_match", "arguments": {"pattern": ".*/docs.*"}}},
"metadata": {"difficulty": "easy", "category": "navigation"}
})
Save to HuggingFace
from hud.utils.tasks import save_tasks
save_tasks(
web_tasks,
repo_id="web-navigation-benchmark",
private=False,
tags=["web", "navigation", "automation"],
)
Leaderboards
After running, visit your dataset leaderboard and publish a scorecard:
from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks
tasks = load_tasks("hud-evals/SheetBench-50")
results = await run_tasks(
tasks=tasks,
agent_type=AgentType.CLAUDE,
name="Claude Sonnet SheetBench",
)
# Open https://hud.ai/leaderboards/hud-evals/SheetBench-50
# Click "My Jobs" to see runs and create a scorecard
Best Practices
- Clear, measurable prompts (binary or graded)
- Isolated task state and deterministic setup
- Use metadata tags (category, difficulty)
- Validate locally, then parallelize
- Version datasets; include a
system_prompt.txt
See Also