Use this file to discover all available pages before exploring further.
HUD gives you three things: a unified API for every model, a way to turn your code into agent-callable tools, and infrastructure to run evaluations at scale.
A production API is one live instance with shared state—you can’t run 1,000 parallel tests without them stepping on each other. Environments spin up fresh for every evaluation: isolated, deterministic, reproducible. Each generates training data.Turn your code into tools agents can call. Define scenarios that evaluate what agents do:
from hud import Environmentenv = Environment("my-env")@env.tool()def search(query: str) -> str: """Search the knowledge base.""" return db.search(query)@env.scenario("find-answer")async def find_answer(question: str): answer = yield f"Find the answer to: {question}" yield 1.0 if "correct" in answer.lower() else 0.0
Scenarios define the prompt (first yield) and the scoring logic (second yield). The agent runs in between.→ More on Environments
Run your scenario with different models. Compare results:
import hudtask = env("find-answer", question="What is 2+2?")async with hud.eval(task, variants={"model": ["gpt-4o", "claude-sonnet-4-5"]}, group=5) as ctx: response = await client.chat.completions.create( model=ctx.variants["model"], messages=[{"role": "user", "content": ctx.prompt}] ) await ctx.submit(response.choices[0].message.content)
Variants test different configurations. Groups repeat each to see the distribution. Results show up on hud.ai with scores, traces, and side-by-side comparisons.→ More on A/B Evals