Skip to content

Config API

EvalConfig contains benchmark-agnostic evaluation settings. Benchmark-specific settings go in env_kwargs; model-specific inference settings go in policy_kwargs.

EvalConfig

python
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Mapping


@dataclass(frozen=True)
class EvalConfig:
    num_eval_per_task: int
    output_dir: str | Path
    task: str | None = None
    task_ids: tuple[int, ...] | None = None
    num_parallel_env: int = 1
    seed: int = 42
    record_episodes_per_task: int = 0
    step_timeout_sec: float | None = None
    rollout_failure_retries: int = 1
    debug_verbose: bool = False
    policy_kwargs: Mapping[str, Any] = field(default_factory=dict)
    env_kwargs: Mapping[str, Any] = field(default_factory=dict)
    metadata: Mapping[str, Any] = field(default_factory=dict)
    runtime_hooks: EvalRuntimeHooks = field(default_factory=EvalRuntimeHooks)
FieldMeaning
num_eval_per_taskNumber of episodes per resolved task.
output_dirDirectory for results.json and media artifacts.
taskBenchmark task selector.
task_idsOptional subset after benchmark task expansion.
num_parallel_envParallel environment lanes.
seedStarting seed for deterministic episode indexing.
record_episodes_per_taskNumber of videos to save per task.
step_timeout_secOptional timeout for rollout steps.
rollout_failure_retriesRetry count for failed rollout waves.
debug_verboseLet benchmark runtimes print more detail.
policy_kwargsExtra kwargs forwarded to policy inference.
env_kwargsBenchmark-specific options.
metadataCaller-provided metadata written into result artifacts.
runtime_hooksOptional heartbeat callbacks.

EvalResult

python
@dataclass(frozen=True)
class EvalResult:
    overall: Mapping[str, Any]
    per_task: Mapping[str, Mapping[str, Any]]
    per_group: Mapping[str, Mapping[str, Any]] = field(default_factory=dict)
    artifacts: Mapping[str, Any] = field(default_factory=dict)
    metadata: Mapping[str, Any] = field(default_factory=dict)

overall, per_task, and per_group contain metrics. artifacts contains paths such as results_path, output_dir, and media_dir. metadata includes evaluator mode and environment type.

Config Helpers

The package also exposes small helpers for config parsing:

  • env_type_from_cfg(...)
  • optional_env_task(...)
  • optional_env_task_ids(...)
  • env_kwargs_without_type_task(...)
  • resolve_policy_kwargs(...)

They are intended for integration code that adapts external config systems into EvalConfig.

Released under the Apache-2.0 License.