Evaluation Loop
Evaluation starts with:
python
from praxis_eval import EvalConfig, LocalPolicy, evaluate
policy = LocalPolicy(my_policy)
config = EvalConfig(task="libero_10", num_eval_per_task=5, output_dir="eval/libero")
result = evaluate("libero", policy=policy, config=config)evaluate(...) resolves the benchmark driver, builds benchmark-specific environment config from EvalConfig, prepares artifact directories, runs rollouts, writes results.json, and returns an EvalResult.
Driver Dispatch
Built-in drivers are registered lazily. A benchmark driver exposes:
contract: anEnvContractcontaining observation keys and anActionSpec.evaluate(policy, config): the method that runs rollout orchestration.
Most current-environment benchmarks use the async pool path:
- Resolve tasks from the benchmark-specific config.
- Build a persistent environment pool.
- Reset policy state.
- Send batched observations to the policy adapter.
- Validate batched actions with
ActionSpec. - Step environments, collect terminal metrics, and optionally record videos.
- Aggregate overall, per-group, and per-task metrics.
SimplerEnv and MS-HAB use a subprocess path. The evaluator starts or uses a praxis-remote policy endpoint, runs the external simulator command in the dedicated runtime, then reads task metrics and artifacts from disk.
EvalConfig
EvalConfig contains generic evaluation settings:
| Field | Purpose |
|---|---|
num_eval_per_task | Number of episodes per resolved task. |
output_dir | Directory for results and media. |
task | Benchmark task selector. |
task_ids | Optional task subset after task resolution. |
num_parallel_env | Parallel environment lanes for each wave. |
seed | Start seed for deterministic episode indexing. |
record_episodes_per_task | Number of videos to save per task. |
step_timeout_sec | Optional rollout step timeout. |
rollout_failure_retries | Retry count for rollout failures. |
policy_kwargs | Extra kwargs forwarded to the policy adapter. |
env_kwargs | Benchmark-specific environment settings. |
Benchmark-specific options belong in env_kwargs, not in the policy adapter.