Skip to content

Results And Artifacts

evaluate(...) returns an EvalResult:

python
print(result.overall)
print(result.per_group)
print(result.per_task)
print(result.artifacts)
print(result.metadata)

Result Shape

FieldContents
overallAggregate metrics across all evaluated tasks.
per_taskMetrics keyed by group/task_id.
per_groupGrouped metrics when the benchmark exposes task groups.
artifactsPaths to results.json, output directory, media directory, and benchmark outputs.
metadataEvaluation mode, environment type, and caller-provided metadata.

Common metrics include:

  • success_rate;
  • avg_episode_length;
  • avg_reward;
  • avg_sum_reward;
  • avg_max_reward;
  • n_episodes;
  • eval_s;
  • eval_ep_s;
  • video_paths.

MS-HAB also reports success_once_rate, success_at_end_rate, and avg_return_per_step.

Artifact Layout

The default artifact layout is:

text
output_dir/
  results.json
  media/
    <task-or-benchmark-media>

results.json includes the aggregated result payload and an _meta block with timestamp, seed, mode, rollout settings, and caller metadata.

When record_episodes_per_task > 0, videos are written below the media directory. SimplerEnv and MS-HAB may run a dedicated single-env recording pass to avoid unstable or oversized multi-env video capture.

Released under the Apache-2.0 License.