Results Gallery¶

All figures on this page are generated from the repository codebase. Core figures come from scripts/build_docs_figures.py, while end-to-end notebook figures are emitted by the tutorial notebooks in notebooks/.

Bandit estimator comparison¶

What it shows:

Relative bias and uncertainty across estimators.
Ground-truth reference line from the synthetic benchmark.
Practical spread between IS and WIS under the same data.

Why it matters:

Shows estimator stability before you trust an estimate.
Highlights the value of diagnostics and CIs in small samples.

MDP estimator comparison¶

What it shows:

Trajectory-based estimators across multiple horizons.
The effect of model-based correction (DR) on variance.
A direct comparison against ground truth.

Why it matters:

Demonstrates why horizon length makes diagnostics essential.
Provides a baseline for choosing estimators on real data.

Overlap diagnostics¶

Overlap ratios histogram — Distribution of target/behavior importance ratios for bandit data.

What it shows:

Whether target actions are supported by the behavior policy.
Heavy tails that signal unstable importance weighting.

Why it matters:

Poor overlap is the fastest path to unreliable OPE.

Effective sample size over time¶

Effective sample size by time step — ESS drops over time for MDP trajectories under importance weighting.

What it shows:

How effective sample size decays with horizon.
The variance cost of long sequences.

Why it matters:

Long horizons require careful estimator choice and diagnostics.

Weighted reward distribution¶

Weighted rewards histogram — Importance-weighted rewards plotted on a log scale.

What it shows:

How a few large weights can dominate estimates.
The tail behavior that affects estimator stability.

Why it matters:

Motivates clipping, diagnostics, and sensitivity checks.

Sensitivity bounds¶

What it shows:

Lower and upper bounds as confounding strength increases.
A compact summary of robustness to unobserved bias.

Why it matters:

Sensitivity curves help quantify uncertainty beyond point estimates.

End-to-end bandit workflow¶

End-to-end bandit estimator comparison — Estimator comparison from the bandit end-to-end notebook.

What it shows:

A full pipeline run with diagnostics and sensitivity outputs.
How estimates line up with the synthetic ground truth.

Why it matters:

Mirrors the workflow researchers use to audit new policies.

Long-horizon MDP comparison¶

Long-horizon MDP estimator comparison — Estimator comparison in a longer-horizon MDP.

What it shows:

Increased variance for IS/PDIS as horizon grows.
Stabilization from MIS/DICE and model-based estimators.

Why it matters:

Motivates long-horizon diagnostics and estimator selection.