Skip to content

Results Gallery

All figures on this page are generated from the repository codebase. Core figures come from scripts/build_docs_figures.py, while end-to-end notebook figures are emitted by the tutorial notebooks in notebooks/.

Bandit estimator comparison

Bandit estimator comparison with uncertainty
Bandit OPE estimates with uncertainty for IS and WIS compared to ground truth.

What it shows:

  • Relative bias and uncertainty across estimators.
  • Ground-truth reference line from the synthetic benchmark.
  • Practical spread between IS and WIS under the same data.

Why it matters:

  • Shows estimator stability before you trust an estimate.
  • Highlights the value of diagnostics and CIs in small samples.

MDP estimator comparison

MDP estimator comparison with uncertainty
MDP OPE estimates for IS, WIS, PDIS, and DR with uncertainty bands.

What it shows:

  • Trajectory-based estimators across multiple horizons.
  • The effect of model-based correction (DR) on variance.
  • A direct comparison against ground truth.

Why it matters:

  • Demonstrates why horizon length makes diagnostics essential.
  • Provides a baseline for choosing estimators on real data.

Overlap diagnostics

Overlap ratios histogram
Distribution of target/behavior importance ratios for bandit data.

What it shows:

  • Whether target actions are supported by the behavior policy.
  • Heavy tails that signal unstable importance weighting.

Why it matters:

  • Poor overlap is the fastest path to unreliable OPE.

Effective sample size over time

Effective sample size by time step
ESS drops over time for MDP trajectories under importance weighting.

What it shows:

  • How effective sample size decays with horizon.
  • The variance cost of long sequences.

Why it matters:

  • Long horizons require careful estimator choice and diagnostics.

Weighted reward distribution

Weighted rewards histogram
Importance-weighted rewards plotted on a log scale.

What it shows:

  • How a few large weights can dominate estimates.
  • The tail behavior that affects estimator stability.

Why it matters:

  • Motivates clipping, diagnostics, and sensitivity checks.

Sensitivity bounds

Sensitivity bounds curve
Bounded-confounding sensitivity curve for bandit OPE.

What it shows:

  • Lower and upper bounds as confounding strength increases.
  • A compact summary of robustness to unobserved bias.

Why it matters:

  • Sensitivity curves help quantify uncertainty beyond point estimates.

End-to-end bandit workflow

End-to-end bandit estimator comparison
Estimator comparison from the bandit end-to-end notebook.

What it shows:

  • A full pipeline run with diagnostics and sensitivity outputs.
  • How estimates line up with the synthetic ground truth.

Why it matters:

  • Mirrors the workflow researchers use to audit new policies.

Long-horizon MDP comparison

Long-horizon MDP estimator comparison
Estimator comparison in a longer-horizon MDP.

What it shows:

  • Increased variance for IS/PDIS as horizon grows.
  • Stabilization from MIS/DICE and model-based estimators.

Why it matters:

  • Motivates long-horizon diagnostics and estimator selection.