Results Gallery¶
All figures on this page are generated from the repository codebase. Core
figures come from scripts/build_docs_figures.py, while end-to-end notebook
figures are emitted by the tutorial notebooks in notebooks/.
Bandit estimator comparison¶
What it shows:
- Relative bias and uncertainty across estimators.
- Ground-truth reference line from the synthetic benchmark.
- Practical spread between IS and WIS under the same data.
Why it matters:
- Shows estimator stability before you trust an estimate.
- Highlights the value of diagnostics and CIs in small samples.
MDP estimator comparison¶
What it shows:
- Trajectory-based estimators across multiple horizons.
- The effect of model-based correction (DR) on variance.
- A direct comparison against ground truth.
Why it matters:
- Demonstrates why horizon length makes diagnostics essential.
- Provides a baseline for choosing estimators on real data.
Overlap diagnostics¶
What it shows:
- Whether target actions are supported by the behavior policy.
- Heavy tails that signal unstable importance weighting.
Why it matters:
- Poor overlap is the fastest path to unreliable OPE.
Effective sample size over time¶
What it shows:
- How effective sample size decays with horizon.
- The variance cost of long sequences.
Why it matters:
- Long horizons require careful estimator choice and diagnostics.
Weighted reward distribution¶
What it shows:
- How a few large weights can dominate estimates.
- The tail behavior that affects estimator stability.
Why it matters:
- Motivates clipping, diagnostics, and sensitivity checks.
Sensitivity bounds¶
What it shows:
- Lower and upper bounds as confounding strength increases.
- A compact summary of robustness to unobserved bias.
Why it matters:
- Sensitivity curves help quantify uncertainty beyond point estimates.
End-to-end bandit workflow¶
What it shows:
- A full pipeline run with diagnostics and sensitivity outputs.
- How estimates line up with the synthetic ground truth.
Why it matters:
- Mirrors the workflow researchers use to audit new policies.
Long-horizon MDP comparison¶
What it shows:
- Increased variance for IS/PDIS as horizon grows.
- Stabilization from MIS/DICE and model-based estimators.
Why it matters:
- Motivates long-horizon diagnostics and estimator selection.