Quickstart: MDP OPE¶

This walkthrough evaluates a target policy on trajectory data.

1) Generate data¶

from crl.benchmarks.mdp_synth import SyntheticMDP, SyntheticMDPConfig

benchmark = SyntheticMDP(SyntheticMDPConfig(seed=0))
dataset = benchmark.sample(num_trajectories=200, seed=1)

Discrete actions

The built-in pipeline assumes discrete action spaces. If you have continuous actions, you will need to implement custom estimators.

2) Define the estimand¶

from crl.assumptions import AssumptionSet
from crl.assumptions_catalog import BEHAVIOR_POLICY_KNOWN, MARKOV, OVERLAP, Q_MODEL_REALIZABLE, SEQUENTIAL_IGNORABILITY
from crl.estimands.policy_value import PolicyValueEstimand

estimand = PolicyValueEstimand(
    policy=benchmark.target_policy,
    discount=dataset.discount,
    horizon=dataset.horizon,
    assumptions=AssumptionSet([SEQUENTIAL_IGNORABILITY, OVERLAP, BEHAVIOR_POLICY_KNOWN, MARKOV, Q_MODEL_REALIZABLE]),
)

3) Run estimators¶

from crl.estimators.dr import DoublyRobustEstimator
from crl.estimators.fqe import FQEEstimator
from crl.estimators.importance_sampling import ISEstimator, PDISEstimator, WISEstimator

estimators = [
    ISEstimator(estimand),
    WISEstimator(estimand),
    PDISEstimator(estimand),
    DoublyRobustEstimator(estimand),
    FQEEstimator(estimand),
]

for estimator in estimators:
    report = estimator.estimate(dataset)
    print(report.value, report.diagnostics)

4) Compare to ground truth¶

true_value = benchmark.true_policy_value(benchmark.target_policy)
print("true", true_value)

Interpret the results¶

If overlap is weak, IS and PDIS will be unstable.
DR and FQE can be more stable but depend on model fit.
Always read diagnostics before trusting an estimate.

Next steps¶

Run the full quickstart script: python -m examples.quickstart.mdp_ope
See the decision tree: Estimator Selection Guide
Learn how to read diagnostics: Diagnostics Interpretation