Quickstart: MDP OPE¶
This walkthrough evaluates a target policy on trajectory data.
1) Generate data¶
from crl.benchmarks.mdp_synth import SyntheticMDP, SyntheticMDPConfig
benchmark = SyntheticMDP(SyntheticMDPConfig(seed=0))
dataset = benchmark.sample(num_trajectories=200, seed=1)
Discrete actions
The built-in pipeline assumes discrete action spaces. If you have continuous actions, you will need to implement custom estimators.
2) Define the estimand¶
from crl.assumptions import AssumptionSet
from crl.assumptions_catalog import BEHAVIOR_POLICY_KNOWN, MARKOV, OVERLAP, Q_MODEL_REALIZABLE, SEQUENTIAL_IGNORABILITY
from crl.estimands.policy_value import PolicyValueEstimand
estimand = PolicyValueEstimand(
policy=benchmark.target_policy,
discount=dataset.discount,
horizon=dataset.horizon,
assumptions=AssumptionSet([SEQUENTIAL_IGNORABILITY, OVERLAP, BEHAVIOR_POLICY_KNOWN, MARKOV, Q_MODEL_REALIZABLE]),
)
3) Run estimators¶
from crl.estimators.dr import DoublyRobustEstimator
from crl.estimators.fqe import FQEEstimator
from crl.estimators.importance_sampling import ISEstimator, PDISEstimator, WISEstimator
estimators = [
ISEstimator(estimand),
WISEstimator(estimand),
PDISEstimator(estimand),
DoublyRobustEstimator(estimand),
FQEEstimator(estimand),
]
for estimator in estimators:
report = estimator.estimate(dataset)
print(report.value, report.diagnostics)
4) Compare to ground truth¶
Interpret the results¶
- If overlap is weak, IS and PDIS will be unstable.
- DR and FQE can be more stable but depend on model fit.
- Always read diagnostics before trusting an estimate.
Next steps¶
- Run the full quickstart script:
python -m examples.quickstart.mdp_ope - See the decision tree: Estimator Selection Guide
- Learn how to read diagnostics: Diagnostics Interpretation