Double Reinforcement Learning (MDP)¶
Implementation: crl.estimators.drl.DRLEstimator
Estimand¶
$V^\pi = \mathbb{E}\left[\sum_{t=0}^{T-1} \gamma^t r_t\right]$.
Assumptions¶
- Sequential ignorability
- Overlap/positivity
- Markov property
Requires¶
TrajectoryDatasetwithstate_space_n
Algorithm¶
- Estimate a Q-function for the target policy.
- Estimate per-time state-action density ratios.
- Combine the Q-model and ratio-weighted TD correction.
Uncertainty¶
- Normal-approximation CI by default.
- Bootstrap CI available via
bootstrap=True.
Failure modes¶
- Sparse coverage can destabilize ratio estimates.
- Misspecified Q-model or density ratios can induce bias.
Minimal example¶
References¶
- Kallus & Uehara (2020)