Estimands¶
Estimands for causal reinforcement learning.
PolicyContrastEstimand
dataclass
¶
Contrast between two policy values.
Estimand
V^{pi_treatment} - V^{pi_control}.
Assumptions: Same as PolicyValueEstimand for both policies. Inputs: treatment: Target policy value estimand. control: Control policy value estimand. Outputs: Contrast specification used by estimators or reports. Failure modes: If assumptions differ, the contrast may not be identified.
to_dict()
¶
Return a dictionary representation.
PolicyValueEstimand
dataclass
¶
Policy value estimand under intervention.
Estimand
V^pi = E[sum_t gamma^t R_t | do(A_t ~ pi(\cdot | S_t))].
Assumptions: Sequential ignorability, positivity/overlap, and correct data contract. Inputs: policy: Target policy. discount: Discount factor. horizon: Optional horizon for finite episodes. assumptions: AssumptionSet describing identification conditions. Outputs: Estimand specification used by estimators. Failure modes: If required assumptions are missing, estimators should refuse to run.