Skip to content

Estimators

Estimator results are returned as EstimatorReport objects with a stable schema and export utilities:

  • report.to_dict() includes schema_version, value, stderr, ci, uncertainty, and diagnostics.
  • report.to_dataframe() produces a one-row pandas table.
  • report.save_json(path) and report.save_html(path) persist reports.

Estimators for off-policy evaluation.

BootstrapConfig dataclass

Configuration for bootstrap confidence intervals.

DRCrossFitConfig dataclass

Configuration for cross-fitting.

Estimand

Not applicable.

Assumptions: None. Inputs: num_folds: Number of cross-fitting folds. num_iterations: Bellman iteration count for linear Q. ridge: Ridge regularization strength. seed: RNG seed for fold splitting. Outputs: Configuration object. Failure modes: None.

DRLConfig dataclass

Configuration for Double Reinforcement Learning (DRL).

DRLEstimator

Bases: OPEEstimator

Double Reinforcement Learning estimator for discrete MDPs.

Estimand

PolicyValueEstimand for the target policy.

Assumptions: Sequential ignorability, overlap, Markov property. Inputs: TrajectoryDataset with discrete state_space_n. Outputs: EstimatorReport with value and diagnostics. Failure modes: Requires adequate state-action coverage to estimate occupancy ratios.

DiagnosticsConfig dataclass

Configuration for diagnostics thresholds.

Estimand

Not applicable.

Assumptions: None. Inputs: min_behavior_prob: Minimum behavior probability threshold. max_weight: Optional clipping threshold for importance weights. ess_threshold: Minimum ESS ratio before warnings. weight_tail_quantile: Quantile for tail summary. weight_tail_threshold: Threshold to flag heavy tails. Outputs: Configuration object. Failure modes: None.

DoubleRLConfig dataclass

Configuration for Double RL cross-fitting.

DoubleRLEstimator

Bases: OPEEstimator

Double RL estimator for contextual bandits (Kallus & Uehara, 2020).

DoublyRobustEstimator

Bases: OPEEstimator

Doubly robust estimator with cross-fitting.

Estimand

PolicyValueEstimand for the target policy.

Assumptions: Sequential ignorability, overlap, Markov property, and known behavior propensities. Inputs: TrajectoryDataset (n, t). Outputs: EstimatorReport with value and diagnostics. Failure modes: Bias if both the Q model and propensities are misspecified.

estimate(data)

Estimate policy value via cross-fitted DR.

DualDICEConfig dataclass

Configuration for DualDICE.

DualDICEEstimator

Bases: OPEEstimator

DualDICE estimator (Nachum et al., 2019) for discrete MDPs.

EstimatorReport dataclass

Report returned by estimators.

Estimand

Policy value for the estimator's target policy.

Assumptions: Assumptions are recorded in the estimand and warnings highlight issues. Outputs: value: Estimated policy value. stderr: Estimated standard error, if available. ci: Optional confidence interval (low, high). diagnostics: Dictionary of diagnostic metrics. assumptions_checked: Assumptions required by the estimator. assumptions_flagged: Assumptions flagged by diagnostics. warnings: List of warning strings. metadata: Extra metadata (fit details, configs). Failure modes: Diagnostics may be None if disabled.

save_html(path)

Write report contents to an HTML file.

save_json(path)

Write report contents to a JSON file.

to_dataframe()

Return a one-row pandas DataFrame if pandas is available.

to_dict()

Return a pandas-friendly dict representation.

to_html()

Return a self-contained HTML report representation.

to_json()

Return a JSON string representation.

FQEConfig dataclass

Configuration for FQE training.

Estimand

Not applicable.

Assumptions: None. Inputs: hidden_sizes: Hidden layer sizes for the Q network. learning_rate: Optimizer learning rate. batch_size: Mini-batch size. num_epochs: Epochs per iteration. num_iterations: Number of fitted Q iterations. weight_decay: L2 penalty. seed: RNG seed for torch and numpy. Outputs: Configuration object. Failure modes: None.

FQEEstimator

Bases: OPEEstimator

Fitted Q Evaluation estimator for finite-horizon MDPs.

Estimand

PolicyValueEstimand for the target policy.

Assumptions: Sequential ignorability, overlap, Markov property, and Q-model realizability. Inputs: TrajectoryDataset (n, t). Outputs: EstimatorReport with value and diagnostics. Failure modes: Extrapolation error for out-of-distribution actions.

estimate(data)

Estimate policy value via FQE.

GenDICEConfig dataclass

Configuration for GenDICE.

GenDICEEstimator

Bases: OPEEstimator

GenDICE estimator (generalized density ratio).

HighConfidenceISConfig dataclass

Configuration for high-confidence lower bounds.

HighConfidenceISEstimator

Bases: OPEEstimator

High-confidence lower bound based on IS (Thomas et al., 2015).

ISEstimator

Bases: OPEEstimator

Trajectory-level importance sampling estimator.

Estimand

PolicyValueEstimand for the target policy.

Assumptions: Sequential ignorability, overlap/positivity, and known behavior propensities. Inputs: LoggedBanditDataset (n,) or TrajectoryDataset (n, t). Outputs: EstimatorReport with value and diagnostics. Failure modes: High variance under weak overlap.

estimate(data)

Estimate policy value via IS.

MAGICConfig dataclass

Configuration for MAGIC.

MAGICEstimator

Bases: OPEEstimator

MAGIC estimator that mixes truncated DR estimators.

MRDRConfig dataclass

Configuration for MRDR.

MRDREstimator

Bases: OPEEstimator

MRDR estimator (Farajtabar et al., 2018).

MarginalizedImportanceSamplingEstimator

Bases: OPEEstimator

MIS estimator (Xie et al., 2019) for discrete state-action spaces.

OPEEstimator

Bases: ABC

Base class for off-policy evaluation estimators.

Estimand

PolicyValueEstimand.

Assumptions: Each estimator declares required assumptions. Inputs: Dataset-specific objects such as TrajectoryDataset or LoggedBanditDataset. Outputs: EstimatorReport with value, diagnostics, and metadata. Failure modes: Raises ValueError if required assumptions are missing.

estimate(data) abstractmethod

Estimate policy value from data.

PDISEstimator

Bases: OPEEstimator

Per-decision importance sampling estimator.

Estimand

PolicyValueEstimand for the target policy.

Assumptions: Sequential ignorability, overlap/positivity, and known behavior propensities. Inputs: TrajectoryDataset (n, t). Outputs: EstimatorReport with value and diagnostics. Failure modes: Variance grows with horizon under weak overlap.

estimate(data)

Estimate policy value via PDIS.

UncertaintySummary dataclass

Structured summary of estimator uncertainty.

WDRConfig dataclass

Configuration for weighted doubly robust estimation.

WISEstimator

Bases: OPEEstimator

Weighted importance sampling estimator.

Estimand

PolicyValueEstimand for the target policy.

Assumptions: Sequential ignorability, overlap/positivity, and known behavior propensities. Inputs: LoggedBanditDataset (n,) or TrajectoryDataset (n, t). Outputs: EstimatorReport with value and diagnostics. Failure modes: Bias from normalization in small samples.

estimate(data)

Estimate policy value via WIS.

WeightedDoublyRobustEstimator

Bases: OPEEstimator

Weighted doubly robust estimator (Thomas & Brunskill, 2016).