Skip to content

Benchmarks

Synthetic benchmarks for CRL.

ConfoundedBandit

Binary-action confounded bandit with proxies.

ConfoundedBanditConfig dataclass

Configuration for the confounded bandit benchmark.

SyntheticBandit

Synthetic bandit with discrete contexts and known reward means.

Estimand

Policy value under intervention for a target policy.

Assumptions: None (ground-truth generator). Inputs: config: SyntheticBanditConfig. Outputs: Methods provide sampled datasets and ground-truth values. Failure modes: None.

sample(num_samples, seed=None)

Sample a logged bandit dataset.

Inputs

num_samples: Number of logged samples. seed: Optional RNG seed.

Outputs: LoggedBanditDataset with propensities. Failure modes: None.

true_policy_value(policy)

Compute the ground-truth value for a policy.

Inputs

policy: TabularPolicy.

Outputs: Expected reward scalar. Failure modes: None.

SyntheticBanditConfig dataclass

Configuration for the synthetic bandit benchmark.

Estimand

Not applicable.

Assumptions: None. Inputs: num_contexts: Number of discrete contexts. num_actions: Number of discrete actions. reward_noise_std: Reward noise standard deviation. seed: Random seed for benchmark generation. Failure modes: None.

SyntheticMDP

Synthetic finite-horizon MDP with tabular dynamics.

Estimand

Policy value under intervention for a target policy.

Assumptions: None (ground-truth generator). Inputs: config: SyntheticMDPConfig. Outputs: Methods provide sampled datasets and ground-truth values. Failure modes: None.

sample(num_trajectories, seed=None)

Sample trajectories from the behavior policy.

Inputs

num_trajectories: Number of trajectories. seed: Optional RNG seed.

Outputs: TrajectoryDataset with propensities. Failure modes: None.

true_policy_value(policy)

Compute the ground-truth policy value via dynamic programming.

Inputs

policy: TabularPolicy.

Outputs: Expected discounted return. Failure modes: None.

SyntheticMDPConfig dataclass

Configuration for the synthetic MDP benchmark.

Estimand

Not applicable.

Assumptions: None. Inputs: num_states: Number of discrete states. num_actions: Number of discrete actions. horizon: Episode horizon. discount: Discount factor. reward_noise_std: Reward noise standard deviation. seed: Random seed for benchmark generation. Failure modes: None.

run_all_benchmarks(num_samples=1000, num_trajectories=200, seed=0)

Run all benchmarks and return a combined result table.

Estimand

Policy value under intervention for each benchmark target policy.

Assumptions: Sequential ignorability, overlap, and known behavior propensities (plus Markov for MDP). Inputs: num_samples: Number of bandit samples. num_trajectories: Number of MDP trajectories. seed: Random seed for sampling. Outputs: Combined list of result dictionaries. Failure modes: Small samples can yield unstable estimates.

run_bandit_benchmark(num_samples=1000, seed=0, config=None)

Run IS/WIS on the synthetic bandit benchmark.

Estimand

Policy value under intervention for the benchmark target policy.

Assumptions: Sequential ignorability, overlap, and known behavior propensities. Inputs: num_samples: Number of logged bandit samples. seed: Random seed for sampling. config: Optional SyntheticBanditConfig override. Outputs: List of result dictionaries with estimate and true value. Failure modes: Small samples can yield high variance estimates.

run_mdp_benchmark(num_trajectories=200, seed=0, config=None)

Run IS/WIS/PDIS/DR/FQE on the synthetic MDP benchmark.

Estimand

Policy value under intervention for the benchmark target policy.

Assumptions: Sequential ignorability, overlap, Markov property, and known behavior propensities. Inputs: num_trajectories: Number of logged trajectories. seed: Random seed for sampling. config: Optional SyntheticMDPConfig override. Outputs: List of result dictionaries with estimate and true value. Failure modes: Small samples can yield unstable estimates.