Benchmarks¶
Synthetic benchmarks for CRL.
ConfoundedBandit
¶
Binary-action confounded bandit with proxies.
ConfoundedBanditConfig
dataclass
¶
Configuration for the confounded bandit benchmark.
SyntheticBandit
¶
Synthetic bandit with discrete contexts and known reward means.
Estimand
Policy value under intervention for a target policy.
Assumptions: None (ground-truth generator). Inputs: config: SyntheticBanditConfig. Outputs: Methods provide sampled datasets and ground-truth values. Failure modes: None.
sample(num_samples, seed=None)
¶
Sample a logged bandit dataset.
Inputs
num_samples: Number of logged samples. seed: Optional RNG seed.
Outputs: LoggedBanditDataset with propensities. Failure modes: None.
true_policy_value(policy)
¶
Compute the ground-truth value for a policy.
Inputs
policy: TabularPolicy.
Outputs: Expected reward scalar. Failure modes: None.
SyntheticBanditConfig
dataclass
¶
Configuration for the synthetic bandit benchmark.
Estimand
Not applicable.
Assumptions: None. Inputs: num_contexts: Number of discrete contexts. num_actions: Number of discrete actions. reward_noise_std: Reward noise standard deviation. seed: Random seed for benchmark generation. Failure modes: None.
SyntheticMDP
¶
Synthetic finite-horizon MDP with tabular dynamics.
Estimand
Policy value under intervention for a target policy.
Assumptions: None (ground-truth generator). Inputs: config: SyntheticMDPConfig. Outputs: Methods provide sampled datasets and ground-truth values. Failure modes: None.
sample(num_trajectories, seed=None)
¶
Sample trajectories from the behavior policy.
Inputs
num_trajectories: Number of trajectories. seed: Optional RNG seed.
Outputs: TrajectoryDataset with propensities. Failure modes: None.
true_policy_value(policy)
¶
Compute the ground-truth policy value via dynamic programming.
Inputs
policy: TabularPolicy.
Outputs: Expected discounted return. Failure modes: None.
SyntheticMDPConfig
dataclass
¶
Configuration for the synthetic MDP benchmark.
Estimand
Not applicable.
Assumptions: None. Inputs: num_states: Number of discrete states. num_actions: Number of discrete actions. horizon: Episode horizon. discount: Discount factor. reward_noise_std: Reward noise standard deviation. seed: Random seed for benchmark generation. Failure modes: None.
run_all_benchmarks(num_samples=1000, num_trajectories=200, seed=0)
¶
Run all benchmarks and return a combined result table.
Estimand
Policy value under intervention for each benchmark target policy.
Assumptions: Sequential ignorability, overlap, and known behavior propensities (plus Markov for MDP). Inputs: num_samples: Number of bandit samples. num_trajectories: Number of MDP trajectories. seed: Random seed for sampling. Outputs: Combined list of result dictionaries. Failure modes: Small samples can yield unstable estimates.
run_bandit_benchmark(num_samples=1000, seed=0, config=None)
¶
Run IS/WIS on the synthetic bandit benchmark.
Estimand
Policy value under intervention for the benchmark target policy.
Assumptions: Sequential ignorability, overlap, and known behavior propensities. Inputs: num_samples: Number of logged bandit samples. seed: Random seed for sampling. config: Optional SyntheticBanditConfig override. Outputs: List of result dictionaries with estimate and true value. Failure modes: Small samples can yield high variance estimates.
run_mdp_benchmark(num_trajectories=200, seed=0, config=None)
¶
Run IS/WIS/PDIS/DR/FQE on the synthetic MDP benchmark.
Estimand
Policy value under intervention for the benchmark target policy.
Assumptions: Sequential ignorability, overlap, Markov property, and known behavior propensities. Inputs: num_trajectories: Number of logged trajectories. seed: Random seed for sampling. config: Optional SyntheticMDPConfig override. Outputs: List of result dictionaries with estimate and true value. Failure modes: Small samples can yield unstable estimates.