Datasets
Causal Inference Samplers
These samplers generate or load causal inference datasets with treatment (x), outcome (y), and covariates (v). They are used with the CausalBGM model family.
- class bayesgm.datasets.base_sampler.Base_sampler(x, y, v, batch_size=32, normalize=False, random_seed=123)[source]
Mini-batch sampler for causal inference datasets.
Stores treatment \(X\), outcome \(Y\), and covariates \(V\) and provides an infinite mini-batch iterator that cycles through the data.
- Parameters:
x (array-like) – Treatment variable with shape
(n,)or(n, 1).y (array-like) – Outcome variable with shape
(n,)or(n, 1).v (array-like) – Covariates with shape
(n, v_dim).batch_size (int, default=32) – Number of samples per mini-batch.
normalize (bool, default=False) – If
True, covariates \(V\) are standardised (zero mean, unit variance) before storage.random_seed (int, default=123) – Random seed used for shuffling.
- class bayesgm.datasets.causal_samplers.Sim_Hirano_Imbens_sampler(batch_size=32, N=20000, v_dim=200, seed=0)[source]
Bases:
Base_samplerHirano Imbens simulation dataset (continuous treatment) sampler (inherited from Base_sampler).
- Parameters:
batch_size – Int object denoting the batch size for mini-batch training. Default:
32.N – Sample size. Default:
20000.v_dim – Int object denoting the dimension for covariates. Default:
200.seed – Int object denoting the random seed. Default:
0.
Examples
>>> from CausalEGM import Sim_Hirano_Imbens_sampler >>> ds = Sim_Hirano_Imbens_sampler(batch_size=32, N=20000, v_dim=200, seed=0)
- class bayesgm.datasets.causal_samplers.Sim_Sun_sampler(batch_size=32, N=20000, v_dim=200, seed=0)[source]
Bases:
Base_samplerSun simulation dataset (continuous treatment) sampler (inherited from Base_sampler).
- Parameters:
batch_size – Int object denoting the batch size for mini-batch training. Default:
32.N – Sample size. Default:
20000.v_dim – Int object denoting the dimension for covariates. Default:
200.seed – Int object denoting the random seed. Default:
0.
Examples
>>> from CausalEGM import Sim_Sun_sampler >>> ds = Sim_Sun_sampler(batch_size=32, N=20000, v_dim=200, seed=0)
- class bayesgm.datasets.causal_samplers.Sim_Colangelo_sampler(batch_size=32, N=20000, v_dim=100, seed=0, rho=0.5, offset=[-1, 0, 1], d=1, a=3, b=0.75)[source]
Bases:
Base_samplerColangelo simulation dataset (continuous treatment) sampler (inherited from Base_sampler).
- Parameters:
batch_size – Int object denoting the batch size for mini-batch training. Default:
32.N – Sample size. Default:
20000.v_dim – Int object denoting the dimension for covariates. Default:
200.seed – Int object denoting the random seed. Default:
0.
Examples
>>> from CausalEGM import Sim_Colangelo_sampler >>> ds = Sim_Colangelo_sampler(batch_size=32, N=20000, v_dim=100, seed=0)
- class bayesgm.datasets.causal_samplers.Semi_Twins_sampler(batch_size=32, seed=0, path='../data/Twins')[source]
Bases:
Base_samplerTwins semi synthetic dataset sampler (inherited from Base_sampler).
- Parameters:
batch_size – Int object denoting the batch size for mini-batch training. Default:
32.seed – Int object denoting the random seed. Default:
0.path – Str obejct denoting the path to the original data.
Examples
>>> from CausalEGM import Semi_Twins_sampler >>> ds = Semi_Twins_sampler(batch_size=32, path='../data/Twins')
- class bayesgm.datasets.causal_samplers.Semi_acic_sampler(batch_size=32, path='../data/ACIC_2018', ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb')[source]
Bases:
Base_samplerACIC 2018 competition dataset (binary treatment) sampler (inherited from Base_sampler).
- Parameters:
batch_size – Int object denoting the batch size for mini-batch training. Default:
32.path – Str object denoting the path to the original dataset.
ufid – Str object denoting the unique id of a specific semi-synthetic setting.
Examples
>>> from CausalEGM import Semi_acic_sampler >>> import numpy as np >>> x = np.random.normal(size=(2000,)) >>> y = np.random.normal(size=(2000,)) >>> v = np.random.normal(size=(2000,100)) >>> ds = Semi_acic_sampler(path='../data/ACIC_2018',ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb')
Prior / Distribution Samplers
These samplers generate data from known distributions. They are used as latent-space priors or benchmark datasets for the BGM model family.
- class bayesgm.datasets.prior_samplers.Gaussian_sampler(mean, sd=1, N=20000)[source]
Multivariate Gaussian sampler.
Generates samples from \(\mathcal{N}(\mu, \sigma^2 I)\) and stores a pre-sampled dataset of size
Nfor batch training.- Parameters:
- class bayesgm.datasets.prior_samplers.GMM_indep_sampler(N, sd, dim, n_components, weights=None, bound=1)[source]
Independent Gaussian Mixture Model (GMM) sampler.
Each dimension is sampled independently from a 1-D Gaussian mixture with
n_componentsequally-spaced centres.- Parameters:
N (int) – Total number of pre-sampled data points.
sd (float) – Standard deviation of each mixture component.
dim (int) – Dimensionality of the data.
n_components (int) – Number of mixture components per dimension.
weights (array-like or None, optional) – Component weights (uniform if
None).bound (float, default=1) – Mixture centres are placed uniformly in
[-bound, bound].
- get_density(data)[source]
Evaluate the exact GMM density at given points.
- Parameters:
data (np.ndarray) – Query points with shape
(m, dim).- Returns:
Density values with shape
(m,).- Return type:
np.ndarray
- class bayesgm.datasets.prior_samplers.Swiss_roll_sampler(N, theta=6.283185307179586, scale=2, sigma=0.4)[source]
Swiss-roll distribution sampler.
Generates 2-D data along the curve \((r \sin(s \cdot r),\, r \cos(s \cdot r))\) plus isotropic Gaussian noise.
- Parameters:
Simulation Functions
Functions for generating synthetic datasets for BGM experiments.
- bayesgm.datasets.simulators.simulate_regression(n_samples, n_features, n_targets, effective_rank=None, variance=None, random_state=123)[source]
Simulate a linear regression dataset with optional low-rank design.
Generates \(X\) (optionally low-rank) and \(Y = X_{\text{aug}} \beta + \varepsilon\) where \(X_{\text{aug}}\) includes an intercept column.
- Parameters:
n_samples (int) – Number of samples.
n_features (int) – Number of features (columns of \(X\)).
n_targets (int) – Number of target (response) variables.
effective_rank (int or None, optional) – If provided, the design matrix \(X\) is generated as a low-rank matrix with this effective rank. Otherwise \(X\) is i.i.d. standard normal.
variance (np.ndarray or None, optional) – Per-sample noise variance. If
None, defaults to0.01 * mean(X^2)per sample.random_state (int, default=123) – Random seed for reproducibility.
- Returns:
X (np.ndarray) – Feature matrix with shape
(n_samples, n_features).Y (np.ndarray) – Response matrix with shape
(n_samples, n_targets).
- bayesgm.datasets.simulators.simulate_low_rank_data(n_samples=10000, z_dim=2, x_dim=4, rank=2, sigma_z=False, random_state=123)[source]
Simulate low-rank observed data with latent variables.
The generator is: -
Z ~ N(0, I)-X | Z ~ N(mu(Z), Sigma(Z))- Parameters:
n_samples (int, default=10000) – Number of samples.
z_dim (int, default=2) – Dimension of latent variable
Z.x_dim (int, default=4) – Dimension of observed variable
X.rank (int, default=2) – Rank for the low-rank covariance component.
sigma_z (bool, default=False) – If
True, covarianceSigmadepends onZ(scaled byz[0]). IfFalse,Sigmais constant across samples.random_state (int, default=123) – Random seed for reproducibility.
- Returns:
X (np.ndarray) – Observed data with shape
(n_samples, x_dim).Z (np.ndarray) – Latent variables with shape
(n_samples, z_dim).
- bayesgm.datasets.simulators.simulate_heteroskedastic_data(n=1000, d=5, seed=42)[source]
Simulate a heteroskedastic regression dataset.
The noise standard deviation depends on the second feature \(X_2\):
sigma = 0.5 + 0.5 * sin(2 * pi * X_2)(clipped to [0.1, 2.0] outside \(|X_2| > 2\)).- Parameters:
- Returns:
X (np.ndarray) – Feature matrix with shape
(n, d).Y (np.ndarray) – Response vector with shape
(n,).sigma (np.ndarray) – True noise standard deviation with shape
(n,).
- bayesgm.datasets.simulators.simulate_z_hetero(n=20000, k=3, d=19, seed=42)[source]
Simulate a latent-factor heteroskedastic regression dataset.
Observed features \(X\) are a noisy low-rank projection of a \(k\)-dimensional latent variable \(Z\). The response \(Y\) depends nonlinearly on \(Z\) with heteroskedastic noise.
- Parameters:
- Returns:
X (np.ndarray) – Observed feature matrix with shape
(n, d).Y (np.ndarray) – Response vector with shape
(n,).