Datasets

Causal Inference Samplers

These samplers generate or load causal inference datasets with treatment (x), outcome (y), and covariates (v). They are used with the CausalBGM model family.

class bayesgm.datasets.base_sampler.Base_sampler(x, y, v, batch_size=32, normalize=False, random_seed=123)[source]

Mini-batch sampler for causal inference datasets.

Stores treatment \(X\), outcome \(Y\), and covariates \(V\) and provides an infinite mini-batch iterator that cycles through the data.

Parameters:

x (array-like) – Treatment variable with shape (n,) or (n, 1).
y (array-like) – Outcome variable with shape (n,) or (n, 1).
v (array-like) – Covariates with shape (n, v_dim).
batch_size (int, default=32) – Number of samples per mini-batch.
normalize (bool, default=False) – If True, covariates \(V\) are standardised (zero mean, unit variance) before storage.
random_seed (int, default=123) – Random seed used for shuffling.

next_batch()[source]

Return the next mini-batch of (x, y, v).

Returns:

data_x (np.ndarray) – Treatment batch with shape (batch_size, 1).
data_y (np.ndarray) – Outcome batch with shape (batch_size, 1).
data_v (np.ndarray) – Covariates batch with shape (batch_size, v_dim).

load_all()[source]

Return the full dataset.

Returns:

data_x (np.ndarray) – Treatment variable with shape (n, 1).
data_y (np.ndarray) – Outcome variable with shape (n, 1).
data_v (np.ndarray) – Covariates with shape (n, v_dim).

class bayesgm.datasets.causal_samplers.Sim_Hirano_Imbens_sampler(batch_size=32, N=20000, v_dim=200, seed=0)[source]

Bases: Base_sampler

Hirano Imbens simulation dataset (continuous treatment) sampler (inherited from Base_sampler).

Parameters:

batch_size – Int object denoting the batch size for mini-batch training. Default: 32.
N – Sample size. Default: 20000.
v_dim – Int object denoting the dimension for covariates. Default: 200.
seed – Int object denoting the random seed. Default: 0.

Examples

>>> from CausalEGM import Sim_Hirano_Imbens_sampler
>>> ds = Sim_Hirano_Imbens_sampler(batch_size=32, N=20000, v_dim=200, seed=0)

class bayesgm.datasets.causal_samplers.Sim_Sun_sampler(batch_size=32, N=20000, v_dim=200, seed=0)[source]

Bases: Base_sampler

Sun simulation dataset (continuous treatment) sampler (inherited from Base_sampler).

Parameters:

batch_size – Int object denoting the batch size for mini-batch training. Default: 32.
N – Sample size. Default: 20000.
v_dim – Int object denoting the dimension for covariates. Default: 200.
seed – Int object denoting the random seed. Default: 0.

Examples

>>> from CausalEGM import Sim_Sun_sampler
>>> ds = Sim_Sun_sampler(batch_size=32, N=20000, v_dim=200, seed=0)

class bayesgm.datasets.causal_samplers.Sim_Colangelo_sampler(batch_size=32, N=20000, v_dim=100, seed=0, rho=0.5, offset=[-1, 0, 1], d=1, a=3, b=0.75)[source]

Bases: Base_sampler

Colangelo simulation dataset (continuous treatment) sampler (inherited from Base_sampler).

Parameters:

batch_size – Int object denoting the batch size for mini-batch training. Default: 32.
N – Sample size. Default: 20000.
v_dim – Int object denoting the dimension for covariates. Default: 200.
seed – Int object denoting the random seed. Default: 0.

Examples

>>> from CausalEGM import Sim_Colangelo_sampler
>>> ds = Sim_Colangelo_sampler(batch_size=32, N=20000, v_dim=100, seed=0)

class bayesgm.datasets.causal_samplers.Semi_Twins_sampler(batch_size=32, seed=0, path='../data/Twins')[source]

Bases: Base_sampler

Twins semi synthetic dataset sampler (inherited from Base_sampler).

Parameters:

batch_size – Int object denoting the batch size for mini-batch training. Default: 32.
seed – Int object denoting the random seed. Default: 0.
path – Str obejct denoting the path to the original data.

Examples

>>> from CausalEGM import Semi_Twins_sampler
>>> ds = Semi_Twins_sampler(batch_size=32, path='../data/Twins')

class bayesgm.datasets.causal_samplers.Semi_acic_sampler(batch_size=32, path='../data/ACIC_2018', ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb')[source]

Bases: Base_sampler

ACIC 2018 competition dataset (binary treatment) sampler (inherited from Base_sampler).

Parameters:

batch_size – Int object denoting the batch size for mini-batch training. Default: 32.
path – Str object denoting the path to the original dataset.
ufid – Str object denoting the unique id of a specific semi-synthetic setting.

Examples

>>> from CausalEGM import Semi_acic_sampler
>>> import numpy as np
>>> x = np.random.normal(size=(2000,))
>>> y = np.random.normal(size=(2000,))
>>> v = np.random.normal(size=(2000,100))
>>> ds = Semi_acic_sampler(path='../data/ACIC_2018',ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb')

Prior / Distribution Samplers

These samplers generate data from known distributions. They are used as latent-space priors or benchmark datasets for the BGM model family.

class bayesgm.datasets.prior_samplers.Gaussian_sampler(mean, sd=1, N=20000)[source]

Multivariate Gaussian sampler.

Generates samples from \(\mathcal{N}(\mu, \sigma^2 I)\) and stores a pre-sampled dataset of size N for batch training.

Parameters:

mean (array-like) – Mean vector of length d.
sd (float, default=1) – Scalar standard deviation applied to every dimension.
N (int, default=20000) – Size of the pre-sampled dataset.

train(batch_size, label=False)[source]

Return a random batch from the pre-sampled dataset.

Parameters:

batch_size (int) – Number of samples to return.
label (bool, default=False) – Unused. Kept for API compatibility.

Returns:

Batch with shape (batch_size, d).

Return type:

np.ndarray

get_batch(batch_size)[source]

Draw fresh samples from the Gaussian distribution.

Parameters:: batch_size (int) – Number of samples to draw.
Returns:: Samples with shape (batch_size, d), dtype float32.
Return type:: np.ndarray

load_all()[source]

Return the full pre-sampled dataset.

Returns:: Dataset with shape (N, d).
Return type:: np.ndarray

class bayesgm.datasets.prior_samplers.GMM_indep_sampler(N, sd, dim, n_components, weights=None, bound=1)[source]

Independent Gaussian Mixture Model (GMM) sampler.

Each dimension is sampled independently from a 1-D Gaussian mixture with n_components equally-spaced centres.

Parameters:

N (int) – Total number of pre-sampled data points.
sd (float) – Standard deviation of each mixture component.
dim (int) – Dimensionality of the data.
n_components (int) – Number of mixture components per dimension.
weights (array-like or None, optional) – Component weights (uniform if None).
bound (float, default=1) – Mixture centres are placed uniformly in [-bound, bound].

get_density(data)[source]

Evaluate the exact GMM density at given points.

Parameters:: data (np.ndarray) – Query points with shape (m, dim).
Returns:: Density values with shape (m,).
Return type:: np.ndarray

train(batch_size)[source]

Return a random batch from the training split.

Parameters:: batch_size (int) – Number of samples to return.
Returns:: Batch with shape (batch_size, dim).
Return type:: np.ndarray

load_all()[source]

Return the full pre-sampled dataset.

Returns:

X (np.ndarray) – Dataset with shape (N, dim).
Y (None) – Placeholder (always None).

class bayesgm.datasets.prior_samplers.Swiss_roll_sampler(N, theta=6.283185307179586, scale=2, sigma=0.4)[source]

Swiss-roll distribution sampler.

Generates 2-D data along the curve \((r \sin(s \cdot r),\, r \cos(s \cdot r))\) plus isotropic Gaussian noise.

Parameters:

N (int) – Number of pre-sampled data points.
theta (float, default=2*pi) – Maximum parameter value along the spiral.
scale (float, default=2) – Frequency scaling of the spiral.
sigma (float, default=0.4) – Standard deviation of the additive Gaussian noise.

train(batch_size, label=False)[source]

Return a random batch from the pre-sampled dataset.

Parameters:

batch_size (int) – Number of samples to return.
label (bool, default=False) – Unused. Kept for API compatibility.

Returns:

Batch with shape (batch_size, 2).

Return type:

np.ndarray

get_density(x_points)[source]

Evaluate the (approximate) density via kernel density on the noiseless curve.

Parameters:: x_points (np.ndarray) – Query points with shape (m, 2).
Returns:: Density values with shape (m,).
Return type:: np.ndarray

load_all()[source]

Return the full pre-sampled dataset.

Returns:

X (np.ndarray) – Dataset with shape (N, 2).
Y (None) – Placeholder (always None).

Simulation Functions

Functions for generating synthetic datasets for BGM experiments.

bayesgm.datasets.simulators.simulate_regression(n_samples, n_features, n_targets, effective_rank=None, variance=None, random_state=123)[source]

Simulate a linear regression dataset with optional low-rank design.

Generates \(X\) (optionally low-rank) and \(Y = X_{\text{aug}} \beta + \varepsilon\) where \(X_{\text{aug}}\) includes an intercept column.

Parameters:

n_samples (int) – Number of samples.
n_features (int) – Number of features (columns of \(X\)).
n_targets (int) – Number of target (response) variables.
effective_rank (int or None, optional) – If provided, the design matrix \(X\) is generated as a low-rank matrix with this effective rank. Otherwise \(X\) is i.i.d. standard normal.
variance (np.ndarray or None, optional) – Per-sample noise variance. If None, defaults to 0.01 * mean(X^2) per sample.
random_state (int, default=123) – Random seed for reproducibility.

Returns:

X (np.ndarray) – Feature matrix with shape (n_samples, n_features).
Y (np.ndarray) – Response matrix with shape (n_samples, n_targets).

bayesgm.datasets.simulators.simulate_low_rank_data(n_samples=10000, z_dim=2, x_dim=4, rank=2, sigma_z=False, random_state=123)[source]

Simulate low-rank observed data with latent variables.

The generator is: - Z ~ N(0, I) - X | Z ~ N(mu(Z), Sigma(Z))

Parameters:

n_samples (int, default=10000) – Number of samples.
z_dim (int, default=2) – Dimension of latent variable Z.
x_dim (int, default=4) – Dimension of observed variable X.
rank (int, default=2) – Rank for the low-rank covariance component.
sigma_z (bool, default=False) – If True, covariance Sigma depends on Z (scaled by z[0]). If False, Sigma is constant across samples.
random_state (int, default=123) – Random seed for reproducibility.

Returns:

X (np.ndarray) – Observed data with shape (n_samples, x_dim).
Z (np.ndarray) – Latent variables with shape (n_samples, z_dim).

bayesgm.datasets.simulators.simulate_heteroskedastic_data(n=1000, d=5, seed=42)[source]

Simulate a heteroskedastic regression dataset.

The noise standard deviation depends on the second feature \(X_2\): sigma = 0.5 + 0.5 * sin(2 * pi * X_2) (clipped to [0.1, 2.0] outside \(|X_2| > 2\)).

Parameters:

n (int, default=1000) – Number of samples.
d (int, default=5) – Number of features.
seed (int, default=42) – Random seed.

Returns:

X (np.ndarray) – Feature matrix with shape (n, d).
Y (np.ndarray) – Response vector with shape (n,).
sigma (np.ndarray) – True noise standard deviation with shape (n,).

bayesgm.datasets.simulators.simulate_z_hetero(n=20000, k=3, d=19, seed=42)[source]

Simulate a latent-factor heteroskedastic regression dataset.

Observed features \(X\) are a noisy low-rank projection of a \(k\)-dimensional latent variable \(Z\). The response \(Y\) depends nonlinearly on \(Z\) with heteroskedastic noise.

Parameters:

n (int, default=20000) – Number of samples.
k (int, default=3) – Dimension of the latent variable \(Z\).
d (int, default=19) – Number of observed features.
seed (int, default=42) – Random seed.

Returns:

X (np.ndarray) – Observed feature matrix with shape (n, d).
Y (np.ndarray) – Response vector with shape (n,).