Datasets

Causal Inference Samplers

These samplers generate or load causal inference datasets with treatment (x), outcome (y), and covariates (v). They are used with the CausalBGM model family.

class bayesgm.datasets.base_sampler.Base_sampler(x, y, v, batch_size=32, normalize=False, random_seed=123)[source]

Mini-batch sampler for causal inference datasets.

Stores treatment \(X\), outcome \(Y\), and covariates \(V\) and provides an infinite mini-batch iterator that cycles through the data.

Parameters:
  • x (array-like) – Treatment variable with shape (n,) or (n, 1).

  • y (array-like) – Outcome variable with shape (n,) or (n, 1).

  • v (array-like) – Covariates with shape (n, v_dim).

  • batch_size (int, default=32) – Number of samples per mini-batch.

  • normalize (bool, default=False) – If True, covariates \(V\) are standardised (zero mean, unit variance) before storage.

  • random_seed (int, default=123) – Random seed used for shuffling.

next_batch()[source]

Return the next mini-batch of (x, y, v).

Returns:

  • data_x (np.ndarray) – Treatment batch with shape (batch_size, 1).

  • data_y (np.ndarray) – Outcome batch with shape (batch_size, 1).

  • data_v (np.ndarray) – Covariates batch with shape (batch_size, v_dim).

load_all()[source]

Return the full dataset.

Returns:

  • data_x (np.ndarray) – Treatment variable with shape (n, 1).

  • data_y (np.ndarray) – Outcome variable with shape (n, 1).

  • data_v (np.ndarray) – Covariates with shape (n, v_dim).

class bayesgm.datasets.causal_samplers.Sim_Hirano_Imbens_sampler(batch_size=32, N=20000, v_dim=200, seed=0)[source]

Bases: Base_sampler

Hirano Imbens simulation dataset (continuous treatment) sampler (inherited from Base_sampler).

Parameters:
  • batch_size – Int object denoting the batch size for mini-batch training. Default: 32.

  • N – Sample size. Default: 20000.

  • v_dim – Int object denoting the dimension for covariates. Default: 200.

  • seed – Int object denoting the random seed. Default: 0.

Examples

>>> from CausalEGM import Sim_Hirano_Imbens_sampler
>>> ds = Sim_Hirano_Imbens_sampler(batch_size=32, N=20000, v_dim=200, seed=0)
class bayesgm.datasets.causal_samplers.Sim_Sun_sampler(batch_size=32, N=20000, v_dim=200, seed=0)[source]

Bases: Base_sampler

Sun simulation dataset (continuous treatment) sampler (inherited from Base_sampler).

Parameters:
  • batch_size – Int object denoting the batch size for mini-batch training. Default: 32.

  • N – Sample size. Default: 20000.

  • v_dim – Int object denoting the dimension for covariates. Default: 200.

  • seed – Int object denoting the random seed. Default: 0.

Examples

>>> from CausalEGM import Sim_Sun_sampler
>>> ds = Sim_Sun_sampler(batch_size=32, N=20000, v_dim=200, seed=0)
class bayesgm.datasets.causal_samplers.Sim_Colangelo_sampler(batch_size=32, N=20000, v_dim=100, seed=0, rho=0.5, offset=[-1, 0, 1], d=1, a=3, b=0.75)[source]

Bases: Base_sampler

Colangelo simulation dataset (continuous treatment) sampler (inherited from Base_sampler).

Parameters:
  • batch_size – Int object denoting the batch size for mini-batch training. Default: 32.

  • N – Sample size. Default: 20000.

  • v_dim – Int object denoting the dimension for covariates. Default: 200.

  • seed – Int object denoting the random seed. Default: 0.

Examples

>>> from CausalEGM import Sim_Colangelo_sampler
>>> ds = Sim_Colangelo_sampler(batch_size=32, N=20000, v_dim=100, seed=0)
class bayesgm.datasets.causal_samplers.Semi_Twins_sampler(batch_size=32, seed=0, path='../data/Twins')[source]

Bases: Base_sampler

Twins semi synthetic dataset sampler (inherited from Base_sampler).

Parameters:
  • batch_size – Int object denoting the batch size for mini-batch training. Default: 32.

  • seed – Int object denoting the random seed. Default: 0.

  • path – Str obejct denoting the path to the original data.

Examples

>>> from CausalEGM import Semi_Twins_sampler
>>> ds = Semi_Twins_sampler(batch_size=32, path='../data/Twins')
class bayesgm.datasets.causal_samplers.Semi_acic_sampler(batch_size=32, path='../data/ACIC_2018', ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb')[source]

Bases: Base_sampler

ACIC 2018 competition dataset (binary treatment) sampler (inherited from Base_sampler).

Parameters:
  • batch_size – Int object denoting the batch size for mini-batch training. Default: 32.

  • path – Str object denoting the path to the original dataset.

  • ufid – Str object denoting the unique id of a specific semi-synthetic setting.

Examples

>>> from CausalEGM import Semi_acic_sampler
>>> import numpy as np
>>> x = np.random.normal(size=(2000,))
>>> y = np.random.normal(size=(2000,))
>>> v = np.random.normal(size=(2000,100))
>>> ds = Semi_acic_sampler(path='../data/ACIC_2018',ufid='d5bd8e4814904c58a79d7cdcd7c2a1bb')

Prior / Distribution Samplers

These samplers generate data from known distributions. They are used as latent-space priors or benchmark datasets for the BGM model family.

class bayesgm.datasets.prior_samplers.Gaussian_sampler(mean, sd=1, N=20000)[source]

Multivariate Gaussian sampler.

Generates samples from \(\mathcal{N}(\mu, \sigma^2 I)\) and stores a pre-sampled dataset of size N for batch training.

Parameters:
  • mean (array-like) – Mean vector of length d.

  • sd (float, default=1) – Scalar standard deviation applied to every dimension.

  • N (int, default=20000) – Size of the pre-sampled dataset.

train(batch_size, label=False)[source]

Return a random batch from the pre-sampled dataset.

Parameters:
  • batch_size (int) – Number of samples to return.

  • label (bool, default=False) – Unused. Kept for API compatibility.

Returns:

Batch with shape (batch_size, d).

Return type:

np.ndarray

get_batch(batch_size)[source]

Draw fresh samples from the Gaussian distribution.

Parameters:

batch_size (int) – Number of samples to draw.

Returns:

Samples with shape (batch_size, d), dtype float32.

Return type:

np.ndarray

load_all()[source]

Return the full pre-sampled dataset.

Returns:

Dataset with shape (N, d).

Return type:

np.ndarray

class bayesgm.datasets.prior_samplers.GMM_indep_sampler(N, sd, dim, n_components, weights=None, bound=1)[source]

Independent Gaussian Mixture Model (GMM) sampler.

Each dimension is sampled independently from a 1-D Gaussian mixture with n_components equally-spaced centres.

Parameters:
  • N (int) – Total number of pre-sampled data points.

  • sd (float) – Standard deviation of each mixture component.

  • dim (int) – Dimensionality of the data.

  • n_components (int) – Number of mixture components per dimension.

  • weights (array-like or None, optional) – Component weights (uniform if None).

  • bound (float, default=1) – Mixture centres are placed uniformly in [-bound, bound].

get_density(data)[source]

Evaluate the exact GMM density at given points.

Parameters:

data (np.ndarray) – Query points with shape (m, dim).

Returns:

Density values with shape (m,).

Return type:

np.ndarray

train(batch_size)[source]

Return a random batch from the training split.

Parameters:

batch_size (int) – Number of samples to return.

Returns:

Batch with shape (batch_size, dim).

Return type:

np.ndarray

load_all()[source]

Return the full pre-sampled dataset.

Returns:

  • X (np.ndarray) – Dataset with shape (N, dim).

  • Y (None) – Placeholder (always None).

class bayesgm.datasets.prior_samplers.Swiss_roll_sampler(N, theta=6.283185307179586, scale=2, sigma=0.4)[source]

Swiss-roll distribution sampler.

Generates 2-D data along the curve \((r \sin(s \cdot r),\, r \cos(s \cdot r))\) plus isotropic Gaussian noise.

Parameters:
  • N (int) – Number of pre-sampled data points.

  • theta (float, default=2*pi) – Maximum parameter value along the spiral.

  • scale (float, default=2) – Frequency scaling of the spiral.

  • sigma (float, default=0.4) – Standard deviation of the additive Gaussian noise.

train(batch_size, label=False)[source]

Return a random batch from the pre-sampled dataset.

Parameters:
  • batch_size (int) – Number of samples to return.

  • label (bool, default=False) – Unused. Kept for API compatibility.

Returns:

Batch with shape (batch_size, 2).

Return type:

np.ndarray

get_density(x_points)[source]

Evaluate the (approximate) density via kernel density on the noiseless curve.

Parameters:

x_points (np.ndarray) – Query points with shape (m, 2).

Returns:

Density values with shape (m,).

Return type:

np.ndarray

load_all()[source]

Return the full pre-sampled dataset.

Returns:

  • X (np.ndarray) – Dataset with shape (N, 2).

  • Y (None) – Placeholder (always None).

Simulation Functions

Functions for generating synthetic datasets for BGM experiments.

bayesgm.datasets.simulators.simulate_regression(n_samples, n_features, n_targets, effective_rank=None, variance=None, random_state=123)[source]

Simulate a linear regression dataset with optional low-rank design.

Generates \(X\) (optionally low-rank) and \(Y = X_{\text{aug}} \beta + \varepsilon\) where \(X_{\text{aug}}\) includes an intercept column.

Parameters:
  • n_samples (int) – Number of samples.

  • n_features (int) – Number of features (columns of \(X\)).

  • n_targets (int) – Number of target (response) variables.

  • effective_rank (int or None, optional) – If provided, the design matrix \(X\) is generated as a low-rank matrix with this effective rank. Otherwise \(X\) is i.i.d. standard normal.

  • variance (np.ndarray or None, optional) – Per-sample noise variance. If None, defaults to 0.01 * mean(X^2) per sample.

  • random_state (int, default=123) – Random seed for reproducibility.

Returns:

  • X (np.ndarray) – Feature matrix with shape (n_samples, n_features).

  • Y (np.ndarray) – Response matrix with shape (n_samples, n_targets).

bayesgm.datasets.simulators.simulate_low_rank_data(n_samples=10000, z_dim=2, x_dim=4, rank=2, sigma_z=False, random_state=123)[source]

Simulate low-rank observed data with latent variables.

The generator is: - Z ~ N(0, I) - X | Z ~ N(mu(Z), Sigma(Z))

Parameters:
  • n_samples (int, default=10000) – Number of samples.

  • z_dim (int, default=2) – Dimension of latent variable Z.

  • x_dim (int, default=4) – Dimension of observed variable X.

  • rank (int, default=2) – Rank for the low-rank covariance component.

  • sigma_z (bool, default=False) – If True, covariance Sigma depends on Z (scaled by z[0]). If False, Sigma is constant across samples.

  • random_state (int, default=123) – Random seed for reproducibility.

Returns:

  • X (np.ndarray) – Observed data with shape (n_samples, x_dim).

  • Z (np.ndarray) – Latent variables with shape (n_samples, z_dim).

bayesgm.datasets.simulators.simulate_heteroskedastic_data(n=1000, d=5, seed=42)[source]

Simulate a heteroskedastic regression dataset.

The noise standard deviation depends on the second feature \(X_2\): sigma = 0.5 + 0.5 * sin(2 * pi * X_2) (clipped to [0.1, 2.0] outside \(|X_2| > 2\)).

Parameters:
  • n (int, default=1000) – Number of samples.

  • d (int, default=5) – Number of features.

  • seed (int, default=42) – Random seed.

Returns:

  • X (np.ndarray) – Feature matrix with shape (n, d).

  • Y (np.ndarray) – Response vector with shape (n,).

  • sigma (np.ndarray) – True noise standard deviation with shape (n,).

bayesgm.datasets.simulators.simulate_z_hetero(n=20000, k=3, d=19, seed=42)[source]

Simulate a latent-factor heteroskedastic regression dataset.

Observed features \(X\) are a noisy low-rank projection of a \(k\)-dimensional latent variable \(Z\). The response \(Y\) depends nonlinearly on \(Z\) with heteroskedastic noise.

Parameters:
  • n (int, default=20000) – Number of samples.

  • k (int, default=3) – Dimension of the latent variable \(Z\).

  • d (int, default=19) – Number of observed features.

  • seed (int, default=42) – Random seed.

Returns:

  • X (np.ndarray) – Observed feature matrix with shape (n, d).

  • Y (np.ndarray) – Response vector with shape (n,).