BlockInvGPFA Class

class blockinvgpfa.BlockInvGPFA(bin_size=0.02, gp_kernel=None, z_dim=3, min_var_frac=0.01, em_tol=1e-05, em_max_iters=500, freq_ll=5, verbose=0)[source]

BlockInvGPFA is a high-performance implementation of Gaussian Process Factor Analysis (GPFA) is a probabilistic dimensionality reduction technique that extracts smooth latent trajectories from noisy, high-dimensional time series data. GPFA combines Factor Analysis (FA) for dimensionality reduction with Gaussian Processes (GPs) to model the time courses of the latent factors in a unified, probabilistic framework.

GPFA operates on a set of time-series data, where each time series is given by an x_dim x bins matrix. x_dim needs to be the same across all time-series, but their individual durations, as determined by bins, can vary across time-series. When fitting the model to data, GPFA assumes all model parameters to be shared across the fitted time-series.

Please consult the main page for more details on the underlying probabilistic model.

This implementation, BlockInvGPFA, builds on the core model described in [Yu et al., 2009] and is adapted from the Elephant’s toolkit, with substantial algorithmic enhancements:

  1. It uses block-matrix identities and Schur complement updates to efficiently share kernel inversion computations across variable-length trials.

  2. It introduces a new variance-explained metric to evaluate BlockInvGPFA model fits

Parameters:
bin_sizefloat, optional, default=0.02

Size of time-series data bins in seconds.

gp_kernelGP kernel instance, optional, default=None

Please consult the scikit-learn GP documentation for a list of available GP kernels.

If None is passed, the kernel defaults to:

ConstantKernel(1-0.001, constant_value_bounds='fixed')
* RBF(length_scale=0.1)
+ ConstantKernel(0.001, constant_value_bounds='fixed')
* WhiteKernel(noise_level=1, noise_level_bounds='fixed')

where only kernel hyperparameters not marked as fixed are learned - in the above case only the length scale of the RBF kernel.

Note

The gp_kernel can either be a single kernel (in which case it will be replicated across all latent dimensions) or a sequence of kernels, one per latent dimension.

z_dimint, optional, default=3

Dimensionality of latent space.

min_var_fracfloat, optional, default=0.01

Fraction of overall data variance for each observed dimension to set as the private variance floor. This is used to combat Heywood cases, where ML parameter learning returns one or more zero private variances (see Martin & McDonald, Psychometrika, Dec 1975).

em_tolfloat, optional, default=1e-5

Stopping criterion for Expectation Maximization (EM) algorithm.

em_max_itersint, optional, default=500

Maximum number of EM iterations.

freq_llint, optional, default=5

Frequency with which data likelihood is updated across EM iterations.

verboseint, optional, default=0

Determines whether to display status messages when calling fit(). - verbose <= 0: no output - verbose >= 1: output of EM iteration statistics - verbose >= 2: additional fitting information.

Attributes:
C_numpy.ndarray, shape (x_dim, z_dim)

Loading matrix parameters, mapping observed data space to latent trajectory space.

d_numpy.ndarray, shape (x_dim, 1)

Observation mean parameter.

R_numpy.ndarray, shape (x_dim, x_dim)

Observation noise covariance parameter.

gp_kernel.thetanumpy.ndarray

Flattened and log-transformed non-fixed GP hyperparameters for optimization.

fit_info_dict

Parameter fitting information. Updated with each call to fit().

  • iteration_timelist

    Runtime for each EM iteration.

  • log_likelihoodslist

    Log likelihoods after each EM iteration.

train_latent_seqs_numpy.recarray

Copy of latent variable time-courses, inferred for training data, and augmented with additional fields.

Z_munumpy.ndarray, shape (z_dim, bins)

Latent variable posterior means for each time bin.

Z_covnumpy.ndarray, shape (z_dim, z_dim, bins)

Latent variable posterior covariances across different time bins.

Z_covGPnumpy.ndarray, shape (bins, bins, z_dim)

Posterior covariance over time for each latent variable.

valid_data_names_tuple of str

Sequence of valid names for returned_data argument of predict() method.

Methods

fit:

Fits model parameters to training data.

predict:

Provides inferred latent variable time-series.

score:

Returns the log-likelihood scores.

variance_explained:

Computes the total explained variance regression score.

Notes

Two major uses cases of GPFA are:

  1. Inferring latent variable time-courses: GPFA fits the data, consisting of a set of high-dimensional time-series, using the fit() method. It then returns the inferred, and, on request, orthonormalized latent variable time-courses by calling predict().

  2. Identifying the latent space dimensionality: GPFA’s main hyperparameter is the dimensionality z_dim of the latent space. The standard approach to identify the best value for z_dim is to perform cross validation. First, one splits the data into a training and a test set, and fits GPFA to the training set only, using fit(). Second, one applies predict() to the test set, to compute the test set data log-likelihood. This procedure is repeated for different z_dim values, to identify the z_dim leading to the best test set data log-likelihood.

    This GPFA class support the use of the cross-validation (CV) functions of sklearn.model_selection.

In both cases, the data provided to fit() is a sequence of matrices \(\boldsymbol{X}_1, \boldsymbol{X}_2, \dots\) of arbitrary order, each containing a separate time-series. Each matrix \(\boldsymbol{X}_n\) is of size x_dim x bins. Here, the observation dimensionality x_dim needs to be shared across all matrices, but their duration bins can differ.

The latent variable time-courses returned by predict() have a similar structure. They are again a sequence of matrices, \(\boldsymbol{Z}_1, \boldsymbol{Z}_2, \dots\), where each \(\boldsymbol{Z}_n\) corresponds to a provided \(\boldsymbol{X}_n\). The size of each \(\boldsymbol{Z}_n\) is z_dim x bins, where bins equals that of the associated \(\boldsymbol{X}_n\).

GPFA fits the model parameters \(\boldsymbol{C}\), \(\boldsymbol{d}\), and \(\boldsymbol{R}\) (stored in attributes C_, d_, and R_) and the GP kernel parameters (stored log-transformed in attribute gp_kernel.theta). Please consult the main page for how these attributes parameterize the probabilistic model underlying GPFA. Parameter fitting is performed by the Expectation Maximization method. See fit() for details.

Examples

>>> import numpy as np
>>> from blockinvgpfa import BlockInvGPFA
>>> X = np.array([[
...     [3, 1, 0, 4, 1, 2, 1, 3, 4, 2, 2, 1, 2, 0, 0, 2, 2, 5, 1, 3],
...     [1, 0, 2, 0, 2, 1, 4, 2, 0, 1, 4, 4, 1, 2, 8, 3, 2, 1, 3, 1],
...     [2, 2, 1, 1, 3, 2, 3, 2, 2, 0, 3, 4, 1, 2, 3, 1, 4, 1, 0, 1]
... ]])
>>> 
>>> blockinv_gpfa = BlockInvGPFA(z_dim=1)
>>> # Fit the model
>>> blockinv_gpfa.fit(X)
Initializing parameters using factor analysis...
Fitting GP parameters by EM...
>>> # Infere latent variable time-courses for the same data
>>> Zs, _ = blockinv_gpfa.predict()
>>> print(Zs)
[array([[-1.18405633, -2.00878   , -0.01470251, -2.3143544 ,  0.03651376,
         -1.06948736,  2.05355342, -0.16920794, -2.26437342, -1.21934552,
          1.98656088,  2.13305066, -1.14113106,  0.11319858,  6.14998095,
          0.89241818,  0.03509708, -1.3936327 ,  0.84781358, -1.2281484 ]])]
>>> # Display the loading matrix (C_) and observation mean (d_) parameters
>>> print(blockinv_gpfa.C_)
[[-0.78376501]
[ 1.77773876]
[ 0.51134474]]
>>> print(blockinv_gpfa.d_)
[1.95470037 2.08933859 1.89693338]
>>> # Obtaining log-likelihood scores
>>> llhs = blockinv_gpfa.score()
>>> print(llhs)
[-117.54588379661148, -107.17193271370158, ..., -100.13200154180569]
>>> # Evaluate the total explained variance regression score
>>> print(blockinv_gpfa.variance_explained())
(0.6581475357501596, array([0.65814754]))
fit(X, use_cut_trials=False)[source]

Fit the GPFA model parameters to the given training data using the Expectation Maximization algorithm. The function also computes the orthonormalization transform of the loading matrix for subsequent use by predict().

Parameters:
Xarray-like, default=None

An array-like sequence of high-dimensional time-series. Each element \(\boldsymbol{X}_n\) in this sequence is an x_dim x bins matrix containing a sequence of x_dim-dimensional observations along its columns. The observation dimensionality x_dim needs to be the same across each \(\boldsymbol{X}_n\), but bins can differ across them. The order in which the elements \(\boldsymbol{X}_1, \boldsymbol{X}_2, \dots\) are provided in the sequence is arbitrary and has no impact on the fitted parameters.

use_cut_trialsbool, optional, default=False

If True, long time-series are cut into multiple shorter ones to potentially speed up training.

Attention

Cutting long time-series might worsen parameter fits, in particular as it removes any long-distance temporal dependencies that might exist in the data.

Returns:
selfobject

Returns the instance itself.

Raises:
ValueError

If covariance matrix of input data is rank deficient.

Notes

The fit() method finds the model parameters that best explain the provided data using Expectation Maximization (EM), using the following steps:

  1. Initialization: The emission model parameters, \(\boldsymbol{C}\), \(\boldsymbol{d}\), and \(\boldsymbol{R}\) are initialized using Factor Analysis while leaving the latent variable time-course unconstrained. GP kernel parameters are left at their default values.

  2. Expecation step: Given current parameter estimates, the Expectation step computes the posterior distribution over the latent variables \(\boldsymbol{Z}\), and the complete data log-likelihood.

  3. Maximization step: Given the \(\boldsymbol{Z}\)-posterior, the Maximization step corresponds to finding the emission model parameters (see first step) that maximize the expected complete data log-likelihood analytically, and optimizes the GP kernel parameters using gradient descent.

  4. EM iteration: Steps 2 and 3 are repeated until either the change in complete data likelihood drops below the set threshold, or if the maximum number of interactions is reached.

Orthonormalization: Finally, this function computes an orthonormalization transform to the loading matrix \(\boldsymbol{C}\) that is in turn used by predict() to return an orthonormalized set of latent variables.

predict(X=None, returned_data=['Z_mu_orth'])[source]

Provides the inferred latent variable time-courses and other moments thereof, if requested.

Parameters:
Xarray-like, default=None

An array-like sequence of time-series. The format is the same as for fit().

Note

If X=None, the latent state estimates for the last X that fit() was called with are returned.

returned_datalist of str, default [‘Z_mu_orth’]

Determines which moments of the inferred latent variable time-courses, and other data are returned. Valid strings are:

'Z_mu' : posterior mean of latent variables before orthonormalization

'Z_mu_orth' : orthonormalized posterior mean of latent variable

'Z_cov' : posterior covariance between latent variables

'Z_covGP' : posterior covariance over time for each latent variable

'X' : time-series data used to estimate the GPFA model parameters

Returns:
numpy.ndarray or dict

Returns a single numpy.ndarray if only a single string is provided to the returned_data argument, containing the requested data. If multiple strings are provided, then the method returns a dictionaries whose keys match the strings provided to returned_data.

Either way, the \(n\) th item in either of the returned numpy.ndarray provides the latent state moments associated with the \(n\) th time-series in the provided X. Its size differs depending on the requested moment, and is

Z_mu: numpy.ndarray of shape (z_dim x bins)

Z_mu_orth: numpy.ndarray of shape (z_dim x bins)

Z_cov: numpy.ndarray of shape (z_dim x z_dim x bins)

Z_covGP: numpy.ndarray of shape (bins x bins x z_dim)

X : numpy.ndarray of shape (x_dim x bins)

Note

Note that the number of bins (bins) can vary across elements in the sequence, reflecting the length of the time series in the provided time-series data.

llsfloat

data log likelihoods

Raises:
ValueError

If returned_data contains strings that aren’t present in self.valid_data_names_.

score(X=None)[source]

Returns the log-likelihood scores. If X = None, the training data log-likelihood scores will be returned.

Parameters:
Xan array-like, default=None

An array-like sequence of time-series. The format is the same as for fit`().

Note

If X=None, the log-likelihoods for the last X that fit() was called with are returned.

Returns:
log_likelihoodlist

List of log-likelihoods, one per element in X.

set_fit_request(*, use_cut_trials: bool | None | str = '$UNCHANGED$') BlockInvGPFA

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
use_cut_trialsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for use_cut_trials parameter in fit.

Returns:
selfobject

The updated object.

set_predict_request(*, returned_data: bool | None | str = '$UNCHANGED$') BlockInvGPFA

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
returned_datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for returned_data parameter in predict.

Returns:
selfobject

The updated object.

variance_explained(X=None)[source]

Computes the total explained variance regression score using \(R^2\) (coefficient of determination) for the overall X.

The total explained variance is decomposed into individual contributions by each orthonormalized latent trajectory.

Returns:
R2float

The total \(R^2\) score, i.e., the total variance explained. Calculated as (total variance - residual variance) / total variance.

latent_R2snumpy.array

An array containing the variance explained by each latent trajectory.

Notes

This function calculates the coefficient of determination (\(R^2\)) to measure how well the latent trajectories explain the variance in the observed data. It follows these steps:

  1. Compute the total explained variance (var_total) across all bins and trials.

  2. Calculate the residual variance (var_res) by decomposing the total explained variance (var_total).

  3. Compute the explained sum of squares (SSreg) for each latent trajectory.

  4. Sum up the SSreg values to obtain the total \(R^2\) score.

  5. Calculate the variance explained by each latent trajectory and store it in latent_R2s.