BlockInvGPFA Class
- class blockinvgpfa.BlockInvGPFA(bin_size=0.02, gp_kernel=None, z_dim=3, min_var_frac=0.01, em_tol=1e-05, em_max_iters=500, freq_ll=5, verbose=0)[source]
BlockInvGPFA is a high-performance implementation of Gaussian Process Factor Analysis (GPFA) is a probabilistic dimensionality reduction technique that extracts smooth latent trajectories from noisy, high-dimensional time series data. GPFA combines Factor Analysis (FA) for dimensionality reduction with Gaussian Processes (GPs) to model the time courses of the latent factors in a unified, probabilistic framework.
GPFA operates on a set of time-series data, where each time series is given by an
x_dimxbinsmatrix.x_dimneeds to be the same across all time-series, but their individual durations, as determined bybins, can vary across time-series. When fitting the model to data, GPFA assumes all model parameters to be shared across the fitted time-series.Please consult the main page for more details on the underlying probabilistic model.
This implementation, BlockInvGPFA, builds on the core model described in [Yu et al., 2009] and is adapted from the Elephant’s toolkit, with substantial algorithmic enhancements:
It uses block-matrix identities and Schur complement updates to efficiently share kernel inversion computations across variable-length trials.
It introduces a new variance-explained metric to evaluate BlockInvGPFA model fits
- Parameters:
- bin_sizefloat, optional, default=0.02
Size of time-series data bins in seconds.
- gp_kernelGP kernel instance, optional, default=None
Please consult the scikit-learn GP documentation for a list of available GP kernels.
If
Noneis passed, the kernel defaults to:ConstantKernel(1-0.001, constant_value_bounds='fixed') * RBF(length_scale=0.1) + ConstantKernel(0.001, constant_value_bounds='fixed') * WhiteKernel(noise_level=1, noise_level_bounds='fixed')
where only kernel hyperparameters not marked as
fixedare learned - in the above case only thelength scaleof the RBF kernel.Note
The
gp_kernelcan either be a single kernel (in which case it will be replicated across all latent dimensions) or a sequence of kernels, one per latent dimension.- z_dimint, optional, default=3
Dimensionality of latent space.
- min_var_fracfloat, optional, default=0.01
Fraction of overall data variance for each observed dimension to set as the private variance floor. This is used to combat Heywood cases, where ML parameter learning returns one or more zero private variances (see Martin & McDonald, Psychometrika, Dec 1975).
- em_tolfloat, optional, default=1e-5
Stopping criterion for Expectation Maximization (EM) algorithm.
- em_max_itersint, optional, default=500
Maximum number of EM iterations.
- freq_llint, optional, default=5
Frequency with which data likelihood is updated across EM iterations.
- verboseint, optional, default=0
Determines whether to display status messages when calling
fit(). - verbose <= 0: no output - verbose >= 1: output of EM iteration statistics - verbose >= 2: additional fitting information.
- Attributes:
- C_numpy.ndarray, shape (x_dim, z_dim)
Loading matrix parameters, mapping observed data space to latent trajectory space.
- d_numpy.ndarray, shape (x_dim, 1)
Observation mean parameter.
- R_numpy.ndarray, shape (x_dim, x_dim)
Observation noise covariance parameter.
- gp_kernel.thetanumpy.ndarray
Flattened and log-transformed non-fixed GP hyperparameters for optimization.
- fit_info_dict
Parameter fitting information. Updated with each call to
fit().- iteration_timelist
Runtime for each EM iteration.
- log_likelihoodslist
Log likelihoods after each EM iteration.
- train_latent_seqs_numpy.recarray
Copy of latent variable time-courses, inferred for training data, and augmented with additional fields.
- Z_munumpy.ndarray, shape (z_dim, bins)
Latent variable posterior means for each time bin.
- Z_covnumpy.ndarray, shape (z_dim, z_dim, bins)
Latent variable posterior covariances across different time bins.
- Z_covGPnumpy.ndarray, shape (bins, bins, z_dim)
Posterior covariance over time for each latent variable.
- valid_data_names_tuple of str
Sequence of valid names for
returned_dataargument ofpredict()method.
Methods
fit:
Fits model parameters to training data.
predict:
Provides inferred latent variable time-series.
score:
Returns the log-likelihood scores.
variance_explained:
Computes the total explained variance regression score.
Notes
Two major uses cases of GPFA are:
Inferring latent variable time-courses: GPFA fits the data, consisting of a set of high-dimensional time-series, using the
fit()method. It then returns the inferred, and, on request, orthonormalized latent variable time-courses by callingpredict().Identifying the latent space dimensionality: GPFA’s main hyperparameter is the dimensionality
z_dimof the latent space. The standard approach to identify the best value forz_dimis to perform cross validation. First, one splits the data into a training and a test set, and fits GPFA to the training set only, usingfit(). Second, one appliespredict()to the test set, to compute the test set data log-likelihood. This procedure is repeated for differentz_dimvalues, to identify thez_dimleading to the best test set data log-likelihood.This GPFA class support the use of the cross-validation (CV) functions of sklearn.model_selection.
In both cases, the data provided to
fit()is a sequence of matrices \(\boldsymbol{X}_1, \boldsymbol{X}_2, \dots\) of arbitrary order, each containing a separate time-series. Each matrix \(\boldsymbol{X}_n\) is of sizex_dimxbins. Here, the observation dimensionalityx_dimneeds to be shared across all matrices, but their durationbinscan differ.The latent variable time-courses returned by
predict()have a similar structure. They are again a sequence of matrices, \(\boldsymbol{Z}_1, \boldsymbol{Z}_2, \dots\), where each \(\boldsymbol{Z}_n\) corresponds to a provided \(\boldsymbol{X}_n\). The size of each \(\boldsymbol{Z}_n\) isz_dimxbins, wherebinsequals that of the associated \(\boldsymbol{X}_n\).GPFA fits the model parameters \(\boldsymbol{C}\), \(\boldsymbol{d}\), and \(\boldsymbol{R}\) (stored in attributes
C_,d_, andR_) and the GP kernel parameters (stored log-transformed in attributegp_kernel.theta). Please consult the main page for how these attributes parameterize the probabilistic model underlying GPFA. Parameter fitting is performed by the Expectation Maximization method. Seefit()for details.Examples
>>> import numpy as np >>> from blockinvgpfa import BlockInvGPFA
>>> X = np.array([[ ... [3, 1, 0, 4, 1, 2, 1, 3, 4, 2, 2, 1, 2, 0, 0, 2, 2, 5, 1, 3], ... [1, 0, 2, 0, 2, 1, 4, 2, 0, 1, 4, 4, 1, 2, 8, 3, 2, 1, 3, 1], ... [2, 2, 1, 1, 3, 2, 3, 2, 2, 0, 3, 4, 1, 2, 3, 1, 4, 1, 0, 1] ... ]]) >>> >>> blockinv_gpfa = BlockInvGPFA(z_dim=1)
>>> # Fit the model >>> blockinv_gpfa.fit(X) Initializing parameters using factor analysis... Fitting GP parameters by EM...
>>> # Infere latent variable time-courses for the same data >>> Zs, _ = blockinv_gpfa.predict() >>> print(Zs) [array([[-1.18405633, -2.00878 , -0.01470251, -2.3143544 , 0.03651376, -1.06948736, 2.05355342, -0.16920794, -2.26437342, -1.21934552, 1.98656088, 2.13305066, -1.14113106, 0.11319858, 6.14998095, 0.89241818, 0.03509708, -1.3936327 , 0.84781358, -1.2281484 ]])]
>>> # Display the loading matrix (C_) and observation mean (d_) parameters >>> print(blockinv_gpfa.C_) [[-0.78376501] [ 1.77773876] [ 0.51134474]]
>>> print(blockinv_gpfa.d_) [1.95470037 2.08933859 1.89693338]
>>> # Obtaining log-likelihood scores >>> llhs = blockinv_gpfa.score() >>> print(llhs) [-117.54588379661148, -107.17193271370158, ..., -100.13200154180569]
>>> # Evaluate the total explained variance regression score >>> print(blockinv_gpfa.variance_explained()) (0.6581475357501596, array([0.65814754]))
- fit(X, use_cut_trials=False)[source]
Fit the GPFA model parameters to the given training data using the Expectation Maximization algorithm. The function also computes the orthonormalization transform of the loading matrix for subsequent use by
predict().- Parameters:
- Xarray-like, default=None
An array-like sequence of high-dimensional time-series. Each element \(\boldsymbol{X}_n\) in this sequence is an
x_dimxbinsmatrix containing a sequence ofx_dim-dimensional observations along its columns. The observation dimensionalityx_dimneeds to be the same across each \(\boldsymbol{X}_n\), butbinscan differ across them. The order in which the elements \(\boldsymbol{X}_1, \boldsymbol{X}_2, \dots\) are provided in the sequence is arbitrary and has no impact on the fitted parameters.- use_cut_trialsbool, optional, default=False
If True, long time-series are cut into multiple shorter ones to potentially speed up training.
Attention
Cutting long time-series might worsen parameter fits, in particular as it removes any long-distance temporal dependencies that might exist in the data.
- Returns:
- selfobject
Returns the instance itself.
- Raises:
- ValueError
If covariance matrix of input data is rank deficient.
Notes
The
fit()method finds the model parameters that best explain the provided data using Expectation Maximization (EM), using the following steps:Initialization: The emission model parameters, \(\boldsymbol{C}\), \(\boldsymbol{d}\), and \(\boldsymbol{R}\) are initialized using Factor Analysis while leaving the latent variable time-course unconstrained. GP kernel parameters are left at their default values.
Expecation step: Given current parameter estimates, the Expectation step computes the posterior distribution over the latent variables \(\boldsymbol{Z}\), and the complete data log-likelihood.
Maximization step: Given the \(\boldsymbol{Z}\)-posterior, the Maximization step corresponds to finding the emission model parameters (see first step) that maximize the expected complete data log-likelihood analytically, and optimizes the GP kernel parameters using gradient descent.
EM iteration: Steps 2 and 3 are repeated until either the change in complete data likelihood drops below the set threshold, or if the maximum number of interactions is reached.
Orthonormalization: Finally, this function computes an orthonormalization transform to the loading matrix \(\boldsymbol{C}\) that is in turn used by
predict()to return an orthonormalized set of latent variables.
- predict(X=None, returned_data=['Z_mu_orth'])[source]
Provides the inferred latent variable time-courses and other moments thereof, if requested.
- Parameters:
- Xarray-like, default=None
An array-like sequence of time-series. The format is the same as for
fit().Note
If
X=None, the latent state estimates for the lastXthatfit()was called with are returned.- returned_datalist of str, default [‘Z_mu_orth’]
Determines which moments of the inferred latent variable time-courses, and other data are returned. Valid strings are:
'Z_mu': posterior mean of latent variables before orthonormalization'Z_mu_orth': orthonormalized posterior mean of latent variable'Z_cov': posterior covariance between latent variables'Z_covGP': posterior covariance over time for each latent variable'X': time-series data used to estimate the GPFA model parameters
- Returns:
- numpy.ndarray or dict
Returns a single
numpy.ndarrayif only a single string is provided to thereturned_dataargument, containing the requested data. If multiple strings are provided, then the method returns a dictionaries whose keys match the strings provided toreturned_data.Either way, the \(n\) th item in either of the returned numpy.ndarray provides the latent state moments associated with the \(n\) th time-series in the provided
X. Its size differs depending on the requested moment, and isZ_mu: numpy.ndarray of shape (z_dim x bins)Z_mu_orth: numpy.ndarray of shape (z_dim x bins)Z_cov: numpy.ndarray of shape (z_dim x z_dim x bins)Z_covGP: numpy.ndarray of shape (bins x bins x z_dim)X: numpy.ndarray of shape (x_dim x bins)Note
Note that the number of bins (
bins) can vary across elements in the sequence, reflecting the length of the time series in the provided time-series data.- llsfloat
data log likelihoods
- Raises:
- ValueError
If returned_data contains strings that aren’t present in self.valid_data_names_.
- score(X=None)[source]
Returns the log-likelihood scores. If
X = None, the training data log-likelihood scores will be returned.- Parameters:
- Xan array-like, default=None
An array-like sequence of time-series. The format is the same as for
fit`().Note
If
X=None, the log-likelihoods for the lastXthatfit()was called with are returned.
- Returns:
- log_likelihoodlist
List of log-likelihoods, one per element in
X.
- set_fit_request(*, use_cut_trials: bool | None | str = '$UNCHANGED$') BlockInvGPFA
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
- use_cut_trialsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
use_cut_trialsparameter infit.
- Returns:
- selfobject
The updated object.
- set_predict_request(*, returned_data: bool | None | str = '$UNCHANGED$') BlockInvGPFA
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
- returned_datastr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
returned_dataparameter inpredict.
- Returns:
- selfobject
The updated object.
- variance_explained(X=None)[source]
Computes the total explained variance regression score using \(R^2\) (coefficient of determination) for the overall X.
The total explained variance is decomposed into individual contributions by each orthonormalized latent trajectory.
- Returns:
- R2float
The total \(R^2\) score, i.e., the total variance explained. Calculated as (total variance - residual variance) / total variance.
- latent_R2snumpy.array
An array containing the variance explained by each latent trajectory.
Notes
This function calculates the coefficient of determination (\(R^2\)) to measure how well the latent trajectories explain the variance in the observed data. It follows these steps:
Compute the total explained variance (var_total) across all bins and trials.
Calculate the residual variance (var_res) by decomposing the total explained variance (var_total).
Compute the explained sum of squares (SSreg) for each latent trajectory.
Sum up the SSreg values to obtain the total \(R^2\) score.
Calculate the variance explained by each latent trajectory and store it in latent_R2s.