Mentions légales du service

Skip to content
Snippets Groups Projects
Commit 3053dab4 authored by Luigi Antelmi's avatar Luigi Antelmi
Browse files

[heterogeneous data] add exercises in mcvae: 1) check alpha parameter; 2) same...

[heterogeneous data] add exercises in mcvae: 1) check alpha parameter; 2) same exercise as the one in PLS section
parent 46b94cb6
No related branches found
No related tags found
No related merge requests found
Pipeline #201274 passed
%% Cell type:code id: tags:
 
``` python
## Uncomment the following lines to set up the python environment
## Probably not needed depending on where you are running this notebook
 
#!pip install torch
#!pip install nilearn
#!pip install nibabel
#!pip install pandas
#!pip install sklearn
#!pip install matplotlib
#!pip install numpy
#!pip install torch torchvision
!git clone https://gitlab.inria.fr/epione_ML/mcvae.git
```
 
%% Output
 
fatal: destination path 'mcvae' already exists and is not an empty directory.
 
%% Cell type:code id: tags:
 
``` python
!git clone https://gitlab.inria.fr/epione_ML/mcvae.git
import sys
import os
sys.path.append(os.getcwd() + '/mcvae/src/')
import mcvae
print(mcvae.__version__)
print('Mcvae version:' + mcvae.__version__)
```
 
%% Output
 
2.0.0
Cloning into 'mcvae'...
remote: Enumerating objects: 359, done.
remote: Counting objects: 100% (359/359), done.
remote: Compressing objects: 100% (224/224), done.
remote: Total 359 (delta 160), reused 296 (delta 114), pack-reused 0
Receiving objects: 100% (359/359), 4.72 MiB | 4.27 MiB/s, done.
Resolving deltas: 100% (160/160), done.
Checking connectivity... done.
Mcvae version:2.0.0
 
%% Cell type:code id: tags:
 
``` python
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_decomposition import PLSCanonical, CCA
```
 
%% Cell type:markdown id: tags:
 
# Random data generation
 
%% Cell type:markdown id: tags:
 
Our journey in the analysis of heterogeneous data starts with the generation of synthetic data. We first need to generate multivarite correlated random variables X and Y. To do so, we rely on the generative model we have seen during lesson:
 
$$ z\sim\mathcal{N}(0,1),$$
$$ X = z w_x,$$
$$ Y = z w_y.$$
 
%% Cell type:code id: tags:
 
``` python
# #############################################################################
 
# N subjects
n = 500
# here we define 2 Gaussian latents variables z = (l_1, l_2)
np.random.seed(42)
l1 = np.random.normal(size=n)
l2 = np.random.normal(size=n)
 
latents = np.array([l1, l2]).T
 
# We define two random transformations from the latent space to the space of X and Y respectively
transform_x = np.random.randint(-8,8, size = 10).reshape([2,5])
transform_y = np.random.randint(-8,8, size = 10).reshape([2,5])
 
# We compute data X = z w_x, and Y = z w_y
X = latents.dot(transform_x)
Y = latents.dot(transform_y)
 
# We add some random Gaussian noise
X = X + 2*np.random.normal(size = n*5).reshape((n, 5))
Y = Y + 2*np.random.normal(size = n*5).reshape((n, 5))
 
 
print('The latent space has dimension ' + str(latents.shape))
print('The transformation for X has dimension ' + str(transform_x.shape))
print('The transformation for Y has dimension ' + str(transform_y.shape))
 
print('X has dimension ' + str(X.shape))
print('Y has dimension ' + str(Y.shape))
```
 
%% Output
 
The latent space has dimension (500, 2)
The transformation for X has dimension (2, 5)
The transformation for Y has dimension (2, 5)
X has dimension (500, 5)
Y has dimension (500, 5)
 
%% Cell type:code id: tags:
 
``` python
dimension_to_plot = 0
 
plt.scatter(X[:,1], Y[:,2])
plt.xlabel('dimension X' + str(dimension_to_plot))
plt.ylabel('dimension Y' + str(dimension_to_plot))
plt.title('Generated data')
plt.plot()
```
 
%% Output
 
[]
 
 
%% Cell type:markdown id: tags:
 
## PLS and scikit-learn: basic use
 
 
Our newly generated data can be already used to test the PLS and CCA provided by standard machie learning packages, such as scikit-learn.
 
%% Cell type:code id: tags:
 
``` python
##########################################################
# We first split the data in trainig and validation sets
 
# The training set is composed by a random sample of dimension N/2
train_idx = np.random.choice(range(X.shape[0]), size = int(X.shape[0]/2), replace = False)
X_train = X[train_idx, :]
 
# The testing set is composed by the remaining subjects
test_idx = np.where(np.in1d(range(X.shape[0]), train_idx, assume_unique =True, invert = True))[0]
X_test = X[test_idx, :]
 
# We reuse the same indices to split the data Y
Y_train = Y[train_idx, :]
Y_test = Y[test_idx, :]
```
 
%% Cell type:code id: tags:
 
``` python
#######################################
# We fit PLS as provided by scikit-learn
 
#Defining PLS object
plsca = PLSCanonical(n_components=2)
 
#Fitting on train data
plsca.fit(X_train, Y_train)
 
#We project the training data in the latent dimension
X_train_r, Y_train_r = plsca.transform(X_train, Y_train)
#We project the testing data in the latent dimension
X_test_r, Y_test_r = plsca.transform(X_test, Y_test)
```
 
%% Cell type:markdown id: tags:
 
We note that the projections in the latent space retrieved by PLS are indeed correlated. The different dimensions of the projections are however uncorrelated.
 
%% Cell type:code id: tags:
 
``` python
# Scatter plot of scores
# ~~~~~~~~~~~~~~~~~~~~~~
# 1) On diagonal plot X vs Y scores on each components
plt.figure(figsize=(12, 8))
plt.subplot(221)
plt.scatter(X_train_r[:, 0], Y_train_r[:, 0], label="train",
marker="o", c="b", s=25)
plt.scatter(X_test_r[:, 0], Y_test_r[:, 0], label="test",
marker="o", c="r", s=25)
plt.xlabel("x scores")
plt.ylabel("y scores")
plt.title('Comp. 1: X vs Y (test corr = %.2f)' %
np.corrcoef(X_test_r[:, 0], Y_test_r[:, 0])[0, 1])
 
plt.legend(loc="best")
 
plt.subplot(224)
plt.scatter(X_train_r[:, 1], Y_train_r[:, 1], label="train",
marker="o", c="b", s=25)
plt.scatter(X_test_r[:, 1], Y_test_r[:, 1], label="test",
marker="o", c="r", s=25)
plt.xlabel("x scores")
plt.ylabel("y scores")
plt.title('Comp. 2: X vs Y (test corr = %.2f)' %
np.corrcoef(X_test_r[:, 1], Y_test_r[:, 1])[0, 1])
 
plt.legend(loc="best")
 
# 2) Off diagonal plot components 1 vs 2 for X and Y
plt.subplot(222)
plt.scatter(X_train_r[:, 0], X_train_r[:, 1], label="train",
marker="*", c="b", s=50)
plt.scatter(X_test_r[:, 0], X_test_r[:, 1], label="test",
marker="*", c="r", s=50)
plt.xlabel("X comp. 1")
plt.ylabel("X comp. 2")
plt.title('X comp. 1 vs X comp. 2 (test corr = %.2f)'
% np.corrcoef(X_test_r[:, 0], X_test_r[:, 1])[0, 1])
plt.legend(loc="best")
 
 
 
plt.subplot(223)
plt.scatter(Y_train_r[:, 0], Y_train_r[:, 1], label="train",
marker="*", c="b", s=50)
plt.scatter(Y_test_r[:, 0], Y_test_r[:, 1], label="test",
marker="*", c="r", s=50)
plt.xlabel("Y comp. 1")
plt.ylabel("Y comp. 2")
plt.title('Y comp. 1 vs Y comp. 2 , (test corr = %.2f)'
% np.corrcoef(Y_test_r[:, 0], Y_test_r[:, 1])[0, 1])
plt.legend(loc="best")
 
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
#We can check the estimated projections
print('X projections: \n' + str(plsca.x_weights_))
 
print('Y projections: \n' + str(plsca.y_weights_))
```
 
%% Output
 
X projections:
[[ 0.48433579 -0.25654367]
[-0.51415729 0.28837788]
[ 0.57083361 0.41523051]
[-0.1600194 0.72474008]
[-0.38678665 -0.39161076]]
Y projections:
[[ 0.36148438 0.39913339]
[-0.39571466 0.70730378]
[ 0.49905721 -0.35689321]
[-0.48904463 -0.31141709]
[-0.4738314 -0.34067659]]
 
%% Cell type:code id: tags:
 
``` python
# Prediction for the training data
 
predicted_Y_train = plsca.predict(X_train)
 
plt.scatter(predicted_Y_train[:,dimension_to_plot], Y_train[:,dimension_to_plot])
plt.ylabel('target Y' + str(dimension_to_plot))
plt.xlabel('predicted Y' + str(dimension_to_plot))
 
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
plsca_4 = PLSCanonical(n_components=4)
plsca_4.fit(X_train, Y_train)
 
predicted_Y_train = plsca_4.predict(X_train)
 
print('X projections with 2 components: \n' + str(plsca.x_weights_))
 
print('Y projections with 2 components: \n' + str(plsca.y_weights_))
 
print('X projections with 4 components: \n' + str(plsca_4.x_weights_))
 
print('Y projections with 4 components: \n' + str(plsca_4.y_weights_))
```
 
%% Output
 
X projections with 2 components:
[[ 0.48433579 -0.25654367]
[-0.51415729 0.28837788]
[ 0.57083361 0.41523051]
[-0.1600194 0.72474008]
[-0.38678665 -0.39161076]]
Y projections with 2 components:
[[ 0.36148438 0.39913339]
[-0.39571466 0.70730378]
[ 0.49905721 -0.35689321]
[-0.48904463 -0.31141709]
[-0.4738314 -0.34067659]]
X projections with 4 components:
[[ 0.48433579 -0.25654367 -0.42402831 0.54660349]
[-0.51415729 0.28837788 0.30201819 0.74306133]
[ 0.57083361 0.41523051 0.64183921 -0.06465157]
[-0.1600194 0.72474008 -0.51490022 -0.22931801]
[-0.38678665 -0.39161076 0.22782708 -0.30383862]]
Y projections with 4 components:
[[ 0.36148438 0.39913339 -0.5105896 0.60783454]
[-0.39571466 0.70730378 0.48505202 0.26714451]
[ 0.49905721 -0.35689321 0.58283628 0.50546537]
[-0.48904463 -0.31141709 0.17782666 0.30291833]
[-0.4738314 -0.34067659 -0.36428334 0.46034359]]
 
%% Cell type:code id: tags:
 
``` python
# We can also predict Y from X
 
predicted_Y_test = plsca.predict(X_test)
 
plt.scatter(predicted_Y_test[:,dimension_to_plot], Y_test[:,dimension_to_plot])
plt.ylabel('target Y' + str(dimension_to_plot))
plt.xlabel('predicted Y' + str(dimension_to_plot))
 
plt.show()
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
**Exercise.** Generate a synthetic dataset of 200 observations.
- Each observation is composed by 3 modalities X1, X2, and X3, of dimensions respectively of 4, 7, and 9.
- Chose the appropriate latent variable representation for the generative model.
- These modalities are corrupted with Gaussian noise, with different noise variance for each modality.
- Fit a PLS model to predict X3 from X1.
 
%% Cell type:code id: tags:
 
``` python
```
 
%% Cell type:markdown id: tags:
 
# Into the guts of latent variable models
 
After playing with the builti-in implementation of PLS in scikit-learn, we are going to implement our own version based on the NIPALS method.
 
## NIPALS for PLS
 
%% Cell type:code id: tags:
 
``` python
# Nipals method for PLS
 
n_components = 3
 
 
# Defining empty arrays where to store results
 
# Reconstruction from latent space to data
loading_x = np.ndarray([X.shape[1],n_components])
loading_y = np.ndarray([Y.shape[1],n_components])
 
# Projections into the latent space
weight_x = np.ndarray([X.shape[1],n_components])
weight_y = np.ndarray([Y.shape[1],n_components])
 
# Latent variables
scores_x = np.ndarray([X.shape[0],n_components])
scores_y = np.ndarray([Y.shape[0],n_components])
 
 
# Initialization of data matrices
current_X = X
current_Y = Y
 
for i in range(n_components):
# Initialization of current latent variables as a data column
t_x = current_X[:,0]
 
# NIPALS iterations
for _ in range(100):
# estimating Y weights given data Y and latent variables from X
w_y = current_Y.transpose().dot(t_x)/(t_x.transpose().dot(t_x))
# normalizing Y weights
w_y = w_y/np.sqrt(np.sum(w_y**2))
 
# estimating latent variables from Y given data Y and Y weights
t_y = current_Y.dot(w_y)
# estimating X weights given data X and latent variables from Y
w_x = current_X.transpose().dot(t_y)/(t_y.transpose().dot(t_y))
# normalizing X weights
w_x = w_x/np.sqrt(np.sum(w_x**2))
 
# estimating latent variables from X given data X and X weights
t_x = current_X.dot(w_x)
 
# Weights are such that X * weights = t
weight_x[:,i] = w_x
weight_y[:,i] = w_y
 
# Latent variables
scores_x[:,i] = t_x
scores_x[:,i] = t_y
 
# Loadings obtained by regressing X on t (X = t * loadings)
 
loading_x[:,i] = np.dot(current_X.T, t_x)/t_x.transpose().dot(t_x)
loading_y[:,i] = np.dot(current_Y.T, t_y)/t_y.transpose().dot(t_y)
 
# Deflation = current_data - current_reconstruction
 
current_X = current_X - t_x.reshape(len(t_x),1).dot(w_x.reshape(1,len(w_x)))
current_Y = current_Y - t_y.reshape(len(t_y),1).dot(w_y.reshape(1,len(w_y)))
 
```
 
%% Cell type:code id: tags:
 
``` python
print('The estimated projections for X are: \n' + str(weight_x))
 
print('\n The estimated projections for Y are: \n' + str(weight_y))
```
 
%% Output
 
The estimated projections for X are:
[[ 0.36753226 0.32695545 0.31901594]
[-0.54017132 -0.51142438 -0.13635813]
[ 0.73596167 -0.54668695 -0.38095079]
[-0.07131382 -0.54068145 0.43213359]
[-0.16251074 0.20085364 -0.74011644]]
The estimated projections for Y are:
[[ 0.22469208 -0.22321827 0.22848154]
[-0.42360052 -0.82303817 -0.17028297]
[ 0.34143196 0.23277327 -0.37414229]
[-0.53070486 0.30658588 -0.6691619 ]
[-0.60979721 0.35299218 0.57536058]]
 
%% Cell type:code id: tags:
 
``` python
plt.figure(figsize=(12, 12))
plt.subplot(3,2,1)
plt.scatter(scores_x[:,0], latents[:,0])
plt.xlabel('PLS 0')
plt.ylabel('ground truth 0')
plt.subplot(3,2,2)
plt.scatter(scores_x[:,0], latents[:,1])
plt.xlabel('PLS 0')
plt.ylabel('ground truth 1')
plt.subplot(3,2,3)
plt.scatter(scores_x[:,1], latents[:,0])
plt.xlabel('PLS 1')
plt.ylabel('ground truth 0')
plt.subplot(3,2,4)
plt.scatter(scores_x[:,1], latents[:,1])
plt.xlabel('PLS 1')
plt.ylabel('ground truth 1')
plt.subplot(3,2,5)
plt.scatter(scores_x[:,2], latents[:,0])
plt.xlabel('PLS 2')
plt.ylabel('ground truth 0')
plt.subplot(3,2,6)
plt.scatter(scores_x[:,2], latents[:,1])
plt.xlabel('PLS 2')
plt.ylabel('ground truth 1')
plt.show()
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
Once that the PLS parameters are estimated, we can solve the regression problem for predicting Y from X. We adopt the scheme used in scikit-learn, where a rotation matrix is first estimated to accoung for non-cummutativity between projection (weights) and reconstruction (loadings).
 
 
%% Cell type:code id: tags:
 
``` python
# Identifying rotation from X to t
# t_x * loadings_x = X
# t_x * loadings_x.T * weight = X * weight
# t_x = X * weight * (loadings_x.T * weight)^-1 = X * rotations_x
 
rotations_x = weight_x.dot(np.linalg.pinv(loading_x.T.dot(weight_x)))
 
# Solving the regression from X to Y
# Y = t_x * loadings_y.T
# Y = X * rotations_x * loadings_y.T
 
regression_coef = np.dot(rotations_x, loading_y.T)
```
 
%% Cell type:code id: tags:
 
``` python
plt.figure(figsize=(12, 18))
for i in range(Y.shape[1]):
plt.subplot(Y.shape[1], 1, i+1)
plt.scatter(X.dot(regression_coef)[:,i], Y[:,i])
plt.xlabel('predicted dimension ' + str(i))
plt.ylabel('target dimension ' + str(i))
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
# Comparing with SVD of covariance matrix
 
eig_val_x, eig_vect, eig_val_y = np.linalg.svd(X.transpose().dot(Y))
```
 
%% Cell type:code id: tags:
 
``` python
print('Eigenvalues for X \n' + str(np.real(eig_val_x[:,:3])))
print('Estimated weights for X\n' + str(np.real(weight_x[:,:3])))
```
 
%% Output
 
Eigenvalues for X
[[-0.36753226 0.32695545 0.31901594]
[ 0.54017132 -0.51142438 -0.13635813]
[-0.73596167 -0.54668695 -0.38095079]
[ 0.07131382 -0.54068145 0.43213359]
[ 0.16251074 0.20085364 -0.74011644]]
Estimated weights for X
[[ 0.36753226 0.32695545 0.31901594]
[-0.54017132 -0.51142438 -0.13635813]
[ 0.73596167 -0.54668695 -0.38095079]
[-0.07131382 -0.54068145 0.43213359]
[-0.16251074 0.20085364 -0.74011644]]
 
%% Cell type:code id: tags:
 
``` python
print('Eigenvalues for Y \n' + str(np.real(eig_val_y.T[:,:3])))
print('Estimated weights for Y\n' + str(np.real(weight_y[:,:3])))
```
 
%% Output
 
Eigenvalues for Y
[[-0.22469208 -0.22321827 0.22848154]
[ 0.42360052 -0.82303817 -0.17028297]
[-0.34143196 0.23277327 -0.37414229]
[ 0.53070486 0.30658588 -0.6691619 ]
[ 0.60979721 0.35299218 0.57536058]]
Estimated weights for Y
[[ 0.22469208 -0.22321827 0.22848154]
[-0.42360052 -0.82303817 -0.17028297]
[ 0.34143196 0.23277327 -0.37414229]
[-0.53070486 0.30658588 -0.6691619 ]
[-0.60979721 0.35299218 0.57536058]]
 
%% Cell type:code id: tags:
 
``` python
# PLS in scikit-learn
 
plsca = PLSCanonical(n_components=3, scale = False)
plsca.fit(X, Y)
```
 
%% Output
 
PLSCanonical(algorithm='nipals', copy=True, max_iter=500, n_components=3,
scale=False, tol=1e-06)
 
%% Cell type:code id: tags:
 
``` python
print(plsca.x_weights_)
print(plsca.y_weights_)
```
 
%% Output
 
[[ 0.36745572 -0.32646087 -0.30267791]
[-0.5399922 0.51144303 0.11763542]
[ 0.73610243 0.54625149 0.37318486]
[-0.07130735 0.54144866 -0.38043117]
[-0.16264435 -0.20072864 0.78137902]]
[[ 0.22485955 0.22212135 -0.2169345 ]
[-0.42328296 0.82287418 0.1493648 ]
[ 0.34172183 -0.23372426 0.32376757]
[-0.53074024 -0.30704996 0.68434385]
[-0.60976283 -0.35303468 -0.59789433]]
 
%% Cell type:markdown id: tags:
 
## NIPALS for CCA
 
%% Cell type:code id: tags:
 
``` python
import scipy
 
# Nipals method for CCA
 
# Defining empty arrays where to store results
 
# Reconstruction from latent space to data
loading_x_cca = np.ndarray([X.shape[1],n_components])
loading_y_cca = np.ndarray([Y.shape[1],n_components])
 
# Projections into the latent space
scores_x_cca = np.ndarray([X.shape[0],n_components])
scores_y_cca = np.ndarray([Y.shape[0],n_components])
 
# Latent variables
weight_x_cca = np.ndarray([X.shape[1],n_components])
weight_y_cca = np.ndarray([Y.shape[1],n_components])
 
# Initialization of data matrices
current_X = X
current_Y = Y
 
for i in range(n_components):
# Initialization of current latent variables as a data column
t_x = current_X[:,0]
 
# NIPALS iterations
for _ in range(500):
## CCA variant
# estimating Y weights given data Y and latent variables from X
Y_pinv = np.linalg.pinv(current_Y)
#Y_pinv = np.linalg.solve(current_Y.dot(current_Y.T),current_Y).T
w_y = Y_pinv.dot(t_x)
 
# normalizing Y weights
w_y = w_y/np.sqrt(np.sum(w_y**2))
# estimating latent variables from Y given data Y and Y weights
t_y = current_Y.dot(w_y)
 
## CCA variant
# estimating X weights given data X and latent variables from Y
X_pinv = np.linalg.pinv(current_X)
#X_pinv = np.linalg.solve(current_X.dot(current_X.T),current_X).T
w_x = X_pinv.dot(t_y)
 
# normalizing X weights
w_x = w_x/np.sqrt(np.sum(w_x**2))
# estimating latent variables from X given data X and X weights
t_x = current_X.dot(w_x)
 
 
# Weights are such that X * weights = t
weight_x_cca[:,i] = w_x
weight_y_cca[:,i] = w_y
 
# Latent dimensions
scores_x_cca[:,i] = t_x
scores_x_cca[:,i] = t_y
 
# Loadings obtained by regressing X on t (X = t * loadings)
 
loading_x_cca[:,i] = np.dot(current_X.T, t_x)/t_x.transpose().dot(t_x)
loading_y_cca[:,i] = np.dot(current_Y.T, t_y)/t_y.transpose().dot(t_y)
 
# Deflation = current_data - current_reconstruction
 
current_X = current_X - t_x.reshape(len(t_x),1).dot(w_x.reshape(1,len(w_x)))
current_Y = current_Y - t_y.reshape(len(t_y),1).dot(w_x.reshape(1,len(w_y)))
 
 
```
 
%% Cell type:code id: tags:
 
``` python
print(weight_x_cca)
 
plt.scatter(X.dot(weight_x_cca)[:,2], Y.dot(weight_y_cca)[:,2])
```
 
%% Output
 
[[ 0.24391049 0.36361822 0.79534955]
[-0.47437693 -0.58135481 0.07701894]
[ 0.82921211 -0.45865491 -0.17520398]
[-0.04561038 -0.56190597 0.56872597]
[-0.16062744 0.06087473 -0.0856826 ]]
 
<matplotlib.collections.PathCollection at 0xa19d5a2e8>
 
 
%% Cell type:code id: tags:
 
``` python
cca = CCA(n_components=3, scale = False)
cca.fit(X,Y)
print(cca.x_weights_)
 
plt.scatter(X.dot(weight_x_cca)[:,2], Y.dot(weight_y_cca)[:,2])
```
 
%% Output
 
[[ 0.2415487 -0.32772785 -0.24927466]
[-0.47054284 0.58859583 0.10763174]
[ 0.83233244 0.44252343 0.28013453]
[-0.04077413 0.58734795 -0.35211385]
[-0.16063576 -0.07310815 0.85077496]]
 
<matplotlib.collections.PathCollection at 0x10624a160>
 
 
%% Cell type:markdown id: tags:
 
## Reduced Rank Regression
 
We finally review reduced rank regression through eigen-decomposition.
 
%% Cell type:code id: tags:
 
``` python
# Reduced Rank Regression
 
n_components = 2
Gamma = np.eye(n_components)
 
SYX = np.dot(Y.T,X)
 
SXX = np.dot(X.T,X)
 
U, S, V = np.linalg.svd(np.dot(SYX, np.dot(np.linalg.pinv(SXX), SYX.T)))
 
A = V[0:n_components, :].T
 
B = np.dot(np.dot(A.T,SYX), np.linalg.pinv(SXX))
```
 
%% Cell type:code id: tags:
 
``` python
A
```
 
%% Output
 
array([[-0.24936409, -0.19522003],
[ 0.3233369 , -0.86712778],
[-0.31172187, 0.27190596],
[ 0.56300866, 0.2411439 ],
[ 0.64739595, 0.27909733]])
 
%% Cell type:code id: tags:
 
``` python
B
```
 
%% Output
 
array([[-0.17847633, 0.34992334, -0.88917092, -0.05742634, 0.16716151],
[ 0.33903723, -0.60088957, -0.58551613, -0.63995952, 0.10694492]])
 
%% Cell type:code id: tags:
 
``` python
regression_coef_rrr = np.dot(A,B)
```
 
%% Cell type:code id: tags:
 
``` python
plt.scatter(np.dot(X,regression_coef)[:,0],Y[:,0])
plt.xlabel('predicted dimension 0')
plt.ylabel('target dimension 0')
plt.title('RRR prediction')
plt.show()
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
## Sparsity in latent variable models
 
We now focus on the effect of spurious variables in mutivariate models. To explore this new setting, we are going to add spurious random features to our data matrices X and Y.
 
%% Cell type:code id: tags:
 
``` python
## Adding 3 random dimensions
## No association is expected from these features
 
X_ext = np.hstack([X,np.random.randn(n*3).reshape([n,3])])
Y_ext = np.hstack([Y,np.random.randn(n*3).reshape([n,3])])
```
 
%% Cell type:code id: tags:
 
``` python
plt.scatter(X_ext[:,0], Y_ext[:,-1])
plt.xlabel('X dimension 0')
plt.ylabel('Y random dimension')
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
from sklearn import linear_model
 
n_components = 3
 
#### Sparse PLS via regularization in NIPALS [Waaijenborg, et al 2007]
# Everything is as for the standard NIPALS algorithm, with the added sparse estimation step
 
loading_x_sparse = np.ndarray([X_ext.shape[1],n_components])
loading_y_sparse = np.ndarray([Y_ext.shape[1],n_components])
 
scores_x_sparse = np.ndarray([X_ext.shape[0],n_components])
scores_y_sparse = np.ndarray([Y_ext.shape[0],n_components])
 
weight_x_sparse = np.ndarray([X_ext.shape[1],n_components])
weight_y_sparse = np.ndarray([Y_ext.shape[1],n_components])
 
current_X = X_ext
current_Y = Y_ext
 
## Penalty parameter for regularization
penalty = 10
 
eps = 1e-4
 
for i in range(n_components):
t_x = current_X[:,0]
for _ in range(100):
w_y = current_Y.transpose().dot(t_x)/(t_x.transpose().dot(t_x))
w_y = w_y/np.sqrt(np.sum(w_y**2))
t_y = current_Y.dot(w_y)
w_x = current_X.transpose().dot(t_y)/(t_y.transpose().dot(t_y))
w_x = w_x/np.sqrt(np.sum(w_x**2))
t_x = current_X.dot(w_x)
 
## Estimating sparse model for the weights of X
lasso_x = linear_model.Lasso(alpha = penalty)
lasso_x.fit(t_x.reshape(-1, 1), current_X)
 
## Estimating sparse model for the weights of Y
lasso_y = linear_model.Lasso(alpha = penalty)
lasso_y.fit(t_y.reshape(-1, 1), current_Y)
 
# Replacing the original weights with the sparse estimation
w_x = (lasso_x.coef_ / (np.sqrt(np.sum(lasso_x.coef_**2) + eps))).reshape([X_ext.shape[1]])
w_y = (lasso_y.coef_ / (np.sqrt(np.sum(lasso_y.coef_**2) + eps))).reshape([Y_ext.shape[1]])
 
# Weights are such that X * weights = t
weight_x_sparse[:,i] = w_x
weight_y_sparse[:,i] = w_y
 
# Latent dimensions
scores_x_sparse[:,i] = t_x
scores_x_sparse[:,i] = t_y
 
# Loadings obtained by regressing X on t (X = t * loadings)
 
loading_x_sparse[:,i] = np.dot(current_X.T, t_x)/t_x.transpose().dot(t_x)
loading_y_sparse[:,i] = np.dot(current_Y.T, t_y)/t_y.transpose().dot(t_y)
 
# Deflation
 
current_X = current_X - t_x.reshape(len(t_x),1).dot(w_x.reshape(1,len(w_x)))
current_Y = current_Y - t_y.reshape(len(t_y),1).dot(w_y.reshape(1,len(w_y)))
 
```
 
%% Cell type:markdown id: tags:
 
We observe that the new weights are similar to the ones estimated before. However the parameters associated with the spurious dimension are entirely set to zero. This indicates that the model does not find these features necessary to explain the common variability between X and Y. There are also other weights which are set to zero, corresponding to the third latent dimension. This makes sense, as our synthetic data was indeed created with only two latent dimensions.
 
%% Cell type:code id: tags:
 
``` python
weight_x_sparse
```
 
%% Output
 
array([[ 0.35947916, 0.09613335, 0. ],
[-0.57208323, -0.5017898 , 0. ],
[ 0.73260855, -0.60111564, 0. ],
[-0.01764855, -0.61404951, 0. ],
[-0.0795538 , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , -0. ]])
 
%% Cell type:code id: tags:
 
``` python
weight_y_sparse
```
 
%% Output
 
array([[ 0.19008698, -0.03395428, -0. ],
[-0.3361836 , -0.95121052, 0. ],
[ 0.29093832, 0.05788278, 0. ],
[-0.56391994, 0.17865107, 0. ],
[-0.66936112, 0.24197983, 0. ],
[-0. , -0. , -0. ],
[ 0. , -0. , -0. ],
[-0. , -0. , -0. ]])
 
%% Cell type:code id: tags:
 
``` python
## Non-sparse parameters previously estimated
 
print(weight_x)
print(weight_y)
```
 
%% Output
 
[[ 0.36753226 0.32695545 0.31901594]
[-0.54017132 -0.51142438 -0.13635813]
[ 0.73596167 -0.54668695 -0.38095079]
[-0.07131382 -0.54068145 0.43213359]
[-0.16251074 0.20085364 -0.74011644]]
[[ 0.22469208 -0.22321827 0.22848154]
[-0.42360052 -0.82303817 -0.17028297]
[ 0.34143196 0.23277327 -0.37414229]
[-0.53070486 0.30658588 -0.6691619 ]
[-0.60979721 0.35299218 0.57536058]]
 
%% Cell type:code id: tags:
 
``` python
## PLS result from scikit-learn PLS on the data augmented with spurious dimensions
 
plsca = PLSCanonical(n_components=3, scale = False)
plsca.fit(X_ext, Y_ext)
print(plsca.x_weights_)
print(plsca.y_weights_)
```
 
%% Output
 
[[ 0.36744392 -0.32636735 -0.3736012 ]
[-0.53999014 0.51133222 0.13293972]
[ 0.73606624 0.54628291 0.34582048]
[-0.07131057 0.54141809 -0.49749784]
[-0.16262691 -0.20075143 0.52023542]
[ 0.00653338 -0.00825341 0.39402462]
[ 0.00339241 -0.00798747 0.07331524]
[ 0.0038984 -0.00566521 0.21066051]]
[[ 0.22481823 0.22219588 -0.39318404]
[-0.4232868 0.82282067 0.19587236]
[ 0.34171379 -0.2337191 0.39240653]
[-0.53068906 -0.30708105 0.46282095]
[-0.6097158 -0.353037 -0.46080397]
[-0.00986828 0.00097623 -0.2131351 ]
[ 0.00469737 0.00268147 -0.28959681]
[-0.00361132 0.00533588 -0.31180287]]
 
%% Cell type:markdown id: tags:
 
## Cross-validating components
 
In addition to sparsity, the optimal number of components in latent variable models can be identified by cross-validation.
A common strategy is to train the model on a subset of the data and to quantify the *predicted residual error sum of squares* (PRESS) in non-overlapping testing data. We can finally choose the number of latent dimensions leading to the lowest average PRESS.
 
%% Cell type:code id: tags:
 
``` python
n_cross_valid_run = 200
 
# max number of components to test
n_components = 5
 
rep_results = []
for i in range(n_components):
rep_results.append([])
 
for k in range(n_components):
for i in range(n_cross_valid_run):
 
# Sampling disjoint set of indices for splitting the data
batch1_idx = np.random.choice(range(X_ext.shape[0]), size = int(X_ext.shape[0]/2), replace = False)
batch2_idx = np.where(np.in1d(range(X_ext.shape[0]), batch1_idx, assume_unique=True, invert = True))[0]
 
# Creating independent data batches for X
X_1 = X_ext[batch1_idx, :]
X_2 = X_ext[batch2_idx, :]
 
# Creating independent data batches for Y
Y_1 = Y_ext[batch1_idx, :]
Y_2 = Y_ext[batch2_idx, :]
 
# Creating a model for each data batch
plsca1 = PLSCanonical(n_components = k+1, scale = False)
plsca2 = PLSCanonical(n_components = k+1, scale = False)
 
# Fitting a model for each data batch
plsca1.fit(X_1,Y_1)
plsca2.fit(X_2,Y_2)
 
# Quantifying the prediction error on the unseen data batch
err1 = np.sum((plsca1.predict(X_2) - Y_2)**2)
err2 = np.sum((plsca2.predict(X_1) - Y_1)**2)
 
rep_results[k].append(np.mean([err1,err2]))
```
 
%% Cell type:code id: tags:
 
``` python
plt.plot(range(1,n_components+1),np.mean(rep_results, 1))
plt.xlabel('# components')
plt.ylabel('prediction error')
plt.show()
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
# Multi-channel Variational Autoencoder
 
The last part of this tutorial concerns the use of the *multi-channel variational autoencoder* (https://gitlab.inria.fr/epione_ML/mcvae), a more advanced methods for the joint analysis and prediction of several modalities.
 
## VAE
The Variational Autoencoer is a latent variable model composed by one encoder and one decoder associated to a single channel.
The latent distribution and the decoding distribution are implemented as follows:
 
$$q(\mathbf{z|x}) = \mathcal{N}(\mathbf{z|\mu_x; \Sigma_x})$$
 
$$p(\mathbf{x|z}) = \mathcal{N}(\mathbf{x|\mu_z; \Sigma_z})$$
 
They are Gaussians with moments parametrized by Neural Networks (or a linear transformation layer in a simple case).
 
__Exercise__: why is convenient to use $\log$ values for the parametrization the variance networks output (W_logvar, W_out_logvar)?
<img src="https://gitlab.inria.fr/epione/flhd/-/raw/master/heterogeneous_data/img/vae.svg" alt="img/vae.svg">
 
### __Exercise__
Why is convenient to use $\log$ values for the parametrization the variance networks output (W_logvar, W_out_logvar)?
## Sparse VAE
To favour sparsity in the latent space we implement the latent distribution as follows:
 
$$q(\mathbf{z|x}) = \mathcal{N}(\mathbf{z|\mu_x; \alpha \odot \mu_x^2})$$
$$q(\mathbf{z|x}) = \mathcal{N}(\mathbf{z|\mu_x; \alpha \odot \mu_x^2}),$$
 
Tha parameter $\alpha$ represents to the [odds](https://en.wikipedia.org/wiki/Odds) of pruning the $i$-th latent dimension according to (element-wise):
where:
$$
\mathbf{\alpha \odot \mu_x^2} =
\begin{bmatrix}
\ddots & & 0 \\
& \alpha_i [\mathbf{\mu_x}]_i^2 & \\
0 & & \ddots
\end{bmatrix}.
$$
Tha parameter $\alpha_i$ represents the [odds](https://en.wikipedia.org/wiki/Odds) of pruning the $i$-th latent dimension according to:
 
$$\alpha_i = \frac{p_i}{1 - p_i}$$
 
<img src="https://gitlab.inria.fr/epione/flhd/-/raw/master/heterogeneous_data/img/sparse_vae.svg" alt="img/sparse_vae.svg">
 
## MCVAE
The MultiChannel VAE is built by stacking multiple VAEs and allowing the decoding distributions to be computed from every input channel.
 
<img src="https://gitlab.inria.fr/epione/flhd/-/raw/master/heterogeneous_data/img/mcvae.svg" alt="img/mcvae.svg">
 
__Excercise__: sketch the Sparse MCVAE with 3 channels.[[solution]](https://gitlab.inria.fr/epione/flhd/-/raw/master/heterogeneous_data/img/sparse_mcvae_3ch.svg)
### __Exercise__
Sketch the Sparse MCVAE with 3 channels. [[Solution]](https://gitlab.inria.fr/epione/flhd/-/raw/master/heterogeneous_data/img/sparse_mcvae_3ch.svg)
 
%% Cell type:code id: tags:
 
``` python
import torch
from mcvae.models import Mcvae
from mcvae.models.utils import DEVICE, load_or_fit
from mcvae.diagnostics import *
from mcvae.plot import lsplom
 
print(f"Running on {DEVICE}") # cpu or gpu
 
FORCE_REFIT = False # Set this to 'True' if ou want to refit the models instead of loading them from file.
```
 
%% Output
 
Running on cpu
 
%% Cell type:code id: tags:
 
``` python
### To initialization of the mcvae model you should provide:
 
# 1 - a list with the different data channels
data = [X, Y]
data = [torch.Tensor(_) for _ in data] #warning: data matrices must be converted to type torch.Tensor
 
# 1 - a dictionary with the data and model characteristics
init_dict = {
'data': data,
'lat_dim': n_components,
}
```
 
%% Cell type:code id: tags:
 
``` python
# Here we initialize an instance of the model
 
# Multi-Channel VAE
torch.manual_seed(24)
model = Mcvae(**init_dict)
model.to(DEVICE)
print(model)
```
 
%% Output
 
Mcvae(
(vae): ModuleList(
(0): VAE(
(W_mu): Linear(in_features=5, out_features=5, bias=True)
(W_logvar): Linear(in_features=5, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=5, bias=True)
)
(1): VAE(
(W_mu): Linear(in_features=5, out_features=5, bias=True)
(W_logvar): Linear(in_features=5, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=5, bias=True)
)
)
)
 
%% Cell type:code id: tags:
 
``` python
###################
## Model Fitting ##
###################
 
# Set up the optimizer
adam_lr = 1e-2
n_epochs = 5000
model.optimizer = torch.optim.Adam(model.parameters(), lr=adam_lr)
 
# Fit (or load if already trained)
load_or_fit(model=model, data=data, epochs=n_epochs, ptfile='model.pt', force_fit=FORCE_REFIT)
```
 
%% Output
 
Loading model.pt
 
%% Cell type:code id: tags:
 
``` python
## Plotting model convergence
plot_loss(model, skip=100)
```
 
%% Output
 
skipping first 100 epochs where losses might be very high
 
 
%% Cell type:markdown id: tags:
 
The plot above indicates that the model converged smoothly.
 
%% Cell type:code id: tags:
 
``` python
# We can plot the original data, dimension x dimension
lsplom(data, title = 'Original data')
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
# We can estimate the reconstructed data (decoding from the latent space)
x_hat = model.reconstruct(data)
 
# Plotting the reconstructed data across dimensions
lsplom(x_hat, title = 'Reconstructed data')
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
## Plotting the weights of the encoder
plot_weights(model, side = 'encoder')
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
## Plotting the weights of the decoder
plot_weights(model, side = 'decoder')
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
# Inspecting model parameters
 
# Decoding parameters
weights_decoding_X = model.vae[0].W_out.weight.data
weights_decoding_Y = model.vae[1].W_out.weight.data
 
print(weights_decoding_X)
print(weights_decoding_Y)
```
 
%% Output
 
tensor([[ 0.1148, -0.0684, 0.5098, 0.0365, -1.6451],
[-0.3095, 0.0943, -0.8667, 0.2497, 2.4173],
[-0.2510, -0.1134, -2.1274, -0.2352, -2.1068],
[ 0.6572, 0.0243, -1.3619, -0.0370, 0.7790],
[-0.3694, 0.0762, 0.7014, -0.0429, 0.3934]])
tensor([[ 0.1395, -0.1859, -1.0957, 0.0222, -0.5188],
[-0.1095, 0.1167, -2.1759, -0.0054, 2.1793],
[-0.4565, -0.0929, 0.2947, -0.0118, -1.3441],
[-0.2269, -0.0548, 1.7775, -0.0096, 1.4515],
[ 0.1131, -0.1738, 2.0298, 0.1213, 1.6703]])
 
%% Cell type:code id: tags:
 
``` python
# Encoding parameters
weights_encoding_X = model.vae[0].W_mu.weight.data
weights_encoding_Y = model.vae[1].W_mu.weight.data
 
print(weights_encoding_X)
print(weights_encoding_Y)
```
 
%% Output
 
tensor([[ 0.0909, -0.3644, -0.3227, 0.5998, -0.3149],
[-0.0150, -0.1016, -0.0240, 0.1764, 0.2570],
[ 0.0469, -0.1533, -0.2561, -0.1465, 0.0691],
[ 0.7675, 0.7032, -0.1042, -0.3557, -0.6180],
[-0.1278, 0.1636, -0.1328, 0.0766, 0.0469]])
tensor([[ 0.1005, -0.1762, -0.4199, -0.2145, 0.1101],
[-1.1709, -0.0119, -0.3710, 0.0791, -0.7206],
[-0.1125, -0.1749, 0.0331, 0.1045, 0.1293],
[ 0.4766, -0.0190, 0.0869, -0.2600, 0.4588],
[-0.0017, 0.2063, -0.0976, 0.1288, 0.1290]])
 
%% Cell type:code id: tags:
 
``` python
# Here we compute the encoding and plot the latent dimensions against our original ground truth for the syntetic data
 
encoding = model.encode(data)
 
# Remember: encodings are distributions
print(f'Encoding distribution q for Channel 0: {encoding[0]}')
 
# with the method "loc" we extract the mean of the distributions
encoding_x = encoding[0].loc.data.numpy()
encoding_y = encoding[1].loc.data.numpy()
```
 
%% Output
 
Encoding distribution q for Channel 0: Normal(loc: torch.Size([500, 5]), scale: torch.Size([500, 5]))
 
%% Cell type:markdown id: tags:
 
We note that the estimated encoding is correlated with the original latent dimensions. There seems to be however some redundancy. This motivates the use of a *sparse model*.
 
%% Cell type:code id: tags:
 
``` python
plt.figure(figsize=(12, 12))
for idx,k in enumerate(range(n_components)):
plt.subplot(n_components,2,2*idx+1)
plt.scatter(encoding_x[:,k], latents[:,0])
plt.xlabel(str('mcvae dim ') + str(k))
plt.ylabel('ground truth 0')
plt.subplot(n_components,2,2*idx+2)
plt.scatter(encoding_x[:,k], latents[:,1])
plt.xlabel(str('mcvae dim ') + str(k))
plt.ylabel('ground truth 1')
plt.tight_layout()
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
# Initialize sparse mcvae
 
model_sparse1 = Mcvae(sparse=True, **init_dict)
model_sparse1.to(DEVICE)
print(model_sparse1)
print('Is the log_alpha parameter the same for every channel?')
log_alpha = model_sparse1.vae[0].log_alpha
print(np.all([log_alpha is vae.log_alpha for vae in model_sparse1]))
```
 
%% Output
 
Mcvae(
(vae): ModuleList(
(0): VAE(
(W_mu): Linear(in_features=5, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=5, bias=True)
)
(1): VAE(
(W_mu): Linear(in_features=5, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=5, bias=True)
)
)
)
Is the log_alpha parameter the same for every channel?
 
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-64-51039ebdfc7f> in <module>
7 print('Is the log_alpha parameter the same for every channel?')
8 log_alpha = model_sparse1.vae[0].log_alpha
----> 9 print(np.all([log_alpha is vae.log_alpha for vae in model_sparse1]))
 
TypeError: 'Mcvae' object is not iterable
 
%% Cell type:markdown id: tags:
__Exercise.__ Check if the sparsity parameter log_alpha is the same for every VAE in the MCvae model.
- Hint: you can acces the model modules and parameters with the dot notation (model.submodel.parameter)
%% Cell type:code id: tags:
 
``` python
# Optimize
model_sparse1.optimizer = torch.optim.Adam(model_sparse1.parameters(), lr=adam_lr)
load_or_fit(model=model_sparse1, data=data, epochs=n_epochs, ptfile='model_sparse1.pt', force_fit=FORCE_REFIT)
```
 
%% Output
 
Loading model_sparse1.pt
 
%% Cell type:markdown id: tags:
 
The sparse model estimates a probability of redundancy associated to each dimension. This means that we can retain only the dimensions with low probability of redundancy. In this case the model correctly identifies only 2 meaningful latent dimensions.
 
%% Cell type:code id: tags:
 
``` python
print('Probability of redundancy: ', model_sparse1.dropout.data)
plot_dropout(model_sparse1, sort=False)
```
 
%% Output
 
Probability of redundancy: tensor([[0.5597, 0.5693, 0.0177, 0.0296, 0.5730]])
 
 
%% Cell type:code id: tags:
 
``` python
# We fix a redundancy threshold
dropout_threshold = 0.20
```
 
%% Cell type:code id: tags:
 
``` python
# We plot the remaining latent dimensions
keep = (model_sparse1.dropout.squeeze() < dropout_threshold).tolist()
kept_comps = [i for i, kept in enumerate(keep) if kept]
print(f'kept components: {kept_comps}')
 
plot_latent_space(model_sparse1, data=data, comp=kept_comps);
```
 
%% Output
 
kept components: [2, 3]
 
 
 
%% Cell type:code id: tags:
 
``` python
# We repeate the same exercise with the synthetic data with redundant dimensions
 
data_sparse = [X_ext, Y_ext]
data_sparse = [torch.Tensor(_) for _ in data_sparse]
 
init_dict = {
'data': data_sparse,
'lat_dim': n_components + 3,
}
 
model_sparse = Mcvae(sparse=True, **init_dict)
model_sparse.to(DEVICE)
print(model_sparse)
```
 
%% Output
 
Mcvae(
(vae): ModuleList(
(0): VAE(
(W_mu): Linear(in_features=8, out_features=8, bias=True)
(W_out): Linear(in_features=8, out_features=8, bias=True)
)
(1): VAE(
(W_mu): Linear(in_features=8, out_features=8, bias=True)
(W_out): Linear(in_features=8, out_features=8, bias=True)
)
)
)
 
%% Cell type:code id: tags:
 
``` python
# Fit (or load model if existing)
model_sparse.optimizer = torch.optim.Adam(model_sparse.parameters(), lr=adam_lr)
load_or_fit(model=model_sparse, data=data_sparse, epochs=n_epochs, ptfile='model_sparse.pt', force_fit=FORCE_REFIT)
```
 
%% Output
 
Loading model_sparse.pt
 
%% Cell type:markdown id: tags:
 
We see that the model recognises again only two meaningful latent dimensions
 
%% Cell type:code id: tags:
 
``` python
print('Probability of redundancy: ', model_sparse.dropout.data)
indices = np.where(model_sparse.dropout.data.numpy().flatten() < dropout_threshold)[0]
non_redundant_comps = indices.tolist()
print(f'Non-redundant components: {non_redundant_comps}')
plot_dropout(model_sparse, sort=False)
```
 
%% Output
 
Probability of redundancy: tensor([[0.7038, 0.7494, 0.7669, 0.0166, 0.8116, 0.0525, 0.6920, 0.7680]])
Non-redundant components: [3, 5]
 
 
%% Cell type:code id: tags:
 
``` python
x_hat = model_sparse.reconstruct(data_sparse, dropout_threshold=dropout_threshold)
lsplom(x_hat, title = 'Reconstructed data')
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
## Plotting the weights of the decoder
plt.figure(figsize=(12, 8))
plot_weights(model_sparse, side = 'decoder')
```
 
%% Output
 
 
 
%% Cell type:code id: tags:
 
``` python
plt.figure(figsize=(12, 8))
plot_weights(model_sparse, side = 'encoder')
```
 
%% Output
 
 
 
%% Cell type:code id: tags:
 
``` python
# PLotting estimated encoding vs ground truth
 
encoding = model_sparse.encode(data_sparse)
 
encoding_x = encoding[0].loc.detach().numpy()
encoding_y = encoding[1].loc.detach().numpy()
 
plt.figure(figsize=(12, 12))
for idx,k in enumerate(indices):
plt.subplot(len(indices),2,2*idx+1)
plt.scatter(encoding_x[:,k], latents[:,0])
plt.xlabel(str('mcvae dim ') + str(k))
plt.ylabel('ground truth 0')
plt.subplot(len(indices),2,2*idx+2)
plt.scatter(encoding_x[:,k], latents[:,1])
plt.xlabel(str('mcvae dim ') + str(k))
plt.ylabel('ground truth 1')
 
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
plot_latent_space(model_sparse, data_sparse, comp=non_redundant_comps);
```
 
%% Output
 
 
 
%% Cell type:markdown id: tags:
 
### Increasing the number of channels
 
In this section we explore the use of the model on data with multiple modalities (or channels)
 
%% Cell type:code id: tags:
 
``` python
# generating a new modality Z
# This modality has meaningful as well as redundant dimensions
 
size_z = 10
 
size_z_redundant = 4
 
np.random.seed(37)
Z_redundant = np.random.randn(n, size_z_redundant) # pure noise
 
transform_z = np.random.randint(-8, 8, size=(2, size_z))
 
Z = latents.dot(transform_z) + 2*np.random.normal(size=(n, size_z))
Z = np.hstack([Z, Z_redundant])
 
print(X_ext.shape,Y_ext.shape, Z.shape)
```
 
%% Output
 
(500, 8) (500, 8) (500, 14)
 
%% Cell type:code id: tags:
 
``` python
# Initialize the model
data_multi = [X_ext, Y_ext, Z]
data_multi = [torch.Tensor(_) for _ in data_multi]
 
init_dict = {
'data': data_multi, # [X, Y, Z]
'lat_dim': n_components + 3,
}
 
model_multi = Mcvae(sparse=True, **init_dict)
model_multi.to(DEVICE)
print(model_multi)
```
 
%% Output
 
Mcvae(
(vae): ModuleList(
(0): VAE(
(W_mu): Linear(in_features=8, out_features=8, bias=True)
(W_out): Linear(in_features=8, out_features=8, bias=True)
)
(1): VAE(
(W_mu): Linear(in_features=8, out_features=8, bias=True)
(W_out): Linear(in_features=8, out_features=8, bias=True)
)
(2): VAE(
(W_mu): Linear(in_features=14, out_features=8, bias=True)
(W_out): Linear(in_features=8, out_features=14, bias=True)
)
)
)
 
%% Cell type:code id: tags:
 
``` python
# Fit (or load)
model_multi.optimizer = torch.optim.Adam(model_multi.parameters(), lr=adam_lr)
load_or_fit(model=model_multi, data=data_multi, epochs=n_epochs, ptfile='model_multi.pt', force_fit=FORCE_REFIT)
```
 
%% Output
 
Loading model_multi.pt
 
%% Cell type:code id: tags:
 
``` python
print('Probability of redundancy: ', model_multi.dropout.data.numpy())
indices = np.where(model_multi.dropout.data.numpy().flatten() < dropout_threshold)[0]
print('Non-redundant components: ', indices)
plot_dropout(model_multi, sort=False)
 
encoding = model_multi.encode(data_multi)
 
encoding_x = encoding[0].loc.detach().numpy()
encoding_y = encoding[1].loc.detach().numpy()
encoding_z = encoding[2].loc.detach().numpy()
 
plt.figure(figsize=(12, 12))
for idx,k in enumerate(indices):
plt.subplot(len(indices),2,2*idx+1)
plt.scatter(encoding_z[:,k], latents[:,0])
plt.xlabel(str('mcvae dim ') + str(k))
plt.ylabel('ground truth 0')
plt.subplot(len(indices),2,2*idx+2)
plt.scatter(encoding_z[:,k], latents[:,1])
plt.xlabel(str('mcvae dim ') + str(k))
plt.ylabel('ground truth 1')
 
plt.show()
```
 
%% Output
 
Probability of redundancy: [[0.03285351 0.39449248 0.00762853 0.6011913 0.59263474 0.6564989
0.25333416 0.3858575 ]]
Non-redundant components: [0 2]
 
 
 
%% Cell type:markdown id: tags:
 
The multi-channel variational autoencoder allows to predict each channel from any other.
 
%% Cell type:code id: tags:
 
``` python
# We compute the reconstruction of the data from the encoding
z = [_.sample() for _ in encoding] # sample the encoding distributions
p = model_multi.decode(z) # compute the decoding distributions
 
# This variable has several dimensions over two indices:
# the first index indicates the modality to decode(0:x, 1:y, 3:z, ...)
# the second index indicates the modality from which the encoding is done (0:x, 1:y, 3:z, ...)
 
# p[x][z]: p(x|z)
decoding_x_from_x = p[0][0].loc.data.numpy()
decoding_x_from_y = p[0][1].loc.data.numpy()
decoding_x_from_z = p[0][2].loc.data.numpy()
```
 
%% Cell type:code id: tags:
 
``` python
plt.figure(figsize=(18, 10))
plt.subplot(2,3,1)
plt.scatter(X_ext[:,0], decoding_x_from_x[:,0])
plt.title('decoding X from X')
plt.xlabel('X0')
plt.ylabel('X0 from latent X')
plt.subplot(2,3,2)
plt.scatter(X_ext[:,0], decoding_x_from_y[:,0])
plt.title('decoding X from Y')
plt.ylabel('X0 from latent Y')
plt.xlabel('X0')
plt.subplot(2,3,3)
plt.scatter(X_ext[:,0], decoding_x_from_z[:,0])
plt.title('decoding X from Z')
plt.xlabel('X0')
plt.ylabel('X0 from latent Z')
plt.subplot(2,3,4)
plt.scatter(decoding_x_from_x[:,0], decoding_x_from_y[:,0])
plt.title('decoding X vs decoding Y')
plt.xlabel('X0 from latent X')
plt.ylabel('X0 from latent Y')
plt.subplot(2,3,5)
plt.scatter(decoding_x_from_x[:,0], decoding_x_from_z[:,0])
plt.title('decoding Y vs decoding Z')
plt.xlabel('X0 from latent X')
plt.ylabel('X0 from latent Z')
plt.subplot(2,3,6)
plt.scatter(decoding_x_from_y[:,0], decoding_x_from_z[:,0])
plt.title('decoding Y vs decoding Z')
plt.xlabel('X0 from latent Y')
plt.ylabel('X0 from latent Z')
plt.tight_layout()
plt.show()
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
__Exercise.__ Compare Mcvae, Sparse Mcvae, and PLS
- Reuse the observations X1, X2, and X3 created earlier.
- Fit a Mcvae model to predict X3 from X1.
- Fit a Mcvae Sparse model to predict X3 from X1.
- Fit a PLS model to predict X3 from X1.
- Compare the predictions for all the three methods.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Application to (pseudo-) neurological data
 
We are going to load volumetric and cognitive data for a synthetic sample generated from the ADNI dataset.
The exercise consists in applying the methods seen so far to understand the relationship between this kind of variables.
 
%% Cell type:code id: tags:
 
``` python
import pandas as pd
 
adni = pd.read_csv('https://marcolorenzi.github.io/material/pseudo_adni.csv')
 
brain_volume_cols = ['WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
cognition_cols = ['CDRSB.bl', 'ADAS11.bl', 'MMSE.bl', 'RAVLT.immediate.bl', 'RAVLT.learning.bl', 'RAVLT.forgetting.bl', 'FAQ.bl']
 
adni[brain_volume_cols]
```
 
%% Output
 
WholeBrain.bl Ventricles.bl Hippocampus.bl MidTemp.bl Entorhinal.bl
0 0.684331 0.012699 0.003786 0.012678 0.002214
1 0.735892 0.012803 0.004866 0.015071 0.003041
2 0.738731 0.030492 0.004300 0.012419 0.002316
3 0.696179 0.032797 0.004720 0.012312 0.002593
4 0.841806 0.004030 0.006820 0.016948 0.002896
.. ... ... ... ... ...
995 0.767153 0.011417 0.005209 0.012879 0.002208
996 0.695168 0.011908 0.004641 0.012534 0.002197
997 0.628691 0.041537 0.003478 0.010870 0.001939
998 0.714763 0.020461 0.004713 0.013989 0.001981
999 0.691858 0.030349 0.004237 0.011439 0.002419
[1000 rows x 5 columns]
 
%% Cell type:code id: tags:
 
``` python
volumes_value = adni[brain_volume_cols].values
 
# Standardization of volumetric measures
volumes_value = (volumes_value - volumes_value.mean(0)) / volumes_value.std(0)
```
 
%% Cell type:code id: tags:
 
``` python
adni[cognition_cols]
```
 
%% Output
 
CDRSB.bl ADAS11.bl MMSE.bl RAVLT.immediate.bl RAVLT.learning.bl \
0 1 8 27.0 23.739439 4.0
1 0 0 30.0 64.933800 9.0
2 0 8 24.0 36.987722 3.0
3 0 3 29.0 50.314425 5.0
4 0 0 30.0 57.217830 9.0
.. ... ... ... ... ...
995 1 2 29.0 61.896022 8.0
996 0 1 29.0 62.083170 8.0
997 3 14 24.0 22.289059 2.0
998 0 13 26.0 31.650504 2.0
999 0 15 28.0 29.089863 3.0
RAVLT.forgetting.bl FAQ.bl
0 5.821573 3
1 4.001653 0
2 6.876316 0
3 4.733481 3
4 7.225401 0
.. ... ...
995 1.663102 0
996 5.241477 1
997 5.437600 7
998 1.669603 4
999 7.703384 4
[1000 rows x 7 columns]
 
%% Cell type:code id: tags:
 
``` python
cognition_value = adni[cognition_cols].values
 
# Standardization of cognitive measures
cognition_value = (cognition_value - cognition_value.mean(0)) / cognition_value.std(0)
```
 
%% Cell type:code id: tags:
 
``` python
plsca_adni = PLSCanonical(n_components=3, scale = False)
plsca_adni.fit(cognition_value,volumes_value)
print(plsca_adni.x_weights_)
print(plsca_adni.y_weights_)
```
 
%% Output
 
[[ 0.38442715 -0.0914859 0.3212761 ]
[ 0.46751116 -0.13784339 0.29848234]
[-0.42755384 0.04820369 0.06450696]
[-0.41248938 -0.01061314 0.43066352]
[-0.3504303 -0.18746317 0.64581907]
[ 0.08267813 0.96398703 0.23059686]
[ 0.38866726 -0.07602536 0.38444838]]
[[-0.38267896 0.49605567 0.326012 ]
[ 0.30522362 -0.19887222 0.19846392]
[-0.51586765 -0.39589381 0.68942719]
[-0.48952454 0.48804833 -0.30834445]
[-0.5046203 -0.56520397 -0.53286217]]
 
%% Cell type:code id: tags:
 
``` python
n_cross_valid_run = 200
 
# number of components to test
n_components = 5
 
rep_results = []
for i in range(n_components):
rep_results.append([])
 
for k in range(n_components):
for i in range(n_cross_valid_run):
 
# Sampling disjoint set of indices for splitting the data
batch1_idx = np.random.choice(range(cognition_value.shape[0]), size = int(cognition_value.shape[0]/2), replace = False)
batch2_idx = np.where(np.in1d(range(cognition_value.shape[0]), batch1_idx, assume_unique=True, invert = True))[0]
 
# Creating independent data batches for X
X_1 = cognition_value[batch1_idx, :]
X_2 = cognition_value[batch2_idx, :]
 
# Creating independent data batches for Y
Y_1 = volumes_value[batch1_idx, :]
Y_2 = volumes_value[batch2_idx, :]
 
# Creating a model for each data batch
cca_adni1 = CCA(n_components = k+1, scale = False)
cca_adni2 = CCA(n_components = k+1, scale = False)
 
# Fitting a model for each data batch
cca_adni1.fit(X_1,Y_1)
cca_adni2.fit(X_2,Y_2)
 
# Quantifying the prediction error on the unseen data batch
err1 = np.sum((cca_adni1.predict(X_2) - Y_2)**2)
err2 = np.sum((cca_adni2.predict(X_1) - Y_1)**2)
 
rep_results[k].append(np.mean([err1,err2]))
```
 
%% Cell type:code id: tags:
 
``` python
plt.plot(range(1,n_components+1),np.mean(rep_results, 1))
plt.xlabel('# components')
plt.ylabel('prediction error')
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
plsca_adni = PLSCanonical(n_components=1, scale = False)
plsca_adni.fit(cognition_value,volumes_value)
 
f = plt.figure(figsize=(15,10))
f.add_subplot(2,1,1)
plt.bar(np.arange(len(plsca_adni.x_weights_[:,0])), plsca_adni.x_weights_[:,0], tick_label = cognition_cols)
plt.title('Component x')
f.add_subplot(2,1,2)
plt.bar(np.arange(len(plsca_adni.y_weights_[:,0])), plsca_adni.y_weights_[:,0], tick_label = brain_volume_cols)
plt.title('Component y')
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
plt.scatter(plsca_adni.x_scores_[:,0], plsca_adni.y_scores_[:,0], c = cognition_value[:,1])
plt.show()
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
## McVAE
 
In this last application we apply the multichannel autoencoder to the pseudo-ADNI data, for jointly modeling different modalities across individuals. We focus on the joint analysis of:
 
- brain volumes;
- sociodemographic information (e.g. age, sex, scholarity);
- cognition;
- apoe genotype;
- fluid biomarkers (abeta and tay concentrations in the CSF).
 
%% Cell type:markdown id: tags:
 
We first import and standardize the different data modalities.
 
%% Cell type:code id: tags:
 
``` python
adni = pd.read_csv('https://marcolorenzi.github.io/material/pseudo_adni.csv')
 
normalize = lambda _: (_ - _.mean(0))/_.std(0)
 
volume_cols = ['WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
volumes_value = adni[volume_cols].values
volumes_value = normalize(volumes_value)
 
demog_cols = ['SEX', 'AGE', 'PTEDUCAT']
demog_value = adni[demog_cols].values
demog_value = normalize(demog_value)
 
cognition_cols = ['CDRSB.bl', 'ADAS11.bl', 'MMSE.bl', 'RAVLT.immediate.bl', 'RAVLT.learning.bl', 'RAVLT.forgetting.bl', 'FAQ.bl']
cognition_value = adni[cognition_cols].values
cognition_value = normalize(cognition_value)
 
apoe_cols = ['APOE4']
apoe_value = adni[apoe_cols].values
apoe_value = normalize(apoe_value)
 
fluid_cols = ['ABETA.MEDIAN.bl', 'PTAU.MEDIAN.bl', 'TAU.MEDIAN.bl']
fluid_value = adni[fluid_cols].values
fluid_value = normalize(fluid_value)
 
# Creating a list with multimodal data
data_adni = [volumes_value, demog_value, cognition_value, apoe_value, fluid_value]
 
# Transform as a pytorch Tensor for compatibility
data_adni = [torch.Tensor(_) for _ in data_adni]
 
print(f'We have {len(data_adni)} channels in total as an input for the model')
```
 
%% Output
 
We have 5 channels in total as an input for the model
 
%% Cell type:code id: tags:
 
``` python
# Defining data and model characteristics
 
init_dict = {
'data': data_adni,
'lat_dim': 5, # We fit 5 latent dimensions
}
```
 
%% Cell type:code id: tags:
 
``` python
# Initialize the model
 
model_adni = Mcvae(sparse=True, **init_dict)
model_adni.to(DEVICE)
print(model_adni)
```
 
%% Output
 
Mcvae(
(vae): ModuleList(
(0): VAE(
(W_out_logvar): Parameter(shape=torch.Size([1, 5]))
(log_alpha): Parameter(shape=torch.Size([1, 5]))
(W_mu): Linear(in_features=5, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=5, bias=True)
)
(1): VAE(
(W_out_logvar): Parameter(shape=torch.Size([1, 3]))
(log_alpha): Parameter(shape=torch.Size([1, 5]))
(W_mu): Linear(in_features=3, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=3, bias=True)
)
(2): VAE(
(W_out_logvar): Parameter(shape=torch.Size([1, 7]))
(log_alpha): Parameter(shape=torch.Size([1, 5]))
(W_mu): Linear(in_features=7, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=7, bias=True)
)
(3): VAE(
(W_out_logvar): Parameter(shape=torch.Size([1, 1]))
(log_alpha): Parameter(shape=torch.Size([1, 5]))
(W_mu): Linear(in_features=1, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=1, bias=True)
)
(4): VAE(
(W_out_logvar): Parameter(shape=torch.Size([1, 3]))
(log_alpha): Parameter(shape=torch.Size([1, 5]))
(W_mu): Linear(in_features=3, out_features=5, bias=True)
(W_out): Linear(in_features=5, out_features=3, bias=True)
)
)
)
 
%% Cell type:code id: tags:
 
``` python
adam_lr = 1e-2
n_epochs = 6000
 
model_adni.optimizer = torch.optim.Adam(model_adni.parameters(), lr=adam_lr)
load_or_fit(model=model_adni, data=data_adni, epochs=n_epochs, ptfile='model_adni.pt', force_fit=FORCE_REFIT)
```
 
%% Output
 
Loading model_adni.pt
 
%% Cell type:markdown id: tags:
 
The model identified only a significant dimension, that we are going to store and analyze:
 
%% Cell type:code id: tags:
 
``` python
print('Rendundancy: ', model_adni.dropout.detach().numpy())
significant_dim = np.where(model_adni.dropout.detach().numpy()<0.5)[1]
print('Significant dimensions: ', significant_dim)
plot_dropout(model_adni, sort=False)
```
 
%% Output
 
Rendundancy: [[0.47644782 0.99128443 0.9764916 0.9853288 0.9925022 ]]
Significant dimensions: [0]
 
 
%% Cell type:code id: tags:
 
``` python
# Here we store in a list the deconding weights estimated for each modality.
# We are interested in the decoding weights corresponding to the non-redundant dimension
 
decoding_weights = []
decoding_weights = [vae.W_out.weight.data.numpy()[:, significant_dim[0]] for vae in model_adni]
 
decoding_weights[0] # Channel 0: Brain Volumes
```
 
%% Output
 
array([ 0.2002608 , -0.14793997, 0.26713213, 0.236022 , 0.24407719],
dtype=float32)
 
%% Cell type:markdown id: tags:
 
The weights give us a nice way to interpret how the different modalities interact together.
 
%% Cell type:code id: tags:
 
``` python
plt.figure(figsize=(12, 12))
plt.subplot(5,1,1)
plt.bar(np.arange(len(decoding_weights[0])), decoding_weights[0], tick_label = volume_cols)
plt.subplot(5,1,2)
plt.bar(np.arange(len(decoding_weights[1])), decoding_weights[1], tick_label = demog_cols)
plt.subplot(5,1,3)
plt.bar(np.arange(len(decoding_weights[2])), decoding_weights[2], tick_label = cognition_cols)
plt.subplot(5,1,4)
plt.bar(np.arange(len(decoding_weights[3])), decoding_weights[3], tick_label = apoe_cols)
plt.subplot(5,1,5)
plt.bar(np.arange(len(decoding_weights[4])), decoding_weights[4], tick_label = fluid_cols)
```
 
%% Output
 
<BarContainer object of 3 artists>
 
 
%% Cell type:markdown id: tags:
 
Once the model is learnt we can use it for prediction. For example we can predict brain volumes from the cognitive data:
 
%% Cell type:code id: tags:
 
``` python
# Predicting volumes (channel 0) from cognition (channel 2)
predictions = model_adni.reconstruct(data_adni, reconstruct_from=[2])
decoding_volume_from_cognition = predictions[0].data.numpy()
```
 
%% Cell type:code id: tags:
 
``` python
plt.figure(figsize=(12, 28))
 
# Plotting predictions for each volumetric feature
for i in range(5):
plt.subplot(5,1,i+1)
plt.scatter(decoding_volume_from_cognition[:,i], volumes_value[:,i])
plt.title('reconstruction ' + volume_cols[i])
plt.xlabel('predicted')
plt.ylabel('target')
plt.show()
```
 
%% Output
 
 
%% Cell type:markdown id: tags:
 
We are finally going to compare the multichannel model with the standard PLS modeling cognition and brain volumes jointly:
 
%% Cell type:code id: tags:
 
``` python
plsca = PLSCanonical(n_components=1, scale = False)
plsca.fit(cognition_value,volumes_value)
print(plsca.x_weights_)
print(plsca.y_weights_)
```
 
%% Output
 
[[ 0.38442715]
[ 0.46751116]
[-0.42755384]
[-0.41248938]
[-0.3504303 ]
[ 0.08267813]
[ 0.38866726]]
[[-0.38267896]
[ 0.30522362]
[-0.51586765]
[-0.48952454]
[-0.5046203 ]]
 
%% Cell type:markdown id: tags:
 
We obtain predictions from the PLS model and compare them with the predictions of mcvae. We observe a strong correlation between predictions. However, it can be noted that as we increase the number of dimensions of CCA, the difference between prediction increases, as CCA tends to overfit.
 
%% Cell type:code id: tags:
 
``` python
predicted_plsca = plsca.predict(cognition_value)
 
plt.figure(figsize=(26, 18))
 
for i in range(5):
plt.subplot(10,2,2*i+1)
plt.scatter(predicted_plsca[:,i], volumes_value[:,i])
plt.title('reconstruction' + volume_cols[i])
plt.xlabel('predicted')
plt.ylabel('target')
 
for i in range(5):
plt.subplot(10,2,2*i+2)
plt.scatter(predicted_plsca[:,i], decoding_volume_from_cognition[:,i])
plt.title('reconstruction' + volume_cols[i])
plt.xlabel('cca')
plt.ylabel('mcvae')
 
plt.show()
```
 
%% Output
 
 
%% Cell type:code id: tags:
 
``` python
# Comparing average reconstruction errors between models
 
print('Reconstruction error:')
print(f'CCA: {((predicted_plsca-volumes_value)**2).sum()}')
print(f'mcvae: {((decoding_volume_from_cognition-volumes_value)**2).sum()}')
```
 
%% Output
 
Reconstruction error:
CCA: 4631.245631476227
mcvae: 3964.8306431191213
 
%% Cell type:markdown id: tags:
 
It is interesting to notice that mcvae leads to a lower reconstruction error than CCA, as the prediction of mcvae benefits from training from other modalities as well.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment