Mentions légales du service

Skip to content
Snippets Groups Projects
Commit 57f2dd50 authored by Vincent Schellekens's avatar Vincent Schellekens
Browse files

remove old draft

parent f89e844a
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
In this notebook, we explore compressive clustering on a 2-d toy example dataset.
%% Cell type:code id: tags:
``` python
# General imports
import numpy as np
import matplotlib.pyplot as plt
# We import the pycle toolbox for sketched learning; we will need three submodules
import pycle
from pycle import sketching, compressive_learning, utils
# Fix the random seed for reproducibility
np.random.seed(0)
```
%% Cell type:markdown id: tags:
Let's start by generating a toy example dataset from a Gaussian mixture model.
%% Cell type:code id: tags:
``` python
d = 2 # Dimension
K = 5 # Number of Gaussians
n = int(1e5) # Number of samples we want to generate
# We use the generatedataset_GMM method from pycle (we ask that the entries are <= 1, and imbalanced clusters)
X = pycle.utils.generatedataset_GMM(d,K,n,normalize='l_inf-unit-ball',balanced=False, separation_min=2)
# Bounds on the dataset, necessary for compressive k-means
bounds = np.array([-np.ones(d),np.ones(d)]) # We assumed the data is normalized between -1 and 1
# Visualize the dataset
plt.figure(figsize=(5,5))
plt.title("Full dataset")
plt.scatter(X[:,0],X[:,1],s=1, alpha=0.1)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
We first compress the dataset as a single sketch vector. Let's define the parameters of the feature map $\Phi$ first.
%% Cell type:code id: tags:
``` python
# Pick the dimension m (5*K*d is usually (just) enough for clustering)
m = 10*K*d
# Kernel bandwith (squared)
sigma2 = 0.05
# We want m Gaussian frequencies in dimension d, with squared kernel bandwith sigma2
W = pycle.sketching.drawFrequencies("Gaussian",d,m,sigma2)
# To generate the map, we provide a nonlinearity rho (here complex exponential for RFF) and the projections W
Phi = pycle.sketching.SimpleFeatureMap("ComplexExponential",W)
# We sketch X with Phi: we map a 100000x2 dataset -> a 100-dimensional complex vector
z = pycle.sketching.computeSketch(X,Phi)
print("Dataset size: ", X.shape)
print("Sketch size: ", z.shape)
```
%% Output
Dataset size: (100000, 2)
Sketch size: (100,)
%% Cell type:markdown id: tags:
Now, to solve k-means from the sketch, we call the CLOMPR algorithm.
%% Cell type:code id: tags:
``` python
(weights,centroids) = pycle.compressive_learning.CLOMPR("k-means",z,Phi,K,bounds,nRepetitions=5)
```
%% Cell type:markdown id: tags:
Let's see how well we did:
%% Cell type:code id: tags:
``` python
# Visualize the centroids (we re-use the dataset for visual comparison)
plt.figure(figsize=(5,5))
plt.title("Compressively learned centroids")
plt.scatter(X[:,0],X[:,1],s=1, alpha=0.15)
plt.scatter(centroids[:,0],centroids[:,1],s=1000*weights)
plt.legend(["Data","Centroids"])
plt.show()
print("SSE from sketch: {}".format(pycle.utils.SSE(X,centroids)))
# Compare to k-means
from sklearn.cluster import KMeans
kmeans_estimator = KMeans(n_clusters=K)
kmeans_estimator.fit(X)
centroids_kmeans = kmeans_estimator.cluster_centers_
print("SSE from k-means: {}".format(pycle.utils.SSE(X,centroids_kmeans)))
```
%% Output
SSE from sketch: 1018.8545093592352
SSE from k-means: 990.8373999288126
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment