Note
Go to the end to download the full example code.
MNIST Examples
We illustrate some basic considerations while manipulating the class codpy.kernel.Kernel classes , applying it to the MNIST problem.
It illustrates the following interpolation / extrapolation method
\[f_{k,\theta}(\cdot) = K(\cdot, Y) \theta, \quad \theta = K(X, Y)^{-1} f(X),\]
- where
\(K(X, Y)\) is the Gram matrix, see
codpy.kernel.Kernel.Knm()\(K(X, Y)^{-1} = (K(Y, X)K(X, Y))^{-1}K(Y,X)\) is computed as a least-square method without regularization terms, see
codpy.kernel.Kernel.get_knm_inv().
This notebook illustrates various choices+ for the set \(Y\)
import os
import random
import numpy as np
import pandas as pd
# A multi scale kernel method.
from sklearn.metrics import confusion_matrix
import codpy.core as core
from codpy.clustering import MiniBatchkmeans
# We use a custom hot encoder for performances reasons.
from codpy.data_processing import hot_encoder
# Standard codpy kernel class.
from codpy.kernel import KernelClassifier
os.environ["OPENBLAS_NUM_THREADS"] = "32"
os.environ["OMP_NUM_THREADS"] = "32"
We pick-up MNIST data using tensorflow (tf). pip install tf on your installation prior running this notebook !
Note :
The MNIST corresponds to features \(X\in \mathbb{R}^{60000,784}\). Each image, represented as \(28\times 28\) black and white pixel is a feature described as a vector in dimension \(D=784\).
- We hot encode the MNIST classes\(f(X) \in \mathbb{R}^{60000,10}\) is defined as
\[f(x) = (\delta_i(c(x)), \quad i=1,\ldots,10,\]where \(c(x)\) is the indice of the label of the class, and \(\delta_i(j) = \{i==j\}\).
def get_MNIST_data(N=-1):
import tensorflow as tf
(x, fx), (z, fz) = tf.keras.datasets.mnist.load_data()
x, z = x / 255.0, z / 255.0
x, z, fx, fz = (
x.reshape(len(x), -1),
z.reshape(len(z), -1),
fx.reshape(len(fx), -1),
fz.reshape(len(fz), -1),
)
fx, fz = (
hot_encoder(pd.DataFrame(data=fx), cat_cols_include=[0], sort_columns=True),
hot_encoder(pd.DataFrame(data=fz), cat_cols_include=[0], sort_columns=True),
)
x, fx, z, fz = (x, fx.values, z, fz.values)
if N != -1:
indices = random.sample(range(x.shape[0]), N)
x, fx = x[indices], fx[indices]
return x, fx, z, fz
We perform basic tests on MNIST results : confusion matrix and scores.
def show_confusion_matrix(z, fz, predictor=None, cm=True):
f_z = predictor(z)
fz, f_z = fz.argmax(axis=-1), f_z.argmax(axis=-1)
out = confusion_matrix(fz, f_z)
if cm:
print("confusion matrix:")
print(out)
print("score MNIST:", np.trace(out) / np.sum(out))
pass
Run codpy silently on/off.
core.kernel_interface.set_verbose(False)
Set variables and pick MNIST data for the test. N_MNIST_pics is used to pick a smaller set than the original one. The training set is x,fx, the test set is z,fz.
N_clusters = 100
N_MNIST_pics = 5000
x, fx, z, fz = get_MNIST_data(N_MNIST_pics)
First pick \(Y\) at random. Output confusion matrix for two sets : the training set \(X\) and the test set \(Z\)
indices = np.random.choice(range(x.shape[0]), size=N_clusters)
y, fy = x[indices], fx[indices]
predictor = KernelClassifier(x=x, y=y, fx=fx)
print("Output with the training set - reproductibility test:")
show_confusion_matrix(x, fx, predictor, cm=False)
print("Output with the test set :")
show_confusion_matrix(z, fz, predictor)
print("Discrepancy(x,y):", predictor.discrepancy(y))
Output with the training set - reproductibility test:
score MNIST: 0.8732
Output with the test set :
confusion matrix:
[[ 958 0 0 0 0 5 8 1 8 0]
[ 0 1088 7 3 1 2 4 1 29 0]
[ 15 18 879 26 19 0 11 26 37 1]
[ 11 2 39 864 1 23 8 17 34 11]
[ 1 9 6 1 847 4 16 1 11 86]
[ 24 9 8 55 18 707 20 16 28 7]
[ 20 5 9 0 19 20 875 1 9 0]
[ 4 28 22 4 20 0 2 910 6 32]
[ 5 7 11 66 10 21 18 9 812 15]
[ 16 9 6 12 94 7 1 22 11 831]]
score MNIST: 0.8771
Discrepancy(x,y): 0.006751273339628827
Select \(Y\) having a lowest discrepancy with a greedy algorithm.
predictor = KernelClassifier(x=x, fx=fx).greedy_select(
N=N_clusters, all=True, start_indices={indices[0]}
)
print("Reproductibility test:")
show_confusion_matrix(x, fx, predictor, cm=False)
print("Performance test:")
show_confusion_matrix(z, fz, predictor, cm=False)
print("Discrepancy(x,y):", predictor.discrepancy(predictor.get_y()))
Reproductibility test:
score MNIST: 0.8782
Performance test:
score MNIST: 0.8827
Discrepancy(x,y): 0.005728688575252161
Select \(Y\) adapted to \(f(x)\) using a greedy algorithm.
predictor = KernelClassifier(x=x, fx=fx).greedy_select(N=N_clusters, fx=fx, all=True)
print("Reproductibility test:")
show_confusion_matrix(x, fx, predictor, cm=False)
print("Performance test:")
show_confusion_matrix(z, fz, predictor, cm=False)
print("Discrepancy(x,y):", predictor.discrepancy(predictor.get_y()))
Reproductibility test:
score MNIST: 0.8976
Performance test:
score MNIST: 0.891
Discrepancy(x,y): 0.008942885551873503
Select \(Y\) using a k-means algorithm.
y = MiniBatchkmeans(x, N=N_clusters).cluster_centers_
predictor = KernelClassifier(x=x, y=y, fx=fx)
print("Reproductibility test:")
show_confusion_matrix(x, fx, predictor, cm=False)
print("Performance test:")
show_confusion_matrix(z, fz, predictor, cm=False)
print("Discrepancy(x,y):", predictor.discrepancy(predictor.get_y()))
Reproductibility test:
score MNIST: 0.9156
Performance test:
score MNIST: 0.9181
Discrepancy(x,y): 0.05833591511492753
Total running time of the script: (0 minutes 28.764 seconds)