Supervised Learning Using Kernel Density Estimation

06 Mar 2023

In this post I want to demonstrate that the distinction between supervised and unsupervised learning is somewhat arbitrary. Specifically, I want to solve a supervised learning problem (binary classification) using an unsupervised learning algorithm (kernel density estimation).

Let’s start by using sklearn.dataset.make_classification to generate a classification problem.

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=400, n_features=2, n_informative=2,
    n_redundant=0, n_repeated=0, n_classes=2,
    random_state=42,
)

The dataset has just two features so it is easy to visualize.

Binary Classification

Joint Distribution $p(x, y)$

Chapter 5 of Deep Learning by Goodfellow et al. says we can solve a supervised learning problem $p(y|x)$ as

\[p(y|x) = \frac{p(x, y)}{\sum_{\forall y' \in y}p(x, y')}\]

by learning the joint distribution $p(x, y)$. The formula is just an application of the law of total probability. Notice that the denominator is the marginal probability $p(x)$. In our example, we will use kernel density estimation to learn $p(x, y)$. You can check out my notebook to see how the algorithm works.

First we need to transform our data X and y by treating the ground truth array y of class labels as just another feature by concatenating it with X. This will get us an array of shape $400 \times 3$ because there are 400 samples, 2 original features plus the new ground truth feature. We can then pass the concatenated (and transposed as required by the API) array to kernel density estimation function.

import numpy as np
from scipy.stats import gaussian_kde

X_with_y = np.column_stack([X, y])
dist = gaussian_kde(X_with_y.T)

We now have dist object representing the estimate of the joint distribution $p(x, y)$. The distribution is visualized in the figure bellow. Blue contours indicate regions more likely belonging to the positive class, red contours indicate the negative class. We can see that the algorithm successfully identified the bimodal distribution of the data.

Joint PDF

Building the Classifier

Now it’s time to build the actual classifier. We define a class KDEClassifier with three methods and an API similar to sklearn estimators. You can see the complete implementation in the following snippet.

class KDEClassifier:

    def fit(self, X: np.array, y: np.array) -> None:
        X_with_y = np.column_stack([X, y])
        self.dist = gaussian_kde(X_with_y.T)

    def predict_proba_density(self, X: np.array) -> np.array:
        y_pos = np.ones(len(X))
        y_neg = y_pos - 1
        X_with_pos = np.column_stack([X, y_pos]).T
        X_with_neg = np.column_stack([X, y_neg]).T

        proba_densities = [
            self.dist.pdf(X_with_)
            / (self.dist.pdf(X_with_neg) + self.dist.pdf(X_with_pos))
            for X_with_ in (X_with_neg, X_with_pos)
        ]

        return np.column_stack(proba_densities)

    def predict(self, X: np.array) -> np.array:
        proba_densities = self.predict_proba_density(X)
        return np.argmax(proba_densities, axis=1)

The fit method learns the joint distribution $p(x, y)$ using the training data. It’s the same two lines of code we used previously to compute the kernel density estimation of $p(x, y)$.

The predict_proba_density method implements the formula 5.2 from the Deep Learning book. The first two lines create an array X_with_pos containing all the training points with an added class label feature always set to the positive label. The same is repeated for the X_with_neg array, this time adding the negative label.

The next four lines is the formula itself wrapped in a for cycle with two elements X_with_pos and X_with_neg. The for cycle is there because we apply the formula twice to get both $p(y = \mathrm{pos}|x)$ and $p(y = \mathrm{neg}|x)$ probability densities for each sample. The first iteration of the cycle computes

\[p(y = \mathrm{pos}|x) = \frac{p(x, y = \mathrm{pos})}{\sum_{\forall y' \in y}p(x, y')}\]

which translates to Python as

self.dist.pdf(X_with_pos)
/ (self.dist.pdf(X_with_neg) + self.dist.pdf(X_with_pos))

As there are only two classes, the denominator of the formula $\sum_{\forall y’ \in y}p(x, y’)$ is just two densities added together. The same calculation happens for $p(y = \mathrm{neg}|x)$ in the second iteration of the cycle. The end result produced by np.column_stack(proba_densities) is a $400 \times 2$ array that in the first column contains the positive class probability density for each sample and the negative class probablity density in the second column.

The predict method is simple. It takes the array from predict_proba_density and if the density in the second column for a given sample is higher than the density in the first column, it assigns the positive class to the sample. Otherwise, it assigns the negative class. The output is a $400 \times 1$ array.

Evaluation

How well does our classifier work? We can compare its precision and recall with a logistic regression classifier.

from sklearn.metrics import classification_report
from sklearn.linear_model import SGDCassifier

kde_classifier = KDEClassifier()
kde_classifier.fit(X, y)

lr_classifier = SGDClassifier(loss="log_loss", random_state=0)
lr_classifier.fit(X, y)

print(classification_report(y, lr_classifier.predict(X)))
print(classification_report(y, kde_classifier.predict(X)))

The classifier outperforms the logistic regression in precision and overall accuracy because it’s capable of finding a non-linear seperation between the classes as we can see in the figure below. The lightest contour in the middle is the decision threshold between the positive and negative class. Dark blue and red contours define regions where the models are more confident about their decision.

Classifier Comparison

Notebook with all the code is available on my GitHub.