06 Mar 2023
In this post I want to demonstrate that the distinction between supervised and unsupervised learning is somewhat arbitrary. Specifically, I want to solve a supervised learning problem (binary classification) using an unsupervised learning algorithm (kernel density estimation).
Let’s start by using sklearn.dataset.make_classification
to generate a classification problem.
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=400, n_features=2, n_informative=2,
n_redundant=0, n_repeated=0, n_classes=2,
random_state=42,
)
The dataset has just two features so it is easy to visualize.
Chapter 5 of Deep Learning by Goodfellow et al. says we can solve a supervised learning problem $p(y|x)$ as
\[p(y|x) = \frac{p(x, y)}{\sum_{\forall y' \in y}p(x, y')}\]by learning the joint distribution $p(x, y)$. The formula is just an application of the law of total probability. Notice that the denominator is the marginal probability $p(x)$. In our example, we will use kernel density estimation to learn $p(x, y)$. You can check out my notebook to see how the algorithm works.
First we need to transform our data X
and y
by treating the ground truth array y
of class
labels as just another feature by concatenating it with X
. This will get us an array of shape
$400 \times 3$ because there are 400 samples, 2 original features plus the new ground truth feature.
We can then pass the concatenated (and transposed as required by the API) array to kernel density
estimation function.
import numpy as np
from scipy.stats import gaussian_kde
X_with_y = np.column_stack([X, y])
dist = gaussian_kde(X_with_y.T)
We now have dist
object representing the estimate of the joint distribution $p(x, y)$. The
distribution is visualized in the figure bellow. Blue contours indicate regions more likely
belonging to the positive class, red contours indicate the negative class. We can see that the
algorithm successfully identified the bimodal distribution of the data.
Now it’s time to build the actual classifier. We define a class KDEClassifier
with three
methods and an API similar to sklearn
estimators. You can see the complete implementation
in the following snippet.
class KDEClassifier:
def fit(self, X: np.array, y: np.array) -> None:
X_with_y = np.column_stack([X, y])
self.dist = gaussian_kde(X_with_y.T)
def predict_proba_density(self, X: np.array) -> np.array:
y_pos = np.ones(len(X))
y_neg = y_pos - 1
X_with_pos = np.column_stack([X, y_pos]).T
X_with_neg = np.column_stack([X, y_neg]).T
proba_densities = [
self.dist.pdf(X_with_)
/ (self.dist.pdf(X_with_neg) + self.dist.pdf(X_with_pos))
for X_with_ in (X_with_neg, X_with_pos)
]
return np.column_stack(proba_densities)
def predict(self, X: np.array) -> np.array:
proba_densities = self.predict_proba_density(X)
return np.argmax(proba_densities, axis=1)
The fit
method learns the joint distribution $p(x, y)$ using the training data. It’s
the same two lines of code we used previously to compute the kernel density estimation
of $p(x, y)$.
The predict_proba_density
method implements the formula 5.2 from the Deep Learning book.
The first two lines create an array X_with_pos
containing all the training points with
an added class label feature always set to the positive label. The same is repeated for the
X_with_neg
array, this time adding the negative label.
The next four lines is the formula itself wrapped in a for
cycle with two elements X_with_pos
and X_with_neg
. The for
cycle is there because we apply the formula twice to get both
$p(y = \mathrm{pos}|x)$ and $p(y = \mathrm{neg}|x)$ probability densities for each sample.
The first iteration of the cycle computes
which translates to Python as
self.dist.pdf(X_with_pos)
/ (self.dist.pdf(X_with_neg) + self.dist.pdf(X_with_pos))
As there are only two classes, the denominator of the formula $\sum_{\forall y’ \in y}p(x, y’)$
is just two densities added together. The same calculation happens for $p(y = \mathrm{neg}|x)$
in the second iteration of the cycle. The end result produced by np.column_stack(proba_densities)
is a $400 \times 2$ array that in the first column contains the positive class probability density
for each sample and the negative class probablity density in the second column.
The predict
method is simple. It takes the array from predict_proba_density
and if the density in
the second column for a given sample is higher than the density in the first column, it assigns the
positive class to the sample. Otherwise, it assigns the negative class. The output is a $400 \times 1$ array.
How well does our classifier work? We can compare its precision and recall with a logistic regression classifier.
from sklearn.metrics import classification_report
from sklearn.linear_model import SGDCassifier
kde_classifier = KDEClassifier()
kde_classifier.fit(X, y)
lr_classifier = SGDClassifier(loss="log_loss", random_state=0)
lr_classifier.fit(X, y)
print(classification_report(y, lr_classifier.predict(X)))
print(classification_report(y, kde_classifier.predict(X)))
The classifier outperforms the logistic regression in precision and overall accuracy because it’s capable of finding a non-linear seperation between the classes as we can see in the figure below. The lightest contour in the middle is the decision threshold between the positive and negative class. Dark blue and red contours define regions where the models are more confident about their decision.
Notebook with all the code is available on my GitHub.