Generative Classification Models
Max / 2023-02-15
Generative Classification methods
Theory
When we are using a generative approach to classification, we are not modeling the conditional density \(\pi_k(x) = P(y = k | x)\), i.e., the class membership probability given a certain feature vector, directly. Instead, we are modeling the “other” conditional density \(f(x | y = k)\) (“probability” of a feature vector given a certain class membership, i.e. the likelihood of observing \(x\) under the assumption that the class is \(k\)). Following the Bayes’ rule one gets that
\[ \pi_k(x) \propto \pi_k \cdot f(x | y = k). \] The distribution defined by the parameters \(\pi_k\) is called the prior and can be interpreted as the representation of our a priori knowledge about the frequencies of the target classes. In our setting we can use a straightforward approach to deriving this prior, s.t.
\[ \hat{\pi}_k = \frac{n_k}{n}. \] With prior and likelihood specified, we can use the fact that all posterior class probabilities need to sum to one to get: \[ 1 = \sum^{g}_{j=1} \pi_j(x) = \sum^{g}_{j=1}\alpha \pi_j \cdot p(x | y = j) \iff \alpha = \frac{1}{\sum^{g}_{j=1}\pi_j \cdot p (x | y = j)}. \] From this, we see that \(\pi_k(x)\) can be expressed as \[ \pi_k(x) = \frac{\pi_k \cdot p(x | y = k)}{\sum^{g}_{j=1}\pi_j \cdot p(x | y = j)}. \]
Data
In this code demo we’re looking at the iris data set again:
library(ggplot2)
data(iris)
target <- "Species"
features <- c("Sepal.Width", "Sepal.Length")
iris_train <- iris[, c(target, features)]
target_levels <- levels(iris_train[, target])
ggplot(iris_train, aes(x = Sepal.Width, y = Sepal.Length)) +
geom_point(aes(color = Species))
For the estimation of the models we will mostly use the mlr3
-package, s.t.
we firstly have to define a task:
library(mlr3)
library(mlr3learners)
iris_task <- TaskClassif$new(id = "iris_train", backend = iris_train,
target = target)
Models (& More Theory)
Linear discriminant analysis (LDA)
In LDA, we model the likelihood as a multivariate normal distribution s.t. \[ p(x | y = k) = \frac{1}{\pi^{\frac{p}{2}} |\Sigma|^{\frac{1}{2}}}\exp\left(- \frac{1}{2} (x-\mu_k)^T\Sigma^{-1}(x-\mu_k)\right). \] With:
- \(\hat{\mu}_k = \frac{1}{n_k}\sum_{i: y^{(i)} = k} x^{(i)},\)
- \(\hat{\Sigma} = \frac{1}{n - g} \sum_{k=1}^g\sum_{i: y^{(i)} = k} (x^{(i)} - \hat{\mu}_k)(x^{(i)} - \hat{\mu}_k)^T.\)
For every class, it is assumed that data is normally distributed with the same covariance matrix \(\Sigma\) for all classes but different mean vectors \(\mu_k\).
We train the model:
iris_lda_learner <- lrn("classif.lda", predict_type = "prob")
iris_lda_learner$train(task = iris_task)
We create a general framework for likelihoods, s.t. the we are able to visualize them:
library(mvtnorm)
get_mvgaussian_lda <- function(data, target, level, features) {
classif_task <- TaskClassif$new(id = "mvg_task",
backend = data[, c(features, target)],
target = target
)
lda_learner <- lrn("classif.lda")
lda_learner$train(task = classif_task)
list(
mean = lda_learner$model$means[level, features],
sigma = solve(tcrossprod(lda_learner$model$scaling[features, ])),
type = "mv_gaussian",
features = features
)
}
likelihood <- function(likelihood_def, data) {
switch(likelihood_def$type,
mvgaussian_lda = get_mvgaussian_lda(
data, likelihood_def$target,
likelihood_def$level,
likelihood_def$features
)
)
}
predict_likelihood <- function(likelihood, x) {
switch(likelihood$type,
mv_gaussian = dmvnorm(x,
mean = likelihood$mean,
sigma = likelihood$sigma
)
)
}
We write a plot function for multivariate likelihood functions with two features:
library(reshape2)
plot_2D_likelihood <- function(likelihoods, data, X1, X2, target, lengthX1 = 100,
lengthX2 = 100) {
gridX1 <- seq(
min(data[, X1]),
max(data[, X1]),
length.out = lengthX1
)
gridX2 <- seq(
min(data[, X2]),
max(data[, X2]),
length.out = lengthX2
)
grid_data <- expand.grid(gridX1, gridX2)
features <- c(X1, X2)
target_levels <- names(likelihoods)
names(grid_data) <- features
lik <- sapply(target_levels, function(level) {
likelihood <- likelihoods[[level]]
predict_likelihood(likelihood, grid_data[, likelihood$features])
})
grid_data <- cbind(grid_data, lik)
to_plot <- melt(grid_data, id.vars = features)
ggplot() +
geom_contour(
data = to_plot,
aes_string(x = X1, y = X2, z = "value", color = "variable")
) +
geom_point(data = data, aes_string(x = X1, y = X2, color = target))
}
lda_liks <- sapply(target_levels, function(level)
likelihood(
likelihood_def = list(
type = "mvgaussian_lda", target = target,
level = level, features = features
),
iris_train
),
simplify = FALSE
)
plot_2D_likelihood(lda_liks, iris_train, "Sepal.Width", "Sepal.Length", target)
We clearly see that all class distributions are modeled with the same covariance matrix - the shape and orientation of the contour lines of all 3 class distributions is the same.
library(mlr3viz)
plot_learner_prediction(iris_lda_learner, iris_task) +
guides(alpha = "none", shape = "none")
## INFO [17:50:28.779] [mlr3] Applying learner 'classif.lda' on task 'iris_train' (iter 1/1)
The resulting decision boundaries are linear – even though we can’t really clearly see that in the contour plot above.
Quadratic discriminant analysis (QDA)
In QDA, we model the likelihood as a multivariate normal distribution s.t. \[ p(x | y = k) = \frac{1}{\pi^{\frac{p}{2}} |\Sigma_k|^{\frac{1}{2}}}\exp\left(- \frac{1}{2} (x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\right) \] With:
- \(\hat{\mu}_k = \frac{1}{n_k}\sum_{i: y^{(i)} = k} x^{(i)},\)
- \(\hat{\Sigma}_k = \frac{1}{n_k - 1} \sum_{i: y^{(i)} = k} (x^{(i)} - \hat{\mu}_k)(x^{(i)} - \hat{\mu}_k)^T.\)
This means we estimate a different mean vector and covariance matrix for for every class.
iris_qda_learner <- lrn("classif.qda", predict_type = "prob")
iris_qda_learner$train(task = iris_task)
We define all we need for our likelihood framework and plot them:
get_mvgaussian_qda <- function(data, target, level, features) {
classif_task <- TaskClassif$new(id = "mvg_task",
backend = data[, c(features, target)],
target = target
)
qda_learner <- lrn("classif.qda")
qda_learner$train(task = classif_task)
list(
mean = qda_learner$model$means[level, features],
sigma = solve(tcrossprod(qda_learner$model$scaling[features, ,
level])),
type = "mv_gaussian",
features = features
)
}
likelihood <- function(likelihood_def, data) {
switch(likelihood_def$type,
mvgaussian_lda = get_mvgaussian_lda(
data, likelihood_def$target,
likelihood_def$level,
likelihood_def$features
),
mvgaussian_qda = get_mvgaussian_qda(
data, likelihood_def$target,
likelihood_def$level,
likelihood_def$features
)
)
}
liks <- sapply(target_levels, function(level)
likelihood(list(
type = "mvgaussian_qda", target = target,
level = level, features = features
), iris_train), simplify = FALSE)
plot_2D_likelihood(liks, iris_train, "Sepal.Width", "Sepal.Length", target)
As we can see, the covariance is now different in each class.