Skip to content

Collection of end-to-end regression problems (in-depth: linear regression, logistic regression, poisson regression) 📈

Notifications You must be signed in to change notification settings

paulinamoskwa/GLMs

Repository files navigation

Generalized Linear Models (GLM)

In this repository I delve into three different types of regression.

drawing

📖 About

This is a collection of end-to-end regression problems. Topics are introduced theoretically in the README.md and tested practically in the notebooks linked below.

First, I tested the theory on toy simulations. I made four different simulations in python, taking advantage of the sklearn and statsmodels libraries:

After that I moved onto some real-world-data cases, developing three different end-to-end projects:

Further details can be found in the 'Practical Examples' section below in this README.md.

Note. A good dataset resource for linear/logistic/poisson regression, multinomial responses, survival data.
Note. To further explore feature selection: source 1, source 2, source 3, source 4, source 5.

📚 Theoretical Overview

A generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. In a generalized linear model, the outcome $\mathbf{Y}$ (dependent variable) is assumed to be generated from a particular distribution in a family of exponential distributions (e.g. Normal, Binomial, Poisson, Gamma). The mean $\mathbf{\mu}$ of the distribution depends on the independent variables $\mathbf{X}$ through the relation:

$$\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}] = \boldsymbol{\mu} = g^{-1}(\boldsymbol{X},\boldsymbol{\beta})$$

where $\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}]$ is the expected value of $\boldsymbol{Y}$ conditioned to $\boldsymbol{X}$ , $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta}$ is the linear predictor and $g(\cdot)$ is the link function. The unknown parameters $\boldsymbol{\beta}$ are typically estimated with maximum likelihood and IRLS techniques.

🟥 For the sake of clarity, from now on we consider the case of the scalar outcome, $Y$.

Every GLM consists of three elements:

  1. a distribution (from the family of exponential distributions) for modeling $Y$
  2. a linear predictor $\boldsymbol{X},\boldsymbol{\beta}$
  3. a link function $g(\cdot)$ such that $\mathbb{E}[\boldsymbol{Y}|\boldsymbol{X}] = \boldsymbol{\mu} = g^{-1}(\boldsymbol{X},\boldsymbol{\beta})$

The following are the most famous/used examples.

Distribution Support Typical uses $\mu=\mathbb{E}[Y|\boldsymbol{X}]$ Link function
$\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = g(\mu)$
Link name Mean function
Normal $(\mu,\sigma^2)$ $(-\infty, \infty)$ Linear-response data $\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \mu$ Identity $\mu = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta}$
Gamma $(\mu,\nu)$ $(0,\infty)$ Exponential-response data $\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = -\mu^{-1}$ Negative inverse $\mu = -(\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})^{-1}$
Inverse-Gaussian $(\mu,\sigma^2)$ $(0, \infty)$ $\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \mu^{-2}$ Inverse squared $\mu = (\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})^{-1/2}$
Poisson $(\mu)$ ${0, 1, 2, ..}$ Count of occurrences in a fixed
amount of time/space
$\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\mu)$ Log $\mu = \exp(\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})$
Bernoulli $(\mu)$ ${0, 1}$ Outcome of single yes/no occurrence $\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\frac{\mu}{1-\mu})$ Logit $\mu = \frac{1}{1+\exp(-\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})}$
Binomial $(n, \mu)$ ${0, 1, .., n}$ Count of yes/no in $n$ occurrences $n\hspace{1pt}\mu$ $\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} = \ln(\frac{\mu}{1-\mu})$ Logit $\mu = \frac{1}{1+\exp(-\boldsymbol{X}\hspace{1pt}\boldsymbol{\beta})}$

📂 Practical Examples

As already mentioned, let $Y$ be the outcome (dependent variable) and $\mathbf{X}$ be the independent variables. The three types of regression I analyzed (Linear, Logistic and Poisson) differ in the nature of $Y$. For each type, I collected an ad-hoc dataset to experiment with.


📑 Linear Regression

In the case of linear regression $Y$ is a real number and it is modeled as:

$$\begin{cases} \hspace{4pt} Y\sim N(\mu,\sigma^2)\\ \hspace{4pt} \mu = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for linear regression i analyzed a dataset of human brain weights.


📑 Logistic Regression

In the case of logistic regression $Y$ is a categorical value ($0$ or $1$) and it is modeled as:

$$\begin{cases} \hspace{4pt} Y \sim Bernoulli(\mu)\\ \hspace{4pt} \log(\frac{\mu}{1-\mu}) = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for logistic regression i analyzed an HR dataset.

For Advanced Classification techniques with Scikit-Learn check out Breast Cancer: End-to-End Machine Learning Project.


📑 Poisson Regression

In the case of poisson regression $Y$ is a positive integer (count) and it is modeled as:

$$\begin{cases} \hspace{4pt} Y \sim Poisson(\mu)\\ \hspace{4pt}\log(\mu) = \boldsymbol{X}\hspace{1pt}\boldsymbol{\beta} \end{cases}$$

As a case study for poisson regression i analyzed a dataset of smoking and lung cancer.


⚖️ Python sklearn vs. statsmodels

What libraries should be used? In general, scikit-learn is designed for machine-learning, while statsmodels is made for rigorous statistics. Both libraries have their uses. Before selecting one over the other, it is best to consider the purpose of the model. A model designed for prediction is best fit using scikit-learn, while statsmodels is best employed for explanatory models. To completely disregard one for the other would do a great disservice to an excellent Python library.

To summarize some key differences:

  • OLS efficiency: scikit-learn is faster at linear regression, the difference is more apparent for larger datasets
  • Logistic regression efficiency: employing only a single core, statsmodels is faster at logistic regression
  • Visualization: statsmodels provides a summary table
  • Solvers/methods: in general, statsmodels provides a greater variety
  • Logistic Regression: scikit-learn regularizes by default while statsmodels does not
  • Additional linear models: scikit-learn provides more models for regularization, while statsmodels helps correct for broken OLS assumptions