Expectation Maximization Algorithm
Powerful ideas in Statistical Inference.
By Vamshi Jandhyala in mathematics
November 23, 2020
Key Idea
The key idea of the EM algorithm is to iterate between taking an expectation and then maximizing. Suppose we have a data $Y$ whose density $f(y;\theta)$ leads to a log-likelihood that is hard to maximize. But suppose we can find another random variable $X$ such that $f(y; \theta) = \int f(y,z;\theta)dz$ and such that the likelihood based on $f(y,z;\theta)$ is easy to maximize. In other words, the model of interest is the marginal of a model with a simpler likelihood. In this case, we call $Y$ the observed data and $Z$ the hidden (or latent or missing) data. If we could just “fill in” the missing data, we would have an easy problem. Conceptually the EM algorithm works by filling in the missing data, maximizing the log-likelihood and iterating.
The Algorithm
- Pick a starting value $\theta^0$. Now for $j=1,2,\dots,$ repeat steps $1$ and $2$ below.
- (The E-step): Calculate
$$
J(\theta|\theta^j) = \mathbb{E}_{\theta^j}\left(\log \frac{f(Y^n, Z^n; \theta)}{f(Y^n, Z^n; \theta^j)} \big| Y^n = y^n\right). $$ The expectation is over the missing data $Z^n$ treating $\theta^i$ and the observed data $Y^n$ as fixed. - Find $\theta^{j+1}$ to maximize $J(\theta|\theta^j)$.
Problem
Data on a random sample of $n = 8085$ movies is provided in movies.csv file. Description of the variables in this Data are as follows.
• imdb title id: Identifier.
• title: Name of the movie.
• year: Which year the movie was released.
• genre: Which genre(s) the movie fits in.
• duration: Runtime of the movie.
• country: Which country the movie originated in.
• language: Primary language of the movie.
• avg vote: Overall rating, calculated as a weighted average.
• votes: Number of people who rated the movie.
Objective of this exercise is to identify the difference in ratings for big-budget movies to small-budget movies. Unfortunately, information on the budget of the movies is not provided in the data. So, we are going to make use of a latent variable. In particular, let $X_i$ denote the rating for the $i^{th}$ movie and let $Z_i$ be a binary variable ($k = 0, 1$ represent small-budget and big-budget, respectively) corresponding to that movie. Assume that $\mathbb{P}(Z_i = k) = p_k$ where $p_0 + p_1 = 1$. Further, consider that $X_i|Z_i = k ∼ N(\mu_k, σ_k^2)$. Let $θ = (p_0, p_1, \mu_0, \mu_1, σ_0^2, σ_1^2)$ and $\mathbf{X} = (X_1, \dots , X_n)$. Note that the $Z_i$’s are unobserved latent variables and we want to find out MLE of $\theta$ using only $\mathbf{X}$. We will follow an iterative procedure.
(a) Find $γ_i(k) = \mathbb{P}(Z_i = k|X_i)$.
(b) Write down the expression for the complete log-likelihood $\log f(\mathbf{X}, Z; θ)$.
(c) Show that the conditional expectation $\mathbb{E}[\log f(\mathbf{X}, Z; θ)|\mathbf{X}]$ is a function of $γ_i(k)$’s and $θ$. Then, assuming that $γ_i(k)$ is known for all $i$ and $k$, find out the MLE for $θ$.
(d) Execute the following iterative procedure for the given data:
– Step 1: Fix a very small value of $\epsilon$.
– Step 2: Take some initial values for $θ$.
– Step 3: Evaluate $γ_i(k)$ for all $i = 1, \dots , n$ and for $k = 0, 1$ using current values of $θ$.
– Step 4: Estimate new values of $θ$ using the values of $γ_i(k)$ from the previous step.
– Step 5: Repeat steps $3$ and $4$ until the total squared difference between the new estimates of $θ$ and the previous estimates of $θ$ is less than $\epsilon$.
(e) Report your estimates for $\theta$ from the iterative algorithm and discuss the findings.
Solution
a) We have
$$
\begin{align*}
\gamma_i(k) &= \mathbb{P}(Z_i = k|X_i)\\
&= \frac{f(x_i|Z_i=k)\mathbb{P}(Z_i=k)}{f(x_i|Z_i=1)\mathbb{P}(Z_i=1) + f(x_i|Z_i=0)\mathbb{P}(Z_i=0)} \\
& = \frac{\phi(x_i;\mu_k, \sigma_k^2)p_k}{\phi(x_i;\mu_1, \sigma_1^2)p_1 + \phi(x_i;\mu_0, \sigma_0^2)p_0}
\end{align*}
$$
where $\phi(x;\mu, \sigma)$ denotes a normal density with mean $\mu$ and standard deviation $\sigma$.
b) We can write
$$ f(x,z) = f(z)f(x|z) = p_0^{1-z}\phi(x,\mu_0,\sigma_0)^{1-z}p_1^z\phi(x,\mu_1,\sigma_1)^z $$
Hence the likelihood is
$$ \prod_{i=1}^n p_0^{1-z_i}\phi(x_i,\mu_0,\sigma_0)^{1-z_i} p_1^{z_i}\phi(x_i,\mu_1,\sigma_1)^{z_i} $$
The complete log likelihood is then
$$
\begin{align*}
&\log(p_0)\sum_{i=1}^n(1-z_i) + \log(p_1)\sum_{i=1}^n z_i \\
&-\frac{n}{2}\log (2\pi) -\log \sigma_0 \sum_{i=1}^n (1-z_i) -\log \sigma_1 \sum_{i=1}^n z_i \\
&-\frac{1}{2\sigma_0^2}\sum_{i=1}^n (1-z_i)(x_i - \mu_0)^2 - \frac{1}{2\sigma_1^2}\sum_{i=1}^nz_i(x_i - \mu_1)^2
\end{align*}
$$
c) The conditional expectation
$$
\begin{align*}
&J(\theta | \theta^j) = \mathbb{E}[\log f(\mathbf{X}, Z)|\mathbf{X}, \theta^j] \\
&= \log(p_0)\sum_{i=1}^n(1-\mathbb{E}[Z_i|X_i, \theta^j]) + \log(p_1)\sum_{i=1}^n \mathbb{E}[Z_i|X_i, \theta^j] \\
&-\frac{n}{2}\log (2\pi) - \log(\sigma_0)\sum_{i=1}^n(1-\mathbb{E}[Z_i|X_i, \theta^j]) \\
&- \log(\sigma_1)\sum_{i=1}^n\mathbb{E}[Z_i|X_i, \theta^j]\\
&-\frac{1}{2\sigma_0^2}\sum_{i=1}^n (1-\mathbb{E}[Z_i|X_i, \theta^j])(x_i - \mu_0)^2 \\
&- \frac{1}{2\sigma_1^2}\sum_{i=1}^n \mathbb{E}[Z_i|X_i, \theta^j](x_i - \mu_1)^2 \\
&= \log(p_0)\sum_{i=1}^n \gamma_i(0) + \log(p_1)\sum_{i=1}^n \gamma_i(1) \\
&-\frac{n}{2}\log (2\pi) - \log(\sigma_0)\sum_{i=1}^n \gamma_i(0) - \log(\sigma_1)\sum_{i=1}^n \gamma_i(1)\\
&-\frac{1}{2\sigma_0^2}\sum_{i=1}^n \gamma_i(0)(x_i - \mu_0)^2 - \frac{1}{2\sigma_1^2}\sum_{i=1}^n \gamma_i(1)(x_i - \mu_1)^2
\end{align*}
$$
as $\mathbb{E}(Z_i|X_i) = \mathbb{P}(Z_i = 1|X_i) = \gamma_i(1)$.
Taking the derivative of $J(\theta | \theta^j)$ wrt $\mu_0, \mu_1, p_0, p_1, \sigma_0$ and $\sigma_1$ and setting them equal to $0$ we get,
$$
\begin{align*}
\hat{\mu_0}^{j+1} &= \frac{\sum_{i=1}^n \gamma_i(0)x_i}{\sum_{i=1}^n\gamma_i(0)} \\
\hat{\mu_1}^{j+1} &= \frac{\sum_{i=1}^n \gamma_i(1)x_i}{\sum_{i=1}^n \gamma_i(1)} \\
\hat{p_0}^{j+1} &= \frac{\sum_{i=1}^n \gamma_i(0)}{n}\\
\hat{p_1}^{j+1} &= \frac{\sum_{i=1}^n \gamma_i(1)}{n} \\
\hat{\sigma_0}^{j+1} &= \sqrt{\frac{\sum_{i=1}^n \gamma_i(0)(x_i - \mu_0)^2}{\sum_{i=1}^n \gamma_i(0)}}\\
\hat{\sigma_1}^{j+1} &= \sqrt{\frac{\sum_{i=1}^n \gamma_i(1)(x_i - \mu_1)^2}{\sum_{i=1}^n \gamma_i(1)}}
\end{align*}
$$
d) Python code for implementing the iterative EM algorithm is given below
import numpy as np
import pandas as pd
from scipy.stats import norm
from math import sqrt
movies = pd.read_csv('\movies.csv')
x = movies['avg_vote'].to_numpy()
max_iterations = 100000
e = 0.0000001
mu0, mu1, sigma0, sigma1, p0, p1 = 4, 7, 1, 1, 0.5, 0.5
for _ in range(max_iterations):
phi0, phi1 = norm.pdf(x, mu0, sigma0), norm.pdf(x, mu1, sigma1)
gamma = (p1*phi1)/(p1*phi1 + p0*phi0)
mu0_n = np.sum((1-gamma)*x)/np.sum(1-gamma)
mu1_n = np.sum(gamma*x)/np.sum(gamma)
p1_n = np.sum(gamma)/x.shape[0]
p0_n = 1 - p1_n
sigma0_n = sqrt(np.sum((1-gamma)*(x-mu0)**2)/np.sum(1-gamma))
sigma1_n = sqrt(np.sum(gamma*(x-mu1)**2)/np.sum(gamma))
theta_j = np.array([mu0, mu1, sigma0, sigma1, p0, p1])
theta = np.array([mu0_n, mu1_n, sigma0_n, sigma1_n, p0_n, p1_n])
err = np.sum((theta-theta_j)**2)
if (err < e):
break
mu0, mu1, sigma0, sigma1, p0, p1 = theta
print("mu0 = %f, mu1 = %f" % (mu0, mu1))
print("sigma0 = %f, sigma1 = %f" % (sigma0, sigma1))
print("p0 = %f, p1 = %f" % (p0,p1))
e) The estimates for $\theta$ are given below
$$
\begin{align*}
\mu_0 &= 4.04 \
\mu_1 &= 6.39 \
\sigma_0 &= 1.09 \
\sigma_1 &= 0.83 \
p_0 &= 0.26 \
p_1 &= 0.74
\end{align*}
$$
Reference
All of Statistics - Larry Wasserman