Chi square tests

Powerful ideas in Statistical Inference.

By Vamshi Jandhyala in mathematics

November 26, 2020

The Chi-Square Test of Homogeneity

Problem

A survey was conducted among $320$ families, each with $5$ children. The gender distribution among the children is provided below. Is the data consistent with the hypothesis that male and female births are equally probable?

No. of boys:$5$$4$$3$$2$$1$$0$
No. of girls:$0$$1$$2$$3$$4$$5$
No. of families:$14$$56$$110$$88$$40$$12$

Solution

Let $p_b$ the proportion of the children in the population that are boys and $p_g$ be the proportion of the children in the population that are girls.

The null hypothesis is $H_0:$ the number of families in each category satisfies the binomial distribution with $p_b=p_g=0.5$ and the alternate hypothesis is $H_a: H_0$ is not true.

Test Statistic and p-value computation

CategoryObserved (O_i)Expected(E_i)
$5B,0G$$14$${5 \choose 0}\cdot0.5^5\cdot 320= 10$
$4B,1G$$56$${5 \choose 1}\cdot0.5^5\cdot 320= 50$
$3B,2G$$110$${5 \choose 2}\cdot0.5^5\cdot 320= 100$
$2B,3G$$88$${5 \choose 2}\cdot0.5^5\cdot 320= 100$
$1B,4G$$40$${5 \choose 1}\cdot0.5^5\cdot 320= 50$
$0B,5G$$12$${5 \choose 0}\cdot0.5^5\cdot 320= 10$

The value of the Chi-square test statistic is

$$ \sum_{i=1}^6 \frac{(E_i-O_i)^2}{E_i} = 7.16 $$

The number of degrees of freedom is $6-1=5$.

p-value is $\mathbb{P}(\mathcal{X}_5^2 > 2) \approx .21$. Therefore, we can accept $H_0$ at $5%$ level of significance.

The Chi-Square Test of Homogeneity

Suppose that we have independent observations from $J$ multinomial distributions, each of which has $I$ cells, and that we want to test whether the cell probabilities of the multinomials are equal—that is, to test the homogeneity of the multinomial distributions.

If the probability of the $ith$ category of the $jth$ multinomial is denoted $\pi_{ij}$, the null hypothesis to be tested is $H_0: \pi_{i1} = \pi_{i2} = \cdots = \pi_{iJ}, i = 1, \dots, I$. We may view this as a goodness-of-fit test: Does the model prescribed by the null hypothesis fit the data?

To test goodness of fit, we will compare observed values with expected values using Pearson’s chisquare statistic. We will assume that the data consists of independent samples from each multinomial distribution, and we will denote the count in the $ith$ category of the $jth$ multinomial as $n_{ij}$. Under $H_0$, each of the $J$ multinomials has the same probability for the $ith$ category, say $π_i$.

The following theorem shows that the mle of $\pi_i$ is simply $n_i/n$, which is an obvious estimate. Here, $n_i$ is the total count in the $ith$ category, $n$ is the grand total count, $n_{.j}$ is the total count for the $jth$ multinomial.

Theorem

Under $H_0$, the mle’s of the parameter $\pi_i, \pi_2, \dots, \pi_j$ are

$$ \hat{\pi}_i = \frac{n_i}{n}, \text{ $i=1,\dots, I$} $$

where $n_i$ is the total number of responses in the $ith$ category and $n$ is the grand total number of responses.

For the $jth$ multinomial, the expected count in the $ith$ category is the estimated probability of that cell times the total number of observations for the $jth$ multinomial, or

$$ E_{ij} = \frac{n_in_{.j}}{n} $$

Pearson’s chi-square statistic is therefore

$$ \begin{align*} \mathcal{X}^2 &= \sum_{i=1}^I\sum_{j=1}^J \frac{(O_{ij}-E_{ij})^2}{E_{ij}}\\
&= \sum_{i=1}^I\sum_{j=1}^J \frac{(n_{ij}-n_{i}n_{.j}/n)^2}{n_{i}n_{.j}/n} \end{align*} $$

For large sample sizes, the approximate null distribution of this statistic is chi-square. (The usual recommendation concerning the sample size necessary for this approximation to be reasonable is that the expected counts should all be greater than 5.)

The degrees of freedom are the number of independent counts minus the number of independent parameters estimated from the data. Each multinomial has $I − 1$ independent counts, since the totals are fixed, and $I − 1$ independent parameters have been estimated. The degrees of freedom are therefore

$$ df = J (I − 1) − (I − 1) = (I − 1)(J − 1) $$

Problem

A public opinion poll surveyed a simple random sample of $1000$ voters. Respondents were classified by gender (male or female) and by voting preferences (Republican, Democrat or Independent). Based on the contingency table below, use an appropriate statistical technique to identify if the men’s voting preferences differ significantly from the women.

-RepublicanDemocratIndependentTotal
Male$213$$141$$54$$408$
Female$251$$299$$42$$592$
Total$464$$440$$96$$1000$

Solution

In the above problem, we have $J=2$ as there are two multinomial distributions, one each for male and female. As there are three categories, Republican, Democrat and Independent, $I=3$.

We have

$$ E_{RM} = \frac{464 \cdot 408}{1000} = 189.3\\
E_{RF} = \frac{464 \cdot 592}{1000} = 274.7\\
E_{DM} = \frac{440 \cdot 408}{1000} = 179.5\\
E_{DF} = \frac{440 \cdot 592}{1000} = 260.5\\
E_{IM} = \frac{96 \cdot 408}{1000} = 39.16\\
E_{IF} = \frac{56 \cdot 592}{1000} = 56.8 $$

The following table gives the observed count and, below it, the expected count in each party for both males and females

-MaleFemale
Republican Observed$213$$251$
Republican Expected$189.3$$274.7$
Democrat Observed$141$$299$
Democrat Expected$179.5$$260.5$
Independent Observed$54$$42$
Independent Expected$39.16$$56.8$

The value of the chi square statistic is $28.44$. The number of degrees of freedom is $2$.

$\mathbb{P}(\mathcal{X}_2^2 > 28.44) \approx 0.0$. Since the p-value (0.0000) is less than the significance level (0.05), we reject the null hypothesis.