Chi square test

Chi square is a Non parametric test i.e. it does not require normal distribution or variance assumptions about the populations from which the samples were drawn

Purpose:
The general purpose of the Chi-square test is to compare discrete categorical data (count data). For example, a product may be categorized into two categories such as defective/non-defective or in more than two categories such as excellent, good, fair, and poor. Chi-square tests are ideally suited to a data set in which both the variables to be compared, are categorical in nature. The Chi-square test compares observed values to a theoretical expected values

Scope:
Non parametric tests like chi square test are less powerful than parametric tests i.e. they are less likely to reject the null hypothesis when it is false.

Applications

  • Chi square test for testing goodness of fit is used to decide whether there is any difference between the observed (experimental) value and the expected (theoretical) value. It is used to determine whether the data follow the same distribution as in the past.
  • Chi square test for independence of two attributes is used to check whether the two characteristics are independent. It is used to determine whether a categorical outcome variable (Y) is related or associated with another categorical predictor variable (X)

Basic assumptions and requirements :

  • The sample is drawn randomly from the population. This is required if we want to generalize the result to the entire population.
  • Data is to reported in raw frequencies (counts not percentages).
  • Observations are independent.
  • Variables are mutually exclusive (individuals cannot be assigned to more than one category) and exhaustive (include all possible contexts or categories).
  • Observed frequencies are not too small (n must be relatively large) in a 2 x 2 table, Chi Square should not be used if n is less than 20 or any of individual cell count is less than 5

Methodology
There are several types of chi square tests depending on the way the data was collected and the hypothesis being tested. We’ll begin with the simplest case: a 2 x 2 contingency table. Suppose you conducted a drug trial on a group of animals and you hypothesized that the animals receiving the drug would show increased heart rates compared to those that did not receive the drug. You conduct the study and collect the following data:

Statistical Hypotheses :
H0: Observations are distributed equally among contexts (observed frequencies do not deviate from expected frequencies) i.e. The proportion of animals whose heart rate increased is independent of drug treatment.
HA: Observations are not distributed equally among contexts (observed frequencies do deviate from expected frequencies) i.e. The proportion of animals whose heart rate increased is associated with drug treatment.

Test statistic :

Where O is observed frequency and E is expected frequency.

Calculating the Chi-square statistic:

Degrees of freedom = (# columns (or) contexts – 1)*(# rows (or) categories – 1) =(2-1)(2-1)=1

Odds Ratio
It is another measure of association for 2 × 2 contingency tables. It occurs as a parameter in the most important type of model for
categorical data. For a probability of success π, the odds of success are defined to be

odds = π/(1 − π)

For instance, if π = 0.75, then the odds of success equal 0.75/0.25 = 3.The odds are nonnegative, with value greater than 1.0 when a success is more likely than a failure. When odds = 4.0, a success is four times as likely as a failure.

Therefore, Odds Ratio = (Odds of success in category A) / (Odds of success in category B)

In the example, Odds of increase in heart rate for treated patients = 36/14 = 2.57 and odds of increase in heart rate for non treated patients = 30/25 =1.2 . Odds ratio is given by the ratio of these two odds i.e. 2.57/1.2 = 2.14. This ratio signifies that treated patient was 2.14 times more likely to find a increase in heart rate than non treated patient.

Shortcut to calculate to odds ratio:
Lets consider the above example where we have to calculate the odds ratio of increased heart rate in category treated over that of not-treated. In the below table, eack cell value has been represented by an alphabet.

Then, Odds ratio will be (a*d)/(b*c). For our example, odds ratio= (36*25)/(14*30) = 2.14 (same as shown before!)

Interpretation
Compare the calculated χ2 statistic to a critical χ2 value in order to determine whether to reject the null hypothesis.
“1) If calculated χ2 >critical χ2, p ≤0.05 – indicates that there is significant statistical evidence in support of rejecting the null hypothesis. There is less than or equal to a 5% probability that we could obtain this result by chance, which is an acceptable level of error for experiments.
2) If calculated χ2 ≤ critical χ2, p> 0.05 – indicates that there is no significant statistical evidence in support of rejecting the null hypothesis. There is greater than a 5% probability that we could obtain this result by chance, which is exceeds the acceptable level of error for ecological experiments.”
In our example, we now have our chi square statistic ( χ2 = 3.418), our predetermined alpha level of significance (0.05), and our degrees of freedom (df = 1). Entering the Chi square distribution table with 1 degree of freedom and reading along the row we find our value of x2 (3.418) lies between 2.706 and 3.841. The corresponding probability is 0.065. Since a p-value of 0.65 is greater than the conventionally accepted significance level of 0.05 (i.e. p > 0.05) we fail to reject the null hypothesis. In other words, there is no statistically significant difference in the proportion of animals whose heart rate increased.