Medical statistics and Data Science: Epidemiology

Commonly used statistical tests

Hosmer–Lemeshow test

Hosmer–Lemeshow test is often used for goodness-of-fit after logistic regression.

Before explaining why Hosmer–Lemeshow test should be used, we must understand what covariate patterns means and how to calculate the number of covariate patterns.

Suppose, we have three independent/explanatory variables, including

  • Smoking: 2 groups (group 1: yes; group 2: no)
  • Coffee: 3 groups (group 1: <=3cups; group 2: 4-9 cups; group 3: >=10cups)
  • Aage: 4 groups (group 1: <10 years; group 2: 10-20 years; group 3: 20-30 year; and group 4: 30-40 years)

Suppose we do an univariate logistic regression including only the variable smoking as the independent variable, the number of covariate patterns will be 2. Why? It is simply because the variable smoking consists of 2 groups.

Now, I hope you are able to identify that the number of covariate patterns will be 3 suppose we do an univariate logistic regression including only the variable coffee as the independent variable. Why? Yes, you are right, it is because the variable coffee consists of 3 groups. Now this should be a piece of cake for you, the number of covariate patterns of course will be 4 if we do univariate logistic regression including only age as the independent variable, because the variable age consists of 4 groups.

The key question is what the number of covariate patterns will be if we do multiple logistic regression including all the three independent variables (smoking, coffee and age) in our logistic regression?

The number of covariate patterns Smoking coffee Age
1 1 1 1
2 1 1 2
3 1 1 3
4 1 1 4
5 1 2 1
6 1 2 2
7 1 2 3
8 1 2 4
9 1 3 1
10 1 3 2
11 1 3 3
12 1 3 4
13 2 1 1
14 2 1 2
15 2 1 3
16 2 1 4
17 2 2 1
18 2 2 2
19 2 2 3
20 2 2 4
21 2 3 1
22 2 3 2
23 2 3 3
24 2 3 4

If we look at the table, the maximum number of covariate patterns will be the combination of each value/group of all the three variables. The maximum number of covariate patterns = 2*3*4=24.

Imagine, if the variable age consists of 20 groups (11, 12, 13, ......, 30), the number of covariate patterns will be 2*3*20=120.

Pearson x² test can be used for goodness-of-fit after logistic regression. However, with the number of covariate patterns are increasing, especially when the number of covariate patterns is close to the sample size, Pearson x² test becomes questionable.

One way to avoid the situation that the number of covariate patterns is close to the sample size is to group the data.

Several grouping strategies were proposed, Lemeshow and Hosmer (1988) has shown that The grouping strategies based on percentiles of the estimated probabilities is preferable especially when majority of the estimated probabilities are small. When this preferred strategy is applied, usually with group are 10, which referred to as the "deciles of risk". “This term comes from health science research where the outcome y=1 often represents the occurrence of some disease”. Some logistic regression software packages provide Hosmer–Lemeshow test usually based on 10 groups.

Reference list:

  • Hosmer, D. W., Jr., S. A. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Hoboken,NJ: Wiley.

test

test

test

test