×
files/journal/2022-09-01_23-34-07-000000_997.jpg

Journal of Modern Mathematics and Statistics

ISSN: Online
ISSN: Print 1994-5388
186
Views
0
Downloads

Alternative Goodness-of-Fit Test in Logistic Regression Models

C.A. Udomboso, A.U. Chukwu, E.I. Enang and M.E. Nja
Page: 43-46 | Received 21 Sep 2022, Published online: 21 Sep 2022

Full Text Reference XML File PDF File

Abstract

The Deviance and the Pearson chi-square are two traditional goodness-of-fit tests in generalized linear models for which the logistic model is a special case. The effort involved in the computation of either the Deviance or Pearson chi-square statistic is enormous and this provides a reason for prospecting an alternative goodness-of-fit test in logistic regression models with discrete predictor variables. The Deviance is based on the log likelihood function while the Pearson chi-square derives from the discrepancies between observed and predicted counts. Replacing observed and predicted counts with observed proportions and predicted probabilities, respectively in a cross-classification data arrangement, the standard error of estimate is proposed as an alternative goodness-of-fit test in logistic regression models. The illustrative example returns favourable comparisons with Deviance and the Pearson chi-square statistics.


INTRODUCTION

Goodness-of-fit of a model measures how well the model describes the response variable. Assessing goodness-of-fit involves investigating how close values predicted by the model are to the observed values. Goodness-of-fit test is a test of the explanatory power of a model. In general linear models, this power can be tested using the analysis of variance under the global null hypothesis. The coefficient of determination, R2 and the standard error of estimate are also available for model evaluation in general linear models. In logistic regression models, a special case of the generalized linear model, the Deviance and the Pearson chi-square statistics are two traditional goodness-of-fit statistics. They are distributed as chi-square with k-m-I degrees of freedom where, k is the number of categories or subpopulations, m is the number of parameters to be estimated (Jennifer et al., 1996). The logistic regression tests are based on the assumption that the covariates involved in the model are all discrete.

In the presence of continuous covariates, Bewick conclude that the data is often too sparse to use the Deviance or Pearson chi-square. In that case, they proposed the Hosmer-Lemeshow goodness-of-fit test. Pulkstenis and Robinson (2002) also designed two goodness-of-fit tests for logistic regression models in the presence of continuous explanatory variables, using a methodology similar to that of Hosmer and Lemeshow goodness-of-fit test. Under the violation of the assumptions of independence and identical distribution, Acher et al. (2007) developed goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design. Deng et al. (2009) designed an improved goodness-of-fit test for logistic regression models based on case-control data by random partition. The statistic has an asymptotic chi-square distribution.

This study proposes the use of the standard error of estimate as a goodness-of-fit test in logistic regression models. The standard error of the estimate Se is a measure of average amount by which the actual observations vary around the regression plane (Webster, 1992). It is a measure of the average amount of the error associated with the model:

Where, k is the number of parameter to be estimated and m is sample size. The smaller, the standard error of estimate, the better the fit. This statistic is closely similar to the Pearson chi-square statistic and is distributed as chi- square with m-k-1 degrees of freedom under the null hypothesis that the predicted probabilities closely approximate the observed proportions.

The test is justified against the premise that the values of the observed proportions and the predicted probabilities are continuous and follow a normal distribution as shown by the normal P-P Plot (plot of expected cumulative probability against observed cumulative probability). The cross-classification data on coronary artery disease (Koch et al., 1985) is used to demonstrate that the proposed alternative method returns favourable comparison with the traditional Deviance and Pearson chi-square statistics. In the illustrative example, sex and ECG status are independent variables while the occurrence of coronary artery disease (success) is the response variable The Newton-Raphson iterative scheme is employed in fitting, the logistic model using the SPSS software.

The product binomial distribution: There are two independent and identically distributed explanatory variables X1 and X2, representing sex and ECG, respectively. For this reason, the product binomial distribution (Stokes et al., 1979) Pr (nkij) is given as:

Where, nki1 is the number of Persons of the kth sex and ith ECG with coronary artery disease, nki2 is the number of persons of the kth sex and ith ECG without coronary artery disease, k = 1 for females, k = 2 for males; i = 1 for ECG<0.1, i = 2 for ECG≥0.1, nki+ = nki1 + nki2. θki is the probability that a person of the kth sex with an ith ECG status has coronary artery disease:

=odds of coronary artery disease for the kth group

Let mkij = Model-predicted counts defined as:

Where, nkij = number of persons of the kth sex and ith ECG with jth disease status.

The Deviance: The Deviance is used to compare two models in generalized linear models and is similar to residual variance in ANOVA (McCullagh and Nelder, 1989). Let:

be the maximum achievable log likelihood and;

be the log likelihood under consideration. The deviance is defined as:

The Pearson chi-square: The Pearson chi-square () is the other traditional goodness-of-fit test in logistic regression models. It tests the null hypothesis that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution (Chernoff and Lehmann, 1954). The null distribution of the Pearson statistic with j rows and k columns is approximated by the chi-square distribution with (k-1) (j-1) degrees of freedom (Plackett, 1983):

Where, nkij and mkij are as defined earlier.

THE ALTERNATIVE GOODNESS-OF-FIT TEST

Let be as defined before, Pki be the proportion of success in the kth sex level and ith ECG level:

Table 1: Coronary artery disease data

Let = discrepancy between the ith model-predicted response probability and the ith proportion of success. D = {d1,…, dn} is a sequence of discrepancies. The proposed alternative goodness-of-fit test is the standard error of estimated, Se. This is given as:

Where:

m = Number of categories or sub-populations
k = Number of parameters to be estimated

ILLUSTRATIVE EXAMPLE

It is required to assess the goodness-of-fit associated with the Newton-Raphson method applied in the logistic regression modeling of the data is shown in Table 1.

PROPOSED METHOD

The following results were obtained:

With n = 4, k = 2, Se = 0.2928 having a p value (Pr2) = 0.6908. Tested against chi-square (χ2) with 1 degree of freedom at 0.05 level of significance, the null hypothesis of a good fit is accepted. The deviance = 0.2141 with a p value of 0.6436. The Pearson chi-square ( χ2) statistic = 0.2151 with a p value of 0.6425.

DISCUSSION

One of the goodness-of-fit tests in general linear models is the standard error of estimate which assesses the overall goodness-of-fit in a model. It is a measure of the average amount by which the actual observations vary around the regression line. When the assumption of normality apply to a set of observed values, it becomes reliable to apply this measure for model evaluation. Applying the P-P Plot; a plot of the expected cumulative probability against observed cumulative probability, it is shown that the observed proportions are approximately normally distributed.

Several advantages are associated with the use of the standard error of estimate. Like the Pearson chi-square, the standard error of estimate is distributed as chi-square so, the measure can be tested for significance at m-k-1 chi-square degrees of freedom.

The test can be supported by the coefficient of determination (R2) which measures the explanatory power of the model. From the illustrative example, the standard error of estimate is given as:

This compares favourably with the Deviance which has a value of 0.2141 and a p value of 0.6436. The Pearson chi-square statistic is 0.2151 with a p value of 0.6425. Tested against chi-square distribution with 1 (m-k-1) degrees of freedom at 0.05 level of significance, the null hypothesis of a good fit is accepted. m-k-1 = 4-2-1 = 1 degree of freedom. The Deviance, Pearson chi-square values and parameter estimates were obtained using the SPSS software.

CONCLUSION

The proposed alternative goodness-of-fit test observes the assumptions of generalized linear models in general. It therefore, extends beyond the scope of the logistic regression models. Its computational ease renders it more user-friendly than the Deviance or the Pearson chi-square. It is an attempt to unify the general linear model and the generalized linear model with respect to goodness-of-fit test.

How to cite this article:

C.A. Udomboso, A.U. Chukwu, E.I. Enang and M.E. Nja. Alternative Goodness-of-Fit Test in Logistic Regression Models.
DOI: https://doi.org/10.36478/jmmstat.2011.43.46
URL: https://www.makhillpublications.co/view-article/1994-5388/jmmstat.2011.43.46