Journal of Modern Mathematics and Statistics

Abstract

Logistic regression model deals with the relationship that exists between a dependent variable and one or more independent variables. It provides a method for modeling a binary response variable which takes values 1 and 0. In this stuty, a brief review of the underlying theory for the approach is presented and the Logistic regression model to estimate the graduating Cumulative Grade Point Average (CGPA) of graduates were fitted and tested. Data were collected from Faculty of Science, University of Lagos. The study reveals that the final year Grade Point Average (GPA) of the graduands has significant effect among all other variables.

INTRODUCTION

Logistic regression model, introduced in late 1960s and early 1970s has in the early 1980s become routinely available in statistical packages. It has also found many applications in fields like the social sciences (Chuang, 1997) and in educational research, especially in higher education (Austin et al., 1992). Logistic regression analysis extends the techniques of multiple regression analysis to research situations in which the outcome variable is categorical. There is a binary response of interest and the predictor variables are used to model the probability of that response.

Situations involving categorical outcomes are quite common in practice. In educational program, predictions are made for the binary outcome of success/failure. In the same vein, operation units could be classified as successful or not successful according to some objective criteria in industries.

The several characteristics of the units could be measured and logistic regression analysis could be used to determine which characteristics best predict success. Similarly, in a medical arena, an outcome might be due to the presence or absence of a particular disease. This research gives a brief review of the underlying theory of logistic regression with its application to Graduating Cumulative Grade Point Average (CGPA) of the 2007/2008 graduates of Faculty of Science, University of Lagos. It derives motivation from the study done by Peng et al. (2002). Further support is derived from Karp (2007) who argued that logistic regression is an increasingly popular analytic tool, used to predict the probability that the event of interest will occur as a linear function of one (or more) continuous and/or dichotomous independent variables. Logistic regression model have been applied in a number of contexts. Some examples include applications to adjust for bias in comparing two groups in observational studies (Rosenbaum and Rubin, 1983). Efron (1975) compared logistic regression to discriminant analysis (which assumes the explanatory variables are multivariate normal at each level of the response variable); it has also been applied to a study investigating the risk factors for low birth weight babies (Hosmer and Lemeshow, 1989). Other applications include using logistic regression analysis to determine the factors that affect green card usage for health services (Senol and Ulutagay, 2006). Applications of logistic regression have also been extended to cases where the dependent variable is >2 cases, known as multinomial or polytomous. Tabachnick and Fidell (1996) use the term polychotomous.

University of Lagos, Nigeria: The University of Lagos was established in 1962. It is made up of two campuses: the main campus located in Akoka, Yaba and the college of medicine in Idi-Araba, Surulere, Nigeria. The institution started with 131 student but today, it can boast of annual intake of 39,000 students.

In addition, it has a total staff strength of 3,365 administrative and technical staff (1,386), junior staff (1,164) academic staff (813). The university has nine faculties and a college of medicine. The faculties include: Arts, Business Administration, Education, Engineering, Environmental Sciences, Law, Pharmacy, Sciences and Social Sciences. These faculties offer a total of 117 programmes. The university also offers Master’s and Doctorate degree in most of its programmes. The distance learning institute of the university offers courses in Accounting, Business Administration, Science Education and Library/Information Sciences.

The vision of the university is to be a top-class institution for the pursuit of excellence in knowledge through learning and research as well as in character and service to humanity while the mission is to provide a conducive teaching, learning, research a nd development environment where staff and students can interact and compete effectively with their counterparts both nationally and internationally in terms of intellectual competence and zeal to add value to the world.

In the spirit of this vision and mission the university recently rewarded 19 of its researchers for their outstanding excellence in the 2005 Research Conference and Fair.

Logistic regression model and general theory: Logistic regression analysis is part of a category of statistical models known as generalized linear models which consist of fitting a logistic regression model to an observed proportion in order to measure the relationship between the response variable and set of explanatory variables (Lavange et al., 1986).

Letting X denote the vector of predictors {x₁, x₂,.... x_k} and let the conditional probability that the outcome is present be denoted by the equation given as:

(1)

The logistic regression model (Harrell, 2001) is given by:

(2)

π (x)	=	The success probability at value x
xβ	=	Stands for β₀+ β₁x₁+ β₂x₂+ ....+ β_kx_k
e	=	The base of the system of natural logarithms

It can be transformed to give a new interpretation. Specifically, we define the odds as the following ratio:

(3)

The logistic regression model has a linear form for the logit of this probability:

(4)

Thus:

Equation 4 is in the same form as the multiple linear regression equation. The inverse transformation of Eq. 4 is the logistic function of the form:

(5)

With Eq. 5, one predicts the probability of the occurrence of the outcome of interest. According to Eq. 4, the relationship between log it (Y) and X is linear. Yet, according to Eq. 5, the relationship between the probability of Y and X is non-linear. Thus, the natural log transformation of the odds in Eq. 4 is necessary to make the relationship between a categorical outcome variable and its predictor(s) linear.

The value of the coefficient β determines the direction of the relationship between C and the logit of Y. When β is >0, larger (or smaller) X values are associated with larger (or smaller) logits of Y. Conversely, if β is <0, larger (or smaller) X values are associated with smaller (or larger) logits of Y.

Fitting the logistics regression model to data: The unknown parameters β_j in the logistic regression model are estimated by the method of maximum likelihood. Solving for logistic regression coefficients β_j and their standard errors involves calculus, in which values are found using maximum likelihood methods.

These values, in turn are used to evaluate the fit of one or more models. The statistical significance for individual logistic regression coefficients is evaluated using the Wald test:

(6)

Wald test statistics has a standard normal distribution when β = 0. For the logistic regression model, the hypothesis H:β = 0 states that the probability of success is independent of X.

The usefulness of the model (Dayton, 1992) as a whole can be assessed by testing the hypothesis that simultaneously all of the partial logistic regression coefficients are 0 that is H: β = 0.

Goodness of fit: Goodness of fit shows how effectively the model we have described the outcome variable. Selection is made to the available list of independent variable that it deems important in described the dependent variable. A log-likelihood is calculated for a candidate model-based on summing the probabilities associated with the predicted and actual outcomes for each case i:

(7)

The comparison of observed to predicted values using the likelihood function is based on the statistic, D known as deviance. The resulting deviance is:

(8)

The value of D is compared with and without the independent variable in the equation as given below which aids in the assessment of the significance of an independent variable:

(9)

This goodness of fit process evaluates predictors that are eliminated from the full model. In general, as predictors are added/deleted, log-likelihood decreases/ increases. The logistic regression in SPSS uses three R² like measures: Cox and Snell's, Nagelkerke's and McFadden's and then the Hosmer and Lemeshow Chi-square test of goodness of fit.

The Cox and Snell measure is based on log-likelihood. Equation 10 provides the method of calculation for Cox and Snell’s R²:

(10)

However, Cox and Snell’s R² cannot achieve a maximum value of 1. The Nagelkerke’s R² which stands as a modification of the Cox and Snell, assures that a value of 1 is achieved. In order to achieve a measure that ranges from 0-1, Nagelkerke's R² divides Cox and Snell's R² by its maximum. Equation 11 provides the measure for Nagelkerke R²:

(11)

Where:

(12)

The McFadden's R² is a less common pseudo-R² variant, based on log-likelihood kernels for the full versus the intercept-only models (McFadden, 1974).

Hosmer and Lemeshow Chi-square test of goodness of fit evaluates the goodness of fit by creating 10 ordered groups of subjects. Then it compares the number actually in each group (observed) to the number predicted by the logistic regression model. A good model fit is indicated by a non-significant Chi-square value.

Application of logistic regression model: Logistic regression analysis was applied to the CGPA of the 2007/2008 graduates of Faculty of Science, University of Lagos (Table 1). The data set was obtained from the result record office of the Faculty which includes other pieces of information (e.g., age, gender and UME score) concerning the students that entered the Faculty of Science in 2003/2004 session and scheduled to graduate in 2007/2008 academic year. All analysis was performed initially using Microsoft Excel and this was loaded into SPSS.

The characteristics of the data set are as follows: the dependent binary variable Y represented with 1, stands for graduation of the students with CGPA >2.4 and 0, stands for graduation of the students with CGPA <2.4 is as follows:

The explanatory variables, used to predict whether or not an individual student graduated with CGPA above or below 2.4 were the students’ final year GPA, UME score, age, gender (0 = female student, 1 = male student). About 75.87% of the students (261 students) had CGPA >2.4 while 24.13% (83 students) <2.4 as shown in Table 2.

Table 1:	Logistic regression analysis of faculty of science students’ CGPA

Table 2:	Sample data for gender and graduation of the students with CGPA below or above 2.4

The gender predictor was coded as 0 = male and 1 = female with 57.27% (n =197) males and 42.73% (n = 147) females. Assessing a male’s odd of being graduated with CGPA <2.4 relative to female’s odds. The result is an odd ratio of 1.10 which suggests that males being graduated with CGPA <2.4 are 1:10 times that of female. The odd ratio is derived from two odds:

Its natural logarithm (that is log_e (1.10) is a logit equal to 0.04. Using GPA as the predictor, the logistic equation for log-odds in CGPA >2.4 is (the SPSS logistic regression is provided in Table 2).

The equation is exponentiated to estimate odds:

The probability is obtained by:

In the data, GPA in final year ranged from 1.49-4.95. Thus, for the lowest GPA recorded, 1.49, the log-odds, odds and the probability of CGPA >2.4 are -0.31366, 0.731 and 0.422, respectively. At the other extreme for GPA of 4.95, the log-odds of CGPA >2.4 are 3.0287, the odds are 20.6703 and the probability is 0.954. The relation between GPA and CGPA is showed in Fig. 1. Figure 1 was constructed by systematically varying CGPA from 1.00-5.00 (shown on the abscissa) and calculating the estimated probability of GPA (shown on the ordinate).

Table 3:	Log-odds and probabilities for various combinations of the predictor


Fig. 1:	Relation between GPA and CGPA

It is evident from the graph that students with a CGPA over 3.5 have an estimated odd of 0.5 while for π is estimated to be equal 0.3 (Table 3). The estimated logistic regression coefficient for GPA in final year is 0.966 and the exponential of this value is e^0.966 = 2.63. This indicates that for an increase in GPA in final year, the odds in favor of CGPA above 2.4 are estimated to be increased by a multiplicative factor of 2.63.

The reported standard error for b is 0.146 and statistical significance is assessed by the Wald Chi-squared statistic (0.966/0.146)² = 43.78 that with 1 degree of freedom, it is significant at conventional levels (the empirical two-tailed p-value is 0.0000 in Table 2). Thus, this study supports the conclusion that the GPA in final year is a useful predictor of student performance upon graduation.

As shown in Table 4, a classification table is constructed by predicting CGPA >2.4 or <2.4 of each student based on whether or not the odds for CGPA >2.4 are greater or less than 1.0 and comparing these predictions to the actual outcome for each student.

The percents of correct decisions are 93.1 for students who graduated with CGPA >2.4, 28.9 for students who graduated with CGPA <2.4 and 77.6 overall. This overall result is compared with a rate of 24.1% that would be obtained by simply predicting students who graduated with CGPA <2.4 as the outcome for every student (i.e., since 83 of the 344 graduated with CGPA <2.4, this is the prediction with the greater likelihood of being correct). A four-predictor logistic model was fitted to the data. The result showed that the logistic regression equation for log-odds is estimated to be:

Table 4:	The observed and perdicted fruquencies for graduating grades by Logistic regression

To estimate odd, the equation is exponentiated as:

The probability of success is obtained by applying the logistic transformation:

The Wald Chi-squared statistics are non-significant for UME score, gender and age (that is p-values of 0.166, 0.307 and 0.176, respectively) while the chi-squared value for GPA in final year is statistically significant at the 0.05 level (that is p-value of 0.000). Thus, given that the other predictors remain in the model, removing the GPA in final year as a predictor would result in significantly poorer predictive efficiency, although removing any of the other predictors does not have a significant impact (Table 3).

The Hosmer-Lemeshow test yielded a χ²distribution with 8 degrees of freedom of 6.569 and was insignificant (p value = 0.584), indicating that the model fit is good. Measuring the usefulness of the model, the Cox and Snell and Nagelkerke R² are two such statistics. The values for the data are 0.149 and 0.223, respectively. In addition to these measures, includes a classification (Table 4) that documents the validity of predicted probabilities.

CONCLUSION

Logistic regression provides a useful means for modeling the dependence of a binary response variable on one or more explanatory variables where the latter can be either categorical or continuous. The model appears to suggest this conclusion: given that the other predictors remain in the model, removing the GPA in final year as a predictor would result in significantly poorer predictive efficiency, although removing any of the other predictors does not have a significant impact. The factor that contributed in the model is the final year GPA.

How to cite this article:

O. Okafor Ray, O. Abass and E. Ahani. Application of Logistic Regression Model to Graduating CGPA of University Graduates.
DOI: https://doi.org/10.36478/jmmstat.2010.58.62
URL: https://www.makhillpublications.co/view-article/1994-5388/jmmstat.2010.58.62

Journal of Modern Mathematics and Statistics

168
Views

1
Downloads

Application of Logistic Regression Model to Graduating CGPA of University Graduates

Abstract

How to cite this article:

Journal of Modern Mathematics and Statistics

168Views

1Downloads

Application of Logistic Regression Model to Graduating CGPA of University Graduates

Abstract

How to cite this article:

168
Views

1
Downloads