In this study, we consider estimation of response bias based on information from both the initial and the subsample in double sampling with inclusion probabilities in contrast to estimation of the response bias based on information from the subsample only. The condition, under which either of the two estimators can be preferred to the other is determined. The relative significance of the contribution of the correlated factors to the total response variance compared to the contribution by the other factors of response variation noted earlier by Hansen under a different sample design is confirmed. The effect of the replication of the survey for K>1 times.
INTRODUCTION
The term survey is used in this study to refer to an investigation, whether complete or incomplete, which is concerned mainly with the counting of units (of analysis) under study, technically known as a census (Warwick and Lininger, 1975; Raj, 1968) and to a complete or incomplete investigation, which is concerned with the collection of detailed information, technically known as a survey (Warwick and Lininger, 1975), from each unit under study. We postulate that the objective of a statistician in designing a survey is to minimize the total survey error the components of which can be classified broadly, by error type, as sampling errors and nonsampling errors.
The sources of survey errors are all the activities that constitute survey planning or determination of the fixed conditions (though often ignored as a source of error) under which, the survey can be conducted. These conditions include population coverage and the survey protocol (Udofia, 2006). The planning operations take place in the office. The other source of survey errors is the process of measurement of the study variable on the units of analysis. This process consists of Hansen et al. (1961) the data collection operations (or field work) and the data processing operations (which is another phase of office work).
Sampling error arises from failure of the selected (random) sample to cover the entire population. It is therefore, associated with certain aspects of the survey planning, which are choice of sample design and determination of sample size. This error is a variable themagnitude of which is inversely related to the sample size. This relationship enables the investigator to reduce the sampling error by increasing the sample size. The close relationship between this class of survey errors and the sample size gives rise to the term sampling error. Nonsampling errors are associated with the other aspects of planning (such as the selection and training of the survey staff and the survey time schedule), with the measurement and with the processing operations.
Non-sampling errors, the reduction of which is receiving increasing attention in recent times, are different distortions of the true value of the study variable. They consist of biases or systematic distortions and random errors both of which are referred to commonly as response errors.
Random errors appear in equal magnitudes in both positive and negative directions and hence cancel out in summation. Biases on the other hand, are systematic distortions which pile up in the same direction and therefore do not cancel out in the summation. As indicated above, these errors are not functions of the sample size. They remain the same in both complete and in incomplete surveys. As noted by Kish (1965) and by Udofia (2008), they can be reduced only by improving the related survey operations in the planning, measurement and in the data processing stages.
It is known (Kish, 1965 and Talukder, 1975) that in many survey situations, response bias is the major component of response error. Raj (1965) notes that sometimes it is so large that its elimination is more important than increasing the sample size. For example, Hansen et al. (1961) concludes that response bias was the important contribution to the errors of the 1950 U.S. Census statistics.
Measurement of response bias is a difficult problem, the solution of which demands more than statistical theory (Udofia, 2008). Hansen et al. (1961) concludes that there is no reasonably satisfactory approach for measurement of response bias. After the pioneer research by Mahalanobis (1946) on measurement of response variation some useful methods of measurement of response bias have been developed and discussed by notably Hansen et al. (1951, 1953, 1961), Fellegi (1964), Raj (1968), Koch (1973) and Talukder (1975). Hansen et al. (1951) gives an illustration of the use of double sampling with equal probabilities of selection in each phase to measure response bias. It is of interest to see how information from both the initial sample and the 2nd sample can be used together for the same purpose and to determine the condition under which either estimator can be preferred. This study is an attempt in this direction.
Suppose that the population, π1, under study consists of N elements which can be divided into H identifiable and independent groups or subpopulations, which may be geographic, demographic or professional groups. Let,
![]() |
denote the number of potential respondents or units of analysis in the hth group (Gh). There exists also a population, π2 of M potential interviewers which can be divided also into H independent groups with Mh,
![]() |
potential interviewers in the hth group (Qh) such that units in Gh can be interviewed only by units in Qh. As noted by Hansen et al. (1951), the discussion in this study applies also to a situation where the elements of π2 can be divided into L, L ≠ H, subpopulations. An example of the arrangement described above is given in Udofia (2008). The number of potential respondents in Gh that are available to be interviewed by each of the Mh units of Qh is
![]() |
It is assumed that information on the measure of size, Z, is available for each of the population elements, Uj j = 1, 2, ..., N, before the start of the survey. An initial sample, S(n1h) of size nlh,
![]() |
is drawn from Gh, h = 1, 2, ..., H, with probability
![]() |
where, Zhj is the value of Z for unit j and with replacement. Also, a sample S(m1h) of mlh
![]() |
interviewers is drawn by Simple Random Sampling Without Replacement (SRSWOR) from Qh, h = 1, 2, ..., H and
![]() |
units of S(n1h) are assigned at random to each of the m1h interviewers so that
The study variable is measured on S(n1h) under a set, A, of essential survey conditions (Udofia, 2006) typical of a large survey. Let, Xhijk denote the value of the study variable for unit j in Gh obtained by interviewer i in Qh on the kth, k = 1, 2, ..., K visit under this set of essential survey conditions. This means that a survey can be repeated K, K>1, times (Hansen et al., 1951; Raj, 1968). The repetition can be done in such a way that information received from an individual on one visit does not influence information received from that individual on any subsequent visit.
During the second phase of the survey, a second sample, S(n2h) of size n2h,
![]() |
n2h<n1h, is drawn by SRSWOR from S(nth). Also, a sample S(m2h) of m2h,
![]() |
interviewers is drawn by SRSWOR from the initial m1h interviewers (or from another population of interviewers that is different from π2) and
![]() |
units of S(n2h) are assigned at random to each of the m2h interviewers. The study variable is measured once on S(n2h) under a set, B, of highly effective essential survey conditions that can produce only the desired true or precise responses (Udofia, 2006). Let, Yhij denote the value of the study variable obtained by interviewer i, i ∈ S(m2h) from unit j, j ∈ S(n2h) under set B of essential survey conditions. These are the desired true values.
Consider the following linear response model of the type discussed earlier by Hansen et al. (1953), Raj (1968) and by Talukder (1975).
![]() |
(1) |
where:
Xhijk and Yhij | = | Retain their earlier definitions |
αhi | = | Denotes bias due to interviewer i |
i ∈ Qh, βhj | = | Denotes bias due to respondent j, j ∈ Gh |
dhijk | = | Denotes random error in the response by unit j received by interviewer i on occasion k |
We assume that
![]() |
(2) |
We assume also that there is no correlation between elements of the set {Yhij, αhi, βhj, dhijk}. Under these assumptions, response bias is defined as:
![]() |
(3) |
or
![]() |
(4) |
since,
![]() |
by (2). Equation (3) clearly presents total bias, B, as the sum of all biases (Kish, 1965).
Under the above assumptions, we consider the following unbiased estimator of B based on information on the subsample, S(n2h) only:
![]() |
(5) |
with variance
![]() |
(6) |
This variance can also be expressed in terms of its simple components that reflect the different sources of response variation as:
![]() |
(7) |
Contributions by the correlated factors are represented in Eq. 7 by the terms with as a factor and which, as noted by Hansen et al. (1961), can be large.
METHOD OF ESTIMATION OF RESPONSE BIAS FROM BOTH SAMPLES IN DOUBLE SAMPLING WITH UNEQUAL PROBABILITIES
Under the sample design and with the data in previous study, we propose the following alternative estimator of response bias to include information from both phases of the double sampling design:
![]() |
(8) |
By using theorems on conditional expectations and conditional variances (Raj, 1956, 1968), we obtain the expected value and the variance of the above estimator as follows:
By the conditional expectation equation:
![]() |
(9) |
Where:
![]() |
(10) |
and hence,
![]() |
which is the same as Eq. (4) where,
![]() |
The variance of is given by the conditional variance equation:
![]() |
(11) |
From Eq. (10), we obtain
![]() |
(12) |
Now,
![]() |
(13) |
We rewrite Eq. (13) as:
![]() |
(14) |
Where:
![]() |
(15) |
Where:
![]() |
Also,
![]() |
Where:
![]() |
so that
![]() |
(16) |
Substitution of Eq. (14-16) in Eq. (12) gives the result
![]() |
(17) |
From Eq. (8):
![]() |
Let,
![]() |
so that
![]() |
Then
![]() |
and
![]() |
(18) |
Substitution of Eq. (17) and (18) in Eq. (11) gives the result
![]() |
(19) |
The significant role of the correlated factors which are terms with factors and K-1, respectively is also obvious in this result.
For the purpose of decomposition of the terms in Eq. (13), we have under the model assumptions,
![]() |
(20) |
![]() |
(21) |
![]() |
(22) |
Under the model assumptions (2), ρp(dhijk, dhijw)hi = 0, for all k ≠ w. Also, for any given J, σp2(Y)hi = 0 for all k ≠ w, since, Yhij is the same for all k ≠ w under fixed i and j.
![]() |
(23) |
Substitution of Eq. 20-23 in Eq. 13 gives, under the above assumptions, the following result.
![]() |
(24) |
The 2nd term in Eq. 12 retains its definition in Eq. 15. The last term in Eq. 12, when Xhijk is expressed in terms of its components as in Eq. (1), becomes:
![]() |
(25) |
Substitution of Eq. 15, 24 and 25 in Eq. 12 gives the result
![]() |
(26) |
Now V1E2 remains the same as in Eq. 18. The substitution of Eq. 18 and 26 in 11 gives the following expression for V()terms of simple response variances.
![]() |
(27) |
The terms on the right-hand-side of the above equation, which we refer to as the simple response variances, reflect contributions from the main sources of response variation which are the biases, variation of the true values and the correlated factors.
With regards to unbiased estimation of in Eq. 19, we give the following similarly to Raj (1968):
![]() |
or
![]() |
or
![]() |
or
![]() |
![]() |
or
![]() |
or
![]() |
or
![]() |
For unbiased estimation of
![]() |
we give the following definitions. The calculation of estimates of the simple response variances in Eq. (27) can be done by considering the following definitions on S(n2h):
![]() |
(28) |
![]() |
(29) |
![]() |
(30) |
![]() |
(31) |
Hence, Eq. (28-29):
![]() |
(32) |
![]() |
and hence
![]() |
(33) |
Where:
![]() |
Let,
![]() |
Then,
![]() |
and hence
![]() |
![]() |
or
![]() |
or
![]() |
(34) |
Where:
![]() |
and
![]() |
Let,
![]() |
Then, since σ2(α)h cannot be estimated unbiasedly, we find an unbiased estimator of
![]() |
by considering
![]() |
From Eq. (32)
![]() |
or
![]() |
RESULTS AND DISCUSSION
We regret that it is difficult to find data from a survey based on the design discussed in this study for illustration of the various computations suggested by the previous results. As observed by Hansen et al. (1961), experiment that can generate the necessary data for the illustration are often highly expensive. The main aim of this study, therefore, is to provide the mathematical framework within which statistical inferences can be made about the contributions of specified sources of survey error to respond variance in double sampling with unequal probabilities. To facilitate application of the method discussed in the study, we suggest that the survey data should be tabulated in a 4-dimensional table with h-k as the column headings.
For the purpose of choice between the estimators in Eq. 5 and in Eq. (8), we subtract V(
)in Eq. 27 from in 6 and obtain the result
![]() |
This result shows that if, in every interviewer assignment,
![]() |
The implication of this is that if in every intervener assignment the biases are minimal so that Xhij has a high positive correlation with Yhij i.e., both Xhij and Yhij have the same distribution with σp (X) ≈ σp (Y) the estimator based on information from the subsample only should be preferred to
which is based on information from both samples. If as a result of non-stationary distribution of indiscriminate biases in X, σp (X)hi>>σp (Y), the observed value of ρp (Xhij, Yhij) may be
![]() |
within the interviewer assignments. In such a situation which may be common to most practical social survey situations, in Eq. (8) is preferable to
in Eq. (5) above. The choice of
in this case enables the investigator to include in his estimation model more of the highly variable measurements {Xhijk} than the precise and less variable measurements {Yhij}. This agrees with the precision rule is stratified random sampling with proportional allocation.
CONCLUSION
Equations 19 and 27 show that under the sample design used in this study, the contribution from correlated factors to the total response variance is also substantial as observed by Hansen et al. (1961) under a different sample design. Repetition of the survey for K>1 times is necessary for the calculation of the error terms. However, the magnitude of K affects the contribution from correlated factors positively more than it affects the contribution from each of the other sources of variation. For this reason, K can be limited to 2. Since, the ratio of σp (X) to σp (Y) suggests the amount of distortions that make the distribution of X different from that of Y, which is desired, the ratio σp (X)/σp (Y) can be calculated from a pilot survey data as a basis of choice between in Eq. 5 and
Eq. in 8 above.
Godwin A. Udofia . Estimation of Response Bias from Both Samples in Double Sampling with Unequal Probabilities.
DOI: https://doi.org/10.36478/jmmstat.2009.29.37
URL: https://www.makhillpublications.co/view-article/1994-5388/jmmstat.2009.29.37