© 1978, 1988, 2006, 2007, 2009 Joseph George Caldwell. All Rights Reserved.
Posted at Internet website http://www.foundationwebsite.org.
Sample survey is concerned with the problem of making inferences about the characteristics of a collection of items -- a "population" -- based on the characteristics of a portion of the population -- a "sample." If the sample is selected from the population in such a way that the probabilities of selection are known, then the theory of statistics may be used to make these inferences, i.e., to estimate population characteristics and to test hypotheses about the population.
There are several reasons for employing proper statistical design techniques in a research study. First, the use of statistical sample survey design techniques not only assures sample estimates having satisfactory levels of precision, but it also enables measurement of what that level of precision is. The usual measure of precision is the standard error of the estimate. Second, the use of sound sample survey design procedures improves the ratio of cost to level of precision, by enabling higher precision for a given level of cost, or lower cost for a desired level of precision. The increase in precision or decrease in cost is effected by allocating the sampling effort in ways that take into account the costs of sampling, the variability of the target population, and special features of the target population (such as the occurrence of natural "clusters" of the population, e.g., cities, schools, and households). Third, the use of sound sample survey procedures assures high validity of the estimates, i.e., the estimates will have low bias. A final advantage in the use of statistical design concepts is in enhancing the usefulness of the survey results by assuring (prior to implementation of the survey) that the level of precision of the sample estimates will be sufficient to permit meaningful interpretation.
(Note: Precision, or reliability refers to the variability of an estimate -- i.e., the extent to which the estimate fluctuates around the average value obtained through repeated sampling. Validity refers to how close this average value (obtained in repeated sampling) is to the "true" (population) value of the parameter being estimated (and also to how well that parameter represents the concept that is desired to be measured). The difference between the average value and the population value is called the bias. Precision is usually measured by the standard error of the estimate. Validity is measured by the bias. Accuracy refers to a combined measure of precision and validity. Accuracy is usually measured by the mean squared error, equal to the square of the standard error plus the square of the bias.)
Construction of an efficient survey design is accomplished by taking into account the characteristics of the population being surveyed, the costs of sampling, and the precision requirements of the study. It is important to keep in mind the study requirements, in order to assure that the precision obtained for key variables is sufficient, but not unnecessarily high. Furthermore, it is desirable to obtain comparable levels of precision for estimates of variables that are of comparable importance.
The development of a sample design is accomplished by taking into account the identification of the basic estimates of interest together with prior information concerning the variability of the population with respect to the variables of interest and the intercorrelations between these variables. Because most projects have multiple inference objectives, the design is usually not "fine-tuned" to provide maximum precision for a single estimate. Rather, the design is generally structured to provide acceptable accuracy for a number of relationships of interest.
There are two basic types of sample surveys – descriptive sample surveys and analytical sample surveys. In a descriptive sample survey, the objective is to obtain estimates of basic descriptive characteristics (such as means or proportions) for the whole population or particular subpopulations of interest. In an analytical sample survey, the primary objective is to be able to estimate the relationship of a particular variable (a "dependent" variable) to other variables (the "independent," or "explanatory," variables): for example, the relationship of client income to program inputs, client characteristics, and regional demographic and economic characteristics.
The distinction between a descriptive and an analytical survey has a profound effect on the nature of the survey design. In a descriptive survey, attention usually centers on the estimation of the state of a finite population at some point in time. The estimation objectives of the survey are simple -- usually just estimation of means and proportions for the population and various subpopulations, with perhaps some estimation of basic relationships, usually through presentation of results by stratum or by crosstabulations. The sample design process usually proceeds without difficulty -- the basic requirement is that the sample size is sufficiently large for all subpopulations of interest.
In an analytical survey, on the other hand, attention centers on the estimation of relationships between (dependent and independent) variables, for a conceptually infinite population. Attention centers on making inferences about the process generating or acting on the population, not about the population itself. The purpose of the survey is to develop a model of the process. Usually, a parametric statistical model such as the general linear model (regression model) is used. The analysis involves the iterative steps of model identification, estimation, tests of model adequacy, and respecification.
The major differences between an analytical survey and a descriptive survey are the following. In a descriptive survey, we want large sample sizes in subpopulations of interest. In an analytical survey, on the other hand, we want variation and balance in the explanatory variables, and orthogonality (low correlation) between variables that are not causally related. In addition, the finite population correction factor is irrelevant in an analytical survey. This fact can have a substantial impact on sample size determination.
A sample survey that has the estimation of relationships as its primary objective differs from the majority of sample surveys, whose objective is simply to compute descriptive statistics (such as percentages) about the populations and subpopulations under study. The relationships between the study variables are often complex, and there are numerous instances in which the independent variables may be correlated to a substantial degree. In some instances, the relationship of the correlated independent variables to a dependent variable may be quite difficult to assess. In order to obtain a good estimate of these relationships, the sample design must be structured to "orthogonalize" these "confounded" independent variables, i.e., to reduce the correlation between them.
We shall now describe a general procedure for determining a sample design. This description is in two parts. First, we discuss general procedures that apply to both descriptive and analytical survey design. That discussion is followed by a discussion of additional considerations that apply to the development of an analytical survey design.
A standardized procedure for developing a sample design is presented below. This general procedure may be followed in developing the sample design for most applications.
The Elements of Survey Design
1. Specify population of interest
2. Specify units of analysis and estimates of interest
3. Specify precision objectives of the survey; resource constraints; political constraints
4. Specify other variables of interest (explanatory variables, stratification variables)
5. Review population characteristics (distributional, cost)
6. Develop instrumentation (development, pretest, pilot test, reliability and validity analysis)
7. Develop sample design
8. Determine sample size and allocation
9. Specify sample selection procedure
10. Specify field procedures
11. Determine data processing procedures
12. Develop data analysis plan
13. Outline final report
The paragraphs that follow describe each of the steps listed above.
1. Specify Population of Interest The population about which it is desired to
make inferences is called the target
population. It is defined by four
quantities -- content, units, extent, and time.
(For example, we may be interested in estimating the income, of
In many surveys, there are several different units of analysis (e.g., hospitals and patients, or schools and students). The survey population is a subset of the target population, that recognizes practical constraints, such as inaccessibility or lack of data, cost, and political or legal constraints.
Prior to development of a good sample design, it is helpful to summarize what information is known about the survey population, such as means and variances of the population and subpopulations of interest; sampling cost information; and the cluster structure of the population.
2. Define Estimates of Interest The variables that are of interest must be identified (e.g., age, race, sex, earnings, status). The design effort should specify subpopulations that are of interest (e.g., women, the unemployed). Surrogate variables must be identified if desired variables cannot be obtained (e.g., earnings vs. income). The method of observation (record scan, personal interview, telephone interview, or mail questionnaire) must be identified, as must the measures of interest (e.g., mean earnings change in the past year).
3. Specify Objectives and Constraints The precision objectives of the survey may be specified in terms of standard errors or in terms of confidence intervals on key parameter estimates, for high-interest populations. Resource constraints, political and legal constraints, and data measurement constraints must be identified.
4. Specify Other Variables of Interest Variables which can be used to stratify (categorize) the population should be identified. Stratification can be used to improve precision, to reduce costs, to assure adequate precision for subpopulations of interest, and to enable measurement of precision. Also, explanatory variables that can be used in crosstabulations, ratio and regression estimates, and nonresponse analysis should be specified.
5. Review Population Characteristics Whatever information is available on distributional breakdowns (frequency distributions, crosstabulations) should be reviewed. This information can suggest what stratifications might be of value in the design stage. Information on within-stratum variances and intracluster correlation coefficients can also assist the design effort, as can information on sampling costs.
6. Develop Instrumentation The major steps in developing the
survey instrumentation are:
o Determination of the type of instrumentation
o Development of the instruments
o Pretest of the instruments
o Pilot test of the instruments
In developing the survey instruments, a number of factors
should be kept in mind. These factors are as follows:
o Question content
o Question order
o Question wording
o Question structure/format (open or closed, number of categories)
o Questionnaire length
o Questionnaire layout
o Questionnaire instructions
o Opportunities for interviewer comment on validity of response
7. Develop Sample Design The process of determining the specific sample design is a creative one. The variety of choice is essentially limitless, and the number of design objectives are usually multiple and conflicting. The recommended procedure is to synthesize several design alternatives which emphasize different design objectives, to characterize the alternatives in terms of precision, cost, and operational problems, and to achieve a consensus on the best overall design. The design alternatives include various combinations of stratification, clustering, multistage sampling, and double sampling.
8. Determine Sample Size and Allocation This step involves specification of items such as the number of first-stage and second-stage sample units, the number of sample units per stratum, and the proportion of the panel to be replaced in panel sampling.
9. Specify Sample Selection Procedures It must be decided how the sample units are to be selected -- with or without replacement, probability-proportional-to-size sampling, systematic sampling, Rao-Hartley-Cochran sampling, or controlled selection.
10. Specify Field Procedures "Field procedures" include the number and spacing of questionnaire waves, nature of the initial contact, recall procedures, field edit procedures, and transmittal to central headquarters.
11. Specify Data Processing Procedures "Data processing procedures" include procedures for logging, manual edit, coding, keying, machine edit, data base design, treatment of non-response and missing values, and data base documentation.
12. Develop Data Analysis Plan The data analysis plan should include consideration of a preliminary descriptive analysis of the data, nonresponse counts, and treatment of nonresponse. The directed analysis of the data includes computation of estimates of interest, crosstabs, and tests of hypotheses. In an analytical survey, the iterative process of model building must be specified.
13. Report Preparation The report of the analysis should include estimates, tables, charts, commentary, and interpretation. Also, the report may contain tables of generalized variances, a characterization of nonresponse, and discussion of possible sources of bias.
1. Careful development of the survey instruments, including consideration of item and instrument reliability and validity (with special attention to layout or forms design, and to question content, wording, and order); compatibility of answer categories with traditional government response categories; inclusion of only those data elements required to satisfy the survey objectives; and capitalization on experience with similar questions / instruments in earlier surveys.
2. Careful development of the survey sample design, with full consideration of objectives, sampling costs, and the nature of the target population in developing an efficient sample
3. Meticulous attention to the Instructions for Requesting OMB Approval under the Federal Reports Act," as revised. Figure 1 contains a list of the items required in the approval application.
4. Consideration of the terms of relevant legislation, such as the Privacy Act of 1974, the Freedom of Information Act, the Department of Commerce's "Directives for the Conduct of Federal Statistical Activities," and agency regulations in developing the survey instruments and procedures.
5. Close coordination with agency staff, OMB liaison staff, agency counsel, OMB clearance officials, and other government officials before submission of the clearance request form.
6. Consultation with appropriate interested parties. A key item in most surveys is item 5(c), "Consultation with State Government Officials." This consultation is required. Failure to consider the interests and concerns of interested organizations (advocacy groups, professional associations, interstate commissions) can cause serious delays – even total failure -- in large-scale survey efforts.
By following the above guidelines, we
anticipate no problems in obtaining OMB approval of the survey instruments and design. In essence, our approach consists in
capitalizing on previous experience in this area. A key resource in the preparation of the
design-related portions of the qualification statement is the fact that one of
the principals of
Survey Design Report
A final survey design report generally includes revised pretested instrument, a recommended sample design (including a description of the sample selection procedure and the probabilities of selection for the sample units), and an OMB clearance request package. The instruments and OMB package are included as appendices.
As mentioned earlier, the purpose of an analytical study is essentially to develop a model that will describe the relationship of one variable (the dependent variable) to a number of other (independent or explanatory) variables. In order to be successful in this objective, we need to include in the model all variables that are expected to have a substantial impact on the dependent variable. The list of potential variables that could reasonably be included in the model may be long.
In order to be able to develop a model that adequately describes the relationship of a dependent variable to various independent variables, it is necessary that the independent variables exhibit two basic properties:
1. There is good "balance" -- that is, adequate variation -- in the independent variables (i.e., there are comparable numbers of "high" and "low" values of each independent variable); and
2. There is a high degree of orthogonality between independent variables that are not causally related.
The problem that arises in the development of an analytical survey design (i.e., a survey design to support development of an analytical model) is that there are usually several independent variables of interest, each represented by several levels, and the total combination of levels of variables (or "cells") is usually far larger than the allowed sample size. The standard sampling approach of multiple stratification hence cannot be applied (since the number of stratum "cells" will exceed the sample size).
An approach that has often been used in the past in this situation is to select a "judgment" sample of observations. This has a serious drawback, however, in that the theory of statistics can no longer be applied to make inferences from the sample data. Moreover, judgment sampling may easily be avoided, since there is a statistical procedure for sample selection that possesses all of the flexibility of judgment sampling, but has none of the associated drawbacks. This procedure is called controlled selection.
A final note concerning the sample selection concerns ways in which prior information on dependent variables or surrogate dependent variables (i.e., manpower levels) can be used in the sample selection process. In "descriptive" surveys, this kind of information is often used as a basis for stratification. In "analytical" surveys, this kind of information is usually not used -- the design generally centers on the independent variables, not on the dependent variables.
For analytical surveys, we generally utilize the statistical sampling procedure of controlled selection as the basis for the sample design. This procedure was developed for the situation in which multiple stratification results in a number of cells that exceeds the allowable sample size. The procedure is illustrated in Figure 4. In this example, there are a total of three variables of stratification. Two of the variables have three levels, and one of the variables has two levels. There are a total of 3 x 3 x 2 = 18 cells in the multiple stratification, but three of the cells contain no population elements.
It is desired to select a sample of n=10 units from the 15 occupied cells of the table. The way that this is done is to specify two "sample patterns," either one of which is satisfactory from an analysis point of view (i.e., has adequate variation and balance in each variable, and adequate orthogonality between the variables), and all of which "cover" the entire cross-stratification grid table (i.e., each cell is contained in at least one pattern). Probabilities of selection are then specified for each pattern, and one of the patterns is selected. The selected pattern is the sample to be used. Since every sample pattern was specified such that it was "desirable" from an analysis point of view, the sample pattern that is selected is guaranteed to be completely acceptable. Since every unit was contained in at least one pattern, every unit has a nonzero probability of selection, and probability theory can hence be used to support analysis of the data.
We see then, that controlled selection possesses all of the advantages of judgment sampling (i.e., every pattern is a "judgment" sample, and exhibits the desired variation in the variables), but none of the disadvantages (selection biases, inability to apply statistical inference to a non-probability
The design task is concerned with determining exactly which variables are to be included in the sample design, and the number of categories of each variable.
Specification of the Regression Model
An analytical survey is generally concerned with using the sample survey data to estimate the parameters of a linear regression model, such as a multiple regression model. The
general form of a regression model is:
y = b0 + b1x1 + b2x2 + ... + bmxm + e
y = dependent variable
x = independent variable, or explanatory variable
b = regression coefficient (model parameter), to be
estimated in the regression analysis
e = model error term, representing random variations
in the dependent variable that are not explained
by the rest of the model
In regression analysis, there are generally two approaches. One is to attempt to develop a single "combined" regression model that represents the relationship for all members of the population. The other is to break the population into a number of different categories, and to develop a different model for each category. This latter approach (the "separate regressions" approach) is used if the nature of the relationship is quite different for the different categories. If the total number of observations in this analysis is probably not very great, it is prudent to avoid the use of separate regressions, since the precision of the coefficient estimates would generally be unsatisfactorily low. Differences in the nature of the relationship among different categories is generally treated by including "interaction" terms in the model.
Sample surveys that
o The 1976 Survey of Institutionalized Persons
o Survey of the Impact of National Health Insurance on Bureau of Community Health Service Users
o Nationwide Hospital Cost Data Study
o Professional Services Review Organization (PSRO) Data Base Development Study
o Follow-up Study of Vocational Rehabilitation Clients
o Elementary and Secondary School Civil Rights Survey
o Needs Assessment of State Social Services Programs
o Survey of Discrimination in US Sales and Rental Markets
to Collect Data for the
of Coffee Production in
The Hospital Cost Data Study, the PSRO Data Base Development Study, the Housing Market Discrimination Study, and the US Highway Capacity Manual Study involved the development of analytical survey design; the others involved descriptive survey design.
In addition to the above one-time sample surveys, Vista staff developed the sampling manuals used by states to collect data for the Social Services Reporting Requirements, for Utilization Review of Medicaid, and for Office of Child Support Enforcement Reporting Requirements.
In addition to experience in the design of
1. Cochran, W. G., Sampling Techniques, 3rd edition, Wiley,
3. Des Raj, The Design of Sample Surveys, McGraw Hill, 1972
5. Caldwell, Joseph George, Vista’s Approach to Evaluation, http://www.foundationwebsite.org/ApproachToEvaluation.htm or http://www.foundationwebsite.org/ApproachToEvaluation.pdf .
6. Caldwell, Joseph George, Sample Survey Design for Evaluation, http://www.foundationwebsite.org/SampleSurveyDesignForEvaluation.htm or http://www.foundationwebsite.org/SampleSurveyDesignForEvaluation.pdf