Vista’s Approach to Evaluation Design
© 1978,
1988, 2006, 2007 Joseph George Caldwell.
All Rights Reserved.
Posted at http://www.foundationwebsite.org
and http://www.foundation.bw
Contents
3.
Vista's Approach to Evaluation
Selected
References in Evaluation
Evaluation is concerned with the determination
of what the effects of a project or program are, and what is the relationship
of the effects to specified variables, such as project inputs, client
characteristics, or environmental characteristics.
At first glance, the problem of evaluating
a project or program (henceforth we refer only to projects) may appear straightforward. The principles of statistical experimental design,
as set forth by Sir Ronald A. Fisher in the 1920s, may be used to randomly
assign "treatments" (program inputs) to "experimental
units" (members of the target population), and the techniques of
statistical analysis (e.g., the analysis of variance) may be used to determine
an unbiased estimate of the treatment effects.
Through the 1930s, 1940s, and 1950s,
statisticians, led by Dr. R. C. Bose, made great progress in the development of
sophisticated experimental designs, such as balanced incomplete block (BIB)
designs, partially balanced incomplete block (PBIB) designs, orthogonal Latin
square designs, and fractional factorial designs. These designs could be used to simultaneously
determine the effects of a large number of project variables on project effect,
using only a modest number of experimental units (treated with particular combinations
of levels of the treatment variables.
Despite the availability of the science of
statistical experimental design, evaluation research has experienced a rocky
road. Even when sound statistical
experimental designs could have been applied to obtain unequivocal results,
they often were not used. In many cases,
simple after-the-fact "case studies" were applied to intuitively
assess the project results. In other
cases, comparison ("control") groups were formed by
"matching" the comparison units to the treatment units on
pre-measures of project outcome, or on various socioeconomic variables. While this procedure may
appear reasonable (since it produces comparison groups that are similar to the
treatment groups), it can produce disastrous results. It introduces what are known as
"regression effects" -- biases in the estimated treatment effects
caused by the (nonrandom) selection of units based on a variable that contains
measurement error. (Note: in this
discussion, we generally use the term "control group" to refer to a comparison
group formed by randomized assignment, or to a naturally-assembled collection
of experimental units (e.g., classroom, village) in the case in which the
groups selected for treatment are randomly selected from the population of such
groups; the term "comparison group" refers to a groups formed by any
procedure -- e.g., randomized assignment, randomized selection of a
pre-existing groups, or matching. This
usage is not universal.)
Although the nature of regression effects
has been known to statisticians since the time of Sir Francis Galton, behavioral scientists and economists have routinely
ignored the problem, and often used this type of matching to construct
comparison groups. This practice has
resulted in a number of evaluation "disasters," such as the Westinghouse
/
The advantage of using a statistical
experimental design is that, if treatments (and non-treatment) are assigned
randomly to experimental units, it is possible to obtain an unbiased estimate
of program effect. Notwithstanding this
tremendous benefit, however, there are many situations in which it is not
practical or possible to assign treatments randomly. For example, in a study on smoking it is not
possible to select a sample of human subjects and force some to smoke and some
not to smoke (the assignment to the smoking and nonsmoking groups being made
randomly). Or, in a social services
program, federal law may prescribe who is eligible for benefits; benefits may
not legally be withheld from randomly selected target populations for the
purpose of conducting an evaluation.
In spite of numerous instances where
political, ethical, or natural constraints have made it impossible to apply the
randomization principle of experimental design, however, there are numerous
instances in social and economic evaluation where randomization could have been
applied to produce unequivocal evaluation results, and was not. There are two major reasons for this. First, the determination of what community
receives an experimental program (e.g., a health or education program) may be
political (e.g., the "worst" region gets the project). Second, the evaluation design effort may be
initiated after the project has already begun, so that the evaluation researcher
has no control over the treatment allocation.
Since many project managers are not evaluation specialists, no attempt
is made to formulate the project design to permit unbiased estimation of the project
effects. The evaluation
"design" must be formulated after the fact, given the treatment
allocation.
Because the use of statistical experimental
design is not always present, alternative ways of conducting evaluations have
been considered. In 1963, Donald T.
Campbell and Julian C. Stanley published a monograph entitled, Experimental and Quasi-experimental Designs
for Research, which described sixteen "quasi-experimental"
designs for research. These designs
attempted to reduce some of the threats to validity (biases) that result from
the lack of randomized assignment of treatments (biases due to the effects of
history, maturation of subjects, testing, instrumentation, regression, selection,
and mortality). Some of these designs
are based on "before-and-after" comparisons, whereas others are based
on the use of a "comparison" group that is not formed by randomized
assignment of individuals to the treatment and comparison groups.
Many years have passed since
The quasi-experimental designs that seem
most immune from threats to validity (biases caused by a lack of randomization)
are the interrupted-time-series design and the regression-discontinuity
design. The reason why these designs are
better is that theoretically, if a linear statistical model can be specified
that describes treatment effect as a function of various explanatory variables
in such a fashion that the model error terms are not correlated with the
explanatory variable or with each other, and if there are no measurement errors
in the explanatory variables, then the usual method of estimation (ordinary
least squares) can be used to produce unbiased estimates of the model
coefficients. (The model coefficients
indicate the average change in treatment effect per unit change in the
explanatory variable if the explanatory variables are uncorrelated.) The regression-discontinuity design and the
interrupted time series design are examples of linear statistical models.
The regression-discontinuity design is
simply a linear regression model that contains a number of explanatory variables,
one or more of which are treatment variables.
The rationale for use of this design is the fact that the explanatory
variables (other than the treatment variables) will account for, or
"explain," most of the difference between the treatment and
comparison groups, and that the unexplained difference will be due to the
treatments. While this assertion cannot
be proved from the data, it is logically plausible if the analyst can reasonably
argue that the explanatory variables probably do explain differences in the
treatment and nontreatment populations. It is generally considered, however, that the
"adjustment" that occurs to the effect estimate
by accounting for the other variables is not sufficient, so that while the bias
may be reduced, it is not eliminated.
In the 1960s, Profs.
George E. P. Box and Gwilym M. Jenkins introduced a
family of time series models (autoregressive integrated moving average (ARIMA) models)
that have gained wide acceptance in explaining time series phenomena. In 1970 they published a text entitled, Time Series Analysis, Forecasting and
Control, and in 1975 Box and G. C. Tiao published
a paper entitled, "Intervention Analysis with Applications to Economic and
Environmental Applications." That
paper shows how to use time series analysis to estimate the effect of program
"interventions," or treatments, in situations in which the variable
to be affected by the program is measured at frequent, regular intervals over
time (e.g., unemployment). The procedure
they describe is a generalized version of the interrupted-time-series
design. It has gained wide acceptance as
a means of estimating the effect of program interventions. Its validity rests on the same premise as the
regression discontinuity design, i.e., that the other variables of the model
(i.e., other than the treatment variables) explain all differences -- other
than treatment -- between the treatment and nontreatment
populations, the model error term is uncorrelated with the model explanatory
variables (in particular, the treatment intervention) and each other, and
measurement errors are not present in the explanatory variables. If a major, one-time change occurs in a nontreatment explanatory variable or in the model error
term simultaneous with the introduction of the program intervention, the design
cannot distinguish between the effect of that variable and the treatment variables. If the treatment variables are varied over
time, however, this is unlikely to occur, and, if the model is properly
specified, the estimation of the treatment effect coefficients will be
unbiased.
In evaluating the effects of economic
programs, econometricians often use linear statistical models ("econometric
models") to represent the relationship of the program effects to various
variables (program inputs, client characteristics, environmental variables,
macroeconomic variables). (Many
econometric models are "cross sectional" models, i.e., they do not
include time series representations as do the Box-Jenkins or Kalman filter (state space) models.) This approach is an extension of the
regression discontinuity design. As
discussed above, this approach is valid if the model is properly specified
(i.e., the model error terms are uncorrelated with the explanatory variables
and each other, and there are no measurement errors in the explanatory variables). This condition cannot be empirically verified
from the data, and the quality of the results depends in large measure on the
model specification skill of the modeler, and the availability of data to
measure the model variables.
While this approach has gained substantial
support from economists over the past two decades, it has a serious shortcoming. The problem that arises is that regression models
developed from passively observed
data cannot be used to predict what changes will occur when forced changes are made in the
explanatory variables, even if the model
is well-specified. Such models measure
only associations, not causal relationships.
They describe only what will probably happen to the dependent variable
(program effect) if the explanatory variables operate in the same way as they
did in the past. If forced changes are made to the explanatory variables in a
way that is different from in the past, there is no way of knowing whether they
will produce the same results as were observed in the past. (This fact accounts for the fact that
econometricians have been relatively unsuccessful in predicting the effect of
changes in economic control variables on the economy -- the econometric models,
although very elaborate (i.e., containing many variables and specification
equations) were developed from passively-observed data (rather than from data
in which forced changes were made to the explanatory variables).) The problem caused by incorporating
passively-observed data into evaluation models is not as serious as for
econometric forecasting models, because the program inputs are generally specified
by the project planners (i.e., they are "forced", or
"control," variables).
Nevertheless, the evaluation project analyst must take special care in
the interpretation of any model coefficients corresponding to variables whose values
were not independently specified.
The most widely-used procedures for
collecting data for evaluation include experimental designs, quasi-experimental
designs, and sample survey. Issues
dealing with the use of experimental and quasi-experimental designs were
discussed above. With regard to the use
of sample survey design, there are special problems that occur in the field of
evaluation. The first problem is that
the goal in some evaluation studies is to produce an analytical (e.g.,
regression) model that describes the relationship of program effects to various
explanatory variables (program inputs, client characteristics, regional
demographic and economic characteristics).
The sample survey design that is needed to produce data suitable for the
development of a regression model is called an "analytical" survey
design. It is quite different from the
usual sample survey design – a "descriptive" survey design, which is
intended simply to describe the program effects in terms of major demographic
or program-related variables. The
approach to analytical survey design is quite different from that for
descriptive survey design. In an
analytical survey design, the objective is to develop a sample design that
introduces substantial variation in the explanatory
(independent) variables of the model.
The objective in a descriptive survey design, on the other hand, is to
develop a sample design that introduces substantial variation in the dependent variable (e.g., through stratification
of the target population into internally-homogeneous categories, or
"strata"). Standard sample
survey design texts address the design of descriptive surveys, not analytical
surveys.
The second important consideration in
sample survey design for evaluation is the fact that use of the "finite
population correction" (FPC) factor is generally not appropriate in evaluation
applications. The FPC is a factor that
reduces the variance of sample estimates in sampling without replacement from
finite populations. Since the target populations
for evaluation studies are finite, it might appear at first glance that this
factor should be applied. It generally
should not be applied, however, since the conceptual framework in an evaluation
study is such that the goal is to make inferences about a process (i.e., the program),
not a particular set of program recipients.
This fact has been generally misunderstood in sample survey design for
evaluation. The wrongful use of the FPC
causes two problems. First, the
estimated precision of the reported estimates may be grossly overstated. Second, the sample size estimates determined
in the survey design phase to achieve desired precision levels may be grossly
underestimated.
When the evaluation design can be done as
part of the initial project design, the evaluation design can usually be much stronger
than if the evaluation design is developed after initiation of the project
(i.e., after the treatment allocation has been accomplished). The latter situation is very common, however,
and
Given the often severe shortcomings of
evaluation designs that are not based on a randomized assignment of treatments (program
inputs), it is reasonable to ask what can be done if randomization
is not possible or was not done. Such is
often the case, since many evaluations begin after the project is well under way
or even completed. The point is,
however, that the use of the best quasi-experimental design or a cross-sectional
regression model based on retrospectively-obtained data can be vastly superior
to a poor alternative, such as an ex-post case study evaluation.
This approach has worked well. In the USAID-funded Economic and Social Impact Analysis / Women in Development project
in the Philippines, Vista was responsible for identifying indicators, research
designs, measurement instruments (data collections forms and questionnaires),
and sampling plans for eighteen development project evaluations in the
Philippines.
With regard to the choice of evaluation
design, a principal factor to consider is whether the evaluation is to be essentially
descriptive or analytical in nature. The
descriptive approach simply addresses what happened (in terms of project
results). The analytical approach
attempts to determine the relationship of project outcome to various explanatory
variables, including project control inputs as well as exogenous variables such
as macroeconomic conditions, regional demographic characteristics, and client characteristics. Most evaluations are essentially descriptive
in nature. They assess the project
outcome, but provide only limited insight concerning the determinants of project
outcome. The accomplishment of an
analytical evaluation requires a substantially greater investment of resources,
both for design (e.g., an analytical survey design vs. a descriptive design),
data collection (i.e., collection of data on all of the explanatory variables
of an analytical model), and analysis (e.g., regression analysis vs. crosstabulation analysis).
A key decision to be made in evaluation
design concerns the choice of indicators, i.e., measures of project outcome. Ideally, what is wanted is a measure of effectiveness (MOE), which
indicates what happened in terms of the ultimate goals of the project (e.g.,
employment, earnings, health, mortality).
Often, however, it is not possible to measure the ultimate outcome
(e.g., too many variables affecting employment may be operating concomitantly
with the project variables to permit an unequivocal assessment of the effect of
the project on unemployment). In this
case, the evaluation centers on project outputs -- measures of performance (MOPs) -- which
are logically linked to the ultimate effectiveness measure. For example, it may be possible to measure
"number of meals delivered" to a target population, but not feasible
to measure resultant decreases in mortality.
In this case the linkage between improved nutrition and decreased
mortality is accepted as a basis for using "number of meals
delivered" as a measure of performance.
The preceding paragraphs have described
some of the technical aspects of evaluation.
In addition to the technical aspects, other aspects such as
organizational and political aspects must also be considered. In the past, these other aspects were often
ignored, leading to evaluation failures.
To avoid this problem,
In the 1970s, it was realized that the
investment in evaluation of government programs was not leading to more successful
policies and programs, and a concerted effort was undertaken to determine
why. Joseph S. Wholey
and others working in the program evaluation group of The Urban Institute
eventually identified a number of conditions which, if present, generally
disabled attempts to evaluate performance.
They developed the concept of "evaluability
assessment" -- a descriptive and analytic process intended to produce a
reasoned basis for proceeding with an evaluation of use to both management and
policymakers. They developed a set of
criteria which must be satisfied before proceeding to a full evaluation. This approach begins by obtaining management's
description of the program. The
description is then systematically analyzed to determine whether it meets the
following requirements:
o
it is complete
o
it is acceptable to policymakers
o
it is a valid representation of the program as it actually exists
o
the expectations for the program are plausible
o
the evidence required by management can be reliably produced
o
the evidence required by management is feasible to collect; and
o
management's intended use of the information can realistically be expected to
affect performance
The object of an evaluability
assessment is to arrive at a program description that is evaluable. If even one of the criteria is not met, the
program is judged to be unevaluable, meaning that
there is a high risk that management will not be able to demonstrate or achieve
program success in terms acceptable to policymakers. The conduct of an evaluability
assessment is considered to be a necessary prerequisite to evaluation.
In summary, then,
1. Struening, E.
L. and M. Guttentag, Handbook of Evaluation
Research,
2 Volumes, Sage Publications, 1975
2. Campbell, D. T. and J. C. Stanley, Experimental and
Quasi-Experimental
Designs for Research,
Publishing Company, 1963
3. Klein, R. E., M. S. Read, H. W. Riecken, J. A. Brown, Jr.,
A. Pradilla, and
C. H. Daza, Evaluating
the Impact of
Nutrition
and Health Programs, Plenum Press, 1979
4. Weiss, Carol H., Evaluation Research, Prentice-Hall, 1972
5. Suchman,
Edward A., Evaluative Research,
Russell Sage
Foundation, 1967
6. Wholey, Joseph
S, J. W. Scanlon, H. G. Duffy, J. S.
Fukumoto,
and L. M. Vogt, Federal Evaluation Policy,
The
Urban Institute, 1975
7. Schmidt, R. E., J. W. Scanlon, and J. B.
Bell,
Evaluability Assessment: Making Public Programs
Work Better,
Human Services Monograph Series No. 14,
Project
Department of Health and Human Services,
November 1979
8. Smith, K. F., Design and Evaluation of AID-Assisted
Projects,
US Agency for International Development, 1980
9. Cochran, W. G., Sampling Techniques, 3rd edition, Wiley,
1977
10. Cochran, W. G., and G. M. Cox, Experimental Designs, 2nd
edition,
Wiley, 1957
11. Box, G. E. P., and G. M. Jenkins, Time Series Analysis,
Forecasting
and Control, Holden Day, 1970
12. Draper, N. and H. Smith, Applied Regression Analysis,
Wiley, 1966
13.


