SAMPLE SURVEY
DESIGN AND ANALYSIS:
A COMPREHENSIVE
THREE-DAY COURSE
WITH APPLICATION
TO MONITORING AND EVALUATION
by
Joseph George Caldwell, PhD
(001)(864)541-7324
http://www.foundationwebsite.org
COURSE NOTES: DAY ONE
BASIC CONCEPTS OF SAMPLE SURVEY
(Updated 4 May 2007, 26 March 2009, 10 January 2010)
NO RECORDING
DEVICES ALLOWED
© 1980 - 2009 Joseph George Caldwell. All rights reserved.
Posted at Internet website http://www.foundationwebsite.org/SampleSurvey3DayCourseDayOne.pdf . May be copied or reposted for noncommercial use (or for evaluation by those considering attending the course), with attribution.
Introduction
These notes are intended to accompany a lecture, using a board or projector to augment the oral presentation. They have been prepared so that the student may listen to the presentation without having to take notes.
The lecture is accompanied by examples and handouts, which are not included in these notes.
The course also includes in-class student exercises.
The course may be covered in three six-hour days (three hours in morning, three hours in afternoon), or in five half days (three and one-half hours per day). The split-up sessions are intended to accommodate clients whose employees would find it inconvenient or impractical to allocate an entire day, or three days in sequence, to a course.
The course is intended for any class size, but a smaller class size (e.g., 10-30 students) is better for interactive discussion (responses to student questions, clarifications, additional examples).
The topics covered in the three-day course are:
Day 1: Basic concepts of sample survey
Day 2: How to design surveys and analyze survey data
Day 3: Special topics; practical problems in survey design
In Day 1, basic principles of statistics and sampling theory are presented and the major types of sample design are described, and the rationales for selecting each type of design are discussed. Day 2 is concerned with the problem of constructing a design of each major type (i.e., determining sample sizes and sample selection methods). Day 3 is concerned with special topics, such as the use of sample survey design in program monitoring and evaluation.
The level and scope of the course; managing expectations
This course is an introductory course on the design and analysis of sample surveys. It assumes that the student has taken a prerequisite course in “college math,” but it does not require prior knowledge of calculus. For students having knowledge of calculus (or some background in probability and statistics, say from an elementary course in statistics), some additional information is presented. This additional material is marked with the notation optional. These optional sections (few in number) are omitted from the course presentation.
Attendees should be somewhat familiar with basic statistical concepts, such as probability, the mean and variance of a distribution, the normal distribution, the binomial distribution, estimation and hypothesis testing, confidence intervals, and regression and correlation. Needed material from these topics is reviewed, but this review is not sufficiently thorough for a person having no previous knowledge of probability or statistics. Ideally, a person attending this course would have previously taken an elementary course in statistics. A person with no previous training in statistics could follow much of the lecture, but it would be expecting a lot to absorb the basic concepts of statistics “on the fly,” in addition to the material specific to sample survey.
This course is intended to cover a broad range of topics in sample survey design. To do so, it does not cover each topic in great detail. The concern is with known results and how to apply them, not in proving them.
The course is introductory and elementary, but relatively comprehensive, and certainly intensive. At the end of the course, a person with some mathematical ability should be able to recognize which basic type of sampling is appropriate in a given situation, be able to estimate the sample size required to produce a specified level of precision, and be able to conduct standard analyses of the collected sample data.
There is no way, however, that a three-day course will make an “instant survey statistician” out of anyone. In a survey design situation that is complex or that will involve large amounts of time, effort, or money, the advice of an expert sample survey statistician should be sought.
The course is basically conceptual, with some time spent on working through detailed examples, including numerical calculation of formulas. Someone wishing to construct an actual survey design and analyze the survey data would likely want to consult a reference text to review detailed examples and gain expertise by working through exercises.
This course is an ideal introduction for a project director or government technical (project) officer who wishes to understand the basic concepts of sample survey in order to effectively manage or monitor a project involving sample survey. With the background of this course, the project manager should be able to sense what type of survey design is appropriate in a given situation, and be able to converse meaningfully with a consulting survey statistician on a project involving a sample survey.
The course has been presented a number of times, both on an “advertised” basis at commercial hotels, and on an “in-house” basis at the US Bureau of Labor Statistics. Overall, the evaluation sheets returned by course attendees have been very favorable, but in a few instances it was attended by persons with limited mathematical background and in those cases the material was considered too complicated. While it is possible to present a course on sample survey with virtually no reference to mathematical symbology, such a course would not be of use to a person who actually wanted to design a survey and analyze survey data. This course is not a “no-math” course. While it is elementary and introductory and does not require knowledge of calculus, it does require some familiarity with mathematics at the “college math” level. Persons with little mathematical background could attend the course and understand much of the lecture material, but they would be unable to follow the mathematical formulas and work out numerical examples.
Nobody likes unpleasant surprises. One of the purposes for publishing these notes on the Internet is so that prospective students may quickly peruse them and assess whether the material is too advanced for them, given their present background in mathematics.
The course covers a lot of material in a short time. These notes will enable the student to pay attention to the lecture without having to take notes. It is not expected that everything will “sink in” in a three-day course, and it is recommended that the student who wishes to apply the techniques in practice acquire a reference text for study, or attend a formal course in which many homework exercises will “fix” the concepts.
Each attendee to the course is asked to complete a course evaluation. One of the questions asked is whether the course should spend less time on many topics (as it does), or concentrate on a small number of designs. The overwhelming response from attendees is that they liked the course as it is – a broad overview of many topics, with less time spent on any particular design or topic.
The course is comprehensive, but it is certainly not exhaustive. It provides an introduction to the major aspects of sample survey design and analysis. There are many specialized topics that it does not cover, and it does not address every possible combination of survey design elements. For example, it includes stratified sampling and ratio estimation, but does not include stratified sampling with ratio estimation – the student is referred to a reference text for the information on that particular combination. Also, the course includes cluster sampling and stratified sampling, but it skips discussion of stratified cluster sampling.
These notes are available for review by anyone considering enrolling in the course. The notes do not contain all of the exercises, examples, and handouts that are included in the presentation.
It is not expected for the student to memorize the various formulas presented, but it is expected that some of the major ones would be familiar and recognized, by the end of the course (e.g., the formulas for a mean, a variance, a weighted average, and a confidence interval). A certain amount of course material (e.g., examples, supplementary material, details) is included as “background,” to place the essential concepts in context. It is not expected that the student remember all of the material presented, and the really important concepts will be identified and stressed.
In a usual academic course, the material covered here would be written out by the professor, over the course of 16 one-hour class sessions. If all of the material covered here were written out, it would not be possible to cover it in a three-day course. Hence, in addition to obviating the need for taking notes, the course notes enable much more material to be covered than would be possible in a usual course. It is recognized that there is a learning advantage to the student’s writing his own notes, but this benefit has been sacrificed in order to cover much material in a short time. The material presented in the notes is available in a variety of reference texts on sample survey. The essential feature of the course is the lecture and in-class interaction, not the notes. The notes are made available simply to enable the student to take full advantage of these aspects.
The course lasts 16-18 hours (Day 3 is usually cut a little short, to accommodate travel arrangements). This is about 1/3 of the class time of a “three-unit” college semester (three hours per week for 16 weeks, or 48 hours). The college course, however, would include substantial amounts of homework, which this course does not include.
Sample survey involves a lot of formulas. There are a number of different designs and estimation techniques, and each of them involves its own formulas (or procedures, such as resampling) for calculating estimates and errors of estimation. These course notes include many formulas, for reference, but not a lot of class time is spent in working with the formulas. They are too many and too complicated to learn well in a three-day course. Most of the class time is spent in discussing concepts, examples, and approaches, not with working through complicated estimation formulas. A few detailed numerical examples will be worked out in the early part of the course, so that the student may become familiar with the computational requirements of the estimation formulas. After that, formulas will be shown in order to illustrate concepts and general forms, but no further calculations will be made using them.
Note on course content. If presented on an advertised basis
(individual enrollments), the course follows these notes closely. If presented for a single client, the content
may be modified somewhat to suit the client’s interests. For example, an overseas client may have no
interest in information about the process for obtaining omb approval for a
questionnaire to be used in a survey funded by the
The pace of the course, the selection of topics, and the time spent on various topics may be adjusted a little by the instructor, in order to address specific concerns or interests of the students.
While these notes parallel the lecture, not every item included in the notes is necessarily included in the lecture, and not every item included in the lecture is included in the notes. The notes are intended to reduce the requirement for the student to take copious notes during the lecture. They are not intended to be a detailed recording of the lecture. For additional detail and examples, the student should consult a sample survey reference textbook.
This course focuses mainly on estimation (point and interval estimation), not on hypothesis testing. The reason for this focus is that in sampling from finite populations, subpopulations almost always have different parameters, and so the test of the hypothesis of equality of parameters is irrelevant. We do consider hypothesis testing in applications of sample survey to evaluation, where the assumption of a conceptually infinite population (which produced the particular finite population) is reasonable. Application of sample survey to monitoring and evaluation is addressed in Day 3 of the course.
(A similar situation (regarding finite and infinite populations) occurs in the field of statistical quality control. On the one hand, we may be interested in estimating the percentage of defectives in a particular lot of goods, to decide whether to accept the lot. In this case (acceptance sampling), we are interested in estimating the characteristics of a particular finite population (i.e., the lot). On the other hand, a quality control manager will view this lot as a single sample from the process that generated it and many other lots. In this case, the lot is viewed as a single sample from a conceptually infinite population of lots, and we are interested in estimating the characteristics of this conceptually infinite population.)
In the past, the course was presented by Dr. Caldwell and his colleague, Dhirendra N. Ghosh.
Course Pricing
The course is not longer given on an advertised basis, but
only “in-house” at a client’s facility.
The current price for the course, conducted over a three-day period at a
client’s facility, is USD15,000. The
price if conducted over five days (one half-day session each day for five days,
with some material dropped from the “day
This price is an all-inclusive price, including, subject to the following limitations. Half payment is requested in advance, and half payment upon completion. We will absorb travel and accommodation expense for the presenter(s) (one or two persons) up to USD5,000, but if these limits are exceeded, the client will be billed for travel and per diem (meals, lodging and incidental) expense for the presenters (one or two persons) in accordance with US Government maximum travel per diem allowances (or international-organization allowances) for the travel (from presenter’s home base to client’s location, time spent at the client’s location, and return to the presenter’s home base), and requested to pay the amount in excess of USD5,000.
It is agreed that the client will download the course notes from the Internet website http://www.foundationwebsite.org , and print sufficient copies for all attendees. Note: The Internet version of the course notes does not include all handouts. These supplementary items (as computer files) will be e-mailed to the client prior to the course. If the client does not print the course notes or the supplementary items, the course will be presented without course notes. This is not the intended format, or the format that has been used successfully in the past. As discussed, much material is presented, and it is not possible to write out this material during a three-day course. At the same time, restricting the course to a lecture, without benefit of the notes, would lose much. The course is intended to be a lecture supplemented with the Course Notes.
The client is expected to provide a comfortable environment
conducive to learning. If the client
does not have suitable accommodations at its own facility, it is recommended
that facilities be procured at a local commercial hotel, many of which have
excellent facilities for seminars. It is
requested that the client provide a computer (with a Microsoft XP or
It is requested that the client provide snacks and drinks for the breaks. The client is encouraged to provide lunch to presenters and attendees for full-day sessions, but this is at the client’s discretion. (This was the practice when the course was presented on an advertised basis at a commercial hotel, and it works well (it keeps the class together, and avoids late returns to class after lunch).)
COURSE SCHEDULE
Sample Survey Design and Analysis:
A Comprehensive Three-Day Course
with Application to Monitoring and Evaluation
by
Joseph George Caldwell, PhD
Sample Survey Design and Analysis:
A Comprehensive Three-Day Course
with Application to Monitoring and Evaluation
by Joseph George Caldwell, PhD
Course Schedule
Day 1: Basic Concepts of Sample Survey
3:10 - 3:40 Double
Sampling
Third Days; Questions and Answers
Day 2: How to Design Surveys and Analyze Survey Data Part One:
How to
Design Descriptive Surveys
of Survey
Design; Distinctions between Descriptive and
Analytical
Surveys
9:15 - 9:30 General Procedure for Designing a
Descriptive Sample Survey
Part Two:
How to
Design Analytical Surveys
Part Three: How
to Analyze
Survey Data.
3:20 - 3:40 Standard Estimation Procedures for Analytical Surveys
Topics for Third Day
Day 3: Special Topics/Practical Problems in Survey Design
Procedures
11:15 - 12:00 Sample Frame Problems
1:15 - 2:00 Treatment of Nonresponse
3:15 - 3:45 Major
National Surveys
COURSE SYLLABUS
Sample Survey Design and Analysis:
A Comprehensive Three-Day Course
with Application to Monitoring and Evaluation
by
Joseph George Caldwell, PhD
Sample Survey Design and Analysis:
A Comprehensive Three-Day Course
with Application to Monitoring and Evaluation
by Joseph George Caldwell, PhD
Course Syllabus
Day 1: Basic Concepts
of Sample Survey
1. Introduction
· Course Objectives and Outline
·
Overview of
First Day's Course Content
2. Concepts
of a statistical distribution (mean, variance, percentiles; examples: normal, binomial)
3. Types of sampling
·
Purposive
(judgment)
·
Haphazard
· Quota
·
Probability
Sampling
4. Concepts of statistical inference from samples
· Sample
· Estimators of population parameters (measures of central tendency; other parameters (e.g., p))
·
Properties of
estimators: variance, bias; precision vs. trueness; accuracy (mse)
·
Central limit
theorem
·
Sample
moments vs. population moments
·
Distribution
of sample statistic vs. population distribution
5. Simple random sampling
· When to use
·
Now to select
a sample
o
Target
population, sampling population, sampling frame
o Random numbers -- how to use, generated vs. tabled
o
Systematic
Sampling (from randomly ordered files)
o Sampling with and without replacement
· Types of Estimators
o Simple
o
Ratio
o Regression
o Bayes (mention)
o
Resampling (Jackknife,
Bootstrap) (mention)
·
Variance
formulas
· Variance estimates
o
Formulas
o Resampling (mention)
·
Sampling for
means vs. sampling for proportions
·
Confidence
intervals
·
Determining
sample sizes
6. The concept of sample design
·
Precision/cost ratio;
design effect
·
Ways of departing from
simple random sampling
o
Variations in the
probability of selection
o
Dropping the
independence assumption (systematic, cluster, replacement, controlled selection,
matching)
·
Optimal design
·
Auxiliary variables
o
Correlated with
variables of interest
o
Cost information
7. Stratified
sampling
·
Description
·
When to use
·
How to select sample
·
Estimation formulas
·
Self-weighting case
·
Variance formulas
·
Variance estimates
·
Construction of strata
·
Multiple
stratification
·
Stratification to the
limit
·
Cross-stratification
·
Certainty stratum
·
Optimal allocation
·
Determination of sample
size
·
Stratification when
the variable of stratification is inaccurate
·
Post-stratification
8. Cluster
sampling
·
Description
·
When to use
·
Intracluster
correlation coefficient
9. Systematic
random sampling
·
Description
·
When to use
·
How to select sample
(integer sampling interval, noninteger sampling interval; random start; random starts)
·
Estimation formulas
·
Variance formulas
·
Variance estimation
(paired selections, successive differences)
·
Replicated subsamples
10. Multistage
sampling
·
Description
·
When to use
·
Intracluster
correlation coefficient
·
Estimation formulas
·
Self-weighting sample
·
Methods of sample
selection
o
1st stage: PPS, PPMS,
equal probs., w/rep, wo/rep
o
2nd stage: fixed sample
size, variable sample size
o
self-weighting
§
1st -- PPS, 2nd -- equal probs.
(advantages/disad.)
§ 1st -- equal, 2nd -- proportional (adv./disadv.)
·
Impact of ICC on
selection method
o
If rho fixed (e.g.,
equal-sized units)
o
If rho variable (e.
g., variable-sized units)
·
PPS selection
·
Certainty stratum
·
Variance formulas
·
Variance estimation
·
Systematic selection
ok for 2nd stage units under certain circumstances
·
RHC method for
sampling wo replacement
·
Determination of sample
size (design)
o
First stage
o
Second stage
·
Need frame only for lst
stage units and selected 2nd stage units
·
Generalized variances
(mention)
·
PPMS
11.
Two-phase (double) sampling
·
Description
·
When to
use
·
Estimation formulas
·
How to select sample
·
Variance formulas
·
Estimation of variance
·
Determination of
sample size (1st and 2nd phases)
12. Survey of References; Outline of Topics
for 2nd and 3rd Days; Questions and Answers
Day 2: How to Design
Surveys and Analyze Survey Data
Part One: How to Design Descriptive Surveys
1. Introduction
· Overview of Second Day's Course Content
·
The Elements
of Survey Design
· Distinctions between Descriptive and Analytical Surveys
2. General Procedures for Designing a Descriptive Survey
· Specify population of interest
· Define estimates of interest
· Specify precision objectives of survey; resource constraints
· Specify other variables of interest
· Develop instrumentation
· Develop sample design
· Determine sample size and allocation
· Specify sample selection procedures
· Specify field procedures
· Specify data processing procedures
· Develop data analysis plan
· Outline report
3. When and How to Use Simple Random Sampling
· Nature of situation which warrants use of simple random sample
· How to select a simple random sample
· Sampling without replacement
· How to select a simple random sample without replacement
4. When and How to Use Systematic. Sampling
· Reasons for using systematic sampling
· Nature of situation which warrants use of systematic sampling
·
How to select
a systematic sample
5. When and How to Use Stratification
·
Nature of
situation which warrants use of stratified sampling
·
The use of a
certainty stratum
· How to determine the number of strata, and the stratum boundaries
· Stratification to the limit
· Collapsed strata
· Post-stratification
· Errors in classification
· Multiple stratification: cross stratification
· Multiple stratification: nested stratification
· How to allocate sample sizes to strata, when costs and variances are known
·
How to allocate sample
sizes to strata, when costs and variances are unknown
·
Self-weighting design
·
General recommendations regarding
stratification
6. When
and How to Use Cluster Sampling
·
Nature of situations
which warrants use of cluster sampling
·
The "cluster"
effect
·
Determining sample
size in cluster sampling (equal-size clusters)
·
Variable-size clusters:
sampling with probabilities proportional to size (PPS)
·
Variable-size
clusters: sampling with probabilities proportional to a measure of size (PPMS)
·
Stratification of
clusters; the use of a certainty stratum of clusters
·
Construction of
clusters
·
Variable-size clusters; determination of
sample size
·
Replacement vs. non-replacement sampling of
clusters
·
Situations in which
clustering improves precision
·
Self-weighting design
·
Sample frame
considerations
·
General recommendations regarding cluster
sampling
7. When
and how to Use Multistage Sampling (Two-Stage)
·
Nature of situation
which warrants use of a multistage design
·
Determining sample
sizes in two-stage sample (equal-sized primary units)
·
The use of nonreplacement sampling
(equal-size primary units)
·
The use of systematic
sampling for selection of second-stage units
·
Determining sample
sizes in two-stage sampling (unequal size primary units, selection with equal probabilities)
·
PPS sampling of
primary units (unequal-size primary units)
·
Determining sample
size in PPS sampling
·
The use of nonreplacement sampling
(unequal-size primary units)
·
Stratification of
primary units; the use of a certainty stratum
·
Self-weighting design
·
Sample frame
considerations
·
General recommendations regarding two-stage
designs
8. When and How to
Use Double Sampling
·
Nature of situation which warrants the use
of double sampling
·
Determination of
sample size in double sampling
9. How
to Resolve Conflicting / Multiple Survey Design Objectives
Part Two:
How to
Design Analytical Surveys
1. Review
of Regression Analysis
2. General
Procedures for Designing an Analytical Survey
·
Sample survey design
for analysis
·
Essential problems in
design of an analytical survey
·
Two conceptual
approaches to design of analytical surveys
·
Methods for the design
of analytical surveys
3. Illustration
of Methods for the Design of Analytical Surveys
Part Three:
How to
Analyze Survey Data
1. Standard
Estimation Procedures for Descriptive Surveys
·
Preliminary analysis
·
Planned analysis
·
Special analysis
2. Standard
Estimation Procedures for Analytical Surveys
·
Preliminary analysis
·
Planned analysis
·
Tests of model
adequacy/model revision
3. Computer
Programs for Analysis of Survey Data; Outline of Topics for Third Day
Day 3: Special Topics/Practical Problems in
Survey Design
1. Survey
Design for Monitoring and Evaluation
2. Instrumentation,
Data Collection, and Survey Field Procedures
·
Selection of Data
Collection Procedures
·
Questionnaire
Development
·
Development of Field
Procedures (Treatment of Nonresponse, Inplace
Interviews vs. Travelling Team, Incentive Payments)
·
Pretesting and Pilot
Testing
·
Editing, Coding, Data
Base Design and Development
3. Preparation
of OMB Clearance Forms
4. Longitudinal
Surveys
5. Sample
Frame Problems
6. Sampling
for Rare Elements
7. Treatment
of Nonresponse
8. Nonsampling
Errors
9. Randomized
Responses
10. Random Digit
Dialing
11. Major National
Surveys
12. Questions and
Answers
Sample Survey Design and Analysis:
A Comprehensive Three-Day Course
with Application to Monitoring and Evaluation
by Joseph George Caldwell, PhD
Course Critique
Form
Dear Participant:
We appreciate your attendance and are interested in your comments in order to improve our course. Please answer the following questions, adding additional comments as necessary, and send the form back in the attached envelope. Thank you.
Date of course_________________ Location of course_______________________________
Course Content
1. How useful do you consider the information?_______________________________
2. Was the material presented in sufficient detail?_____________________________
3. Were there some topics you would have preferred more discussion on? Yes__ No___
If
so, which ones?_________________________________________________________
Course Delivery
1. Were the presentations effective?____________________________________________
2. Were the visual aids helpful?___________________________________________
3. Were the course notes sufficiently detailed?________________________________
Facilities
1. Was the seating arrangement satisfactory?_____________________________________
2. Were the meals satisfactory?_________________________________________________
3. Was parking adequate?______________________________________________________
4. Is the location convenient?__________________________________________________
General
1. How did you find out about this course?____________________________________
Brochure
in mail_________________
Organizational
channels___________
Associate_______________________
Internet_________________________
Other
(specify)___________________
2. Did you have sufficient registration time?___________________
3. Did you feel the course was as you expected it to be, from the flyer?
____________________________________________________________________
4. Did
you feel the course was as you expected it to be, from the Course Notes (if
examined on the Internet)?______________________________________________
5. If from out of town: Did you stay at the hotel where the
course was presented? _____
6. This course was presented to provide a broad overview of Sample Survey
Design Techniques. Would you have
preferred to concentrate on a few
specific designs?__________________________________________________________
7. Have you ever
attended a course on sampling before?
Yes_____ No_____
8. Would you prefer a more
detailed course of 5 days,_____
or a less detailed course of 2 days?_____
9. Would you prefer a more
advanced course,_____
or a less advanced course?_____
10. Compared to other short
courses of which you are familiar, was the cost of this course:
About
right______________
Rather
high______________
Lower
than expected______
11. What additional seminars might
you be interested in?
Time Series
Analysis, Forecasting and Control________
Biostatistics_______________
Experimental Design________
Quality Control_____________
Evaluation Research________
Introduction to
Statistics and Data Analysis____________
Simulation and
Modeling______________
Optimization_______________
Other (specify)______________
Additional Comments:_____________________________________________________________
Name
(optional)______________________________________________________________
Organization (optional)__________________________________________________________
DAY 1: BASIC
CONCEPTS IN SAMPLE SURVEY
INTRODUCTION; COURSE OBJECTIVES AND OUTLINE; OVERVIEW OF FIRST DAY’S COURSE CONTENT
BASIC CONCEPTS IN
SAMPLE SURVEY
POPULATION: THE POPULATION IS A WELL-DEFINED COLLECTION OF ELEMENTS (MEMBERS, ITEMS, OBJECTS), a1, a2, …,aN (POPULATION SIZE = N).
IN MOST OF THIS COURSE, THE POPULATIONS WILL BE FINITE. AT ONE POINT (DEALING WITH EVALUATION RESEARCH) WE WILL CONSIDER CONCEPTUALLY INFINITE POPULATIONS.
EXAMPLES:
THE POPULATION IS DEFINED BY FOUR QUANTITIES: CONTENT,
UNITS, EXTENT AND TIME (E.G., THE INCOME, OF
WE ARE INTERESTED IN DESCRIBING CERTAIN CHARACTERISTICS (ATTRIBUTES, FEATURES, PROPERTIES) OF THE POPULATION. LET Xi DENOTE AN ARBITRARY NUMERICAL ATTRIBUTE THAT CAN BE DETERMINED FOR A POPULATION ELEMENT (SUCH AS GENDER, AGE, INCOME, HIV STATUS, SCHOOL SIZE, HOSPITAL OWNERSHIP).
FOR EXAMPLE, IN EXAMPLE (1), WE MAY WISH TO DESCRIBE THE PREVIOUS YEAR’S EARNINGS AND CURRENT EMPLOYMENT STATUS OF ALL RESIDENTS (ON JULY 1), BY AGE CATEGORY, GENDER, AND MARITAL STATUS.
THE PROBLEM OF SAMPLE SURVEY (“SAMPLING”) IS TO ESTIMATE THE VALUE OF POPULATION CHARACTERISTICS (E.G., A MEAN, PROPORTION OR TOTAL) FROM A SUBSET (PART, PORTION, “SAMPLE”) OF THE POPULATION.
WHY A SUBSET?
THE POPULATION TO BE SAMPLED (THE SAMPLED POPULATION) MAY DIFFER FROM THE POPULATION OF INTEREST (THE TARGET POPULATION), FOR PRACTICAL REASONS.
BEFORE SELECTING A SUBSET, THE SAMPLED POPULATION IS DIVIDED INTO SAMPLING UNITS (NONOVERLAPPING, EXHAUSTIVE). A LIST OF ALL OF THE SAMPLING UNITS IS CALLED A FRAME (OR SAMPLE FRAME OR SAMPLING FRAME). A SAMPLE (TECHNICAL DEFINITION) IS A COLLECTION OF SAMPLING UNITS DRAWN FROM A FRAME.
EXAMPLE: WANT A SAMPLE OF PUBLIC-SCHOOL STUDENTS. ALL STUDENTS ARE IN SCHOOLS, SO WE MAY DEFINE THE SAMPLING UNIT AS A SCHOOL, AND SELECT A SAMPLE OF SCHOOLS TO OBTAIN A SAMPLE OF STUDENTS. WE ARE MUCH MORE LIKELY TO BE ABLE TO OBTAIN A LIST OF SCHOOLS (SCHOOL FRAME) THAN A LIST OF STUDENTS (STUDENT FRAME).
AFTER SELECTION OF THE SAMPLE, MEASUREMENTS ARE MADE ON THE SAMPLE ELEMENTS (AND ALSO PERHAPS ON THE SAMPLING UNITS) (E.G., A STUDENT’S AGE; A TEACHER’S LEVEL OF EDUCATION; A SCHOOL’S TYPE OF OWNERSHIP; A HOSPITAL’S ANNUAL INCOME).
TWO MAJOR TYPES OF MEASUREMENT SCALES (“VARIABLES”): DISCRETE AND CONTINUOUS.
DISCRETE (NOMINAL/CATEGORICAL, ORDINAL/RANKING): CAN BE COUNTED (E.G., INTEGERS). EXAMPLES: GENDER (M OR F); EMPLOYMENT STATUS (EMPLOYED OR UNEMPLOYED); FAMILY SIZE; EDUCATIONAL LEVEL.
SPECIAL CASE: FOR A BINARY VARIABLE THE Xi’s ARE 0 OR 1 (E.G., MALE=0, FEMALE=1; ABSENCE OF SOME CONDITION = 0, PRESENCE OF THE CONDITION = 1).
CONTINUOUS (INTERVAL, RATIO): DISTANCES / DIFFERENCES CAN BE MEASURED ON AN INTERVAL SCALE (REAL NUMBERS); EXAMPLES: AGE, HEIGHT, TEMPERATURE, BLOOD COUNT, INCOME
STATISTICAL THEORY GUIDES US IN SUMMARIZING AND ANALYING THE SAMPLE, TO MAKE INFERENCES ABOUT THE POPULATION. IT ALSO GUIDES US IN THE DESIGN OF THE SURVEY, THE SAMPLE SELECTION PROCEDURES, AND THE SURVEY INSTRUMENTS (QUESTIONNAIRES, DATA COLLECTION FORMS).
THE ELEMENTS OF
SURVEY DESIGN
1. SPECIFY
POPULATION OF INTEREST
2. SPECIFY
UNITS OF ANALYSIS AND ESTIMATES OF INTEREST
3. SPECIFY
PRECISION OBJECTIVES OF THE SURVEY; RESOURCE CONSTRAINTS; POLITICAL CONSTRAINTS
4. SPECIFY
OTHER VARIABLES OF INTEREST (EXPLANATORY VARIABLES, STRATIFICATION VARIABLES)
5. REVIEW
POPULATION CHARACTERISTICS (DISTRIBUTIONAL, COST)
6. DEVELOP
INSTRUMENTATION (DEVELOPMENT, PRETEST, PILOT TEST, RELIABILITY AND VALIDITY
ANALYSIS)
7. DEVELOP
SAMPLE DESIGN
8.
DETERMINE SAMPLE SIZE AND ALLOCATION
9. SPECIFY
SAMPLE SELECTION PROCEDURE
10.
SPECIFY FIELD PROCEDURES
11.
DETERMINE DATA PROCESSING PROCEDURES
12.
DEVELOP DATA ANALYSIS PLAN
13.
OUTLINE FINAL REPORT
(FROM “
DESCRIPTION
(CHARACTERISTICS) OF A FINITE POPULATION OF SIZE N
LET X DENOTE A (NUMERICAL-VALUED) CHARACTERISTIC, SUCH AS AGE OR INCOME (X IS A “CONCEPT”). LET x DENOTE A PARTICULAR VALUE OF X (SUCH AS AN AGE OF 43).
(FOR BINARY DATA,
, WHERE
DENOTES THE PROPORTION
OF 1’s)
ALSO ![]()
THE MEAN AND MEDIAN ARE MEASURES OF LOCATION, OR CENTRAL TENDENCY; THE VARIANCE AND STANDARD DEVIATION ARE MEASURES OF SPREAD, VARIATION, OR DISPERSION.
THE PRECEDING QUANTITIES ARE SINGLE-VALUED ATTRIBUTES (CHARACTERISTICS, “PARAMETERS”) THAT SUMMARIZE THE LOCATION AND SPREAD OF THE ATTRIBUTE. IN ADDITION, WE CAN SUMMARIZE THE POPULATION USING MORE COMPLEX REPRESENTATIONS, SUCH AS FREQUENCY DISTRIBUTIONS, CROSSTABULATIONS, AND TABLES OF MEANS.
POPULATION PARAMETERS ARE USUALLY DENOTED BY LOWER-CASE
GREEK LETTERS (E.G.,
) OR BY UPPER-CASE LATIN LETTERS (E.G.,
). USE OF AN
UPPER-CASE LATIN LETTER FOR THE POPULATION TOTAL (X) MAY BE CONFUSING,
HOWEVER, SINCE THAT IS THE SAME SYMBOL USED TO DENOTE THE UNDERLYING RANDOM
VARIABLE (ALSO X). WE WILL
USUALLY USE GREEK LETTERS TO DENOTE PARAMETERS, BUT NOT ALWAYS, IN ORDER TO
FAMILIARIZE THE STUDENT WITH ALTERNATIVE NOTATION THAT IS IN COMMON USE.
NOTE ON FONTS
NOTE ON FONTS:
TO ENHANCE READABILITY (ON THE COMPUTER SCREEN AND ON WALL PROJECTIONS),
THESE NOTES ARE PRESENTED IN BLOCK LETTERS, USING THE MICROSOFT ARIEL
FONT. MATHEMATICAL SYMBOLS ARE ITALICIZED,
TO MAKE THEM EASIER TO DISTINGUISH FROM
MATHEMATICAL EXPRESSIONS ARE CONSTRUCTED USING MICROSOFT
EQUATION EDITOR 3.0, WHICH USES THE MICROSOFT TIMES NEW ROMAN FONT, ITALICIZED. THERE ARE HENCE SOME SLIGHT DIFFERENCES
BETWEEN SYMBOL FONTS IN THE TEXT AND IN THE FORMULAS (E.G., E(X) IN THE
TEXT VS.
IN A FORMULA; f(x)
AND g(x) IN TEXT VS.
AND
IN A FORMULA).
(THE USE OF TIMES FONT FOR THE TEXT WOULD DECREASE READABILITY, AND THE USE OF THE EQUATION EDITOR TO REPRESENT ALL SYMBOLS IN THE TEXT WOULD INTRODUCE VARIATIONS IN LINE SPACING, GREATLY EXPAND THE COMPUTER FILE SIZE OF THIS DOCUMENT, SIGNIFICANTLY INCREASE THE TIME REQUIRED TO TYPE THESE NOTES, AND SIGNIFICANTLY INCREASE THE FILE SIZE AND INTERNET DOWNLOAD TIME (SINCE FORMULAS ARE STORED AS SEPARATE FILES IN .htm DOCUMENTS).)
DESCRIPTION OF A FINITE POPULATION (CONT.)
FREQUENCY DISTRIBUTION, TABULAR FORM:
INTERVAL FREQUENCY RELATIVE FREQUENCY
a0 - a1 f1 f1/N
a1 - a2 f2 f2/N
a2 – a3 f3 f3/N
…
ak-1 – ak fk f4/N
N (POPULATION SIZE) = f1 + f2 + … + fk
VALUES FALLING ON AN INTERVAL BOUNDARY ARE ASSIGNED TO THE LOWER INTERVAL (I.E., THE VALUE a1 IS ASSIGNED TO THE CATEGORY a0 - a1, NOT TO a1 - a2).
EXAMPLE: AGE DISTRIBUTION OF THE POPULATION
INTERVAL FREQUENCY RELATIVE FREQUENCY (PROPORTION)
0-18 247 .27
19-64 549 .61
65+ 113 .12
TOTAL 909 1.00
SPECIAL CASE: DISCRETE VARIABLE HAVING A SMALL NUMBER OF CATEGORIES (SUCH AS GENDER, EMPLOYMENT STATUS, OR HOUSEHOLD SIZE). IN THIS CASE THE INTERVALS MAY INCLUDE A SINGLE NUMBER:
EXAMPLE: GENDER DISTRIBUTION OF THE POPULATION
GENDER FREQUENCY RELATIVE FREQUENCY
MALE 110 .48
FEMALE 117 .52
TOTAL 227 1.00
EXAMPLE: DISTRIBUTION OF HOUSEHOLD SIZE
HOUSEHOLD SIZE FREQUENCY
1 f1
2 f2
3 f3
4 f4
5 f5
6 f6
7 f7
8 f8
9 f9
10 f10
11, 12, 13,…. f11, f12, f13,….
DESCRIPTION OF A FINITE POPULATION (CONT.)
FREQUENCY DISTRIBUTIONS, GRAPHICAL FORM:
DISCRETE VARIABLES
FREQUENCY DISTRIBUTION OF GENDER (THE SUM OF THE FREQUENCIES IS N)

PROBABILITY DENSITY FUNCTION OF GENDER (THE SUM OF THE PROBABILITIES IS 1)

CONTINUOUS VARIABLES (OR ORDERED DISCRETE VARIABLES
HAVING MANY VALUES)
FREQUENCY DISTRIBUTION OF AGE (HISTOGRAM)

PROBABILITY DENSITY FUNCTION OF AGE

DESCRIPTION OF A FINITE POPULATION (CONT.)
CROSSTABULATIONS (TABLES OF COUNTS AND MEANS)
(JOINT) FREQUENCY DISTRIBUTION OF POPULATION BY GENDER
AND AGE
AGE MALE FEMALE BOTH SEXES
0-18 20 150 170
19-64 150 550 700
65+ 30 100 130
ALL AGES 200 800 1,000
TABLE OF MEAN ANNUAL INCOME BY GENDER AND AGE
GENDER
MALE FEMALE TOTAL
0-18 1,000 800 900
AGE 19-64 30,000 35,000 34,000
65+ 10,000 10,000 10,000
TOTAL 20,000 22,000 30,000
STATISTICAL MODELS: REGRESSION EQUATIONS:
INCOME AS A FUNCTION OF EDUCATION: FORMULA OR TABLE
![]()
WHERE
y = AGE
x1 = HAS HIGH SCHOOL DIPLOMA (0 OR 1)
x2 = HAS COLLEGE DEGREE (0 OR 1)
x7 = PARENTS HAVE COLLEGE DEGREE (0 OR 1)
x11 = NUMBER OF YEARS OF WORK EXPERIENCE
e = ERROR TERM
|
|
Education |
|||||||||
|
<12 years |
HSD |
BA/BS |
MS |
PhD |
MD |
Other Prof Degree |
Other Degree |
Other |
||
|
Income |
<50K |
|
|
|
|
|
|
|
|
|
|
50K-100K |
|
|
|
|
|
|
|
|
|
|
|
100K-200K |
|
|
|
|
|
|
|
|
|
|
|
>200K |
|
|
|
|
|
|
|
|
|
|
SOCIAL AND ECONOMIC IMPACT OF AN ECONOMIC DEVELOPMENT PROGRAM: FORMULA OR TABLE
|
|
Program Participation |
||
|
Non-Participant |
Participant |
||
|
Gender |
Female |
|
|
|
Male |
|
|
|
FACTORS AFFECTING
SAMPLE SURVEY DESIGN
THE DESIGN OF THE SAMPLE SURVEY (E.G., CHOICE OF SAMPLING UNITS, SAMPLE SIZES) WILL DEPEND ON WHAT THE ESTIMATION OBJECTIVES ARE, AND THE COSTS INVOVLED (E.G., INSTRUMENT PREPARATION, PRETESTING, SAMPLE DESIGN COSTS, SAMPLING COSTS, ANALYSIS COSTS).
THE OBJECTIVE OF SAMPLE SURVEY DESIGN IS TO ENABLE THE PRODUCTION OF ESTIMATES, OF DESIRED QUANTITIES, THAT ARE OF ADEQUATE ACCURACY (HIGH PRECISION, LOW BIAS) AND ACCEPTABLE COST, TO SUPPORT DECISIONS / ACTIONS.
TWO MAIN CLASSES OF SAMPLE SURVEYS: DESCRIPTIVE SURVEYS (ENUMERATIVE SURVEYS) AND ANALYTICAL SURVEYS (TO SUPPORT MODEL DEVELOPMENT – SIMILAR TO DESIGN OF EXPERIMENTS). THIS COURSE WILL ADDRESS BOTH TYPES OF SURVEYS.
DESCRIPTIVE SURVEYS FOCUS ON ESTIMATION OF OVERALL POPULATION (OR SUBPOPULATION) CHARACTERISTICS (SUCH AS MEANS OR TOTALS). ANALYTICAL SURVEYS FOCUS ON ESTIMATION OF RELATIONSHIPS AMONG VARIABLES AND ON TESTS OF HYPOTHESIS (E.G., IS IT REASONABLE TO CONCLUDE THAT TWO POPULATIONS COULD HAVE BEEN GENERATED BY THE SAME PROBABILITY DISTRIBUTION; OR, DOES AN ECONOMIC DEVELOPMENT PROGRAM HAVE A POSITIVE ECONOMIC IMPACT).
TYPES OF SAMPLING
SIMPLEST FORM OF RANDOM SAMPLING: SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT
SOME BASIC
CONCEPTS OF PROBABILITY AND STATISTICS
PROBABILITY THEORY
CONSIDER AN EXPERIMENT, WHICH, WHEN PERFORMED, HAS AN OUTCOME (THE RESULT OF THE EXPERIMENT)
SAMPLE SPACE: THE SET OF ALL POSSIBLE OUTCOMES OF AN EXPERIMENT
EXAMPLES:
COIN-TOSSING EXPERIMENT: HEAD, TAIL
A PERSON SELECTED IN A SURVEY: JOHN SMITH, MARY JONES,…
THE GENDER OF A PERSON SELECTED IN A SURVEY: M, F
A HOUSEHOLD SELECTED IN A SURVEY: THE SMITH FAMILY, THE JONES FAMILY
THE SIZE OF A HOUSEHOLD SELECTED IN A SURVEY: 0, 1, 2 ,3, 4,…
AN OPINION: DISAGREE STRONGLY, DISAGREE MILDLY, NEITHER AGREE NOR DISAGREE, AGREE MILDLY, AGREE STRONGLY
IN SAMPLE SURVEY, THE EXPERIMENT IS THE SELECTION (“DRAWING”) OF A SAMPLE UNIT. THE SAMPLE SPACE IS THE SET (COLLECTION) OF ALL SAMPLING UNITS.
THE PROBABILITY ASSOCIATED WITH A SAMPLE UNIT IS THE RELATIVE FREQUENCY WITH WHICH THAT UNIT WOULD BE SELECTED, IN REPEATED DRAWINGS.
IN THE SIMPLEST CASE, THE PROBABILITY OF SELECTION OF EACH SAMPLE UNIT IS THE SAME (I.E., 1/N IN THE CASE OF A SINGLE DRAW). THIS IS REFERRED TO AS SAMPLING WITH EQUAL PROBABILITIES.
IN SAMPLE SURVEY, IT IS FREQUENTLY THE CASE THAT THE SAMPLE UNITS ARE SELECTED WITH UNEQUAL PROBABILITIES. HOW TO SPECIFY THOSE PROBABILITIES, AND HOW TO SELECT THE SAMPLE ACCORDINGLY, IS THE CENTRAL PROBLEM OF SAMPLE SURVEY DESIGN.
OPTIONAL (SOME ADDITIONAL INFORMATION ABOUT
PROBABILITES, INCLUDED FOR STUDENTS HAVING MATHEMATICAL BACKGROUND):
EVENT: A SUBSET (PART) OF THE SAMPLE SPACE (A COLLECTION OF OUTCOMES). EXAMPLES: HEAD (IN A COIN-TOSSING EXPERIMENT). AN INCOME OF $50,000 (OF A RESPONDENT TO A SURVEY). USUALLY DENOTED BY A, B, C,….
EVENT SPACE: THE COLLECTION OF ALL EVENTS.
AN OUTCOME IS REFERRED TO AS A “SIMPLE EVENT”
THE PROBABILITY OF AN EVENT: THE RELATIVE FREQUENCY WITH WHICH A PARTICULAR OUTCOME OCCURS IN REPETITIONS OF AN EXPERIMENT.
NOTATION:
OUTCOMES (“SIMPLE EVENTS”) a, b, c,… or A, B, C,…
PROBABILITY OF AN EVENT, A = Prob(A) = Pr(A) = P(A)
COMPOUND EVENTS:
PROBABILITY OF A
OR B = P(A UNION B) = P(A + B)
PROBABILITY OF A AND B = P(A INTERSECT B) = P(AB)
THE PROBABILITY FUNCTION, P(.), SPECIFIES THE
PROBABILITY OF EACH EVENT. ITS
DEFINITION OF CONDITIONAL PROBABILITY:
PROBABILITY OF A GIVEN (CONDITIONAL ON) B = P(A|B) = P(AB)/P(B) IF P(B)>0
DEFINITION OF INDEPENDENT EVENTS:
EVENTS A AND B ARE INDEPENDENT IF ANY ONE OF THE FOLLOWING THREE CONDITIONS HOLDS:
P(AB) =
P(A)P(B)
P(A|B) = P(A)
IF P(B)>0
P(B|A) = P(B)
IF P(A)>0
RULES FOR WORKING WITH PROBABILITIES:
P( A + B) =
P(A) + P(B) – P(AB)
P(AB) = P(A|B)P(B) = P(A)P(B|A)
RANDOM VARIABLES
AND DISTRIBUTION FUNCTIONS
RANDOM VARIABLE: A NUMERICAL-VALUED FUNCTION WHOSE VALUE DEPENDS ON THE OUTCOME OF AN EXPERIMENT. (IN MATHEMATICS: “A REAL-VALUED FUNCTION DEFINED ON A SAMPLE SPACE”; IT IS NOT A VARIABLE, BUT A FUNCTION.) DENOTED BY X(.) OR X.
A PARTICULAR VALUE OF THE RANDOM VARIABLE WILL BE DENOTED IN LOWER CASE (E.G., x IS A PARTICULAR VALUE (RESULT OF A PARTICULAR EXPERIMENT) OF THE RANDOM VARIABLE X.
EXAMPLES OF RANDOM VARIABLES:
DISCRETE RANDOM VARIABLES (NOMINAL, ORDINAL; SMALL INTEGERS):
HIV STATUS OF A PERSON IN A SURVEY: NOT INFECTED: 0; INFECTED: 1
GENDER OF INDIVIDUALS IN A SURVEY: FEMALE: 0; MALE: 1
SIZE OF A FAMILY IN A SURVEY: 0, 1, 2, 3, 4, 5, …
AGE CATEGORY: 0-17: 0; 18-64: 1; 65+: 2
INCOME CATEGORY: 0-$50,000 / YR: 1; 50,001 – 100,000 /YR: 2; 100,001 + /YR: 3
OPINION RESPONSE (“LIKERT SCALE”): DISAGREE STRONGLY: 1; DISAGREE MILDLY: 2; NEITHER AGREE NOR DISAGREE: 3; AGREE MILDLY: 4; AGREE STRONGLY: 5
CONTINUOUS RANDOM VARIABLES (INTERVAL OR RATIO SCALES OF MEASUREMENT):
THE AGE OF SOMEONE SELECTED IN A SURVEY (IN YEARS)
THE ANNUAL INCOME OF A FAMILY SELECTED IN A SURVEY (IN DOLLARS; NOT REALLY CONTINUOUS (DOLLARS OR THOUSANDS OF DOLLARS), BUT CLOSE ENOUGH – IT IS THE CONCEPTUAL MEASUREMENT SCALE THAT COUNTS)
A RANDOM VARIABLE IS A FUNCTION (OF THE OUTCOME) THAT HAS A NUMERICAL VALUE, WHEREAS THE OUTCOME OF AN EXPERIMENT MAY SIMPLY BE A NON-NUMERICAL ABSTRACT CONCEPT, SUCH AS A “HEAD” OR “TAIL” IN A COIN-TOSSING EXPERIMENT, A SAMPLE UNIT (SCHOOL, HOSPITAL, PERSON) SELECTED IN A SURVEY, OR GENDER (MALE, FEMALE) OF A PERSON IN A SURVEY.
RANDOM VARIABLES ARE USUALLY DENOTED BY UPPER-CASE LETTERS NEAR THE END OF THE ALPHABET, SUCH AS X, Y, Z,….
NEXT: PROPERTIES OF RANDOM VARIABLES:
EXPECTATION
VARIANCE
PROBABILITY DISTRIBUTION
OPTIONAL: CUMULATIVE DISTRIBUTION FUNCTION, FX(.), OF A RANDOM VARIABLE, X: FX(x) = P(X <= x) = Prob(the set of all outcomes, a, such that X(a )<= x) FOR EVERY REAL NUMBER x.
![]()
EXAMPLES OF CUMULATIVE DISTRIBUTION FUNCTIONS:
EXAMPLE 1, DISCRETE DISTRIBUTION: COIN TOSSING (TAIL=0, HEAD=1):

EXAMPLE 2, DISCRETE DISTRIBUTION: HOUSEHOLD INCOME IN A SURVEY OF HOUSEHOLDS

EXAMPLE 3, CONTINUOUS DISTRIBUTION: HOUSEHOLD INCOME IN A SURVEY OF HOUSEHOLDS

PROBABILITY DENSITY
FUNCTIONS
DISCRETE RANDOM VARIABLES
THE PROBABILITY FUNCTION, OR DISCRETE DENSITY FUNCTION, OF A RANDOM VARIABLE, X, HAVING VALUES x1, x2, x3,…IS DEFINED AS:
![]()
EXAMPLE: PROBABILITY FUNCTION OF HOUSEHOLD SIZE IN SURVEY OF HOUSEHOLDS

THE SUM OF THE PROBABILITIES IS EQUAL TO ONE.
OPTIONAL: CONTINUOUS RANDOM VARIABLES
LET FX(.) BE THE CUMULATIVE DISTRIBUTION FUNCTION OF THE RANDOM VARIABLE X. THE RANDOM VARIABLE X IS CONTINUOUS IF THERE EXISTS A FUNCTION fX(.) SUCH THAT
![]()
FOR EVERY REAL NUMBER x. THE FUNCTION fX(.) IS CALLED THE PROBABILITY DENSITY FUNCTION OF X.
EXAMPLE: PROBABILITY DENSITY FUNCTION OF HOUSEHOLD INCOME IN SURVEY OF HOUSEHOLDS

THE AREA UNDER THE CURVE IS EQUAL TO ONE.
EXPECTATION AND
VARIANCE OF A RANDOM VARIABLE
EXPECTATION, OR MEAN, OR EXPECTED VALUE:
DISCRETE CASE: ![]()
OPTIONAL: CONTINUOUS CASE: ![]()
THE MEAN IS THE CENTER OF GRAVITY (CENTROID) OF THE UNIT MASS DETERMINED BY THE DENSITY FUNCTION.
VARIANCE: (EXPECTATION OF SQUARED DEVIATIONS FROM THE MEAN):
DISCRETE CASE: ![]()
OPTIONAL: CONTINUOUS CASE: ![]()
STANDARD DEVIATION = SQUARE ROOT OF VARIANCE: ![]()
NOTE: THE ABOVE FORMULAS FOR THE MEAN AND VARIANCE PRODUCE THE SAME RESULTS (IN THE DISCRETE CASE) AS THE FORMULAS GIVEN EARLIER FOR THE MEAN AND VARIANCE OF THE FINITE POPULATION (WHERE THE PROBABILITY ASSIGNED TO EACH MEMBER OF THE POPULATION IS 1/N). ALL THAT IS DIFFERENT IS THAT THE ABOVE FORMULAS ARE BASED ON THE PROBABILITY DENSITY FUNCTION OF A RANDOM VARIABLE, WHEREAS THE ORIGINAL FORMULAS WERE INTRODUCED BEFORE THE CONCEPTS OF PROBABILITY, RANDOM VARIABLE, AND THE PROBABILITY DENSITY FUNCTION OF A RANDOM VARIABLE WERE INTRODUCED. THE FORMULAS ARE DIFFERENT, BUT THE RESULTS ARE EXACTLY THE SAME.
THE REASON FOR INTRODUCING THE STATISTICAL THEORY IS NOT TO COMPLICATE THINGS UNNECESSARILY, BUT TO LEAD TO A BETTER UNDERSTANDING OF THE CONCEPTS TO BE INTRODUCED NEXT: STATISTICS, ESTIMATORS, AND SAMPLING DISTRIBUTIONS.
THE PRIMARY GOAL OF SAMPLE SURVEY IS TO OBTAIN ESTIMATES OF THE MEAN AND VARIANCE (AND OTHER QUANTITIES) OF POPULATIONS OF INTEREST, BASED ON SAMPLES FROM THOSE POPULATIONS, AND TO MAKE STATEMENTS ABOUT THE ACCURACY OF THOSE ESTIMATES. THE THEORY OF STATISTICS ENABLES US TO DO THIS.
OPTIONAL: SOME RULES FOR WORKING WITH RANDOM VARIABLES:
IF X AND Y ARE TWO RANDOM VARIABLES, THEN
E(cX) = c E(X)
var(cX) = c2 var(X)
E(X + Y) = E(X) + E(Y).
IF X AND Y ARE INDEPENDENT, THEN var(X + Y) = var(X) + var(Y).
IF g(.) IS A FUNCTION, THEN THE EXPECTATION OF g(X) IS DEFINED AS
E(g(X))
=
.
IF X IS DISCRETE, AND
E(g(X))
= ![]()
IF X IS CONTINUOUS.
CHEBYCHEV (TCHEBYCHEFF) INEQUALITY:
for every k>0.
IF WE SET
and
, WE OBTAIN:
FOR EVERY r>0
OR
![]()
FOR r = 2, WE HAVE
, OR FOR ANY RANDOM VARIABLE X HAVING FINITE VARIANCE AT
LEAST THREE-FOURTHS OF THE PROBABILITY FALLS WITHIN TWO STANDARD DEVIATIONS OF
THE MEAN. (THIS IS NOT A VERY USEFUL
RESULT, SINCE IT HOLDS FOR ALL FINITE-VARIANCE RANDOM VARIABLES.)
JENSEN’S INEQUALITY.
IF g(.) IS A CONVEX FUNCTION, THEN
.
, WHERE μ DENOTES THE EXPECTATION OF X, E(X).
IF ai DENOTES THE i-TH ELEMENT OF A FINITE
POPULATION (SAMPLE SPACE), AND Prob(ai) = pi, LET
US DEFINE THE RANDOM VARIABLE X(.) AS X(ai) = xi/pi,
WHERE xi DENOTES SOME ATTRIBUTE (SUCH AS INCOME). THEN E(X) =
, THE POPULATION TOTAL.
(DESCRIPTIVE) SAMPLING
THEORY: SAMPLE; PROBABILITY SAMPLE
WE WILL DRAW CONCLUSIONS (MAKE INFERENCES) ABOUT THE POPULATION BASED ON A SAMPLE SELECTED FROM THE POPULATION. THIS IS INDUCTIVE INFERENCE, NOT DEDUCTIVE INFERENCE, SINCE OUR CONCLUSIONS ARE NOT MADE WITH CERTAINTY.
A SAMPLE IS A COLLECTION OF SAMPLING UNITS, X1, X2,…,Xn DRAWN FROM A FRAME. THE SIZE (NUMBER OF UNITS) OF THE SAMPLE IS DENOTED BY n. THE SAMPLE WILL BE DRAWN IN A SPECIAL WAY, DEPENDING ON THE OBJECTIVES OF THE SURVEY.
MOST AREAS OF STATISTICS (EXPERIMENTAL DESIGN, QUALITY CONTROL, RELIABILITY) APPLICATIONS DEAL WITH RANDOM SAMPLES. A RANDOM SAMPLE IS USUALLY DEFINED AS A SET OF INDEPENDENT AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES (ALL DEFINED ON THE SAME SAMPLE SPACE).
THIS IS NOT THE TYPE OF SAMPLE THAT IS TYPICALLY USED
IN (DESCRIPTIVE) SAMPLE SURVEY. IN
SAMPLE SURVEY, SOME SAMPLE ELEMENTS MAY BE SELECTED FROM SUBSETS OF THE SAMPLE
SPACE (POPULATION), AND THEY ARE NOT NECESSARILY INDEPENDENT. FURTHERMORE, THE PROBABILITY DENSITY FUNCTION
USUALLY HAS NO SPECIFIC FORM (E.G.,
THE SAMPLES IN SAMPLE SURVEY ARE PROBABILITY SAMPLES (EACH UNIT OF THE SAMPLE IS SELECTED WITH A KNOWN, NONZERO PROBABILITY), BUT NOT THE USUAL “RANDOM SAMPLES” OF STATISTICS (INDEPENDENT AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES).
THE INFERENCES ARE BEING MADE ABOUT THE PARTICULAR FINITE POPULATION AT HAND, NOT ABOUT A HYPOTHETICAL PROCESS THAT MAY HAVE GENERATED IT (CREATED IT AS A SAMPLE, OR “REALIZATION,” FROM A “SUPERPOPULATION” OF POPULATIONS). (IN DAY 2 WE WILL CONSIDER ANALYTICAL SURVEY DESIGNS, WHICH ARE BASED ON A MODEL OF A HYPOTHETICAL PROCESS THAT MAY BE CONSIDERED TO HAVE GENERATED THE PARTICULAR POPULATION AT HAND.)
(FOR DISCUSSION OF THIS CONCEPT, SEE “HISTORY AND DEVELOPMENT OF THE THEORETICAL FOUNDATIONS OF SURVEY BASED ESTIMATION AND ANALYSIS,” BY J. N. K. RAO AND D. R. BELLHOUSE, SURVEY METHODOLOGY, VOL. 16, NO. 1, PP. 3-29 (JUNE 1990) STATISTICS CANADA. RAO AND BELLHOUSE DISCUSS THREE APPROACHES TO SAMPLE SURVEY: DESIGN-BASED (CORRESPONDING TO DESCRIPTIVE SURVEYS), MODEL-DEPENDENT AND MODEL-BASED (OR MODEL-ASSISTED) (THE LATTER TWO CORRESPONDING TO ANALYTICAL SURVEYS). A SAMPLING TEXT THAT INCLUDES DISCUSSION OF THESE CONCEPTS IS SAMPLING: DESIGN AND ANALYSIS BY SHARON L. LOHR (DUXBURY PRESS, 1999). MOST OLDER BOOKS ON SAMPLING DO NOT DISCUSS THESE CONCEPTS.)
SAMPLING THEORY (CONT.): STATISTIC; ESTIMATOR
FOR EACH POPULATION QUANTITY OF INTEREST (E.G., POPULATION MEAN, SUBPOPULATION MEANS), WE WISH TO ESTIMATE THE QUANTITY FROM THE SAMPLE.
A STATISTIC IS ANY QUANTITY THAT CAN BE CALCULATED FROM THE SAMPLE (I.E., A FUNCTION OF THE SAMPLE). (A STATISTIC IS A RANDOM VARIABLE; IT DOES NOT DEPEND ON ANY UNKNOWN PARAMETERS, SUCH AS μ OR σ2.)
AN ESTIMATOR IS A STATISTIC USED TO ESTIMATE A POPULATION CHARACTERISTIC (PARAMETER, SUCH AS A MEAN OR TOTAL). (IF A STATISTIC IS USED TO ESTIMATE A POPULATION PARAMETER, THEN IT IS AN ESTIMATOR.)
EXAMPLE OF AN ESTIMATOR: USE THE SAMPLE MEAN TO ESTIMATE THE POPULATION MEAN. (THE VALUE OF THE ESTIMATOR, CALCULATED FROM A PARTICULAR SAMPLE, IS CALLED THE ESTIMATE. THE ESTIMATOR IS A FUNCTION (FORMULA); THE ESTIMATE IS A NUMBER.)
TWO TYPES OF ESTIMATION (ESTIMATORS): POINT ESTIMATION AND INTERVAL ESTIMATION. WE CONSIDER POINT ESTIMATION FIRST.
THE ACADEMIC DISCIPLINE OF SAMPLE SURVEY IS CONCERNED WITH IDENTIFYING “GOOD” ESTIMATORS (ESTIMATORS THAT ARE IN SOME SENSE “CLOSE” TO THE POPULATION VALUES THEY ARE INTENDED TO ESTIMATE), AND IN DETERMINING SAMPLE DESIGNS THAT ASSURE THAT THE ESTIMATES ARE AS “CLOSE” AS DESIRED.
THE APPLIED FIELD OF SAMPLE SURVEY DESIGN AND ANALYSIS IS CONCERNED WITH KNOWING THOSE ESTIMATORS AND SAMPLE DESIGN PROCEDURES.
SAMPLING THEORY
(CONT.): PRECISION, TRUENESS/BIAS, ACCURACY
PROPERTIES OF ESTIMATORS:
PRECISION: HOW MUCH VARIATION IS THERE IN THE ESTIMATE, FROM SAMPLE TO SAMPLE
TRUENESS: ON AVERAGE, HOW “CLOSE” IS THE ESTIMATE TO THE POPULATION CHARACTERISTIC BEING ESTIMATED
ACCURACY: A COMBINATION OF PRECISION AND TRUENESS
ISO-5725 (“ACCURACY (TRUENESS AND PRECISION) OF MEASUREMENT METHODS AND RESULTS”) USES THE TERM “TRUENESS”; STATISTICIANS USUALLY USE THE TERM “BIAS” (WHICH HAS THE REVERSE HIGH/LOW SENSE OF TRUENESS).
OTHER TERMS FOR THE CONCEPT OF PRECISION ARE REPEATABILITY AND RELIABILITY (ALL IN THE SAME HIGH/LOW SENSE); AND VARIABILITY, SPREAD AND DISPERSION (REVERSE SENSE).
OTHER TERMS FOR THE CONCEPT OF TRUENESS ARE VALIDITY AND UNBIASEDNESS (SAME HIGH/LOW SENSE); AND BIAS (REVERSE SENSE).
GRAPHIC ILLUSTRATION OF RELATIONSHIP BETWEEN PRECISION, BIAS AND ACCURACY.

THE PRECISION (OF AN ESTIMATOR, X) WILL BE MEASURED BY THE VARIANCE:
;
WHERE
, OR THE STANDARD DEVIATION:
.
THE TRUENESS OF AN ESTIMATOR WILL BE MEASURED BY THE BIAS, WHICH IS DEFINED AS THE DIFFERENCE BETWEEN THE EXPECTATION OF THE ESTIMATOR AND THE POPULATION PARAMETER OF WHICH IT IS AN ESTIMATE:
.
AN ESTIMATOR IS UNBIASED IF ITS EXPECTATION IS EQUAL TO THE POPULATION PARAMETER OF WHICH IT IS AN ESTIMATE.
ACCURACY WILL BE MEASURED BY THE MEAN SQUARED ERROR:
.
WE WOULD LIKE ESTIMATORS THAT HAVE LOW VARIANCE (HIGH PRECISION) AND LOW BIAS (COMPARED TO OTHER ESTIMATORS THAT HAVE THE SAME SAMPLE SIZE OR SAMPLING COST).
THERE ARE OTHER PROPERTIES OF ESTIMATORS, SUCH AS CONSISTENCY (THE TENDENCY FOR AN ESTIMATE OF A PARAMETER TO APPROACH THE PARAMERTER VALUE AS THE SAMPLE SIZE INCREASES). THE MOST IMPORTANT PROPERTIES OF ESTIMATORS ARE PRECISION AND BIAS.
PRINCIPAL ITEMS OF INTEREST FOR EACH SAMPLE DESIGN AND ESTIMATION METHOD:
IN WHAT FOLLOWS, WE SHALL PRESENT FORMULAS FOR VARIOUS SAMPLE ESTIMATES OF POPULATION PARAMETERS.
WE SHALL INDICATE WHETHER AN ESTIMATOR IS BIASED OR UNBIASED. WE SHALL ALSO PRESENT FORMULAS FOR THE TRUE VARIANCE OF THE ESTIMATE AND THE SAMPLE ESTIMATE OF THE VARIANCE OF THE ESTIMATE (AND ITS SQUARE ROOT, THE ESTIMATED STANDARD ERROR OF THE ESTIMATE).
THE TRUE VALUE IS OF INTEREST TO HELP US DECIDE ON SAMPLE SIZES, DURING THE COURSE OF DESIGNING A SURVEY.
THE ESTIMATED VALUE IS OF INTEREST TO INDICATE THE PRECISION OF AN ESTIMATE, AFTER THE SURVEY IS COMPLETED AND THE DATA ANALYZED. THE ESTIMATED STANDARD ERROR IS USED TO CONSTRUCT CONFIDENCE INTERVALS.
SAMPLING THEORY
(CONT.): NOTES
THERE ARE VARIOUS METHODS FOR DETERMINING ESTIMATORS (METHOD OF MOMENTS, LEAST-SQUARES, MAXIMUM LIKELIHOOD, BAYESIAN METHODS, RAO-BLACKWELL METHOD, MINIMUM CHI-SQUARE, MINIMUM-DISTANCE) AND VARIOUS CRITERIA FOR COMPARING ESTIMATORS (BIAS, VARIANCE, MEAN SQUARED ERROR, CONSISTENCY, SUFFICIENCY, LOCATION/SCALE INVARIANCE). THESE METHODS WILL NOT BE ADDRESSED IN THIS COURSE.
NOTE ON SCOPE OF COURSE: IN THIS COURSE, WE RESTRICT
ATTENTION TO STANDARD ESTIMATORS OF COMMON POPULATION PARAMETERS SUCH AS MEANS
AND TOTALS, OR RATIOS OR DIFFERENCES AMONG THEM, AND USE LINEAR ESTIMATION
TECHNIQUES (LINEAR COMBINATIONS OF THE SAMPLE VALUES). FOR MORE COMPLEX PARAMETERS, SUCH AS SIMPLE
AND PARTIAL CORRELATION COEFFICIENTS, MEDIANS, QUANTILES, REGRESSION
COEFFICIENTS, MORE ADVANCED METHODS (NONLINEAR ESTIMATION PROCEDURES) ARE
REQUIRED. SINCE THE SAMPLING METHODS
USED IN SAMPLE SURVEY ARE COMPLEX, THESE ESTIMATION METHODS ARE COMPLICATED
(E.G.,
SAMPLING THEORY
(CONT.): EXAMPLE
CONSIDER THE ESTIMATOR THAT IS THE SAMPLE MEAN, FROM A SIMPLE RANDOM SAMPLE DRAWN WITH REPLACEMENT (“SRSWR”).
POPULATION ELEMENTS: x1, x2, …, xN (LOWER CASE SIGNFIES ACTUAL NUMERICAL VALUES)
SAMPLE: X1, X2, …, Xn (UPPER CASE SIGNIFIES RANDOM VARIABLES; FUNCTIONS; CONCEPTUAL)
SAMPLE: x1, x2, …, xn (LOWER CASE SIGNIFIES NUMBERS, IN A SPECIFIC CASE)
NOTE: THE ITEMS x1, x2, …, xn OF THE SAMPLE ARE NOT (NECESSARILY) THE FIRST n ITEMS (x1, x2, …, xn ) OF THE POPULATION.
POPULATION MEAN: 
SAMPLE MEAN:
OR 
SAMPLING THEORY
(CONT.): EXAMPLE (CONT.)
NOTE: THERE IS INCONSISTENCY IN THE FIELD OF STATISTICS IN THE USE OF CAPITAL LETTERS AND LOWER-CASE (“SMALL”) LETTERS. SOME AUTHORS USE CAPITAL LETTERS TO DENOTE RANDOM VARIABLES (FUNCTIONS), AND SMALL LETTERS TO DENOTE REAL NUMBERS (ELEMENTS OF A POPULATION OR SAMPLE, OBSERVED VALUES OF AN ESTIMATOR). OTHERS USE CAPITAL LETTERS TO REFER TO POPULATION PARAMETERS (MEAN, TOTAL) AND SMALL LETTERS TO REFER TO POPULATION ELEMENTS, SAMPLE ELEMENTS, AND SAMPLE ESTIMATORS, WITHOUT DISTINGUISHING BETWEEN RANDOM VARIABLES (FUNCTIONS) AND OBSERVED NUMERICAL VALUES.
WHAT IS EVEN MORE CONFUSING IS THAT MANY AUTHORS USE THE
SAME SYMBOL INTERCHANGEABLY AS A RANDOM VARIABLE OR A REAL NUMBER, WITHOUT
COMMENT. FOR EXAMPLE, IN THE EXPRESSION
,
AND
ARE USED EITHER AS
RANDOM VARIABLES OR AS NUMBERS FROM A PARTICULAR SAMPLE. THIS PRACTICE MUST BE VERY CONFUSING TO THE
NEW STUDENT, BUT IT IS NOT UNUSUAL.
(ANOTHER CONFUSING ITEM IS THE USE OF THE TERM “RANDOM VARIABLE” TO
DESCRIBE A FUNCTION.)
IN MATHEMATICAL STATISTICS, THIS DISTINCTION IS VERY IMPORTANT. FOR EXAMPLE, IN THE STATEMENT, Prob(X = x), X REFERS TO A RANDOM VARIABLE (A FUNCTION), AND x REFERS TO A REAL NUMBER. FOR EXAMPLE, Prob(AGE = 27. IN THIS CASE, THE EXPRESSION E(X) REFERS TO THE EXPECTATION (EXPECTED VALUE, MEAN VALUE, AVERAGE VALUE) OF THE RANDOM VARIABLE X, WHICH IS THE MEAN AGE OF THE MEMBERS OF THE POPULATION. THE EXPRESSION E(x) IS SIMPLY THE EXPECTATION OF THE REAL NUMBER, x, WHICH IS x. IN THE EXAMPLE WHERE X = AGE AND x = 27, E(AGE) = μX = 40.3, SAY, BUT E(27)=27.
IN THIS COURSE, CAPITAL LETTERS WILL REFER TO POPULATION PARAMETERS AND TO RANDOM VARIABLES, AND LOWER-CASE LETTERS WILL REFER TO POPULATION ELEMENTS, SAMPLE ELEMENTS, AND THE CALCULATED VALUES OF SAMPLE STATISTICS. LOWER-CASE LETTERS WILL NOT REFER TO RANDOM VARIABLES (UNLESS SPECIFICALLY STATED). ALTERNATIVE NOTATIONS WILL BE PRESENTED, TO FAMILIARIZE THE STUDENT WITH NOTATIONS FOUND IN DIFFERENT REFERENCE TEXTS.
WHILE THIS DISTINCTION IS IMPORTANT CONCEPTUALLY
(MATHEMATICALLY), IT OFTEN IS IGNORED IN PRACTICAL APPLICATIONS. FOR EXAMPLE, IN RECALLING THE FORMULA USED TO
CALCULATE THE SAMPLE MEAN, IT DOES NOT MATTER WHETHER ONE RECALLS
(A FORMULA INVOLVING
RANDOM VARIABLES) OR
(A FORMULA INVOLVING
REAL NUMBERS FROM A PARTICULAR SAMPLE).
IN THE INTEREST OF SIMPLICITY, WE SHALL OFTEN USE THE LATTER TYPE OF
EXPRESSION (LOWER-CASE LETTERS, SAMPLE VALUES) FOR FORMULAS.
SAMPLING THEORY
(CONT.): EXAMPLE (CONT.)
IT CAN BE SHOWN THAT:
![]()
AND
![]()
WHERE μX IS THE POPULATION MEAN AND σX2 IS THE POPULATION VARIANCE.
μX IS ESTIMATED (IN SRSWR) BY
.
σX2 IS ESTIMATED (IN SRSWR) BY:
.
(THE DIVISOR n-1 IS USED INSTEAD OF n SO THAT
IS UNBIASED. THE BEST FORMULAS
FOR SAMPLE ESTIMATES OF POPULATION PARAMETERS ARE OFTEN NOT IDENTICAL IN FORM
TO THE POPULATION FORMULAS. IN FACT, IN
SOME CASES, SUCH AS ESTIMATING A POWER SPECTRUM IN TIME SERIES ANALYSIS, USING
THE POPULATION FORMULA PRODUCES A TERRIBLE ESTIMATE (IT IS NOT EVEN
CONSISTENT).)
IS ESTIMATED (IN SRSWR)
BY:
.
THE QUANTITY
IS CALLED THE STANDARD ERROR OF THE ESTIMATE,
, DENOTED
. (THE STANDARD ERROR
OF THE ESTIMATE IS SIMPLY THE STANDARD DEVIATION OF THE ESTIMATE, BUT THE TERM
“STANDARD ERROR” IS USED INSTEAD OF “STANDARD DEVIATION” WHEN REFERRING TO THE
STANDARD DEVIATION OF ESTIMATES OR POPULATION PARAMETERS.)
IT IS ESTIMATED (IN SRSWR) BY
.
SAMPLING THEORY
(CONT.): EXAMPLE (CONT.)
HENCE THE SAMPLE MEAN,
, OF A SIMPLE RANDOM SAMPLE DRAWN WITH REPLACEMENT IS UNBIASED,
AND ITS VARIANCE DECREASES BY THE FACTOR 1/n AS THE SAMPLE SIZE n
INCREASES.
IT CAN BE SHOWN THAT THE SAMPLE MEAN OF A SIMPLE RANDOM SAMPLE DRAWN WITH REPLACEMENT HAS THE MINIMUM VARIANCE OF ALL UNBIASED ESTIMATORS (OF THE POPULATION MEAN) THAT ARE LINEAR FUNCTIONS OF THE SAMPLE (“BEST LINEAR UNBIASED ESTIMATE,” “BLUE”).
NOTE: THE PRECEDING ESTIMATION FORMULAS APPLY TO SIMPLE RANDOM SAMPLING WITH REPLACEMENT. FOR OTHER METHODS OF SAMPLING, THE FORMULAS ARE DIFFERENT.
NOTE: THE STANDARD ERROR OF THE ESTIMATE IS OF INTEREST MAINLY FOR CONSTRUCTING INTERVAL ESTIMATES (TO BE EXAMINED SHORTLY).
NOTE ALSO:
POPULATION TOTAL =
(= X) = NμX
SAMPLE ESTIMATE OF POPULATION TOTAL = ![]()
SAMPLING THEORY
(CONT.): SAMPLING DISTRIBUTION
SAMPLING DISTRIBUTION OF THE ESTIMATOR (THE PROBABILITY DISTRIBUTION OF A RANDOM VARIABLE).
CONSIDER THE HYPOTHETICAL UNIVERSE OF ALL POSSIBLE SAMPLES
FOR ANY METHOD OF SAMPLING. THERE IS A NUMERICAL
VALUE OF THE STATISTIC (OR ESTIMATE) FOR EVERY POSSIBLE SAMPLE. THE PROBABILITY DISTRIBUTION OF THIS
STATISTIC (ESTIMATE) IS THE SAMPLING DISTRIBUTION OF THE STATISTIC.

(WEAK) LAW OF LARGE NUMBERS: AS THE SAMPLE SIZE INCREASES, THE SAMPLE MEAN OF A SIMPLE RANDOM SAMPLE DRAWN WITH REPLACEMENT BECOMES VERY CLOSE TO THE POPULATION MEAN (THE PROBABILITY THAT THE SAMPLE MEAN DIFFERS BY ANY SPECIFIED AMOUNT FROM THE POPULATION MEAN DECREASES TO ZERO AS THE SAMPLE SIZE INCREASES TO INFINITY).
CENTRAL LIMIT THEOREM: THE SAMPLING DISTRIBUTION OF THE SAMPLE MEAN (IN SIMPLE RANDOM SAMPLING WITH REPLACEMENT) APPROACHES THE NORMAL DISTRIBUTION (THE “BELL-SHAPED CURVE”) AS THE SAMPLE SIZE APPROACHES INFINITY. IF μX AND σX2 DENOTE THE MEAN AND VARIANCE OF THE POPULATION, THEN THE MEAN AND VARIANCE OF THE LIMITING DISTRIBUTION ARE μX AND σX2 /n. WE WILL DISCUSS THE NORMAL DISTRIBUTION IN GREATER DETAIL SHORTLY.
THE TWO PRECEDING RESULTS ARE TRUE FOR ANY FINITE POPULATION (THEY HOLD FOR RANDOM SAMPLING FROM ANY PROBABILITY DENSITY WITH FINITE VARIANCE).
THESE RESULTS ARE VERY IMPORTANT IN SAMPLE SURVEY. THEY ARE THE BASIS FOR THE ESTIMATION FORMULAS THAT ARE USED TO ANALYZE THE DATA.
THEY ARE VERY USEFUL SINCE THE FORM OF THE PROBABILITY DISTRIBUTION IS USUALLY NOT KNOWN (BUT ALWAYS HAS A FINITE VARIANCE).
THEY ARE APPLICABLE FOR LARGE SAMPLE SIZES (E.G., N>30).
SAMPLING THEORY
(CONT.): INTERVAL ESTIMATION
THE PRECISION OF AN ESTIMATE WILL BE DETERMINED, USING THE THEORY OF STATISTICS, BASED ON INFORMATION IN THE SAMPLE.
THE FORMULAS USED TO ESTIMATE THE POPULATION MEAN (OR OTHER CHARACTERISTIC) WILL DEPEND ON THE SAMPLE DESIGN AND SAMPLE SELECTION METHOD.
THE PRECEDING MATERIAL ON ESTIMATION HAS BEEN CONCERNED WITH POINT ESTIMATION OF POPULATION PARAMETERS. WE WILL NOW ADDRESS INTERVAL ESTIMATION.
A POINT ESTIMATE SPECIFIES A SINGLE, “LIKELY,” VALUE AS THE ESTIMATE. WE KNOW SOMETHING ABOUT ITS PRECISION FROM ITS STANDARD ERROR (AND THE THEORY OF STATISTICS).
AN INTERVAL ESTIMATE OF A PARAMETER IS AN INTERVAL THAT HAS A SPECIFIED PROBABILITY OF INCLUDING THE PARAMETER (THE INTERVAL IS THE RANDOM QUANTITY, NOT THE PARAMETER).
CONFIDENCE INTERVAL. LET θ DENOTE A POPULATION PARAMETER (E.G., THE MEAN, μ). LET T1 = t1(X1, …,Xn) AND T2 = t2(X1, …,Xn) BE TWO STATISTICS SATISFYING T1 <= T2 FOR WHICH P(T1 < θ < T2) = α, WHERE α DOES NOT DEPEND ON θ. THEN THE RANDOM INTERVAL (T1,T2) IS CALLED A 100α PERCENT CONFIDENCE INTERVAL FOR θ.
α IS CALLED THE CONFIDENCE COEFFICIENT. T1 AND T2 ARE CALLED THE LOWER AND UPPER CONFIDENCE LIMITS, RESPECTIVELY.
A VALUE (t1, t2) OF THE RANDOM INTERVAL (T1, T2) IS ALSO CALLED A 100α PERCENT CONFIDENCE INTERVAL FOR θ.
(SIMILAR DEFINITIONS FOR ONE-SIDED LOWER AND UPPER CONFIDENCE INTERVALS AND LIMITS.)
NOTE: CONFIDENCE INTERVALS ARE RELATED TO TESTS OF HYPOTHESIS, TO BE DISCUSSED IN DAY 2.
SAMPLING THEORY
(CONT.): CONFIDENCE INTERVALS
HOW TO DETERMINE CONFIDENCE INTERVALS: NEED TO USE INFORMATION ABOUT THE SAMPLING DISTRIBUTION OF CERTAIN STATISTICS.
EXAMPLE: FROM PROBABILITY THEORY: CHEBYCHEV
INEQUALITY (FOR ANY STATISTIC FROM A FINITE POPULATION, NOT JUST
):
![]()
SO (SINCE
)
IS AN APPROXIMATE (1
– 1/r2) PERCENT CONFIDENCE INTERVAL FOR μX.
THE CHEBYCHEV-INEQUALITY METHOD OF CONSTRUCTING CONFIDENCE INTERVALS IS NOT VERY GOOD.
IT IS USUALLY MUCH BETTER TO OBTAIN CONFIDENCE INTERVALS
FROM THE KNOWLEDGE THAT, FOR LARGE SAMPLES, THE SAMPLING DISTRIBUTION OF
TENDS TO A NORMAL DISTRIBUTION WITH MEAN
AND VARIANCE
.
SAMPLING THEORY
(CONT.): THE NORMAL DISTRIBUTION
THE NORMAL DISTRIBUTION:
PROBABILITY
DENSITY FUNCTION: ![]()

THE STANDARD NORMAL DISTRIBUTION IS THE DISTRIBUTION OF
. IT IS A NORMAL DISTRIBUTION WITH μ =
0 AND σ = 1): ![]()

TABLES AVAILABLE.
95% OF THE AREA IS CONTAINED WITHIN THE INTERVAL μ
1.96σ. 90% OF
THE AREA IS CONTAINED WITHIN THE INTERVAL μ
1.645σ.
SAMPLING THEORY (CONT.): CONFIDENCE INTERVALS
CONFIDENCE INTERVAL FOR
:

SO (REARRANGING AND SETTING
)
![]()
HENCE
![]()
HENCE
![]()
IS A 100α PERCENT CONFIDENCE INTERVAL FOR μ, WHERE zp DENOTES THE p PERCENTILE POINT OF THE NORMAL PROBABILITY DENSITY FUNCTION.
FOR EXAMPLE, z.025 = -1.96 AND z.975 = 1.96, SO
![]()
IS A 95% CONFIDENCE INTERVAL FOR μX.
IN PRACTICE, WE DO NOT KNOW THE VALUE OF
, AND WE USE THE SAMPLE ESTIMATE,
, IN ITS PLACE. IN
THIS CASE, HOWEVER, THE PROBABILITY DISTRIBUTION OF
IS NOT EXACTLY A
STANDARD NORMAL DISTRIBUTION. IT IS A
STUDENT’S t DISTRIBUTION, WHICH IS A LITTLE “WIDER” THAN THE STANDARD
NORMAL DISTRIBUTION. A REASONABLE AND
CONVENIENT APPROXIMATION IS TO REPLACE THE FACTOR 1.96 BY 2, AND USE
![]()
AS A 95% CONFIDENCE INTERVAL.
THE QUANTITY
IS CALLED THE “BOUND
ON THE ERROR OF ESTIMATION OF
”.
THE PRECEDING ILLUSTRATED THE CONSTUCTION OF A CONFIDENCE
INTERVAL FOR THE POPULATION MEAN, μ, IN THE CASE OF SIMPLE RANDOM
SAMPLING WITH REPLACEMENT, USING
AS THE ESTIMATE OF THE POPULATION MEAN, μX,
AND
AS ITS ESTIMATED
STANDARD ERROR. IN GENERAL, FOR AN
ARBITRARY SAMPLE DESIGN (AND PARAMETER TO BE ESTIMATED), WE CONSTRUCT THE
CONFIDENCE INTERVAL USING THE APPROPRIATE PARAMETER ESTIMATE AND ITS ESTIMATED
STANDARD ERROR.
SAMPLING THEORY
(CONT.): DIFFERENT SAMPLING METHODS
WE WILL NOW EXAMINE SEVERAL DIFFERENT TYPES OF SAMPLING USED IN SAMPLE SURVEY:
SIMPLE RANDOM SAMPLING, WITH AND WITHOUT REPLACEMENT
STRATIFIED SAMPLING
CLUSTER SAMPLING
SYSTEMATIC SAMPLING
MULTISTAGE SAMPLING
DOUBLE SAMPLING (TWO-PHASE SAMPLING)
AND TWO ALTERNATIVE TYPES OF ESTIMATION (IN ADDITION TO THE USUAL LINEAR ESTIMATORS):
RATIO ESTIMATORS
REGRESSION ESTIMATORS.
IN DAY ONE OF THE COURSE, WE EXAMINE THE SAMPLING METHODS (DEFINITIONS, SAMPLE SELECTION METHODS, ESTIMATION FORMULAS (POINT ESTIMATES, CONFIDENCE INTERVALS)
IN DAY TWO WE SHOW HOW TO DETERMINE WHICH ONE TO USE, AND HOW TO TAILOR IT TO THE PARTICULAR APPLICATION (I.E., DESIGN THE SURVEY).
A NOTE ON NOTATION…IT IS CUSTOMARY IN STATISTICS TO USE X TO SPECIFY AN ARBITRARY (SINGLE) RANDOM VARIABLE, AND TO USE X1, X2,… OR X, Y, Z,… FOR SEQUENCES OR SETS OF RANDOM VARIABLES. WHEN ONE VARIABLE DEPENDS ON ANOTHER IN SOME WAY (SUCH AS INCOME DEPENDING ON AGE OR EDUCATION), IT IS CUSTOMARY TO USE Y FOR THE “DEPENDENT” VARIABLE AND X’s FOR THE EXPLANATORY (“INDEPENDENT”) VARIABLES. IN SAMPLE SURVEY, IT IS CUSTOMARY TO USE Y FOR AN ARBITRARY RANDOM VARIABLE AND FOR A DEPENDENT VARIABLE.
WHILE IT DOES NOT MATTER THEORETICALLY WHAT SYMBOL IS USED TO REPRESENT A RANDOM VARIABLE, WE SHALL DEFER TO CONVENTION AND HENCEFORTH USE Y, INSTEAD OF X, TO DENOTE AN ARBITRARY RANDOM VARIABLE. (THE CONVENTIONAL NOTATION IS GENERALLY GOOD AND HELPFUL, AND THERE IS NO GOOD REASON TO DEPART FROM IT.) WHETHER X OR Y IS USED IS NOT RELEVANT – ALL THAT MATTERS IS HOW THE RANDOM VARIABLE IS DEFINED. THE CONVENTION OF USING Y TO DENOTE A DEPENDENT VARIABLE (E.G., IN A MULTIPLE REGRESSION EQUATION) IS WELL ESTABLISHED, HOWEVER, AND THERE IS NO REASON FOR DEPARTING FROM IT.
THE NOTATION IN THIS COURSE CLOSELY FOLLOWS THE NOTATION IN SAMPLING TECHNIQUES, 3rd EDITION BY WILLIAM G. COCHRAN (WILEY, 1977) OR ELEMENTARY SURVEY SAMPLING, 2nd EDITION, BY RICHARD L. SCHEAFFER, WILLIAM MENDENHALL AND LYMAN OTT (DUXBURY PRESS, 1979). (COCHRAN HAS MORE FORMULAS, AND IS A MATHEMATICS TEXT. SCHEAFFER IS MUCH SIMPLER, AND PRESENTS JUST THE BASIC RESULTS, WITHOUT PROOF.)
SIMPLE RANDOM
SAMPLING
(FROM A FINITE POPULATION)
POPULATION SIZE = N,
SAMPLE SIZE = n
POPULATION MEAN = ![]()
POPULATION VARIANCE = ![]()
ALSO ![]()
POPULATION TOTAL = ![]()
SAMPLE MEAN (POINT ESTIMATOR OF POP. MEAN) =
(A RANDOM VARIABLE)
OR
(A NUMBER, CALCULATED
FROM A PARTICULAR SAMPLE).
SAMPLE VARIANCE =
(A RANDOM VARIABLE)
OR
(A NUMBER, CALCULATED FROM
AN ACTUAL SAMPLE).
SAMPLING WITH REPLACEMENT SAMPLING WITHOUT REPLACEMENT
![]()
(TRUE) VARIANCE OF
![]()
![]()
ESTIMATED VARIANCE OF
ESTIMATED STANDARD ERROR OF ![]()
THE FACTOR (N – n)/N = 1 – n/N IS CALLED THE FINITE POPULATION CORRECTION (fpc).
IT SHOWS HOW MUCH LOWER THE VARIANCE OF THE ESTIMATE IS WITH SAMPLING WITHOUT REPLACEMENT, COMPARED TO SAMPLING WITH REPLACEMENT.
(NOTE: HERE, AND IN THE SAMPLING METHODS THAT FOLLOW, IF THE TOTAL POPULATION SIZE IS NOT KNOWN, THEN REPLACE THE fpc BY 1.)
95% CONFIDENCE INTERVAL FOR THE POPULATION MEAN:
.
AS NOTED EARLIER, THE TERM
IS CALLED THE “BOUND
ON THE ERROR OF ESTIMATION OF
.”
SIMPLE RANDOM
SAMPLING WITH BINARY DATA
(SAME FORMULAS APPLY, BUT THEY SIMPLIFY)
EACH yi = 0 OR 1
POPULATION MEAN = ![]()
POPULATION VARIANCE = σ2 = ![]()
ALSO
![]()
SAMPLE MEAN = 
SAMPLE VARIANCE = ![]()
SAMPLING WITH REPLACEMENT SAMPLING WITHOUT REPLACEMENT
(DROP THE SUBSCRIPT Y.)
![]()
(TRUE) VARIANCE OF
![]()
![]()
ESTIMATED VARIANCE OF
ESTIMATED STANDARD ERROR OF ![]()
95% CONFIDENCE INTERVAL FOR P: ![]()
SELECTING RANDOM
SAMPLES
RANDOMNESS IS A PROPERTY OF THE PROCESS GENERATING THE “RANDOM” NUMBERS – IT CANNOT BE PROVED FROM THE NUMBERS THEMSELVES.
TABLES OF RANDOM NUMBERS
ENTER TABLE RANDOMLY
DOCUMENT STARTING POINT AND RECORD SELECTED NUMBERS
USE A TABLE FROM AN ACCESSIBLE (IN-PRINT) SOURCE (SO THAT YOUR SELECTION CAN BE DOCUMENTED AND VERIFIED BY OTHERS, BY ACCESSING THAT SOURCE AND SEEING THE NUMBERS YOU SELECTED).
SYSTEMATIC SAMPLING (WILL BE TREATED IN GREATER DETAIL LATER)
SELECT EVERY k-th ITEM FROM A LIST OF THE SAMPLE UNITS (FRAME).
APPROPRIATE IF LIST IS IN RANDOM ORDER (E.G., PREPARED ARBITRARILY), BUT IT OFTEN PRODUCES INCREASES IN PRECISION IF THE LIST IS NOT IN RANDOM ORDER (E.G., A TREND IS PRESENT).
IF LIST IS NOT IN RANDOM ORDER, THEN THE SAMPLE IS NOT RANDOM, AND RESULTS MAY BE BIASED. THE GREATEST DANGER IS IF THERE IS SOME SORT OF PERIODICITY IN THE LIST.
EVERY k-th UNIT IS SELECTED FROM THE LIST, k=N/n.
IF IT IS SUSPECTED THAT THE LIST IS NOT IN RANDOM ORDER, THEN SELECT A NUMBER OF SYSTEMATIC SAMPLES (FOR EXAMPLE, TEN SYSTEMATIC SAMPLES, EACH WITH INTERVAL 10k, AND EACH STARTING FROM A NEW RANDOM STARTING POINT).
COMPUTER-GENERATED (“PSEUDORANDOM”) NUMBERS. GENERATED BY MATHEMATICAL FORMULAS, STARTING WITH A “SEED”: REPRODUCIBLE, DOCUMENTABLE.
MATHEMATICAL / STATISTICAL SOFTWARE PACKAGES (E.G., PROC SURVEYSELECT IN SAS).
RANDOM NUMBERS GENERATED BY A HAND CALCULATOR ARE GENERALLY NOT REPRODUCIBLE. OK FOR “PERSONAL USE,” BUT NOT FOR PAID WORK FOR A CLIENT.
REASON FOR DOCUMENTATION:
· REVIEW OF WORK (TO ENSURE / ESTABLISH CORRECTNESS OF SAMPLING PROCEDURES)
· LEGAL TESTIMONY (TO PROVE CORRECTNESS IN A COURT OF LAW)
ESTIMATION OF
SAMPLE SIZE
SIMPLE RANDOM SAMPLING WITH REPLACEMENT (“SRSWR”)
A RECOMMENDED SAMPLE SIZE MAY BE DETERMINED BY SPECIFYING A BOUND (LIMIT, NUMERICAL VALUE) ON THE STANDARD ERROR OF THE ESTIMATE (OR THE SIZE OF A 95% CONFIDENCE INTERVAL), SETTING THIS VALUE EQUAL TO THE THEORETICAL FORMULA FOR THE BOUND, AND SOLVING FOR n.
EXAMPLE:
SUPPOSE THAT WE WANT A 95% CONFIDENCE INTERVAL FOR THE
POPULATION MEAN TO BE OF SIZE
.
THEN, SINCE THE FORMULA FOR A 95% CONFIDENCE INTERVAL (IN
SRSWR) IS
, WE SET
AND SOLVE FOR n:

TO USE THIS FORMULA, WE TO SPECIFY E, AND WE NEED AN ESTIMATE OF THE STANDARD DEVIATION, σY.
THIS IS OBTAINED FROM PREVIOUS SURVEYS, REPORTS, OR JUDGMENT (E.G., IF WE JUDGE THAT MOST OF THE POPULATION COVERS A RANGE OF 200,000, THEN WE COULD ESTIMATE σY = 200,000/4 = 50,000).
FOR EXAMPLE, IF σY = 50,000 AND E=5,000, THEN n = 400.
THE PRECEDING METHOD FOR DETERMINING SAMPLE SIZE DOES NOT TAKE COST (BUDGET RESTRICTIONS) INTO ACCOUNT. IT IS APPROPRIATE FOR SRSWR OR FOR SRSWOR IF N IS LARGE.
NOTE THAT IN DETERMINING SAMPLE SIZES, WE USE THE FORMULAS FOR THE POPULATION (TRUE) VALUES OF THE VARIANCE OR STANDARD ERROR OF THE ESTIMATE – WE DO NOT USE THE SAMPLE FORMULAS SINCE WE ARE NOT USING SAMPLE DATA (WE DO NOT YET HAVE A SAMPLE!).
GENERAL NOTE ON SAMPLE SIZE DETERMINATION FOR DESCRIPTIVE
SURVEYS (NOT JUST SIMPLE RANDOM SAMPLING)
A FREE COMPUTER PROGRAM FOR DETERMINING SAMPLE SIZES, BOTH FOR SIMPLE RANDOM SAMPLING AND FOR MORE COMPLEX DESIGNS, IS AVAILABLE FROM BRIXTON HEALTH (A PUBLIC HEALTH AND EPIDEMIOLOGY CONSULTANCY IN LIVERPOOL, ENGLAND, UK) AT http://www.brixtonhealth.com/samplexs.html .
THE COMPUTER PROGRAM USED BY THE AUTHOR TO DETERMINE SAMPLE SIZES FOR SURVEYS (BOTH SIMPLE AND COMPLEX) IS POSTED AT http://www.foundationwebsite.org/JGCSampleSizeProgram.mdb (A MICROSOFT ACCESS PROGRAM). IT IS DESIGNED PRIMARILY FOR USE IN DETERMINING SAMPLE SIZES FOR ANALYTICAL SURVEYS. FOR DESCRIPTIVE SURVEYS, THE USUAL APPROACH TO SAMPLE SIZE DETERMINATION IS TO SPECIFY A DESIRED LEVEL OF PRECISION (FOR AN ESTIMATE OF INTEREST) AND TO DETERMINE THE SAMPLE SIZE THAT PRODUCES THAT LEVEL OF PRECISION. FOR ANALYTICAL SURVEYS, THE USUAL APPROACH IS TO SPECIFY A DESIRED LEVEL OF POWER FOR A SPECIFIED TEST OF HYPOTHESIS (E.G., ABOUT THE SIZE OF A “DOUBLE DIFFERENCE” ESTIMATE), AND TO DETERMINE THE SAMPLE SIZE THAT PRODUCES THAT LEVEL OF POWER.
THERE ARE MANY OTHER SOURCES OF INFORMATION ABOUT SAMPLE SURVEY DESIGN AND SAMPLING ON THE INTERNET WORLD WIDE WEB.
ESTIMATION OF
SAMPLE SIZE
SIMPLE RANDOM SAMPLING WITH REPLACEMENT (“SRSWR”)
IN SAMPLING FOR PROPORTIONS,
(DROPPING THE SUBSCRIPT X), SO
.
TO USE THIS FORMULA, WE NEED TO SPECIFY THE PROPORTION, P.
SINCE P(1-P) ASSUMES ITS MAXIMUM VALUE FOR P=.5,
THE MAXIMUM SIZE FOR A 95% CONFIDENCE INTERVAL,
, IS
, AND SO, SETTING
AND SOLVING FOR n,
WE OBTAIN THE REQUIRED SAMPLE SIZE AS:
.
FOR EXAMPLE, IF E =.05, THEN n = 400. IF E = .03 THEN n = 1,111. (NOTE: MANY TELEVISION OPINION POLLS HAVE n = 1,000, AND HAVE AN ERROR OF ESTIMATION OF ABOUT .03.)
NOTE: AN ADVANTAGE OF DETERMINING SAMPLE SIZES FOR SAMPLING FOR PROPORTIONS IS THAT THE VARIANCE (OF THE ESTIMATE OF THE MEAN) IS A FUNCTION OF THE (TRUE) MEAN. HENCE, WE CAN DETERMINE THE SAMPLE SIZE SIMPLY BY SPECIFYING THE MEAN. THE SAMPLE SIZE IS OFTEN (BUT NOT ALWAYS) DETERMINED BY SETTING THE VALUE OF p EQUAL TO .5 (SINCE THIS VALUE PRODUCES THE MAXIMUM VALUE OF n).
ESTIMATION OF THE
POPULATION TOTAL
(SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT)
THE POPULATION MEAN IS
.
THE POPULATION TOTAL IS
.
AN ESTIMATOR OF THE POPULATION TOTAL IS
.
THE ESTIMATED VARIANCE OF
IS
.
THE “BOUND ON THE ERROR OF ESTIMATION” IS
.
FOR SIMPLE RANDOM SAMPLING WITH REPLACEMENT, SIMPLY DROP THE
FINITE POPULATION CORRECTION (fpc),
.
THE CALCULATION OF SAMLE SIZE PROCEEDS AS IN THE CASE OF ESTIMATION OF THE POPULATION MEAN, BUT WITH THE DIFFERENT FORMULA FOR THE VARIANCE.
SAMPLING VARIANCE
OF OTHER STATISTICS (SRSWR)
(MEDIAN, PERCENTILES, STANDARD DEVIATION, COEFFICIENT OF VARIATION)
IN THE PRECEDING, WE HAVE GIVEN FORMULAS FOR THE VARIANCES
(TRUE AND ESTIMATED) OF THE SAMPLE MEAN,
, AND THE ESTIMATED POPULATION TOTAL,
.
HERE ARE STANDARD ERRORS FOR SOME OTHER STATISTICS:
IF THE PARENT POPULATION IS ![]()
IF THE PARENT POPULATION IS
, WHERE CV DENOTES THE POPULATION COEFFICIENT OF
VARIATION.
IF THE PARENT POPULATION IS
.
SUMMARY OF MAIN
RESULTS
THE GOAL IS ESTIMATION OF FINITE-POPULATION PARAMETERS: MEAN
(
), TOTAL (
). THE FOLLOWING ARE
THE MAIN FORMULAS INVOLVED:
FORMULAS FOR ESTIMATORS OF THE
POPULATION PARAMETERS: SAMPLE MEAN (
), ESTIMATED POPULATION TOTAL (
).
(FROM THIS POINT ON, WE WILL USE
RATHER THAN Y,
TO DENOTE THE POPULATION TOTAL, TO AVOID CONFUSION WITH THE SYMBOL Y
USED TO DENOTE A RANDOM VARIABLE.)
FORMULAS FOR THE TRUE VALUES OF THE VARIANCES OR STANDARD DEVIATIONS (STANDARD ERRORS) OF THE ESTIMATORS. (THESE ARE USED IN THE ESTIMATION OF THE SAMPLE SIZES REQUIRED TO ACHIEVE SPECIFIED LEVELS OF PRECISION.)
FORMULAS FOR ESTIMATING THE
STANDARD ERRORS OF THE SAMPLE ESTIMATES: STANDARD ERROR OF
, STANDARD ERROR
OF
. (THESE ARE USED TO
INDICATE THE LEVEL OF PRECISION OF THE SAMPLE ESTIMATES.)
BOUND ON THE ERROR OF ESTIMATION (TWICE THE ESTIMATED STANDARD ERROR OF THE ESTIMATE)
95% CONFIDENCE INTERVAL: THE
ESTIMATE
TWICE THE ESTIMATED STANDARD
ERROR OF THE ESTIMATE
THE ABOVE SUMMARY WILL APPLY NOT JUST TO SIMPLE RANDOM SAMPLING (WITH OR WITHOUT REPLACEMENT), BUT TO MANY OF THE OTHER TYPES OF SAMPLING TO BE DISCUSSED.
SO, IN GENERAL, ALL WE REALLY NEED TO KNOW ABOUT EACH SURVEY DESIGN IS WHAT IS THE FORMULA FOR A GOOD ESTIMATE (OF A SPECIFIED POPULATION PARAMETER, SUCH AS THE POPULATION MEAN OR TOTAL), AND THE FORMULA FOR THE STANDARED ERROR OF THE ESTIMATE. FROM THIS WE CAN STATE THE “BOUND ON THE ERROR OF ESTIMATION” (TWICE THE STANDARD ERROR OF THE ESTIMATE) AND A 95% CONFIDENCE INTERVAL.
USING
SUPPLEMENTARY (AUXILIARY) INFORMATION TO ASSIST SURVEY DESIGN
IF NOTHING IS KNOWN ABOUT THE POPULATION IN ADVANCE OF SAMPLING (EXCEPT FOR A LIST OF SAMPLING UNITS), THEN SIMPLE RANDOM SAMPLING IS ALL THAT CAN BE DONE.
INFORMATION KNOWN ABOUT THE POPULATION PRIOR TO SAMPLING CAN ENABLE THE CONSTRUCTION OF AN IMPROVED (MORE EFFICIENT) SAMPLE DESIGN (HIGHER PRECISION FOR THE SAME SAMPLING EFFORT (SMALLER SAMPLE, LOWER COST), OR ACHIEVEMENT OF A SPECIFIED LEVEL OF PRECISION FOR LESS SAMPLING EFFORT).
THIS INFORMATION MAY BE ABOUT THE PRIMARY VARIABLE(S) OF INTEREST, OR ABOUT VARIABLES RELATED TO THEM (E.G., IN AN INCOME SURVEY, MAY KNOW LAST YEAR’S INCOME, OR VARIABLES RELATED TO INCOME, SUCH AS QUALITY OF NEIGHBORHOOD, OR AGE, OR EDUCATION).
THIS INFORMATION MAY BE QUALITATIVE OR QUANTITATIVE, BUT IT MUST BE DETERMINABLE FOR EACH SAMPLE UNIT IN THE FRAME.
QUALITATIVE (NOMINAL):
RICH OR POOR (NEIGHBORHOODS, SOIL REGIONS)
ADVANCED OR RETARDED (ECONOMIC REGIONS)
MORE OR LESS DENSLY POPULATED
URBAN OR RURAL
RESIDENCE, ETHNICITY
QUANTITATIVE (ORDINAL, INTERVAL):
INCOME DATA WHEN SURVEYING EXPENDITURE DATA
HEIGHT WHEN ESTIMATING WEIGHT
AGE WHEN ESTIMATING BLOOD PRESSURE
SEX WHEN ESTIMATING MARKET PREFERENCES
EDUCATIONAL LEVEL OR NATIONALITY WHEN SURVEYING ATTITUDES
POLITICAL AFFILIATION WHEN SURVEYING POLITICAL OPINIONS
SAMPLING COSTS (URBAN, RURAL) WHEN SURVEYING SCHOOLS
THIS INFORMATION WILL ENABLE US TO CONSTRUCT A VARIETY OF SURVEY DESIGNS THAT ARE MORE EFFICIENT THAN SIMPLE RANDOM SAMPLING:
STRATIFIED SAMPLING
CLUSTER SAMPLING
MULTISTAGE SAMPLING
DOUBLE SAMPLING (TWO-PHASE SAMPLING)
STRATIFIED RANDOM
SAMPLING
DEFINITION: A STRATIFIED RANDOM SAMPLE IS ONE OBTAINED BY SEPARATING THE SAMPLE UNITS (POPULATION ELEMENTS) INTO NONOVERLAPPING GROUPS, CALLED STRATA, AND SELECTING A SIMPLE RANDOM SAMPLE FROM EACH STRATUM.
THE STRATA ARE DEFINED ON THE BASIS OF AUXILIARY INFORMATION.
REASONS (INDICATIONS) FOR STRATIFICATION:
1. POPULATION ELEMENTS ARE MORE HOMOGENEOUS (LESS VARIABLE) WITHIN STRATA THAN IN THE GENERAL POPULATION (WITH RESPECT TO THE VARIABLES OF INTEREST).
2. COST OF SAMPLING MAY BE LOWER IN SOME STRATA (ADMINISTRATIVE CONVENIENCE).
3. ESTIMATES OF POPULATION PARAMETERS (I.E., MEANS, PROPORTIONS, VARIANCES) CAN BE READILY OBTAINED FOR EACH STRATUM.
EXAMPLE1:
IN A COUNTY CONTAINING FIVE VOTING DISTRICTS, WE WISH TO ESTIMATE THE PROPORTION OF REGISTERED VOTERS WHO FAVOR A PARTICULAR ELECTION CANDIDATE. WE WANT AN OVERALL ESTIMATE FOR THE COUNTY AND ESTIMATES FOR EACH VOTING DISTRICT. A LIST OF REGISTERED VOTERS IS AVAILABLE. A SAMPLE OF 100 VOTERS IS SELECTED FROM EACH DISTRICT.
THE DATA ARE COLLECTED AND ANALYZED (USING FORMULAS TO BE PRESENTED), AND THE PROPORTIONS ARE ESTIMATED FOR THE COUNTY AND EACH DISTRICT.
EXAMPLE 2:
IT IS DESIRED TO ESTIMATE THE TOTAL NUMBER OF COMPUTERS IN
ALL SCHOOLS IN
IN THIS CASE, USE OF A STRATIFIED SAMPLE DESIGN, WITH STRATIFICATION BY SCHOOL SIZE, URBAN/RURAL STATUS, AND OWNERSHIP STATUS, WOULD PROBABLY BE A GOOD CHOICE. (NOTE: FEW PRACTICAL SAMPLE SURVEYS COLLECT DATA ON ONLY ONE VARIABLE. WHILE THIS DESIGN MAY BE GOOD FOR ESTIMATING THE TOTAL NUMBER OF COMPUTERS, IT WOULD NOT NECESSARILY BE THE BEST FOR ESTIMATING SOME OTHER PARAMETER, SUCH AS THE TYPE OF WATER SOURCE FOR THE SCHOOL (NONE, WELL, PUMP, MUNICIPAL WATER). ALL OF THE SURVEY OBJECTIVES AND CONSTRAINTS MUST BE CONSIDERED TOGETHER IN DESIGNING THE SURVEY.
STRATIFIED RANDOM
SAMPLING
NOTATION / ESTIMATION FORMULAS
NUMBER OF STRATA: L
WILL USE THE SUFFIX / SUBSCRIPT h TO DENOTE AN ARBITRARY STRATUM, AND SUFFIX / SUBSCRIPT iI TO DENOTE AN ARBITRARY UNIT WITHIN A STRATUM.
NUMBER OF UNITS (IN STRATUM h) = “SIZE” OF STRATUM h: Nh
NUMBER OF UNITS IN SAMPLE (IN STRATUM h): nh
VALUE OF THE i-th UNIT: yhi
STRATUM WEIGHT: ![]()
SAMPLING FRACTION: ![]()
TRUE MEAN: 
SAMPLE MEAN: 
TRUE VARIANCE: 
(WILL USE S2 RATHER THAN σ2, SINCE (1) IT IS CUSTOMARY IN SAMPLE SURVEY; AND (2) THE FORMULAS ARE A LITTLE SIMPLER.)
SAMPLE VARIANCE: ![]()
TOTAL POPULATION SIZE: N = N1 + N2 + … + Nk
TOTAL SAMPLE SIZE: n = n1 + n2 + …
+ nk
THE POPULATION TOTAL, MEAN AND VARIANCE ARE DEFINED THE SAME AS BEFORE.
STRATIFIED RANDOM
SAMPLING
ESTIMATION FORMULAS
![]()
![]()
![]()
ESTIMATE OF THE POPULATION MEAN (st STANDS FOR “STRATIFIED”):

THIS IS NOT, IN GENERAL, THE SAME AS THE SAMPLE MEAN, WHICH IS:

THE ESTIMATE
IS EQUAL TO
ONLY IF:
![]()
I.E., IF THE STRATIFICATION INVOLVES A PROPORTIONAL ALLOCATION OF THE SAMPLE TO THE STRATA (SAMPLE SIZES PROPORTIONAL TO STRATUM SIZES), THEN THE SAMPLE IS SAID TO BE “SELF-WEIGHTING.”
STRATIFIED RANDOM
SAMPLING
ESTIMATION FORMULAS / MAJOR RESULTS
MAJOR RESULTS:
THE ESTIMATE
IS AN UNBIASED
ESTIMATE OF THE POPULATION MEAN,
, I.E.,
![]()
THE TRUE VARIANCE OF
IS:

AN UNBIASED ESTIMATE OF THE VARIANCE OF
IS:
![]()
APPROXIMATE 95% CONFIDENCE LIMITS ARE HENCE AS FOLLOWS:
FOR THE POPULATION MEAN:
![]()
FOR THE POPULATION TOTAL:
![]()
THE ESTIMATION OF SAMPLE SIZES IS COMPLICATED FOR STRATIFIED SAMPLING, BECAUSE THERE ARE MANY SAMPLE SIZES INVOLVED (I.E., ONE FOR EACH STRATUM).
IN MANY APPLICATIONS, THE STRATA ARE SUBPOPULATIONS OF SPECIAL INTEREST, SUCH AS URBAN/RURAL OR MALE/FEMALE OR COUNTRY REGIONS, AND SEPARATE ESTIMATES ARE DESIRED FOR EACH SUCH STRATUM. IN THESE CASES, THE SAMPLE-SIZE FORMULAS FOR SIMPLE RANDOM SAMPLING APPLY TO DETERMINE THE SAMPLE SIZE FOR EACH SUCH STRATUM.
STRATIFIED RANDOM
SAMPLING
ALLOCATION OF SAMPLE TO STRATA
PROPORTIONAL ALLOCATION (SELF-WEIGHTING):
![]()
OPTIMAL (MINIMUM VARIANCE), IF COST OF SAMPLING IS THE SAME IN ALL STRATA, BUT THE STRATUM VARIANCES MAY DIFFER:

THIS IS CALLED THE “NEYMAN” ALLOCATION.
OPTIMAL (MINIMUM VARIANCE), WITH SAMPLING COST FUNCTION (“LINEAR” COST FUNCTION):
COST =![]()

GUIDELINES: TAKE A LARGER SAMPLE IN A STRATUM IF
THE MORE WE KNOW ABOUT THE VARIABLE OF INTEREST (y), THE BETTER JOB WE CAN DO OF STRATIFICATION. IN SOME CASES, IT MAY BE WORTHWHILE TO CONDUCT A PRELIMINARY SAMPLE TO OBTAIN SOME INFORMATION THAT WOULD ASSIST STRATIFICATION (TWO-PHASE SAMPLING, OR DOUBLE SAMPLING).
IN SAMPLING FOR PROPORTIONS, THE FORMULAS BECOME A LITTLE SIMPLER – SEE A TEXT ON SAMPLE SURVEY FOR THE FORMULAS IN THAT CASE.
SYSTEMATIC SAMPLING: WHEN THE UNITS ARE ARRANGED IN DESCENDING OR ASCENDING ORDER, THEN SYSTEMATIC SAMPLING HAS A SIMILAR EFFECT AS STRATIFICATION. (WE ADDRESS SYSTEMATIC SAMPLING LATER.)
GAINS IN PRECISION
FROM STRATIFICATION
THE GAIN IN PRECISION FROM STRATIFICATION OVER SIMPLE RANDOM SAMPLING DEPENDS MAINLY (APART FROM COST CONSIDERATIONS) ON HOW MUCH LESS THE VARIATION WITHIN STRATA IS COMPARED TO THE VARIATION OVER THE GENERAL POPULATION.
A SIMPLE MODEL.
CONSIDER THE CASE IN WHICH THE OBSERVED RANDOM VARIABLE, X, IS THE SUM OF TWO INDEPENDENT RANDOM VARIABLES, A “STRATUM” COMPONENT, XS, AND A “WITHIN-STRATUM” COMPONENT, XW:
X = XS + XW
SUPPOSE THAT E(XS) = μS, E(XW) = 0, V(XS) = σS2 AND V(XW) = σW2.
THEN V(X) = V(XS) + V(XW), OR σ2 = σS2 + σW2.
SUPPOSE THAT THERE ARE L STRATA AND THAT AN EQUAL NUMBER OF UNITS, nS, IS SELECTED FROM EACH STRATUM. THE TOTAL SAMPLE SIZE IS n = LnS.
THE ESTIMATOR OF THE POPULATION MEAN IS:

WHERE
.
ITS VARIANCE IS:
.
THE VARIANCE OF A SIMPLE RANDOM SAMPLE (WITH REPLACEMENT) OF SIZE n = Lns IS:
.
SO THE RATIO OF THE VARIANCES OF STRATIFIED RANDOM SAMPLING TO SIMPLE RANDOM SAMPLING IN THIS CASE IS σW2/σ2.
ALTERNATIVE ESTIMATION TECHNIQUES: RATIO AND REGRESSION ESTIMATORS
RATIO ESTIMATORS (IN SIMPLE RANDOM SAMPLING)
VARIABLE OF PRIMARY INTEREST (“RESPONSE” VARIABLE): Y
SUPPOSE THAT THERE IS ANOTHER (“AUXILIARY”) VARIATE, X,
CORRELATED WITH Y, AND KNOWN FOR EACH UNIT OF THE SAMPLE, AND FOR WHICH
WE KNOW THE POPULATION TOTAL,
.
SUPPOSE FURTHER THAT THE RELATIONSHIP BETWEEN THE RESPONSE VARIABLE AND THE AUXILIARY VARIABLE IS LINEAR THROUGH THE ORIGIN:

THEN A RATIO ESTIMATOR IS A GOOD CHOICE. IT IS, IN FACT, THE BEST CHOICE IF THE VARIANCE OF THE RESPONSE VARIABLE, Y, ABOUT THE LINE IS PROPORTIONAL TO X.
THE RATIO ESTIMATE OF
, THE POPULATION TOTAL (FOR Y) IS:
![]()
(THE PRECEDING IS SHEAFFER’S NOTATION. IN COCHRAN’S NOTATION, USING X AND Y TO DENOTE THE POPULATION TOTALS, THIS FORMULA IS:
![]()
WHERE x AND y DENOTE THE SAMPLE TOTALS OF THE xi AND yi, RESPECTIVELY, AND X DENOTES THE POPUATION TOTAL FOR THE xi. THIS NOTATION IS CONFUSING, SINCE X REFERS BOTH TO A RANDOM VARIABLE AND A POPULATION TOTAL (OF THE xi), AND x, WHICH WOULD NORMALLY REFER TO A SPECIFIC VALUE OF THE X RANDOM VARIABLE, INSTEAD REFERS TO THE SAMPLE TOTAL OF THE xi.)
THE RATIO ESTIMATE OF
, THE POPULATION MEAN (FOR Y) IS:
![]()
(OR, IN COCHRAN’S NOTATION:
).
THE RATIO ESTIMATE IS BIASED, BUT THE BIAS IS NEGLIGIBLE IN LARGE SAMPLES. IT IS CONSISTENT, I.E., ITS AVERAGE TENDS TO THE TRUE VALUE AS THE SAMPLE SIZE INCREASES.
NOTE: FROM THIS POINT ON IN THIS COURSE, WE WILL NOT PRESENT FORMULAS FOR THE TRUE VARIANCES OF ALL OF THE SAMPLE ESTIMATES DISCUSSED. THE TRUE VARIANCE IS NEEDED TO DETERMINE SAMPLE SIZE, BUT IT IS NOT USED IN THE ANALYSIS OF THE SAMPLE DATA. THE FORMULAS FOR THE ESTIMATED VARIANCES (BASED ON THE SAMPLE DATA) WILL ALWAYS BE PRESENTED (SINCE THEY ARE ALWAYS NEEDED IN THE ANALYSIS OF THE SAMPLE DATA), BUT THE TRUE FORMULAS WILL BE PRESENTED ONLY WHEN ADDRESSING THE PROBLEM OF DETERMINING SAMPLE SIZE (IN ADVANCE OF THE SURVEY). STANDARD REFERENCE TEXTS MAY BE CONSULTED FOR THE FORMULAS FOR THE TRUE VARIANCES, IN THOSE CASES IN WHICH THEY ARE NOT PRESENTED HERE.
SOME DISCUSSION OF DETERMINING SAMPLE SIZES WAS PRESENTED EARLIER, IN THE CASE OF SIMPLE RANDOM SAMPLING. DETERMING SAMPLE SIZES FOR OTHER SAMPLE DESIGNS IS ADDRESSED IN DAY TWO OF THE COURSE.
THE MAJOR PROBLEM IN DETERMINING SAMPLE SIZES IN COMPLEX SURVEYS BY SETTING THE DESIRED NUMERICAL VALUE OF AN ERROR BOUND EQUAL TO THE THEORETICAL (FORMULA) VALUE IS THAT THE VARIANCES INVOLVED IN THE FORMULA ARE USUALLY NOT KNOWN (PRIOR TO CONDUCTING THE SURVEY), EVEN APPROXIMATELY. FOR THIS REASON, A DIFFEREENT APPROACH IS USED TO DETERMINE SAMPLE SIZES FOR COMPLEX SURVEYS. IT IS BASED ON A FUNCTION CALLED THE “DESIGN EFFECT,” OR KISH’S “DEFF.” THIS APPROACH IS DISCUSSED IN DAY TWO OF THE COURSE.
THE FORMULAS THAT ARE PRESENTED IN THE FOLLOWING ARE USED TO ANALYZE THE DATA, BUT THEY ARE NOT VERY USEFUL FOR ESTIMATION OF SAMPLE SIZES, BECAUSE THE VALUES OF THE PARAMETERS INVOLVED ARE USUALLY NOT KNOWN UNTIL AFTER THE SURVEY HAS BEEN CONDUCTED.
THE FORMULAS THAT ARE PRESENTED IN THIS COURSE ARE USEFUL FOR CALCULATING ESTIMATES IN THE CASE OF HIGHLY STRUCTURED SURVEY DESIGNS. IN MANY LARGE-SCALE SURVEYS, IT IS NECESSARY TO DEPART FROM HIGHLY STRUCTURED DESIGNS, AND THE FORMULAS DO NOT APPLY. IN SUCH CASES, NUMERICAL METHODS (BASED ON SIMULATION, OR “RESAMPLING”) ARE AVAILABLE TO DO THE ESTIMATION. THESE ARE MENTIONED IN THIS COURSE, BUT NOT DESCRIBED IN DETAIL.
RATIO ESTIMATORS
IN SIMPLE RANDOM SAMPLING
SUMMARY OF RESULTS
ESTIMATOR OF THE POPULATION RATIO, R:

ESTIMATED VARIANCE OF r:

BOUND ON THE ERROR OF ESTIMATION OF r:

IF THE POPULATION MEAN FOR X, μX, IS
UNKNOWN, USE THE SAMPLE ESTIMATE,
, TO APPROXIMATE
.
RATIO ESTIMATOR OF THE POPULATION TOTAL:
![]()
ESTIMATED VARIANCE OF
:
.
NOTE THAT IT IS NECESSARY TO KNOW
, THE POPULATION TOTAL FOR X, IN ORDER TO ESTIMATE
BY THE RATIO ESTIMATION METHOD.
AS USUAL, AN APPROXIMATE 95% CONFIDENCE INTERVAL FOR A POPULATION PARAMETER IS GIVEN BY:
PARAMETER ESTIMATE PLUS/MINUS 2 (ESTIMATED STANDARD ERROR OF THE PARAMETER ESTIMATE),
WHERE THE ESTIMATED STANDARD ERROR OF THE PARAMETER ESTIMATE IS THE SQUARE ROOT OF ITS ESTIMATED VARIANCE. IN THE PRECEDING CASE:
![]()
RATIO ESTIMATOR OF THE POPULATION MEAN:
![]()
ESTIMATED VARIANCE OF
:
.
NOTE THAT IT IS NECESSARY TO KNOW
, THE POPULATION MEAN FOR X, IN ORDER TO ESTIMATE
BY THE RATIO ESTIMATION METHOD.
RATIO ESTIMATORS
IN STRATIFIED RANDOM SAMPLING
TWO APPROACHES:
WHICH METHOD IS PREFERRED DEPENDS ON THE NATURE OF THE POPULATION AND THE DESIGN. IF THE RATIO VARIES FROM STRATUM TO STRATUM, THE SEPARATE ESTIMATE IS USUALLY BETTER (MORE PRECISE). IF THE SAMPLE SIZE IS SMALL IN EACH STRATUM, THE COMBINED RATIO ESTIMATE IS USUALLY BETTER.
THE FORMULAS FOR RATIO ESTIMATORS IN STRATIFIED RANDOM SAMPLING ARE SOMEWHAT COMPLICATED. REFER TO COCHRAN, SAMPLING TECHNIQUES FOR THE FORMULAS. THERE ARE TECHNIQUES AVAILABLE TO REDUCE OR REMOVE THE BIAS, AND TO IMPROVE THE VARIANCE OF THE ESTIMATE.
PRODUCT ESTIMATORS:
IF X AND Y TAKE ONLY POSITIVE VALUES AND THE CORRELATION IS NEGATIVE, THEN A RATIO ESTIMATE IS NOT APPROPRIATE, BUT A SIMILAR ESTIMATE, CALLED A PRODUCT ESTIMATOR, IS INDICATED:
,
(OR, IN COCHRAN’S NOTATION:
).
REGRESSION
ESTIMATORS IN SIMPLE RANDOM SAMPLING
VARIABLE OF PRIMARY INTEREST (“RESPONSE” VARIABLE): Y
SUPPOSE THAT THERE IS ANOTHER (“AUXILIARY”) VARIATE, X,
CORRELATED WITH Y, AND KNOWN FOR EACH UNIT OF THE SAMPLE, AND FOR WHICH
WE KNOW THE POPULATION TOTAL,
.
SUPPOSE FURTHER THAT THE RELATIONSHIP BETWEEN THE RESPONSE VARIABLE AND THE AUXILIARY VARIABLE IS LINEAR, BUT NOT NECESSARILY THROUGH THE ORIGIN, AS WAS ASSUMED IN THE CASE OF RATIO ESTIMATION:

THEN USE OF A LINEAR REGRESSION ESTIMATOR IS APPROPRIATE.
FOR EXAMPLE, MAY KNOW LAST YEAR’S SCHOOL BUDGET FOR EACH SCHOOL IN THE COUNTRY (FROM AN ANNUAL SCHOOL CENSUS), AND WANT TO OBTAIN A PRELIMINARY ESTIMATE THIS YEAR’S BUDGET FROM A SAMPLE OF SCHOOLS.
(LINEAR) REGRESSION ESTIMATOR OF A POPULATION MEAN,
:
![]()
WHERE

ESTIMATED VARIANCE OF
:
![]()
BOUND ON THE ERROR OF ESTIMATION: 2
.
FOR AN ESTIMATE OF THE POPULATION TOTAL (FOR Y), USE ![]()
CLUSTER SAMPLING
(SINGLE-STAGE
CLUSTER SAMPLING)
A CLUSTER SAMPLE IS A SIMPLE RANDOM SAMPLE IN WHICH EACH SAMPLING UNIT IS A COLLECTION, OR CLUSTER, OF ELEMENTS. IN THIS CASE THE SAMPLING UNITS ARE THE CLUSTERS, AND THE ELEMENTS WITHIN THE UNITS ARE CALLED SUBUNITS.
CLUSTER SAMPLING IS FAR MORE COST-EFFECTIVE THAN SIMPLE RANDOM SAMPLING OR STRATIFIED SAMPLING, IF
EXAMPLES:
IN MANY COUNTRIES THERE ARE NO COMPLETE, UP-TO-DATE LISTS OF HOUSEHOLDS OR FARMS, AND THE COST OF CONSTRUCTING A FRAME OF ALL UNITS OF THE POPULATION WOULD BE PROHBITIVE. IT IS MUCH CHEAPER TO DIVIDE THE COUNTRY INTO GEOGRAPHIC AREAS (I.E., CONSTRUCT AN AREA FRAME), SELECT A RANDOM SAMPLE OF GEOGRAPHIC AREAS, AND OBSERVE ALL OF THE ELEMENTS (HOUSEHOLDS, FARMS) WITHIN EACH SELECTED AREA.
SUPPOSE THAT IN A CITY, CITY BLOCKS CONTAIN AN AVERAGE OF 20 HOUSEHOLDS EACH. INTERVIEWING ALL HOUSEHOLDS IN A SAMPLE OF 50 BLOCKS WILL COST SUBSTANTIALLY LESS THAN INTERVIEWING A SIMPLE RANDOM SAMPLE OF 1,000 HOUSEHOLDS. ALSO, A FRAME OF CITY BLOCKS MAY BE READILY AVAILABLE, WHEREAS A FRAME OF HOUSEHOLDS MAY NOT.
OTHER EXAMPLES:
IN CLUSTER SAMPLING, IT IS DESIRED THAT CLUSTERS BE INTERNALLY HETEROGENEOUS (WITH RESPECT TO THE CHARACTERISTICS BEING MEASURED). IF ALL OF THE ELEMENTS WITHIN A CLUSTER ARE VERY SIMILAR THEN RELATIVELY LIT