IMPROVED ALGORITHMS

FOR ESTIMATION,

PREDICTION AND CONTROL

 

Final Report

 

 

January 26, 1986

 

 

 

 

Contract No. N00014-85-C-0814

 

 

 

 

 

 

Submitted to:

 

Dr. Neal D. Glassman

Office of Naval Research

Department of the Navy

800 N. Quincy Street

Arlington, Virginia 22217-5000

 

 

 

Submitted by:

 

Vista Research Corporation

4540 Cerco del Coraz$on

Tucson, Arizona 85718

(602) 299-0286

 

 

Note: This report was originally prepared in WordPerfect 4.2.  It has been converted to Microsoft Word 2013, using the best available electronic copy.  The conversion did not go well, and much reformatting (e.g., subscripts) is required.  This copy of the report is not complete.

 

 

Copyright (C) 1986 Vista Research Corporation

 


FOREWORD

 

This report was prepared under Contract No. N00014-85-C-0814 with the Office of Naval Research (ONR), under Phase I of the Small Business Innovation Research (SBIR) Program.  Vista Research Corporation expresses its appreciation to ONR, to the SBIR Program, and to Dr. Neal D. Glassman, ONR Scientific Officer, for their support in sponsoring this work.  The author of this report was Dr. J. George Caldwell.

 


 

Contents

FOREWORD.. 2

I. INTRODUCTION AND SUMMARY.. 4

II. BACKGROUND.. 10

III. PROJECT APPROACH.. 19

IV. SIMULATION RESULTS.. 28

V. CONCLUSIONS AND RECOMMENDATIONS.. 29

 

 

Note: This version of the document is incomplete, missing the following items:

 

References. . . . . . . . . . . . . . . . . . .   

 

Appendices

 

A.   Computer Program Source Code Listings. . .A-1

 

B.   Computer Program Output. . . . . . . . . .B-1

 


I. INTRODUCTION AND SUMMARY

 

 

A. _Study Purpose_

 

This report describes the results of a study to examine the feasibility of developing fast algorithms for estimation, prediction and control.  The objective of the study was to assess the likelihood of finding procedures which could be faster, in terms of computer running time, than the classical least-squares method of parameter estimation, used extensively to develop models of stochastic phenomena.  While the least-squares method has proved its worth in over a century of use, it has some serious drawbacks, stemming from the fact that it requires the inversion of matrices.  For "large" problems such as the problem of tracking many missiles or processing data from multiple intelligence sensors in real-time, the computational burden of the least-squares method can overwhelm even today's powerful computing systems.  Previous attempts to solve this problem have centered on the development of faster computers (e.g., array processors), the improvement of algorithms for matrix inversion, or the simplification of the model to produce a matrix that is easier to invert.  In general, these approaches have not been successful in solving the problem.  Modern sensor exploitation systems, for example, still cannot operate in real-time or even near-real-time.

 

The present study proposed to adopt a totally different approach to the problem.  In particular, it was proposed to investigate methods which would avoid the computationally-intensive process of matrix inversion.  Avoiding this procedure could reap tremendous benefits.  For example, the problem of tracking a missile can involve the inversion of a nine-dimensional matrix at each instant that a radar pulse is received, if a nine-component state vector is used to represent missile position, velocity, and acceleration.

 

This project was supported as a Phase I project of the Small Business Innovation Research (SBIR) program.  The objective of a Phase I SBIR project is to assess the feasibility of a proposed concept and to develop a plan for developing the concept.  If, based on the Phase I effort, the proposed concept appears to have merit, then development of the concept may be funded under Phase II of the SBIR program.

 

B. _Study Results_

 

This Phase I project has successfully accomplished all of the tasks identified in the proposal, and established the feasibility of the proposed concept.  Seven tasks were proposed to be accomplished in this study:

 

     Task 1. Development of Criteria for Comparing

             Alternative Estimation Schemes

 

     Task 2. Development of Test Cases

 

     Task 3. Implementation of the Single-Variable Linear-

             Model Case

 

     Task 4. Extension to the Multiple-Variable Linear-

             Model Case

 

     Task 5. Comparison of Methods

 

     Task 6. Design of Phase II Work Plan

 

     Task 7. Preparation of Final Report

 

The results of each of the seven study tasks are summarized in the paragraphs that follow.

 

_Task 1. Development of Criteria for Comparing Alternative Estimation Schemes_

 

A total of seven criteria were identified and retained as useful for describing the performance of alternative estimation algorithms.  These criteria address the following performance aspects:

 

     o computational speed

 

     o computer storage requirements

 

     o precision of the model parameter estimates

 

     o bias of the model parameter estimates

 

     o accuracy of the model parameter estimates

 

     o precision of model-based predictions

 

     o numerical stability of the algorithm

 

"Computational speed" refers to the time required to analyze the data and produce estimates of the model being used to describe the data.  "Computer storage requirements" refers to the total amount of direct-access memory ("core") required to implement the algorithm.  "Estimate precision" refers to the amount of variability of the estimates in repeated data samples.  "Estimate bias" refers to the difference between the expected (average) value of the parameter estimates in repeated data samples and the true values of the parameters.  "Estimate accuracy" is a combined measure of precision and bias.  "Prediction precision" refers to the closeness of the model-based predictions to actual future values.  "Numerical stability" refers to the ability of an algorithm to converge to a desired answer.

 

Specific measures were determined for each of the preceding concepts.  Because of project resource limitations, however, it was not possible to develop computer software to determine numerical values for all of the measures.

 

_Task 2. Development of Test Cases_

 

It was decided to test the performance of alternative algorithms on sixteen data sets.  All of these data sets involve a single dependent variable ("y") and a number (m) of independent variables ("x's").  In each case the model used to generate the data is of the form:

 

 

     yj = bo +  bi xij + ej

 

 

        = bo + _x_j'_b_ + ej

 

 

This model, called a linear statistical model, specifies the relationship of the dependent variable y to m independent, or explanatory, variables (x1j,x2j,...,xmj).  The variables bo and b1,b2,...bm are constants, called regression coefficients.  They are specified when generating the test data, and are to be estimated by the estimation algorithm.  The ej are called model error terms, or model residuals.  They are a sequence of uncorrelated random variables with mean zero and standard deviation SIG.  The m x's are also random variables with zero mean and variance matrix  *SIGMA2 (where the asterisk denotes multiplication).  The test cases vary in the ratio of SIG to SIGMA (models for which the ratio SIG/SIGMA is low are easier to estimate), and the strength of the correlations among the x's (if the x's are uncorrelated, i.e.,   is the identity matrix, the estimation is easy; if the x's are highly correlated, the estimation is difficult).  The values of SIG, SIGMA,   , bo, and _b_ are collectively referred to as the "model parameters."

 

The test cases considered were as follows:

 

     Test Cases 1-4: m=1 x, value of SIG/SIGMA varies

         from low to high

 

     Test Cases 5-8: m=3 x's, correlation of x's varies

         from none to high

 

     Test Cases 9-12: m=6 x's, correlation of x's from

          none to high

 

     Test Cases 13-16: m=10 x's, correlation of x's from

          none to high

 

The number (m) of independent variables (x's) represents the "dimensionality" of the estimation problem.  In the classical least-squares approach, it is necessary to invert a matrix of order m in order to estimate the model parameters.

 

In social science applications, the dimensionality of a linear regression model can be quite high, e.g., m = 25 to 50 (e.g., there can be a model coefficient corresponding to every possible response to a socioeconomic or demographic question included in a survey questionnaire).  In industrial and scientific applications, the number of explanatory variables may vary from small to large, but the researcher is often able to specify the values of the independent variables, so that even if there are many of them, special procedures (e.g., fractional factorial experimental designs) can be used to avoid the explicit inversion of a matrix.  In military applications, the value of m is often moderate or small.  For example, in the application of tracking a missile, the position, velocity, and acceleration of the missile can be specified by a nine-component state vector, and the estimation of the parameters of the statistical model (called a Kalman filter) requires inversion of a nine by nine matrix.  The test cases specified above include the usual dimensionality range of interest for many military applications.  For all test cases, the number of observations generated (i.e., the sample size) was n = 100.  Later study should examine the effect of varying sample size; with the resources available to this project, it was not possible to examine a very large number of test cases, and the decision was made to hold sample size constant for all test cases.

 

A computer program, called SIMULA, was written to implement the generation of the test case data.  A source code listing of that program is presented in Appendix A.

 

_Task 3. Implementation of the Single-Variable Linear-Model Case_

 

In the 1940's and 1950's, some work was done in investigating model estimation procedures that were alternatives to the classical least-squares procedure.  Two of these procedures are referred to as the "Wald" method and the "Bartlett" method.  They were designed for application to the case in which there was a single explanatory variable (i.e., m = 1).  A computer program was written in this project to implement both of these procedures.

 

As one of the first steps in this study, it was proposed to compare the performance of the Wald and Bartlett estimation procedures to the performance of the classical least squares method, in order to illustrate the utility of the criteria and associated performance measures proposed to compare alternative estimation procedures.

 

This comparison demonstrated that several of the suggested criteria could be applied to measure algorithm performance.  For the single-explanatory-variable example, however, the comparison is not very revealing.  The estimation of parameters for single-explanatory-variable (m = 1) models is very easy and fast with any of the methods, so that there is almost no variation in the performance measures.

 

It was not possible to implement all of the performance measures in this project, because of resource limitations.  For example, one problem that arose was that, for the microcomputer software used in this study, the system timer could not be accessed by the FORTRAN compiler.  It was decided not to allocate project resources to the development of an assembly-language timer that could be linked to the FORTRAN-compiled object modules.  Instead, timing measurements were made externally (manually, by direct visual observation), and they are hence approximate.  For other performance measures (e.g., the parameter accuracy measures), it would have been necessary to replicate a large number of sample cases to determine numerical estimates of these measures.  Once again, project resources were not sufficiently ample to accomplish this.  Although it was not possible to implement all of the suggested performance measures in this Phase I effort, it is quite feasible to do so with some additional resources, and this should be done as part of Phase II, if Phase II is funded.

 

_Task 4. Extension to the Multiple-Variable Linear-Model_

 

Task 4 was the central task of the proposed study.  The objective of this task was to determine fast algorithms for estimating parameters of models containing more than one explanatory variable.  In this project, we sythesized and analyzed a number of algorithms that represent extensions of the Wald-Bartlett methods.  They are iterative methods, and will be referred to as "iterative Wald-Bartlett" methods.  Several variations of this method were considered, and the one that worked best was selected for detailed examination.  This method is described in detail in this report, and all of the performance assessments that are presented in this report are for this method.

 

The thrust of this project was to compare the performance of new estimation techniques to the performance of the classical least-squares technique.  In order to do this, a computer program was required that could perform the classical least-squares computations.  We implemented the least-squares estimation procedure by using the Gauss-Jordan method of solving the normal equations (i.e., inverting the correlation matrix).  This procedure is described in Chapter 3 of Cooley and Lohnes, _Multivariate Procedures for the Behavioral Sciences_ (Reference 8), and is the basis for the least-squares estimation procedure presented in the IBM Scientific Subroutine Package (Reference 9).  The least-squares algorithm used in this project was based on subroutines available from the IBM Scientific Subroutine Package, adapted to the Microsoft (R) FORTRAN compiler that was available to the microcomputer used in this project.  Thorough documentation of the subroutines, including commented code, is included in Reference 9.  A source code listing of the algorithm used in this project is presented in Appendix A.  The listing in Appendix A does not include any comments in the subroutines, in order to avoid possible copyright infringement.

 

The computer program source code for the iterative Wald-Bartlett method and the classical least-squares method is also presented in Appendix A.  These programs are written in FORTRAN II, an early, unstructured version of FORTRAN.  That version was used since it was the version implemented by the Microsoft FORTRAN compiler used on this project.

 

_Task 5. Comparison of Methods_

 

Task 5 was concerned with comparison of the performance of the iterative Wald-Bartlett method and the classical least-squares method, applied to estimate the parameters of the sixteen test cases developed in Task 2.  The results were very interesting.

 

First, the performance of both the iterative Wald-Bartlett and the classical least-squares method depends on the nature of the problem.  For "easy" problems, in which there are few x's or they are uncorrelated, both methods work well.  Second, in problems of low to moderate difficulty, there does not appear to be an appreciable speed difference between the classical least-squares algorithm and the alternative algorithm synthesized in this study.

 

Third, both the classical least-squares and the iterative Wald-Bartlett methods have difficulties with very difficult problems (m large and the x's highly correlated).  The really significant result that was observed in this case was that, whereas the classical method may fail catastrophically, producing totally-absurd results, the iterative Wald-Bartlett method is not particularly fast, but it determines an estimated model that produces reasonably close predictions.  (Note: The original number of test cases planned to be examined was 12.  It was after observing this result that we added four more test cases, representing singular covariance matrices, to examine this phenomenon in greater detail.)

 

The implications of this result are very significant, and probably outweighs the importance of speed in most applications, particularly since the processing speed differences between the iterative Wald-Bartlett and the classical least squares algorithm are not great.  For example, in an embedded-computer application (e.g., a tracker on an unmanned missile) it may be very desirable to have a "robust" estimation or prediction procedure -- one that does not fail catastrophically.  Moreover, the problem causing the catastrophic failure of the least-squares method (the failure of the algorithm to be able to invert a "nearly-singular" matrix) is one that has plagued data analysts for years  -- ever since the widespread use of digital computers to implement the least-squares methodology.  Matrix inversion problems in statistical analysis were not severe or widespread prior to the 1960's, when most computations were done using mechanical calculators.  Most statistical calculations were done by masters-level mathematicians, and the problems were kept small or designed to avoid the inversion of large matrices.  Starting in the 1960's, however, computer packages and computing resources were widely available, and were being applied in many cases by researchers who had little or no appreciation of the matrix-inversion problem inherent in the least-squares method.  With the large number of non-statisticians using statistical linear-model packages (e.g., multiple regression packages) and the low precision of many microcomputers, this problem arises frequently.  Consulting statisticians are often called in on regression analysis investigations when regression analysis packages produce meaningless results due to a matrix-inversion problem caused by a near-singularity in a correlation matrix or a linear dependency in the x's.  With the growing number of microcomputers, and the increasing number of non-statisticians using regression analysis programs (and the many other statistical procedures that require matrix inversion), there is a need for statistical estimation procedures that do not fail when near-singularities or linear dependencies are present in the data.  The demand for "robust" statistical estimators probably represents a far greater commercial value than the demand for high-speed algorithms, since the demand for real-time processing represents only a small portion of the total demand for statistical estimation.  This aspect should be explored further in Phase II.

 

In general, Task 5 succeeded in demonstrating the feasibility of the proposed approach.  Although the iterative Wald-Bartlett method does not appear to be substantially faster than the classical least-squares method, it appears to be much more "robust" than the classical least-squares method.  Also, since the iterative Wald-Bartlett method is based on "order statistics," it would be much less sensitive to data errors, such as "outliers," than the classical method.  The results of the Phase I study suggest that additional exploration of this area would be very beneficial.

 

_Task 6. Design of Phase II Work Plan_

 

The results of this Phase I study have demonstrated the feasibility of determining fast, robust estimators.  This study centered on the synthesis of a particular class of estimator, however, and that estimator certainly does not represent a final solution to the general problem.  Although the iterative method considered in this study does not fail catastrophically in difficult or ill-conditioned problems, as does the classical least-squares method, it is not very fast in such cases.  The present feasibility study has revealed the promise of the proposed approach, but substantial development effort is necessary to produce an estimation method that works well (i.e., is both fast and robust) in all cases.  We believe that the resources available in Phase II can accomplish that development effort, and have outlined a research plan for implementing the effort.

 

It is proposed that the Phase II effort be redirected from the goal originally proposed for Phase I.  The Phase I research was directed solely to the problem of finding fast algorithms.  Based on the Phase I results, however, it appears that there may be more potential (both in terms of project success and commercial value) in attempting the development of robust algorithms.  The serendipitous discovery that the synthesized iterative method could produce solutions to problems for which the classical least-squares method failed catastrophically probably outweighs the promise of a fast algorithm, in terms of both military and commercial/industrial significance.  Embedded processors in military weapon systems require software that is robust, i.e., does not fail catastrophically under certain circumstances.  The current effort has demonstrated that it is indeed possible to develop such estimators.  The growing use of microcomputers in commercial and industrial applications will create a growing demand for estimation procedures that do not fail in ill-conditioned problems, and do not require the participation of a professional statistician to assure their convergence to a correct solution.  These applications include not only multiple linear regression models of the sort addressed in this study, but the whole range of modern-day data analysis procedures (multivariate analysis of variance, factor analysis, canonical correlation analysis, discriminant analysis, and time series analysis models), since all of these methods involve matrix inversion, which is the source of the slowness and potential for catastrophic failure of these methods.  The availability of robust algorithms would be particularly beneficial in time series applications (e.g., "Box-Jenkins" analysis), where iterative estimation methods are often employed, and convergence problems are encountered by non-statistician users.

 

It is proposed that Phase II be a two-year effort, staffed at approximately 2-5 persons per year.  A description of the proposed tasks to be addressed in the Phase II effort is included in this report.

 

_Task 7. Preparation of Final Report_

 

This document describes the activities, results, and conclusions of the Phase I study.  In addition, it identifies the effort that should be implemented in Phase II, in order to complete the development of fast, robust algorithms for estimation, prediction, and control.

 

This Phase I study has accomplished each of the seven tasks identified and described in the proposal, in the proposed six-month time frame.  We believe that our success in accomplishing all of the proposed tasks, on schedule, augurs well for the accomplishment of the Phase II objectives as well.

 

C. _Organization of the Report_

 

The remainder of this report consists of four additional sections and two appendices.  Section II ("Background") describes the motivation for the proposed research effort, both relative to the requirement for fast algorithms and for robust algorithms.  Section III ("Project Approach") describes the methodology for conducting the Phase I investigation.  The methodology consists of the specification of performance criteria, the simulation of test-case data with which to test the performance of alternative algorithms, the synthesis of one or more candidate "fast algorithms," and the comparison of the performance of the synthesized method to the classical least-squares method using the specified criteria and the test-case data.  Section IV ("Simulation Results") presents the results of this simulation study.  Section V ("Conclusions and Recommendations") summarizes the study conclusions and describes the additional research and development thata is needed to develop and implement the proposed concept.  The Appendices contain source code listings of all computer programs used in this study (Appendix A), and detailed output listings (Appendix B).

 

 

II. BACKGROUND

 

A. _General Motivation for the Proposed Study_

 

This report describes the results of a research study to assess the feasibility of developing fast algorithms for real-time estimation, prediction and control. Such algorithms would provide a solution to a critical problem faced in both industrial and military applications -- the fact that the algorithms used to implement state-of-the-art statistical estimation, prediction and control techniques are far too slow for many real-time or near-real-time applications of high interest, even using the fastest computers.

 

The slowness of statistical correlation/tracking techniques such as  the Kalman-Bucy filter was one of the principal reasons for the failure of the ballistic missile defense program of the l960's.  This problem has still not been satisfactorily solved, even with substantial improvements in computer processing speed and direct-access storage capability.  Modern command, control, communications, and intelligence (C3I) systems such as those intended to support the AirLand Battle concept, the Air Force's Tactical Air Control Center, or Naval tactical data systems require multisensor correlation/fusion to be performed in real time or near real time.  The success of such systems is in jeopardy to the extent that they rely on data processing by traditional statistical algorithms.

 

In addition to military applications, the availability of fast prediction and control algorithms would assist control of rapid, time-varying processes occurring in industry and commerce (e.g., hot steel finishing mills and air traffic control), for which available prediction and control methods are generally heuristic, because of the current inability to conduct on-line system identification (and thereby use model-based predictors/controllers).

 

B.  _The Requirement for Fast Prediction/Control Algorithms_

 

Over the past two decades, tremendous advances have been made in the development of powerful statistical estimation techniques. The applications associated with these techniques cover a wide variety of fields, including scientific, economic, industrial, and military applications. In general, present-day statistical estimation procedures represent extensions of the work of Gauss, who developed the least-squares method of estimating the position of an asteroid, based on observations which contained measurement errors.  Gauss's method, or the method of least squares, consists in determining a set of model parameters in such a way that the sum of squares of the differences between the actual measurements and the position estimates based on the model is minimized.

 

A salient characteristic of the method of least squares is that it requires computation of a "cross-products" matrix (or a covariance matrix or a correlation matrix), and it requires the inversion of a matrix whose order depends on the number of parameters being estimated.  In many applications, this presents no problems, since statistical analysis need not be done in "real-time."  Instead, in most applications the analyst may collect the observed data, and then determine the parameter estimates "off-line" in the hours, days, or weeks that follow.  In some applications, however, such as those involving the tracking of fast-moving objects (such as missiles, airplanes, or satellites), or the control of rapidly-changing systems (such as certain industrial processes), it is necessary to estimate the parameters "on-line," in real-time or near-real-time.  In problems in which attention centers on only one or a few processes at a time (e.g., a small number of objects are being tracked, or a small number of electromagnetic emitters are being monitored), the computational burden is not severe.  If numerous tracks (or emitters) are involved, or the underlying system changes "too fast," the method of least-squares breaks down -- the computational requirements may saturate even the most powerful (fastest, largest) computers available.

 

In an attempt to remedy this problem, a tremendous amount of effort has been expended on the development of computationally efficient algorithms for determining least-squares estimates.  In l960, Kalman and Bucy developed a recursive solution to the least-squares estimation problem, now called the Kalman filter (or Kalman-Bucy filter).  Kailath (Reference 1) provided a comprehensive survey of over 600 references on filtering.  Aggarwal (Reference 2) describes the problem of developing least-squares algorithms that are numerically stable and computationally efficient.

 

One approach to reducing the computational complexity of tracking algorithms is linearization.  While this procedure helps, it is not sufficient.  As an example of the inadequacy of linearization, it is noted that the ballistic missile trackers proposed in the 1960's were in fact simplified nine-component Kalman filters in which the equations of motion had been linearized, and the covariance matrix radically simplified.  Yet this approach still failed, because of the tremendous computational requirements of the general linear model and the least-squares estimates.

 

As a further example of the inadequacy of linearization, it is noted that neither the Kalman filter nor the Box-Jenkins (autoregressive-integrated-moving-average time series) models could outperform the heuristic alpha-beta tracker, in air traffic control studies of the early 1970's.  As a final example, it appears that in complex correlation/tracking problems (e.g., satellite ocean surveillance, intelligence analysis of electromagnetic emissions for unit identification), heuristic nonlinear "algorithmic" procedures work better than procedures derived from linear-model theory.

 

The potential exists to reduce the computational burden of prediction/control algorithms by a factor of several orders of magnitude.  Furthermore, this reduction can probably be achieved at very modest cost -- a few person-years of research effort.  This investment is negligible, when compared to the massive research investment that will be required to develop even a hundred-fold increase in computer speed, through the development of an operational large-scale parallel-processor or bio-chip technology.

 

In addition to improvements in the computational efficiency of least-squares algorithms, tremendous gains have been made in the speed of the computers which perform the least-squares computations.  Advances such as Very Large-Scale Integration (VLSI) technology have increased computer processing speeds and direct-access storage capabilities by a factor of one thousand since the mid-1960's, when the feasibility of performing least-squares tracking of incoming ballistic missiles was seen to exceed available computational capabilities.

 

Despite the tremendous computational gains of the last fifteen years, however, it appears that further advances may be elusive.  Computer processing technology is now running up against physical limits, such as the time required for electronic signals to propagate along the wires inside the computer.  As reported in a recent issue of _Defense Science_ (Reference 3), advances in computational speed are leveling off (see Figure 1).  In order to achieve processing speed increases of two orders of magnitude or more with physical devices, it appears that parallel processing architectures will be required.  Unfortunately, formidable problems are associated with the development of large-scale parallel processors and associated software, and a substantial amount of basic research and development will be required.

 

To achieve further increases will probably require the development of "bio-chips," in which synthetic organic molecules perform the binary switching functions of present-day physical switches such as silicon or gallium arsenide field-effect transistors.  In principle, the use of bio-chip molecular switches could lead to circuit elements one thousand times smaller than may be achieved by conventional semiconductors.Developments in this area may be slow, however, since even the most promising of the bio-chips -- the soliton switch -- is still theoretical.

 

In summary, it does not appear likely to significantly improve the computational speed of least-squares algorithms, and the likelihood of realizing substantial increases in computer speed very soon is not high.  In order to achieve substantial gains in processing speed in the near term, then, it appears that the most promising approach is to develop estimation procedures which impose a substantially reduced computational burden.

 

C.  _Early Accomplishments in the Area of Low-Compute Estimation Procedures_

 

1.  _Experimental Design Applications_

 

In the 1930's and 1940's, tremendous advances were made in the development and application of the general linear statistical model, to solving problems in statistical experimental design.  In those days, however, no computers (other than human beings) were available for solving large-scale systems of linear equations, and ingenious methods were developed to determine designs for which the estimators could be determined without the need for explicit matrix inversion.

 

The advantage in the field of experimental design, of course, is that the statistician has control over the values of the explanatory variables of the model.  The popular experimental designs developed in the 1930's and 1940's (randomized blocks, Latin squares, fractional factorial, partially-balanced incomplete blocks) were models in which the designer introduced various degrees of orthogonality into the "design matrix" of the model, so that the equations of estimation could be easily solved.

 

The point to be recognized here is that, in the face of a strong requirement to develop low-compute estimation procedures in experimental-design applications, tremendous advances were made.  The underlying theory was complex (Galois theory and projective geometries), but the computational algorithms that resulted were extremely simple, allowing for the rapid hand-solution of estimation problems containing large numbers of variables.

 

2.  _Regression-Model Applications_

 

In regression analysis, the statistician does not always have control over the values of the explanatory variables, as is the case in experimental design.  Nevertheless, in the period 1920-1950 significant advances were made in the field of developing low-compute procedures for soving regression problems.  The _Biometrika Tables for Statisticians_, which were published in 1953 to "reduce the labour of statistical arithmetic," included tables of orthogonal polynomials, which vastly reduced the amount of computation required to produce estimates of linear contrasts.  These tables were of invaluable aid to statisticians in computing estimates of regression model parameters until about 1960, when high-speed digital computers became generally available in research facilities.  At that time, it appears that just about all effort directed toward computational simplicity ceased.  A digital computer could, in minutes, invert large matrices that simply could not be inverted manually.

 

 

It is interesting to note that it was at just about this same time that Kalman and Bucy developed their recursive scheme for determining estimates and predictions for a linear time-series model.  The computational requirements of their method were staggering from a manual perspective, but offered no difficulties when tackled by a digital computer.  Over the next twenty years, advances were made in the development of more rapid or more precise algorithms for implementing the Kalman-Bucy filter, but these methods accepted the basic linear-model-estimation formulas of the recursive filter as a starting point.  With this mind-set, the computational requirements of the Kalman-Bucy filter were never substantially reduced.

 

In a serendipitous fashion, however, it is interesting to observe that some advances were inadvertently made in the development of low-compute estimators for the non-time-series linear (regression) model.  These developments, by Wald and Bartlett, are described in the paragraphs that follow.

 

In its simplest form, the general linear statistical model (which forms the basis for modern estimation, prediction, and control algorithms) may be written as:

 

     _y_ = X'_p_ + _e_,

 

where

 

     _y_ = vector of observations;

 

     _p_ = vector of parameters;

 

     X = matrix of explantory variables ("data matrix");

 

     _e_ = error vector,

 

and the prime (') denotes matrix transposition.  In correlation/tracking and fusion problems, the form of the equations changes somewhat (e.g., there are "model" and "observation" errors, and the representation is usually in terms of a state vector), but the elementary form given above will serve to illustrate the nature of the estimation algorithms.

 

The least-squares estimate of the parameter _p_ is given by:

     _p_ = (XX')*X_y_

 

where the asterisk denotes a conditional (generalized) inverse of a matrix.  The key point to note with the least-squares estimate is the fact that it involves matrix products and matrix inversion.  (As the model becomes more complex, the number of matrix operations increases.)

 

In the simple example of regression analysis (fitting a straight line), the above model reduces to:

 

     yi = p1 + p2xi + ei,

 

where the index i denotes the i-th observation (i = 1,2,...,n), and the least-squares estimates of the parameters (the intercept and slope of the line) are:

                      _ _       _ _

                (xi - x)(yi - y)

     p2 =     _                  _

                            _ _

                      (xi - x)2

 

and

          _ _     _ _

     p1 = y - p2x

       _ _      _ _

where x and y denote the means of the observed x's and y's, respectively.

 

In 1940, Abraham Wald (the "father" of statistical decision theory and sequential analysis) proposed (Reference 5) a much simpler estimator as an alternative to p2:

            _ _    _ _    _ _    _ _

     p2 =  (y2 - y1)/(x2 - x1)

       _ _       _ _

where x1 and x2  denote the means of the x-values above and below the median (of the x's) and y1 and y2 denote the means of the corresponding y-values (i.e., the y-values associated with the x-values).  Wald originally proposed this estimator as a solution to the problem in which both the x variable and the y variable are subject to error.  It is interesting to observe, however, that Wald's estimate requires substantially less computation than the least-squares estimator -- 4n additions and one division versus 4n additions, 2n multiplications and one division, where n denotes the number of observations.  This represents a reduction in computer time by an order of magnitude.

 

Wald's estimator possesses the desirable statistical property of consistency (which the least squares estimate does not, in the errors-in-variables problem), but the sampling variance of the estimator is larger than for the least-squares estimate.  This inefficiency may be overcome by taking a slightly larger sample, in which case the Wald estimate still has the computational advantage.

 

In 1949, M. S. Bartlett (Reference 6) modified Wald's estimator by dividing the ranked x-variable into three equal-sized groups, and forming the estimate

           _ _    _ _    _ _    _ _

     p2 = (y3 - y1)/(x3 - x1)

 

        _ _    _ _

where x1, y1 are the means corresponding to the low-value group of x's, and x3, y3 are the means corresponding to the high-value group of x's.  Bartlett's estimator is more efficient (i.e., has lower sampling variance) than Wald's estimator, and requires 1/3 less computation.

 

In 1958, J. W. Hooper and H. Theil (Reference 7) extended the Wald/Bartlett method of grouping to the case of multiple linear regression (in which there is more than one x-variable).  The method was judged somewhat tedious to implement, however, and was essentially abandoned.  Note that this was about the time when high-speed digital computers (e.g., the IBM 650) were becoming generally available (at least in the major universities and research centers), and so there was at that time no longer an incentive to prefer the Wald-type estimators to the least-squares estimators on computational grounds.  (For the errors-in-variables problem, for which the Wald's estimator was orginally developed, a method by J. Durbin, introduced in 1954, was generally adopted as a perferred method.  It is more (statistically) efficient than the Wald and Bartlett estimators, but is not relevant to the present problem because it requires the same amount of computation as the least-squares estimation procedure.)

 

It is interesting to observe that both the Wald and Bartlett estimators may be derived from the formula for the least-squares estimates, by replacing the values of the explanatory variables by +1's and -1's in the case of the Wald estimator, and by +1's, -1's, and 0's in the case of the Bartlett estimator.  This procedure is analogous to the procedure of determining an experimental design: the x-values are set at values which enable the equations of estimation to be solved without explicit matrix inversion.  (The difference is, of course, that in the case of experimental design, the specified x-values are actually used in the experimentation process, whereas in the regression case, the actual (continuous) x-value is replaced by the simpler discrete value.)

 

The point to the above is that, with a little ingenuity, it is possible to develop estimators that have drastically reduced computational requirements, over those of the least-squares estimates.  The preceding example addresses the simplest situation (one dependent variable, one explanatory variable), with no matrix multiplications or inversions required, and yet a reduction in computational requirenents of an order of magnitude were realized.  In the general case of several or many variables (e.g., a Kalman filter for a nine-component state vector), involving many matrix multiplications and inversions, the potential for dramatic computational reductions is tremendous.

 

A review of the statistical literature of the past two decades reveals a fixation with least-squares estimation.  To be sure, some new estimators have been introduced (e.g., jacknife estimators), but they are computationally similar to the least-squares estimates (requiring matrix multiplications and inversions), and have similar computational efficiencies.  It appears that the effort to improve computational efficiency has been conditional on use of the least-squares approach, rather than on centering on novel estimation procedures.  The criteria against which the procedures are invariably judged is the error-variance of the minimum-variance linear unbiased estimator.  While restriction to this criterion may be reasonable for off-line statistical estimation, it is not reasonable to restrict attention to this single criterion for the large-scale on-line (real-time) estimation situation.  We believe that, once the criteria against which the estimator are to be judged are appropriately modified, significant and substantial improvements will follow.

 

E.  _Parallel with Optimization Theory_

 

In a sense, the situation with respect to the use of the general linear statistical model and the least-squares estimates parallels the situation that existed in the 1950's in the field of optimization theory.  At that time, the principal optimization procedure was linear programming.  The framework of linear programming did not suit many practical applications, however, and so alternative nonlinear programming methods were sought.  In 1963, the Generalized Lagrange Multipliers method was introduced by H. E. Everett.  This powerful method produced very fast solutions to very large optimization problems in which the objective function could be nonlinear, non-convex, and discontinuous. The GLM method had a tremendous advantage over previous optimization procedures, in that it did not require the solution of a large-scale system of equations.  Unfortunately, the GLM method is restricted to problems in which the objective function is "separable," i.e., may be expressed as a certain sum.  Furthermore, it was not possible to guarantee convergence of the method, and in some cases convergence could be slow.

 

In the late sixties, Fiacco and McCormick promoted the use of "quadratic penalty fuctions" to solve constrained optimization problems.  This approach (called the Sequential Unconstrained Minimization Technique, or SUMT) worked well for a larger class of problems, but in general it did not possess the great speed of the GLM method.  Also, the method fails in a fairly wide class of problems in which the Hessian matrix is "ill-conditioned" (not positive definite).

 

Finally, also in the late 1960's, the two methods were combined into what is now known as the "generalized Lagrangian method."  Hestenes and Powell suggested adding a quadratic penalty function to the _Lagrangian_ function, instead of the _objective_ function, as is done in the SUMT method.  The generalized Lagrangian method converges fast, and does not exhibit the ill-conditioning that frequently occurs in the original penalty-function method (SUMT).

 

Thus, in a span of about ten years, a tremendous leap forward was made in solving constrained optimization problems.  The interesting fact to note, however, is that whereas the linear programming solution (the simplex method) had a very well-behaved theory associated with it, and was guaranteed to converge in a predetermined number of steps, the generalized Lagrangian methods possessed no such property -- the methods are defined as algorithms (for adjusting the values of Lagrange multipliers), and no definitive statement can be made about the rate of convergence.

 

The fact remains, however, that by moving out of the very restrictive linear-model framework, tremendous advances were quickly realized.  Furthermore, satisfactory solutions were not achieved by "linearizing" nonlinear problems, but by developing heuristic algorithmic procedures, which were demonstrated to work well.

 

While the problems of estimation, prediction, and control are statistical in nature, they are also optimization problems, and it is reasonable to conjecture that a tremendous advance in speed may be realized by applying the ingenuity and heuristic methods that worked so well in the field of constrained optimization.  The current situation -- a near-total dependence on the general linear statistical model -- is analogous to the situation in which the field of constrained optimization theory found itself twenty-five years ago.  It would appear that much can be done.

 

F. _The Need for a Robust Estimator_

 

During the course of this Phase I study, it was observed that the algorithm that had been synthesized as a candidate "fast" algorithm possessed a remarkable property -- it produced reasonable results when applied to "difficult" problems, in which the classical least-squares method failed catastrophically.  The "difficult" problems were ones in which the correlation matrix that had to be inverted in the classical least-squares method was "near-singular" (i.e., had a small determinant).  This situation arises whenever the explanatory variables of the model are highly correlated.  In this case, inverting the correlation matrix (which is central to the classical least-squares method) is difficult (in a numerical analysis sense), particularly for large or even moderately large matrices (e.g., m = 10 explanatory variables).  What happens is that roundoff errors (due to the truncated representation of real numbers in the computer) ruin the matrix inversion, and the matrix inversion algorithm fails to produce the desired inverse.  The least-squares method in fact fails "catastrophically," in the sense that the produced results are generally totally wrong -- the resultant model produced by the method may predict even worse than the "trivial" model that predicts the mean value of the dependent variable for all values of the independent variables.  This problem is ameliorated somewhat by using double precision arithmetic, but it is a particularly troublesome problem in a microcomputing environment, where the computer word length is short (and the precision of computation is low).  The problem is a particularly insidious one, since many statistical analysis programs do not warn the user that the method has failed, and if the results do not appear to be patently absurd, they may be accepted as correct.

 

In addition to the problem of near-singular correlation matrices, the problem of catastrophic failure may arise if a a problem is "ill-conditioned," or "ill-specified."  This happens, for example, if there is a linear dependency among the independent variables (the x's).  This situation often arises in social science applications.  In a survey questionnaire, for example, the respondent may be asked to select an answer in one of a number (e.g., five) of categories.  In the analysis of the survey data, the response for the selected category is coded as a "1," and the response for the other categories are coded as "0's."  A social science researcher who is not aware of the matrix inversion to be done in a regression analysis may include all five responses as variables in the data base, and in a regression model.  These variables are linearly dependent, however, since the sum of all five category responses must equal 1.  The presence of this linear dependency in the dependent variable will cause the correlation matrix to be singular.  Once again, the least-squares method, if applied to this problem, will fail catastrophically.  This case is often not as troublesome as the "near-singular" case, however, since the presence of an exact linear dependency may result in a computed matrix determinant of exactly zero, and some computer programs check for this condition.  Because of roundoff errors, however, the computer algorithm may fail to recognize the singularity, compute a nonzero numerical value for the determinant, and produce a "solution," which, unfortunately, is totally wrong.

 

This situation is a very real problem.  Many users of statistical program packages are not trained in the theory underlying the computations and are unaware of the pitfalls of the least-squares method.  Researchers in social science are now trained in university curricula in how to apply the statistical procedure of multiple regression analysis, but they are not expected to know matrix algebra and are generally not trained in it as part of their introduction to statistics.  As a professional consulting statistician, the author of this report has been retained on more than one occasion to "explain" absurd results obtained because of the problem of linear dependencies in regression analysis applications.  In technical terms, the classical least-squares method is not "robust" with respect to the presence of linear dependencies or near-dependencies (high correlations) in the explanatory variables.

 

Statistical theory can handle the presence of linear dependencies, if they are recognized.  In such cases, the inverse of the correlation matrix is replaced by a "generalized inverse" or "conditional inverse."  Although this theory is generally taught to graduate statistics majors, however, it is not known to most data analysts.  Moreover, most of the major statistical software packages do not offer this capability.

 

The present study began as an attempt to determine the feasibility of finding fast algorithms for estimation, prediction, and control.  In the course of the study, an algorithm was developed that did not fail catastrophically in "difficult" problems.  Upon observing this, it was decided to explore the performance of the algorithm in ill-conditioned problems containing linear dependencies.  The method works well in such cases.  The significance of this result could be very great from a commercial viewpoint, in view of the extensive use of statistical multiple regression analysis.  Moreover, virtually all multivariable statistical analysis procedures involve matrix inversion.  All of these methods are subject to catastrophic failure, and are candidates for application of a method that does not require matrix inversion.

 

III. PROJECT APPROACH

 

 

A. _Summary of Approach_

 

The approach proposed for this study consisted of four major steps:

 

     1. Development of Criteria for Comparing Alternative

        Estimation Algorithms

 

     2. Generation of Test Cases

 

     3. Synthesis of Candidate Algorithms

 

     4. Comparison of the Performance of Candidate

        Algorithms to the Perfomance of the Classical

        Least-Squares Algorithm

 

Each of these steps is described in detail in the following subsections.

 

B. _Development of Criteria for Comparing Alternative Estimation Algorithms_

 

In order to assess the performance of alternative procedures for estimation, prediction and control, it is necessary to identify a number of quantitative, measurable descriptors of performance.  Alternative procedures may differ in a number of respects, such as processing speed and accuracy.  It is desirable to identify a set of performance measures, or criteria, which afford a relatively comprehensive description of algorithm performance, and yet is not overly redundant.

 

It is noted that the performance measures that are appropriate for an algorithm may vary, depending on whether the algorithm is intended for use for estimation, or for prediction, or for control.  In an estimation problem, attention centers on estimating the model parameters (bo, _b_,  , SIG, SIGMA) as closely as possible.  In a prediction or control problem, attention centers on using the model for predicting a new value of y corresponding to a specified set of x-values.  How close the parameter estimates are to the true values is of secondary interest in this situation.  (In a "prediction" problem, the x's are either passively observed or actively controlled; in a "control" problem, the x's are actively controlled.  The problem of predicting the state of the economy is essentially a prediction problem; the problems of controlling a steel production process or directing a "smart" bomb to a target are examples of "control" problems.)  The intended application of the model -- estimation, prediction or control -- should influence the sample design for the collection of the data from which the model is to be estimated.  For example, if a model is to be used to predict how the dependent variable (y) will respond to forced changes in the independent variables (x's), then the data should correspond to the case in which forced changes are made in the x's.  A model developed from passively-observed x's is not appropriate for predicting how the system will respond if forced changes are made in the x's (although this is often done, with dissapointing results!).

 

After consideration of the various uses (estimation, prediction and control) of the models under study, and of the various properties of algorithms that may be of interest to model developers, it was decided that the following seven concepts characterized the algorithm performance in reasonably comprehensive fashion:

 

     1. Computer running speed

 

     2. Computer storage requirements

 

     3. Precison of the parameter estimates

 

     4. Bias of the parameter estimates

 

     5. Mean squared error of the parameter estimates

 

     6. Precision of model-based predictions

 

     7. Numerical stability of the algorithm

 

Having decided on these concepts as comprising a relatively comprehensive characterization of the performance of an algorithm, it remained to determine quantitative measures of each concept.  These concepts and their associated measures are described in the paragraphs that follow.  Note that, although the preceding concepts were identified as part of the Phase I report, it was not possible with the Phase I resources to numerically determine values for all of the measures.  That numerical determination can be accomplished with some additional programming, and should be done as one of the first steps in the Phase II study.

 

1. _Computer Running Speed_

 

The primary motivation for the proposed study was to determine the feasibility of determining estimation, prediction and control algorithms that had faster running times than the classical least-squares method.  The measure of speed that was used in this study was the total elapsed time required to read the data from the file on which it was stored, compute the parameter estimates, and print out the results.  The time required to "set up" the run (e.g., specify the data file name, number of parameters, number of observations, etc.) was not included, since this time consists mainly of the time of the human operator to enter data through the microcomputer keyboard, and does not reflect algorithm performance.

 

All of the computer programming and processing on this project was done using a Radio Shack Model II microcomputer using a TRSDOS Version 2.0 operating system.  This microcomputer utilizes a 4 MHz Z80 microprocessor, and has 64 kilobytes of direct access memory.  The programming was done in FORTRAN II, using a Microsoft (R) FORTRAN compiler to produce the object code.  Unfortunately, the available Microsoft FORTRAN did not permit access to the system timer, and so the timing was done manually (external to the program), not automatically (internal to the program).  The running times presented in this report are hence approximate.  In addition to algorithm processing time, they include the time required to print the results.  The amount of printed output for the various methods is comparable.  For the simpler cases examined, the algorithm processing time was very short compared to the print time, and so the total measured running time is not an accurate reflection of the processing time.  For the more difficult cases, the algorithm processing time is large relative to the print time, and so the measured running time is a relatively valid indicator of algorithm processing time.

 

It is recognized that the speed measure used in this study is very crude, particularly for problems having a small number of explanatory variables (since the processing time for these problems is small compared to the data access and printout time).  In Phase II, we propose to develop an assembly-language timer that can be called internal to the program, so that accurate measures of processing time can be determined.

 

The time measurements for the iterative algorithm were biased high in many cases, because the algorithm was forced to make at least ten iterations even if convergence had already occurred.  In Phase II, a test for convergence should be developed, so that the algorithm stops when convergence occurs.

 

2. _Computer Storage Requirements_

 

The amount of direct-access computer memory required to implement an algorithm is a concern, since it determines how "large" a problem can be handled with available memory, or how much memory is required to handle a problem of a given size.  The size of an estimation problem is determined primarily by two factors -- the number of observations and the number of explanatory variables (we are speaking here only the univariate case, in which there is but a single dependent variable).  The classical least-squares method has an advantage in that the observations may be read and processed one-at-a-time, and do not need to be stored simultaneously in memory.  The particular "fast algorithm" that was considered in this study required that all of the data be stored in memory.

 

Although the required amount of direct-access memory required depends on the problem size, the computer programs developed in this project did not dynamically adjust the memory requirement to the size of the problem.  Instead, the "dimensions" of the program variables were set to allow for storage of all observations and variables, corresponding to the largest problem analyzed -- 100 observations and ten independent variables (and a single dependent variable).  Under these conditions, the core requirements of the classical least-squares algorithm and the particular "fast algorithm" investigated in this study were as follows:

 

     Least-squares algorithm:  20,984 bytes

 

     Fast algorithm:           21,759 bytes,

 

i.e., the core requirements are approximately the same.

 

3. _Precision of the Parameter Estimates_

 

In an estimation problem, it is desired to determine ("estimate") the values of the model parameters as "closely" or "accurately" as possible.  The model parameters are bo, _b_,  , SIGMA, and SIG.  With regard to estimation, however, we are generally interested only in bo, _b_, and SIG, not in   or SIGMA.  The reason for this limitation is that the parameters   and SIGMA are not generally of concern in an estimation problem once the x's are known.  The usual objective in a data analysis is to estimate the values of bo, _b_, and SIG, or to predict the value of a new y given specified values of the x's.  While the fact that the x's are random variables can affect some analysis procedures (e.g., hypothesis testing), the least-squares estimates are the same whether the x's are fixed numbers or random variables.  For this reason, we shall restrict attention only to measures of precison of the estimates of bo, _b_, and SIGMA.  This restriction applies also to the next two performance concepts considered (bias and accuracy).

 

Accuracy is usually reflected in two concepts -- precision and bias.  The "precision" (or reliability) of an estimator (i.e., an estimation formula or algorithm) refers to the degree of reproducibility, or variation, of the estimates, if repeated data samples were selected and the estimator used to determine the estimate value for each sample.  The usual measure of precision is the standard deviation (square root of the variance, or second central moment).  The "bias" of a parameter estimator is the amount of systematic error in the estimate, measured as the difference between the average value obtained for the parameter estimate in repeated sampling and the true value of the parameter.  (Note: in the preceding discussion, we have used the distinction that an "estimate" is a numerical value that represents our guess as to the value of the parameter, whereas the term "estimator" refers to the formula or algorithm for producing that numerical value.  This distinction in usage of these terms is not strict, either in the field of statistics or in this report.)

 

In the present study, it was possible to estimate the precision of the parameter estimates for the classical least-squares method, but project resources dis not permit the estimation of precision of the parameter estimates for the fast algorithm studied.  Closed-form mathematical formulas are available for estimating the precision of the least-squares estimates, but such is not the case for estimating the precision of the fast algorithm estimates.  Instead, the precision has to be measured empirically, by selecting many independent data samples and computing the standard deviation of the parameter estimates over these samples.  Although it was not possible to implement this procedure in Phase I (because of resource limitations), the estimated standard deviation of the parameter estimates is considered to be an important measure of algorithm performance, and this procedure should be implemented in the Phase II effort.

 

While the precision and bias of the parameter estimates are of high concern in estimation problems, they are of secondary conern in prediction and control problems.  In the latter types of problems, it is the accuracy of the prediction that is of primary concern.  It is possible to have a model in which the parameter estimates are not very accurate, and yet the accuracy of the predictions based on that model is comparable to those obtained from a model in which the parameter estimates are substantially more accurate.  (This may be the case, for example, if the explanatory variables are highly correlated.)

 

The precision of most statistical estimates increases as the sample size (number of data observations) increases, usually by the factor 1/ n (for the standard deviation).

 

4. _Bias of Parameter Estimates_

The bias of an estimate is the expected value (over repeated data samples) of the difference between the expected value of the parameter estimate and the true value of the parameter.  The bias of an estimator is of greater concern in estimation probalems than in prediction and control problems.

 

As was the case in the measurement of precision, Phase I project resources did not permit the generation an analysis of a large number of data samples (corresponding to the same model parameters) to estimate the bias.

 

The bias of an estimator may or may not decrease as the sample size increases.

 

5. _Mean Squared Error_

 

The mean squared error is a measure of accuracy -- i.e., it is a "combined" measure of precision and bias.  The definition of the mean squared error of a parameter estimator is:

 

     Mean Squared Error = variance + bias2

 

or

 

     E(!p - p)2 = E(!p - E(!p))2 + (E(!p) - p)2,

 

where p is the parameter value and !p is the estimate, i.e., the mean squared error is the expected value of the square of the difference between the estimated parameter value and the true parameter value, over repeated data samples.

 

Once more, Phase I resources did not permit the numerical determination of the mean squared error, but this can be done in Phase II.

 

6. _Precision of Model-Based Prediction_

 

The three preceding measures are concerned wtih measurement of the "closeness" of the parameter estimates to the true values, or to the amount of variability in the parameter estimates in repeated data samples.  For prediction problems, the primary concern is how close the model-based predictions (of y, the dependent variable) corresponding to a specified set of x-values will be to a newly-sampled y corresponding to those x-values.  An interesting fact is that there can be fairly substantial errors in the estimation of the model parameters, and yet predictions based on the (erroneous) model may be almost as good as those based on a much-more-nearly-correct model.  In some applications (e.g., econometric modelling), attention centers very much on estimation of the model parameters.  In other applications, the parameter values are of incidental interest -- all that matters is the error of prediction.

 

A standard indicator of the prediction error is the estimated variance or standard deviation of the model "residuals," or error terms (differences between the observed y-values and those predicted by the model).  This is not the same as the standard deviation of the prediction error for a particular set of x's, which depends also on the specified values of the x's.  The standard deviation of the prediction error is proportional to the standard devaition of the residuals, however, and so the latter is a good indicator of the predictive ability of the model.

 

Another indicator of the predictive power of a model is the reduction in variance between predictions based on the trivial model that predicts that each new y will equal the mean (of the y's), and the variance of the predictions based on the estimated model.  (This measure is valid only for application of the model to predict y-values from x-values that were produced in the same fashion -- e.g., passively observed, forcibly changed -- as were those from which the model parameters were derived.  Also, its use assumes that the theoretical variance of y is finite, which is often not the case for time series data.)  The standard error of the residuals is best estimated from a sample other than that from which the model parameters were derived, but as long as the number of parameters is very small relative to the number of observations, it is common practice to use the same data set for both purposes.

 

The reduction in variance for predictions based on the "mean model" compared to predictions based on the estimated model is called the "coefficient of determination."  It can be defined as:

 

     CD = 1 - (variance of predictions using "mean model")

           /(variance of predictions using estimated model)

 

The coefficient of determination was determined both for the classical least-squares model and the fast algorithm algorithm studied in this project.

 

6. _Numerical Stability of the Algorithm_

 

The three preceding measures of algorithm performance are appropriate in most cases.  In some situations, however, an algorithm may fail to operate as intended because of computer roundoff errors or because of an intrinsic weakness in the algorithm, such as a failure to converge to an answer close to the desired answer, or a failure to converge at all.  The length of time required for convergence to a desired answer is considered under performance measure 1 (computer processing speed), if the process converges.  (An additional measure of performance that is of interest in studying the performance of an iterative algorithm is the "rate" of convergence, or the number of iterations required for convergence.)

 

Some of the test cases examined in the project present difficulties, for both the classical least-squares and the fast algorithm.  In some such cases, the algorithm may fail catastrophically, i.e., produce totally wrong results.  In other such cases, the algorithm may produce an answer (i.e., set of parameter estimates) that is not very close to the correct answer (i.e., the true parameter values).  Those cases are noted, with an indication of the nature of the failure.  For example, the classical least-squares method may fail because of an inability to invert a near-singular matrix.  Or, a "fast" algorithm may converge very slowly.

 

2. _Generation of Test Cases_

The performance of an estimation algorithm may vary, depending on the nature of the data set to which it is applied.  We proposed to develop a set of test cases to which candidate algorithms could be applied, and to measure their performance relative to each test case.  Although there is an infinite variety of test cases that might be considered, it was possible in the present project to generate and analyze only a few.  It was decided to examine sixteen cases in all.  The nature of these test cases was described in Section I of this report.  The test cases differ in terms of the number of variables included in the model, and in terms of the complexity of the model, as reflected in the covariance matrix of the explanatory variables (x's).

 

The general model considered in this study was of the following form:

 

 

     yj = bo + _x_j'_b_ + ej     j=1,2,...,n

 

or

 

     _y_ = bo_1_ + X'_b_ + _e_

 

 

where

 

     yj = dependent variable

 

     _y_' = (y1,y2,...,yn) = vector of all

 

            observed y's

 

     _x_j' = (x1j,x2j,...,xmj) = vector of inde-

 

               pendent variables corresponding to the

               j-th observation

 

     X = (_x_1,_x_2,...,_x_n) = data matrix

 

     _e_' = (e1,e2,...,en) = vector of model error

 

            terms corresponding to all observations

 

     bo = y-intercept

 

     _b_' = (b1,b2,...,bm) = vector of regression

 

            coefficients

 

     var(_x_j) =   SIGMA2

 

     var(_e_) = I SIG2

 

     _1_' = (1,1,...,1) = n-component vector of all 1's

 

where   is an mxm covariance matrix and I is an nxn identity matrix.  The various test cases correspond to four different values of m (m = 1, 3, 6, and 10), various values of SIGMA and SIG, and various covariance matrices  .

 

The x's are random variables that are generated as follows.  First, m independent random variables, f1,f2,...,fm, normally distributed with mean 0 and variance SIGMA, are sampled.  An mxm coefficient matrix, C, is then specified:

 

     C = {cij} .

 

The x's are computed as

 

     _x_ = C _f_ .

 

The covariance matrix of _x_ is

 

     var_x_ = CC' SIGMA2 =   SIGMA2 ,

 

where   = CC'.

 

To generate yj corresponding to a specified _x_ (say, _x_j), a standardized normal deviate, ej, is sampled, multiplied by SIG, and the result added to bo + _x_'_b_:

 

 

     yj = bo + _x_j'_b_ + SIG ej .

 

From the preceding, it is seen that the variance matrix    is not explicitly specified.  Instead, it is the coefficient matrix C that is actually specified.  The matrix   has the value CC'.

 

The coefficient matrices and values of bo, _b_, SIG, and SIGMA for the various test cases are specified in the computer printouts of Appendix B.  The values of the corresponding covariance matrices   are also specified in those printouts, as part of a larger matrix,  A.  The covariance matrix,  A, specified in the printouts includes an additional row and column, which contains the variance and covariances of y with each x, i.e.,

 

 

       A = var(y,_x_')

 

         = var(bo_1_ + _x_'_b_ + _e_, _x_')

 

             _b_'  _b_ + SIG2      _b_'  

         =

                _b_                      

 

The normally distributed random numbers were generated by the algorithm

                12

     f = SIGMA    ui

               i=1

 

where u1,u2,...,u12, are a sequence of independent uniformly distributed random numbers.  The number ui was generated by the multiplicative congruential method, specified for the 16-bit Radio Shack Model II microprocessor by the algorithm:

 

     IXtent = IXold (216 + 3) = IXold 259

 

                IXtent if IXtent    0

     IXnew  =

                IXtent + (216-1)+1 = IXtent + 32767+1

 

                                          if IXtent    0

 

     ui = IXnew (2-15) = IXnew (.30517578E-4)

 

     IXold = IXnew

 

The multiplicative congrential method for generating pseudo-random numbers is described in Reference 10.

 

The rationale for the specification of the test cases is as follows.  For the case of a single independent variable (m = 1), the four test cases involved four different values of the ratio SIG/SIGMA: .1, .5, 1.0, and 5.  The case with SIG/SIGMA = .1 is easiest, since there is a substantial amount of variation in x and a small model error term.

 

For the cases with m greater than one, the value of SIG/SIGMA was set equal to .1, and the complexity of the variance matrix   was varied.  For the first subcase,   was set equal to the identity matrix, i.e., the x's were orthogonal (uncorrelated).  For the second subcase, the coefficient matrix used to generate the x's (and hence  ) was specified to correspond to a low-to-moderate degree of correlation among the x's.  For the third subcase, the coefficient matrix was specified to produce a moderate-to-high degree of correlation among the x's.  In the final subcase, a linear dependency was introduced among the x's: the last x was set equal to the sum of the two preceding x's.

 

The values of the regression coefficients, _b_, were chosen to make the estimation difficult.  The correlations among the x's were all positive, and so the regression coefficients were specified to be plus ones and minus ones.  This caused convergence of the iterative algorithm to be slow, because each time an adjustment was made in one direction for a particular coefficient, a reverse adjustment would be required on the next iteration for coefficients having the opposite sign.

 

3. _Synthesis of a Candidate Algorithm_

 

In order to demonstrate the feasibility of the proposed concept of developing a fast estimation algorithm that could be used as an alternative to the classical least-squares algorithm, it was proposed to synthesize one or more alternative algorithms, and to examine their performance.  The class of algorithms that was synthesized in this project are extensions of the Wald and Bartlett estimators, extended to the case of more than one independent variable.

 

The algorithm is iterative.  At the k-th iteration, the residuals, ejk-1, from the preceding iteration are regressed on a particular independent variable, xi, using either the Wald or Bartlett method:

 

      ejk-1 = bok + b1kxij + ejk

 

where bok and b1k are the Wald or Bartlett estimates.  In the regression process of determining bok and b1k, of course, the value of ejk is unknown.  Once the values of bok and b1k are determined, the vaule of ejk may be computed as follows:

 

     ejk = ejk-1 - bok - b1kxij

 

The process cycles through all of the x's in order.  That is, the first iteration regresses y on x1.  The second iteration regresses the residuals from the first regression on x2.  The third iteration regresses the residuals from the second regression on x3, and so on.  If there are m independent variables, then the (m+1)-st regression begins over again, and regresses the residuals from the m-th regression on x1, and so on.

 

The preceding iterative procedure is conjectured to produce consistent parameter estimates of the model parameters (i.e., estimates that converge in probability to the true parameter values as the sample size increases).  Determination of whether this is true, or of conditions under which it is true, should be addressed in Phase II.

 

In the present study, several modifications of the preceding procedure were examined.  For example, a "stepwise" procedure was considered, in which the regression at each iteration was performed on the independent variable, xr, which resulted in the greatest reduction in the criterion:

 

     br IQr

 

where IQr denotes the interquartile range of xr, and br denotes the Wald regression coefficient of the residuals at that iteration on xr.  Another approach considered was to regress the residuals separately on all x's, and then to adjust all m regression coefficents by a specified fraction ("stepsize") of the indicated adjustment.  No modified algorithm was found, however, that performed faster than the one specified above.

 

Both the Wald and Bartlett procedures were applied to the test cases for which m = 1, but only the Wald procedure was applied to the other test cases.  The iterative procedure described above will be referred to as an "iterative Wald-Bartlett" estimation method.

 

E. _Comparison of the Performance of the Candidate Algorithm to the Performance of the Classical Least-Squares Algorithm_

 

Each of the sixteen test cases was analyzed using both the classical least-squares (CLS) and the iterative Wald-Bartlett (IWB) algorithms.  The performance of each case was determined, using the processing speed and the coefficient of determination as the measures of performance, or observing whether the method failed catastrophically.  The closeness of the residual standard error to the parameter SIG was also noted.  The results of these simulations are described in the next section of this report.

 

 

IV. SIMULATION RESULTS

 

 

A. _Test Cases Involving a Single Independent Variable_

 

The four test cases with m = 1 independent variable were analyzed using both the Wald and Bartlett estimates, and the classical least-squares estimator.  In general, all cases ran so fast that speed differences could not be reliably determined among the methods.  All methods produced parameter estimates close to the true values, and all produced similar values for the residual standard error and the coefficient of determination.  A summary of the results of the four cases with m = 1, and all of the other test cases as well, is presented in Figure 2.

 

B. _Test Cases Involving Three Independent Variables_

 

For the test cases with m = 3 independent variables, the data were analyzed with both the iterative Wald-Bartlett (IWB) and classical least-squares (CLS) methods.  The results are presented in Figure 2.  Both the CLS and IYB methods appear to be of comparable speed, in cases in which the CLS method does not fail.  The CLS method fails if the correlations among the x's are high, or if there is a linear dependency among the x's.  The IWB method succeeds in these cases, but does not converge very fast.  The slow convergence is probably due to the fact that the regression coefficients were purposely specified to cause slow convergence.

 

C. _Test Cases Involving Six Explanatory Variables_

 

The results for the cases involving six explanatory variables are as follows.  For cases in which the CLS method does not fail, it is somewhat faster than the IYB method, for the difficult estimation problems represented by the test cases.  For cases in which the correlations among the x's is high, the CLS method fails.  In these cases, the IWB method is slow to converge.  Once again, the slow convergence is probably due to the "pathological" specification of the regression coefficients.

 

D. _Test Cases Involving Ten Independent Variables_

 

The results for the test cases involving ten independent variables were as follows.  For cases in which the CLS method succeeded, it is faster than the IWB method for the test cases studied.  For the cases in which the correlation among the x's is high, the CLS method fails, and the IYB method is slow.  In the worst case (case 15), the process was terminated after 1800 seconds.  At that point, the model had reached a solution close to the correct solution, but not correct -- the residual standard error was 2.03, compared to the true value of 1.0.

 

In summary, the IYB method appears to be comparable in speed to the CLS method for problems with a small number of explanatory variables or low correlations among the x's.  For difficult problems (with high correlations among the x's), the CLS method fails.  The IYB method converged in all of the difficult cases but one.

 

 

V. CONCLUSIONS AND RECOMMENDATIONS

 

 

A. _Conclusions_

 

This study has demonstrated the feasibility of developing an algorithm for estimating the parameters of a linear statistical model and making predictions based on the estimated model, that is comparable in speed to the classical least-squares method for problems of low to moderate difficulty, and is definitely more robust, in the sense that it is less subject to catastrophic failure.  The feasibility was established by synthesizing an algorithm -- an extension of the estimation procedures of Wald and Bartlett -- which often outperformed the classical method.

 

The availability of a fast, robust estimation procedure would be beneficial to both military and nonmilitary applications.  In a military context, there is a growing need for faster estimation procedures -- current procedures cannot accomplish tracking of large numbers of objects in real-time, or accomplish large-scale sensor exploitation in real-time.  Also, embedded-processor estimation algorithms are non-interactive (i.e., must perform without human intervention), and are potentially subject to catastrophic failure if based on the classical least-squares procedure, if highly correlated data are entered into the data input stream.

 

In both military and commercial/industrial applications, there is a requirement for "fail-safe" estimation algorithms.  The classical least-squares algorithm that is currently in use is not fail-safe.  For problems involving moderate or large numbers of variables, computer roundoff errors can ruin the estimates, and the user may be totally unaware of the failure.  The danger of this occurrence is particularly strong in microcomputers having short word lengths (e.g., 16 bit microprocessors), and is a problem even for 32-bit machines.  Furthermore, many persons using statistical software packages are not aware that linear dependencies in the variables can cause the complete failure of the methods.  Once again, roundoff errors may obscure the problem, so that the user has no reason to believe that the estimation algorithm failed.

 

While the present project has demonstrated the feasibility of developing fast, robust algorithms, it has not accomplished the ultimate goal of developing such an algorithm.  The algorithm that was sythesized in this project is not considered to be a final solution to this problem -- it does not outperform the classical least-squares method in every case, and its theoretical properties (e.g., convergence, consistency of estimates) are unknown.  Continuation of the effort to develop improved (fast, robust) estimators will require substantial additional effort.  The present study suggests, however, that the potential for success of such an effort is high.

 

B. _Recommendations_

 

It is recommended that a Phase II study be conducted, oriented toward the goal of developing improved algorithms for estimation, prediction and control.  It is further recommended that the Phase II effort should address the following tasks.

 

     1. Extension of the results of the Phase I study, to

        include analysis of a wider range of test cases,

        and measurement of the full set of performance

        measures identified in this study.

 

     2. Development of a broader class of algorithms,

        additional to the iterative Wald-Bartlett method

        developed in this Phase I effort.

 

     3. Extension of the algorithm to consider estimation

        problems additional to the multiple linear

        regression problem considered in the Phase I effort.

        Consideration should be given to developing fast,

        robust methods for the full range of problems

        currently addressed by least-squares methods,

        such as multivariate analysis of variance and

        time series analysis procedures (currently done

        by Box-Jenkins, Kalman filter, and state-space

        methods).

 

     4. Analyze the theoretical numerical and statistical

        properties of candidate algorithms, such as

        convergence conditions and consistency.

 

If successful, it is expected that there would be a substantial military and non-military demand for a statistical estimation computer program package based on the improved methods.  With respect to military applications, such methods offer the potential for fast, fail-safe processing of tracking and sensor exploitation data.  With respect to non-military applications, the methods are much more "user-friendly" than the least-squares method, in that the user would be protected from catastrophic failures, and would not need to understand matrix algebra concepts such as linear dependencies, singularities, and numerical stability problems in matrix inversion, to be assured of successful application of the procedures.  In view of the large number of persons involved in data analysis, and the growing use of microcomputers, the development of such methods is considered to be a very significant contribution to the field of data analysis.

 

The determination that the iterative Wald-Bartlett method avoided the catastrophic failure problem of the classical least-squares method was serendipitously discovered during the course of the Phase I investigation.  Since this property of the algorithm is judged to be probably more important than speed in many applications, it is recommended that the title of the Phase II study be changed from "Fast Algorithms for Estimation, Prediction and Control," to "Improved Algorithms for Estimation, Prediction and Control."

 

Reduction of the danger of obtaining wrong answers from regression analyses represents an area of potentially great benefit, to a wide class of data analysts.  The original concept of this study was to develop fast algorithms, primarily for real-time applications such as tracking, sensor exploitation, or industrial process control.  Those applications, while important, concern relatively few data analysts.  The discovery of the robustness of the iterative Wald-Bartlett algorithm, however, could have substantial impact for a wide class of data analysts.  For example, a typical logistics application involves the determination of parametric cost estimating relationships.  These relationships are estimated by linear regression analysis.  Since the models developed are empirical in nature, they involve the analysis of a large number of cost-related variables.  The presence of a linear dependency in the data, or the occurrence of a roundoff-error-caused matrix inversion failure caused by high correlation among some of the explanatory variables, could produce incorrect results.  The availability of "fail-safe" algorithms would benefit this and many other similar applications.  In view of the substantial amount of funds expended by the Office of Naval Research on parametric cost analysis and other data analysis, the benefits of improved estimation methods would be substantial.