Below is the introduction to a technical report (without references) describing some of the algorithm work in which I have been involved. The whole report is technical report TR-1030, T. Wallace, An All-neighbor classification rule based on correlated distance combination, 1996, Lincoln Laboratory, MIT.
There exists a substantial group of pattern recognition problems in which complicated signals are measured, representing multiple classes. For example, consider the problem of categorizing the state of an internal combustion engine by examining an audio recording of its operation. If the analyst has access to the detailed design of the engine, it is possible that the various sounds may be decomposed into frequencies representing motions or vibrations of specific engine parts, which could then be studied at some level. But what if such knowledge is lacking, and instead a database of engine recordings is available, representing possibly a range of engine states from the set (new, worn, damaged)?
Or consider monitoring an earth-orbiting spacecraft using a narrowband radar. Such a radar records an instantaneous total radar cross-section value that fluctuates with time as the spacecraft moves with respect to the radar. Again, if the structure of the spacecraft (as deployed) is exactly known, and the orbit and motion of the spacecraft are precisely known, in principle one can predict the signature using electromagnetic prediction methods. If the radar wavelength is large enough compared to the size of the spacecraft, one may even expect to do this with a tolerable amount of processing on a large computer. However, in reality one or more of these criteria is usually not satisfied, so the only recourse may be to base one's assessment on a collected database of historic signatures of the same spacecraft or class of spacecraft.
This type of problem has often historically been attacked using nonparametric methods, such as a nearest-neighbor (NN) classifier [ Fukunaga 1990 ] or neural network, [ Lippmann ] which selects a single class as the most likely. The NN method has been demonstrated to be an effective pattern recognition technique in many experiments. An oft-cited reference [ Cover Hart ] proved that the probability of error of this method is bounded by twice the Bayes error. However, this is only true asymptotically, in the infinite sample case, so this result is rarely applicable in practice. In practice, an undersampled situation usually exists in which it may be helpful to consider more than the nearest neighbor of a vector. And of course one may be able to do better than twice the Bayes error. The k-nearest-neighbor method [ Fukunaga 1990, Fukunaga/Hostetler, ] and its variants [ Dudani, Keller/Gray/Givens ] are sometimes used to give the classifier more information, although both theoretical and experimental work [ Hart, Baily/Jain Schmidt/Levelt/Duin ] suggest that this approach is sometimes less effective than the simple NN.
The neural network is architecturally different from the NN approach, but the performance has proved to be similar in many experiments. It has been shown [ Ruck/Rogers, Kanaya/Miiyake, Barnard ] that in the limit, the performance of neural networks approaches the Bayesian limit, which sounds twice as good as the NN situation. However, both these results are for an idealized case of infinite data, and most reported experiments comparing the two methods [ Schmidt/Levelt/Duin, Geva/Sitte, Weideman/Manry, Rubin ] show the NN methods to perform slightly better in most cases. The training phase of network construction evidently creates boundaries in hyperspace that are similar to those defined by the NN method. The output is usually comparable to the output of the NN algorithms as well, consisting of a simple classification.
There are difficulties encountered in applying these methods to certain real problems. The confidence in a classification is usually unavailable, except as a statistical performance over an entire ensemble of data. It is always desirable to be able to explain to the user the rationale behind the system's assessment, as is done in some rule-based systems. Explanatory evidence tends to be absent or lacking with these methods, although a NN system can exhibit the nearest neighbor itself as evidence to support the classification. In either case the user can be provided with an assessment of the overall performance of the classifier based on certain experiments or training runs. It would be better to report the confidence in each classification, however, since certain input patterns can in principle be determined not to match anything in the database well enough for high confidence in the class assignment, while others may generate extremely good matches.
Another common complication in this type of problem involves varying dimensions of the data vectors. For example, what if most of the engine recordings last 10 s, but a small group of potentially valuable examples last only 3 s.? What if radar satellite signatures inhabit a total look angle range from 20\(de to 160\(de, with some short, some long, and some not even overlapping in look angle? The NN and neural network methods normally require input vectors of constant dimension, so some method must be found to meet this requirement
One additional problem is that there is no guarantee that all the possible classes of the system under analysis are represented by vectors in the database. In fact, many malfunctions probably are not represented in the database. In some cases it may be impossible or dangerous to collect data on malfunctioning systems. Work has been reported [ Dubuisson/Masson ] on general methods of deciding that a vector does not belong to any class represented in the database or that it is ambiguous. While these methods may indeed be useful, they are mainly applicable to the standard case of vectors in n-space.
An interesting paper by Denoeux describes a method of combining the Dempster-Shafer formalism with the k-NN algorithm. This technique also automatically generates an assessment of the uncertainty in the classification, as expected for Dempster-Shafer methods. This is definitely in the spirit of the current method, but k is fixed, so not all the available information is used. Although large k may be used with this method, in the case of a heterogeneous database in which different input vectors have different numbers of comparable database vectors, it is difficult to select k a priori without sometimes having fewer than k available distances.
In light of the preceding discussion, one can itemize some desirable properties of a system for assessing the state of an object, given a sample vector and a database of historic vectors. None of the methods in the literature seems to have all of the following properties, which assume a system based on some kind of direct (NN) or indirect (neural net) comparison of vectors. The comparison algorithm should be able to handle input vectors of different dimension. (Some minimum size may be necessary.) This rules out using a simple feature vector in n-dimensional space, since some of the dimensions may be missing. In fact, while A may be comparable to B, and B to C, it is possible that there is no way to compare A to C. One way to deal with this problem is to work with distances between vector pairs, where possible, rather than with the original vectors. If a large subset of the database is comparable to the new vector, then all the evidence generated from each possible comparison should be used by the system. The NN approach would not satisfy this requirement, but even k-NN methods fail to utilize all the potential evidence. It follows that the system must be able to assess a new vector based on comparisons to a database subset of varying size. It is expected that the confidence in a given class assignment would be lower on the average when fewer database vectors are available for comparison. To support individual assessments and explanatory evidence, the methods should be either statistical or ``fuzzy,'' generating a continuum of confidences rather than just dividing a pattern space into disjoint regions. The system should not assume that all the possible classes are represented in the database. A ``none-of-the-above'' hypothesis should exist, or individual hypotheses of particular class membership (or not) should be tested. This requirement is compatible with the desired numerical confidence assessment of point (c).
It is easy to write down this wish list, but it is difficult to develop methods that satisfy each point. A statistical approach is obviously one way to proceed, but the individual distances between pattern vectors are correlated, and the dimension of the necessary joint probability density functions might be as large as dozens or hundreds. Given a database of some hundreds or thousands of vectors, it appears insurmountable to take a Bayesian approach. Even if a few thousand vectors were comparable and available, and hence perhaps ten million distances could be computed and analyzed, it appears intractable to estimate a joint density that might accurately represent the likelihood of a distance vector of several hundred or a thousand values.
In the method described below, this problem is approached by first computing a numeric distance between vectors, wherever possible, which is designed to capture their similarity. This is the problem-specific part of the method, and will not be dwelled upon in this report. Non-parametric methods are then used to estimate single-distance probability density functions (PDFs). This procedure is necessary since nonlinear preprocessing is typically used prior to computation of any distances, and sometimes combinatorial methods are used in the distance measure itself, making it difficult to predict the form of the PDFs. The non-parametric estimates show quite a bit of complexity, which further discourages analytic efforts.
Rather than classifying the unknown vector into one of the database classes, we test the hypothesis that the vector belongs to each possible class. To obtain distances representing the null hypothesis, we have used two methods. The first is to simulate them, as described in the next section. If simulation is possible, large numbers of simulated distances may be generated to largely solve the problem. If simulation is not possible, a large group of inter-class distances may be used to represent the null hypothesis. Given these two groups of distances, a likelihood function giving us the probability of class membership may be estimated. Often there is some a priori information about the class of the unknown vector. For example, in the satellite monitoring situation it is fairly certain which satellite is being tracked, so a single assessment of the confidence that the satellite is nominal is produced. If the situation is less certain, multiple confidences of membership in various classes may be produced.
But the single-distance problem is really not the heart of the matter. The difficulty is in dealing with multiple correlated distances. In the course of this research a new technique was developed, combining the evidence for n distances, given fixed n. Further analysis of this statistic for varying n produced a useful analytic expression for evidence combination, which is a function of n. Methods of optimizing the parameters of this expression based on large-scale classification experiments have been used with good success. The result is not only an overall confidence of class assignment, but also an easy recapitulation of the individual evidence going into the overall assessment. This is very useful to convince a skeptical user that the confidences are correct.
We believe that this method is of general applicability. Initially it was applied to narrowband radar signatures of low-earth-orbit satellites with both low and high inter-vector correlations. More recently it has been applied it to wideband radar range profiles of geosynchronous satellites, which represent completely different physics, as well as different satellite configurations, operations, and orbits. The technique has worked well in both cases. Work is underway to apply the method to photometric data.