Parzen windows are commonly used to approximate densities. Given training data , we can approximate
is a multidimensional properly normalized Gaussian centered at data with variance . It has been shown (Duda and Hart (1973)) that Parzen windows approximate densities for arbitrarily well, if is appropriately scaled.
Using Parzen windows we may write
where we have used the fact that
and where is a Gaussian projected onto the known input dimensions (by simply leaving out the unknown dimensions in the exponent and in the normalization, see Ahmad and Tresp, 1993). are the components of the training data corresponding to the known input (compare Figure 1).
Now, if we assume that the network prediction is approximately constant over the ``width'' of the Gaussians, , we can approximate
where is the network prediction which we obtain if we substituted the corresponding components of the training data for the unknown inputs.
With this approximation,
Interestingly, we have obtained a network of normalized Gaussians which are centered at the known components of the data points. The ''output weights'' consist of the neural network predictions where for the unknown input the corresponding components of the training data points have been substituted. Note, that we have obtained an approximation which has the same structure as the solution for normalized Gaussian basis functions (Ahmad and Tresp, 1994).
In many applications it might be easy to select a reasonable value for using prior knowledge but there are also two simple ways to obtain a good estimate for using leave-one-out methods. The first method consists of removing the pattern from the training data and calculating . Then select the for which the log likelihood is maximum. The second method consists of treating an input of the training pattern as missing and then testing how well our algorithm (Equation 2) can predict the target. Select the which gives the best performance. In this way it would even be possible to select input-dimension-specific widths leading to ``elliptical'', axis-parallel Gaussians (Ahmad and Tresp, 1993).
Note that the complexity of the solution is independent of the number of missing inputs! In contrast, the complexity of the solution for feedforward networks suggested in Tresp, Ahmad and Neuneier (1994) grows exponentially with the number of missing inputs. Although similar in character to the solution for normalized RBFs, here we have no restrictions on the network architecture which allows us to choose the network most appropriate for the application.
If the amount of training data is large, one can use the following approximations:
Note that the solution which substitutes the components of the training data closest to the input seems biologically plausible.