Dimensional Reduction
Dimensional reduction is the process of assigning high-dimensional stimuli to locations in a lower-dimensional space in such a way that the relations between the high-dimensional stimuli are recreated or preserved by their low-dimensional analogs. It is a process that has many applications, including information compression, noise attenuation, the induction of similarity metrics, and data visualization. There are many ways of performing dimensional reduction. In this post I describe a new method for dimensional reduction which involves assigning stimuli to low-dimensional locations that allow the high-dimensional stimuli to be optimally reconstructed.
The Principle Of Reconstruction
The approach I am describing involves assigning the high-dimensional stimulus patterns, and learned bases, to the low-dimensional space, and then reconstructing the high-dimensional stimulus patterns by taking weighted sums over the bases. The bases are of the same dimensionality as the stimulus patterns, and the weights in the reconstructions summations are a decreasing function of the distance in the low-dimensional space between the stimulus pattern being reconstructed and each basis. The overarching goal is to learn locations for the stimulus patterns and the bases, and values for the bases themselves, to allow the high-dimensional patterns to be reconstructed with minimum error. The hope is that this approach will cause `similar' stimuli to get assigned to proximal locations to one another in the low-dimensional space because they can be reconstructed with similar bases. Notable about this approach is that it defines a generative model for the high-dimensional stimulus patterns, which distinguishes it from some other approaches.
Model
Let's assume we have an \(N \times H\) matrix \(S\) containing our high-dimensional stimuli, where \(N\) is the number of stimulus objects, \(H\) is the number of dimensions the stimulus objects have, and \(S_{sh}\) denotes the value of pattern \(s\) on dimension \(h\). We also have an \(M \times H\) matrix \(B\) consisting of the reconstruction bases, where \(M\) is the number of bases we elect to use in the reconstruction process (e.g. 25), and \(B_{bh}\) denotes the value that basis \(b\) has on dimension \(h\).
In addition to this information the model needs to represent the low-dimensional locations assigned to both the stimuli and to the bases. Let's assume the locations of the stimulus objects are stored in an \(N \times L\) matrix \(P\), where \(L\) is the number of dimensions we want our lower-dimensional space to possess (e.g., 2), and \(P_{sl}\) represents the location of stimulus object \(s\) on dimension \(l\). In addition, let's assume the locations of the bases are stored in an \(B \times L\) matrix \(Q\) where \(Q_{bl}\) represents the location of basis \(b\) on dimension \(l\).
If we denote the reconstruction of stimulus object \(s\) on dimension \(h\) as \(R_{sh}\) then we have
$$R_{sh} = \sum\limits_{b=1}^{B} w_{sb} B_{bh}$$
which is to say that reconstruction of a feature of a stimulus object via a weighted sum over the corresponding feature values of each basis, where the weight assigned to each basis in the sum is a monotonically decreasing function of the distance between the stimulus object and the basis in the hypothesized lower-dimensional space. In other words,
$$w_{sb} = w_{bs} = \exp(-\sqrt[2]{d_{sb}})$$
where
$$d_{sb} = \sqrt[2]{\sum\limits_{l=1}^{L} (l_{il} - l_{bl})^2}$$
Gradient Equations
Learning in this model involves finding the low-dimensional locations of the stimulus objects and the bases, and the high-dimensional feature values of the bases. This can be accomplished via gradient-descent by defining the reconstruction error of the stimulus objects, and then finding the gradient of this error with respect to each of these parameters. In what follows, to keep the notation simple, we consider only the reconstruction of a single high-dimensional feature (i.e. we drop the \(s\) and \(h\) subscripts); the full gradient equations can easily be extrapolated from what we write here through appropriate summation however.
Let's start with the squared reconstruction error, \(E\), where \(r\) represents the reconstructed value and \(p\)
$$E = (r-p)^2$$
The gradient of the reconstruction error with respect to \(q_b\), the relevant feature value of basis \(b\), is
$$\frac{\partial E}{\partial q_b} = 2(r-p)w_b$$
The gradient of the reconstruction error with respect to the location of the stimulus object being reconstructed, on the low-dimensional dimension \(x\), is
$$\frac{\partial E}{\partial x_p} = 2 (r-p) \frac{\partial}{\partial x_p} \left( \sum\limits_b w_b q_b \right) = 2 (r-p) \left( \sum\limits_b \frac{q_b(x_b-x_p)w_b}{d_b} \right)$$
And the gradient of the reconstruction error with respect to the location of basis \(b\) on the low-dimension dimension \(x\), is
$$\frac{\partial E}{\partial x_b} = 2 (r-p) \left( \frac{q_b(x_p-x_b)w_b}{d_b} \right)$$
Note that, modulo the summing over \(b\), this is just the negative of the previous gradient equation, which reflects the underlying symmetry of the reconstruction model: moving the location of a bases one unit in a direction accomplishes the same as moving the location of the object being reconstructed in the opposite direction, with respect to the single basis and object.
Performance On MNIST Data
The MNIST data set is a digitized collection of handwritten digits (0 to 9) commonly used to benchmark machine learning classification algorithms. We can use a model like the one described above to assign low-dimensional locations to these high-dimensional images, and by examining the learned low-dimensional locations we can visualize the structure of the digits.
The model we applied to the MNIST data differed from the model described above in a few respects:
- Instead of exponential decay, we used a sigmoidal decay function to model the relationship between low-dimensional distance and reconstruction weight (\(w_{sb} = \frac{1}{1 + e^{d_{sb} - 5}}\)). This means that there is very little weight decay in the immediate vicinity of a basis, which acts as an attractor space for stimuli similar to the basis.
- We normalized the weight activations of the bases as experimentation revealed that this produced better reconstructions and visualizations. This changes (complicates) the gradient equations a little; deriving the new equations is left as an exercise to the interested reader.
- We applied an L2 regularization penalty to the low-dimensional stimulus and basis locations as this appeared to help the gradient descent avoid local minima and resulted in better reconstruction solutions.
The chart below shows the locations assigned to MNIST handwritten digits by this reconstruction model. The locations are color-coded by which digit class they belong to (and the locations of the bases are shown in black). As you can see, the model does a pretty reasonable job at compressing the high-dimensional stimuli into low-dimensional locations while allowing the digits to be reconstructed by the learned bases and preserving information about the identity of the digits.