Concentration on the sphere – The curse of dimensionality

When people are going to do something, it is very important that they know the limits of that thing ahead of time. Machine learning methods may not work well with hyper-multidimensional data, which is referred to as the curse of dimensionality. Ultimately, this is the same situation as the ugly duckling theorem described earlier. Simply put, such data is ‘meaningless’ data.

From a philosophical perspective, this is a limitation of human existence, rather than machine learning. The possible knowledge for humans is always only knowledge from some ‘point of view’. Therefore, when people want to know things in depth, they need to change their point of view and analyse them, which involves discarding information. If one tries to deal with all information at once, it becomes meaningless input. (The all-encompassing universe is a huge meaningless thing, and we create meaning by using only a small part of it. ) Therefore, it is impossible to develop technology to overcome this, as this is a cosmic principle.

The curse of dimensionality is mathematically described as Concentration on the sphere. Concentration on the sphere is that, with any given data point as centre, the distances of the other data points from the centre become approximately equal as the dimension increases. This means that the difference in distance between data points is no longer significant using any of the pairs. Importantly, the centre in here is each data point, not the origin or the centre of gravity of the data space.

Ultimately, if the distances between all data points are equal, clustering, for example, becomes impossible. Actual data does not reach this level, but hyper-multidimensional data comes as close as possible to this state of affairs.

If we think about it in terms of figures, a figure in two-dimensional space where the distances between each points are equal is a regular triangle, and in three-dimensional space it is a regular tetrahedron. I don’t understand exact mathematics, and don’t know what to call such a figure, but if we were to call it an equidistant figure, I guess that an equidistant figure in d dimensions is a regular d+1 equidistant figure. Although we are using a spherical surface to explain how distances approach equidistant, we should not forget that in reality, space expands explosively as the dimension increases. When the number of dimensions approaches the number of data points, the data points are sparsely scattered over a huge space.

Typical misconception of “Concentration on the sphere ” may be the claim that hyper-multidimensional data has a spherical topology. You may come across papers that say that spherical SOM eliminates the curse of dimensionality, but I think, this is completely wrong and is a kind of pseudo-science. The sphere in Concentration on the sphere means approximately equidistant between data points, not a three-dimensional sphere as we know.

We can see constellations in the night sky. However, this is how the stars are arranged as seen from Earth, and the constellations seen from another star will be different from those seen from Earth.

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です