Talking about No Free Lunch Theorem is a bit risky. This is because this can be taken as a denial of machine learning algorithms. This is a theorem in the area of combinatorial optimization, created by physicists David H. Wolpert and William G. Macready. It says, “Any algorithm that searches for extreme values of the cost function will have the same performance when the results are averaged over all possible cost functions”. The meaning of this is that every model has its strengths and weaknesses, so we should use different models for the right models in the right places.
The same is true in the area of machine learning for classification and prediction, where it is said that it is necessary to adjust the algorithm according to the situation, rather than applying a general-purpose learning algorithm as is. It is presumed that this is the reason why the terms “data science” and “data scientist” have come into use. However, improving machine learning algorithms is at the level of graduate school research, a level not reached by most data scientists who use it as a job title within their companies. In this regard, I have argued for a decade that the designation “data scientist” is inappropriate.
What more than 95% of data scientists do is use the various algorithms available in machine learning libraries. At most, they set the options and select the variables to be used in the model. It would be good if they could experiment with various variable weightings, but that may be the limit of what they can do. For example, working in Python requires coding a series of tasks such as reading data, cleaning data, scaling, and applying learning algorithms, which many data scientists might refer to as “algorithm development. That is a far cry from real algorithm development.
To illustrate the difference between a general-purpose learning algorithm and an algorithm specific to an individual problem, take Fisher’s Iris data as an example. Fisher famously used this data when developing his discriminant analysis. Discriminant analysis may not normally be classified as machine learning, but this is what it is, the ancestor of machine learning. Discriminant analysis is a method of linear separation of classes. The observed data values constitute a multidimensional space, and linear equations divide that space. However, have you noticed that there is a different method for Iris data?
Biological classification can often be based on the proportions of each part. Humans also have different body proportions according to race. Iris data has 4 attributes: Sepal Length, Sepal Width, Petal Length, Petal Width. Normally, these are used as they are, but new values can be calculated, for example, (Sepal Length/Sepal Width), (Petal Length/Petal Width), (Sepal Length/Petal Length). This search for new values that are valid for the model is called feature engineering.
The above example is a calculation that is only possible with Iris data, and discriminant analysis can be said to be a modeling method that can be applied more broadly and generally. In other words, feature endineering is a discovery of a law specific to the problem. Sometimes, the features thus created are sufficient by themselves to achieve the purpose of the model.
Real-world problems often require the discovery of such laws as well as general-purpose learning algorithms. In classification problems, when one class is concentrically distributed around another class, there is a technique to map the data onto a quadratic surface, which can also be regarded as a kind of feature calculation.
The No Free Lunch Theorem demonstrates the importance of using prior knowledge to discover the laws of each problem. However, it is generally interpreted that one should try as many modeling methods as possible. In reality, there should not be so many methods that can be chosen from machine learning libraries, as the type of problem and data type will limit the applicable methods. However, using several different modeling methods and performing ensemble prediction/classification among them can be expected to have a certain effect. Self-Organizing Maps are useful for analyzing local strengths and weaknesses in the data space of multiple different models.
There are many cases of inconsistency in the data, and it is not uncommon for the prediction and classification performance of machine learning to be limited to 70% or 80% correct. If mistakes occur 20% or 30% of the time in real work, it is not suitable for practical use. Thus, depending on whether false negatives or false positives are less convenient for the business, the nature of the model may be adjusted by shifting the model to one or the other. Still, it is said that only 10% to 15% of all data science projects in the world have been able to advance to practical use. Whether this is more or less would be a matter of opinion. Even if there is only a 10% chance of success, if it gives them an overwhelming advantage over other companies, it is well worth the challenge.
Google’s changes to its algorithm can be a cause for grief. Unreasonable things often happen, such as drastic changes in search engine rankings, ads that no longer appear, or YouTuber accounts being de-monetized. It is very annoying because many people gain or lose unexpectedly at the slightest whim of Google’s data scientists. Even an advanced company like Google, which excels in data science, has such problems all the time, so one can guess easly what the state of data science is like in the general corporate world.
A ray of light in terms of feature extraction would be deep learning. In deep learning neural networks for image recognition, features are seen to be automatically created in the hidden layer. However, the extent to which this can be widely applied awaits further research. Another direction is to make more use of unsupervised learning and exploratory analysis. Efforts to better understand the data should not be neglected. Current machine learning/data science may be too biased toward classification/prediction.