Machine learning (ML) is among the most significant technological developments in recent history. The exponential growth of data as well as hardware advances have made it possible for algorithms to learn and extract insights from the data. One common misbelief is that ML is capable of providing solutions for any given problem. The truth is that it is not uncommon for an ML project to fail.
So what are some weak spots of Machine Learning?
Since ML is interconnected with the data, it is reasonable to assume that inadequate or inaccurate data will potentially lead to model failure.
Not enough data If there is a lack of data, the model will not be able to generalize. Generalization refers to the model’s ability to adapt to previously unknown data. This problem occurs not only when there is not enough data but also when the data are not representative enough. Thus, machine learning engineers have to be certain that the available data represents the entire distribution of the problem that needs to be solved. Having more representative examples of the group of objects that may be encountered will lead to more accurate predictions.
Low-quality data Before designing a complex algorithm, machine learning engineers always need to carefully examine the data. Specifically, data usually contain outliers, noise, and irrelevant examples so if not handled correctly, the predictions will most likely be flawed.
Domain-specific representations The features that are given as input to the model should be aligned with the task’s domain. For instance, in Natural Language Processing, words are usually represented by word embeddings. Commonly used Word2Vec embeddings are trained on general-purpose corpora, thus not performing well enough on specific domain tasks. Cognitiv+ has trained domain-specific embeddings and has proved that they give a significant gain to the overall performance.
The model that is used for prediction and insight extraction should be carefully selected and trained.
Model overfit Overfitting happens when the machine learning model represents the training data too well. More precisely, the model learns the detail and noise in the training data causing poor performance on unseen data. Overfitting leads to a model with low generalization capability and is one common reason why models fail in production.
Model selection There is not a single model that performs well across different machine learning tasks. For example, researchers in Natural Language Processing (NLP) have found that for different NLP tasks such as text classification and named entity recognition, the models that lead to state-of-the-art results for each task are different.