Practical Applications of Machine Learning Models-When and where to use which model
In 2014, an article “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” shows the practical results from179 classifiers’ performance on 121 datasets. The answer is: there is no best classifier, but there is a most proper one for that dataset.
Overall, Random Forest shows the best result, but it is champion only on 9.9% of datasets. SVM, Neural Network, and Boosting technologies follow behind. The pattern conforms to our expectations: The higher dimensions the dataset is, the better SVM and Random Forest performs compared to Boosting technologies. The larger the dataset is, the better the Neural Networks perform.
From multi-resources, I summarized the most common and popular model applications in different industries. I hope this will help with your interviews and your own field of work. You can make a general guess of what the interviewers are looking for and steer in that direction to gain attention. Generally, data science can be combined with different industries, and this generated the new concepts of DS+Finance, DS+Media, DS+Retail, DS+Energy, DS+Government, and DS+Healthcare. Let’s take a look at the first three one by one(because the rest of them require massive domain knowledge).
The scenarios in the finance industry are mostly Fraud Detection in credit card applications, loan applications, credit card transactions, and risk analysis. These are usually unbalanced dataset classification problems, and the common practice is to use Boosting with Logistic Regression.
There are a lot of regulations in the finance industry, which requires the banks and financial institutions to have highly interpretable models for illustration. The features from boosting and logistic regressions can be clearly explained in this situation. So most banks do not use complex neural networks. You might think this is something you already know, but not all people can avoid talking about CNN or KNN in an interview in the finance industry.
The media industry, including social media, requires a broader range of ML applications. Practically, some of the top companies survive based on a very advanced recommendation system(TikTok). For social media platforms like Facebook, Instagram, Twitter, and Tiktok. The flow goes like this:
There is a great article illustrating how the recommendation system at ByteDance works. Unfortunately, it is not in English. I will probably write an article about it in the future.
A lot of companies choose to use Collaborative Filtering. Some schools probably covered part of it in the customer analytics courses, but they probably did not tell you it is called “Collaborative Filtering.” There are three types of CF — — user-based CF, item-based CF, and model-based CF. They calculate the similarities between the users, items, or models, sort the similarities and calculate the interest scores. At last, they use the “TOP-N” method for recommendations. There are different kinds of calculations for similarity scores based on your choice of CF.
Small recommendation systems tend to use item-based CF. Large-scale recommendation systems use user/model-based CF. The machine learning models are applied in the model-based CF, which might include:
Notes: A restricted Boltzmann machine is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. Wikipedia
NLP based social media analytics
Content recommendation is a large part of ML in the social media industry. The LSA and LDA mentioned in CF are also popular just for building user-personas. Data Scientists should be familiar with other NLP tools such as Doc2Vec, Topic Modeling, POS(part of speech tagging).
There are multiple ways for sales prediction, including the SCAN-PRO model(regression) and Time Series Prediction(LSTM in particular). However, in most situations, sale prediction can be tough because of omitted variable bias. Generally speaking, preparing a regression model can be a great choice when applying for a sales analyst position.
The CF model and logistic model are also popular in the retail industry. They are mainly used to build customer personas, predict consumer preferences, and launch precise promotions.
Prescriptive Analytics & Operations Research(OR)
This is not really related to machine learning. When it comes to logistics optimization and supply chain management, knowledge in OR and prescriptive analytics can help a lot. Programmers should be familiar with Python packages, e.g., Pulp & SciPy, for these kinds of problems.
What will be popular in the future?
The basic models that students learn from school are not enough for practical situations. For example, most schools do not teach students how to build or optimize a recommendation system, how to use AutoML techniques, and how to apply reinforcement learning in advanced internet companies. AutoML is a really trending and popular field since it can build the ML model by itself, which will reduce the cost of labor and definitely add difficulty for data scientists to find jobs. Instead of figuring out the formulas behind the current ML models, focusing on trending technologies could benefit our careers. Last but not least, great command of databases(SQL & NoSQL), big data tools(Hadoop, Spark, etc.), cloud computing tools(GCP, AWS, etc.) will help a lot.
Thank you for reading my article, and if you like it, please give me a thumb.
ML applications LINK
Collaborative Filtering LINK