Getting the most of Machine Learning: 7 key points to build effective ML systems
Building machine learning models is one thing, and making the most of machine learning models is another thing. There are a lot of iterative works involved to build an effective ML application that can ultimately solve the business problem. The gap between the problem and the desired goal is often caused by a lack of proper framework in approaching machine learning systems.
In this article, I want to talk about the 7 key points that can potentially help in the path of building effective ML systems. But I would also like to mention that I am also learning this art and I found one of the best ways to make it resonate well is to write about it. If it helps you, let me know, I will know I did a great job.
Let's get started!
1. Formulating the problem
ML tends to succeed if we are solving a simple and clear problem. Where does that problem come from and how do we define it? Generally, it all starts from the business side (perhaps excluding pets projects) looking to optimize or automate the given part of the business.
Here are 4 categories that most problems fall in:
- Classification problems in which the goal is to answer yes or no, or make a choice between multiple things. An example can be to identify if there is a car in a picture.
- Regression problems in which the goal is to determine how much/many of a given quantity or variable. An example can be to project the revenue in 2 years. Another classical example of this category is predicting the price of the house given the region, size, number of bedrooms...
- Anomalies in which the goal is to spot the unusual scenarios such as when someone tries to use a stolen credit card and the fraud can be detected because the card is used in an unusual location or is tried to cash/pay the questionable amount of money. ML is incredibly good at spotting these anomalies.
- Clustering where the objective is to group people (such as customers) or things: Say you want to group different types of equipment in a factory that share some given characteristics. Another popular example for this category is market/customer segmentation, where you can try to group customers to give some groups a promotion based on their purchasing history.
- Another category that is still finding its way out is Reinforcement Learning where the goal is to optimize the rewards.
If you can't formulate the problem well enough that it can not fall in those categories, it's likely that machine learning won't solve it. And that's okay, some problems can be solved without ML.
2. Learn about the dataset that you're working with
This is a very important step as we can be tempted to jump into the model. This is where you 'become one with the data' (CC: Andrej).
Go through some values, plot some features, try to understand the correlation between them. If the work is vision-related, visualize some images, spots what’s missing in the images. Are they enough or diverse enough? What types of image scenarios can you add? Or what are images that can mislead the model?
Here are other things that you would want to inspect in the data:
- Duplicate values
- Corrupted or data with unsupported formats (ex: having an image of
.txt
and with 0 Kilobyte(Kb)) - Class imbalances
- Biases
3. Be strategic during data manipulation
Real-world datasets are messy. You will have to spend an incredible amount of time cleaning the data. The right manipulation strategy depends on the problem at hand and the size of the available dataset.
Here are questions that can guide you as you clean the data in a systematic way:
- When should you remove or keep missing values?
- Should remove or keep a given feature?
- Can you aggregate some features?
- Can you produce a new feature from a set of features?
Answering these questions will be helpful during and after the project. Take an example, if there is a feature that has high predictive power but it can introduce biases, then it's worth thinking if you can drop it to avoid privacy issues.
4. Start model development quickly
The standard way of modelling (revealed most in researches and academia) is to spend an enormous amount of time confessing the model to generalize on the available dataset.
You have heard about this popular notion that a good model comes from good data. Keep the model simple and do all you can to improve the data while also aiming to reduce the error at every step you take.
5. Keep things simple
There is no reason to complicate things when there is a simple workaround. Take an example, instead of building a neural network from scratch, you can leverage pre-trained models. Often a simple model will be all you need.
Below is a real-life example(a source in reference):
In the process of applying machine learning to AirBnB search, the team started with a complex deep learning model. The team was quickly overwhelmed by its complexity and ended up consuming development cycles. After several failed deployment attempts the neural network architecture was drastically simplified: a single hidden layer NN with 32 fully connected ReLU activations. Even such a simple model had value, as it allowed the building of a whole pipeline of deploying ML models in a production setting while providing reasonably good performance. Over time the model evolved, with a second hidden layer being added, but it still remained fairly simple, never reaching the initially intended level of complexity.
6. Deploy only if you have achieved the desired performance
If you are still getting high errors on the training set, the right thing to do may be to improve the data iteratively other than rushing the model deployment.
It's likely that the data can change (say you got a new feature) in the process of data improvement. Due to that reason, it's okay to delay deployment, just to buy the time and resources that could be burned. You can also find out that the problem you're trying to solve is almost not going to work.
This also applies to data pipelines. Instead of perfecting pipelines in the early days of a project, keep in mind that the data can change, and so you will have to change everything backwards too.
This point does not contradict reproducibility. Always keeping the work reproducible will pay off.
7. Matching the results with the business needs
The expectation of the ML project is to produce bottom-line results which satisfy the business needs. It is thus important to make sure that we do not go into the scenarios of 'but I did well on the test set'(CC: Andrew Ng.). Getting 99% test accuracy is not always an indication that the business problem is solved.
References and Further Learning
- 'Become one with the data': A Recipe for Training Neural Networks, Andrej Karpathy.
- Real-life example, Airbnb: Challenges in Deploying Machine Learning: a Survey of Case Studies, Section 4.1, Page 6.
- Machine Learning Operations Engineering, DeepLearning.AI
I actively share content around ML ideas, tips and best practices on Twitter and LinkedIn. If you would like to connect, you're welcome. Every day, I share one or two things that you will find insightful.