8 Lessons Learned From An AI Project
I recently built a web application capable of providing diagnostic feedback on x-rays for a veterinarian. The application uses artificial intelligence (AI) to do so.
The application is specifically meant to work on x-rays of the ventrodorsal (VD) view of the abdomen of dogs, and is capable of providing 15 diagnoses in a multi label fashion.
To start, the application I built is absolutely not meant to replace feedback from veterinary radiologists, or to be solely relied upon to examine x-rays. Specifically, the app is meant to provide useful feedback the veterinarian can consider when working on a case.
This is because the machine learning (ML) model I deployed in the application suffers from performance and explainability issues I’ll talk about below.
Regardless, I still learned lots by engineering this ML project from end to end.
Below, you’ll find a discussion of some of the lessons I learned, that serve as both a reminder to myself, and also to provide value if you’re interested in AI.
And also discussions of some of the silly mistakes I made during the course of the project :)
1. Data, Data, Data
The most important part of any ML project is probably data.
The dataset you have dictates whether or not the ambitious project you’ve planned with AI is possible.
Ideally, the dataset you have at your disposal will consist of large amounts of high-quality data, which for a classification task, should be distributed evenly among all the classes in the dataset.
Now reality.. often falls short of these expectations, and it did for me in this project. More on that later.
Since data is so important to ML projects, a high priority at the beginning of any ML project should be to become one with the dataset.
Before programming the user interface (UI). Before coding out the backend. Before designing a continuous training pipeline.
On this project, I had a super ambitious scope at the beginning:
“I’m going to design a generic multilabel classifier capable of scanning x-ray images of dogs and cats!!!!”
Which is the scope I had in mind and didn’t think about too much until around the halfway point of the project, at which point I had already finished a bunch of engineering work on the frontend and backend.
Which was a mistake..
When I started properly looking at the dataset I collected for the first time, I thought:
“What is a .dcm extension?? What do VD, LL, and RL mean?? HMM the scope is extremely ambitious for the amount of data I have…..
OKAY, I think I need to reevaluate what I’m doing.”
Which is exactly what I had to do.
Now luckily, all the work I did wasn’t wasted because after consultation, I was able to scope down the project appropriately to just focus on the VD view of the abdomen of dogs.
But I did have to go back and communicate the change and get the new scope approved, which wasn’t ideal, and potentially could’ve been worse.
Even with the narrowing of the scope, the dataset I had at my disposal was still too small and imbalanced, leading to generalization issues in the trained model.
Synthesizing artificial data helped the problem slightly, but the biggest improvement in the model and thus the application overall would be gained through getting more real data.
Specifically, obtaining more high-quality data distributed evenly among the 15 classes recognized by the algorithm.
2. Explainability is *really* important, especially for health-based AI applications
Here’s a thought experiment.
A competition is held between a deep learning algorithm, and a set of a hundred of the top human experts, to see who could achieve the best performance in examining x-rays of dogs.
In this competition, let's say the algorithm outperforms the human experts on every image.
The only caveat is that the algorithm cannot explain why it made predictions the way that it did, while the human experts can.
The algorithm just receives an image and returns diagnoses for that image.
Most likely, if you were a vet, you would have more confidence in the human experts even though they didn’t perform as well because they could explain why a specific prediction was made - something that would probably be important to you if you were, in the future, going to make important health decisions on the basis of such feedback.
Meaning that for many AI applications, for human users to trust and act on predictions generated by AI, being able to understand why they were made the way they were is important.
Which is unfortunately not something I considered heavily in or before this project, nor something that is top-of-mind with the opaque super deep neural networks popular these days, and an issue with the final application I deployed.
3. Privacy
When orchestrating a ML project, you should account for privacy concerns with regards to the dataset you are handling and how this will change how you manage the project.
For this project, for privacy reasons, all the x-ray data had to stay local and I couldn’t have any data on the cloud or anywhere that wasn’t on memory sticks or my laptop.
Which was unfortunate because I couldn’t take advantage of the cloud sadly :(
4. ML-based applications are 5% ML, and 95% engineering
This one is probably a cliché, but it's absolutely true.
Understanding the deep intricacies of ML theory is important, but in the context of ML engineering projects, the majority of your time in the project will go towards engineering.
It definitely did for me during this project.
This involves tasks like:
- Designing a continuous training pipeline to orchestrate the training and deployment of the algorithm you're using
- Orchestrating that pipeline to run on a schedule to continuously train the algorithm as new labelled data becomes available
- Developing methods to monitor the performance of the algorithm in production, which can trigger the continuous training pipeline when performance is detected to depreciate
- Implementing continuous integration (CI) / continuous delivery (CD)
On top of the time you’ll need to engineer the frontend and backend, design appropriate unit tests, integration tests, acceptance tests, and so on.
5. Continuous training pipelines
Expanding on the point above, I learned it's important to put, at minimum, all your model training, model analysis, and model deployment code into modules and then design a pipeline to execute the modules in a sequence appropriately.
This pipeline can be orchestrated to run on a schedule when there is sufficient amounts of new labelled data available, which will allow the algorithm running in production to constantly be updated.
Ideally, you also want to incorporate CI/CD on the pipeline so you can automatically incorporate updates into the pipeline, which I didn’t implement in this project.
In addition, you'll want to devise methods to monitor the performance of the algorithm in production. When performance depreciates for whatever reason, perhaps concept drift, you can trigger the pipeline to run and train on all the new labelled data you've (hopefully) collected.
For large-scale applications, this pipeline and associated code can get pretty complex.
6. ML Models
I used the ResNet50 pretrained on ImageNet as a fixed feature extractor, stacking logistic regression units for multilabel classification on top.
For unstructured datasets (images, videos, text), starting the project off with a pretrained deep learning algorithm is typically advisable as these algorithms excel on unstructured datasets.
I could’ve used a more complex pretrained algorithm but decided against it due to ResNets still being quite performant, and my familiarity with them as I implemented the ResNet paper using pytorch.
One idea I had but didn’t have much time to explore was gathering a large public dataset of x-rays, and training my pretrained algorithm in a self-supervised fashion on the x-rays so the model would’ve gotten better at recognizing generic features of x-rays.
Then afterwards, feeding my small labelled dataset to the algorithm to fine-tune the parameters.
7. Ensembles can be useful
Generic advice provided when you are designing ML systems is to not use an ensemble model when you are using an online system.
One of the reasons why is because the main performance metric for an online system is latency.
If you deploy an ensemble, the ensemble will take longer to process inputs compared to a single model in most cases, meaning your system will be slower and less performant as a result.
In other words, when deploying models online, the small increase in predictive performance when using an ensemble isn’t worth the large performance decrease your system experiences when you use an ensemble.
There is a caveat that I learned from this project though, which is when you are deploying an algorithm in an online system for a business, where the users are happy to wait longer for better predictions, ensembling can be a useful tool to gain extra predictive performance.
Model-wise, I didn’t decide to use an ensemble in this project, and though the performance gain would’ve likely been small, it might’ve still been worthwhile to experiment with.
The biggest improvement for the project overall would still come through getting a larger and more varied dataset though, regardless of whatever model iterations I did or do.
8. UIs are important
Lastly, this point is a reminder to myself that spending the time to design and develop a good UI is important.
Your application will be subpar regardless of how impressive and advanced your server side engineering and AI algorithms are, if your users don’t enjoy interacting with the interface you’ve created for them.
As I am someone that typically enjoys working on AI algorithms and the backend more then the frontend, this is a necessary reminder :)
So those are most of the main lessons I learned or had reinforced through completing this project, and hopefully you found something of use above.
If you have any questions or feedback or would just like to connect, I'd love to hear from you! Feel free to send me a LinkedIn message or drop me an email at jesse@jessekhaira.com :)