My name is Pablo Damasceno, I’m a Senior Data Scientist at Janssen Pharmaceuticals, which is part of Johnson & Johnson Pharmaceuticals.
I've worked at UCSF for the past six years developing, testing, and deploying machine learning algorithms in the clinic, working very closely with doctors to make sure ML and AI algorithms were being used.
Recently, I joined Janssen Pharmaceuticals in the computer vision division of data science.
In this article, I’ll talk about two examples of clinical use of ML and AI:
Let’s dive in 👇
COVID-19 Project
The problem
One of the major issues hospitals had because of COVID was the lack of ventilators. The question everyone had on their minds was: can we predict whether a patient that comes today is going to need a ventilator? If you can predict then you can try to manage the resources.
Even better, if you can predict that a patient is expected to die in the next 24 to 48 hours, you can try to manage resources and make sure this patient receives the care they need.
The problem was that nobody had enough data. Each hospital had a few 100 cases, even if you had thousands of cases, perhaps only a few hundred of those had chest X-rays you could then use to train your model.
The solution (the federated learning model)
NVIDIA got in contact with us to contribute to creating a platform for Federated learning. They shipped a Docker container to us to put in a machine and then train this model in our data. The weights of this model (the data stays where each of the hospitals has the data) were sent to an AWS server.
This model was then going to average out those weights, perhaps a weighted average based on the data for each of the sites, and then NVIDIA sent the weights back to us. In the end, you have a model that was basically trained in a bunch of different institutions.
We were part of a network of 20 institutions that use the model, which is called the ‘Deep and cross’. What we were predicting, in the end, was: what's the probability that this person needs a ventilator, and what's the probability that this person is going to die in the next 24 hours or 48 hours?
We had chest X-rays for the patients and we also had a bunch of electronic health records data and with our model, we were able to combine these with the images.
The results
The federated learning model consistently works better than the model trained on individual institutions, it has higher accuracy and it’s also more robust. We went to three other hospitals that were not part of the federated learning and applied the model and it worked better there too.
Turns out that sending 30% of the weights between institutions in the central site gave basically 90% of the performance. If you're concerned about your data, or in this case, the weights being moved around, you can send just 30% of those weights and keep a high accuracy for your model.
It took us eight months from beginning to end. In the academic world, this is unheard of, to get in only eight months 20 institutions from all over the world to come together to solve a problem and get a paper written. It took eight months for us to do the job, but it took NVIDIA a year for the paper to get reviewed and published, which was the longest and hardest part of the work.
Tumor segmentation in neuroscience
The problem
Not being a clinician, I was amazed to learn that a lot of people can have a brain tumor the size of a golf ball in their brain and live their daily lives completely normal. If the tumor is growing very slowly, your brain can adapt. The neural networks in your brain go around this tumor and if it's growing slowly, they can just plastically move.
The doctors decide not to remove this tumor unless it starts to grow exponentially, in a stage where it’s going to start to break a lot of connections. They wouldn't just remove it from the beginning, because there's always a probability that something's going to get cut that shouldn't in the surgery.
So, they monitor the patient, every three months they scan them, and they look to see if the tumor is growing exponentially or about the same size.
You would imagine that what they do to measure this is to do some kind of analytical calculation of the volume, but it’s actually very qualitative. When you look at the radiology report that is then passed to the neurosurgeon, they're gonna have very qualitative sentences like “this is suggestive of progression” or “there’s some slow growth” because they're being vague on purpose.
They wanted a model that could tell if there was progression or not. It needed to be accurate, interpretable, and fast (to help the workflow).
The solution
The point here was, to take things that are already known to work like a unit and segment describing where the bottleneck is if I want to apply this to the clinic. Then we take those images, we're throwing the unit, we get segmentations and we give them a volumetric plot of what's happening to the tumor volume over time.
I used a bunch of MRI images that are put in the same resolution so that the same neural nets can be used. We had around 600 images, and I got a radiologist to help me segment the tumor in those 600 images. They did not want to do it so I had to come up with a different approach.
The approach was to do active learning. We compromised and they did 20. I trained a neural net that then segmented another 100. For them to annotate these 100 and improve on the algorithm was a lot faster than segmenting from scratch. Then, they did another 100 and we slowly converged to a number of 600 manually segmented images.
As I mentioned, I used the 3d units, did some data augmentation and the deployment part was also in a collaboration with Nvidia with their tool Clara which has orchestration for getting data from the scanner, kicking in a pipeline, and allowing the results to go through a database.
The results
With the Dice Score, which is basically how much of my segmentation agrees with the manual segmentation, we got around 87% agreement.
When I take a bunch of patients that progress and did not progress, I get sensitivity and specificity of about 0.7.
How this works in practice
Radiologists in the office find a patient that just came in and had a scan. They want to see the volume of the tumor for this patient, so they send an image to a machine that is in a private cloud within UCSF. This kicks in a pipeline that is going to take this image, make sure that this image is an MRI for the brain so that our model is not running some different data.
The whole process takes about five minutes. Getting this image, running the segmentation, sending the data back to a database that the doctor can access through the browser and from the browser, they're able to open the image again, load the image, make sure it's the same patient, they can then load the segmentation.
Before, they were doing it by eye, but here when they make some edits they’re actually editing things of the size of like one to two millimeters. Whereas when they were doing it by eye, they were making mistakes on the order of centimeters. So we gave them the possibility of editing those segmentations and for them to give us some feedback on that. Then, of course, we can use this data to retrain the model.
Essentially, in the end, we presented them with segmentation images that they were able to change if they wanted to. It was accurate, fast, interpretable, and editable.
Limitations
Sometimes there were edge cases and sometimes there were mistakes, one of which was that sometimes the scanners were changing. A patient from Reno would go to their hospital every three months and then the other three months, they would come to UCSF. At their own hospital, the scanner has bad quality and the model or even a radiologist when we give the radiologist the image isn’t able to get the same volume that if you have a scanner with high quality because the pixels are not there and you cannot see where the boundaries are.
So as a stopgap solution, what we introduced was this human in the loop that is able to look at the images before it goes to the radiologist to make his decision. If your model is going to fail, make sure that it's not in a way that they cannot do anything about it, but make sure that it's in a way that they can still understand what's going on and use that in their normal workflow.
With the human in the loop, you have the data, you have the radiologists having the ability to do a manual curation of the data and select scanners. Finally, this is presented in a way of a volume plot that the radiologist can use and send to the neurologist who’s going to decide whether to do surgery or not.
Can we go one step further?
We have all this information about the tumor, can the models do something that perhaps the doctor could not do like predict what's going to happen with the next scan? If I can, I could tell the doctor to make sure the patient did not miss their visit, and perhaps there’s some intervention that should already be done because we already expected this patient should be progressing the next visit.
This is a slightly tricky problem because those patients’ visits are not always equally spaced. To deal with this, you create a model that can make predictions about whether they're going to progress or not in the next visit. The simplest way to do this is, between today and when they came in the past, to take the Delta and use it as a feature. However, delta images are complicated.
A better approach you can take is to first calculate a bunch of features. Once you have segmentation, you can calculate radiomics, so basically calculating hundreds of features. If you have hundreds of features, you can now calculate the delta and do standard machine learning things like do a lasso to only get the features that are important or support vector classifiers for example.
Out of eight patients that radiology marked as progressed, the model predicted that seven of them would progress and got one wrong. Out of seven patients that did not progress, the model got three wrong. Turns out, the model got those three wrong because they did not progress in three months but in six.
Summary
Sometimes the clinical challenges we have with respect to machine learning are new problems that require new solutions, like the federated learning problem where you don't have enough data but a combination of hospitals has enough data. So then you need a new solution like federated learning where you bring in a model to the data, and then you’re sharing weights, which works really well for this particular problem.
Other times, the problem is more about orchestrating existing solutions, as in the case of the segmentation we discussed.
Keeping the human in the loop and using interpretable models, I hope is still around for a while.
Let's also as a community think about not just automating the boring work. I started with the problem of automating the hard work of segmenting the volume of the tumor as it’s expensive, tricky, and hard, but can we take one step further and make a prediction of what's going to happen in the next visit?
Questions
Is this model in production? Are you helping more radiologists or brain surgeons?
The first problem we came across while working on COVID to do with production. We worked on it for about six months and we got a model but when we tried to productize it to help people coming to the hospital, we figured out we could not put it into production. The problem was that we needed the images, the chest X-rays, and the electronic health records, all of this in real-time and a way to send them back to the doctor.
There was no communication between the hospital’s different departments. We presented this blind spot to the university and they’re now trying to fix it. Out of the 20 institutions, only Harvard was able to somewhat put that in production.
For oncology, we're primarily helping the radiologists who have so much workload. When they brought the volume metrics to the new neurosurgeons, they were so happy because now they can see exactly where they should be removing the mass.
So in the end, they ended up helping the neurosurgeon too. It’s in production. At UCSF, they're using it but always first going to a technician who looks at segmentation, makes choices about changes to the segmentation if needed, then it goes to the radiologist who makes a decision and informs the general surgeon.
Do we do data augmentation, like view transformation and data manipulation, to create more unbiased data for medical images?
There are a lot of interesting things going on in this field, in particular, in the area of self-supervised learning and adversarial attacks or adversarial networks. Adding a branch of an adversarial network helps a lot with biases. As long as you have an idea of what the bias might be, then you can try to introduce that in the network.
Unfortunately, we usually learn after the fact that we ran something that didn't work. Not every scanner is like the UCSF multimillion-dollar scanner and that's where the human in the loop is important to make sure that those mistakes don't go into production.