More and more medical devices are using artificial intelligence to diagnose patients more precisely and to treat them more effectively. Although a lot of devices have already been approved (e.g. by the FDA), a lot of regulatory questions remain unanswered.
This article describes what manufacturers whose devices are based on artificial intelligence techniques should pay attention to.
The terms artificial intelligence (AI), machine learning and deep learning are often used imprecisely or even synonymously.
The term “artificial intelligence” (AI) itself leads to discussions about, for example, whether machines are actually intelligent.
We will use the definition below:
“A machine’s ability to make decisions and perform tasks that simulate human intelligence and behavior.
So it is about machines ability to take on tasks or make decisions in a way that simulates human intelligence and behavior.
A lot of artificial intelligence techniques use machine learning, which is defined as follows:
“A facet of AI that focuses on algorithms, allowing machines to learn and change without being programmed when exposed to new data.”
And deep learning is, in turn, part of machine learning and is based on neural networks (see Fig. 1).
“The ability for machines to autonomously mimic human thought patterns through artificial neural networks composed of cascading layers of information.”
Source i.a. HCIT Experts
This gives us the following taxonomy:
Fig. 1: Artificial intelligence is based on numerous techniques, of which machine learning is only one part. Neural networks, deep learning, are part of machine learning.
The assumption that artificial intelligence in medicine mainly uses neural networks is not correct. A study by Jiang et al. showed that support vector machines are used most frequently (see Fig. 2). Some medical devices use several methods at the same time.
Manufacturers use artificial intelligence, especially machine learning, for tasks such as the following:
Detecting a retinopathy
Images of the eye fundus
Counting and recognizing certain cell types
Images of histological sections
Diagnosis of heart infarctions, Alzheimer's, cancer, etc.
Radiology images, e.g. CT, MRI
Speech, movement patterns
Selection and dosage of medicines
Diagnoses, gene data, etc.
Diagnosis of heart diseases, degenerative brain diseases, etc.
ECG or EEG signals
Laboratory values, environmental factors etc.
Time-of-death prognosis for intensive care patients
Vital signs, laboratory values and other data in the patient's records
Table 1: Comparison of the tasks that can be performed with artificial intelligence and the data used for these tasks
Other applications include:
Fig. 3: Segmentation of organs (here a kidney) with the help of artificial intelligence (Source) (click to enlarge)
The FDA has published an extensive list of AI-based medical devices that will be very helpful for manufacturers wanting to:
It is interesting to note that the number of newly authorized AI-based devices is not increasing any further.
The techniques are used for the purpose of classification or regression.
Examples of classification
Examples of regression
There are currently no laws or harmonized standards that specifically regulate the use of artificial intelligence in medical devices. However, these devices must meet existing regulatory requirements, such as:
Unlike the European legislators, the FDA has published its view on artificial intelligence on its website.
In addition, the FDA published a “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD)” in April 2019.
This document talks about the challenge of continuously learning systems. However, it observes that previously approved medical devices based on AI procedures worked with “locked algorithms”.
The FDA tries to explain, for the two types of algorithm modification, when:
The new “framework” is based on well-known approaches:
The FDA recognizes that, according to its own regulations, a self-learning or continuously-learning algorithm that is in use would need to be inspected and approved again. But that seems too strict even for the FDA. Therefore, it looks at the objectives of a modification to the algorithm and distinguishes between:
The FDA wants to use these objectives to decide on the need for new submissions.
The FDA considers there to be four pillars that manufacturers can use as a basis for ensuring the safety and benefit of their devices, including for modifications:
Fig. 4a: Algorithm Change Protocol (ACP) from the FDA's proposed regulatory framework for software that use machine learning (click to enlarge)
Fig. 4b: Decision tree the FDA uses to decide whether modifications to software based on machine learning make a re-approval necessary (click to enlarge)
The FDA gives examples of when a manufacturer may change a software algorithm without asking it for approval. The first of these examples is a software program used in an intensive care unit that uses monitoring data (e.g., blood pressure, ECG, pulse-oximetry) to detect patterns that occur at the onset of physiologic instability in patients.
The manufacturer plans to change the algorithm, for example to reduce false alarms. If this is already set out in the SCS and this has been approved by the FDA along with the ACP, the manufacturer can make these changes without a new "approval”.
If, however, the manufacturer notices that it can also claim that the algorithm now generates a warning 15 minutes before the onset of physiologic instability (it now also specifies a period of time), this would be an extension of the intended use. This modification would require FDA approval.
The FDA discusses how to deal with continuously learning systems. However, it has still not answered the question of what the best practices are for evaluating and approving a “frozen” algorithm based on AI processes.
Guidelines, “Good Machine Learning Practices” as the FDA calls them, are still lacking. Therefore, the Johner Institute is developing such a guideline together with a notified body.
The FDA’s idea of not requiring a new submission based on pre-approved procedures for algorithm modifications has its charms. We would like to see such specificity from the European legislators and authorities.
Manufacturers regularly find it difficult to prove that the requirements placed on the device, e.g. with regard to accuracy, correctness and robustness, have been met.
Dr. Rich Carruana, one of Microsoft's leading minds in artificial intelligence, advised against the use of a neural network he had developed himself to propose an appropriate therapy for pneumonia patients:
“I said no. I said we don’t understand what it does inside. I said I was afraid.”Dr. Rich Carruana, Microsoft
The questions that auditors should ask manufacturers include, for example:
How did you reach the assumption that your training data has no bias?
Otherwise the results would be wrong or only correct under certain conditions.
How did you avoid overfitting your model?
Otherwise, the algorithm would only correctly predict the data it was trained with.
What makes you assume that the results are just randomly correct?
For example, it could be that an algorithm correctly decides that an image contains a house. But that the algorithm did not recognize a house, but the sky. Another example is shown in Fig. 3.
What requirements does the data have to meet in order to correctly classify your system or predict the results? Which framework conditions must be observed?
Since the model was trained with a certain quantity of data, it can only make correct predictions for data coming from the same population.
Would you not have achieved a better result with another model or with other hyperparameters?
Manufacturers must minimize risks as far as possible. These also include risks resulting from incorrect predictions made by sub-optimal models.
Why do you assume that you have used enough training data?
Collecting, processing and “labeling” training data is time-consuming. The more data that is used to train a model, the more powerful it can be.
What gold standard did you use when labeling the training data? Why do you consider the chosen standard to be the gold standard?
Particularly if the machine starts to be superior to people, it becomes difficult to determine whether a physician, a group of “normal” physicians, or the world's best experts in a discipline are the reference.
How can you ensure reproducibility if your system continues to learn?
Continuous Learning Systems (CLS), in particular, must ensure that the further training does not, at the very least, reduce performance.
Have you validated systems that you are using to collect, prepare, and analyze data, and to train and validate your models?
An essential part of the work consists of collecting and processing the training data and using it to train the model. The software needed for this is not part of the medical device. However, it is subject to the requirements of the Computerized Systems Validation.
Table 2: Aspects that should be addressed in the review of medical devices with associated declaration
The questions are typically also discussed as part of the ISO 14971 risk management process and the clinical evaluation according to MEDDEV 2.7.1 Revision 4.
Fig. 4: Input data that only randomly looks like a certain pattern. In this example, a Chihuahua and a muffin (source) (click to enlarge)
Auditors should no longer be generally satisfied with the statement that machine learning techniques are black boxes. The current research literature shows how manufacturers can explain and make transparent the functionality and "inner workings" of devices for users, authorities and notified bodies alike.
For example, using Layer Wise Relevance Propagation it is possible to recognize which input data (“feature”) was decisive for the algorithm, e.g. for classification.
Figure five shows, in the left picture, that the algorithm can rule out a number "6" primarily because of the pixels marked dark blue. This makes sense, because with a "6" this area typically does not contain any pixels. On the other hand, the right image shows in red the pixels that reinforce the algorithm's assumption that the digit is a “1”.
The algorithm evaluates the pixels in the rising part of the digit as damaging for classification as "1". This is because it was trained with images where the “1” is written as a simple vertical line, as is the case in the USA. This shows how important it is for the result that the training data is representative of the data that is to be classified later.
Fig. 5: Layer Wise Relevance Propagation determines which input is responsible for which share of the result. The data are visualized here as a heat map (source). (click to enlarge)
The free online book “Interpretable Machine Learning” by Christoph Molnar, who is one of the keynote speakers at Institute Day 2019, is particularly worth a read.
The guideline for the use of artificial intelligence (AI) in medical devices is now available on Github at no cost.
We developed this guideline with notified bodies, manufacturers and AI experts.
Use the Excel version of the guideline that is available here for free. With it, you can filter the requirements of the guideline, transfer it into your own specification document and adjust it to your specific situation.
When we were writing it, it was important to us to give the manufacturers and notified bodies precise test criteria to provide for a clear and undisputed assessment. The process approach is also in the foreground. The requirements of the guideline are grouped along these processes.
Artificial intelligence is currently receiving a lot of hype. A lot of “articles” praise it as either the solution to every medical problem or the start of a dystrophy in which machines will take over. We are facing a period of disillusionment. “Dr. Watson versagt” [“Dr. Watson fails”] was the title on article in issue 32/2018 of Der Spiegel on the use of AI in medicine.
It has to be expected that the media will write over-the-top and scandalized reports on cases where bad AI decisions have tragic consequences. But over time, the use of AI will become just as normal and indispensable as the use of electricity. We can no longer afford and no longer want to pay for medical staff to perform tasks that computers can do better and faster.
The regulatory framework and best practices lag behind the use of AIs. This leads to risks for patients (medical devices are less safe) and for manufacturers (audits and approval procedures seem to reach arbitrary conclusions).
In 2019, the Johner Institute, together with notified bodies, published a guideline for the safe development and use of artificial intelligence - comparable to the IT Security Guideline.