Manufacturers of medical devices with machine learning are faced with the difficult task of having to demonstrate the conformity of their devices.
This is challenging for many manufacturers. They know the laws, but which standards and best practices do they need to pay attention to in order to demonstrate conformity and be able to talk to authorities and notified bodies on an equal footing?
This article will save you hours of research. It will provide you with an overview of the most important regulations and best practices that you need to know about, saving you hundreds of pages of reading as a result.
If you pay attention to these regulations, you can be perfectly prepared for the next audit.
There are currently no laws or harmonized standards that specifically regulate the use of machine learning in medical devices. Obviously, these devices have to comply with existing regulatory requirements set out in the MDR and IVDR, such as:
The MDR and IVDR allow conformity to be demonstrated using harmonized standards and “common specifications.” For medical devices that use machine learning techniques, manufacturers should observe the following standards:
These standards contain specific requirements that are also relevant for medical devices with machine learning, e.g.:
Please read the article on the validation of ML libraries.
The FDA has established comparable requirements, especially in 21 CFR part 820 (including part 820.30 on design controls). Numerous FDA guidance documents, including the documents on “software validation”, the use of off-the-shelf software (OTSS) and cybersecurity, are mandatory reading for companies that want to sell medical devices that are or contain software in the USA.
The FDA draft “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD)” is also mandatory reading.
You can find a detailed description of this framework in the article on “Artificial Intelligence in Medicine.”
The Chinese NMPA has released the draft of the “Technical Guiding Principles of Real-World Data for Clinical Evaluation of Medical Devices” for comment.
However, the document is currently only available in Chinese. But we have had the table of contents translated automatically for you.
China-NMPA-AI-Medical-DeviceDownload
The document addresses:
The authority is also building up its staff and has established an “AI Medical Device Standardization Unit”. This unit is responsible for the standardization of terminology, technology and processes for development and quality assurance.
The Japanese “Ministry of Health, Labour and Welfare” is also working on AI standards. Unfortunately, the authority only publishes the progress reports on these efforts in Japanese. (Translation programs will help though.) No concrete results have been published yet.
The COICR published the document “Artificial Intelligence in Healthcare” in April 2019. It refers to existing requirements rather than providing new ones and recommends the development of standards.
Conclusion: not very helpful
Technical Report IEC/TR 60601-4-1 gives guidance for “Medical electrical equipment and medical electrical systems employing a degree of autonomy.” This guidance, however, is not specific to medical devices that use machine learning.
Conclusion: slightly helpful
The Xavier University has published the document “Perspectives and Good Practices for AI and Continuously Learning Systems in Healthcare.”
As the title makes clear, it is (also) about continuously learning systems. Nevertheless, many of the best practices mentioned can also be transferred to systems that do not learn continuously:
This traceability/interpretability, in particular, is a challenge for many manufacturers.
The training videos in Auditgarant introduce other important techniques, such as LRP LIME, the visualization of neural network activation and counterfactuals.
The document also discusses exciting questions, such as whether patients have to be informed when an algorithm has been updated and could come to a better or even a different diagnosis.
The guidelines contained in this document have been incorporated in the Johner Institute's AI guidelines.
Conclusion: helpful, especially for continuously learning systems
This document from the Xavier University, which the Johner Institute helped draft, looks at best practices in the field of explainability. It provides useful guidance on which information has to be provided, for example, for “technical stakeholders”, in order to meet these explainability requirements.
Conclusion: at least partially helpful
The title of this BSI/AAMI document sounds promising. But, ultimately, it is only a position paper that you can download free of charge from the AAMI store. The position paper calls for the development of new standards with the involvement of the BSI and AAMI. Results are expected by the end of 2020.
Conclusion: not very helpful
The standard DIN SPEC 92001 “Artificial Intelligence – Life Cycle Processes and Quality Requirements – Part 1: Quality Meta Model” is also available for free download.
It presents a meta-model but does not give any specific requirements for the development of AI/ML systems. The document is not specific to any particular sector.
“Part 2: Technical and Organizational Requirements” is currently not available.
Conclusion: not very helpful
The standard ISO/IEC CD TR 29119-11 “Software and systems engineering – Software testing – Part 11: Testing of AI-based systems” is still under development.
Conclusion: still too early, worth keeping an eye on
The Korean “Software Testing Qualification Board” has made a syllabus for testing AI systems entitled “Certified Tester AI Testing – Testing AI-Based Systems (AIT – TAI) Foundation Level Syllabus” available for download.
From chapter 3.8, the syllabus provides information on quality assurance for AI systems, which can mostly also be found in the Johner Institute’s guidelines.
In addition, chapter 6 of the document contains guidelines for the black box testing of AI models, such as combinatorial testing and “metamorphic testing”. The tips on neural network testing, for example, using “neuron coverage” and tools such as DeepXplore, are particularly worth looking at.
Conclusion: recommended
The ANSI has published several standards together with the CTA (Consumer Technology Association):
As the titles suggest, the standards provide definitions. Nothing more and nothing less.
The CTA is currently working on additional specific standards, including one on “trustworthiness”.
Conclusion: only helpful as a collection of definitions
The IEEE is currently working on a whole family of standards:
Conclusion: still too early, worth keeping an eye on
Several working groups at ISO are also working on AI/ML specific standards:
The first standards have already been completed (such as the one described below).
Conclusion: still too early, worth keeping an eye on
ISO/IEC TR 24048 is entitled “Information Technology – Artificial Intelligence (AI) – Overview of trustworthiness in artificial intelligence.” It is not specific to any particular domain, but it does give examples for the healthcare sector.
The standard summarizes important hazards and threats as well as common risk minimization measures (see Fig. 1).
ISO-IEC-24028-2020: Chapter structure mind mapDownload
However, the standard stays quite universal, does not give any concrete recommendations and does not establish any specific requirements. It is useful as an overview and an introduction, and as a reference to other sources.
Conclusion: recommended, with conditions
The WHO and ITU (International Telecommunication Union) are developing a specific framework for the use of AI in healthcare, in particular for diagnosis, triage and treatment support.
This AI4HInitiative includes several topic groups from various medical faculties as well as working groups looking at cross-sectional topics. The Johner Institute is an active member of the regulatory requirements working group.
This working group is developing a guideline that is based on the Johner Institute’s previous guideline and will potentially replace it. The plan is to coordinate the results with the IMDRF.
If you would like to know more about this initiative, please contact the ITU or the Johner Institute.
Conclusion: highly recommended for the future
Notified bodies and authorities have still not agreed on a uniform approach and common requirements for medical devices with machine learning.
Therefore, manufacturers regularly find it difficult to prove that the requirements placed on the device, e.g. with regard to accuracy, correctness and robustness, have been met.
Dr. Rich Caruana, one of Microsoft's leading minds on artificial intelligence, even advised against the use of a neural network he himself had developed to propose the appropriate therapy for pneumonia patients:
“I said no. I said we don’t understand what it does inside. I said I was afraid.”
Dr. Rich Caruana, Microsoft
The existence of machines that users do not understand is nothing new. You can use a PCR without understanding it; in any case, there are people who know how the device works and what is inside. However, this is no longer always the case with artificial intelligence.
The questions that auditors should ask manufacturers include:
Key question | Background |
Why do you think that your device represents the state of the art? | Classic starting question. In your answer, you should go into technical and medical aspects. |
How did you reach the assumption that your training data has no bias? | Otherwise the results would be wrong or only correct under certain conditions. |
How did you avoid overfitting your model? | Otherwise, the algorithm would only correctly predict the data it was trained with. |
What makes you assume that the results are not just randomly correct? | For example, an algorithm could correctly decide that an image contains a house. But it could be the case that the algorithm did not recognize a house, but the sky. Another example is shown in Fig. 3. |
What requirements does the data have to meet so that your system correctly classifies it or predicts the correct results? Which framework conditions have to be observed? | Since the model was trained with a certain quantity of data, it can only make correct predictions for data coming from the same population. |
Would you not have achieved a better result with another model or with other hyperparameters? | Manufacturers must minimize risks as far as possible. These risks also include risks resulting from incorrect predictions made by sub-optimal models. |
Why do you assume that you have used enough training data? | Collecting, processing and “labeling” training data is time-consuming. The bigger the dataset used to train a model, the more powerful it can be. |
Which standard did you use when labeling the training data? Why do you consider the chosen standard to be the gold standard? | Particularly if the machine starts to be superior to people, it becomes difficult to determine whether a physician, a group of “normal” physicians, or the world's best experts in a discipline are the reference. |
How can you ensure reproducibility if your system continues to learn? | Continuously learning systems (CLS), in particular, must ensure that the further training, at the very least, does not reduce performance. |
Have you validated the systems you are using to collect, prepare, and analyze data, and to train and validate your models? | An essential part of the work consists of collecting and processing the training data and using it to train the model. The software needed for this is not part of the medical device. However, it is subject to the computerized systems validation requirements. |
Table 1: Potential questions during the verification of medical devices with corresponding answers
The questions mentioned above are typically also discussed in the course of the ISO 14971 risk management process and the clinical evaluation according to MEDDEV 2.7.1 Revision 4 (and performance evaluation of IVD medical devices).
Tips on how manufacturers can meet these regulatory requirements for medical devices with machine learning can be found in the “Artificial Intelligence in Medicine” article.
The regulatory requirements are clear. However, it is still not clear to manufacturers and, in some cases, even authorities and notified bodies how they should be interpreted and implemented for medical devices that use machine learning methods.
As a result, a lot of institutions feel obliged to help by publishing “best practices.” Unfortunately, a lot of these documents are only of limited use:
Unfortunately, no improvement seems to be in sight. On the contrary: more and more guidelines are being developed. For example, the OECD recommends the development of AI/ML specific standards and is currently working on one itself. It is the same with the IEEE and the DIN, and numerous other organizations.
Conclusion:
Medical device manufacturers need more quality not quantity from the best practices and standards on machine learning.
Best practices and standards should provide guidance for actions and set verifiable requirements. The fact that the WHO is using the Johner Institute's guidelines as a basis gives us reason for cautious optimism.
It would be nice if notified bodies, authorities and possibly also the MDCG would be more actively involved in the (further) development of these standards. This process should be transparent. We have seen on several occasions recently what working in back rooms without (external) quality assurance can lead to.
A joint approach would make it possible to achieve a common understanding of how medical devices that use machine learning should be developed and tested. There would only be winners.
Notified bodies and authorities are cordially invited to participate in the further development of guidelines. Just an email to the Johner Institute is enough.
Manufacturers looking for support in the development and authorization of ML-based devices (e.g., for the review of the technical documentation or the validation of ML libraries) can get in touch with us by email or via the contact form.
With thanks to Pat Baird for the helpful input.