Automated quality control of T1-weighted brain MRI scans for clinical research datasets: methods comparison and design of a quality prediction classifier
Bhalerao G., Gillis G., Dembele M., Suri S., Ebmeier K., Klein J., Hu M., Mackay C., Griffanti L.
Abstract T1-weighted (T1w) MRI is widely used in clinical neuroimaging for studying brain structure and its changes, including those related to neurodegenerative diseases, and as anatomical reference for analysing other modalities. Ensuring high-quality T1w scans is vital as image quality affects reliability of outcome measures. However, visual inspection can be subjective and time-consuming, especially with large datasets. The effectiveness of automated quality control (QC) tools for clinical cohorts remains uncertain. In this study, we used T1w scans from elderly participants within ageing and clinical populations to test the accuracy of existing QC tools with respect to visual QC and to establish a new quality prediction framework for clinical research use. Four datasets acquired from multiple scanners and sites were used (N = 2438, 11 sites, 39 scanner manufacturer models, 3 field strengths – 1.5T, 3T, 2.9T, patients and controls, average age 71 ± 8 years). All structural T1w scans were processed with two standard automated QC pipelines (MRIQC and CAT12). The agreement of the accept-reject ratings was compared between the automated pipelines and with visual QC. We then designed a quality prediction framework that combines the QC measures from the existing automated tools and is trained on clinical research datasets. We tested the classifier performance using cross-validation on data from all sites together, also examining the performance across diagnostic groups. We then tested the generalisability of our approach when leaving one site out and explored how well our approach generalises to data from a different scanner manufacturer and/or field strength from those used for training, as well as on an unseen new dataset of healthy young participants with movement related artefacts. Our results show significant agreement between automated QC tools and visual QC (Kappa=0.30 with MRIQC predictions; Kappa=0.28 with CAT12’s rating) when considering the entire dataset, but the agreement was highly variable across datasets. Our proposed robust undersampling boost (RUS) classifier achieved 87.7% balanced accuracy on the test data combined from different sites (with 86.6% and 88.3% balanced accuracy on scans from patients and controls respectively). This classifier was also found to be generalisable on different combinations of training and test datasets (average balanced accuracy of leave-one-site-out = 78.2%; exploratory models on field strengths and manufacturers = 77.7%; movement related artefact dataset when including 1% scans in the training = 88.5%). While existing QC tools may not be robustly applicable to datasets comprised of older adults, they produce quality metrics that can be leveraged to train a more robust quality control classifiers for ageing and clinical cohorts.