Machine Learning

Classification of Alzheimer’s Disease Stages using Radiology Imaging and Longitudinal Clinical Data – Part 4

Implementation, Evaluation and Results of Alzheimers Disease Progression Models and Development of Web-based Application


There are few conditions and perquisites for building classification models e.g., sample size of data, type of data, correlation, number of classes and type of problem. The studies suggest building multiple models and select the model with the best performance (Kruthika et al., 2019b), (Goyal et al., 2018). This section provides a short description of the data set which consists of subjects from the U.S.A. and Canada before detailing preparation for modelling. The multiple clinical stages are combined into three stages i.e, normal, mild cognitive impairment (MCI) and dementia. Machine learning techniques are applied to handle data imbalance, missing and high magnitudes values. Feature selection, feature engineering e.g., monthly changes in specific biomarkers, random grid search to tune hyperparameters and multiple algorithms are implemented before evaluating using normalized confusion matrix, AUROC (Area Under the Receiver Operating Characteristics) score and AUROC curve. AUROC can also be written as AUC (Area Under The Curve)-ROC (Receiver Operating Characteristics). The output from the models is explained and a web-based application is developed and deployed to the cloud. This is a continuation from here.

Description and Analysis of the Data Acquired

The project utilizes data from The Alzheimer’s disease Prediction of Longitudinal Evolution (TADPOLE) challenge to predict the future outcome of clinical stages (normal, MCI or dementia) for a patient. The data set is gathered from the Alzheimers Disease Neuroimaging Initiative (ADNI). ADNI was launched in 2003 and so far, there have been three stages for data collection namely ADNI followed by ADNI-GO and ADNI-2. The subjects are recruited from over 50 sites across the U.S.A. and Canada to help research the application of neuroimaging e.g., magnetic resonance imaging (MRI) and other biomarkers. The code to generate the standard data set is openly available in a GitHub repository.

The data collected from ADNI are merged into a CSV file and consists of:

• Cerebrospinal fluid (CSF) markers of amyloid-beta and tau deposition as opposed to the cerebral cortex. CSF is a clear, colourless fluid which acts a buffer in the brain. An abnormal concentration of proteins e.g., tau is an indicator of early signs of Alzheimers Disease.

• The figure below shows what happens to the Hippocampus.


Different types of radiological images including:

  • Figure below is an example of Magnetic resonance imaging (MRI). MRI measures the volume of grey and white matter of the brain. It is a good indicator of progression because it becomes abnormal.  MRIIn the data set, there are three types of markers of 3D sub-region measuring volumes, cortical thickness and surface areas. The markers are extracted by aligning the images with each other using software called Freesurfer and developing a cross-sectional (each subject visit is independent) or longitudinal (uses information from all the visits of a subject) pipelines.
  • Positron emission tomography is generated through detection of gamma-ray which are emitted by a radioactive tracer through the introduction of the biological active molecule. 3D images are reconstructed using a computer analysis. PETScans are of different types depending on cellular and molecular processes. The molecular processes are the first which become abnormal and hence useful from healthy control to progress to MCI or not. However, these have lower spatial resolution than MRI. The images in the data set have their frames processed, averaged across the dynamic range and standardized. Standardised uptake value ratio (SUVR) measures for relevant regions-of-interest are extracted after registering the PET images to corresponding MRI using the SPM5 software.
  • Diffusion tensor imaging measures the degeneration of white matter. DTIThe scans are corrected for head motion and eddy-current distortion, skull-stripped, EPI- corrected, and finally aligned to the T1 scans.
  • Cognitive tests such as ADAS11, the Mini-Mental State Examination (MMSE) acquired in the presence of a clinical expert. Cognitive tests measure the decline in a direct and quantifiable manner.
  • Genetic information such as alipoprotein E4 (APOE4) status.
  • General demographic information such as age, gender and education. Age is an important factor. Females are also more likely to develop the disease than men. Other factors such as smoking, diabetes depression, head injuries also increase the risk of developing the disease.

Each row in the data represents a visit of the subject and each column either represents the information about the subject or the biomarker from the visit. The duplicated rows are removed to include recent information. There is a total of 1737 subjects of which 957 are males and 780 are females. Subjects with APOE4 values of zero, one and two are 522, 298, 62 respectively. 204, 171 and 128 subjects have 16, 18 and 20 years of education and have different relationship status e.g., 653 subjects are married, 33 subjects never married and 99 subjects are widowed.

Data Preparation

The project is a multiclass classification problem and therefore presents different challenges from binary classification. It also has missing data and the classes are not distributed equally. The following section discusses the common approach used before modelling to answer the research question.

Combining Multiple Clinical Stages to Three Classes

The data set includes different clinical stages in addition to normal, MCI and dementia. There are also missing values for the clinical-stage due to different reasons e.g., subject not returning for another examination. 3,837 rows are not selected because the stage of the disease is not recorded. MCI is the most common stage followed by normal in the data set. The values for “normal to mild cognitive impairment” and “dementia to mild cognitive impairment” are replaced by “mild cognitive impairment”. Similarly, the values for “mild cognitive impairment to dementia” and “normal to dementia” are substituted with “dementia”, and “mild cognitive impairment to normal” is replaced with “normal”. The count of clinical stages after renaming is shown in the figure below. count clinical

Dealing with Imbalanced Data

The figure above shows imbalanced data. Each class is not equally represented and hence certain machine learning techniques are difficult to implement. It is also an open research area. Synthetic Minority Over-sampling Technique (SMOTE) (Lemaitre G. et al., 2017) is applied to handle class imbalance. SMOTE implements a nearest-neighbour algorithm to generate new synthetic data for the training set. The new samples are not generated for test data set to ensure the model generalizes well.

Dealing with Missing Data

Missing values is a common issue when working with longitudinal data (Mehdipour Ghazi et al., 2019) and results in an error from scikit-learn estimators as most of the algorithms expect numeric values. Incomplete rows/columns are deleted or the missing values are imputed from the known values of the data to handle the problem. SimpleImputer is applied to utilize the strategy of mean or most frequent value of the column in which missing values are present.

Dealing with Features Varying in Order of Magnitude

Features with high magnitude bear more weight than features with low magnitude. Feature scaling is performed to normalize the range of features implementing standardization to prevent the influence of variation on machine learning algorithms. StandardScaler from scikit-learn library replaces the values with their Z-score and the features with a mean of zero and standard deviation of one.

Applying One-vs-the-rest Strategy on Multiclass Classification

The strategy adopted for multiclass classification is different from binary classification. Multiclass assumes that each sample is assigned to one and only one class e.g. a stage can either be MCI or dementia but not both at the same time. One-vs-the-rest (OvR) strategy is applied and fits one machine learning algorithm per class and a class is fitted against all other classes for each algorithm. The advantages include interpretability, efficient computation and gain information by inspecting the corresponding classifier.

Interpreting Machine Learning Model

Modelling is built upon the principle that minor differences can be exploited to discover patterns to determine the various classes. However, the models developed in an ideal situation do not produce the same real-life results (Zhang et al., 2017). Models with good performance e.g., an ensemble of classifiers are often complex and difficult to explain. Further, a proper understanding of the reasons why a model makes a prediction is crucial to gain confidence of humans. A framework called Shapley Additive Explanations (SHAP) is applied for interpreting the model output. SHAP assigns each feature an importance value for a prediction based on cooperative game theory (Lundberg and Lee, 2017). Shapley value is defined as the average of marginal contribution for a feature across all the possible combinations.

Metrics to Evaluate the Performance of Machine Learning Algorithm

The project is a multiclass classification problem with an imbalance data set. Therefore, accuracy is not a proper evaluation metric. The confusion matrix is a better evaluation technique to summarize the performance of a model that classifies different classes. Metrics used to evaluate are used by other studies e.g., (Mehdipour Ghazi et al., 2019), (Cui and Liu, 2019) and include normalized confusion matrix, average multiclass AUROC (Area Under the Receiver Operating Characteristics) score, multiclass AUROC score dictionary and AUROC curve. Receiver operating characteristics is a probability curve and area under the curve tells the measure of separability. Confusion matrix is a table to describe the performance of a classification model.confusion_matrix

It is good for calculating other metrics like F1-score and AUROC curve for classification problems. It can also be normalized so that the numbers are between zero and one. It enables to obtain the percentage of correctly classified samples.

• TP (true positives): Cases that are predicted as true and are actually true

• TN (true negatives): Cases that are predicted as false and are actually false

• FP (false positives): Cases that are predicted as true but are actually false

• FN (false negatives): Cases that are predicted as false but are actually true


is the equation for AUROC score for one class (ci) against class (cj) where n is the “number of points belonging to each class and Si is the sum of ranks of the class i test points after ranking all the class i and j data points in increasing likelihood of belonging to class i” (Azvan et al., 2018), (Mehdipour Ghazi et al., 2019).

To use the metrics for multiclass, “micro” averaging or “macro” averaging can be used after implementing One-vs-the-rest strategy. The project uses “macro” averaging because it treats all classes equally by calculating the metric independently for each class and then taking an average. “Macro” averaging is:

where PRE is the performance of each individual class. Average AUROC score is after using “macro” averaging. AUROC curve is a 2D graph where the x-axis is the measure of the true positive rate (TPR) or recall while the y-axis is a measure of false-positive rate (FPR) or miss rate.

TPR = TP  / TP + FN
FPR = FP / FP + TN

The model with high performance has a higher AUROC score with the top left corner of the plot as an ideal point. It is where FPR is zero and TPR is one. It is also ideal to maximize TPR while minimizing FPR. It is commonly used to visualize the performance of the binary classifier.

The project is multiclass classification problem and hence it is necessary to binarize the output. One AUROC curve is produced per class. The challenge provides longitudinal data to train called “D1” and test the predictions on another longitudinal data called “D2”. “D2” includes subjects who rolled over from previous ADNI studies to prospective ADNI-3 study (Azvan et al., 2018).

The report continues here.

Leave a Reply