Estimating individual treatment effect on disability progression in multiple sclerosis using deep learning

The study protocol was originally approved by the McGill University Health Center’s Research Ethics Board – Neurosciences-Psychiatry (IRB00010120) and then transferred and approved by the McGill University Faculty of Medicine and Health Sciences Institutional Review Board (A03-M14-22A).


Data is taken from six different randomized clinical trials (n = 3830): OPERA I8, OPERA II8, BRAVO9, ORATORIO10, OLYMPUS11, and ARPEGGIO12 ( numbers, NCT01247324, NCT01412333, NCT00605215, NCT01194570, NCT00087529, NCT02284568, respectively). Informed consent and participant compensation (if any) were handled by the individual clinical trials. We excluded participants who spent less than 24 weeks in the trial, who had less than two clinical visits, or who were missing one or more input features at the baseline visit. Therefore, it is important to appreciate that the data included in our work are not an exact reproduction of those used in the clinical trials.

All clinical/demographic and MRI features that were consistently recorded as part of all 6 clinical trials (a total of 19 features) were used to train our model. Values were recorded at the baseline visit (immediately before randomized treatment allocation), and are a combination of binary (sex), ordinal (EDSS, FSS), discrete (Gad count), and continuous variables (age, height, weight, disease duration, T25FW, 9-hole peg test (9HPT), T2 lesion volume, Gad count, and NBV). Disease duration was estimated from the time of symptom onset.

Lesion segmentation and volumetric measurements are derived from ground-truth lesion masks, which were generated independently (by an image analysis center outside of this study) during the course of each clinical trial. A fully manual or a semi-automatic segmentation strategy was used during clinical trial analysis for each trial. This analysis began with automated segmentation and was followed by manual correction by experts. The resulting segmentation masks are the best available approximation to ground truth, but would not be expected to be identical between each expert and reading center in part due to differences in the approach to lesion segmentation between reading centers (school effects). To account for any difference between the trial sites’ segmentation pipelines and improve model optimization dynamics33, we scaled the segmentation-based metrics into a common reference range. To do so, we first isolated the subset of samples that fulfilled the intersection of inclusion criteria for all trials. Then, we scaled all MRI metrics such that their range from −3 SD to +3 SD matches that of a reference trial (in the same interval of ±3 SD) obtained from the training set. The reference trials were selected on the basis of sample size (ORATORIO for the PPMS trials, and OPERA I/II for the RRMS trials). The range was clamped at ±3 SD for the scaling to be robust to extreme outliers.

The following right-skewed distributions were log-transformed: NBV, T2 lesion volume, T25FW, and 9HPT. Gad counts were binned into bins of 0, 1, 2, 3, 4, 5-6, 7-9, 10-14, 15-19 and 20+ lesions. Finally, to improve convergence during gradient descent, all non-binary features were standardized by subtracting the mean and dividing by the standard deviation, both calculated from the training dataset33.

Outcome definition

The primary outcome used in clinical trials assessing the efficacy of therapeutic agents on disease progression is the time to confirmed disability progression (CDP) at 12 or 24 weeks. We use CDP24 because it is a more robust indication that disability accrual will be maintained after 5 years34. CDP24 is most commonly based on the EDSS, a scale going from 0 (no disability) to 10 (death), in discrete 0.5 increments (except for a 1.0 increment between 0.0 and 1.0). A CDP24 event is defined as a 24-week sustained increase in the EDSS of 0.5 for baseline EDSS values > 5.5, of 1.5 for a baseline EDSS of 0, and of 1.0 for EDSS values in between. This difference in the increment required to confirm disability progression is commonly adopted in clinical trials, and partially accounts for the finding that patients transition through the EDSS scores at different rates35.

While it is possible to predict time-to-event using traditional machine learning methods if workarounds are used to address right-censored data or using machine learning frameworks specifically developed to model survival data (reviewed elsewhere36), we chose not to model time-to-CDP24 because of limitations inherent in this metric. As outlined by Healy et al.37, CDP reflects not only the rate of progression but also the baseline stage of the disease, which is problematic because the stage is represented by a discretized EDSS at a single baseline visit. This results in a noisy outcome label which could make it harder for a model to learn a representation that relates to the progressive biology which we are trying to model.

We therefore model the rate of progression directly by fitting a linear regression model onto the EDSS values of each individual participant over multiple visits (see Supplementary Methods 2 for details) and take its slope to be the outcome label that our MLP uses for training. One advantage of the slope outcome over time-to-CDP24 is that it can be modeled using any type of regression model. We revert to using time-to-CDP24 for model evaluation to facilitate comparison with treatment effect survival metrics reported in the original clinical trial publications.

Treatment effect modeling

To enrich clinical trials with individuals predicted to have an increased response to treatment, it is helpful to begin with the definition of individual treatment effect (ITE) according to the Neyman/Rubin Potential Outcome Framework38. Let the ITE for individual i be τi, then

$${\tau }_{i} {:}={Y}_{i}(1)-{Y}_{i}(0),$$


where Yi(1) and Yi(0) represent the outcome of individual i when given treatment and control medications, respectively. The Fundamental Problem of Causal Inference39 states that the ITE is unobservable because only one of the two outcomes is realized in any given patient, dictated by their treatment allocation. Yi(1) and Yi(0) are therefore termed potential outcomes or, alternatively, factual (observed) and counterfactual (not observed) outcomes.

Ground-truth can nevertheless be observed at the group level in specific situations, such as randomized control trials, because treatment allocation is independent of the outcome. We provide a detailed discussion of two important estimands, the average treatment effect (ATE) and the CATE in Supplementary Methods 1. Briefly, ATE represents the average effect when considering the entire population, while CATE considers a sub-population characterized by certain characteristics (e.g., 40-year-old women with 2 Gad lesions at baseline). We use CATE estimation to frame the problem of predicting treatment response for individuals.

The best estimator for CATE is conveniently also the best estimator for the ITE in terms of mean squared error (MSE)6. Several frameworks have been developed to model CATE, but a simple meta-learning approach which decomposes the estimation into sub-tasks that can be solved using any supervised machine learning model provides a flexible starting point6. For a broader survey of methods, see the survey on uplift modeling by Gutierrez et al.4 (the uplift literature has contributed extensively to the field of causal inference, particularly when dealing with randomized experiments from an econometrics perspective).

In this work, an MLP was selected as the base model due to its high expressive power and flexibility to be integrated into larger end-to-end-trainable neural networks consisting of different modules (such as convolutional neural networks). We used a multi-headed architecture, with a common trunk and two output heads: one for modeling the potential outcome on treatment, \({\hat{\mu }}_{1}(x)\), and the other to model the potential outcome on placebo, \({\hat{\mu }}_{0}(x)\). For inference, the CATE estimate \(\hat{\tau }(x)\) given a feature vector x can be computed as:

$$\hat{\tau }(x)={\hat{\mu }}_{1}(x)-{\hat{\mu }}_{0}(x).$$


We use \(\hat{\tau }(x)\) as the predicted treatment effect for an individual with characteristics x. Note that we multiplied all \(\hat{\tau }(x)\) values by −1 in this paper to simplify interpretation in the section “Results”, such that a positive effect indicates improvement, while a negative effect indicates worsening on treatment.

This multi-headed approach can be seen as a variant of the T-Learner described for example by Kunzel et al.6, except that the two base models in our case share weights in the common trunk. Our network is similar to that conceptualized by Alaa et al.40, but without the propensity network used to correct for any conditional dependence between the treatment allocation and the outcome given the input features, since our dataset comes from randomized data.

To decrease the size of the hyperparameter search space, we fixed the number of layers and only tuned the layer width. We used one common hidden layer and one treatment-specific hidden layer. Additional common or treatment-specific layers could be used if necessary, but given the low dimensionality of our feature space and the relatively small sample size, the network’s depth was kept small to avoid overfitting. The inductive bias behind our choice of using a multi-headed architecture is that disability progression can have both disease-specific and treatment-specific predictors of disability progression, which can be encoded into the common and treatment-specific hidden layer representations, respectively. Consequently, the common hidden layers can learn from all the available data, irrespective of treatment allocation. Rectified linear unit (ReLU) activation functions were used at hidden layers for non-linearity.


The model was trained in two phases, depicted in Fig. 3. In the first phase, a 5-headed MLP was pre-trained on an RRMS dataset to predict the slope outcome on each treatment arm. In the second phase, the parameters of the common layers were frozen, and the output heads were replaced with two new randomly initialized output heads for fine-tuning on the PPMS dataset to predict the same outcome.

Fig. 3: Multi-headed multilayer perceptron (MLP) architecture for CATE estimation.
figure 3

The MLP was first pre-trained on a relapsing-remitting multiple sclerosis dataset (top), followed by fine tuning on a primary progressive multiple sclerosis dataset (bottom). Subtraction symbols indicate which treatment and control are being subtracted for the CATE estimate. Gray-colored layers indicate the common layers that are transferred from the pre-trained MLP to the fine-tuning MLP, at which point their parameters are frozen and only the parameters of the blue-colored layers are updated. The orange-colored layers are discarded after the pre-training step. x: Feature vector. \({\hat{\tau }}_{t}(x)\): CATE estimate for treatment t given feature vector x. \({\hat{\mu }}_{t}(x)\): predicted potential outcome on treatment t. IFNb-1a = Interferon beta-1a.

Optimization was done using mini-batch gradient descent with momentum. To prevent overfitting, the validation loss was monitored during 4-fold cross-validation (CV) to early stop model training at the epoch with the lowest MSE, up to a maximum of 100 epochs. Dropout and L2 regularization were used, along with a max-norm constraint on the weights41, to further prevent overfitting.

Mini-batches were sampled in a stratified fashion to preserve the proportions of participants receiving active treatment and placebo. Backpropagation was done using the MSE calculated at the output head that corresponds to the treatment that the patient was allocated to, ti (the output head with available ground-truth). The squared errors from each output head were then weighted by ns/(mnt), where ns represents the total number of participants in the training split, nt represents the number of participants in the treatment arm corresponding to the output head of interest, and m represents the total number of treatment arms. This compensates for the treatment allocation imbalance in the dataset.

We aimed to reduce variance by using the early stopped models obtained from each CV fold as members of an ensemble. This ensemble’s prediction is the mean of its members’ predictions, and is used for inference on the unseen test set.

A random search was used to identify the hyperparameters with the best validation performance (learning rate, momentum, L2 regularization coefficient, hidden layer width, max norm, dropout probability). We used CV aggregation, or crogging42, to improve the generalization error estimate using our validation metrics. Crogging involves aggregating all validation set predictions (rather than the validation metrics) and computing one validation metric for the entire CV procedure. The best model during hyperparameter tuning was selected during CV on the basis of two validation metrics: the MSE of the factual predictions, and the ADwabc (described in detail in Supplementary Methods 3). We combine both validation metrics during hyperparameter tuning by choosing the model with the highest ADwabc among all models that fall within 1 SD of the best performing model based on the MSE loss. The SD of the best performing model’s MSE is calculated from the loss values obtained in the individual CV folds.

Baseline models

The performance of the multi-headed MLP was compared to ridge regression and CPH models. Both models were used as part of a T-learner configuration (as defined by Kunzel et al.6). Hyperparameter tuning was done on the same folds and with the same metrics as for the MLP.

Statistical analysis

Hazard ratios were calculated using CPH models and associated p-values from log-rank tests. Sample size estimation for CPH assumes a two-sided test and was based on Rosner43, as implemented by the Lifelines library (version 0.27.0)44.


All experiments were implemented in Python 3.845. MLPs were implemented using the Pytorch library (version 1.7.1)46. Scikit-Learn (version 0.24.2)47 was used for the implementation of ridge regression, while Lifelines (version 0.27.0)44 was used for CPH. For reproducibility, the same random seed was used for data splitting and model initialization across all experiments.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link

Back to top button