Setup and notation
We assume we have training data consisting of high-dimensional features from multiple omic types (e.g. gene expression, DNA methylation, somatic mutations, etc.), measured on n samples, and an outcome of interest (e.g. cancer vs. normal tissue). The goal is to use the training data to build a model that can be used to predict the outcome for new samples based on their corresponding multi-omic profiles. This is a standard supervised learning setting with the particularity that the features are from multiple omic types.
For simplicity, we focus our presentation on a binary outcome and omic features of two types, but the multi-tuning parameter regression framework can be applied to other outcomes (e.g. multinomial, continuous, time to event, etc.) and more than two data types. For sample i, i = 1, …, n, we denote the binary outcome by yi ∈ {0, 1} and its corresponding omic profile as the partitioned vector , where and represent the features from the first and second omic type, respectively. The number of features of each type is usually much larger than the sample size n, and is in the hundreds or thousands in typical studies. For example, gene expression is measured for thousands of genes/features, DNA methylation for hundreds of thousands of CpGs, and genotypes for millions of SNPs. We let p1 and p2 be the number of features of the first and second omic type respectively, and p = p1 + p2 be the total number of features. In typical applications, there is also a low-dimensional vector of personal and/or clinical features (e.g. age, gender, etc.), that may also be predictive and included in the feature set, but for simplicity we omit here any additional non-omic features.
We denote by y = (y1, …, yn) the vector of outcomes and by X(1) and X(2) the design matrices whose rows are the ith sample vector and defined above. We denote by X= [X(1)| X(2)] the matrix containing the full set of features obtained by row-wise concatenation of X(1) and X(2).
In addition to constructing a model that generalizes well, i.e. accurately predicts y based on X in new subjects, a parsimonious prediction model based on a small subset of the full feature set is generally preferred. A model with fewer features is more interpretable and more easily translated into a custom assay deployable in a clinical setting.
Regularized regression with a sparsity inducing penalty is an effective way to simultaneously perform feature selection and parameter estimation to build a predictive model with a small subset of features. The simplest and most commonly used sparsity inducing penalty is the L1 norm, which gives rise to LASSO regression [11]. Regularization with a combination of the L1 and L2 norms, i.e. elastic net regression, typically outperforms feature selection with the LASSO in settings with correlated features [18]. Many additional extensions and variations of the LASSO have been proposed to, for example, reduce over-shrinkage of larger coefficients (adaptive LASSO [19]) or to handle features with additional structure (e.g. group LASSO [15], fused LASSO [20], group LASSO with overlap [21], graph LASSO [22]). For definiteness, in this paper we focus on the elastic net, which includes the LASSO as a special case. However, customized tuning parameters for each omic type can be used with any type of regularized regression.
The elastic net regularization penalty is a weighted mixture of the LASSO (L1 norm) and ridge (square of L2 norm) penalties given by: , where α, 0 ≤ α ≤ 1, is the weight given to the LASSO penalty and 1 − α the weight given to the ridge penalty. Both the LASSO (α = 1) and the ridge (α = 0) penalties are particular cases of the elastic net penalty.
Standard elastic-net logistic regression solves the penalized regression problem given by:
where is the standard logistic log-likelihood function and the regularization parameter λ ≥ 0 controls the degree of penalization applied to the vector of regression coefficients β (except the intercept β0 which is typically not penalized). The single regularization parameter λ is common to all features and is usually tuned by cross-validation.
In this paper, we propose using separate tuning parameters for each omic type by solving the penalized regression problem given by,
| 2 |
where the vector of regression coefficients β = (β(1), β(2)) is partitioned according to the omic type conformably to X = [X(1)| X(2)].The regularization parameters λ1 ≥ 0 and λ2 ≥ 0 are now specific to each omic type. Our hypothesis is that a ‘custom’ degree of regularization for each type can account for intrinsic differences between the data types and lead to a selected model with better prediction performance.
Model fitting and parameter tuning
Although the two-tuning parameter model (2) is non-standard, it can be fitted using elastic-net regression software provided it has an option to use a weighted elastic-net penalty of the form
. Using a weighted penalty, the multi-tuning parameter elastic-net penalty in (2) can be equivalently written as
with a p-dimensional weight vector with 1 in its first p1 entries (corresponding to the p1 features of the first omic type and the dimension of β(1)) and a in the next p2 entries (corresponding to the p2 features of the second omic type and the dimension of β(2)). Thus, the two-penalty model can be alternatively specified using the tuning parameters λ = λ1, , where λ controls the overall shrinkage on both types of omic features, and κ controls the shrinkage ratio between the two types. For κ = 1, the model reduces to the standard elastic net with a single tuning parameter.
To investigate the performance of the multi-tuning parameter elastic net regression (MTP EN) for building predictive models based on multi-omic features we conduct a series of simulations under a range of scenarios. To fit the MTP EN we use the efficient and widely used elastic-net implementation in the glmnet R package [23], which allows for a user-specified weighted penalty via the “penalty.factor” argument. Additional file 1 contains an R script implementing a complete and self-contained analysis example using MTP EN.
We set the elastic net penalty α to ½ and tune the overall shrinkage, λ, and shrinkage ratio parameter, κ,by k-fold cross-validation (CV), with the area under the ROC curve, AUC, as the performance metric. The training data is randomly split into k equally-sized folds, each with (approximately) the same proportion of positive (y = 1) and negative (y = 0) samples as the full training data. The MTP EN is trained on k − 1 folds for all values of (λ, κ) in a grid of possible tuning parameter values and the AUC is computed based on the validation held-off fold. This is repeated, using in turn each of the folds as validation held-off fold. The ‘optimal’ tuning parameters are the values κ = κmax and λ = λmax that maximize the average AUC across all folds. The optimal tuning parameter values are then applied to a fully independent test set to unbiasedly assess the model AUC. For model tuning we utilized both 5-fold and 10-fold CV, 5-fold for the analysis of real data because of a small sample size in one class and 10-fold for the simulation study.
Simulation study
We evaluate the MTP elastic-net through simulation, exploring varying proportions and effect sizes of the relevant features in each omic type, varying dimensionalities of the feature sets, and a range of correlation structures. Specifically, the binary outcome was generated based on a logistic regression model of the form:
where we set qj features among the pj in omic type j = 1, 2 (qj ≪ pj) to be predictive by arranging the vector of regression coefficients β(j) to be sparse, with qj non-zero and pj − qj zero entries. We set the effect size of the predictive features, i.e. the non-zero entries in omic type j = 1, 2 to a common value βj.
We generated feature data X = [X(1)| X(2)] by sampling from a multivariate normal distribution X~N(0, Σ), where Σ is a population covariance matrix with the following structure: i) dia(Σ) = 1, i.e. the X’s are already standardized and have marginal variances equal to one; ii) rj features among the qj informative ones in omic type j = 1, 2 have a common pairwise correlation ρj and correlation ρ12 with the counterpart set of features in the other platform iii) all remaining correlations are zero. The simulation scenarios are summarized in Table 1. For each scenario, we simulated 400 replicate data sets, 200 training and 200 test sets. Every data set included 500 features on 200 samples, with an expected 100 samples in each class (β0 = 0).
Table 1.
Summary of simulation scenarios
| Independent Features: ρ1 = ρ2 = ρ12 = 0 | |||||
|---|---|---|---|---|---|
| Scenario # | β 1 | β 2 | q 1 | q 2 | Optimal penalty ratio (κ*) |
| 1 | 0.6 | 0.8 | 5 | 20 | 0.55 |
| 2 | 0.6 | 0.6 | 5 | 20 | 0.70 |
| 3 | 0.8 | 0.6 | 5 | 20 | 0.75 |
| 4 | 0.8 | 0.8 | 5 | 5 | 1 |
| 5 | 0.8 | 0.6 | 5 | 5 | 1 |
| Correlated Features: β1 = 0.8, β2 = 0.6, q1 = 5, q2 = 20, r1 = r2 = 3 | |||||
| ρ 1 | ρ 2 | ρ 12 | Optimal penalty ratio (κ∗) | ||
| 6 | 0.4 | 0.2 | 0 | 0.85 | |
| 7 | 0.4 | 0.2 | 0.4 | 0.9 | |
Model tuning parameters (λ, κ) were selected using the training data, and applied to the test data set for estimating prediction performance. To further reduce the dimensionality of the selected features beyond what is achieved by maximizing the training data prediction performance, we set λ to λ1se, the largest value of λ such that the AUC is within one standard error of the maximum (achieved at λ = λmax). This strategy by Friedman et al. [12] yields predictive performance similar to that achieved by setting λ = λmax but with a more parsimonious model. Thus, the parameters κ = κmax and λ = λ1se, are used to estimate AUC from the independent test data set.
In addition to the AUC, we also report accuracy (1-misclassification error) and sensitivity and specificity of feature selection. The accuracy was calculated as the number of samples correctly classified in testing dataset divided by the total number of samples. The sensitivity (specificity) was calculated as the number of informative (uninformative) features correctly selected in the final model divided by the total number of informative (uninformative) features.