Seven datasets on lung cancer patients were made available: six with genetic information, and one with clinical data. The genetic datasets consisted of data on mutations, copy number variation, methylation, micro-RNA expression, messenger RNA expression and protein expression.
All of this for more than 1,000 patients, about half of whom had the subtype non-small-cell lung cancer, and half had the subtype squamous lung cancer. These subtypes are very different and have therefore been individually analysed. The analysis and results below concern the non-small-cell variant.
Merging the Data
Not every patient appears in every dataset. That is why a selection was made of the available tables to maximise the overlap of available patients. Based on this criterion, the micro-RNA and protein expression data were dropped, since they were available for far fewer patients. If there were multiple measurements for one patient within one table, these are aggregated by an average or transposition so that for each patient one row of measurements remains, with approximately 140,000 variables.
This merged table still contained many missing values, so to create a complete table, a number of steps are performed. The variables containing more than 20% missing values are dropped, after which around 60,000 variables remain. The remaining missing values are estimated using a K-Nearest Neighbour algorithm, which determines the values by looking at patients with similar properties to which the relevant variable is known.
Because the algorithms applied to this data are sensitive to the scale of the values, all variables are standardised so that they have the same mean and the same variance. This prevents the variables with a relatively large scale from being unfairly highly weighted in the training process.