Introduction
Early cancer detection is a crucial step towards a non-invasive, side effects free treatment to a malignant tumor. Even though successful therapies for late-stage cancer exist, their side effects are particularly harmful. An early prediction leads to better care and higher survival rate. However, its complex mechanisms have turned this task into a challenging feat. Thus, doctors have turned to the machine learning community to develop pattern recognition algorithms to detect cancer development with high sensitivity based on values from simple blood tests.
Cancer Burden in Numbers
According to the World Health Organization, cancer was responsible for over 8.8 million deaths in 2015. This represents a sixth of all deaths in the world and the second deadliest disease. Cancer Research UK states that the number of new cases is steadily growing each year and estimates it will reach 23.6 million by 2030. Moreover, the economic burden of cancer cannot be disregarded since it represented an annual cost of 1.16 trillion US dollars in 2010. Although it looks like a hard-to-cure, inevitable pathology, 30% to 50% of cancers can be avoided or treated efficiently, especially if diagnosed early.
One of the solutions to reduce cancer burden is an early detection of an abnormal behavior in the human body or an efficient interpretation of a preventive screening.
Datasets
For this problem, we can consider two types of datasets. On the first hand, we can predict the development of cancer through blood tests results coupled with other data information (family history, age, gender, etc.). In particular, scientists look for an abnormal level of specific genetic biomarkers within the human body. In fact, biologists have proven that a change in the genetic expression of some cells can cause cancer. The dataset we obtain from the blood test and the other measurements is a matrix consisting of either real-valued or factor variables.
In the following table, we present some of the risk factors for different types of cancer:
Cancer type | Genetic biomarkers | Other measurements |
Breast cancer | BRCA1, BRCA2 | Family history, Ionizing radiation, Smoking |
Colorectal cancer | FAP, MACC1 | Age, Obesity, Inflammatory bowel disease |
Lung cancer | EGFR, KRAS, ALK | Smoking, Passive smoking, Ionizing radiation |
Prostate cancer | BRCA1, BRCA2 | Infections, STD, Age, Race, Family history |
Leukemia | SPRED1 | Radiation |
As all of these measurements can be obtained in a non-invasive way (e.g. blood tests), doctors can perform them frequently for high-risk individuals. The Machine Learning algorithms will then determine whether a more thorough check-up is recommended.
On a second hand, doctors can rely on medical imaging for early detection. In fact, in its early stages, a cancer is asymptomatic and it can grow silently to a more advanced tumor. Thus, medical imaging examination is beneficial. For instance, women must undergo regular mammograms to screen breast cancer risks after a certain age. However, these images are particularly hard and time-consuming to analyze.
Fortunately, Convolutional Neural Networks (CNN), Extreme Learning Machine (ELM) and Support Vector Machine (SVM) have proven to be very effective at detecting tumors with medical imaging images and cancer-type-specific attribute.
Methods
We train Artificial Neural Networks (ANN) based on the expression of the genetic biomarkers and the patient’s measurements. Thus, we determine whether a cell has become cancerous. We also rely on Extreme Learning Machine (ELM) which is faster and slightly more efficient.
As its name states, an ANN mimics a circuit of neurons. An example is given in the figure. It assigns weights, determined by a back-propagation optimization algorithm, to the input layer of the data. To move from a layer to the next one, the weighted neurons go through an activation function. It is similar to the transmission of neural message through a synapse. The “signal” goes through the process until reaching the output layer where we either get a +1 for the malign tumor or -1 for the benign one (the class id).
This scheme models the complex structure of the task at hand and predicts efficiently. An ELM is a single-layer feedforward neural network characterized by its fast training phase and its easy generalization property. Unlike the ANN’s case, the weights are set randomly among the neurons leading to a faster algorithm that is not as rigid as ANN.
As for the medical imaging datasets, we train Convolutional Neural Networks (CNN). They consist of an input and output layer as well as multiple convolutional and pooling layers.
During a convolution step, we consider a weight matrix of a smaller size, obtained through a learning phase. This filter matrix goes across the image such that each entry (i.e. pixel) is covered at least once. As for the pooling step, we aim at reducing the size of the image. We divide the convoluted output into blocks and we take the maximum of each block as entry for the pooled matrix. ELM has also proven to be quite effective in predicting the type of tumor with the huge advantage of a much faster training phase.
Lastly, Support Vector Machine (SVM) rely on a linear separation of an augmentation of the feature space or in other words, a complex combination of the features. Thus, it models efficiently the non-linear aspect of the task.
Sensitivity and Specificity
We compare the models with respect to the sensitivity and specificity. First, the true positive rate is the proportion of cancer cases correctly classified as such. Similarly, the true negative rate is the proportion of correctly classified fit individuals.
An algorithm is sensitive if its true positive rate is high and it is specific if its true negative rate is high. However, there is a trade-off between these two metrics. Increasing the sensitivity decreases the specificity and vice versa.
Since it is more harmful to falsely state a sick patient as a healthy one, the algorithms should be extremely sensitive and should have a reasonable specificity.
We also consider the Area Under Receiver Operating Characteristic Curve (AUC). It represents how the sensitivity behaves as we increase the specificity. The higher the AUC, the better our algorithm predicts.
Results
Applying the different algorithms on the different datasets (details in the references), we obtain the following promising results:
Sensitivity |
Specificity |
AUC |
Accuracy |
|
CNN Breast Cancer [1] | 0.82 | 0.82 | 0.82 | |
CNN Lung Cancer [2] | 0.87 | 0.991 | 0.94 | |
ELM Breast Cancer [3] | See ROC curve below | See ROC curve below | 0.93 | |
ANN Breast Cancer [4] | 0.96 | |||
SVM Cervical Cancer [4] | 0.68 | |||
SVM Breast Cancer [4] | 0.95 | |||
SVM Oral Cancer [4] | 0.97 |
We see that we obtain the best results for the lung and breast cancer with a specificity exceeding 90% and slightly lower sensitivity. For the other types of cancers, the data sets are not large enough to be concluding (cf. References).
Conclusion
Early cancer detection is still a major field of research in both the medical and the machine learning communities. They need to work tightly together as new discoveries on either side of the chain lead to impactful advances.
Expert knowledge and efficient, more adaptable algorithms can boost the current models and reach higher sensitivity and specificity. This can ultimately save millions of lives as well as alleviate the economic burden of cancer.
Slim Kammoun
References
[1] Y. J. Tan, K. S. Sim and F. F. Ting, 2017 “Breast cancer detection using convolutional neural networks for mammogram imaging system”
[2] Mehdi Fatan Serj, Bahram Lavi, Gabriela Hoff, and Domenec Puig Valls, 2018 “A Deep Convolutional Neural Network for Lung Cancer Diagnostic”
[3] Ahsan Malik & Jamshed Iqbal, 2016 “Extreme learning machine based approach for diagnosis and analysis of breast cancer”
[4] Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V. Karamouzis, Dimitrios I. Fotiadis, 2015 “Machine learning applications in cancer prognosis and prediction”