Precision medicine instruments produce incredible amounts of data, sometimes measured in terabytes or even petabytes. Machine learning (ML), a subfield of artificial intelligence, learns from data to make more accurate predictions. 

Supervised ML algorithms use training data to make classifications. For example, ML can identify low prevalent mutations in circulating tumor DNA (ctDNA) by identifying real mutations from artifacts that arise from sequencing technologies. The ML algorithm learns by examples of mutations and artifacts that are provided in the training data.  The predictions made by these algorithms are evaluated in independent data, different from the training set. This allows scientists to determine if the algorithm can provide more generalized results based on the initial data and whether it spots patterns that can be applied to an individual’s ctDNA to produce more accurate results.

ML offers a robust way to harness the predictive qualities of genomic data and advance precision oncology. The first advantage is speed. Instead of being a hindrance, large datasets can be an advantage, as they provide more information to teach the system. 

ML can also generate new knowledge and improve accuracy. The complex relationships embedded in data can be challenging to unwind, making it difficult to draw conclusions. However, with the right algorithms, we can identify those relationships to make accurate predictions and to expand our biological understanding. 

Collectively, these advantages improve precision oncology’s cost-effectiveness and our faith in molecular testing. Being able to make better predictions based on existing data can both accelerate the identification of genetic biomarkers that could match patients with targeted treatments demonstrated to improve outcomes and alleviate the need for expensive and time-consuming orthogonal validation tests.

Applications in Oncology

Projecting tumor evolution is one example of how ML predictions can be brought into patient care. Phylogenetic trees closely track this process, as early mutations become tumor subclones and then subclones of those subclones. Humans have a difficult time determining how a tumor has evolved and where tumor evolution will go, but ML has the requisite capabilities to make those predictions.

One of the most important ML applications is predicting which mutations may be driving a particular cancer. This is not always clear-cut, as there may be several oncogenes and each one could potentially play the lead role. ML can help untangle which ones are dominant and should be targeted for treatment.

The next step is drug selection. In this scenario, scientists contribute the genomic information from the tumor along with a drug’s chemical structure to determine whether that therapy will be effective. 

The same approach can be applied to combination therapies, as ML could delineate how two or more drugs work together against cancer. Machine learning can help predict whether a specific combo will be effective, as well as the side effects a patient might expect. In addition to improving patient care, this could also accelerate clinical trials.

Another very important application is prognostics. On a statistical level, cancer recurs quite frequently; however, on an individual patient level, it’s far from certain a patient’s cancer will return. Clinicians are often forced to take a wait and see approach, using scans, biopsies and ctDNA tests to determine if tumors are back.  

Machine learning could help alleviate this uncertainty, providing more accurate prognostics to guide treatment. If a patient is at much higher risk for a recurrence, clinicians might pursue more aggressive treatments or other strategies that could mitigate risk. On the other hand, a patient with a low chance of recurrence could avoid unnecessary treatments.

Next Steps

One of the major bottlenecks for ML is data labeling, during which biologists and/or computer scientists tag the data, providing context to help the ML algorithm learn. Unfortunately, data labeling is often a manual process that can be cumbersome, expensive and time-consuming. It forces scientists to make choices: Do I take the extra time to include the maximum amount of data to inform the ML model, or do I move forward, more rapidly, with what I have?

Canexia Health and others are exploring semi-supervised learning, in which unlabeled and labeled data are combined into training sets. Even though much of the data is unlabeled, combining it with the labeled variety gives the algorithm important information to make accurate decisions.

Also, one of the downsides of classical machine learning is the need for human intervention to transform the raw data into a set of engineered features usable for model training. However, that is less of a problem with deep learning, a form of machine learning that uses artificial neural networks (machine learning algorithms designed to replicate actual brains). Deep learning models are end-to-end, meaning that they automatically extract relevant features from the raw data.

This approach could be particularly useful to improve ctDNA “limits of detection,” the lowest concentrations of detectable genetic material that will provide accurate results, which could expand the number of patients who can benefit from this technology. 

Ultimately, the goal is to include data from diverse sources, multi-modal data, such as ctDNA and imaging, to better appreciate where a cancer is now and where it could be going. While a challenging application, we are excited to be working on this and look forward to sharing more in the future.


Recent Resources