Most well-studied cancer mutations are in the so-called “coding” region of the genome, which provides instructions for building the proteins that allow cells to carry out their function. However, the amount of “non-coding” DNA in the human genome is around 50 times larger than that of coding DNA. Among its many roles, non-coding DNA affects the 3D structure of the genome, which in turn partially determines which genes in the coding region are turned on. Most cancers have numerous mutations in their non-coding DNA, but, with few exceptions, the impact of these mutations on cancer is not known. Researchers in the Ma and Van Allen labs are developing a machine-learning algorithm that can identify mutations in non-coding DNA that allow tumor cells to proliferate and outcompete healthy cells because of their effect on the genome’s 3D structure. They will train the software to recognize these mutations using a dataset of 2,500 cancer patients whose entire genomes have been sequenced as part of the pan-cancer analysis of whole genomes (PCAWG). By comparing mutations in these patients’ tumors with publicly available data on the relationship between DNA sequence and 3D structure, they expect the algorithm will identify which mutations are most likely to affect cancer proliferation because of the changes they induce in genome structure. The researchers hope that this will become a model system for using machine-learning to understand the cancer genome and that it will allow scientists to develop experiments to confirm the role of these types of mutations in cancer development. In the future, a more in-depth understanding of mutations in non-coding DNA may help scientists develop new anti-cancer therapies as well as give doctors new tools to identify tumor subtypes and determine the best therapies for their patients.
Dietlein F, Weghorn D, Taylor-Weiner A, Richters A, Reardon B, Liu D, Lander ES, Van Allen EM, Sunyaev SR. Identification of cancer driver genes based on nucleotide context. Nat Genet. 2020.
Zhang Y, Xiao Y, Yang M, Ma J. Cancer mutational signatures representation by large-scale context embedding. Bioinformatics. 2020.