bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.170431; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Significant non-existence of sequences in genomes and proteomes Grigorios Koulouras1 # and Martin C. Frith1, 2, 3 * 1Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan. 2Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Chiba, Japan. 3Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST, Shinjuku-ku, Tokyo, Japan. # Present Address: Cancer Research UK Beatson Institute, Glasgow G61 1BD, UK * Correspondence:
[email protected] bioRxiv preprint doi: https://doi.org/10.1101/2020.06.25.170431; this version posted June 26, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Abstract Nullomers are minimal-length oligomers absent from a genome or proteome. Although research has shown that artificially synthesized nullomers have deleterious effects, there is still a lack of a strategy for the prioritisation and classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers which are statistically expected to exist.