Version dated: October 7, 2018 Compatibility and saturated data Circumstances in which parsimony but not compatibility will be provably misleading Robert W. Scotland1, and Mike Steel2 1 Department of Plant Sciences, Oxford University, Oxford OX1 3RB, UK 2 Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand Corresponding author: Mike Steel, Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand E-mail:
[email protected] Fax: +64-3-3642587; Phone: +64-21329705 arXiv:1409.3274v2 [q-bio.PE] 20 Jan 2015 1 Abstract.| Phylogenetic methods typically rely on an appropriate model of how data evolved in order to infer an accurate phylogenetic tree. For molecular data, standard statistical methods have provided an effective strategy for extracting phylogenetic information from aligned sequence data when each site (character) is subject to a common process. However, for other types of data (e.g. morphological data), characters can be too ambiguous, homoplastic or saturated to develop models that are effective at capturing the underlying process of change. To address this, we examine the properties of a classic but neglected method for inferring splits in an underlying tree, namely, maximum compatibility. By adopting a simple and extreme model in which each character either fits perfectly on some tree, or is entirely random (but it is not known which class any character belongs to) we are able to derive exact and explicit formulae regarding the performance of maximum compatibility. We show that this method is able to identify a set of non-trivial homoplasy-free characters, when the number n of taxa is large, even when the number of random characters is large.