TRANSACTIONS OF THE AMERICAN MATHEMATICAL SOCIETY, SERIES B Volume 8, Pages 1–38 (February 2, 2021) https://doi.org/10.1090/btran/56
PERSISTENT OBSTRUCTION THEORY FOR A MODEL CATEGORY OF MEASURES WITH APPLICATIONS TO DATA MERGING
ABRAHAM D. SMITH, PAUL BENDICH, AND JOHN HARER
Abstract. Collections of measures on compact metric spaces form a model category (“data complexes”), whose morphisms are marginalization integrals. The fibrant objects in this category represent collections of measures in which there is a measure on a product space that marginalizes to any measures on pairs of its factors. The homotopy and homology for this category allow measurement of obstructions to finding measures on larger and larger product spaces. The obstruction theory is compatible with a fibrant filtration built from the Wasserstein distance on measures. Despite the abstract tools, this is motivated by a widespread problem in data science. Data complexes provide a mathematical foundation for semi- automated data-alignment tools that are common in commercial database software. Practically speaking, the theory shows that database JOIN oper- ations are subject to genuine topological obstructions. Those obstructions can be detected by an obstruction cocycle and can be resolved by moving through a filtration. Thus, any collection of databases has a persistence level, which measures the difficulty of JOINing those databases. Because of its general formulation, this persistent obstruction theory also encompasses multi-modal data fusion problems, some forms of Bayesian inference, and probability cou- plings.
1. Introduction We begin this paper with an abstraction of a problem familiar to any large enterprise. Imagine that the branch offices within the enterprise have access to many data sources. The data sources exposed to each office are related and overlapping but non-identical. Each office attempts to merge its own data sources into a cohesive whole, and reports its findings to the head office. This is done by humans, often aided by ad-hoc data-merging software solutions. Presumably, each office does a good job of merging the data that they see. Now, the head office receives these cohesive reports, and must combine them into an overall understanding. This paper provides a mathematical foundation combining methods from mea- sure theory, simplicial homotopy, obstruction theory, and persistent cohomology (Section 1(a) gives an overview) for semi-automated data-table-alignment tools (e.g, HumMer [14]) that are common in commercial database software. Data tables are
Received by the editors December 4, 2019, and, in revised form, August 7, 2020. 2020 Mathematics Subject Classification. Primary 55U10; Secondary 55S35. Work by all three authors was partially supported by the DARPA Simplex Program, under contract # HR001118C0070. The last two authors were also partially supported by the Air Force Office of Scientific Research under grant AFOSR FA9550-18-1-0266.