AI Data Wrangling with Associative Arrays

Jeremy Kepner1,2,3, Vijay Gadepally1,2, Hayden Jananthan1,4, Lauren Milechin5, Siddharth Samsi1 1MIT Lincoln Laboratory Supercomputing Center, 2MIT Computer Science & AI Laboratory, 3MIT Mathematics Department, 4Vanderbilt University, 5MIT Department of Earth, Atmospheric and Planetary Sciences

Abstract—The AI revolution is data driven. AI “data wran- of the matrix representation of AI neural networks [7]–[10]. gling” is the process by which unusable data is transformed to Associative array algebra generalizes set theory and matrices support AI algorithm development (training) and deployment to further unify SQL, NoSQL, and NewSQL databases [11]– (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis [13]. Polystore databases leverage associative array mathemat- steps found in an AI pipeline. Rigorous mathematical repre- ics to simplify the interchange of data among diverse databases sentations of these data enables data translation and analysis [14]–[16]. optimization within and across steps. Associative array algebra Hierarchical data structures, such as JavaScript Object No- provides a mathematical foundation that naturally describes the tation (JSON) [17] and Extensible Markup Language (XML) tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding [18], are important to AI pipelines. Spreadsheets, and their inference/training calculations used by neural networks are also corresponding pivot tables, are available to over 100 million well described by associative arrays. More surprisingly, a general users and are an important tool for presenting AI results. denormalized form of hierarchical formats, such as XML and This work presents rigorous mathematical representations of JSON, can be readily constructed. Finally, pivot tables, which these data as associative arrays, enabling data translation and are among the most widely used data analysis tools, naturally emerge from associative array constructors. A common founda- analysis optimization within and across AI pipeline steps. tion in associative arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous II.ASSOCIATIVE ARRAY MATHEMATICS mathematical properties, such as, associativity, commutativity, The full mathematics of associative arrays and the ways and distributivity that are critical to reordering optimizations. they encompass matrix mathematics and relational algebra are I.INTRODUCTION described in the aforementioned references [12], [13], [16]. The essence of associative array algebra is three operations: AI models require data for their construction and training element-wise addition (database table union), element-wise [1]. The broad applicability of AI to diverse domains involves multiplication (database table intersection), and array multi- integration and conditioning of diverse data. AI “data wran- plication (database table transformation). These operations are gling” is the process by which unusable data is transformed illustrated as spreadsheet operations in Figure 1. In brief, an to support AI algorithm development (training) and deploy- associative array A is defined as a mapping from sets of keys ment (inference) [2]. Significant time is devoted to translating to values diverse data representations supporting the many query and analysis steps found in an AI pipeline [3]. Typical steps include A : K1 × K2 → V parsing data from a raw hierarchical format to approximately where K1 are the row keys and K2 are the column keys and tabular spreadsheet files, ingesting into databases, querying can be any sortable set, such as sets of integers, real numbers, for specific AI analysis, construction of vectors for training and strings. The row keys are equivalent to the sequence ID models, and storing matrix representations of neural networks. in a relational database table. The column keys are equivalent Each of these steps can be represented using a plethora of to the column names in a database table. V is a set of values data structures and formats and a major component of AI data arXiv:2001.06731v1 [cs.DB] 18 Jan 2020 that forms a semiring (V, ⊕, ⊗, 0, 1) with addition operation wrangling is harmonizing these representations to align with ⊕, multiplication operation ⊗, additive identity/multiplicative the different operations of an AI pipeline. annihilator 0, and multiplicative identity 1. The values can take Theory can play a significant role in harmonizing data on many forms, such as numbers, strings, and sets. One of the representations. Set theory forms the basis of the mathematical most powerful features of associative arrays is that addition duality enabling the declarative SQL user interface to be and multiplication can be a wide variety of operations. Some implemented with any procedural language in a relational of the common combinations of addition and multiplication database [4]–[6]. Graph theory and linear algebra are the basis operations that have proven valuable are standard arithmetic addition and multiplication +.×, union and intersection ∪.∩ This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702- used in database operations, and various tropical algebras that 15-D-0001 and National Science Foundation CCF-1533644. Any opinions, are important in finance and neural networks: max.+, min.+, findings, conclusions or recommendations expressed in this material are those max.×, min.×, max.min, and min.max. Because associative of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering or the National Science arrays are typically constructed as semirings their operations Foundation. are linear systems with rigorous mathematical properties, such NxM T A = (k1,k2,v,Å) (k1,k2,v) = A C = A C = A Å B C = A Ä B C = A B = A Å.Ä B elementwise matrix construct deconstruct transpose addition multiplication multiplication

Fig. 1. Core associative array operations and their corresponding spreadsheet operations depicted in Microsoft Excel [19]–[22].

Hierarchical JSON Sparse Table dataset/accessLevel dataset/keyword dataset/modified dataset/description dataset/distribution/title dataset/identifier dataset/publisher/name {"conformsTo": "https://project-open- 1001 data.cio.gov/v1.1/schema", "describedBy": "https://project- 1001/0001 public Agency IT Policy Archive 8/31/15 A list of all public documents relating to SBA IT policy. SBA-OCIO-2015-08-002 1001/0001/0001 SBA IT Policy Archive U.S. Small Business Administration open-data.cio.gov/v1.1/schema/catalog.json", "@context": 1001/0001/0002 SBA IT Policy Archive Policy Zip 1001/0002 public Bureau IT Leadership Directory 8/15/15 A directory of all SBA officials with the title of CIO or dutiesSBA-OCIO-2015-08-001 of a CIO. "https://project-open- 1001/0002/0001 Bureau IT Leadership Directory U.S. Small Business Administration data.cio.gov/v1.1/schema/catalog.jsonld", "@type": 1001/0002/0002 Bureau IT Leadership Directory Json 1001/0003 public CIO role on program governance8/30/15 boards Information about SBA CIO‚Äôs membership in governanceSBA-OCIO-2015-08-003 boards. "dcat:Catalog", "dataset": [{"@type": "dcat:Dataset", "title": 1001/0003/0001 CIO Governanace Board U.S. Small Business Administration "SBA IT Policy Archive", "description": "A list of all public 1001/0003/0002 CIO Governance Board Membership List 1001/0004 non-public Loan 6/30/17 Not available to the public. Additional information is availableSBA-OCA-2017-11-001 on an as needed basis due to security concerns. documents relating to SBA IT policy.", "modified": "2015-08- 1001/0004/0001 ExtractOrigination U.S. Small Business Administration 31", "accessLevel": "public", "identifier": "SBA-OCIO-2015- 1001/0005 non-public Loan 6/30/17 Not available to the public. Additional information is availableSBA-OCA-2017-11-002 on an as needed basis due to security concerns. 1001/0005/0001 ExtractServicing U.S. Small Business Administration 08-002", "landingPage": "https://www.sba.gov/about- 1001/0006 non-public Loan 6/30/17 Not available to the public. Additional information is availableSBA-OCA-2017-11-003 on an as needed basis due to security concerns. 1001/0006/0001 Originate3 U.S. Small Business Administration sba/sba-performance/open-government/digital-sba/digital- 1001/0007 non-public Loan 6/30/17 Not available to the public. Additional information is availableSBA-OCA-2017-11-004 on an as needed basis due to security concerns. strategy/it-policy-archive", "license": 1001/0007/0001 OriginateStatus U.S. Small Business Administration 1001/0008 non-public Loan 6/30/17 Not available to the public. Additional information is availableSBA-OCA-2017-11-005 on an as needed basis due to security concerns. "http://www.usa.gov/publicdomain/label/1.0/", "publisher": 1001/0008/0001 OrigBypass U.S. Small Business Administration {"@type": "org:Organization", "name": "U.S. Small Business 1001/0009 non-public Loan 6/30/17 Not available to the public. Additional information is availableSBA-OCA-2017-11-006 on an as needed basis due to security concerns. 1001/0009/0001 OrigScore U.S. Small Business Administration Administration"}, "distribution": [{"@type": 1001/0010 non-public Loan 6/30/17 Not available to the public. Additional information is availableSBA-OCA-2017-11-007 on an as needed basis due to security concerns. "dcat:Distribution", "accessURL": … 1001/0010/0001 OrigUpdate U.S. Small Business Administration 1001/0011 non-public Loan 6/30/17 Not available to the public. Additional information is availableSBA-OCA-2017-11-008 on an as needed basis due to security concerns.

Fig. 2. Hierarchical JSON metadata records from Data.Gov Small Business Administration [23] and their corresponding denormalized sparse representation. as, associativity, commutativity, and distributivity that are array constructor (see Figure 2). As sparse associative arrays critical to reordering optimizations. hierarchical data can be readily ingested into AI pipelines. Spreadsheets are used by over 100 million people each day. III.AIPIPELINES The results of AI pipelines are often presented in spreadsheet form. Figure 1 illustrates the spreadsheet equivalents of the Hierarchical data structures, such as JSON and XML, are core associative arrays operations. Interestingly, pivot tables, playing an increasingly important role in AI pipelines because which are among the most widely used spreadsheet analysis of their ability to capture diverse data of the type that lends tools, naturally emerge from associative array constructors. itself to AI applications. JSON and XML also align well with Associative arrays span the data and operations found in object oriented programming as arrays of objects built from most AI pipelines, providing a powerful tool for harmonizing other arrays objects can be written in a human readable format. and optimizing AI pipeline data ﬂow. However, most data analysis environments and AI pipelines depend upon data being in an approximately tabular format. ACKNOWLEDGEMENT Fortunately, hierarchical data can also be represented as sparse The authors wish to acknowledge the following individuals associative arrays by traversing the hierarchy and increment- for their contributions and support: Bob Bond, Alan Edel- ing/appending the row counters and column keys to emit man, Laz Gordon, Charles Leiserson, Dave Martinez, Mimi row/column/value triples directly passable to an associative McClure, Michael Wright, William Arcand, David Bestor, William Bergeron, Chansup Byun, Matthew Hubbell, Michael Houle, Michael Jones, Anna Klein, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, and Albert Reuther.

REFERENCES [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [2] W. McKinney, Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. ” O’Reilly Media, Inc.”, 2012. [3] “The State of AI and Machine Learning.” https://www.figure- eight.com/the-state-of-ai-and-machine-learning-report/. Accessed: 2019- 12-26. [4] E. F. Codd, “A relational model of data for large shared data banks,” Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970. [5] D. Maier, The theory of relational databases, vol. 11. Computer science press Rockville, 1983. [6] E. F. Codd, The relational model for database management: version 2. Addison-Wesley Longman Publishing Co., Inc., 1990. [7] J. Kepner and J. Gilbert, Graph algorithms in the language of linear algebra. SIAM, 2011. [8] J. Kepner, P. Aaltonen, D. Bader, A. Bulu, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, C. Yang, J. D. Owens, M. Zalewski, T. Mattson, and J. Moreira, “Math- ematical foundations of the graphblas,” in 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–9, Sep. 2016. [9] J. Kepner, M. Kumar, J. Moreira, P. Pattnaik, M. Serrano, and H. Tufo, “Enabling massive deep neural networks with the graphblas,” in 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–10, Sep. 2017. [10] M. Kumar, W. P. Horn, J. Kepner, J. E. Moreira, and P. Pattnaik, “Ibm power9 and cognitive computing,” IBM Journal of Research and Development, vol. 62, pp. 10:1–10:12, July 2018. [11] V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, and J. Kepner, “Graphulo: Linear algebra graph kernels for nosql databases,” in Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International, pp. 822–830, IEEE, 2015. [12] J. Kepner, V. Gadepally, D. Hutchison, H. Jananthan, T. Mattson, S. Samsi, and A. Reuther, “Associative array model of sql, nosql, and newsql databases,” in High Performance Extreme Computing Conference (HPEC), IEEE, 2016. [13] J. Kepner and H. Jananthan, Mathematics of Big Data. MIT Press, 2018. [14] V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner, S. Madden, T. Mattson, and M. Stonebraker, “The bigdawg polystore system and architecture,” in 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6, Sep. 2016. [15] T. Mattson, V. Gadepally, Z. She, A. Dziedzic, and J. Parkhurst, “Demonstrating the bigdawg polystore system for ocean metagenomics analysis.,” in CIDR, 2017. [16] H. Jananthan, Z. Zhou, V. Gadepally, D. Hutchison, S. Kim, and J. Kepner, “Polystore mathematics of relational algebra,” in Big Data Workshop on Methods to Manage Heterogeneous Big Data and Polystore Databases, IEEE, 2017. [17] D. Crockford, “The application/json media type for javascript object notation (json),” 2006. [18] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau, “Extensible markup language (xml) 1.0,” 2000. [19] “Create a PivotTable to analyze worksheet data.” https://support.office.com/en-us/article/create-a-pivottable-to-analyze- worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576. Accessed: 2019-12-26. [20] “Transpose Excel data from rows to columns, or vice versa.” https://www.techrepublic.com/blog/microsoft-office/transpose-excel- data-from-rows-to-columns-or-vice-versa/. Accessed: 2019-12-26. [21] “Addition & Subtraction of Matrices using Excel.” https://www.youtube.com/watch?v=TAElTGKlvTM. Accessed: 2019- 12-26. [22] “Do matrix multiplication and inverse in MS Exce.” https://ms- office.wonderhowto.com/how-to/do-matrix-multiplication-and-inverse- ms-excel-357088/. Accessed: 2019-12-26. [23] “SBA Public Datasets.” https://catalog.data.gov/dataset/sba-public- datasets. Accessed: 2019-12-26.