Arxiv:2011.03076V2 [Cs.PL] 12 Jan 2021 Epeetapicpe Hoeia Rmwr O Inferring for Framework Theoretical Principled a Present We Oki Eei N Aiyextensible
Total Page:16
File Type:pdf, Size:1020Kb
Towards a more perfect union type Michał J. Gajda Abstract derive JSON types [2]1. This approach requires considerable We present a principled theoretical framework for inferring engineering time due to the implementation of unit tests in and checking the union types, and show its work in practice a case-by-case mode, instead of formulating laws applying on JSON data structures. to all types. Moreover, this approach lacks a sound underly- The framework poses a union type inference as a learn- ing theory. Regular expression types were also used to type ing problem from multiple examples. The categorical frame- XML documents [13], which does not allow for selecting al- work is generic and easily extensible. ternative representation. In the present study, we generalize previously introduced approaches and enable a systematic CCS Concepts: • Software and its engineering → Gen- addition of not only value sets, but inference subalgorithms, eral programming languages; • Social and professional to the union type system. topics → History of programming languages. 1.1.2 Frameworks for describing type systems. Type ACM Reference Format: systems are commonly expressed as partial relation of typ- Michał J. Gajda. 2020. Towards a more perfect union type. In Pro- ing. Their properties, such as subject reduction are also ex- ceedings of 32nd International Symposium on Implementation and pressed relatively to the relation (also partial) of reduction Application of Functional Languages (IFL2020). ACM, New York, NY, USA, 14 pages. within a term rewriting system. General formulations have been introduced for the Damas-Milner type systems param- eterized by constraints [23]. It is also worth noting that tradi- 1 Introduction tional Damas-Milner type disciplines enjoy decidability, and Typing dynamic languages has been long considered a chal- embrace the laws of soundness, and subject-reduction. How- lenge [3]. The importance of the task grown with the ubiq- ever these laws often prove too strict during type system uity cloud application programming interfaces (APIs) utiliz- extension, dependent type systems often abandon subject- ing JavaScript object notation (JSON), where one needs to reduction, and type systems of widely used programming infer the structure having only a limited number of sample languages are either undecidable [21], or even unsound [27]. documents available. Previous research has suggested it is Early approaches used lattice structure on the types [25], possible to infer adequate type mappings from sample data which is more stringent than ours since it requires idem- [2, 8, 14, 20]. potence of unification (as join operation), as well as com- In the present study, we expand on these results. We pro- plementary meet operation with the same properties. Se- pose a modular framework for type systems in program- mantic subtyping approach provides a characterization of ming languages as learning algorithms, formulate it as equa- a set-based union, intersection, and complement types [9, tional identities, and evaluate its performance on inference 10], which allows model subtype containment on first-order of Haskell data types from JSON API examples. types and functions. This model relies on building a model using infinite sets in set theory, but its rigidity fails to gen- 2 arXiv:2011.03076v2 [cs.PL] 12 Jan 2021 1.1 Related work eralize to non-idempotent learning . We are also not aware of a type inference framework that consistently and com- 1.1.1 Union type providers. The earliest practical effort pletely preserve information in the face of inconsistencies to apply union types to JSON inference to generate Haskell nor errors, beyond using bottom and expanding to infamous types [14]. It uses union type theory, but it also lacks an ex- undefined behaviour [5]. tensible theoretical framework. F# type providers for JSON We propose a categorical and constructive framework that facilitate deriving a schema automatically; however, a type preserves the soundness in inference while allowing for con- system does not support union of alternatives and is given sistent approximations. Indeed our experience is that most shape inference algorithm, instead of design driven by de- of the type system implementation may be generic. sired properties [20]. The other attempt to automatically in- fer schemas has been introduced in the PADS project [8]. Nevertheless, it has not specified a generalized type-system design methodology. One approach uses Markov chains to 1This approach uses Markov chains to infer best of alternative type representations. 2Which would allow extension with machine learning techniques like IFL2020, September, 2020, Markov chains to infer optimal type representation from frequency of oc- 2020. curing values[2]. IFL2020, September, 2020, Michał J. Gajda 2 Motivation data Example = Example { f_6408f5 :: O_6408f5 Here, we consider several examples similar to JSON API de- , f_54fced :: O_6408f5 scriptions. We provide these examples in the form of a few , f_6c9589 :: O_6408f5 } JSON objects, along with desired representation as Haskell data O_6408f5 = O_6408f5 { size, height :: Int data declaration. , difficulty :: Double , previous :: String } 1. Subsets of data within a single constructor: However, when this object has multiple keys with values a. API argument is an email – it is a subset of valid of the same structure, the best representation is that of a String values that can be validated on the client- mapping shown below. This is also an example of when user side. may decide to explicitly add evidence for one of the alter- b. The page size determines the number of results to re- native representations in the case when input samples are turn (min: 10, max:10,000) – it is also a subset of in- insufficient. (like when input samples only contain a single teger values (Int) between 10, and 10, 000 element dictionary.) c. The date field contains ISO8601 date – a record field data ExampleMap = ExampleMap ( Map Hex ExampleElt) represented as a String that contains a calendar data ExampleElt = ExampleElt { date in the format "2019-03-03" size :: Int , height :: Int 2. Optional fields: The page size is equal to 100 by default , difficulty :: Double – it means we expect to see the record like {"page_size": , previous :: String } 50} or an empty record {} that should be interpreted in the same way as {"page_size": 100} 2.1 Goal of inference 3. Variant fields: Answer to a query is either a number Given an undocumented (or incorrectly labelled) JSON API, of registered objects, or String "unavailable" - this is we may need to read the input of Haskell encoding and integer value (Int)ora String (Int :|: String) avoid checking for the presence of unexpected format devia- 4. Variant records: Answer contains either a text message tions. At the same time, we may decide to accept all known 3 with a user identifier or an error. – That can be repre- valid inputs outright so that we can use types to ensure sented as one of following options: that the input is processed exhaustively. Accordingly, we can assume that the smallest non-singleton {"message": "Where can I submit proposal?", "uid" : 1014}set is a better approximation type than a singleton set. We {"message": "Submit it to HotCRP", "uid":call 317} it minimal containing set principle. {"error" : "Authorization failed", "code": 401}Second, we can prefer types that allow for a fewer num- {"error" : "User not found", "code":ber 404} of degrees of freedom compared with the others, while data Example4 = Message { message :: String conforming to a commonly occurring structure. We denote , uid :: Int } it as an information content principle. | Error { error :: String Given these principles, and examples of frequently oc- , code :: Int } curring patterns, we can infer a reasonable world of types that approximate sets of possible values. In this way, we 5. Arrays corresponding to records: can implement type system engineering that allows deriving type system design directly from the information about data [ [1, "Nick", null ] structures and the likelihood of their occurrence. , [2, "George", "2019-04-11"] , [3, "Olivia", "1984-05-03"] ] 3 Problem definition 6. Maps of identical objects (example from [2]): As we focus on JSON, we utilize Haskell encoding of the { "6408f5": { "size": 969709 , "height":JSON 510599 term for convenient reading(from Aeson package [1]); , "difficulty": 866429.732, "previous":specified"54fced" as follows:}, "54fced": { "size": 991394 , "height": 510598data Value = Object (Map String Value) , "difficulty": 866429.823, "previous": "6c9589" }, | Array [ Value ] "6c9589": { "size": 990527 , "height": 510597 | Null , "difficulty": 866429.931, "previous": "51a0cb" } }| Number Scientific | String Text It should be noted that the last example presented above | Bool Bool requires Haskell representation inference to be non-monotonic, as an example of object with onlya single key would be best represented by a record type: 3Compiler feature of checking for unmatched cases. Towards a more perfect union type IFL2020, September, 2020, 3.1 Defining type inference correspond to a statically assigned and a narrow type. In 3.1.1 Information in the type descriptions. If an infer- this setting, mempty would be fully polymorphic type ∀0.0. ence fails, it is always possible to correct it by introducing an Languages with dynamic type discipline will treat beyond additional observation (example). To denote unification op- as untyped, dynamic value and mempty again is an entirely eration, or information fusion between two type descrip- unknown, polymorphic value (like a type of an element of 6 tions, we use a Semigroup interface operation <> to merge an empty array) . types inferred from different observations. If the semigroup class (Monoid t , Eq t, Show t) is a semilattice, then <> is meet operation (least upper bound). => Typelike t where Note that this approach is dual to traditional unification that beyond :: t -> Bool narrows down solutions and thus is join operation (greatest Besides, the standard laws for a commutative Monoid, lower bound). We use a neutral element of the Monoid to we state the new law for the beyond set: The beyond set indicate a type corresponding to no observations.