Why Not to Use Binary Floating Point Datatypes In

1 Why Not to Use Binary Floating Point Datatypes in RDF Jan Martin Keil Heinz Nixdorf Chair for Distributed Information Systems, Institute for Computer Science, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany Email: [email protected] Abstract—The XSD binary floating point datatypes are regu- decimals. Subsequently, error accumulation may significantly larly used for precise numeric values in RDF. However, the use of falsify the result of the processing of these values. Disasters, these datatypes for knowledge representation can systematically such as the Patriot Missile Failure, which resulted in 28 deaths, impair the quality of data and, compared to the XSD decimal datatype, increases the probability of data processing producing illustrate the potential impact of accumulated errors in real false results. We argue why in most cases the XSD decimal world applications [6]. The increasing relevance of knowledge datatype is better suited to represent numeric values in RDF. graphs for real world applications calls for general awareness of these issues in the Semantic Web community. Index Terms—Datatype, Knowledge Graph, Numerical Stabil- In this paper, we discuss advantages and disadvantages of ity, Ontology Engineering, RDF, XSD. the different numeric datatypes. We aim to raise awareness of the implications of datatype selection in RDF and to enable a more informed choice in the future. This work is structured I. INTRODUCTION as follows: In Section II, we give an overview of relevant The Resource Description Framework (RDF) is the funda- standards and related work, followed by a comparison of the mental building block of knowledge graphs and the Semantic properties of the binary floating point and decimal datatypes Web. In RDF, values are represented as literals. A literal in Section III. In Section IV, we discuss the implications of consists of a lexical form, a datatype, and possibly a language the datatype properties in different use cases. An approach for tag. The use of XML Schema (XSD) built-in datatypes [1] is automatic problem detection is sketched in Section V. Finally, recommended by the RDF standard [2, Sec. 5.1]. For numeric we summarize the implications of the datatype properties in values, this includes the primitive types float, double and Section VI. decimal as well as all variations of integer1 which are derived from decimal. II. BACKGROUND The datatype decimal allows to represent numbers with Each datatype in RDF consists of a lexical space, a value arbitrary precision, whereas the datatypes float and double space, and a lexical-to-value mapping. This is compatible with allow to represent binary floating point values of limited range datatypes in XSD [2]. and precision [1]. However, in practice, the binary floating The value space of a datatype is the set of values point datatypes are regularly used for precise numeric values, for that datatype [1, 2]. even though the datatype can not accurately represent these The lexical space of a datatype is the prescribed values. In a comparison study [3] of nine unit ontologies, five set of strings which the lexical mapping for that ontologies used binary floating point values for unit conversion arXiv:2011.08077v1 [cs.DB] 13 Nov 2020 datatype maps to values of that datatype. The mem- factors, two ontologies used decimal values for conversion bers of the lexical space are lexical representations factors, and two ontologies did not provide conversion data.2 (XSD) or lexical forms (RDF) of the values to which Even a popular ontology guideline [4] and a World Wide Web they are mapped [1, 2]. Consortium (W3C) working group note [5] use binary floating point datatypes for precise numeric values in examples. The lexical mapping (XSD) or lexical-to-value A finite decimal with sufficient precision can represent every mapping (RDF) for a datatype is a prescribed re- possible binary floating point value. In contrast, a finite binary lation which maps from the lexical space of the floating point can not exactly represent every possible decimal datatype into its value space [1, 2]. value. Therefore, the use of binary floating point datatypes for RDF reuses many of the XSD datatypes [2]. For non- precise numeric values can cause rounding errors in the values integer numbers, XSD provides the datatypes decimal, float actually represented, compared to original values provided as and double. The datatype decimal represents a subset of the real numbers [1]. 1integer, long, int, short, byte, nonNegativeInteger, positiveInteger, un- Value space of decimal: The set of numbers that can signedLong, unsignedInt, unsignedShort, unsignedByte, nonPositiveInteger, be obtained by dividing an integer by a non-negative and negativeInteger i 2using float/double: OM 1, OM 2, QU, QUDT, SWEET; using decimal: power of ten: 10n with i 2 Z; n 2 N0; precision is OBOE, WD not reflected [1]. 2 Lexical space of decimal: The set of all decimal ment, and constant. According to the note, the usually appro- numbers with or without a decimal point [1]. priate datatypes are (derived datatypes of) the integer datatype Lexical mapping of decimal: Set i according to the for counts, binary floating point datatypes for measurements, decimal digits of the lexical representation and the and the decimal datatype for constants [5]. leading sign, and set n according to the position of The digital representation or computation of numerical val- the period or 0, if the period is omitted. If the sign ues can cause numerical problems: An overflow error occurs, is omitted,“+” is assumed [1]. if a represented value exceeds the maximum positive or nega- The XSD datatype float is aligned with the IEEE 32-bit tive value in the value space of a datatype. An underflow error binary floating point datatype [7]3, the XSD datatype double occurs, if a represented value is smaller than the minimum is aligned to the IEEE 64-bit binary floating point datatype positive or negative value different from zero in the value [7]. Both represent subsets of the rational numbers. They only space of a datatype. A rounding error occurs, if a represented differ in their three defining constants [1]. value is not in the value space of a datatype and is represented Value space of float (double): The set of the special by a nearby value in the value space that is determined by a values positiveZero, negativeZero, positiveInfinity, rounding scheme. A cancellation is caused by the subtraction negativeInfinity, and notANumber and the numbers of nearly equal values and eliminates leading accurate digits. that can be obtained by multiplying an integer m This will expose rounding errors in the input values of the whose absolute value is less than 224 (double: 253) subtraction. Error accumulation is the insidious growth of with a power of two whose exponent e is an integer errors due to the use of a numerically instable sequence of between −149 (double: −1074) and 104 (double: operations. 971): m × 2e [1]. Lexical space of float (double): The set of all III. PROPERTIES OF BINARY FLOATING POINT AND decimal numbers with or without a decimal point, DECIMAL DATATYPES IN RDF numbers in exponential notation, and the literals Binary floating point and decimal datatypes in the context INF, +INF, -INF, and NaN [1]. of RDF have individual properties, which make them more or Lexical mapping of float (double): Set either less suitable for specific use cases: the according numeric value (including rounding, if Float and double permit the use of positive and negative necessary), or the according special value. An imple- infinite values. Decimal supports neither positive nor negative mentation might choose between different rounding infinite values. variants that satisfy the requirements of the IEEE Float and double permit the use of exponential notation. specification. Especially in case of numbers with many leading or trailing Some RDF serialization or query languages provide a zeros, this is more convenient and less error-prone to read or shorthand syntax for numeric literals without explicit datatype write for humans. Decimal does not permit the use of expo- specification. In Turtle, TriG and SPARQL, by default, num- nential notation. There is no actual reason for this limitation. bers without fractions are integers, numbers with fractions are For example, Wikibase4 also accepts exponential notation for decimals, and numbers in exponential notation are doubles [8, decimal.5 9, 10]. In JSON-LD, by default, numbers without fractions are The value spaces of float and double only provide partial integers and numbers with fractions are doubles, to align with coverage of the lexical space. Therefore, the lexical map- the common interpretation of numbers in JSON [11]. ping ( 1 in Figure 1) might require rounding to a possible In addition to the core XSD datatypes, a W3C working binary representation and the actual value might slightly differ group note introduces the precisionDecimal datatype from the lexical representation. For example, float has no [12]. It is aligned to the IEEE decimal floating-point datatypes exact binary representation of 0:1 and, if using the default [7] and represents a subset of real numbers. It retains precision roundTiesToEven rounding scheme [7], the lexical represen- and permits the special values positiveZero, negativeZero, tation 0:1 of float actually maps to a slightly higher binary positiveInfinity, negativeInfinity, and notANumber. Further, it representation of 0:10000000149011612. Depending on the supports exponential notation. The precision and exponent val- used RDF framework, it might be possible to preserve the ues of the precisionDecimal datatype are unbounded, but exact value of the lexical representation by implementing can be restricted in derived datatypes to comply with an actual a custom mapping to decimal ( 2 in Figure 1). However, IEEE decimal floating-point datatype. However, even though this causes additional development effort and introduces non the RDF standard permits the use of precisionDecimal, it standard conform behavior.

Load more