1

Why Not to Use Binary Floating Point Datatypes in RDF Jan Martin Keil Heinz Nixdorf Chair for Distributed Information Systems, Institute for Computer Science, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany Email: [email protected]

Abstract—The XSD binary floating point datatypes are regu- decimals. Subsequently, error accumulation may significantly larly used for precise numeric values in RDF. However, the use of falsify the result of the processing of these values. Disasters, these datatypes for knowledge representation can systematically such as the Patriot Missile Failure, which resulted in 28 deaths, impair the quality of data and, compared to the XSD decimal datatype, increases the probability of data processing producing illustrate the potential impact of accumulated errors in real false results. We argue why in most cases the XSD decimal world applications [6]. The increasing relevance of knowledge datatype is better suited to represent numeric values in RDF. graphs for real world applications calls for general awareness of these issues in the community. Index Terms—Datatype, Knowledge Graph, Numerical Stabil- In this paper, we discuss advantages and disadvantages of ity, , RDF, XSD. the different numeric datatypes. We aim to raise awareness of the implications of datatype selection in RDF and to enable a more informed choice in the future. This work is structured I.INTRODUCTION as follows: In Section II, we give an overview of relevant The Resource Description Framework (RDF) is the funda- standards and related work, followed by a comparison of the mental building block of knowledge graphs and the Semantic properties of the binary floating point and decimal datatypes Web. In RDF, values are represented as literals. A literal in Section III. In Section IV, we discuss the implications of consists of a lexical form, a datatype, and possibly a language the datatype properties in different use cases. An approach for tag. The use of XML Schema (XSD) built-in datatypes [1] is automatic problem detection is sketched in Section V. Finally, recommended by the RDF standard [2, Sec. 5.1]. For numeric we summarize the implications of the datatype properties in values, this includes the primitive types float, double and Section VI. decimal as well as all variations of integer1 which are derived from decimal. II.BACKGROUND The datatype decimal allows to represent numbers with Each datatype in RDF consists of a lexical space, a value arbitrary precision, whereas the datatypes float and double space, and a lexical-to-value mapping. This is compatible with allow to represent binary floating point values of limited range datatypes in XSD [2]. and precision [1]. However, in practice, the binary floating The value space of a datatype is the set of values point datatypes are regularly used for precise numeric values, for that datatype [1, 2]. even though the datatype can not accurately represent these The lexical space of a datatype is the prescribed values. In a comparison study [3] of nine unit ontologies, five set of strings which the lexical mapping for that ontologies used binary floating point values for unit conversion arXiv:2011.08077v1 [cs.DB] 13 Nov 2020 datatype maps to values of that datatype. The mem- factors, two ontologies used decimal values for conversion bers of the lexical space are lexical representations factors, and two ontologies did not provide conversion data.2 (XSD) or lexical forms (RDF) of the values to which Even a popular ontology guideline [4] and a they are mapped [1, 2]. Consortium (W3C) working group note [5] use binary floating point datatypes for precise numeric values in examples. The lexical mapping (XSD) or lexical-to-value A finite decimal with sufficient precision can represent every mapping (RDF) for a datatype is a prescribed re- possible binary floating point value. In contrast, a finite binary lation which maps from the lexical space of the floating point can not exactly represent every possible decimal datatype into its value space [1, 2]. value. Therefore, the use of binary floating point datatypes for RDF reuses many of the XSD datatypes [2]. For non- precise numeric values can cause rounding errors in the values integer numbers, XSD provides the datatypes decimal, float actually represented, compared to original values provided as and double. The datatype decimal represents a subset of the real numbers [1]. 1integer, long, int, short, byte, nonNegativeInteger, positiveInteger, un- Value space of decimal: The set of numbers that can signedLong, unsignedInt, unsignedShort, unsignedByte, nonPositiveInteger, be obtained by dividing an integer by a non-negative and negativeInteger i 2using float/double: OM 1, OM 2, QU, QUDT, SWEET; using decimal: power of ten: 10n with i ∈ Z, n ∈ N0, precision is OBOE, WD not reflected [1]. 2

Lexical space of decimal: The set of all decimal ment, and constant. According to the note, the usually appro- numbers with or without a decimal point [1]. priate datatypes are (derived datatypes of) the integer datatype Lexical mapping of decimal: Set i according to the for counts, binary floating point datatypes for measurements, decimal digits of the lexical representation and the and the decimal datatype for constants [5]. leading sign, and set n according to the position of The digital representation or computation of numerical val- the period or 0, if the period is omitted. If the sign ues can cause numerical problems: An overflow error occurs, is omitted,“+” is assumed [1]. if a represented value exceeds the maximum positive or nega- The XSD datatype float is aligned with the IEEE 32-bit tive value in the value space of a datatype. An underflow error binary floating point datatype [7]3, the XSD datatype double occurs, if a represented value is smaller than the minimum is aligned to the IEEE 64-bit binary floating point datatype positive or negative value different from zero in the value [7]. Both represent subsets of the rational numbers. They only space of a datatype. A rounding error occurs, if a represented differ in their three defining constants [1]. value is not in the value space of a datatype and is represented Value space of float (double): The set of the special by a nearby value in the value space that is determined by a values positiveZero, negativeZero, positiveInfinity, rounding scheme. A cancellation is caused by the subtraction negativeInfinity, and notANumber and the numbers of nearly equal values and eliminates leading accurate digits. that can be obtained by multiplying an integer m This will expose rounding errors in the input values of the whose absolute value is less than 224 (double: 253) subtraction. Error accumulation is the insidious growth of with a power of two whose exponent e is an integer errors due to the use of a numerically instable sequence of between −149 (double: −1074) and 104 (double: operations. 971): m × 2e [1]. Lexical space of float (double): The set of all III.PROPERTIES OF BINARY FLOATING POINTAND decimal numbers with or without a decimal point, DECIMAL DATATYPES IN RDF numbers in exponential notation, and the literals Binary floating point and decimal datatypes in the context INF, +INF, -INF, and NaN [1]. of RDF have individual properties, which make them more or Lexical mapping of float (double): Set either less suitable for specific use cases: the according numeric value (including rounding, if Float and double permit the use of positive and negative necessary), or the according special value. An imple- infinite values. Decimal supports neither positive nor negative mentation might choose between different rounding infinite values. variants that satisfy the requirements of the IEEE Float and double permit the use of exponential notation. specification. Especially in case of numbers with many leading or trailing Some RDF serialization or query languages provide a zeros, this is more convenient and less error-prone to read or shorthand syntax for numeric literals without explicit datatype write for humans. Decimal does not permit the use of expo- specification. In , TriG and SPARQL, by default, num- nential notation. There is no actual reason for this limitation. bers without fractions are integers, numbers with fractions are For example, Wikibase4 also accepts exponential notation for decimals, and numbers in exponential notation are doubles [8, decimal.5 9, 10]. In JSON-LD, by default, numbers without fractions are The value spaces of float and double only provide partial integers and numbers with fractions are doubles, to align with coverage of the lexical space. Therefore, the lexical map- the common interpretation of numbers in JSON [11]. ping ( 1 in Figure 1) might require rounding to a possible In addition to the core XSD datatypes, a W3C working binary representation and the actual value might slightly differ group note introduces the precisionDecimal datatype from the lexical representation. For example, float has no [12]. It is aligned to the IEEE decimal floating-point datatypes exact binary representation of 0.1 and, if using the default [7] and represents a subset of real numbers. It retains precision roundTiesToEven rounding scheme [7], the lexical represen- and permits the special values positiveZero, negativeZero, tation 0.1 of float actually maps to a slightly higher binary positiveInfinity, negativeInfinity, and notANumber. Further, it representation of 0.10000000149011612. Depending on the supports exponential notation. The precision and exponent val- used RDF framework, it might be possible to preserve the ues of the precisionDecimal datatype are unbounded, but exact value of the lexical representation by implementing can be restricted in derived datatypes to comply with an actual a custom mapping to decimal ( 2 in Figure 1). However, IEEE decimal floating-point datatype. However, even though this causes additional development effort and introduces non the RDF standard permits the use of precisionDecimal, it standard conform behavior. The value space of decimal covers does not demand its support in implementations [2]. Therefore, the whole lexical space. Therefore, the lexical mapping ( 3 in RDF frameworks can not be expected to always support Figure 1) always provides the exact value described in the precisionDecimal. lexical representation without any rounding. Another W3C working group note addresses the selection of The accuracy of calculations based on literals with the proper numeric datatypes for different use cases. It identified datatypes float or double ( 4 in Figure 1) is limited, as a three relevant use cases of numeric values: count, measure- 4https://wikiba.se/, SPARQL endpoint example: https://query.wikidata.org 3As the XSD recommendation refers to IEEE 754-2008 version of the 5Example SPARQL query: SELECT ("1e-9"ˆˆxsd:decimal AS standard, we do not refer to the subsequent IEEE 754-2019 version. ?d) WHERE {} 3

3 5 Decimal Lexical Default Decimal Exact* Decimal Decimal Value Representation Lexical Mapping Calculation Result Value

2 Custom Binary Floating Lexical Mapping Point Cast 6

4 Binary Floating 1 Approximate Binary Default Binary Floating Binary Floating Point Lexical Floating Point Lexical Mapping Point Value Point Calculation Representation Result Value

Default Handling Decimal Cast 7 Custom Handling (Additional Effort)

8 Approximate Exact Value Approximate Decimal Decimal Decimal Value Calculation Approximate Value Result Value

* Except for calculations with infinite decimal expansion of (intermediate) results

Fig. 1. Possible processing paths of numeric literals depending on their datatype. properly implemented RDF framework will use binary floating very large and very small values and reduces the risk of typing point arithmetic by default. For example, this happens during errors due to missing or additional zeros. Further, they provide the execution of SPARQL queries that include arithmetic a representation of positive and negative infinite, which might functions or aggregations. Therefore, the calculations might be necessary in a few cases. be affected by various numeric problems, i.e. underflow er- However, in most cases, knowledge deals with exact decimal rors, overflow errors, rounding errors, cancellation, and error numbers or ranges of decimal values. Ranges are typically accumulation, as demonstrated in Figure 2. Calculations based described with two exact decimal numbers, either with a on decimals ( 5 in Figure 1) will by default use a decimal minimum value and a maximum value or an value and an error arithmetic with arbitrary precision. Therefore, they might only range. Binary floating point datatypes do not allow the accurate be affected by rounding errors in case of (intermediate) results representation of exactly known or defined numbers in many with infinite decimal expansion, as well as accumulations of cases. In addition, they entail the risk to fool data curators these rounding errors, as demonstrated in Figure 2. However, into believing that they stated the exact number, as the lexical calculations using decimal arithmetic with arbitrary precision representation on first sight appears to be exact. This becomes probably are significantly slower, compared to calculations even more critical, if double was used unintentionally due to using binary floating point arithmetic with limited precision. the shorthand syntax in Turtle, TriG, SPARQL, or JSON-LD. Depending on the used RDF framework, it might be possible to Therefore, binary floating point datatypes are not suitable to cast between the datatypes ( 6 and 7 in Figure 1). However, fulfill the requirements for knowledge representation. a value cast from binary floating point to decimal ( 7 in In consequence, the knowledge can not be used for exact Figure 1) is still affected by the rounding error of the floating calculations without programming overhead. The possible point value caused by the lexical mapping and subsequent small rounding errors of binary floating point input values calculations will still ( 8 in Figure 1) result in approximate might accumulate to significant errors in calculation results. results only. In contrast, the results of calculations based on Disasters, as the Patriot Missile Failure, illustrate the potential a value cast from decimal to floating point ( 6 in Figure 1) impact of accumulated errors in real world applications [6]. and based on an initial floating point value ( 1 in Figure 1) do This contradicts a W3C working group note [5], stating that not differ, if the same rounding method is used. The SPARQL binary floating point datatypes are appropriate for measure- query in Figure 2 demonstrates different numerical problems ments. It provided the following example representation of a of the datatypes. measurement in the interval of 73.0 to 73.2:

IV. DISCUSSION _:w eg:value "73.1"ˆˆxsd:float . _:w eg:errorRange "0.1"ˆˆxsd:float . The traditional use case of RDF is the representation of knowledge. The floating point datatypes provide two advan- However, if using the default roundTiesToEven tages for the knowledge representation compared to the deci- rounding scheme [7], this example actually represents mal datatype: Exponential notation eases the representation of a measurement in the interval 72.99999847263098388 4

SELECT ?datatype (xsd:decimal(STRDT("0.1", ?datatype)) AS ?rounded) (xsd:decimal(STRDT("1", ?datatype)/STRDT("3", ?datatype)) AS ?roundedInfinit) (xsd:decimal(STRDT("1.0000001", ?datatype) - STRDT("1.0000000", ?datatype)) AS ?cancellation) (STRDT("1000000000000000000000000000000000000000", ?datatype) AS ?overflow) (STRDT("0.0000000000000000000000000000000000000000000001", ?datatype) AS ?underflow) WHERE {VALUES ?datatype {xsd:float xsd:decimal}} datatype rounded roundedInfinit cancellation xsd:float 0.10000000149011612 0.3333333432674408 0.00000011920928955078125 xsd:decimal 0.1 0.33333333333333333333 0.0000001 overflow underflow Infinity 0.0 1000000000000000000000000000000000000000 0.0000000000000000000000000000000000000000000001 Fig. 2. Top: A SPARQL query that demonstrates differing numerical problems of the datatypes float and decimal. Bottom: The corresponding query output, as on http://query.wikidata.org. to 73.19999847561121612, as 73.1 and 0.1 are not in into decimal for the RDF representation and back would be the value space of float.6 In consequence, the actual superfluous. As the original value can only contain values with represented error interval misses to cover all points between an exact binary representation, the corruption of data with 73.19999847561121612 and 73.2. A common solution for rounding is impossible. Thus, the use of floating point data- this problem is the use of different rounding schemes for types for the exchange of computational results is reasonable. the calculation of the upper and lower bound of the interval (outward rounding) [13]. However, this is not provided in V. AUTOMATIC DISTORTION DETECTION current RDF frameworks and causes additional programming The automatic detection of quality issues is key to an effec- effort. The example shows that also in case of measurements tive quality assurance. Therefore, RDF editors, like Proteg´ e´7, binary floating point datatypes have clear disadvantages or evaluation tools, like the OntOlogy Pitfall Scanner! [14], compared to the decimal datatype. would ideally warn data curators, if the use of binary floating Further, the use of binary floating point values restricts the point datatypes would distort numeric values. selection of the used arithmetic for calculations, as it causes a A simple test can be implemented by comparing the results notable overhead for the application of decimal arithmetic with of the default mapping to a binary floating point value ( 1 in arbitrary precision. However, the selection of an arithmetic Figure 1) and a custom mapping to a decimal value ( 2 in must be up to the application, not to the input data, as Figure 1). The SPARQL query in Figure 3 demonstrates the applications might widely vary regarding the required accuracy approach. and the numerical conditioning of the underlying problem. The same problem arises in use cases that involve the comparison of values: Binary floating point datatypes might VI.CONCLUSION impair ontology matching results, because of a decreased The use of binary floating point datatypes in RDF can similarity of values that should have been equal. Further, systematically impair the quality of data and falsify the results they might impair data validation based on ontologies, as of data processing. This can cause serious impacts in real constraints become blurred due to rounding. For example, world applications. Therefore, binary floating point datatypes if using the default roundTiesToEven rounding scheme, an should only be used for numeric values if (a) a representation upper bound of "0.1"ˆˆxsd:float still permits a value of infinity is required, or (b) the original source provides binary of 0.100000001. Thus, the use of binary floating point floating point values. In general, the decimal datatype must datatypes for knowledge representation can systematically be the first choice. Semantic Web teaching materials should impair the quality of data and increases the probability of data name the disadvantages of the binary floating point datatypes, processing producing false results. shorthand syntaxes should prioritize the decimal datatype, and In other use cases, RDF might be used for the exchange of Semantic Web tools should hint to use decimal. initially binary floating point values, as computational results However, the datatype decimal also has some disadvantages. or the output of analog-to-digital converters. If the data to It would be desirable to introduce in RDF mandatory support exchange are binary floating point values, the transformation for (a) an exponential notation for the decimal datatype, and (b) a decimal datatype that supports infinite values, like 6Lexical mappings (roundTiesToEven rounding scheme): 73.1 → precisionDecimal, to eliminate these disadvantages. 73.0999984741211 and 0.1 → 0.10000000149011612, Interval calcula- tions: 73.0999984741211 ± 0.10000000149011612 7https://protege.stanford.edu/ 5

SELECT (xsd:decimal(?defaultMapping) AS ?float) (?customMapping AS ?decimal) (xsd:decimal(?defaultMapping) != ?customMapping AS ?distorted) WHERE { VALUES ?lexical {"10" "1" "0.1" "0.5"} BIND(STRDT(?lexical,xsd:float) AS ?defaultMapping) BIND(STRDT(?lexical,xsd:decimal) AS ?customMapping) } float decimal distorted 0.5 0.5 false 0.10000000149011612 0.1 true 1 1 false 10 10 false Fig. 3. Top: A SPARQL query that demonstrates an approach to detect number distortion. Bottom: The corresponding query output, as on http://query. wikidata.org.

ACKNOWLEDGMENT 7. IEEE. IEEE 754-2008 Standard for Floating-Point Arithmetic. Standard. Aug. 29, 2008. DOI: 10.1109/IEEESTD.2008.4610935. Many thanks to Sheeba Samuel, Sirko Schindler, Eberhard 8. D. Beckett, T. Berners-Lee, E. Prud’hommeaux, and G. Carothers. RDF Zehendner, and my supervisor Birgitta Konig-Ries¨ for very 1.1 Turtle: Terse RDF Triple Language. Ed. by E. Prud’hommeaux and helpful comments on earlier drafts of this manuscript. G. Carothers. W3C Recommendation. Feb. 25, 2014. URL: https://www. w3.org/TR/2014/REC-turtle-20140225/. 9. C. Bizer and R. Cyganiak. RDF 1.1 TriG: RDF Dataset Language. Ed. by REFERENCES G. Carothers and A. Seaborne. W3C Recommendation. Feb. 25, 2014. URL: https://www.w3.org/TR/2014/REC-trig-20140225/. 1. D. Peterson, S. Gao, A. Malhotra, et al., eds. W3C XML Schema Def- 10. S. Harris and A. Seaborne, eds. SPARQL 1.1 Query Language. W3C inition Language (XSD) 1.1 Part 2: Datatypes. W3C Recommendation. Recommendation. Mar. 21, 2013. URL: https://www.w3.org/TR/2013/ May 5, 2012. URL: http://www.w3.org/TR/2012/REC-xmlschema11-2- REC-sparql11-query-20130321/. 20120405/. 11. M. Sporny, D. Longley, G. Kellogg, et al. JSON-LD 1.1: A JSON-based 2. R. Cyganiak, D. Wood, M. Lanthaler, et al., eds. RDF 1.1 Concepts Serialization for . Ed. by G. Kellogg, P.-A. Champin, D. and Abstract Syntax. W3C Recommendation. Feb. 25, 2014. URL: http: Longley, et al. W3C Recommendation. July 16, 2020. URL: https://www. //www.w3.org/TR/2014/REC-rdf11-concepts-20140225/. w3.org/TR/2020/REC-json-ld11-20200716/. 3. J. M. Keil and S. Schindler. “Comparison and evaluation of ontologies 12. D. Peterson and C. M. Sperberg-McQueen, eds. An XSD datatype for for units of measurement”. In: Semantic Web 10.1 (2019), pp. 33–51. IEEE floating-point decimal. W3C Working Group Note. June 9, 2011. DOI: 10.3233/SW-180310. URL: https : / / www. w3 . org / TR / 2011 / NOTE - xsd - precisionDecimal - 4. N. F. Noy and D. L. McGuinness. Ontology Development 101: A 20110609/. Guide to Creating Your First Ontology. Tech. rep. KSL-01-05/SMI-2001- 13. A. Neumaier. Introduction to Numerical Analysis. Cambridge University 0880. Stanford Knowledge Systems Laboratory and Stanford Medical Press, Aug. 23, 2012. 366 pp. Informatics, Mar. 2001. URL: http://www.ksl.stanford.edu/people/dlm/ 14. M. Poveda-Villalon,´ A. Gomez-P´ erez,´ and M. C. Suarez-Figueroa.´ papers/ontology-tutorial-noy-mcguinness-abstract.html. “OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology 5. J. J. Carroll and J. Z. Pan, eds. XML Schema Datatypes in RDF and Evaluation”. In: International Journal on Semantic Web and Information OWL. W3C Working Group Note. Mar. 14, 2006. URL: https://www.w3. Systems 10.2 (2014), pp. 7–34. DOI: 10.4018/ijswis.2014040102. org/TR/2006/NOTE-swbp-xsch-datatypes-20060314/. 6. Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia. Tech. rep. GAO/IMTEC-92-26. General Ac- counting Office, Information Management and Technology Division, Feb. 4, 1992. 20 pp. URL: https://www.gao.gov/products/IMTEC-92-26.