Risk Analytics using Knowledge Graphs / FIBO with Deep Learning

Live Date: October 21, 2020

Featuring: Greg Steck, Director of Data Engineering, FI Consulting Thomas Cook, Director of Sales, Cambridge Semantics

Recording: bit.ly/3ofmn4Y Presentation: bit.ly/37IHIOl

edmcouncil.org cambridgesemantics.com

WEBINAR Q&A:

Where do we start and what is the business case? We start with finding the right use case and present some or all of the business challenges we described such as integrating disparate datasets and building a flexible model for frequent business requirement changes. The business case is built around the time/cost savings for data engineering teams as well as the faster and deeper insights for business teams.

Where/how is the pretty, organized FIBO mapped to the ugly world of production systems? We started by building a basic model of our domain (mortgage loans) as the organization sees it and then layered in FIBO terms where applicable. Sometimes the FIBO model made more sense than how we initially modeled, and other times the concepts don’t yet exist in FIBO (e.g. loan accounting rules).

Can we create a knowledge graph from financial reports in iXBRL format? Are there any challenges or points that we should take into consideration? Anzo Data Fabric will automatically create an ontology from XML that you can then blend with other parts of the knowledge graph. XBRL is an XML format. If you need to process iXBRL, and cannot access the report in the XBRL (pure XML) format, you will need to do some custom pre- processing to strip the XML from the HTML and could use NLP to process any additional data in the HTML portion of the document.

Also recommend looking at the RDF Data Cube ontology. We have modeled some financial reporting data using that ontology and it has worked quite well.

EDM Council 77 Water St., 8th Fl., New York, NY 10005 | +1 646 722 4381 | edmcouncil.org

Can you illustrate more about data ingestion? How was data ingested into the KG for this case study? Manually? Automatically from a RDMS? Automatically from unstructured data? Something else? For this use case used the following methods to get data into the graph: • AnzoGraph Graph Data Interface (GDI) to access and load data from the FRED API • Sent JSON-LD objects using Python to AnzoGraph • PySpark / RDFlib – the large Fannie Mae loan data was being cleansed using PySpark, so we wrote a Spark UDF to write the Spark dataframe to RDF directly via RDFlib. An easy way to get large data into AnzoGraph is reading RDF or CSV files directly to AnzoGraph. However, up to 200+ data source types can be access directly from the Graph Data Interface (GDI) including: jdbc, http, elastic, kafka, etc.

Have you worked with any domains or applications that try to crowd source data for the KG? No, we have not come across this use case to crowd source data into the KG. But would potentially look at the CKAN data catalog as it has some capabilities to ingest RDF data from users.

How does it work with online information? Is it possible to use data that is loaded in real time? AnzoGraph DB supports a cron-like mechanism where queries can be scheduled internally from within the database. One of the primary use cases for AnzoGraph Cron jobs is to schedule and regularly pull data from an external data or message streaming service into AnzoGraph. A prime example is Apache Kafka, which is an open source messaging platform that many customers have incorporated into their data pipeline architectures. Intervals are typically set to 30 second or 1 minute and data is loaded from streams in micro-batches. AnzoGraph comes with a Kafka reader service, but others streaming sources would require a custom service.

How much time did it take to build the initial ontology? For a POC at a client we spent between 3-4 weeks iterating on the ontology to get the analytics up and running.

How were the FIBO terms for mortgage loan etc. consumed? I think the example terms are in the non-Production level parts of FIBO; were there any issues using these? (They are conceptually complete but not optimized for OWL-based applications; I would assume that's not a problem, but it would be good to hear about this). It was not a problem to use some of the mortgage terms from FIBO, we were not using a lot of inferencing with the large data (although AnzoGraph supports RDFS+ inferencing)

What tool did you use for the visualizations of portions of FIBO? We use the tool Metaphactory, specifically the Ontodia functionality, to create diagrams for ontology visualization. This can also be accomplished using the Anzo Data Fabric, but that tool was not used for this use case.

EDM Council 77 Water St., 8th Fl., New York, NY 10005 | +1 646 722 4381 | edmcouncil.org

You are making a compelling case for graph over relational. What are the benefits to using graph over document dbs (e.g. MongoDB)? There are a couple key benefits to using graph over document databases. The first is the use of global identifiers in RDF that allow for better scalability and interoperability in an enterprise, and the other is the ability of graph to natively do multiple “hop” queries and traverse the graph easier. Document databases and graph databases solve fundamentally different problems. More details here.

Relationships defined within the graph are key element in a graph database as well as the ontologies that define the structure of the data. Queries can access and connect any piece of data with any other to perform complex analytics like those available in any SQL database, but also combined with the insights derived from the relationships in the data. Graph algorithms and other graph analytics provide additional insights from the relationships. Direct data access to over 200+ sources (including document databases like Mongo DB), data virtualization, Geospatial, data science and linear algebra libraries, and many other extensions to the SPARQL language (including User Defined Extensions) extend the functionality and flexibility of AnzoGraph as a data management and integration platform, that supports real-time analytics at scale.

"Data" comes from systems. Do you also look at the systems that produce the data? Yes, as part of the metadata that we capture we look at the systems that produce the data. We typically have microservices producing or transforming the data and we log as part of the data lineage what each microservice is doing to the data in the graph. That way we can see the full picture of the data—what system produced the data I am looking at.

How does FIBO ontology align with BIAN-Service Domains and BOM's? Our loan ontology that includes FIBO terms doesn’t connect directly with BIAN; however, the metadata we capture that is the basis for our data lineage does make reference to BIAN Service Domains. The microservices that send the data lineage are defined by the BIAN methodology to identify service domains.

Are the dependencies in the data lineage visualization constructed automatically? Yes, the dependencies in the data lineage are connected automatically. We do it with a combination of SPARQL insert queries and using a set of Python decorators that send JSON-LD objects to the knowledge graph API to capture lineage as the process is being executed. Data Provenance (lineage) is also captured in the Anzo Data Fabric from Cambridge Semantics but not showcased in this use case.

EDM Council 77 Water St., 8th Fl., New York, NY 10005 | +1 646 722 4381 | edmcouncil.org

How does AnzoGraph work for handling data compression and data security for row level sensitivity? AnzoGraph has an IRI dictionary that compresses full IRI strings down to an integer format internally. ACL (Access control lists) is a new feature that is coming in an upcoming release that will offer data security.

Is Solidatus a 3rd party Lineage tool? Yes, Solidatus is a 3rd party tool that is geared towards regulatory lineage for banks and other financial institutions. They have a SPARQL integration that allows us to feed the data we captured in our risk analytics pipeline to Solidatus for visualization.

Is Solidatus tightly coupled here with ANZO? What are the options we have here for Reporting lineage on top of ANZO? Solidatus is a separate product from Anzo and AnzoGraph and they are not integrated. Anzo Data Fabric does have a graphical report of provenance (data lineage) for data sets to show what layers and transformations were used to track lineage of elements.

If I have a Data Catalogue available, is there a simple way to create ontology in ANZO? How were you able to get the initial time of the discussion with sponsors in showcasing more value by moving to graph along with FIBO? Did you do a small poc to start with? With the Anzo Data Fabric, we make it very easy to connect to enterprise data sources (databases and files), import the metadata from those sources and automatically generate an ontology from them. This included inferring relationships that exist between classes.

Also recommend looking at the Data Catalog Ontology or DCAT, it is a good ontology to describe data catalogs and works well with tools like CKAN.

We were able to show the value of moving to the graph by starting with a small POC that incorporated data from several upstream models that relied on a complex ETL process. We simplified the process and showed how each upstream process could directly insert their data into the graph and the graph could be built as the processes were executed.

What are the lessons learned and 2-3 key takeaways for the EDMWebinar participants? Knowledge graphs can save data engineering teams time and money by providing capabilities to better integrate disparate datasets and make frequent changes to the model. Business teams will see insights faster from their changing business requirements and get data provenance baked directly into their pipeline for regulatory compliance.

EDM Council 77 Water St., 8th Fl., New York, NY 10005 | +1 646 722 4381 | edmcouncil.org