Understanding Data Lineage

By Mark Bands, Head of Product Strategy and Regulatory Intelligence

Last month I attended a Client Data Management working group where much of the discussion, prompted by the data quality framework requirements of BCBS 239, was focussed on data quality, data lakes and data provenance or lineage. The BCBS 239 guidelines on data-quality frameworks notes that banks must establish data-quality management; including data profiling, data lineage, monitoring, reporting and escalation procedures. In addition to the impetus provided by these guidelines, it is worth noting that understanding the data supply chain, the data’s provenance or lineage, is fundamental for all businesses that are moving to utilise their data as a strategic enterprise asset.

What is Data Lineage?

To be clear on what we are talking about, it is relevant to look at applicable definitions of the lineage/provenance concept. Online resources explain that data lineage is defined as a “lifecycle” view of the data, that includes the data’s origins and where it moved over time. It describes what happens to data as it goes through diverse processes, and therefore helps provide visibility into analytics and simplifies the process of tracing errors back to their sources. Lineage also enables replaying specific portions or inputs to the dataflow for step-wise debugging or regenerating lost output.

In his book “Data Resource Simplexity” technology author Michael Brackett states that, “Data Provenance is provenance applied to the organisation’s data resource. The data provenance principle states that the source of data, how the data were captured, the meaning of the data when they were first captured, where the data were stored, the path of those data to the current location, how the data were moved along that path, and how those data were altered along that path must be documented to ensure the authenticity of those data and their appropriateness for supporting the business”.

Why is knowing lineage important?

Beyond the fact that regulators are now asking financial services organisations to demonstrate evidence of their data provenance (and thereby the extent to which their data supply chain can be trusted), what benefit is there in knowing your business data lineage? Industry experts correctly assert that firms are now managing large amounts of data that is functionally interconnected through the organisation (same data, many uses). In this context, any data error or omission at any point across the many uses and systems has the potential for causing negative repercussions to the business.

How do we practically achieve traceable provenance?

As noted, with the business needs causing timely and accurate data to be needed, managing your data properly becomes imperative. One of the steps towards better data management is the need to execute proper data modelling. Data modelling creates essential comprehension of the data in a firm by showing how each piece of data is linked and used all over the enterprise. Once data is properly modelled and the business usage context understood, firms can:

  • Assign appropriate data ownership
  • Make sure the correct data is used for the specific context
  • Keep track of how data moves through the organisation
  • Measure data quality

Data and the information it provides are some of the biggest assets of most modern enterprises. In today’s information age, almost every enterprise decision is based on a detailed analysis of data recorded from diverse sources, including internal structured databases and external data sources. To ensure that data retrieved from different sources is used appropriately and within context, it is imperative that the provenance of the data be recorded and made available to its users.

In the legal entity and client data space, systems like iMeta’s Onboarding and Client Lifecycle Management Platform provide clear data source to ultimate usage tracking; with a detailed audit trail of what changes were made to each data element along the way, who made them and when.

Beyond the data, and also relevant in the current milieu, firms now need an audit trail for the source of and application of regulatory policy, specific to a multitude of jurisdictions and different regulation types. Again, platforms like iMeta’s Assassin go a long way to providing an operational context; where responsible teams and individuals are able to track adherence to policy and provide demonstrable audit of regulatory compliance lineage and application – but the understanding and detail of this adjunct context is a subject for another blog.