As data becomes more complex and dynamic, treating it effectively can also seem arduous and complicated. Especially with the increase in the amount of data being handled, processed and stored by companies.
That’s why understanding the lineage of data is such a key aspect for organizations that depend on data.
Learning where the data comes from and all of its evolution has measurable and infinite benefits. Even the most simple data lineage can help companies build better processes and improve their operations.
What Is Data Lineage?
Data lineage is simply the cycle of data – from its origin until the end of the flow – and uncovers all life cycles of it.
Companies nowadays depend on data, so understanding, tracking and visualizing the entire data flow allows them to evaluate the quality of their information and also see closely the quality of the data, checking all the possible ways that the data transformed and changed along the cycle.
Data lineage is the data’s entire record, and it gives a map of the data journey, so it’s possible to see it from its source until how it got to a specific point and all the steps in between, with detailed explanations of all the data transformations that happened during the process – or during the travel through different systems.
So as the data goes through the flow, changes, adapts and evolves, the tracking of this process can be seen up close by the organizations, so they can better understand their own operation, formulate new processes, track errors, change what is ineffective and learn from what is effective.
Why Is Data Lineage Important?
Data lineage is extremely important for companies that depend on data, and not only for its tracking on the data source, performance, updates, systems, and possible operations changes but also for the knowledge that the company can take from this data.
Knowing how the change was made, who did it, which process, how the updates were completed, and which steps were used will tell the company which efforts and decisions brought the most results and were more successful. It’s also a way to keep the integrity, security and quality of the data throughout its lifecycle.
When talking about data, transparency is vital. Organizations can create and improve their internal policies and have all the information easily accessible when needed – that’s what the tracking of the lineage of data does: it keeps all the vital knowledge recorded and easy to be accessed.
Data lineage benefits directly in the following aspects:
- Better adherence to policies, governance processes and legislations.
- All areas can strategically rely on the data recorded.
- The organization can make better decisions for the present and the future as they can rely on the old tracked datasets.
- It’s possible to improve and create new systems, as it’s possible to analyze all the benefits, risks and impacts.
- The detailed data recorded can be the best tool to improve risk management in all areas and projects.
- Easily locate and fix the sources of errors and problems.
What Are the Key Components of Data Lineage?
When it comes to the key components of data lineage, there’s not an agreed list of which component should be considered the most important.
The industry has some reference guides and official publications that can give an idea of what should be a key component when working with the lineage of data. Those guides and publications were made by DAMA International (DAMA-DMBOK), the EDM (Enterprise Data Management) Council (DCAM) and the standard number 239 from The Basel Committee on Banking Supervision.
DAMA International (DAMA-DMBOK)
According to DAMA-DMBOK, the key components of data lineage and data flow should be the ones that map and record all the relationships happening between the data and:
- The business process and operations are being used by the company.
- All the places and environments where the data is being stored – such as data stores and databases
- All the security applications and network segments
- The business roles in which contact with the data is needed, whether it’s for creating, updating, analyzing, or deleting the data.
- Locations, especially where local differences happen.
So, for DAMA-DMBOK, the key components of data lineage are all the IT systems components themselves: applications, databases, network segments and business processes.
EDM (Enterprise Data Management) Council (DCAM)
The EDM Council has a standard glossary published with what should be considered important when working with data lineage.
According to them, data lineage is all about giving the details and the ownership, in chronological order, of the data location and custody. They affirm that all the changes in the data from one system to the other should be compiled as a visual map.
So, to the EDM Council, the key components of data lineage are the systems being used, ownership and control of data, and metadata.
The Basel Committee on Banking Supervision (BCBS239)
The Basel Committee on Banking Supervision revived on their standard number 239 the key principles and components for data lineage.
To the BCBS239, companies should have in place business process and business dictionaries in order to have an explained report and documentation that can easily explain all the risks on the data to be distributed only to the relevant parties.
The BCBS239 also stresses how much of a crucial component is metadata. According to them, every company should record their business metadata. This should be done with the help of the ownership of risk data and information from both the business and the IT functions.
The last key component for data lineage according to the BCBS239 is the need for good data quality controls. All the data errors should be identified, reported and explained through detailed reports. To them, monitoring and measuring the accuracy and the quality of the data should be a top priority in order to work on appropriate security levels.
What Are the Main Data Lineage Techniques and Examples?
When it comes to data lineage, there are many options and different approaches of techniques to outline, understand and diagram the movement of the data and all of its changes and transformations. Some of the most commons are:
The pattern-based technique is a solution that does the lineage without the need to look at any code used to create and change the data. With this technique, the metadata is read and evaluated for tables, columns, and reports.
Some companies even name the pattern-based lineage as a type of artificial intelligence (AI), as it can investigate the lineage of data to find patterns. For example, if the solution finds two columns with the same or similar information it will understand that it’s probably the same data but in different stages of the data flow, so the solution will link this information together and create a chart with the data lineage.
For those that only need to read hidden logic patterns and the code is not available, then this is a great solution. But the pattern-based technique is not 100% accurate and may not work for those that deal with data privacy and data risk. The performance can be impacted as it doesn’t look into bigger details, and it can miss the link between datasets.
As the name suggests, the manual lineage is simply the mapping and the documentation of all the knowledge of the people working on the company. This can be done by interviews with the process owners and all the people that have contact with useful data.
Normally, the lineage is compiled in spreadsheets, so the data mapping can be easily looked at. But this technique is risky, and the information can be contradictory and big pieces of data can be missed in case someone forgets to give one information.
Lineage by Data Tagging
The data tagging technique works based on the assumption that each data moving on the flow is tagged and labelled by an engine, in a way to track this label from the start of the process until the end of it.
This technique needs all the data stored on a system or tool that can control all its movements. Data tagging is a great solution and most companies consider it as a promising tool, but it still requires more improvements as it can only perform the lineage of data on closed systems, so it excludes all the information outside the technology or the selected method.
Some companies have available data environments that provide storage, processing and master data management all in one place. With this, the environment keeps all the data changes and movements in one place.
So this self-contained technique provides the lineage without the need for other tools and external systems. The lineage is built as the data goes on through the cycle.