Designing the curation experience for an AI driven data lineage tool
Data Lineage is a critical module in Informatica's cloud data governance and privacy product. This project aimed to investigate and imagine how AI can be leveraged to predict data connections within the data lineage, particularly in cases where the data itself is not readily available.
- Stakeholder interviews & discussions
- Documenting ask, business goals, user goals
- Storyboarding use case & scenario
- Concept development
- Low and high fidelity wireframes
- Prototyping and client walkthroughs
12 Week (JUN - AUG 2022)
Ranjeet Tayi, Jill Blue Lin
What is data lineage?
Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption. It shows where the data originated, how it has changed, and its ultimate destination within the data pipeline.
What is it used for ?
Apart from helping organizations keep a clear record of data’s movements and transformations, it is also used for:
Assurance of Data Integrity
To make sure data elements in your report are trustworthy
To check data stores for personal user information and ensure data privacy law compliance.
To analyze impact of changes you make upstream or downstream in your data system
Unparseable code creates gaps in dataflow diagram
Automated data lineage process relies on creating connection assignments by parsing code logic. However, when the code logic or parser is unavailable, it creates a gap in the data flow diagram, especially in legacy systems without custom data lineage scanners.
User roles affected by the problem :
Data Catalog Admin
Responsible for keeping data source documentation up to date within the organization.
Missing Lineage Request Lead to Data Silos
Plans data management strategies and standards for optimal data usage.
Manual Lineage Curation is Cumbersome
Creates and presents analysis reports based on which critical business decisions are made.
Unable to Verify Credibility of Report
How might we facilitate data lineage documentation in these cases through the power of machine learning?
Use Case 1: Create Job for Curation
As a Data Catalog Admin, I want to set up a collection of sources and targets for the CLAIRE machine learning engine to efficiently curate the inferred data lineage.
Streamlined Project Creation and Assignment
Reduce the hassle of documentation and assignment through third party apps through in app project creation and assignment.
Use Case 2: Accurate Lineage Curation
As a Data Steward, I want to curate an accurate inferred lineage map for business analysts to make data decisions for the organization.
Job Summary for Planning
A clear job summary with recommendations and it's breakdown for easier project planning and management.
Granularity for precise decision making
Drill down from data set level to data column level recommendations to take decisions at different lineage levels.
ML Recommendations to aid decision making
Recommendation card provides prediction parameters and a confidence score to supplement decision making and build trust with the user. Action on a recommendation refines the ML model.
Toggle Lineage views
Lineage view toggle from map to list view enables quick search and comparison of recommendations for accurate decision making.
Collaborate through the Comments Panel
The comments panel provides a space to discuss and consult with other data stewards about a recommendation decision within context.
Use Case 3: Data Lineage Transparency
As a Business Analyst, I want to have a transparent view of the data lineage based on inferred vs derived so that I can make robust business decisions.
Toggle between Inferred and Derived Lineage
The Inferred lineage toggle indicates that there is some ML based inferred lineage documentation curated by the data steward.
4D Design Process
This project utilized the 4D Design Process, a converging and diverging approach which consisted of the following phases
AI Engine - CLAIRE as Part of Cloud Offerings
At the start of the project, I met with stakeholders to understand the business goals of the project. The following business goals allowed me to define my own design goals:
- Detailed column-level lineage for Informatica customers.
- Promoting cloud product adoption with CLAIRE AI.
- Establishing a competitive advantage through the first AI-powered lineage solution.
Stakeholder Interviews for Understanding Data Systems
Due to time and budget constraints, direct user engagement was not possible during the concept development phase. Instead, I relied on the project manager's and fellow designers' expertise to understand the user. I conducted three stakeholder interviews with the following research goals:
I needed to learn -
- Current documentation processes of data lineage
- Scale of data systems for inferred lineage
- Use cases for inferred lineage
- Usage cycle of data lineage (systems, analysis, creation, feedback)
- Experience/expectations of using an AI/ML engine
Plotting the Curation Experience
Scenario: NYC Health+ hospitals use Informatica's data lineage tool to manage patient data. They also receive data from external partners but due to data sharing restrictions they do not have the transformation logic and path for that data. They need a way to view their end to end data lineage for verifying the information in their revenue report.
Machine learning approach is hard to debug
Share prediction details with users to build trust
Data Lineage is used for critical business decisions
Distinguish derived vs. inferred to show information reliability
Data systems too vast for one person to know entirely
Inferred lineage needs to be a collaborative tool
User Roles in Inferred Lineage Curation
Stakeholder interviews and secondary research was defined in three role based user personas: Data Catalog Administrator, Data Steward, and Business Analyst.
Understanding AI-Powered Data Lineage Through Storyboarding
To validate my understanding of the scenario, the storyboard served as a tangible representation, facilitating communication and shaping the project's direction.
Garnering Customer Feedback
Due to time and budget constraints, user interviews were not possible at the start of the project. Therefore, I used the customer validation call to get more insights into the lineage systems and users. I spoke with two clients, where I examined their existing data systems and lineage in their real-time environment in the first half of the call, and then shared my ideas and prototype concepts; collecting feedback.
To garner feedback on my concepts, I met with data stewards from two clients: Elevance Health (an insurance provider) and Thrivent (a not-for-profit financial organization). Talking to them I got the following feedback -
- Both customers appreciated the ability to drill down from the data set to column-level lineage and take actions at all those levels.
- An interesting new use case came up - the same 2 data sets could be assigned to multiple projects, so cross-project collaboration could cause overlaps in lineage curation.
- They also already have some documentation for existing mappings, so it would be good to have the ability to upload existing mappings directly into a lineage inference project.
- They also really liked the list view and said it could be used to incrementally build their lineage using placeholders when things still need to be added to the product catalog.
A Mock-up’s worth a Thousand Words
Through low fidelity concept mock-ups I was able to best explain initial ideas, design requirement, feasibility and timeline with the cross-functional stakeholders.
Define the Scenario
Use case scenarios need to be as detailed as possible for designing robust workflows, because every industry could have different data management and governance.
Human Discretion for AI Probability
Because AI is based on probabilities instead of definitive answers, human discretion becomes crucial for training and refining the AI system—making traditional approaches like bulk actions less applicable.