Navigating the symbiosis: How Generative AI and Data Governance can support each other
In 2022 OpenAI sparked a huge demand for Generative AI by releasing ChatGPT based on the GPT-3 large language model. Generative AI refers to a type of machine learning algorithms built on top of deep neural networks with the goal of generating new content, such as text, images, or audio. These models are usually trained on large unstructured datasets, like internet content. This fact does not imply however that we can compromise on Data Governance best practices, when working with these models. Data Governance practices need to adapt to the new reality of GenAI, including leveraging GenAI itself for enhanced data quality and traceability.
What is Data Governance exactly?
Data Governance is a set of policies, processes, practices and recommendations to assure data quality and safety. In simple terms we need to ensure that consumers can trust the data. It should be reliable, complete, up-to-date, and consistent. Moreover, data lineage (or “data provenance”) needs to be well understood, allowing to track every business insight back to its source information. Sensitive data on the other hand should only be available to those with a “need to know”. Now how does Generative AI fit into this picture?
Ensuring secure and high quality GenAI models through data governance
GenAI primarily utilizes semi-structured and unstructured data, especially Text, JSON, CSV, Audio and Video files. It significantly deviates from past decades’ norm where quality data meant structured, tabular information housed in relational databases. Ensuring GenAI can train on appropriate data and access reliable information necessitates a standardized approach to quality assurance in non-structured data, forcing existing Data Governance practices to adopt to a new reality. This is especially crucial, since the impending AI Act actually enforces Data Governance practices for all high-risk AI applications [1]. Let’s now dive into some practical use-cases of applying Data Governance to the GenAI realm:
1/ Prepare source artifacts for AI solutions
Large language models (LLMs) are designed to learn from diverse content. However, the absence of appropriate quality and metadata can significantly diminish the expected outcomes. The remedy lies in maintaining source documents with the understanding that they will be utilized by AI in the future. This necessitates that the documents are up-to-date, consistent, and accompanied by a comprehensive metadata set. This can be achieved through designating ownership to data assets and promoting relevant education among all personnel.
2/ Enable models to learn policies
Data Governance experts should formulate a suite of rules, policies, and processes that are undestandable for human and machines alike. Enterprises can use prompt engineering (pre-prompts) and fine-tuning techniques such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) to apply rules and constraints to what specific models generate [2]. This ensures that the output adhers to established governance policies and maintains the standards of ethical and responsible AI. There is an art to striking the right balance between model performance and alignment with policies, since an overly aligned model can quickly become useless to your business.
3/ Define standard metadata for AI models
In the realm of AI, establishing standards for model metadata is a pivotal step towards transparency and reliability. Notable references include Google’s model cards [3] or Twilio’s AI Nutrition Facts [4]. AI Nutrition Facts for example, which draw direct inspiration from food labels, offer essential model information in an understandable format. They detail the model’s ‘ingredients’ like training data, model architecture, optimization techniques, and performance metrics. Such an approach enhances understanding, assesses suitability, and facilitates model maintenance at scale.
4/ Manage PII data in the context of LLMs
To prevent unintended disclosure of Personally Identifiable Information (PII) by LLMs, you may consider employing techniques such as differential privacy [5]. This method introduces a certain amount of ‘noise’ into the model’s training data, effectively preserving the individual privacy while maintaining the model’s business utility. Another technique to consider is model response scrubbing, which involves rule- or ML-based classifiers that validate model responses with respect to PII data leakage.
5/ Invest in Explainable AI (XAI)
Explainable AI (XAI) introduces the concept of “data lineage” into AI models, effectively moving us away from a “black box” paradigm. It’s like opening the hood of a car; you can see how the engine works, and track why it behaves the way it does.
“Global Explainable AI market is expected to be worth USD 16.2 billion by 2028, growing at a CAGR of 20.9% during the forecast period.”
One of the reasons for that rapid market growth might be the growing importance of XAI in the context of Generative AI - already dubbed Explainable Generative AI (GenXAI) [7] - as it helps to understand the complex and diverse solutions that GenAI relies on. In the realm of Generative AI, explainability is achieved through a variety of techniques. Feature Attribution techniques [8], for instance, assign relevance scores to each input feature, such as a word or pixel. In GenAI, this could mean identifying which parts of an input text or image were most influential in generating the output. Sample-based techniques [9], on the other hand, investigate how different inputs affect the output. In the context of GenAI, this could involve exploring how slight variations in input data can lead to drastically different generated outputs. Probing-based methods [10] aim to understand what knowledge a GenAI model has captured. This could involve training a classifier on a model’s activations to distinguish different types of in- and outputs, providing insights into the model’s understanding. Mechanistic interpretability [11] goes a step further, aiming to reverse-engineer model components into human-understandable algorithms. This involves viewing models as graphs and identifying subgraphs (or circuits) that yield certain functionality. Each of these techniques offers unique insights into the inner workings of GenAI models. They allow to peek under the hood of these complex systems, providing a deeper understanding of how they generate their outputs and offering the potential to improve their performance, reliability, and safety.
6/ Test the quality of your GenAI chains on a regular basis
Just as data quality gates are vital to the data governance practice, so is the regular testing of the quality delivered by GenAI solutions. Several tools, benchmarks and frameworks can help automate these evaluations:
RAGTruth: RAGTruth is a word-level hallucination corpus designed for analyzing hallucinations in various domains and tasks within the Retrieval-augmented generation (RAG) setting [12].
RAGAS: RAGAS is an open-source tool designed for RAG metrics evaluation. RAGAS helps measure the performance of RAG solutions and is useful for monitoring and improving these kinds of applications in production [13].
MLCommons: MLCommons provides a broader framework for evaluating machine learning models. It focuses on benchmarking and standardizing performance across various tasks, including comprehensive AI safety benchmarks [14].
Ensuring secure and high quality data through GenAI models
Machine Learning and statistical analysis can play a significant role in enhancing data quality, particularly in imputing missing data and identifying outliers. While these techniques are well-established and extensively utilized, Generative AI has introduced a novel toolkit that can augment, and potentially fully automate, laborious and manual tasks. Let’s delve into some representative use-cases:
1/ Metadata enrichment
The task of ensuring accurate labeling, often viewed as tedious, can be automated using advanced language models. These can can draw understanding from the data and create and attach metadata to both structured and unstructured data sets. In this context GenAI is a gemechanger, helping remove one of the key impediments to implementing Data Governance at scale.
2/ Data integration and transformation
Data integration often involves costly and time-consuming manual work. Generative AI, on the other hand, is capable of automatically matching information based on a broader context, and executing transformations and enrichments in an intelligent manner [15] [16] [17], helping streamline ETL and ELT implementation.
3/ Security and compliance assurance
Advanced algorithms can aid in monitoring and alerting against data privacy breaches by consistently scrutinizing content against established rules and policies. Traditional statistical and heuristic systems are static, necessitating frequent redefinitions. In contrast, machine learning is dynamic and capable of self-adjusting to new policies and incoming information to better identify “zero day” threats. This adaptability results in higher accuracy and immediate compliance with new requirements and processes. Generative AI models in particular - like Variational Autoencoders (VAEs) - show remarkable performance in anomaly detection scenarios and can be fully trained in an unsupervised manner. [18]
4/ Synthetic data generation
Synthetic data refers to artificially generated data that mimics the statistical properties of real-world information without containing any personally identifiable information (PII) that can be traced back to real data. It can generate realistic testing and training data without incurring data privacy/GDPR-related risks and with much better protection than masking or pseudonymization. Synthetic data is not a new thing, however GenAI models like Generative Adversarial Networks (GANs) can now achieve it in a manner that is more efficient then “traditional solutions” like oversampling or undersampling. [19]
This is merely the tip of the iceberg. Generative AI is rapidly evolving, and each domain may harbor unique ideas that are yet to be explored and validated. Consequently, the symbiotic relationship between Data Quality and GenAI is in high demand.
Closing thoughts
Implementing Data Governance policies creates a foundation for Responsible AI, which is based on six key principles: accountability, inclusiveness, reliability and safety, fairness, transparency, and privacy and security. These factors are crucial from engineering, business, and legal perspectives – upcoming regulations will even make them mandatory. Responsible AI, aided by the right data governance framework, can effectively enable AI to benefit organizations while safeguarding users and society.
At the same time, as with every big technology breakthrough, quality and safety tend to lag behind the pace of innovation. Just as the first airplanes did not have oxygen masks and life vests included as standard equipment, there is currently no formal standard as to how exactly GenAI should be governed. However, a trusted advisor like Exerizon can help you stay in touch with the latest developments in this dynamic space.
References:
[1] “Article 10: Data and Governance”, EU Artificial Intelligence Act
[3] Modelcards.withgoogle.com, Google
[4] Nutrition-facts.ai, Twillio
[5] Differential Privacy, Harvard University
[8] G. Varma, “Feature Attribution in Explainable AI”, September 2021
[9] P. Mishra, “8 Types of Sampling Techniques”, June 2021
[10] A. Chawla, “The Probe Method: A Reliable and Intuitive Feature Selection Technique”, September 2023
[11] L. Bereska & E. Gavves, “Mechanistic Interpretability for AI Safety — A Review”, April 2024
[13] Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
[14] MLCommons | Better AI for Everyone
[15] C. J. Yang, “Improving LLMs: ETL to “ECL” (Extract-Contextualize-Load)”, March, 2024
[17] M. Rai, “Using generative AI to overhaul data integration? Start here”, InfoWorld, Jan 2024
[18] J. Rocca, “Understanding Variational Autoencoders (VAEs)”, September 2019
[19] “Generative Adversarial Network (GAN)”, geeksforgeeks.org