Policy-driven Safeguards Comparison
Jan 14, 2025
Introduction
In our ongoing meta-modeling research and development at Nace, we encountered challenges in verifying machine-generated outputs such as LLM assistant responses against task-specific guidelines. This issue particularly affects our enterprise partners who need automated, policy-compliant text generation.
To address these challenges, we propose Nace Automated Verification Intelligence (NAVI), a verification system for reviewing policy-aware outputs and identifying violating text. This blog post illustrates that current solutions, which focus on factual groundedness, are insufficient for detecting policy violations. Additionally, merely prompting off-the-shelf LLMs proves inadequate for this specific task.
To thoroughly assess performance in policy alignment verification, we developed a comprehensive evaluation procedure around this task, the key component of which is a collection of human-curated policy-aware generation examples, Policy Alignment Verification (PAV) Dataset. Using this evaluation, we show that our Safety Model surpasses all existing solutions, while maintaining enterprise-grade latency and reliability. In efforts to advance the progress on this task and be transparent in our evaluation, we are open-sourcing a subset of the PAV Dataset.
In this blogpost, we'll cover:
List of current solutions and a high-level overview of NAVI
Our methodology for creating a custom validation dataset for the verifiers
Description of experimental setup
A performance comparison, which demonstrates NAVI Safety Model's effectiveness
Existing solutions
In evaluating NAVI's performance, we compared it to several existing solutions for policy alignment verification and factual grounding. Here's a brief overview of each solution:
AWS Bedrock Guardrail
Amazon Bedrock Guardrails is a tool designed to improve the safety of generative AI applications. It evaluates both user inputs and model responses against configured policies, including content filters, denied topics, sensitive information filters, and word filters. While primarily focused on content safety and factual verification, it can be adapted for policy alignment tasks.
Azure Groundedness Detection
Microsoft's Azure Groundedness service aims to identify whether AI-generated content is grounded in the input data. It analyzes output content by comparing it to the provided context, flagging inconsistencies or hallucinations. Although designed for factual verification, its approach to detecting ungrounded information can be applied to policy alignment verification.
NeMo Guardrails with GPT-4o
NVIDIA's NeMo Guardrails is an open-source toolkit for programmably guardrailing different types of inputs and outputs in AI applications. They feature moderation applied to input, dialog, retrieval, tool use and outputs. Several modes here can be evaluated for policy alignment verification. Any LLM can be used with the toolkit, in our evaluation GPT-4o is used as a strong reasoning model.
Prompt-based Adaptation of Off-the-shelf LLMs
This approach involves using carefully crafted prompts with existing large language models to perform policy alignment verification. It offers flexibility and can be implemented with various LLMs, potentially achieving competitive performance.
We find that most of the solutions focus on specifically factual grounding. By evaluating NAVI against this diverse set of solutions, we aim to provide a comprehensive comparison that extends beyond just factual verification to encompass broader policy alignment tasks.
How NAVI works
NAVI verification process comprises four main stages:
Retrieval Pipeline: This stage processes submitted policy documents into a representation suitable for effective decision-making.
Safety Model: Responsible for the primary decision-making, it outputs a verification result indicating whether the input is “Compliant” or “Noncompliant” with the policies.
Explanation Module: Provides reasoning behind why the input is labeled “Noncompliant” with given policies.
Reference Module: Supplies specific references to the sections of policies that have been violated.
While the system provides comprehensive output, this evaluation will focus primarily on the Safety Model's performance, as it is the core component producing the verification result, and no suitable alternatives for other components were found in existing solutions.
Policy Alignment Verification Dataset
Figure 1. Demonstration of policy alignment verification task against the more common factual verification (task explanation - Google Drawings)
In policy compliance checking, certain phrases or wordings can inadvertently trick models and cause them to misinterpret or incorrectly apply policies. This occurs because models were not trained to operate within a limited space of select policies. To evaluate how our solution performs compared to existing ones, we recognized the need to build a validation dataset that accurately reflects real-world scenarios. Existing open-source datasets, which primarily focus on fact-checking, do not meet our specific requirements for policy verification. Therefore, we manually constructed our own dataset, guided by the following criteria:
Diversity and Realism: The dataset must include a wide range of cases that accurately reflect real-world scenarios, ensuring it is applicable to various contexts and challenges encountered in practice.
Complexity: The dataset should be sufficiently complex to present a significant challenge for large language models, pushing their capabilities in understanding and compliance verification.
Specificity: The dataset needs to be tailored to address the specific requirements of policy verification, as opposed to the general fact-checking focus of existing datasets.
Table 1. Description of Policy Alignment Verification Dataset fields.
Data collection process
Our system's applications span diverse industries and specialized capabilities. The industry-specific subsets include Aviation, Communications, Finance, Legal, Retail, etc. Capability-specific subsets aim at evaluating model behavior against some specific input formats or features like short policies, informal or highly irrelevant inputs.
We model industry-specific cases by focusing on specific representative companies that have publicly available policy documents.
For each subset, multiple violating and policy-following texts are created manually (“response”).
The annotation criteria were iteratively refined until achieving over 95% inter-annotator agreement across all subsets.
We processed “response” through an embedding based approach and retrieved the context. The extraction is not used during evaluation, and is shown for demonstration purposes only.
For the open-source version, we pick 6 high-quality industry-specific subsets as detailed below.
The resulting open-source dataset consists of 125 examples across 6 industry-specific subsets, out of which 70 have a positive label and 55 have negative (`Noncompliant` is positive). Table 2 features a breakdown of the examples by subset. To evaluate the efficiency of identifying the correct policy in a multi-policy, multi-document context, each example includes a list of policy files where target policy extraction should be performed.
Table 2. Subset statistics of the open source dataset.
The dataset is accessible on Hugginface with the link: https://huggingface.co/datasets/nace-ai/policy-alignment-verification-dataset/
Experimental Setup
5.1. Relevant policies
A key factor in NAVI's superior performance in real-world settings is our Retrieval Pipeline, which comprehensively presents related policy information across multiple related policy documents. Since different vendors have varying context length requirements and lack native support for document processing, we supply all competing solutions with processed policy outputs from our pipeline instead of raw documents. By doing so, we ensure that during performance comparison, we are evaluating solely the verifiers' ability to identify policy violations, without the influence of document processing capabilities.
5.2. AWS Bedrock Guardrails
Method
We create a guardrail using AWS platform with Relevance and Groundness flags following their documentation. We run our passages against extracted relevant policies and compute 2 scores:
Relevance score
Groundness score
Then we find the most optimal thresholds for relevance and groundness scores to achieve the highest F1 score tuning directly on the target dataset. This is done deliberately to ensure we record the best possible performance. We try all threshold combinations from 0.01 to 0.99 with step 0.01 for both relevance and groundness.
Figure 2. AWS Bedrock Guardrails 3D surface plot of performance.
5.3 Azure Groundedness Detection
Method
We create a Content Safety guard on the Azure platform following their documentation.
To ensure compatibility with Azure Groundedness Detection, we needed to adjust our dataset. Since not all examples included queries, we crafted queries by asking an LLM to generate a plausible query given a response.
Azure Groundedness Detection offers two modes: QnA and Summarization. We tested both and found that the Summarization mode performs better on our validation set. Therefore, we report results based on this mode.
5.4. NeMO with GPT-4o
Description
NeMo Guardrails is a content guardrail service offered by Nvidia. It processes each request in multiple steps:
Analyzing user intent
Determining appropriate action based on defined rules
Generating policy-compliant responses
Key characteristics:
Requires configuration files specifying models, rules, and action flows
Uses GPT-4o as the underlying model for both generation and verification
Appends policy rules to each prompt
Performs multiple API calls per request for comprehensive checking
Limitations:
Configuration files need to be updated for each new policy check
Policy descriptions must be concise due to context length restrictions
Higher latency due to multiple API calls per request
Limited ability to handle complex, multi-document policy scenarios
Method
For environment setup with NeMo, we install Nemo Guardrails following their documentation
Set up OpenAI API access for GPT-4o.
We then configure guardrails with appropriate policies and rules.
Given the challenges of submitting arbitrary model responses to guardrails, we have reframed the problem to focus on verifying user-provided text against established policies.
5.5. Prompt-Based Safety
Method
We selected state-of-the-art general reasoning models for evaluating under the prompt-based setting:
gpt-4o-2024-11-20
claude-3-5-sonnet-202410222
We have tried various prompting approaches such as:
Zero-shot learning - giving a description of the task
Rule learning - giving our labeling criteria for determining whether a passage is compliant or not
Few-shot learning - we manually crafted 3 examples that we used as demonstrations to the model
For each model, we ran several improvement iterations in prompt styles to avoid suboptimal performance.
We record best results under a few-shot setting, which are the reported results.
5.6. Evaluation Framework
We selected the F1 score as our evaluation metric for performance since we want to account for both precision and recall in this classification problem. Our focus was specifically on the Noncompliant class, as identifying these cases is crucial for companies—missing instances of noncompliance could have serious consequences, so we prioritized maximizing the detection of these cases, To ensure transparency, all results reported in this post are based on the open-source version of our evaluation dataset.
For latency estimations, we use Average Model Latency in milliseconds, measured on the same machine. The time for processing each example is recorded, and the average is calculated across all examples.
Results and Analysis
Table 3. Performance and latency results
Figure 3. Performance and Latency comparison
The results highlight NAVI's clear advantages:
Performance: NAVI achieved an F1 score of 90.4%, significantly outperforming AWS Bedrock Guardrail (76.5%), gpt-4o/Claude Sonnet solutions (~75%), and AzureGroundedness Detection/NeMo implementations (71.2%). This indicates NAVI's robust ability to accurately identify policy violations, making it a highly reliable choice for enterprises requiring precise compliance checks.
Speed: With an average processing time of 488 ms, NAVI matches the speed of AWS Bedrock Guardrail and Azure Groundedness Detection, while being markedly faster than NeMo (2669 ms) and Claude (2926 ms) implementations. This efficiency ensures that NAVI can handle large-scale applications without sacrificing performance, essential for real-world deployments.
These findings suggest that purpose-built systems like NAVI, specifically designed for policy alignment verification, provide substantial advantages over general-purpose guardrails and prompt engineering approaches. The tailored architecture of NAVI not only enhances performance but also maintains speed, positioning it as a leading solution for enterprises in need of reliable policy compliance verification.
Conclusion
The NAVI Safety Model has demonstrated remarkable performance, outperforming existing solutions like AWS Bedrock Guardrail. These results underscore the importance of specialized systems in achieving higher accuracy and faster processing speeds compared to general-purpose and factuality-driven models. As organizations increasingly seek automated solutions for policy compliance, NAVI stands out as a robust and efficient tool, capable of meeting the demands of enterprise-level applications. Our findings pave the way for further advancements in policy verification technology and, more broadly, in multi-stage AI applications.