Red Teaming AI Systems for Security Validation
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V6I1P112Keywords:
AI Red Teaming, Adversarial Machine Learning, OWASP LLM Top-10, MITRE ATLAS, NIST AI RMF, ISO/IEC 42001, ISO/IEC 23894, EU AI Act, Jailbreak Detection, Prompt Injection, Automated Red Teaming, Security Validation, Continuous Monitoring, Governance IntegrationAbstract
There is a growing trend in which artificial intelligence (AI) is embedded in safety-critical, regulatory-driven, or high impact domains such as healthcare, finance, public infrastructure, enterprise automation. The larger these systems become, however, the more apparent the risks such systems pose from adversarial exploitation, model manipulability, and the unsafe integration of tools. AI models are different from traditional software in that they exhibit probabilistic and context-dependency nature and hence exhibit unique vulnerabilities including prompt injection, jailbreaks, data poisoning, model extraction, and unsafe autonomous decision-making. These threats require special testing and assurance methodologies, beyond traditional penetration tests.
Red teaming is becoming an important technique to evaluate AI systems under real world adversarial settings. It’s the systematic stress-testing of models to identify holes before bad actors can exploit them. Nevertheless, contemporary red team approaches are fragmented with little shared taxonomies, common benchmarks or integration with governance. To fill in these gaps, this paper presents COMPASS-RT, a model-agnostic and deployment-agnostic framework for AI red teaming.
The framework is comprised of six pillars: (i) risk-based scoping, consistent with the NIST AI Risk Management Framework and ISO/IEC 42001; (ii) threat modelling using MITRE ATLAS and the OWASP Top-10 adapted for LLM applications; (iii) hybrid adversarial testing, combining human expertise with automated, LLM-driven attack generation; (iv) benchmark-based validation, using standardized corpora such as AdvBench, HarmBench, and JailbreakBench; (v) governance integration to ensure that findings map to risk registers, mitigation workflows, and regulatory compliance under regimes like the EU AI Act; and (vi) continuous validations, providing sustained measurement and regression testing across model updates.
This paper makes a threefold contribution: it unifies existing best practices from adversarial AI testing, operationalizes best practices in governance, and does so in a way which incorporates reporting templates that organizations can use for audit-ready assurance. By incorporating COMPASS-RT into enterprise security and compliance programs, enterprises can establish defensible processes for demonstrating the robustness of AI systems, eroding the efficacy of attacks and accelerating response. Risk Management: You cannot eliminate risk completely, but disciplined and repeatable red teaming greatly strengthens security posture and confidence in AI deployments
References
[1] National Institute of Standards and Technology (NIST), Artificial Intelligence Risk Management Framework (AI RMF 1.0), Jan. 2023. Available: https://doi.org/10.6028/NIST.AI.100-1
[2] National Institute of Standards and Technology (NIST), AI RMF Generative AI Profile (AI 600-1), 2024. Available: https://doi.org/10.6028/NIST.AI.600-1
[3] ISO/IEC 23894:2023, Information Technology — Artificial Intelligence — Guidance on Risk Management, International Organization for Standardization, Geneva, 2023.
[4] ISO/IEC 42001:2023, Artificial Intelligence — Management System, International Organization for Standardization, Geneva, 2023.
[5] UK National Cyber Security Centre (NCSC) and Cybersecurity and Infrastructure Security Agency (CISA), Guidelines for Secure AI System Development, Nov. 2023. Available: https://www.ncsc.gov.uk/collection/secure-ai
[6] Google, Introducing Google’s Secure AI Framework (SAIF), Jun. 2023. Available: https://cloud.google.com/secure-ai-framework
[7] MITRE, Adversarial Threat Landscape for Artificial-Intelligence Systems (ATLAS) Fact Sheet, 2024. [Online]. Available: https://atlas.mitre.org
[8] OWASP, Top 10 for Large Language Model Applications v1.1, Open Worldwide Application Security Project, 2023–2024. Available: https://owasp.org/www-project-top-10-for-large-language-model-applications
[9] [9] OpenAI, GPT-4 System Card, Mar. 2023. Available: https://cdn.openai.com/papers/gpt-4-system-card.pdf
[10] OpenAI, GPT-4o System Card, Aug. 2024. Available: https://cdn.openai.com/papers/GPT-4o-system-card.pdf
[11] E. Perez, S. Ringer, K. Kaplun, et al., “Red Teaming Language Models with Language Models,” arXiv preprint arXiv:2202.03286, 2022.
[12] D. Ganguli, A. Askell, J. Clark, et al., “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviours, and Lessons Learned,” Anthropic Research Report, 2022.
[13] Y. Liu, X. Xu, A. Zhang, et al., “Formalizing and Benchmarking Prompt Injection Attacks and Defenses,” in Proc. USENIX Security Symposium, 2024.
[14] P. Chao, H. Jin, A. Zhang, et al., “JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs,” in Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024.
[15] T. Mazeika, H. Zhang, S. He, et al., “HarmBench: A Standardized Evaluation Framework for Red Teaming and Robust Refusal,” arXiv preprint arXiv:2402.04249, 2024.
[16] WalledAI Project, AdvBench Dataset, 2024. Available: https://github.com/walledai/AdvBench
[17] CoAI Group, Tsinghua University, SafetyBench: A Comprehensive Safety Benchmark for LLMs, 2023–2024. Available: https://github.com/thu-coai/SafetyBench
[18] Z. Chen, H. Zhou, Y. Liu, et al., “AgentPoison: Red-teaming LLM Agents via Poisoning Long-term Memory or RAG,” in Proc. NeurIPS, 2024.
[19] European Union, Regulation (EU) 2024/1689 of the European Parliament and of the Council on Artificial Intelligence (EU AI Act), OJ L, 12 Jul. 2024.
[20] European Union Agency for Cybersecurity (ENISA), ENISA Threat Landscape 2024, 2024. [Online]. Available: https://www.enisa.europa.eu/publications