Skip to content

The Strata-Sword is a hierarchical Chinese-English jailbreak safety benchmark based on quantified reasoning complexity, developed in-house by Alibaba-AAIG | Strata-Sword 是 Alibaba-AAIG自研的中英文分层越狱攻击安全基准,将“推理复杂度”作为可评估的安全维度,并提出多种中文特有攻击方法,以系统评测不同推理复杂度下LLMs和LRMs的安全边界,从而为提升模型安全性提供新思路。

Notifications You must be signed in to change notification settings

Alibaba-AAIG/Strata-Sword

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

Strata-Sword Strata-Sword is a multi-level safety evaluation benchmark proposed by Alibaba AAIG team. It aims to more comprehensively assess models' safety capabilities when facing jailbreak instructions of varying reasoning complexity, helping model developers better understand each model's safety boundaries.

  🤗 Hugging Face   |   🤖 ModelScope   |   📄 Arxiv   

简体中文 | English

Oyster Logo


🧩 Our Approach — Strata-Sword

Core Contribution

  1. Reasoning complexity as a safety evaluation dimension We define and quantify "reasoning complexity" as an evaluable safety dimension, and categorize harmful jailbreak instructions into three different tiers — basic instructions, simple reasoning, and complex reasoning — based on three key elements of reasoning complexity.

  2. Tiered jailbreak evaluation dataset construction We classify 15 different jailbreak attack methods into 3 different levels according to reasoning complexity, and the dataset includes a total of 700 jailbreak prompts.

  3. Language-specific jailbreak attack methods Strata-Sword also accounts for language characteristics, customizing attack methods for both Chinese and English, and for the first time introduces three Chinese-specific jailbreak attack methods: acrostic-poem attack, lantern-riddle attack, and Chinese-character decomposition attack.

Evaluation Results

We systematically evaluate 23 mainstream open-source and closed-source commercial large language models, characterizing models' safety capability boundaries from the perspective of reasoning complexity.

六脉神剑结果

We also provide statistics for the 15 jailbreak attack methods used in Strata-Sword, evaluating each method's overall performance.

六脉神剑各脉结果

🚀 Quick Start

1. Environment installation: install the required dependencies

pip install -r requirements.txt

2. Test: run the Chinese and English jailbreak prompt sets for the three Strata-Sword levels

python strata_sword.py

📚 Citation

If you use Strata-Sword in your research, please cite the following paper:

@article{Strata-Sword,
  title={Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions},
  author={Zhao, Shiji and Duan, Ranjie and Liu, Jiexi and Jia, Xiaojun and Wang, Fengxiang and Wei, Cheng and Cheng, Ruoxi and Xie, Yong and Liu, Chang and Guo, Qing and Tao, Jialing and Chen, YueFeng and Xue, Hui and Wei, Xingxing},
  year={2025},
  url={https://github.com/Alibaba-AAIG/Strata-Sword}
}

🤝 Contribution

We welcome collaboration and discussion in the areas of security evaluation and alignment: Red-team work is continuous and ongoing; Strata-Sword will continue to release new versions in the future! We welcome contributions from more red-team developers for large models to brainstorm and continuously propose jailbreak attack methods to be added to subsequent Strata-Sword evaluation sets! In addition, feel free to submit Issues to report problems and engage in Discussions to share ideas!


📄 License

This project is licensed under the Apache 2.0 License.


🙏 Acknowledgments

We thank the open-source community and the researchers advancing AI safety.

Strata is part of Alibaba AAIG's commitment to responsible AI.

“The LLM is my oyster, which I with Strata-Sword will open.” 大模型是我的牡蛎,我将用六脉神剑打开它。

About

The Strata-Sword is a hierarchical Chinese-English jailbreak safety benchmark based on quantified reasoning complexity, developed in-house by Alibaba-AAIG | Strata-Sword 是 Alibaba-AAIG自研的中英文分层越狱攻击安全基准,将“推理复杂度”作为可评估的安全维度,并提出多种中文特有攻击方法,以系统评测不同推理复杂度下LLMs和LRMs的安全边界,从而为提升模型安全性提供新思路。

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages