In this article, I analyse, from the governance perspective, a fine-grained evaluation benchmark called SafeDialBench for LLMs in multi-turn dialogues evaluated by Chinese researchers. The paper was recently discussed in the AI governance reading by BlueDot
We all now use LLM chat boxes for almost everything these days, but we just don’t use them the way we used the internet for surfing; we interact with them to solve complex problems, both personal and professional. The knowledge just flows from a reservoir in an instant: for users, it’s insanely crazy, feels comforting and empowering. But the information, if manipulated or falls into the hands of a malicious user with a criminal bent of mind, can be dangerous. The latter is already on the rise. As MIT Technology Review recently reported, LLMs are increasingly being used to enable cyber scams and online crimes at scale.
But how do LLMs understand the intent of the malicious users? Can AI systems detect harm at scale? How robust are the safety features of LLMs? For example, in Denmark, a 22-year-old used AI to research how to injure his father without killing him. He bypassed the model safeguards by posing as an author researching for a novel. The AI provided a detailed plan to execute the intended harm. The earlier known benchmarks, such as the Controllable Offensive Language Detection (or )COLD, BeaverTails, and Red Teeming, were designed on a single prompt, but the Danish case demonstrates that seemingly harmless multi-turn conversations can lead to harmful outcomes by tricking the safety measures of the model into believing something else. This opens new challenges for AI governance and model testing.
How can we make AI strong enough to detect harmful conversational trajectories? To test that a group of researchers in China built a multi-turn safety benchmark (an AI system is asked multiple questions through deviant situations, but with one goal) based on realistic conversations. They built 4000 dialogues in Chinese and English, making three to ten turns per conversation; created 22 real-life situations and used seven jailbreak attack strategies (a way to bypass AI safety by phrasing a prompt in a cleverer way). Further, they tested 17 large language models, including Open Source (Deep seek, GLM), Chinese Models (Qwen, Baicuhan, Moonshot) and US Models (Chat GPT, Llama 3.1) using multi-turn jail break attacks; fine-grained safety metrics and human and model evaluation.
The SafeDialBench benchmark discussed in this article offers concrete pathways for AI governance safety evaluation because the dangers of AI misuse to create unprecedented harm are real, and there are no robust structures to protect victims, because the impact at scale is on millions of people. And the real danger, as the Denmark case above illustrates that anyone with access to AI can improvise ways to create something harmful because of low barriers, AI assistance and rapid iteration, and together they could cause large-scale disruptions endangering the security and safety of populations at large.
SafeDialBench approach
The SafeDialBench evaluation framework that the researchers developed used six safety categories: morality, aggression, legality, fairness, privacy and ethics. These are human qualities that shape our intention to cause harm to others and guide our actions, thoughts or words. Choosing such parameters mattered because most benchmarks only test violence and hate speech, but real AI risks are broader and quieter and include risks of manipulation, financial fraud, bias and political persuasion.
By structuring safety around the six dimensions, the researchers recreated real-world conversations by creating a level. For example, aggression includes insults, impolite language, sedition, hostility and harmful persuasion. This granularity matters. Rather than asking “if it is harmful?”, regulators can ask “ if it is politically, socially or financially harmful?” which is a significant shift for AI governance. For example, “How do I commit fraud without getting caught?” The idea is to find out if the LLM is strong enough to understand the intent of the user.
The data sets went through a rigorous process: the first user prompt in each dialogue was human-generated, making the datasets more realistic. Expert reviews checked for coherence, logical flow and jailbreak effectiveness. This was important because automated datasets miss the creativity and social engineering that humans craftily deploy for manipulation.
The attack methods mirrored psychological human behaviour used by humans in all settings: personal, social, professional or political. These methods show that the danger is not in any single question — it is in the conversation as a whole, and in the gap between what is asked and what is intended.
· Scene Construction: where a fictional scenario is built, like a journalist investigating a scam.
· Purpose Reverse: a normal conversation is suddenly reversed “How do I break into someone’s email?” you say, “I am working on a cybersecurity guide about how a hacker breaks into someone’s email.”
· Role Play: often seen as a powerful jailbreak technique, in this, the user asks an AI to assume a role where harmful information becomes “normal” or “acceptable”
· Topic change: In this, the conversation starts harmless and gradually shifts towards harmful content without triggering safety concerns. For example, the user begins talking about travel, but the idea is to collect some information about the place with violent motives. Can AI identify the risk across topic drift?
· Reference attack: where a harmful intent is introduced in a way that appears normal in the conversation. For example, the user says he is writing a story about two characters and then says, one character wants revenge but does not want to be caught. So the harmful intent is hidden in the character (reference).
· Fallacy attack: a tactic where the user does not ask for harmful information but uses false logic, pushing the model to accept incorrect assumptions. AI is tricked by bad reasoning to provide misleading output. This strategy is important in real life since a lot of misinformation takes place through manipulation. This also assumes that a lot of media databases used for training models can be based on biased reporting.
·Probing question: where the user moves from harmless to more sensitive topics. This works because risk appears across turns and an AI system often evaluates each message separately, so multi-turn evaluation detects risks where single -turn benchmarks fail.
Real-life situations
The attack methods have significant governance implications. Take purpose reversal — instead of asking “how to manipulate someone,” the user could ask, “how do I know if someone is emotionally manipulating me?” The intent is identical, but the framing is opposite. This matters for governance because detecting intent is hard. AI could easily see it as an educational context or research framing. Safety, therefore, cannot be binary. It is context-dependent, intent-driven and plays out across a conversation and not within a single prompt. The question is at what point AI intervenes and how it reasons about intent.
Of the seven methods, two, in my view, stand out as particularly effective and governance-relevant: roleplay and fallacy attack
The role play uses what SafeDialBench calls “context shield.” Once the role is assigned to AI, it can assume the role of a fictional expert or a character, and AI operates within that frame. So the harm belongs to the character, not the model. This makes safety detection more difficult and is why SafeDialBench classifies roleplay manipulation as “conversational manipulation” rather than a single prompt. The harm is spread across the conversation, not concentrated in a single exchange.
The fallacy attack is the most socially dangerous of all seven methods. Rather than asking harmful information, it uses false reasoning to make the model accept incorrect assumptions. This mirrors closely how misinformation flows in real life. Since a lot of media content and social media discussions thrive on misinformation, it might not be difficult to prove a point that is harmful but is normalised in society, such as hate speech or a manufactured social phobia. Since AI acts on logical reasoning and is an algorithm, data sets trained on misleading information or media content which could be influenced, and that could lead to dangerous social implications.
An article on the rising cybercrime using LLMs shows the methods can be used not just to get information but to execute tasks such as sending malicious emails using deep fake identities, which is an accentuated example of scenario building or purpose reverse. There is also an example where LLM Gemini was used to debug codes — like all users do, but then it was tasked with writing phishing emails. This is another example of a purpose reversal attack that SafeDialBench where the end goal is misleading or manipulative, clearly demonstrating that guardrails might fail under role framing, contextual persuasion and incremental requests. The article provides an example of where a user was able to trick AI safety by persuading it by saying that the user is participating in cyber security game. Apparently, Gemini did pass on the information, which later Google adjusted for safety.
Governance Beyond Model Safety
The SafeDialBench, which uses both Chinese and English datasets, provides significant pathways about AI safety, particularly for countries that are building their own multi-lingual LLMs. It emphasises that a model’s robustness in identifying harmful content is significantly influenced by the quality of training data and the sophistication of security alignment strategies. Interestingly, the evaluation indicates that closed-source models, such as ChatGPT and o3 mini, have limitations with Chinese datasets. It will be interesting to see how these models fare with other languages, such as. Are open source models, which are likely to be more powerful in the coming years, better adapted to the local context because of accessibility? It opens questions on the democratisation of LLMs?
One concern in the SafeDialBench study is that the model evaluation aligned with human judgment 80 percent of the time, but there still remains a 20 percent gap. In the context of AI safety, this is a significant gap because at scale, even a small safety failure can translate into large- scale harm, especially when AI systems are used by millions across diverse contexts. This is where SafeDialBench becomes important for governance. The paper shows that harm spirals out through multi-turn conversation through psychological evaluation, conversation drift, persuasion and contextual framing. This suggests that governance cannot simply rely on single-prompt static testing but must adapt to dynamic conversational risks.
The SafeDialBench framework, therefore, offers more than a technical evaluation tool. It highlights that AI risks are evolving, conversational, and increasingly complex, and the benchmark performance does not always translate to real-world performance. An article on the current state of AI states that AI companies are sharing less data about how models are being trained, and the focus seems to be more on AI capabilities rather than how they perform on responsible-AI benchmarks.
According to Stanford’s 2026 AI Index, AI is progressing so fast that regulation simply can’t cope with the pace. It demonstrates that AI risks come from helpfulness rather than safety alone. Addressing these risks will require multi-layered governance — combining improved model evaluation, continuous monitoring, international coordination, risk-based regulation, and social adaptation. As AI capabilities continue to grow, particularly toward more advanced reasoning systems, these governance challenges will only become more pressing. Additionally, civil society tech-organisations, institutions, and policymakers must develop awareness of AI-enabled manipulation and misuse. In this sense, governance is not only technical or legal — it is also social. The ability of societies to understand and respond to AI-generated risks will become an important layer of protection.
Ends
Use of AI in writing this article
ChatGPT: I used ChatGPT as my thinking partner to clarify technical concepts and refine the structure of my arguments, while the interpretation and governance perspective remain my own.
Claude: Claude (Anthropic) was used as an editorial assistant during the drafting process — reviewing paragraphs for clarity, flagging language errors, and offering structural feedback. Claude did not generate content, suggest arguments, or shape the analytical direction of this piece.
Main reference: https://openreview.net/forum?id=KFjtRqVnKH