In the ancient Chinese game of Go, state-of-the-art artificial intelligence has generally been able to defeat the best human players since at least 2016. But in the last few years, researchers have discovered flaws in these top-level AI Go algorithms that give humans a fighting chance. By using unorthodox “cyclic” strategies—ones that even a beginning human player could detect and defeat—a crafty human can often exploit gaps in a top-level AI’s strategy and fool the algorithm into a loss.
Researchers at MIT and FAR AI wanted to see if they could improve this “worst case” performance in otherwise “superhuman” AI Go algorithms, testing a trio of methods to harden the top-level KataGo algorithm’s defenses against adversarial attacks. The results show that creating truly robust, unexploitable AIs may be difficult, even in areas as tightly controlled as board games.
Three failed strategies
In the pre-print paper “Can Go AIs be adversarially robust?”, the researchers aim to create a Go AI that is truly “robust” against any and all attacks. That means an algorithm that can’t be fooled into “game-losing blunders that a human would not commit” but also one that would require any competing AI algorithm to spend significant computing resources to defeat it. Ideally, a robust algorithm should also be able to overcome potential exploits by using additional computing resources when confronted with unfamiliar situations.
The researchers tried three methods to generate such a robust Go algorithm. In the first, they simply fine-tuned the KataGo model using more examples of the unorthodox cyclic strategies that previously defeated it, hoping that KataGo could learn to detect and defeat these patterns after seeing more of them.
This strategy initially seemed promising, letting KataGo win 100 percent of games against a cyclic “attacker.” But after the attacker itself was fine-tuned (a process that used much less computing power than KataGo’s fine-tuning), that win rate fell back down to 9 percent against a slight variation on the original attack.
For its second defense attempt, the researchers iterated a multi-round “arms race” where new adversarial models discover novel exploits and new defensive models seek to plug up those newly discovered holes. After 10 rounds of such iterative training, the final defending algorithm still only won 19 percent of games against a final attacking algorithm that had discovered previously unseen variation on the exploit. This was true even as the updated algorithm maintained an edge against earlier attackers that it had been trained against in the past.
In their final attempt, researchers tried a completely new type of training using vision transformers, in an attempt to avoid what might be “bad inductive biases” found in the convolutional neural networks that initially trained KataGo. This method also failed, winning only 22 percent of the time against a variation on the cyclic attack that “can be replicated by a human expert,” the researchers wrote.
Will anything work?
In all three defense attempts, the KataGo-beating adversaries didn’t represent some new, previously unseen height in general Go-playing ability. Instead, these attacking algorithms were laser-focused on discovering exploitable weaknesses in an otherwise performant AI algorithm, even if those simple attack strategies would lose to most human players.
Those exploitable holes highlight the importance of evaluating “worst-case” performance in AI systems, even when the “average-case” performance can seem downright superhuman. On average, KataGo can dominate even high-level human players using traditional strategies. But in the worst case, otherwise “weak” adversaries can find holes in the system that make it fall apart.
It’s easy to extend this kind of thinking to other types of generative AI systems. LLMs that can succeed at some complex creative and reference tasks might still utterly fail when confronted with trivial math problems (or even get “poisoned” by malicious prompts). Visual AI models that can describe and analyze complex photos may nonetheless fail horribly when presented with basic geometric shapes.
Improving these kinds of “worst case” scenarios is key to avoiding embarrassing mistakes when rolling an AI system out to the public. But this new research shows that determined “adversaries” can often discover new holes in an AI algorithm’s performance much more quickly and easily than that algorithm can evolve to fix those problems.
And if that’s true in Go—a monstrously complex game that nonetheless has tightly defined rules—it might be even more true in less controlled environments. “The key takeaway for AI is that these vulnerabilities will be difficult to eliminate,” FAR CEO Adam Gleave told Nature. “If we can’t solve the issue in a simple domain like Go, then in the near-term there seems little prospect of patching similar issues like jailbreaks in ChatGPT.”
Still, the researchers aren’t despairing. While none of their methods were able to “make [new] attacks impossible” in Go, their strategies were able to plug up unchanging “fixed” exploits that had been previously identified. That suggests “it may be possible to fully defend a Go AI by training against a large enough corpus of attacks,” they write, with proposals for future research that could make this happen.
Regardless, this new research shows that making AI systems more robust against worst-case scenarios might be at least as valuable as chasing new, more human/superhuman capabilities.
Source link