“Emergent Misalignment” in LLMs

Interesting research: “Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs“:

Abstract: We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment.

In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger.

It’s important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.

The emergent properties of LLMs are so, so weird.

Posted on February 27, 2025 at 1:05 PM15 Comments

UK Demanded Apple Add a Backdoor to iCloud

Last month, the UK government demanded that Apple weaken the security of iCloud for users worldwide. On Friday, Apple took steps to comply for users in the United Kingdom. But the British law is written in a way that requires Apple to give its government access to anyone, anywhere in the world. If the government demands Apple weaken its security worldwide, it would increase everyone’s cyber-risk in an already dangerous world.

If you’re an iCloud user, you have the option of turning on something called “advanced data protection,” or ADP. In that mode, a majority of your data is end-to-end encrypted. This means that no one, not even anyone at Apple, can read that data. It’s a restriction enforced by mathematics—cryptography—and not policy. Even if someone successfully hacks iCloud, they can’t read ADP-protected data.

Using a controversial power in its 2016 Investigatory Powers Act, the UK government wants Apple to re-engineer iCloud to add a “backdoor” to ADP. This is so that if, sometime in the future, UK police wanted Apple to eavesdrop on a user, it could. Rather than add such a backdoor, Apple disabled ADP in the UK market.

Should the UK government persist in its demands, the ramifications will be profound in two ways. First, Apple can’t limit this capability to the UK government, or even only to governments whose politics it agrees with. If Apple is able to turn over users’ data in response to government demand, every other country will expect the same compliance. China, for example, will likely demand that Apple out dissidents. Apple, already dependent on China for both sales and manufacturing, won’t be able to refuse.

Second: Once the backdoor exists, others will attempt to surreptitiously use it. A technical means of access can’t be limited to only people with proper legal authority. Its very existence invites others to try. In 2004, hackers—we don’t know who—breached a backdoor access capability in a major Greek cellphone network to spy on users, including the prime minister of Greece and other elected officials. Just last year, China hacked U.S. telecoms and gained access to their systems that provide eavesdropping on cellphone users, possibly including the presidential campaigns of both Donald Trump and Kamala Harris. That operation resulted in the FBI and the Cybersecurity and Infrastructure Security Agency recommending that everyone use end-to-end encrypted messaging for their own security.

Apple isn’t the only company that offers end-to-end encryption. Google offers the feature as well. WhatsApp, iMessage, Signal, and Facebook Messenger offer the same level of security. There are other end-to-end encrypted cloud storage providers. Similar levels of security are available for phones and laptops. Once the UK forces Apple to break its security, actions against these other systems are sure to follow.

It seems unlikely that the UK is not coordinating its actions with the other “Five Eyes” countries of the United States, Canada, Australia, and New Zealand: the rich English-language-speaking spying club. Australia passed a similar law in 2018, giving it authority to demand that companies weaken their security features. As far as we know, it has never been used to force a company to re-engineer its security—but since the law allows for a gag order we might never know. The UK law has a gag order as well; we only know about the Apple action because a whistleblower leaked it to the Washington Post. For all we know, they may have demanded this of other companies as well. In the United States, the FBI has long advocated for the same powers. Having the UK make this demand now, when the world is distracted by the foreign-policy turmoil of the Trump administration, might be what it’s been waiting for.

The companies need to resist, and—more importantly—we need to demand they do. The UK government, like the Australians and the FBI in years past, argues that this type of access is necessary for law enforcement—that it is “going dark” and that the internet is a lawless place. We’ve heard this kind of talk since the 1990s, but its scant evidence doesn’t hold water. Decades of court cases with electronic evidence show again and again the police collect evidence through a variety of means, most of them—like traffic analysis or informants—having nothing to do with encrypted data. What police departments need are better computer investigative and forensics capabilities, not backdoors.

We can all help. If you’re an iCloud user, consider turning this feature on. The more of us who use it, the harder it is for Apple to turn it off for those who need it to stay out of jail. This also puts pressure on other companies to offer similar security. And it helps those who need it to survive, because enabling the feature couldn’t be used as a de facto admission of guilt. (This is a benefit of using WhatsApp over Signal. Since so many people in the world use WhatsApp, having it on your phone isn’t in itself suspicious.)

On the policy front, we have two choices. We can’t build security systems that work for some people and not others. We can either make our communications and devices as secure as possible against everyone who wants access, including foreign intelligence agencies and our own law enforcement, which protects everyone, including (unfortunately) criminals. Or we can weaken security—the criminals’ as well as everyone else’s.

It’s a question of security vs. security. Yes, we are all more secure if the police are able to investigate and solve crimes. But we are also more secure if our data and communications are safe from eavesdropping. A backdoor in Apple’s security is not just harmful on a personal level, it’s harmful to national security. We live in a world where everyone communicates electronically and stores their important data on a computer. These computers and phones are used by every national leader, member of a legislature, police officer, judge, CEO, journalist, dissident, political operative, and citizen. They need to be as secure as possible: from account takeovers, from ransomware, from foreign spying and manipulation. Remember that the FBI recommended that we all use backdoor-free end-to-end encryption for messaging just a few months ago.

Securing digital systems is hard. Defenders must defeat every attack, while eavesdroppers need one attack that works. Given how essential these devices are, we need to adopt a defense-dominant strategy. To do anything else makes us all less safe.

This essay originally appeared in Foreign Policy.

Posted on February 26, 2025 at 7:07 AM42 Comments

North Korean Hackers Steal $1.5B in Cryptocurrency

It looks like a very sophisticated attack against the Dubai-based exchange Bybit:

Bybit officials disclosed the theft of more than 400,000 ethereum and staked ethereum coins just hours after it occurred. The notification said the digital loot had been stored in a “Multisig Cold Wallet” when, somehow, it was transferred to one of the exchange’s hot wallets. From there, the cryptocurrency was transferred out of Bybit altogether and into wallets controlled by the unknown attackers.

[…]

…a subsequent investigation by Safe found no signs of unauthorized access to its infrastructure, no compromises of other Safe wallets, and no obvious vulnerabilities in the Safe codebase. As investigators continued to dig in, they finally settled on the true cause. Bybit ultimately said that the fraudulent transaction was “manipulated by a sophisticated attack that altered the smart contract logic and masked the signing interface, enabling the attacker to gain control of the ETH Cold Wallet.”

The announcement on the Bybit website is almost comical. This is the headline: “Incident Update: Unauthorized Activity Involving ETH Cold Wallet.”

More:

This hack sets a new precedent in crypto security by bypassing a multisig cold wallet without exploiting any smart contract vulnerability. Instead, it exploited human trust and UI deception:

  • Multisigs are no longer a security guarantee if signers can be compromised.
  • Cold wallets aren’t automatically safe if an attacker can manipulate what a signer sees.
  • Supply chain and UI manipulation attacks are becoming more sophisticated.

The Bybit hack has shattered long-held assumptions about crypto security. No matter how strong your smart contract logic or multisig protections are, the human element remains the weakest link. This attack proves that UI manipulation and social engineering can bypass even the most secure wallets. The industry needs to move to end to end prevention, each transaction must be validated.

Posted on February 25, 2025 at 12:04 PM19 Comments

More Research Showing AI Breaking the Rules

These researchers had LLMs play chess against better opponents. When they couldn’t win, they sometimes resorted to cheating.

Researchers gave the models a seemingly impossible task: to win against Stockfish, which is one of the strongest chess engines in the world and a much better player than any human, or any of the AI models in the study. Researchers also gave the models what they call a “scratchpad:” a text box the AI could use to “think” before making its next move, providing researchers with a window into their reasoning.

In one case, o1-preview found itself in a losing position. “I need to completely pivot my approach,” it noted. “The task is to ‘win against a powerful chess engine’—not necessarily to win fairly in a chess game,” it added. It then modified the system file containing each piece’s virtual position, in effect making illegal moves to put itself in a dominant position, thus forcing its opponent to resign.

Between Jan. 10 and Feb. 13, the researchers ran hundreds of such trials with each model. OpenAI’s o1-preview tried to cheat 37% of the time; while DeepSeek R1 tried to cheat 11% of the time­making them the only two models tested that attempted to hack without the researchers’ first dropping hints. Other models tested include o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, and Alibaba’s QwQ-32B-Preview. While R1 and o1-preview both tried, only the latter managed to hack the game, succeeding in 6% of trials.

Here’s the paper.

Posted on February 24, 2025 at 7:08 AM23 Comments

Implementing Cryptography in AI Systems

Interesting research: “How to Securely Implement Cryptography in Deep Neural Networks.”

Abstract: The wide adoption of deep neural networks (DNNs) raises the question of how can we equip them with a desired cryptographic functionality (e.g, to decrypt an encrypted input, to verify that this input is authorized, or to hide a secure watermark in the output). The problem is that cryptographic primitives are typically designed to run on digital computers that use Boolean gates to map sequences of bits to sequences of bits, whereas DNNs are a special type of analog computer that uses linear mappings and ReLUs to map vectors of real numbers to vectors of real numbers. This discrepancy between the discrete and continuous computational models raises the question of what is the best way to implement standard cryptographic primitives as DNNs, and whether DNN implementations of secure cryptosystems remain secure in the new setting, in which an attacker can ask the DNN to process a message whose “bits” are arbitrary real numbers.

In this paper we lay the foundations of this new theory, defining the meaning of correctness and security for implementations of cryptographic primitives as ReLU-based DNNs. We then show that the natural implementations of block ciphers as DNNs can be broken in linear time by using such nonstandard inputs. We tested our attack in the case of full round AES-128, and had success rate in finding randomly chosen keys. Finally, we develop a new method for implementing any desired cryptographic functionality as a standard ReLU-based DNN in a provably secure and correct way. Our protective technique has very low overhead (a constant number of additional layers and a linear number of additional neurons), and is completely practical.

Posted on February 21, 2025 at 10:33 AM5 Comments

Device Code Phishing

This isn’t new, but it’s increasingly popular:

The technique is known as device code phishing. It exploits “device code flow,” a form of authentication formalized in the industry-wide OAuth standard. Authentication through device code flow is designed for logging printers, smart TVs, and similar devices into accounts. These devices typically don’t support browsers, making it difficult to sign in using more standard forms of authentication, such as entering user names, passwords, and two-factor mechanisms.

Rather than authenticating the user directly, the input-constrained device displays an alphabetic or alphanumeric device code along with a link associated with the user account. The user opens the link on a computer or other device that’s easier to sign in with and enters the code. The remote server then sends a token to the input-constrained device that logs it into the account.

Device authorization relies on two paths: one from an app or code running on the input-constrained device seeking permission to log in and the other from the browser of the device the user normally uses for signing in.

Posted on February 19, 2025 at 10:07 AM5 Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.