Your chatbot might be leaky. According to recent reports, user conversations with AI chatbots such as OpenAI's ChatGPT and xAI's Grok "have been exposed in search engine results." Similarly, prompts on the Meta AI app may be appearing on a public feed. But what if those queries and chats can be protected, boosting privacy in the process?
That's what Duality, a company specializing in privacy-enhancing technologies, hopes to accomplish with its private large language model (LLM) inference framework. Behind the framework lies a technology called fully homomorphic encryption, or FHE, a cryptographic technique enabling computing on encrypted data without needing to decrypt it.
Duality's framework first encrypts a user prompt or query using FHE, then sends the encrypted query to an LLM. The LLM processes the query without decryption, generates an encrypted reply, and transmits it back to the user.
"They can decrypt the results and get the benefit of running the LLM without actually revealing what was asked or what was responded," says Kurt Rohloff, cofounder and CTO at Duality.
As a prototype, the framework supports only smaller models, particularly Google's BERT models. The team tweaked the LLMs to ensure compatibility with FHE, such as replacing some complex mathematical functions with their approximations for more efficient computation. Even with these slight alterations, however, the AI models operate just like a normal LLM would.
"Whatever we do on the inference does not require retraining. In our approach, we still want to make sure that training happens the usual way, and it's the inference that we essentially try to make more efficient," says Yuriy Polyakov, vice president of cryptography at Duality.
FHE is considered a quantum-computer-proof encryption. Yet despite its high level of security, the cryptographic method can be slow. "Fully homomorphic encryption algorithms are heavily memory-bound," says Rashmi Agrawal, cofounder and CTO at CipherSonic Labs, a company that spun out of her doctoral research at Boston University on accelerating homomorphic encryption. She explains that FHE relies on lattice-based cryptography, which is built on math problems around vectors in a grid. "Because of that lattice-based encryption scheme, you blow up the data size," she adds. This results in huge ciphertexts (the encrypted version of your data) and keys requiring lots of memory.
Another computational bottleneck entails an operation called bootstrapping, which is needed to periodically remove noise from ciphertexts, Agrawal says. "This particular operation is really expensive, and that is why FHE has been slow so far."
To overcome these challenges, the team at Duality is making algorithmic improvements to an FHE scheme known as CKKS (Cheon-Kim-Kim-Song) that's well-suited for machine learning applications. "This scheme can work with large vectors of real numbers, and it achieves very high throughput," says Polyakov. Part of those improvements involves integrating a recent advancement dubbed functional bootstrapping. "That allows us to do a very efficient homomorphic comparison operation of large vectors," Polyakov adds.
All of these implementations are available on OpenFHE, an open-source library that Duality contributes to and helps maintain. "This is a complicated and sophisticated problem that requires community effort. We're making those tools available so that, together with the community, we can push the state of the art and enable inference for large language models," says Polyakov.
Hardware acceleration also plays a part in speeding up FHE for LLM inference, especially for bigger AI models. "They can be accelerated by two to three orders of magnitude using specialized hardware acceleration devices," Polyakov says. Duality is building with this in mind and has added a hardware abstraction layer to OpenFHE for switching from a default CPU backend to swifter ones such as GPUs and application-specific integrated circuits (ASICs).
Agrawal agrees that GPUs, as well as field-programmable gate arrays (FPGAs), are a good fit for FHE-protected LLM inference because they're fast and connect to high-bandwidth memory. She adds that FPGAs in particular can be tailored for fully homomorphic encryption workloads.
For Duality's next steps, the team is progressing their private LLM inference framework from prototype to production. The company is also working on safeguarding other AI operations, including fine-tuning pretrained models on specialized data for specific tasks, as well as semantic search to uncover the context and meaning behind a search query rather than just using keywords.
FHE forms part of a broader privacy-preserving toolbox for LLMs, alongside techniques such as differential privacy and confidential computing. Differential privacy introduces controlled noise or randomness to datasets, obscuring individual details while maintaining collective patterns. Meanwhile, confidential computing employs a trusted execution environment -- a secure, isolated area within a CPU for processing sensitive data.
Confidential computing has been around longer than the newer FHE technology, and Agrawal considers it as FHE's "head-to-head competition." However, she notes that confidential computing can't support GPUs, making them an ill match for LLMs.
"FHE is strongest when you need noninteractive end-to-end confidentiality because nobody is able to see your data anywhere in the whole process of computing," Agrawal says.
A fully encrypted LLM using FHE opens up a realm of possibilities. In health care, for instance, clinical results can be analyzed without revealing sensitive patient records. Financial institutions can check for fraud without disclosing bank account information. Enterprises can outsource computing to cloud environments without unveiling proprietary data. User conversations with AI assistants can be protected, too.
"We're entering into a renaissance of the applicability and usability of privacy technologies to enable secure data collaboration," says Rohloff. "We all have data. We don't necessarily have to choose between exposing our sensitive data and getting the best insights possible from that data."