New algorithm compresses LLM’s reams of data to increase privacy

Leaner Large Language Models (LLMs) could enable efficient local use on phones and laptops

Follow us onFollow Tech Explorist on Google News

Since the inception of the Large Language Models (LLMs), there has been an increasing load of automating tasks like translation, text classification, and customer service. To use LLMs, users need to send requests to a centralized server, which processes them and pingbacks with the response.

However, this method is expensive, energy-intensive, and often slow. As the data is stored on their servers, it could face potential data leaks or data loss in case of system failure. To overcome these challenges, researchers have developed a technique for compressing data.

Engineers at Princeton and Stanford Engineering have proposed a new algorithm that trims redundancies and reduces the precision of layers. Such compressed reams of data can be stored locally on a device like a phone or laptop.

Trimming redundancy means streamlining excess data that don’t actively contribute to the output. Meanwhile, reducing the precision of layers means reducing precisions (8-bit or 16-bit) to yield nearly the same results.

This algorithm would not just provide performance nearly as accurate as an uncompressed version, but also increase privacy, save energy, and lower costs. The new algorithm CALDERA (Calibration Aware Low precision DEcomposition with low Rank Adaptation), will be presented in December.

When you use ChatGPT, whatever request you give it goes to the back-end servers of OpenAI, which process all of that data, and that is very expensive,” said coauthor Rajarshi Saha.

So, you want to be able to do this LLM inference using consumer GPUs [graphics processing units], and the way to do that is by compressing these LLMs.

AI-powered approach to establishing a carbon-neutral energy city

While the CALDERA is not the first to compress LLMs, it uses two new approaches; Low-Precision and Low-Rank. The Low-Rank framework reduces the redundancies in the LLM weight matrices, while the Low-Precision reduces the number of bits.

Using both of these properties together, we are able to get much more compression than either of these techniques can achieve individually,” said Saha.

To train the CALDERA algorithm, researchers used large collections of information that are used to train LLMs. These data sets were composed of matrices and grids of numbers to store data.

Researchers tested their algorithm with open-source large language models released by Meta AI. The team found that the low-rank framework can further improve methods that use Low-Precision.

Engineers evaluated the performance of the compressed language models using several sets of tasks. They witnessed an improvement of up to 5%, which is significant for metrics.

I think it’s encouraging and a bit surprising that we were able to get such good performance in this compression scheme,said Goldsmith. “By taking advantage of the weight matrix rather than just using a generic compression algorithm for the bits that are representing the weight matrix, we were able to do much better.

New AI can separate brain patterns related to a particular behavior

Journal Reference

  1. Saha, R., Sagan, N., Srivastava, V., Goldsmith, A. J., & Pilanci, M. (2024). Compressing Large Language Models using Low Rank and Low Precision Decomposition. ArXiv. DOI: 10.48550/arXiv.2405.18886
Up next

New AI model imitates sounds more like humans

Teaching AI to communicate sounds like humans do.

4M: a next-generation framework for training multimodal foundation

An open-source training framework to advance multimodal AI.
Recommended Books
The Cambridge Handbook of the Law, Policy, and Regulation for Human–Robot Interaction (Cambridge Law Handbooks)

The Cambridge Handbook of the Law, Policy, and Regulation for Human-Robot...

Book By
Cambridge University Press
Journal
Picks for you

New AI model imitates sounds more like humans

4M: a next-generation framework for training multimodal foundation

Researchers develop offline speech recognition algorithm

Most recent Large Language Models remain vulnerable to simple manipulations

Boltz-1: A model to predict biomolecular structures