New way allows AI chatbots to converse all day without failing

A simple yet effective solution for a puzzling problem.


When people talk to AI chatbots like ChatGPT for a long time, sometimes the conversation starts to go downhill. That’s because these bots use big computer programs called large language models. These models are like super-smart machines that learn from lots of data.

These models break down the words in your questions into smaller pieces called tokens. They use an attention mechanism to determine how these tokens fit together and make new sentences.

Usually, the chatbot remembers the recent tokens it has seen in something called a KV Cache. But if this memory gets too big, it slows things down. Plus, the bot’s performance drops if fewer tokens are remembered.

For example, imagine a chatbot that can remember up to 4,096 tokens, but a typical academic paper has around 10,000 tokens. That’s a problem!

A team of scientists from MIT and other places found a surprising reason why chatbots start to mess up after talking for a while. But they also devised a simple fix to keep the conversation going smoothly.

Their trick involves changing the way the chatbot remembers things during the conversation. Some methods require the cache to hold more information than its capacity, causing the chatbot to forget the first pieces of data. This can cause the model to fail.

But with the new method, called StreamingLLM, the chatbot remembers the significant first bits of information. This helps it keep chatting even during super long conversations, like those with over 4 million words!

Compared to other methods, StreamingLLM is more than 22 times faster. Chatbots can work all day without crashing, making them great helpers for writing, editing, or even coding.

Guangxuan Xiao, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on StreamingLLM said, “Now, with this method, we can persistently deploy these large language models. By making a chatbot that we can always chat with and that can always respond to us based on our recent conversations, we could use these chatbots in some new applications.”

In their new paper, the researchers also figured out why keeping the first token in the cache helps the model stay on track.

Even though the first word of a conversation might not seem necessary for predicting the next word, it helps the model better understand the context. It’s like having a starting point to make sense of everything that comes after it.

So, by ensuring the chatbot remembers the first token, it can keep up its performance even when dealing with a lot of information.

Some models employ a Softmax operation in their attention mechanism, where each token gets a score indicating its relation to other tokens. The scores need to add up to 1, but since many tokens aren’t closely related, they are usually low. So, the model puts any leftover attention into the first token.

This first token that collects extra attention is dubbed an “attention sink” by the researchers.

Song Han, an associate professor in EECS, a member of the MIT-IBM Watson AI Lab, and a distinguished scientist of NVIDIA said, “We need an attention sink, and the model decides to use the first token as the attention sink because it is globally visible — every other token can see it. We must always keep the attention sink in the cache to maintain the model dynamics.”

In creating StreamingLLM, the researchers found that having four special tokens called attention sinks at the beginning of the memory cache works best. They also learned that even as new words are added, and old ones are removed, each word must keep its original position in line.

Combining these two tricks made StreamingLLM able to chat continuously without slowing down, and it works even better than another method that needs to redo some calculations.

For example, with a memory cache of 256 words, the other method takes 63 milliseconds to figure out a new word, while StreamingLLM only takes 31 milliseconds. But if the cache grows to 4,096 words, the other method takes 1,411 milliseconds, while StreamingLLM stays fast, needing just 65 milliseconds.

Yang You, a presidential young professor of computer science at the National University of Singapore, who was not involved with this work said“The innovative approach of StreamingLLM, centered around the attention sink mechanism, ensures stable memory usage and performance, even when processing texts up to 4 million tokens in length. This capability is not just impressive; it’s transformative, enabling StreamingLLM to be applied across various AI applications. The performance and versatility of StreamingLLM mark it as an up-and-coming technology, poised to revolutionize how we approach AI-driven generation applications.”

Tianqi Chen, an assistant professor in the machine learning and computer science departments at Carnegie Mellon University who also was not involved with this research, agreed, saying “Streaming LLM enables the smooth extension of the conversation length of large language models. We have been using it to enable the deployment of Mistral models on iPhones with great success.”

Journal Reference:

  1. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han and Mike Lewis. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453v3


See stories of the future in your inbox each morning.