created: 2025-01-06 updated: 2025-01-13
Large language models like ChatGPT & Claude have become ubiquitous in today’s world. I visited a friend for the New years eve and was impressed by how ChatGPT was being extensively used within the family. The friend’s little kid is going to grow up in a world, where turning to large language models (LLM) for quick answers is going to be the norm rather than the exception.
LLMs do have their fair share of pitfalls. The primary, being the problem of hallucination, meaning the LLM spitting out factually incorrect answers. There has been (and will be) a lot of work into understanding when and how these models hallucinate, but I wanted to understand the output of the LLMs from an information theoretic perspective.
Now, how does an LLM work? Language Models are probability distributions over sequences of tokens/words1 (In Natural Language Processing, tokens are building blocks of text. But they can be interchangeably used with “word” in our context for now). Most of the times, we are dealing with something called Autoregressive language models. The easiest way to think about it is that, it is a black box to which, if you give a sequence of tokens (or words), it spits out the next token (or word). Now you add this generated token to the previous sequence of tokens and send it to the black box again. It will spit the next one and so on. You get the idea!
LLMs have been possible, thanks to the concept of attention 2 and a specific architecture of neural networks called the transformers. For the interested readers, I would divert your attention to the wonderful talk by Grant Sanderson 3 on the same topic
The important thing to remember is that the output token of the LLM (black box) is not deterministic. Rather, it is a probability distribution over all the available tokens in the vocabulary. And this might seem like a really large set. But in reality, most of the probability is distributed over only a few tokens, while the rest have close to negligible probability of appearing as the next token.
For the rest of the post, let us use ChatGPT as an example to understand the concepts. When I type in the following text and look for the next 3 possible tokens
The protagonist went around the corner and …
We can already see that the model finds certain tokens/words highly probable than the rest.
When a probability distribution enters the picture, there are lots of different ways to think about reasoning with it. Here, I wanted to explore it from the perspective of entropy.
Entropy is a tricky concept and it can mean different things in different contexts. Quanta Magazine has a nice post on what entropy means 4. I would particularly recommend playing around with the animation by Jonas Parnow & Mark Belan in that post to get an intuitive understanding of what the measure of disorder actually means. For the sake of this post, we will stick to an information theoretic understanding and borrow Shannon’s definition of entropy for the probability distribution. It is defined as
\[\text{Shannon's entropy}=-\sum\limits_xp(x)\times\log{p(x)}\]
where \(x\) are the available tokens and \(p(x)\) refers to the probability of that token appearing next. Now, what does the value of entropy say about the underlying probability distribution. If all the states/tokens are equally probable, the entropy would be the high. If the probability is concentrated on a very few tokens, the entropy would be low. In other words, a higher entropy can be interpreted as the model not being sure of the next token relative to the case where the entropy is low, meaning that the model is more confident of what token should come next
Let’s put this into practice. Let’s ask ChatGPT the following
write a haiku about ai …
Now, for each token in the output sequence, we obtain a probability distribution for which the entropy can be calculated.
Note: The ChatGPT’s API only allows to look at the probability values of the top 20 tokens. This could mean that we might be missing out on non-zero probable next tokens beyond this range. And the definition of entropy over a probability distribution assumes that it is normalized (meaning all the different probability values for a single distribution sums to 1) So, we normalize the probability values for the top 20 tokens before calculating the entropy. Now, looking at the result…
This is interesting! The entropy is low (meaning the model is relatively sure of the next token) for the cases where the sentence ends or if the next token is the broken part of a single word (Wis-dom above).
What if we did it for a longer text. Asking ChatGPT to
Write me an essay about Claude Shannon …
Too many tokens to visualize. Let’s just look at a histogram of the entropy values
In most of the cases, the model seems to be pretty sure of the next token. Since the plot above seems a bit erratic to make sense of it, I tried taking the average over a moving window to smoothen the plot. Below are different plots for different window sizes
The red lines that you see are the tokens which are or contain a full stop (interpreted as the end of a sentence). The interpretation from the first simple example still holds that the entropy drops down for the cases where each sentence ends. I wanted to see what kind of tokens were selected when there was high and low entropy respectively. Looking at the top 10 tokens with the lowest entropies …
194 | Institute | of | Michigan | th | 191 | Technology | | of | ,
and comparing it with the top 10 tokens with the highest entropy cases …
led | wrote | considered | established | revolution | technology | , | is | regarded | conducted
it can be crudely interpreted that factual statements (involving proper nouns) come with low entropy, where the model has to be sure of what comes next while descriptive statements have multiple options as synonyms that could increase the pool of possible and probable next tokens.
Next, I went and asked ChatGPT what kind of analysis can I build on this and it told me to look at the clusters surrounding high entropy tokens (words before and after)
… and innovative thinking led to groundbreaking contributions …
… time that Shannon wrote his master’s thesis …
… This work, considered one of the …
… paper, Shannon established the principles of …
… research that would revolution ize the field …
… have profoundly influenced technology, telecommunications, …
… This work , considered one of …
… circuit design and is regarded as a cornerstone …
… where he conducted research that would …
Yeah. I honestly didn’t know what to make of it. But was interesting to give a voice to ChatGPT itself to contribute to this post.
I wanted to try a text completion in Tamil 5 to see how ChatGPT performs. I asked
Finish the rest… எப்பொருள் யார்யார்வாய்க் கேட்பினும்
And it gave me
எப்பொருள் யார்யார்வாய்க் கேட்பினும் அப்பொருள்
மெய்ப்பொருள் காண்பது அறிவு.
This is a couplet from the ancient Tamil text “Thirukkural,” written by the poet Thiruvalluvar. It emphasizes the importance of discerning the truth regardless of who speaks it.
And the probability distribution of tokens look like …
Here, you can clearly see that the tokens are individual letters. While this is not ideal for autoregressive language models leading to more compute time, it can actually be helpful in fine-grained model performance evaluation.
Why did I do the things above? I found it interesting that the OpenAI’s API for text generation using ChatGPT models gave back information of the next token probabilities. I wanted to play around with it from the point of view of entropy. While no substantial conclusions can be made from this little experiment, it is of paramount importance that we pay heed to the Thirukkural couplet above and discern the truth regardless of who (or what) speaks it!
(You can find the code used to create all the plots here)
Hacker news discussion - https://news.ycombinator.com/item?id=42649315
Relevant and interesting paper mentioned in the comments there - Creativity Has Left the Chat: The Price of Debiasing Language Models by Behnam Mohammadi