Tolkien And Tokens

Or should I say 51, 144221, 1958, 17951—at least for all the large language models in the room.

Wait, what?

Let’s back up.

Large language models (LLMs) don’t write the way you and I do—thinking of words, writing individual characters, and arranging them into sentences.

To see this in real time type “hello world” here.

In other words, 51, 144221, 1958, 17951 is how an LLM would represent Tolkien and Tokens.

So, tell me in plan English, what is tokenization?

Tokenization is the process of breaking text into smaller units—words, subwords, or even individual characters—before an AI model processes it. Instead of “reading” words like humans do, the model works with numerical representations (tokens).

Just like you might choose between multiple paste when hiking a mountain, there are multiple paths to tokenization:

• Word-based: Each word is a token (e.g., “Tolkien” = 1 token).

• Subword-based (most common in LLMs): Breaks words into meaningful chunks (e.g., “To”, “lki”, “en”).

• Character-based: Each letter is a token (e.g., “T”, “o”, “l”, “k”, “i”, “e”, “n”).

Why is this important?

Without tokenization, AI models wouldn’t be able to effectively process or generate text because they wouldn’t have a structured way to interpret the input data.

It would be like saying “no cap” to a millennial, they wouldn’t get it.

One Comment Add yours

Leave a comment