From the course: LLaMa for Developers

Unlock the full course today

Join today to access over 23,100 courses taught by industry experts.

The LLaMA tokenizer

The LLaMA tokenizer - Llama Tutorial

From the course: LLaMa for Developers

The LLaMA tokenizer

- [Instructor] So we've probably heard about LLMs using tokens, but what exactly are they? Typically, when we explain tokens, we say that they're kind of similar to words, but let's dive deeper. A token is a way to represent components of an input into a model. This could be characters, words, sentences, or numbers. In some multimodal models, it can also be images, sounds, and videos. In the past, prior tokens were hand coded based on language rules typically around punctuation or words. But the idea tokenization being different was first introduced in 1992. Since then, there have been many tokenizers that have been trained with billions of words utilized to train the best tokenizer. Some of these tokenizers have been character based, others have been prefix or suffix based, and others have been word based. But each tokenizer is a little bit different. In the case of LLaMa, LLaMa uses the SentencePiece tokenizer, which was developed by Google. Now, how do you get the representation in…

Contents