From the course: LLaMa for Developers
Unlock the full course today
Join today to access over 23,100 courses taught by industry experts.
Quantizing LLaMA - Llama Tutorial
From the course: LLaMa for Developers
Quantizing LLaMA
- [Instructor] In the previous video, we talked a little bit about fitting LLaMa onto your hardware. In this video, we're going to dive deeper into quantizing Lama. So to start off, why is quantization important? There are four main reasons. The first one is it allows you to run more powerful models by reducing the memory footprint, second one is it allows you to train more powerful models, the third one is that it reduces energy consumption, and the fourth one that it advances computer science. Now, recapping quantization. From the previous video, we saw this chart. Depending on the precision that we store our model, we require a different amount of memory. Urgent LLaMa would require 28 gigabytes of memory while 4-bit quantized LLaMa only requires 3 1/2. Now, quantization is fairly new. We only achieved reliable 8-bit quantization for large language models in 2022. So with that said, let's review some of the important blog posts and papers about quantizing large language models. So…
Contents
-
-
-
-
-
-
(Locked)
Resources required to serve LLaMA4m 35s
-
(Locked)
Quantizing LLaMA4m 7s
-
(Locked)
Using TGI for serving LLaMA2m 40s
-
(Locked)
Using VLLM for serving LLaMA5m 27s
-
(Locked)
Using DeepSpeed for serving LLaMA4m 13s
-
(Locked)
Explaining LoRA and SLoRA1m 59s
-
(Locked)
Using a vendor for serving LLaMA3m 16s
-
(Locked)
-
-