Maker.io main logo

How to Use Optimum-Intel to Accelerate LLaMA 3.1 on LattePanda Sigma

28

2025-08-15 | By DFRobot

LattePanda

Editor's Note:

LLaMA 3.1 is now widely running on various local devices, but mostly on CPUs. This article introduces a method using Intel Optimum-Intel to optimize its performance. The author utilized this method on a LattePanda Sigma, an x86 single board computer/server, leveraging the integrated GPU to accelerate LLaMA 3.1's inference speed, with impressive results. This method can be applied to devices with Intel integrated graphics.

Introduction:

Those who have read previous articles know that we can use the LattePanda Sigma to run large language models with around 10B parameters. Today, let's explore how to use the Optimum-Intel tool from OpenVINO to perform inference with the integrated GPU on the LLaMA-3.1-8B-Instruct model.

Optimum Intel:

Optimum-Intel serves as an interface layer between the Transformers and Diffusers libraries and various optimization tools provided by Intel. It offers developers a simplified way to use these libraries with Intel's hardware-optimized technologies, such as OpenVINO™, IPEX, etc., accelerating inference performance of AI models based on Transformer or Diffusion architectures on Intel hardware.

Optimum intel code repository

Image of How to Use Optimum-Intel to Accelerate LLaMA 3.1 on LattePanda Sigma

Optimum-intel

LLaMA 3.1 Overview:

LLaMA 3.1 with 405B parameters supports a context length of 128K tokens and was trained on 150 trillion tokens using over 16,000 H100 GPUs. Researchers have found that LLaMA 3.1 405B is comparable to top industry models like GPT-4, Claude 3.5 Sonnet, and Gemini Ultra, based on evaluations across more than 150 benchmark datasets.

Download the pre-trained weights of the Meta-Llama-3.1-8B-Instruct model provided by the ModelScope community using the following command. If you already have it, you can skip this step.

Copy Code
git clone --depth=1 https://www.modelscope.cn/LLM-Research/Meta-Llama-3.1-8B-Instruct.git

LattePanda Sigma Overview:

The LattePanda Sigma is equipped with a 13th-gen Intel Core i5-1340P processor, featuring 12 cores and 16 threads, with a turbo boost of up to 4.6GHz, delivering extreme performance and multitasking capabilities. This single-board computer uses Intel Iris Xe integrated graphics, which has 80 execution units (EU) and a maximum dynamic frequency of 1.4GHz, providing excellent graphics performance and rendering quality. We can leverage its integrated GPU to perform LLM inference.

Image of How to Use Optimum-Intel to Accelerate LLaMA 3.1 on LattePanda Sigma

LattePanda Sigma CPU

Image of How to Use Optimum-Intel to Accelerate LLaMA 3.1 on LattePanda Sigma

LattePanda Sigma GPU

Setting Up the Development Environment:

Please download and install Anaconda, then create and activate a virtual environment named llama31 using the following commands, and install Optimum Intel and its dependencies, OpenVINO and NNCF:

Copy Code
conda create -n llama31 python=3.11 # Create virtual environment
conda activate llama31 # Activate virtual environment 
git clone https://gitee.com/Pauntech/llama3.1-model.git # git clone the repo frem gitee 
python -m pip install --upgrade pip # Upgrade pip to the latest version 
pip install optimum-intel[openvino,nncf] # Install Optimum Intel and its dependencies OpenVINO and NNCF
pip install -U transformers # Upgrade transformers library to the latest version

Image of How to Use Optimum-Intel to Accelerate LLaMA 3.1 on LattePanda Sigma

Install Result

Using Optimum-CLI to Quantize the LLaMA 3.1 Model to INT4:

Optimum-CLI is a cross-platform command-line tool that comes with Optimum-Intel, allowing you to perform model quantization without writing code. You can use it to quantize the LLaMA 3.1 model and convert it to the OpenVINO format:

Copy Code
optimum-cli export openvino --model C:\Meta-Llama-3.1-8B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.8 --sym llama31_int4

The meanings of the parameters in the optimum-cli commands are as follows:

--model specifies the path of the model to be quantized.

--task specifies the task type.

--weight-format specifies the precision of the model parameters.

--group-size defines the group size during the quantization process.

--ratio determines the proportion of weights retained during quantization.

--sym indicates that symmetric quantization is used.

Image of How to Use Optimum-Intel to Accelerate LLaMA 3.1 on LattePanda Sigma

Quantization Result

Building a Chatbot Based on the LLaMA 3.1 Model:

First, install the required packages:

Copy Code
pip install gradio mdtex2html streamlit -i https://mirrors.aliyun.com/pypi/simple/

Then run python llama31_chatbot.py, and the result will be as shown below:

Image of How to Use Optimum-Intel to Accelerate LLaMA 3.1 on LattePanda Sigma

Chatbot Interface and GPU Running

Conclusion:

The Optimum Intel toolkit based on OpenVINO™ is simple and easy to use. With just one command, you can quantize the LLaMA 3.1 model to INT4, and with some basic preprocessing, you can use LLaMA 3.1 on the LattePanda Sigma and achieve excellent results. If you need to run large language models locally, you might consider deploying them on the LattePanda Sigma. Next, we will introduce how to use the ipexllm tool to utilize LattePanda's integrated GPU for LLM inference.

Mfr Part # DFR1081
LATTEPANDA SIGMA 16GB RAM 500GB
DFRobot
฿24,055.85
View More Details
Add all DigiKey Parts to Cart
Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.