Windows ＋llama-cpp-python + SLM の量子化モデルで推論実行

windows環境は以下の通り。

windows 11 Pro ( 22H2 )
Core(TM) i5-4590T CPU @ 2.00GHz
RAM 8.00 GB

基本的にはWSL2 ＋ Ubuntu の環境でllama-cpp-python を使います。

ではやってみましょう。

まず、PowerShellスクリプトの実行権限の確認を行います。

スタートメニューを右クリックして「ターミナル（管理者）」を選択、PowerShellスクリプトが実行できるかどうか確認します。

以下のコマンドを実行します。

Get-ExecutionPolicy

1	Get-ExecutionPolicy

上記の結果が「Restricted」の場合、現在のユーザにスクリプト実行権限を与えます。

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

1	Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

再度Get-ExecutionPolicyを実行して「RemoteSigned」になっていることを確認します。

wsl2およびUbuntuをインストールします。

wsl --install

1	wsl --install

以下のようなエラーになった場合
Error code: Wsl/Service/CreateInstance/MountVhd/HCS/ERROR_FILE_NOT_FOUND
Ubuntuの登録を解除します。
＞wsl –unregister Ubuntu
再度実行
＞wsl –install

インストールが開始されます。

Ubuntu のインストールが終了すると、ユーザーアカウントを作成します。

まず、ユーザーネームを入力します（大文字は使えません、先頭に数字やハイフンもダメです）。

適当なユーザーネーム（例：foo3）を入力したら、次に適当なパスワードを入力してユーザーアカウント作成終了です。

exitでアカウントを抜けれます。

再度ターミナルを起動して確認してみます。

wsl --list --verbose

1	wsl --list --verbose

WSL2がインストールされています(後ろのVERSION番号で確認)。

再度ターミナルを終了して、startメニュからUbuntuを起動します。

こんな感じ。

aptリポジトリをアップデートします。

sudo apt update

1	sudo apt update

以下のようなエラーが出た場合

例：Temporary failure resolving ‘archive.ubuntu.com’

名前解決ができていないので自分の環境にあったDNSサーバを /etc/resolv.conf に指定します。

まず、ネットワークの自動生成機能は切っておきます。/etc/wsl.confに追加します。
＞sudo nano /etc/wsl.conf
[network]
generateResolvConf = false

次に、nslookupなどで自分の環境のDNSサーバを調べます(例：122.197.254.137)。それで/etc/resolv.conf を書き換えます。

＞sudo nano /etc/resolv.conf
nameserver 122.197.254.137

updateを再実行、同時にupgradeもやっておきましょう。
＞sudo apt update
＞sudo apt upgrade -y

pipとvenvをインストール。

sudo apt install python3-pip -y

sudo apt install python3-venv -y

sudo apt install python3-pip -y

sudo apt install python3-venv -y

仮想環境（venv1）を作成して入ります。

python3 -m venv venv1

source venv1/bin/activate

python3 -m venv venv1

source venv1/bin/activate

llama-cpp-python をインストールします。

pip install llama-cpp-python

1	pip install llama-cpp-python

HuggingFace から4bit量子化された言語モデルをダウンロードして、これを推論用に使います。

オリジナルはMicrosoftの小規模言語モデル（SLM）のPhi-3-miniです。

作業用フォルダーを作成してダウンロード

mkdir models
cd models

wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf

mkdir models

cd models

wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf

wgetで失敗することがあります。
Temporary failure in name resolution.

DNSサーバーのIPアドレスをGoogleのPublicDNSサーバーに設定します。

＞sudo nano /etc/resolv.conf

nameserver 8.8.8.8

再度wgetを実行

親ディレクトリに戻ります。

cd ..

cd ..

プロンプトを実行するパイソンコードを作ります。

プロンプトは「Tell me how to feel better? Please answer in Japanese.」(気分を良くするにはどうしたらいいですか？日本語で応えてください。)です。

Phi-3-mini は一応日本語も理解できるようですのでプロンプトを日本語で記述しても答えてくれます。

＞sudo nano sample.py

from llama_cpp import Llama
# プロンプトを記述
prompt = """[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.Please ensure that your responses are socially unbiased and positive in nature.If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.If you don't know the answer to a question, please don't share false information.
<</SYS>>
Tell me how to feel better? Please answer in Japanese.[/INST]"""

# ダウンロードしたModelをセット.
llm = Llama(model_path="./models/Phi-3-mini-4k-instruct-q4.gguf", n_gpu_layers=0)
# 推論実行
output = llm(
    prompt,max_tokens=500,stop=["System:", "User:", "Assistant:"],echo=True,
)
# 表示
print(output)

from llama_cpp import Llama

# プロンプトを記述

prompt = """[INST] <<SYS>>

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.Please ensure that your responses are socially unbiased and positive in nature.If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.If you don't know the answer to a question, please don't share false information.

<</SYS>>

Tell me how to feel better? Please answer in Japanese.[/INST]"""

# ダウンロードしたModelをセット.

llm = Llama(model_path="./models/Phi-3-mini-4k-instruct-q4.gguf", n_gpu_layers=0)

# 推論実行

output = llm(

prompt,max_tokens=500,stop=["System:", "User:", "Assistant:"],echo=True,

)

# 表示

print(output)

＞python3 sample.py

先頭に書いたスペックのWindows PCでCPUのみの場合、応答が返るのに１分半近くかかりました。
GPUにNVIDIAのRTX 40系（12GB）でCUDA/cuDNNを使い、n_gpu_layersを40以上に設定すれば、回答遅延をほぼ０にできるようです。

仮想環境を抜けます。
＞deactivate

Ubuntuを抜けます。
＞exit

FRONT

地図と画像のサイト

Windows ＋llama-cpp-python + SLM の量子化モデルで推論実行

Be the first to comment

Leave a Reply コメントをキャンセル