本地部署 Llama-2 大模型
本文通过使用llama.cpp 模型量化等技术,实现了在普通笔记本电脑中运行大语言模型。
随着以ChatGPT为首的生成式人工智能的火爆,越来越多的独立开发者和研究人员开始尝试使用AI模型辅助工作和学习。但随之而来的是隐私泄漏问题,其中,ChatGPT的隐私泄漏问题尤为严重,在OpenAI更新隐私政策后,免费使用ChatGPT的用户提交的所有信息都有可能被泄露给第三方。在这样的情况下,为了保护自己的隐私,我们急需使用一个可以本地运行的AI模型。
Llama-2是一款备受欢迎的大型开源AI模型之一。开源社区更是开发出了llama.cpp这样的纯CPU推理实现,伴随着大模型量化技术发展,使得我们可以将大模型运行在笔记本电脑之上。
安装编译 llama.cpp
模型下载
Baichuan
Chinese-LLaMA-2
Qwen
模型处理
llama.cpp
需要使用ggml
模型。我们需要使用转换脚本将原始模型转换为gguf
格式。
python convert.py models/7B/
量化模型
./quantize models/chinese-alpaca-2-13b-hf/ggml-model-f16.gguf models/chinese-alpaca-2-13b-hf/ggml-model-q4_K_S.gguf Q4_K_S
命令行使用
./main -m models/chinese-alpaca-2-13b-hf/ggml-model-q4_K_S.gguf --color -f prompts/alpaca.txt -ins -c 4096 --temp 0.5 -n 4096 --repeat_penalty 1.2
HTTP web server
./server --model models/your-model.gguf --n-gpu-layers 25 -cb --host 0.0.0.0 --port 8019 --api-key sk-xxxxxxxxxxxxxxxxxxxxxx
llama-cpp-python类openai接口访问
-
创建并启用虚拟环境
python -m venv .env && source .env/bin/activate
-
安装依赖
pip install llama-cpp-python[server]
-
启动WebServer
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
-
编写python代码测试接口访问
import openai openai.api_key = "..." openai.api_base = "http://127.0.0.1:8000/v1" default_prompt = """ I understand that you are an AI language model. When responding, you should remember and adhere to the following guidelines: - No need for polite language or unnecessary introductions. - Responses should be concise and precise. - When asked to implement something using a programming language, please provide a detailed response and ensure it achieves the desired outcome using the programming language. - Before responding, check if there are any issues with the output you are about to provide. If there are issues, please correct them before responding. - Remain neutral on controversial topics. - Responses should be within 1000 words. - Respond in Chinese. """ print(openai.Model.list()) # create a chat completion chat_completion = openai.ChatCompletion.create( model="alpaca-7b-16k", messages=[ {"role": "system", "content": default_prompt}, {"role": "user", "content": "请充当一个资深的旅游博主,介绍下中国的首都"} ], temperature=0.9, top_p=0.6, max_tokens=1024 ) # print the completion print(chat_completion['choices'][0]['message']['content'])
参考
- https://huggingface.co/meta-llama/Llama-2-7b-hf
- https://github.com/ggerganov/llama.cpp
- https://github.com/abetlen/llama-cpp-python
- https://huggingface.co/xtuner
- https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese
- https://huggingface.co/unsloth
- https://www.youtube.com/watch?v=LPmI-Ok5fUc