self-host-deploy-llama

self-host-deploy-llama

本地部署大模型 Llama-2

随着以ChatGPT为首的生成式人工智能的火爆,越来越多的独立开发者和研究人员开始尝试部署大型AI模型,以满足各种需求。其中,Llama-2是一款备受欢迎的大型AI模型之一。在本文中,我们将探讨如何进行本地部署Llama-2,以便在独立开发和研究项目中充分利用其强大的生成能力。无论你是一名全栈程序员还是对人工智能充满热情的研究者,这篇文章将为你提供有关Llama-2的详细部署指南,助你更好地掌握这一领域的技术。

安装编译 llama.cpp

模型下载

Baichuan

Chinese-LLaMA-2

模型处理

llama.cpp需要使用ggml模型。我们需要使用转换脚本将原始模型转换为gguf格式。

python convert.py models/7B/

量化模型

./quantize models/chinese-alpaca-2-13b-hf/ggml-model-f16.gguf models/chinese-alpaca-2-13b-hf/ggml-model-q4_K_S.gguf Q4_K_S

命令行使用

./main -m models/chinese-alpaca-2-13b-hf/ggml-model-q4_K_S.gguf --color -f prompts/alpaca.txt -ins -c 4096 --temp 0.5 -n 4096 --repeat_penalty 1.2

类openai接口访问

  1. 创建并启用虚拟环境

    python -m venv .env && source .env/bin/activate

  2. 安装依赖

    pip install llama-cpp-python[server]

  3. 启动WebServer

    python3 -m llama_cpp.server --model models/7B/llama-model.gguf

  4. 编写python代码测试接口访问

    import openai
    
    openai.api_key = "..."
    openai.api_base = "http://127.0.0.1:8000/v1"
    
    default_prompt = """
    I understand that you are an AI language model. When responding, you should remember and adhere to the following guidelines:
    - No need for polite language or unnecessary introductions.
    - Responses should be concise and precise.
    - When asked to implement something using a programming language, please provide a detailed response and ensure it achieves the desired outcome using the programming language.
    - Before responding, check if there are any issues with the output you are about to provide. If there are issues, please correct them before responding.
    - Remain neutral on controversial topics.
    - Responses should be within 1000 words.
    - Respond in Chinese.
    """
    
    print(openai.Model.list())
    # create a chat completion
    chat_completion = openai.ChatCompletion.create(
        model="alpaca-7b-16k",
        messages=[
            {"role": "system", "content": default_prompt},
            {"role": "user", "content": "请充当一个资深的旅游博主,介绍下中国的首都"}
        ],
        temperature=0.9,
        top_p=0.6,
        max_tokens=1024
    )
    # print the completion
    print(chat_completion['choices'][0]['message']['content'])

参考