微調整のための簡単なデータ準備と前処理 - Louis

背景

プライベートLLMで最も重要なのは、プライベートデータを使ってLLMをファインチューニングすることです。本記事ではまずデータ準備と前処理の方法を解説します。

データ準備

スタンフォードのAlpacaトレーニングで使用されているデータセットをダウンロードすると、次のような形式のJSONファイルが得られます:
alpaca_data.json

[
    {
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
    },
.......
    {
        "instruction": "What are the three primary colors?",
        "input": "",
        "output": "The three primary colors are red, blue, and yellow."
    }
]

データ前処理

ファインチューニングで要求されるデータ形式（基本的なSFTデータ構造）は以下の通りです:

{
	"messages":[
		{
			"role": "user",
			"content": "xxxx_question_1"
		},
		{
			"role": "assistant",
			"content": "xxxx_answer_1"
		}
	]
}

Pythonスクリプトを使って、alpaca_data.json のデータ形式をファインチューニングに必要な形式に変換します。
また、変換後のデータは訓練データセットと検証データセットの2種類を用意します。（訓練:検証 = 7:3）

import json
import random

def load_data():
    with open('alpaca_data.json','r',encoding='utf-8') as f:
        data=json.load(f)
    return data

def conversion_to_jsonl(data):

    random.shuffle(data)

    split=int(len(data)*0.7)

    train=data[:split]
    valid=data[split:]

    def convert(item):
        return {
            "messages":[
                {
                    "role":"user",
                    "content":item["instruction"]+"\n"+item.get("input","")
                },
                {
                    "role":"assistant",
                    "content":item["output"]
                }
            ]
        }

    with open("train.jsonl","w",encoding="utf-8") as f:
        for item in train:
            f.write(json.dumps(convert(item),ensure_ascii=False)+"\n")

    with open("valid.jsonl","w",encoding="utf-8") as f:
        for item in valid:
            f.write(json.dumps(convert(item),ensure_ascii=False)+"\n")

if __name__ == "__main__":
    original_data = load_data()
    conversion_to_jsonl(original_data)

完全なコードは以下のリンクから参照できます:

https://github.com/LuochuanAD/Fine-tuning-Learn

参考

データソース:

スタンフォードのAlpacaトレーニングで使用されたデータセット:

https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/refs/heads/main/alpaca_data.json

关于作者

我是Louis,一名长期从事iOS与AI相关工程实践的工程师,也是一个正在探索产品与商业可能性的准创始人.

这里的文章,更多是我在项目中用过,踩过坑,反复验证过的东西,而不是为了流量而写的“快内容”.

☕ 打赏

如果这篇文章对你有帮助,欢迎请我喝一杯咖啡☕️

PayPal
https://www.paypal.me/luochuan188

PayPay

You can support my work via PayPay by searching my PayPay ID:

PayPay ID: luochuan

微信支付

支付宝

你的支持会让我有更多时间,把真实项目中的经验持续整理和分享出来.

不打赏也完全没关系,感谢你读到这里.

联系与合作

如果你:

· 正在做iOS App / AI / 自动化相关的项目

· 对技术选型、架构设计、产品落地有困惑

· 或希望进行技术交流、合作探讨

欢迎通过以下邮箱联系我:

luochuanad@gmail.com

背景

データ準備

データ前処理

参考

CATALOG

关于作者

☕ 打赏

联系与合作