Data Quality Management in Fine-tuning - Louis

Background

Fine-tuning model performance = 70% data structure design + 20% data quality + 10% training parameters

This article focuses solely on data quality, specifically data cleaning.

Complete Data Cleaning Pipeline

Raw data
   ↓
Format standardization
   ↓
Simple deduplication
   ↓
Semantic deduplication
   ↓
Noise filtering
   ↓
Length control
   ↓
Language filtering
   ↓
Semantic consistency check
   ↓
Perplexity filtering
   ↓
Final SFT data

Data Cleaning

1 Deduplication

Real-world data often contains a lot of duplicate entries.

Method 1: Simple text deduplication

seen = set()

clean_data = []

for item in data:

    text = item["messages"][0]["content"]

    if text not in seen:
        seen.add(text)
        clean_data.append(item)

Method 2: Semantic deduplication

Many questions are phrased differently but mean the same thing:

What is Python?
Explain Python.
Tell me about Python language.

Use embedding similarity to deduplicate.

For example, use:

OpenAI embeddings
Hugging Face sentence-transformers

Approach:

Compute embeddings
↓
cosine similarity
↓
>0.9 Remove duplicates

Refer to the article: “Incremental Vector Update Strategy for Embeddings”

https://strictfrog.com/en/2026-03-14-incremental-vector-update-strategy-for-embedding/

2 Noise Filtering

Method 1: Simple rules based on keywords

bad_words = ["I don't know", "Sorry", "Maybe"]

def is_bad(text):

    for w in bad_words:
        if w in text:
            return True

    return False

Method 2: Remove empty responses

3 Format Standardization

Normalize different data sources into a fixed JSON format.

For example, this article converts Alpaca data format into SFT data structure.

“Simple Data Preparation and Preprocessing for Fine-tuning”:

https://strictfrog.com/en/2026-03-15-simple-data-preparation-and-preprocessing-for-fine-tuning/

4 Length Control

Many training frameworks require: < 4096 tokens

Use tokenizer to calculate:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokens = tokenizer.encode(text)

if len(tokens) > 4096:
    continue

5 Language Filtering

If the training model is Chinese, filter out English, Japanese, and other languages.

6 Semantic Quality Filtering

Example: Question and answer mismatch

Q: What is Python?
A: Tokyo is the capital of Japan.

Solution: Use similarity scoring based on a fine-tuned model to evaluate relevance.

7 Perplexity Filtering

Idea:

Use a language model to calculate the perplexity.

If perplexity is too high:

It indicates poor text quality.

For example, nonsensical strings like asdlfjlasdjfl should be removed.

Common models:

GPT-2
KenLM

8 Conversation Completeness Check

For multi-turn conversation data:

Ensure the turn order:

user
assistant
user
assistant

Avoid sequences like:

user
user
assistant
assistant

Check method:

roles = [m["role"] for m in messages]

for i in range(len(roles)-1):
    if roles[i] == roles[i+1]:
        return False

9 Data Quality Scoring

Dimension	Meaning
Semantic Matching	Q/A relevance
Language Quality	Grammar correctness
Length Reasonableness	Not overly long
Diversity	Different topics

Score range 0-1, remove if below 0.6

def score_data(question, answer):

    prompt = f"""
Rate the quality of this QA pair from 0-1.

Q: {question}
A: {answer}

Only output a number.
"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role":"user","content":prompt}]
    )

    return int(response.choices[0].message.content)

Data Scale Reference

For example:

Raw data 1,000,000
↓
After cleaning 100,000
↓
Final training 50,000

Final Training Size	Effect
1,000	Slight improvement
5,000	Noticeable improvement
10,000	Quite good
50,000	Near professional level

关于作者

我是Louis,一名长期从事iOS与AI相关工程实践的工程师,也是一个正在探索产品与商业可能性的准创始人.

这里的文章,更多是我在项目中用过,踩过坑,反复验证过的东西,而不是为了流量而写的“快内容”.

☕ 打赏

如果这篇文章对你有帮助,欢迎请我喝一杯咖啡☕️

PayPal
https://www.paypal.me/luochuan188

PayPay

You can support my work via PayPay by searching my PayPay ID:

PayPay ID: luochuan

微信支付

支付宝

你的支持会让我有更多时间,把真实项目中的经验持续整理和分享出来.

不打赏也完全没关系,感谢你读到这里.

联系与合作

如果你:

· 正在做iOS App / AI / 自动化相关的项目

· 对技术选型、架构设计、产品落地有困惑

· 或希望进行技术交流、合作探讨

欢迎通过以下邮箱联系我:

luochuanad@gmail.com