Background
What are the sources of SFT data for fine-tuning LMs?
1 Open Source Datasets
Common ones:
Stanford University Alpaca
https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/refs/heads/main/alpaca_data.json
Hugging Face datasets
https://huggingface.co/datasets
OpenAI OpenAssistant
https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data
2 Auto-Generatng Data with LLMs (Most Common)
import openai
import json
prompt = "Generate 50 math reasoning QA pairs with step-by-step reasoning."
def generate_data(prompt):
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role":"user","content":prompt}]
)
return response.choices[0].message.content
3 Scraping Real Conversations
Sources:
- StackOverflow
- GitHub issues
- Wikipedia
Real user questions tend to have very high quality.
So you would need to write scrapers; examples are not provided here.
4 Private Database Knowledge
Enterprise RAG knowledge bases, documents, websites
5 Manually Curated High-Quality Data
For formatting, refer to the following article:
“High-Quality SFT Data Structure Design for Fine-tuning”
https://strictfrog.com/en/2026-03-15-high-quality-sft-data-structure-design-for-fine-tuning/