Incremental Vector Update Strategy for Embedding - Louis

Background

Only embed “new or changed content” instead of reprocessing all documents each time.

1. Three Levels of Incremental Updates

Level	Granularity	Recommendation
File-level	PDF	⭐⭐⭐⭐
Page-level	page	⭐⭐⭐
Chunk-level	paragraph	⭐⭐⭐⭐⭐

Recommended workflow:

Private API
 ↓
Fetch file
 ↓
PDF hash
 ↓
Is it a new file?
 ├── No → Skip
 └── Yes
      ↓
text extraction
 ↓
chunk
 ↓
chunk hash
 ↓
Exist?
 ├── Yes → Skip
 └── No
      ↓
embedding
 ↓
vector db

Approach 1: File Hash

import hashlib

def file_hash(file_bytes):
    return hashlib.md5(file_bytes).hexdigest()

Database storage:

file_hash
file_name
processed_at

Approach 2: Page Hash

Split PDF by page, hash every page

Database storage:

file_id
page_number
page_hash

Approach 3: Chunk Hash (Enterprise-level)

def chunk_hash(text):
    return hashlib.sha1(text.encode()).hexdigest()

Database storage:

chunk_id
chunk_hash
vector
metadata

Vector Database Metadata Design (Approaches 1 and 3)

Recommended metadata:

{
  file_id: "pdf123",
  file_hash: "...",
  chunk_hash: "...",
  page: 3,
  source: "Louis_pdf"
}

Advantages:

Delete specific files
Update specific files
Filter by source

2. Deduplication Strategies (Avoid Duplicate Embeddings)

Approach 1: Deduplicate Paragraphs

Many PDFs contain:

Disclaimer
Footer
Company introduction

Approach 2: Semantic Deduplication

Two chunks similarity > 0.95

Approach 3: Text Approximate Deduplication

Using: datasketch

Suitable for: web pages, emails, FAQs

3. Embedding Cache (Highly Effective)

Build a text_hash → embedding mapping

chunk
 ↓
hash
 ↓
check cache

If cache exists, use embedding directly

关于作者

我是Louis,一名长期从事iOS与AI相关工程实践的工程师,也是一个正在探索产品与商业可能性的准创始人.

这里的文章,更多是我在项目中用过,踩过坑,反复验证过的东西,而不是为了流量而写的“快内容”.

☕ 打赏

如果这篇文章对你有帮助,欢迎请我喝一杯咖啡☕️

PayPal
https://www.paypal.me/luochuan188

PayPay

You can support my work via PayPay by searching my PayPay ID:

PayPay ID: luochuan

微信支付

支付宝

你的支持会让我有更多时间,把真实项目中的经验持续整理和分享出来.

不打赏也完全没关系,感谢你读到这里.

联系与合作

如果你:

· 正在做iOS App / AI / 自动化相关的项目

· 对技术选型、架构设计、产品落地有困惑

· 或希望进行技术交流、合作探讨

欢迎通过以下邮箱联系我:

luochuanad@gmail.com

Background

1. Three Levels of Incremental Updates

Approach 1: File Hash

Approach 2: Page Hash

Approach 3: Chunk Hash (Enterprise-level)

Vector Database Metadata Design (Approaches 1 and 3)

2. Deduplication Strategies (Avoid Duplicate Embeddings)

Approach 1: Deduplicate Paragraphs

Approach 2: Semantic Deduplication

Approach 3: Text Approximate Deduplication

3. Embedding Cache (Highly Effective)

CATALOG

关于作者

☕ 打赏

联系与合作