1. Introduction: Why Does RAG Offline Preprocessing Need Metadata Enhancement and Knowledge Graphs?

Hi, I’m a tech blogger. Today, let’s talk about a critical yet easily overlooked aspect of RAG systems that directly impacts final generation quality: metadata enhancement and knowledge graph integration in offline preprocessing.

Have you ever encountered this dilemma? You’ve painstakingly built an RAG (Retrieval-Augmented Generation) system, but when a user asks a slightly complex question like “In what year did Apple Inc. launch its first smartphone?”, the system only retrieves fragmented information such as “Apple Inc. was founded by Steve Jobs” from a massive pile of documents, or even mistakenly retrieves completely irrelevant content like “Apple is a fruit” due to vector similarity. What’s the reason behind this? It’s because the retrieval capability of traditional RAG systems largely relies solely on vector retrieval.

Although vector retrieval is powerful, it’s essentially a “semantic similarity matching” tool that lacks support for structural relationships and multi-hop reasoning within documents. It can see that “Apple” and “Jobs” are semantically related, but it doesn’t understand the “founding” relationship between “Apple Inc.” and “Steve Jobs,” nor does it have the ability to locate “iPhone” and extract temporal information based on “first smartphone.” Pure vector retrieval is like having an enormous library where you can only search by “book content summary,” but you cannot view the “table of contents,” “author associations,” or “publication history”—inevitably missing a wealth of high-value information.

RAG offline metadata enhancement technology and knowledge graph integration preprocessing for RAG are the keys to solving this pain point. Their core value lies in: during the offline phase, through a series of preprocessing steps, adding a “structural skeleton” and “relationship network” to the original documents. Metadata enhancement is like attaching more precise tags to each book (e.g., “author,” “publication year,” “abstract keywords”), while the knowledge graph builds a vast network of world knowledge relationships, connecting discrete document points into lines and forming a surface.

Through this article, you will systematically master the following capabilities:

  • A three-tier progressive strategy for metadata cleaning and structuring: from simple rule-based filling to high-cost LLM prediction, learn to balance cost and effectiveness in different scenarios.

  • The full pipeline of knowledge graph construction and integration: how to extract entities from raw documents, model relationships, and convert them into retrievable vector representations using graph algorithms (e.g., Node2Vec).

  • Practical approaches for multimodal, multi-source data alignment and incremental updates: how to handle heterogeneous data from web pages, PDFs, and databases, and design a sustainable update distribution mechanism.

  • Core techniques for GraphRAG knowledge graph construction: combining cutting-edge practices to understand why entity disambiguation, relationship confidence filtering, and embedding reasoning paths during construction are necessary.

2. Core Concepts: The Foundation of Metadata Enhancement and Knowledge Graph Integration

Before diving into code, we must establish two core concepts: metadata enhancement and knowledge graph integration. I’ve seen too many developers jump straight into coding without a clear understanding of these concepts, leading to repeated rework later. So please take some time to ensure you truly understand these fundamentals.

Why do we need metadata enhancement?

Imagine you have a collection of PDFs, each an independent document. Traditional RAG systems only process the “body” part, chunking it, vectorizing it, and storing it in a vector database. But besides the body, a document contains a wealth of metadata, such as:

  • Title: The core topic of the document.
  • Author: The authoritative source of knowledge.
  • Publication date: The timeliness of knowledge, crucial for news or technical updates.
  • Chapter/Table of contents: The internal structure of the document, key for fine-grained retrieval.
  • Document type: Is it a paper, a report, or a textbook? Different types dictate different answering styles.

Metadata enhancement refers to cleaning, completing, standardizing, and extracting entities from these raw external and internal metadata, transforming them from “noise” into high-quality structured information that can directly serve as retrieval conditions for RAG. For example:

  • Field completion: If the “author” field of a document is missing, we can predict and fill it using rules (e.g., filename patterns: paper_ZhangSan_2023.pdf) or an LLM.
  • Standardization: Unify “Apple Inc.” and “Apple公司” into “Apple Inc.”
  • Entity extraction: Automatically extract major people (e.g., “Ding Lei”), companies (e.g., “NetEase”), and locations (e.g., “Hangzhou”) from the document and use them as metadata tags.
  • Hierarchical indexing: Establish a “summary -> chapter -> paragraph” hierarchy for documents to support more precise context recall.

Purpose of knowledge graph integration

Knowledge graphs go one step further. Instead of tagging individual documents, they build a relationship network across documents and entities. The classic example: Document A says “Apple Inc. was founded by Steve Jobs in Cupertino,” and Document B says “Jobs was co-founder of Pixar Animation Studios.” Traditional vector retrieval would consider both documents related to “Jobs,” but a knowledge graph reveals a deeper relationship: “Apple Inc.” and “Pixar” are connected through the person “Steve Jobs.”

When a user asks, “Which famous animation company did the founder of Apple also start?”, an RAG with a knowledge graph can achieve multi-hop reasoning through the “Jobs” node: “Apple Inc. -> Steve Jobs -> Pixar Animation Studios.”

Knowledge graph integration specifically includes:

  • Entity Linking: Linking entities extracted from text (e.g., “Curry”) to specific nodes in the knowledge graph (e.g., “Stephen Curry (basketball player)”) rather than “Curry (place).” This is key for disambiguation.
  • Relation Modeling: Defining relationship types between entities, such as “founder,” “located in,” “founded in,” etc.
  • Graph Storage and Querying: Using dedicated graph databases (e.g., Neo4j) or in-memory graph libraries (e.g., NetworkX) to store triples (head entity, relation, tail entity) and provide efficient graph query interfaces.

How do they collaborate?

Think of it this way: Metadata is the “identity tag” of a document (e.g., author, date, type), while knowledge graphs are the “social network” of knowledge. During the RAG offline preprocessing phase, we first enhance metadata to create a clean, information-rich document node for each document. Then, knowledge graph construction extracts entities and relationships across documents and connects document nodes into this global knowledge network.

Finally, when the system processes user queries online, we can achieve the integration of two retrieval paths:

  1. Metadata retrieval: First, filter by strong conditions such as time, author, and document type to significantly narrow the search scope, then perform vector retrieval.
  2. Knowledge graph retrieval: First, perform multi-hop reasoning through the knowledge graph to find deeply relevant document paths, then rank document nodes on those paths as candidate results.

Combining both allows RAG to see not only the “most similar” but also the “most relevant.” This is the core improvement brought by GraphRAG knowledge graph construction.

3. Practical Metadata Cleaning and Enhancement: From Raw Documents to High-Quality Structured Fields

Theory is clear; let’s go directly to code. In this section, we’ll use a practical example to complete metadata cleaning and enhancement from scratch. We’ll simulate a collection of documents with author, date, abstract, and body text.

1
2
3
4
import re
import json
import pandas as pd
from collections import Counter

3.1 Data Loading and Initial Problem Diagnosis

First, we load simulated data. In practice, this could come from a crawler, database export, or file listing.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Simulate a noisy metadata CSV with common missing values, inconsistent formats, etc.
raw_data = [
{"id": 1, "title": "Latest Developments at Apple Inc.", "author": "Zhang Wei", "date": "2023-10-15", "content": "Apple recently released iPhone 15 ..."},
{"id": 2, "title": "About Steve Jobs Biography", "author": None, "date": "2023/10/16", "content": "The story of Steve Jobs ..."},
{"id": 3, "title": "Big Data Beginner's Guide", "author": "Li Na", "date": "2022-05-20", "content": "Big data technologies include Hadoop, Spark ..."},
{"id": 4, "title": "Apple Inc. and Steve Jobs", "author": "Lisa Johnson", "date": "2023-10-17", "content": "Jobs is the co-founder of Apple ..."},
{"id": 5, "title": "Missing data ahead", "author": "Wang Si", "date": "2024-01-08", "content": "The article body is rich but the title is wrong ..."},
]

# Load into pandas DataFrame for easy processing
df = pd.DataFrame(raw_data)
print("Raw data overview:")
print(df.info())

Initial problem diagnosis:

  • Missing values: record id=2 has missing author field.
  • Inconsistent format: date field, id=2 uses “2023/10/16”, while others use “2023-10-15”.
  • Noisy data: id=5 title is obviously wrong; it pollutes the title field.
  • Language inconsistency: id=4 author is English, others are Chinese.

3.2 Level 1: Rule-Based Quick Filling and Standardization

For 80% of common scenarios, intelligent rule-based processing is the most efficient and cost-effective.

1
2
3
4
5
6
7
8
9
10
11
# 1. Handle missing values: fill with default values using rules
def fill_missing_author_by_rule(row):
"""Rule filling: if author is empty, attempt to extract pattern from filename or content; here simulate a simple rule"""
if pd.isna(row['author']):
# Simulate: for id=2, we infer from content keywords that the author is likely "Anonymous"
# More general rule: if content contains specific keywords, e.g., "unknown", "AI-generated"
return "Anonymous" # fill with default
return row['author']

df['author'] = df.apply(fill_missing_author_by_rule, axis=1)
print("After filling missing authors:\n", df[['id', 'author']])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 2. Handle inconsistent formats: standardize date format
def standardize_date(date_str):
"""Unify different date formats to ISO 8601 (YYYY-MM-DD)"""
if pd.isna(date_str):
return None
# Try matching 'YYYY/MM/DD' format
match = re.match(r'(\d{4})/(\d{1,2})/(\d{1,2})', str(date_str))
if match:
y, m, d = match.groups()
return f"{y}-{m.zfill(2)}-{d.zfill(2)}" # zfill pads with zero
# Try matching 'YYYY-MM-DD' format
match = re.match(r'(\d{4})-(\d{1,2})-(\d{1,2})', str(date_str))
if match:
y, m, d = match.groups()
return f"{y}-{m.zfill(2)}-{d.zfill(2)}"
# More complex cases: e.g., '2023年10月16日', can add regex: r'(\d{4})年(\d{1,2})月(\d{1,2})日'
# Return original string if format not recognized; log for further processing
return str(date_str) # or return None

df['date'] = df['date'].apply(standardize_date)
print("After date standardization:\n", df[['id', 'date']])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 3. Remove noise from title/metadata (e.g., extra whitespace, meaningless prefixes)
def clean_text_field(text):
if pd.isna(text):
return ""
# Remove leading/trailing spaces and special characters
text = text.strip()
# If the title is entirely Japanese or special symbols, set a threshold to filter
# Simulate: if title length < 5 and contains no Chinese/English, consider invalid
if len(text) < 5 and not re.search(r'[\u4e00-\u9fff]', text) and not re.search(r'[a-zA-Z]', text):
return None # Mark as invalid
return text

df['title'] = df['title'].apply(clean_text_field)
print("After title cleaning (id=5 title is marked empty but not removed due to regex):\n", df[['id', 'title']])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 4. Language unification: normalize English author names to pinyin or unified handling
# For simplicity, we handle only one common case: capitalize last name + first name
def normalize_author_name(author):
if author is None or pd.isna(author):
return author
# If author name is purely English (no Chinese characters), keep it as is?
# Actually, keep it as is for uniformity
# But for consistent indexing, we can add a language tag for English authors
# In real projects, use a dictionary mapping, e.g., "Lisa Johnson" -> "丽莎·约翰逊"
# Here we keep original name but add a language indicator
if re.search(r'[a-zA-Z]', author) and not re.search(r'[\u4e00-\u9fff]', author):
return f"{author} (EN)" # add language suffix for differentiation
return author

df['author'] = df['author'].apply(normalize_author_name)
print("After author name normalization:\n", df[['id', 'author']])

3.3 Level 2: Filling with Mean/Mode

Second step: handle a few missing values that rules cannot cover. For example, a “category” field. We first infer a statistical pattern from well-structured documents.

1
2
3
4
5
6
7
8
9
# Suppose we have a 'category' field also missing; we fill it with the mode
df['category'] = [None, 'Technology', 'Business', 'Technology', 'Technology'] # only one document missing
# Find the most frequent category
most_common_category = df['category'].mode()[0] if not df['category'].mode().empty else "Other"
print(f"The most common category is: {most_common_category}")

# Fill missing category
df['category'] = df['category'].fillna(most_common_category)
print("After category filling:\n", df[['id', 'category']])

3.4 Level 3: Intelligent Filling with Pretrained Models (Advanced)

For missing core fields (e.g., product description, summary), consider using pretrained language models (e.g., BERT or GPT) for prediction. We demonstrate filling “summary” with BERT:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Note: requires transformers library (pip install transformers)
# This is just demonstration logic; actual runtime needs model loading and token length handling

from transformers import pipeline

# Initialize a fill-mask pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased') # Real projects recommend Chinese BERT, e.g., bert-base-chinese

def fill_summary_with_bert(row):
if pd.isna(row.get('summary')) or not isinstance(row.get('summary'), str) or len(row['summary'].strip()) == 0:
# If summary missing, predict a summary using title and first 50 characters of content
input_text = f"[CLS] {row['title']}. {row['content'][:50]}... [MASK]."
try:
predictions = unmasker(input_text)
# Take the most probable predicted word as part of the summary
predicted_word = predictions[0]['token_str']
# This is just a demo; actual implementation needs more complex logic, e.g., generating full sentence with [MASK]
# A more advanced approach uses GPT/LLM to directly generate summary
return f"Summary based on title and beginning: {predicted_word}"
except Exception as e:
print(f"BERT prediction failed for ID {row['id']}: {e}")
return "Summary pending"
return row['summary']

# Execute filling (use with caution; very time-consuming)
# df['summary'] = df.apply(fill_summary_with_bert, axis=1)

print("\nFinal cleaned metadata:")
print(df[['id', 'title', 'author', 'date', 'category']])

Best Practice Tips:

  • Priority: Rules → Statistics → Deletion → LLM prediction. Always start with the lowest-cost approach.
  • Logging: Record which records were filled by rules, which by model prediction. Keep original values for backtracking and evaluation.
  • Validation: Metadata quality is crucial. Before moving to the next phase, spend time using scripts to verify the reasonableness of filled results, e.g., whether author names are complete, dates are valid.

Now we have a clean, structured metadata set. Next, we will build a knowledge graph based on this foundation.

4. Knowledge Graph Construction: From Entity Extraction to Relationship Modeling

With high-quality metadata and document content, we proceed to knowledge graph construction. In this section, we’ll manually extract entities and relationships and build a simple graph using NetworkX.

4.1 Entity Extraction: Identifying and Deduplicating from Text

Entity extraction is the foundation. We can use spaCy, Stanza, or LLMs (e.g., GPT-4 API). To reduce cost and demonstrate local execution, we use a hybrid of rule-based and spaCy approaches.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Install: pip install spacy networkx transformers
# Download spaCy Chinese model: python -m spacy download zh_core_web_sm

import spacy
import networkx as nx
from itertools import combinations

# Load the large Chinese model for better NER
# nlp = spacy.load('zh_core_web_trf') # Recommended transformer model, fallback to sm or md if not installed
try:
nlp = spacy.load('zh_core_web_trf')
except OSError:
print("Transformer model not found, using lightweight model instead")
nlp = spacy.load('zh_core_web_sm')

# Our document content (with enhanced metadata)
docs_text = [
"Apple Inc. was founded by Steve Jobs in Cupertino",
"Jobs was co-founder of Pixar Animation Studios",
"Jobs delivered a speech at Stanford University",
"Cupertino is the headquarters of Apple Inc.",
]

# Extract entities (person, organization, location)
def extract_entities_spacy(text):
doc = nlp(text)
entities = []
for ent in doc.ents:
if ent.label_ in ['PERSON', 'ORG', 'GPE', 'LOC']: # filter entity types as needed
# Normalize entity name: remove spaces, case (not needed for Chinese)
entity_name = ent.text.strip().replace(" ", "")
entities.append((entity_name, ent.label_))
return list(set(entities)) # deduplicate identical entities within the same sentence

print("Entity extraction example:\n", extract_entities_spacy(docs_text[0]))

Entity Disambiguation: The same entity may have different names, e.g., “Apple Inc.” and “Apple”, or homonyms (e.g., “Apple” fruit vs. brand). We handle this by building an alias mapping table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Alias mapping: standardize different expressions to a canonical name
alias_map = {
"苹果公司": "Apple Inc.",
"Apple": "Apple Inc.",
"Apple公司": "Apple Inc.",
"史蒂夫·乔布斯": "Steve Jobs",
"乔布斯": "Steve Jobs",
"库比蒂诺": "Cupertino",
"皮克斯动画工作室": "Pixar",
"斯坦福大学": "Stanford University",
}

def disambiguate_entity(entity_text):
"""Map entity to canonical name"""
# Exact match
if entity_text in alias_map:
return alias_map[entity_text]
# Fuzzy match (prefix/suffix matching)
for key, value in alias_map.items():
if key in entity_text or entity_text in key:
return value
return entity_text # no mapping, keep original

4.2 Relationship Modeling and Graph Construction

With entities, we need to define relationships between them. Relationships can be predefined (e.g., “founder”, “located in”) or dynamically generated by an LLM. Here we use simple co-occurrence and explicit pattern matching.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Define simple relationship patterns (based on grammar or patterns)
relation_patterns = [
(r"founded by (.+)", "founder"), # "Apple Inc. was founded by Steve Jobs" -> Apple Inc. founder Steve Jobs
(r"located in (.+)", "location"), # "company located in London"
(r"delivered (.+)", "event"), # "delivered a speech at ..."
]

def extract_relations(text):
"""Extract structured triples from a sentence"""
relations = []
for pattern, rel_type in relation_patterns:
match = re.search(pattern, text)
if match:
subject = text.split("founded")[0] if "founded" in text else text # roughly extract subject
object_entity = match.group(1)
# Disambiguate subject and object
subject = disambiguate_entity(subject)
object_entity = disambiguate_entity(object_entity)
relations.append((subject, rel_type, object_entity))
return relations

print("Relationship extraction example (from sentence 1):\n", extract_relations(docs_text[0]))

4.3 Build a Static Knowledge Graph with NetworkX

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Initialize directed graph
G = nx.DiGraph()

for text in docs_text:
# Extract entities
entities = extract_entities_spacy(text)
for entity_text, entity_type in entities:
canonical_name = disambiguate_entity(entity_text)
G.add_node(canonical_name, label_type=entity_type) # add node with attribute

# Extract relationships
relations = extract_relations(text)
for subj, rel, obj in relations:
G.add_edge(subj, obj, relation=rel) # add edge

# Additionally, traverse the full text: if two entities co-occur in the same sentence, add a "related" edge (co-occurrence)
# To avoid noise, limit to strong co-occurrence within the same sentence
for text in docs_text:
ents = extract_entities_spacy(text)
# For each pair of entities in the same sentence, add co-occurrence relationship
if len(ents) >= 2:
for (ent1, type1), (ent2, type2) in combinations(ents, 2):
canon1 = disambiguate_entity(ent1)
canon2 = disambiguate_entity(ent2)
if not G.has_edge(canon1, canon2) and canon1 != canon2:
G.add_edge(canon1, canon2, relation="related") # generic relationship

print("Graph nodes:", G.nodes())
print("Graph edges:\n", list(G.edges(data=True)))

4.4 Visualization and Verification

Simple visualization can be done with matplotlib or pyvis; here we just print.

1
2
3
4
# Optional: export graph as JSON for later use
graph_json = nx.node_link_data(G) # convert to JSON format
# with open('knowledge_graph.json', 'w', encoding='utf-8') as f:
# json.dump(graph_json, f, ensure_ascii=False, indent=2)

Key Optimization Points:

  1. Relationship Confidence Filtering: For co-occurrence relationships, set a threshold, e.g., only consider strong associations if they appear multiple times in the same document; otherwise delete. This reduces noise significantly.
  2. Entity Level Merging: If two nodes frequently co-occur, should they be merged? For example, “Apple Inc.” and “Apple”.
  3. Knowledge Graph Scale: When the graph contains millions of nodes, you cannot use NetworkX in memory; you need graph databases like Neo4j or ArangoDB.

5. Graph Embedding and Vector Fusion: Transforming Knowledge Graphs into Retrievable Representations

We have built a knowledge graph, but how do we make the RAG system use it? The key is graph embedding: converting each node into a low-dimensional vector such that nodes close in the graph are also close in vector space. Then, we can retrieve “most relevant graph nodes” and their associated documents by vector similarity, similar to querying document vectors.

5.1 Generate Node Vectors with Node2Vec

Node2Vec is a classic graph embedding method. First install the library: pip install node2vec.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from node2vec import Node2Vec
import numpy as np

# Assume we already have NetworkX graph G
# Ensure the graph is connected, or at least take the largest connected component
if nx.is_connected(G) is False:
# Take the largest connected component
largest_cc = max(nx.connected_components(G.to_undirected()), key=len)
G_sub = G.subgraph(largest_cc).copy()
else:
G_sub = G.copy()

# Parameters:
# dimensions: embedding dimension, 128 or 256 common
# walk_length: length of random walk, typically 20-80
# num_walks: number of random walks per node, typically 10-20
# workers: parallel threads
node2vec = Node2Vec(G_sub, dimensions=128, walk_length=30, num_walks=200, workers=4)

# Train model and return embeddings dictionary
model = node2vec.fit(window=10, min_count=1, batch_words=4) # window: context window size

# Get embeddings for all nodes
node_embeddings = model.wv # treat word2vec model as WordVectors class
# For example, get the vector for node "Steve Jobs"
if "Steve Jobs" in node_embeddings:
steve_job_vec = node_embeddings["Steve Jobs"]
print(f"Vector dimension for node 'Steve Jobs': {steve_job_vec.shape}")
print(f"First 10 values of vector: {steve_job_vec[:10]}")

Note: Node2Vec preserves two structural properties: “structural equivalence” (e.g., two nodes both playing “founder” roles have similar embeddings even if not in the same community) and “homophily” (connected nodes have similar embeddings). This is crucial for knowledge graph retrieval.

5.2 Vector Fusion Strategy: RRF Hybrid Retrieval

Our system has two types of vectors: document vectors (semantic vectors based on document content, e.g., text-embedding-3-small) and graph node vectors. Ultimately, we need to fuse results from different retrievals. RRF (Reciprocal Rank Fusion) is a simple and effective method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def reciprocal_rank_fusion(results_list, k=60):
"""
RRF fusion of multiple ranking lists
results_list: list of multiple ranking lists, each list is [ { 'id': document ID, 'score': ...}, ... ]
k: smoothing parameter, typically 60
"""
fused_scores = {}
for results in results_list:
for rank, item in enumerate(results, 1):
doc_id = item['id']
# RRF score = 1 / (k + rank)
fused_scores[doc_id] = fused_scores.get(doc_id, 0) + 1.0 / (k + rank)

# Sort by fused score
sorted_docs = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_docs # return list of (doc_id, fused_score)

# Simulate results
# Result 1: from vector retrieval (based on document semantics)
vec_results = [{'id': 1, 'score': 0.9}, {'id': 3, 'score': 0.7}, {'id': 2, 'score': 0.5}]
# Result 2: from graph embedding retrieval (find documents similar to a certain graph node)
graph_results = [{'id': 2, 'score': 0.8}, {'id': 1, 'score': 0.6}, {'id': 4, 'score': 0.3}]

final_ranking = reciprocal_rank_fusion([vec_results, graph_results])
print("Final ranking after RRF fusion:")
for doc_id, score in final_ranking:
print(f"Document ID: {doc_id}, Fused Score: {score:.4f}")

Key Points:

  • Granularity of Graph Embeddings: Graph nodes correspond to entities, while our final retrieval target is documents. Therefore, we need to associate document IDs as attributes with the corresponding nodes during knowledge graph construction (or establish edges between document nodes and entity nodes). Knowledge graph retrieval is triggered only when entities extracted from the user query match graph nodes.
  • Query Processing: In the online phase, the user’s natural language query must first undergo entity extraction, then query the knowledge graph (via graph embedding similarity), and finally obtain the set of relevant documents.

6. Advanced Techniques: Multi-source Heterogeneous Data Alignment and Incremental Updates

Real-world data is always heterogeneous: you may have web pages crawled from Zhihu, PDFs downloaded from arXiv, or Excel databases exported from a company. Entities and relationships between them may naturally conflict, e.g., the same name “Li Na” may refer to different people (one tennis player, one singer). This is one of the trickiest pitfalls in RAG offline preprocessing.

6.1 Multi-source Data Conflict Resolution: Metadata Disambiguation

Practical Principle: When you suspect that two entities from different sources might be the same, use metadata cross-validation.

For example, assume two sources mention a “Ding Lei”:

  • Source A (news): mentions Ding Lei is CEO of NetEase, born in Ningbo.
  • Source B (corporate report): mentions Ding Lei holds shares in NetEase, born in Zhejiang Ningbo.

We can determine if they are the same with the following method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Simulate entity info from two sources
entity_A = {"name": "Ding Lei", "company": "NetEase", "birth_place": "Ningbo", "source": "News"}
entity_B = {"name": "Ding Lei", "company": "NetEase", "birth_place": "Zhejiang Ningbo", "source": "Corporate Report"}

def merge_entities_if_match(ent1, ent2, threshold=0.8):
"""Determine similarity based on metadata"""
matching_fields = 0
total_fields = 0
for field in ent1:
if field in ["name", "company", "birth_place"]:
total_fields += 1
# Compare fields: ignore spaces, use substring matching
if ent1[field].strip().replace(" ", "") == ent2[field].strip().replace(" ", "") or \
ent1[field] in ent2[field] or ent2[field] in ent1[field]:
matching_fields += 1
if total_fields > 0 and matching_fields / total_fields >= threshold:
return True
return False

if merge_entities_if_match(entity_A, entity_B):
print("They are the same entity! Can merge.")

More Advanced Method: Use a pretrained language model to compute semantic similarity between entities. Concatenate the contexts of the two entities and judge whether they are similar.

6.2 Incremental Update Mechanism: Partial Reconstruction Based on Timestamps

Knowledge graphs are not static. When new documents arrive or old documents are modified, the graph must be updated reasonably. Full reconstruction is extremely costly and not suitable for real-time scenarios.

Recommended Strategy:

  1. Change Log: Whenever metadata is updated or a new document is added, record a “change event” containing document ID, timestamp, and change type.
  2. Incremental Extraction: Perform entity and relationship extraction only for changed documents, not the entire dataset. However, you need to handle impact propagation: if an old document is deleted, all “co-occurrence relationships” extracted from that document must also be cleaned.
  3. **Graph Database’s MPP

Summary

Through this article, I believe you have gained a deeper understanding of “RAG offline metadata enhancement technology.” It is recommended to practice more with real projects. If you have any questions, feel free to discuss!