我要投稿

使用Docling从文档中构建知识图谱

发布日期：2025-07-28 08:51:20 浏览次数： 1977

作者：知识图谱工坊

微信搜一搜，关注“知识图谱工坊”

使用Docling首次体验从文档中构建知识图谱！

引言与动机

知识图谱是表示信息的一种结构化方式，它们由节点和边组成。节点代表实体（如人、地点或概念），而边则表示这些实体之间的关系。通过以这种方式组织信息，知识图谱使得数据探索更加直观，便于复杂查询回答，并支持高级分析任务。它们被广泛应用于搜索引擎、推荐系统和数据集成等领域，以提供更深入的洞察并增强决策能力。

使用Docling进行文档提取可以显著简化构建知识图谱的过程。Docling能够解析多种文档格式，包括复杂的PDF文件，并提供文档内容的结构化表示，这简化了关键实体和关系的识别。与处理需要大量预处理的原始文本相比，Docling提供了一个更有组织的输出，使得提取填充知识图谱所需的具体信息变得更加容易，例如样本文档中存在的实体（“巴黎”、“埃菲尔铁塔”）及其关系（“位于”、“由…设计”）。这种结构化方法减少了信息提取所涉及的努力，并提高了结果知识图谱的准确性。

代码实现

好的，介绍了这个想法之后，决定编写一个示例代码，从PDF中构建知识图谱。

以下是代码及其结果。

# preparationpython3 -m venv venvsource venv/bin/activate
pip install --upgrade pippip install 'docling[all]'pip install spacypip install networkxpip install matplotlibpip install nlp

import jsonimport loggingimport timefrom pathlib import Path
import spacyimport networkx as nximport matplotlib.pyplot as plt
from docling.datamodel.base_models import InputFormatfrom docling.datamodel.pipeline_options import (AcceleratorDevice,AcceleratorOptions,PdfPipelineOptions,)from docling.document_converter import DocumentConverter, PdfFormatOption
# Load a spaCy language modelnlp = spacy.load("en_core_web_sm")
def extract_text_from_docling_document(docling_document):"""Extracts text content from a Docling Document object."""text = docling_document.export_to_text()return text
def build_knowledge_graph(text):doc = nlp(text)graph = nx.Graph()
# Extract entitiesfor ent in doc.ents:graph.add_node(ent.text, label=ent.label_)
# Simple relationship extraction (can be improved)for sent in doc.sents:for i, token in enumerate(sent):if token.dep_ in ["nsubj", "dobj"]:subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]object_ = [w for w in token.head.rights if w.dep_ == "dobj"]if subject and object_:graph.add_edge(subject[0].text, object_[0].text, relation=token.head.lemma_)elif subject and token.head.lemma_ in ["be", "have"]:right_children = [child for child in token.head.rights if child.dep_ in ["attr", "acomp"]]if right_children:graph.add_edge(subject[0].text, right_children[0].text, relation=token.head.lemma_)return graph
def visualize_knowledge_graph(graph):"""Visualizes the knowledge graph."""pos = nx.spring_layout(graph)nx.draw(graph, pos, with_labels=True, node_size=3000, node_color="skyblue", font_size=10, font_weight="bold")edge_labels = nx.get_edge_attributes(graph, 'relation')nx.draw_networkx_edge_labels(graph, pos, edge_labels=edge_labels)plt.title("Knowledge Graph from Document")plt.show()
def main():logging.basicConfig(level=logging.INFO)_log = logging.getLogger(__name__) # Initialize the logger here#nlp = spacy.load("en_core_web_sm") # Load spacy # Removed from here#input_doc_path = Path("./input/2503.11576v1.pdf")input_doc_path = Path("./inp

现在让我们尝试构建一个表示实体之间关系的文档。我使用了一个大型语言模型（LLM，granite）生成了下面的文本！

巴黎市位于法国，以其标志性的埃菲尔铁塔而闻名。它是一处热门旅游目的地。这座铁塔由古斯塔夫・埃菲尔设计。著名科学家玛丽・居里出生于巴黎，她为放射学领域做出了重大贡献。她曾在镭研究所工作。塞纳河流经巴黎。
说明其适用性的原因：该文本包含多个实体和关系，这些实体和关系可被轻松提取并在知识图谱中呈现：・实体：巴黎、法国、埃菲尔铁塔、古斯塔夫・埃菲尔、玛丽・居里、镭研究所、塞纳河・关系：巴黎位于法国。巴黎因埃菲尔铁塔而闻名。埃菲尔铁塔由古斯塔夫・埃菲尔设计。玛丽・居里出生于巴黎。玛丽・居里是一位科学家。玛丽・居里为放射学领域做出了贡献。玛丽・居里曾在镭研究所工作。塞纳河流经巴黎。
根据该文本构建的知识图谱会将这些实体表示为节点，将关系表示为边，从而提供信息的结构化呈现。

从上面的文本中制作了一个PDF，并将其用作“输入.pdf”。

代码运行（成功）后，得到以下输出；

INFO:__main__:Document converted in 8.63 seconds.WARNING:docling_core.types.doc.document:Parameter `strict_text` has been deprecated and will be ignored.Number of nodes: 23Number of edges: 72025-04-23 21:33:52.828 python3[73966:691115] The class 'NSSavePanel' overrides the method identifier. This method is implemented by class 'NSWindow'
Nodes: [('Paris', {'label': 'GPE'}), ('France', {'label': 'GPE'}), ('Eiffel Tower', {'label': 'FAC'}), ('Gustave Eiffel', {'label': 'PERSON'}), ('Marie Curie', {'label': 'PERSON'}), ('the Radium Institute', {'label': 'FAC'}), ('Seine River', {'label': 'LOC'}), ('## Explanation', {'label': 'MONEY'}), ('Radium Institute', {'label': 'ORG'}), ('the Eiffel Tower', {'label': 'LOC'}), ('The Eiffel Tower', {'label': 'LOC'}), ('city', {}), ('renowned', {}), ('It', {}), ('destination', {}), ('Explanation', {}), ('entities', {}), ('this', {}), ('suitable', {}), ('Curie', {}), ('scientist', {}), ('contributions', {}), ('graph', {})]
Edges: [('city', 'renowned', {'relation': 'be'}), ('It', 'destination', {'relation': 'be'}), ('Explanation', 'entities', {'relation': 'contain'}), ('entities', 'graph', {'relation': 'represent'}), ('this', 'suitable', {'relation': 'be'}), ('Curie', 'scientist', {'relation': 'be'}), ('Curie', 'contributions', {'relation': 'make'})]