微信扫码
添加专属顾问
我要投稿
**探索PDF解析与检索的未来,RAG与LlamaParse的结合将如何改变信息处理方式。** 核心内容: 1. RAG技术的工作原理及其在数据驱动生成式AI中的关键作用 2. PDF文件在信息提取中的挑战及LlamaParse技术的优势 3. LlamaParse在处理包含表格、图像等复杂文档中的应用前景
!pip install llama-index!pip install llama-index-core!pip install llama-index-embeddings-openai!pip install llama-parse!pip install llama-index-vector-stores-kdbai!pip install pandas!pip install llama-index-postprocessor-cohere-rerank!pip install kdbai_client
from llama_parse import LlamaParsefrom llama_index.core import Settingsfrom llama_index.core import StorageContextfrom llama_index.core import VectorStoreIndexfrom llama_index.core.node_parser import MarkdownElementNodeParserfrom llama_index.llms.openai import OpenAIfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.vector_stores.kdbai import KDBAIVectorStorefrom llama_index.postprocessor.cohere_rerank import CohereRerankfrom getpass import getpassimport osimport kdbai_client as kdbai
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncioimport nest_asyncionest_asyncio.apply()
# API access to llama-cloudos.environ["LLAMA_CLOUD_API_KEY"] = ( os.environ["LLAMA_CLOUD_API_KEY"] if "LLAMA_CLOUD_API_KEY" in os.environ else getpass("LLAMA CLOUD API key: "))# Using OpenAI API for embeddings/llmsos.environ["OPENAI_API_KEY"] = ( os.environ["OPENAI_API_KEY"] if "OPENAI_API_KEY" in os.environ else getpass("OpenAI API Key: "))#Set up KDB.AI endpoint and API keyKDBAI_ENDPOINT = ( os.environ["KDBAI_ENDPOINT"] if "KDBAI_ENDPOINT" in os.environ else input("KDB.AI endpoint: "))KDBAI_API_KEY = ( os.environ["KDBAI_API_KEY"] if "KDBAI_API_KEY" in os.environ else getpass("KDB.AI API key: "))#connect to KDB.AIsession = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)
schema = [dict(name="document_id", type="str"),dict(name="text", type="str"),dict(name="embeddings", type="float32s"),]indexFlat = {"name": "flat","type": "flat","column": "embeddings","params": {'dims': 1536, 'metric': 'L2'},}# Connect with kdbai databasedb = session.database("default")KDBAI_TABLE_NAME = "LlamaParse_Table"# First ensure the table does not already existtry:db.table(KDBAI_TABLE_NAME).drop()except kdbai.KDBAIException:pass#Create the tabletable = db.create_table(KDBAI_TABLE_NAME, schema, indexes=[indexFlat])
!wget 'https://arxiv.org/pdf/2404.08865' -O './LLM_recall.pdf'
EMBEDDING_MODEL = "text-embedding-3-small"GENERATION_MODEL = "gpt-4o"llm = OpenAI(model=GENERATION_MODEL)embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)Settings.llm = llmSettings.embed_model = embed_modelpdf_file_name = './LLM_recall.pdf'parsing_instructions = '''The document titled "LLM In-Context Recall is Prompt Dependent" is an academic preprint from April 2024, authored by Daniel Machlab and Rick Battle from the VMware NLP Lab. It explores the in-context recall capabilities of Large Language Models (LLMs) using a method called "needle-in-a-haystack," where a specific factoid is embedded in a block of unrelated text. The study investigates how the recall performance of various LLMs is influenced by the content of prompts and the biases in their training data. The research involves testing multiple LLMs with varying context window sizes to assess their ability to recall information accurately when prompted differently. The paper includes detailed methodologies, results from numerous tests, discussions on the impact of prompt variations and training data, and conclusions on improving LLM utility in practical applications. It contains many tables. Answer questions using the information in this article and be precise.'''
documents = LlamaParse(result_type="markdown", parsing_instructions=parsing_instructions).load_data(pdf_file_name)
print(documents[0].text[:1000])
# Parse the documents using MarkdownElementNodeParsernode_parser = MarkdownElementNodeParser(llm=llm, num_workers=8).from_defaults()# Retrieve nodes (text) and objects (table)nodes = node_parser.get_nodes_from_documents(documents)
from openai import OpenAIclient = OpenAI()def embed_query(query): query_embedding = client.embeddings.create( input=query, model="text-embedding-3-small" ) return query_embedding.data[0].embeddingdef retrieve_data(query): query_embedding = embed_query(query) results = table.search(vectors={'flat':[query_embedding]},n=5,filter=[('<>','document_id','4a9551df-5dec-4410-90bb-43d17d722918')]) retrieved_data_for_RAG = [] for index, row in results[0].iterrows(): retrieved_data_for_RAG.append(row['text']) return retrieved_data_for_RAGdef RAG(query): question = "You will answer this question based on the provided reference material: " + query messages = "Here is the provided context: " + "\n" results = retrieve_data(query) if results: for data in results: messages += data + "\n" response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": question}, { "role": "user", "content": [ {"type": "text", "text": messages}, ], } ], max_tokens=300, ) content = response.choices[0].message.content return content53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费POC验证,效果达标后再合作。零风险落地应用大模型,已交付160+中大型企业
2026-02-22
不用向量数据库的 RAG,居然跑得更准了?
2026-02-22
AIOps探索:做运维领域的RAG,如何做数据清洗
2026-02-21
Claude Code 每次都要重新探索代码?这个工具直接省下30%成本
2026-02-18
函数计算 AgentRun 重磅上线知识库功能,赋能智能体更“懂”你
2026-02-15
当RAG遇上Agent记忆:为什么相似度检索会"塌方"?
2026-02-15
查个问题还要全图跑一遍?DA-RAG说我只取一瓢
2026-02-14
OpenClaw 终于能"记住"事了!我花了 3 周折腾出的长期记忆系统
2026-02-13
深度解析 PageIndex:无向量 RAG 框架的技术实现与原理剖析
2025-12-04
2026-01-15
2025-12-03
2025-12-02
2026-01-02
2025-12-23
2025-12-07
2025-12-18
2026-02-11
2026-02-03
2026-02-22
2026-02-15
2026-02-04
2026-02-03
2026-01-19
2026-01-12
2026-01-08
2026-01-02