我要投稿

智能时代的知识管家：一个基于DeepSeek、Dify和Elasticsearch的知识库系统诞生与启示

发布日期：2025-04-01 22:03:10 浏览次数： 2218

作者：DevOps运维实践

微信搜一搜，关注“DevOps运维实践”

引言：当运维遇上人工智能

凌晨3点，某公司服务器突然宕机！值班运维小哥小王急得满头大汗，面对上千份文档，他却在10秒内找到了解决方案——这不是科幻电影，而是一个真实运维知识库系统的日常。

今天，我们就来揭秘这个让运维人告别“熬夜翻文档”的神器！用技术解决知识的“最后一公里”问题。

一、系统揭秘：从文档到智慧的转化链

系统由四大核心模块构成

文档处理器：支持PDF、Word、Excel等多种格式的“万能解析器”。它不仅能提取文本，还能通过分块和向量化技术（如Nomic Embeddings），将文档转化为机器可理解的“知识碎片”。
知识仓库：基于Elasticsearch构建的智能存储系统，支持语义搜索。不同于传统关键词匹配，它能理解“服务器卡顿”和“系统响应延迟”是同类问题。
问答引擎：结合DeepSeek大模型与上下文理解，如同一个“永不疲倦的运维专家”。例如当用户提问“如何排查内存泄漏”，系统会先检索相关文档片段，再生成结构化解决方案。
记忆模块：SQLite数据库记录每次问答，形成知识闭环。这不仅是日志，更是优化系统的“经验值”。

二、技术亮点：藏在代码中的智慧

文档处理的“庖丁解牛”

通过RecursiveCharacterTextSplitter将长文档智能分块，既保留上下文关联，又避免信息过载。
正则表达式清洗文本——去除多余空格和乱码，确保知识纯度。
搜索的“读心术”：Elasticsearch的multi_match查询支持跨字段检索，即便用户只记得文档中的零散关键词，也能定位目标。
向量搜索技术（如代码中的similarity_search）：让系统理解“备份”和“容灾”的语义关联，突破字面匹配局限。

（运维老张：“以前找文档像大海捞针，现在像用饿了么点外卖——精准送达！”）

三、真实场景：运维人的“救命时刻”

场景1：新手逆袭

实习生小李面对“数据库主从同步”一脸懵，系统直接甩出操作手册+原理图解，附赠一句：“亲，同步前记得备份哦~”

场景2：深夜救急

某电商大促期间服务器崩了，系统秒回：“《高并发应急预案》第4.2节+自动扩容脚本”，运维团队30分钟恢复业务！

场景3：知识传承

老师傅的“祖传笔记”上传系统后，新员工也能随时调用——从此告别“人走经验凉”的尴尬。

四、运行环境

#python版本
python --version
Python 3.12.9

#pip依赖包
pip install flask flask-httpauth werkzeug elasticsearch langchain-elasticsearch langchain-community python-docx pdfplumber pandas tika langchain-nomic requests urllib3

#es本地docker启动
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  elasticsearch:7.17.10

#创建索引knowledge_index
curl -X PUT "http://localhost:9200/knowledge_index" -H 'Content-Type: application/json' -d'
{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text"
            },
            "content": {
                "type": "text"
            },
            "timestamp": {
                "type": "date"
            }
        }
    }
}
'

五、完整代码实现

配置文件（config.py）

# 配置文件
# Elasticsearch 的连接 URL
ES_URL = "http://192.168.31.136:9200"
# Elasticsearch 中存储知识的索引名称
INDEX_NAME = "knowledge_index"

数据库操作（database.py）

此文件负责数据库的初始化、数据添加和查询操作，使用 SQLite 作为数据库。

import sqlite3

# 初始化数据库，创建历史记录表格
def init_db():
    # 连接到 SQLite 数据库
    conn = sqlite3.connect('history.db')
    # 创建游标对象，用于执行 SQL 语句
    c = conn.cursor()
    # 创建历史记录表格，如果表格不存在
    c.execute('''CREATE TABLE IF NOT EXISTS history
                 (id INTEGER PRIMARY KEY AUTOINCREMENT,
                 user TEXT,
                 question TEXT,
                 answer TEXT)''')
    # 提交数据库事务
    conn.commit()
    # 关闭数据库连接
    conn.close()

# 向历史记录表格中添加用户的问答记录
def add_history(user, question, answer):
    # 连接到 SQLite 数据库
    conn = sqlite3.connect('history.db')
    # 创建游标对象
    c = conn.cursor()
    # 插入用户的问答记录到历史记录表格
    c.execute("INSERT INTO history (user, question, answer) VALUES (?,?,?)", (user, question, answer))
    # 提交数据库事务
    conn.commit()
    # 关闭数据库连接
    conn.close()

# 获取指定用户的历史问答记录
def get_history(user):
    # 连接到 SQLite 数据库
    conn = sqlite3.connect('history.db')
    # 创建游标对象
    c = conn.cursor()
    # 查询指定用户的历史问答记录
    c.execute("SELECT question, answer FROM history WHERE user =?", (user,))
    # 获取查询结果
    history = c.fetchall()
    # 关闭数据库连接
    conn.close()
    return history

文档处理（document_processor.py）

负责文档的加载、处理和向 Elasticsearch 中添加向量数据。

# 导入所需的库
from langchain_community.document_loaders import TextLoader, Docx2txtLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_nomic.embeddings import NomicEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from langchain.document_loaders import PyPDFLoader
import re
from config import ES_URL, INDEX_NAME

# 清理文本，去除多余的空格
def clean_text(text):
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# 加载并处理文档，将文档分割成小块
def load_and_process_documents(file_path):
    if file_path.endswith('.txt'):
        # 使用 TextLoader 加载文本文件
        loader = TextLoader(file_path)
    elif file_path.endswith('.pdf'):
        # 使用 PyPDFLoader 加载 PDF 文件
        loader = PyPDFLoader(file_path)
    elif file_path.endswith('.docx'):
        # 使用 Docx2txtLoader 加载 DOCX 文件
        loader = Docx2txtLoader(file_path)
    else:
        # 如果文件格式不支持，抛出异常
        raise ValueError("不支持的文件格式")

    # 加载文档
    documents = loader.load()
    # 创建文本分割器，设置每个块的大小和重叠部分
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    # 分割文档
    docs = text_splitter.split_documents(documents)
    # 清理每个文档块的文本
    cleaned_docs = [clean_text(doc.page_content) for doc in docs]
    return cleaned_docs

# 将处理后的文档添加到 Elasticsearch 向量存储中
def add_to_vectorstore(file_path):
    # 加载并处理文档
    docs = load_and_process_documents(file_path)
    # 创建 NomicEmbeddings 对象，用于生成文档的嵌入向量
    embeddings = NomicEmbeddings()
    # 将文档添加到 Elasticsearch 向量存储中
    ElasticsearchStore.from_texts(
        texts=docs,
        embedding=embeddings,
        es_url=ES_URL,
        index_name=INDEX_NAME
    )

智能问答（question_answering.py）

实现问题回答的核心逻辑。

import requests
import random
from langchain_nomic.embeddings import NomicEmbeddings
from langchain_elasticsearch.vectorstores import ElasticsearchStore
from config import ES_URL, INDEX_NAME
import os
import json

# 随机选择一个 DeepSeek 节点
def select_node():
    return random.choice(DEEPSEEK_NODES)

# 处理用户的问题，返回答案
def ask_question(question):
    try:
        # 从环境变量中获取 Nomic API 密钥
        nomic_api_key = os.getenv('NOMIC_API_KEY')
        # 初始化 NomicEmbeddings
        embeddings = NomicEmbeddings(model="nomic-embed-text-v1.5", nomic_api_key=nomic_api_key)
        # 创建 Elasticsearch 向量存储对象
        vectorstore = ElasticsearchStore(embedding=embeddings, es_url=ES_URL, index_name=INDEX_NAME)

        # 从向量存储中搜索与问题相关的文档
        query = vectorstore.similarity_search(question, k=3)
        print(json.dumps(query, indent=2))

        # 再次搜索与问题相关的文档
        relevant_docs = vectorstore.similarity_search(question, k=3)
        # 将相关文档的内容拼接成上下文
        context = "\n".join([doc.page_content for doc in relevant_docs])
        # 选择一个 DeepSeek 节点
        node_url = select_node()
        # 构建请求体
        payload = {
            "model": "deepseek",
            "prompt": f"根据以下文档内容回答问题：{context}\n问题：{question}"
        }
        # 发送 POST 请求到 DeepSeek 节点
        response = requests.post(node_url, json=payload, timeout=10)
        # 检查响应状态码，如果不是 200，抛出异常
        response.raise_for_status()
        # 解析响应的 JSON 数据
        result = response.json()
        # 返回答案，如果没有找到答案，返回默认信息
        return result.get("answer", "未找到合适答案")
    except requests.RequestException as e:
        # 处理请求异常
        print(f"与 DeepSeek 模型通信时出错: {e}")
        return"请求出错，请稍后重试"
    except Exception as e:
        # 处理其他异常
        print(f"处理问题时出现未知错误: {e}")
        return"出现未知错误，请联系管理员"

Flask 应用（app.py）

Flask 应用的主文件，处理用户的请求。

import os
import logging
import requests
from elasticsearch import Elasticsearch
from flask import Flask, request, jsonify, render_template, redirect, url_for
from werkzeug.security import generate_password_hash, check_password_hash
from flask_httpauth import HTTPBasicAuth
import re
import urllib3
import pandas as pd
import docx
import pdfplumber
import tempfile
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import json
from werkzeug.utils import secure_filename
from database import init_db, add_history, get_history
from question_answering import ask_question
from document_processor import add_to_vectorstore

# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# 忽略 SSL 警告
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# 用户认证管理类
class UserAuth:
    def __init__(self):
        # 存储用户的用户名和加密后的密码
        self.users = {
            "admin": generate_password_hash("adminpassword"),
            "user": generate_password_hash("userpassword")
        }
        # 创建 HTTP 基本认证对象
        self.auth = HTTPBasicAuth()

        @self.auth.verify_password
        def verify_password(username, password):
            # 验证用户的用户名和密码
            if username in self.users and check_password_hash(self.users.get(username), password):
                return username

# Elasticsearch 管理类
class ElasticsearchManager:
    def __init__(self):
        # 连接到 Elasticsearch
        self.es = Elasticsearch([{'scheme': 'http', 'host': 'localhost', 'port': 9200}])
        # 定义 Elasticsearch 索引名称
        self.index_name = 'ops_knowledge'
        # 初始化索引
        self.init_index()

    def init_index(self):
        try:
            # 检查索引是否存在，如果不存在则创建
            ifnot self.es.indices.exists(index=self.index_name):
                self.es.indices.create(index=self.index_name)
                logger.info(f"索引 '{self.index_name}' 创建成功")
        except Exception as e:
            # 记录创建索引时的错误信息
            logger.error(f"创建索引时出错: {e}")
            raise

    def add_document(self, title, content, file_path):
        # 构建文档对象
        doc = {
            'title': title,
            'content': content,
            'file_path': file_path
        }
        try:
            # 将文档添加到 Elasticsearch 索引中
            res = self.es.index(index=self.index_name, body=doc)
            logger.info(f"文档 {title} 已添加到知识库，ID: {res['_id']}")
            return res['result']
        except Exception as e:
            # 记录添加文档时的错误信息
            logger.error(f"添加文档 {title} 到知识库时出错: {e}")
            returnNone

    def search_knowledge(self, query):
        try:
            # 构建搜索请求体
            body = {
                "query": {
                    "multi_match": {
                        "query": query,
                        "fields": ["title", "content"]
                    }
                }
            }
            # 执行搜索请求
            res = self.es.search(index=self.index_name, body=body)
            # 获取搜索结果
            hits = res['hits']['hits']
            results = []
            for hit in hits:
                # 提取搜索结果的相关信息
                results.append({
                    'id': hit['_id'],
                    'title': hit['_source']['title'],
                    'content': hit['_source']['content'],
                    'file_path': hit['_source']['file_path']
                })
            return results
        except Exception as e:
            # 记录搜索知识时的错误信息
            logger.error(f"搜索知识时出错: {e}")
            return []

# 文档解析类
class DocumentParser:
    @staticmethod
    def parse_document(file_path):
        try:
            if file_path.endswith('.docx'):
                # 解析 DOCX 文件
                doc = docx.Document(file_path)
                content = ""
                for para in doc.paragraphs:
                    content += para.text + "\n"
                return content.strip()
            elif file_path.endswith('.xlsx'):
                # 解析 XLSX 文件
                df = pd.read_excel(file_path)
                content = ""
                for _, row in df.iterrows():
                    content += str(row.values) + "\n"
                return content.strip()
            elif file_path.endswith('.pdf'):
                # 解析 PDF 文件
                with pdfplumber.open(file_path) as pdf:
                    content = ""
                    for page in pdf.pages:
                        content += page.extract_text() + "\n"
                return content.strip()
            else:
                # 使用 Tika 解析其他文件格式
                from tika import parser
                parsed = parser.from_file(file_path, requestOptions={"verify": False})
                content = parsed['content']
                if content:
                    content = content.strip()
                return content
        except Exception as e:
            # 记录解析文件时的错误信息
            logger.error(f"解析文件 {file_path} 时出错: {e}")
            returnNone

# dify 管理类
class DifyManager:
    def __init__(self, api_key):
        # 存储 Dify API 密钥
        self.api_key = api_key
        # 设置请求头
        self.headers = {"Authorization": f"Bearer {self.api_key}"}
        # Dify 文档创建 API URL
        self.api_url = "https://api.dify.ai/v1/datasets/810ff8fb-fd85-4b5b-a4cxxx_xcxxc_b473c/document/create-by-file"
        # 创建请求会话
        self.session = requests.Session()
        # 添加重试机制
        retries = Retry(total=3, backoff_factor=1)
        self.session.mount('https://', HTTPAdapter(max_retries=retries))

    def add_to_dify(self, file_path):
        try:
            # 打开文件
            with open(file_path, 'rb') as file_obj:
                # 构建文件上传请求体
                files = {
                    'file': (secure_filename(os.path.basename(file_path)), file_obj, self._get_mime_type(file_path))
                }
                # 构建数据请求体
                data = {
                    'indexing_technique': 'high_quality',
                    'process_rule': json.dumps({"mode": "automatic"})  # 确保 JSON 序列化
                }

                # 发送 POST 请求到 Dify API
                response = self.session.post(
                    self.api_url,
                    headers=self.headers,
                    files=files,
                    data=data,
                    timeout=30
                )
                # 检查响应状态码，如果不是 200，抛出异常
                response.raise_for_status()
                return response.json()

        except requests.RequestException as e:
            # 处理请求异常
            response = getattr(e, 'response', None)
            error_msg = {
                'error': str(e),
                'status_code': response.status_code if response elseNone,
                'response_text': response.text[:200] if response elseNone
            }
            logger.error(f"Dify上传失败详情: {json.dumps(error_msg, indent=2)}")
            returnNone

    def _get_mime_type(self, filename):
        # 根据文件扩展名返回 MIME 类型
        ext = os.path.splitext(filename)[1].lower()
        return {
            '.pdf': 'application/pdf',
            '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
            '.xlsx': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
        }.get(ext, 'application/octet-stream')

# DeepSeek 管理类
class DeepSeekManager:
    def __init__(self, api_key):
        # 存储 DeepSeek API 密钥
        self.api_key = api_key
        # 设置请求头
        self.headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

    def intelligent_qa(self, question):
        # 清理问题中的特殊字符
        question = re.sub(r'[^\w\s]', '', question)
        # 构建请求体
        data = {
            "model": "deepseek-chat",
            "messages": [
                {"role": "user", "content": question}
            ],
            "temperature": 0.7,
            "max_tokens": 1024,
            "top_p": 0.9# 新增可能的参数，根据实际情况调整
        }
        try:
            # 发送 POST 请求到 DeepSeek API
            response = requests.post(
                "https://api.deepseek.com/chat/completions",
                headers=self.headers,
                json=data
            )
            # 检查响应状态码，如果不是 200，抛出异常
            response.raise_for_status()
            # 解析响应的 JSON 数据
            result = response.json()
            # 返回答案，如果没有找到答案，返回默认信息
            return result.get("choices", [{}])[0].get("message", {}).get("content", "未找到相关答案，请补充更多信息。"
        return answer


# Flask应用
app = Flask(__name__)
user_auth = UserAuth()
knowledge_base = OpsKnowledgeBase("dataset-V0TC4VIGfpfvmyrxxxV3hxxx", "sk-fc5c4a54ebfxxxxxxxc2cb6xxx25xx")

# 首页
@app.route('/')
@user_auth.auth.login_required
def index():
    return render_template('index.html')


# 扫描文件夹并添加文档
@app.route('/scan_folder', methods=['POST'])
@user_auth.auth.login_required
def scan_folder():
    if user_auth.auth.current_user() == 'admin':
        folder_path = request.form.get('folder_path')
        if folder_path:
            knowledge_base.scan_folder_and_add_documents(folder_path)
            return jsonify({'message': '文件夹扫描并添加文档完成'})
        return jsonify({'error': '缺少文件夹路径参数'}), 400
    return jsonify({'error': '只有管理员可以执行此操作'}), 403


# 搜索知识
@app.route('/search', methods=['GET'])
@user_auth.auth.login_required
def search():
    query = request.args.get('query')
    if query:
        results = knowledge_base.search_knowledge(query)
        return render_template('search_results.html', results=results)
    return jsonify({'error': '缺少查询参数'}), 400


# 智能问答
@app.route('/qa', methods=['GET'])
@user_auth.auth.login_required
def qa():
    question = request.args.get('question')
    if question:
        answer = knowledge_base.intelligent_qa(question)
        return render_template('qa_result.html', question=question, answer=answer)
    return jsonify({'error': '缺少问题参数'}), 400


if __name__ == '__main__':
    app.run(debug=True)

六、模板文件

在项目根目录下创建一个templates文件夹，里面包含index.html、login.html、qa_result.html、search_results.html四个文件。

index.html

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>运维知识库</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 20px;
        }

        h1 {
            color: #333;
        }

        form {
            margin-bottom: 20px;
        }

        input[type="text"] {
            padding: 8px;
            width: 300px;
            margin-right: 10px;
        }

        button {
            padding: 8px 15px;
            background-color: #007BFF;
            color: white;
            border: none;
            cursor: pointer;
        }

        button:hover {
            background-color: #0056b3;
        }
    </style>
</head>

<body>
    <h1>运维知识库</h1>
    <h2>扫描文件夹添加文档</h2>
    <form action="/scan_folder" method="post">
        <input type="text" name="folder_path" placeholder="输入文件夹路径">
        <button type="submit">扫描并添加</button>
    </form>
    <h2>搜索知识</h2>
    <form action="/search" method="get">
        <input type="text" name="query" placeholder="输入搜索关键词">
        <button type="submit">搜索</button>
    </form>
    <h2>智能问答</h2>
    <form action="/qa" method="get">
        <input type="text" name="question" placeholder="输入你的问题">
        <button type="submit">提问</button>
    </form>
</body>

</html>

login.html

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>登录</title>
</head>

<body>
    <h1>登录</h1>
    {% if error %}
    <p style="color: red;">{{ error }}</p>
    {% endif %}
    <form action="{{ url_for('login') }}" method="post">
        <label for="username">用户名:</label>
        <input type="text" id="username" name="username" required><br>
        <label for="password">密码:</label>
        <input type="password" id="password" name="password" required><br>
        <input type="submit" value="登录">
    </form>
</body>

</html>

qa_result.html

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>智能问答结果</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 20px;
        }

        h1 {
            color: #333;
        }

        .question {
            font-weight: bold;
        }

        .answer {
            margin-top: 10px;
        }
    </style>
</head>

<body>
    <h1>智能问答结果</h1>
    <p class="question">问题: {{ question }}</p>
    <p class="answer">答案: {{ answer }}</p>
    <a href="/">返回首页</a>
</body>

</html>

search_results.html

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>搜索结果</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 20px;
        }

        h1 {
            color: #333;
        }

        .result {
            border: 1px solid #ccc;
            padding: 10px;
            margin-bottom: 10px;
        }

        .result h2 {
            margin-top: 0;
        }
    </style>
</head>

<body>
    <h1>搜索结果</h1>
    {% if results %}
    {% for result in results %}
    <div class="result">
        <h2>{{ result.title }}</h2>
        <p>{{ result.content }}</p>
        <p>文件路径: {{ result.file_path }}</p>
    </div>
    {% endfor %}
    {% else %}
    <p>未找到相关结果。</p>
    {% endif %}
    <a href="/">返回首页</a>
</body>
</html>

七、运行效果展示

本地文件扫描添加知识库

已有知识库内容检索

智能提问

八、未来进化：从工具到伙伴

多模态升级：支持架构图识别，输入“拓扑图中红色节点的问题”，系统可解析图像并关联日志。
自学习机制：通过用户反馈自动标注优质答案，像围棋AI“左右互搏”般迭代优化。
预测性维护：结合监控数据，主动推送“CPU使用率持续升高，建议检查XXX文档第3节”。

九、结语：让技术回归人性

这个系统的本质，不是取代运维人员，而是把人类从重复劳动中解放出来——让工程师专注创新，让知识流动起来，让每一次深夜救急都变成“有备而战”。正如Linux创始人Linus Torvalds所说：“技术之美，在于让复杂的事情变简单。”当每一个运维问题都能在10秒内找到答案时，我们便离“零故障焦虑”的运维乌托邦更近了一步。

也正如另外一位用户所说：“以前我觉得AI很遥远，直到它帮我保住了年终奖……”

说在最后的最后

代码仅做展示，非最终应用，革命尚未成功，同志仍需努力。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业