Yuque Document Conversion Tool

This is the first practical project for Nanjing University's nova club written by the author. In this project, I will try my best to show my thinking process and restore my long-standing "starting from wheels" way of thinking; this way of thinking is conducive to quickly implementing more complex and complete projects. Based on my previous communication with teacher cac, relying on the "wheel-based" way of thinking will not be very beneficial for some future projects, so my project may be a negative example for most students. The reason why I still adopt this form is to show everyone (especially teacher cac) what my so-called "starting from wheels" way of thinking is like, for everyone to criticize and learn from.

Simplified Reproduction Tutorial

Since some comrades reported that they couldn't understand what I was doing through my documentation, I provide a simplified reproduction tutorial here.

Clone the repository: git clone https://github.com/ChouYuanjue/Yuque_LLM_Wiki (If ssh is configured, it is recommended to use git clone git@github.com:ChouYuanjue/Yuque_LLM_Wiki.git)
Enter the repository directory: cd Yuque_LLM_Wiki
Install dependencies: pip install -r requirements.txt

Create a .env file and fill in the following content (format as follows):

# SiliconFlow API key (for embedding and LLM calls)
SILICONFLOW_API_KEY=sk-xxxxxxxxxxxxxxxx

# Yuque API key (only for api.py, optional)
YUQUE_TOKEN=xxxxxxxxxxxxxxxx

Get Yuque documents:
- With Token: python api.py <Yuque knowledge base URL>
- Without Token: python crawl.py <public Yuque knowledge base URL>
Build the knowledge base:
```
python create_faiss_knowledge_base.py <path to folder containing md files> [collection name]
```
- <path to folder containing md files>: Required, pointing to a specific directory under api/ or crawl/
- [collection name]: Optional, the name of the knowledge base collection, default is 'default_knowledge_base'
- Example: python create_faiss_knowledge_base.py api/68727525 nova
Configure the knowledge base collection name called by WebUI in config.json (such as nova just created), that is, modify the default_collection parameter of vector_store.
Start the Web interface: streamlit run webui.py

Preliminary Implementation Ideas

The following is the plan I immediately generated based on my long-term experience in doing projects after seeing the project topic. Any link I envisioned below is something I am sure I can implement myself/by modifying wheels. The advantage of "experience" is that it avoids unrealistic implementation ideas and also avoids being stuck on certain issues. The disadvantage is that once this thinking is formed, it will subconsciously treat every project by modifying and stacking wheels, refusing to spend time trying unknown implementation methods, and it is also prone to redundant calling steps.

Rendering diagram…

Idea Analysis

If we pass all documents directly to the API of our commonly used large models, it is usually not a good solution, because this will lead to high cost of API calls, and the context length of the model is usually limited and cannot handle all documents. So it is better to call the faiss knowledge base; of course, this solution also has defects, which I will mention below. If we really want to pass all documents, I think DeepWiki is a good choice. This project can be deployed locally, but we do not adopt the local solution here. We can directly push the obtained documents to github, and let DeepWiki generate interactive documents based on the github repository. DeepWiki actually calls a model specialized for understanding code implementations, which has obvious advantages in context understanding. Although it is not directly applicable to this project, it can also achieve very good results. The reason I thought of DeepWiki is that I didn't want to write documentation when I wrote the bot, so I found this interactive assistant, connected it to the bot's repository, allowing users to directly consult the bot about the usage and specific implementation of various functions.

Regarding the problem of tables and images in the crawled documents, tables can be initially converted into markdown format, which does not affect the LLM's understanding of them. Images need special processing: one is to extract text through OCR and then pass it to LLM, and the other is to pass it to LLM after paraphrasing through VLM. Of course, I think the best solution is to pass all to the Embedding model + multimodal large model, but I have never tested it, and I don't know how to call the API of the multimodal model and whether it must go through the preprocessing of the Embedding model. The free Embedding models I know do not seem to support processing data other than text. Models that may support this function are not intended to be tried for the time being due to network environment and cost issues. The processing of images is temporarily added to the TO-DO list.

In addition, regarding sensitive information such as tokens and API keys, I will store them in the .env file to avoid hardcoding into the code. Before this, I pushed such information to public warehouses many times and had to overwrite them through forced pushes, so I must develop the habit of writing .env (sad).

Specific Implementation Ideas

I will complete the implementation of two data acquisition methods: API and web crawler, convert them into markdown, and establish a script to push to DeepWiki webpage. At the same time, I will complete the call of LLM through the faiss knowledge base for a slightly more complex but more localized and controllable implementation. In this process, I will demonstrate the two cases of Nanda Assistant's Yuque (without token) and nova documents (with token) respectively. For the final client implementation, I will try to use a locally deployed website (WebUI) for easy visual calling by everyone, and may eventually deploy an open demo on the server.

A. Yuque API + DeepWiki (+WebUI)

The simplest and fastest implementation method. Can realize question and answer interaction. The disadvantage is that DeepWiki will automatically update documents according to the warehouse only once a week.

B. Yuque API + faiss knowledge base + LLM (+WebUI)

The reason I thought of this way is that my own bot's knowledge base is implemented like this. The general principle can be understood as first cut the document into pieces, then store the embedding of each block in the faiss knowledge base. When answering questions, first compare the embedding of the question with the embedding in the faiss knowledge base, find the most similar blocks, and then pass these blocks to the LLM as context for processing. The advantage is that the prompts that LLM needs to process are significantly reduced. The disadvantage is that the context obtained by LLM may not be the most relevant, may be wrong, and is also likely to be incomplete. It is necessary to call the Embedding model to encode the context.

C. Crawler + DeepWiki (+WebUI)

Same as scheme A, but can be used for Yuque documents that do not provide tokens but are public.

D. Crawler + faiss knowledge base + LLM (+WebUI)

Same as scheme B, but can be used for Yuque documents that do not provide tokens but are public.

E. Directly process web pages to build faiss knowledge base + LLM (+WebUI)

Due to the availability of complete wheels, only relevant links are posted here. Note that this project provides an API port callable by AstrBot, which is convenient for calling when writing plug-ins. It cannot be used out of the box, and you need to understand its principles and modify it according to needs. The project can be found here: RC-CHN/astrbot_plugin_url_2_knowledge_base

F. Implementation based on qbot

The implementation idea is basically the same as above. The advantage is that there are more available wheels. Through AstrBot's built-in LLM calling, knowledge base visual construction and other functions, it can be easily implemented. However, it involves many project dependencies and is also not ready-to-use. There is no plug-in that can be directly installed, so it is not recommended for students without bot writing experience.

Final Implementation

This project has completed the implementation methods A, B, C, and D above; the F implementation method can be found in ChouYuanjue/doge-repo/v4. Chatting with the bot and sending /wiki <repository name> <question> can ask DeepWiki, and sending /kb use <knowledge base name> can call the knowledge base in natural language conversations with the bot (can be imported through AstrBot's WebUI, which automatically parses and generates the knowledge base).

1. System Architecture Overview

The current system mainly consists of the following core modules:

Data Acquisition Layer: Obtain Yuque documents through API or crawler
Data Processing Layer: Convert documents to markdown and process image resources
Knowledge Base Construction Layer: Create a vector knowledge base based on FAISS
Query Service Layer: Provide two query methods: knowledge base query and DeepWiki query
User Interface Layer: WebUI interface based on Streamlit

Rendering diagram…

2. Detailed Description of Core Modules

2.1 Data Acquisition Module

2.1.1 API Method (requires Token)

The api.py module provides the function of obtaining documents through the official Yuque API. The main process is:

Read YUQUE_TOKEN from the environment variable .env
Parse the input Yuque knowledge base URL and extract necessary parameters
Obtain knowledge base metadata and directory structure through API
Traverse each document in the directory and call the API to obtain markdown content
Download images in the document and replace links with local paths
Generate SUMMARY.md file as directory index

Usage:

python api.py <Yuque knowledge base URL>

2.1.2 Crawler Method (no Token required)

The crawl.py module provides the function of crawling public Yuque documents. The main process is:

Visit the Yuque knowledge base page and extract embedded JSON data
Parse JSON data to obtain document structure and content
Traverse documents and obtain complete markdown content through API
Traverse each document in the directory and call the API to obtain markdown content
Download images in the document and replace links with local paths
Generate SUMMARY.md file as directory index

Usage:

python crawl.py <public Yuque knowledge base URL>

2.1.3 Comparison between API and Crawler Methods

Although the api.py and crawl.py modules target different access scenarios (with Token and without Token), their core implementation logic and code structure are very similar, with only minor differences:

Similarities:

The ultimate goal is the same: both obtain Yuque documents and save them in markdown format
Image processing logic is completely the same: both download remote images and replace them with local links
The output format is consistent: both generate markdown documents and SUMMARY.md index files
Most auxiliary functions are common: such as _retry_get, _ensure_dir, _save_binary, etc.

Minor differences:

Different access permissions: The API method requires a Token and can access private documents; the crawler method does not require a Token and can only access public documents
Different data acquisition methods: The API method directly calls the official API interface to obtain data; the crawler method obtains data by parsing page content and calling unofficial APIs
Different URL parsing methods: The API method parses the URL to obtain group_login and book_slug; the crawler method extracts embedded JSON data from the page
Details of directory structure processing: The crawler method supports more complex hierarchical directory structures

2.2 Knowledge Base Construction Module

The create_faiss_knowledge_base.py module is responsible for building the obtained documents into a FAISS vector knowledge base. Its main functions are:

Document chunking: Split long documents into appropriately sized chunks
Vectorization processing: Use Embedding models (such as GTE-large) to convert text into vector representations
Batch processing and retry mechanism: Support batch processing of documents and automatic retry on failure
FAISS index construction: Create efficient vector indexes for similarity search
Persistent storage: Save indexes and document metadata to the local file system

This module provides a complete asynchronous API and supports high-concurrency processing of large numbers of documents.

2.3 Query Service Module

2.3.1 FAISS Knowledge Base Query

The query_knowledge_base.py module implements semantic search and LLM answer generation functions based on FAISS:

Similarity search: Convert user queries into vectors and find the most similar document fragments in the FAISS index
Context construction: Construct the retrieved document fragments into context
LLM call: Send the context and user query to the large language model (such as Qwen2-7B) to generate answers
Result formatting: Return answers to users in a user-friendly format

Usage:

python query_knowledge_base.py <knowledge base name> <query content>
Example: python query_knowledge_base.py nju history of Nanjing University

2.3.2 DeepWiki Query

The query_deepwiki.py module provides the function of querying documents in GitHub repositories through the DeepWiki API:

Request construction: Encapsulate user queries into DeepWiki API request format
Asynchronous communication: Use aiohttp to implement asynchronous requests and polling
Result parsing: Extract and format the results returned by DeepWiki

Usage:

python query_deepwiki.py [repository name] <query content>
Example 1: python query_deepwiki.py Nanjing University freshman admission guide
Example 2: python query_deepwiki.py ChouYuanjue/Guidance_for_New_NJUers Nanjing University freshman admission guide

2.4 Web Interface Module

The webui.py module implements a user-friendly Web interface based on Streamlit:

Configuration loading: Read system configuration from config.json
Query method selection: Support switching between DeepWiki and FAISS knowledge base query methods
Result display: Display query results in the form of text areas
Exception handling: Capture and display various error messages

Usage:

streamlit run webui.py

3. Configuration Instructions

The system is mainly set through two configuration files:

3.1 `.env` file

Store sensitive information in the following format:

SILICONFLOW_API_KEY=your_siliconflow_api_key
YUQUE_TOKEN=your_yuque_token

3.2 `config.json` file

Store system configuration parameters:

{
  "embedding": {
    "api_url": "embedding_api_url",
    "model_name": "embedding_model_name"
  },
  "llm": {
    "api_url": "llm_api_url",
    "model_name": "llm_model_name",
    "temperature": 0.7,
    "max_tokens": 2000
  },
  "vector_store": {
    "path": "vector_store",
    "default_collection": "nju"
  },
  "deepwiki": {
    "base_url": "https://api.devin.ai/ada/query",
    "referer": "https://deepwiki.com/",
    "default_repo_name": "ChouYuanjue/Guidance_for_New_NJUers"
  },
  "webui": {
    "port": 7860
  }
}

4. System Workflow

4.1 Knowledge Base Construction Process

Run api.py or crawl.py to obtain Yuque documents
Documents are saved in markdown format, and images are downloaded locally
Run create_faiss_knowledge_base.py to create a vector knowledge base
Knowledge base files (.index and .db) are saved to the vector_store directory

4.2 Query Process

Users submit queries through the Web interface or command line
The system executes different processes according to the selected query method:
- FAISS method: Convert query to vector → Search for similar documents in the knowledge base → Construct context → Call LLM to generate answers
- DeepWiki method: Send a request to the DeepWiki API → Poll to get results → Parse and return answers
Results are returned to the user interface for display

TO-DO LIST

Processing of images
Automatic document update
More scientific knowledge base construction (such as cluster summary)
Deploy the project on the server
Add PageRank

REPO: Yuque_LLM_Wiki

Yuque Document Conversion Tool

Simplified Reproduction Tutorial

Preliminary Implementation Ideas

Idea Analysis

Specific Implementation Ideas

A. Yuque API + DeepWiki (+WebUI)

B. Yuque API + faiss knowledge base + LLM (+WebUI)

C. Crawler + DeepWiki (+WebUI)

D. Crawler + faiss knowledge base + LLM (+WebUI)

E. Directly process web pages to build faiss knowledge base + LLM (+WebUI)

F. Implementation based on qbot

Final Implementation

1. System Architecture Overview

2. Detailed Description of Core Modules

2.1 Data Acquisition Module

2.1.3 Comparison between API and Crawler Methods

2.2 Knowledge Base Construction Module

2.3 Query Service Module

2.4 Web Interface Module

3. Configuration Instructions

3.1 .env file

3.2 config.json file

4. System Workflow

4.1 Knowledge Base Construction Process

4.2 Query Process

TO-DO LIST

3.1 `.env` file

3.2 `config.json` file