Runnel Zhang | Bibliotheca Runnel

Overview

As Retrieval-Augmented Generation (RAG) transitions from naïve applications to deep reasoning, processing highly unstructured, logic-intensive mathematical texts has emerged as a frontier challenge. This paper systematically reviews the technical evolution from "data segmentation" to "logical distillation." We focus on the construction of unsupervised discovery and structuring systems for implicit techniques in frontier research—characterized by fuzzy concepts, surface heterogeneity, and a lack of annotated data. By comparing three major paradigms—statistical-based, inference-based, and generative reconstruction—we propose an innovative solution based on "Semantic Bleaching and Structural Fingerprinting."

1. Introduction: Paradigm Shift in RAG

1.1 From Retrieval to Reasoning

Traditional RAG systems (Naïve RAG) primarily focus on retrieval metrics (Recall/Precision), assuming that retrieved text snippets inherently contain the complete answer. However, advanced applications like DeepWiki and Cursor demonstrate that optimizing retrieval alone has reached a plateau.

Beyond the "R": The core competitive barrier for modern RAG is shifting toward Pre-Retrieval Optimization (data processing) and Agentic Reasoning (collaboration).
Industrial Case Studies:
- Vibe Coding / Cursor: Integrates intent understanding and planning via Chain of Thought (CoT) rather than simple code retrieval.
- DeepWiki: Transforms unstructured text into atomic propositions or Knowledge Graphs (KG) instead of simple document splitting.

1.2 The Challenge of Mathematical Reasoning

In frontier mathematical papers, traditional text processing often fails due to:

Non-Entity Nature: Techniques are often logical actions or cognitive patterns rather than concrete named entities.
Surface Heterogeneity: The same logical technique (e.g., "constructing an auxiliary object") appears vastly different in algebra, geometry, or analysis, leading to extreme distances in embedding space.
Zero-Shot Discovery: Frontier research is replete with unnamed new techniques; without labeled data, systems must possess unsupervised clustering capabilities.

2. Problem Definition: Academic Alignment

To identify solutions within the academic literature, we translate business requirements into formal academic propositions:

Business Description	Academic Domain / Proposition	Core Challenge
How to segment text containing multiple techniques?	Discourse Segmentation / Propositional Chunking	Identifying Elementary Discourse Units (EDUs) and decomposing them into independent atomic propositions.
Identical underlying logic across different fields.	Semantic Bleaching / Cross-Domain Isomorphism	Removing domain-specific noise (variables, terminology) to extract the underlying logical Skeleton.
How to discover new techniques without a library?	Unsupervised Concept Discovery / Constructive Machine Learning	Discovering frequent substructures in unlabeled data (Pattern Mining).
From natural language to logical code.	Autoformalization / Neuro-Symbolic Reasoning	Converting fuzzy Natural Language (NL) descriptions into precise formal logic or code.

3. Taxonomy of Segmentation Approaches

We categorize current mainstream paradigms based on their fundamental logic for determining "where to cut."

3.1 Paradigm I: Statistical & Embedding-based (Form-based)

Philosophy: "These two segments look different (lexical/vector distance is high), so I will split them."

Max-Min Semantic Chunking
- Mechanism: Calculates cosine similarity between adjacent sentences to find Local Minima as breakpoints.
- Pros: High throughput; no LLM inference required; suitable for coarse-grained filtering of massive documents.
- Cons: "Correlation does not equal logical coherence." In proofs, "derivation" and "computation" steps may use different vocabularies and be erroneously split, while a single sentence containing two techniques cannot be decomposed.
- Literature: [1] Kamradt, G. (2023). 5 Levels Of Text Splitting.
Late Chunking
- Mechanism: Performs embedding on the full context first (preserving long-range dependencies) and applies boundaries during retrieval.
- Pros: Mitigates Co-reference Loss caused by premature splitting.
- Literature: [2] Günther, M., et al. (2024). Jina Embeddings 2: Late Chunking for RAG.

3.2 Paradigm II: Inference-based & Logical Perception (Flow-based)

Philosophy: "If the reading is smooth, it is one segment; if it is abrupt (low prediction probability), it is a new segment."

Meta-Chunking
- Mechanism: Leverages LLM Next Token Prediction. A spike in Perplexity ( $PPL$ ) indicates a logical transition or a new topic.
- Pros: "Logic-Aware." Highly suitable for the rigorous logical chains in mathematical proofs; captures long-range causal relationships.
- Literature: [3] Zhang, D., et al. (2024). Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception.
Attention-based SeqSeg
- Mechanism: Analyzes internal Attention Maps of Transformers to find attention fractures.

3.3 Paradigm III: Generative Reconstruction & Extraction (Intent-based) —— Core Recommended Direction

Philosophy: "The original text is noisy; I will not merely split it, but rewrite it into structured data."

Propositional Chunking
- Mechanism: Decomposes complex sentences into a list of independent Atomic Facts.
- Pros: Solves the problem of "one sentence, multiple techniques."
- Literature: [4] Chen, T., et al. (2023). Dense X Retrieval: What Retrieval Granularity Should We Use?.
Problem-Oriented Segmentation (POSR)
- Mechanism: Jointly trains segmentation with specific tasks (e.g., "identify this technique"). Splitting is a byproduct of recognition.
- Literature: [5] Relevant research in Educational Data Mining & QA Segmentation.

4. Proposed Solution: Semantic Bleaching & Fingerprinting

For "fuzzy, unnamed, few-shot" techniques where traditional classifiers fail, we propose an unsupervised discovery pipeline based on the philosophy of Autoformalization.

4.1 Core Concept: From "Keyword Matching" to "Logical Operator Matching"

We propose an Agentic ETL Pipeline that transforms natural language proofs into abstract logical fingerprints.

4.2 Implementation Pipeline

Step 1: Atomic Decomposition

Action: Use high-reasoning models (e.g., DeepSeek-Math-7B-Instruct) to identify "logical boundaries."
Prompt Strategy: "Decompose this proof into irreducible logical steps. Each step must include input premises and output conclusions."
Reference: [6] Jiang, A. Q., et al. (2023). Draft, Sketch, and Prove. ICLR. (Introduces "Informal-to-Sketch" mapping).

Step 2: Semantic Bleaching

Action: Perform De-contextualized Rewriting for each step.
Prompt: "Remove all specific variable names ( $f(x)$ $f (x)$ , $\epsilon$ $ϵ$ ) and specific mathematical objects (Groups, Manifolds). Describe the logical action of this step using the most abstract meta-language."
- Input: "Construct auxiliary function $g(x) = f(x) - x$ ..."
- Output: "Construct auxiliary object using difference method."
Effect: Homogenizes techniques across algebra and geometry into similar textual representations.

Step 3: Structural Embedding (Logical Fingerprinting)

Action: Vectorize the "bleached" abstract descriptions.
Model Selection: Use models fine-tuned on Math/Code (e.g., Llemma) rather than general-purpose models.
Reference: [7] Azerbayev, Z., et al. (2023). Llemma: An Open Language Model for Mathematics.

Step 4: Unsupervised Clustering & Concept Naming

Action: Utilize HDBSCAN to process vectors. It automatically determines the number of clusters and identifies noise.
Naming: Extract centroid samples for each cluster and use an LLM to summarize the common pattern and assign a name (e.g., "Mapping to Fixed Point Theorem").
Reference: [8] Welleck, S., et al. (2022). NaturalProver: Grounded Mathematical Proof Generation. NeurIPS.

5. Conclusion and Future Work

The key to discovering implicit mathematical techniques lies not in finding a better Splitter, but in building a better Translator.

Future Research Priorities:

Applications of Autoformalization: Utilizing intermediate products (Sketches / Tactics) of autoformalization as feature vectors instead of raw text.
Integration with GraphRAG: Constructing the extracted techniques into a Knowledge Graph and discovering higher-order cognitive patterns via subgraph isomorphism algorithms.

This framework transforms vague "cognitive commonalities" into computable "vector similarities," enabling the automated mining and knowledge base construction of frontier mathematical techniques.