TeX-QB-Gen
Generate question banks directly from textbooks. Leverage OpenRouter's multimodal/text models and local OCR capabilities to extract math problems and solutions from images, web pages, or PDFs, and generate uniformly formatted TeX question banks.
English | 简体中文
TeX Question Bank Generator (TeX-QB-Gen)
Generate question banks directly from textbooks.
Leverage OpenRouter's multimodal/language models and local OCR capabilities to extract math problems and solutions from images, web pages, or PDFs, and generate uniformly formatted TeX question banks. Each question generates an independent TeX file, summarized by master.tex for batch compilation.
Background
This project was born out of the need to earn NJU labor education hours. When professors require you to organize question banks in TeX to earn these hours, why not adopt a faster solution?
Note: All test files in the
tests/directory are examples that can be successfully processed.
Key Features
- Multiple Input Channels: Images, PDFs (automatically distinguish text-based/scanned), URLs; batch fetching from Math Stack Exchange / MathOverflow using APIs.
- Intelligent Problem Splitting: Text PDFs locate problem boundaries via keywords; scanned versions only call OCR or multimodal models on suspected problem pages.
- Answer Strategy: Retain original answers/solutions; for minimal responses like "obvious" or "evident," automatically call LLM to generate
Solution (by LLM)with a "may be inaccurate" disclaimer. - Unified Output: Each question generates a
.texfile containingExercise,Answer,Solution, andSolution (by LLM)with metadata annotations;master.texis auto-generated with hierarchical structure based on input folder organization. Use--generate-masterto create a master document from existing.texfiles. - Caching and Retry: OpenRouter requests are cached to disk, automatically handling rate limits and retries to reduce redundant calls.
- Extensible Configuration: Override models, cache, StackExchange API Key, Tesseract path, etc., via
.env.
Quick Start
1. Prepare the Environment
python -m venv .venv
\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
2. Configure Keys
Copy .env.example to .env and fill in the following:
OPENROUTER_API_KEY: API Key from OpenRouter.STACKEXCHANGE_KEY(optional): Enhance StackExchange API quota.TESSERACT_CMD(optional): Path to Tesseract executable on Windows.TEXBANK_DEFAULT_LANGUAGE(optional): Default output language, supportsauto(retain original),zh(Chinese),en(English).TEXBANK_OCR_ENGINE(optional): OCR engine, supportstesseract(default) orpaddle(suitable for Chinese).
3. Tesseract OCR (Local Backup)
- Windows: Install via official package or Chocolatey.
- After installation, add the installation directory to
PATHor setTESSERACT_CMDin.env.
4. Run Command
$env:PYTHONPATH = "src"
python -m texbank.cli --input examples\sample.pdf --out out
Command Parameters:
| Parameter | Description |
|---|---|
--input/-i | Supports multiple inputs; use @file to read a list. |
--out/-o | Output directory. |
--keyword/-k | StackExchange search keywords. |
--max-items/-m | StackExchange question limit. |
--site/-s | math or mathoverflow. |
--generate-master | Generate master.tex from existing .tex files in directory. |
--no-llm-solution | Disable auto-generated solutions. |
--omit-answer-field | Ask multimodal extraction to return only exercise and solution, omitting answer. |
--language/-l | Output language (auto/zh/en), default auto. |
--paired-sequence | Configure paired question/answer PDFs with label template and ranges (e.g. `{chapter}.{section}.{n} |
--paired-start | Override the starting question index when iterating paired labels. |
--paired-max-gap | Stop after this many consecutive missing labels for a prefix context. |
--paired-max-questions | Limit total questions extracted per paired traversal. |
--paired-max-pages | Maximum distinct pages fetched for a matched label (before auto-appending the next page). |
--paired-prefix-limit | Limit generated values for prefix placeholders without explicit ranges. |
--paired-latest-only | When multiple matches exist, use the last occurrence and the following page for extraction. |
--verbose/-v | Print detailed progress logs. |
--debug | Print debug-level logs (including stack traces). |
After execution, the out/ directory will contain .tex files for each question and master.tex.
When the specified language differs from the original question language, the pipeline automatically calls LLM to translate the question, solution, and additional solutions. Metadata records whether translation occurred.
Detailed Processing Flow
- Smart Page Segmentation: Text-based PDFs extract candidate blocks using keywords (Exercise/Problem/Question/题目/例题, etc.), then generate cross-page combinations (up to 3 groups) to ensure questions and answers are on the same screen.
- Multimodal Priority Strategy: Candidate pages are rendered as images and submitted in batches to multimodal models, with prompts guiding the model to return only matching questions. If requests fail, logs are recorded, and it automatically falls back to local OCR or raw text parsing.
- Fallback for Scanned PDFs: For suspected scanned PDFs, pages are selected based on scan probability, converted to images, and reused in the image processing flow, avoiding page-by-page OCR for the entire document.
- Question Merging and Post-Processing: Cross-page extraction results are deduplicated and concatenated. Missing answers trigger LLM to generate
Solution (by LLM). Language detection and translation ensure the final output adheres to--languageorTEXBANK_DEFAULT_LANGUAGEsettings. - Observability: With
--verbose/--debug, real-time details of each block's page group attempts, fallback reasons, translation status, etc., are available, making it easier to locate bottlenecks in long processing times or failed responses.
Key Modules
texbank.config: Centralized management of settings for models, cache, OpenRouter, StackExchange, etc.texbank.llm_client: OpenRouter wrapper supporting text/multimodal requests, structured JSON parsing, retries, and disk caching.texbank.pdf_utils: PDF text/scan detection, keyword segmentation, scanned page image extraction.texbank.pipeline: Unified pipeline for processing image/PDF/URL inputs and outputtingProblemItemlists.texbank.texgen: RendersProblemIteminto TeX files and generatesmaster.tex.texbank.stackexchange: Calls StackExchange API to search and fetch complete questions and answers.
Model Recommendations
- Structured Extraction:
google/gemini-2.5-flash-lite(economical and sensitive to question structure). - Multimodal Understanding:
google/gemini-2.5-flash-image-preview(stable performance for images with formulas). - Solution Generation:
deepseek/deepseek-chat(cost-effective); switch tometa-llama/llama-3.1-70b-instructfor complex reasoning. - Supplementary Reasoning/Validation:
deepseek/deepseek-r1as a low-cost alternative.
For offline/local models, replace model names in .env and deploy HTTP proxies yourself.
Run Tests
$env:PYTHONPATH = "src"
pytest
Tests rely on a stubbed OpenRouter client to ensure basic logic correctness without real network calls.
FAQ
- 429 Too Many Requests: OpenRouter rate-limiting. Retry later or request a higher quota.
- Tesseract Not Found: Ensure
tesseract.exeis inPATHafter installation or setTESSERACT_CMD. - StackExchange Returns 502: API occasionally fluctuates; retry or reduce request volume.
- TeX Special Characters: Escape
%in comments before rendering; formulas remain unchanged.
TO-DO LIST
- Emit each problem to its own TeX file immediately after processing.
- Fix issues with Chinese PDFs being processed as garbled text, preventing correct problem segmentation.
- Generate a correct and aesthetically pleasing
master.texfile, including table of contents, chapters, and even references. - Handle obvious error responses, such as returning multiple questions at once, incorrect JSON, or Markdown syntax in LLM-generated answers.
- Test scanned PDFs.
- Test cases where answers appear at the end of the book.
- Construct appropriate prompts for cases involving communative diagrams and test them.
- Improve and test batch fetching from MathStackExchange and URL-based solutions.
- Enhance the compatibility of processing solutions to accommodate a wider range of possible textbook formats.
- Add asynchronous processing to improve speed.
- Include local solutions suitable for Chinese OCR.
- Build better preprocessing schemes to reduce hallucinations from large models.
- More precise PDF question segmentation (based on layout analysis or deep learning).
- Test higher-accuracy multimodal models and automatically compare costs.
- Add a Web UI for visualized generation and management of question banks.
- Add automatic validation or symbolic computation checks for
Solution (by LLM).
Feel free to open issues and submit PRs.