MinerU: High-Quality PDF to Markdown/JSON Converter
MinerU is an open-source, high-quality tool for converting PDF documents to Markdown and JSON formats. It's designed to provide precise document content extraction with advanced AI-powered parsing capabilities.
Overview
MinerU is a comprehensive solution for document parsing that combines multiple AI models to achieve high-quality content extraction from PDFs. It's particularly useful for:
- Academic paper processing
- Technical documentation conversion
- Research data extraction
- Content digitization projects
- RAG (Retrieval-Augmented Generation) data preparation
Key Features
🔥 Core Capabilities
- Multi-format Output: Convert PDFs to both Markdown and JSON formats
- AI-Powered Parsing: Uses advanced vision-language models for accurate content recognition
- Layout Analysis: Intelligent understanding of document structure and layout
- OCR Integration: Built-in optical character recognition for scanned documents
- Table Recognition: Advanced table extraction and formatting
- Code Block Detection: Automatic identification and formatting of code sections
🚀 Advanced Features
- Reading Order Optimization: AI-based content ordering for complex layouts
- Multi-language Support: Handles various languages and scripts
- Batch Processing: Efficient handling of multiple documents
- API Interface: RESTful API for integration with other applications
- Web UI: User-friendly web interface for document processing
- Docker Support: Easy deployment with containerization
Installation
Prerequisites
System Requirements:
- Operating System: Linux / Windows / macOS
- Memory: Minimum 16GB+, recommended 32GB+
- Disk Space: 20GB+, SSD recommended
- Python Version: 3.10-3.13
Hardware Requirements:
- CPU: Any modern CPU for basic functionality
- GPU: Turing architecture and later, 6GB+ VRAM (optional for acceleration)
- Apple Silicon: Native support for M1/M2 chips
Installation Methods
1. Using pip/uv (Recommended)
# Upgrade pip first
pip install --upgrade pip
# Install uv package manager
pip install uv
# Install MinerU with core features
uv pip install -U "mineru[core]"
2. From Source Code
# Clone the repository
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
# Install in development mode
uv pip install -e .[core]
3. Docker Deployment
MinerU provides Docker images for easy deployment:
# Pull the official image
docker pull opendatalab/mineru:latest
# Run with basic configuration
docker run -p 8000:8000 opendatalab/mineru:latest
Usage
Command Line Interface
The simplest way to use MinerU:
# Basic usage
mineru -p <input_path> -o <output_path>
# Example
mineru -p document.pdf -o output/
Advanced Command Line Options
# Specify output format
mineru -p input.pdf -o output/ --format markdown
# Batch processing
mineru -p input_folder/ -o output/ --batch
# Custom configuration
mineru -p input.pdf -o output/ --config custom_config.json
# Verbose output
mineru -p input.pdf -o output/ --verbose
Python API
from mineru import MinerU
# Initialize MinerU
mineru = MinerU()
# Process a single document
result = mineru.process("document.pdf", output_dir="output/")
# Process with custom options
result = mineru.process(
"document.pdf",
output_dir="output/",
format="markdown",
config={"ocr": True, "table_detection": True}
)
Web Interface
MinerU provides a web UI for easy document processing:
# Start the web server
mineru web --host 0.0.0.0 --port 8000
Access the interface at http://localhost:8000
Configuration
Basic Configuration File
Create a mineru.json
configuration file:
{
"output_format": "markdown",
"ocr_enabled": true,
"table_detection": true,
"code_block_detection": true,
"language_detection": true,
"reading_order_optimization": true,
"output_encoding": "utf-8"
}
Advanced Configuration Options
{
"processing": {
"ocr": {
"enabled": true,
"language": "auto",
"confidence_threshold": 0.8
},
"layout": {
"detection_method": "ai",
"table_recognition": true,
"image_extraction": true
},
"output": {
"format": ["markdown", "json"],
"include_images": true,
"preserve_formatting": true
}
},
"performance": {
"batch_size": 1,
"max_workers": 4,
"gpu_acceleration": true
}
}
Output Formats
Markdown Output
MinerU generates clean, well-structured Markdown:
# Document Title
## Introduction
This is the main content with proper formatting.
### Subsection
- List item 1
- List item 2
| Column 1 | Column 2 |
| -------- | -------- |
| Data 1 | Data 2 |
```python
# Code block example
def hello_world():
print("Hello, World!")
```
### JSON Output
Structured JSON with metadata and content:
```json
{
"metadata": {
"title": "Document Title",
"author": "Author Name",
"pages": 10,
"processing_time": "2.5s"
},
"content": [
{
"type": "heading",
"level": 1,
"text": "Document Title",
"page": 1
},
{
"type": "paragraph",
"text": "This is the main content...",
"page": 1
},
{
"type": "table",
"data": [["Column 1", "Column 2"], ["Data 1", "Data 2"]],
"page": 2
}
]
}
Performance Optimization
GPU Acceleration
For better performance with large documents:
# Enable GPU acceleration
mineru -p document.pdf -o output/ --gpu
# Specify GPU device
mineru -p document.pdf -o output/ --gpu-device 0
Batch Processing
# Process multiple documents
mineru -p input_folder/ -o output/ --batch --workers 4
Memory Optimization
# Limit memory usage
mineru -p document.pdf -o output/ --max-memory 8GB
Troubleshooting
Common Issues
-
Installation Problems
# Clear pip cache
pip cache purge
# Reinstall with clean environment
pip uninstall mineru
pip install mineru[core] -
Memory Issues
- Reduce batch size
- Use smaller documents
- Increase system memory
-
GPU Issues
- Check CUDA installation
- Verify GPU compatibility
- Use CPU-only mode if needed
Debug Mode
# Enable debug logging
mineru -p document.pdf -o output/ --debug
# Save debug logs
mineru -p document.pdf -o output/ --debug --log-file debug.log
Integration Examples
With Python Applications
import asyncio
from mineru import MinerU
async def process_documents():
mineru = MinerU()
# Process multiple documents asynchronously
tasks = [
mineru.process_async("doc1.pdf", "output/"),
mineru.process_async("doc2.pdf", "output/"),
mineru.process_async("doc3.pdf", "output/")
]
results = await asyncio.gather(*tasks)
return results
With Web Frameworks
from flask import Flask, request, jsonify
from mineru import MinerU
app = Flask(__name__)
mineru = MinerU()
@app.route('/process', methods=['POST'])
def process_pdf():
file = request.files['pdf']
result = mineru.process(file, "temp_output/")
return jsonify(result)
Best Practices
Document Preparation
- Quality: Use high-quality PDFs for better results
- Format: Prefer text-based PDFs over scanned documents
- Size: Break large documents into smaller chunks if needed
- Language: Ensure proper language detection for multilingual documents
Processing Workflow
- Pre-processing: Clean and validate input documents
- Configuration: Use appropriate settings for your use case
- Post-processing: Review and refine output as needed
- Validation: Verify output quality and completeness
Performance Tips
- Batch Processing: Group similar documents together
- Resource Management: Monitor memory and CPU usage
- Caching: Cache processed results when possible
- Parallel Processing: Use multiple workers for large batches
Limitations and Known Issues
Current Limitations
- Reading Order: May be out of order in extremely complex layouts
- Vertical Text: Limited support for vertical text orientation
- Code Blocks: Not yet supported in layout model
- Complex Tables: May have row/column recognition errors
- Language Support: Some lesser-known languages may have OCR issues
Known Issues
- Comic books and art albums may not parse well
- Primary school textbooks and exercises have limited support
- Some mathematical formulas may not render correctly in Markdown
- Complex chemical formulas may not be recognized properly
Community and Support
Resources
- Official Documentation: opendatalab.github.io/MinerU/
- GitHub Repository: github.com/opendatalab/MinerU
- Online Demo: Available on ModelScope and HuggingFace
- Discord Community: Join for discussions and support
Contributing
MinerU is open source and welcomes contributions:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
Citation
If you use MinerU in your research, please cite:
@misc{wang2024mineruopensourcesolutionprecise,
title={MinerU: An Open-Source Solution for Precise Document Content Extraction},
author={Bin Wang and Chao Xu and Xiaomeng Zhao and Linke Ouyang and Fan Wu and Zhiyuan Zhao and Rui Xu and Kaiwen Liu and Yuan Qu and Fukai Shang and Bo Zhang and Liqun Wei and Zhihao Sui and Wei Li and Botian Shi and Yu Qiao and Dahua Lin and Conghui He},
year={2024},
eprint={2409.18839},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.18839},
}
Related Tools
MinerU is part of the OpenDataLab ecosystem, which includes:
- PDF-Extract-Kit: Comprehensive toolkit for PDF content extraction
- OmniDocBench: Benchmark for document parsing evaluation
- Magic-HTML: Mixed web page extraction tool
- Magic-Doc: Fast extraction for PPT/PPTX/DOC/DOCX/PDF files
- LabelU: Multi-modal data annotation tool
- LabelLLM: LLM dialogue annotation platform
License
MinerU is licensed under the AGPL-3.0 license. Note that some models in the project are based on YOLO, which follows the AGPL license and may impose restrictions on certain use cases.