Skip to main content

MinerU: High-Quality PDF to Markdown/JSON Converter

MinerU is an open-source, high-quality tool for converting PDF documents to Markdown and JSON formats. It's designed to provide precise document content extraction with advanced AI-powered parsing capabilities.

Overview

MinerU is a comprehensive solution for document parsing that combines multiple AI models to achieve high-quality content extraction from PDFs. It's particularly useful for:

  • Academic paper processing
  • Technical documentation conversion
  • Research data extraction
  • Content digitization projects
  • RAG (Retrieval-Augmented Generation) data preparation

Key Features

🔥 Core Capabilities

  • Multi-format Output: Convert PDFs to both Markdown and JSON formats
  • AI-Powered Parsing: Uses advanced vision-language models for accurate content recognition
  • Layout Analysis: Intelligent understanding of document structure and layout
  • OCR Integration: Built-in optical character recognition for scanned documents
  • Table Recognition: Advanced table extraction and formatting
  • Code Block Detection: Automatic identification and formatting of code sections

🚀 Advanced Features

  • Reading Order Optimization: AI-based content ordering for complex layouts
  • Multi-language Support: Handles various languages and scripts
  • Batch Processing: Efficient handling of multiple documents
  • API Interface: RESTful API for integration with other applications
  • Web UI: User-friendly web interface for document processing
  • Docker Support: Easy deployment with containerization

Installation

Prerequisites

System Requirements:

  • Operating System: Linux / Windows / macOS
  • Memory: Minimum 16GB+, recommended 32GB+
  • Disk Space: 20GB+, SSD recommended
  • Python Version: 3.10-3.13

Hardware Requirements:

  • CPU: Any modern CPU for basic functionality
  • GPU: Turing architecture and later, 6GB+ VRAM (optional for acceleration)
  • Apple Silicon: Native support for M1/M2 chips

Installation Methods

# Upgrade pip first
pip install --upgrade pip

# Install uv package manager
pip install uv

# Install MinerU with core features
uv pip install -U "mineru[core]"

2. From Source Code

# Clone the repository
git clone https://github.com/opendatalab/MinerU.git
cd MinerU

# Install in development mode
uv pip install -e .[core]

3. Docker Deployment

MinerU provides Docker images for easy deployment:

# Pull the official image
docker pull opendatalab/mineru:latest

# Run with basic configuration
docker run -p 8000:8000 opendatalab/mineru:latest

Usage

Command Line Interface

The simplest way to use MinerU:

# Basic usage
mineru -p <input_path> -o <output_path>

# Example
mineru -p document.pdf -o output/

Advanced Command Line Options

# Specify output format
mineru -p input.pdf -o output/ --format markdown

# Batch processing
mineru -p input_folder/ -o output/ --batch

# Custom configuration
mineru -p input.pdf -o output/ --config custom_config.json

# Verbose output
mineru -p input.pdf -o output/ --verbose

Python API

from mineru import MinerU

# Initialize MinerU
mineru = MinerU()

# Process a single document
result = mineru.process("document.pdf", output_dir="output/")

# Process with custom options
result = mineru.process(
"document.pdf",
output_dir="output/",
format="markdown",
config={"ocr": True, "table_detection": True}
)

Web Interface

MinerU provides a web UI for easy document processing:

# Start the web server
mineru web --host 0.0.0.0 --port 8000

Access the interface at http://localhost:8000

Configuration

Basic Configuration File

Create a mineru.json configuration file:

{
"output_format": "markdown",
"ocr_enabled": true,
"table_detection": true,
"code_block_detection": true,
"language_detection": true,
"reading_order_optimization": true,
"output_encoding": "utf-8"
}

Advanced Configuration Options

{
"processing": {
"ocr": {
"enabled": true,
"language": "auto",
"confidence_threshold": 0.8
},
"layout": {
"detection_method": "ai",
"table_recognition": true,
"image_extraction": true
},
"output": {
"format": ["markdown", "json"],
"include_images": true,
"preserve_formatting": true
}
},
"performance": {
"batch_size": 1,
"max_workers": 4,
"gpu_acceleration": true
}
}

Output Formats

Markdown Output

MinerU generates clean, well-structured Markdown:

# Document Title

## Introduction

This is the main content with proper formatting.

### Subsection

- List item 1
- List item 2

| Column 1 | Column 2 |
| -------- | -------- |
| Data 1 | Data 2 |

```python
# Code block example
def hello_world():
print("Hello, World!")
```

### JSON Output

Structured JSON with metadata and content:

```json
{
"metadata": {
"title": "Document Title",
"author": "Author Name",
"pages": 10,
"processing_time": "2.5s"
},
"content": [
{
"type": "heading",
"level": 1,
"text": "Document Title",
"page": 1
},
{
"type": "paragraph",
"text": "This is the main content...",
"page": 1
},
{
"type": "table",
"data": [["Column 1", "Column 2"], ["Data 1", "Data 2"]],
"page": 2
}
]
}

Performance Optimization

GPU Acceleration

For better performance with large documents:

# Enable GPU acceleration
mineru -p document.pdf -o output/ --gpu

# Specify GPU device
mineru -p document.pdf -o output/ --gpu-device 0

Batch Processing

# Process multiple documents
mineru -p input_folder/ -o output/ --batch --workers 4

Memory Optimization

# Limit memory usage
mineru -p document.pdf -o output/ --max-memory 8GB

Troubleshooting

Common Issues

  1. Installation Problems

    # Clear pip cache
    pip cache purge

    # Reinstall with clean environment
    pip uninstall mineru
    pip install mineru[core]
  2. Memory Issues

    • Reduce batch size
    • Use smaller documents
    • Increase system memory
  3. GPU Issues

    • Check CUDA installation
    • Verify GPU compatibility
    • Use CPU-only mode if needed

Debug Mode

# Enable debug logging
mineru -p document.pdf -o output/ --debug

# Save debug logs
mineru -p document.pdf -o output/ --debug --log-file debug.log

Integration Examples

With Python Applications

import asyncio
from mineru import MinerU

async def process_documents():
mineru = MinerU()

# Process multiple documents asynchronously
tasks = [
mineru.process_async("doc1.pdf", "output/"),
mineru.process_async("doc2.pdf", "output/"),
mineru.process_async("doc3.pdf", "output/")
]

results = await asyncio.gather(*tasks)
return results

With Web Frameworks

from flask import Flask, request, jsonify
from mineru import MinerU

app = Flask(__name__)
mineru = MinerU()

@app.route('/process', methods=['POST'])
def process_pdf():
file = request.files['pdf']
result = mineru.process(file, "temp_output/")
return jsonify(result)

Best Practices

Document Preparation

  1. Quality: Use high-quality PDFs for better results
  2. Format: Prefer text-based PDFs over scanned documents
  3. Size: Break large documents into smaller chunks if needed
  4. Language: Ensure proper language detection for multilingual documents

Processing Workflow

  1. Pre-processing: Clean and validate input documents
  2. Configuration: Use appropriate settings for your use case
  3. Post-processing: Review and refine output as needed
  4. Validation: Verify output quality and completeness

Performance Tips

  1. Batch Processing: Group similar documents together
  2. Resource Management: Monitor memory and CPU usage
  3. Caching: Cache processed results when possible
  4. Parallel Processing: Use multiple workers for large batches

Limitations and Known Issues

Current Limitations

  • Reading Order: May be out of order in extremely complex layouts
  • Vertical Text: Limited support for vertical text orientation
  • Code Blocks: Not yet supported in layout model
  • Complex Tables: May have row/column recognition errors
  • Language Support: Some lesser-known languages may have OCR issues

Known Issues

  • Comic books and art albums may not parse well
  • Primary school textbooks and exercises have limited support
  • Some mathematical formulas may not render correctly in Markdown
  • Complex chemical formulas may not be recognized properly

Community and Support

Resources

Contributing

MinerU is open source and welcomes contributions:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

Citation

If you use MinerU in your research, please cite:

@misc{wang2024mineruopensourcesolutionprecise,
title={MinerU: An Open-Source Solution for Precise Document Content Extraction},
author={Bin Wang and Chao Xu and Xiaomeng Zhao and Linke Ouyang and Fan Wu and Zhiyuan Zhao and Rui Xu and Kaiwen Liu and Yuan Qu and Fukai Shang and Bo Zhang and Liqun Wei and Zhihao Sui and Wei Li and Botian Shi and Yu Qiao and Dahua Lin and Conghui He},
year={2024},
eprint={2409.18839},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.18839},
}

MinerU is part of the OpenDataLab ecosystem, which includes:

  • PDF-Extract-Kit: Comprehensive toolkit for PDF content extraction
  • OmniDocBench: Benchmark for document parsing evaluation
  • Magic-HTML: Mixed web page extraction tool
  • Magic-Doc: Fast extraction for PPT/PPTX/DOC/DOCX/PDF files
  • LabelU: Multi-modal data annotation tool
  • LabelLLM: LLM dialogue annotation platform

License

MinerU is licensed under the AGPL-3.0 license. Note that some models in the project are based on YOLO, which follows the AGPL license and may impose restrictions on certain use cases.