MinerU: High-Quality PDF to Markdown/JSON Converter

MinerU is an open-source, high-quality tool for converting PDF documents to Markdown and JSON formats. It's designed to provide precise document content extraction with advanced AI-powered parsing capabilities.

Overview

MinerU is a comprehensive solution for document parsing that combines multiple AI models to achieve high-quality content extraction from PDFs. It's particularly useful for:

Academic paper processing
Technical documentation conversion
Research data extraction
Content digitization projects
RAG (Retrieval-Augmented Generation) data preparation

Key Features

🔥 Core Capabilities

Multi-format Output: Convert PDFs to both Markdown and JSON formats
AI-Powered Parsing: Uses advanced vision-language models for accurate content recognition
Layout Analysis: Intelligent understanding of document structure and layout
OCR Integration: Built-in optical character recognition for scanned documents
Table Recognition: Advanced table extraction and formatting
Code Block Detection: Automatic identification and formatting of code sections

🚀 Advanced Features

Reading Order Optimization: AI-based content ordering for complex layouts
Multi-language Support: Handles various languages and scripts
Batch Processing: Efficient handling of multiple documents
API Interface: RESTful API for integration with other applications
Web UI: User-friendly web interface for document processing
Docker Support: Easy deployment with containerization

Installation

Prerequisites

System Requirements:

Operating System: Linux / Windows / macOS
Memory: Minimum 16GB+, recommended 32GB+
Disk Space: 20GB+, SSD recommended
Python Version: 3.10-3.13

Hardware Requirements:

CPU: Any modern CPU for basic functionality
GPU: Turing architecture and later, 6GB+ VRAM (optional for acceleration)
Apple Silicon: Native support for M1/M2 chips

Installation Methods

1. Using pip/uv (Recommended)

# Upgrade pip first
pip install --upgrade pip

# Install uv package manager
pip install uv

# Install MinerU with core features
uv pip install -U "mineru[core]"

2. From Source Code

# Clone the repository
git clone https://github.com/opendatalab/MinerU.git
cd MinerU

# Install in development mode
uv pip install -e .[core]

3. Docker Deployment

MinerU provides Docker images for easy deployment:

# Pull the official image
docker pull opendatalab/mineru:latest

# Run with basic configuration
docker run -p 8000:8000 opendatalab/mineru:latest

Usage

Command Line Interface

The simplest way to use MinerU:

# Basic usage
mineru -p <input_path> -o <output_path>

# Example
mineru -p document.pdf -o output/

Advanced Command Line Options

# Specify output format
mineru -p input.pdf -o output/ --format markdown

# Batch processing
mineru -p input_folder/ -o output/ --batch

# Custom configuration
mineru -p input.pdf -o output/ --config custom_config.json

# Verbose output
mineru -p input.pdf -o output/ --verbose

Python API

from mineru import MinerU

# Initialize MinerU
mineru = MinerU()

# Process a single document
result = mineru.process("document.pdf", output_dir="output/")

# Process with custom options
result = mineru.process(
    "document.pdf",
    output_dir="output/",
    format="markdown",
    config={"ocr": True, "table_detection": True}
)

Web Interface

MinerU provides a web UI for easy document processing:

# Start the web server
mineru web --host 0.0.0.0 --port 8000

Access the interface at http://localhost:8000

Configuration

Basic Configuration File

Create a mineru.json configuration file:

{
  "output_format": "markdown",
  "ocr_enabled": true,
  "table_detection": true,
  "code_block_detection": true,
  "language_detection": true,
  "reading_order_optimization": true,
  "output_encoding": "utf-8"
}

Advanced Configuration Options

{
  "processing": {
    "ocr": {
      "enabled": true,
      "language": "auto",
      "confidence_threshold": 0.8
    },
    "layout": {
      "detection_method": "ai",
      "table_recognition": true,
      "image_extraction": true
    },
    "output": {
      "format": ["markdown", "json"],
      "include_images": true,
      "preserve_formatting": true
    }
  },
  "performance": {
    "batch_size": 1,
    "max_workers": 4,
    "gpu_acceleration": true
  }
}

Output Formats

Markdown Output

MinerU generates clean, well-structured Markdown:

# Document Title

## Introduction

This is the main content with proper formatting.

### Subsection

- List item 1
- List item 2

| Column 1 | Column 2 |
| -------- | -------- |
| Data 1   | Data 2   |

```python
# Code block example
def hello_world():
    print("Hello, World!")
```

### JSON Output

Structured JSON with metadata and content:

```json
{
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "pages": 10,
    "processing_time": "2.5s"
  },
  "content": [
    {
      "type": "heading",
      "level": 1,
      "text": "Document Title",
      "page": 1
    },
    {
      "type": "paragraph",
      "text": "This is the main content...",
      "page": 1
    },
    {
      "type": "table",
      "data": [["Column 1", "Column 2"], ["Data 1", "Data 2"]],
      "page": 2
    }
  ]
}

Performance Optimization

GPU Acceleration

For better performance with large documents:

# Enable GPU acceleration
mineru -p document.pdf -o output/ --gpu

# Specify GPU device
mineru -p document.pdf -o output/ --gpu-device 0

Batch Processing

# Process multiple documents
mineru -p input_folder/ -o output/ --batch --workers 4

Memory Optimization

# Limit memory usage
mineru -p document.pdf -o output/ --max-memory 8GB

Troubleshooting

Common Issues

Installation Problems

# Clear pip cache
pip cache purge

# Reinstall with clean environment
pip uninstall mineru
pip install mineru[core]

Memory Issues
- Reduce batch size
- Use smaller documents
- Increase system memory
GPU Issues
- Check CUDA installation
- Verify GPU compatibility
- Use CPU-only mode if needed

Debug Mode

# Enable debug logging
mineru -p document.pdf -o output/ --debug

# Save debug logs
mineru -p document.pdf -o output/ --debug --log-file debug.log

Integration Examples

With Python Applications

import asyncio
from mineru import MinerU

async def process_documents():
    mineru = MinerU()

    # Process multiple documents asynchronously
    tasks = [
        mineru.process_async("doc1.pdf", "output/"),
        mineru.process_async("doc2.pdf", "output/"),
        mineru.process_async("doc3.pdf", "output/")
    ]

    results = await asyncio.gather(*tasks)
    return results

With Web Frameworks

from flask import Flask, request, jsonify
from mineru import MinerU

app = Flask(__name__)
mineru = MinerU()

@app.route('/process', methods=['POST'])
def process_pdf():
    file = request.files['pdf']
    result = mineru.process(file, "temp_output/")
    return jsonify(result)

Best Practices

Document Preparation

Quality: Use high-quality PDFs for better results
Format: Prefer text-based PDFs over scanned documents
Size: Break large documents into smaller chunks if needed
Language: Ensure proper language detection for multilingual documents

Processing Workflow

Pre-processing: Clean and validate input documents
Configuration: Use appropriate settings for your use case
Post-processing: Review and refine output as needed
Validation: Verify output quality and completeness

Performance Tips

Batch Processing: Group similar documents together
Resource Management: Monitor memory and CPU usage
Caching: Cache processed results when possible
Parallel Processing: Use multiple workers for large batches

Limitations and Known Issues

Current Limitations

Reading Order: May be out of order in extremely complex layouts
Vertical Text: Limited support for vertical text orientation
Code Blocks: Not yet supported in layout model
Complex Tables: May have row/column recognition errors
Language Support: Some lesser-known languages may have OCR issues

Known Issues

Comic books and art albums may not parse well
Primary school textbooks and exercises have limited support
Some mathematical formulas may not render correctly in Markdown
Complex chemical formulas may not be recognized properly

Community and Support

Resources

Official Documentation: opendatalab.github.io/MinerU/
GitHub Repository: github.com/opendatalab/MinerU
Online Demo: Available on ModelScope and HuggingFace
Discord Community: Join for discussions and support

Contributing

MinerU is open source and welcomes contributions:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

Citation

If you use MinerU in your research, please cite:

@misc{wang2024mineruopensourcesolutionprecise,
      title={MinerU: An Open-Source Solution for Precise Document Content Extraction},
      author={Bin Wang and Chao Xu and Xiaomeng Zhao and Linke Ouyang and Fan Wu and Zhiyuan Zhao and Rui Xu and Kaiwen Liu and Yuan Qu and Fukai Shang and Bo Zhang and Liqun Wei and Zhihao Sui and Wei Li and Botian Shi and Yu Qiao and Dahua Lin and Conghui He},
      year={2024},
      eprint={2409.18839},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.18839},
}

MinerU is part of the OpenDataLab ecosystem, which includes:

PDF-Extract-Kit: Comprehensive toolkit for PDF content extraction
OmniDocBench: Benchmark for document parsing evaluation
Magic-HTML: Mixed web page extraction tool
Magic-Doc: Fast extraction for PPT/PPTX/DOC/DOCX/PDF files
LabelU: Multi-modal data annotation tool
LabelLLM: LLM dialogue annotation platform

License

MinerU is licensed under the AGPL-3.0 license. Note that some models in the project are based on YOLO, which follows the AGPL license and may impose restrictions on certain use cases.

Overview​

Key Features​

🔥 Core Capabilities​

🚀 Advanced Features​

Installation​

Prerequisites​

Installation Methods​

1. Using pip/uv (Recommended)​

2. From Source Code​

3. Docker Deployment​

Usage​

Command Line Interface​

Advanced Command Line Options​

Python API​

Web Interface​

Configuration​

Basic Configuration File​

Advanced Configuration Options​

Output Formats​

Markdown Output​

Performance Optimization​

GPU Acceleration​

Batch Processing​

Memory Optimization​

Troubleshooting​

Common Issues​

Debug Mode​

Integration Examples​

With Python Applications​

With Web Frameworks​

Best Practices​

Document Preparation​

Processing Workflow​

Performance Tips​

Limitations and Known Issues​

Current Limitations​

Known Issues​

Community and Support​

Resources​

Contributing​

Citation​

Related Tools​

License​