[DRAFT] Crawl 4 AI
Crawl 4 AI is a versatile open-source web scraping tool designed explicitly for gathering data to train AI models. It offers a suite of features tailored to handle the challenges of web data extraction, cleaning, and formatting for use in machine learning. This documentation is a work in progress.
Key Features
- Configurable Crawling: Customize the crawling process by specifying target URLs, data extraction rules, and crawling depth.
- Data Extraction: Utilizes CSS selectors and regular expressions to efficiently extract relevant content from web pages.
- Data Cleaning: Offers automated cleaning to standardize data formats and remove irrelevant information.
- Scalability: Designed to handle large-scale crawling, making it suitable for extensive dataset creation.
- Parallel Crawling: Enhances speed by performing web requests in parallel.
- Output Formatting: Supports various output formats, including JSON, CSV, and TXT, for easy integration with AI training pipelines.
- Rate Limiting: Includes features to respect website terms of service and prevent IP blocking.
- User-Agent Rotation: Helps avoid detection by rotating user-agent strings during crawling sessions.
- Proxy Support: Allows crawling through proxy servers to maintain anonymity and bypass IP restrictions.
Installation
To install Crawl 4 AI, you must clone the GitHub repository and install the dependencies using pip
.
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -r requirements.txt
Usage
Basic Crawling
To start a basic crawl, you need to define a configuration file specifying the target URLs and extraction rules. Here’s an example config.yaml
:
urls:
- "https://example.com"
extraction_rules:
title: "h1::text"
content: "div.article-content::text"
output_format: json
output_file: output.json
To run the scraper with the configuration file:
python crawl4ai.py --config config.yaml
Advanced Configuration
Crawl 4 AI supports advanced configurations to handle more complex scenarios. Some of the advanced features include:
- Depth Limiting: Specify the maximum depth to crawl within the website to prevent infinite loops.
- Rate Limiting: Set delay intervals between requests to avoid overloading the target website.
- User-Agent Rotation: Provide a list of user-agent strings to rotate during crawling.
- Proxy Support: Configure a list of proxies to route traffic through, enhancing anonymity and bypassing geo-restrictions.
Example of advanced features in the config.yaml
:
urls:
- "https://example.com"
extraction_rules:
title: "h1::text"
content: "div.article-content::text"
depth_limit: 3
rate_limit: 1
user_agents:
- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15"
proxies:
- "http://proxy1.example.com:3128"
- "http://proxy2.example.com:3128"
output_format: json
output_file: output.json
Data Cleaning
Crawl 4 AI can perform basic data cleaning operations such as removing HTML tags and standardizing text formatting. The cleaning rules can be specified within the configuration file.
Example of data cleaning rules:
urls:
- "https://example.com"
extraction_rules:
title: "h1::text"
content: "div.article-content::text"
data_cleaning:
content:
- "remove_html_tags"
- "normalize_whitespace"
output_format: json
output_file: output.json
remove_html_tags
: Removes all HTML tags from the extracted content.
normalize_whitespace
: Converts multiple spaces into single spaces and trims leading/trailing whitespace.
Parallel Crawling
Crawl 4 AI supports parallel crawling to speed up the data extraction process. You can specify the number of concurrent requests to execute.
To enable parallel crawling, use the --threads
option:
python crawl4ai.py --config config.yaml --threads 5
Best Practices
- Respect Robots.txt: Always check and respect the website's
robots.txt
file to understand crawling limitations. - Implement Error Handling: Implement robust error handling to manage potential issues during crawling.
- Monitor IP Blocking: Regularly monitor for IP blocking and adjust crawling parameters accordingly.
- Regularly Update User-Agents: Keep the list of user-agent strings updated to reduce the chances of detection.
- Use Proxies: Employ proxy servers to distribute requests and avoid direct blocking of your IP.
- Save Progress: Regularly save your crawling progress to be able to resume in case of any interruptions, so that you do not have to start from the beginning.
Contributing
Since this is an opensource project, contributions are welcome.