Introducing AutoCrawler
AutoCrawler is a powerful image crawling tool designed to search and download high-quality images from both Google and Naver. It prioritizes efficiency, speed, and customization, making it an excellent choice for those looking to automate image collection tasks.
Getting Started with AutoCrawler
Here is a simple guide to start using AutoCrawler for your image downloading needs:
-
Install Chrome: Ensure you have the Chrome browser installed on your computer.
-
Install Dependencies: Run the command
pip install -r requirements.txt
to install necessary Python dependencies. -
Set Your Search Keywords: In the
keywords.txt
file, input the keywords for which you want to search images. -
Run the Script: Execute the command
python3 main.py
to start the crawling process. -
Access Your Images: Once executed, the images will download into the 'download' directory.
Command Line Arguments
AutoCrawler allows user configuration through several command-line arguments, enhancing its flexibility:
-
--skip true
: Skips the keyword if its directory already exists, useful for re-downloading scenarios. -
--threads 4
: Sets the number of threads for downloading, enabling concurrent process execution. -
--google true
/--naver true
: Specifies whether to download images from Google and/or Naver. -
--full false
: Determines whether to download full-resolution images instead of thumbnails. -
--face false
: Activates face search mode for collecting images primarily of faces. -
--no_gui auto
: Allows running without a GUI (headless mode), ideal for certain system configurations. -
--limit 0
: Limits the number of images per site, with0
indicating no limit.
Features and Capabilities
Full Resolution Mode: By setting --full true
, users can download images at their highest available quality in formats like JPG, GIF, and PNG.
Data Imbalance Detection: The tool identifies directories with a below-average number of files and advises whether to delete and re-download from these directories to ensure a balanced data set.
Remote Crawling: AutoCrawler supports running remotely on a server using SSH. This includes setting up a virtual display and using tools like 'screen' to keep the process running even when SSH sessions close.
Customization and Troubleshooting
AutoCrawler is adaptable, allowing users to modify its capabilities by changing underlying scripts, such as collect_links.py
.
For instance, Google’s image search interface may change, requiring adjustments in the script. Users can employ developer tools in browsers like Chrome to examine and adjust the crawling logic, specifically using XPATH to identify elements.
Conclusion
AutoCrawler is a versatile tool that simplifies the process of gathering images from large repositories on the web, providing customization for various user needs. Its configurability and full-resolution capabilities make it suitable for both personal and professional use, backed by robust support for adaptation and troubleshooting.