Automatically scrape images with your query from the popular search engines
- Bing
- Baidu
- Yahoo (currently only low resolution)
using an easy-to-use Frontend or using scripts.
This code is part of a paper (citation), also check the project page if you are interested in creation a dataset for instance segmentation.
Start the front end with a single command (adjust the /PATH/TO/OUTPUT
to your desired output path)
docker run -it --rm --name easy_image_scraping --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output -p 5000:5000 ghcr.io/a-nau/easy-image-scraping:latest
Enter your query and wait for the results to show in the output
folder. The web applications also shows a preview of
downloaded images.
Start using the command line with
docker run -it --rm --name easy_image_scraping --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output -p 5000:5000 ghcr.io/a-nau/easy-image-scraping:latest bash
If you just want to search for a single keywords adjust and run search_by_keyword.py
- Write the list of search terms in the file
search_terms_eng.txt
. - You can then use Google Translate to translate the whole file to new languages. Change the ending of the translated file to the respective language.
- Adjust
config.py
to define search engines for each language - Run
search_by_keywords_from_files
This is optional - you can also directly use our provided container.
You can also build the image yourself using
docker build -t easy_image_scraping .
The run it by using
docker run -it --rm --name easy_image_scraping -p 5000:5000 --mount type=bind,source=/PATH/TO/OUTPUT,target=/usr/src/app/output easy_image_scraping
For Local Setup, check this
- Set up an environment using
or
conda env create -f environment.yml
pip install -r requirements.txt
- To use Selenium, we need to download the Chrome Driver (also see this)
- Check your Chrome Version and download the corresponding webdriver version
- Unzip it, and add it to the path (for details, see here). Alternatively, you
can adjust scrape_and_download.py
with webdriver.Chrome( executable_path="path/to/chrome_diver.exe", # add this line options=set_chrome_options() ) as wd:
- Code is partially based on and borrowed from
- sczhengyabin/Image-Downloader ( mostly crawler.py) , MIT License
- Article with Gists by Fabian Bosler, see fetch_image_urls.py
- Dockerfile is based on joyzoursky/ docker-python-chromedriver , MIT License
- Cookie notices are handled by the I still don't care about cookies extension GNU General Public License v3.0
Unless stated otherwise, this project is licensed under the MIT license.
If you use this code for scientific research, please consider citing
@inproceedings{naumannScrapeCutPasteLearn2022,
title = {Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to Parcel Logistics},
author = {Naumann, Alexander and Hertlein, Felix and Zhou, Benchun and Dörr, Laura and Furmans, Kai},
booktitle = {{{IEEE Conference}} on {{Machine Learning}} and Applications ({{ICMLA}})},
date = 2022
}
Please be aware of copyright restrictions that might apply to images you download.