in 2014, I’ ve written a Python image crawler for a specific website Cure WorldCosplay, a website that attracts cosplayers all over the world post their own pictures . Which has about 10k active members and up to 10 million pictures posted.
The pros is the program is packaged into a single executable file, no programming environment needed. But some virus detection software could report unsafe file.
Here the program is!
|Click names below for download|
Theoretically, if you have enough disk space, you can download all the pics of that website (about 9800 Gigabyte), the only limit is your bandwidth. I have deployed 36 crawlers on a Linux server at the same time, they download pics 24/7 at the maximum internet bandwidth.
How to use it:
The program will direct you to the ranking page of the Cure WorldCosplay, so you can browse around, then select a coser you like, and who’s coser ID will be displayed her/his page. Copy that WorldCosplay No. and type it into the program, and the program will download all this coser’s HD photos in no time (on average each coser have about 100 pics).
Moreover, the software will generate a Index local HTML file that display all the
downloaded images in thumbnail view, that allow you browse around and click to see HD picture.
To make the packaged file as small as possible, I did not use much external libraries except an imaging library Pillow. I tried to use library Scrapy to get pics online more efficiently and PyQt to make prettier interface, but the packaged size (to .exe file) would be much larger and had potential dependency problems. So I then stick with python default libraries: urllib multiprocessing, and Tkinter.
The time was 2014, I just picked up Python, before long I was obsessed with it for its simple and powerful. At the time I was doing a research project and then in no time I realized that Python can do anything:
- Python scrapy, selenium: scraping user info from social media & business website
- Python pandas, matplotlib: data cleansing and exploratory analysis
- Python gensim, scikit-learn: text sentiment analysis, topic modeling
- Python scikit-learn, graphlab create, xgboost: machine learning
- Python Flask: deploy interactive websites