Photo Scraping With Ruby
I love Watir. I enjoy doing web automations using it. Today I will show you a small script I’m using to download photos from a local photography cummunity. It downloads all photos to a directory called ‘images’.
Here’s the code:
require 'watir'
# Open a browser:
browser = Watir::Browser.new :chrome
# Go to https://photo-forum.net/index.php?APP_ACTION=NEW_ALBUM
browser.goto 'https://photo-forum.net/index.php?APP_ACTION=NEW_ALBUM'
# Collect links :
links = browser.html.scan(/https:\/\/photo-forum.net\/i\/\d+/).each {|l| p l}
# Getting the actual image links:
@images = []
links.each do |l|
browser.goto(l)
@images << browser.html.scan(/https:\/\/photo-forum.net\/static\/.*.jpg/)[0]
end
@images.each do |i|
`wget #{i} -N -P images/`
end
It’s a pretty straightforward code so I’ll go over the interesting bits. I started with this sandbox:
require 'watir'
require 'pry'
# Open a browser:
browser = Watir::Browser.new :chrome
# Go to https://photo-forum.net/index.php?APP_ACTION=NEW_ALBUM
browser.goto 'https://photo-forum.net/index.php?APP_ACTION=NEW_ALBUM'
binding.pry
This opens a browser that I can work with to examine what I need to do. I quickly realized that all pages I have to collect can be matched with “https://photo-forum.net/i/< digits >” so that’s why I match those first and assign them to a variable called “links”.
Once I have all links I open each one and match the jpg files via my second regex. In the end I’m using wget to download the images. The interesting part here is the -N which will not download files that were already collected. This allows me to schedule regular run and collect images that I can review locally later.
That’s great for building up enough inspiration to get me going out with my camera in those cold cold days…
Enjoy the code!