Photo Scraping With Ruby

I love Watir. I enjoy doing web automations using it. Today I will show you a small script I’m using to download photos from a local photography cummunity. It downloads all photos to a directory called ‘images’.

Here’s the code:

require 'watir'

# Open a browser:
browser = Watir::Browser.new :chrome

# Go to https://photo-forum.net/index.php?APP_ACTION=NEW_ALBUM
browser.goto 'https://photo-forum.net/index.php?APP_ACTION=NEW_ALBUM'

# Collect links :
links = browser.html.scan(/https:\/\/photo-forum.net\/i\/\d+/).each {|l| p l}

# Getting the actual image links:

@images = []

links.each do |l|
	browser.goto(l)
	@images << browser.html.scan(/https:\/\/photo-forum.net\/static\/.*.jpg/)[0]
end

@images.each do |i|
	`wget #{i} -N -P images/`
end

It’s a pretty straightforward code so I’ll go over the interesting bits. I started with this sandbox:

require 'watir'
require 'pry'

# Open a browser:
browser = Watir::Browser.new :chrome

# Go to https://photo-forum.net/index.php?APP_ACTION=NEW_ALBUM
browser.goto 'https://photo-forum.net/index.php?APP_ACTION=NEW_ALBUM'

binding.pry

This opens a browser that I can work with to examine what I need to do. I quickly realized that all pages I have to collect can be matched with “https://photo-forum.net/i/< digits >” so that’s why I match those first and assign them to a variable called “links”.

Once I have all links I open each one and match the jpg files via my second regex. In the end I’m using wget to download the images. The interesting part here is the -N which will not download files that were already collected. This allows me to schedule regular run and collect images that I can review locally later.

That’s great for building up enough inspiration to get me going out with my camera in those cold cold days…

Enjoy the code!