Hexfox

Scrape your cinema's listings to get a daily email of films with a high IMDb rating (Part 1)

Never miss a great film with this Scrapy tutorial

A large issue with learning to program is first working out what is even possible to achieve; and this is made all the harder by the prevalence of math problems or little puzzles in tutorials which have no direct link to the how they’d be used in the real world.

Programmers love these but I remember the deep frustration myself: “why can’t I just learn something useful?“.


Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!


I, like most people, tend to learn best when I have a certain goal in mind, because reaching that goal feels like an accomplishment and fuels the mind to go deeper. In fact, it’s how I ended up here after a 10 year journey down the rabbit hole (pro tip: it never really ends, just keep digging).

“Every great developer you know got there by solving problems they were unqualified to solve until they actually did it”. - @patio11

A fantastic quote but what if you don’t know what to solve? Yeah, that’s really common! This is where I come in..

This post is designed to showcase just how much the world of possibilities opens up once you learn a certain skill called web scraping - the industry term for “extracting data from the web”.

What we’re going to do today

We’re going to write a script, actually make that two scripts, for a fun home project that will check your local cinema’s film listings every day and email you only if is is showing a film that receives a high rating on IMDb. Because who the hell wants to pay top dollar for yet another shoddy film? At £20 ($28) a ticket in London, it’s certainly not fucking me I’ll tell you that..!

So here is the linear flow we’ll be following, deliberately split up into bite-size manageable chunks:

  • get the film names from the cinema website (web scraping).
  • store the film names.
  • lookup each film’s IMDb rating (API lookup).
  • send an email detailing today’s highly rated films.

Vue Cinema

I have released the code for this post under a free open source license, if you’d like to get it and follow along, or edit the script to work better for you then I’d really encourage that; simply place your email in the box below and my army of robots will send you the zip instantly. You’ll also get weekly updates from me with future posts like this, and further web scraping tips and tricks written specifically with the aim of saving you countless hours of furious clicking around the web looking for answers.

Want the code? Download it now!

Simply whack your first name & email in the boxes below and my robot minions will dispatch the code zip to you instantly. I'll also send you weekly updates containing tips and tricks specifically designed to save you time while scraping.

I seriously hate spam. Unsubscribe anytime.

As I said - we’re going to split this up in to 2 scripts: one that actually visits the local cinema’s website to get the data and the other that takes the retrieved data and checks it against IMDb before sending the daily email.

Why separate out such a simple script? It’s a good practice to get into in the world of development, as Doug McIllroy said: “Make each program do one thing well”. By separating out our concerns we automatically achieve a number of advantages:

  • While programming, we can concentrate on one area/domain at a time.

  • As the web scraping extraction does not have to wait for the lookups to IMDb it can accomplish it’s job without being impeded. In the programming world we call this “non-blocking” and it is a corner stone to achieving decent performance in many areas of computing.

  • If one script breaks, it is unlikely to affect the other as it knows nothing about it and can therefore be fixed in isolation. It will also be a lot clearer to us which part broke when it inevitably does.

  • As our 1st step needs to export data so that the 2nd step can use it, we get to keep that intermediary data which may otherwise not be recorded as part of the normal flow of the program. This might not be helpful in this specific case but in other scenarios where we might collect millions of results then you can see where this may come in handy. It gives us portability; we can get the data and then move it elsewhere for processing.

  • As the scripts run independently, they can even run on separate machines. Again, this is not so helpful now now as we just want to run a quick hacky script - but what happens when we start scraping huge websites or multiple domains? The separation here would allow us to scale up just the part of the program that requires the extra beef.

But how will the 2 scripts talk to each other? Well, just like we humans do: via the medium of text! The first script will save a file in a certain format that the 2nd script knows how to read, simple as that. The format we’ll use to save the results to is called JSON, an incredibly common data interchange format on the web. It allows us to read and write data in a way which can be read by the human eye too, but that is all you need to know for now.

Before we begin

Let’s describe a few of the tools we’ll be using before we get started. Unfortunately, installing these is out of scope for this article but if you get the code I have written instructions in the README which should help you get going.

Python & Versions

We’ll be using the Python programming language due to it’s wide ecosystem for web scraping; I’ve written about why I think Python is the best beginner language for web scraping so pause reading this if you’d like a bit more background.

If you’re on Linux or MacOSX, the code will work on Python 2.7 or Python 3.5. If you are on Windows, Scrapy does not support Python 3 just yet, but hopefully this will change in the near future, keep an eye on this ticket.

Libraries

In most programming languages, there are tools on the internet in the form of “packages” of code that other people have written. We can use these packages to greatly speed up our development. Today we’ll be using:

  • Scrapy - A fast & powerful scraping and web crawling framework written in Python. Why are we using Scrapy? Because it allows us to write a scraper with minimal fuss, once of course we understand how Scrapy works. If you’re interested, I’ve written previously about some of the quick wins Scrapy gives us.
  • IMDbPY - a library for getting data from IMDB, including our sought after ratings.
  • sendgrid-python - a library for sending our email via the Sendgrid service.

These packages are all installed from the Python Package Index, you can check out the README in the code zip for install instructions.

With that over - let us begin, padawan.

Part 1: The Scrapy build

We’re going to build the web scraping part first; this is the script which will visit local cinema website as if it were mimicking a web browser and then extract certain elements from the page before storing them to a JSON file.

You can read this entire file in the code zip as cinema_scraper.py.

import scrapy


class CinemaSpider(scrapy.Spider):
    name = "cinema"
    allowed_domains = ['myvue.com']
    start_urls = [
        'http://www.myvue.com/latest-movies/cinema/london-stratford',
    ]

    def parse(self, response):
        movie_names = response.css('.filmListFilmInfo h3 a::text').extract()
        for movie_name in movie_names:
            yield {
                'name': movie_name
            }

That is it in its entirety and I hope you can see now why I chose Scrapy for this very simple task; we’ll talk about what it abstracts away from us later but for now you only need to know that it does this so that we can concentrate on the core problem at hand: extracting that data.

Imports

Let’s go through it line-by-line starting with the import. Python consists of modules which are basically packages containing code, they exist on the filesystem as simple directories full of *.py files. By importing the scrapy module as we do here, we get to use it’s all the web scraping goodness that lies within.

Scrapy Spider Class

So the first job of the day is to define what Scrapy calls our “Spider class”. If you’re not familiar with classes, they can be thought of as a factory that output objects. If that is complete moon language to you, then fear not and press on as it doesn’t really matter what they are now provided you follow these instructions.

When the scraping job is kicked off, Scrapy will take this “Spider class” blueprint we’ve written and use it to generate an object in memory that knows how to scrape the cinema website. We’ve called it CinemaScraper; the bit in brackets (scrapy.Spider) denotes that we are sub-classing the default spider that Scrapy provides for us. Sub-classing is also a topic for another day but you can take it to mean that our spider inherits all the features of the default spider that Scrapy provides. By doing that and overriding certain bits we can make it do whatever we want.

Scrapy Spider Class attributes

Immediately under the class definition, we define a bunch of attributes; these are variables which exist at the top-level of a class and are basically variables that can store values “attached” to the class. They are used internally by the scrapy.Spider that we subclassed earlier and Scrapy mechanics in general.

Let’s go through them:

  • name - this attribute is designed to give the spider a simple name so that we can call it at the command line.
  • allowed_domains - is a list of domain names that we are allowing the spider to visit when it runs. Usually this is just the site you want to scrape but there may be instances where you need to go off-site to get more information from elsewhere, so it being a list allows you to provide multiple values here.
  • start_urls - when Scrapy starts your spider, it needs a list of URLs to visit first. After it has visited each of the URLs it creates a “response” object containing all the information it found and passes it to the method called parse. Why parse? Because that is the default one it picks due to a choice in the background; we can override this if we want, but for now just know that parse is the “default” method which gets called with a response.

But which URL do we need to start on? Well, we need a list of movie names from our local cinema. So what I did was simply visit my local cinema’s website, find the page which displayed listings for my local branch and then took the URL of that page. This is the URL you see above and you can visit it here.

Parsing the response

Let’s see it again…

def parse(self, response):
    movie_names = response.css('.filmListFilmInfo h3 a::text').extract()
    for movie_name in movie_names:
        yield {
            'name': movie_name
        }

This method called parse is attached to the spider class but fear not, if you’re not clued up on classes and methods yet you can just think of it as a function, a grouped bit of code that takes inputs and returns outputs. In this case, the inputs are self, which will be the class instance itself and which we can ignore for now, and more importantly response which is the response object we talked about earlier. This object has been contstructed by Scrapy visiting the start_url we gave it and contains all the data we need to extract our film names. The very first line of the method does this in fact, so let’s investigate..

Diving into the HTML

The following paragraphs assume you are using Google Chrome as your browser; but if you’re not - there will be similar tools in your browser.

When Scrapy visits the start_url, it sends an HTTP request; the server hosting the cinema site receives this request, generates the page of HTML and then returns an HTTP response. If you need to visualise this, you can think of both requests and responses as large bundles of text that travel across the internet’s cables. When Scrapy gets the HTTP response back, it parses it in the background and creates the Scrapy response object which then gets sent to our parse method.

With that little bit explained - if you visit the page in question in your browser, then right click and “view source” you will see the magical code which makes up the web: HTML. At the moment you can think of it as a wall of bricks, and we’re going to “selectors” to extract the bricks we want. Selectors are just text and come in a variety of formats that you will read about: XPath and CSS being the two most popular. Today we’re going to use CSS because I believe it is the better investment to learn for the beginner. You can pause for a bit to read why with that article if you like.

The steps to finding the code in this jargon are:

  • Pick a piece of text you want to find. Let’s pick the movie name for instance.
  • Go to the source and search for that text.
  • Look at the HTML elements surrounding it to work out the correct selector to use.

This is the chunk of HTML we are interested in, you can see “The Conjuring 2” as the movie name inside the <a> element.

<div class="filmListFilmInfo">
  <img id="dnn_ctr1418_ViewCinemaListing_MKII_rptSynopsis_imgCert_0" src="/images/certificates/cert_15.png" alt="strong horror" style="height:19px;width:19px;">
  <h3>
    <a id="dnn_ctr1418_ViewCinemaListing_MKII_rptSynopsis_hlInfo_0" href="/latest-movies/info/cinema/stratford/film/the-conjuring-2">The Conjuring 2</a>
  </h3>
</div>

But how do we extract it? With a CSS selector! Let’s look at the one in our code:

movie_names = response.css('.filmListFilmInfo h3 a::text').extract()

The selector is the string being parsed to the response.css method. I won’t go in depth into CSS selectors here as that’s a huge topic for another day but I will explain what this one does:

  • .filmListFilmInfo - says “select all elements in the HTML that have the CSS class called “filmListFilmInfo”. If you look at the starting <div> you’ll see it.
  • h3 - the space before the h3 says “now look for elements inside the previous part”. The h3 itself says “now select all h3 elements inside all elements that had the class .filmListFilmInfo.
  • a::text - again the space before this part says the same; and the a::text part says “find all anchor elements and get the link text”.

So to end we have a selector which says “find all elements with class .filmListFilmInfo, then get all h3 elements inside that, then get all anchor elements inside that and take their text”.

movie_names is now a list of movies we can later lookup!

We then loop through this list of movie names, create a dictionary for each and then yield them. yield is a keyword in Python which I won’t go into now but can be thought of in the same light as return in its most basic sense. We construct a Python dictionary rather than just returning the string itself so that we can adapt this later to return extra information about the film (this will make sense in the challenges part later..)

Run that thing..

Now we need to test the thing gets the data we want..

scrapy runspider cinema_scraper.py -o movies.json

Scrapy has the functionality to run individual spiders if you pass it the filename and the “runspider” declaration. If you run this, you will see a bunch of debug logging output and hopefully see it finding a bunch of scraped items. In the background, because we supplied the -o movies.json parameter, it also saves all the results to a JSON file of that name. If successfully ran, the contents will look similar to this:

[
  { name: "The Conjuring 2" },
  { name: "Me Before You" },
  { name: "Gods Of Egypt" },
  { name: "The Boss" },
  { name: "X-Men: Apocalypse" },
  { name: "The Angry Birds Movie" }
];

That’s it! We have the file we need to create the next part.

Want the code? Download it now!

Simply whack your first name & email in the boxes below and my robot minions will dispatch the code zip to you instantly. I'll also send you weekly updates containing tips and tricks specifically designed to save you time while scraping.

I seriously hate spam. Unsubscribe anytime.