Scrape your cinema’s listings to get a daily email of films with a high IMDb rating

Never miss a great film with this Scrapy tutorial

A large issue with learning to program is first working out what is even possible to achieve; and this is made all the harder by the prevalence of math problems or little puzzles in tutorials which have no direct link to the how they’d be used in the real world. I remember the deep frustration myself: “why can’t I just learn something useful?”.

I, like most people, tend to learn best when I have a certain goal in mind, because reaching that goal feels like an accomplishment and fuels the mind to go deeper. In fact, it’s how I ended up here after a 10 year journey down the rabbit hole (pro tip: it never really ends, just keep digging).

“Every great developer you know got there by solving problems they were unqualified to solve until they actually did it”. – @patio11

A fantastic quote but what if you don’t know what to solve? Yeah, that’s really common! This is where I come in..

This post is designed to showcase just how much the world of possibilities opens up once you learn a certain skill called web scraping – the industry term for “extracting data from the web”.

What we’re going to do today

We’re going to write a script, actually make that two scripts, for a fun home project that will check your local cinema’s film listings every day and email you only if is is showing a film that receives a high rating on IMDb. Because who the hell wants to pay top dollar for yet another shoddy film? At £20 ($28) a ticket in London, it’s certainly not fucking me I’ll tell you that..!

So here is the linear flow we’ll be following, deliberately split up into bite-size manageable chunks:

  • get the film names from the cinema website (web scraping).
  • store the film names.
  • lookup each film’s IMDb rating (API lookup).
  • send an email detailing today’s highly rated films.

Vue Cinema

I have released the code for this post under a free open source license, if you’d like to get it and follow along, or edit the script to work better for you then I’d really encourage that; simply place your email in the box below and my army of robots will send you the zip instantly. You’ll also get weekly updates from me with future posts like this, and further web scraping tips and tricks written specifically with the aim of saving you countless hours of furious clicking around the web looking for answers.

As I said – we’re going to split this up in to 2 scripts: one that actually visits the local cinema’s website to get the data and the other that takes the retrieved data and checks it against IMDb before sending the daily email.

Why separate out such a simple script? It’s a good practice to get into in the world of development, as Doug McIllroy said: “Make each program do one thing well”. By separating out our concerns we automatically achieve a number of advantages:

  • While programming, we can concentrate on one area/domain at a time.

  • As the web scraping extraction does not have to wait for the lookups to IMDb it can accomplish it’s job without being impeded. In the programming world we call this “non-blocking” and it is a corner stone to achieving decent performance in many areas of computing.

  • If one script breaks, it is unlikely to affect the other as it knows nothing about it and can therefore be fixed in isolation. It will also be a lot clearer to us which part broke when it inevitably does.

  • As our 1st step needs to export data so that the 2nd step can use it, we get to keep that intermediary data which may otherwise not be recorded as part of the normal flow of the program. This might not be helpful in this specific case but in other scenarios where we might collect millions of results then you can see where this may come in handy. It gives us portability; we can get the data and then move it elsewhere for processing.

  • As the scripts run independently, they can even run on separate machines. Again, this is not so helpful now now as we just want to run a quick hacky script – but what happens when we start scraping huge websites or multiple domains? The separation here would allow us to scale up just the part of the program that requires the extra beef.

But how will the 2 scripts talk to each other? Well, just like we humans do: via the medium of text! The first script will save a file in a certain format that the 2nd script knows how to read, simple as that. The format we’ll use to save the results to is called JSON, an incredibly common data interchange format on the web. It allows us to read and write data in a way which can be read by the human eye too, but that is all you need to know for now.

Before we begin

Let’s describe a few of the tools we’ll be using before we get started. Unfortunately, installing these is out of scope for this article but if you get the code I have written instructions in the README which should help you get going.

Python & Versions

We’ll be using the Python programming language due to it’s wide ecosystem for web scraping; I’ve written about why I think Python is the best beginner language for web scraping so pause reading this if you’d like a bit more background.

If you’re on Linux or MacOSX, the code will work on Python 2.7 or Python 3.5. If you are on Windows, Scrapy does not support Python 3 just yet, but hopefully this will change in the near future, keep an eye on this ticket.

Libraries

In most programming languages, there are tools on the internet in the form of “packages” of code that other people have written. We can use these packages to greatly speed up our development. Today we’ll be using:

  • Scrapy – A fast & powerful scraping and web crawling framework written in Python. Why are we using Scrapy? Because it allows us to write a scraper with minimal fuss, once of course we understand how Scrapy works. If you’re interested, I’ve written previously about some of the quick wins Scrapy gives us.
  • IMDbPY – a library for getting data from IMDB, including our sought after ratings.
  • sendgrid-python – a library for sending our email via the Sendgrid service.

These packages are all installed from the Python Package Index, you can check out the README in the code zip for install instructions.

With that over – let us begin, padawan.

Part 1: The Scrapy build

We’re going to build the web scraping part first; this is the script which will visit local cinema website as if it were mimicking a web browser and then extract certain elements from the page before storing them to a JSON file.

You can read this entire file in the code zip as cinema_scraper.py.

import scrapy


class CinemaSpider(scrapy.Spider):
    name = "cinema"
    allowed_domains = ['myvue.com']
    start_urls = [
        'http://www.myvue.com/latest-movies/cinema/london-stratford',
    ]

    def parse(self, response):
        movie_names = response.css('.filmListFilmInfo h3 a::text').extract()
        for movie_name in movie_names:
            yield {
                'name': movie_name
            }

That is it in its entirety and I hope you can see now why I chose Scrapy for this very simple task; we’ll talk about what it abstracts away from us
later but for now you only need to know that it does this so that we can concentrate on the core problem at hand: extracting that data.

Imports

Let’s go through it line-by-line starting with the import. Python consists of modules which are basically packages containing code, they exist on the filesystem as simple directories full of *.py files. By importing the scrapy module as we do here, we get to use it’s all the web scraping goodness that lies within.

Scrapy Spider Class

So the first job of the day is to define what Scrapy calls our “Spider class”. If you’re not familiar with classes, they can be thought of as a factory that
output objects. If that is complete moon language to you, then fear not and press on as it doesn’t really matter what they are now provided you follow these instructions.

When the scraping job is kicked off, Scrapy will take this “Spider class” blueprint we’ve written and use it to generate an object in memory that knows how to scrape the cinema website. We’ve called it CinemaScraper; the bit in brackets (scrapy.Spider) denotes that we are sub-classing the default spider that Scrapy provides for us. Sub-classing is also a topic for another day but you can take it to mean that our spider inherits all the features of the default spider that Scrapy provides. By doing that and overriding certain bits we can make it do whatever we want.

Scrapy Spider Class attributes

Immediately under the class definition, we define a bunch of attributes; these are variables which exist at the top-level of a class and are basically variables that can store values “attached” to the class. They are used internally by the scrapy.Spider that we subclassed earlier and Scrapy mechanics in general.

Let’s go through them:

  • name – this attribute is designed to give the spider a simple name so that we can call it at the command line.
  • allowed_domains – is a list of domain names that we are allowing the spider to visit when it runs. Usually this is just the site you want to scrape but there may be instances where you need to go off-site to get more information from elsewhere, so it being a list allows you to provide multiple values here.
  • start_urls – when Scrapy starts your spider, it needs a list of URLs to visit first. After it has visited each of the URLs it creates a “response” object containing all the information it found and passes it to the method called parse. Why parse? Because that is the default one it picks due to a choice in the background; we can override this if we want, but for now just know that parse is the “default” method which gets called with a response.

But which URL do we need to start on? Well, we need a list of movie names from our local cinema. So what I did was simply visit my local cinema’s website, find the page which displayed listings for my local branch and then took the URL of that page. This is the URL you see above and you can visit it here.

Parsing the response

Let’s see it again…

    def parse(self, response):
        movie_names = response.css('.filmListFilmInfo h3 a::text').extract()
        for movie_name in movie_names:
            yield {
                'name': movie_name
            }

This method called parse is attached to the spider class but fear not, if you’re not clued up on classes and methods yet you can just think of it as a function, a grouped bit of code that takes inputs and returns outputs. In this case, the inputs are self, which will be the class instance itself and which we can ignore for now, and more importantly response which is the response object we talked about earlier. This object has been contstructed by Scrapy visiting the start_url we gave it and contains all the data we need to extract our film names. The very first line of the method does this in fact, so let’s investigate..

Diving into the HTML

The following paragraphs assume you are using Google Chrome as your browser; but if you’re not – there will be similar tools in your browser.

When Scrapy visits the start_url, it sends an HTTP request; the server hosting the cinema site receives this request, generates the page of HTML and then returns an HTTP response. If you need to visualise this, you can think of both requests and responses as large bundles of text that travel across the internet’s cables. When Scrapy gets the HTTP response back, it parses it in the background and creates the Scrapy response object which then gets sent to our parse method.

With that little bit explained – if you visit the page in question in your browser, then right click and “view source” you will see the magical code which makes up the web: HTML. At the moment you can think of it as a wall of bricks, and we’re going to “selectors” to extract the bricks we want. Selectors are just text and come in a variety of formats that you will read about: XPath and CSS being the two most popular. Today we’re going to use CSS because I believe it is the better investment to learn for the beginner. You can pause for a bit to read why with that article if you like.

The steps to finding the code in this jargon are:

  • Pick a piece of text you want to find. Let’s pick the movie name for instance.
  • Go to the source and search for that text.
  • Look at the HTML elements surrounding it to work out the correct selector to use.

This is the chunk of HTML we are interested in, you can see “The Conjuring 2” as the movie name inside the <a> element.

<div class="filmListFilmInfo">
  <img id="dnn_ctr1418_ViewCinemaListing_MKII_rptSynopsis_imgCert_0" src="/images/certificates/cert_15.png" alt="strong horror" style="height:19px;width:19px;">
  <h3>
    <a id="dnn_ctr1418_ViewCinemaListing_MKII_rptSynopsis_hlInfo_0" href="/latest-movies/info/cinema/stratford/film/the-conjuring-2">The Conjuring 2</a>
  </h3>
</div>

But how do we extract it? With a CSS selector! Let’s look at the one in our code:

movie_names = response.css('.filmListFilmInfo h3 a::text').extract()

The selector is the string being parsed to the response.css method. I won’t go in depth into CSS selectors here as that’s a huge topic for another day but I will explain what this one does:

  • .filmListFilmInfo – says “select all elements in the HTML that have the CSS class called “filmListFilmInfo”. If you look at the starting <div> you’ll see it.
  • h3 – the space before the h3 says “now look for elements inside the previous part”. The h3 itself says “now select all h3 elements inside all elements that had the class .filmListFilmInfo.
  • a::text – again the space before this part says the same; and the a::text part says “find all anchor elements and get the link text”.

So to end we have a selector which says “find all elements with class .filmListFilmInfo, then get all h3 elements inside that, then get all anchor elements inside that and take their text”.

movie_names is now a list of movies we can later lookup!

We then loop through this list of movie names, create a dictionary for each and then yield them. yield is a keyword in Python which I won’t go into now but can be thought of in the same light as return in its most basic sense. We construct a Python dictionary rather than just returning the string itself so that we can adapt this later to return extra information about the film (this will make sense in the challenges part later..)

Run that thing..

Now we need to test the thing gets the data we want..

scrapy runspider cinema_scraper.py -o movies.json

Scrapy has the functionality to run individual spiders if you pass it the filename and the “runspider” declaration. If you run this, you will see a bunch of debug logging output and hopefully see it finding a bunch of scraped items. In the background, because we supplied the -o movies.json parameter, it also saves all the results to a JSON file of that name. If successfully ran, the contents will look similar to this:

[
  {"name": "The Conjuring 2"},
  {"name": "Me Before You"},
  {"name": "Gods Of Egypt"},
  {"name": "The Boss"},
  {"name": "X-Men: Apocalypse"},
  {"name": "The Angry Birds Movie"}
]

That’s it! We have the file we need to create the next part.

Part 2: Check IMDb and email

So the scraper is working and it’s saving the data we need. What next? Panic? No! Let’s just start work on the next manageable chunk: checking the ratings on imdb!

With this part of the flow we need to: read in the file we just saved and make sense of it; check each film name found with IMDb to get its rating; and lastly, if we have any good films, send an email to ourselves containing their names.

You can read this entire file in the code zip as check_imdb.py.

import json
import sys

import imdb
import sendgrid

NOTIFY_ABOVE_RATING = 7.5

SENDGRID_API_KEY = "API KEY GOES HERE"


def run_checker(scraped_movies):
    imdb_conn = imdb.IMDb()
    good_movies = []
    for scraped_movie in scraped_movies:
        imdb_movie = get_imdb_movie(imdb_conn, scraped_movie['name'])
        if imdb_movie['rating'] > NOTIFY_ABOVE_RATING:
            good_movies.append(imdb_movie)
    if good_movies:
        send_email(good_movies)


def get_imdb_movie(imdb_conn, movie_name):
    results = imdb_conn.search_movie(movie_name)
    movie = results[0]
    imdb_conn.update(movie)
    print("{title} => {rating}".format(**movie))
    return movie


def send_email(movies):
    sendgrid_client = sendgrid.SendGridClient(SENDGRID_API_KEY)
    message = sendgrid.Mail()
    message.add_to("[email protected]")
    message.set_from("[email protected]")
    message.set_subject("Highly rated movies of the day")
    body = "High rated today:<br><br>"
    for movie in movies:
        body += "{title} => {rating}".format(**movie)
    message.set_html(body)
    sendgrid_client.send(message)
    print("Sent email with {} movie(s).".format(len(movies)))


if __name__ == '__main__':
    movies_json_file = sys.argv[1]
    with open(movies_json_file) as scraped_movies_file:
        movies = json.loads(scraped_movies_file.read())
    run_checker(movies)

Now that’s quite a chunk to take in at once; instead let’s go through it section by section to see what the hell it’s doing..

Read in the saved file

Earlier, we made the scraper output a JSON file containing a list of movie names. Now we need our 2nd Python script to open the file and convert the JSON list of movie names back to a Python list so that we can use it. Let’s go:

if __name__ == '__main__':
    movies_json_file = sys.argv[1]
    with open(movies_json_file) as scraped_movies_file:
        movies = json.loads(scraped_movies_file.read())
    run_checker(movies)

This section you will find at the bottom of the check_imdb.py file and that is because of the special if __name__ == '__main__' line. What this does is allow us to run some code when the script is being ran from the command line; it’s quite an odd line, very hard to remember and not something I like about Python but luckily whenever we write a script we can just copy paste it in and get on our merry way.

The very next line uses the variable sys.argv which is a list of arguments the Python script was called with on the command line. If we take a look at what we’d write on the command line, then it’s a bit easier to visualise what sys.argv would contain:

python check_imdb.py movies.json
python check_imdb.py some_other_file.json

In both cases of running these from the command line, sys.argv will be a list. The first item in the list (sys.argv[0]) will be equal to “check_imdb.py” as that is the 1st argument to the Python executable and the name of the file to run; but the 2nd argument as sys.argv[1] is the one we want as it is the file name of the file that was output by the scraper.

We could have hardcoded the location of the file in the script itself but remember: the whole advantage of splitting up the script was to keep it portable and to keep the 2 parts separated – hardcoding in the filename would have created what we programmers call a hard dependency: “script 2 depends on script 1 outputting a file in this specific location”. By allowing the script to take in a variable from the outside world, we can save the file wherever the hell we want. Ace.

The with open() line opens the JSON file with Python’s standard file opening mechanics; it then puts the contents of the file through json.loads which converts the JSON data into Python data types; in this instance, we’re left with our list of movie objects (movies).

The final line in this section takes our list of movies and runs a function called run_checker. Let’s take a look at that..

Process the list of movies

def run_checker(scraped_movies):
    imdb_conn = imdb.IMDb()
    good_movies = []
    for scraped_movie in scraped_movies:
        imdb_movie = get_imdb_movie(imdb_conn, scraped_movie['name'])
        if imdb_movie['rating'] > NOTIFY_ABOVE_RATING:
            good_movies.append(imdb_movie)
    if good_movies:
        send_email(good_movies)

So here we create our run_checker function that accepts an input that is a list of movie objects (`scraped_movies’). We then instantiate (create) our IMDb connection object which is provided by the IMDbPy library, you can see the import for yourself at the top of the file. What this does is allow us to create this connection object once and then pass it to all our other functions to use it – this saves us the hassle of doing it every time we have to use it.

The next line starts an empty list of good_movies; it’s empty because we don’t know which movies have a good rating yet but we need to create it now so that we have somewhere to store the good movies when we find them.

Now comes the mighty for loop – a for loop in Python simply takes an iterable (something that has many things) and loops through the entire set of things it contains so that the programmer can do something with each thing. In our case the things we are looping through are the movie objects we loaded from the JSON file earlier. This works by assigning the first item in scraped_movies to the variable scraped_movie, then running the indented code and then setting the next item in scraped_movies to scraped_movie and so on, until there are no more items.

Inside the nested bit, we see:

        imdb_movie = get_imdb_movie(imdb_conn, scraped_movie['name'])
        if imdb_movie['rating'] > NOTIFY_ABOVE_RATING:
            good_movies.append(imdb_movie)

You can see we call another function that we’ve called get_imdb_movie which takes two inputs: the imdb connection we created above and the ‘name’ key of the scraped_movie.

If you remember, we saved the scraped movies as a list of objects in JSON like:

[
  {"name": "The Conjuring 2"},
  {"name": "The Killing"},
]

And then we used json.loads() when reading in the file which takes the JSON and turns it into Python datatypes we can work with. In our
instance it would have been turned into a list of dictionaries. For newbies, dictionaries can be thought of as a lookup table – every entry has a key and a value & each dictionary can have multiple entries but they keys must be unique. In our case the keys are the field names, “name”, and the values are the names of the movies. Note that each movie object only has one key at the moment because we only cared about the name; but what if wanted to gather more data about the movie? Keep this in mind for the challenge later..

Anyway, back to the nested for loop in the run_checker function – this is the reason we have to pluck the name field out of the variable scraped_movie – because each scraped_movie is actually a dictionary that could have many keys (even though it doesn’t yet..).

We’ll investigate the get_imdb_movie function in a minute but for now just know that it returns some data created by IMDbPY library; and that this data contains the rating we need under a key called rating. So we do a simple check: if the rating is higher than the global value we set at the top of the file, then it’s a good movie! In programming, an if statement is known as a conditional. It’s the part of code where a decision is made which results in different
things happening and in our case if the rating is high enough, we add the movie to the good_movies list, and if it’s not: we ignore the crap movie and carry on with our life.

The very final lines of this run_checker function check to see if any good movies were added to the good movie list, and if some were – it calls the function to send the email. We do this extra check because otherwise you would receive an email even if there weren’t any good movies found and let’s be honest, you get enough spam as it is.

Get the rating from IMDb

So, moving on to that get_imdb_movie function that we skipped:

def get_imdb_movie(imdb_conn, movie_name):
    results = imdb_conn.search_movie(movie_name)
    movie = results[0]
    imdb_conn.update(movie)
    print("{title} => {rating}".format(**movie))
    return movie

This one reads a bit strangely as frankly the IMDbPY library is a bit odd to use and is probably what we call “unpythonic” in the land of
Python. I won’t go into why – but don’t take this as a good example of how to write a library.

The first line does a search on IMDb for our movie name; the second line takes the first result from the list of results returned and the 3rd line tells the IMDbPY to go and fetch more information about that specific movie. It then prints a message to the terminal so that we know the scrip is doing something and returns the movie data containing the rating we want and lots of other information that you might find useful in the challenges later.

Sending the email

And finally for this file, we’ll take a look at the send_email function:

Sendgrid is a service which allows to send transactional email (email that occurs because of an action a user takes) via an API and provides a Python library to make this as easy as possible for us.

It should be noted that Python can send email via a normal SMTP server, so if you have one of those or you want to use your GMail details, you might prefer doing that rather than relying on the Sendgrid cloud service.

To use this code you will need to a) sign up for a sendgrid account on their free plan and 2) update the SENDGRID_API_KEY variable at the
top of the file. Be careful where you put your code if you do that though as if you put it anywhere public someone would be able to take the API key and use it themselves to spam people. In the real world, we put these keys in “environment variables” but discussing that is out of scope for this article.

def send_email(movies):
    sendgrid_client = sendgrid.SendGridClient(SENDGRID_API_KEY)
    message = sendgrid.Mail()
    message.add_to("[email protected]")
    message.set_from("[email protected]")
    message.set_subject("Highly rated movies of the day")
    body = "High rated today:<br><br>"
    for movie in movies:
        body += "{title} => {rating}<br>".format(**movie)
    message.set_html(body)
    sendgrid_client.send(message)
    print("Sent email with {} movie(s).".format(len(movies))

The first 5 lines are fairly self explantory: we create a Sendgrid client connection, then create an email object and start filling it with data about the subject, who to send it to & who to send it from.

The next 3 lines create the “body” text of the email, this is created in HTML and so when you <br> that is translated by your email readers as “insert a new line here”. We start the body string with a title of “High Rated Today” and then we loop through all the good movies we found and insert their name and rating, one per line. We then send the email and print a success message to the terminal.

Gluing them together

So we have our 2 scripts! But they are still 2 separate scripts and it will quickly become a complete pain running each one individually every time. So let’s fix that with, you guessed it, another little script.

This script is slightly different as it is a shell (bash) script; in their simplest form bash scripts can simply be line by line commands that you would type in to the terminal anyway. We’re going to fluff ours out a bit so it provides some feedback as to what is going on.

#!/bin/bash
echo "Starting scraper"
scrapy runspider cinema_scraper.py -t json --nolog -o - > "movies.json"
echo "Scrape complete, checking movies with imdb"
python check_imdb.py movies.json

The first line tells the shell it’s a bash script; and the echo lines simply print the message to the screen giving a bit of feedback to the person running it. The other lines you will recognise from our previous testing, the scrapy file simply saves its results to movies.json, and the check_imdb.py script reads it results from the same file.

You can run this script from the code folder with:

./run.sh

My challenge to you, right now

We’re nearly done, feels a lot easier now right? The challenge I set to now is to convert this script into one that works with your local cinema as I very much doubt you care what my cinema is showing. If you remember from my explanation about why we separated the scripts – this is yet another reason as to why we did it; you now only have to change the web scraping part to look at your local cinemas website and as long as you return the data in the same way – you don’t even have to touch any of the lookup script. If you follow these steps you can get a working script today:

  • visit your local cinema’s website
  • find the page which details the films on today
  • work out what CSS selector you need to use to extract the film names
  • knowing the results of the previous 2 points, update the web scraping script

You want more of a challenge? Good on ya. Can you update the scripts to:

  • include the show times of the films in the email?
  • remember which films have been seen so you don’t get an email every day?
  • limit the films you’re told about to a certain genre?

As ever, email me for some hints or just to show me how you got along – it makes all this worthwhile. Same goes for if you have anything to add, I’m all ears. Simply reply to any email from my list.

Reminder: want the code?

As I said earlier, the code for this is available for free and the license allows you to do what the hell you want to it with the one caveat that you’re allowed to take me to court if your computer melts through the floor (I jest). To get it, simply plonk your email below and my army of robot minions will make sure you get it swiftly.

Enjoy the post? TODO