Hexfox Logo
Hexfox

Scrape your cinema's listings to get a daily email of films with a high IMDb rating (Part 2)

Never miss a great film with this Scrapy tutorial

So the scraper is working and it’s saving the data we need. What next? Panic? No! Let’s just start work on the next manageable chunk: checking the ratings on imdb!

With this part of the flow we need to: read in the file we just saved and make sense of it; check each film name found with IMDb to get its rating; and lastly, if we have any good films, send an email to ourselves containing their names.


Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!


You can read this entire file in the code zip as check_imdb.py.

import json
import sys

import imdb
import sendgrid

NOTIFY_ABOVE_RATING = 7.5
SENDGRID_API_KEY = "API KEY GOES HERE"


def run_checker(scraped_movies):
    imdb_conn = imdb.IMDb()
    good_movies = []
    for scraped_movie in scraped_movies:
        imdb_movie = get_imdb_movie(imdb_conn, scraped_movie['name'])
        if imdb_movie['rating'] > NOTIFY_ABOVE_RATING:
            good_movies.append(imdb_movie)
    if good_movies:
        send_email(good_movies)


def get_imdb_movie(imdb_conn, movie_name):
    results = imdb_conn.search_movie(movie_name)
    movie = results[0]
    imdb_conn.update(movie)
    print("{title} => {rating}".format(**movie))
    return movie


def send_email(movies):
    sendgrid_client = sendgrid.SendGridClient(SENDGRID_API_KEY)
    message = sendgrid.Mail()
    message.add_to("trevor@example.com")
    message.set_from("no-reply@example.com")
    message.set_subject("Highly rated movies of the day")
    body = "High rated today:<br><br>"
    for movie in movies:
        body += "{title} => {rating}".format(**movie)
    message.set_html(body)
    sendgrid_client.send(message)
    print("Sent email with {} movie(s).".format(len(movies)))


if __name__ == '__main__':
    movies_json_file = sys.argv[1]
    with open(movies_json_file) as scraped_movies_file:
        movies = json.loads(scraped_movies_file.read())
    run_checker(movies)

Now that’s quite a chunk to take in at once; instead let’s go through it section by section to see what the hell it’s doing..

Read in the saved file

Earlier, we made the scraper output a JSON file containing a list of movie names. Now we need our 2nd Python script to open the file and convert the JSON list of movie names back to a Python list so that we can use it. Let’s go:

if __name__ == '__main__':
    movies_json_file = sys.argv[1]
    with open(movies_json_file) as scraped_movies_file:
        movies = json.loads(scraped_movies_file.read())
    run_checker(movies)

This section you will find at the bottom of the check_imdb.py file and that is because of the special if __name__ == '__main__' line. What this does is allow us to run some code when the script is being ran from the command line; it’s quite an odd line, very hard to remember and not something I like about Python but luckily whenever we write a script we can just copy paste it in and get on our merry way.

The very next line uses the variable sys.argv which is a list of arguments the Python script was called with on the command line. If we take a look at what we’d write on the command line, then it’s a bit easier to visualise what sys.argv would contain:

python check_imdb.py movies.json
python check_imdb.py some_other_file.json

In both cases of running these from the command line, sys.argv will be a list. The first item in the list (sys.argv[0]) will be equal to “check_imdb.py” as that is the 1st argument to the Python executable and the name of the file to run; but the 2nd argument as sys.argv[1] is the one we want as it is the file name of the file that was output by the scraper.

We could have hardcoded the location of the file in the script itself but remember: the whole advantage of splitting up the script was to keep it portable and to keep the 2 parts separated - hardcoding in the filename would have created what we programmers call a hard dependency: “script 2 depends on script 1 outputting a file in this specific location”. By allowing the script to take in a variable from the outside world, we can save the file wherever the hell we want. Ace.

The with open() line opens the JSON file with Python’s standard file opening mechanics; it then puts the contents of the file through json.loads which converts the JSON data into Python data types; in this instance, we’re left with our list of movie objects (movies).

The final line in this section takes our list of movies and runs a function called run_checker. Let’s take a look at that..

Process the list of movies

def run_checker(scraped_movies):
    imdb_conn = imdb.IMDb()
    good_movies = []
    for scraped_movie in scraped_movies:
        imdb_movie = get_imdb_movie(imdb_conn, scraped_movie['name'])
        if imdb_movie['rating'] > NOTIFY_ABOVE_RATING:
            good_movies.append(imdb_movie)
    if good_movies:
        send_email(good_movies)

So here we create our run_checker function that accepts an input that is a list of movie objects (`scraped_movies’). We then instantiate (create) our IMDb connection object which is provided by the IMDbPy library, you can see the import for yourself at the top of the file. What this does is allow us to create this connection object once and then pass it to all our other functions to use it - this saves us the hassle of doing it every time we have to use it.

The next line starts an empty list of good_movies; it’s empty because we don’t know which movies have a good rating yet but we need to create it now so that we have somewhere to store the good movies when we find them.

Now comes the mighty for loop - a for loop in Python simply takes an iterable (something that has many things) and loops through the entire set of things it contains so that the programmer can do something with each thing. In our case the things we are looping through are the movie objects we loaded from the JSON file earlier. This works by assigning the first item in scraped_movies to the variable scraped_movie, then running the indented code and then setting the next item in scraped_movies to scraped_movie and so on, until there are no more items.

Inside the nested bit, we see:

imdb_movie = get_imdb_movie(imdb_conn, scraped_movie['name'])
if imdb_movie['rating'] > NOTIFY_ABOVE_RATING:
    good_movies.append(imdb_movie)

You can see we call another function that we’ve called get_imdb_movie which takes two inputs: the imdb connection we created above and the ‘name’ key of the scraped_movie.

If you remember, we saved the scraped movies as a list of objects in JSON like:

[{ "name": "The Conjuring 2" }, { "name": "The Killing" }]

And then we used json.loads() when reading in the file which takes the JSON and turns it into Python datatypes we can work with. In our instance it would have been turned into a list of dictionaries. For newbies, dictionaries can be thought of as a lookup table - every entry has a key and a value & each dictionary can have multiple entries but they keys must be unique. In our case the keys are the field names, “name”, and the values are the names of the movies. Note that each movie object only has one key at the moment because we only cared about the name; but what if wanted to gather more data about the movie? Keep this in mind for the challenge later..

Anyway, back to the nested for loop in the run_checker function - this is the reason we have to pluck the name field out of the variable scraped_movie - because each scraped_movie is actually a dictionary that could have many keys (even though it doesn’t yet..).

We’ll investigate the get_imdb_movie function in a minute but for now just know that it returns some data created by IMDbPY library; and that this data contains the rating we need under a key called rating. So we do a simple check: if the rating is higher than the global value we set at the top of the file, then it’s a good movie! In programming, an if statement is known as a conditional. It’s the part of code where a decision is made which results in different things happening and in our case if the rating is high enough, we add the movie to the good_movies list, and if it’s not: we ignore the crap movie and carry on with our life.

The very final lines of this run_checker function check to see if any good movies were added to the good movie list, and if some were - it calls the function to send the email. We do this extra check because otherwise you would receive an email even if there weren’t any good movies found and let’s be honest, you get enough spam as it is.

Get the rating from IMDb

So, moving on to that get_imdb_movie function that we skipped:

def get_imdb_movie(imdb_conn, movie_name):
    results = imdb_conn.search_movie(movie_name)
    movie = results[0]
    imdb_conn.update(movie)
    print("{title} => {rating}".format(**movie))
    return movie

This one reads a bit strangely as frankly the IMDbPY library is a bit odd to use and is probably what we call “unpythonic” in the land of Python. I won’t go into why - but don’t take this as a good example of how to write a library.

The first line does a search on IMDb for our movie name; the second line takes the first result from the list of results returned and the 3rd line tells the IMDbPY to go and fetch more information about that specific movie. It then prints a message to the terminal so that we know the scrip is doing something and returns the movie data containing the rating we want and lots of other information that you might find useful in the challenges later.

Sending the email

And finally for this file, we’ll take a look at the send_email function:

Sendgrid is a service which allows to send transactional email (email that occurs because of an action a user takes) via an API and provides a Python library to make this as easy as possible for us.

It should be noted that Python can send email via a normal SMTP server, so if you have one of those or you want to use your GMail details, you might prefer doing that rather than relying on the Sendgrid cloud service.

To use this code you will need to a) sign up for a sendgrid account on their free plan and 2) update the SENDGRID_API_KEY variable at the top of the file. Be careful where you put your code if you do that though as if you put it anywhere public someone would be able to take the API key and use it themselves to spam people. In the real world, we put these keys in “environment variables” but discussing that is out of scope for this article.

def send_email(movies):
    sendgrid_client = sendgrid.SendGridClient(SENDGRID_API_KEY)
    message = sendgrid.Mail()
    message.add_to("trevor@example.com")
    message.set_from("no-reply@example.com")
    message.set_subject("Highly rated movies of the day")
    body = "High rated today:<br><br>"
    for movie in movies:
        body += "{title} => {rating}<br>".format(**movie)
    message.set_html(body)
    sendgrid_client.send(message)
    print("Sent email with {} movie(s).".format(len(movies))

The first 5 lines are fairly self explantory: we create a Sendgrid client connection, then create an email object and start filling it with data about the subject, who to send it to & who to send it from.

The next 3 lines create the “body” text of the email, this is created in HTML and so when you <br> that is translated by your email readers as “insert a new line here”. We start the body string with a title of “High Rated Today” and then we loop through all the good movies we found and insert their name and rating, one per line. We then send the email and print a success message to the terminal.

Want the code? Download it now!

Simply whack your first name & email in the boxes below and my robot minions will dispatch the code zip to you instantly. I'll also send you weekly updates containing tips and tricks specifically designed to save you time while scraping.

I seriously hate spam. Unsubscribe anytime.