When should you use Scrapy over BeautifulSoup?
...and what's the difference anyway?
If you’ve found yourself here, you’re probably trying to retrieve some data off the web to solve a problem - whatever it may be. And if you’re asking this particular question, then this process is probably familiar to you:
- You started by furiously Googling how to do it.
- And then discovered it is called “web scraping”.
- You then read or got told that Python is a great language to scrape in.
So, armed with that knowledge, you looked for the next step: working out what the best method is! And that’s when things fall apart with plenty of people that claim to know best clamouring to tell you which software to use to reach your goal.
Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!
You started searching for a solution and yet you’ve ended up with more freaking problems. With online help, you’ve narrowed it down to these 2 apparently competing routes that are Scrapy & BeautifulSoup but now you’re not sure which one would be best to learn. Fear not - I will help you make this decision, right now.
First we need a little background so that we can understand the differences between them: Scrapy is a fully fledged solution which allows people to write small amounts of Python code to create a “spider” - an automated bot which can trawl web pages and scrape them. BeautifulSoup on the other hand is a helpful utility that allows a programmer to get specific elements out of a webpage (for example, a list of images). As such, BeautifulSoup alone is not enough because you have to actually get the webpage in the first place and this leads people to using something like requests or urllib2 to do that part. These tools operate kind-of like a web browser and retrieve pages off the internet so that BeautifulSoup can pluck out the bits a person is after.
So the difference between the two is actually quite large: Scrapy is a tool specifically created for downloading, cleaning and saving data from the web and will help you end-to-end; whereas BeautifulSoup is a smaller package which will only help you get information out of webpages.
So when should I use what?
We’re here to get stuff done. If you are a beginner to the world of web scraping, I have no hesitation in saying that you should take the Scrapy route. Scrapy will solve numerous problems for you that you would otherwise have to handle yourself; in fact I’m willing to bet that it will solve problems for you that you don’t even know you have yet. “Like what!?”, you say? Let’s have a look:
Scrapy enables you to easily post-process any data you find. Data on the web is a mess! It is very unlikely that the data you find will be in the exact format that you would like it to be: it may have extra line breaks; funky styling; extra commas in random places; or simply be in all upper case. Scrapy will let you handle these cases in a straight forward fashion.
Data can often be incomplete in the wild - if you are writing your own script you will have to try doubly hard to ensure it is resilient to these cases. Scrapy will make the process of working around incomplete data much easier for you.
You will often find when scraping that web pages just blow up in your face: pages won’t be found, servers will have errors or you could have internet connectivity issues half way through a large scrape. Scrapy lets you handle errors gracefully and even has inbuilt ability for resuming a scrape from the last page it encountered. You get all this for free.
Some websites will be behind a login wall. Scrapy has built in form handling which you can setup to login to the websites before beginning your scrape.
As a tool built specifically for the task of web scraping, Scrapy provides the building blocks you need to write sensible spiders. What are sensible spiders? Spiders that require a minimum amount of maintenance. Individual websites change their design and layouts on a frequent basis and as we rely on the layout of the page to extract the data we want - this causes us headaches. Scrapy separates out the logic so that a simple change in layout doesn’t result in us having to rewrite out spider from scratch.
Scraping can cause issues for the sites you are targeting; for example, fetching too many pages at once can put a strain on the target server and take it offline. This will inevitably result in your spider getting banned for abuse - so it’s best to be a good citizen on the web. Scrapy allows you to be one by enabling you to easily throttle the rate at which you are scraping.
Scrapy can do multiple requests at the same time which allows scraping runs to be much faster. If you are writing a Python script from scratch that tries to do that, you will likely find that things can go wrong in a horrible million ways. Scrapy has years of use in actual large organisations that avoid this.
You will see many people recommending other solutions, and they work! But what I am saying is: they will likely take more effort and thus it will take you longer to get what you want done. On top of that, you should also be wary of people suggesting things because:
- they likely do not know your full situation and the future plans of the project you are about to embark upon. What starts as a 20 line script rarely stays that way!
- they do not know your level of knowledge.
- people forget easily just how hard it was to learn the thing they are suggesting.
- developers in particular love to suggest solutions to problems that they themselves would find interesting to solve. This often means not using an “off the shelf” solution when that would be a much faster way to get the result you are after. Remember, we only care about the end result here.
When should I use BeautifulSoup & Requests then?
If you know that you won’t need any of the above or any scraping guidance in general then they are fantastic tools that offer a lot of freedom. For one off scripts that you don’t plan to maintain in the long run, they are likely the better solution.
This is the first post of many on the topic of web scraping; if you enjoyed this - thanks for reading and consider signing up to the mailing list below to receive more articles like this before they even get posted here.