What is the best programming language for web scraping?
Which is the fastest or most reliable?
So, you need to scrape the web - you’re eyeing up that piece of data you’d like to extract and wondering what is the fastest and most efficient use of your time to get at it. Awesome, you are not the only one!
You might be coming from a background of zero programming knowledge but you’re determined and willing to learn any language to do it - you’d just rather not waste time going down the wrong alley or learning unnecessary bits of a language. Fear not… the answer is closer than you think.
Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!
Your current knowledge
There is a common idiom in the world of programming that “the fastest language is the one you know” and this really couldn’t be more correct. Have you ever programmed before to a stage which was past the very first steps in a language? If so, the language you used then will likely have some support for scraping the web and knowing how to work in that language will get you to your goal much faster. Use it as a stepping stone.
For help with web scraping, many languages have 3rd-party libraries (code made by other people) that can help with your task. To try and find one of these, try a simple Google for ”[your language name] web scraping library”. Or failing that: put your email below, reply to the email you receive and ask me! I will personally reply and point you in the right direction for your language.
But for the rest of you - no programming language experience at all? Do not worry in the slightest, I’ll let you in on a little secret of the tech industry: most of the best programmers I know taught themselves from scratch in their own time.
And not only that, but usually when I ask someone why they got into programming, the response often sounds something like, “I wanted to get some data off a website to do [insert cool thing here]”. Along with web and game development, web scraping is very high on the list of subjects that bring many newbies into the industry simply because wanting to access data you can already see is such a prevalent desire thanks to the net.
There is nothing stopping you becoming one of these people. Even if you don’t want to do it full time, a grounding in basic automation will help you with many tasks you’ll do in the future, and open doors to many others as well.
The route for beginners
So, where does that leave us? We want to be efficient with our learning time and we also want a language which is going to help us scrape the web rather than hinder us. I’m not going to beat around the bush, I have absolutely no hesitation in recommending Python as the language you should go forward with, and I’m going to explain why.
Ease of Learning
Python is used throughout the world as a teaching language because it’s syntax (how it looks) and rules lend itself to simplicity. Don’t believe me?
links = ["google.com/search", "amazon.com", "google.com/about"] for link in links: if link.startswith("google.com"): print(link)
Take a few seconds to work out what it might be doing before reading ahead.
So what it does is first create a list of links, it then takes each link and checks if it starts with “google.com”. If it does, it prints it to the screen. Did you get that? Well done if so, you just read a piece of code which handles variable assignment, iteration, conditionals, method calls and screen output. Damn, sounds much more intimidating in Computer Science speak doesn’t it? Engineers do that a lot, many things are made to sound harder than they conceptually are.
Even if you couldn’t read it, you can probably take my description and work backwards to understand the code; and therein lies my point, Python is an easy to work with language because the syntax reads almost like English itself and the core concepts are easy to understand. With a goal in mind, you’ll pick it up in no time.
Speed? IO is the problem
Plenty of people ask the question, “what is the fastest language for web scraping?” but they do this before they realise that processor performance is never the bottleneck when web scraping, I/O (input output) is! I/O is any communication that has to occur with your processor and “the outside world”. In our case the outside world is the internet; for web scraping the output is our requests for information to the internet and the input is the responses you get back.
The… internet… i.s… slow… No matter how fast your connection is, it will never compare with the bandwidth and speeds available to the processor and memory sat inside your machine - and therefore a super fast number-crunching language is not a requirement for web scraping as that is not the bottleneck, the internet is.
You will read people saying that Python is not the fastest language, and guess what, they are totally right. When they say this, they mean Python is not as fast as compiled languages such as C and Golang but in general terms, what you lose in terms of raw performance with Python it makes up for with the speed of development, readability of the code and the long term maintainability of a project.
As an aside, you rarely want a web scrape to be fast anyway. A fast web scrape will put undue strain on the site you are scraping and this is actually where web scraping gets a bit of an unethical cowboy image. To scrape ethically is to scrape at a reasonable rate and not put excess pressure on the hosting of the site in question.
Many languages have libraries and pieces of software already written that will help you with the task of web scraping - but Python has by far the most and the largest community support for doing so. Let’s take a look at a few things:
requests - this library makes dealing with HTTP requests simple. Why do you need to deal with HTTP requests? Because they are what your web browser sends in the background when it wants to retrieve a page from the internet. By mimicking that we can get the same data the browser does.
BeautifulSoup - is a library that handles extracting the relevant data from HTML. What is HTML? HTML is the markup language the web runs on. You can think of it as a giant lump of text that the browser knows how to render in to a webpage and BeautifulSoup knows how to read that same lump to get the data you need. In a simple script, you might for example use requests to do an HTTP request to get the HTML which BeautifulSoup can extract the data you need from.
Scrapy - is more than a library. Scrapy is a framework designed explicitly for the job of scraping the web. What does this mean? It means it handles a lot of the work for you, covers edge cases you haven’t thought about yet, and provides a structure to your spiders so that multiple similar ones can share the same code. If you’re willing to put the time into learning how it works, you will be able to write web spiders in minutes rather than hours.
Python is an interpreted scripting language and therefore does not need to be compiled after every change to the code. What? Yeah, sorry. Compilation is the sequence which occurs to turn the text of a programming language into the 1s and 0s a processor can understand. Interpreted languages do this while the application is running as opposed to beforehand. This is handy for web scraping:
- Python offers us a “shell”, which in plain English is a program that accepts line by line programming input - this is impossible in a compiled language. It allows us to enter our code line by line to see what it does live and is a great way to fiddle with new concepts until you understand them.
- Because there is no compile time, changes take effect immediately when you next run the application you’ve written. This is great for web scraping as it’s really helpful to be able to fix mistakes & iterate quickly to get the job done.
- One off scripts can be written for those small hack-together tasks you wish to carry out.
Fantastically diverse ecosystem
Outside of the web scraping world, Python has thousands of other helpful libraries designed to save you time when doing common things. For example:
- Python is incredibly popular in the academic & data research worlds due to two libraries that are incredibly good at dealing with large, complex mathematical computations: NumPy & SciPy. Many people use web scraping for data collection before running statistical analysis on it with those tools.
- Python has in-built support for saving to common formats like CSV (comma seperated values) and JSON. There are also many libraries which will help you with saving scraped data to Microsoft Excel or Google Sheets for later analysis.
- Python has fantastic support for natural language processing thanks to NLTK & spaCy. You can build up a corpus by scraping the web and then analyse the text you find with those libraries.
So there we have it. Now it’s time to stop hesitating about the direction to take and just go for it. If you found this article helpful, I write to my mailing list weekly on the topic of automated web scraping. We cover general questions like this one, specific tips on how to get data extracted in the most efficient manner, what to do with the data once you have it scraped and many other topics designed to make you an expert data extractor. Keen? Simply plop your email in the boxes below and you’ll hear from me shortly!