Hexfox

Having trouble extracting the tbody element while web scraping?

The DOM can be a fiddly beast

Aha! The old <tbody> problem. You’re trying to use an XPath selector like:

//table[@class="some_class"]/tbody/tr

or even a CSS selector like:


Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!


table.some_class>tbody>tr

and yet you’re getting back an empty list or even worse, a not found error. Yipee.

You can see the damn tag in Chrome’s web inspector or Firefox’s Firebug and yet when you actually scrape the page you get zero results. Well huh. To understand what the bloody hell is going on here we first have to understand the basics of how a browser works.

A seriously basic browser guide

We’ll keep this super basic as I know you just want the answer - but honestly, taking the two minutes now to understand why browsers act this way will solve you a lot of pain in the future. Onwards..

  • We start with a browser requesting data from a server.
  • The server says hello and then sends back HTML and serves up further Javascript and CSS to the user that the HTML might link to. If you’ve ever done “view source” in a browser, this is the HTML you are seeing. In that pane you are always seeing the HTML in its raw form before the browser has done anything extra to it.
  • A browser reads (“parses”) this HTML and “renders” a DOM (document object model). For now, all you need to know is that this DOM is how the browser represents the web page in memory and is also the structure that Javascript can interact with to change what is being displayed on the page. Due to complex HTML parsing rules and the fact Javascript can change the elements in the DOM after the page has already loaded, it may end up different to the original HTML served by the server. The DOM is what you see in browser inspector/developer tools such as Firefox’s Firebug or Chrome’s Inspector as these show you the current view of the web page so that you can debug things like Javascript.

So where does tbody come into this?

So now you know that there are 2 different stages, the raw HTML from the server and the rendered DOM, we can explain that your browser is injecting the tbody element inside the table element it parses in the raw HTML. Why does it do this? Because the HTML specification requires it. But the main thing to note is that the tbody element did not exist in the original HTML your browser got from the server, and therefore it also doesn’t exist when you try to retrieve it with your HTTP requesting library.

So how do we avoid this problem?

Ignore tbody when selecting! It is really that simple: pretend the darn thing does not exist.

If we take the XPath and CSS selectors from earlier, you would simply drop the tbody section from them like so:

Xpath selector: //table[@class="some_class"]/tbody/tr => //table[@class="some_class"]/tr.

CSS selector: table.some_class>tbody>tr => table.some_class>tr.

You should now have the elements you were after! If you found this helpful, I email helpful tips like this one to my list, subscribe below.

Stuck trying to extract the data you need?

I'm here to help; every week I send out exclusive tips, stories & guides to my subscribers on how to best scrape data off the web and avoid common issues like this one. Save yourself some future pain and join them!

I seriously hate spam. Unsubscribe anytime.