Having trouble extracting the tbody element while web scraping?
The DOM can be a fiddly beast
Aha! The old <tbody>
problem. You’re trying to use an XPath selector like:
//table[@class="some_class"]/tbody/tr
or even a CSS selector like:
Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!
table.some_class>tbody>tr
and yet you’re getting back an empty list or even worse, a not found error. Yipee.
You can see the damn tag in Chrome’s web inspector or Firefox’s Firebug and yet when you actually scrape the page you get zero results. Well huh. To understand what the bloody hell is going on here we first have to understand the basics of how a browser works.
A seriously basic browser guide
We’ll keep this super basic as I know you just want the answer - but honestly, taking the two minutes now to understand why browsers act this way will solve you a lot of pain in the future. Onwards..
- We start with a browser requesting data from a server.
- The server says hello and then sends back HTML and serves up further Javascript and CSS to the user that the HTML might link to. If you’ve ever done “view source” in a browser, this is the HTML you are seeing. In that pane you are always seeing the HTML in its raw form before the browser has done anything extra to it.
- A browser reads (“parses”) this HTML and “renders” a DOM (document object model). For now, all you need to know is that this DOM is how the browser represents the web page in memory and is also the structure that Javascript can interact with to change what is being displayed on the page. Due to complex HTML parsing rules and the fact Javascript can change the elements in the DOM after the page has already loaded, it may end up different to the original HTML served by the server. The DOM is what you see in browser inspector/developer tools such as Firefox’s Firebug or Chrome’s Inspector as these show you the current view of the web page so that you can debug things like Javascript.
So where does tbody come into this?
So now you know that there are 2 different stages, the raw HTML from the server
and the rendered DOM, we can explain that your browser is injecting the tbody
element inside the table element it parses in the raw HTML. Why does it do this?
Because the HTML specification requires it. But the main thing to note is that
the tbody
element did not exist in the original HTML your browser got from
the server, and therefore it also doesn’t exist when you try to retrieve it
with your HTTP requesting library.
So how do we avoid this problem?
Ignore tbody
when selecting! It is really that simple: pretend the darn thing
does not exist.
If we take the XPath and CSS selectors from earlier, you would simply drop the
tbody
section from them like so:
Xpath selector: //table[@class="some_class"]/tbody/tr
=>
//table[@class="some_class"]/tr
.
CSS selector: table.some_class>tbody>tr
=> table.some_class>tr
.
You should now have the elements you were after! If you found this helpful, I email helpful tips like this one to my list, subscribe below.