Having trouble extracting the tbody element while web scraping?
The DOM can be a fiddly beast
Aha! The old
<tbody> problem. You’re trying to use an XPath selector like:
or even a CSS selector like:
Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!
and yet you’re getting back an empty list or even worse, a not found error. Yipee.
You can see the damn tag in Chrome’s web inspector or Firefox’s Firebug and yet when you actually scrape the page you get zero results. Well huh. To understand what the bloody hell is going on here we first have to understand the basics of how a browser works.
A seriously basic browser guide
We’ll keep this super basic as I know you just want the answer - but honestly, taking the two minutes now to understand why browsers act this way will solve you a lot of pain in the future. Onwards..
- We start with a browser requesting data from a server.
So where does tbody come into this?
So now you know that there are 2 different stages, the raw HTML from the server
and the rendered DOM, we can explain that your browser is injecting the
element inside the table element it parses in the raw HTML. Why does it do this?
Because the HTML specification requires it. But the main thing to note is that
tbody element did not exist in the original HTML your browser got from
the server, and therefore it also doesn’t exist when you try to retrieve it
with your HTTP requesting library.
So how do we avoid this problem?
tbody when selecting! It is really that simple: pretend the darn thing
does not exist.
If we take the XPath and CSS selectors from earlier, you would simply drop the
tbody section from them like so:
You should now have the elements you were after! If you found this helpful, I email helpful tips like this one to my list, subscribe below.