A friendlier alternative to XPath selectors
They can be tricky business
Selecting the HTML element you want with an XPath can be a right pain. We’ve all been there, you spend an inordinate amount of time working out the correct selector to use and what happens when you try it? Nil, nada, zilch. No results.
This isn’t necessarily the fault of XPath itself but rather our understanding of how it works, but what if we didn’t need to know XPath to extract the results we need?
Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!
Thanks to CSS we might not.
CSS Selectors
CSS, as you may know, is the styling language of the web. To sum it up in a nutshell, HTML is made up of text based elements and developers place identifiers on these elements that allow them to style a web page with CSS. As a person that wants to extract data from the HTML, we can use these handy identifiers to make our lives much easier.
Why prefer CSS over XPath?
The easiest way is to show with a few examples:
Selecting an element by its ID
CSS:
#ex-id
XPath:
//*[@id=’ex-id’]
Selecting elements by the classes applied to them
CSS:
.ex-class
XPath:
//li[contains(concat(" ", normalize-space(@class), " "), " ex-class ")]
In summary…
Which requires less work to think about? Which one requires less writing? For me there is only ever one winner. CSS let’s you scrape the web in a much easier fashion by using the very hooks which helped the original developer style the site in the first place.
Another bonus is that learning CSS can be a much better investment than learning XPath depending on your future plans. Why? CSS is ubiquitous on the web, every single HTML5 site relies on it for design and has down since the late 90s; XPath is used in very few, rather specific use cases that you may not ever come across again in years of programming - especially web development.
But my scraping library only works with XPath!
Yes, some libraries only allow you to extract data if you have a valid XPath..but what if we could get a valid XPath if we know the CSS selector? We can and the conversion tools are already out there. To find one for your language search for ”[language] xpath to css conversion” but here are a few for the common scraping languages:
- Python: cssselect on pypi.
- Javascript: css-to-xpath on npm.
For those using Scrapy or BeautifulSoup: both already have provisions for extracting data with the CSS selectors (in fact Scrapy uses the cssselect library under the hood!).
Example
To demonstrate quickly with Python & the cssselect library:
>>> from cssselect import GenericTranslator
>>> GenericTranslator().css_to_xpath('.ex-class')
u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' ex-class ')]"
So there we go, even if your language doesn’t have one of these converters available - just use Python to convert and then copy and paste the resulting XPath.
Caveats
Yaa, boo, hiss. There is always at least one.
There are some certain use cases CSS selectors cannot cover that XPath can and indeed certain CSS selectors that simply do not translate to a valid XPath as they are slightly different in terms of their functionality. However, for 99% of your HTML extracting needs, this is not going to be a problem - it’s just worth knowing about in case you come across it.
Another one might be performance. In the majority of cases CSS lookups perform much faster than XPath but depending on which language you are using there may be a slight performance penalty for using raw CSS selectors over XPath (or vice versa). This is something that should be investigated on an individual when-you-need-it basis which means you should totally ignore it until it actually becomes a problem. It’s highly unlikely a case will ever become a bottleneck because web scraping is I/O bound (speed limited by the internet) rather than CPU or memory bound (limited by the processor or RAM respectively).