How to filter out duplicate URLs from Scrapy's start_urls
as Scrapy turns off de-duplication for them
Scrapy provides a duplicate URL filter for all spiders by default, which means
that any URL that looks the same to Scrapy during a crawl will not be visited
twice. But for
start_urls, the URLs you set as the first one’s a spider should
crawl, this de-duplication is deliberately disabled.
Why is it disabled you ask?
Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!
Because Scrapy expects the list you give it to be the definitive URLs you want it to scrape first (rather than URLs it finds automatically later) and therefore if it was to ignore that definitive list and start de-duplicating it, things could get confusing for the user.
But what if you’re not in direct control of that list? What if the list contains thousands of URLs and you’d like to ensure there are no duplicates? There are many cases where de-duplicating that starting list of URLs is really helpful and luckily we can easily do that with Scrapy thanks to a simple override.
Underneath Scrapy’s hood
To override this behaviour we first need to understand what Scrapy does beneath.
For simple spiders, you usually define the
start_urls attribute on the Spider
class itself, like so:
import scrapy class HexfoxSpider(scrapy.Spider): name = "hexfox" allowed_domains = ["hexfox.com"] start_urls = [ "http://hexfox.com/a", "http://hexfox.com/b", "http://hexfox.com/c", "http://hexfox.com/a", "http://hexfox.com/b", "http://hexfox.com/c", ]
What Scrapy does in the background though when kicking off a crawl is to call a
method on the Spider class named
.start_requests() which by default
looks like this†.
This exists on the parent class (
def start_requests(self): for url in self.start_urls: yield Request(url, dont_filter=True)
† the example is slightly simplified but correct for our explanation!
As you can see, all it does is loop through all the
start_urls you specified
and creates a request for each one while explicitly making sure that each
request does not get dupe filtered by setting
dont_filter=True. This is
obviously the opposite of what we want!
Applying the fix
Therefore, and you’ve probably got there already, all we need to do to make sure your starturls go through the de-duplication filter is to override this with our own `startrequests` method. That simply involves pasting the following code on to your spider class:
def start_requests(self): for url in self.start_urls: yield Request(url)
Because we haven’t passed
dont_filter=True to the
Request instance, it will
simply fall back to it’s default behaviour which is to deliberately check for
Now next time your spider runs it will use this method, and voila, the duplicate
URLs in your
start_urls will only be scraped once.