How to filter out duplicate URLs from Scrapy's start_urls

Scrapy provides a duplicate URL filter for all spiders by default, which means that any URL that looks the same to Scrapy during a crawl will not be visited twice. But for start_urls, the URLs you set as the first one’s a spider should crawl, this de-duplication is deliberately disabled.

Why is it disabled you ask?

Hi! I'm Darian, a software developer based in London, and I teach my readers here at Hexfox to automate the repetitive tasks that would otherwise suck hours out of their life, with the aim of preparing them for a future world where automation is king. Sound interesting? Stick around or sign up!

Because Scrapy expects the list you give it to be the definitive URLs you want it to scrape first (rather than URLs it finds automatically later) and therefore if it was to ignore that definitive list and start de-duplicating it, things could get confusing for the user.

But what if you’re not in direct control of that list? What if the list contains thousands of URLs and you’d like to ensure there are no duplicates? There are many cases where de-duplicating that starting list of URLs is really helpful and luckily we can easily do that with Scrapy thanks to a simple override.

Underneath Scrapy’s hood

To override this behaviour we first need to understand what Scrapy does beneath. For simple spiders, you usually define the start_urls attribute on the Spider class itself, like so:

import scrapy

class HexfoxSpider(scrapy.Spider):
    name = "hexfox"
    allowed_domains = ["hexfox.com"]
    start_urls = [
        "http://hexfox.com/a",
        "http://hexfox.com/b",
        "http://hexfox.com/c",
        "http://hexfox.com/a",
        "http://hexfox.com/b",
        "http://hexfox.com/c",
    ]

What Scrapy does in the background though when kicking off a crawl is to call a method on the Spider class named .start_requests() which by default looks like this†. This exists on the parent class (Scrapy.Spider):

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, dont_filter=True)

† the example is slightly simplified but correct for our explanation!

As you can see, all it does is loop through all the start_urls you specified and creates a request for each one while explicitly making sure that each request does not get dupe filtered by setting dont_filter=True. This is obviously the opposite of what we want!

Applying the fix

Therefore, and you’ve probably got there already, all we need to do to make sure your starturls go through the de-duplication filter is to override this with our own `startrequests` method. That simply involves pasting the following code on to your spider class:

def start_requests(self):
    for url in self.start_urls:
        yield Request(url)

Because we haven’t passed dont_filter=True to the Request instance, it will simply fall back to it’s default behaviour which is to deliberately check for duplicate URLs.

Now next time your spider runs it will use this method, and voila, the duplicate URLs in your start_urls will only be scraped once.

Hexfox

How to filter out duplicate URLs from Scrapy's start_urls

as Scrapy turns off de-duplication for them

Underneath Scrapy’s hood

Applying the fix

Finding Scrapy hard to grasp?