Writing a custom spider

Feeds already supports a number of websites (see Supported Websites) but adding support for a new website doesn’t take too much time.

A quick example

Writing a spider is easy! Consider the slightly simplified spider for indiehackers.com:

import scrapy

from feeds.loaders import FeedEntryItemLoader
from feeds.spiders import FeedsSpider


class IndieHackersComSpider(FeedsSpider):
    name = "indiehackers.com"
    start_urls = ["https://www.indiehackers.com/interviews/page/1"]
    feed_title = "Indie Hackers"

    def parse(self, response):
        interview_links = response.css(".interview__link::attr(href)").extract()
        interview_dates = response.css(".interview__date::text").extract()
        for link, date in zip(interview_links, interview_dates):
            yield scrapy.Request(
                response.urljoin(link),
                self._parse_interview,
                meta={"updated": date.strip()},
            )

    def _parse_interview(self, response):
        remove_elems = [
            ".shareable-quote",
            ".share-bar",
        ]
        il = FeedEntryItemLoader(
            response=response,
            base_url="https://{}".format(self.name),
            remove_elems=remove_elems,
        )
        il.add_value("link", response.url)
        il.add_css("title", "h1::text")
        il.add_css("author_name", "header .user-link__name::text")
        il.add_css("content_html", ".interview-body")
        il.add_value("updated", response.meta["updated"])
        return il.load_item()

First, the URL from the start_urls list is downloaded and the response is given to parse(). From there we extract the article links that should be scraped and yield scrapy.Request objects from the for loop. The callback method _parse_interview() is executed once the download has finished. It extracts the article from the response HTML document and returns an item that will be placed into the feed automatically.

It’s enough to place the spider in the spiders folder. It doesn’t have to be registered somewhere for Feeds to pick it up.

Reusing an existing feed

Often websites provide a feed but it’s not full text. In such cases you usually only want to augment the original feed with the full article.

Generic spider

For a lot of feeds (especially those from blogs) it is actually sufficient to use the Generic full-text extraction spider which can extract content from any website using heuristics (go to Generic full-text extraction for more on that).

Note that a lot of feeds (e.g. those generated by Wordpress) actually contain the full text but your feed reader chooses to show a summary instead. In such cases you can also use the Generic full-text extraction spider and add your feed URL to the fulltext_urls key in the config. This will create a full text feed from an existing feed without having to rely on heuristics.

Custom extraction

These spiders take an existing RSS feed and inline the article content while cleaning up the content (removing share buttons, etc.):

Paywalled content

If your website has a feed but some or all articles are behind a paywall or require to login to read, take a look at the following spiders:

Creating a feed from scratch

Some websites don’t offer any feed at all. In such cases we have to find an efficient way to detect new content and extract it.

Utilizing an API

Some use a REST API which we can use to fetch the content.

Utilizing the sitemap

Others provide a sitemap which we can parse:

Custom extraction

The last resort is to find a page that lists the newest articles and start scraping from there.

For paywalled content, take a look at: