Generic full-text extraction

The generic spider can transform already existing Atom or RSS feeds, which usually only contain a summary or a few lines of the content, into full content feeds. It is similar to Full-Text RSS but uses a port of an older version of Readability under the hood and currently doesn’t support site_config files. It works best for blog articles.

Some feeds already provide the full content but in a tag that is not used by your feed reader. E.g. feeds created by Wordpress usually have the full content in the “encoded” tag. In such cases it’s best to add the URL to the fulltext_urls entry which extracts the content directly from the feed without Readability. There is a little helper script in scripts/check-for-fulltext-content to detect if a feed contains full-text content.

Configuration

Add generic to the list of spiders:

# List of spiders to run by default, one per line.
spiders =
  generic

Add the feed URLs (Atom or XML) to the config file.

# List of URLs to RSS/Atom feeds to crawl, one per line.
[generic]
urls =
    https://www.example.com/feed.atom
    https://www.example.org/feed.xml
fulltext_urls =
    https://myblog.example.com/feed/