Generic full-text extraction¶
The generic spider can transform already existing Atom or RSS feeds, which usually only contain a summary or a few lines of the content, into full content feeds. It is similar to Full-Text RSS but uses a port of an older version of Readability under the hood and currently doesn’t support site_config files. It works best for blog articles.
Some feeds already provide the full content but in a tag that is not used by
your feed reader. E.g. feeds created by Wordpress usually have the full
content in the “encoded” tag. In such cases it’s best to add the URL to the
fulltext_urls entry which extracts the content directly from the feed
without Readability. There is a little helper script in
scripts/check-for-fulltext-content to detect if a feed contains full-text
generic to the list of spiders:
# List of spiders to run by default, one per line. spiders = generic
Add the feed URLs (Atom or XML) to the config file.
# List of URLs to RSS/Atom feeds to crawl, one per line. [generic] urls = https://www.example.com/feed.atom https://www.example.org/feed.xml fulltext_urls = https://myblog.example.com/feed/