.. _API: API for Spiders =============== If you want to you support a custom website, take a look at :ref:`Development`. Spider class ------------ A spider is a class in a module (Python file) in ``feeds.spiders`` that is a subclass of ``feeds.spiders.FeedsSpider``, ``feeds.spiders.FeedsCrawlSpider`` or ``feeds.spiders.FeedsXMLFeedSpider``. * ``FeedsXMLFeedSpider`` is used, if the spider is based on parsing an XML document as a basis. This is useful if the spider should start from an existing XML feed or a sitemap. * ``FeedsCrawlSpider`` is used, if the spider should crawl the site based on links that are found on the site. Patterns can be given to limit what links should be followed. * ``FeedsSpider`` is used in all other cases (this spider is usually used). Class variables ^^^^^^^^^^^^^^^ * ``name``: The name of the spider (**mandatory**). * ``start_urls``: A list of URLs to start (used if the ``start_requests(self)`` method is not overwritten). * ``feed_title``: Title of the feed. * ``feed_subtitle``: Subtitle of the feed. * ``feed_link`` * ``author_name``: Author of the feed. * ``feed_icon``: URL of a site favicon. * ``feed_logo``: URL of a site logo. Methods ^^^^^^^ * ``start_requests(self)``: If the start request is more complicated than a simply ``GET`` to the URL(s) in the ``start_urls`` list, this method can be overwritten. It is expected to yield or return a ``scrapy.Request`` object. Please note that this method can *only* emit ``Request`` objects. * ``parse(self, response)``: After a URL from ``start_urls`` has been scraped, the ``parse()`` method is called and the response is given as an argument. It is also the default call back method for new ``scrapy.Request`` objects. * ``parse_node(self, response, node)``: A ``FeedsXMLFeedSpider`` calls ``parse_node()`` instead of ``parse()`` for every node in the XML document returned by the URL in ``start_urls``. FeedEntryItemLoader ------------------- A spider uses a ``FeedEntryItemLoader`` object to extract content from a response. The following fields are accepted and can be added to a item loader object: * ``link`` * ``title`` * ``author_name`` * ``author_email`` * ``content_html`` * ``updated`` * ``category`` * ``path`` * ``enclosure_iri`` * ``enclosure_type`` A value can be added to an item loader with the ``add_value()``, ``add_css()`` or ``add_xpath()`` methods like in the following example: .. code-block:: python il = FeedEntryItemLoader(response=response) il.add_value("link", response.url) il.add_css("title", "h1::text") il.add_css("author_name", "header .user-link__name::text") il.add_css("content_html", ".interview-body") il.add_css("updated", ".date::text") return il.load_item() Only the ``link`` field is required, all the other fields can be empty but usually it is adviced to add as many fields as possible (i.e. the original site provides). If the ``updated`` field is not provided, the date and time during the extraction is used. If caching is enabled, the date and time when the item was first seen is cached and reused on following runs. Input processing ---------------- Automatic rules are applied to fields depending on their type. Default input rules ^^^^^^^^^^^^^^^^^^^ These rules are usually applied to every field. #. Empty strings and ``None`` are skipped. #. The content is stripped. #. The content is unescaped twice, i.e. ``&&xxx;`` is converted to its decoded (binary) equivalent. ``title`` ^^^^^^^^^ #. The default input rules apply. #. One title: "