Link Extractors¶
LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.
There are two Link Extractors available in Scrapy by default, but you create your own custom Link Extractors to suit your needs by implementing a simple interface.
The only public method that every LinkExtractor has is extract_links, which receives a Response object and returns a list of links. Link Extractors are meant to be instantiated once and their extract_links method called several times with different responses, to extract links to follow.
Link extractors are used in the CrawlSpider class (available in Scrapy), through a set of rules, but you can also use it in your spiders, even if you don’t subclass from CrawlSpider, as its purpose is very simple: to extract links.
Built-in link extractors reference¶
All available link extractors classes bundled with Scrapy are provided in the scrapy.contrib.linkextractors module.
SgmlLinkExtractor¶
- class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)¶
The SgmlLinkExtractor extends the base BaseSgmlLinkExtractor by providing additional filters that you can specify to extract links, including regular expressions patterns that the links must match to be extracted. All those filters are configured through these constructor parameters:
Parameters: - allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
- deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
- allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
- deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
- deny_extensions (list) – a list of extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractor module.
- restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
- tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
- attrs (list) – list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
- canonicalize (boolean) – canonicalize each extracted url (using scrapy.utils.url.canonicalize_url). Defaults to True.
- unique (boolean) – whether duplicate filtering should be applied to extracted links.
- process_value (callable) – see process_value argument of BaseSgmlLinkExtractor class constructor
BaseSgmlLinkExtractor¶
- class scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor(tag="a", attr="href", unique=False, process_value=None)¶
The purpose of this Link Extractor is only to serve as a base class for the SgmlLinkExtractor. You should use that one instead.
The constructor arguments are:
Parameters: - tag (str or callable) – either a string (with the name of a tag) or a function that receives a tag name and returns True if links should be extracted from that tag, or False if they shouldn’t. Defaults to 'a'. request (once it’s downloaded) as its first parameter. For more information, see Passing additional data to callback functions.
- attr (str or callable) – either string (with the name of a tag attribute), or a function that receives an attribute name and returns True if links should be extracted from it, or False if they shouldn’t. Defaults to href.
- unique (boolean) – is a boolean that specifies if a duplicate filtering should be applied to links extracted.
- process_value (callable) –
a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.
For example, to extract links from this code:
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
You can use the following function in process_value:
def process_value(value): m = re.search("javascript:goToPage\('(.*?)'", value) if m: return m.group(1)