Scrapy
Scrapy at a glance
Pick a website
Define the data you want to scrape
Write a Spider to extract the data
Run the spider to extract the data
Review scraped data
What else?
What’s next?
Installation guide
Pre-requisites
Installing Scrapy
Platform specific installation notes
Scrapy Tutorial
Creating a project
Defining our Item
Our first Spider
Storing the scraped data
Next steps
Examples
Command line tool
Default structure of Scrapy projects
Using the
scrapy
tool
Available tool commands
Custom project commands
Items
Declaring Items
Item Fields
Working with Items
Extending Items
Item objects
Field objects
Spiders
Spider arguments
Built-in spiders reference
Selectors
Using selectors
Built-in Selectors reference
Item Loaders
Using Item Loaders to populate items
Input and Output processors
Declaring Item Loaders
Declaring Input and Output Processors
Item Loader Context
ItemLoader objects
Reusing and extending Item Loaders
Available built-in processors
Scrapy shell
Launch the shell
Using the shell
Example of shell session
Invoking the shell from spiders to inspect responses
Item Pipeline
Writing your own item pipeline
Item pipeline example
Activating an Item Pipeline component
Feed exports
Serialization formats
Storages
Storage URI parameters
Storage backends
Settings
Link Extractors
Built-in link extractors reference
Logging
Log levels
How to set the log level
How to log messages
Logging from Spiders
scrapy.log module
Logging settings
Stats Collection
Common Stats Collector uses
Available Stats Collectors
Sending e-mail
Quick example
MailSender class reference
Mail settings
Telnet Console
How to access the telnet console
Available variables in the telnet console
Telnet console usage examples
Telnet Console signals
Telnet settings
Web Service
Web service resources
Web service settings
Writing a web service resource
Examples of web service resources
Example of web service client
Frequently Asked Questions
How does Scrapy compare to BeautifulSoup or lxml?
What Python versions does Scrapy support?
Does Scrapy work with Python 3?
Did Scrapy “steal” X from Django?
Does Scrapy work with HTTP proxies?
How can I scrape an item with attributes in different pages?
Scrapy crashes with: ImportError: No module named win32api
How can I simulate a user login in my spider?
Does Scrapy crawl in breadth-first or depth-first order?
My Scrapy crawler has memory leaks. What can I do?
How can I make Scrapy consume less memory?
Can I use Basic HTTP Authentication in my spiders?
Why does Scrapy download pages in English instead of my native language?
Where can I find some example Scrapy projects?
Can I run a spider without creating a project?
I get “Filtered offsite request” messages. How can I fix them?
What is the recommended way to deploy a Scrapy crawler in production?
Can I use JSON for large exports?
Can I return (Twisted) deferreds from signal handlers?
What does the response status code 999 means?
Can I call
pdb.set_trace()
from my spiders to debug them?
Simplest way to dump all my scraped items into a JSON/CSV/XML file?
What’s this huge cryptic
__VIEWSTATE
parameter used in some forms?
What’s the best way to parse big XML/CSV data feeds?
Does Scrapy manage cookies automatically?
How can I see the cookies being sent and received from Scrapy?
How can I instruct a spider to stop itself?
How can I prevent my Scrapy bot from getting banned?
Should I use spider arguments or settings to configure my spider?
I’m scraping a XML document and my XPath selector doesn’t return any items
I’m getting an error: “cannot import name crawler”
Debugging Spiders
Parse Command
Scrapy Shell
Open in browser
Logging
Spiders Contracts
Custom Contracts
Common Practices
Run Scrapy from a script
Running multiple spiders in the same process
Distributed crawls
Avoiding getting banned
Dynamic Creation of Item Classes
Broad Crawls
Increase concurrency
Reduce log level
Disable cookies
Disable retries
Reduce download timeout
Disable redirects
Enable crawling of “Ajax Crawlable Pages”
Using Firefox for scraping
Caveats with inspecting the live browser DOM
Useful Firefox add-ons for scraping
Using Firebug for scraping
Introduction
Getting links to follow
Extracting the data
Debugging memory leaks
Common causes of memory leaks
Debugging memory leaks with
trackref
Debugging memory leaks with Guppy
Leaks without leaks
Downloading Item Images
Using the Images Pipeline
Usage example
Enabling your Images Pipeline
Images Storage
Additional features
Implementing your custom Images Pipeline
Custom Images pipeline example
Ubuntu packages
Scrapyd
AutoThrottle extension
Design goals
How it works
Throttling algorithm
Settings
Benchmarking
Jobs: pausing and resuming crawls
Job directory
How to use it
Keeping persistent state between batches
Persistence gotchas
DjangoItem
Using DjangoItem
DjangoItem caveats
Django settings set up
Architecture overview
Overview
Components
Data flow
Event-driven networking
Downloader Middleware
Activating a downloader middleware
Writing your own downloader middleware
Built-in downloader middleware reference
Spider Middleware
Activating a spider middleware
Writing your own spider middleware
Built-in spider middleware reference
Extensions
Extension settings
Loading & activating extensions
Available, enabled and disabled extensions
Disabling an extension
Writing your own extension
Built-in extensions reference
Core API
Crawler API
Settings API
Signals API
Stats Collector API
Requests and Responses
Request objects
Request.meta special keys
Request subclasses
Response objects
Response subclasses
Settings
Designating the settings
Populating the settings
How to access settings
Rationale for setting names
Built-in settings reference
Signals
Deferred signal handlers
Built-in signals reference
Exceptions
Built-in Exceptions reference
Item Exporters
Using Item Exporters
Serialization of item fields
Built-in Item Exporters reference
Release notes
0.22.0 (released 2014-01-17)
0.20.2 (released 2013-12-09)
0.20.1 (released 2013-11-28)
0.20.0 (released 2013-11-08)
0.18.4 (released 2013-10-10)
0.18.3 (released 2013-10-03)
0.18.2 (released 2013-09-03)
0.18.1 (released 2013-08-27)
0.18.0 (released 2013-08-09)
0.16.5 (released 2013-05-30)
0.16.4 (released 2013-01-23)
0.16.3 (released 2012-12-07)
0.16.2 (released 2012-11-09)
0.16.1 (released 2012-10-26)
0.16.0 (released 2012-10-18)
0.14.4
0.14.3
0.14.2
0.14.1
0.14
0.12
0.10
0.9
0.8
0.7
Contributing to Scrapy
Reporting bugs
Writing patches
Submitting patches
Coding style
Scrapy Contrib
Documentation policies
Tests
Versioning and API Stability
Versioning
API Stability
Experimental features
Add commands using external libraries
Scrapy
Docs
»
Edit on GitHub
Python Module Index
s
s
scrapy
scrapy.contracts
scrapy.contracts.default
scrapy.contrib.closespider
Close spider extension
scrapy.contrib.corestats
Core stats collection
scrapy.contrib.debug
Extensions for debugging Scrapy
scrapy.contrib.downloadermiddleware
scrapy.contrib.downloadermiddleware.ajaxcrawl
scrapy.contrib.downloadermiddleware.chunked
Chunked Transfer Middleware
scrapy.contrib.downloadermiddleware.cookies
Cookies Downloader Middleware
scrapy.contrib.downloadermiddleware.defaultheaders
Default Headers Downloader Middleware
scrapy.contrib.downloadermiddleware.downloadtimeout
Download timeout middleware
scrapy.contrib.downloadermiddleware.httpauth
HTTP Auth downloader middleware
scrapy.contrib.downloadermiddleware.httpcache
HTTP Cache downloader middleware
scrapy.contrib.downloadermiddleware.httpcompression
Http Compression Middleware
scrapy.contrib.downloadermiddleware.httpproxy
Http Proxy Middleware
scrapy.contrib.downloadermiddleware.redirect
Redirection Middleware
scrapy.contrib.downloadermiddleware.retry
Retry Middleware
scrapy.contrib.downloadermiddleware.robotstxt
robots.txt middleware
scrapy.contrib.downloadermiddleware.stats
Downloader Stats Middleware
scrapy.contrib.downloadermiddleware.useragent
User Agent Middleware
scrapy.contrib.exporter
Item Exporters
scrapy.contrib.linkextractors
Link extractors classes
scrapy.contrib.linkextractors.sgml
SGMLParser-based link extractors
scrapy.contrib.loader
Item Loader class
scrapy.contrib.loader.processor
A collection of processors to use with Item Loaders
scrapy.contrib.logstats
Basic stats logging
scrapy.contrib.memdebug
Memory debugger extension
scrapy.contrib.memusage
Memory usage extension
scrapy.contrib.pipeline.images
Images Pipeline
scrapy.contrib.spidermiddleware
scrapy.contrib.spidermiddleware.depth
Depth Spider Middleware
scrapy.contrib.spidermiddleware.httperror
HTTP Error Spider Middleware
scrapy.contrib.spidermiddleware.offsite
Offsite Spider Middleware
scrapy.contrib.spidermiddleware.referer
Referer Spider Middleware
scrapy.contrib.spidermiddleware.urllength
URL Length Spider Middleware
scrapy.contrib.spiders
Collection of generic spiders
scrapy.contrib.statsmailer
StatsMailer extension
scrapy.contrib.webservice
Built-in web service resources
scrapy.contrib.webservice.crawler
Crawler JSON-RPC resource
scrapy.contrib.webservice.enginestatus
Engine Status JSON resource
scrapy.contrib.webservice.stats
Stats JSON-RPC resource
scrapy.crawler
The Scrapy crawler
scrapy.exceptions
Scrapy exceptions
scrapy.http
Request and Response classes
scrapy.item
Item and Field classes
scrapy.log
Logging facility
scrapy.mail
Email sending facility
scrapy.selector
Selector class
scrapy.settings
Settings manager
scrapy.signalmanager
The signal manager
scrapy.signals
Signals definitions
scrapy.spider
Spiders base class, spider manager and spider middleware
scrapy.statscol
Stats Collectors
scrapy.telnet
The Telnet Console
scrapy.utils.trackref
Track references of live objects
scrapy.webservice
Web service
Read the Docs
v: 0.22
Versions
latest
0.22
0.20
0.18
0.16
0.14
0.12
0.10.3
0.9
0.8
0.7
Downloads
PDF
HTML
Epub
On Read the Docs
Project Home
Builds
Free document hosting provided by
Read the Docs
.