[Requests and Responses] - 请求和响应
摘要:关于scrapy中两个对象Requests、Responses的介绍
Request
一个 Request 对象代表一个HTTP请求, HTTP 请求是由 Spider 产生并被 Downloader 处理进而生成一个 Response.
scrapy.http.Request((url [, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
参数详解:
url (string)
请求的目标地址
callback (callable)
将处理请的响应,如果没有指定则自动使用 parse() 作为回调函数,如果处理过程发生错误将调用 errback.
meta (dict)
参数将以字典的形式传递到回调函数。
priority (int)
设置请求的优先值,值越大在 scheduler 中越早被执行。默认值为0,允许负值表示相对较低的优先级。
dont_filter (boolean)
默认为 False,当设置为 True 时请求经不被过滤,当使用该项时可能会造成爬虫无限循环。
errback (callable)
作为发生请求错误的回调函数。它收到一个Twisted Failure 实例作为第一个参数,可用于跟踪连接建立超时,DNS错误等。
- copy()
- replace()
例子:
使用 meta 进行参数传递
def parse_page1(self, response): item = MyItem() item['main_url'] = response.url request = scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) request.meta['item'] = item yield request def parse_page2(self, response): item = response.meta['item'] item['other_url'] = response.url yield item
错误处理
import scrapy from scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import DNSLookupError from twisted.internet.error import TimeoutError, TCPTimedOutError class ErrbackSpider(scrapy.Spider): name = "errback_example" start_urls = [ "http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Not found error "http://www.httpbin.org/status/500", # server issue "http://www.httpbin.org:12345/", # non-responding host, timeout expected "http://www.httphttpbinbin.org/", # DNS error expected ] def start_requests(self): for u in self.start_urls: yield scrapy.Request(u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True) def parse_httpbin(self, response): self.logger.info('Got successful response from {}'.format(response.url)) # do something useful here... # errback如果有返回值就要返回一个可迭代的对象,类似于["1","2","3","4"] def errback_httpbin(self, failure): # log all failures self.logger.error(repr(failure)) # in case you want to do something special for some errors, # you may need the failure's type: if failure.check(HttpError): # these exceptions come from HttpError spider middleware # you can get the non-200 response response = failure.value.response self.logger.error('HttpError on %s', response.url) elif failure.check(DNSLookupError): # this is the original request request = failure.request self.logger.error('DNSLookupError on %s', request.url) elif failure.check(TimeoutError, TCPTimedOutError): request = failure.request self.logger.error('TimeoutError on %s', request.url)
meta 的特殊键值
Request.meta 属性可以包含任意数据,但是有一些是由 Scrapy 及其内置扩展的特殊键。列表如下:- dont_redirect
如果改值设为 True,则请求将被 middleware 忽略 #打印重定向的链接 request.meta["redirect_urls"]
# 打印重定向的原因 request.meta["redirect_reasons"] # For example: [301, 302, 307, 'meta refresh'][301, 302, 307, 'meta refresh']
- dont_retry
默认为 True,决定 Retry middleware 是否使用。 - handle_httpstatus_list
用于指定允许哪些响应码。 - handle_httpstatus_all
设置该选项为 Ture 将允许所有请求的响应码。 dont_merge_cookies
当你不想响应的 Cookie 和当前已有的 Cookie 混合时将次参数设置为 True。点击查看关于Cookie的中间,实例:request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}, meta={'dont_merge_cookies': True})
cookiejar
支持使用多个 Cookie Session,但是 cookiejar 不会自动添加在 Request 上,需要每次手动添加。例子:for i, url in enumerate(urls): yield scrapy.Request(url, meta={'cookiejar': i}, callback=self.parse_page) def parse_page(self, response): # do some processing return scrapy.Request("http://www.example.com/otherpage", meta={'cookiejar': response.meta['cookiejar']}, callback=self.parse_other_page)
- dont_cache
设置为 True 将禁止缓存响应。 - redirect_urls
指定重定向链接。 - bindaddress
The IP of the outgoing IP address to use for the performing the request. - dont_obey_robotstxt
即使在 settings 中为 True,当 Request.meta 有 dont_obey_robotstxt 时也将忽略robots.txt - download_timeout
下载超时时间(秒)。 - download_maxsize
最大响应体大小 (bytes) The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.
download_fail_on_dataloss
默认为 True,当响应体大大小和 Content-Length 不匹配是 raise ResponseFailed([_DataLoss]),当为 False 响应体将被处理,同时dataloss 将被添加到response 的 flags中。'dataloss' in response.flags is True
proxy
为请求添加代理服务器,并且优先于http_proxy / https_proxyhttp://some_proxy_server:port http://username:password@some_proxy_server:port
- ftp_user
- ftp_password
- referrer_policy
请求头的 referrer 策略,详情点击查看。 - max_retry_times
为每个请求添加最大请求次数参数,参数优先于 settings 中的 RETRY_TIMES.
- dont_redirect
Request 的扩展 FormRequest
- 模拟网页数据提交 (post)
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
- 使用类方法from_response模拟用户登录,该方法返回一个 FormRequest 对象,并且他的 form 值被返回的响应HTML页面中 <form>标签 事先填充
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
Response
本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。