[Requests and Responses] - 请求和响应

摘要：关于scrapy中两个对象Requests、Responses的介绍

Request

一个 Request 对象代表一个HTTP请求， HTTP 请求是由 Spider 产生并被 Downloader 处理进而生成一个 Response.

scrapy.http.Request((url [, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])

参数详解：

url (string)
请求的目标地址
callback (callable)
将处理请的响应，如果没有指定则自动使用 parse() 作为回调函数，如果处理过程发生错误将调用 errback.
meta (dict)
参数将以字典的形式传递到回调函数。
priority (int)
设置请求的优先值，值越大在 scheduler 中越早被执行。默认值为0，允许负值表示相对较低的优先级。
dont_filter (boolean)
默认为 False，当设置为 True 时请求经不被过滤，当使用该项时可能会造成爬虫无限循环。
errback (callable)
作为发生请求错误的回调函数。它收到一个Twisted Failure 实例作为第一个参数，可用于跟踪连接建立超时，DNS错误等。
copy()
replace()

例子:

使用 meta 进行参数传递

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

错误处理


import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...
    # errback如果有返回值就要返回一个可迭代的对象，类似于["1","2","3","4"]
    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

meta 的特殊键值
Request.meta 属性可以包含任意数据，但是有一些是由 Scrapy 及其内置扩展的特殊键。列表如下：
- dont_redirect
  如果改值设为 True，则请求将被 middleware 忽略
- redirect_urls
```
#打印重定向的链接
request.meta["redirect_urls"]
```
- redirect_reasons
```
# 打印重定向的原因
request.meta["redirect_reasons"]
# For example: [301, 302, 307, 'meta refresh'][301, 302, 307, 'meta refresh']
```
- dont_retry
  默认为 True，决定 Retry middleware 是否使用。
- handle_httpstatus_list
  用于指定允许哪些响应码。
- handle_httpstatus_all
  设置该选项为 Ture 将允许所有请求的响应码。
- dont_merge_cookies
  当你不想响应的 Cookie 和当前已有的 Cookie 混合时将次参数设置为 True。点击查看关于Cookie的中间，实例：
```
request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'},
                               meta={'dont_merge_cookies': True})
```
- cookiejar
  支持使用多个 Cookie Session，但是 cookiejar 不会自动添加在 Request 上，需要每次手动添加。例子：
```
for i, url in enumerate(urls):
    yield scrapy.Request(url, meta={'cookiejar': i},
        callback=self.parse_page)

def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)
```
- dont_cache
  设置为 True 将禁止缓存响应。
- redirect_urls
  指定重定向链接。
- bindaddress
  The IP of the outgoing IP address to use for the performing the request.
- dont_obey_robotstxt
  即使在 settings 中为 True，当 Request.meta 有 dont_obey_robotstxt 时也将忽略robots.txt
- download_timeout
  下载超时时间(秒)。
- download_maxsize
  最大响应体大小（bytes）
- download_latency
  The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.
- download_fail_on_dataloss
  默认为 True，当响应体大大小和 Content-Length 不匹配是 raise ResponseFailed([_DataLoss])，当为 False 响应体将被处理，同时dataloss 将被添加到response 的 flags中。
```
'dataloss' in response.flags is True
```
- proxy
  为请求添加代理服务器，并且优先于http_proxy / https_proxy
```
http://some_proxy_server:port
http://username:password@some_proxy_server:port
```
- ftp_user
- ftp_password
- referrer_policy
  请求头的 referrer 策略，详情点击查看。
- max_retry_times
  为每个请求添加最大请求次数参数，参数优先于 settings 中的 RETRY_TIMES.

Request 的扩展 FormRequest

模拟网页数据提交 (post)

return [FormRequest(url="http://www.example.com/post/action",
                    formdata={'name': 'John Doe', 'age': '27'},
                    callback=self.after_post)]

使用类方法from_response模拟用户登录，该方法返回一个 FormRequest 对象，并且他的 form 值被返回的响应HTML页面中 <form>标签 事先填充

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

Response

文章目录

keepnight

[Requests and Responses] - 请求和响应

Request

Request 的扩展 FormRequest

Response

添加新评论

轻语

最新文章

分类

标签

归档