[Spider Examples] - 3个爬虫类例子解析

摘要：列举三种不同模板创建的爬虫，重点关注spider怎么从命令行读取参数，以及response.follow方法的使用。

1 引言

在上一节CrawlSpider - Scrapy爬虫详解中我已经介绍了CrawlSpider了，还剩下

Spider
XMLFeedSpider
CSVFeedSpider
所有这一节继续介绍这3个类，主要说说Spider的几个属性，结合示例介绍下常用的几个。本节对应官方文档：Spiders（Scrapy 1.4.0）

2 Spider类用法解析

2.1 爬虫的执行

我们先看一个例子，说说Spider的执行顺序或者说爬虫是怎么个流程运作起来的。

代码：

import scrapy


class MySpider(scrapy.Spider):
    name = 'toscrape.com'
    allowed_domains = ['toscrape.com']
    start_urls = [
        'http://quotes.toscrape.com/tag/love/',
    ]

def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)

流程图：

Spider Examples-3个爬虫类例子解析1.png

Spider Examples-3个爬虫类例子解析2.png

上面的流程图是更一般的一个Spider运作流程：

初始UR集合传递给start_requests（可以重新写这个方法），得到初始response；
response自动传给默认的解析函数parse()；
parse()通过选择器（Selector）解析response的内容来提取数据；
如果需要进一步请求其他网页则在parse()中yield新的Request即可，并自行指定回调函数
通过parse()或者其他回到获取的数据通过yield item这样的方法自动传给管道进行存储数据的操作。

我们再看一个例子（官方文档给出的例子）：

# Quotes_Spider_start_requests.py
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        yield scrapy.Request('http://quotes.toscrape.com/tag/%s/' % self.category)

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            item = dict(text=text, author=author)
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

然后我们通过命令行启动爬虫：

新建一个run.py文件或者从命令行启动都行：

from scrapy import cmdline
cmdline.execute("scrapy runspider Quotes_Spider_start_requests.py -a category=life -o quotes.json".split())

在爬虫文件夹下打开命令行：

crapy runspider Quotes_Spider_start_requests.py -a category=life -o quotes.json

重点：给爬虫传递的参数将作为爬虫对象的一个属性进行使用。所以你看到的

-a category=life

意思就是说你在爬虫中可以访问一个名为category的属性，它的值就是你传入的值。所以你在上面的代码start_requests()下可以看到相关的内容，这里我的用意在于爬取指定tag下的所有数据项。
我们还可以有另外一个做法：

# Quotes_Spider.py
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ['toscrape.com']

    def __init__(self, category=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            item = dict(text=text, author=author)
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

2.2 关于allowed_domains ：

用于过滤在后续爬虫运行中发起请求的链接，如果链接的域名在allowed_domains 列表中，则请求允许，否则将被下面的中间件过滤（中间件后续再涉及）：

'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

这里需要注意几点：

我添加http://www.example.org到allowed_domains 列表中，那么在发起请求的http://bob.www.example.org 会被允许，http://www2.example.com 和 http://example.com会被过滤；我推测应该是简单的匹配字符串，判断请求的url是否包含"www.example.org"
allowed_domains 针对的只是后续请求中链接过滤，而不会过滤 start_urls 列表中的链接；
在链接被重定向的情况下，如果重定向后的链接不在允许域名内也不会被过滤，因为重定向后是被下载中间接管，而过滤中间是作用在重定向前请求的那个链接；
如果某个页面需要被获取，但是它的域名不在允许域名列表中，你可以跟进链接发起请求是设置dont_filter=True来允许它。

我们可以做几个实例验证下上面几点：

修改Quotes_Spider.py中的允许域名列表：

allowed_domains = ['google.com']

运行：

scrapy runspider Quotes_Spider.py -a category=life -o quotes.json

实际上我们要访问：

① http://quotes.toscrape.com/tag/life/
② http://quotes.toscrape.com/tag/life/page/2/

由于只过滤后续的链接，所以初始的时候我们还是可以得到第一个链接的response，但是在后续中可不行。

Spider Examples-3个爬虫类例子解析3.png

修改跟进的方法：

yield response.follow(next_page, self.parse, dont_filter=True)

Spider Examples-3个爬虫类例子解析4.png

这时候就可以了，所以我们上面做的一方面在允许设置限定，另一方跟进时最修正，达到了负负得正的效果。

2.3 response.follow() 和 response.urljoin()

在上面的例子中使用了follow() 作为跟进方法，follow需要一个传入一个链接以及设置回调函数用于开始下一个Request。在这里我们必须说的是我们获取到的下一链接是相对路径：

>>> next_page = response.css('li.next a::attr("href")').extract_first()
u'/tag/life/page/2/'

也就是在使用前应该对其进行补全。以前我们的方法是使用urljoin()：

next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

而我们使用follow()可以直接使用相对路径，其返回的是一个Request。另外我们也可以用一个选择器来替代字符串传给follow：

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

再贴一遍代码，这3个方法都是可以的：

# 方法1
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield response.follow(next_page, self.parse)

# 方法2
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
    yield response.follow(next_page, self.parse)

# 方法3
for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

3 XMLFeedSpider类例子解析

顾名思义，XMLFeedSpider是专门用于解析xml订阅的，它对特定节点进行迭代从而获取数据。

那我们以这个网站作为例子，先看下它的页面：

Spider Examples-3个爬虫类例子解析5.png

一块一块的，比较像wordpress搭建的，那它就有feed功能？一一验证：

在网站后面加wp-admin.php看看行不行

http://youquhome.com/wp-login.php

Spider Examples-3个爬虫类例子解析6.png

看来是的（有的站长会把后台地址改变），那加一个feed看看：

http://youquhome.com/feed/

Spider Examples-3个爬虫类例子解析7.png

看来可以使用XMLFeedSpider。

代码如下：

# -*- coding:utf-8 -*-
from scrapy.spiders import XMLFeedSpider


class MySpider(XMLFeedSpider):
    name = 'youquhome.com'
    allowed_domains = ['youquhome.com']
    start_urls = ['http://youquhome.com/feed/']
    iterator = 'iternodes'  # 默认
    itertag = 'item'  # 以item为节点块进行迭代

    def parse_node(self, response, node):
        title = node.xpath('title/text()').extract_first()  # 参考下图
        print title

看看一个具体的item结构怎样：

Spider Examples-3个爬虫类例子解析8.png

所以这样写提取函数就行了：

title = node.xpath('title/text()').extract_first()

结果：

Spider Examples-3个爬虫类例子解析9.png

其他的方法就不细致研究了，有兴趣的自行查看文档。

4 CSVFeedSpider类例子解析

特地上传了一份csv文件到虚拟主机上作为测试，感觉这个类用来处理离线CSV不是更好？

代码：

# -*- coding:utf-8 -*-
from scrapy.spiders import CSVFeedSpider


class MySpider(CSVFeedSpider):
    name = 'pangan.com'
    allowed_domains = ['pangan.com']
    start_urls = ['http://pangan.win/example.csv']
    delimiter = ','  # 默认，每个字段分隔符
    # quotechar = "'"  # 不理解,应该是转义字符，把quotechar转义
    headers = ['question', 'answer', 'author', 'agree'] #估计是设置要提取的csv文件的列

    def parse_row(self, response, row):
        self.logger.info('Hi, this is a row!: %r', row)
        print row['question']  # row数据类型为dict

输出：

Spider Examples-3个爬虫类例子解析10.png

这份csv长这样：
Spider Examples-3个爬虫类例子解析11.png
具体函数就不深入了。另外还有一个SitemapSpider 类就不看了。

文章目录

keepnight