[Items] - Scrapy中数据的传递

摘要：关于Scrapy中数据的传递，重点关注scrapy.Field(serializer=serialize_text)中serializer的用法

<!--index-menu>

这节简要的讲讲Scrapy中的Items，在Scrapy中它就类似于一个字典承当着爬虫中数据项的定义、数据接收和分发工作。

一般从数据为出发点，要做到：

分析有哪些数据
定义用于存储一组数据的容器Item
从网页中获取结构化数据
将数据传给Item
输出Item

以例子展开：

item-scrapy中数据的传递1.png

还是爬取quotes和author，那我们只要像QuotesItem类那样定义Item，然后在爬虫中实例化QuotesItem并接收数据和输出数据即可。

# -*- coding: utf-8 -*-
import scrapy


class QuotesItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ['toscrape.com']
    custom_settings = {
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'quotes.jsonlines',
    }

    def __init__(self, category=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            item = QuotesItem()
            item['text'] = text
            item['author'] = author
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

看到parse()方法中的：

item = QuotesItem()
item['text'] = text
item['author'] = author

对就是这么简单，这个效果和我们之前写的dict是一样的：

item = dict(text=text, author=author)

那Item中的Filed还有哪种用法，对还有一个格式化可选参数，举个例子：

假如我只想提取text中单词的任意5个单词做为输出，那该怎么做？

从Item的定义上：

text = scrapy.Field(serializer=serialize_text)

其中serializer 就是可选参数，serialize_text 作为函数处理传给 QuotesItem['text'] 的值，（实际上 serializer 只有在使用 Item Exporters 时才有用，在后面我介绍的爬虫项目Item Pipeline - 爬虫项目和数据管道中，要想在 item 上定义方法还需要使用Item Loaders - 数据传递的另一中方中的 Item Loaders）我们定义下 serialize_text() 函数：

得到的结果：

import random

def serialize_text(text):
    word_list = text.replace(u'“', '').replace(u'”', '').split()
    return random.sample(word_list, 5)

整体代码：

# -*- coding: utf-8 -*-
# filename: Quotes_Spider.py
import scrapy
import random


def serialize_text(text):
    word_list = text.replace(u'“', '').replace(u'”', '').split()
    return random.sample(word_list, 5)


class QuotesItem(scrapy.Item):
    text = scrapy.Field(serializer=serialize_text)
    author = scrapy.Field()


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ['toscrape.com']
    custom_settings = {
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'quotes.jsonlines',
    }

    def __init__(self, category=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            item = QuotesItem()
            item['text'] = text
            item['author'] = author
            yield item
            # 下面三句等同
            # yield {"text":text,"author":author}
            # yield QuotesItem({"text":text,"author":author})

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

运行脚本：

from scrapy import cmdline
cmdline.execute("scrapy runspider Quotes_Spider.py -a category=life".split())

摘抄自：Items - Scrapy中数据的传递 - 知乎,感谢作者

keepnight

[Items] - Scrapy中数据的传递

添加新评论

轻语

最新文章

分类

标签

归档