[Item Loaders] - 数据传递的另一种方式
摘要:关于Item Loaders的使用。
在之前的文章中我们学习了一个简单的数据容器 Item ,
Item 提供了抓取数据的容器(container),而Item Loaders 提供了填充(populating )的机制。
import scrapy
# 加载输入处理器,稍后讲解
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
from scrapy.loader import ItemLoader
def filter_price(value): # 只返回数字,其他非数字项过滤
if value.isdigit():
return value
class Product(scrapy.Item): # 定义Item
name = scrapy.Field(
input_processor=MapCompose(remove_tags), # 输入处理器
output_processor=Join(), # 输出处理器
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
il = ItemLoader(item=Product()) # 实例化一个ItemLoader
il.add_value('name', [u'Welcome to my', u'<strong>website</strong>'])
il.add_value('price', [u'€', u'<span>1000</span>', u'2000'])
print il.load_item() # 这个方法返回的是一个Item
输出:
{'name': u'Welcome to my website', 'price': u'1000'}
下面从实例化开始讲解:
1、第一个方法add_value
add_value(field_name, value, *processors, **kwargs)
给指定域名field_name赋值value,后面的参数我们可以做几个实例:
from scrapy import Item, Field
from scrapy.loader import ItemLoader
class Product(Item):
name = Field()
il = ItemLoader(item=Product())
然后我们依次验证:
实例1:
>>> il.add_value('name', u'name: kai, name: xin')
>>> print il.load_item()
{'name': [u'name: kai, name: xin']}
实例2:
>>> il.add_value('name', u'name: kai, name: xin', re='name: (\w+)')
>>> print il.load_item()
{'name': [u'kai', u'xin']}
实例3:
>>> from scrapy.loader.processors import TakeFirst
>>> il.add_value('name', u'name: kai, name: xin', TakeFirst(), re='name: (\w+)')
>>> print il.load_item()
{'name': [u'kai']}
另外该方法支持None 作为field name,可以用来添加多项不同的数据:
loader.add_value(None, {'name': u'foo', 'sex': u'male'})
2、处理器讲解
Identity
不做任何处理,返回原值。
>>> from scrapy.loader.processors import Identity >>> proc = Identity() >>> proc(['one', 'two', 'three']) ['one', 'two', 'three']
TakeFirst
从接收到的结果中选取第一个非空元素,常作为输出处理器使用。
>>> from scrapy.loader.processors import TakeFirst >>> proc = TakeFirst() >>> proc(['', 'one', 'two', 'three']) 'one'
Join(separator=u' ')
对接收到的元素以指定separator进行合并,相当于 u' '.join(list)
>>> from scrapy.loader.processors import Join >>> proc = Join() >>> proc(['one', 'two', 'three']) u'one two three' >>> proc = Join('<br>') >>> proc(['one', 'two', 'three']) u'one<br>two<br>three'
Compose(functions, default_loader_context)*
将函数作用于接收到元素
>>> from scrapy.loader.processors import Compose >>> proc = Compose(lambda v: v[0], str.upper) >>> proc(['hello', 'world']) 'HELLO'
MapCompose(functions, default_loader_context)*
将函数迭代作用于接收到的元素列表
>>> def filter_world(x): ... return None if x == 'world' else x ... >>> from scrapy.loader.processors import MapCompose >>> proc = MapCompose(filter_world, unicode.upper) >>> proc([u'hello', u'world', u'this', u'is', u'scrapy']) [u'HELLO, u'THIS', u'IS', u'SCRAPY']
3、例子讲解
name = scrapy.Field(
input_processor=MapCompose(remove_tags), # 输入处理器
output_processor=Join(), # 输出处理器
在Product 类中,name 这个Field 的定义与上一节不同,在其中指定了输入处理器和输出处理器,输入处理器用于对接收到的值进行处理,输出处理器定义如何将处理后的值再输出。这里name 这个field 的输入处理器指定将输入值中的网页标记通过w3lib 库中的remove_tags 方法进行移除,而输出处理器则将前面处理结果进行合并再输出。
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(), )
这里price 这个field 的输入处理器使用了两个方法用来先移除网页标记,再过滤掉不是数字的项,而输出处理器则只取处理结果中第一个。
看了上面的例子你可能对这个Item Loaders 产生疑惑这有什么用,看看下面的等价处理方法:
import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
from scrapy.loader import ItemLoader
def filter_price(value):
if value.isdigit():
return value
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst() # 指定默认输出处理器
# default_input_processor # 默认输入处理器
price_in = MapCompose(remove_tags, filter_price) # 以_in为后缀指定输入处理器
name_in = MapCompose(remove_tags)
name_out = Join() # 以_out为后缀指定输出处理器
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
il = ProductLoader(item=Product())
il.add_value('name', [u'Welcome to my', u'<strong>website</strong>'])
il.add_value('price', [u'€', u'<span>1000</span>', u'2000'])
print il.load_item()
输出:
{'name': u'Welcome to my website', 'price': u'1000'}
指定的处理器有优先级,如下:
- 通过在ProductLoader 中定义的 field_in 和 field_out具有最高优先级;
- 在Field 中指定的参数 input_processor 和 output_processor;
- ProductLoader 中的默认处理器。
就不具体解释了,例子讲的很详细了。下面讲讲在爬虫中如何用。
4、ItemLoader 在爬虫中的应用
# -*- coding: utf-8 -*-
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Identity
def text_process(text):
word = text.replace(u'“', '').replace(u'”', '')
return word
class QuotesLoader(ItemLoader):
default_output_processor = Identity()
text_in = MapCompose(text_process)
class QuotesItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ['toscrape.com']
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_URI': 'quotes.json',
'FEED_EXPORT_INDENT': 4,
'FEED_FORMAT': 'json',
}
def __init__(self, category=None, *args, **kwargs):
super(QuotesSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]
def parse(self, response):
l = QuotesLoader(item=QuotesItem(), response=response)
l.add_xpath('author', '//small[@class="author"]/text()')
l.add_xpath('text', '//span[@class="text"]/text()')
yield l.load_item()
# 为了演示就不翻页爬数据了
输出文件:
[
{
"text": [
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.",
"It is better to be hated for what you are than to be loved for what you are not.",
"This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.",
"I may not have gone where I intended to go, but I think I have ended up where I needed to be.",
"Good friends, good books, and a sleepy conscience: this is the ideal life.",
"Life is what happens to us while we are making other plans.",
"Today you are You, that is truer than true. There is no one alive who is Youer than You.",
"Life is like riding a bicycle. To keep your balance, you must keep moving.",
"Life isn't about finding yourself. Life is about creating yourself.",
"Finish each day and be done with it. You have done what you could. Some blunders and absurdities no doubt crept in; forget them as soon as you can. Tomorrow is a new day. You shall begin it serenely and with too high a spirit to be encumbered with your old nonsense."
],
"author": [
"Albert Einstein",
"André Gide",
"Marilyn Monroe",
"Douglas Adams",
"Mark Twain",
"Allen Saunders",
"Dr. Seuss",
"Albert Einstein",
"George Bernard Shaw",
"Ralph Waldo Emerson"
]
}
]
因为指定了FEED_EXPORT_INDENT 输出更漂亮、易读。
参考
摘抄自:Item Loaders - 数据传递的另一中方式 - 知乎,感谢原作者。
本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。