[Run Scrapy] - 从脚本运行爬虫及多爬虫运行
摘要:Scrapy从脚本运行爬虫的几种方式
方式一:CrawlerProcess()
之前介绍了一个cmdline.execute() 的方法运行爬虫,还用另一种从脚本运行爬虫的方法:CrawlerProcess()
先看一个例子:
import scrapy
import random
from scrapy.crawler import CrawlerProcess
def serialize_text(text):
word_list = text.replace(u'“', '').replace(u'”', '').split()
return random.sample(word_list, 2)
class QuotesItem(scrapy.Item):
text = scrapy.Field(serializer=serialize_text)
author = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['toscrape.com']
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_URI': 'quotes.jsonlines',
}
def __init__(self, category=None, *args, **kwargs):
super(QuotesSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]
def parse(self, response):
quote_block = response.css('div.quote')
for quote in quote_block:
text = quote.css('span.text::text').extract_first()
author = quote.xpath('span/small/text()').extract_first()
item = QuotesItem()
item['text'] = text
item['author'] = author
yield item
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
process = CrawlerProcess()
process.crawl(QuotesSpider, category='love')
process.start()
上面的可以直接运行。我们再看看CrawlerProcess这个类:
class scrapy.crawler.CrawlerProcess(settings=None)
用于在一个进程(process)中同时启动多个爬虫,接收settings作为初始化参数,在看看它的一个方法:
crawl(crawler_or_spidercls, *args, **kwargs)
其他的我就不介绍了,具体在这Common Practices。
下面给出一个启动多个爬虫的部分代码:
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(QuotesSpider, category='humor')
process.crawl(QuotesSpider, category='love')
process.start()
是不是很简单呢。
方式二:cmdline.execute()
# -*- coding: utf-8 -*-
# filename: Quotes_Spider.py
import scrapy
import random
def serialize_text(text):
word_list = text.replace(u'“', '').replace(u'”', '').split()
return random.sample(word_list, 5)
class QuotesItem(scrapy.Item):
text = scrapy.Field(serializer=serialize_text)
author = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ['toscrape.com']
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_URI': 'quotes.jsonlines',
}
def __init__(self, category=None, *args, **kwargs):
super(QuotesSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]
def parse(self, response):
quote_block = response.css('div.quote')
for quote in quote_block:
text = quote.css('span.text::text').extract_first()
author = quote.xpath('span/small/text()').extract_first()
item = QuotesItem()
item['text'] = text
item['author'] = author
yield item
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
运行脚本:
from scrapy import cmdline
cmdline.execute("scrapy runspider Quotes_Spider.py -a category=life".split())
上面两个py文件在同一个目录。
参考
参考自:Run Scrapy - 从脚本运行爬虫及多爬虫运行 - 知乎,感谢作者。
本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。