Different channel’s structure in a websit are similar, sometimes we want to reuse source code and don’t create a Scrapy project per channel. This is a tutorial how to use multiple spiders in a Scrapy project.
ENV
Python: 2.7.5
Scrapy: 0.24.2
Tree-like directories of the tutorial project
Source code in GitHub: scrapy_multiple_spiders
scrapy_multiple_spiders
├── commands
│ ├── __init__.py
│ └── crawl.py
└── tutorial
├── scrapy.cfg
└── tutorial
├── __init__.py
├── common_spider.py
├── items.py
├── pipelines.py
├── settings.py
├── spider_settings
│ ├── __init__.py
│ ├── spider1.py
│ └── spider2.py
└── spiders
├── __init__.py
├── spider1.py
└── spider2.py
Custom project command
In Scrapy, we can add our custom project commands by using the COMMANDS_MODULE
setting item in settings.py
, then we will custom the standard crawl
command.
When calling scrapy crawl <spider name>
, the run
function within scrapy.commands.crawl.Command
is the entrance.
The following code inherits scrapy.commands.crawl.Command
and overwrite the run
function in our project’s commands.crawl.CustomCrawlCommand
class.
class CustomCrawlCommand(Command):
def run(self, args, opts):
if len(args) < 1:
raise UsageError()
elif len(args) > 1:
raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
spname = args[0]
# added new code
spider_settings_path = self.settings.getdict('SPIDER_SETTINGS', {}).get(spname, None)
if spider_settings_path is not None:
self.settings.setmodule(spider_settings_path, priority='cmdline')
# end
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()
The commented part is new code, others are same as the run
function in scrapy.commands.crawl.Command
class.
The Scrapy settings
has four priorities: default
, command
, project
, cmdline
, the cmdline
has a top priority, use it to overwrite default setting items which are in settings.py
.
SPIDER_SETTINGS
is a setting item in settings.py
, it is a dictionary, the key is the spider name, the value is the spider’s custom setting file name.
Create common spiders and settings
tutorial.tutorial.common_spider.CommonSpider
is a spider which includes a normal parsing process for a website and some common functions.
settings.py
includes common setting items for all spiders, such as LOG_LEVEL
, you can overwrite them in a spider custom setting file, such as spider1.py
and spider2.py
in tutorial.tutorial.spider_settings
directory.
common_spider.py
:
class CommonSpider(Spider):
"""
This is a common spider, including common functions which child spiders can inherit or overwrite
"""
name = ''
allowed_domains = []
start_urls = []
# must add "kwargs", otherwise can't run in scrapyd
def __init__(self, settings, **kwargs):
super(CommonSpider, self).__init__(**kwargs)
self._start_urls = []
self._start_urls.extend(settings.get('START_URLS', []))
if not self._start_urls:
raise Exception('no urls to crawl')
@classmethod
def from_settings(cls, settings, **kwargs):
return cls(settings, **kwargs)
@classmethod
def from_crawler(cls, crawler, **kwargs):
return cls.from_settings(crawler.settings, **kwargs)
def start_requests(self):
for url in self._start_urls:
# must append these hosts, otherwise OffsiteMiddleware will filter them
parsed_url = urlparse.urlparse(url)
parsed_url.hostname and self.allowed_domains.append(parsed_url.hostname)
# open('file name', 'a+') is different between OS X and Linux,
# read an empty filter list from <JOBDIR>/requests.seen when launche the spider on OS X,
# be careful "dont_filter"
yield Request(url, callback=self.parse, method='GET', dont_filter=True)
def parse(self, response):
self.log('response url: %s, status: %d' % (response.url, response.status), INFO)
settings.py
:
COMMANDS_MODULE = 'commands'
SPIDER_SETTINGS = {
'spider1': 'tutorial.spider_settings.spider1',
'spider2': 'tutorial.spider_settings.spider2',
}
LOG_LEVEL = 'INFO'
Create multiple spiders in a project
Spiders without custom parsing process
Like tutorial.tutorial.spiders.spider1.Spider1
.
Spider1’s setting file: spider1.py
(in spider_settings
directory)
LOG_FILE = 'spider1.log'
JOBDIR='spider1_job'
START_URLS = ['http://www.bing.com/news']
Spider1’s source file: Spider1.py
(in spiders
directory)
from ..common_spider import CommonSpider
class Spider1(CommonSpider):
name = 'spider1'
Spiders with custom parsing process
Like tutorial.tutorial.spiders.spider2.Spider2
.
Spider2’s setting file: spider2.py
(in spider_settings
directory)
LOG_FILE = 'spider2.log'
JOBDIR='spider2_job'
START_URLS = ['http://www.bing.com/knows']
TITLE_PATH = 'html head title::text'
Spider2’s source file: Spider2.py
(in spiders
directory)
from scrapy.log import INFO
from ..common_spider import CommonSpider
class Spider2(CommonSpider):
name = 'spider2'
# must add "kwargs", otherwise can't run in scrapyd
def __init__(self, settings, **kwargs):
super(Spider2, self).__init__(settings, **kwargs)
self._title_path = settings.get('TITLE_PATH', '')
def parse_other_info(self, response):
title = response.css(self._title_path).extract()[0]
self.log('title: %s' % title, INFO)
def parse(self, response):
self.parse_other_info(response)
super(Spider2, self).parse(response)
Run spiders
- Set
PYTHONPATH
to/<path>/scrapy_multiple_spiders
; - In
/<path>/scrapy_multiple_spiders/tutorial
, callscrapy crawl spider1
orscrapy crawl spider2
, check the log filespider1.log
orspider2.log
.