Different channel’s structure in a websit are similar, sometimes we want to reuse source code and don’t create a Scrapy project per channel. This is a tutorial how to use multiple spiders in a Scrapy project.
ENV
Python: 2.7.5
Scrapy: 0.24.2
Tree-like directories of the tutorial project
Source code in GitHub: scrapy_multiple_spiders
scrapy_multiple_spiders
├── commands
│ ├── __init__.py
│ └── crawl.py
└── tutorial
├── scrapy.cfg
└── tutorial
├── __init__.py
├── common_spider.py
├── items.py
├── pipelines.py
├── settings.py
├── spider_settings
│ ├── __init__.py
│ ├── spider1.py
│ └── spider2.py
└── spiders
├── __init__.py
├── spider1.py
└── spider2.py
Custom project command
In Scrapy, we can add our custom project commands by using the COMMANDS_MODULE setting item in settings.py, then we will custom the standard crawl command.
When calling scrapy crawl <spider name>, the run function within scrapy.commands.crawl.Command is the entrance.
The following code inherits scrapy.commands.crawl.Command and overwrite the run function in our project’s commands.crawl.CustomCrawlCommand class.
class CustomCrawlCommand(Command):
def run(self, args, opts):
if len(args) < 1:
raise UsageError()
elif len(args) > 1:
raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
spname = args[0]
# added new code
spider_settings_path = self.settings.getdict('SPIDER_SETTINGS', {}).get(spname, None)
if spider_settings_path is not None:
self.settings.setmodule(spider_settings_path, priority='cmdline')
# end
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()
The commented part is new code, others are same as the run function in scrapy.commands.crawl.Command class.
The Scrapy settings has four priorities: default, command, project, cmdline, the cmdline has a top priority, use it to overwrite default setting items which are in settings.py.
SPIDER_SETTINGS is a setting item in settings.py, it is a dictionary, the key is the spider name, the value is the spider’s custom setting file name.
Create common spiders and settings
tutorial.tutorial.common_spider.CommonSpider is a spider which includes a normal parsing process for a website and some common functions.
settings.py includes common setting items for all spiders, such as LOG_LEVEL, you can overwrite them in a spider custom setting file, such as spider1.py and spider2.py in tutorial.tutorial.spider_settings directory.
common_spider.py:
class CommonSpider(Spider):
"""
This is a common spider, including common functions which child spiders can inherit or overwrite
"""
name = ''
allowed_domains = []
start_urls = []
# must add "kwargs", otherwise can't run in scrapyd
def __init__(self, settings, **kwargs):
super(CommonSpider, self).__init__(**kwargs)
self._start_urls = []
self._start_urls.extend(settings.get('START_URLS', []))
if not self._start_urls:
raise Exception('no urls to crawl')
@classmethod
def from_settings(cls, settings, **kwargs):
return cls(settings, **kwargs)
@classmethod
def from_crawler(cls, crawler, **kwargs):
return cls.from_settings(crawler.settings, **kwargs)
def start_requests(self):
for url in self._start_urls:
# must append these hosts, otherwise OffsiteMiddleware will filter them
parsed_url = urlparse.urlparse(url)
parsed_url.hostname and self.allowed_domains.append(parsed_url.hostname)
# open('file name', 'a+') is different between OS X and Linux,
# read an empty filter list from <JOBDIR>/requests.seen when launche the spider on OS X,
# be careful "dont_filter"
yield Request(url, callback=self.parse, method='GET', dont_filter=True)
def parse(self, response):
self.log('response url: %s, status: %d' % (response.url, response.status), INFO)
settings.py:
COMMANDS_MODULE = 'commands'
SPIDER_SETTINGS = {
'spider1': 'tutorial.spider_settings.spider1',
'spider2': 'tutorial.spider_settings.spider2',
}
LOG_LEVEL = 'INFO'
Create multiple spiders in a project
Spiders without custom parsing process
Like tutorial.tutorial.spiders.spider1.Spider1.
Spider1’s setting file: spider1.py (in spider_settings directory)
LOG_FILE = 'spider1.log'
JOBDIR='spider1_job'
START_URLS = ['http://www.bing.com/news']
Spider1’s source file: Spider1.py (in spiders directory)
from ..common_spider import CommonSpider
class Spider1(CommonSpider):
name = 'spider1'
Spiders with custom parsing process
Like tutorial.tutorial.spiders.spider2.Spider2.
Spider2’s setting file: spider2.py (in spider_settings directory)
LOG_FILE = 'spider2.log'
JOBDIR='spider2_job'
START_URLS = ['http://www.bing.com/knows']
TITLE_PATH = 'html head title::text'
Spider2’s source file: Spider2.py (in spiders directory)
from scrapy.log import INFO
from ..common_spider import CommonSpider
class Spider2(CommonSpider):
name = 'spider2'
# must add "kwargs", otherwise can't run in scrapyd
def __init__(self, settings, **kwargs):
super(Spider2, self).__init__(settings, **kwargs)
self._title_path = settings.get('TITLE_PATH', '')
def parse_other_info(self, response):
title = response.css(self._title_path).extract()[0]
self.log('title: %s' % title, INFO)
def parse(self, response):
self.parse_other_info(response)
super(Spider2, self).parse(response)
Run spiders
- Set
PYTHONPATHto/<path>/scrapy_multiple_spiders; - In
/<path>/scrapy_multiple_spiders/tutorial, callscrapy crawl spider1orscrapy crawl spider2, check the log filespider1.logorspider2.log.