Different channel's structure in a websit are similar, sometimes we want to reuse source code and don't create a Scrapy project per channel. This is a tutorial how to use multiple spiders in a Scrapy project.
Tree-like directories of the tutorial project
Source code in GitHub: scrapy_multiple_spiders
Custom project command
In Scrapy we can add our custom project commands by using the COMMANDS_MODULE setting item in settings.py, we will custom the standard "crawl" command. When call "scrapy crawl <spider name>", the run() function in scrapy.commands.crawl.Command is the entrance. Inherit scrapy.commands.crawl.Command and overwrite the run() function in our project's commands.crawl.CustomCrawlCommand class.
The commented part is new code, others are same as run() function in scrapy.commands.crawl.Command class. The Scrapy settings has four priorities: default, command, project, cmdline, the cmdline has a top priority, use it to overwrite default setting items which are in settings.py. "SPIDER_SETTINGS" is a setting item in settings.py, it is a dictionary, the key is a spider name, the value is the spider's custom setting file name.
Create common spiders and settings
tutorial.tutorial.common_spider.CommonSpider is a spider which includes a normal parsing process for a website and some common functions. settings.py includes common setting items for all spiders, such as LOG_LEVEL, you can overwrite them in a spider custom setting file, such as spider1.py and spider2.py in tutorial.tutorial.spider_settings directory.
COMMANDS_MODULE = 'commands'
Create multiple spiders in a project
spider without custom parsing process
Spider1's setting file: spider1.py (in "spider_settings" directory)
LOG_FILE = 'spider1.log'
Spider1's source file: Spider1.py (in "spiders" directory)
from ..common_spider import CommonSpider
spider with custom parsing process
Spider2's setting file: spider2.py (in "spider_settings" directory)
LOG_FILE = 'spider2.log'
Spider2's source file: Spider2.py (in "spiders" directory)
from scrapy.log import INFO
- set PYTHONPATH to "/<path>/scrapy_multiple_spiders"
- in "/<path>/scrapy_multiple_spiders/tutorial", call scrapy crawl spider1 or scrapy crawl spider2, check log file spider1.log or spider2.log