Using Multiple Spiders in a Scrapy Project

2015-07-27

Tech

Different channel’s structure in a websit are similar, sometimes we want to reuse source code and don’t create a Scrapy project per channel. This is a tutorial how to use multiple spiders in a Scrapy project.

ENV

Python: 2.7.5
Scrapy: 0.24.2

Tree-like directories of the tutorial project

Source code in GitHub: scrapy_multiple_spiders

scrapy_multiple_spiders
├── commands
│   ├── __init__.py
│   └── crawl.py
└── tutorial
    ├── scrapy.cfg
    └── tutorial
        ├── __init__.py
        ├── common_spider.py
        ├── items.py
        ├── pipelines.py
        ├── settings.py
        ├── spider_settings
        │   ├── __init__.py
        │   ├── spider1.py
        │   └── spider2.py
        └── spiders
            ├── __init__.py
            ├── spider1.py
            └── spider2.py

Custom project command

In Scrapy, we can add our custom project commands by using the COMMANDS_MODULE setting item in settings.py, then we will custom the standard crawl command.

When calling scrapy crawl <spider name>, the run function within scrapy.commands.crawl.Command is the entrance.

The following code inherits scrapy.commands.crawl.Command and overwrite the run function in our project’s commands.crawl.CustomCrawlCommand class.

class CustomCrawlCommand(Command):
    def run(self, args, opts):
        if len(args) < 1:
            raise UsageError()
        elif len(args) > 1:
            raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
        spname = args[0]
        
        # added new code
        spider_settings_path = self.settings.getdict('SPIDER_SETTINGS', {}).get(spname, None)
        if spider_settings_path is not None:
            self.settings.setmodule(spider_settings_path, priority='cmdline')
        # end
                 
        crawler = self.crawler_process.create_crawler()
        spider = crawler.spiders.create(spname, **opts.spargs)
        crawler.crawl(spider)
        self.crawler_process.start()

The commented part is new code, others are same as the run function in scrapy.commands.crawl.Command class.

The Scrapy settings has four priorities: default, command, project, cmdline, the cmdline has a top priority, use it to overwrite default setting items which are in settings.py.

SPIDER_SETTINGS is a setting item in settings.py, it is a dictionary, the key is the spider name, the value is the spider’s custom setting file name.

Create common spiders and settings

tutorial.tutorial.common_spider.CommonSpider is a spider which includes a normal parsing process for a website and some common functions.

settings.py includes common setting items for all spiders, such as LOG_LEVEL, you can overwrite them in a spider custom setting file, such as spider1.py and spider2.py in tutorial.tutorial.spider_settings directory.

common_spider.py:

class CommonSpider(Spider):
    """
        This is a common spider, including common functions which child spiders can inherit or overwrite
    """
    name = ''
    allowed_domains = []
    start_urls = []

    # must add "kwargs", otherwise can't run in scrapyd
    def __init__(self, settings, **kwargs):
        super(CommonSpider, self).__init__(**kwargs)

        self._start_urls = []
        self._start_urls.extend(settings.get('START_URLS', []))
        if not self._start_urls:
            raise Exception('no urls to crawl')

    @classmethod
    def from_settings(cls, settings, **kwargs):
        return cls(settings, **kwargs)

    @classmethod
    def from_crawler(cls, crawler, **kwargs):
        return cls.from_settings(crawler.settings, **kwargs)

    def start_requests(self):
        for url in self._start_urls:
            # must append these hosts, otherwise OffsiteMiddleware will filter them
            parsed_url = urlparse.urlparse(url)
            parsed_url.hostname and self.allowed_domains.append(parsed_url.hostname)

            # open('file name', 'a+') is different between OS X and Linux, 
            # read an empty filter list from <JOBDIR>/requests.seen when launche the spider on OS X, 
            # be careful "dont_filter"
            yield Request(url, callback=self.parse, method='GET', dont_filter=True)

    def parse(self, response):
        self.log('response url: %s, status: %d' % (response.url, response.status), INFO)

settings.py:

COMMANDS_MODULE = 'commands'

SPIDER_SETTINGS = {
    'spider1': 'tutorial.spider_settings.spider1',
    'spider2': 'tutorial.spider_settings.spider2',
}

LOG_LEVEL = 'INFO'

Create multiple spiders in a project

Spiders without custom parsing process

Like tutorial.tutorial.spiders.spider1.Spider1.

Spider1’s setting file: spider1.py (in spider_settings directory)

LOG_FILE = 'spider1.log'

JOBDIR='spider1_job'

START_URLS = ['http://www.bing.com/news']

Spider1’s source file: Spider1.py (in spiders directory)

from ..common_spider import CommonSpider


class Spider1(CommonSpider):
    name = 'spider1'

Spiders with custom parsing process

Like tutorial.tutorial.spiders.spider2.Spider2.

Spider2’s setting file: spider2.py (in spider_settings directory)

LOG_FILE = 'spider2.log'

JOBDIR='spider2_job'

START_URLS = ['http://www.bing.com/knows']

TITLE_PATH = 'html head title::text'

Spider2’s source file: Spider2.py (in spiders directory)

from scrapy.log import INFO
from ..common_spider import CommonSpider


class Spider2(CommonSpider):
    name = 'spider2'

    # must add "kwargs", otherwise can't run in scrapyd
    def __init__(self, settings, **kwargs):
        super(Spider2, self).__init__(settings, **kwargs)

        self._title_path = settings.get('TITLE_PATH', '')

    def parse_other_info(self, response):
        title = response.css(self._title_path).extract()[0]
        self.log('title: %s' % title, INFO)

    def parse(self, response):
        self.parse_other_info(response)

        super(Spider2, self).parse(response)

Run spiders

Set PYTHONPATH to /<path>/scrapy_multiple_spiders;
In /<path>/scrapy_multiple_spiders/tutorial, call scrapy crawl spider1 or scrapy crawl spider2, check the log file spider1.log or spider2.log.