Using multiple spiders in a Scrapy project

Overview

Different channel's structure in a websit are similar, sometimes we want to reuse source code and don't create a Scrap project per channel. This is a tutorial how to use multiple spiders in a Scrapy project.

ENV

Python: 2.7.5
Scrapy: 0.24.2

Tree-like directories of this tutorial project

Source code in GitHub: scrapy_multiple_spiders

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
scrapy_multiple_spiders
├── commands
│   ├── __init__.py
│   └── crawl.py
└── tutorial
├── scrapy.cfg
└── tutorial
├── __init__.py
├── common_spider.py
├── items.py
├── pipelines.py
├── settings.py
├── spider_settings
│   ├── __init__.py
│   ├── spider1.py
│   └── spider2.py
└── spiders
├── __init__.py
├── spider1.py
└── spider2.py

Custom project command

In Scrapy we can add our custom project commands by using the COMMANDS_MODULE setting item in settings.py, we will custom the standard "crawl" command. When call "scrapy crawl <spider name>", the run() function in scrapy.commands.crawl.Command is the entrance. Inherit scrapy.commands.crawl.Command and overwrite the run() function in our project's commands.crawl.CustomCrawlCommand class.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class CustomCrawlCommand(Command):
def run(self, args, opts):
if len(args) < 1:
raise UsageError()
elif len(args) > 1:
raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
spname = args[0]

# added new code
spider_settings_path = self.settings.getdict('SPIDER_SETTINGS', {}).get(spname, None)
if spider_settings_path is not None:
self.settings.setmodule(spider_settings_path, priority='cmdline')
# end

crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()

The commented part is new code, others are same as run() function in scrapy.commands.crawl.Command class. The Scrapy settings has four priorities: default, command, project, cmdline, the cmdline has a top priority, use it to overwrite default setting items which are in settings.py. "SPIDER_SETTINGS" is a setting item in settings.py, it is a dictionary, the key is a spider name, the value is the spider's custom setting file name.

Create common spiders and settings

tutorial.tutorial.common_spider.CommonSpider is a spider which includes a normal parsing process for a website and some common functions. settings.py includes common setting items for all spiders, such as LOG_LEVEL, you can overwrite them in a spider custom setting file, such as spider1.py and spider2.py in tutorial.tutorial.spider_settings directory.

common_spider.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class CommonSpider(Spider):
"""
This is a common spider, including common functions which child spiders can inherit or overwrite
"""

name = ''
allowed_domains = []
start_urls = []

# must add "kwargs", otherwise can't run in scrapyd
def __init__(self, settings, **kwargs):
super(CommonSpider, self).__init__(**kwargs)

self._start_urls = []
self._start_urls.extend(settings.get('START_URLS', []))
if not self._start_urls:
raise Exception('no urls to crawl')

@classmethod
def from_settings(cls, settings, **kwargs):
return cls(settings, **kwargs)

@classmethod
def from_crawler(cls, crawler, **kwargs):
return cls.from_settings(crawler.settings, **kwargs)

def start_requests(self):
for url in self._start_urls:
# must append these hosts, otherwise OffsiteMiddleware will filter them
parsed_url = urlparse.urlparse(url)
parsed_url.hostname and self.allowed_domains.append(parsed_url.hostname)

# open('file name', 'a+') is different between OS X and Linux,
# read an empty filter list from <JOBDIR>/requests.seen when launche the spider on OS X,
# be careful "dont_filter"
yield Request(url, callback=self.parse, method='GET', dont_filter=True)

def parse(self, response):
self.log('response url: %s, status: %d' % (response.url, response.status), INFO)

settings.py

1
2
3
4
5
6
7
8
COMMANDS_MODULE = 'commands'

SPIDER_SETTINGS = {
'spider1': 'tutorial.spider_settings.spider1',
'spider2': 'tutorial.spider_settings.spider2',
}

LOG_LEVEL = 'INFO'

Create multiple spiders in a project

spider without custom parsing process

like tutorial.tutorial.spiders.spider1.Spider1
Spider1's setting file: spider1.py (in "spider_settings" directory)

1
2
3
4
5
LOG_FILE = 'spider1.log'

JOBDIR='spider1_job'

START_URLS = ['http://www.bing.com/news']

Spider1's source file: Spider1.py (in "spiders" directory)

1
2
3
4
5
from ..common_spider import CommonSpider


class Spider1(CommonSpider):
name = 'spider1'

spider with custom parsing process

like tutorial.tutorial.spiders.spider2.Spider2
Spider2's setting file: spider2.py (in "spider_settings" directory)

1
2
3
4
5
6
7
LOG_FILE = 'spider2.log'

JOBDIR='spider2_job'

START_URLS = ['http://www.bing.com/knows']

TITLE_PATH = 'html head title::text'

Spider2's source file: Spider2.py (in "spiders" directory)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from scrapy.log import INFO
from ..common_spider import CommonSpider


class Spider2(CommonSpider):
name = 'spider2'

# must add "kwargs", otherwise can't run in scrapyd
def __init__(self, settings, **kwargs):
super(Spider2, self).__init__(settings, **kwargs)

self._title_path = settings.get('TITLE_PATH', '')

def parse_other_info(self, response):
title = response.css(self._title_path).extract()[0]
self.log('title: %s' % title, INFO)

def parse(self, response):
self.parse_other_info(response)

super(Spider2, self).parse(response)

Run spiders

  1. set PYTHONPATH to "/<path>/scrapy_multiple_spiders"
  2. in "/<path>/scrapy_multiple_spiders/tutorial", call scrapy crawl spider1 or scrapy crawl spider2, check log file spider1.log or spider2.log