Ⅸ.Re:Scrapy中间件

MiddlewareManager

Scrapy 中 SpiderMiddlewareManagerDownloaderMiddlewareManagerExtensionManagerItemPipelineManager继承了相同的基类MiddlewareManager

SpiderMiddlewareManager:可以理解成Scrapy针对Spider处理机制的钩子框架,可以自定义处理发送给Spider的响应、Spider产生的 Request 及 Item

DownloaderMiddlewareManager:可以理解成Scrapy针对请求/响应的钩子框架,可以增加对请求/响应前后的处理

ExtensionManager:扩展(只是普通的类)。提供一些辅助功能和状态统计

ItemPipelineManager:持久化数据(需要自行实现)

scrapy/middleware.py#MiddlewareManager

class MiddlewareManager:
    """Base class for implementing middleware managers"""

    component_name = 'foo middleware'

    def __init__(self, *middlewares):
        self.middlewares = middlewares
        # Optional because process_spider_output and process_spider_exception can be None
        # 用来存储各个中间件方法的队列
        self.methods: Dict[str, Deque[Optional[Callable]]] = defaultdict(deque)
        for mw in middlewares:
            self._add_middleware(mw)

    @classmethod
    def _get_mwlist_from_settings(cls, settings: Settings) -> list:
        raise NotImplementedError

    @classmethod
    def from_settings(cls, settings: Settings, crawler=None):
        # 获取 settings.py 中的中间件设置
        mwlist = cls._get_mwlist_from_settings(settings)
        middlewares = []
        enabled = []
        for clspath in mwlist:
            try:
                # 加载类路径
                mwcls = load_object(clspath)
                # 创建类实例
                mw = create_instance(mwcls, settings, crawler)
                middlewares.append(mw)
                enabled.append(clspath)
            except NotConfigured as e:
                if e.args:
                    clsname = clspath.split('.')[-1]
                    logger.warning("Disabled %(clsname)s: %(eargs)s",
                                   {'clsname': clsname, 'eargs': e.args[0]},
                                   extra={'crawler': crawler})

        logger.info("Enabled %(componentname)ss:\n%(enabledlist)s",
                    {'componentname': cls.component_name,
                     'enabledlist': pprint.pformat(enabled)},
                    extra={'crawler': crawler})
        return cls(*middlewares)

    @classmethod
    def from_crawler(cls, crawler):
        return cls.from_settings(crawler.settings, crawler)

    def _add_middleware(self, mw) -> None:
        if hasattr(mw, 'open_spider'):
            self.methods['open_spider'].append(mw.open_spider)
        if hasattr(mw, 'close_spider'):
            self.methods['close_spider'].appendleft(mw.close_spider)

    def _process_parallel(self, methodname: str, obj, *args) -> Deferred:
        methods = cast(Iterable[Callable], self.methods[methodname])
        return process_parallel(methods, obj, *args)

    def _process_chain(self, methodname: str, obj, *args) -> Deferred:
        methods = cast(Iterable[Callable], self.methods[methodname])
        return process_chain(methods, obj, *args)

    def open_spider(self, spider: Spider) -> Deferred:
        return self._process_parallel('open_spider', spider)

    def close_spider(self, spider: Spider) -> Deferred:
        return self._process_parallel('close_spider', spider)

SpiderMiddlewareManager

默认中间件

SPIDER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
    'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
    'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
    # Spider side
}

主要方法

  • process_spider_input(response, spider)
    • 返回 None 会继续执行其他中间件
    • 引发一个异常(raise an exception)不会往下处理,会根据errback指定方法或进入process_spider_exception处理
  • process_spider_output(response, result, spider)
    • 返回 Request
    • 返回 Item
  • process_spider_exception(response, exception, spider)
    • 返回 None 会继续给之后的 process_spider_exception 处理,直到 Engine 那里记录并丢弃
    • 返回 Request
    • 返回 Item
  • process_start_requests(start_requests, spider)
    • 返回一个可迭代的 Request 对象

DownloaderMiddlewareManager

默认中间件

DOWNLOADER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
    # Downloader side
}

主要方法

  • process_request(request, spider)
    • 返回 None 会执行其他中间件,直到被 Downloader 执行
    • 返回 Request 会立即加入到下载队列中
    • 返回 Respense 不会执行其他中间件,直接返回该响应
    • 引发一个异常(IgnoreRequest) 会根据errback指定方法或进入process_exception处理。如果没有代码处理异常,它将会被忽略,且不被记录
  • process_response(request, response, spider)
    • 返回 Respense 会执行其他中间件process_response的处理
    • 引发一个异常(IgnoreRequest) 会根据errback指定方法或进入process_exception处理。如果没有代码处理异常,它将会被忽略,且不被记录
  • process_exception(request, exception, spider)
    • 返回 None 会执行其他中间件继续处理异常
    • 返回 Request 会立即加入到下载队列中
    • 返回 Respense 不会执行其他中间件

ExtensionManager

默认中间件

# 因为通常不互相依赖影响,加载顺序无关紧要
EXTENSIONS_BASE = {
    'scrapy.extensions.corestats.CoreStats': 0,
    'scrapy.extensions.telnet.TelnetConsole': 0,
    'scrapy.extensions.memusage.MemoryUsage': 0,
    'scrapy.extensions.memdebug.MemoryDebugger': 0,
    'scrapy.extensions.closespider.CloseSpider': 0,
    'scrapy.extensions.feedexport.FeedExporter': 0,
    'scrapy.extensions.logstats.LogStats': 0,
    'scrapy.extensions.spiderstate.SpiderState': 0,
    'scrapy.extensions.throttle.AutoThrottle': 0,
}

主要方法

普通的类,如果from_crawler方法引发 NotConfigured异常,扩展将被禁用。否则,将启用扩展。

通常,扩展连接到 Signals 并执行由它们触发的任务。

详见 👉 信号量 https://docs.scrapy.org/en/latest/topics/signals.html#topics-signals

ItemPipelineManager

默认中间件

需要自己实现

主要方法

  • process_item(self, item, spider)
    • 返回 Item
    • 返回 Deferred
    • 引发一个异常(DropItem

Ⅸ.Re:Scrapy中间件
https://元气码农少女酱.我爱你/b2f48fd84c6c/
作者
元气码农少女酱
发布于
2023年5月2日
许可协议