Scrapy - 外壳

描述

Scrapy shell 可以使用无错误代码来抓取数据，而无需使用蜘蛛。Scrapy shell 的主要目的是测试提取的代码、XPath 或 CSS 表达式。它还有助于指定您要从中抓取数据的网页。

配置外壳

可以通过安装IPython（用于交互式计算）控制台来配置 shell ，这是一个功能强大的交互式 shell，可以提供自动完成、彩色输出等功能。

如果您在 Unix 平台上工作，那么最好安装 IPython。如果 IPython 无法访问，您也可以使用bpython 。

您可以通过设置名为 SCRAPY_PYTHON_SHELL 的环境变量或定义 scrapy.cfg 文件来配置 shell，如下所示 -

[settings]
shell = bpython

启动外壳

可以使用以下命令启动 Scrapy shell -

scrapy shell <url>

url指定需要抓取数据的 URL 。

使用外壳

shell 提供了一些额外的快捷方式和 Scrapy 对象，如下表所述 -

可用的快捷键

Shell 在项目中提供了以下可用的快捷方式 -

先生编号	快捷方式及说明
1	帮助（）它提供了可用的对象和快捷方式以及帮助选项。
2	获取（请求或网址）它收集来自请求或 URL 的响应，并且关联的对象将得到正确更新。
3	查看（响应）您可以在本地浏览器中查看给定请求的响应以进行观察，并正确显示外部链接，它会在响应正文中附加一个基本标记。

先生编号

快捷方式及说明

帮助（）

它提供了可用的对象和快捷方式以及帮助选项。

获取（请求或网址）

它收集来自请求或 URL 的响应，并且关联的对象将得到正确更新。

查看（响应）

您可以在本地浏览器中查看给定请求的响应以进行观察，并正确显示外部链接，它会在响应正文中附加一个基本标记。

可用的 Scrapy 对象

Shell 在项目中提供了以下可用的 Scrapy 对象 -

先生编号	对象及描述
1	爬行器它指定当前的爬虫对象。
2	蜘蛛如果当前 URL 没有蜘蛛，那么它将通过定义新的蜘蛛来处理 URL 或蜘蛛对象。
3	要求它指定最后收集的页面的请求对象。
4	回复它指定最后收集的页面的响应对象。
5	设置它提供了当前的 Scrapy 设置。

Shell 会话示例

让我们尝试抓取 scrapy.org 网站，然后开始按照所述从 reddit.com 抓取数据。

在继续之前，首先我们将启动 shell，如以下命令所示 -

scrapy shell 'http://scrapy.org' --nolog

使用上面的 URL 时，Scrapy 将显示可用的对象 -

[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
[s]   item       {}
[s]   request    <GET http://scrapy.org >
[s]   response   <200 http://scrapy.org >
[s]   settings   <scrapy.settings.Settings object at 0x2bfd650>
[s]   spider     <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s]   shelp()           Provides available objects and shortcuts with help option
[s]   fetch(req_or_url) Collects the response from the request or URL and associated 
objects will get update
[s]   view(response)    View the response for the given request

接下来，从对象的工作开始，如下所示 -

>> response.xpath('//title/text()').extract_first() 
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'  
>> fetch("http://reddit.com") 
[s] Available Scrapy objects: 
[s]   crawler     
[s]   item       {} 
[s]   request     
[s]   response   <200 https://www.reddit.com/> 
[s]   settings    
[s]   spider      
[s] Useful shortcuts: 
[s]   shelp()           Shell help (print this help) 
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects 
[s]   view(response)    View response in a browser  
>> response.xpath('//title/text()').extract() 
[u'reddit: the front page of the internet']  
>> request = request.replace(method="POST")  
>> fetch(request) 
[s] Available Scrapy objects: 
[s]   crawler     
...

从 Spider 中调用 Shell 来检查响应

仅当您期望获得该响应时，您才可以检查蜘蛛处理的响应。

例如 -

import scrapy 

class SpiderDemo(scrapy.Spider): 
   name = "spiderdemo" 
   start_urls = [ 
      "http://mysite.com", 
      "http://mysite1.org", 
      "http://mysite2.net", 
   ]  
   
   def parse(self, response): 
      # You can inspect one specific response 
      if ".net" in response.url: 
         from scrapy.shell import inspect_response 
         inspect_response(response, self)

如上面的代码所示，您可以使用以下函数从蜘蛛调用 shell 来检查响应 -

scrapy.shell.inspect_response

现在运行蜘蛛，您将看到以下屏幕 -

2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
[s] Available Scrapy objects: 
[s]   crawler     
...  
>> response.url 
'http://mysite2.org'

您可以使用以下代码检查提取的代码是否正常工作 -

>> response.xpath('//div[@class = "val"]')

它将输出显示为

[]

上面的行仅显示空白输出。现在您可以调用 shell 来检查响应，如下所示 -

>> view(response)

它将响应显示为

True

先生编号	对象及描述
1	爬行器它指定当前的爬虫对象。
2	蜘蛛如果当前 URL 没有蜘蛛，那么它将通过定义新的蜘蛛来处理 URL 或蜘蛛对象。
3	要求它指定最后收集的页面的请求对象。
4	回复它指定最后收集的页面的响应对象。
5	设置它提供了当前的 Scrapy 设置。