Scrapy: A Fast High-Level Screen Scraping and Web Crawling Framework

Mais uma Tool para um futuro PoC (Proof of Concept) e post no blog.

Chances are that people will confuse themselves with Scapy, the powerful interactive packet manipulator and Scrapy, the web miner and crawler. This is ofcourse because of their identical spellings. This post is about Scrapy.

So, Scrapy is an application framework for crawling web sites and extracting structured data. This finds its applications in data mining, information processing or historical archival processes. It provides a lot more features than most web crawlers out there and hence it makes itself up on PenTestIT. These are it’s current features:

  • Built-in support for selecting and extracting data from HTML and XML sources
  • Built-in support for cleaning and sanitizing the scraped data using a collection of reusable filters (called Item Loaders) shared between all the spiders.
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
  • A media pipeline for automatically downloading images (or any other media) associated with the scraped items
  • Support for extending Scrapy by plugging your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).
  • Wide range of built-in middlewares and extensions for:
    • cookies and session handling
    • HTTP compression
    • HTTP authentication
    • HTTP cache
    • user-agent spoofing
    • robots.txt
    • crawl depth restriction
    • and more
  • Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
  • Extensible stats collection for multiple spider metrics, useful for monitoring the performance of your spiders and detecting when they get broken
  • An Interactive shell console for trying XPaths, very useful for writing and debugging your spiders
  • A System service designed to ease the deployment and run of your spiders in production.
  • A built-in Web service for monitoring and controlling your bot
  • A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler
  • Logging facility that you can hook on to for catching errors during the scraping process.

In addition to the above features, it works on Linux, Windows, Mac and BSD operating systems. It was programmed in Python with ease of use and simplicity in mind. Infact according to the author, Scrapy is used in production crawlers to completely scrape more than 500 retailer sites daily, all in one server! To use it properly on your system, you need to complete the following dependencies:

  • Python 2.5, 2.6, 2.7 (3.x is not yet supported)
  • Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of a Twisted bug)
  • lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
  • simplejson (not required if using Python 2.6 or above)
  • pyopenssl (for HTTPS support. Optional, but highly recommended)

You can easily configure Scrapy via the scrapy.cfg configuration file.

Download Scrapy 0.12 (tip.zip/tip.tar.gz).
http://scrapy.org/download/

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s