Frontera (web crawling)
Original author(s) | Alexander Sibiryakov, Javier Casas |
---|---|
Developer(s) | Scrapinghub Ltd., GitHub community |
Initial release | November 1, 2014 |
Stable release |
v0.7.0
/ February 9, 2017 |
Development status | Active |
Written in | Python |
Operating system | OS X, Linux |
Type | web crawling |
License | BSD 3-clause license |
Website |
github |
Frontera is an open source, web crawling framework implementing crawl frontier component and providing scalability primitives for web crawler applications.
Overview
These days content and structure of World Wide Web is changing rapidly. Frontera was designed be able to adapt quickly to these changes. Most large scale web crawlers operate in batch mode. Meaning they operate in sequential phases (injection, fetching, parsing, deduplication, scheduling, repeat). Phases are executed synchronously, one by one, requiring crawler significant time to react on web changes. Such design is mostly motivated by a low performance of hard disks when doing random access. Frontera is opposite relies on modern key value storage systems, using efficient data structures and powerful hardware. By design Frontera operates in online manner performing crawling, parsing and scheduling of new links simultaneously. It's designed as open source project since the beginning aiming to fit various use cases, and is highly flexible and configurable.
Large scale web crawls isn't the only Frontera's purpose. Its flexibility allows to run crawls of moderate size on a single machine with few cores by leveraging single process and distributed spiders run modes.
Features
Frontera is written mainly in Python. Data transport and formats are well abstracted and out-of-box implementations include support of MessagePack, JSON, Kafka and ZeroMQ.
- Online operation: small requests batches, with parsing done right after fetch.
- Pluggable backend architecture: low-level storage logic is separated from crawling policy.
- Three run modes: single process, distributed spiders, distributed backend and spiders.
- Transparent data flow, allowing to integrate custom components easily.
- Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
- SQLAlchemy and HBase storage backends.
- Revisiting logic (only with RDBMS backend).
- Optional use of Scrapy for fetching and parsing.
- BSD 3-clause license, allowing to use in any commercial product.
- Python 3 support.
Comparison to other web crawlers
Although, Frontera isn't a web crawler itself but it dictates a significant requirements to the web crawler architecture Frontera is used with. Frontera has an online architecture, comparing to Nutch batched one.[1]
There is StormCrawler built on top of Apache Storm and using some components from Apache Nutch ecosystem. Scrapy Cluster was deisgned by ISTResearch with precise monitoring and management of the queue in mind. These systems are using online architecture and similar in a way that they provide fetching and/or queueing mechanisms, but no link database or content processing.
Architecture
Single process [2]
Fetcher
The Fetcher is responsible for fetching web pages from the sites and feeding them to the frontier which manages what pages should be crawled next. Fetcher can be implemented using Scrapy or any other crawling framework/system as the framework offers a generic frontier functionality. In distributed run mode Fetcher is replaced with message bus producer from Frontera Manager side and consumer from Fetcher side.
Frontera API / Manager
The main entry point to Frontera API is the FrontierManager object. Frontier users, in our case the Fetcher, will communicate with the frontier through it.
Middlewares
Frontier middlewares are specific hooks that sit between the Manager and the Backend. These middlewares process Request and Response objects when they pass to and from the Frontier and the Backend. They provide a convenient mechanism for extending functionality by plugging custom code. Canonical URL solver is a specific case of middleware responsible for substituting non-canonical document URLs with canonical ones.
Backend
The frontier Backend is where the crawling logic/policies lies. It's responsible for receiving all the crawl info and selecting the next pages to be crawled. Backend is meant to be operating on higher level, and Queue, Metadata and States objects are responsible for low-level storage communication code.
May require, depending on the logic implemented, a persistent storage to manage Request and Response objects info.
Data Flow
The data flow in Frontera is controlled by the Frontier Manager, all data passes through the manager-middlewares-backend scheme and goes like this: The frontier is initialized with a list of seed requests (seed URLs) as entry point for the crawl. The fetcher asks for a list of requests to crawl. Each url is fetched and the frontier is notified back of the crawl result as well of the extracted data the page contains. If anything went wrong during the crawl, the frontier is also informed of it. Once all urls have been crawled, steps 2-3 are repeated until crawl of frontier end condition is reached. Each loop (steps 2-3) repetition is called a frontier iteration.
Distributed [3]
The same Frontera Manager pipeline is used in all Frontera processes when running in distributed mode.
Overall system forms a closed circle and all the components are working as daemons in infinite cycles. There is a message bus responsible for transmitting messages between components, persistent storage and fetchers (when combined with extraction these processes called spiders). There is a transport and storage layer abstractions, so one can plug its own transport. Distributed backend run mode has instances of three types:
- Spiders or fetchers, implemented using Scrapy. Responsible for resolving DNS queries, getting content from the Internet and doing link (or other data) extraction from content.
- Strategy workers. Run the crawling strategy code: scoring the links, deciding if link needs to be scheduled and when to stop crawling.
- DB workers. Store all the metadata, including scores and content, and generating new batches for downloading by spiders.
Such design allows to operate online. Crawling strategy can be changed without having to stop the crawl. Also crawling strategy can be implemented as a separate module; containing logic for checking the crawling stopping condition, URL ordering, and scoring model.
Frontera is polite to web hosts by design and each host is downloaded by no more than one spider process. This is achieved by stream partitioning.
Data flow
The seed URLs defined by the user in spiders are propagated to strategy workers and DB workers by means of spider log stream. Strategy workers decide which pages to crawl using state cache, assigning a score to each page and sends the results to the scoring log stream.
DB Worker stores all kinds of metadata, including content and scores. Also DB worker checks for the spider’s consumers offsets and generates new batches if needed and sends them to spider feed stream. Spiders consume these batches, downloading each page and extracting links from them. The links are then sent to the spider log stream where they are stored and scored. That way the flow repeats indefinitely.
Battle testing
At Scrapinghub Ltd. there is a crawler processing 1600 requests per second at peak, built using primarily Frontera using Kafka as a message bus and HBase as storage for link states and link database. Such crawler operates in cycles, each cycle takes 1.5 months and results in 1.7B of downloaded pages.[4]
Crawl of Spanish internet resulted in 46.5M pages in 1.5 months on AWS cluster with 2 spider machines.[5]
Used by
Frontera is used by several companies
History
First version of Frontera operated in single process, as part of custom scheduler for Scrapy, using on-disk SQLite database to store link states and queue. It was able to crawl for days. After getting to some noticeable volume of links it started to spend more and more time on SELECT queries, making crawl inefficient. This time Frontera is developed under DARPA's Memex program and included in its catalog of open source projects.[6]
In 2015 subsequent versions of Frontera used HBase for storing link database and queue. Application was distributed on two parts: backend and fetcher. Backend was responsible for communicating with HBase by means of Kafka and fetcher was only reading Kafka topic with URLs to crawl, and producing crawl results to another topic consumed by backend, thus creating a closed cycle. First priority queue prototype suitable for web scale crawling was implemented during that time. The queue were producing batches with limits on number of hosts and requests per host.
Next significant milestone of Frontera development was introduction of crawling strategy and strategy worker, along with abstraction of message bus. It became possible to code the custom crawling strategy without dealing of low-level backend code operating with the queue. An easy way to say what links should be scheduled, when and with what priority made Frontera a truly crawl frontier framework. Kafka was quite a heavy requirement for small crawlers and message bus abstraction allowed to integrate almost any messaging system with Frontera.
See also
- Frontera documentation at ReadTheDocs.
References
- ↑ Sibiryakov, Alexander (22 Jun 2015). "What is better - Scrapy or Apache Nutch?". Quora.
- ↑ Dowinton, Richard (15 Apr 2015). "Frontera: the brain behind the crawls". Scrapinghub blog.
- ↑ Sibiryakov, Alexander (8 August 2015). "Distributed Frontera: web crawling at large scale". Scrapinghub blog.
- ↑ Sibiryakov, Alexander (29 Mar 2017). "Frontera: архитектура фреймворка для обхода веба и текущие проблемы". Habrahabr.
- ↑ Sibiryakov, Alexander (15 Oct 2015). "frontera-open-source-large-scale-web-crawling-framework". Speakerdeck.
- ↑ "Open Catalog, Memex (Domain-Specific Search)".