Frontera (web crawling)

Frontera
Original author(s)	Alexander Sibiryakov, Javier Casas
Developer(s)	Scrapinghub Ltd., GitHub community
Initial release	November 1, 2014 (2014-11-01)

Stable release	v0.7.0 / February 9, 2017 (2017-02-09)

Development status	Active
Written in	Python
Operating system	OS X, Linux
Type	web crawling
License	BSD 3-clause license
Website	github.com/scrapinghub/frontera

Frontera is an open source, web crawling framework implementing crawl frontier component and providing scalability primitives for web crawler applications.

Overview

These days content and structure of World Wide Web is changing rapidly. Frontera was designed be able to adapt quickly to these changes. Most large scale web crawlers operate in batch mode. Meaning they operate in sequential phases (injection, fetching, parsing, deduplication, scheduling, repeat). Phases are executed synchronously, one by one, requiring crawler significant time to react on web changes. Such design is mostly motivated by a low performance of hard disks when doing random access. Frontera is opposite relies on modern key value storage systems, using efficient data structures and powerful hardware. By design Frontera operates in online manner performing crawling, parsing and scheduling of new links simultaneously. It's designed as open source project since the beginning aiming to fit various use cases, and is highly flexible and configurable.

Large scale web crawls isn't the only Frontera's purpose. Its flexibility allows to run crawls of moderate size on a single machine with few cores by leveraging single process and distributed spiders run modes.

Features

Frontera is written mainly in Python. Data transport and formats are well abstracted and out-of-box implementations include support of MessagePack, JSON, Kafka and ZeroMQ.

Online operation: small requests batches, with parsing done right after fetch.
Pluggable backend architecture: low-level storage logic is separated from crawling policy.
Three run modes: single process, distributed spiders, distributed backend and spiders.
Transparent data flow, allowing to integrate custom components easily.
Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
SQLAlchemy and HBase storage backends.
Revisiting logic (only with RDBMS backend).
Optional use of Scrapy for fetching and parsing.
BSD 3-clause license, allowing to use in any commercial product.
Python 3 support.

Comparison to other web crawlers

Although, Frontera isn't a web crawler itself but it dictates a significant requirements to the web crawler architecture Frontera is used with. Frontera has an online architecture, comparing to Nutch batched one.^[1]

There is StormCrawler built on top of Apache Storm and using some components from Apache Nutch ecosystem. Scrapy Cluster was deisgned by ISTResearch with precise monitoring and management of the queue in mind. These systems are using online architecture and similar in a way that they provide fetching and/or queueing mechanisms, but no link database or content processing.

Architecture

Single process ^[2]

Fetcher

The Fetcher is responsible for fetching web pages from the sites and feeding them to the frontier which manages what pages should be crawled next. Fetcher can be implemented using Scrapy or any other crawling framework/system as the framework offers a generic frontier functionality. In distributed run mode Fetcher is replaced with message bus producer from Frontera Manager side and consumer from Fetcher side.

Frontera API / Manager

The main entry point to Frontera API is the FrontierManager object. Frontier users, in our case the Fetcher, will communicate with the frontier through it.

Middlewares

Frontier middlewares are specific hooks that sit between the Manager and the Backend. These middlewares process Request and Response objects when they pass to and from the Frontier and the Backend. They provide a convenient mechanism for extending functionality by plugging custom code. Canonical URL solver is a specific case of middleware responsible for substituting non-canonical document URLs with canonical ones.

Backend

The frontier Backend is where the crawling logic/policies lies. It's responsible for receiving all the crawl info and selecting the next pages to be crawled. Backend is meant to be operating on higher level, and Queue, Metadata and States objects are responsible for low-level storage communication code.

May require, depending on the logic implemented, a persistent storage to manage Request and Response objects info.

Data Flow

The data flow in Frontera is controlled by the Frontier Manager, all data passes through the manager-middlewares-backend scheme and goes like this: The frontier is initialized with a list of seed requests (seed URLs) as entry point for the crawl. The fetcher asks for a list of requests to crawl. Each url is fetched and the frontier is notified back of the crawl result as well of the extracted data the page contains. If anything went wrong during the crawl, the frontier is also informed of it. Once all urls have been crawled, steps 2-3 are repeated until crawl of frontier end condition is reached. Each loop (steps 2-3) repetition is called a frontier iteration.

Distributed ^[3]

The same Frontera Manager pipeline is used in all Frontera processes when running in distributed mode.

Overall system forms a closed circle and all the components are working as daemons in infinite cycles. There is a message bus responsible for transmitting messages between components, persistent storage and fetchers (when combined with extraction these processes called spiders). There is a transport and storage layer abstractions, so one can plug its own transport. Distributed backend run mode has instances of three types:

Spiders or fetchers, implemented using Scrapy. Responsible for resolving DNS queries, getting content from the Internet and doing link (or other data) extraction from content.
Strategy workers. Run the crawling strategy code: scoring the links, deciding if link needs to be scheduled and when to stop crawling.
DB workers. Store all the metadata, including scores and content, and generating new batches for downloading by spiders.

Such design allows to operate online. Crawling strategy can be changed without having to stop the crawl. Also crawling strategy can be implemented as a separate module; containing logic for checking the crawling stopping condition, URL ordering, and scoring model.

Frontera is polite to web hosts by design and each host is downloaded by no more than one spider process. This is achieved by stream partitioning.

Data flow

The seed URLs defined by the user in spiders are propagated to strategy workers and DB workers by means of spider log stream. Strategy workers decide which pages to crawl using state cache, assigning a score to each page and sends the results to the scoring log stream.

DB Worker stores all kinds of metadata, including content and scores. Also DB worker checks for the spider’s consumers offsets and generates new batches if needed and sends them to spider feed stream. Spiders consume these batches, downloading each page and extracting links from them. The links are then sent to the spider log stream where they are stored and scored. That way the flow repeats indefinitely.

Battle testing

At Scrapinghub Ltd. there is a crawler processing 1600 requests per second at peak, built using primarily Frontera using Kafka as a message bus and HBase as storage for link states and link database. Such crawler operates in cycles, each cycle takes 1.5 months and results in 1.7B of downloaded pages.^[4]

Crawl of Spanish internet resulted in 46.5M pages in 1.5 months on AWS cluster with 2 spider machines.^[5]

Used by

Frontera is used by several companies

History

First version of Frontera operated in single process, as part of custom scheduler for Scrapy, using on-disk SQLite database to store link states and queue. It was able to crawl for days. After getting to some noticeable volume of links it started to spend more and more time on SELECT queries, making crawl inefficient. This time Frontera is developed under DARPA's Memex program and included in its catalog of open source projects.^[6]

In 2015 subsequent versions of Frontera used HBase for storing link database and queue. Application was distributed on two parts: backend and fetcher. Backend was responsible for communicating with HBase by means of Kafka and fetcher was only reading Kafka topic with URLs to crawl, and producing crawl results to another topic consumed by backend, thus creating a closed cycle. First priority queue prototype suitable for web scale crawling was implemented during that time. The queue were producing batches with limits on number of hosts and requests per host.

Next significant milestone of Frontera development was introduction of crawling strategy and strategy worker, along with abstraction of message bus. It became possible to code the custom crawling strategy without dealing of low-level backend code operating with the queue. An easy way to say what links should be scheduled, when and with what priority made Frontera a truly crawl frontier framework. Kafka was quite a heavy requirement for small crawlers and message bus abstraction allowed to integrate almost any messaging system with Frontera.

References

↑ Sibiryakov, Alexander (22 Jun 2015). "What is better - Scrapy or Apache Nutch?". Quora.
↑ Dowinton, Richard (15 Apr 2015). "Frontera: the brain behind the crawls". Scrapinghub blog.
↑ Sibiryakov, Alexander (8 August 2015). "Distributed Frontera: web crawling at large scale". Scrapinghub blog.
↑ Sibiryakov, Alexander (29 Mar 2017). "Frontera: архитектура фреймворка для обхода веба и текущие проблемы". Habrahabr.
↑ Sibiryakov, Alexander (15 Oct 2015). "frontera-open-source-large-scale-web-crawling-framework". Speakerdeck.
↑ "Open Catalog, Memex (Domain-Specific Search)".

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.