News server operation

From Wikipedia, the free encyclopedia

Among the operators and users of commercial Usenet news servers, common concerns are the continually increasing storage and network capacity requirements and their effects. Completion (the ability of a server to successfully receive all traffic), retention (the amount of time articles are made available to readers) and overall system performance are the topics of frequent discussion. With the increasing demands, it is common for the transit and reader server roles to be subdivided further into numbering, storage and front end systems. These server farms are continually monitored by both insiders and outsiders, and measurements of these characteristics are often used by consumers when choosing a commercial news service.

Contents

[edit] Articles and posts

End users often use the term "posting" to refer to a single message or file posted to Usenet. For articles containing plain text, this is synonymous with an article. For binary content such as pictures and files, it is often necessary to split the content among multiple articles. Typically through the use of numbered Subject: headers, the multiple-article postings are automatically reassembled into a single unit by the newsreader. Most servers do not distinguish between single and multiple-part postings, dealing only at the level of the individual component articles.

[edit] Headers and overviews

Each news article contains a complete set of header lines, but in common use the term "headers" is also used when referring to the News Overview database. The overview is a list of the most frequently used headers, and additional information such as article sizes, typically retrieved by the client software using the NNTP XOVER command. Overviews make reading a newsgroup faster for both the client and server by eliminating the need to open each individual article to present them in list form.

If non-overview headers are required, such as for when using a kill file, it may still be necessary to use the slower method of reading all the complete article headers. Many clients are unable to do this, and limit filtering to what is available in the summaries.

[edit] Spools

When the server stores the body of an article, it places it in a disk storage area generically called a "spool". There are several common ways in which the spool may be organized:

  • One file per article is the oldest storage scheme, still in common use on smaller servers and replicated in many clients. Its performance capability is a direct function of the underlying operating system's ability to create, remove and locate files within a directory, and often this scheme is insufficient to keep up with modern Usenet traffic. It does, however, allow for the greatest flexibility in managing the amount and location of storage used by the server. Nearly all current software using this scheme stores articles using the B News 2.10 layout.
  • Cyclical storage has been in increasingly common use since the 1990s. In this storage method, articles are appended serially to large indexed container files. When the end of the file is reached, new articles are written at the beginning of the file, overwriting the oldest entries. On some servers, this overwriting is not performed, but instead new container files are created as older ones are deleted. The major advantages of this system include predictable storage requirements if an overwriting scheme is employed, and some freedom from dependency on the underlying performance of the operating system. There is, however, less flexibility to retain articles by age rather than space used, and traditional text manipulation tools such as grep are less well suited to analyzing these files. Some degree of article longevity control can be exercised by directing subsets of the newsgroups to specific sets of container files.
  • In some cases, a relational database or similar is used to contain the spool. This is most commonly seen with Internet forum software that also offers an NNTP interface.
  • Some servers, such as INN, allow multiple storage schemes to be used at once. Various hybrid storage schemes have also been used in news servers, including different organizations of the file-per-article method, or smaller containers carrying perhaps 100 articles apiece.

[edit] Speed

Speed, for the purpose of this article, is how quickly a server can deliver an article to the user. The server that the user connects to is typically part of a server farm that has many servers dedicated to multiple tasks. How fast the data can move in this farm is the first thing that affects the speed of delivery.

Once the farm is able to deliver the data to the network, then the provider has limited control over the speed to the user. Since the network path to each user is different, some users will have good routes and the data will flow quickly. Other users will have overloaded routers between them and the provider which will cause delays. About all a provider can do in that case is try moving the traffic through a different route. If the ISP has limited connectivity to the network, routing changes may have little effect.

Frequently a user can reduce the impact of network problems by using multiple connections. Some servers allow as many as 8 simultaneous connections, but this varies widely. Likewise, newsreaders are commonly limited to using as few as two or four connections.

[edit] Article sizes

Article sizes are limited to what the servers will accept. For text users this is generally not a problem. For Binary users this can be a problem since the maximum article size varies from site to site.

The larger the article size, the fewer articles on each server. This generally means that a server can run with less overhead which makes for a more efficient server. This is because fewer articles reduces the overhead needed to process them. However, the larger the article size, the fewer servers the article will arrive on.

[edit] Servers

Users frequently call their service a server. In many cases this is very far from the truth. While each service is different, here is a list of the various types of server roles that a provider will have in each server farm it runs. Roles can be mixed at a given site, for example numbering and transit may be provided by the same system.

Transit server 
These are the servers that handle basic article exchange. They exchange traffic with remote servers, supply articles to the numbering servers, and transmit articles posted from the local front end servers.
Numbering server (stamper) 
This server inserts the RFC 1036 Xref: header into each article, so that the back and front end servers all present article lists in a uniform manner.
Back end server 
This is the data storage system for the front end servers. They usually have multiple RAID disk arrays to hold the data. The provider can increase reliability by using multiple backend servers with redundant data, redundant arrays attached to the same server, or even both.
Front end server 
These are the servers that a user would actually connect to. It is not unheard of for a large commercial news service provider to have more than 50 front end servers. These systems usually only store overviews locally, and retrieve article bodies from the back end servers. These systems typically carry the heaviest CPU load in the farm.

Large server farms typically also place load balancers between the front end servers and the network.

[edit] Retention

Retention is simply defined as how long the server keeps articles. Most users want retention to be long enough so that they don't need to access the server every day. Conversely, overly long retention can overwhelm users with slow computers or network connections by making the article lists inordinately large.

Retention is generally quoted separately for text and binary articles, though it may also vary between different groups within these categories. The times vary greatly according to the amount of storage available on the servers and continually increasing traffic, but as of 2005 it is common for specialist news providers to have text retention of over 100 days and binary retention of over a week.

It can be difficult for end users to accurately measure the retention of a server. One common method is to examine the oldest articles in a group and examine the Date: headers, but this is not always accurate. Some articles in a group may be retained for longer than others, articles from remote servers do not always arrive promptly, and at times the date headers are simply incorrect. A sampling of many or all articles, preferably in more than one newsgroup, is required to detect such anomalies.

[edit] Completion

Given the large number of articles transferred between servers and the large size of individual articles, their complete propagation to any one server farm is not guaranteed. The term "completion" is used to describe how well a service is keeping up with the traffic.

The primary obstacle to calculating the completion percentage is how many articles were posted. Looking at only one server, one cannot know how many articles were actually inserted throughout the network. Articles may never make their way outside the originating server, or may fail to find their way out to the transit cloud. Very large articles are frequently dropped, and tend to propagate less well than smaller ones.

One way to measure completion is to access multiple servers and retrieve lists of articles. Because Message-ID: headers are nominally unique throughout the network, comparison of the lists is mostly a straightforward task. Practical limitations to this type of measurement include the impossibility of obtaining lists from all servers worldwide, the fact that many servers filter out spam or employ Usenet Death Penalties, and that some servers mask incompletion by hiding multipart binary sets with missing articles. It is also necessary to take into account propagation times and retention; an article may simply have not yet arrived at a given server, or it may have been present but already expired.