Leandro Navarro, Victor J. Sosa, Oscar Ardaiz
Summary
The success of the Web has produced an unmanageable volume of traffic, eating up any available amount of network capacity. Hierarchies of proxy-cache servers may help in reducing the number of times any document is downloaded by saving a copy for reuse in future requests, in exchange for many cache validation requests. Replication, a complementary strategy to caching is presented based on Web content distribution. When a document is published, it is distributed towards library servers (storage servers subscribed to certain topics) instead of waiting for requests. A network for document distribution that minimises the number of times every document circulates through the network, and moves that distribution to hours of low traffic, reducing the peak load traffic on the network.
A Web proxy server inside an organisation will serve requests returning objects from the cache (someone just downloaded an non-expired copy), from the library (we are subscribed to that topic), or finally from the original source. For example, when a department library subscribes to a publication in electronic format, the document distribution service will provide them with a copy of the new issues shortly after (or before) publication, without any penalty for the first reader, without any expensive traffic at the times of read requests.
That is reflected in the Object Distribution System (ODS), a distribution system inspired by the ultra-large scale distribution models used in everyday life (e.g. food, books, newspapers). Beyond traditional mechanisms of approaching information to readers such as caching and mirroring, this system enables the publication, classification and subscription to collections of documents. There is also provision for classification authorities to offer classification schemes to label documents.
The scenario and the additional advantages over proxy-caché in some situation for some documents are presented. The status of implementation of the model is discussed at the end.
1 Introduction: The status of the Network
The Internet is carrying out the work of transporting Web documents increasingly slowly and in unpredictable manner. This effect is the result of a complex combination of phenomena, among them the explosive growth of the user population, the frequent and repetitive traffic of documents, and the proliferation of many content of diverse value for readers.
The quality of service on Internet varies very much: the aggregated load generated by the growing population is, most of the time, very variable, chaotic (self-similar, fractal), independently of the capacity of the network [Crovella 95].
This is worsened by the load and failures in servers due to a large number of simultaneous requests, network partitions, specially on remote documents. When a massive number of users have a common interest at a certain moment, this causes a storm of requests to the server and an overload to their network vicinity. This can be observed with resources that thousand of users want to access simultaneously, that are available by Internet before than any other media: results from competitions, elections, images from an important event, etc.
As a result of the previous effects, many resources become inaccessible for a large number of users, feeling frustrated in their expectations about the Internet.
In addition, the bad news is that Web traffic grows up more than other services: requests point anywhere in the network, instead of email, news, dns where there are two regions: clients talk to a local proxy server, and proxies talk among them, This separation facilitates a sustainable growth: if we had to go visit the author every time we were interested in reading a book ...
Another problem is the large volume and diverse quality of information in Internet. Generally there are no guarantees on the quality and reliability of such information. Web pages are not usually classified in terms of any criteria. Classification schemes are collections of labels or topics produced by a classification authority, and it is used to associate meta-information to objects. They are used by authors, readers, librarians and the distribution network to describe the documents.
The goal is to provide local access to relevant and pre-selected information; obtaining the best service from the ordered use of global interconnections where bandwidth is scarce, quality is unstable and network partitions occur too often; and providing a global and cooperative mechanism for content classification and qualification (metadata).
We focus on a model centered on communities or organisations that are producers and consumers of information: they may produce, classify, label, offer and publish information, and also look for and consume information produced by other distant communities. These interactions occur with a local (region, organisation) service agent or library, while object distribution is done asynchronously, reliably and cooperatively among agents located anywhere in Internet. This model is appropriate because intra-community networking is usually adequate meanwhile external networking is usually poor and more expensive.
Subscription to contents is central to this model: we understand subscription as a contract between a reader (the subscriber) and "the network" to receive certain contents in the future. This is opposed to the usual visitor-content relation on the Web where a visit is an isolated act in time. Subscription contracts can be used to predict and organise an efficient distribution of contents. People usually subscribe to resources (the URL of a web page, a set of documents, a software package, a multimedia content), or to a group of contents with the same characteristics (newsgroups, keyword-based query results).
2 Document Distribution
Inspired by many distribution models used in everyday life (e.g. food distribution chains, publications). Consumers don’t go to places where goods are produced (e.g. factories, author’s home). Goods are purchased in the closest retail shop (a proxy), where most products are on stock waiting for customers (even though sometimes goods are back-ordered). Factories (a server) produce at a near optimal pace supplying distributors and retailers. This system works because consumers trust their retail shops: shops provide fresh products at a reasonable price, probably better deal than one could try to get from the factory.
This model is adequate for very large scale and it does not exist on the current Internet community, but it may be introduced over the existing networking infrastructure, without modifying protocols and standards. Their progressive introduction provides immediate advantages for their users.
Distribution differs from caching and mirroring in several ways:

Fig. 1: Caché versus Distribution
3 Object Distribution System
The Object Distribution System (ODS) is formed by two independent virtual networks: an Object Distribution Network (ODN) and an Object Routing Network (ORN). ODN brings objects close to readers according their interests, and ORN builds the distribution chains that ODN needs to do his work in a near optimal way.

Fig. 2: Object Distribution System
ODN handles objects that are persistent and replicated in every interested service agent. ODN can handle different collections of objects, determined by their authors or some classification authority.
An ODN is composed by a number of cooperating service agents that join several groups, or collections of objects, according to the interests of their users. In every ODN group service agents cooperate to obtain an efficient replication inside the group, providing a selective replication of objects restricted to interested agents only. In this way we also want to put some order in the chaos that is brought about by having information that is not classified.
ORN builds distribution chains dynamically for each group. To build the chains the routing agents (members of ORN) take into account the type of membership to a group of each service agent and the underlying network state. Even if systems such as News or GNS distribute objects in a hierarchical manner, they do not build distribution paths dynamically.
The routing mechanisms used in ORN for building distribution chains is completely independent of the class of objects that are being handled by ODN. Both networks were designed to work independently, defining a clear interface between them so that ORN can provide services to ODN in a transparent way.
4 MWeb
MWeb is a realization of the previous model using news, and the nntp protocol as a transport, and the mhtml format [Palme 98] to produce document collections. There is no ORN service and therefore the distribution chains are statically configured at this moment, and we are evaluating services to take into account measures and variation of bandwidth and delay in the distribution and access to contents.
The most important functions in Mweb are publication of documents (injecting documents in the replication infrastructure when they are created or modified), replication (currently using flooding and the nntp protocol on newsgroups) and local access (transparent: caching module, non-transparent: local catalog server, or a mhtml enabled nntp browser).

This prototype of Mweb allows accessing locally a document store or library of Web documents distributed in an efficient way to subscribers (frequently readers) of a specific topic or category. When a document is going to be published, a utility is used to produce a mime-mhtml message (multipart with html and graphics together) [Palme 98]. That document is posted to the local news store where it will be distributed (using the NNTP protocol). Afterwards, clients will find at their local document store, copies of the latest version and probably some older versions of that document.
Three tools have been built
The prototype is composed of three tools that we have implemented to interact with the news based distribution network:
* Publication (mhtmlnews/mhtml programs)
* Proxy-Library (Apache module modified)
* Library-Catalogue (CGI program)
This separation among requests, publication and distribution, helps to optimise the use of saturated Internet links, minimising the number of copies traversing those links (only when they change). In addition, the author of a document has to send to the network just one copy of it to reach their audience.
Publication
The main function of a publisher is the transmission of documents at the moment they are generated. Publisher process takes URL(s) of documents in order to create mime mhtml objects (grouping text and images). That mhtml object is published in a news store (nntp), or e-mail address (smtp) to be distributed. Additional mechanisms are being considered to give publishers control and accountability over their content (mechanisms to ensure various aspects of copyright protection: signature, certificate, electronic payment, etc.)
Library-Proxy
A transparent local access to the document store: Every reader site may use their local web store to access to documents published and replicated in a given library (newsgroup) in the form of a collection of related MHTML documents. That is done using the proxy interface (a http proxy server connected to the local news store). This mechanism incorporates new functionality to an existing proxy-cache server: it will server documents from the library, from the cache or directly from the network.
The implementation of a transparent proxy to a library has been studied for two popular and publicly available servers: Squid and Apache. Both are available in source code and their internals are documented.
Due to the modular design of Apache and existence of mechanisms to extend the code, we chose to extend the proxy-cache module to handle the retrieval of mhtml documents from a local nntp server. It retrieves mhtml documents from a local nntp server, decoded from mhtml format and delivered to the user and the cache.
Library-Catalog
A non-transparent local access to the document store: Library catalogue is a CGI tool that provides the most recent version of a document or it shows all available versions of that document, and the content classified in terms of meta-information.
The CGI looks for URL at the local news store. If we request the following document:
http://www.sample.es/index.html using store.org as document store, we have the following choices:
Fig. 4: An example extracted from the prototype system showing a table of different versions of http://www.rediris.es/index.es.html
Furthermore, these URLs become persistent names: It doesn't matter if the original document disappears or moves, because we will be able to access one or more versions of documents from our local library, or a particular version at an specific date.
We are working to extend this CGI application to show the content of the news store in terms of other meta-information in a Yahoo like catalogue.
The prototype implementation is available at
http://www.canet.upc.es/mwebConclusions
Web has become the "killer app" of Internet, because its explosive development and its growing use. Many strategies and techniques to improve the transmission of documents in the Web have been proposed until now, most of them based on taking advantage of read requests (caching). Local proxy caching in real world experiments can reduce the access time to documents around 30%, prefetching has the potential to additional improvements. However, prefetching is only effective when the right documents are identified and when future requests are correctly predicted. Otherwise, prefetching may be yet another way to waste bandwidth. Mweb may help in some cases, for some communities to complement caching.
Some communities share a large volume of information (articles, technical reports, notes, etc.) that is frequently consulted by members which can be readers and authors. This information is valuable and is worth to keep it available locally, and classified in a catalogue. In that context we can use our distribution model to benefit from the separation of distribution (publication and document transport) from query/publication to a local digital library (Mweb + news store). Local access can be visible to the user (catalogue interface) or invisible (transparent) through a http proxy to the local document store.
The main advantages of Mweb are: nearly instantaneous access to documents (no penalty to the first reader of a document, a difference with proxy cache); the author has to send just one copy of a document; ability to equalise traffic (take advantage of hours of low traffic); elimination of document validity checks (conditional HTTP GET requests); URLs persistency and versioning control or URL with timestamps.
We also expect to contribute with the new working group of IETF in cache and replication [WREC 98].
References
[Apache 98] The Apache Group, "Apache HTTP Server Project".
[Crovella 95] M. Crovella, A. Bestavros, Explaining WWW Traffic Self-Similarity, August, 1995.
[Lijding 97] M.E. Lijding, L. Navarro, C. E. Righetti, "Object Distribution Networks for world-wide document circulation", CRIWG97, El Escorial, España, September 1997.
[Palme 98] J. Palme, A. Hopmann, N. Shelness, "MIME Encapsulation of Aggregate Documents, such as HTML (MHTML)" RFC 2110
[Simonsen 98] J. Simonson, et al., "Version augmented URIs for reference permanence via an Apache module design", Proceedings of the WWW7 Conference, Brisbane, Australia, April 1998.
[WREC 98] IETF BoF, Web REplication and Caching(WREC),