3.3 Web Caching (and Hierarchically Nested Groups)

[Next] [Up] [Previous] [Contents]
Next: 3.4 Meta-Rendezvous Service (Another Up: 3 Odds and Ends Previous: 3.2 Configuration with Yoid

3.3 Web Caching (and Hierarchically Nested Groups)

One cannot talk about host-based multicast without talking about web caching infrastructures which are, after all, a form of host-based multicast. This leads naturally to the question of yoid's applicability to web caching.

There are several aspects of web caching I would like to consider:

Finding the nearest web cache (cache discovery).
Finding a web cache (among many) that is most likely to already hold the desired content (cache routing).
Pre-loading web caches with content that is most likely to be desired (cache pre-loading).

The first is rather straightforward with yoid--simply create a well-known group of web caches, and have web clients contact the group and over time converge on the nearest web caches. (Note that the client would not join per se. Rather, it would contact and "explore" the group, same as a true joining member does, but not actually join.) But there are also other ways to do nearby web cache discovery, so I'm not sure that this by itself is very interesting.

The second and third aspects are important areas where yoid can perhaps make a valuable contribution. To do so, however, some additional mechanisms are required for yoid. Namely, it must be possible to 1) organize a yoid ``group'' into a nested hierarchy of subgroups, and 2) be able to route unicast frames through the hierarchy to a specific subgroup (or subsubgroup, subsubsubgroup, and so on). I call a group with these two characteristics a nested group.

Nested Groups

In what follows, I talk about nested groups outside the context of web caching. After that I'll bring web caching back into it and show how nested groups can be used for cache routing and cache pre-loading.

Figure 10 illustrates the major components of a yoid nested group. The name of the group in the figure is yoid://rn1.com/gn1. The figure illustrates three levels of nesting. Each oval represents a nested subgroup. Each subgroup is internally connected, by which I mean that a frame multicast to the subgroup only needs to traverse members of the subgroup.

[IMAGE ]
Figure 10: Yoid Nested Group

I have chosen a naming convention whereby the group id of each subgroup is a superset of that of the subgroup it is nested in. I have arbitrarily selected ':' as a delimiter. It is not necessarily the case that a delimiter is needed, as long as within the protocol there is a way to indicate which subgroup given named subgroup is nested in. For that matter, it is not necessarily the case that a subgroup id be a superset of that of its parent subgroup, again as long as the protocol can make the identification. It is probably a good idea, though, for the sake of the humans that have to read the strings, to have these kinds of conventions. (To avoid too much clutter in the picture, I have truncated the beginning of the group id in the lower subgroups.)

I have chosen to draw this picture as a series of balloons inside balloons, rather than as a branching tree. This is intentional, and is meant to emphasize the point that a member of a subgroup is also a member of all the subgroups above it. What this means practically is that if a frame is multicast to a certain subgroup S, it will reach all members of all subgroups nested within S, however deep.

(This will all look, by the way, tediously familiar to anybody versed in standard hierarchical routing algorithms. Such a person will also know that there are a number of ways to structure, name, and route through hierarchies. I do not explore alternatives here, have not yet thought deeply about them, and as it is do a lot of hand waving with what I do say here. This is all an area for further study.)

To build the nested structure shown in Figure 10, each member simply joins its lowest subgroup as it would any (non-nested) group, with the following exceptions:

The root of a subgroup joins the next higher level subgroup.
A few members of a subgroup must establish mesh neighbors in other subgroups at the same level.
There is no rendezvous for subgroups. Rather, members of a given subgroup are discovered using one of two alternative methods described below.

For example, member B shown in the figure would join subgroup rn1.com/gn1:sg1:ssg1:sssg1 in the normal way, by which I mean it would only select members in the subgroup as a parent. One member of rn1.com/gn1:sg1:ssg1:sssg1 finds itself to be the root (A in the figure), and so joins rn1.com/gn1:sg1:ssg1. This means that it chooses its parent from the set of all members of all subgroups of rn1.com/gn1:sg1:ssg1. The root of rn1.com/gn1:sg1:ssg1 (member C) selects a parent from rn1.com/gn1:sg1, and so on.

Required, both by applications and by YTMP itself, is a way to efficiently discover a member of any subgroup. The normal method of discovering a member of a regular (non-nested) group is via the rendezvous. In the case of a nested group, the rendezvous knows of a few (or more) members of the group, but it doesn't necessarily know of members from each subgroup. This means that there must be a way to route from any member of the group to any subgroup.

Though endless variations are possible, the basic method for doing this would be to require each member to know of one or two members of every sibling subgroup at each level. In other words, A would know of one or two members each from rn1.com/gn1:sg1:ssg1:sssg2 and rn1.com/gn1:sg1:ssg1:sssg3, one or two each from rn1.com/gn1:sg1:ssg2 and rn1.com/gn1:sg1:ssg3, and one or two each from rn1.com/gn1:sg2 and rn1.com/gn1:sg3.

A simple way to propagate this information would be for every member to broadcast a control message about itself to all of its mesh neighbors. They would store the contained subgroup information if either they didn't already have it, or if it were from a more nearby member, where nearby here could be measured in terms of hops over the tree or hops over the mesh (versus say latency, which might be better but more costly to measure). If they stored the information, they would also then forward it to their mesh neighbors.

An example of how this would allow discovery is as follows. Say a member X (not shown in the figure) in rn1.com/gn1:sg3:ssg2 wanted to discover a member in rn1.com/gn1:sg1:ssg1:sssg3. (Note that it isn't necessary for all members to be at the same depth in the hierarchy.) By virtue of its place in the hierarchy, X would know of a couple arbitrarily selected members in rn1.com/gn1:sg1 (a sibling subgroup at the highest level). Say one of those members were in rn1.com/gn1:sg1:ssg2:sssg1. This member would in turn know of a member of sibling subgroup rn1.com/gn1:sg1:ssg1, say one in rn1.com/gn1:sg1:ssg1:sssg1. This one, finally, would know of one in rn1.com/gn1:sg1:ssg1:sssg3. Thus, with three queries (more if one of the members was unavailable, possibly less if we were lucky), an appropriate member can be discovered.

Note that this method would give each member the required information about other members, but not a connection to each said member. As a result, discovery could be relatively slow--a connection would first have to be established at each hop of discovery along the way. An alternative would be to actually pass routing information through the tree-mesh, so that a frame could be unicast forwarded hop-by-hop through the tree-mesh to any addressed point in the hierarchy. In other words, employ a pretty much standard hierarchical routing algorithm to support fast discovery.

To be clear, I am not, I repeat not, suggesting an alternative to unicast IP (or IPv6). (Though I suppose there might be applicability to VPNs.) I am simply exploring possible fast methods of discovery in a nested group, the importance of which will be clear when I talk about web caching.

To complete this sketch, I need to talk about one more thing--how a member decides what subgroup it should join (or create if it is the first to join). The short explanation is that the application tells it. For the application to do this, however, it very likely needs information about what subgroups already exist (the web caching application certainly does).

Thus, the API between the application and yoid requires a mechanism whereby the application can query yoid and ask what the set of child subgroups are of any named subgroup at any level. Yoid in turn needs a protocol whereby any member can make this query any other member. So, for instance, the application in the member at rn1.com/gn1:sg3:ssg2 could ask over the API about all subgroups of rn1.com/gn1:sg1:ssg1. The member at rn1.com/gn1:sg3:ssg2 would discover a member in rn1.com/gn1:sg1:ssg1, query it for all child subgroups (which it would already know by virtue of the information needed for discovery), get the reply, and return that to its local application.

Web Cache Routing and Pre-loading

Now I can describe how nested groups can be used in support of web caching.

First, assume a coordinated web cache infrastructure--a set of machines deployed around the global internet solely for the purpose of web caching. These caches would all join the same nested group. The sub-grouping scheme would be known in advance by the caches. Specifically, each cache would know the naming convention, including how many subgroups each subgroup has (i.e., the subgroup fan-out), and the maximum depth of the hierarchy.

At initial join time, each cache would randomly select a bottom level subgroup and join it. It may be the only member of that subgroup--this is not a problem. There will be many ``holes'' in the hierarchy--potential deep subgroups with no members at all. This is also not a problem. Top-level subgroups will be well-populated, getting thinner as you go down the hierarchy (how thin how quickly depends on the fan-out and number of caches).

Web clients can also ``join'' the group, but only for the purpose of discovering nearby web caches (if this is necessary at all--as I said, other mechanisms also exist for this).

When a web client wants to get a page, it sends the URL to its cache (proxy) as usual. The get is then routed through the hierarchy as follows: At each level, starting at the top, the cache hashes the URL into the space of populated subgroups at that level, pseudo-randomly but deterministically selecting one. The get is forwarded to a member in that subgroup, which in turns hashes and selects from the subgroups at the next lower level.

This continues until either 1) there is a cache hit, 2) the lowest level subgroup is reached, or 3) some other criteria, such as the number of caches traversed, is met. If there is no hit, the get is forwarded to the web server, the page is retrieved, cached, forwarded back to the original proxy, and finally back to the client.

Where the page gets cached at this point is a good question. It could be cached nowhere, at the target cache only (the cache where it was ultimately routed and fetched), at the original client proxy cache only, at the target and proxy caches, or at all the caches traversed in routing the get. The reason we wouldn't necessarily want to cache it everywhere, and even might not want to cache it anywhere, is that by and large it is preferable to use cache pre-loading rather than have to deal with potential stale caches.

Cache pre-loading would work like this. Caches keep track of how many gets they get for each URL (even if they don't actually cache the page). When a cache notices that it is getting a lot of gets for a given URL, it invokes cache pre-loading for that URL. This means that it 1) maintains an update copy of the page, and 2) multicasts the page to some low-level subgroup.

Ideally the web server for the page would get involved and update the cache whenever the page changed. Barring this, the cache could just be real aggressive about keeping it fresh. Whenever the page changes, an update of the page would be multicast to the subgroup. This would have the effect of spreading the document over more caches, but without the stale cache problem (or, at least, much less of one).

If after invoking cache pre-loading at a low level individual caches are still getting a lot of queries for that URL, pre-loading can be expanded by multicasting the page and any subsequent changes to the next higher level subgroup, and so on.

Note that the addition or deletion of individual cache servers can modify the set of populated subgroups. This in turns changes where some URLs get hashed--URLs that formerly hashed into a given low-level subgroup may now instead hash into the new subgroup created by the new cache server. This is not a particular problem in this case, for two reasons.

First, caching is from the word go a hit-and-miss thing. A URL being routed to a new cache server looks, from the client's perspective, no different from a cache miss, and from the replaced cache server's perspective, no different from a lack of new queries.

Second, assuming a reasonable number of cache servers to begin with (some tens if not hundreds), the addition or deletion of a single cache server only changes the set of populated subgroups at the lowest levels (with high probability). Higher level subgroups, where popular pages get pre-loaded, will rarely change.

[Next] [Up] [Previous] [Contents]
Next: 3.4 Meta-Rendezvous Service (Another Up: 3 Odds and Ends Previous: 3.2 Configuration with Yoid

Paul Francis
Fri Oct 1 11:06:22 JST 1999