Random Walkers for P2P Searches

Slashdot ran a story yesterday on “Random Walkers“, a method for searching across Gnutella-like (decentralized unstructured) P2P networks. The New Scientist article pointed to by Slashdot is, as usual, a bit out of kilter with reality.

The New Scientist article says of the researchers, “If they are right, millions of computers could be connected, meaning P2P networks could rival the performance and stability of stand-alone supercomputers.” Of course, the reader who submitted the story to Slashdot took this to mean “They [the researchers] claim this could result in a network very capable of facilitating a massive distributed supercomputer.” Cute, but not very accurate.

The actual research paper, “Search and Replication in Unstructured Peer-to-Peer Networks” is an interesting analysis of various search algorithms on decentralized unstructured P2P networks, both with and without proactive replication strategies. Here, decentralized means there’s no central index or directory server/network. (See Distributed Systems Topologies at OpenP2P.com for more info.) Unstructured means, basically, that the network’s topology is not predetermined or controlled. So, nodes make connections to whichever other nodes in the network they want to, and have to rely on the good faith of the other nodes for their queries. Replication strategies are the way in which files (or, more generically, objects) are replicated across a network. The simplest way, and the one most people are familiar with, is called “owner replication”. That is, when a file search is successful and a node downloads the file, the file is then stored at both the original node and the node that just downloaded the file. This method ensures that only nodes that wish to store the file have to store it locally. Another method is “path replication”, in which a file is stored at each node along the route a successful search request took between the requesting node and the provider node.

The results of the research are interesting, but not anything as earth-shattering as New Scientist tried to dress them up as. Ever since the release of Gnutella by Nullsoft, it’s been painfully obvious that searching the decentralized Gnutella network was inefficient compared to centralized networks. The main advantage of Gnutella was, at a point when Napster was just beginning to face serious legal threats, that it had no single point of failure. Qin Lv et al have, in their research, found an interesting way of searching across a purely decentralized unstructured P2P network. Their algorithm seems to give decent results at much lower bandwidth costs than other common search methods. However, while they were doing their research many of the Gnutella-based networks evolved into more scalable systems.

One method of increasing the size to which a network may grow is TTL (Time-To-Live) horizons. TTL’s basically state how many hops a query can propagate through the network. In large networks TTL’s must be set relatively high in order for a user to be assured of getting a result back for their query. However, TTL’s do allow for massive growth of a network. TTL’s effectively enforce a radius on the number of connections any node can affect in the network. A search by one node doesn’t have an effect on the nodes outside of its search radius. Because of this, such network can grow to be rather large without significantly more or different problems than when they were smaller. However, this also means nodes may not be able to find more obscure files as the network grows. The researchers mention the importance of TTL horizons, and indeed look into the effect of different TTL’s on different decentralized topologies.

A second method of boosting the scalability of such networks is to introduce some structure to them. The FastTrak network did this by using “superpeers.” These superpeers were essentially search servers that cached the file indices of their children in the network. The group of superpeers could query each other, essentially forming a fast search network inside of the larger decentralized network. The Gnutella Development Forum described a similar technology called Ultrapeers in its first draft of the Gnutella RFC, and in a document submitted by Lime Wire LLC.

This second method drastically increases the performance of these networks, and seems to be gaining popularity in recent Gnutella implementations. Thus, real Gnutella networks are a bit different these days than the ones the researchers simulated. The relevant part of the research to distributed computing is with regards to how to optimally replicate data over a network so that random walkers can find it. Not surprisingly, replication should be random. This is intuitively obvious, because it avoids concentrating occurrences of data along a path through path replication or relegating it to leaves (which in actual Gnutella networks are rarely real leaves, but client nodes connected to multiple super-nodes) through owner replication. If random replication is coupled with random walkers and random network topology, then data retrieval is fairly reliable even without the hinting used by guided walkers in Freenet. So, in theory all of this is quite promising, but whether or not anyone can actually implement random walkers, random topology, and random replication in practice is another question entirely.

The main problem is that real P2P networks are rarely homogeneous in their makeup. Some nodes have higher bandwidth connections, more storage, or different priorities than others. Some users may just want to facilitate the exchange of information by providing fast servers for other nodes to connect to and search from. Others may just want to be free riders. Without a homogeneous network, random walkers might not perform nearly as well as predicted, running into the problems of network latency and ephemeral nodes. Of course, relying on a homogenous network would severely limit the usefulness of these research results, because of the relatively few computers that could join such a network in the first place. Without a comparison against other architectures (such as the mixed centralized + decentralized topology of modern Gnutella networks) any claims that random walkers are going to have immediate applications in P2P networks need to be taken with a grain of salt. So do claims that such “research could be used to distribute massive amounts of computing power for scientific applications in the form of ‘distributed supercomputers’”. The actual research doesn’t concern distributing computing power, but rather deals with distributing data storage for fast/accurate retrieval. This could conceivably be used to better facilitate distributed/grid computing, but other decentralized structured topologies would probably better serve that purpose. Furthermore, even if the results of this research could be shown to work well in real-world environments, it’s far more likely they’d be used in unstructured networks for simple file sharing purposes and to help curb some of the ill effects of DoS attacks against file sharing networks (though it’d take a lot more than random walkers to stop the RIAA, which has enough cash on hand to chew through a lot of bandwidth to accomplish its goals).

So, what are we left with? Some interesting research results that may or may not be useful in real-world P2P networks. It’s interesting to note that, in their paper at least, the researchers didn’t try to tout their results nearly as much as did New Scientist or even the Slashdot poster. People should keep in mind that the main advantage of Gnutella is its lack of a single point of failure. Most legitimate applications don’t have this requirement (as Craig Silverstein recently said on Slashdot, “It’s true the Internet is distributed, but Internet services have never been.”), so they can take on some structure that allows for more efficient data sharing. In fact, many commercial applications such as Kontiki seem to rely heavily on having managed distributed structured networks. Personally, having spent a semester designing a scalable, secure, and manageable P2P network for a legitimate real-world application, I’m of the opinion that structured topologies are the way to go and decentralized unstructured topologies just aren’t up to par quite yet. This research brings things a little closer, but not by much.

Leave a Reply

Please spell "response" backwards: (required)