GroupLens: Applying Collaborative Filtering to Usenet News

Konstan, Miller, Maltz, Herlocker, Gordon, and Riedl
Reviewed by Alex Feinman

First, some definitions:

Collaborative filtering
A system where each user rates the subject, and the combined results are used to present a rating to each new reader.
Signal-to-noise ratio
From the Jargon File:

signal-to-noise ratio [from analog electronics] /n./
Used by hackers in a generalization of its technical meaning. `Signal' refers to useful information conveyed by some communications medium, and `noise' to anything else on that medium. Hence a low ratio implies that it is not worth paying attention to the medium in question. Figures for such metaphorical ratios are never given. The term is most often applied to Usenet newsgroups during flame wars. Compare bandwidth. See also coefficient of X, lost in the noise.

GroupLens

After a pilot study, the authors implemented a collaborative filtering system on a subset of Usenet groups:

(Alex's aside:
These groups vary from really low signal-to-noise (rec.humor; that's why rec.humor.funny was invented) to moderate S/N (comp.groupware; a small readership means fewer clueless people). They also vary in pertinence from "just about every (non-junk) message is important" (comp.os.linux.announce) to "really, only about 1 message in ten is interesting to me" (rec.food.recipes). I'm not sure whether the uauthors were aware they were picking such a wide spread when they set up the experiment; I get the feeling they were.)

The pilot study was followed by a public trial, where users were invited to download the filtering software free of charge. 250 users signed up, submitting a total of 47569 ratings; the system generated 600,000 ratings from the user input for 22,862 articles.

How it works

The user downloads and installs a replacement news reader. While the user reads each article, s/he can rate it from 1-5 ("really bad" to "really good"). This rating is sent back to the central filtering server. The accumulated ratings for the different newsfroups are shown in Figure 2, on page 79. These ratings are then correlated with other users' ratings, and magically comes up with a predictive rating. I guess the technical details were only revealed in their other paper.

How well it works

The experimenters then applied standard statistical methods to these ratings to determine whether they were of some use. In Figure 3, they show the user pair correlations for some newsgroups; a high correlation indicates that most people though the same thing about an article. The figure shows that, for example, most people like the same jokes in rec.humor (or at least thought the same messages were junk), whereas few people liked the same recipes in rec.food.recipes. They also collected statistics that showed that a user's personalized rating system predicted ratings more accurately than an averaged rating system. Surprise, surprise -- people have different tastes.

More important than statistical success, perhaps, is the user's perception of whether a filtering system is working. This appears not to be a major issue in this paper; perhaps it is discussed in one of the other papers on GroupLens. I, for one, would be interested in knowing what people who use the system consistently think of it.


The Alexa Effect


Or, how the Internet can make anyone rich!

Set up a search engine, and they will come. Alexa crawls the Web every night, and keeps all the archived pages. It recently donated a copy of "The Web" to the Library of Congress, whatever that means.

After you install the Alexa software on your PC (or Mac, in alpha), it keeps track of where you go. Isn't this the sort of privacy invasion that had everyone in such an uproar a few years ago with cookies? How quickly we forget. Anyway, it tracks correlations in sequentially visited web sites, and puts together a list of these correlations. For example, many people check the weather and news in the morning when they log in; hence, there is a lot of traffic that jumps from weather.com to cnn.com, or vice versa. So Alexa constructs a correlation between these web sites.

While you browse, Alexa's toolbar shows these correlations as "Related Links". I wasn't able to find whether Netscape stole this idea outright for Navigator 4.5, or if they licensed it, but the same thing exists in Navigator now. Alexa also shows you site stats, like how many people visited it; an ad, whether you want to see it or not; a somewhat useless archive of the web, for recovering out-of-date and obsolete web pages; and a link to a few on-line reference tools.

It is useful to note that the ad is larger than any other link on the toolbar, and wedged in between useful tools. Great product placement, dude.