Words by c.z.robertson

Power laws, collaborative filtering, and blogrolls

2003-02-13 22:21:22 UTC

Chiming in on the power laws discussion... It's not just blogs that follow a power law distribution in terms of popularity. It happens in music too, and I've spent some time thinking about the problem. For several years I've had a fantasy of building a collaborative filtering system for music. According to my thought-experiments, a CF system isn't susceptible to power law problems.

CF systems find correlations between the tastes of the system's users and, from that, make recommendations to users. So you need each user to give a list of the things they like and the things they dislike, then once you've got a reasonable number of those, you can start finding correlations and making useful recommendations.

There are two learning stages that have to be overcome: the first is that the system has to gain a reasonable number of users and ratings to give it good coverage of the domain of items that it's making recommendations on; the second is that each new user must provide a sufficient amount of information about their preferences so that the system can know them well enough to make relevant recommendations. I've never implemented a CF system, so I don't know how bad these drawbacks are in practice. They may not be significant, and it probably varies from domain to domain anyway.

But I've just had a flash of inspiration: In the blogging world we have a lot of information about preferences already. All those blogrolls and OPML files that we have constitute a good set of preference information that could be used to bootstrap a CF system.

One obvious drawback comes to mind: Blogrolls and OPML files only contain information about positive preferences. They only say "I like this". It's a start, but a proper system would also contain statements that said "I dislike this".

So then I got on to thinking about the architecture of a CF system, and I had the idea of storing all the preference information in a file on the user's website. It would be handy to do that, because then I wouldn't have to worry about storing a load of user accounts and all the bother that would go along with that. I could just have a user give the URL of their preferences file and I could poll that every so often to keep my preferences database up-to-date.

A query could then contain either the URL of the user's preferences file, or a list of preferences information, with the instruction to find the most likely things that someone with those preferences would like.

(The CF weenies among you may like to know that I'm here thinking of using an item- rather than user-based model. It would be more difficult to use this sort of architecture with a user-based system.)