I spent a few evenings last week working on a contest that GitHub
is running to create a recommender engine for their site. Think Netflix Prize
but much smaller scale. Their description:
The 2009 GitHub contest is a simple recommendation engine contest. We have released a dataset of our watched repositories data and want to provide a list of recommended repositories for each of our users. Removed from the sample dataset are 4,788 watches - write an open source program to guess the highest percentage of those removed watches and you win our prize.
My objective wasn’t to place first, but to understand the problem more deeply. I distilled down a few observations and presented them at BarCampRDU
on Saturday. The discussion was lively and the feedback was positive:
A few people asked for my notes, so here they are…
Lesson 1: Simple is Best
I started off my preparation for the contest by reading a few papers on recommender systems that seemed relevant. Armed with this info, I formulated an approach based on co-occurrence of repositories: Create a matrix of repositories and calculate the number of times a pair of repositories is followed by the same user. Then to generate a recommendation, walk through a users follows accumulating the most frequently co-occuring repositories that he was not following already. Merge the results together for each repository a user follows, based on highest co-occurrence to provide the top 10 list. This took a while to get working right and execution was very, very slow.
It turns out this method wasn’t very effective. I was let down when my first score was just over 900 ( the top score at the time was over 2000). I then decided to take a closer look at both the data and at what everyone else was doing. Not all teams have their source checked in; this is particularly true of the top ranking teams. After some simple analysis on the data, I could see a clear problem: the data is sparse. As the plot below illustrates, most people follow a couple repositories and most repositories have just one or two followers.
R Code for above chart
I also noticed that some submissions using very simple techniques ranked more than twice as high as my submission. One of the most effective methods was simply to walk up the repository fork history and add everything encountered. Another technique made use of the fact that repository owner can be derived from the repository name. If a user follows a single repository, all other repositories sharing the same owner are added. I altered my approach; using these simpler techniques.
R Code for above chart
My entry, including code, can be found here
Lesson 2: Collaborate
Participants are collaborating both implicitly, through checked in code, and explicitly. The most clear example of this can be seen in @chebuctonian
who are working closely together.
in the NYTimes suggests that the lesson of collaboration was perhaps the most important that was learned in the Netflix Prize.
The biggest lesson learned, according to members of the two top teams, was the power of collaboration. It was not a single insight, algorithm or concept that allowed both teams to surpass the goal Netflix, the movie rental company, set nearly three years ago: to improve the movie recommendations made by its internal software by at least 10 percent, as measured by predicted versus actual one-through-five-star ratings by customers.
Lesson 3: Have a method
GitHub has done a nice job creating an automated scoring mechanism. As soon as results are committed via git, they are automatically scored and posted on the leaderboard. Top placing teams have created mechanisms to do similar tests offline without the commit phase, thus speeding up the testing of hypotheses. This allows for many ideas to be formulated and then systematically checked and ranked against previous findings.
A related point that we discussed at BarCampRDU was the efficacy of off-the shelf recommender systems. Directed Edge
, for example, recently received funding.
Directed Edge provides an API that can be called over HTTP to provide recommendations. However, a general-pupose system like this allows for no domain-specific parameters to be included in the analysis. In the case of the GitHub contest, there is no way to factor in fork-history or repository owners. These were far and away the most effective parametrs to include in the recommendation. I applied for a develpoper key to their web-service. If I get a key, I will run the GitHub dataset and see how it ranks.
If I find time before the end of the month to revise my entry, I would like to refactor my code a more formalized KNN
approach where each parameter has a boost co-efficient. Then I would let the computer walk through various boost co-efficients using simulated annealing
. I realize this goes counter to lesson #1, but it seems like the simplest competitive approach given what I know now. It also has the benefit of letting the computer do the work for you.
Once completed at the end of the month, it will be interesting to see both the methods used by different teams as well as how the code is put into use by GitHub.
Minifig photo credits: kennymatic and Craig Rodway