Contact

Ryan Cox

Popular Posts

java.util.uuid FAQ
XPath Engine Comparison
Author Tips

 RSS Feed

Aug
14th
2009
Fri
permalink

Lessons Learned from the GitHub Recommender Contest

I spent a few evenings last week working on a contest that GitHub is running to create a recommender engine for their site. Think Netflix Prize but much smaller scale. Their description:
The 2009 GitHub contest is a simple recommendation engine contest. We have released a dataset of our watched repositories data and want to provide a list of recommended repositories for each of our users. Removed from the sample dataset are 4,788 watches - write an open source program to guess the highest percentage of those removed watches and you win our prize.
My objective wasn’t to place first, but to understand the problem more deeply. I distilled down a few observations and presented them at BarCampRDU on Saturday. The discussion was lively and the feedback was positive:


A few people asked for my notes, so here they are…

Lesson 1: Simple is Best


I started off my preparation for the contest by reading a few papers on recommender systems that seemed relevant. Armed with this info, I formulated an approach based on co-occurrence of repositories: Create a matrix of repositories and calculate the number of times a pair of repositories is followed by the same user. Then to generate a recommendation, walk through a users follows accumulating the most frequently co-occuring repositories that he was not following already. Merge the results together for each repository a user follows, based on highest co-occurrence to provide the top 10 list. This took a while to get working right and execution was very, very slow.

 

It turns out this method wasn’t very effective. I was let down when my first score was just over 900 ( the top score at the time was over 2000). I then decided to take a closer look at both the data and at what everyone else was doing. Not all teams have their source checked in; this is particularly true of the top ranking teams. After some simple analysis on the data, I could see a clear problem: the data is sparse. As the plot below illustrates, most people follow a couple repositories and most repositories have just one or two followers.
R Code for above chart

I also noticed that some submissions using very simple techniques ranked more than twice as high as my submission. One of the most effective methods was simply to walk up the repository fork history and add everything encountered. Another technique made use of the fact that repository owner can be derived from the repository name. If a user follows a single repository, all other repositories sharing the same owner are added. I altered my approach; using these simpler techniques.


R Code for above chart

My entry, including code, can be found here.

Lesson 2: Collaborate



Participants are collaborating both implicitly, through checked in code, and explicitly. The most clear example of this can be seen in @chebuctonian and @xlvector who are working closely together.

This article in the NYTimes suggests that the lesson of collaboration was perhaps the most important that was learned in the Netflix Prize.
The biggest lesson learned, according to members of the two top teams, was the power of collaboration. It was not a single insight, algorithm or concept that allowed both teams to surpass the goal Netflix, the movie rental company, set nearly three years ago: to improve the movie recommendations made by its internal software by at least 10 percent, as measured by predicted versus actual one-through-five-star ratings by customers.

Lesson 3: Have a method

GitHub has done a nice job creating an automated scoring mechanism. As soon as results are committed via git, they are automatically scored and posted on the leaderboard. Top placing teams have created mechanisms to do similar tests offline without the commit phase, thus speeding up the testing of hypotheses. This allows for many ideas to be formulated and then systematically checked and ranked against previous findings.

A related point that we discussed at BarCampRDU was the efficacy of off-the shelf recommender systems. Directed Edge, for example, recently received funding.
Directed Edge provides an API that can be called over HTTP to provide recommendations. However, a general-pupose system like this allows for no domain-specific parameters to be included in the analysis. In the case of the GitHub contest, there is no way to factor in fork-history or repository owners. These were far and away the most effective parametrs to include in the recommendation. I applied for a develpoper key to their web-service. If I get a key, I will run the GitHub dataset and see how it ranks.

If I find time before the end of the month to revise my entry, I would like to refactor my code a more formalized KNN approach where each parameter has a boost co-efficient. Then I would let the computer walk through various boost co-efficients using simulated annealing. I realize this goes counter to lesson #1, but it seems like the simplest competitive approach given what I know now. It also has the benefit of letting the computer do the work for you.

Once completed at the end of the month, it will be interesting to see both the methods used by different teams as well as how the code is put into use by GitHub.

Minifig photo credits: kennymatic and Craig Rodway
Comments (View)    
Aug
23rd
2008
Sat
permalink

Excellent new book in the making: Computational Modeling and Complexity Science

  
Recently, in a post about his terrific, Programming Collective Intelligence, Toby Segaran calls out the “excluded middle” of technical books. His goal with with PCI was to cover that middle ground - “showing concepts, implementation and applications all at once”. Its a truly great book that will stand the test of time.

I’ve recently stumbled across another title that belongs in this middle ground: Computational Modeling and Complexity Science by Allen B. Downey. He’s also the author The Little Book of Semaphores, Think Python: An Introduction to Software Design, Learning Perl the Hard Way and a few other titles; all self published and released under the GPL.

According to Downey, CMCS, is about “algorithms, intermediate Python programming and complexity science”.

The book has a lot in common with PCI. Both:
  • Use Python as a basis for all examples
  • Span academia and pragmatism with working examples
  • Maintain an easy to read narrative
  • Are just the right length ( so far! )
CMCS gives just enough theory to provide context. The readability of the text is helped out with anecdotes from the lives of the referenced mathematicians. CMCS takes on analysis of social networks, queuing theory, stochastic modeling, cellular automata and a bunch of other great subjects.

In Chapter 7, Self-organized Criticality, there is some coverage of modeling forest fires:

In 1992 Drossel and Schwabl proposed a cellular automaton that is an abstract model of a forest fire. Each cell is in one of three states: empty, occupied by a tree, or on fire.

The rules of the CA are:
  1. An empty cell becomes occupied with probability p.
  2. A cell with a tree will burn if any of its neighbors is on fire.
  3. A cell with a tree will spontaneously burn, with probability f, even if none of its neighbors is on fire.
  4. A cell with a burning tree becomes an empty cell in the next time step.

    I couldn’t help pulling out PIL and gluing some Python together to try this out. Here’s what I came up with (converted to flash to keep filesize down):

    source

    Much of the book us still under construction and none of the referenced source URLs I tried were functional. That said, one suggestion I do have is to make more use of existing Python libraries. As an a example, the chapter on graphs includes the implementation of basic graph datastructures. This is understandable since this will no doubt be used in undergraduate classrooms, but there should at least be a pointer to NetworkX; a battle-tested warhorse that the reader should be aware of.

    It’s a great read so far and I’m looking forward to seeing how the book evolves.
    Comments (View)    
    Aug
    2nd
    2008
    Sat
    permalink

    Beneath the Bare Metal: Using dynamic profiling tools like DTrace and VProbes



    Just back from BarCampRDU. Lots of great sessions. I especially enjoyed Grant Ingersoll’s presentations on Mahout and Solr.

    I did a talk on use of dynamic tracing tools such as VProbes and DTrace. FWIW: Here are some links I referenced in the talk.

    DTrace



    VProbes

    Comments (View)    
    Jun
    11th
    2008
    Wed
    permalink

    Greg Kroah Hartman did a techtalk at Google recently. Some interesting stats related to the Linux 2.6.25 kernel development effort:

    • 2399 contributers over last 1.5 years
    • 1/2 of these have done one patch, 1/4 did 2
    • 9.2M Lines of Code
    • 10% year over year lines of code growth
    • ~55% of code in drivers - majority by a small bit, but not all drivers.
    • Change rate of kernel code to driver code is proportional. Translation: the core of OS is changing at a crazy rate.
    • New releases every 2.75 months or so
    • 9000 symbols and 12000 data structures inside the kernel (below the syscall layer)
    • Rate of change is insane
      • 4300 lines added
      • 1800 lines removed
      • 1500 lines modified
      • PER DAY!!!

    Comments (View)    
    Jun
    8th
    2008
    Sun
    permalink

    Pythonic API for Tumblr’s web API checked in

    When my Wordpress blog got hacked, I decided it was time to port over to Tumblr and let them deal with hosting, patching, etc… I put together a python script to import my exiting content via RSS. My original script grew into python-tumblr. Several samples are checked into SVN including one that uses Google Charts to visualize post type frequency.




    Comments, feedback, patches welcome…
    Comments (View)    
    Dec
    16th
    2007
    Sun
    permalink

    Google Charts Gets Invader’d

    Playing with Google Chart API and PyGoogleChart this afternoon after reading the announcement last week.

    Invader Chart

    A couple things to think about when using the API:

    • You will want to use a wrapper - Take one look at the link that generated the above graphic, and you’ll quickly see that this is not meant for humans. Maybe write your own wrapper around the bits that you are using. Maybe use one like PyGoogleChart.
    • It’s not just for web apps - Charts can be cached for later use using curl or equivilent. NB: IANAL so review the Terms of Service and usage policy and make your own decision.
    • You will have to normalize your data - The chart api docs talk quite a bit about reducing number of elements in your datasets. Another important thing to think about is the range. For instance if I wanted to chart the last few days closing values for the CAC 40 Index which varies from 5,547.21 to 5,750.92. I would have to first normalize the values to something under 4096 for Extended Encoding, under 100 for Text encoding and less than 62 for Simple Encoding. Then the y-axis labels would have to be adjusted to reflect the normalization. Example

    It’s pretty clear we’ve only seen the beginning. People outside will find new and unanticipated ways of using the API and Google will continue to crank out new features.

    Inspiration for above:

    Comments (View)    
    Mar
    3rd
    2007
    Sat
    permalink

    Luke 1.7 is out

    Luke 1.7

    After a year or so of shut-eye, Luke, the Lucene query tool, is back. New features listed below. Time to dust off some patches I’ve got for an earlier version and send them along for 1.8.

    Luke 1.7 features:

    • Added pagination of results, especially useful for very large result sets.
    • Added support for new Field flags. They are now displayed in the Document details, and also can be set on edited documents.
    • Added a function to add new documents to the index.
    • Low-level index operations (such as detecting unused files, index directory cleanup) use the newly exposed Lucene classes instead of duplicating their internals in Luke.
    • A side-effect of the above is the ability to properly cleanup all supported index formats, including the new lockless and single-norm indexes.
    • Added a function to copy the list of top terms to clipboard.
    • Added a function to copy the term vector to clipboard.
    • Added a function to close and/or re-open the current index.
    • In the Documents tab, pressing “First Term” now positions the term enumeration at the first term for the selected field.
    • Added a field vocabulary analysis plugin by Mark Harwood, with some modifications.
    • Overall UI cleanup - improved layout in some places, added graphics instead of ASCII art, etc.

    Nicely done Andrzej.

    In other Lucene related news, it looks like some interesting things are brewing in the Hadoop project.

    Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

    Comments (View)    
    Feb
    24th
    2007
    Sat
    permalink

    The Definitive ANTLR Reference


    As reported elsewhere, the Pragmatic guys have shepherded another great book through their process: The Definitive ANTLR Reference. Though not intended as a how-to, I’m finding it very readable.

    Last time I looked at ANTLR, the ANTLRWorks tool ( pictured above ) didn’t exist. It’s got some quirky aspects and rough edges, but it makes exploring and learning by example a reality. Recommended!

    Comments (View)    
    Jan
    1st
    2007
    Mon
    permalink

    VIM Plugin for JavascriptLint


    I’ve put together a vim compiler plugin to provide quick access to JavaScriptLint output. It will identify file, line and column of warnings and errors generated by jsl.

    Requirements:

    • vim 7+
    • jsl in path

    Installation:

    • Download vimball - jsl.vba
    • Open vimball - vim jsl.vba
    • Verify contents - :VimballList
      Should see:
      would extract <plugin/jsl.vim>: 1 lines
      would extract <compiler/jsl.vim>: 22 lines
      
    • Install - :so %

    Usage:

    • Open file - vim jsl-test.js
    • Run JavaScriptLit - :make
    • View output - :cope

    Feedback welcome…

    Comments (View)    
    Jun
    21st
    2006
    Wed
    permalink

    Raleigh NFJS 2006


    Recently Jay brought his current lineup of speakers to town for the 2006 NFJS Symposium Tour . Lots of great speakers. Packed with nuts and bolts details. It was buzzword compliant, with plenty of talks on AJAX and Ruby. Justin Gehtland’s talks on hibernate were excellent and can be found on codecite . These are polished speakers who have given these sessions many times prior. Dave Thomas could make event the most prosaic topic engaging and entertaining. The rare bit of wit, humor and humanity go a long way in this field.Points that stood out:

    Keep Track of Your Ideas

    Andy Hunt talked at length about keeping track of the ideas that occur to you throughout the day. Method doesn’t really matter. Some techniques he proposed: spacepen and pad, voice memos on your cell phone, wiki or wiki modes in your editor. I’ve installed potwiki to give it a try. Elsewhere on the web Michael Michalko seems to have a few things to say on this topic.

    I like Michalko’s suggestions for developing ideas once captured:

    • What can be SUBSTITUTED? (Who else? What else? Other ingredient? Other process? Other power? Other place? Other approach? Can you change the rules?)
    • What can be COMBINED? (How about a blend, an alloy, an assortment, an ensemble? Combine units? Combine purposes with something else? Combine appeals? Combine ideas?)
    • What can I ADAPT from something else to the idea? (What else is like this? What other idea does this suggest? Does the past offer a parallel? What could I copy? Whom could I emulate?)
    • What can I MAGNIFY? (What can be added? More time? Stronger? Higher? Longer? Extra value? Extra features? Duplicate? Multiply? Exaggerate?)
    • What can I MODIFY or change? (What can be altered? New twist? Change meaning, color, motion, sound, odor, form, shape? What other changes can be made?)
    • Can I PUT the idea TO OTHER USES? (New ways to use as? Other uses if modified? Can you make it do more things? Other extensions? Other spin-off? Other markets?)
    • What can be ELIMINATED? (What to subtract? Smaller? Condensed? Miniature? Lower? Shorter? Lighter? Omit? Streamline? Split up? Understate?)
    • What can be REARRANGED the parts? (What other arrangement might be better? Interchange components? Other pattern? Other layout? Other sequence? Transpose cause and effect? Change pace? Change schedule?)
    • Can it be REVERSED? (Transpose positive and negative? How about opposites? Turn it upside down? Reverse roles? Consider it backwards? What if you did the unexpected?)

    Class Oriented Development

    During his keynote Dave Thomas took more than a few well-deserved pot-shots at the received wisdom in the software community. He ticked through some of the usual suspects: type safety, methodologies, backward compatibility. Simply stated, his thesis was: if you don’t a have direct and clear understanding of why you are doing something: stop doing it. One point that resonated with me was his characterization of Java and many other object-oriented languages as really class-oriented. We spend our days developing sophisticated class relationships and hierarchies. Is-a. Has-a. While these languages provide anemic support for actually orienting your development around objects; the heart and soul of your application at runtime. Few true hierararchical taxonomies exist in the real world. Classes were invented by stamp-collectors. Why are we trying to put the world into neat little boxes?

    This was of course a shameless plug for features like mix-ins that exist in many modern programming languages . Either way this is an idea worth kicking around the block.

    Learn By Doing

    In his Refactoring your Wetware talk Andy Hunt drew on the example of surgerical residents who follow the pattern of observe, assist and perform. For surgeons there is no substitute for doing. The same is probably true for developers.

    Put down that book. Close Firefox. mkdir testapp

    Comments (View)