4 August 2010
motionless: taking the pain out of Google Static Maps

I have found myself needing to generate several map visualizations lately. Creating complex URLs for Google Static Maps can be time consuming and error prone. To make this process more civilized, I’ve put together a simple python library. It supports markers, paths as well as simple centered maps. All the details can be found on github. In brief, install with pip:

pip install motionless

Simple map centered on the Eiffel Tour:


from motionless import CenterMap 
cmap = CenterMap(lat=48.858278,lon=2.294489,maptype='satellite') 
print cmap.generate_url()

Slightly more complex example generated from a GPX file on my Garmin device illustrating how to plot paths and add markers. Code is in the examples directory

14 August 2009
Lessons Learned from the GitHub Recommender Contest
I spent a few evenings last week working on a contest that GitHub is running to create a recommender engine for their site. Think Netflix Prize but much smaller scale. Their description:
The 2009 GitHub contest is a simple recommendation engine contest. We have released a dataset of our watched repositories data and want to provide a list of recommended repositories for each of our users. Removed from the sample dataset are 4,788 watches - write an open source program to guess the highest percentage of those removed watches and you win our prize.
My objective wasn’t to place first, but to understand the problem more deeply. I distilled down a few observations and presented them at BarCampRDU on Saturday. The discussion was lively and the feedback was positive:


A few people asked for my notes, so here they are…

Lesson 1: Simple is Best


I started off my preparation for the contest by reading a few papers on recommender systems that seemed relevant. Armed with this info, I formulated an approach based on co-occurrence of repositories: Create a matrix of repositories and calculate the number of times a pair of repositories is followed by the same user. Then to generate a recommendation, walk through a users follows accumulating the most frequently co-occuring repositories that he was not following already. Merge the results together for each repository a user follows, based on highest co-occurrence to provide the top 10 list. This took a while to get working right and execution was very, very slow.

 

It turns out this method wasn’t very effective. I was let down when my first score was just over 900 ( the top score at the time was over 2000). I then decided to take a closer look at both the data and at what everyone else was doing. Not all teams have their source checked in; this is particularly true of the top ranking teams. After some simple analysis on the data, I could see a clear problem: the data is sparse. As the plot below illustrates, most people follow a couple repositories and most repositories have just one or two followers.
R Code for above chart

I also noticed that some submissions using very simple techniques ranked more than twice as high as my submission. One of the most effective methods was simply to walk up the repository fork history and add everything encountered. Another technique made use of the fact that repository owner can be derived from the repository name. If a user follows a single repository, all other repositories sharing the same owner are added. I altered my approach; using these simpler techniques.


R Code for above chart

My entry, including code, can be found here.

Lesson 2: Collaborate



Participants are collaborating both implicitly, through checked in code, and explicitly. The most clear example of this can be seen in @chebuctonian and @xlvector who are working closely together.

This article in the NYTimes suggests that the lesson of collaboration was perhaps the most important that was learned in the Netflix Prize.
The biggest lesson learned, according to members of the two top teams, was the power of collaboration. It was not a single insight, algorithm or concept that allowed both teams to surpass the goal Netflix, the movie rental company, set nearly three years ago: to improve the movie recommendations made by its internal software by at least 10 percent, as measured by predicted versus actual one-through-five-star ratings by customers.

Lesson 3: Have a method

GitHub has done a nice job creating an automated scoring mechanism. As soon as results are committed via git, they are automatically scored and posted on the leaderboard. Top placing teams have created mechanisms to do similar tests offline without the commit phase, thus speeding up the testing of hypotheses. This allows for many ideas to be formulated and then systematically checked and ranked against previous findings.

A related point that we discussed at BarCampRDU was the efficacy of off-the shelf recommender systems. Directed Edge, for example, recently received funding.
Directed Edge provides an API that can be called over HTTP to provide recommendations. However, a general-pupose system like this allows for no domain-specific parameters to be included in the analysis. In the case of the GitHub contest, there is no way to factor in fork-history or repository owners. These were far and away the most effective parametrs to include in the recommendation. I applied for a develpoper key to their web-service. If I get a key, I will run the GitHub dataset and see how it ranks.

If I find time before the end of the month to revise my entry, I would like to refactor my code a more formalized KNN approach where each parameter has a boost co-efficient. Then I would let the computer walk through various boost co-efficients using simulated annealing. I realize this goes counter to lesson #1, but it seems like the simplest competitive approach given what I know now. It also has the benefit of letting the computer do the work for you.

Once completed at the end of the month, it will be interesting to see both the methods used by different teams as well as how the code is put into use by GitHub.

Minifig photo credits: kennymatic and Craig Rodway
23 August 2008
Excellent new book in the making: Computational Modeling and Complexity Science
  
Recently, in a post about his terrific, Programming Collective Intelligence, Toby Segaran calls out the “excluded middle” of technical books. His goal with with PCI was to cover that middle ground - “showing concepts, implementation and applications all at once”. Its a truly great book that will stand the test of time.

I’ve recently stumbled across another title that belongs in this middle ground: Computational Modeling and Complexity Science by Allen B. Downey. He’s also the author The Little Book of Semaphores, Think Python: An Introduction to Software Design, Learning Perl the Hard Way and a few other titles; all self published and released under the GPL.

According to Downey, CMCS, is about “algorithms, intermediate Python programming and complexity science”.

The book has a lot in common with PCI. Both:
  • Use Python as a basis for all examples
  • Span academia and pragmatism with working examples
  • Maintain an easy to read narrative
  • Are just the right length ( so far! )
CMCS gives just enough theory to provide context. The readability of the text is helped out with anecdotes from the lives of the referenced mathematicians. CMCS takes on analysis of social networks, queuing theory, stochastic modeling, cellular automata and a bunch of other great subjects.

In Chapter 7, Self-organized Criticality, there is some coverage of modeling forest fires:

In 1992 Drossel and Schwabl proposed a cellular automaton that is an abstract model of a forest fire. Each cell is in one of three states: empty, occupied by a tree, or on fire.

The rules of the CA are:
  1. An empty cell becomes occupied with probability p.
  2. A cell with a tree will burn if any of its neighbors is on fire.
  3. A cell with a tree will spontaneously burn, with probability f, even if none of its neighbors is on fire.
  4. A cell with a burning tree becomes an empty cell in the next time step.
    1. I couldn’t help pulling out PIL and gluing some Python together to try this out. Here’s what I came up with (converted to flash to keep filesize down):

      source

      Much of the book us still under construction and none of the referenced source URLs I tried were functional. That said, one suggestion I do have is to make more use of existing Python libraries. As an a example, the chapter on graphs includes the implementation of basic graph datastructures. This is understandable since this will no doubt be used in undergraduate classrooms, but there should at least be a pointer to NetworkX; a battle-tested warhorse that the reader should be aware of.

      It’s a great read so far and I’m looking forward to seeing how the book evolves.
2 August 2008
Beneath the Bare Metal: Using dynamic profiling tools like DTrace and VProbes


Just back from BarCampRDU. Lots of great sessions. I especially enjoyed Grant Ingersoll’s presentations on Mahout and Solr.

I did a talk on use of dynamic tracing tools such as VProbes and DTrace. FWIW: Here are some links I referenced in the talk.

DTrace



VProbes

11 June 2008

Greg Kroah Hartman did a techtalk at Google recently. Some interesting stats related to the Linux 2.6.25 kernel development effort:

  • 2399 contributers over last 1.5 years
  • 1/2 of these have done one patch, 1/4 did 2
  • 9.2M Lines of Code
  • 10% year over year lines of code growth
  • ~55% of code in drivers - majority by a small bit, but not all drivers.
  • Change rate of kernel code to driver code is proportional. Translation: the core of OS is changing at a crazy rate.
  • New releases every 2.75 months or so
  • 9000 symbols and 12000 data structures inside the kernel (below the syscall layer)
  • Rate of change is insane
    • 4300 lines added
    • 1800 lines removed
    • 1500 lines modified
    • PER DAY!!!

8 June 2008
Pythonic API for Tumblr’s web API checked in

When my Wordpress blog got hacked, I decided it was time to port over to Tumblr and let them deal with hosting, patching, etc… I put together a python script to import my exiting content via RSS. My original script grew into python-tumblr. Several samples are checked into SVN including one that uses Google Charts to visualize post type frequency.




Comments, feedback, patches welcome…
16 December 2007
Google Charts Gets Invader’d

Playing with Google Chart API and PyGoogleChart this afternoon after reading the announcement last week.

Invader Chart

A couple things to think about when using the API:

  • You will want to use a wrapper - Take one look at the link that generated the above graphic, and you’ll quickly see that this is not meant for humans. Maybe write your own wrapper around the bits that you are using. Maybe use one like PyGoogleChart.
  • It’s not just for web apps - Charts can be cached for later use using curl or equivilent. NB: IANAL so review the Terms of Service and usage policy and make your own decision.
  • You will have to normalize your data - The chart api docs talk quite a bit about reducing number of elements in your datasets. Another important thing to think about is the range. For instance if I wanted to chart the last few days closing values for the CAC 40 Index which varies from 5,547.21 to 5,750.92. I would have to first normalize the values to something under 4096 for Extended Encoding, under 100 for Text encoding and less than 62 for Simple Encoding. Then the y-axis labels would have to be adjusted to reflect the normalization. Example

It’s pretty clear we’ve only seen the beginning. People outside will find new and unanticipated ways of using the API and Google will continue to crank out new features.

Inspiration for above:

3 March 2007
Luke 1.7 is out
Luke 1.7

After a year or so of shut-eye, Luke, the Lucene query tool, is back. New features listed below. Time to dust off some patches I’ve got for an earlier version and send them along for 1.8.

Luke 1.7 features:

  • Added pagination of results, especially useful for very large result sets.
  • Added support for new Field flags. They are now displayed in the Document details, and also can be set on edited documents.
  • Added a function to add new documents to the index.
  • Low-level index operations (such as detecting unused files, index directory cleanup) use the newly exposed Lucene classes instead of duplicating their internals in Luke.
  • A side-effect of the above is the ability to properly cleanup all supported index formats, including the new lockless and single-norm indexes.
  • Added a function to copy the list of top terms to clipboard.
  • Added a function to copy the term vector to clipboard.
  • Added a function to close and/or re-open the current index.
  • In the Documents tab, pressing “First Term” now positions the term enumeration at the first term for the selected field.
  • Added a field vocabulary analysis plugin by Mark Harwood, with some modifications.
  • Overall UI cleanup - improved layout in some places, added graphics instead of ASCII art, etc.

Nicely done Andrzej.

In other Lucene related news, it looks like some interesting things are brewing in the Hadoop project.

Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

24 February 2007
The Definitive ANTLR Reference

As reported elsewhere, the Pragmatic guys have shepherded another great book through their process: The Definitive ANTLR Reference. Though not intended as a how-to, I’m finding it very readable.

Last time I looked at ANTLR, the ANTLRWorks tool ( pictured above ) didn’t exist. It’s got some quirky aspects and rough edges, but it makes exploring and learning by example a reality. Recommended!

1 January 2007
VIM Plugin for JavascriptLint

I’ve put together a vim compiler plugin to provide quick access to JavaScriptLint output. It will identify file, line and column of warnings and errors generated by jsl.

Requirements:

  • vim 7+
  • jsl in path

Installation:

  • Download vimball - jsl.vba
  • Open vimball - vim jsl.vba
  • Verify contents - :VimballList
    Should see:
    would extract <plugin/jsl.vim>: 1 lines
    would extract <compiler/jsl.vim>: 22 lines
    
  • Install - :so %

Usage:

  • Open file - vim jsl-test.js
  • Run JavaScriptLit - :make
  • View output - :cope

Feedback welcome…