(Tap… tap… tap… is this thing on? Good.)
July was vacation month, where I went on a glorious bike tour of the Oregon/California coast, and learned definitively that I don’t like biking on the side of a highway all day. Don’t worry, I escaped in Coos Bay and took trains and buses between Eugene, Santa Cruz, Berkeley, and SF for a vacation more my speed.
But now that I’m back, August is turning out to be project month. I have 3 great TCS applications to global health in the pipeline, and I have big plans to tell you about them soon. But one mixed blessing about these applications is that people actually want to see the results, like, yesterday! So first I have to deal with the results, and then I can write papers and blogs about the techniques.
Since Project Month is a little over-booked with projects, I’m going to have to triage one today. You’ve heard of the NetFlix Challenge, right? Well, github.com is running a smaller scale recommendation contest, and I was messing around with personal page rank, which seems like a fine approach for recommending code repositories to hackers. I haven’t got it working very well (best results, 15% of holdout set recovered), but I was having fun with it. Maybe someone else will take it up, let me know if you get it to work; networkx + data = good times.
f = open('download/data.txt') for l in f: u_id, r_id = l.strip().split(':') G.add_edge(user(u_id), repo(r_id))
2 responses to “August is Too-Many-Projects Month”
My first attempts at the contest were along these lines ( I used NetworkX and it’s eigenvector centrality measure for the first rev ) and yielded similar results. The matrix is just too sparse; not enough edges in the graph.
Here are some notes on my observations:
Interesting plots and interesting observation, Ryan. It definitely makes me want to mess around with the PPR some more. If the matrix is too sparse, then how about a little “coarsening”? Just when I thought I was out… I’ll at least see how this thing scores with the current code.