Introducing, a search engine for Node.js modules

I made a bet on a new programming platform 3 years ago, and it paid off. Every line of code that has earned me money since then has been run by Node.js. In case you missed it, Node.js is the evil step-child of Netscape and Gmail that is going to take over software development for the next decade. Starting now.

Brendan Eich invented the JavaScript programming language in 10 days during a death-march in the late 90’s while working at Netscape. It’s both utterly broken and brilliantly productive. Google invented Gmail, and then realized it was too slow. So they wrote a fast web browser, Chrome. Along the way, they wrote an entirely new JavaScript engine, called V8, that is damn fast. Now, V8 can be taken out of Chrome, and used by itself. Throw in some scalable event-driven architecture, and you get the most badass way to build web stuff going. That’s Node.js. Oh, and it does flying robots too…

The best thing about Node.js is the module system, npm. This is separate, and emerged slightly later than Node itself. It comprehensively solves the “versioning problem”. When you break software into modules, you end up with many versions of these modules. As time goes by, different modules depend on different versions of other modules, and it gets messy.

Think of it like this, Rachel invites Joey to a party in her apartment, and Joey in turn invites Phoebe. Rachel also invites Chandler, who in turn invites Ursula, Phoebe’s evil twin sister, who is pretending to be Phoebe. The party ends up with incompatible guests, and fails. Here’s what NPM does. The party is split between two apartments. One apartment gets Joey and Phoebe, the other Chandler and Ursula. Rachel hangs out in the hallway and is none the wiser. Happy days.

When you publish a Node.js module to npm, you specify the other modules you depend on, and the versions that you can work with. npm ensures that all modules you install only see compatible dependencies. No more breakage.

There’s another great thing about the Node.js module system. It emerged organically from the nature of the JavaScript language and the founding personalities (npm was started by @izs). Node.js modules are small. Really small.  Which is great, because that makes Node.js anti-fragile. There are no dependencies on standards, no need for curation, no need for foundations that bless certain modules, no blockage when module authors dig their heels in. The system tolerates bad coders, module abandonment, personality implosions, and breaking changes. With 23,000 plus modules at the time of writing, you’ll always find what you need. You might have to fix a few things, tweak a few things, but that’s better than being completely stuck.

This wonderful anarchy does introduce a new problem. How do you find the right module? There’s a chicken and egg period when you’re new to Node – it seems like there are ten options for almost anything you want to do.

The question is how to solve this problem in a scalable way – not everybody can go to all the conferences or hang out on IRC – although you really should if you can. “Ask the community” doesn’t scale, and the latency is pretty bad too. Also, if your goal is pick one module, the über-helpfulness of the Node community sort of works against that, as you’ll get more than one recommendation! The npm site delegates the search question to Google. The results are less useful than you’d think. Google’s algorithms don’t give us what we need, and the search results, in terms of scannability, are pretty lame. The npm command line has a free text search function. It’s nice, but the results are pre-Google internet quality, and for the same reasons – free text search doesn’t do great finding relevant results. Then there’s the Node Toolbox, which is like a 90’s Yahoo for Node. There’s a human limit to curation and the amount of modules that can be categorized. Ontology building is, frankly, Sisyphusarbeit.

This situation is itchy. Just annoying enough to make you write some code to solve it. Towards the end of last year I randomly ended up reading that wonderful article “The Anatomy of a Large-Scale Hypertextual Web Search Engine” – written back when Larry and Sergey still had Stanford email addresses. The thing that hits you is how simple the idea of PageRank is: if popular web pages point to your web page, your web page must be really good! And the math is just lovely and so … simple! It should have been obvious to everyone.

In a gross misapplication of the underlying mathematical model (random web surfing), the same idea can be applied to Node.js modules to generate a NodeRank – a measure of how awesome your module is. How does it work. Modules depend on other modules. If lots of modules depend on a particular module, then that module must be pretty popular. A good example is express, a web framework. But that’s not enough! The algorithm asks you to look further. The modules that express itself depends on are more popular still. Case in point, connect, a HTTP server framework. The connect module needs to get some NodeRank juice from the express module. That’s what the algorithm does: your module is awesome if it’s used by other awesome modules!

Implementing the algorithm is tricky. But Google to the rescue! (ironic, capital I). I found a great blog post, with python code, that explains how to calculate a fast approximation. Thanks Michael Nielsen! Of course, a little part of me was betting a Node.js port would run even faster (it did, much faster!). So I hacked up an implementation.

Now, You can pull down the entire npm registry, it’s just a CouchDB database. A bit of manipulation with Dominic Tarr’s excellent JSONStream, and out pops a NodeRank for every module.

A ranking by itself does not a search engine make. At the risk of being branded a heretic, I’m using ElasticSearch for the main search engine. Yes, it’s Java. No, it’s not Node. Hey Whadda You Gonna Do! ElasticSearch lets you add a custom score to each document – that’s where the NodeRank goes. I hacked all this together into a little website that lets you search for Node modules:

You use nodezoo to search for modules in the same way as you use Google: just type in some relevant terms and something reasonable should come back. It’s not necessary for the module description or keywords to contain your search terms. The results still need refinement (big time!), but I need complaints to know where it’s going wrong – tweet me: @rjrodger.

The search results also attempt to provide some additional context for deciding which module to use. They include a link to the github project (if it exists), the stars and forks count, and a few other pieces of meta data.

The nodezoo system itself is still pretty hacky. One key piece that’s missing is updating the search index in real time as people publish modules. At the moment it’s a batch job. And it downloads the entire database each time. That’s probably not a good thing.

I’m going to do a series of blog posts on this little search engine, explaining how it works, and walking through the refactoring. The code is all on github if you want to follow along. This is part 1. More soon!

This entry was posted in Node.js. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *