Feb 152014
 

Introduction

More than a year has passed since the previous (and only) post… and a lot has happened. The most recent thing (indirectly related to this post): I’ve left my job at brandcrumb in December after two and a half years of intense, challenging and professionally fulfilling collaboration. I’m now working from home, drinking way too much coffee (FRESH POTS!) while working on new challenging projects.

It’s a bit sad that I never had (or took) the time to write more on this blog while working at brandcrumb, a lot of very interesting things happened. I might/should try to find some time in the future to talk about some of those past things.

Let’s go to the core of the topic, shall we? Why am I writing today and what about?

Unix pipelining

I suppose most of us are familiar with Unix pipelining (invented by Douglas McIlroy, thus this post’s title).

You know, that simple but amazing thing that allows us to do:

tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q

(You can read the story behind that elegant piece of code right here)

I’m a developer, I’m a *nix lover, I love the Unix philosophy and I love (like any *nix geek) the command-line! Yet, I feel like I tend to use it less and less everyday. Sure, I’m still spending my days typing commands like npm install, apt-get moo, node ./index.js, grunt, bower, etc etc (yes a lot of web-related work these days).

More than the “command-line”, it’s the pipelining concept that is disappearing from my developer’s life. I’m not talking about using pipes in my shell, I’m still using that every day for silly things like:

cat application.log | grep ERROR

The command above is basically me, using what our fathers gave us years ago. But what about today? What about pipelining in our day to day modern tools?

I’m not totally fair when I say that, since this week’s events are probably related to the fact I spent last week-end reading again the difference between the several versions of Node.js‘ Streams’ API. A very nice example of modern application of, if not Unix’s pipelining, it’s philosophy.

An article I found while writing these lines: The Unix Philosophy, Streams and Node.js

“You know, something simple…”

One of the project I’m working on at the moment has, amongst others, two systems communicating together via socket.io: a Single-Page AngularJS application (Dashboard) and a Node.js app, the second one feeding real-time data into the first one (for real time reporting, charting, …)

(The real-time data in turn comes from other systems and are received by the Node.js app via queueing systems …)

This week, two people I work with asked me for something:

Louis, could you provide us with an easy way to test the Angular.JS dashboard without using the complete product architecture? We would like to prototype a new tool and see how the dashboard would behave with its output. Could we have a mechanism like checking continuously if a local file has changed and load automatically CSV/JSON rows from it? You know, something simple…

I understood perfectly their need but, to be honest, I wasn’t sure I was gonna be able to provide them with an easy solution.

The AngularJS application does very few things and has few dependencies (at the moment), it could almost load and run perfectly if you were to open the index.html file directly in your browser without loading it from a web server. It was quite clear to me that the main challenge was to detect that the file has changed.

I didn’t want them to have to install a webserver on their laptop, but due to Same Origin Policy I couldn’t even try to use ajax to get the file every second and have a “near real-time” update of the dashboard.

(Please, don’t hesitate to comment and tell me how you’d have implemented that)

A quick check at google’s results for “HTML5 & local files” gave me the impression that I couldn’t easily implement a continuous reload of a local file… I didn’t have much time, I wanted something simple enough to be deployed in a minute but I also really wanted to have a solution giving us a “real time” update of the dashboard.

Since they are using R to code, providing them with a library was not easy (I’ve never coded in R), plus I didn’t want them to change what they had, the goal was to keep it simple for them. The way I understood the contract (or challenge) between us was like: “we write in a file, you do the magic”.

I suddenly had a flashback of the Operating System courses I had at college1 and I was reminded of those pseudo devices, those named-pipes, IPC, etc. I started to think, “It would be cool actually if they could write normally in a file, but I’d stream every line through websocket… they’re already producing CSV files with their tools anyway so, apart from the output format, they wouldn’t have to change anything”.

Some checks later, I had my solution: I could write a small Node.js tool creating a socket.io server and creating a named-pipe. It would read from that pipe, line by line, would then parse them (JSON) and stream the data through socket.io. The dashboard could therefore connect to that tool and deal with the incoming data the same way it does with the real production architecture.

An hour later I had my solution implemented, and not only was I able to give my solution to my colleagues but also, and for the first time in a long time, was I able to give a silly (yet useful to me/us) contribution to the open source world.

I’m please to introduce you to unix2ws (and jeez, a 4-line-application I created at the same occasion for my tests)

Have a look at unix2ws’ example folder, you’ll see how easy it is to use.

Personal conclusion

It’s a bit weird for me to realise that, in a very short time, I forgot what the Unix world and our fathers had given to us. I used to work on the maintenance of a huge C/C++ application, entirely made of (relatively) small applications trying to respect as many of the Eric Raymond’s 17 Unix Rules as possible, glued together with an intense usage of Unix pipelining.

Years after, doing my web and cloudy stuff, playing with “cool cutting edge things” like AWS, angular, node, etc, I almost forgot how elegant those basic concepts were.

This week, and using just a few lines of code I was able to isolate one of the systems (the dashboard) from the rest of the architecture and give my colleagues the opportunity to prototype on their side without having to know ANYTHING about how the dashboard works. On their side, they do what they do best: they do statistics in R, they write JSON records (respecting the schema we agreed on) in a file and they don’t even have to care about the rest…

I’m happy, it’s been a great week, I’ve fallen in love with Unix again and I managed to finally do something that I hope I’ll do a lot more from now on: try and give back a little to the open source community.

Don’t hesitate to comment on this topic. How would you have done that? If you dislike/hate the solution, why?

I hope I’ll write another post soon, but who knows… I could try and keep my “two-posts-distance” average to 444 days.


1: well… said like this, you could think that I was attending those classes… don’t take me wrong: I was playing truant all the time, so I just had a flashback of that crazy week preceeding the final exam, studying Mr Jaumain’s slides in a rush while reading frenetically Andrew Tanenbaum’s book to try to understand something… and I passed, with good marks :-D

May 282012
 

EDIT
After writing this article, I realized that cradle uses a write cache, by default. After disabling that cache, the difference between heap snapshots is much smaller, I’ll continue my tests to see if it was the only source of my problems.

felix-couchdb, on the other side, doesn’t mention any cache in the documentation, so I’ll continue my investigations to try to find why this library “leaks”.

So, I’m probably gonna write a new post very soon with the results of those investigations.

This post talks about a problem that I’m facing right now with NodeJS & CouchDB. Actually, I should say “with NodeJS & cradle” or “with NodeJS & felix-couchdb” because CouchDB is not really the problem here.

Let me describe the problem:

Last week, with my colleagues, we moved one part of our project to AWS. This part of the project is amazingly stupid and simple. It’s a kind of logging tool, that receives information about 10 times per second, and writes this information in new documents. The tool never updates or reads documents.

After some time (a few hours), I had a look at the CPU usage of my instances and saw this kind of graph:

You can clearly see that on the 27th arround 12:00 AM, I restarted the NodeJS application, and that it started to leak immediately again. (Of course, no need to say: the load of my NodeJS application is quite constant (in terms of connections/users), so this graph is not related to such kind of peak).

So, I continued my investigations to try to find this leak… and after hours of debugging, I realized that the leaking part was in a third library: cradle.

I wrote a very simple chunk of code, outside of my application, to try to reproduce the leak and confirm that cradle was responsible:

// Debugging (v8-profiler doesn't work for me)
// To activate it: kill -SIGUSR2 $PID
var agent   = require("webkit-devtools-agent");
//
var express = require("express");
var http    = require("http");
//
var cradle  = require("cradle");

// CouchDB connection settings
var config = {
  host: "localhost",
  port: 5984,
  db: "leak"
};

// Connect to CouchDB
var connection = new(cradle.Connection)(config.host, config.port);
var db = connection.database(config.db);

// The example document that we're gonna insert
// in this example the document won't change
var doc = {
  firstname:        "Louis",
  lastname:         "Lambeau",
  preferredBeers:    ["duvel", "chimay", "westmalle"],
  preferredLanguage: "ruby",
  nationality:      "Belgian",
  address: {
    country:  "Spain",
    province: "Catalunya",
    city:     "Barcelona"
  },
  bio: [
    // This was generated with the fantastic slipsum generator (http://slipsum.com/)
    "Normally, both your asses would be dead as fucking fried chicken,",
    "but you happen to pull this shit while I'm in a transitional period",
    "so I don't wanna kill you, I wanna help you. But I can't give you this case",
    "it don't belong to me. Besides, I've already been through too much shit this",
    "morning over this case to hand it over to your dumb ass.",
    "You think water moves fast? You should see ice. It moves like it has a mind.",
    "Like it knows it killed the world once and got a taste for murder.",
    "After the avalanche, it took us a week to climb out. ",
    "Now, I don't know exactly when we turned on each other, ",
    "but I know that seven of us survived the slide... and only five made it out.",
    "Now we took an oath, that I'm breaking now. We said we'd say it was the snow",
    "that killed the other two, but it wasn't. Nature is lethal but it doesn't hold",
    "a candle to man."
  ].join("\n")
};

// To launch the inserts, kill -SIGHUP $PID
process.addListener("SIGHUP", function() {
  for (var i=0; i < 10000; i++){
    db.save(doc);
  }
});

// Create the express server
var app = express.createServer();
app.listen(8888);

(Off-topic: I tried to use v8-profiler, but it seems it’s not working anymore with node >= 0.6.x. Thanks to c4milo and his tool node-webkit-agent I’ve been able to do some heap profiling)

I ran this code to make a heap snapshot before and after the 10.000 insertions, and the result was quite clear:

A comparison between the second heap snapshot and the first one gives us concerning numbers:

  • The size of the heap almost doubled
  • We have 20.650 new strings

If we look at the 20k strings that reside in memory, we can see 3 different strings, repeated again and again:

  • A JSON representation of our document (the body of the POST request)
  • The headers of the request that cradle sent
  • The JSON representation of CouchDB’s reply ({“ok”:true,”id”:”hash”,”rev”:”hash”})

A second set of insertions, followed by a third heap snapshot confirmed it was constantly growing (12.29 MB).

I decided to try another library, felix-couchdb to check if this behavior was observable there as well.

// Debugging (v8-profiler doesn't work for me)
// To activate it: kill -SIGUSR2 $PID
var agent   = require("webkit-devtools-agent");
//
var express = require("express");
var http    = require("http");
//
var felix  = require("felix-couchdb");

// CouchDB connection settings
var config = {
  host: "localhost",
  port: 5984,
  db: "leak"
};

// Connect to CouchDB
var connection = felix.createClient(config.port, config.host);
var db = connection.db(config.db);

// The example document that we're gonna insert
// in this example the document won't change
var doc = {
  firstname:        "Louis",
  lastname:         "Lambeau",
  preferredBeers:    ["duvel", "chimay", "westmalle"],
  preferredLanguage: "ruby",
  nationality:      "Belgian",
  address: {
    country:  "Spain",
    province: "Catalunya",
    city:     "Barcelona"
  },
  bio: [
    // This was generated with the fantastic slipsum generator (http://slipsum.com/)
    "Normally, both your asses would be dead as fucking fried chicken,",
    "but you happen to pull this shit while I'm in a transitional period",
    "so I don't wanna kill you, I wanna help you. But I can't give you this case",
    "it don't belong to me. Besides, I've already been through too much shit this",
    "morning over this case to hand it over to your dumb ass.",
    "You think water moves fast? You should see ice. It moves like it has a mind.",
    "Like it knows it killed the world once and got a taste for murder.",
    "After the avalanche, it took us a week to climb out. ",
    "Now, I don't know exactly when we turned on each other, ",
    "but I know that seven of us survived the slide... and only five made it out.",
    "Now we took an oath, that I'm breaking now. We said we'd say it was the snow",
    "that killed the other two, but it wasn't. Nature is lethal but it doesn't hold",
    "a candle to man."
  ].join("\n")
};

// To launch the inserts, kill -SIGHUP $PID
process.addListener("SIGHUP", function() {
  for (var i=0; i < 10000; i++){
    db.saveDoc(doc);
  }
});

// Create the express server
var app = express.createServer();
app.listen(8888);

And the result? Even worse!! An increase of 33 MB after 10.000 insertions.

And finally, of course, I tried a very quick naive implementation of my own, directly using the HTTP library to send a POST request containing the JSON encoded document.

The result is, I think, much more acceptable, but again, it’s a very naive implementation:

// Debugging (v8-profiler doesn't work for me)
// To activate it: kill -SIGUSR2 $PID
var agent   = require("webkit-devtools-agent");
//
var express = require("express");
var http    = require("http");

// CouchDB connection settings
var config = {
  host: "localhost",
  port: 5984,
  db: "leak"
};

// Saves a document to CouchDB
function saveDoc(doc){
  // JSON representation of the doc
  var data = JSON.stringify(doc);
  //
  var post_options = {
    host: config.host,
    port: config.port,
    path: "/" + config.db,
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Content-Length": data.length
    }
  };
  var request = http.request(post_options, function(res){
    //
  });
  request.write(data);
  request.end();
}

// The example document that we're gonna insert
// in this example the document won't change
var doc = {
  firstname:        "Louis",
  lastname:         "Lambeau",
  preferredBeers:    ["duvel", "chimay", "westmalle"],
  preferredLanguage: "ruby",
  nationality:      "Belgian",
  address: {
    country:  "Spain",
    province: "Catalunya",
    city:     "Barcelona"
  },
  bio: [
    // This was generated with the fantastic slipsum generator (http://slipsum.com/)
    "Normally, both your asses would be dead as fucking fried chicken,",
    "but you happen to pull this shit while I'm in a transitional period",
    "so I don't wanna kill you, I wanna help you. But I can't give you this case",
    "it don't belong to me. Besides, I've already been through too much shit this",
    "morning over this case to hand it over to your dumb ass.",
    "You think water moves fast? You should see ice. It moves like it has a mind.",
    "Like it knows it killed the world once and got a taste for murder.",
    "After the avalanche, it took us a week to climb out. ",
    "Now, I don't know exactly when we turned on each other, ",
    "but I know that seven of us survived the slide... and only five made it out.",
    "Now we took an oath, that I'm breaking now. We said we'd say it was the snow",
    "that killed the other two, but it wasn't. Nature is lethal but it doesn't hold",
    "a candle to man."
  ].join("\n")
};

// To launch the inserts, kill -SIGHUP $PID
process.addListener("SIGHUP", function() {
  for (var i=0; i < 10000; i++){
    saveDoc(doc);
  }
});

// Create the express server
var app = express.createServer();
app.listen(8888);

So, the questions are:

  • Did I completely misunderstand something about NodeJS or the usage of those libs?
  • Am I the only one to insert a lot of documents per minute/hour into CouchDB using those libs?
  • Does everybody reinvent the wheel and create their own CouchDB client implementation when they use NodeJS?

I’m probably going to submit issues on github for both projects with a link to this post to explain my problem, but if in the meantime someone has an idea and can explain what I’m doing wrong / what they’re doing wrong, I would be grateful ;-)

P.S. By the way, welcome on my blog, this was the very first post! I’ll probably write a new post very soon about CouchDB/BigCouch & Map/Reduce views performances, just to share with you some experiments that I’ve performed lately.