Poor Man's Data Store in the Cloud!

1-19-2014
See the sample code on GitHub »

A little while ago I got involved with a new project that like most projects needed to persist data somewhere. The difference was(at least for the first 2-3 phases) that the data didn't change often, at the most it would only change once a day. But the data also is what drives the application, it is mainly used for calculations.

We needed something fast and that would allow multiple instances of the app to have access to it but the budget was super tight so it had to be cheap too! The solution - store the data as a flat file(in our case json) in a cloud container. There are several advantages to this approach: The main one of course is the cost, currently Rackspace Cloud Files charges $0.10/GB of storage and $0.12/GB. We only have a handful of 2kb json files so it would take quite a few requests to even get to 1GB. That puts our monthly data store costs at less than a quarter($0.22)! I'm sure these costs are inline with most other cloud providers so its probably easiest to use the storage wherever you host your servers now. Plus, no worry of SQL injection attacks hah!

Obviously its not the perfect solution and probably isn't right for most applications. The main drawback is its pretty slow! Depending on the service you use and probably a million other variables your request to get the data may take anywhere from 1-5+ seconds. Pretty painful if you needed to read or write that data on every request! We end up caching the results and prefetch them where we can so the speed isn't too much of a concern for us other than maybe when we are saving a file - even then its not too bad.

Okay, okay. So you don't care about the disadvantages and just want to get this thing up and running ASAP? How do we do it?? Well luckily there's an app for that! Well technically I suppose its a package. Pkgcloud is a node.js package that provides a single API for you to interact with all the major cloud providers. This means you could start out on Rackspace and decide that AWS is the provider for you and still be able to move with very minimal changes to your code!

The pkgcloud API actually offers control over Compute, Storage, Database, DNS, Block Storage, Load Balancers, etc. etc. We are only covering a small portion(Storage uploading/downloading) in this post. Be sure to checkout the docs to see the full power of pkgcloud.

Rackspace Setup

This tutorial uses rackspace, however the steps should be similar with any service/api that pkgcloud supports.

First login to Cloud Control Panel and then click on the Files tab

Rackspace Cloud Setup Step 1

Then click on Create Container. Enter demo-customer-config as the container name, select a region(typically best to select same region as your server), select a type(for this type of storage I use Private since its not really data I want to expose directly to customers)

Rackspace Cloud Setup Step 2

Upload the two data files to your container.

Keep in mind that you can't have sub directories/sub containers in a container. If you need to have some sort of classifications within a container you should do it with a naming convention.

Node Setup

I should probably start this section by explaining the strategy we used to manage the files. One example of a container that we used is for customer config settings. We would name the file something like [USER_ID]:[TIMESTAMP].json => "1234:1393657134003.json". This allows us to get a list of all the files in the container and loop through until we find the most recent one. Using a strategy like this is fairly simple to implement and it even would allow some sort of rollback feature if you needed to do something like that. For us we were only conerned with having the latest file with no rollback. There are probably a bunch of other strategies that would be useful - I'd love to hear about 'em so ping me or leave a comment.

Install pkgcloud in your project

$ npm install pkgcloud@0.8.15 
# better yet, add it to package.json and `npm install`

Lets add our main module that will do most of the heavy lifting

/lib/rackspaceFiles.js

/**
 * This API(all the exports.fn's) could be expanded to add listing of available files and what not, so right now the scope is just limited to the use case of [USER_ID]:[TIMESTAMP].json
 */
 
// DONT FORGET TO PUT YOUR CREDENTIALS IN HERE BEFORE STARTING SERVER
var PROVIDER = 'rackspace'; // see docs for full list of providers
var USERNAME = 'YOUR_RACKSPACE_USERNAME';
var API_KEY  = 'YOUR_RACKSPACE_API_KEY'; 
var REGION   = 'YOUR_REGION';


var pkgcloud = require('pkgcloud');
// utilities to help with the naming convention and downloading/uploading streams
var utils = require('./rackspaceUtils.js');

// maintain a list of all your containers - the value must match the name of your container on rackspace
var CONTAINERS = {
  customerConfig: 'demo-customer-config'
  // this is a convenient way to keep production and test data separate. You need to implement IS_PRODUCTION and create two containers if you want to implement this.
  // customerConfig: IS_PRODUCTION ? 'demo-customer-config' : 'TEST-demo-customer-config'
};

var rackspaceFiles = pkgcloud.storage.createClient({
  provider: PROVIDER,
  username: USERNAME,
  apiKey: API_KEY,
  region: REGION
});

// make sure the caller is using a valid container - controlled via CONTAINERS
var isValidContainer = function(container){
  for(var c in CONTAINERS){
    if(CONTAINERS[c] === container){
      return true;
    }
  }
  return false;
};

// export containers so callers of this module have a reference to available containers - plus it keeps the values all in one spot - saves you some time if you decide to rename a container later.
exports.CONTAINERS = CONTAINERS;

// Remember the results come back as a String, so if you're expecting JSON you should JSON.parse(results); see exports.getFileByCustomerId()
var getFileFromRackspaceByCustId = function(container, custId, next){
  if(!custId){
    throw new Error('Customer id invalid:' + custId);
  }
  if(!container || !isValidContainer(container)){
    throw new Error('Container is required or container provided is invalid: ' + container);
  }
  //get a list of all the files
  rackspaceFiles.getFiles(container, function(err, files){
    if(err){
      return next(err, null);
    }
    //find the latest fileName for THIS customer
    utils.getLatestConfigFileName(custId, files, function(err, fileName){
      if(err){
        return next(err, null);
      }
      
      console.log('[INFO] Attempting to retrieve file from rackspace:', fileName, container, custId);
      
      //finally, get the actual config FILE
      rackspaceFiles.download({
        container: container, 
        remote: fileName
      })
      // pkgcloud's api does uploading/downloading with streams
      .pipe(utils.createMemoryStream(function(err, resultsAsString){
        next(err, resultsAsString);
      }));
    });
  });
};

var saveFileToRackspaceById = function(container, custId, dataToSave, next){
  if(!custId){
    throw new Error('Customer id invalid:' + custId);
  }
  if(!dataToSave){
    throw new Error('Config data requried');
  }
  if(!container || !isValidContainer(container)){
    throw new Error('Container is required or container provided is invalid: ' + container);
  }
  
  // get the proper name
  var fileName = utils.createConfigFileName( custId );
  
  console.log('[INFO] Attempting to save file to rackspace:', fileName, container, custId);
  
  // pkgcloud's api does uploading/downloading with streams
  utils.createReadStream(dataToSave)
    .pipe(
      rackspaceFiles.upload({
        container: container,
        remote: fileName
      }, function(err, result){
        next(err, result);
      })
    );
};

exports.getFileByCustomerId = function(container, customerId, next) {
  getFileFromRackspaceByCustId(container, customerId, function(err, fileAsString){
    // Since the content of the file is treated as a string we have to JSONify it - if you are not storing JSON you will want to remove this JSON.parse();    
    next(err, JSON.parse(fileAsString));
  });
};

// note we always create a new file - with our requirements it isn't really necessary to update files
exports.saveFile = function(container, customerId, data, next) {
  saveFileToRackspaceById(container, customerId, data, next);
};

Now we need to add the helpers for file naming and streams.

/lib/rackspaceUtils.js

var Stream = require('stream');
var util = require('util');
var Bffr = require('buffer');

var utils = {
  // [CUSTOMER_ID]:[TIMESTAMP].json
  createConfigFileName: function(customerId){
    return customerId + ':' + new Date().getTime() + '.json';
  },
  createReadStream: function(data){
    var Readable = Stream.Readable;
    
    var StringReader = function(str, opt){
      Readable.call(this, opt);
      this._data = str;
    };
    
    util.inherits(StringReader, Readable);
    
    StringReader.prototype._read = function(){
    
      if(!this._data){
        this.push(null);
      }
      else {
        var buf = new Buffer(this._data, 'utf-8');
        this.push(buf);
        this._data = null;
      }
    };
    
    return new StringReader(data);
  },
  createMemoryStream: function(onEnd){
    var stream = new Stream.Writable();
    var result = new Buffer('');
    stream._write = function(chunk, enc, next){
      var buffer = (Buffer.isBuffer(chunk)) ? chunk : new Buffer(chunk, enc);
      result = Buffer.concat([result, buffer]);
      next();
    };
    stream.on('finish', function(){
      onEnd(null, result.toString());
    });
    stream.on('error', function(){
      console.error('Problem with memoryStream:', arguments);
      onEnd(new Error('There was a problem reading the stream from rackspace.'), null);
    });
    return stream;
  },
  // remove file extension and return the id and timestamp split up in array
  splitIdFromTimestamp: function(name){
    return ( name && name.split ) ? name.replace('.json','').split(':') : [];
  },
  getLatestConfigFileName: function(id, files, next){
    // store the latest valid fileName and TS
    var latestFileName; //filename === customer id + timestamp of when the file was created
    var latestTs = 0;
    // array result after splitting the name
    var split;
    // current itr id
    var currId;
    // current itr timestamp
    var currTs;
    
    if(!id || !files || !next){
      return next(new Error('Id:' + id + ' or files: ' + files + ' or next: ' + next + ' missing!'));
    }
    
    //iterate over each file - saving the one that matches the id with the latest date
    for(var i = 0, l=files.length; i latestTs){
      
        latestFileName = files[i].name;
        latestTs = currTs;
      }
    }
    
    if(latestFileName && latestTs){
      // Huzzah! We got a match, send back the whole fileName
      next(null, latestFileName);
    }
    else {
      // no match
      next(new Error('No config file found for id: ' + id), null);
    }
  }
};

module.exports = utils;

Finally we'll modify index.js to make the request to load the customer data and also add a method to save customer data.

/routes/index.js

var rackspaceFiles = require('../lib/rackspaceFiles.js');
/*
 * GET home page.
 */
 
// typically you'd get this from the request
var CUSTOMER_ID = "100";

// Always get the container from rackspaceFiles so the list is maintained in one spot.
var CONTAINER = rackspaceFiles.CONTAINERS.customerConfig;

var renderIndex = function(res, data){
  res.render('index', { title: 'Express', customer: data });
};

exports.index = function(req, res){
  //get customer data from 
  rackspaceFiles.getFileByCustomerId(CONTAINER, CUSTOMER_ID, function(err, data){
    if(err){
      throw new Error('Unable to get file for customer id: ' + CUSTOMER_ID);
    }
    renderIndex(res, data);
  });
};

exports.save = function(req, res){
  if(!req.body.firstName || !req.body.lastName || !req.body.email){
    throw new Error('Missing required fields');
  }
  // get data from POSTed form
  var customerData = {
    id: CUSTOMER_ID,
    firstName: req.body.firstName,
    lastName: req.body.lastName,
    email: req.body.email
  };
  
  // To save a file you have to pass in a string - so convert our json object to a string before passing it to rackspace
  var dataAsString = JSON.stringify(customerData);
  
  // save data to rackspace
  rackspaceFiles.saveFile(CONTAINER, CUSTOMER_ID, dataAsString, function(err, result){
    if(err){
      throw new Error('Problem saving file to rackspace!');
    }
    res.redirect('/');
  });
};

Conclusion

So after all that we have a pretty simple way to manage our data. Adding a new "table" is as easy as creating a new Container in the Rackspace Control Panel and then adding that container to the CONTAINERS list in rackspaceFiles.js. While its definitely not as robust as standing up an actual database, there are scenarios where something like this could make sense - especially if budgets are extremely tight.

With that said - its probably worth noting that the project that I used this on has outgrown this approach. It served us well for about 3-4 months but the project is now to the point where the data is changing more and we need to migrate to a real database. This method was great during the proof of concept phase and allowed us a lot of flexibility while keeping the cost low.

Feel free to contact me or post down below if you have any comments or questions!