Distributed File Systems

I’ve been thinking about the best way to configure a bunch of computers for doing large-scale machine learning experiments. One problem that always pops up is how to get some piece of the data to the node that needs to process it (a mapping in the Map Reduce framework).

You can cook up various schemes to distribute the data, but in the end I don’t think anything is going to beat the simplicity of a shared file system. However, when your cluster starts getting big and your data starts getting large, you start running into problems with traditional shared file systems like NFS (contention mostly). This leads one to consider a truly distributed file system.

It should come as no surprise that Google has the Google File System. I think many of the amazing things the people at Google are able to do can be attributed to the fact that they have their map-reduce and distributed file system infrastructure properly sorted out.

For the rest of us, there’s Hadoop, which is nice, but still not quite as easy to use as I’d like it. Ideally, I want to install the latest version of my Linux distribution or run a setup program on Windows and it should just work. No mess, no fuss. On Windows I want to see my distributed file system as a drive letter (or as a directory on Linux): this makes it easy to make legacy applications (C++ programs, MATLAB scripts, etc.) operate on your data. Along these lines, Hadoop has something called Pipes which could be used in some cases, but ideally I want the fact that I’m operating on distributed data to be completely transparent to my applications.

Here OpenAFS is showing some promise. It seem some guys are working on an IFS driver for OpenAFS (see OpenAFS for Windows Requested Features and Road Map). IFS looks like the right way to integrate a new file system with the Windows platform. Last I checked, Hadoop didn’t support all the functions of a general purpose file system, but maybe it could still be integrated with IFS to give a it a really nice interface for Windows users. I don’t know what OpenAFS does on Linux, but I’m assuming it works nicely there already. I should investigate…

I mention Hadoop and OpenAFS, since they seem to be the only candidates in the list of distributed file systems on Wikipedia that appear to be free, properly maintained and generally useful.

Once you have your data sorted out, you still need to distribute your computation across the nodes in your cluster. I’ll discuss that in another post.

By the way, the Hadoop folks recently created a subproject called Mahout, that is focusing on building distributed implementations of various machine learning algorithms, following the ideas published in Map-Reduce for Machine Learning on Multicore.

Leave a Reply