Archive for the ‘Software Engineering’ Category

Hyperic SIGAR

Monday, February 11th, 2008

Hyperic SIGAR (System Information Gatherer and Reporter) is a cross-platform, cross-language library and command-line tool for accessing operating system and hardware information in C, Java, Perl and C#. SIGAR is licensed under the GPL version 2. Not quite sure what this implies for Java projects, but anyway.

Just spotted this library as part of the upcoming GridGain 2.0 release.

Distributed File Systems

Saturday, February 9th, 2008

I’ve been thinking about the best way to configure a bunch of computers for doing large-scale machine learning experiments. One problem that always pops up is how to get some piece of the data to the node that needs to process it (a mapping in the Map Reduce framework).

You can cook up various schemes to distribute the data, but in the end I don’t think anything is going to beat the simplicity of a shared file system. However, when your cluster starts getting big and your data starts getting large, you start running into problems with traditional shared file systems like NFS (contention mostly). This leads one to consider a truly distributed file system.

It should come as no surprise that Google has the Google File System. I think many of the amazing things the people at Google are able to do can be attributed to the fact that they have their map-reduce and distributed file system infrastructure properly sorted out.

For the rest of us, there’s Hadoop, which is nice, but still not quite as easy to use as I’d like it. Ideally, I want to install the latest version of my Linux distribution or run a setup program on Windows and it should just work. No mess, no fuss. On Windows I want to see my distributed file system as a drive letter (or as a directory on Linux): this makes it easy to make legacy applications (C++ programs, MATLAB scripts, etc.) operate on your data. Along these lines, Hadoop has something called Pipes which could be used in some cases, but ideally I want the fact that I’m operating on distributed data to be completely transparent to my applications.

Here OpenAFS is showing some promise. It seem some guys are working on an IFS driver for OpenAFS (see OpenAFS for Windows Requested Features and Road Map). IFS looks like the right way to integrate a new file system with the Windows platform. Last I checked, Hadoop didn’t support all the functions of a general purpose file system, but maybe it could still be integrated with IFS to give a it a really nice interface for Windows users. I don’t know what OpenAFS does on Linux, but I’m assuming it works nicely there already. I should investigate…

I mention Hadoop and OpenAFS, since they seem to be the only candidates in the list of distributed file systems on Wikipedia that appear to be free, properly maintained and generally useful.

Once you have your data sorted out, you still need to distribute your computation across the nodes in your cluster. I’ll discuss that in another post.

By the way, the Hadoop folks recently created a subproject called Mahout, that is focusing on building distributed implementations of various machine learning algorithms, following the ideas published in Map-Reduce for Machine Learning on Multicore.

Intel Threading Building Blocks open sourced

Wednesday, July 25th, 2007

Intel has released Threading Building Blocks (TBB), their library for doing multithreaded programming with C++, under the GPL. If you’re stuck in C++ land but yearn for the power of all those shiny new cores (with 4 Core 2 Duo cores now within reach of anyone with a modest budget), this might be for you. As an added bonus, the documentation looks pretty impressive.

They’re also running a contest to promote TBB.

Random Java Stuff

Thursday, June 28th, 2007

Java Genetic Algorithms Package (JGAP) is (surprise) a genetic algorithm package for Java. JGAP 3.2 was recently released. Coolest feature: you can use JGAP to evolve a Robocode robot (more about Robocode).

After reading this post by Bill Pugh to OpenJDK’s quality-discuss mailing list, I immediately gave FindBugs a try. The Eclipse plugin for FindBugs is highly recommended. Getting FindBugs going is Maven is a bit more… interesting. I think you want the findbugs-maven-plugin, although there’s also a maven-findbugs-plugin. It seems findbugs-maven-plugin includes support for FindBugs 1.2.0, but I wasn’t able to consistently convince it to use that instead of 1.0.0. Good luck.

I promptly unleashed FindBugs on Apache ActiveMQ and found some minor issues and some more serious issues. Maximum respect to Rob Davies for fixing the latter collection.

Update: I unleashed FindBugs on JRuby. See JRUBY-1173.

Memory allocation in Windows programs

Friday, May 25th, 2007

After searching far and wide (actually, after Googling for the right phrase and clicking on the 13th result), I found a blog entry that deals with the issues you can run into when allocating and deallocating memory across module boundaries on Windows. Read it at The Old New Thing: Allocating and freeing memory across module boundaries.

Learn a new language

Sunday, May 20th, 2007

If I remember correctly, the The Pragmatic Programmer suggests that you learn a new language every now and then (every year?). Judging from a bunch of blogs I’ve been reading recently, you might want to look at one of the following:

Along these lines I recently discovered two very interesting blogs you might want to keep an eye on:

MultithreadedTC: A framework for testing concurrent Java applications

Friday, May 11th, 2007

MultithreadedTC is a framework that makes it easier to test concurrent abstractions (as opposed to JUnit, which doesn’t play well with threads). It was mentioned in the talk by Pugh, Goetz (of Java Concurrency in Practice fame) and Click (of lock-free hash table fame) on Testing Concurrent Software at JavaOne 2007.

Google TechTalk: Guice

Sunday, May 6th, 2007

I watched the Google TechTalk Java on Guice: Dependency Injection, the Java Way today. Very interesting stuff. Definitely worth watching before you dive into the rest of the Guice documentation.

Yahoo! Pig

Saturday, April 28th, 2007

Just spotted a mention of Yahoo! Research’s Pig project on the Hadoop mailing list. More at Geeking with Greg: Yahoo Pig and Google Sawzall.