Machine learning tools of the trade
Like any maturing discipline, machine learning is splitting into specialties. And just as a surgeon uses a scalpel, and a general practitioner prefers a stethoscope, different tools are appropriate for different use cases within these subfields.
In the last few months, I’ve run projects that have used tools in at least five categories. Here’s a roundup:
Tools for machine learning researchers: Examples here are Theano, Caffe, and Torch. Designed at universities, these tools retain their roots there, with documentation assuming you’ll be willing to learn the math and algorithms of machine learning. Getting them up and working takes, in my experience, several days at least, for a fairly advanced programmer / sysadmin type. This is a big investment of time, which is well justified if you’re already up the prerequisite learning curve and will be using this tool extensively, especially if you’re looking to build cutting-edge algorithms.
Tools for non-researcher data scientists: Machine learning for AWS and Microsoft’s Azure Machine Learning are targeted at the newly emerging data science specialist. This expert is not an algorithm designer nor a PhD student, but rather a practitioner who wants to build learners, fast. Just as this role is brand-new, so are these tools, with both announced in the last few months. Here, the prerequisite learning curve is much smaller, and the tool learning time also shorter. These visual environments make it particularly easy to learn, too. But you’ll hit a point where you’ll want to advance to the professional tools
Tools for advanced professional ML “drivers”: Statistics overlaps substantially with machine learning, but the tool set is much more mature. Tools like SAS, SPSS, and (more recently) R are standard issue in the professional statistician’s kit.
Designed for the trained professional, the investment required to learn the underlying statistics, and also these tools, was historically quite substantial.
But a funny thing’s happening with the new kid on the block: R. First, it’s is gaining strong foothold against SAS in even the most advanced modeling institutions: places like banks, mortgage companies, and more. Second, if you’re willing to work in a text-based world (which isn’t hard once you get into it and let’s face it, REPL can be fun), new ML libraries mean that you can build sophisticated learners really fast, and without a big learning curve.
An example: on a recent project, I and a colleague spent five days together trying to get Caffe to work for a deep learning problem. After a lot of frustration and an unnecessarily complexified network specification language, we pivoted to H2O, and had it running in about 20 minutes. This experience was typical of a number of recent projects of mine, where H2O was an order of magnitude easier to use. Bottom line for me:
[bctt tweet=”I love #H2O, and use it whenever I can. It’s faster and more nimble by far than anything else.”]
Tools for scale and speed: I’ve dipped my toe into Mahout lately, which is in an entirely different category than the others. Once you’ve built your learner (using one of the above approaches), in many scenarios you’ll want to deploy it to process massive amounts of data. Running Mahout on an AWS EMR cluster, for instance, you can stand up a substantial virtual data center in minutes at low cost, and crunch data at scale. This is where cluster compute ML environments like Mahout and Apache Spark MLLib come into their own. And the flops you’ll get on AWS is virtually limitless including (unlike H2O) GPU-based machine images, which will scream your ML-based systems into hyperdrive.
As with the other professional tools, you’re going to need a strong technical background. Here, however, it’s more of the *nix sysadmin variety, combined with some data wrangling expertise, rather than ML algorithm or software development.
Roll your own: Finally, don’t forget to consider rolling your own tool when needed. Most machine learning algorithms are reasonably simple to implement from pseudocode, and are openly described in technical articles, many of which have been published in open-source repos, which (modulo licensing: be sure to check) an average coder can modify and extend for special needs. This is the approach I took recently when I needed a restricted Boltzmann machine without the kerfuffle of an academic framework around it. After all, the core algorithm is less than 50 lines of code. I found a basic github repo to fork, and added a bunch of functionality to meet my client’s needs.
Have I missed any categories and/or tools? Please add your thoughts to the comments and I’ll include the best in a later draft of this article.