A week in ML

A group of us at CJ engineering recently got the opportunity to set aside all other project and squad work for 1 week, to focus exclusively on a machine learning exercise. The objective of this effort was not so much to come up with a great prediction algorithm that would yield a fountain of money for CJ and for our clients: rather it was to get stuck in with a machine learning challenge, and to experience all the pitfalls and promises, the trials and triumphs that are part and parcel of practical machine learning. And if that fountain-of-money-yielding-algorithm did emerge from the exercise, so much the better…

The nuts and bolts about the problem statement, steps followed, tools and technology used are left for a separate posting: in this series of 3 posts I will share the key lessons I learned that I plan to take with me to my next ML endeavor:

1. know your data
2. know your algorithm(s)
3. know your architecture

This post will delve into the first of these:

know your data

Data – your new best friends

The ML success story of my dreams goes something like this: take a bunch of attributes, toss them all into some algorithm, click play, and – voilà – a robust prediction model materializes, our clients’ revenue goes through the roof, and I am awarded the Nobel prize for economics. While we await this moment of glory (which hasn’t happened yet, but there’s still time), what we can do in the meantime is get cosily familiar with our features: our input data. They are our new best friends: get to know their size and shape, take them to the pub for a beer: let them open up to you and share their insights.

There’s a plot in here somewhere…

If going to the pub turns out not to be an option, another way to get that insight from your features is to plot them against the target variable in the training data, and even plot them against each other (although that sounds a bit subversive). Zeppelin notebooks makes this easy, by the way. In general, being able to visualize your features in various ways in the context of your ML question is very effective in informing your approach to that problem.

Less More is more.

Make sure you have enough data: enough samples to split into meaningful training and test datasets, enough to cover all your data variability. When working with time series data, make sure you have data that goes back sufficiently in time: for example if you know your model will be influenced by seasonality, make sure your data covers all the seasons. But, you ask, how much is enough – what’s the magic number? I don’t know, it will depend. In one experiment we tried to infer seasonality with less than 2 years of data. It seems to me that we ought to have had at least 3 full years of data to train with.

Ask pertinent questions to avoid impertinent blowouts

In a linear regression model to predict future revenue for your clients, is it really appropriate to train your model for a dataset that includes all your clients’ data? Or should you be training a model per client? Beware of categorical feature blowout – a relatively moderate input dataset size can translate to massive amounts of computation if you use an input with high cardinality as a categorical feature. Our program had the impertinence to run out of memory before our data had even made it into the algorithm. It’s also easy to overfit with this kind of data.

 

Coming soon: know your algorithm(s).

Posted in machine learning Tagged with: , , ,