Code Like A Champion

Deedle is a fantastic data frame library available in both C# and F#, but I don’t know if it’s earned the exposure it deserves. One explanation for that is machine learning in .NET has never been terribly well supported, but as I’ve investigated that further, there are some key libraries that make it not just possible, but a great experience.

So what is a data frame? Picture this situation. You have a CSV file where each row is a different observation. Each attribute within the row corresponds to an attribute. For explanation purposes, we’ll use the Wine classification dataset available from the UCI machine learning repository.

In typical .NET usage, you might simply make a custom type (WineCultivar or some such) and then use a StreamReader to read lines from the file, spilt it, and then generate instances of your class. Net result: IEnumerable ready for your use. However, machine learning tasks require a fair amount of data manipulation and pre-processing to prep the data for learning. For example: you may need to centralize or standardize continuous data; you may need to explode categorical attributes into binary ones; you may need to impute missing values. That becomes a rather difficult process now as WineCultivar doesn't support that. We likely end up using a lot of anonymous types or dictionaries to handle the various transforms and while that can work, it's ugly. Too, suppose we want to operate on data at the column level (such as center all values in the row by subtracting the mean and dividing by the standard deviation)? There's no particularly clean way to do that. Enter data frames!

Data Frames

Data frames exist in a lot of different languages. Python has the awesome Pandas library and R gets maximum leverage out of them. Imagine we have a way to view our data in a tabular fashion (rows and columns), but that allows each column to be heterogeneous. In other words, we may want some columns to be text and some to be integers and some to be floating-point. In a multi-dimensional array, we could only achieve that by declaring it of type object[,] and that’s so general as to be useless. Deedle, at a high level, gives us that power and flexibility to slice, index, group, and select data in a row-wise or column-wise fashion and it keeps track of type conversions in a much cleaner way.

Loading a Data Frame

Terminology is important in Deedle. The documents are excellent, but my goal is to make them even easier. First, both columns and rows have keys. In the default usage, the row keys will be integers (auto-incrementing ones technically) and columns are strings. Here’s how we can construct a Frame<int, string> by reading in the wine.data CSV file.

var frame = Deedle.Frame.ReadCsv(filename, hasHeaders: false);

In this case, Deedle defaulted that we wanted an int row key and string column key. Since our input file does not contain headers, we tell Deedle that as well through the method argument. Deedle thus uses Column_0, Column_1, etc as the column names. However, we actually know the column names, they’re just in a separate file. So we can easily set the column names after the fact by doing:

frame.RenameColumns(new string[]
{
    "Label", "Alcohol", "Malic Acid", "Ash", "Alcalinity of Ash", "Magnesium", "Phenols", "Flavanoids",
    "Nonflavanoid Phenols", "Proanthocyanins", "Color Intensity", "Hue", "OD280/OD315", "Proline"
});

Changing Values Within A Column

For machine learning purposes, Label is the class label for the observation and the other columns are the features. Under the hood, Deedle is storing the data in a column-centric fashion. What I mean by that is, you can picture the Frame as being a collection of collections. Each sub-collection corresponds to one of the features (including the Label one). While row access is supported in Deedle, the preferred way to slice and access data is through the columns as typically that is a more natural use-case for a data frame. If you find yourself accessing through rows a lot, you can interchange the structure by using Transpose.

In this data set, there are 3 possible class labels (1, 2, or 3). This is a good fit for multi-class classification and we’re going to use the Accord.NET library for this. I will not go into much detail on that library and will save that for a follow-up post. One route to go for this problem, particularly knowing that 2 of the classes are not linearly separable, is to try to use a support vector machine with a nonlinear kernel. Accord provides such an implementation of those as well as various learning algorithms to train the SVMs, but here’s the challenge: Accord requires the labels in a multi-class scenario to start at 0 and go up: our labels as given in the data set don’t conform to that. Deedle can handle that with minimal effort:

var relabeled = frame.Columns["Label"].Select(kvp => (int)kvp.Value - 1);
frame.ReplaceColumn("Label", relabeled);

The key thing to know is that Series (the columns) are immutable. All we can do is generate new series from the old series and then replace them. However, Frames are to a limited extent mutable. In this case, we want to subtract 1 from every label to move them from the range [1,3] to the range [0,2].

To do so, we simply choose the Label column using the Columns property, then use a normal LINQ projection. The trick in this particular usage is that we are going to enumerate through KeyValuePairs where the key is the row key and the value is the value in the Series. In this case, we don’t need the key at all; all we need to do is subtract 1 from the value. We have to cast it to an int because under the hood, Deedle is storing the data as objects so we have to keep track of this fact (though there are some ways to be safe about it we’ll cover another time). Now relabeled is of type Series<int, int> where the first type parameter is the type of the row key (unchanged) and the second parameter is the type of data stored within the series (changed to an int by us). Now what we need to do is replace the previous Label column in the frame with our new version, which can be done easily enough using ReplaceColumn. This mutates the data frame accordingly. The one caveat here is that Deedle will now have the Label column at the “end” of the frame instead of the beginning so if you inspect or print it, you’ll see it shows up in a different place.

Selecting Subsets of Rows

We need to split the frame into two to correspond with a training set and a test set. Deedle supports subsetting in a variety of ways.

I was fortunate recently to be able to attend O’Reilly’s Software Architecture Conference in New York City, held April 10-13, 2016 at the Hilton Midtown Hotel in Manhattan. This was the first conference I’ve ever been to and while it went fine, I realized there’s a lot of small things that go into a conference that simply never occurred to me. If you’re about to go to a conference for the first (specifically a technical conference), I hope this will help you too.

Why A Conference

This happens before you attend the conference, before you even plunk down the money to register. I was asked this question by a fellow attendee: why this conference? Why a conference at all? The reason I chose O’Reilly was two-fold. One: my boss initially asked me about going to Microsoft’s Build 2016 conference in San Francisco. I wasn’t interested in that because Build has kind of become a circus and I didn’t want to travel to San Francisco. I found the O’Reilly conference because I really wanted to attend a non-Microsoft conference. I love C# and .NET and Microsoft has really turned over a new leaf in many regards, but I wanted a more “pure” interpretation of a concept and O’Reilly seemed perfect as it’s agnostic to language (although yes, to some extent they assume folks are on Java or at least the JVM). And two: I knew I could take a train from my home (Charlottesville, VA) to New York and take an easy subway ride to the hotel. I’m okay with flying, but it’s the circus around flying that I don’t like and when you factor in everything, not a huge time savings.

Good Point, Tell Me More About Travel

For people that travel a lot, flying and taking public transportation isn’t a big deal. I do neither that often so the ease of getting from the transit hub to the hotel was important to me. New York has such a dominant public transit compared to basically every other US city that getting around it is pretty easy once you figure out a few basic things. So definitely if conferences are new to you and you’re anxious, try to find one in New York. We took Amtrak up from Charlottesville, took about 6 hours, very easy ride. If you’re taking Amtrak, definitely spring for the business class. It’s like $50 per trip, but you get extra leg room plus a complimentary beverage from the travel car. Amtrak boasts about providing Wifi, but for most of the trip it was absolute garbage so I wouldn’t count on it; plan to work offline.

On Amtrak when you get to Penn Station, it’s a bit overwhelming. We ended up not going the correct, easy route to the subway and ended up leaving the stattion, walking, a few blocks, then reentering the subway. It was raining and that was a lot of steps while carrying a giant duffel bag so make sure you read the signage or ask someone for help. Once we got into the subway, we (wife and I) bought 7-day unlimited metro passes for $31 each. That’s definitely the way to go. A lot of the subway stations have an Information booth so definitely ask that person for guidance on which train to get on for your destination. We did that the first time, which was a help because we wanted to take the E train, but it wasn’t running that day (Saturday) so the Information person helped us out. The more useful thing to know: Google Maps will give you Subway routes. And for most of the trip, that’s how we did it. We would type where we wanted to go into Google on our phones (which includes restaurants or other landmarks), then choose the Subway option and it will show you which trains to take, how many stops, connections, how long it will take, etc. That was our goto and it worked great. Other guides say this and it’s true: you get the hang of swiping your metro card and going through the turnstile. Watch other people do it if you need to, but if you’re nervous, swipe the card, make sure it reads “Go”, then push through the turnstile. Don’t be in too big of a hurry but after a few times, you’ll get the rhythm.

Here’s what my wife and I had packed: backpack (I wore), two carryable bags (my wife), one large duffel bag with wheels (I carried). We debated small rolling suitcases or one large rolling suitcase, but chose the duffel bag we had. For us, that worked great for a couple reasons. One, in the subway you have to lift it over the turnstile. That’s easier with a duffel bag (designed to be carried) vs. a suitcase. With 2 people, it’s pretty easy to send someone through the turnstile, hand them the bags, then second person swipes through. That would be possible with larger suitcases too but lifting them can be more difficult. That said, I carried the duffel bag through myself and it was okay. If you’re concerned, cabs are always an option (or Uber), but we did’nt avail ourselves of either.

Okay, Talk About The Conference A Bit

I received an email from O’Reilly with information regarding check-in to the conference. The most important factor is the check-in hours, which O’Reilly helpfully put into the email. Also O’Reilly had a conference mobile app I downloaded and it was very useful (more on that later). At least at this conference, in-person check-in was necessary only to get your badge. Here, the badges had RFID so it could tell what sessions you attended. I’m sure that’s standard practice now. I was surprised O’Reilly didn’t give out more swag on check-in.

O’Reilly split up the conference into a few parts. The conference was 2 days, Tuesday and Wednesday. Before that, if you paid more, you could do a 2-day hands on training Sunday and Monday or 1 day tutorials (1 AM, 1 PM) on Monday. I opted for the 1 day tutorial.

Here’s my number one finding about O’Reilly: 45 minutes is the sweet spot for sessions. I’ll break down the sessions in a bit, but I sat though talks that were 20 minutes (morning keynotes at the conference), 45 minutes (afternoon sessions during the conference), 90 minutes (the morning session at the conference), and 3 hours (the Monday tutorials). With one exception, the 20 minute keynotes were pretty bland. That’s not nearly enough time to get into much interesting or tactile. The 3 hour sessions were simply too long. Neither speaker I attended seemed to have 3 hours of quality content so it felt like a lot of filler or fluff around otherwise useful information. That also applies to the 90 minute conference sessions, both had 45 minutes of material padded unnecessarily. I’m generalizing wildly here as of course I didn’t attend every session of the entire event and maybe some were better than all the ones I attended, just be aware that while some of the conference speakers do it a lot, others aren’t as experienced and you’ll probably be able to tell.

Here’s some niceties that were available that I didn’t realize or use, but would definitely come in handy situationally.

Some hotels provide a baggage check that allows you to fairly securely store your luggage with the hotel when you don’t have a room. For attendees, this was useful if they were leaving at the end of a day but had to check out at noon. The baggage check allowed them some place to leave their stuff while they attended without carting it around.
O’Reilly provided computers to print boarding passes. This was helpful for people flying out who didn’t print it earlier or forgot or whatever.
The food supplied for lunches and snacks was pretty good (relative to other hotel food I’ve had), but beyond being kind of picky, I don’t have any weird allergies. All 3 days, lunches were fancy sandwiches (turkey, roast beef, chicken, it varied each day). They also had a soup. There were some vegetarian options, but if you’re gluten free (like my mother-in-law, my dad’s girlfriend, and a developer that reports to me), I have no idea how (or if) that was accommodated. Doesn’t mean don’t ask, but once again, people with allergies might get it in the shorts a bit. The snacks were fruit and some sweets plus coffee. I don’t personally drink coffee so unfortunate they didn’t have iced tea or something (albeit this was NY, iced tea might be a foreign concept :-)

Break Down The Sessions!

The sessions covered a broad spectrum from incredibly engrossing to not that interesting. When I selected talks, I did so using this process: a) Did I recognize the speaker? If so, I attended. b) Did I think I would make use of the subject matter at work? If so, I’d probably attend. c) Was it better than all the other options though none of the options were great for me in that time slot? This was the default if (a) and (b) weren’t met.

Hands down the most enjoyable talk I heard was given by Mark Bates. He discussed how he and a friend tried to launch a platform to make it easier for conference speakers to submit talks. They had 3 months to build a version that his friend had already sold to a Ruby conference. He walked through how he laid out the latest and greatest in microservice architectures, selected a language he’d never used before (Go), and then picked a number of other slick technologies. Long story short: after 3 months he had basically nothing. He then slowly walked that back to a standard Ruby monolith backed by Postgres and with Angular. He pointed out that this was a pretty heretical finding at this conference with so many microservice talks going on, but his main point was: if your company doesn’t have scale problems like Netflix or Amazon, why would their solution be necessary? In his case, his company/product was so new, that the scale just wasn’t there. Better to start with a monolith, get something online, and then iterate over it later. The things I enjoyed about this talk: Mark told the story as a narrative and he was incredibly funny and personable about it; the lesson really resonated with me that other people’s solutions are useful if you don’t have their problems.

I listened to two talks by Michael Nygard, one on manueverable architecture and the other on architecture without an end state. I had read Michael’s book (Release It!, which I highly recommend) and so I wanted to hear him speak for that reason. Michael is a great presenter, it’s clear he does it a lot. Both were packed with information and I took pages of notes. I won’t attempt to recreate all of them, but if you ever have a chance to hear Michael, take advantage of it.

The keynotes in the morning were generally kind of blah as 20 minutes isn’t enough time to get into much that’s actionable, but to me the best one was given by Janelle Klein on “Make the pain visible”. She discussed software projects that had failed on her in the past and her recent efforts to find a way for software developers to expose the difficulty and friction of writing code and the myriad activities that get in the way. This was truly a powerful perspective and a thing I hear from my developers quite a bit. I liked this talk so much I contacted Janelle about her e-book and am hoping to leverage some of her ideas in the near future.

Wrap Up

Overall, I enjoyed the O’Reilly conference experience. I learned quite a bit. I wish I had gone to some different sessions than those I initially selected so that’s a lesson for next time to think more deeply about the topic or try to find better intel on what will be covered. Would I go to O’Reilly again? That’s a tough question. I would go, yes, though there are other conferences I’d like to try first. One of those is QCon, which is loaded with speakers across a lot of technologies and I think that has some compelling offerings. I’ve heard great things about some others and I wish I could go to NDC, but my company isn’t much on sponsoring conference attendance at overseas conferences so unless NDC comes to the US, unlikely I’ll make it to that one anytime soon.

Getting Started with Deedle for Machine Learning