Busy making things: github, links, photos, @mc.

Parsing CSV data in Scala with opencsv

Posted: July 28th, 2011 | Author: | Filed under: Java, Open Source, Scala | 2 Comments »

One of the great things about Scala (or any JVM language for that matter) is that you can take advantage of lots of libraries in the Java ecosystem. Today I wanted to parse a CSV file with Scala, and of course the first thing I did was search for scala csv. That yielded some interesting results, including a couple of roll-your-own regex-based implementations. I prefer to lean on established libraries instead of copying and pasting code from teh internet, so my next step was to search for java csv.

The third hit down was opencsv and looked solid, had been updated recently, and was Apache-licensed. All good signs in my book. It’s also in the main maven repository, so adding it to my sbt 0.10.x build configuration was easy:


libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.1"

The syntax for sbt 0.7.x is similar, but you should really upgrade:


val opencsv = "net.sf.opencsv" % "opencsv" % "2.1"

Once that configuration change is in place, running sbt update will let you use opencsv in either your project or the shell via sbt console.

There are a couple of simple usage examples on the opencsv site along with a link to javadocs. The javadocs are currently for the development version (2.4) and include an improved iterator interface that would be useful for larger files.

Let’s parse some CSV data in Scala. We’ll use a CSV version of violations of 14 CFR 91.11, 121.580 & 135.120, affectionately known as the unruly passenger dataset (as seen in the Django book):


Year,Total
1995,146
1996,184
1997,235
1998,200
1999,226
2000,251
2001,299
2002,273
2003,281
2004,304
2005,203
2006,136
2007,150
2008,123
2009,135
2010,121

You can download the raw data as unruly_passengers.txt.

Here’s a full example of parsing the unruly passengers data:


import au.com.bytecode.opencsv.CSVReader
import java.io.FileReader
import scala.collection.JavaConversions._

val reader = new CSVReader(new FileReader("unruly_passengers.txt"))
for (row <- reader.readAll) {
    println("In " + row(0) + " there were " + row(1) + " unruly passengers.")
}

This will print out the following:


In Year there were Total unruly passengers.
In 1995 there were 146 unruly passengers.
In 1996 there were 184 unruly passengers.
In 1997 there were 235 unruly passengers.
In 1998 there were 200 unruly passengers.
In 1999 there were 226 unruly passengers.
In 2000 there were 251 unruly passengers.
In 2001 there were 299 unruly passengers.
In 2002 there were 273 unruly passengers.
In 2003 there were 281 unruly passengers.
In 2004 there were 304 unruly passengers.
In 2005 there were 203 unruly passengers.
In 2006 there were 136 unruly passengers.
In 2007 there were 150 unruly passengers.
In 2008 there were 123 unruly passengers.
In 2009 there were 135 unruly passengers.
In 2010 there were 121 unruly passengers.

There are a couple of ways to make sure that the header line isn't included. If you specify the seperator and quote character, you can also tell it to skip any number of lines (one in this case):


val reader = new CSVReader(new FileReader("unruly_passengers.txt"), ",", "\"", 1)

Alternatively you could create a variable that starts true and is set to false after skipping the first line.

Also worth mentioning is the JavaConversions import in the example. This enables explicit conversions between Java datatypes and Scala datatypes and makes working with Java libraries a lot easier. WIthout this import we couldn't use Scala's for loop syntactic sugar. In this case it's implicitly converting a Java.util.List to a scala.collection.mutable.Buffer.

Another thing to be aware of is any cleaning of the raw field output that might need to be done. For example, some CSV files often have leading or training whitespace. A quick and easy way to take care of this is to trim leading and trailing whitespace: row(0).trim.

This isn't the first time I've been pleasantly surprised working with a Java library in Scala, and I'm sure it won't be the last. Many thanks to the developers and maintainers of opencsv and to the creators of all of the open source libraries, frameworks, and tools in the Java ecosystem.


Getting to know Scala

Posted: February 28th, 2011 | Author: | Filed under: Java, Scala, Web Services | 5 Comments »

Over the past couple of weeks I’ve been spending some quality time with Scala. I haven’t really been outside of my Python shell (pun only slightly intended) since getting to know node.js several months back. I’m kicking myself for not picking it up sooner, it has a ton of useful properties:

  • The power and speed of the JVM and access to the Java ecosystem without the verbosity
  • An interesting mix of Object-Oriented and Functional programming (which sounds weird but works)
  • Static typing without type pain through inferencing in common scenarios
  • A REPL for when you just want to check how something works
  • An implementation of the Actor model for message passing and Erlang-style concurrency.

Getting started

The first thing I did was try to get a feel for Scala’s syntax. I started by skimming documentation and tutorials at scala-lang.org. I quickly learned that Programming Scala was available on the web so I started skimming that on a plane ride. It’s an excellent book and I need to snag a copy of my bookshelf.

After getting to know the relatively concise and definitely expressive syntax of the language, I wanted to do something interesting with it. I had heard of a lot of folks using Netty for highly concurrent network services, so I thought I would try to do something with that. I started off tinkering with (and submitting a dependency patch to) naggati2, a toolkit for building protocols using Netty.

After an hour or so I decided to shelve Naggati and get a better handle on the language and Netty itself. I browsed through several Scala projects using Netty and ended up doing a mechanistic (and probably not very idiomatic) port of a Java echo server. I put this up on github as scala-echo-server.

Automation is key

Because my little app has an external dependency, I really wanted to automate downloading that dependency and adding it to my libraries. At quick glance, it looked like it was possible to use Maven with Scala, and there was even a Scala plugin and archetype for it. I found the right archetype by typing mvn archetype:generate | less, found the number for scala-archetype-simple, and re-ran mvn archetype:generate, entering the correct code and answering a couple of questions. Once that was done, I could put code in src/main/scala/com/postneo and run mvn compile to compile my code.

It was about this time that I realized that most of the Scala projects I saw were using simple-build-tool instead of Maven to handle dependencies and build automation. I quickly installed it and easily configured my echo server to use it. From there my project was a quick sbt clean update compile run from being completely automated. While I’m sure that Maven is good this feels like a great way to configure Scala projects.

Something a little more complex

After wrapping my head around the basics (though I did find myself back at the Scala syntax primer quite often), I decided to tackle something real but still relatively small in scope. I had implemented several archaic protocols while getting to know node.js, and I thought I’d pick one to learn Scala and Netty with. I settled on the Finger protocol as it existed in 1977 in RFC 742.

The result of my work is an open source project called phalanges. I decided to use it as an opportunity to make use of several libraries including Configgy for configuration and logging and Ostrich for statistics collection. I also wrote tests using Specs and found that mocking behavior with mockito was a lot easier than I expected. Basic behavior coverage was particularly useful when I refactored the storage backend, laying the groundwork for pluggable backends and changing the underlying storage mechanism from a List to a HashMap.

Wrapping up

Scala’s type checking saved me from doing stupid things several times and I really appreciate the effort put in to the compiler. The error messages and context that I get back from the compiler when I’ve done something wrong are better than any other static language that I can remember.

I’m glad that I took a closer look at Scala. I still have a lot to learn but it’s been a fun journey so far and it’s been great to get out of my comfort zone. I’m always looking to expand my toolbox and Scala looks like a solid contender for highly concurrent systems.