Month: July 2011

  • Parsing CSV data in Scala with opencsv

    One of the great things about Scala (or any JVM language for that matter) is that you can take advantage of lots of libraries in the Java ecosystem. Today I wanted to parse a CSV file with Scala, and of course the first thing I did was search for scala csv. That yielded some interesting results, including a couple of roll-your-own regex-based implementations. I prefer to lean on established libraries instead of copying and pasting code from teh internet, so my next step was to search for java csv.

    The third hit down was opencsv and looked solid, had been updated recently, and was Apache-licensed. All good signs in my book. It’s also in the main maven repository, so adding it to my sbt 0.10.x build configuration was easy:

    
    libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.1"
    

    The syntax for sbt 0.7.x is similar, but you should really upgrade:

    
    val opencsv = "net.sf.opencsv" % "opencsv" % "2.1"
    

    Once that configuration change is in place, running sbt update will let you use opencsv in either your project or the shell via sbt console.

    There are a couple of simple usage examples on the opencsv site along with a link to javadocs. The javadocs are currently for the development version (2.4) and include an improved iterator interface that would be useful for larger files.

    Let’s parse some CSV data in Scala. We’ll use a CSV version of violations of 14 CFR 91.11, 121.580 & 135.120, affectionately known as the unruly passenger dataset (as seen in the Django book):

    
    Year,Total
    1995,146
    1996,184
    1997,235
    1998,200
    1999,226
    2000,251
    2001,299
    2002,273
    2003,281
    2004,304
    2005,203
    2006,136
    2007,150
    2008,123
    2009,135
    2010,121
    

    You can download the raw data as unruly_passengers.txt.

    Here’s a full example of parsing the unruly passengers data:

    
    import au.com.bytecode.opencsv.CSVReader
    import java.io.FileReader
    import scala.collection.JavaConversions._
    
    val reader = new CSVReader(new FileReader("unruly_passengers.txt"))
    for (row <- reader.readAll) {
        println("In " + row(0) + " there were " + row(1) + " unruly passengers.")
    }
    

    This will print out the following:

    
    In Year there were Total unruly passengers.
    In 1995 there were 146 unruly passengers.
    In 1996 there were 184 unruly passengers.
    In 1997 there were 235 unruly passengers.
    In 1998 there were 200 unruly passengers.
    In 1999 there were 226 unruly passengers.
    In 2000 there were 251 unruly passengers.
    In 2001 there were 299 unruly passengers.
    In 2002 there were 273 unruly passengers.
    In 2003 there were 281 unruly passengers.
    In 2004 there were 304 unruly passengers.
    In 2005 there were 203 unruly passengers.
    In 2006 there were 136 unruly passengers.
    In 2007 there were 150 unruly passengers.
    In 2008 there were 123 unruly passengers.
    In 2009 there were 135 unruly passengers.
    In 2010 there were 121 unruly passengers.
    

    There are a couple of ways to make sure that the header line isn't included. If you specify the seperator and quote character, you can also tell it to skip any number of lines (one in this case):

    
    val reader = new CSVReader(new FileReader("unruly_passengers.txt"), ",", "\"", 1)
    

    Alternatively you could create a variable that starts true and is set to false after skipping the first line.

    Also worth mentioning is the JavaConversions import in the example. This enables explicit conversions between Java datatypes and Scala datatypes and makes working with Java libraries a lot easier. WIthout this import we couldn't use Scala's for loop syntactic sugar. In this case it's implicitly converting a Java.util.List to a scala.collection.mutable.Buffer.

    Another thing to be aware of is any cleaning of the raw field output that might need to be done. For example, some CSV files often have leading or training whitespace. A quick and easy way to take care of this is to trim leading and trailing whitespace: row(0).trim.

    This isn't the first time I've been pleasantly surprised working with a Java library in Scala, and I'm sure it won't be the last. Many thanks to the developers and maintainers of opencsv and to the creators of all of the open source libraries, frameworks, and tools in the Java ecosystem.