Year: 2011

  • Parsing CSV data in Scala with opencsv

    One of the great things about Scala (or any JVM language for that matter) is that you can take advantage of lots of libraries in the Java ecosystem. Today I wanted to parse a CSV file with Scala, and of course the first thing I did was search for scala csv. That yielded some interesting results, including a couple of roll-your-own regex-based implementations. I prefer to lean on established libraries instead of copying and pasting code from teh internet, so my next step was to search for java csv.

    The third hit down was opencsv and looked solid, had been updated recently, and was Apache-licensed. All good signs in my book. It’s also in the main maven repository, so adding it to my sbt 0.10.x build configuration was easy:

    
    libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.1"
    

    The syntax for sbt 0.7.x is similar, but you should really upgrade:

    
    val opencsv = "net.sf.opencsv" % "opencsv" % "2.1"
    

    Once that configuration change is in place, running sbt update will let you use opencsv in either your project or the shell via sbt console.

    There are a couple of simple usage examples on the opencsv site along with a link to javadocs. The javadocs are currently for the development version (2.4) and include an improved iterator interface that would be useful for larger files.

    Let’s parse some CSV data in Scala. We’ll use a CSV version of violations of 14 CFR 91.11, 121.580 & 135.120, affectionately known as the unruly passenger dataset (as seen in the Django book):

    
    Year,Total
    1995,146
    1996,184
    1997,235
    1998,200
    1999,226
    2000,251
    2001,299
    2002,273
    2003,281
    2004,304
    2005,203
    2006,136
    2007,150
    2008,123
    2009,135
    2010,121
    

    You can download the raw data as unruly_passengers.txt.

    Here’s a full example of parsing the unruly passengers data:

    
    import au.com.bytecode.opencsv.CSVReader
    import java.io.FileReader
    import scala.collection.JavaConversions._
    
    val reader = new CSVReader(new FileReader("unruly_passengers.txt"))
    for (row <- reader.readAll) {
        println("In " + row(0) + " there were " + row(1) + " unruly passengers.")
    }
    

    This will print out the following:

    
    In Year there were Total unruly passengers.
    In 1995 there were 146 unruly passengers.
    In 1996 there were 184 unruly passengers.
    In 1997 there were 235 unruly passengers.
    In 1998 there were 200 unruly passengers.
    In 1999 there were 226 unruly passengers.
    In 2000 there were 251 unruly passengers.
    In 2001 there were 299 unruly passengers.
    In 2002 there were 273 unruly passengers.
    In 2003 there were 281 unruly passengers.
    In 2004 there were 304 unruly passengers.
    In 2005 there were 203 unruly passengers.
    In 2006 there were 136 unruly passengers.
    In 2007 there were 150 unruly passengers.
    In 2008 there were 123 unruly passengers.
    In 2009 there were 135 unruly passengers.
    In 2010 there were 121 unruly passengers.
    

    There are a couple of ways to make sure that the header line isn't included. If you specify the seperator and quote character, you can also tell it to skip any number of lines (one in this case):

    
    val reader = new CSVReader(new FileReader("unruly_passengers.txt"), ",", "\"", 1)
    

    Alternatively you could create a variable that starts true and is set to false after skipping the first line.

    Also worth mentioning is the JavaConversions import in the example. This enables explicit conversions between Java datatypes and Scala datatypes and makes working with Java libraries a lot easier. WIthout this import we couldn't use Scala's for loop syntactic sugar. In this case it's implicitly converting a Java.util.List to a scala.collection.mutable.Buffer.

    Another thing to be aware of is any cleaning of the raw field output that might need to be done. For example, some CSV files often have leading or training whitespace. A quick and easy way to take care of this is to trim leading and trailing whitespace: row(0).trim.

    This isn't the first time I've been pleasantly surprised working with a Java library in Scala, and I'm sure it won't be the last. Many thanks to the developers and maintainers of opencsv and to the creators of all of the open source libraries, frameworks, and tools in the Java ecosystem.

  • Social Graph Analysis using Elastic MapReduce and PyPy

    A couple of weeks back I read a couple of papers (Who Says What to Whom on Twitter and What is Twitter, a Social Network or a News Media?) that cited data collected by researchers for the latter paper.

    This 5 gigabyte compressed (26 gigabyte uncompressed) dataset makes for a good excuse to use MapReduce and MrJob for processing. MrJob makes it easy to test MapReduce jobs locally as well as run them on a local Hadoop cluster or on Amazon’s Elastic MapReduce.

    Designing MapReduce Jobs

    I usually find myself going through the same basic tasks when writing MapReduce tasks:

    1. Examine the data input format and the data that you have to play with. This is sometimes explained in a metadata document or you may have to use a utility such as head if you’re trying to look at the very beginning of a text file.
    2. Create a small amount of synthetic data for use while designing your job. It should be obvious to determine if the output of your job is correct or not based on this data. This data is also useful when writing unit tests.
    3. Write your job, using synthetic data as test input.
    4. Create sample data based on your real dataset and continue testing your job with that data. This can be done via reservoir sampling to create a more representative sample or it could be as simple as head -1000000 on a very large file.
    5. Run your job against the sample data and make sure the results look sane.
    6. Configure MrJob to run using Elastic MapReduce. An example configuration can be found in conf/mrjob-emr.conf but will require you to update it with your credentials and S3 bucket information before it will work.
    7. Run your sample data using Elastic MapReduce and a small number of low-cost instances. It’s a lot cheaper to fix configuration problem when you’re just
      running two cheap instances.
    8. Once you’re comfortable with everything, run your job against the full dataset on Elastic MapReduce.

    Analyzing the data

    This project contains two MapReduce jobs:

    jobs/follower_count.py
    A simple single-stage MapReduce job that reads the data in and sums the number of followers each user has.
    jobs/follower_histogram.py
    This is a two-phase MapReduce job that first sums the number of followers a each user has then for each follower count sums the number of users that have that number of followers. This is one of many interesting ways at looking at this raw data.

    Running the jobs

    The following assumes you have a modern Python and have already installed MrJob (pip install MrJob or easy_install MrJob or install it from source).

    To run the sample data locally:

    $ python jobs/follower_count.py data/twitter_synthetic.txt
    

    This should print out a summary of how many followers each user (represented by id) has:

    5       2
    6       1
    7       3
    8       2
    9       1
    

    You can also run a larger sample (the first 10 million rows of the full dataset mentioned above) locally though it will likely take several minutes to process:

    $ python jobs/follower_count.py data/twitter_sample.txt.gz
    

    After editing conf/mrjob-emr.conf you can also run the sample on Elastic MapReduce:

    $ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
     -o s3://your-bucket/your-output-location --no-output data/twitter_sample.txt.gz
    

    You can also upload data to an S3 bucket and reference it that way:

    $ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
     -o s3://your-bucket/your-output-location --no-output s3://your-bucket/twitter_sample.txt.gz
    

    You may also download the full dataset and run either the follower count or the histogram job. The following general steps are required:

    1. Download the whole data file from Kwak, Haewoon and Lee, Changhyun and Park, Hosung and Moon, Sue via bittorrent. I did this on a small EC2 instance in order to make uploading to S3 easier.
    2. To make processing faster, decompress it, split it in to lots of smaller files (split -l 10000000
      for example).
    3. Upload to an S3 bucket.
    4. Run the full job (it took a little over 15 minutes to read through 1.47 billion relationships and took just over an hour to complete).
    python jobs/follower_histogram.py -c conf/mrjob-emr.conf -r emr \
    -o s3://your-bucket/your-output-location --no-output s3://your-split-input-bucket/
    

    Speeding things up with PyPy

    While there are lots of other things to explore in the data, I also wanted to be able to run PyPy on Elastic MapReduce. Through the use of bootstrap actions, we can prepare our environment to use PyPy and tell MrJob to execute jobs with PyPy instead of system Python. The following need to be added to your configuration file (and vary between 32 and 64 bit):

    # Use PyPy instead of system Python
    bootstrap_scripts:
    - bootstrap-pypy-64bit.sh
    python_bin: /home/hadoop/bin/pypy
    

    This configuration change (available in conf/mrjob-emr-pypy-32bit.conf and conf/mrjob-emr-pypy-64bit.conf) also makes use of a custom bootstrap script found in conf/bootstrap-pypy-32bit.sh and conf/bootstrap-pypy-64bit.sh).

    A single run of “follower_histogram.py“ with 8 “c1.xlarge“ instances took approximately 66 minutes using Elastic MapReduce’s system Python. A single run with PyPy in the same configuration took approximately 44 minutes. While not a scientific comparison, that’s a pretty impressive speedup for such a simple task. PyPy should speed things up even more for more complex tasks.

    Thoughts on Elastic MapReduce

    It’s been great to be able to temporarily rent my own Hadoop cluster for short periods of time, but Elastic MapReduce definitely has some downsides. For starters, the standard way to read and persist data during jobs is via S3 instead of HDFS which you would most likely be using if you were running your own Hadoop cluster. This means that you spend a lot of time (and money) transferring data between S3 and nodes. You’re not bringing the data to computing resources like a dedicated Hadoop cluster running HDFS might.

    All in all though it’s a great tool for the toolbox, particularly if you don’t have the need for a full-time Hadoop cluster.

    Play along at home

    All of the source code and configuration mentioned in this post can be found at social-graph-analysis and is released under the BSD license.

  • Literate Diffing

    The other day I found myself wanting to add commentary to a diff. There are code review tools such as reviewboard and gerrit that make commenting on diffs pretty easy. Github allows you to comment on pull requests and individual commits.

    These are all fantastic tools for commenting on diffs, but I kind of wanted something different, something a little more self-contained. I wanted to write about the individual changes, what motivated them, and what the non-code implications of each change might be. At that point my mind wandered to the world of lightweight literate programming using tools like docco, rocco, and pycco.

    A literate diff might look something like this (using Python/Bash style single-line comments):

    # Extend Pygments' DiffLexer using a non-standard comment (#) for literate diffing using pycco.
    diff -r cfa0f44daad1 pygments/lexers/text.py
    --- a/pygments/lexers/text.py	Fri Apr 29 14:03:50 2011 +0200
    +++ b/pygments/lexers/text.py	Sat Apr 30 20:28:56 2011 -0500
    @@ -231,6 +231,7 @@
                 (r'@.*\n', Generic.Subheading),
                 (r'([Ii]ndex|diff).*\n', Generic.Heading),
                 (r'=.*\n', Generic.Heading),
    # Add non-standard diff comments.  This has to go above the Text capture below
    # in order to be active.
    +            (r'#.*\n', Comment),
                 (r'.*\n', Text),
             ]
         }

    It turns out that it’s pretty easy to process with patch, but comes with a catch. The patch command would blow up quite spectacularly if it encountered one of these lines, so the comments will have to be removed from a literate diff before being passed to patch. This is easily done using awk:

    cat literate.diff | awk '!/\#/' | patch -p0

    If you’re using a DVCS, you’ll need -p1 instead.

    Since I’m using a non-standard extension to diffs, tools such as pygments won’t know to syntax highlight comments appropriately. If comments aren’t marked up correctly, pycco won’t be able to put them in the correct spot. This requires a patch to pygments and a patch to pycco. I’m kind of abusing diff syntax here and haven’t submitted these patches upstream, but you can download and apply them if you’d like to play along at home.

    I still think tools like github, reviewboard, and gerrit are much more powerful for commenting on diffs but was able to make pycco output literate diffs quick enough that I thought I’d share the process. These tools are no excuse for clearly commenting changes and implications within the code itself, but I do like having a place to put underlying motivations. Here’s an example of a literate diff for one of my commits to phalanges, a finger daemon written in Scala. It’s still a pretty contrived example but is exactly what I was envisioning when my mind drifted from diffs to literate programming.

  • PyPy is Fast (And So Can You)

    I’ve known for some time that PyPy (Python implemented in a subset of the language called RPython) is fast. The PyPy speed charts show just how fast for a lot of benchmarks (and it’s a little slower in a few areas too).

    After seeing a lot of PyPy chatter while PyCon was going on, I thought I’d check it out. On OS X it’s as simple as brew install pypy. After that, just use pypy instead of python.

    The first thing I did was throw PyPy at a couple of Project Euler problems. They’re great because they’re computationally expensive and usually have lots of tight loops. For the ones I looked at, PyPy had a 50-75% speed improvement over CPython. David Ripton posted a more complete set of Euler solution runtimes using PyPy, Unladen Swallow, Jython, Psyco, and CPython. Almost all of the time, PyPy is faster, often significantly so. At this point it looks like the PyPy team is treating “slower than CPython” as a bug, or at the very least, something to improve.

    The latest stable release currently targets Python 2.5, but if you build the latest version from source it looks like they’re on their way to supporting Python 2.7:

    $ ./pypy-c 
    Python 2.7.0 (61fefec7abc6, Mar 18 2011, 06:59:57)
    [PyPy 1.5.0-alpha0] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    And now for something completely different: ``1.1 final released:
    http://codespeak.net/pypy/dist/pypy/doc/release-1.1.0.html''
    >>>> 

    There are a few things to look out for when using PyPy. The entire standard library isn’t built out, though the most commonly used modules are. PyPy supports ctypes and has experimental but incomplete support for the Python C API. PyPy is built out enough to support several large non-trivial projects such as Twisted (without SSL) and Django (with sqlite).

    PyPy is definitely one of many bright futures for Python, and it’s fast now. If you’ve been thinking about checking it out, perhaps now is the time to take it for a spin.

  • Getting to know Scala

    Over the past couple of weeks I’ve been spending some quality time with Scala. I haven’t really been outside of my Python shell (pun only slightly intended) since getting to know node.js several months back. I’m kicking myself for not picking it up sooner, it has a ton of useful properties:

    • The power and speed of the JVM and access to the Java ecosystem without the verbosity
    • An interesting mix of Object-Oriented and Functional programming (which sounds weird but works)
    • Static typing without type pain through inferencing in common scenarios
    • A REPL for when you just want to check how something works
    • An implementation of the Actor model for message passing and Erlang-style concurrency.

    Getting started

    The first thing I did was try to get a feel for Scala’s syntax. I started by skimming documentation and tutorials at scala-lang.org. I quickly learned that Programming Scala was available on the web so I started skimming that on a plane ride. It’s an excellent book and I need to snag a copy of my bookshelf.

    After getting to know the relatively concise and definitely expressive syntax of the language, I wanted to do something interesting with it. I had heard of a lot of folks using Netty for highly concurrent network services, so I thought I would try to do something with that. I started off tinkering with (and submitting a dependency patch to) naggati2, a toolkit for building protocols using Netty.

    After an hour or so I decided to shelve Naggati and get a better handle on the language and Netty itself. I browsed through several Scala projects using Netty and ended up doing a mechanistic (and probably not very idiomatic) port of a Java echo server. I put this up on github as scala-echo-server.

    Automation is key

    Because my little app has an external dependency, I really wanted to automate downloading that dependency and adding it to my libraries. At quick glance, it looked like it was possible to use Maven with Scala, and there was even a Scala plugin and archetype for it. I found the right archetype by typing mvn archetype:generate | less, found the number for scala-archetype-simple, and re-ran mvn archetype:generate, entering the correct code and answering a couple of questions. Once that was done, I could put code in src/main/scala/com/postneo and run mvn compile to compile my code.

    It was about this time that I realized that most of the Scala projects I saw were using simple-build-tool instead of Maven to handle dependencies and build automation. I quickly installed it and easily configured my echo server to use it. From there my project was a quick sbt clean update compile run from being completely automated. While I’m sure that Maven is good this feels like a great way to configure Scala projects.

    Something a little more complex

    After wrapping my head around the basics (though I did find myself back at the Scala syntax primer quite often), I decided to tackle something real but still relatively small in scope. I had implemented several archaic protocols while getting to know node.js, and I thought I’d pick one to learn Scala and Netty with. I settled on the Finger protocol as it existed in 1977 in RFC 742.

    The result of my work is an open source project called phalanges. I decided to use it as an opportunity to make use of several libraries including Configgy for configuration and logging and Ostrich for statistics collection. I also wrote tests using Specs and found that mocking behavior with mockito was a lot easier than I expected. Basic behavior coverage was particularly useful when I refactored the storage backend, laying the groundwork for pluggable backends and changing the underlying storage mechanism from a List to a HashMap.

    Wrapping up

    Scala’s type checking saved me from doing stupid things several times and I really appreciate the effort put in to the compiler. The error messages and context that I get back from the compiler when I’ve done something wrong are better than any other static language that I can remember.

    I’m glad that I took a closer look at Scala. I still have a lot to learn but it’s been a fun journey so far and it’s been great to get out of my comfort zone. I’m always looking to expand my toolbox and Scala looks like a solid contender for highly concurrent systems.