Author: Matt Croydon

  • Parsing CSV data in Scala with opencsv

    One of the great things about Scala (or any JVM language for that matter) is that you can take advantage of lots of libraries in the Java ecosystem. Today I wanted to parse a CSV file with Scala, and of course the first thing I did was search for scala csv. That yielded some interesting results, including a couple of roll-your-own regex-based implementations. I prefer to lean on established libraries instead of copying and pasting code from teh internet, so my next step was to search for java csv.

    The third hit down was opencsv and looked solid, had been updated recently, and was Apache-licensed. All good signs in my book. It’s also in the main maven repository, so adding it to my sbt 0.10.x build configuration was easy:

    
    libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.1"
    

    The syntax for sbt 0.7.x is similar, but you should really upgrade:

    
    val opencsv = "net.sf.opencsv" % "opencsv" % "2.1"
    

    Once that configuration change is in place, running sbt update will let you use opencsv in either your project or the shell via sbt console.

    There are a couple of simple usage examples on the opencsv site along with a link to javadocs. The javadocs are currently for the development version (2.4) and include an improved iterator interface that would be useful for larger files.

    Let’s parse some CSV data in Scala. We’ll use a CSV version of violations of 14 CFR 91.11, 121.580 & 135.120, affectionately known as the unruly passenger dataset (as seen in the Django book):

    
    Year,Total
    1995,146
    1996,184
    1997,235
    1998,200
    1999,226
    2000,251
    2001,299
    2002,273
    2003,281
    2004,304
    2005,203
    2006,136
    2007,150
    2008,123
    2009,135
    2010,121
    

    You can download the raw data as unruly_passengers.txt.

    Here’s a full example of parsing the unruly passengers data:

    
    import au.com.bytecode.opencsv.CSVReader
    import java.io.FileReader
    import scala.collection.JavaConversions._
    
    val reader = new CSVReader(new FileReader("unruly_passengers.txt"))
    for (row <- reader.readAll) {
        println("In " + row(0) + " there were " + row(1) + " unruly passengers.")
    }
    

    This will print out the following:

    
    In Year there were Total unruly passengers.
    In 1995 there were 146 unruly passengers.
    In 1996 there were 184 unruly passengers.
    In 1997 there were 235 unruly passengers.
    In 1998 there were 200 unruly passengers.
    In 1999 there were 226 unruly passengers.
    In 2000 there were 251 unruly passengers.
    In 2001 there were 299 unruly passengers.
    In 2002 there were 273 unruly passengers.
    In 2003 there were 281 unruly passengers.
    In 2004 there were 304 unruly passengers.
    In 2005 there were 203 unruly passengers.
    In 2006 there were 136 unruly passengers.
    In 2007 there were 150 unruly passengers.
    In 2008 there were 123 unruly passengers.
    In 2009 there were 135 unruly passengers.
    In 2010 there were 121 unruly passengers.
    

    There are a couple of ways to make sure that the header line isn't included. If you specify the seperator and quote character, you can also tell it to skip any number of lines (one in this case):

    
    val reader = new CSVReader(new FileReader("unruly_passengers.txt"), ",", "\"", 1)
    

    Alternatively you could create a variable that starts true and is set to false after skipping the first line.

    Also worth mentioning is the JavaConversions import in the example. This enables explicit conversions between Java datatypes and Scala datatypes and makes working with Java libraries a lot easier. WIthout this import we couldn't use Scala's for loop syntactic sugar. In this case it's implicitly converting a Java.util.List to a scala.collection.mutable.Buffer.

    Another thing to be aware of is any cleaning of the raw field output that might need to be done. For example, some CSV files often have leading or training whitespace. A quick and easy way to take care of this is to trim leading and trailing whitespace: row(0).trim.

    This isn't the first time I've been pleasantly surprised working with a Java library in Scala, and I'm sure it won't be the last. Many thanks to the developers and maintainers of opencsv and to the creators of all of the open source libraries, frameworks, and tools in the Java ecosystem.

  • Social Graph Analysis using Elastic MapReduce and PyPy

    A couple of weeks back I read a couple of papers (Who Says What to Whom on Twitter and What is Twitter, a Social Network or a News Media?) that cited data collected by researchers for the latter paper.

    This 5 gigabyte compressed (26 gigabyte uncompressed) dataset makes for a good excuse to use MapReduce and MrJob for processing. MrJob makes it easy to test MapReduce jobs locally as well as run them on a local Hadoop cluster or on Amazon’s Elastic MapReduce.

    Designing MapReduce Jobs

    I usually find myself going through the same basic tasks when writing MapReduce tasks:

    1. Examine the data input format and the data that you have to play with. This is sometimes explained in a metadata document or you may have to use a utility such as head if you’re trying to look at the very beginning of a text file.
    2. Create a small amount of synthetic data for use while designing your job. It should be obvious to determine if the output of your job is correct or not based on this data. This data is also useful when writing unit tests.
    3. Write your job, using synthetic data as test input.
    4. Create sample data based on your real dataset and continue testing your job with that data. This can be done via reservoir sampling to create a more representative sample or it could be as simple as head -1000000 on a very large file.
    5. Run your job against the sample data and make sure the results look sane.
    6. Configure MrJob to run using Elastic MapReduce. An example configuration can be found in conf/mrjob-emr.conf but will require you to update it with your credentials and S3 bucket information before it will work.
    7. Run your sample data using Elastic MapReduce and a small number of low-cost instances. It’s a lot cheaper to fix configuration problem when you’re just
      running two cheap instances.
    8. Once you’re comfortable with everything, run your job against the full dataset on Elastic MapReduce.

    Analyzing the data

    This project contains two MapReduce jobs:

    jobs/follower_count.py
    A simple single-stage MapReduce job that reads the data in and sums the number of followers each user has.
    jobs/follower_histogram.py
    This is a two-phase MapReduce job that first sums the number of followers a each user has then for each follower count sums the number of users that have that number of followers. This is one of many interesting ways at looking at this raw data.

    Running the jobs

    The following assumes you have a modern Python and have already installed MrJob (pip install MrJob or easy_install MrJob or install it from source).

    To run the sample data locally:

    $ python jobs/follower_count.py data/twitter_synthetic.txt
    

    This should print out a summary of how many followers each user (represented by id) has:

    5       2
    6       1
    7       3
    8       2
    9       1
    

    You can also run a larger sample (the first 10 million rows of the full dataset mentioned above) locally though it will likely take several minutes to process:

    $ python jobs/follower_count.py data/twitter_sample.txt.gz
    

    After editing conf/mrjob-emr.conf you can also run the sample on Elastic MapReduce:

    $ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
     -o s3://your-bucket/your-output-location --no-output data/twitter_sample.txt.gz
    

    You can also upload data to an S3 bucket and reference it that way:

    $ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
     -o s3://your-bucket/your-output-location --no-output s3://your-bucket/twitter_sample.txt.gz
    

    You may also download the full dataset and run either the follower count or the histogram job. The following general steps are required:

    1. Download the whole data file from Kwak, Haewoon and Lee, Changhyun and Park, Hosung and Moon, Sue via bittorrent. I did this on a small EC2 instance in order to make uploading to S3 easier.
    2. To make processing faster, decompress it, split it in to lots of smaller files (split -l 10000000
      for example).
    3. Upload to an S3 bucket.
    4. Run the full job (it took a little over 15 minutes to read through 1.47 billion relationships and took just over an hour to complete).
    python jobs/follower_histogram.py -c conf/mrjob-emr.conf -r emr \
    -o s3://your-bucket/your-output-location --no-output s3://your-split-input-bucket/
    

    Speeding things up with PyPy

    While there are lots of other things to explore in the data, I also wanted to be able to run PyPy on Elastic MapReduce. Through the use of bootstrap actions, we can prepare our environment to use PyPy and tell MrJob to execute jobs with PyPy instead of system Python. The following need to be added to your configuration file (and vary between 32 and 64 bit):

    # Use PyPy instead of system Python
    bootstrap_scripts:
    - bootstrap-pypy-64bit.sh
    python_bin: /home/hadoop/bin/pypy
    

    This configuration change (available in conf/mrjob-emr-pypy-32bit.conf and conf/mrjob-emr-pypy-64bit.conf) also makes use of a custom bootstrap script found in conf/bootstrap-pypy-32bit.sh and conf/bootstrap-pypy-64bit.sh).

    A single run of “follower_histogram.py“ with 8 “c1.xlarge“ instances took approximately 66 minutes using Elastic MapReduce’s system Python. A single run with PyPy in the same configuration took approximately 44 minutes. While not a scientific comparison, that’s a pretty impressive speedup for such a simple task. PyPy should speed things up even more for more complex tasks.

    Thoughts on Elastic MapReduce

    It’s been great to be able to temporarily rent my own Hadoop cluster for short periods of time, but Elastic MapReduce definitely has some downsides. For starters, the standard way to read and persist data during jobs is via S3 instead of HDFS which you would most likely be using if you were running your own Hadoop cluster. This means that you spend a lot of time (and money) transferring data between S3 and nodes. You’re not bringing the data to computing resources like a dedicated Hadoop cluster running HDFS might.

    All in all though it’s a great tool for the toolbox, particularly if you don’t have the need for a full-time Hadoop cluster.

    Play along at home

    All of the source code and configuration mentioned in this post can be found at social-graph-analysis and is released under the BSD license.

  • Literate Diffing

    The other day I found myself wanting to add commentary to a diff. There are code review tools such as reviewboard and gerrit that make commenting on diffs pretty easy. Github allows you to comment on pull requests and individual commits.

    These are all fantastic tools for commenting on diffs, but I kind of wanted something different, something a little more self-contained. I wanted to write about the individual changes, what motivated them, and what the non-code implications of each change might be. At that point my mind wandered to the world of lightweight literate programming using tools like docco, rocco, and pycco.

    A literate diff might look something like this (using Python/Bash style single-line comments):

    # Extend Pygments' DiffLexer using a non-standard comment (#) for literate diffing using pycco.
    diff -r cfa0f44daad1 pygments/lexers/text.py
    --- a/pygments/lexers/text.py	Fri Apr 29 14:03:50 2011 +0200
    +++ b/pygments/lexers/text.py	Sat Apr 30 20:28:56 2011 -0500
    @@ -231,6 +231,7 @@
                 (r'@.*\n', Generic.Subheading),
                 (r'([Ii]ndex|diff).*\n', Generic.Heading),
                 (r'=.*\n', Generic.Heading),
    # Add non-standard diff comments.  This has to go above the Text capture below
    # in order to be active.
    +            (r'#.*\n', Comment),
                 (r'.*\n', Text),
             ]
         }

    It turns out that it’s pretty easy to process with patch, but comes with a catch. The patch command would blow up quite spectacularly if it encountered one of these lines, so the comments will have to be removed from a literate diff before being passed to patch. This is easily done using awk:

    cat literate.diff | awk '!/\#/' | patch -p0

    If you’re using a DVCS, you’ll need -p1 instead.

    Since I’m using a non-standard extension to diffs, tools such as pygments won’t know to syntax highlight comments appropriately. If comments aren’t marked up correctly, pycco won’t be able to put them in the correct spot. This requires a patch to pygments and a patch to pycco. I’m kind of abusing diff syntax here and haven’t submitted these patches upstream, but you can download and apply them if you’d like to play along at home.

    I still think tools like github, reviewboard, and gerrit are much more powerful for commenting on diffs but was able to make pycco output literate diffs quick enough that I thought I’d share the process. These tools are no excuse for clearly commenting changes and implications within the code itself, but I do like having a place to put underlying motivations. Here’s an example of a literate diff for one of my commits to phalanges, a finger daemon written in Scala. It’s still a pretty contrived example but is exactly what I was envisioning when my mind drifted from diffs to literate programming.

  • PyPy is Fast (And So Can You)

    I’ve known for some time that PyPy (Python implemented in a subset of the language called RPython) is fast. The PyPy speed charts show just how fast for a lot of benchmarks (and it’s a little slower in a few areas too).

    After seeing a lot of PyPy chatter while PyCon was going on, I thought I’d check it out. On OS X it’s as simple as brew install pypy. After that, just use pypy instead of python.

    The first thing I did was throw PyPy at a couple of Project Euler problems. They’re great because they’re computationally expensive and usually have lots of tight loops. For the ones I looked at, PyPy had a 50-75% speed improvement over CPython. David Ripton posted a more complete set of Euler solution runtimes using PyPy, Unladen Swallow, Jython, Psyco, and CPython. Almost all of the time, PyPy is faster, often significantly so. At this point it looks like the PyPy team is treating “slower than CPython” as a bug, or at the very least, something to improve.

    The latest stable release currently targets Python 2.5, but if you build the latest version from source it looks like they’re on their way to supporting Python 2.7:

    $ ./pypy-c 
    Python 2.7.0 (61fefec7abc6, Mar 18 2011, 06:59:57)
    [PyPy 1.5.0-alpha0] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    And now for something completely different: ``1.1 final released:
    http://codespeak.net/pypy/dist/pypy/doc/release-1.1.0.html''
    >>>> 

    There are a few things to look out for when using PyPy. The entire standard library isn’t built out, though the most commonly used modules are. PyPy supports ctypes and has experimental but incomplete support for the Python C API. PyPy is built out enough to support several large non-trivial projects such as Twisted (without SSL) and Django (with sqlite).

    PyPy is definitely one of many bright futures for Python, and it’s fast now. If you’ve been thinking about checking it out, perhaps now is the time to take it for a spin.

  • Getting to know Scala

    Over the past couple of weeks I’ve been spending some quality time with Scala. I haven’t really been outside of my Python shell (pun only slightly intended) since getting to know node.js several months back. I’m kicking myself for not picking it up sooner, it has a ton of useful properties:

    • The power and speed of the JVM and access to the Java ecosystem without the verbosity
    • An interesting mix of Object-Oriented and Functional programming (which sounds weird but works)
    • Static typing without type pain through inferencing in common scenarios
    • A REPL for when you just want to check how something works
    • An implementation of the Actor model for message passing and Erlang-style concurrency.

    Getting started

    The first thing I did was try to get a feel for Scala’s syntax. I started by skimming documentation and tutorials at scala-lang.org. I quickly learned that Programming Scala was available on the web so I started skimming that on a plane ride. It’s an excellent book and I need to snag a copy of my bookshelf.

    After getting to know the relatively concise and definitely expressive syntax of the language, I wanted to do something interesting with it. I had heard of a lot of folks using Netty for highly concurrent network services, so I thought I would try to do something with that. I started off tinkering with (and submitting a dependency patch to) naggati2, a toolkit for building protocols using Netty.

    After an hour or so I decided to shelve Naggati and get a better handle on the language and Netty itself. I browsed through several Scala projects using Netty and ended up doing a mechanistic (and probably not very idiomatic) port of a Java echo server. I put this up on github as scala-echo-server.

    Automation is key

    Because my little app has an external dependency, I really wanted to automate downloading that dependency and adding it to my libraries. At quick glance, it looked like it was possible to use Maven with Scala, and there was even a Scala plugin and archetype for it. I found the right archetype by typing mvn archetype:generate | less, found the number for scala-archetype-simple, and re-ran mvn archetype:generate, entering the correct code and answering a couple of questions. Once that was done, I could put code in src/main/scala/com/postneo and run mvn compile to compile my code.

    It was about this time that I realized that most of the Scala projects I saw were using simple-build-tool instead of Maven to handle dependencies and build automation. I quickly installed it and easily configured my echo server to use it. From there my project was a quick sbt clean update compile run from being completely automated. While I’m sure that Maven is good this feels like a great way to configure Scala projects.

    Something a little more complex

    After wrapping my head around the basics (though I did find myself back at the Scala syntax primer quite often), I decided to tackle something real but still relatively small in scope. I had implemented several archaic protocols while getting to know node.js, and I thought I’d pick one to learn Scala and Netty with. I settled on the Finger protocol as it existed in 1977 in RFC 742.

    The result of my work is an open source project called phalanges. I decided to use it as an opportunity to make use of several libraries including Configgy for configuration and logging and Ostrich for statistics collection. I also wrote tests using Specs and found that mocking behavior with mockito was a lot easier than I expected. Basic behavior coverage was particularly useful when I refactored the storage backend, laying the groundwork for pluggable backends and changing the underlying storage mechanism from a List to a HashMap.

    Wrapping up

    Scala’s type checking saved me from doing stupid things several times and I really appreciate the effort put in to the compiler. The error messages and context that I get back from the compiler when I’ve done something wrong are better than any other static language that I can remember.

    I’m glad that I took a closer look at Scala. I still have a lot to learn but it’s been a fun journey so far and it’s been great to get out of my comfort zone. I’m always looking to expand my toolbox and Scala looks like a solid contender for highly concurrent systems.

  • Installing PyLucene on OSX 10.5

    I was pleasantly surprised at my experience installing PyLucene this morning on my OSX 10.5 laptop. The installation instructions worked perfectly without a hiccup. This may not be impressive if you’ve never installed (or attempted to install) PyLucene before.

    I tried once a year or so back and was unsuccessful. The build process just never worked for me and I couldn’t find a binary build that fit my OS + Python version + Java version combination.

    Check out PyLucene:

    $ svn co http://svn.apache.org/repos/asf/lucene/pylucene/trunk pylucene
    

    Build JCC. I install Python packages in my home directory and if you do so too you can omit sudo before the last command, otherwise leave it in:

    $ cd pylucene/jcc
    $ python setup.py build
    $ sudo python setup.py install
    

    Now we need to edit PyLucene’s Makefile to be configured for OSX and Python 2.5. If you use a different setup than the one that ships with OSX 10.5, you’ll have to adjust these parameters to match your setup.

    Edit the Makefile:

    $ cd ..
    $ nano Makefile
    

    Uncomment the 5 lines Below the comment # Mac OS X (Python 2.5, Java 1.5). If you have installed a different version of Python such as 2.6, there should be a combination that works for you. Here’s what I uncommented:

    # Mac OS X  (Python 2.5, Java 1.5)
    PREFIX_PYTHON=/usr
    ANT=ant
    PYTHON=$(PREFIX_PYTHON)/bin/python
    JCC=$(PYTHON) -m jcc --shared
    NUM_FILES=2
    

    Save the file, exit your editor, and build PyLucene:

    $ make
    

    If it doesn’t build properly check the settings in your Makefile.

    After a successful build, install it (again you can omit sudo if you install Python packages locally and not system-wide):

    $ sudo make install
    

    Now verify that it’s been installed:

    $ python
    Python 2.5.1 (r251:54863, Nov 11 2008, 17:46:48)
    [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import lucene
    >>>
    

    If it imports without a problem you should have a working PyLucene library. Rejoice.

  • Sphinx Search with PostgreSQL

    While I don’t plan on moving away from Apache Solr for my searching needs any time soon, Jeremy Zawodny’s post on Sphinx at craigslist made me want to take a closer look. Sphinx works with MySQL, PostgreSQL, and XML input as data sources, but MySQL seems to be the best documented. I’m a PostgreSQL guy so I ran in to a few hiccups along the way. These instructions, based on instructions on the Sphinx wiki, got me up and running on Ubuntu Server 8.10.

    Install build toolchain:

    $ sudo aptitude install build-essential checkinstall
    

    Install Postgres:

    $ sudo aptitude install postgresql postgresql-client \\
    postgresql-client-common postgresql-contrib \\
    postgresql-server-dev-8.3
    

    Get Sphinx source:

    $ wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz
    $ tar xzvf sphinx-0.9.8.1.tar.gz
    $ cd sphinx-0.9.8.1
    

    Configure and make:

    $ ./configure --without-mysql --with-pgsql \\
    --with-pgsql-includes=/usr/include/postgresql/ \\
    --with-pgsql-lib=/usr/lib/postgresql/8.3/lib/
    $ make
    

    Run checkinstall:

    $ mkdir /usr/local/var
    $ sudo checkinstall
    

    Sphinx is now installed in /usr/local. Check out /usr/local/etc/ for configuration info.

    Create something to index:

    $ createdb -U postgres test
    $ psql -U postgres test
    test=# create table test (id integer primary key not null, text text);
    test=# insert into test (text) values ('Hello, World!');
    test=# insert into test (text) values ('This is a test.');
    test=# insert into test (text) values ('I have another thing to test.');
    test=# -- A user with a password is required.
    test=# create user foo with password 'bar';
    test=# alter table test owner to foo;
    test=# \\q
    

    Configure sphinx (replace nano with your editor of choice):

    $ cd /usr/local/etc
    $ sudo cp sphinx-min.conf.dist sphinx.conf
    $ sudo nano sphinx.conf
    

    These values worked for me. I left configuration for indexer and searchd unchanged:

    source src1
    {
      type = pgsql
      sql_host = localhost
      sql_user = foo
      sql_pass = bar
      sql_db = test
      sql_port = 5432
      sql_query = select id, text from test
      sql_query_info = SELECT * from test WHERE id=$id
    }
    
    index test1
    {
      source = src1
      path = /var/data/test1
      docinfo = extern
      charset_type = utf-8
    }
    

    Reindex:

    $ sudo mkdir /var/data
    $ sudo indexer --all
    

    Run searchd:

    $ sudo searchd
    

    Play:

    $ search world
    
    Sphinx 0.9.8.1-release (r1533)
    Copyright (c) 2001-2008, Andrew Aksyonoff
    
    using config file '/usr/local/etc/sphinx.conf'...
    index 'test1': query 'world ': returned 1 matches of 1 total in 0.000 sec
    
    displaying matches:
    1. document=1, weight=1
    
    words:
    1. 'world': 1 documents, 1 hits
    

    Use Python:

    cd sphinx-0.9.8.1/api
    python
    >>> import sphinxapi, pprint
    >>> c = sphinxapi.SphinxClient()
    >>> q = c.Query('world')
    >>> pprint.pprint(q)
    {'attrs': [],
     'error': '',
     'fields': ['text'],
     'matches': [{'attrs': {}, 'id': 1, 'weight': 1}],
     'status': 0,
     'time': '0.000',
     'total': 1,
     'total_found': 1,
     'warning': '',
     'words': [{'docs': 1, 'hits': 1, 'word': 'world'}]}
    

    If you add new data and want to reindex, make sure you use the --rotate flag:

    sudo indexer --rotate --all
    

    This is an extremely quick and dirty installation designed to give me a sandbox
    to play with. For production use you would want to run as a non-privileged user
    and would probably want to have an /etc/init.d script for searchd or run it
    behind supervised. If you’re looking to experiment with Sphinx and MySQL,
    there should be plenty of documentation out there to get you started.

  • Kansas Primary 2008 recap

    I’m winding down after a couple of very long days preparing for our coverage of the 2008 Kansas (and local) primaries. As always it’s been an exhausting but rewarding time. We’ve come a long way since the first election I wrote software for and was involved with back in 2006 (where election night involved someone accessing an AS/400 terminal and shouting numbers at me for entry). Our election app has become a lot more sophisticated, our data import process more refined, and election night is a whole lot more fun and loads less stressful than it used to be. I thought I’d go over some of the highlights while they’re still fresh in my mind.

    Douglas County Comission 2nd District Democratic primary section

    Our election app is definitely a success story for both the benefits of structured data and incremental development. Each time the app gets a little more sophisticated and a little smarter. What once wasn’t used until the night of the election has become a key part of our election coverage both before and after the event. For example, this year we had an overarching election section and also sections for indivudual races, like this section for the Douglas County Commission 2nd district Democratic primary. These sections tie together our coverage of the individual races: Stories, photos and videos about the race, our candidate profiles, any chats we’ve had with the candidates, campaign finance documents, and candidate selectors, an awesome app that has been around longer than I have that lets users see which candidates they most agree with. On election night they’re smart enough to display results as they come in.

    Election results start coming in Results rolling in County commission races almost done

    This time around, the newsroom also used our tools to swap out which races were displayed on the homepage throughout the night. We lead the night with results from Leavenworth County, since they were the first to report. The newsroom spent the rest of the nice swapping in one or more race on the homepage as they saw fit. This was a huge improvement over past elections where we chose ahead of time which races would be featured on the homepage. It was great to see the newsroom exercise editorial control throughout the night without having to involve editing templates.

    More results

    On the television side, 6 News Lawrence took advantage of some new hardware and software to display election results prominently throughout the night. I kept catching screenshots during commercial breaks, but the name of the race appeared on the left hand side of the screen with results paging through on the bottom of the screen. The new hardware and software allowed them to use more screen real estate to provide better information to our viewers. In years past we’ve had to jump through some hoops to get election results on the air, but this time was much easier. We created a custom XML feed of election data that their new hardware/software ingested continuously and pulled results from. As soon as results were in our database they were on the air.

    The way that election results make their way in to our database has also changed for the better over the past few years. We have developed a great relationship with the Douglas County Clerk, Jamie Shew and his awesome staff. For several elections now they have provided us with timely access to detailed election results that allow us to provide precinct-by-precinct results. It’s also great to be able to compare local results with statewide results in state races. We get the data in a structured and well-documented fixed-width format and import it using a custom parser we wrote several elections ago.

    State results flow in via a short script that uses BeautifulSoup to parse and import data from the Kansas Secretary of State site. That script ran every few minutes throughout the night and was updating results well after I went to bed. In fact it’s running right now while we wait for the last few precincts in Hodgeman County to come in. This time around we did enter results from a few races in Leavenworth and Jefferson counties by hand, but we’ll look to automate that in November.

    As always, election night coverage was a team effort. I’m honored to have played my part as programmer and import guru. As always, it was great to watch Christian Metts take the data and make it both beautiful and meaningful in such a short amount of time. Many thanks go out to the fine folks at Douglas County and all of the reporters, editors, and technical folk that made our coverage last night possible.

  • DjangoCon!

    I’m a little late to the announcement party, but I’ll be attending DjangoCon and sitting on a panel about Django in Journalism with Maura Chace and Matt Waite. The panel will be moderated by our own Adrian Holovaty.

    I think the panel will be pretty fantastic but I can’t help be just as terrified as my fellow panelists. I love that we’ll have both Journalist-programmers and Programmer-journalists on the panel, and I love that Django is so often the glue that brings the two together.

    DjangoCon is going to be awesome.

  • Natalie Anne Croydon

    Passed out

    Last weekend, our first child, Natalie Anne Croydon was born. I’ve been trying to keep up with Flickr photos and updated my twitter feed a lot during the labor and delivery process (what a geek!). Thanks to everyone for their kind words and congratulations.

    For more pictures, check my Flickr archive starting on May 24 or my photos tagged “Natalie”.

  • Arduino: Transforming the DIY UAV Community

    It’s been pretty awesome watching the homebrew UAV community discover and embrace Arduino. Back in January community leader Chris Anderson discovered and fell in love with Arduino. Today he posted information and the board design for an Arduino-powered UAV platform. Because everything is open, it’s very easy to combine functionality from other boards in order to reduce the cost:

    The decision to port the Basic Stamp autopilot to Arduino turned out to be an unexpected opportunity to make something really cool. I’ve taken Jordi’s open source RC multiplexer/failsafe board, and mashed it up with an Arduino clone to create “ArduPilot”, perhaps the cheapest autopilot in the world. ($110! That’s one-third the price of Paparazzi)

    As with their other projects, the UAV schematics, board design, and Arduino control software will be released before they’re done. It’s quite awesome to realize just how cheap the Arduino-based autopilot is:

    That’s a $110 autopilot, thanks to the open source hardware. By comparison, the Basic Stamp version of this, with processor, development board and failsafe board, would run you $300, and it’s not as powerful

    I’ve been quite impressed by how quickly the Arduino autopilot has gotten off the ground (pun only slightly intended). The decision to port the existing Basic Stamp code to Arduino was made just over a week ago. While I haven’t seen the control code, it looks like the team are well on their way.

    I love it when geek topics collide, and this is about as good as it gets. I’ll be keeping a close eye on the ArduPilot, and I can’t wait to see it in the skies.

  • This whole number reuse thing has gone too far

    This madness needs to stop!

    Espoo, Finland – Nokia today unveiled a trio of mobile devices that balance stunning and sophisticated looks with the latest in mobile functionality. All three devices, the Nokia 6600 fold, the Nokia 6600 slide and the Nokia 3600 slide present a smooth, minimalist design and an appealing array of easy-to-use features. The devices range in price from 175 EUR to 275 EUR before taxes and subsidies and are expected to start shipping during the third quarter of 2008.

    I know that Nokia have a finite set of product names when we’re talking about 4 digit numbers. Aside from the Nseries and Eseries and a handful of other products, Nokia are pretty keen on assigning 4 digit numbers as product names. While often confusing, at least it avoids product names like RAZR or ENv. I don’t quite get the naming of the 6600 fold and the 6600 slide though. Either someone in Espoo has the attention span of a goldfish or they expect that S60 consumers do.

    Us S60 owners are a pretty loyal and knowledgeable bunch. We do our research and know our history. I may be wrong, but I’d venture that a good number of S60 users could name a dozen or more S60 models from the 7650 to the N-Gage to the N95. Surely a good chunk of us would rattle off the 6600 in the process. We might also remember the 3600 as the awkward American cousin of the 3650.

    You know, that business phone from 2003 that brought significant hardware and software upgrades to the table compared to the 7650 and the 3650. I sure remember it as if it were yesterday.

    Every once in awhile someone raises a stink about Nokia reusing a product number. Usually it’s a product number from the 80’s or 90’s and the word “Classic” is attached to the new phone. I’m OK with that. I just think that it’s a little early to be reusing a product code from 2003 in a market segment of geeks and power users.

  • Python for S60: back in the saddle

    I had the opportunity to meet Jürgen Scheible and Ville Tuulos, authors of the Mobile Python book at PyCon a few weeks ago. They graciously gave me a copy of their book, which is an absolutely fantastic guide to writing S60 apps in Python. It seems like every time I look away from Python for S60 it gets better, and this time was no exception. Everything is just a little more polished, a few more APIs are supported (yay sensor API!), and the community and learning materials available have grown tremendously.

    While I didn’t get a chance to hang out too long during the sprints, I did pull together some code for a concept I’ve wanted to do for a long time: a limpet webcam that I can stick on something and watch it ride around the city. Specifically I thought it would be cool to attach one to a city bus and upload pictures while tracing its movements.

    So here’s my quick 19 line prototype that simply takes a picture using the camera API and uploads the saved photo using ftplib copied over from the Python 2.2.2 standard library. It’s called webcam.py. I haven’t run it since PyCon, so the most recent photo is from the PyS60 intro session.

    Working with PyS60 again was absolutely refreshing. I write Python code (using Django) at work but writing code for a mobile device again got the creative juices flowing. I’m trying to do more with less in my spare time, but I definitely need to make more time for PyS60 in my life.

  • PyCon 2008

    I’m headed out the door to PyCon 2008. Yay!

  • Covering Kansas Democratic Caucus Results

    I think we’re about ready for caucus results to start coming in.

    We’re covering the Caucus results at LJWorld.com and on Twitter.

    Turnout is extremely heavy. So much so that they had to split one of the caucus sites in two because the venue was full.

    Later…

    How did we do it?

    We gained access to the media results page from the Kansas Democratic Party on Friday afternoon. On Sunday night I started writing a scraper/importer using BeautifulSoup and rouging out the Django models to represent the caucus data. I spent Monday refining the models, helper functions, and front-end hooks that our designers would need to visualize the data. Monday night and in to Tuesday morning was spent finishing off the importer script, exploring Google Charts, and making sure that Ben and Christian had everything they needed.

    After a few hours of sleep, most of the morning was spent testing everything out on our staging server, fixing bugs, and improving performance. By early afternon Ben was wrapping up KTKA and Christian was still tweaking his design in Photoshop. Somewhere between 1 and 2 p.m. he started coding it up and pretty soon we had our results page running on test data on the staging server.

    While the designers were finishing up I turned my focus to the planned Twitter feed. Thanks to some handy wrappers from James, I wrote a quick script that generated a short message based on the caucus results we had, compared it to the last version of the message, and sent a post to Twitter if the message had changed.

    Once results started coming in, we activated our coverage. After fixing one quick bug, I’ve been spending most of the evening watching importers feed data in to our databases and watching the twitter script send out updates. Because we’ve been scraping the Kansas Democratic Party media results all night and showing them immediately, we’ve been picking up caucuses seconds after they’ve been reported and have been ahead of everything else I’ve looked at.

    Because we just recently finished moving our various Kansas Weekly papers to Ellington and a unified set of templates, it was quite trivial to include detailed election results on the websites for The Lansing Current, Baldwin City Signal, Basehor Sentinel, The Chieftain, The De Soto Explorer, The Eudora News, Shawnee Dispatch, and The Tonganoxie Mirror

    While there are definitely things we could have done better as a news organization (there always are), I’m quite pleased at what we’ve done tonight. Our servers hummed along quite nicely all night, we got information to our audience as quickly as possible, and generally things went quite smoothly. Many thanks to everyone involved.

  • We’re hiring!

    Wow, the Django job market is heating up. I posted a job opening for both junior and senior-level Django developers on djangogigs just a few days ago, and it has already fallen off the front page.

    So I’ll mention it again: We’re hiring! We’re growing and we have several positions open at both the junior and senior level. We’d love to talk to you if you’ve been working with Django since back in the day when everything was a tuple. We’d love to talk to you if you’re smart and talented but don’t have a lot of (or any) Django experience.

    Definitely check out the listing at djangogigs for more, or feel free to drop me a line if you’d like to know more.

  • Google apps for your newsroom

    Google spreadsheetsI like to think that I’m pretty good at recognizing trends. One thing that I’ve been seeing a lot recently in my interactions with the newsroom is that we’re no longer exchanging Excel spreadsheets, Word files, and other binary blobs via email. Instead we’re sending invites to spreadsheets and documents on Google docs, links to data visualization sites like Swivel and ManyEyes, and links to maps created with Google MyMaps.

    Using these lightweight webapps has definitely increased productivity on several fronts. While as much as we would love every FOIA request and data source to come in a digital format, we constantly see data projects start with a big old stack of paper. Google spreadsheets has allowed us to parallelize and coordinate data entry in a way that just wasn’t possible before. We can create multiple spreadsheets and have multiple web producers enter data in their copious spare time. I did some initial late night data entry for the KU flight project (Jacob and Christian rocked the data visualization house on that one), but we were able to take advantage of web producers to enter the vast majority of the data.

    Sometimes the data entry is manageable enough (or the timeline is tight enough) that the reporter or programer can handle it on their own. In this case, it allows us to quickly turn quick spreadsheet-style data entry in to CSV, our data lingua franca for data exchange. Once we have the data in CSV form we can visualize it with Swivel or play with it in ManyEyes. If all we’re looking for is a tabular listing of the data, we’ve written some tools that make that easy and look good too. On larger projects, CSV is often the first step to importing the data and mapping it to Django objects for further visualization.

    Awesome webapps that increase productivity aren’t limited to things that resemble spreadsheets from a distance. A few weeks back we had a reporter use Google’s awesome MyMaps interface to create a map of places to enjoy and avoid while traveling from Lawrence, KS to Miami, FL for the orange bowl. We pasted the KML link in to our Ellington map admin and instantly had an interactive map on our site. A little custom template work completed the project quite quickly.

    It all boils down to apps that facilitate collaboration, increase productivity, and foster data flow. Sometimes the best app for the job sits on the desktop (or laptop). Increasingly, I’ve found that those apps live online—accessable anywhere, anytime.

  • 2008 Digital Edge Award Finalists

    The 2008 DIgital Edge Award finalists were just announced, and I’m excited to see several World Company sites and projects on there as well as a couple of sites running Ellington and even the absolutely awesome Django-powered PolitiFact.com.

    At work we don’t do what we do for awards. We do it to serve our readers, tell a story, get information out there, and do it as best we can. At the same time even being nominated as finalists is quite an honor, and evokes warm fuzzy feelings in this programmer.

    Here are the various World Company projects and sites that were nominated (in the less than 75,000 circulation category):

    Not too shabby for a little media company in Kansas. I’m particularly excited about the LJWorld.com nominations since it hasn’t been too long since we re-designed and re-launched the site with a lot of new functionality. Scanning the finalists I also see a couple of other sites running Ellington as well as several special projects by those sites.

    As someone who writes software for news organizations for a living I’m definitely going to take some time this morning to take a look at the other finalists. I’m particularly excited to check out projects from names that I’m not familiar with.

  • Web Browser for S60: If only I remembered the shortcut key

    I was reading a post on the (unofficial) Nokia Blog detailing the shortcut keys for the S60 Web Browser. I was instantly reminded of my own frustration and how often I had looked at that page of the user manual.

    Web Browser for S60 is easily the most powerful app that ships with new S60 devices. It renders real websites well and quickly using my favorite rendering engine. It packs more features, options, and actions on one screen than any other app that I can think of.

    It comes at a price though. Aside from the two softkeys there aren’t any keys on the device marked specifically for browser actions once you’re in the app. Even better, I usually use the browser in full screen mode to maximize screen real estate, so those two softkeys are now gone.

    It’s still an amazing app, I just distinctly recall the constant remembering and forgetting and looking up of shortcut keys as a big pain point in my enjoyment of the S60 browser. It’s something that I’d love to see addressed, but I’m not sure of an easy way to do so without resorting to touch interfaces or keys (physical or virtual) that change labels given different contexts.

    Mobile browser shortcuts are a lot easier on a device like the N800 or N810 that can dedicate keys to it full time. Or you can go the no-keys approach that seems to be working well for the iPhone.

    I know that those smart Finns will figure it out.

  • Kansas covered by OpenStreetMap!

    I was checking up on the TIGER/Line import to OpenStreetMaps earlier today and I was pleased to see that Kansas is already partially processed! I had emailed Dave the other day and he had queued Kansas up, but I was pleasantly suprised to see it partially processed already. Douglas County is already completely imported and Tiles@Home is currently rendering it up. Parts of Lawrence have already been rendered and cam be seen using the osmarender layer. Here’s 23rd and Iowa:

    Lawrence in OpenStreetMap!

    Andrew Turner had turned me on to OpenStreetMap over beers at Free State and at GIS Day @ KU even though I’ve been reading about it for some time now. So far it seems like an amazing community and I’ve been enjoying digging in to the API and XML format and various open source tools like osmarender, JSOM, and OpenLayers.

    After getting psyched up at GIS Day I’ve been playing with some other geo tools, but more on that later.