Category: Open Source

  • Parsing CSV data in Scala with opencsv

    One of the great things about Scala (or any JVM language for that matter) is that you can take advantage of lots of libraries in the Java ecosystem. Today I wanted to parse a CSV file with Scala, and of course the first thing I did was search for scala csv. That yielded some interesting results, including a couple of roll-your-own regex-based implementations. I prefer to lean on established libraries instead of copying and pasting code from teh internet, so my next step was to search for java csv.

    The third hit down was opencsv and looked solid, had been updated recently, and was Apache-licensed. All good signs in my book. It’s also in the main maven repository, so adding it to my sbt 0.10.x build configuration was easy:

    
    libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.1"
    

    The syntax for sbt 0.7.x is similar, but you should really upgrade:

    
    val opencsv = "net.sf.opencsv" % "opencsv" % "2.1"
    

    Once that configuration change is in place, running sbt update will let you use opencsv in either your project or the shell via sbt console.

    There are a couple of simple usage examples on the opencsv site along with a link to javadocs. The javadocs are currently for the development version (2.4) and include an improved iterator interface that would be useful for larger files.

    Let’s parse some CSV data in Scala. We’ll use a CSV version of violations of 14 CFR 91.11, 121.580 & 135.120, affectionately known as the unruly passenger dataset (as seen in the Django book):

    
    Year,Total
    1995,146
    1996,184
    1997,235
    1998,200
    1999,226
    2000,251
    2001,299
    2002,273
    2003,281
    2004,304
    2005,203
    2006,136
    2007,150
    2008,123
    2009,135
    2010,121
    

    You can download the raw data as unruly_passengers.txt.

    Here’s a full example of parsing the unruly passengers data:

    
    import au.com.bytecode.opencsv.CSVReader
    import java.io.FileReader
    import scala.collection.JavaConversions._
    
    val reader = new CSVReader(new FileReader("unruly_passengers.txt"))
    for (row <- reader.readAll) {
        println("In " + row(0) + " there were " + row(1) + " unruly passengers.")
    }
    

    This will print out the following:

    
    In Year there were Total unruly passengers.
    In 1995 there were 146 unruly passengers.
    In 1996 there were 184 unruly passengers.
    In 1997 there were 235 unruly passengers.
    In 1998 there were 200 unruly passengers.
    In 1999 there were 226 unruly passengers.
    In 2000 there were 251 unruly passengers.
    In 2001 there were 299 unruly passengers.
    In 2002 there were 273 unruly passengers.
    In 2003 there were 281 unruly passengers.
    In 2004 there were 304 unruly passengers.
    In 2005 there were 203 unruly passengers.
    In 2006 there were 136 unruly passengers.
    In 2007 there were 150 unruly passengers.
    In 2008 there were 123 unruly passengers.
    In 2009 there were 135 unruly passengers.
    In 2010 there were 121 unruly passengers.
    

    There are a couple of ways to make sure that the header line isn't included. If you specify the seperator and quote character, you can also tell it to skip any number of lines (one in this case):

    
    val reader = new CSVReader(new FileReader("unruly_passengers.txt"), ",", "\"", 1)
    

    Alternatively you could create a variable that starts true and is set to false after skipping the first line.

    Also worth mentioning is the JavaConversions import in the example. This enables explicit conversions between Java datatypes and Scala datatypes and makes working with Java libraries a lot easier. WIthout this import we couldn't use Scala's for loop syntactic sugar. In this case it's implicitly converting a Java.util.List to a scala.collection.mutable.Buffer.

    Another thing to be aware of is any cleaning of the raw field output that might need to be done. For example, some CSV files often have leading or training whitespace. A quick and easy way to take care of this is to trim leading and trailing whitespace: row(0).trim.

    This isn't the first time I've been pleasantly surprised working with a Java library in Scala, and I'm sure it won't be the last. Many thanks to the developers and maintainers of opencsv and to the creators of all of the open source libraries, frameworks, and tools in the Java ecosystem.

  • Social Graph Analysis using Elastic MapReduce and PyPy

    A couple of weeks back I read a couple of papers (Who Says What to Whom on Twitter and What is Twitter, a Social Network or a News Media?) that cited data collected by researchers for the latter paper.

    This 5 gigabyte compressed (26 gigabyte uncompressed) dataset makes for a good excuse to use MapReduce and MrJob for processing. MrJob makes it easy to test MapReduce jobs locally as well as run them on a local Hadoop cluster or on Amazon’s Elastic MapReduce.

    Designing MapReduce Jobs

    I usually find myself going through the same basic tasks when writing MapReduce tasks:

    1. Examine the data input format and the data that you have to play with. This is sometimes explained in a metadata document or you may have to use a utility such as head if you’re trying to look at the very beginning of a text file.
    2. Create a small amount of synthetic data for use while designing your job. It should be obvious to determine if the output of your job is correct or not based on this data. This data is also useful when writing unit tests.
    3. Write your job, using synthetic data as test input.
    4. Create sample data based on your real dataset and continue testing your job with that data. This can be done via reservoir sampling to create a more representative sample or it could be as simple as head -1000000 on a very large file.
    5. Run your job against the sample data and make sure the results look sane.
    6. Configure MrJob to run using Elastic MapReduce. An example configuration can be found in conf/mrjob-emr.conf but will require you to update it with your credentials and S3 bucket information before it will work.
    7. Run your sample data using Elastic MapReduce and a small number of low-cost instances. It’s a lot cheaper to fix configuration problem when you’re just
      running two cheap instances.
    8. Once you’re comfortable with everything, run your job against the full dataset on Elastic MapReduce.

    Analyzing the data

    This project contains two MapReduce jobs:

    jobs/follower_count.py
    A simple single-stage MapReduce job that reads the data in and sums the number of followers each user has.
    jobs/follower_histogram.py
    This is a two-phase MapReduce job that first sums the number of followers a each user has then for each follower count sums the number of users that have that number of followers. This is one of many interesting ways at looking at this raw data.

    Running the jobs

    The following assumes you have a modern Python and have already installed MrJob (pip install MrJob or easy_install MrJob or install it from source).

    To run the sample data locally:

    $ python jobs/follower_count.py data/twitter_synthetic.txt
    

    This should print out a summary of how many followers each user (represented by id) has:

    5       2
    6       1
    7       3
    8       2
    9       1
    

    You can also run a larger sample (the first 10 million rows of the full dataset mentioned above) locally though it will likely take several minutes to process:

    $ python jobs/follower_count.py data/twitter_sample.txt.gz
    

    After editing conf/mrjob-emr.conf you can also run the sample on Elastic MapReduce:

    $ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
     -o s3://your-bucket/your-output-location --no-output data/twitter_sample.txt.gz
    

    You can also upload data to an S3 bucket and reference it that way:

    $ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
     -o s3://your-bucket/your-output-location --no-output s3://your-bucket/twitter_sample.txt.gz
    

    You may also download the full dataset and run either the follower count or the histogram job. The following general steps are required:

    1. Download the whole data file from Kwak, Haewoon and Lee, Changhyun and Park, Hosung and Moon, Sue via bittorrent. I did this on a small EC2 instance in order to make uploading to S3 easier.
    2. To make processing faster, decompress it, split it in to lots of smaller files (split -l 10000000
      for example).
    3. Upload to an S3 bucket.
    4. Run the full job (it took a little over 15 minutes to read through 1.47 billion relationships and took just over an hour to complete).
    python jobs/follower_histogram.py -c conf/mrjob-emr.conf -r emr \
    -o s3://your-bucket/your-output-location --no-output s3://your-split-input-bucket/
    

    Speeding things up with PyPy

    While there are lots of other things to explore in the data, I also wanted to be able to run PyPy on Elastic MapReduce. Through the use of bootstrap actions, we can prepare our environment to use PyPy and tell MrJob to execute jobs with PyPy instead of system Python. The following need to be added to your configuration file (and vary between 32 and 64 bit):

    # Use PyPy instead of system Python
    bootstrap_scripts:
    - bootstrap-pypy-64bit.sh
    python_bin: /home/hadoop/bin/pypy
    

    This configuration change (available in conf/mrjob-emr-pypy-32bit.conf and conf/mrjob-emr-pypy-64bit.conf) also makes use of a custom bootstrap script found in conf/bootstrap-pypy-32bit.sh and conf/bootstrap-pypy-64bit.sh).

    A single run of “follower_histogram.py“ with 8 “c1.xlarge“ instances took approximately 66 minutes using Elastic MapReduce’s system Python. A single run with PyPy in the same configuration took approximately 44 minutes. While not a scientific comparison, that’s a pretty impressive speedup for such a simple task. PyPy should speed things up even more for more complex tasks.

    Thoughts on Elastic MapReduce

    It’s been great to be able to temporarily rent my own Hadoop cluster for short periods of time, but Elastic MapReduce definitely has some downsides. For starters, the standard way to read and persist data during jobs is via S3 instead of HDFS which you would most likely be using if you were running your own Hadoop cluster. This means that you spend a lot of time (and money) transferring data between S3 and nodes. You’re not bringing the data to computing resources like a dedicated Hadoop cluster running HDFS might.

    All in all though it’s a great tool for the toolbox, particularly if you don’t have the need for a full-time Hadoop cluster.

    Play along at home

    All of the source code and configuration mentioned in this post can be found at social-graph-analysis and is released under the BSD license.

  • Literate Diffing

    The other day I found myself wanting to add commentary to a diff. There are code review tools such as reviewboard and gerrit that make commenting on diffs pretty easy. Github allows you to comment on pull requests and individual commits.

    These are all fantastic tools for commenting on diffs, but I kind of wanted something different, something a little more self-contained. I wanted to write about the individual changes, what motivated them, and what the non-code implications of each change might be. At that point my mind wandered to the world of lightweight literate programming using tools like docco, rocco, and pycco.

    A literate diff might look something like this (using Python/Bash style single-line comments):

    # Extend Pygments' DiffLexer using a non-standard comment (#) for literate diffing using pycco.
    diff -r cfa0f44daad1 pygments/lexers/text.py
    --- a/pygments/lexers/text.py	Fri Apr 29 14:03:50 2011 +0200
    +++ b/pygments/lexers/text.py	Sat Apr 30 20:28:56 2011 -0500
    @@ -231,6 +231,7 @@
                 (r'@.*\n', Generic.Subheading),
                 (r'([Ii]ndex|diff).*\n', Generic.Heading),
                 (r'=.*\n', Generic.Heading),
    # Add non-standard diff comments.  This has to go above the Text capture below
    # in order to be active.
    +            (r'#.*\n', Comment),
                 (r'.*\n', Text),
             ]
         }

    It turns out that it’s pretty easy to process with patch, but comes with a catch. The patch command would blow up quite spectacularly if it encountered one of these lines, so the comments will have to be removed from a literate diff before being passed to patch. This is easily done using awk:

    cat literate.diff | awk '!/\#/' | patch -p0

    If you’re using a DVCS, you’ll need -p1 instead.

    Since I’m using a non-standard extension to diffs, tools such as pygments won’t know to syntax highlight comments appropriately. If comments aren’t marked up correctly, pycco won’t be able to put them in the correct spot. This requires a patch to pygments and a patch to pycco. I’m kind of abusing diff syntax here and haven’t submitted these patches upstream, but you can download and apply them if you’d like to play along at home.

    I still think tools like github, reviewboard, and gerrit are much more powerful for commenting on diffs but was able to make pycco output literate diffs quick enough that I thought I’d share the process. These tools are no excuse for clearly commenting changes and implications within the code itself, but I do like having a place to put underlying motivations. Here’s an example of a literate diff for one of my commits to phalanges, a finger daemon written in Scala. It’s still a pretty contrived example but is exactly what I was envisioning when my mind drifted from diffs to literate programming.

  • Sphinx Search with PostgreSQL

    While I don’t plan on moving away from Apache Solr for my searching needs any time soon, Jeremy Zawodny’s post on Sphinx at craigslist made me want to take a closer look. Sphinx works with MySQL, PostgreSQL, and XML input as data sources, but MySQL seems to be the best documented. I’m a PostgreSQL guy so I ran in to a few hiccups along the way. These instructions, based on instructions on the Sphinx wiki, got me up and running on Ubuntu Server 8.10.

    Install build toolchain:

    $ sudo aptitude install build-essential checkinstall
    

    Install Postgres:

    $ sudo aptitude install postgresql postgresql-client \\
    postgresql-client-common postgresql-contrib \\
    postgresql-server-dev-8.3
    

    Get Sphinx source:

    $ wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz
    $ tar xzvf sphinx-0.9.8.1.tar.gz
    $ cd sphinx-0.9.8.1
    

    Configure and make:

    $ ./configure --without-mysql --with-pgsql \\
    --with-pgsql-includes=/usr/include/postgresql/ \\
    --with-pgsql-lib=/usr/lib/postgresql/8.3/lib/
    $ make
    

    Run checkinstall:

    $ mkdir /usr/local/var
    $ sudo checkinstall
    

    Sphinx is now installed in /usr/local. Check out /usr/local/etc/ for configuration info.

    Create something to index:

    $ createdb -U postgres test
    $ psql -U postgres test
    test=# create table test (id integer primary key not null, text text);
    test=# insert into test (text) values ('Hello, World!');
    test=# insert into test (text) values ('This is a test.');
    test=# insert into test (text) values ('I have another thing to test.');
    test=# -- A user with a password is required.
    test=# create user foo with password 'bar';
    test=# alter table test owner to foo;
    test=# \\q
    

    Configure sphinx (replace nano with your editor of choice):

    $ cd /usr/local/etc
    $ sudo cp sphinx-min.conf.dist sphinx.conf
    $ sudo nano sphinx.conf
    

    These values worked for me. I left configuration for indexer and searchd unchanged:

    source src1
    {
      type = pgsql
      sql_host = localhost
      sql_user = foo
      sql_pass = bar
      sql_db = test
      sql_port = 5432
      sql_query = select id, text from test
      sql_query_info = SELECT * from test WHERE id=$id
    }
    
    index test1
    {
      source = src1
      path = /var/data/test1
      docinfo = extern
      charset_type = utf-8
    }
    

    Reindex:

    $ sudo mkdir /var/data
    $ sudo indexer --all
    

    Run searchd:

    $ sudo searchd
    

    Play:

    $ search world
    
    Sphinx 0.9.8.1-release (r1533)
    Copyright (c) 2001-2008, Andrew Aksyonoff
    
    using config file '/usr/local/etc/sphinx.conf'...
    index 'test1': query 'world ': returned 1 matches of 1 total in 0.000 sec
    
    displaying matches:
    1. document=1, weight=1
    
    words:
    1. 'world': 1 documents, 1 hits
    

    Use Python:

    cd sphinx-0.9.8.1/api
    python
    >>> import sphinxapi, pprint
    >>> c = sphinxapi.SphinxClient()
    >>> q = c.Query('world')
    >>> pprint.pprint(q)
    {'attrs': [],
     'error': '',
     'fields': ['text'],
     'matches': [{'attrs': {}, 'id': 1, 'weight': 1}],
     'status': 0,
     'time': '0.000',
     'total': 1,
     'total_found': 1,
     'warning': '',
     'words': [{'docs': 1, 'hits': 1, 'word': 'world'}]}
    

    If you add new data and want to reindex, make sure you use the --rotate flag:

    sudo indexer --rotate --all
    

    This is an extremely quick and dirty installation designed to give me a sandbox
    to play with. For production use you would want to run as a non-privileged user
    and would probably want to have an /etc/init.d script for searchd or run it
    behind supervised. If you’re looking to experiment with Sphinx and MySQL,
    there should be plenty of documentation out there to get you started.

  • Arduino: Transforming the DIY UAV Community

    It’s been pretty awesome watching the homebrew UAV community discover and embrace Arduino. Back in January community leader Chris Anderson discovered and fell in love with Arduino. Today he posted information and the board design for an Arduino-powered UAV platform. Because everything is open, it’s very easy to combine functionality from other boards in order to reduce the cost:

    The decision to port the Basic Stamp autopilot to Arduino turned out to be an unexpected opportunity to make something really cool. I’ve taken Jordi’s open source RC multiplexer/failsafe board, and mashed it up with an Arduino clone to create “ArduPilot”, perhaps the cheapest autopilot in the world. ($110! That’s one-third the price of Paparazzi)

    As with their other projects, the UAV schematics, board design, and Arduino control software will be released before they’re done. It’s quite awesome to realize just how cheap the Arduino-based autopilot is:

    That’s a $110 autopilot, thanks to the open source hardware. By comparison, the Basic Stamp version of this, with processor, development board and failsafe board, would run you $300, and it’s not as powerful

    I’ve been quite impressed by how quickly the Arduino autopilot has gotten off the ground (pun only slightly intended). The decision to port the existing Basic Stamp code to Arduino was made just over a week ago. While I haven’t seen the control code, it looks like the team are well on their way.

    I love it when geek topics collide, and this is about as good as it gets. I’ll be keeping a close eye on the ArduPilot, and I can’t wait to see it in the skies.

  • Kansas covered by OpenStreetMap!

    I was checking up on the TIGER/Line import to OpenStreetMaps earlier today and I was pleased to see that Kansas is already partially processed! I had emailed Dave the other day and he had queued Kansas up, but I was pleasantly suprised to see it partially processed already. Douglas County is already completely imported and Tiles@Home is currently rendering it up. Parts of Lawrence have already been rendered and cam be seen using the osmarender layer. Here’s 23rd and Iowa:

    Lawrence in OpenStreetMap!

    Andrew Turner had turned me on to OpenStreetMap over beers at Free State and at GIS Day @ KU even though I’ve been reading about it for some time now. So far it seems like an amazing community and I’ve been enjoying digging in to the API and XML format and various open source tools like osmarender, JSOM, and OpenLayers.

    After getting psyched up at GIS Day I’ve been playing with some other geo tools, but more on that later.

  • Maemo blows me away again

    Nokia N810 Internet Tablet

    My wife and I just bought a house and I’ve realized that there isn’t any room in the budget for major gadget purposes, so I’ve been trying not to get too excited about things coming down the road. It’s not suprising that I’ve been following the recently announced Nokia N810 Internet Tablet in a much more detached manner than usual.

    That is until I saw Ari Jaaksi holding a prototype in his hand. Holy crap that thing is significantly smaller than the N800 and packs quite a punch. The slide-out keyboard is killer, GPS is a no-brainer these days and is included, the browser is Mozilla-based, the UI got a refresh… I could go on for days.

    The other thing I really like about the new tablet is that the Maemo platform is moving to be even more open than it was before (which was about as open as the lawyers at Nokia would allow). The quite good but closed source Opera web browser has been replaced by one that is Mozilla-based. This is yet another major component that is now open instead of closed. The major closed-source components (if I’m remembering correctly) are now limited to the DSP, various binary drivers that Nokia licenses directly, and the handwriting recognition software. That’s definitely a smaller list than it was before, and I applaud Nokia’s efforts in opening up as much as possible. It’s also worth noting that the Ubuntu Mobile project is basing a lot of its work on the work that Nokia has done with Maemo (most notable Matchbox and the Hildon UI).

    So yeah, I’m now paying much closer attention to this new device that I was doing my best to ignore. Job well done.

  • There is an Erlang community, it’s just smaller than you’re used to

    I thought I’d pick a nit with something that Ted Leung mentioned in his response to Sam Ruby’s response to Russ’ original post about Java needing an overhaul.

    Particularly this comment about the Erlang community:

    This is actually 2 problems. There’s the issue with the libraries, and there’s the issue with the community that did/didn’t produce the libraries. We don’t just need a technology, we need a community. Hmm, Erlang lab, anyone?

    I’d like to assert that there is a vibrant Erlang community, it’s just smaller than you’re used to, and might be a little harder to find than most.

    I’ve been lurking, participating, and sharing what I’ve learned in #erlang on irc.freenode.net since the Erlang movie blew my mind. I started tinkering with Erlang and the PDF for Programming Erlang came out a few days later. I still consider myself an Erlang noob. I still ask questions with obvious answers, spend hours trying to do something that should have taken 5 minutes, and sometimes don’t do things in the most Erlangish manner.

    But the Erlang community has been great to me. Whenever I ask a question in #erlang, I almost always get the answer I was looking for, whether it’s a pointer to a specific part of the Erlang docs, a snippet of code, an opinion on the best way to do something, or a link to a blog post that has the answer. I’ve been around long enough that I can offer the same to people who are just picking up Programming Erlang and are running in to the same things I was a month or two back.

    If you don’t think there’s an Erlang community, please come by #erlang and spend some time. They’ve been one of the most helpful communities I’ve ever been a part of. Activity is currently skewed more towards the European timezones, but as the community grows and more people across the world pick up Erlang that’s changing.

    There’s also a pretty huge community outside of IRC. I’ve been subscribed to erlang-questions, the most active Erlang mailing list for a month or two too. The solution to one of my problems with bit syntax was asked and answered before I even knew that’s what I wanted. I’ve also learned a lot of things about Erlang that I wouldn’t have otherwise from the mailing list.

    The community doesn’t stop there. Head over to trapexit.org and check out the Wiki and the forums. When you’re done there, check out the packages available at CEAN (the smaller Erlang counterpart of CPAN). There are lots of libraries included here if that’s what you’re looking for. Anything in CEAN can be installed from within the Erlang shell.

    If you’re looking for libraries, don’t forget to check out Erlang’s module documentation. It’s far from Python’s batteries included, but there’s more there than it gets credit for. Aside from a distributed database there are TCP and UDP socket libraries, an http client and server, an XML library and support for SNMP and SSH. You’ll find many more protocol implementations at CEAN while the building blocks reside in Erlang’s standard library. Other places to look for sample code or libraries is Jungerl, a loosely knit collection of useful (but sometimes aging) libraries and applications, or Google Code.

    While many third party Erlang libraries feel like they’re at the 0.1 stage (and many are simply because their authors are new to Erlang), don’t forget about the polished apps and libraries. I’m specifically thinking of ejabberd, RabbitMQ, YAWS, and Wings3D to name a few. Also worth a specific mention is ErlyWeb and the several libraries that it is built on top of.

    So yes: the Erlang community is quite small. Think Python 10-12 years ago or Ruby before the Rails. But don’t pretend that it doesn’t exist, because while tiny, it’s vibrant and extremely helpful.

  • Erlang bit syntax and network programming

    I’ve been playing with Erlang over nights and weekends off and on for a few months now. I’m loving it for several reasons. First off, it’s completely different than any other programming language I’ve worked with. It makes me think rather than take things for granted. I’m intrigued by concurrency abilities and its immutable no-shared-state mentality. I do go through periods of amazing productivity followed by hours if not days of figuring out a simple task. Luckily the folks in #erlang on irc.freenode.net have been extremely patient with myself and the hundreds of other newcomers and have been extremely helpful. I’m also finding that those long pain periods are happening less frequently the longer I stick with it.

    One of the things that truly blew me away about Erlang (after the original Erlang Now! moment) is its bit syntax. The bit syntax as documented at erlang.org and covered in Programming Erlang is extremely powerful. Some of the examples in Joe’s book such as parsing an MP3 file or an IPv4 datagram hint at the power and conciseness of binary matching and Erlang’s bit syntax. I wanted to highlight a few more that have impressed me while I was working on some network socket programming in Erlang.

    There are several mind-boggling examples in Applications, Implementation and Performance Evaluation of Bit Stream Programming in Erlang (PDF) by Per Gustafsson and Konstantinos Sagonas. Here are two functions for uuencoding and uudecoding a binary:

    uuencode(BitStr) ->
    << (X+32):8 || <<X:6>> <= BitStr >>.
    uudecode(Text) ->
    << (X-32):6 || <<X:8>> <= Text >>.
    

    UUencoding and UUdecoding isn’t particularly hard, but I’ve never seen an implementation so concise. I’ve also found that Erlang’s bit syntax makes socket programming extremely easy. The gen_tcp library makes connection to TCP sockets easy, and Erlang’s bit syntax makes creating requests and processing responses dead simple too.

    Here’s an example from qrbgerl, a quick project of mine that receives random numbers from the Quantum Random Bit Generator service. The only documentation I needed to use the protocol was the Python client and the C++ client. Having access to an existing Python client helped me bridge the “how might I do this in Python?” and “how might I do this in Erlang?” gaps, but I ended up referring to the canonical C++ implementation quite a bit too.

    I start out by opening a socket and sending a request. Here’s the binary representation of the request I’m sending:

    list_to_binary([<<0:8,ContentLength:16,UsernameLength:8>>, Username, 
    <<PasswordLength:8>>, Password, <<?REQUEST_SIZE:32>>]),

    This creates a binary packet that complies exactly with what the QRBG service expects. The first thing that it expects is a 0 represented in 8 bits. Then it wants the length of the username plus the length of the password plus 6 (ContentLength above) represented in 16 bits. Then we have the username represented as 8 bit characters/integers, followed by the length of the password represented in 8 bits, followed by the password (again as 8 bit characters/integers). Finally we represent the size of the request in 32 bits. In this case the macro ?REQUEST_SIZE is 4096.

    While that’s a little tricky, once we’ve sent the request, we can use Erlang’s pattern matching and bit syntax to process the response:

    <<Response:8, Reason:8, Length:32, Data:Length/binary, 
    _Rest/binary>> = Bin,

    We’re matching several things here. The response code is the first 8 bits of the response. In a successful response we’ll get a 0. The next 8 bits represent the reason code, again 0 in this case. The next 32 bits will represent the length of the data we’re being sent back. It should be 4096 bytes of data, but we can’t be sure. Next we’re using the length of the data that we just determined to match that length of data as a binary. Finally we match anything else after the data and discard it. This is crucial because binaries are often padded at the beginning or end of the stream. In this case there’s some padding at the end that we need to match but can safely discard.

    Now that we have 4096 bytes of random bits, let’s do something with them! I’ve mirrored the C++ and Python APIs as well as I could, but because of Erlang’s no shared state it’s going to look a little different. Let’s match a 32 bit integer from the random data that we’ve obtained:

    <<Int:32/integer-signed, Rest/binary>> = Bin,

    We’re matching the first 32 bits of our binary stream to a signed integer. We’re also matching the rest of the data in binary form so that we can reuse it later. Here’s that data extraction in action:

    5> {Int, RestData} = qrbg:extract_int(Data).
    {-427507221,
     <<0,254,163,8,239,180,51,164,169,160,170,248,94,132,220,79,234,4,117,
       248,174,59,167,49,165,170,154,...>>}
    6> Int.
    -427507221

    I’ve been quite happy with my experimentation with Erlang, but I’m definitely still learning some basic syntax and have only begun to play with concurrency. If the above examples confuse you, it might help to view them in context or take a look at the project on google code. I have also released an ISBN-10 and ISBN-13 validation and conversion library for Erlang which was a project I used to teach myself some Erlang basics. I definitely have some polishing to do with the QRBG client, but isbn.erl has full documentation and some 44 tests.

  • Reason 3,287 why I hate setuptools

    root@monkey:~/inst/simplejson# python setup.py install
    The required version of setuptools (>=0.6c6) is not available, and
    can't be installed while this script is running. Please install
     a more recent version first.
    
    (Currently using setuptools 0.6c3 
    (/usr/lib/python2.4/site-packages/setuptools-0.6c3-py2.4.egg))
  • isbn.erl: My first Erlang module

    For a few weeks now I’ve been tinkering with, learning about, and falling in love with Erlang. I’ve been noticing the buzz about Erlang over the past few months, but two things won me over: the Erlang video and how amazingly simple and elegant concurrency and message passing is.

    For the past few weeks I’ve been reading the Erlang documentation, Joe Armstrong’s new book, reading the trapexit wiki, and lurking in #erlang on irc.freenode.net. If you’re wading in the erlang waters, I highly suggest just about every link in this wonderful roundup of beginners erlang links.

    After a few weeks of reading and tinkering in the shell I decided that it was time to come up with a quick project to hone my Erlang skills. I had tinkered with ISBN validation and conversion while writing a small django application to catalog the books on the bookshelf, so I thought that was a good place to start. The Wikipedia page and my collection of ISBN links provided me with more than enough guidance on validation, check digit generation, and conversion.

    I found it very easy to build this module from the ground up: start with ISBN-10 check digit generation, then use that to build an ISBN-10 validator. I did a similar thing with ISBN-13, writing the check digit generator and then the ISBN-13 validator. From there I was able to build on all four public functions to write an ISBN-10 to ISBN-13 converter as well as an ISBN-13 to ISBN-10 converter (when that is a possibility). In the process of this very simple module I ended up learning about and applying accumulators, guards, the use of case, and lots of general Erlang knowledge.

    Here’s a peek at how the module works. After downloading it via google code or checking out the latest version via Subversion, the module needs to be compiled. This is easily accomplished from the Erlang shell (once you have Erlang installed of course):

    mcroydon$ erl
    Erlang (BEAM) emulator version 5.5.4 [source] [async-threads:0] [kernel-poll:false]
    
    Eshell V5.5.4  (abort with ^G)
    1> c('isbn.erl').
    {ok,isbn}

    After that we can use any of the exported functions:

    2> isbn:validate_13([9,7,8,1,9,3,4,3,5,6,0,0,5]).
    true

    Let’s trace the execution of a simple function, check_digit_10/1. The function expects a list of 9 numbers (the first 9 numbers of an ISBN-10) and returns the check digit as either an integer or the character 'X'. The first thing that the function does is check to see if we’ve actually passed it a 9-item list:

    check_digit_10(Isbn) when length(Isbn) /= 9 ->
        throw(wrongLength);

    This is accomplished with a simple guard (when length(Isbn) /- 9). If that guard isn’t triggered we move on to the next function:

    check_digit_10(Isbn) -> 
        check_digit_10(Isbn, 0).

    This forwards our list of 9 numbers on to check_digit_10/2 (a function with the same name that takes two arguments. We’ll see in a minute that the 0 I’m passing in will be used as an accumulator. The next function does most of the heavy lifting for us:

    check_digit_10([H|T], Total) ->
        check_digit_10(T, Total + (H * (length(T) + 2)));

    This function takes the list, splits it in to the first item (H) and the rest of the list (T). We add to the total as specified in ISBN-10 and then call check_digit_10/2 again with the rest of the list. This tail-recursive approach seems odd at first if you’re coming from most any object oriented language, but Erlang’s function nature, strong list handling, and tail-recursive ways feel natural very quickly. After we’ve recursed through the entire list, it’s time to return the check digit (or 11 minus the total modulus 11). There’s a special case if we get a result of 10 to use the character ‘X’ instead:

    check_digit_10([], Total) when 11 - (Total rem 11) =:= 10 ->
        'X';

    Finally we return the result for the common case given an empty list:

    check_digit_10([], Total) ->
        11 - (Total rem 11).

    Feel free to browse around the rest of the module and use it if you feel it might be useful to you. I’ve posted the full source to my isbn module at my isbnerl Google Code project, including the module itself and the module’s EDoc documentation. It is released under the new BSD license in hopes that it might be useful to you.

    I learned quite a bit creating this simple module, and if you’re learning Erlang I suggest you pick something not too big, not too small, and something that you are interested in to get your feet wet. Now that I have the basics of sequential Erlang programming down, I think the next step is to make the same move from tinkering to doing something useful with concurrent Erlang.

  • Darwin Calendar Server Status Update

    Wilfredo Sánchez Vega posted an update on the status of Darwin Calendar Server to the calenderserver-dev mailing list this afternoon with a status update on the project. Check the full post for details, but here’s the takeaway:

    We think the server’s in pretty solid shape right now. Certainly there are still some bugs that have to be fixed before we can roll up a “1.0” release, but the core functionality is all in place now, and we think it’s fairly useable in its current state.

    There is also a preview release in the works, so keep a close eye on the mailing list and the Darwin Calendar Server site.

  • Properly serving a 404 with lighttpd’s server.error-handler-404

    The other day I was looking in to why Django‘s verify_exists validator wasn’t working on a URLField. It turns out that the 404 handler that we were using to generate thumbnails with lighttpd on the media server was serving up 404 error pages with HTTP/1.1 200 OK as the status code. Django’s validator was seeing the 200 OK and (rightfully) not raising a validation error, even though there was nothing at that location.

    It took more than the usual amount of digging to find the solution to this, so I’m making sure to write about it so that I can google for it again when it happens to me next year. The server.error-hadler-404 documentation metnions that to send a 404 response you will need to use server.errorfile-prefix. That doesn’t help me a lot since I needed to retain the dynamic 404 handler.

    Amusingly enough, the solution is quite clear once you dig in to the source code. I dug through connections.c and found that if I sent back a Status: line, that would be forwarded on by lighttpd to the client (exactly what I wanted!)

    Here’s the solution (in Python) but it should translate to any language that you’re using as a 404 handler:

    print "Status: 404"
    print "Content-type: text/html"
    print
    print # (X)HTML error response goes here
    

    I hope this helps, as I’m sure it will help me again some time in the future.

  • Packaging Python Imaging Library for maemo 3.0 (bora) and the N800

    I found myself wanting to do some image manipulation in Python with the Python Imaging Library on my N800. Unfortunately PIL isn’t available in the standard repositories. Not to worry, after reading the Debian Maintainers’ Guide I packaged up python2.5-imaging_1.1.6-1_armel.deb, built against python2.5 and maemo 3.0 (bora), and installs perfectly on my N800:

    Nokia-N800-51:/media/mmc1# dpkg -i python2.5-imaging_1.1.6-1_armel.deb 
    Selecting previously deselected package python2.5-imaging.
    (Reading database ... 13815 files and directories currently installed.)
    Unpacking python2.5-imaging (from python2.5-imaging_1.1.6-1_armel.deb) ...
    Setting up python2.5-imaging (1.1.6-1) ...
    Nokia-N800-51:/media/mmc1# python2.5 
    Python 2.5 (r25:9277, Jan 23 2007, 15:56:37) 
    [GCC 3.4.4 (release) (CodeSourcery ARM 2005q3-2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from PIL import Image
    >>>

    Feel free to take a look at the directory listing for full source, diffs, etc. You might also want to check out debian/rules to see where to use setup.py in the makefile to build and install a Python package. If anyone wants a build of this for maemo-2.0/770 please let me know. It would just take a little time to set up a maemo-2.0 scratchbox.

  • Nokia N800 and camera.py

    Nokia N800 and camera.py

    I received my Nokia N800 yesterday and have been enjoying how zippy it is compared to the 770 (which has been getting faster with every firmware upgrade).. I got a chance to play with the browser while waiting for my wife at the airport and have been poking around to see how Bora is different than Mistral and Scirocco.

    One of the bigger physical differences between the 770 an N800 is the onboard camera. I haven’t set up Nokia Internet Call Invitation yet but I looked around for some camera code samples and was pleasantly suprised. Luckily Nokia is one step ahead of me with camera.py, an example program to show what the camera sees on the screen. It looks like some bindings are missing so saving an image from the camera is off limits at the moment but this is a great start.

    To run the above example on your N800, grab camera.py, install Python and osso-xterm, and run camera.py from the console.

    It’s time to dust off the Python Maemo tutorial and get my feet wet.

    Update: I’ve also been playing with the c version of the camera example code and have managed to get it running and taking pictures after building it in scratchbox and running it with run-standalone.sh ./camera.

  • From GPX to PostGIS

    Now that I have a RELAX NG Compact schema of GPX, it’s time to figure out how to get my data in to PostGIS. so I can do geospatial queries. I installed PostGIS on my Ubuntu box with these instructions. If I recall correctly, the latest version didn’t work but the previous release did.

    GPX makes a great GPS data interchange format, particularly because you can convert just about anything to GPX with GPSBabel. I’m not aware of anything that can convert straight from GPX to PostGIS, but it is a relatively straightforward two-part process.

    The first order of business is converting our GPX file to an ESRI Shapefile. This is easiest done by using gpx2shp, a utility written in C that is also available in Ubuntu. Once gpx2shp is installed, run it on your GPX file: gpx2shp filename.gpx.

    Once we have a shape file, we can use a utility included with PostGIS to import the shapefile in to postgres. The program is called shp2pgsql, and can be used as such: shp2pgsql -a shapefile.shp tablename. I tend to prefer the -a option which appends data to the table (as long as it’s the same basic type of the existing data). shp2pgsql generates SQL that you can then pipe in to psql as such: shp2pgsql -a shapefile.shp tablename | psql databasename.

    Once the data is in Postgres (don’t forget to spatially-enable your database, you can query it in ways that I’m still figuring out. A basic query like this will return a list of lat/lon pairs that corresponds to my GPS track: select asText(the_geom) from tablename;. I’m llooking forward to wrapping my head around enough jargon to do bounding box selects, finding things nearby a specific point, etc.

    The examples in the PostGIS documentation seem pretty good, I just haven’t figured out how to apply it to the data I have. I feel like mastering GIS jargon is one of the biggest hurdles to understanding PostGIS better. Well, that and a masters degree in GIS wouldn’t hurt.

  • All I want to do is convert my schema!

    I’m working on a django in which I want to store GPS track information in GPX format. The bests way to store that in django is with an XMLField. An XMLField is basically just a TextField with validation via a RELAX NG Compact schema.

    There is a schema for GPX. Great! The schema is an XSD though, but that’s okay, it’s a schema for XML so it should be pretty easy to just convert that to RELAX NG compact, right?

    Wrong.

    I pulled out my handy dandy schema swiss army knife, Trang but was shocked to find out that while it can handle Relax NG (both verbose and compact), DTD, and an XML file as input and even XSD as an output, there was just no way that I was going to be able to coax it to read an XSD. Trang is one of those things (much like Jing that I rely on pretty heavily that hasn’t been updated in years. That scares me a bit, but I keep on using ’em.

    With Trang out of the picture, I struck out with various google searches (which doesn’t happen very often). the conversion section of the RELAX NG website. The first thing that struck my eye was the Sun RELAX NG Converter. Hey, Sun’s got it all figured out. I clicked the link and was somewhat confused when I ended up at their main XML page. I scanned around and even searched the site but was unable to find any useful mention of their converter. A quick google search for sun “relax ng converter” yielded nothing but people talking about how cool it was and a bunch of confused people (just like me) wondering where they could get it.

    At this point I was grasping at straws so I pulled up The Internet Archive version of the extinct Sun RELAX NG Converter page. That tipped me off to the fact that I really needed to start tracking down rngconf.jar. A google search turned up several Xdoclet and Maven cvs repositories. I grabbed a copy of the jar but it wouldn’t work without something called Sun Multi-Schema XML Validator.

    That’s the phrase that pays, folks.

    A search for Sun “Multi-Schema XML Validator” brought me to the java.net project page and included a prominent link to nightly builds of the multi-schema validator as well as nightly builds of rngconv. These nightly builds are a few months old, but I’m not going to pick nits at this point.

    After downloading msv.zip and rngconv.zip and making sure all the jars were in the same directory I had the tools I needed to convert the XSD in hand to RELAX NG Compact. First I converted the XSD to RELAX NG Verbose with the following command: java -jar rngconv.jar gpx.xsd > gpxverbose.rng. That yielded the following RELAX NG (very) Verbose schema. Once I had that I could fall back to trusty Trang to do the rest: trang -I rng -O rnc gpxverbose.rng gpx.rng. It errored out on any(lax:##other) so I removed that bit and tried again. After a lot more work than should have been required, I had my RELAX NG Compact schema for GPX.

    My experience in finding the right tools to convert XSD to RELAX NG was so absurd that I had to write it up, if only to remind myself where to look when I need to do this again in two years.

  • Oh the CalDAV Possibilities

    While checking up on the Darwin Calendar Server wiki the other day I noticed something I had missed last week: CalDAVTester. It is an exhaustive suite of tests written in Python with XML config files to verify that a CalDAV server implementation is properly implementing the spec.  This suite of tests is going to prove very useful as more servers and clients implement the CalDAV spec.

    Right now the biggest problem with CalDAV is a lack of clients and servers.  That will change over the next 6-8 months as clients and servers are refined, released and rolled out.  Hopefully the CalConnect group and an exhaustive suite of tests will help keep interop a high priority.

  • Darwin Calendar Server

    As soon as Gruber pointed out Darwin Calendar Server I felt like I had to check it out. I’ve played with Darwin Streaming Server in the past and love me some Webkit. I was pleasantly suprised to find that Darwin Calendar Server runs on top of Python and Twisted.

    So away I went. I checked out the source and began to poke around. I managed to check out the source before the README was added so I did a fair amount of head scratching and wheel spinning, but it turns out that getting up and running is pretty easy: ./run -s

    That sets up the server, downloading and building some prereqs as it goes. I already had some prereqs installed system wide so I can’t guarantee that this works, but I’m pretty sure that it has worked for others. I should take a second to qualify that I’m running OS X 10.4 with Python 2.4 installed. From there I copied over the sample config file (cp ./conf/repository-static.xml ./conf/repository-dev.xml) and immediately started troubleshooting SSL errors. First I installed PyOpenSSL and created a self-signed certificate. That yielded a brand new error: OpenSSL.SSL.Error: [('PEM routines', 'PEM_read_bio', 'no start line'), ('SSL routines', 'SSL_CTX_use_PrivateKey_file', 'PEM lib')]

    After doing that and getting some guidance from the folks in #collaboration on freenode I decided to hack away at the plist and disable SSL for now (change SSLEnable to false instead of true). From there I could run the server (./run) and bring up a directory listing my pointing to 127.0.0.1:8008.

    Darwin Calendar Server Chandler Setup

    From there I subscribed to the example payday calendar and the holiday calendar. It appears that iCal won’t do two-way CalDAV until Leopard, but in the meantime I was able to successfully set up and test Chandler.

    This is some absolutely amazing tech in its infancy. I can’t wait to see where this goes and I’m excited that it’s built with tools that I’m familiar with (Python, Twisted, SQLite, iCal). It seems to me like this open source app is but the tip of the iceberg of collaboration features that will be baked in to OS X 10.5 desktop and server.  I would also kill for a mobile device that spoke CalDAV natively so that I can replace my duct taped google calendar to iCal to iSync to 6682 workflow.

  • PyS60 1.3.8 Released

    Python for S60 version 1.3.8, released specifically for S60 3rd Edition is now available for download. See the release notes for more information. Special thanks to Jukka and everyone else for pushing this release out the door just before Finland shuts down for the summer.