Category: Python

  • Social Graph Analysis using Elastic MapReduce and PyPy

    A couple of weeks back I read a couple of papers (Who Says What to Whom on Twitter and What is Twitter, a Social Network or a News Media?) that cited data collected by researchers for the latter paper.

    This 5 gigabyte compressed (26 gigabyte uncompressed) dataset makes for a good excuse to use MapReduce and MrJob for processing. MrJob makes it easy to test MapReduce jobs locally as well as run them on a local Hadoop cluster or on Amazon’s Elastic MapReduce.

    Designing MapReduce Jobs

    I usually find myself going through the same basic tasks when writing MapReduce tasks:

    1. Examine the data input format and the data that you have to play with. This is sometimes explained in a metadata document or you may have to use a utility such as head if you’re trying to look at the very beginning of a text file.
    2. Create a small amount of synthetic data for use while designing your job. It should be obvious to determine if the output of your job is correct or not based on this data. This data is also useful when writing unit tests.
    3. Write your job, using synthetic data as test input.
    4. Create sample data based on your real dataset and continue testing your job with that data. This can be done via reservoir sampling to create a more representative sample or it could be as simple as head -1000000 on a very large file.
    5. Run your job against the sample data and make sure the results look sane.
    6. Configure MrJob to run using Elastic MapReduce. An example configuration can be found in conf/mrjob-emr.conf but will require you to update it with your credentials and S3 bucket information before it will work.
    7. Run your sample data using Elastic MapReduce and a small number of low-cost instances. It’s a lot cheaper to fix configuration problem when you’re just
      running two cheap instances.
    8. Once you’re comfortable with everything, run your job against the full dataset on Elastic MapReduce.

    Analyzing the data

    This project contains two MapReduce jobs:

    jobs/follower_count.py
    A simple single-stage MapReduce job that reads the data in and sums the number of followers each user has.
    jobs/follower_histogram.py
    This is a two-phase MapReduce job that first sums the number of followers a each user has then for each follower count sums the number of users that have that number of followers. This is one of many interesting ways at looking at this raw data.

    Running the jobs

    The following assumes you have a modern Python and have already installed MrJob (pip install MrJob or easy_install MrJob or install it from source).

    To run the sample data locally:

    $ python jobs/follower_count.py data/twitter_synthetic.txt
    

    This should print out a summary of how many followers each user (represented by id) has:

    5       2
    6       1
    7       3
    8       2
    9       1
    

    You can also run a larger sample (the first 10 million rows of the full dataset mentioned above) locally though it will likely take several minutes to process:

    $ python jobs/follower_count.py data/twitter_sample.txt.gz
    

    After editing conf/mrjob-emr.conf you can also run the sample on Elastic MapReduce:

    $ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
     -o s3://your-bucket/your-output-location --no-output data/twitter_sample.txt.gz
    

    You can also upload data to an S3 bucket and reference it that way:

    $ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
     -o s3://your-bucket/your-output-location --no-output s3://your-bucket/twitter_sample.txt.gz
    

    You may also download the full dataset and run either the follower count or the histogram job. The following general steps are required:

    1. Download the whole data file from Kwak, Haewoon and Lee, Changhyun and Park, Hosung and Moon, Sue via bittorrent. I did this on a small EC2 instance in order to make uploading to S3 easier.
    2. To make processing faster, decompress it, split it in to lots of smaller files (split -l 10000000
      for example).
    3. Upload to an S3 bucket.
    4. Run the full job (it took a little over 15 minutes to read through 1.47 billion relationships and took just over an hour to complete).
    python jobs/follower_histogram.py -c conf/mrjob-emr.conf -r emr \
    -o s3://your-bucket/your-output-location --no-output s3://your-split-input-bucket/
    

    Speeding things up with PyPy

    While there are lots of other things to explore in the data, I also wanted to be able to run PyPy on Elastic MapReduce. Through the use of bootstrap actions, we can prepare our environment to use PyPy and tell MrJob to execute jobs with PyPy instead of system Python. The following need to be added to your configuration file (and vary between 32 and 64 bit):

    # Use PyPy instead of system Python
    bootstrap_scripts:
    - bootstrap-pypy-64bit.sh
    python_bin: /home/hadoop/bin/pypy
    

    This configuration change (available in conf/mrjob-emr-pypy-32bit.conf and conf/mrjob-emr-pypy-64bit.conf) also makes use of a custom bootstrap script found in conf/bootstrap-pypy-32bit.sh and conf/bootstrap-pypy-64bit.sh).

    A single run of “follower_histogram.py“ with 8 “c1.xlarge“ instances took approximately 66 minutes using Elastic MapReduce’s system Python. A single run with PyPy in the same configuration took approximately 44 minutes. While not a scientific comparison, that’s a pretty impressive speedup for such a simple task. PyPy should speed things up even more for more complex tasks.

    Thoughts on Elastic MapReduce

    It’s been great to be able to temporarily rent my own Hadoop cluster for short periods of time, but Elastic MapReduce definitely has some downsides. For starters, the standard way to read and persist data during jobs is via S3 instead of HDFS which you would most likely be using if you were running your own Hadoop cluster. This means that you spend a lot of time (and money) transferring data between S3 and nodes. You’re not bringing the data to computing resources like a dedicated Hadoop cluster running HDFS might.

    All in all though it’s a great tool for the toolbox, particularly if you don’t have the need for a full-time Hadoop cluster.

    Play along at home

    All of the source code and configuration mentioned in this post can be found at social-graph-analysis and is released under the BSD license.

  • Literate Diffing

    The other day I found myself wanting to add commentary to a diff. There are code review tools such as reviewboard and gerrit that make commenting on diffs pretty easy. Github allows you to comment on pull requests and individual commits.

    These are all fantastic tools for commenting on diffs, but I kind of wanted something different, something a little more self-contained. I wanted to write about the individual changes, what motivated them, and what the non-code implications of each change might be. At that point my mind wandered to the world of lightweight literate programming using tools like docco, rocco, and pycco.

    A literate diff might look something like this (using Python/Bash style single-line comments):

    # Extend Pygments' DiffLexer using a non-standard comment (#) for literate diffing using pycco.
    diff -r cfa0f44daad1 pygments/lexers/text.py
    --- a/pygments/lexers/text.py	Fri Apr 29 14:03:50 2011 +0200
    +++ b/pygments/lexers/text.py	Sat Apr 30 20:28:56 2011 -0500
    @@ -231,6 +231,7 @@
                 (r'@.*\n', Generic.Subheading),
                 (r'([Ii]ndex|diff).*\n', Generic.Heading),
                 (r'=.*\n', Generic.Heading),
    # Add non-standard diff comments.  This has to go above the Text capture below
    # in order to be active.
    +            (r'#.*\n', Comment),
                 (r'.*\n', Text),
             ]
         }

    It turns out that it’s pretty easy to process with patch, but comes with a catch. The patch command would blow up quite spectacularly if it encountered one of these lines, so the comments will have to be removed from a literate diff before being passed to patch. This is easily done using awk:

    cat literate.diff | awk '!/\#/' | patch -p0

    If you’re using a DVCS, you’ll need -p1 instead.

    Since I’m using a non-standard extension to diffs, tools such as pygments won’t know to syntax highlight comments appropriately. If comments aren’t marked up correctly, pycco won’t be able to put them in the correct spot. This requires a patch to pygments and a patch to pycco. I’m kind of abusing diff syntax here and haven’t submitted these patches upstream, but you can download and apply them if you’d like to play along at home.

    I still think tools like github, reviewboard, and gerrit are much more powerful for commenting on diffs but was able to make pycco output literate diffs quick enough that I thought I’d share the process. These tools are no excuse for clearly commenting changes and implications within the code itself, but I do like having a place to put underlying motivations. Here’s an example of a literate diff for one of my commits to phalanges, a finger daemon written in Scala. It’s still a pretty contrived example but is exactly what I was envisioning when my mind drifted from diffs to literate programming.

  • PyPy is Fast (And So Can You)

    I’ve known for some time that PyPy (Python implemented in a subset of the language called RPython) is fast. The PyPy speed charts show just how fast for a lot of benchmarks (and it’s a little slower in a few areas too).

    After seeing a lot of PyPy chatter while PyCon was going on, I thought I’d check it out. On OS X it’s as simple as brew install pypy. After that, just use pypy instead of python.

    The first thing I did was throw PyPy at a couple of Project Euler problems. They’re great because they’re computationally expensive and usually have lots of tight loops. For the ones I looked at, PyPy had a 50-75% speed improvement over CPython. David Ripton posted a more complete set of Euler solution runtimes using PyPy, Unladen Swallow, Jython, Psyco, and CPython. Almost all of the time, PyPy is faster, often significantly so. At this point it looks like the PyPy team is treating “slower than CPython” as a bug, or at the very least, something to improve.

    The latest stable release currently targets Python 2.5, but if you build the latest version from source it looks like they’re on their way to supporting Python 2.7:

    $ ./pypy-c 
    Python 2.7.0 (61fefec7abc6, Mar 18 2011, 06:59:57)
    [PyPy 1.5.0-alpha0] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    And now for something completely different: ``1.1 final released:
    http://codespeak.net/pypy/dist/pypy/doc/release-1.1.0.html''
    >>>> 

    There are a few things to look out for when using PyPy. The entire standard library isn’t built out, though the most commonly used modules are. PyPy supports ctypes and has experimental but incomplete support for the Python C API. PyPy is built out enough to support several large non-trivial projects such as Twisted (without SSL) and Django (with sqlite).

    PyPy is definitely one of many bright futures for Python, and it’s fast now. If you’ve been thinking about checking it out, perhaps now is the time to take it for a spin.

  • Installing PyLucene on OSX 10.5

    I was pleasantly surprised at my experience installing PyLucene this morning on my OSX 10.5 laptop. The installation instructions worked perfectly without a hiccup. This may not be impressive if you’ve never installed (or attempted to install) PyLucene before.

    I tried once a year or so back and was unsuccessful. The build process just never worked for me and I couldn’t find a binary build that fit my OS + Python version + Java version combination.

    Check out PyLucene:

    $ svn co http://svn.apache.org/repos/asf/lucene/pylucene/trunk pylucene
    

    Build JCC. I install Python packages in my home directory and if you do so too you can omit sudo before the last command, otherwise leave it in:

    $ cd pylucene/jcc
    $ python setup.py build
    $ sudo python setup.py install
    

    Now we need to edit PyLucene’s Makefile to be configured for OSX and Python 2.5. If you use a different setup than the one that ships with OSX 10.5, you’ll have to adjust these parameters to match your setup.

    Edit the Makefile:

    $ cd ..
    $ nano Makefile
    

    Uncomment the 5 lines Below the comment # Mac OS X (Python 2.5, Java 1.5). If you have installed a different version of Python such as 2.6, there should be a combination that works for you. Here’s what I uncommented:

    # Mac OS X  (Python 2.5, Java 1.5)
    PREFIX_PYTHON=/usr
    ANT=ant
    PYTHON=$(PREFIX_PYTHON)/bin/python
    JCC=$(PYTHON) -m jcc --shared
    NUM_FILES=2
    

    Save the file, exit your editor, and build PyLucene:

    $ make
    

    If it doesn’t build properly check the settings in your Makefile.

    After a successful build, install it (again you can omit sudo if you install Python packages locally and not system-wide):

    $ sudo make install
    

    Now verify that it’s been installed:

    $ python
    Python 2.5.1 (r251:54863, Nov 11 2008, 17:46:48)
    [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import lucene
    >>>
    

    If it imports without a problem you should have a working PyLucene library. Rejoice.

  • Sphinx Search with PostgreSQL

    While I don’t plan on moving away from Apache Solr for my searching needs any time soon, Jeremy Zawodny’s post on Sphinx at craigslist made me want to take a closer look. Sphinx works with MySQL, PostgreSQL, and XML input as data sources, but MySQL seems to be the best documented. I’m a PostgreSQL guy so I ran in to a few hiccups along the way. These instructions, based on instructions on the Sphinx wiki, got me up and running on Ubuntu Server 8.10.

    Install build toolchain:

    $ sudo aptitude install build-essential checkinstall
    

    Install Postgres:

    $ sudo aptitude install postgresql postgresql-client \\
    postgresql-client-common postgresql-contrib \\
    postgresql-server-dev-8.3
    

    Get Sphinx source:

    $ wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz
    $ tar xzvf sphinx-0.9.8.1.tar.gz
    $ cd sphinx-0.9.8.1
    

    Configure and make:

    $ ./configure --without-mysql --with-pgsql \\
    --with-pgsql-includes=/usr/include/postgresql/ \\
    --with-pgsql-lib=/usr/lib/postgresql/8.3/lib/
    $ make
    

    Run checkinstall:

    $ mkdir /usr/local/var
    $ sudo checkinstall
    

    Sphinx is now installed in /usr/local. Check out /usr/local/etc/ for configuration info.

    Create something to index:

    $ createdb -U postgres test
    $ psql -U postgres test
    test=# create table test (id integer primary key not null, text text);
    test=# insert into test (text) values ('Hello, World!');
    test=# insert into test (text) values ('This is a test.');
    test=# insert into test (text) values ('I have another thing to test.');
    test=# -- A user with a password is required.
    test=# create user foo with password 'bar';
    test=# alter table test owner to foo;
    test=# \\q
    

    Configure sphinx (replace nano with your editor of choice):

    $ cd /usr/local/etc
    $ sudo cp sphinx-min.conf.dist sphinx.conf
    $ sudo nano sphinx.conf
    

    These values worked for me. I left configuration for indexer and searchd unchanged:

    source src1
    {
      type = pgsql
      sql_host = localhost
      sql_user = foo
      sql_pass = bar
      sql_db = test
      sql_port = 5432
      sql_query = select id, text from test
      sql_query_info = SELECT * from test WHERE id=$id
    }
    
    index test1
    {
      source = src1
      path = /var/data/test1
      docinfo = extern
      charset_type = utf-8
    }
    

    Reindex:

    $ sudo mkdir /var/data
    $ sudo indexer --all
    

    Run searchd:

    $ sudo searchd
    

    Play:

    $ search world
    
    Sphinx 0.9.8.1-release (r1533)
    Copyright (c) 2001-2008, Andrew Aksyonoff
    
    using config file '/usr/local/etc/sphinx.conf'...
    index 'test1': query 'world ': returned 1 matches of 1 total in 0.000 sec
    
    displaying matches:
    1. document=1, weight=1
    
    words:
    1. 'world': 1 documents, 1 hits
    

    Use Python:

    cd sphinx-0.9.8.1/api
    python
    >>> import sphinxapi, pprint
    >>> c = sphinxapi.SphinxClient()
    >>> q = c.Query('world')
    >>> pprint.pprint(q)
    {'attrs': [],
     'error': '',
     'fields': ['text'],
     'matches': [{'attrs': {}, 'id': 1, 'weight': 1}],
     'status': 0,
     'time': '0.000',
     'total': 1,
     'total_found': 1,
     'warning': '',
     'words': [{'docs': 1, 'hits': 1, 'word': 'world'}]}
    

    If you add new data and want to reindex, make sure you use the --rotate flag:

    sudo indexer --rotate --all
    

    This is an extremely quick and dirty installation designed to give me a sandbox
    to play with. For production use you would want to run as a non-privileged user
    and would probably want to have an /etc/init.d script for searchd or run it
    behind supervised. If you’re looking to experiment with Sphinx and MySQL,
    there should be plenty of documentation out there to get you started.

  • Kansas Primary 2008 recap

    I’m winding down after a couple of very long days preparing for our coverage of the 2008 Kansas (and local) primaries. As always it’s been an exhausting but rewarding time. We’ve come a long way since the first election I wrote software for and was involved with back in 2006 (where election night involved someone accessing an AS/400 terminal and shouting numbers at me for entry). Our election app has become a lot more sophisticated, our data import process more refined, and election night is a whole lot more fun and loads less stressful than it used to be. I thought I’d go over some of the highlights while they’re still fresh in my mind.

    Douglas County Comission 2nd District Democratic primary section

    Our election app is definitely a success story for both the benefits of structured data and incremental development. Each time the app gets a little more sophisticated and a little smarter. What once wasn’t used until the night of the election has become a key part of our election coverage both before and after the event. For example, this year we had an overarching election section and also sections for indivudual races, like this section for the Douglas County Commission 2nd district Democratic primary. These sections tie together our coverage of the individual races: Stories, photos and videos about the race, our candidate profiles, any chats we’ve had with the candidates, campaign finance documents, and candidate selectors, an awesome app that has been around longer than I have that lets users see which candidates they most agree with. On election night they’re smart enough to display results as they come in.

    Election results start coming in Results rolling in County commission races almost done

    This time around, the newsroom also used our tools to swap out which races were displayed on the homepage throughout the night. We lead the night with results from Leavenworth County, since they were the first to report. The newsroom spent the rest of the nice swapping in one or more race on the homepage as they saw fit. This was a huge improvement over past elections where we chose ahead of time which races would be featured on the homepage. It was great to see the newsroom exercise editorial control throughout the night without having to involve editing templates.

    More results

    On the television side, 6 News Lawrence took advantage of some new hardware and software to display election results prominently throughout the night. I kept catching screenshots during commercial breaks, but the name of the race appeared on the left hand side of the screen with results paging through on the bottom of the screen. The new hardware and software allowed them to use more screen real estate to provide better information to our viewers. In years past we’ve had to jump through some hoops to get election results on the air, but this time was much easier. We created a custom XML feed of election data that their new hardware/software ingested continuously and pulled results from. As soon as results were in our database they were on the air.

    The way that election results make their way in to our database has also changed for the better over the past few years. We have developed a great relationship with the Douglas County Clerk, Jamie Shew and his awesome staff. For several elections now they have provided us with timely access to detailed election results that allow us to provide precinct-by-precinct results. It’s also great to be able to compare local results with statewide results in state races. We get the data in a structured and well-documented fixed-width format and import it using a custom parser we wrote several elections ago.

    State results flow in via a short script that uses BeautifulSoup to parse and import data from the Kansas Secretary of State site. That script ran every few minutes throughout the night and was updating results well after I went to bed. In fact it’s running right now while we wait for the last few precincts in Hodgeman County to come in. This time around we did enter results from a few races in Leavenworth and Jefferson counties by hand, but we’ll look to automate that in November.

    As always, election night coverage was a team effort. I’m honored to have played my part as programmer and import guru. As always, it was great to watch Christian Metts take the data and make it both beautiful and meaningful in such a short amount of time. Many thanks go out to the fine folks at Douglas County and all of the reporters, editors, and technical folk that made our coverage last night possible.

  • DjangoCon!

    I’m a little late to the announcement party, but I’ll be attending DjangoCon and sitting on a panel about Django in Journalism with Maura Chace and Matt Waite. The panel will be moderated by our own Adrian Holovaty.

    I think the panel will be pretty fantastic but I can’t help be just as terrified as my fellow panelists. I love that we’ll have both Journalist-programmers and Programmer-journalists on the panel, and I love that Django is so often the glue that brings the two together.

    DjangoCon is going to be awesome.

  • Python for S60: back in the saddle

    I had the opportunity to meet Jürgen Scheible and Ville Tuulos, authors of the Mobile Python book at PyCon a few weeks ago. They graciously gave me a copy of their book, which is an absolutely fantastic guide to writing S60 apps in Python. It seems like every time I look away from Python for S60 it gets better, and this time was no exception. Everything is just a little more polished, a few more APIs are supported (yay sensor API!), and the community and learning materials available have grown tremendously.

    While I didn’t get a chance to hang out too long during the sprints, I did pull together some code for a concept I’ve wanted to do for a long time: a limpet webcam that I can stick on something and watch it ride around the city. Specifically I thought it would be cool to attach one to a city bus and upload pictures while tracing its movements.

    So here’s my quick 19 line prototype that simply takes a picture using the camera API and uploads the saved photo using ftplib copied over from the Python 2.2.2 standard library. It’s called webcam.py. I haven’t run it since PyCon, so the most recent photo is from the PyS60 intro session.

    Working with PyS60 again was absolutely refreshing. I write Python code (using Django) at work but writing code for a mobile device again got the creative juices flowing. I’m trying to do more with less in my spare time, but I definitely need to make more time for PyS60 in my life.

  • PyCon 2008

    I’m headed out the door to PyCon 2008. Yay!

  • Covering Kansas Democratic Caucus Results

    I think we’re about ready for caucus results to start coming in.

    We’re covering the Caucus results at LJWorld.com and on Twitter.

    Turnout is extremely heavy. So much so that they had to split one of the caucus sites in two because the venue was full.

    Later…

    How did we do it?

    We gained access to the media results page from the Kansas Democratic Party on Friday afternoon. On Sunday night I started writing a scraper/importer using BeautifulSoup and rouging out the Django models to represent the caucus data. I spent Monday refining the models, helper functions, and front-end hooks that our designers would need to visualize the data. Monday night and in to Tuesday morning was spent finishing off the importer script, exploring Google Charts, and making sure that Ben and Christian had everything they needed.

    After a few hours of sleep, most of the morning was spent testing everything out on our staging server, fixing bugs, and improving performance. By early afternon Ben was wrapping up KTKA and Christian was still tweaking his design in Photoshop. Somewhere between 1 and 2 p.m. he started coding it up and pretty soon we had our results page running on test data on the staging server.

    While the designers were finishing up I turned my focus to the planned Twitter feed. Thanks to some handy wrappers from James, I wrote a quick script that generated a short message based on the caucus results we had, compared it to the last version of the message, and sent a post to Twitter if the message had changed.

    Once results started coming in, we activated our coverage. After fixing one quick bug, I’ve been spending most of the evening watching importers feed data in to our databases and watching the twitter script send out updates. Because we’ve been scraping the Kansas Democratic Party media results all night and showing them immediately, we’ve been picking up caucuses seconds after they’ve been reported and have been ahead of everything else I’ve looked at.

    Because we just recently finished moving our various Kansas Weekly papers to Ellington and a unified set of templates, it was quite trivial to include detailed election results on the websites for The Lansing Current, Baldwin City Signal, Basehor Sentinel, The Chieftain, The De Soto Explorer, The Eudora News, Shawnee Dispatch, and The Tonganoxie Mirror

    While there are definitely things we could have done better as a news organization (there always are), I’m quite pleased at what we’ve done tonight. Our servers hummed along quite nicely all night, we got information to our audience as quickly as possible, and generally things went quite smoothly. Many thanks to everyone involved.

  • We’re hiring!

    Wow, the Django job market is heating up. I posted a job opening for both junior and senior-level Django developers on djangogigs just a few days ago, and it has already fallen off the front page.

    So I’ll mention it again: We’re hiring! We’re growing and we have several positions open at both the junior and senior level. We’d love to talk to you if you’ve been working with Django since back in the day when everything was a tuple. We’d love to talk to you if you’re smart and talented but don’t have a lot of (or any) Django experience.

    Definitely check out the listing at djangogigs for more, or feel free to drop me a line if you’d like to know more.

  • 2008 Digital Edge Award Finalists

    The 2008 DIgital Edge Award finalists were just announced, and I’m excited to see several World Company sites and projects on there as well as a couple of sites running Ellington and even the absolutely awesome Django-powered PolitiFact.com.

    At work we don’t do what we do for awards. We do it to serve our readers, tell a story, get information out there, and do it as best we can. At the same time even being nominated as finalists is quite an honor, and evokes warm fuzzy feelings in this programmer.

    Here are the various World Company projects and sites that were nominated (in the less than 75,000 circulation category):

    Not too shabby for a little media company in Kansas. I’m particularly excited about the LJWorld.com nominations since it hasn’t been too long since we re-designed and re-launched the site with a lot of new functionality. Scanning the finalists I also see a couple of other sites running Ellington as well as several special projects by those sites.

    As someone who writes software for news organizations for a living I’m definitely going to take some time this morning to take a look at the other finalists. I’m particularly excited to check out projects from names that I’m not familiar with.

  • Reason 3,287 why I hate setuptools

    root@monkey:~/inst/simplejson# python setup.py install
    The required version of setuptools (>=0.6c6) is not available, and
    can't be installed while this script is running. Please install
     a more recent version first.
    
    (Currently using setuptools 0.6c3 
    (/usr/lib/python2.4/site-packages/setuptools-0.6c3-py2.4.egg))
  • Darwin Calendar Server Status Update

    Wilfredo Sánchez Vega posted an update on the status of Darwin Calendar Server to the calenderserver-dev mailing list this afternoon with a status update on the project. Check the full post for details, but here’s the takeaway:

    We think the server’s in pretty solid shape right now. Certainly there are still some bugs that have to be fixed before we can roll up a “1.0” release, but the core functionality is all in place now, and we think it’s fairly useable in its current state.

    There is also a preview release in the works, so keep a close eye on the mailing list and the Darwin Calendar Server site.

  • Properly serving a 404 with lighttpd’s server.error-handler-404

    The other day I was looking in to why Django‘s verify_exists validator wasn’t working on a URLField. It turns out that the 404 handler that we were using to generate thumbnails with lighttpd on the media server was serving up 404 error pages with HTTP/1.1 200 OK as the status code. Django’s validator was seeing the 200 OK and (rightfully) not raising a validation error, even though there was nothing at that location.

    It took more than the usual amount of digging to find the solution to this, so I’m making sure to write about it so that I can google for it again when it happens to me next year. The server.error-hadler-404 documentation metnions that to send a 404 response you will need to use server.errorfile-prefix. That doesn’t help me a lot since I needed to retain the dynamic 404 handler.

    Amusingly enough, the solution is quite clear once you dig in to the source code. I dug through connections.c and found that if I sent back a Status: line, that would be forwarded on by lighttpd to the client (exactly what I wanted!)

    Here’s the solution (in Python) but it should translate to any language that you’re using as a 404 handler:

    print "Status: 404"
    print "Content-type: text/html"
    print
    print # (X)HTML error response goes here
    

    I hope this helps, as I’m sure it will help me again some time in the future.

  • Nokia N800 and camera.py

    Nokia N800 and camera.py

    I received my Nokia N800 yesterday and have been enjoying how zippy it is compared to the 770 (which has been getting faster with every firmware upgrade).. I got a chance to play with the browser while waiting for my wife at the airport and have been poking around to see how Bora is different than Mistral and Scirocco.

    One of the bigger physical differences between the 770 an N800 is the onboard camera. I haven’t set up Nokia Internet Call Invitation yet but I looked around for some camera code samples and was pleasantly suprised. Luckily Nokia is one step ahead of me with camera.py, an example program to show what the camera sees on the screen. It looks like some bindings are missing so saving an image from the camera is off limits at the moment but this is a great start.

    To run the above example on your N800, grab camera.py, install Python and osso-xterm, and run camera.py from the console.

    It’s time to dust off the Python Maemo tutorial and get my feet wet.

    Update: I’ve also been playing with the c version of the camera example code and have managed to get it running and taking pictures after building it in scratchbox and running it with run-standalone.sh ./camera.

  • Mapping Every airport and helipad in America

    All the airports

    After stumbling upon Transtats again today I took my semi-anual visit to the FAA data and statistics page to see if there was anything new to play with. The unruly passenger count still looks like it’s down for 2006 but I was really interested in playing with the airport data that I’ve seen before.

    After a little help from Python’s CSV module and some helper functions from geopy, I whipped up a 4 meg KML file for use with Google Earth or anything else that can import KML. Be warned thought that the file contains some 20,000 airports, helipads, and patches of dirt that can lead to some rendering bugs. If you’re interested, here’s the code that generated the KML.

  • Arduino serial communication with Python

    The Arduino is here!

    I got my shiny Arduino yesterday. The first order of business (after the obligatory “Hello World” led_blink sketch) was interfacing Arduino with my language of choice, Python.

    Googling around for python serial led me to pySerial, a cross-platform serial library. I was actually quite suprised that such a wrapper didn’t exist in the Python Standard Library. Nevertheless, I plodded on.

    The first order of business was symlinking the default device for the Arduino serial drivers on my mac (for sanity):
    sudo ln -s /dev/tty.usbserial-LOTSOFCHARSANDNUMBERS /dev/tty.usbserial. From there I fired up the Python shell and ran the serial hello world sketch on my Arduino:

    >>> import serial
    >>> ser = serial.Serial('/dev/tty.usbserial', 9600)
    >>> while 1:
    ...     ser.readline()
    '1 Hello world!\r\n'
    '2 Hello world!\r\n'
    '3 Hello world!\r\n'

    Writing from Python to Arduino is simple too. Load serial_read_blink and do the following from Python:

    >>> import serial
    >>> ser = serial.Serial('/dev/tty.usbserial', 9600)  
    >>> ser.write('5')

    Hooray, it worked! Communicating with the Arduino over serial with Python (just like every other language) is a pretty trivial process.

  • All I want to do is convert my schema!

    I’m working on a django in which I want to store GPS track information in GPX format. The bests way to store that in django is with an XMLField. An XMLField is basically just a TextField with validation via a RELAX NG Compact schema.

    There is a schema for GPX. Great! The schema is an XSD though, but that’s okay, it’s a schema for XML so it should be pretty easy to just convert that to RELAX NG compact, right?

    Wrong.

    I pulled out my handy dandy schema swiss army knife, Trang but was shocked to find out that while it can handle Relax NG (both verbose and compact), DTD, and an XML file as input and even XSD as an output, there was just no way that I was going to be able to coax it to read an XSD. Trang is one of those things (much like Jing that I rely on pretty heavily that hasn’t been updated in years. That scares me a bit, but I keep on using ’em.

    With Trang out of the picture, I struck out with various google searches (which doesn’t happen very often). the conversion section of the RELAX NG website. The first thing that struck my eye was the Sun RELAX NG Converter. Hey, Sun’s got it all figured out. I clicked the link and was somewhat confused when I ended up at their main XML page. I scanned around and even searched the site but was unable to find any useful mention of their converter. A quick google search for sun “relax ng converter” yielded nothing but people talking about how cool it was and a bunch of confused people (just like me) wondering where they could get it.

    At this point I was grasping at straws so I pulled up The Internet Archive version of the extinct Sun RELAX NG Converter page. That tipped me off to the fact that I really needed to start tracking down rngconf.jar. A google search turned up several Xdoclet and Maven cvs repositories. I grabbed a copy of the jar but it wouldn’t work without something called Sun Multi-Schema XML Validator.

    That’s the phrase that pays, folks.

    A search for Sun “Multi-Schema XML Validator” brought me to the java.net project page and included a prominent link to nightly builds of the multi-schema validator as well as nightly builds of rngconv. These nightly builds are a few months old, but I’m not going to pick nits at this point.

    After downloading msv.zip and rngconv.zip and making sure all the jars were in the same directory I had the tools I needed to convert the XSD in hand to RELAX NG Compact. First I converted the XSD to RELAX NG Verbose with the following command: java -jar rngconv.jar gpx.xsd > gpxverbose.rng. That yielded the following RELAX NG (very) Verbose schema. Once I had that I could fall back to trusty Trang to do the rest: trang -I rng -O rnc gpxverbose.rng gpx.rng. It errored out on any(lax:##other) so I removed that bit and tried again. After a lot more work than should have been required, I had my RELAX NG Compact schema for GPX.

    My experience in finding the right tools to convert XSD to RELAX NG was so absurd that I had to write it up, if only to remind myself where to look when I need to do this again in two years.

  • Pardon the Dust

    Sorry about the short outage there. I finally consolidated the various co-location, shared hosting, and virtual private hosting services that I was consuming every month in to one VPS account. I still have some legacy URLs to do some rewrite magic for, but the archives back to 2002 is here.

    Because my new box is very Django-oriented, I am now running WordPress via PHP5 (FastCGI) and MySQL5 on lighttpd behind perlbal.

    One of the things I really enjoyed about the move from WordPress on Apache with a really gnarly .htaccess file for URL rewriting to lighttpd was the simplicity of it all. Getting WordPress to “just work” for me on lighttpd was as simple as adding a 404 handler for the site:

    server.error-handler-404 = "/index.php?error=404"

    Everything should be smoothing out shortly and of course the eventual goal is to move this blog over to Django trunk. I did just that a few months ago but I need to revisit the code, find the importer, and give it a lot of layout love.