Busy making things: github, links, photos, @mc.

Social Graph Analysis using Elastic MapReduce and PyPy

Posted: May 4th, 2011 | Author: | Filed under: Open Source, Projects, Python | 6 Comments »

A couple of weeks back I read a couple of papers (Who Says What to Whom on Twitter and What is Twitter, a Social Network or a News Media?) that cited data collected by researchers for the latter paper.

This 5 gigabyte compressed (26 gigabyte uncompressed) dataset makes for a good excuse to use MapReduce and MrJob for processing. MrJob makes it easy to test MapReduce jobs locally as well as run them on a local Hadoop cluster or on Amazon’s Elastic MapReduce.

Designing MapReduce Jobs

I usually find myself going through the same basic tasks when writing MapReduce tasks:

  1. Examine the data input format and the data that you have to play with. This is sometimes explained in a metadata document or you may have to use a utility such as head if you’re trying to look at the very beginning of a text file.
  2. Create a small amount of synthetic data for use while designing your job. It should be obvious to determine if the output of your job is correct or not based on this data. This data is also useful when writing unit tests.
  3. Write your job, using synthetic data as test input.
  4. Create sample data based on your real dataset and continue testing your job with that data. This can be done via reservoir sampling to create a more representative sample or it could be as simple as head -1000000 on a very large file.
  5. Run your job against the sample data and make sure the results look sane.
  6. Configure MrJob to run using Elastic MapReduce. An example configuration can be found in conf/mrjob-emr.conf but will require you to update it with your credentials and S3 bucket information before it will work.
  7. Run your sample data using Elastic MapReduce and a small number of low-cost instances. It’s a lot cheaper to fix configuration problem when you’re just
    running two cheap instances.
  8. Once you’re comfortable with everything, run your job against the full dataset on Elastic MapReduce.

Analyzing the data

This project contains two MapReduce jobs:

jobs/follower_count.py
A simple single-stage MapReduce job that reads the data in and sums the number of followers each user has.
jobs/follower_histogram.py
This is a two-phase MapReduce job that first sums the number of followers a each user has then for each follower count sums the number of users that have that number of followers. This is one of many interesting ways at looking at this raw data.

Running the jobs

The following assumes you have a modern Python and have already installed MrJob (pip install MrJob or easy_install MrJob or install it from source).

To run the sample data locally:

$ python jobs/follower_count.py data/twitter_synthetic.txt

This should print out a summary of how many followers each user (represented by id) has:

5       2
6       1
7       3
8       2
9       1

You can also run a larger sample (the first 10 million rows of the full dataset mentioned above) locally though it will likely take several minutes to process:

$ python jobs/follower_count.py data/twitter_sample.txt.gz

After editing conf/mrjob-emr.conf you can also run the sample on Elastic MapReduce:

$ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
 -o s3://your-bucket/your-output-location --no-output data/twitter_sample.txt.gz

You can also upload data to an S3 bucket and reference it that way:

$ python jobs/follower_count.py -c conf/mrjob-emr.conf -r emr \
 -o s3://your-bucket/your-output-location --no-output s3://your-bucket/twitter_sample.txt.gz

You may also download the full dataset and run either the follower count or the histogram job. The following general steps are required:

  1. Download the whole data file from Kwak, Haewoon and Lee, Changhyun and Park, Hosung and Moon, Sue via bittorrent. I did this on a small EC2 instance in order to make uploading to S3 easier.
  2. To make processing faster, decompress it, split it in to lots of smaller files (split -l 10000000
    for example).
  3. Upload to an S3 bucket.
  4. Run the full job (it took a little over 15 minutes to read through 1.47 billion relationships and took just over an hour to complete).
python jobs/follower_histogram.py -c conf/mrjob-emr.conf -r emr \
-o s3://your-bucket/your-output-location --no-output s3://your-split-input-bucket/

Speeding things up with PyPy

While there are lots of other things to explore in the data, I also wanted to be able to run PyPy on Elastic MapReduce. Through the use of bootstrap actions, we can prepare our environment to use PyPy and tell MrJob to execute jobs with PyPy instead of system Python. The following need to be added to your configuration file (and vary between 32 and 64 bit):

# Use PyPy instead of system Python
bootstrap_scripts:
- bootstrap-pypy-64bit.sh
python_bin: /home/hadoop/bin/pypy

This configuration change (available in conf/mrjob-emr-pypy-32bit.conf and conf/mrjob-emr-pypy-64bit.conf) also makes use of a custom bootstrap script found in conf/bootstrap-pypy-32bit.sh and conf/bootstrap-pypy-64bit.sh).

A single run of “follower_histogram.py“ with 8 “c1.xlarge“ instances took approximately 66 minutes using Elastic MapReduce’s system Python. A single run with PyPy in the same configuration took approximately 44 minutes. While not a scientific comparison, that’s a pretty impressive speedup for such a simple task. PyPy should speed things up even more for more complex tasks.

Thoughts on Elastic MapReduce

It’s been great to be able to temporarily rent my own Hadoop cluster for short periods of time, but Elastic MapReduce definitely has some downsides. For starters, the standard way to read and persist data during jobs is via S3 instead of HDFS which you would most likely be using if you were running your own Hadoop cluster. This means that you spend a lot of time (and money) transferring data between S3 and nodes. You’re not bringing the data to computing resources like a dedicated Hadoop cluster running HDFS might.

All in all though it’s a great tool for the toolbox, particularly if you don’t have the need for a full-time Hadoop cluster.

Play along at home

All of the source code and configuration mentioned in this post can be found at social-graph-analysis and is released under the BSD license.


Covering Kansas Democratic Caucus Results

Posted: February 5th, 2008 | Author: | Filed under: Django, Journalism, Projects | 1 Comment »

I think we’re about ready for caucus results to start coming in.

We’re covering the Caucus results at LJWorld.com and on Twitter.

Turnout is extremely heavy. So much so that they had to split one of the caucus sites in two because the venue was full.

Later…

How did we do it?

We gained access to the media results page from the Kansas Democratic Party on Friday afternoon. On Sunday night I started writing a scraper/importer using BeautifulSoup and rouging out the Django models to represent the caucus data. I spent Monday refining the models, helper functions, and front-end hooks that our designers would need to visualize the data. Monday night and in to Tuesday morning was spent finishing off the importer script, exploring Google Charts, and making sure that Ben and Christian had everything they needed.

After a few hours of sleep, most of the morning was spent testing everything out on our staging server, fixing bugs, and improving performance. By early afternon Ben was wrapping up KTKA and Christian was still tweaking his design in Photoshop. Somewhere between 1 and 2 p.m. he started coding it up and pretty soon we had our results page running on test data on the staging server.

While the designers were finishing up I turned my focus to the planned Twitter feed. Thanks to some handy wrappers from James, I wrote a quick script that generated a short message based on the caucus results we had, compared it to the last version of the message, and sent a post to Twitter if the message had changed.

Once results started coming in, we activated our coverage. After fixing one quick bug, I’ve been spending most of the evening watching importers feed data in to our databases and watching the twitter script send out updates. Because we’ve been scraping the Kansas Democratic Party media results all night and showing them immediately, we’ve been picking up caucuses seconds after they’ve been reported and have been ahead of everything else I’ve looked at.

Because we just recently finished moving our various Kansas Weekly papers to Ellington and a unified set of templates, it was quite trivial to include detailed election results on the websites for The Lansing Current, Baldwin City Signal, Basehor Sentinel, The Chieftain, The De Soto Explorer, The Eudora News, Shawnee Dispatch, and The Tonganoxie Mirror

While there are definitely things we could have done better as a news organization (there always are), I’m quite pleased at what we’ve done tonight. Our servers hummed along quite nicely all night, we got information to our audience as quickly as possible, and generally things went quite smoothly. Many thanks to everyone involved.


Google apps for your newsroom

Posted: January 7th, 2008 | Author: | Filed under: Journalism, Projects | 14 Comments »

Google spreadsheetsI like to think that I’m pretty good at recognizing trends. One thing that I’ve been seeing a lot recently in my interactions with the newsroom is that we’re no longer exchanging Excel spreadsheets, Word files, and other binary blobs via email. Instead we’re sending invites to spreadsheets and documents on Google docs, links to data visualization sites like Swivel and ManyEyes, and links to maps created with Google MyMaps.

Using these lightweight webapps has definitely increased productivity on several fronts. While as much as we would love every FOIA request and data source to come in a digital format, we constantly see data projects start with a big old stack of paper. Google spreadsheets has allowed us to parallelize and coordinate data entry in a way that just wasn’t possible before. We can create multiple spreadsheets and have multiple web producers enter data in their copious spare time. I did some initial late night data entry for the KU flight project (Jacob and Christian rocked the data visualization house on that one), but we were able to take advantage of web producers to enter the vast majority of the data.

Sometimes the data entry is manageable enough (or the timeline is tight enough) that the reporter or programer can handle it on their own. In this case, it allows us to quickly turn quick spreadsheet-style data entry in to CSV, our data lingua franca for data exchange. Once we have the data in CSV form we can visualize it with Swivel or play with it in ManyEyes. If all we’re looking for is a tabular listing of the data, we’ve written some tools that make that easy and look good too. On larger projects, CSV is often the first step to importing the data and mapping it to Django objects for further visualization.

Awesome webapps that increase productivity aren’t limited to things that resemble spreadsheets from a distance. A few weeks back we had a reporter use Google’s awesome MyMaps interface to create a map of places to enjoy and avoid while traveling from Lawrence, KS to Miami, FL for the orange bowl. We pasted the KML link in to our Ellington map admin and instantly had an interactive map on our site. A little custom template work completed the project quite quickly.

It all boils down to apps that facilitate collaboration, increase productivity, and foster data flow. Sometimes the best app for the job sits on the desktop (or laptop). Increasingly, I’ve found that those apps live online—accessable anywhere, anytime.


Google Maps adds clickability to GPolyline and GPolygon

Posted: September 8th, 2007 | Author: | Filed under: Journalism, Projects, Web Services | 6 Comments »

Google Maps: clickable poly!

I’ve been waiting for this announcement Ever since Google introduced GGeoXml to its mapping API:

In our latest release (2.88) of the API, we’ve added “click” events to GPolyline and GPolygon, much to the enthusiasm of developers in the forum.

I knew it was just a matter of time since their internal apps have been supporting clickable GPolylines and GPolygons for some time now. Read the whole post for some fascinating information on how click detection works.

What this boils down to (for me anyway) is that you can display information generated with Google’s MyMaps interface on your own site with the same fidelity as the original via KML and GGeoXml. Up until now you could load KML from MyMaps via GGeoXml but the GPolylines and GPololygons only displayed and were not clickable. This removes a huge roadblock and should allow for even more interesting mapping applications.


Erlang bit syntax and network programming

Posted: August 10th, 2007 | Author: | Filed under: Open Source, Projects | 74 Comments »

I’ve been playing with Erlang over nights and weekends off and on for a few months now. I’m loving it for several reasons. First off, it’s completely different than any other programming language I’ve worked with. It makes me think rather than take things for granted. I’m intrigued by concurrency abilities and its immutable no-shared-state mentality. I do go through periods of amazing productivity followed by hours if not days of figuring out a simple task. Luckily the folks in #erlang on irc.freenode.net have been extremely patient with myself and the hundreds of other newcomers and have been extremely helpful. I’m also finding that those long pain periods are happening less frequently the longer I stick with it.

One of the things that truly blew me away about Erlang (after the original Erlang Now! moment) is its bit syntax. The bit syntax as documented at erlang.org and covered in Programming Erlang is extremely powerful. Some of the examples in Joe’s book such as parsing an MP3 file or an IPv4 datagram hint at the power and conciseness of binary matching and Erlang’s bit syntax. I wanted to highlight a few more that have impressed me while I was working on some network socket programming in Erlang.

There are several mind-boggling examples in Applications, Implementation and Performance Evaluation of Bit Stream Programming in Erlang (PDF) by Per Gustafsson and Konstantinos Sagonas. Here are two functions for uuencoding and uudecoding a binary:

uuencode(BitStr) ->
<< (X+32):8 || <<X:6>> <= BitStr >>.
uudecode(Text) ->
<< (X-32):6 || <<X:8>> <= Text >>.

UUencoding and UUdecoding isn’t particularly hard, but I’ve never seen an implementation so concise. I’ve also found that Erlang’s bit syntax makes socket programming extremely easy. The gen_tcp library makes connection to TCP sockets easy, and Erlang’s bit syntax makes creating requests and processing responses dead simple too.

Here’s an example from qrbgerl, a quick project of mine that receives random numbers from the Quantum Random Bit Generator service. The only documentation I needed to use the protocol was the Python client and the C++ client. Having access to an existing Python client helped me bridge the “how might I do this in Python?” and “how might I do this in Erlang?” gaps, but I ended up referring to the canonical C++ implementation quite a bit too.

I start out by opening a socket and sending a request. Here’s the binary representation of the request I’m sending:

list_to_binary([<<0:8,ContentLength:16,UsernameLength:8>>, Username, 
<<PasswordLength:8>>, Password, <<?REQUEST_SIZE:32>>]),

This creates a binary packet that complies exactly with what the QRBG service expects. The first thing that it expects is a 0 represented in 8 bits. Then it wants the length of the username plus the length of the password plus 6 (ContentLength above) represented in 16 bits. Then we have the username represented as 8 bit characters/integers, followed by the length of the password represented in 8 bits, followed by the password (again as 8 bit characters/integers). Finally we represent the size of the request in 32 bits. In this case the macro ?REQUEST_SIZE is 4096.

While that’s a little tricky, once we’ve sent the request, we can use Erlang’s pattern matching and bit syntax to process the response:

<<Response:8, Reason:8, Length:32, Data:Length/binary, 
_Rest/binary>> = Bin,

We’re matching several things here. The response code is the first 8 bits of the response. In a successful response we’ll get a 0. The next 8 bits represent the reason code, again 0 in this case. The next 32 bits will represent the length of the data we’re being sent back. It should be 4096 bytes of data, but we can’t be sure. Next we’re using the length of the data that we just determined to match that length of data as a binary. Finally we match anything else after the data and discard it. This is crucial because binaries are often padded at the beginning or end of the stream. In this case there’s some padding at the end that we need to match but can safely discard.

Now that we have 4096 bytes of random bits, let’s do something with them! I’ve mirrored the C++ and Python APIs as well as I could, but because of Erlang’s no shared state it’s going to look a little different. Let’s match a 32 bit integer from the random data that we’ve obtained:

<<Int:32/integer-signed, Rest/binary>> = Bin,

We’re matching the first 32 bits of our binary stream to a signed integer. We’re also matching the rest of the data in binary form so that we can reuse it later. Here’s that data extraction in action:

5> {Int, RestData} = qrbg:extract_int(Data).
{-427507221,
 <<0,254,163,8,239,180,51,164,169,160,170,248,94,132,220,79,234,4,117,
   248,174,59,167,49,165,170,154,...>>}
6> Int.
-427507221

I’ve been quite happy with my experimentation with Erlang, but I’m definitely still learning some basic syntax and have only begun to play with concurrency. If the above examples confuse you, it might help to view them in context or take a look at the project on google code. I have also released an ISBN-10 and ISBN-13 validation and conversion library for Erlang which was a project I used to teach myself some Erlang basics. I definitely have some polishing to do with the QRBG client, but isbn.erl has full documentation and some 44 tests.


isbn.erl: My first Erlang module

Posted: May 3rd, 2007 | Author: | Filed under: Open Source, Projects | 63 Comments »

For a few weeks now I’ve been tinkering with, learning about, and falling in love with Erlang. I’ve been noticing the buzz about Erlang over the past few months, but two things won me over: the Erlang video and how amazingly simple and elegant concurrency and message passing is.

For the past few weeks I’ve been reading the Erlang documentation, Joe Armstrong’s new book, reading the trapexit wiki, and lurking in #erlang on irc.freenode.net. If you’re wading in the erlang waters, I highly suggest just about every link in this wonderful roundup of beginners erlang links.

After a few weeks of reading and tinkering in the shell I decided that it was time to come up with a quick project to hone my Erlang skills. I had tinkered with ISBN validation and conversion while writing a small django application to catalog the books on the bookshelf, so I thought that was a good place to start. The Wikipedia page and my collection of ISBN links provided me with more than enough guidance on validation, check digit generation, and conversion.

I found it very easy to build this module from the ground up: start with ISBN-10 check digit generation, then use that to build an ISBN-10 validator. I did a similar thing with ISBN-13, writing the check digit generator and then the ISBN-13 validator. From there I was able to build on all four public functions to write an ISBN-10 to ISBN-13 converter as well as an ISBN-13 to ISBN-10 converter (when that is a possibility). In the process of this very simple module I ended up learning about and applying accumulators, guards, the use of case, and lots of general Erlang knowledge.

Here’s a peek at how the module works. After downloading it via google code or checking out the latest version via Subversion, the module needs to be compiled. This is easily accomplished from the Erlang shell (once you have Erlang installed of course):

mcroydon$ erl
Erlang (BEAM) emulator version 5.5.4 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.5.4  (abort with ^G)
1> c('isbn.erl').
{ok,isbn}

After that we can use any of the exported functions:

2> isbn:validate_13([9,7,8,1,9,3,4,3,5,6,0,0,5]).
true

Let’s trace the execution of a simple function, check_digit_10/1. The function expects a list of 9 numbers (the first 9 numbers of an ISBN-10) and returns the check digit as either an integer or the character 'X'. The first thing that the function does is check to see if we’ve actually passed it a 9-item list:

check_digit_10(Isbn) when length(Isbn) /= 9 ->
    throw(wrongLength);

This is accomplished with a simple guard (when length(Isbn) /- 9). If that guard isn’t triggered we move on to the next function:

check_digit_10(Isbn) -> 
    check_digit_10(Isbn, 0).

This forwards our list of 9 numbers on to check_digit_10/2 (a function with the same name that takes two arguments. We’ll see in a minute that the 0 I’m passing in will be used as an accumulator. The next function does most of the heavy lifting for us:

check_digit_10([H|T], Total) ->
    check_digit_10(T, Total + (H * (length(T) + 2)));

This function takes the list, splits it in to the first item (H) and the rest of the list (T). We add to the total as specified in ISBN-10 and then call check_digit_10/2 again with the rest of the list. This tail-recursive approach seems odd at first if you’re coming from most any object oriented language, but Erlang’s function nature, strong list handling, and tail-recursive ways feel natural very quickly. After we’ve recursed through the entire list, it’s time to return the check digit (or 11 minus the total modulus 11). There’s a special case if we get a result of 10 to use the character ‘X’ instead:

check_digit_10([], Total) when 11 - (Total rem 11) =:= 10 ->
    'X';

Finally we return the result for the common case given an empty list:

check_digit_10([], Total) ->
    11 - (Total rem 11).

Feel free to browse around the rest of the module and use it if you feel it might be useful to you. I’ve posted the full source to my isbn module at my isbnerl Google Code project, including the module itself and the module’s EDoc documentation. It is released under the new BSD license in hopes that it might be useful to you.

I learned quite a bit creating this simple module, and if you’re learning Erlang I suggest you pick something not too big, not too small, and something that you are interested in to get your feet wet. Now that I have the basics of sequential Erlang programming down, I think the next step is to make the same move from tinkering to doing something useful with concurrent Erlang.


PostGIS: From Multilinestring to Linestring

Posted: March 4th, 2007 | Author: | Filed under: Projects | 8 Comments »

Continuing on with my PostGIS tinkering, I’ve been further exploring getting data in and out of PostGIS. After recompiling GEOS to get the C API working so that I could start working with the new features in PostGIS 1.2.1.

One of the problems I ran in to is that gpx2shp really likes to import data as a Multilinestring, while in the simplest case of a single-track GPX file, it should really be a Linestring. After searching a bit, I came across a mention of linemerge which can turn a Multilinestring such as MULTILINESTRING((-95.235071 38.971896,-95.235076 38.971906,-95.235015 38.971848)) in to a LINESTRING(-95.235071 38.971896,-95.235076 38.971906,-95.235015 38.971848), which seems like a saner way to work with single tracks.


Properly serving a 404 with lighttpd’s server.error-handler-404

Posted: February 28th, 2007 | Author: | Filed under: Linux, Open Source, Projects, Python | 1 Comment »

The other day I was looking in to why Django‘s verify_exists validator wasn’t working on a URLField. It turns out that the 404 handler that we were using to generate thumbnails with lighttpd on the media server was serving up 404 error pages with HTTP/1.1 200 OK as the status code. Django’s validator was seeing the 200 OK and (rightfully) not raising a validation error, even though there was nothing at that location.

It took more than the usual amount of digging to find the solution to this, so I’m making sure to write about it so that I can google for it again when it happens to me next year. The server.error-hadler-404 documentation metnions that to send a 404 response you will need to use server.errorfile-prefix. That doesn’t help me a lot since I needed to retain the dynamic 404 handler.

Amusingly enough, the solution is quite clear once you dig in to the source code. I dug through connections.c and found that if I sent back a Status: line, that would be forwarded on by lighttpd to the client (exactly what I wanted!)

Here’s the solution (in Python) but it should translate to any language that you’re using as a 404 handler:

print "Status: 404"
print "Content-type: text/html"
print
print # (X)HTML error response goes here

I hope this helps, as I’m sure it will help me again some time in the future.


Packaging Python Imaging Library for maemo 3.0 (bora) and the N800

Posted: February 11th, 2007 | Author: | Filed under: Linux, Open Source, Projects | 6 Comments »

I found myself wanting to do some image manipulation in Python with the Python Imaging Library on my N800. Unfortunately PIL isn’t available in the standard repositories. Not to worry, after reading the Debian Maintainers’ Guide I packaged up python2.5-imaging_1.1.6-1_armel.deb, built against python2.5 and maemo 3.0 (bora), and installs perfectly on my N800:

Nokia-N800-51:/media/mmc1# dpkg -i python2.5-imaging_1.1.6-1_armel.deb 
Selecting previously deselected package python2.5-imaging.
(Reading database ... 13815 files and directories currently installed.)
Unpacking python2.5-imaging (from python2.5-imaging_1.1.6-1_armel.deb) ...
Setting up python2.5-imaging (1.1.6-1) ...
Nokia-N800-51:/media/mmc1# python2.5 
Python 2.5 (r25:9277, Jan 23 2007, 15:56:37) 
[GCC 3.4.4 (release) (CodeSourcery ARM 2005q3-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>>

Feel free to take a look at the directory listing for full source, diffs, etc. You might also want to check out debian/rules to see where to use setup.py in the makefile to build and install a Python package. If anyone wants a build of this for maemo-2.0/770 please let me know. It would just take a little time to set up a maemo-2.0 scratchbox.


Mapping Every airport and helipad in America

Posted: January 24th, 2007 | Author: | Filed under: Projects, Python, Web Services | 3 Comments »

All the airports

After stumbling upon Transtats again today I took my semi-anual visit to the FAA data and statistics page to see if there was anything new to play with. The unruly passenger count still looks like it’s down for 2006 but I was really interested in playing with the airport data that I’ve seen before.

After a little help from Python’s CSV module and some helper functions from geopy, I whipped up a 4 meg KML file for use with Google Earth or anything else that can import KML. Be warned thought that the file contains some 20,000 airports, helipads, and patches of dirt that can lead to some rendering bugs. If you’re interested, here’s the code that generated the KML.


From GPX to PostGIS

Posted: January 20th, 2007 | Author: | Filed under: Open Source, Projects | 51 Comments »

Now that I have a RELAX NG Compact schema of GPX, it’s time to figure out how to get my data in to PostGIS. so I can do geospatial queries. I installed PostGIS on my Ubuntu box with these instructions. If I recall correctly, the latest version didn’t work but the previous release did.

GPX makes a great GPS data interchange format, particularly because you can convert just about anything to GPX with GPSBabel. I’m not aware of anything that can convert straight from GPX to PostGIS, but it is a relatively straightforward two-part process.

The first order of business is converting our GPX file to an ESRI Shapefile. This is easiest done by using gpx2shp, a utility written in C that is also available in Ubuntu. Once gpx2shp is installed, run it on your GPX file: gpx2shp filename.gpx.

Once we have a shape file, we can use a utility included with PostGIS to import the shapefile in to postgres. The program is called shp2pgsql, and can be used as such: shp2pgsql -a shapefile.shp tablename. I tend to prefer the -a option which appends data to the table (as long as it’s the same basic type of the existing data). shp2pgsql generates SQL that you can then pipe in to psql as such: shp2pgsql -a shapefile.shp tablename | psql databasename.

Once the data is in Postgres (don’t forget to spatially-enable your database, you can query it in ways that I’m still figuring out. A basic query like this will return a list of lat/lon pairs that corresponds to my GPS track: select asText(the_geom) from tablename;. I’m llooking forward to wrapping my head around enough jargon to do bounding box selects, finding things nearby a specific point, etc.

The examples in the PostGIS documentation seem pretty good, I just haven’t figured out how to apply it to the data I have. I feel like mastering GIS jargon is one of the biggest hurdles to understanding PostGIS better. Well, that and a masters degree in GIS wouldn’t hurt.


Arduino serial communication with Python

Posted: January 18th, 2007 | Author: | Filed under: Projects, Python | 8 Comments »

The Arduino is here!

I got my shiny Arduino yesterday. The first order of business (after the obligatory “Hello World” led_blink sketch) was interfacing Arduino with my language of choice, Python.

Googling around for python serial led me to pySerial, a cross-platform serial library. I was actually quite suprised that such a wrapper didn’t exist in the Python Standard Library. Nevertheless, I plodded on.

The first order of business was symlinking the default device for the Arduino serial drivers on my mac (for sanity):
sudo ln -s /dev/tty.usbserial-LOTSOFCHARSANDNUMBERS /dev/tty.usbserial. From there I fired up the Python shell and ran the serial hello world sketch on my Arduino:

>>> import serial
>>> ser = serial.Serial('/dev/tty.usbserial', 9600)
>>> while 1:
...     ser.readline()
'1 Hello world!\r\n'
'2 Hello world!\r\n'
'3 Hello world!\r\n'

Writing from Python to Arduino is simple too. Load serial_read_blink and do the following from Python:

>>> import serial
>>> ser = serial.Serial('/dev/tty.usbserial', 9600)  
>>> ser.write('5')

Hooray, it worked! Communicating with the Arduino over serial with Python (just like every other language) is a pretty trivial process.


All I want to do is convert my schema!

Posted: January 16th, 2007 | Author: | Filed under: Django, Java, Open Source, Projects, Python, Web Services | 53 Comments »

I’m working on a django in which I want to store GPS track information in GPX format. The bests way to store that in django is with an XMLField. An XMLField is basically just a TextField with validation via a RELAX NG Compact schema.

There is a schema for GPX. Great! The schema is an XSD though, but that’s okay, it’s a schema for XML so it should be pretty easy to just convert that to RELAX NG compact, right?

Wrong.

I pulled out my handy dandy schema swiss army knife, Trang but was shocked to find out that while it can handle Relax NG (both verbose and compact), DTD, and an XML file as input and even XSD as an output, there was just no way that I was going to be able to coax it to read an XSD. Trang is one of those things (much like Jing that I rely on pretty heavily that hasn’t been updated in years. That scares me a bit, but I keep on using ‘em.

With Trang out of the picture, I struck out with various google searches (which doesn’t happen very often). the conversion section of the RELAX NG website. The first thing that struck my eye was the Sun RELAX NG Converter. Hey, Sun’s got it all figured out. I clicked the link and was somewhat confused when I ended up at their main XML page. I scanned around and even searched the site but was unable to find any useful mention of their converter. A quick google search for sun “relax ng converter” yielded nothing but people talking about how cool it was and a bunch of confused people (just like me) wondering where they could get it.

At this point I was grasping at straws so I pulled up The Internet Archive version of the extinct Sun RELAX NG Converter page. That tipped me off to the fact that I really needed to start tracking down rngconf.jar. A google search turned up several Xdoclet and Maven cvs repositories. I grabbed a copy of the jar but it wouldn’t work without something called Sun Multi-Schema XML Validator.

That’s the phrase that pays, folks.

A search for Sun “Multi-Schema XML Validator” brought me to the java.net project page and included a prominent link to nightly builds of the multi-schema validator as well as nightly builds of rngconv. These nightly builds are a few months old, but I’m not going to pick nits at this point.

After downloading msv.zip and rngconv.zip and making sure all the jars were in the same directory I had the tools I needed to convert the XSD in hand to RELAX NG Compact. First I converted the XSD to RELAX NG Verbose with the following command: java -jar rngconv.jar gpx.xsd > gpxverbose.rng. That yielded the following RELAX NG (very) Verbose schema. Once I had that I could fall back to trusty Trang to do the rest: trang -I rng -O rnc gpxverbose.rng gpx.rng. It errored out on any(lax:##other) so I removed that bit and tried again. After a lot more work than should have been required, I had my RELAX NG Compact schema for GPX.

My experience in finding the right tools to convert XSD to RELAX NG was so absurd that I had to write it up, if only to remind myself where to look when I need to do this again in two years.


Pardon the Dust

Posted: December 20th, 2006 | Author: | Filed under: Django, MySQL, PHP, Projects, Python, Weblogs | 7 Comments »

Sorry about the short outage there. I finally consolidated the various co-location, shared hosting, and virtual private hosting services that I was consuming every month in to one VPS account. I still have some legacy URLs to do some rewrite magic for, but the archives back to 2002 is here.

Because my new box is very Django-oriented, I am now running WordPress via PHP5 (FastCGI) and MySQL5 on lighttpd behind perlbal.

One of the things I really enjoyed about the move from WordPress on Apache with a really gnarly .htaccess file for URL rewriting to lighttpd was the simplicity of it all. Getting WordPress to “just work” for me on lighttpd was as simple as adding a 404 handler for the site:

server.error-handler-404 = "/index.php?error=404"

Everything should be smoothing out shortly and of course the eventual goal is to move this blog over to Django trunk. I did just that a few months ago but I need to revisit the code, find the importer, and give it a lot of layout love.


My OSX Development Environment

Posted: April 26th, 2006 | Author: | Filed under: *BSD, Apple, Linux, Projects | 15 Comments »

My work powerbook was out at Apple for a week or so getting a tan, a new motherboard, memory, and processor. While it was out of town I settled in to a Linux development environment focused around Ubuntu Dapper, Emacs 22 + XFT (pretty anti-aliased fonts), and whatever else I needed. Ubuntu (and other apt-based systems) are great for hitting the ground running because you just install whatever you need on the fly only when you need it. I also got pretty in to emacs and all of the stuff that’s there by default with a source build of the development snapshot. My co-worker James helped me get through some of the newbie bumps of my emacs immersion program.

When the powerbook came back I decided it was time to reboot my development environment, so I started from scratch. Here’s what I installed, in the order that I installed it:

  • Updates. Oh. My. Goodness. I rebooted that thing so many times I started looking for a green start button.
  • Quicksilver (freeware): I use it all the time to get at stuff I need.
  • Transmit (commercial, $30): Worth every penny.
  • Firefox (open source): My browser of choice, though I really dig Safari’s rendering engine.
  • Textmate (commercial, 39 euro): I spend all day in this text editor and it rocks, though I do miss emacs.
  • Then I disabled capslock. I never hit it on purpose, it’s always getting in the way. I should really map a modifier key to it, but I’m not sure which one and I don’t know if I can convince my pinkey to hit it on purpose.
  • Xcode: A man has to have a compiler.
  • Subversion (open source): I used the Metissian installer since it has treated me well in the past, and I often have flashbacks of building subversion pre-1.0 from source.
  • Django (open source): I checked out trunk, .91, and magic-removal from svn.
  • Ellington (commercial, starting at $10-15k): I checked out ellington and other work stuff from our private repository.
  • Firebug: Essential for web development.
  • Python 2.4 (open source): I’m not a big fan of the Python 2.3 that ships with OSX.
  • Python Imaging Library (open source): It’d be really nice if this made its way in to the standard Python distro.
  • ElementTree (open source): I usually use either ElementTree or Sax for parsing XML documents.
  • GNU Wget (open source): It’s what I use to download stuff from the commandline.
  • PostgreSQL (open source): It probably hogs resources to always have this running in the background, but I use it often enough.
  • PostgreSQL startup item from entropy.ch
  • mxDateTime (open source): I’ve never really used it directly, but psycopg does.
  • Psycopg 1.x (open source): Django uses this to talk to Postgres.
  • Colloquy (open source): A really nice IRC client for OSX. I’m also rather fond of Irssi and screen over SSH.
  • Growl (open source): It’s not work critical but I like it.
  • Pearl Crescent Page Saver (freeware): I find it indispensable for taking screenshots of entire web pages.
  • Session Saver for Firefox: I hate looking at 15 different forum threads to find the latest version of this, but I love what it does for me.
  • Adium (open source): Best darned IM client for OSX that talks just about any protocol.

While I may have missed an app or two, I think that just about covers my OSX development and living environment. I find the Ubuntu desktop useful enough that it’s still humming under my desk at work. The work LCD has both analog and DVI inputs so I’m able to switch between my two-screened powerbook and a one-screened Linux desktop in a pseudo-KVM kind of way.

I can’t say enough how impressed I was with Dapper, and how productive it kept me. Aside from my emacs learning curve, I felt at home and had the command line and any app that I wanted to install at my disposal.

I hope that this laundry list is helpful, if nothing else it’ll be a place for me to start the next time I’m looking for a clean slate.


Wishlist 2.0

Posted: April 5th, 2006 | Author: | Filed under: Projects, Web Services | 56 Comments »

I’ve been contemplating writing a wishlist app off and on for a few months now but have never gotten around to doing so. While I have an Amazon wishlist, there’s a lot of stuff that I’d love to have that Amazon doesn’t sell. After finding myself keeping a seperate list and periodically e-mailing it to my wife, I though tit would be cool to be able to put together a wishlist using any item that has a URL.

I waited too long and it looks like gifttagging has done at least 80% of what I was hoping to do. It has the web 2.0 look and feel and a tag cloud on the front page and everything. I have a feeling that I won’t actually use the service but it definitely does almost all of what I was planning to do, so if I tried to pull it off it’d be something of an also-ran.

A couple of weeks ago I brainstormed the concept (in a rather conversational tone) with the hope of motivating myself to get started on it. That obviously didn’t happen so I thought perhaps I’d share the brainstorming session in case it’s useful to someone.

So you have an amazon wishlist, and a wishlist with this other site, and you want some stuff that you can’t put on a wishlist. Wouldn’t it be nice if you could put all of this wishlist stuff in one spot? Cue wishlist 2.0 (or whatever it’s called). It gives you one URL you can send to your friends who want to know what you want. Of course it does stuff like pull in (and keep in synch) your Amazon wishlist, but it also works for so much more, like that doodad you want from whatsit.com. It lets you set priorities, keep notes about particular items, and it’s really easy to share with your friends. They can subscribe to an RSS/Atom feed of the stuff you want, you can send them an email linking to your wishlist, they can leave comments and OMG you can tag stuff too.

So lets get down to some details. You sign up. Confirm your email address, cause you have to have a valid email address (even if it’s mailinator.com). After you confirm you’re sent to your “dashboard” screen. You know, the one you get every time you log in. It lists your wishlist items in whatever order you prefer (but you can reorder them). Since it’s your first time there’s a little bit at the top asking if you’d like a tour of the place, or if you’d rather, just import your shit from amazon.

The import process is pretty painless. We’re up front about needing some information about you in order to get your wishlist from Amazon. So we get that info from you say “hang on a sec” and go grab your info using Amazon’s APIs. We come back with “Hey, so you’re John Whatshisname from Austin, TX, right? You want this, that, and the other thing. That’s you, right?”

After we confirm that we’re not pulling in some other dude’s wishlist, we prepopulate your wishlist with the stuff from Amazon. Your quantities and ranking come over, plus everything gets tagged with “amazon”.

If you don’t have anything to import from amazon, we take you in the other way and show you how easy it is to add items to your wishlist. All stuff needs is a URL in order for you to add it. We’ll do our best to guess what it is, but you can always override that. It gets your default “want it” value unless you override that, plus you can tag it with whatever you want “del.icio.us-style”.

From there we can point out that “hey, your wishlist has an RSS feed. Or an Atom feed, if that’s how you roll.” You can also do other stuff like tell your friends, browse stuff from other peoples’ wishlists, or access your wishlist from a mobile phone.

I guess the browsing and social aspect could be fleshed out a bit. Each wishlist item could be able to tell you what other people that want this want. You know, if you want a pink RAZR you might also want a fashionable bluetooth headset. Stuff like that. You can also look at the latest stuff that everyone is wishing for. If you’re on somebody elses wishlist page and you see something that they want that you also want, you can just click “I want this too” and you can add it to your wishlist.


Backing Up Flickr Photos with Amazon S3

Posted: March 22nd, 2006 | Author: | Filed under: Open Source, Projects, Python, Web Services | 711 Comments »

I love that I now have an Amazon S3 billing page that reads like a really cheap phone or water bill. I think that they’re silently changing the game (again) without telling anyone else. I really like the implications of this magepiebrain post and decided to start using S3 “for real” myself last night.

The first ingredient was James Clarke’s flickr.py. Getting a list of my photos is pretty simple:

import flickr
me = flickr.people_findByUsername("postneo")
photos = flickr.people_getPublicPhotos(me.id)

The second ingredient for getting the job done was a pythonic wrapper around the Amazon example python libraries by Mitch Garnaat called BitBucket. Because it builds on the example libraries, there’s very little error checking, so be careful. Check out Mitch’s site for some example BitBucket usage, it’s pretty slick.

Once I was familiar with both libraries, I put together a little script that finds all of my photos and uploads the original quality image to S3, using the flickr photo ID as the key. Here’s the complete code for flickrbackup.py, all 25 lines of it.

After uploading 160 or so photos to Amazon, I owe them about a penny.

Getting photos back out is really easy too:

>>> import BitBucket
>>> bucket = BitBucket.BitBucket("postneo-flickr")
>>> bucket.keys()
>>> bits = bucket[u'116201243']
>>> bits.to_file("photo.jpg")


We’re Moving to Kansas!

Posted: October 31st, 2005 | Author: | Filed under: Django, Open Source, Projects, Python, Weblogs | 37 Comments »

No really, we’re moving to Kansas. I’ve accepted a position at World Online, the online division of the Lawrence Journal-World. I’ll be working on some award winning sites including LJWorld.com, lawrence.com, KUsports.com using my favorite web framework: Django.

I’m really excited about working with an awesome team of people doing some really cool stuff. And of course I’m completely stoked about working with Django on a daily basis. I’ll talk about what I’m up to when I can but there will be times when I have to keep my lips zipped. I guess now might be a good time to mention that this is my personal weblog and that views/opinions/etc expressed here are mine and do not necessarily reflect those of my employer.

Needless to say I’ve been a bit busy with getting up to speed at work and planning the move. I’ve been meaning to write this post for some time now and had to delete a completely out of date post that I had half written while in Lawrence. Blogging will probably be light until things settle out, but in the meantime keep an eye on my del.icio.us links.

Strap in, Toto!


Skipping Startup School

Posted: October 11th, 2005 | Author: | Filed under: Projects, Weblogs | 7 Comments »

I managed to get past the hall monitors and was accepted to Startup School but had to skip it due to scheduling. I’ll be keeping an eye on the blogosphere and the notes that come out of the daylong event. I was particularly interested in attending as a guy who is constantly on the edge but doesn’t always realize it until similar products or services come out a year or two down the road. I did moblogging back in 2002 on my POS WAP-only Sprintphone and have worked on countless other small projects. Every once in awhile I reminisce with Russ about that big new thing that he or I had tinkered with but not got off the ground a year or two ago. I was definitely looking forward to peeking behind the curtain a bit, but hopefully some attendees will be kind enough to write up their experiences.

In other news, you might get an idea of what Aaron and Infogami is up to if you visit the Startup School Wiki


Django: Big Integer Fields

Posted: August 26th, 2005 | Author: | Filed under: Django, MySQL, Projects, Python | 6 Comments »

I submitted a patch to Django Ticket #399 (request for a bigint field type). It still needs testing but works at a quick glance on mysql. Here’s a shot of them in action from the admin interface (the input is just too small and just too big respectively):

BigInt Admin

Update: BigIntegerField works perfectly on PostgreSQL but because it doesn’t have an unsigned integer type (that I can find), PostitiveBigIntegerField isn’t going to make it all the way up to 18,446,744,073,709,551,615 without using an arbitrary precision NUMERIC or mapping zero to -9,223,372,036,854,775,808. Both solutions are messy and it would be a shame to have the mysql and postgres backends behave so differently. As an aside, it looks like this is already the case with mysql’s IntegerFields being UNSIGNED while Postgres just checks to make sure that the integers are positive before inserting.

The best solution would probably be to employ backend-specific range checking for these monsterous numbers. That way you won’t end up out of range in PostgreSQL but you’re also not penalizing MySQL for being able to count to 18 bajillion. At this point it would be safe to drop in BigIntegerField as is (as soon as I check it out on sqlite), but PostitiveBigIntegerField still needs some pondering.