Category: Projects

  • Django: Big Integer Fields

    I submitted a patch to Django Ticket #399 (request for a bigint field type). It still needs testing but works at a quick glance on mysql. Here’s a shot of them in action from the admin interface (the input is just too small and just too big respectively):

    BigInt Admin

    Update: BigIntegerField works perfectly on PostgreSQL but because it doesn’t have an unsigned integer type (that I can find), PostitiveBigIntegerField isn’t going to make it all the way up to 18,446,744,073,709,551,615 without using an arbitrary precision NUMERIC or mapping zero to -9,223,372,036,854,775,808. Both solutions are messy and it would be a shame to have the mysql and postgres backends behave so differently. As an aside, it looks like this is already the case with mysql’s IntegerFields being UNSIGNED while Postgres just checks to make sure that the integers are positive before inserting.

    The best solution would probably be to employ backend-specific range checking for these monsterous numbers. That way you won’t end up out of range in PostgreSQL but you’re also not penalizing MySQL for being able to count to 18 bajillion. At this point it would be safe to drop in BigIntegerField as is (as soon as I check it out on sqlite), but PostitiveBigIntegerField still needs some pondering.

  • Newcomers to the Bookshelf

    Newcomers to my bookshelf

    Several new books have landed on to my bookshelf recently and I thought I’d take a minute to highlight them:

  • Django Generic Views: CRUD

    Note: I’ve not yet updated this to reflect the new model syntax. For the time being you can take a look at the new model syntax for tasks here.

    There are lots of gems buried in Django that are slowly coming to light. Generic views and specifically the CRUD (create, update, delete) generic views are extremely powerful but underdocumented. This brief tutorial will show you how to make use of CRUD generic views in your Django application.

    One of my first encounters with Rails was the simple todo list tutorial which managed to relate lots of useful information by creating a simple yet useful application. While I will do my best to point out interesting and useful things along the way, it is probably best that you be familiar with the official Django tutorials. Now would also probably be a good time to mention that this tutorial works for me using MySQL and revision 524. Django is under constant development, so things may change. I’ll do my best to keep up with changes.

    Getting Started

    As with all Django projects, the best place to start is to start with django-admin.py startproject todo. Make sure that the directory you created your project in is in your PYTHONPATH, then edit todo/settings/main.py to point to the database of your choice. Now would be a good time to set your DJANGO_SETTINGS_MODULE to "todo.settings.main". Next move to your apps/ dir and create a new application: django-admin.py startapp tasks and django-admin.py init. The initial setup process is covered in much more detail in tutorial 1.

    The Model

    Now that we have the project set up, let’s take a look at our rather simple model (todo/apps/tasks/models/tasks.py):

    from django.core import meta
    
    # Create your models here.
    
    class Task(meta.Model):
      fields = (
        meta.CharField('title', maxlength=200),
        meta.TextField('description'),
        meta.DateTimeField('create_date', 'date created'),
        meta.DateTimeField('due_date', 'date due'),
        meta.BooleanField('done'),
      )
    
      admin = meta.Admin(
        list_display = ( 'title', 'description', 'create_date', 'due_date', 'done' ),
        search_fields = ['title', 'description'],
        date_hierarchy = 'due_date',
      )
    
      def __repr__(self):
        return self.title

    The model is short and sweet, storing a title, description, two dates, and if the task is done or not. To play with your model in the admin, add the following to INSTALLD_APPS in todo/settings/main.py: 'todo.apps.tasks',

    Feel free to play around with your model using the admin site. For details, see tutorial 2.

    URL Configuration

    Now let’s configure our URLs. We’ll fill in the code behind these URLs as we go. I edited todo/settings/urls/main.py directly, but you’re probably best off decoupling your URLs to your specific app as mentiond in tutorial 3.

    from django.conf.urls.defaults import *
    
    info_dict = {
      'app_label': 'tasks',
      'module_name': 'tasks',
    }
    
    urlpatterns = patterns('',
      (r'^tasks/?$', 'todo.apps.tasks.views.tasks.index'),
      (r'^tasks/create/?$', 'django.views.generic.create_update.create_object',
      dict(info_dict, post_save_redirect="/tasks/") ),
      (r'^tasks/update/(?P<object_id>\d+)/?$',
      'django.views.generic.create_update.update_object', info_dict),
      (r'^tasks/delete/(?P<object_id>\d+)/?$',
      'django.views.generic.create_update.delete_object',
      dict(info_dict, post_delete_redirect="/tasks/new/") ),
      (r'^tasks/complete/(?P<object_id>\d+)/?$',
      'todo.apps.tasks.views.tasks.complete'),
    )

    Note: I had to alter the formatting of the urlpatterns in order to make them fit. It looks a lot better in its original formatting.

    We use the info_dict to pass information about our application and module to the generic view handlers . The CRUD generic views need only provide these two pieces of information, but some generic views need more. See the generic views documentation for an explanation.

    Let’s look at each of these URLs one at a time, along with the code behind them.

    Index

    (r'^tasks/?$', 'todo.apps.tasks.views.tasks.index'),

    This points to our index view, which is an index function in todo/apps/tasks/views/tasks.py:

    from django.core import template_loader
    from django.core.extensions import DjangoContext as Context
    from django.utils.httpwrappers import HttpResponse, HttpResponseRedirect
    from django.models.tasks import tasks
    from django.core.exceptions import Http404
    
    def index(request):
      notdone_task_list = tasks.get_list(order_by=['-due_date'], done__exact=False)
      done_task_list = tasks.get_list(order_by=['-due_date'], done__exact=True)
      t = template_loader.get_template('tasks/index')
      c = Context(request, {
        'notdone_tasks_list': notdone_task_list,
        'done_tasks_list': notdone_task_list,
      })
      return HttpResponse(t.render(c))

    This view creates two lists for us to work with in our template, notdone_tasks_list is (not suprisingly) a list of tasks that are not done yet. Similarly, done_tasks_list contains a list of tasks that have been completed. We will use the template tasks/index.html to render this view.

    Make sure that you have a template directory defined in todo.settings.main (this refers to todo/settings/main.py). Here’s mine:

    TEMPLATE_DIRS = (
     "/home/mcroydon/django/todo/templates",
    )

    Now let’s take a look at the template that I’m using for the index:

    {% if notdone_tasks_list %}
        <p>Pending Tasks:</p>
        <ul>
        {% for task in notdone_tasks_list %}
            <li>{{ task.title }}: {{ task.description }} <br/> 
              Due {{ task.due_date }} <br/> 
              <a href="/tasks/update/{{ task.id }}/">Update</a> 
              <a href="/tasks/complete/{{ task.id }}/">Complete</a>
            </li>
        {% endfor %}
        </ul>
    {% else %}
        <p>No tasks pending.</p>
    {% endif %}
        <p>Completed Tasks:</p>
        <ul>
    {% if done_tasks_list %}
        {% for task in done_tasks_list %}
            <li>{{ task.title }}: {{ task.description }} <br/> 
              <a href="/tasks/delete/{{ task.id }}/">Delete</a>
            </li>
        {% endfor %}
        </ul>
    {% else %}
        <p>No completed pending.</p>
    {% endif %}
    <p><a href="/tasks/create/">Add a task</a></p>

    Don’t let this index scare you, it’s just a little bit of logic, a little looping, and some links to other parts of the application. See the template authoring guide if you have questions. Here’s a picture to give you a better idea as to how the above barebones template renders in Firefox:

    Tasks thumbnail

    Create Generic View

    Now let’s take a look at the following URL pattern:

    (r'^tasks/create/?$', 'django.views.generic.create_update.create_object', dict(info_dict, post_save_redirect="/tasks/") ),

    There’s a lot of magic going on here that’s going to make your life really easy. First off, we’re going to call the create_object generic view every time we visit /tasks/create/. If we arrive there with a GET request, the generic view displays a form. Specifically it’s looking for module_name_form.html. In our case it will be looking for tasks_form. It knows what model to look for because of the information we gave it in info_dict. If however we reach this URL via a POST, the create_object generic view will create a new object for us and then redirect us to the URL of our choice (as long as we give it a post_save_redirect).

    Here’s the template that I am using for tasks_form.html:

    {% block content %}
    
    {% if object %}
    <h1>Update task:</h1>
    {% else %}
    <h1>Create a Task</h1>
    {% endif %}
    
    {% if form.has_errors %}
    <h2>Please correct the following error{{ form.errors|pluralize }}:</h2>
    {% endif %}
    
    <form method="post" action=".">
    <p><label for="id_title">Title:</label> {{ form.title }}
    {% if form.title.errors %}*** {{ form.title.errors|join:", " }}{% endif %}</p>
    <p><label for="id_description">Description:</label> {{ form.description }}
    {% if form.description.errors %}*** {{ form.description.errors|join:", " }}{% endif %}</p>
    <p><label for="id_create_date_date">Create Date:</label> {{ form.create_date_date }}
    {% if form.create_date_date.errors %}*** {{ form.create_date_date.errors|join:", " }}{% endif %}</p>
    <p><label for="id_create_date_time">Create Time:</label> {{ form.create_date_time }}
    {% if form.create_date_time.errors %}*** {{ form.create_date_time.errors|join:", " }}{% endif %}</p>
    <p><label for="id_due_date_date">Due Date:</label> {{ form.due_date_date }}
    {% if form.due_date_date.errors %}*** {{ form.due_date_date.errors|join:", " }}{% endif %}</p>
    <p><label for="id_due_date_time">Due Time:</label> {{ form.due_date_time }}
    {% if form.due_date_time.errors %}*** {{ form.due_date_time.errors|join:", " }}{% endif %}</p>
    <p><label for="id_done">Done:</label> {{ form.done }}
    {% if form.done.errors %}*** {{ form.done.errors|join:", " }}{% endif %}</p>
    <input type="submit" />
    </form>
    <!--
    This is a lifesaver when debugging!
    <p> {{ form.error_dict }} </p>
    -->
    
    {% endblock %}

    Here’s what the create template looks like rendered:

    Create Tasks

    If we fill out the form without the proper (or correctly formatted) information, we’ll get an error:

    Create Tasks

    Update

    (r'^tasks/update/(?P<object_id>\d+)/?$', 'django.views.generic.create_update.update_object', info_dict),

    This URL pattern handles updates. The beautiful thing is it sends requests to tasks_form.html, so with a little logic, 90% of the form can be exactly the same as the create form. If we go to /tasks/create/, we get the blank form. If we visit /tasks/update/1/ we will go to the same form but it will be prepopulated with the data from the task with the ID of 1. Here’s the logic that I used to change the header:

    {% if object %}
    <h1>Update task:</h1>
    {% else %}
    <h1>Create a Task</h1>
    {% endif %}

    So if there’s no object present, we’re creating. If there’s an object present, we’re updating. Same form. Pretty cool.

    Warning: It looks like form.create_date_date and form.create_date_time have broken between the time I wrote this and wrote it up. This form will not prepopulate the form with the stored information. There’s a ticket for this, and I’ll update as neccesary when it has been fixed.

    Delete

    Here’s the URL pattern to delte a task:

    (r'^tasks/delete/(?P<object_id>\d+)/?$', 'django.views.generic.create_update.delete_object', dict(info_dict, post_delete_redirect="/tasks/new/") ),

    There’s another little Django gem in the delete function. If we end up at /tasks/delete/1 using a GET request, Django will automatically send us to the tasks_form_delete.html template. This allows us to make sure that the user really wanted to delete the task. here’s my very simple tasks_form_delete.html template:

    <form method="post" action=".">
    <p>Are you sure?</p>
    <input type="submit" />
    </form>

    Once this form is submitted, the actual delete takes place and we are redirected to the main index page (because we set post_delete_redirect.

    CRUD Generic Views And the Rest of Your Application

    That pretty much covers the basics of the CRUD generic views. The great thing about generic views is that you can use them along side your custom views. There’s no need to do a ton of custom programming for list/detail, date-based, or CRUD since those generic views are available to you. Because we set up our URL patterns, we can make sure that we craft URLs that look pretty and make sense.

    For my sample tasks application I decided that I wanted to create links that would immediately set a task as complete and then redirect to the index. This is pretty much trivial and can be accomplished by adding the following URL pattern and backing it up with the appropriate view. Here’s the pattern:

    (r'^tasks/complete/(?P<object_id>\d+)/?$', 'todo.apps.tasks.views.tasks.complete'),

    And here’s the view that goes along with it (from todo/apps/tasks/models/tasks.py):

    def complete(request, object_id):
      try:
        t = tasks.get_object(pk=object_id)
      except:
        # do something better than this
        raise Http404
      try:
        t.done=True
        t.save()
        return HttpResponseRedirect('/tasks/')
      except:
        # do something better than this
        raise Http404

    I do plan to actually handle errors and respond accordingly, but it was late last night and I just wanted to see it work (and it does).

    Conclusion

    Django rocks. Generic views rock. The framework and specifically the generic views make your life easy. My little tasks app took a few hours to put together, but a significant portion of that was reading up on the documentation, trying to figure out generic views using the existing docs and reading the source, and of course pestering the DjangoMasters about generic views and other stuff on #django (thanks all).

    I hope this overview of CRUD generic views helps, but if anything confuses you, don’t hesitate to comment or get in touch with me (matt at ooiio dot com). Also expect to see updates to this tutorial as APIs change and I get a little more time to clean up my code.

    Feel free to download and play with my little todo app: todo-tutorial.tar.gz or todo-tutorial.zip. Consider them released under a BSD-style license. Above all, don’t sue me.

  • Agile Web Development with Rails

    Rails Book

    Choo Choo! My copy of Agile Web Development with Rails showed up at my doorstep like a lost kitten this evening. I’ve skimmed the beta book but it’s great to have a reference available in dead tree form. Dave, David, and the rest of the crew: you rock. Thanks so much for getting the beta book in front of our eyeballs.

  • Geolocation Information Gaps

    I’ve been seeking out interesting data sources to plot in Google Earth after learning the basics of KML. I’ve been wanting to do something cool with NOAA’s XML weather feeds since I heard about them, so I thought I would download the 700kb list of stations serving up XML and spit out some KML from that data as a “neat” first step.

    I’ll probably still do that, but after parsing the data, I’m a bit dissapointed. As always there are huge gaps in geolocation information. In order to get my hands on the data I turned to xmltramp which is an awesome library for accessing simple XML documents in a pythonic way. I then whipped up a few lines of Python to walk through the data:

    import xmltramp # http://www.aaronsw.com/2002/xmltramp/
    
    f=open('stations.xml', 'r')
    doc=xmltramp.parse(f.read())
    count = 0
    total = 0
    for station in doc['station':]:
      total = total + 1
      sid = str(station['station_id'])
      lat = str(station['latitude'])
      lon = str(station['longitude'])
      if (lat != 'NA') and (lon != 'NA'):
        print "Station ID: " + sid + \
        " (" + lat + "," + lon + ")"
        count = count + 1
    print str(count) + " out of " + str(total) + \
    " stations are geolocated."

    Here’s the output of the above code:

    mcroydon@mobilematt:~/py/kmlist$ python kmlist.py
    Station ID: PAGM (63.46N,171.44W)
    [... snip ...]
    Station ID: KSHR (44.46.10N,106.58.08W)
    422 out of 1775 stations are geolocated.

    Well that’s a bummer. 422 out of 1775, or less than 25% of all stations are geolocated. While that’s still 422 more stations than I knew about previously, it’s a far cry from a majority of weather stations across the United States.

    Another thing you will notice is that some stations appear to be expressed in degrees in decimal form (63.46N) while others appear to use Degrees/Minutes/Seconds (44.46.10N).

    It’s gaps like these that can make working with “found” geolocation data frustrating.

  • The Revolution Will Be Geotagged

    Over the weekend I’ve been working on a Python for Series 60 project that I thought up a few days ago while exchanging information with Gustaf between Google Earth instances. It really should have hit me when Google Sightseeing packed its sights in to a KML file, but what can I say, I’m a little slow.

    After sending a .kml file via email to Gustaf, I decided to take a look at what exactly made up a .kml file. I started to drool a little bit when I read the KML documentation. The first example is extremely simple yet there’s a lot of power behind it. A few lines of XML can tell Google Earth exactly where to look and what to look at.

    Proof of Concept

    With this simple example in mind, I started to prototype out a proof of concept style Python app for my phone. Right now everything is handled in a popup dialog, and for the time being I’m just going to save a .kml file and let you do with it as you please, but over the next few days I plan to re-implement the app with an appuifw.Form, get latitude and longitude information from Bluetooth GPS (if you’re so lucky), and work on smtplib integration so that the app can go from location -> write KML -> send via smtplib.

    Rapid Mobile Development

    When I say that I’ve been working on this app over the weekend, that’s not strictly accurate. I prototyped the proof of concept over about 20-30 minutes on Friday night using the Python for Series 60 compatability library from the wonderful folks at PDIS. I then spent the rest of some free time over the weekend abstracting out the KML bits and reverting my lofty smtplib goals to saving to a local file on the phone. I’m not sure if the problem is due to my limited T-Mobile access or if I need to patch smtplib in order to use it on my phone.

    There’s also one big downside to trying to use smtplib on the phone, and that’s the fact that smtplib (and gobs of dependent modules) aren’t distributed with the official Nokia PyS60 distribution, so if I’m going to distribute this app with smtplib functionality, I’ll have to package up a dozen or two library modules to go with it. I’m going to mull it over for a few days and see if I can get past my smtplib bug or investigate alternatives.

    from kml import Placemark

    I’ve started a rudimentary Python kml library designed with the Series 60 target in mind. It’s rather simplistic, and so far I’ve only implemented the simplest of Placemarks, but I plan to add to it as the need arises. It should be quite usable to generate your own KML Placemark. Here’s a quick usage example:

    >>> from kml import Placemark
    >>> p=Placemark(39.28419, -76.62169, \
    "The O's Play Here!", "Oriole Park at Camden Yards")
    >>> print p.to_string()
    <kml xmlns="http://earth.google.com/kml/2.0">
    <Placemark>
      <description>The O's Play Here!</description>
      <LookAt>
        <longitude>-76.62169</longitude>
        <latitude>39.28419</latitude>
        <range>600</range>
        <tilt>0</tilt>
        <heading>0</heading>
      </LookAt>
      <Point>
        <coordinates>-76.62169,39.28419</coordinates>
      </Point>
    </Placemark>
    </kml>

    Once I have my Placemark object, saving to disk is cake:

    >>> f=open("camdenyards.kml", "w")
    >>> f.write(p.to_string())
    >>> f.close()

    If you have Google Earth installed, a simple double click should bring you to Camden Yards in Baltimore. The simplicity of it and the “just works” factor intrigue me, not the fact that this can be accomplished in a few dozen lines of python but the fact that KML seems so well suited for geographic data interchange.

    Camden Yards in Google Earth

    It’s About Interchange

    If you are really in to geographic data, and I mean so at an academic or scientific level, KML probably isn’t the format for you. You might be more interested in the Open Geospatial Consortium’s GML (Geography Markup Language). It looks like it does a great job at what it does, but I’m thinking that the killer format is aimed more at the casual user. KML is just that. From a simple Placemark describing a dot on a map to complicated imagery overlays, KML has your back covered. I find the documentation satisfying and straighforward, though I’m no expert on standards.

    In the very near future conveying where you are or what you are talking about in a standard way is going to be extremely important. Right now there’s only one major consumer of .kml files and that’s Google Earth. Expect that to change rapidly as people realize how easy it is to produce and consume geodata using KML and .kmz files (which are compressed .kml files that may also include custom imagery).

    I would love to see “proper” KML generators and consumers, written with XML toolkits instead of throwing numbers and strings around in Python. I would love to have a GPS-enabled phone spitting out KML using JSR-179, the Location API for J2ME. I hope to use Python for Series 60 to further prototype an application that uses a Bluetooth GPS receiver for location information and allow easy sharing of geodata using KML.

    The Code

    If you’d like, take a look at the current state my kml Python library, which is extremely simple and naive, but it allows me to generate markup on either my laptop or N-Gage that Google Earth is happy to properly parse. A proof of concept wrapper around this library can be found here. I hope to expand both in the coming days, and I hope to soon have the smtplib-based code working properly on my phone with my carrier.

    Update: Oops, forgot to add the <name/> tag. Fixed. The name should now replace the (ugly) filename for your Placemark.

  • Thoughts on Ajaxian Advertising

    When I read Jason Calacanis’ post about Ajax and ad revenues I couldn’t help but think about the flip side of the coin: how can advertisers (and webloggers/content publishers take advantage of Ajax-fu to increase revenues?

    Sure, advertisers have been taking advantage of Javascript to serve up popups, pop-unders, hijack your screen holding you hostage, and a myriad of other nasties. That’s not what I’m talking about (although there’s always something new to make me look at crappy ads). I’m thinking more about smart stuff that can be done to increase value for advertisers as well as readers. What the heck am I talking about? Here are some thoughts:

    • Auto-Refreshing Text ads. This would have to be done carefully, as I tend to stay as far away as possible from ads that make my eyes bleed. I really love text ads because (at least with Google) they tend to be right on the money, relate to the rest of the content on the page, and often enough are interesting enough to click on in a short attention span kind of way. You could do all kinds of sexy stuff, like scroll the top ad off and bring one up from the bottom in a skyscraper configuration, or just do a fade swap for a new ad. Because they’re still text ads, and the new ad is probably just as targeted, it just might work. Most text ads are pay per click not pay per view, so costs per ad wouldn’t got up any, you wouldn’t have to pay the content publisher any more unless there’s an actual click involved.
    • Context-Sensitive Ads, Part I: Take the success of text ads one step further. Who’s to say that an Ajaxian click can’t involve another impression? If someone chooses to drill down to more pictures of Lindsay Lohan, why not update that ugly sidebar ad? Or of course you could use the opportunity to scan the new content and update text ads if necessary. That’d be kinda cool.
    • Context-Sensitive Ads, Part II: Hey, take that one step further. How about ads that update themselves onHover(). Again, don’t be stupid. Get too fancy or make my eyes bleed and I’ll probably not come back. But couple this with unobtrusive value-adding textads (Hover over a title containing the word “giraffe” and you get some text ads involving giraffes). This may not work quite as well as an ajaxian fold/expand call (a click would make the user more likely to expect an “event” to happen). It’s also mighty tempting to flog something like this to death.
    • Interactivity in a meaningful way: No, I don’t want to punch the monkey. But how can Ajax help me interact with an ad in a way that I might find useful? What about an ad that provides me with some information (or something else) that I’m looking for. For example, if I were hawking Wikipedia, I might infer something based on the content from the page and serve up an excerpt from a related article. If I didn’t get it right the first time maybe I’d offer some other suggestions. A click on that suggestion might provide another excerpt rather than just send me on over to the page. Something like this might have the same effect as the Google multi-click banner ads that Jason describes here.
    • Something completely different: This whole Ajax thing is still a baby. There’s a lot that hasn’t been done with Ajax yet, and even more stuff that hasn’t been thought up yet. There’s a ton of potential here and I expect a lot of smart people to push the envelope. Who knows, some of them might even apply it to online ads.
    • Latest WordPress

      Please bear with the default WP theme for a little bit. I hope to get my previous template back up soon, but I finally took the plunge and updated to the latest WordPress release (something I should have done some time ago). Let me know if you experience anything super weird.

    • Django: Trivial Patch

      Last night I ran across a deprecation warning when running django-admin.py startproject <projectname>, so I went to file a ticket but found that someone else had experienced the same problem. I looked at the solution, and on the surface it looked like the fix involved replacing from whrandom import choice to from random import choice. Indeed that was all it took, so I submitted a (trivial) patch and continued to bang on Django a bit more.

      That trivial patch made me really wish that there were unit tests for Django. I would have felt a lot better knowing that after applying my patch n tests still passed with flying colors. Without a test framework in place, I really had no idea if my trivial search and replace broke something. It’s possible that somewhere in the code was really expecting some behavior specific to whrandom that was just slightly different than the behavior of random.

      I’m going to hunt around for other little trivial fixes that don’t require carnal knowledge of the codebase and submit patches if I can come up with a fix. At the same time I hope that Nelson’s test suite ticket gets noticed. I wouldn’t mind doing some of the dirty work once a framework is in place, but as always I defer to Adrian and the core Django team when it comes to policy and implementation.

    • Keeping Up With Django

      It’s been quite amazing watching this framework called Django, pulled from a production environment, evolve in realtime right before my eyes. Adrian has been committing changes left and right, fixing bugs, adding features, and most importantly lowering the barrier for new users. Docs and tutorials are being clarified, things made simpler, and a few of those nagging problems are dissapearing in front of my eyes.

      Yesterday Adrian modified the cookie system so that we didn’t have to add a custom setting in order to make it work. He’s also moved DJANGO_SETTINGS_MODULE to a Python variable so you should no longer have to set an environment variable in order to tel the system which module you want to run. Strike that, you still need the environment variable, the name is just user configurable. (Thanks Stefano!) And of course the addition of django-admin.py runserver lets you bypass mod_python or another WSGI-compliant server while you are just checking out the framework or during initial development.

      jango team and everyone in the quickly expanding community. If you’re having trouble with something, hop on #django at Freenode, there’s probably someone else in there who has experienced the exact same thing. And don’t forget to svn up often!

      Now that I’ve gone through the tutorials and have reasonably wrapped my head around the framework I plan to work on a small project to flex my newly found Django muscles.

      Update:

      As always, Adrian has made our lives simpler, this time with Changeset 247:

      Added ‘–settings’ option to django-admin. This specifies which settings module to use, if you don’t want to deal with setting the DJANGO_SETTINGS_MODULE environment variable. Refactored django-admin to use optparse. Updated the tutorials to use ‘–settings’ instead of environment variables, which can be confusing.

    • Pimp My Django

      Pimp My Django

      Yeah, baby! Thanks to tons of SVN updates, I’ve managed to plow through the second Django tutorial unscathed. I’m really impressed with the admin interface and how a dab of Python can do stuff like create a sidebar or create a rocking search interface.

      Huzzah!

    • Django: Making it Easier

      Yes! Adrian has just commited a patch that bypasses mod_python completely!

      mcroydon@mobilematt:~/django/proj$ django-admin.py runserver
      Starting server on port 8000. Go to http://127.0.0.1:8000/ for Django.

      svn up in your django_src dir and enjoy.

    • Django Gotchas

      I know that a lot of these things will get ironed out or explained more clearly once the platform matures and more people start using it, but here are some tips for early Django adopters trying to get through the tutorials:

      1. Don’t name your project test. I know it sounds like a good idea now. I did the same. Trust me though, at the beginning of Tutorial 2 you’ll start kicking yourself when you run in to namespace collisions with the test module. In order to make this warning a bit more abstract, you’re best off not naming your project anything that is a Python module in the Python Standard Library
      2. Your project directory should be somewhere in your Python Path. Under Linux your best bet is to set the PYTHONPATH environment variable. Here’s the gotcha, if your project is in /path/to/project, you actually want to set your PYTHONPATH variable to /path/to. On a per-session basis you can do this by doing something like export PYTHONPATH=/path/to
      3. In order to really use django-admin.py as often as you’re going to, you really want to symlink it to /bin or /usr/bin. For me (again, under Ubuntu having checked out Django via Subversion), ln -s /usr/lib/python2.4/site-packages/django/bin/django-admin.py /usr/bin/django-admin.py did the trick.
      4. Really pay attention to the install instructions and make sure you symlink your django install to /usr/lib/python/site-packages correctly.
      5. I’m still dealing with the newbie cookie authentication bug that tends to creep up partway through the second tutorial, but I was able to get the admin login to pop up by using the following settings for Apache2 and mod_python (with special thanks to #django and the Django on OSX installation guide):
        <location /admin/>
                SetHandler mod_python
                PythonHandler django.core.handler
                PythonPath sys.path+['/home/mcroydon/django']
                SetEnv DJANGO_SETTINGS_MODULE proj.settings.admin
                PythonDebug On
        </location>

        The PythonPath to the directory containing your project is again quite critical. This is where the test collision will show up if you’ve been foolish enough to name your project test. Make sure that DJANGO_SETTINGS_MODULE points to your project. This worked perfectly for me under Ubuntu Hoary, but YMMV. PythonDebug On is your friend here and turns a plain jane 500 error into a traceback.

      Gotchas like these are going to have to be minimized in order to not scare away newbies and also enable that “running start” feeling that you get with Rails. I’m sure that a few of these are mitigated by setup.py in the tarball and the rest should be made a lot easier once WSGI support is added (and I’m glad to hear that adding a built-in server is on the list somewhere) should take care of the rest.

      I do have to say that so far (gotchas aside) I’m really impressed with the platform. It’s great to be able to work with the Python interpreter to flush out some test data (you can do similar things with the Ruby interpreter with Rails, but Python is my native language). I also got warm fuzzies when someone asked if there was a mail module for Django. Heck, Python has an awesome standard library, why not use it?

      Thanks again to everyone in #django on Freenode for all the help and guidance. Now it’s time to get back to that login gotcha.

      Note: Django is a moving target right now. Since I wrote this early this morning, support for a standalone server has been added (just after WSGI support was added) and lots of little bugs and niggles are being taken care of as I type.

    • Django: Python on Rails?

      Let the buzz over Django begin. I first saw it fly by very early this morning as Clint Ecker pointed to some documentation. Simon Willison has given it a proper introduction this morning.

      I definitely need to take a close look at Django if it can approach the productivity of Rails while speaking my native Python. I could be missing something, but I think one very important thing that Django needs in order to have that running start in development productivity is to ship with a small HTTP server available by default. Rails uses WEBrick for this and allows development without the need to mess with Apache or lighttpd in order to start coding. It should be trivial to add similar functionality to Django (with CGIHTTPServer and all).

      I don’t mean to rag on the new framework on the block. I think Django has a ton of potential. It’s off to the right start, having been extracted from a working environment being worked on by some really smart people.

      We’ll see how this turns out, but I’m extremely excited.

    • Clusters: Open Source Meets Commodity Hardware

      During the last semester I wrote two papers for my Computer Architectures class. I spent quite a bit of time on them and have been thinking about posting them on my weblog for quite some time. I’m a bit worried about plagarism though, and I’m not sure what to do about it. I’m pretty sure that I can submit it to the auto-plagarism-detector service that my university subscribes to, and I’m probably going to do that now that this paper is posted.

      Secondly, I’m releasing this paper under the by-nc-sa (Attribution NonCommercial ShareAlike 2.0) license, so unless you can turn in your paper to your teacher with a by-nc-sa license displayed on it, you can’t include it in your paper without proper citation.

      PLEASE NOTE: If you are considering plagarising this, please don’t. If your teacher allows you to cite non-academic internet sources, then by all means borrow my ideas and cite me. What I would really suggest doing is taking a look at my primary sources and then heading to your university library or computer system to consult them yourself. All of the ACM journal sources that I cited are available online if your university subscribes to the ACM Portal. This paper was thoroughly researched but there were some late nights involved in the production of it so it is provided WITHOUT WARRANTY against correctness or anything like that. My RAID paper is probably a little better than this one.

      Creative Commons License
      This work is licensed under a Creative Commons License.

      Clusters:

      Open Source Meets

      Commodity Hardware

      Matt Croydon

      CMSC 311

      May 6, 2005

      Before the mid-1980's, most supercomputers were large, monolithic machines. Over time the Top 500 Supercomputers list has seen clusters go from non-existent to being the dominant architecture[1], currently representing 296 out of the top 500 slots (almost 60%)[2]. Compared to monolithic supercomputers such as those from Cray Research, clusters are extremely cheap for the amount of performance realized. When lower cost is combined with cheap off-the-shelf hardware and open source software platforms, clusters can't help but improve and gain popularity.

      Different Tools for Different Jobs

      The definition of the word “cluster” varies greatly depending on the context in which it is used. A cluster is commonly used in high availability situations when, for example, equipment must gracefully fail over or requests must be divided among the available hardware. For clustered application servers, this can be accomplished by simple round robin DNS entries or more complex load balancing hardware or software.

      Clusters are also used when data needs to be stored on multiple machines for redundancy or performance sake. MySQL, an open source database program, can use a cluster of computers for both data replication as well as load balancing[3] in several configurations.

      This paper will focus on the most common and most popular form of clustering: clusters used for parallel computing, scientific computing, and simulation in educational, professional, and government organizations. More specifically, it will focus on open source software that is available to make the construction and administration of clusters easier and more powerful.

      A Brief History

      Cluster computing has its roots in the mid 1980's when developers wanted to tie together multiple computers in order to harness their collective power. In 1985, Amnon Barak developed the first predecessor to Mosix called MOS that ran on a cluster of four Digital Equipment Corporation (DEC) PDP/11 computers[4]. In 1986, DEC decided try try clustering for themselves with VAXCluster[5]. At the time, VAXCluster was able to take advantage of a much higher data rate of 70mbit/sec[5] but because of the proprietary interconnect used, VAXCluster remained much more tightly coupled, while MOS and Mosix decided to use token ring LAN technology[4]. As Mosix was ported to other platforms and improved, it was also available to take advantage of advances in networking technology without its looser coupling being effected. Mosix relied on patches to the Unix kernel in order to allow processes to migrate among nodes in the cluster. Mosix was later ported to the Linux kernel by Moshe Bar[6], where it thrives as an open source project.

      Beowulf came on to the scene in the early to mid 1990's with a huge splash. Beowulf allowed users to tie together large numbers of lower cost desktop hardware (486 DXes at the time) rather than the specialized hardware used by Mosix and VAXCluster.[7]. The project originated at NASA's Goddard Space Flight Center and quickly led to a successful open source project[8] as well as a successful business for some of the original developers called Scyld Software.

      Beowulf Internals

      PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) are standards for preforming parallel operations. Both frameworks have language bindings (for example FORTRAN, C, Perl, and other languages) that abstract the underlying standard in to something easier to work with. The direction of MPI is steered by a group called the MPI Forum. Since the release of the initial specification, the MPI Forum have updated the specification to MPI 2.0, adding features and clarifying issues that were deemed important[9] Each cluster software, tool, or operating system vendor implements their own version of MPI, so cross-platform compability is not guaranteed, but porting between MPI implementations is quite possible.

      In contrast to MPI, PVM software is usually provided at the PVM website[10] in either source or binary form. From there users can call the PVM library directly or though third party bindings. PVM provides binaries for Windows, which can allow users to program parallel applications on a platform that they may be more familiar with. However, most Beowulf clusters run standard Linux or some variant thereof. PVM also supports monolithic parallel computers such as Crays and other specialized machines. Further differences and similarities between MPI and PVM can be found in the paper Goals Guiding Design: PVM and MPI by William Gropp and Ewing Lusk[11].

      In recent years, another tool called BProc (the Beowulf Distributed Process Space) has expanded the abilities of parallel processing and management of data between nodes. BProc allows a parallel job to be started on the main controlling node and parallel-capable processes are automatically migrated to child nodes[12]. This paradigm is also used by Mosix and OpenMosix, which will be discussed later. BProc is an open source project available at bproc.sourceforge.net.

      Parallel processes also need to take in to consideration the amount of time that will be needed for preparation, cleanup, and merging of parallel data. Amdahl's law[26] stipulates that the total execution time for a parallel program is equal to the parallel part of the problem divided by the number of nodes plus the serial part of the program. Even if a cluster contains thousands of nodes, the amount of time ti takes to execute the serial code is going to remain constant.

      How to Build a Beowulf[7]

      Large Beowulf clusters run complex simulations and crunch teraflops of information per second. At the same time, small 4-16 node clusters are often used in educational settings to teach parallel processing design paradigms to Computer Science students as well as cluster design and implementation to Computer Engineering students.

      College Beowulf clusters are often (but not always) comprised of outdated computers and hand-me-down hardware. While extremely fast speeds cannot be obtained with these antiquated clusters, they are valuable in teaching and observing the differences between a program or algorithm written for a single processor machine and the same program/algorithm written for and run on a cluster.

      There are several tools available for deploying a Beowulf cluster, but almost all require a basic installation of a compatible Linux distribution on either the mater node or the master and all child nodes. Scyld software makes what is widely considered the easiest to install Beowulf software. All that is needed is a $3 CD[14] containing an unsupported Scyld distribution for the master and each child node. Official copies with commercial support are also available directly from Scyld. Once the CD is booted on the mater note, a simple installation menu is presented. After installing and configuring Scyld on the master node, insert a Scyld CD in each child node and they automatically get their configuration information from the master node and the child nodes can run directly from CD.

      Another popular package that runs on top of many modern RPM-based Linux distributions is OSCAR, the Open Source Cluster Application Resources project[15]. OSCAR offers a very simple user interface to install and configure cluster software on the master node. Once that is accomplished, client nodes can be Network booted and the client software automatically installed. OSCAR also supports other installation and boot methods.

      While many colleges take the small cluster approach, Virginia Tech has taken advantage of the modern Macintosh platform and created a top 10 supercomputer for a fraction of traditional costs. Virginia Tech started out with desktop machines, but now maintains a cluster of 1100 Apple XServe 1U servers running Mac OS X server (based on an open source BSD-derived core called Darwin).

      Another Approach to Clustering: OpenMosix

      While most Beowulf clusters are dedicated to cluster-related tasks all the time, clustering does not have to be that way. OpenMosix is a set of patches to the Linux kernel and a few userland monitoring tools for keeping track of where processes are running and how efficiently. OpenMosix is extremely flexible. Nodes can join or leave a cluster whenever they wish. Many programs and algorithms can take advantage of clustering with the automatic node migration built in to OpenMosix. Whenever a new process is spawned or forked (as is common in traditional Unix-like software design) OpenMosix may choose to execute that process locally or on another node.

      Many OpenMosix clusters are implemented in a head/client node configuration much like Beowulf clusters, but they are not limited to such configurations. Because OpenMosix is just a patch to the standard kernel, machines in a cluster can have multiple uses. They can run standard graphical window managers and be used as desktop machines while processes are migrated to them if they have computing cycles to spare. OpenMosix does an excellent job at making sure that client nodes still have enough resources to do whatever else they are doing in addition to cluster process execution.

      In addition to the mutli-use scenario, OpenMosix cluster nodes can run as true peers. For example, if there are 20 computers currently connected to a dynamic cluster and all but a few of them are idle, processes from the machines being actively used can be automatically migrated for execution throughout the cluster. Similarly, if all computers are heavily used, virtually no process migration will occur since execution will be quicker on the local machine. Also, if 400MHz desktop machine needs to do some complex calculations, as long as the program is written in a way that can take advantage of process migration, those calculations could be run extremely quickly on an idle 3GHz machine. Many of the scenarios above are described in a Linux Journal article entitled Clusters for Nothing and Nodes for Free[16], but also come from my experiences building and experimenting with a 2-3 node OpenMosix cluster a few years ago[17].

      Recently the OpenMosix community has embraced “instant clusters,” or the idea that any hardware with local network connections can become a cluster without interfering with its other uses. The OpenMosix website lists a page[18] with several open source “instant cluster” software projects. The most popular project is called ClusterKnoppix[19], a Linux distribution with OpenMosix installed on it that runs directly from CD-ROM. With a minimum of one CD burned on a master node, a 30 seat computer lab can instantly become a 30 node cluster without disturbing the operating system installed on the hard drives.

      To share data among nodes, OpenMosix uses the Cluster File System, a concept originally developed for the Mosix project called the Mosix File System[4]. The file system was renamed after the Mosix project closed its source code and Moshe Bar and others began working on the GPL-licensed[20] code which would become OpenMosix between 2001 and 2002. This cluster file system along with the ability to run a cluster as peer-nodes gives OpenMosix quite an advantage over traditional monolithic and cluster systems.

      How Open Source Helps Clusters

      While some computational clusters run on Windows, the vast majority run on top of an open-source Linux distribution. The Linux Kernel itself is open source and depending on the Linux distribution, all, most, or at least some of the operating system is open source. Sometimes Linux distributions can be open source without being free (as in no cost) such as Red Hat Enterprise Linux. There are many excellent free (open source and no cost) Linux distributions to run Beowulf, OpenMosix, or any other type of clustering software on.

      There are many open source applications that help users install, configure, and maintin clusters; many have been mentioned before. These include OSCAR, Beowulf and OpenMosix themselves, various PVM and MPI implmentations, BProc, and more. In addition to the tools already mentioned, there is a suite of open source utilies for OpenMosix called openMosixView[21]. The various programs included in the suite allow for visualization as well as graphical management of the cluster, visual feedback for processes, process migration, load per node, and also allow for logging and analysis of cluster performance.

      There are many other interesting open source clustering projects that don't require a Beowulf or OpenMosix frame to run on. One of the most popular examples of this is distcc[22], a program that allows for distributed compilation of C or C++ code. Distcc is quite lightweight and does not require a shared filesystem, it just requires child nodes to be running distcc in daemon mode.

      The Future of Clusters

      While Robert Lucke considers openMosix the next generation of clustering software because of its flexibility[23], some of the most stunning advances are happening in the world of grid and distributed computing[24]. Grid computing can mean different things to different people, but generally extends computing platforms beyond location and geography.

      The SETI@home project[25] has managed to create a very powerful supercomputer by utilizing the spare CPU cycles of thousands of desktop machines spread throughout the world. The program usually runs as a screen saver so that it does not consume computing resources while the machine is being actively utilized. SETI@home and other projects are pushing the envelope of using spare processor cycles to tackle a task that would otherwise require large dedicated clusters or supercomputers.

      While grid and distributed computing may take away part of the supercomputing market share that clusters (and particularly those built on open source software using commodity hardware), I believe that clusters are here to stay. Individual component prices continues to drop, network throughput is improving, and cluster software continues to evolve. Expect to hear even more about clusters over the next several years.

      References

      [1] Top500 Supercomputer Sites, “Charts for November 2004 – Clusters (NOW),” April 2005, http://top500.org/lists/2004/11/overtime.php?c=5.

      [2] Top500 Supercomputer Sites, “Highlights from Top500 List for November 2004,” April 2005, http://top500.org/lists/2004/11/trends.php.

      [3] J. Zawodny and D. Balling, High Performance MySQL, Sebastapol: O'Reilly and Associates, 2004, chaps. 7 and 8.

      [4] A. Barak et al, The Mosix Distributed Operating System: Load Balancing for Unix, Berlin: Springer-Verlag, 1993, pp. 1-18.

      [5] N. Kronenberg et al, “VAXcluster: a closely-coupled distributed system,” in ACM Transactions on Computer Systems (TOCS), 1986, pp. 130-146.

      [6] The openMosix Project, “openMosix, an Open Source Linux Cluster Project,” April 2005, http://openmosix.sourceforge.net/.

      [7] T. Sterling et al, How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters, Cambridge, Mass: The MIT Press, 1999.

      [8] The Beowulf Project, “Beowulf.org: The Beowulf Cluster Site,” April 2005, http://www.beowulf.org/.

      [9] The MPI Forum, “Message Passing Interface,” April 2005, http://www-unix.mcs.anl.gov/mpi/.

      [10] Computer Science and Mathematics Divison, Oak Ridge National Laboratory, “PVM: Parallel Virtual Machine,” April 2005, http://www.csm.ornl.gov/pvm/pvm_home.html.

      [11] W. Gropp and E. Lusk, “Goals Guiding Design: PVM and MPI,” in IEEE International Conference on Cluster Computing (CLUSTER'02), 2002, pp. 257-268.

      [12] E. Hendricks, “BProc: The Beowulf Distributed Process Space” in Proceedings of the 16
      th
      international conference on Supercomputing, 2002, pp. 129-136.

      [13] P. Prins, “Teaching Parallel Computing Using Beowulf Clusters: A Laboratory Approach,”in Journal of Computing Sciences in College, 2004, pp. 55-61.

      [14] Linux Central, “CDROM with Scyld Beowulf,” April 2005, http://linuxcentral.com/catalog/index.php3?prod_code=L000-089.

      [15] Open Source Cluster Application Resources, “OSCAR: Open Source Cluster Application Resources”, April 2005, http://oscar.openclustergroup.org/.

      [16] A. Perry et al, “Clusters for Nothing and Nodes for Free,” Linux Journal, Vol 2004, Issue 123, July, 2004.

      [17] Matt Croydon, “OpenMosix Success,” April 2005, http://www.postneo.com/2002/11/20/openmosix-success.

      [18] openMosix, “Instant openMosix, The Fast Path to an openMosix Cluster,” April 2005, http://openmosix.sourceforge.net/instant_openmosix_clusters.html.

      [19] ClusterKnoppix, “ClusterKnoppix: Main Page,” April 2005 http://bofh.be/clusterknoppix/.

      [20] Open Source Initiative, “Open Source Initiative – The GPL:Licensing” April 2005 http://www.opensource.org/licenses/gpl-license.php.

      [21] openMosixView, “openMosixView: a cluster-management GUI,” April 2005, http://www.openmosixview.com/index.html.

      [22] Martin Pool, “distcc: a fast, free distributed C/C++ compiler,” April 2005,

      http://distcc.samba.org/.

      [23] R. Lucke, Building Clustered Linux Systems (Hewlett-Packard Professional Books), Upper Saddle River, New Jersey: Prentice Hall, 2004.

      [24] M. Holliday et al, “A Geographically-distributed, Assignment-structured, Undergraduate Grid Computing Course” in Proceedings of the 36th SIGCSE technical symposium on Computer science education, 2005, pp 206-210.

      [25] The SETI@home Project, “SETI@home: Search for Extraterrestrial Intelligence at Home,” April 2005, http://setiathome.ssl.berkeley.edu/.

      [26] G. Pfister, In Search of Clusters: The Coming Battle in Lowly Parallel Computing. Upper Saddle River, New Jersey: Prentice Hall, 1998. pp. 184-185.

    • RAID: Redundant Array of [Independent|Inexpensive] Disks

      During the last semester I wrote two papers for my Computer Architectures class. I spent quite a bit of time on them and have been thinking about posting them on my weblog for quite some time. I’m a bit worried about plagarism though, and I’m not sure what to do about it. I’m pretty sure that I can submit it to the auto-plagarism-detector service that my university subscribes to, and I’m probably going to do that now that this paper is posted.

      Secondly, I’m releasing this paper under the by-nc-sa (Attribution NonCommercial ShareAlike 2.0) license, so unless you can turn in your paper to your teacher with a by-nc-sa license displayed on it, you can’t include it in your paper without proper citation.

      PLEASE NOTE: If you are considering plagarising this, please don’t. If your teacher allows you to cite non-academic internet sources, then by all means borrow my ideas and cite me. What I would really suggest doing is taking a look at my primary sources and then heading to your university library or computer system to consult them yourself. All of the ACM journal sources that I cited are available online if your university subscribes to the ACM Portal. This paper was thoroughly researched but there were some late nights involved in the production of it so it is provided WITHOUT WARRANTY against correctness or anything like that.

      Creative Commons License
      This work is licensed under a Creative Commons License.

      Matt Croydon

      CMSC 311

      March 9, 2005

      The term RAID originally stood for “Redundant Arrays of Inexpensive Disks” [1], although an effort has been made to replace Inexpensive with Independent [2] in order to deemphasize the importance of cost. In modern practice, the words can be used interchangeably, and in most computer-oriented contexts the meaning is commonly understood. RAID technology was developed to improve upon monolithic SLED (Single Large Expensive Disks) [1] devices. In addition to being large and expensive, these drives have fixed input and output levels and in the late 80’s and early 90’s were not keeping pace with the rest of semiconductor technology [2].

      There are several discreet configurations or levels of RAID, each with its advantages and disadvantages. The various levels are conceptual, and not necessarily tied to a specific implementation. RAID can be accomplished on either the hardware or software level. Hardware-based RAID tends to provide higher overall performance while software-based RAID offers lower cost and greater flexibility.

      Redundancy is required because as more disks are added to an array, the MTTF (Mean Time to Failure) [1] decreases sharply. For example, if each individual drive is rated for 30,000 hours and if there are 100 disks in the array, the MTTF for the array is the MTTF of each individual drive divided by the number of drives. The MTTF of the 100 drive array is 30 hours, a long cry from the 30,000 hours that each unit is rated for [2].

      RAID Level 0 and JBOD

      RAID 0 is not part of the original specification [3] and provides absolutely no redundancy; however it does employ data striping. Data striping is an important concept in some RAID configurations. RAID 0 is often implemented in hardware controllers that also support other levels of RAID. RAID 0 allows extremely write performance but does not significantly improve on read access time [2].

      The other non-redundant RAID technology is JBOD, which stands for “Just a Bunch of Disks” [4]. JBOD uses either RAID hardware or software to combine multiple disks so that they appear as one logical device to the operating system. JBOD allows for easy storage capacity expansion and is in common usage on both Windows and Linux platforms among others.

      RAID Level 1

      RAID 1 uses mirroring in order to achieve redundancy [3]. For every disk of data, there is a mirrored disk that contains an exact copy of the original disk [5]. While every write to the array has to be performed twice (fist on the original drive, then to the mirrored drive), read speeds can be improved. Because there are two copies of the data, the drive that can retrieve the data quickest can be used. Both drives may also simultaneously serve read requests thereby increasing the read speed. If one drive in a two drive array fails, the remaining drive can be used for reading and writing until the defective disk can be replaced. Once a new drive is placed in the array, data can be copied over and eventually mirroring once again takes place in real time.

      RAID Level 2

      RAID 2 uses the same ECC (Error Correcting Code) as ECC memory [2]. In addition to the data disks, a number of check disks are used to store the ECC data. If Hammering ECC is used, an array of 10 data disks would need 4 check disks and an array of 25 data disks would require 5 check disks [1]. The extra disks are required to be able to detect and repair an unrecoverable error. In RAID 2, data is striped bit by bit across the data disks while the ECC data is written to the check disks [1].

      RAID Level 3

      The next level of RAID assumes that most hardware or software RAID controllers will be able to detect an error. A single check disk can be used to recover from an error, so if we leave the job of error detection to the controller and eliminate all but one of the check disks as compared to RAID 2 [1]. This strategy cuts down on cost without sacrificing redundancy as long as every bit on all of th other data disks and the check disk can be successfully read. The contents of the bad disk can be obtained by finding the parity of the disks that have not failed and comparing each bit to the parity of all of the disks as stored on the check disk. If the values are identical, the bad disk originally held a 0 in that position. If the values differ, it held a 1 [1].

      RAID Level 4

      RAID 4 also only uses one check disk but stripes data across the data drives in chunks rather than bit by bit. The check disk stores the parity information for each chunk of data. RAID 4 is very efficient for systems such as transaction processing that require many very small reads from the disk array. If the data is smaller than the storage chunk size, the array can furnish multiple request simultaneously [6].

      RAID Level 5

      RAID 5 is the most commonly deployed configuration [7] in commercial settings and distributes the parity blocks evenly across all disks [2]. Because the data and parity are spread across all disks, RAID 5 excels at both small and large reads, and large writes. RAID 5 requires a “read-modify-write” [2] cycle to calculate and write parity information, so RAID 5 is less than optimal when it comes to many small writes. [2]

      Advanced RAID Configurations

      There are several hybrid RAID configurations that while not in the original RAID specification, can improve reliability and redundancy in certain situations. RAID 6 employs two distinct parity calculations for each chunk of data stored [4]. RAID 6 appears to be more theoretical than practical; as there are no guidelines for implementing it. RAID 6 differs from most RAID configurations in that it can recover from two unrecoverable errors, as long as the rest of the data and parity information can be read successfully.

      While many combinations of RAID components are possible, only a few are common. These include RAID 10, RAID 50, and RAID 0+1. RAID 10 significantly improves reliability by providing “a stripe set across mirrored pairs” [7]. This means that RAID 10 can recover from two total failures as long as the failures are on opposite sides of the mirror. Similarly, RAID 50 combines two RAID 5 arrays. RAID 50 is extremely redundant and not practical for most purposes. RAID 0+1 simply constructs a RAID 1 array out of several RAID 0 arrays. In RAID 0+1, one disk failure brings down the mirror half of the array until the bad disk is replaced [7].

      Increasing RAID Throughput

      Many modern hardware RAID controllers contain onboard memory caches to speed up input and output. Caching of data and parity blocks was found to increase throughput in the early to mid 90’s [9]. The physical location of parity blocks in RAID 5 has been proven to influence throughput [10]. In their study, Lee and Katz determined that left-symmetric, extended-left-symmetric, and flat-left-symmetric parity configurations were the best for overall use [10]. The absolute best parity configuration for RAID 5 drives depends on the size and number of both reads and writes.

      Strategies for Increased Reliability

      There are several ways to increase RAID reliability, even in simple arrays. Because the different RAID levels are merely suggestions for how to accomplish redundancy, specific implementations may vary. For a simple 2 disk RAID 1 array, you have the option of placing both disks on one hardware controller or (if supported) you may place each disk on its own controller and have the two controllers coordinate mirroring [7]. In this configuration, the failure of any one RAID controller does not bring down the entire array.

      Hybrid arrays (as discussed in the Advanced RAID Configurations section above) can also increase reliability by creating mutli-tiered or multi-leveled arrays. Advanced configurations need to be used with caution, since the MTTF decreases exponentially as the total number of disks increases.

      As per-disk capacity increases, it is possible to implement RAIDs with identical storage capacity while using fewer overall disks. If fewer disks are used, the MTTF increases. Unfortunately with increased storage capacity comes an increased need for storage, so decreasing the total number of disks in a RAID may not be possible.

      RAID Today

      In the early days of RAID research, SCSI was the only technology that easily allowed for RAID configurations. Today that is changing rapidly with the introduction of extremely large capacity IDE and Serial ATA drives as well as lower cost hardware controller cards for them. These lower costs to entry have allowed RAID to spread from university research labs and large corporations all the way down to home users seeking data protection. Many mid-range to high end motherboards have a built-in IDE or Serial ATA RAID controller built in.

      RAID technology is also being used extensively in large server farms and storage facilities. Elaborate collections of RAID arrays are often combined with network technology such as SAN (storage area networks) and NAS (network attached storage) to meet the always-on accessible-anywhere needs of today’s customers.

      RAID has also become an built-in part of Microsoft’s Windows operating system and has also been incorporated in to the Linux Kernel [11]. Software-based RAID further reduces entry costs, though generic IDE RAID controllers can be found in stores for well below $50. A more well known hardware RAID controller from Adaptec or others can rage from $100 for IDE to several hundred dollars for advanced SCSI Ultra 160 controllers.

      Conclusion

      Using a RAID may lull users in to a false sense of security. Most RAID configurations protect against only one unrecoverable error and usually require that every other bit be read successfully in order to recover the data. Just because a RAID is in use does not mean that users are invincible. Rigorous and recoverable backups should also be implemented in addition to the use of RAID technology.

      With that caution in mind, RAID can provide redundancy that would not otherwise be available. If a specific RAID configuration is tailored to a specific profile (many small writes, continuous large reads, etc) a significant increase in throughput can be realized.

      RAID, a technology that started out as graduate and Doctoral research projects, now powers a wide array of technology from home computers to large datacenters. RAID allows advanced research facilities and corporate databanks alike to achieve redundancy on collections of data that commonly reach terabytes and petabytes [12].

      References

      [1] D. Patterson, G. Gibson, and R. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” in Proceedings of the 1988 ACM SIGMOD international conference on Management of data, 1988, pp. 109-116.

      [2] P. Chen et al, “RAID: High-Performance, Reliable Secondary Storage,” ACM Computing Surveys, Vol 26, pp. 145-185, June 1994.

      [3] M. Scnier, Ed., Dictionary of PC Hardware and Data Communications Terms, Sebastopol: O’Reilly and Associates, 1996, pp.362-363.

      [4] M. Shooman, Reliability of Computer Systems and Networks, New York: John Wiley and Sons, 2002, pp.119-126.

      [5] G. Gibson, Redundant Disk Arrays: Reliable, Parallel Secondary Storage, Cambridge: MIT Press, 1992.

      [6] R. Jain et al. Eds., Input/Output in Parallel and Distributed Computing Systems, Boston: Kluwer Academic Publishers, 1996, pp.106-108.

      [7] C. Zacker and J. Rourke, PC Hardware: The Complete Reference, Berkeley: Osborne/McGraw Hill, 2001, pp.606-613.

      [8] PC Guide, “Multiple (Nested) RAID Levels”, March 2005, http://www.pcguide.com/ref/hdd/perf/raid/levels/mult.htm.

      [9] J. Menon and J. Cortney, “The Architecture of a fault-tolerant cached RAID controller,” in Proceedings of the 20th annual international symposium on Computer architecture, 1993, pp.76-87

      [10] E. Lee and R. Katz, “Performance consequences of parity placement in disk arrays,” in Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, 1991, pp.190-199.

      [11] I. Molnar, G. Oxman, and M. de Icaza, “Kernel Korner: The New Linux RAID Code,” Linux Journal, Vol 1997, Article No. 25, December, 1997.

      [12] Los Alamos National Laboratories Networked Systems Research Team, “Announcements,” March 2005, http://public.lanl.gov/netsys/.

    • SDLQuake on Maemo

      SDLQuake on Maemo x86

      Yep, it had to be done. Above you can see SDLQuake running on Maemo x86. I haven’t tried it on the ARM target but I heard that it or a port of it should run just fine on ARM. Between various emulators and game engines, it shouldn’t be hard at all to amuse yourself with a Nokia 770.

      No changes were required for this x86 build. ./configure, make and run-standalone.sh ./sdlquake.

    • Python For Series 60: 1.1.0 Pre-Alpha

      There’s a new version out, 1.1.0 Pre-Alpha. Grab the .SIS installer for first edition devices (3650, N-Gage, etc) or for 2nd edition devices. Don’t forget to pick up the first edition or second edition SDK

      I’ll read over the new API docs tonight and hope to find all kinds of juicy morsels.

      Update: Erik Smartt fills in some details on his weblog. Thanks again to the whole Python for Series 60 team for all the hard work.

    • More Maemo Madness

      Calcoo
      Calcoo, an RPN and algebraic calculator
      Gnuboy 3x zoom
      Gnuboy, zoomed 3x using xgnuboy.
      VTE
      VTE terminal emulator.

      More successful builds on Maemo x86 today. I’m still in the information gathering stage, trying to find projects that are worth spending more time on doing “proper” hildonization to. All of the above screenshots were derived from downloading a source tarball and running ./configure and make, nothing more. VTE was exciting because it didn’t fail out on dependencies that I can’t easily provide and with run-standalone.sh the soft keyboard just popped up. Having a decent usable terminal emulator is going to be a key item for a physical 770 device. The error-free build is encouraging.

      When I have some more time I would like to package some of the apps that I have been tinkering with up in .debs for distribution, but I’d like to stress that everything I’ve posted in the last few days builds on x86 with little or no modification. They’re far from well integrated Maemo apps and I’ve only tested a handful on an ARM target (I’m waiting for the next scratchbox/qemu release to do any real testing), but it’s definitely a start.

    • More Maemo Success

      I managed to get a few more things compiled and running on Maemo (mostly on x86) over the weekend. Proper Maemo ports are also starting to come in from new sources. This CPU/Memory usage meter is hildonized and designed to fit in the top statusbar. Great little hack! There are a bunch of gnome-applet style things that would do great in that status bar.

      Two of the best places for fully-featured Maemo apps are INDT’s Maemo Apps pageand the Kernel Concepts Maemo page.

      With that said, on to some more low-hanging fruit of apps that compiled with little or no modification (on most of these ./configure and make “just worked”):

      MPlayer playing Rocketboom in a window
      MPlayer windowed playing the intro to Rocketboom
      Fullscreen MPlayer
      MPlayer fullscreen, zoomed 2x.
      GView
      GView: A very lightweight image viewer.
      Glock
      Glock: an analog clock.

      The smaller apps, GView and Glock could really rock if they were ported with a basic hildonized interface. Over the weekend I also got XChat built and running on x86 (here’s a screenshot) with everything but in-channel text input working perfectly. I still think that a properly ported Gaim would be the best graphical IRC interface for Maemo right now.

      The weekend is over but I hope to do some proper maemo hacking during my downtime this week.

      Update: MPlayer chewed up a lot of CPU and as such is probably not going to run very well on the device itself, especially since so much time and effort has been put in to tweaking GStreamer for the platform. I’ve grown accustomed to the “throw anything at MPlyaer” approach to Linux multimedia, so I had to try…