Saturday, March 28, 2009

Decomposing genWABlockGroups() – Part 1

In my last post I mentioned that a place to start is to check out the genWABlockGroups() function. Today I am going to walk through first couple of lines to explain what it does to begin to unwind the mystery.

This function was written to create maps based on Census Block Groups. Specifically, I built it to map the median home values in Washington state from the 2000 Census. Today I’m going to review the first couple of lines and input files:

baseDir = "/Users/aidan/Desktop/"
projectDir = "Census 2000 WA BGs/"
polygonShapeFile = baseDir + projectDir + "bg53_d00.dat"
polygonMetaDataFile = baseDir + projectDir + "bg53_d00a.dat"
dataFile = baseDir + projectDir + "Median HH Value.txt"
outputFile = baseDir + projectDir + "output"

The first set of lines are all about setting up the input files and output files. In general, there are three input files required.

  1. You need a “shape” file. I talk quite a bit about the types of data sources I use in this post: http://censuskml.blogspot.com/2007/03/get-your-real-live-examples.html. The files I use come straight from the Census Cartographic Boundary files website: http://www.census.gov/geo/www/cob/bdy_files.html. Here is a sample of what a shape file looks like:

    1 -0.122233088267882E+03 0.489843885617284E+02
    -0.122285651792493E+03 0.490024370686774E+02
    -0.122251693540713E+03 0.490024929621633E+02
    -0.122251621999248E+03 0.490024930799168E+02
    ...
    -0.122285651792493E+03 0.490024370686774E+02
    END


    The structure is pretty simple. The shape's unique identifier within the file comes first, "1" in the above example. The coordinates that immediately follow this are the center point for the shape. Then the points that dictate the boundary points are separated by a line break after the first one, ending with the word "END". You’ll note that the first of the boundary points matches the last one exactly. The shapes are always 100% complete. A single Census cartographic component came be made up of multiple shapes (for example, an chain of islands that are in one county will be made up of distinct shapes). The “metadata” file is crucial to piecing all of this together.
  2. You need a shape “metadata” file. Again, this comes straight from the same Census website. Here is what the first few lines of the file look like:

    0
    " "
    " "
    " "
    " "
    " "
    " "
    " "

    1
    "53"
    "073"
    "0102"
    "2"
    "2"
    "BG"
    " "
    ...

    It can get hard to decipher these files. At a basic level this maps a shape to a specific Census area, in this case a Block Group. The metadata starts with the unique shape ID. Then there are a series of identifiers in quotation marks: (a) “53” = the State ID of Washington, (b) “073” = the county identifier within WA, (c) “0102” = tract within the county, (d) “2” = “the block group within the tract”, (e) “2” = the LSAD identifier with is kind of a numeric identifier that marks this as a Block Group, (f) “BG” = the translation of the previous LSAD code, and finally (g) the last blank is an area to put miscellaneous data. As I mention in the description of the previous file, there can be a one-to-many relationship between the shapes and actual Census area – multiple shapes can have the exact same metadata. Check out the BlockGroup.rb file to see how to programmatically read this data.
  3. You need a data file. The data file needs a way to map back to the “shapes” that are loaded. The “key” must match the shapes unique identifier. Again, I try and use files that are as straight from the Census website as possible. Typically I pull the data down from Tiger or FactFinder. Again, another sample:

    150,53,001,950100,1,530019501001,68800,88800,115900
    150,53,001,950100,2,530019501002,63300,78800,97800
    150,53,001,950100,3,530019501003,41300,58600,81600

    This is a CSV where the 6th column (if are are counting 1-based) contains the unique identifier for the Census geographic area and the three columns after it contain data about that specific area. You'll note how the identifier is a combination of the data from the metadata file. It took me a while to figure out how to construct the unique identifier, but I cracked the code. In the code, each shape object file has a function to build up it’s own unique ID that will map to the way Census spits out data with respect to that area.

Next time I will go through the actual function calls that load in these files.

No comments: