I apologize for not updating the blog for the last week - I've been silent because I have just started a new job and haven't had the time to come back to the work here yet. I do plan to come back to it, however at a slower pace. In the mean time, do not hesitate to reach out for help with similar efforts by either using the comment system or dropping me an e-mail at censuskml at gmail.
Friday, April 13, 2007
An alert reader pointed out that the KMZs I posted yesterday were not properly downloading as KMZ files - the ended up as ZIP files on the desktop. This was because I was using Amazon's S3 service to host the files which apparently doesn't have the KMZ mime type set. I've moved the KMZs to BingoDisk which fixes the problem in my testing. Please let me know if you have difficulty getting these new files.
As a side note, the Google Maps links don't work with the BingoDisk files, but do with S3, so I've left a copy there as well.
Wednesday, April 11, 2007
To follow up on my last post I wanted to share some of the actual KMZ files I generated to create the screenshots I showed. Please read that post for a full explanation of what these KMZs contain. The KMZ document descriptions also contain a brief explanation.
Please note that I do not warrant in any way, the accuracy of these maps. Use at your own risk. That being said, I have done spot checking for 3 random districts in each of the attached KMZs using the FEC query tool, which can be found here. The numbers I calculated are pretty close, and sometimes match exactly, to what the FEC tool returns. In the cases they do not, I suspect is is because of the data file I choose to use, with the FEC explicitly states might contain some inaccuracies given their attempt to make them as up-to-date as possible.
I've also used the
TimeSpan element, so you can load both the 109th and 110th Congress files into Google Earth at one time. Using the time slider you can move between the two Congresses - the effect is pretty cool. I've also included my
StyleMap trick, so that when you mouseover the point near the center of each district a label with the district's name and the total receipts in dollars appears.
I'm posting California, New York, and Texas - all states that have many districts. If you are interested in others, please feel free to e-mail me at censuskml [at gmail].
Total Receipts for House of Representative Candidates by Congressional District KMZs:
- California 109th Congress
- California 110th Congress
- New York 109th Congress
- New York 110th Congress
- Texas 109th Congress
- Texas 110th Congress
California and Texas are too large to load into Google Maps, but New York works:
There are many interesting insights that can be gleaned from the data, in particular when contrasting the 109th and 110th election cycles. I'll save interpretation for another post. For now, enjoy.
Tuesday, April 10, 2007
I am stretching my wings beyond Census data - and what could be more interesting then campaign contribution data? I have an interest in politics and thought it would be fascinating to see if there were ways to make the data about the amount spent on elections available. Using the same basic framework I created to read in Census data, I decided that the Federal Election Commission (FEC) would be a good source of interesting data about elections.
The Census provides boundary files for the last eight Congresses. The FEC provides a wealth of data and I have just started to explore the full extent of what is possible. I started by using the Candidate Financial Summary Without PAC Breakdown data. This data, of course, is in a rather complex format, so writing a Ruby module to read it out was the first order of business.
The Census boundary files for the 110th and 109th are in a relatively similar format (the 108th and before begin to deviate substantially) so I used these as my polygon files. I used the Candidate Financial Summary Without PAC Breakdown (CFS) files from the FEC. These files are the most current but do have some potential accounting issues. As the FEC states:
The cost of this timelines, though, is that some of the information available here is less precise than for "cansum". For example, in "cansum" you can see how much a campaign received from Corporate PACs or Labor PACs, while here there is only one value for the total received from "other political committees." This includes all PAC contributions, but it may also contain contributions from other candidates, and some other types of committees we don't typically think of as PACs. We can't do the full breakdowns until all the information about specific contributions has been entered into the database.
When using these summary files you need to be aware of some possible double counting of activity. Some candidates have more then one committee authorized to raise and spend funds on their behalf. The activity reflected in this file represents the sum of those committees. If they transfer funds back and forth among each other, this activity would be counted twice. Information about "transfers from authorized committees" and "transfers to authorized committees" is included in the file and if there are values in both of these fields it is necessary to subtract these from total receipts and total disbursements to obtain a more accurate value for actual activity.
In creating a data structure to store the CFS data, I created a somewhat flexible way to aggregate the data using some of the interesting features of Ruby. In particular the
evalstatement make it quite easy to pass in free text to allow the caller of the aggregation function to specify which field to aggregate quite easily.
I encountered two main challenges mapping the CFS data back to the Census Congressional District polygons. First, the FEC uses the two-letter acronym to identify the state and the Census uses FIPS codes to identify states. Using regular expressions in TextMate, I converted the list from the Census website to a couple of different Ruby hash tables so that I could convert back and forth from two-letter acronym to two-digit code.
Second, the FEC files are somewhat inconsistent (at least based on how I am reading it) about how they handle states with only one Congressional District. The Census bureau is pretty clear that it uses "00" to identify districts that are the sole district for a given state. The FEC seems to follow this convention for the most part, except in a couple of states. For example, in Wyoming there are candidates for the House of Representatives listed in Congressional Districts "00" and "01". Wyoming has only one seat in the 110th congress. Here is a snippet from the webl06.zip file which contains the CFS data for the 2005-2006 election cycle (the ellipses represent where I have cut from the line for the sake of readability):
H4WY00055CUBIN, BARBARA L I2REP...WY01 W W48...
H6WY01025TRAUNER, GARY S 1DEM...WY01 W L47...
H6WY00118WINNEY, JUSTIN WILLIAM JR 2REP...WY00 0...
According to the documentation for the file, the two digits following the state acronym represent the district. In this case, it would appear to suggest that there are candidates for House in district "00" and "01", when in fact there is only one district in Wyoming. To handle this, I simply rolled up all of the House candidates, regardless of the district in the FEC file for one district states. This may be the wrong thing to do, so I've got an e-mail into the FEC to find out the actual answer.
Alright, enough with that background, let's see some pictures. In the following screenshots, I am mapping the total amount received by House of Representative candidates in each Congressional District for a given election cycle. For each $1,000 received, the district gets one meter in height. The districts with higher amounts are more green, those with lower amounts are more red. Missouri is missing from the 109th Congress because the polygon metadata file is missing from the Census website.
109th Congress from above:
110th Congress from above:
109th looking North:
110th looking North:
109th looking West:
110th looking West:
109th looking East:
110th looking East:
109th looking South:
110th looking South:
That is it for now. I'm close to sharing the KMZs for these, so be on the look out for a post soon.
Monday, April 9, 2007
I've been unhappy with how Google Earth handles labeling of shapes - the
Placemark KML object is pretty good at containing a single shape, but breaks down when handling multiple shapes.
Placemark elements contain the
description elements, but only
Point elements pick up on these. As I have talked about before, you can't label a
Polygon without using a
Point. I started to use the
MultiGeometry to group together
Polygon elements that belong to one geography (i.e. when a County has a couple of noncontiguous shapes). The annoying part is that I needed to add a
Point for labels to show up.
When dealing with complex geographies, like Block Groups, having the labels show up the entire time doesn't work very well. For example in the last map I shared, there are ~4,300 Block Groups in WA and having the labels all show up doesn't work very well because they overlap and make it quite confusing. You can play with breaking out the labels into a different
Placemark and perhaps a folder structure at the County/County Subdivision level might help, but it still isn't perfect because you would have to hunt and peck for what you were looking. In looking through the KML spec, I was excited to find the
StyleMap element. The
StyleMap element provides a mechanism to have a
Placemark respond to mouseover/click/highlight events. You can define a
Style element for both normal and highlight classes. I thought this would be a great way to provide a label: when a user moves their mouse over a geography (e.g. Block Group, County Subdivision, County) the label could show up - the normal style would have it transparent and the normal would have it opaque.
Well, turns out it doesn't quite work that way. Unfortunately, the only thing that sparks the transformation is the user moving their mouse over the icon of a
Point: the style doesn't change if they mouseover the label or any of the
Polygon elements in the
Placemark. What is odd, however, is that all of the elements of the
Placemark do respond to the new style when you mouseover the icon.
What you'll see below are maps of Median Household Value (variable H85 MEDIAN VALUE (DOLLARS) FOR ALL OWNER-OCCUPIED HOUSING UNITS  from Summary File 3) by County Subdivision. I'm not entirely happy with my new labeling, but it works such that when you mouseover the icon at the center of the polygon, the name of the County Subdivision shows up and the border is highlighted in white. I like the effect, but I am not happy with how I have to have an icon show up. In the maps below, the more red an area/shape, the higher the median household value - the greener, the lower the median household value. In the 3D maps, each $1,000 of value adds one meter of height.
Screenshot (notice how Leavenworth-Lake Wenatchee is highlighted):
Saturday, April 7, 2007
Good morning! For those of you that enjoy getting updates via the feed, I kindly request that you change your readers to use my new feed:
Blogger is good at many things, but getting a sense of how your feed is used is not one of them. Thank you!
Friday, April 6, 2007
I just recently discovered the
TimeSpan element in KML and thought it might be an interesting way to animate the maps. To try this out, I add this to the maps I discussed in my previous post. The dates are arbitrary, but I gave each block group a date starting from the one with the lowest median household value and ending with the block with the highest. You can see the effect in the following video which animates the drawing of each of the block groups: building all of them up in order and then removing them in reverse order.
In Google Earth you can control many aspects of the animation, including the speed at which it moves through dates. I think this may be an interesting way to make the maps come alive a bit more.
Using the new functionality I've discussed in the last two posts, I am pushing forward in creating new maps. Today I'm going to share a few maps of median household value (variable H76 - MEDIAN VALUE (DOLLARS) FOR SPECIFIED OWNER-OCCUPIED HOUSING UNITS  from Summary File 3). There is another variable, H85, which might be better for what I want, but I'm going to go ahead and share these maps before going back. These new maps show the median household value by block group for Washington state. There are ~4,300 block groups in Washington and a wide array of values in the data, so this data provides a good test of the new functionality (labels, excluded polygons, logarithmic color scales). Given the large number of block groups, I've found that it takes several different map formats to fully explore the data. I've created maps that have the 3D views I've shared before and maps that are flat with some transparency so that you can see the underlying geography.
One of the challenges I found in creating these maps was gathering the Census data at the block group for the entire state. NHGIS doesn't provide many variables beyond the basic population and economic ones and the Census FactFinder website doesn't make it easy to download all of the block groups for a given state at once - you have to download each county separately. Given these challenges, I looked into download the raw data from Summary File 3. There raw files are available by FTP but, of course, are in a very complex format. The Census Bureau provides an Access database template that contains empty versions of each of the ~80 tables needed to work with the data (including import specs which is quite helpful). Feeling intrepid I downloaded all of the data for Washington state (FTP site) and loaded the tables I needed to into the Access template. This worked pretty well, but is somewhat confusing, particularly because joining the geographic identifiers for each record is not quite as straight forward as the Census documentation would lead you to believe. I finally got it to work and this provided the data for the maps provided below - perhaps you can now understand why I haven't gone back and re-run the maps using the H85 variable yet.
In the maps below, the more red an area/shape, the higher the median household value - the greener, the lower the median household value. In the 3D maps, each $1,000 of value adds one meter of height. You'll note that some areas show up as white, this is because the data provided by the Census for these block groups is 0. This makes sense for places like Mt. Rainier, but not for a region of downtown Seattle that shows up as white. I'm going to look into this next.
Now, some maps (don't forget you can click on each picture to get a larger version)!
Entire state (flat from overhead):
Entire state (flat from overhead, borders around each block group, some transparency):
Entire state (3D from overhead):
Entire State (3D looking North):
Entire State (3D looking East):
Seattle (3D looking Southeast):
Seattle (flat from overhead, borders around each block group, some transparency):
Thursday, April 5, 2007
Today's 500 word post was just eaten by the Blogger spell-checker. I'm going to attempt to retrieve it, but it isn't looking good. The topic of the post was some major improvements I've made to the code base that make it much more generic and support two new features:
- Exclusions are now handled - if a polygon has an exclusion listed in the Census shape file, this is now handled. This turns out to be quite important for some areas. I'm working on a map of Washington state at the Block Group level and it turns out that in the rural parts of the state there are a couple of small towns that have a single Block Group surrounding the town and then separate, smaller Block Groups for the town itself. Before, these smaller Block Groups might have been covered up.
- A geography made up of multiple polygons now acts as one object in the KMZ output file.
In addition, I've made a number of enhancements under the hood. While I had been using some object-orientation before, I finally made the classes much more complete and generic so that it is much easier to add new geographies. For example, I created the
Polygonobject which encapsulates all of the information for a given polygon in one container:
attr_reader :id, :centerLon, :centerLat, :mainCoords, :exCoords
attr_writer :id, :centerLon, :centerLat
def initialize(id, centerLon, centerLat)
@id = id
@centerLon = centerLon
@centerLat = centerLat
@mainCoords = Array.new()
@exCoords = Array.new()
With these improvements, I hope to have some maps to share soon. When I am once again inspired, I will share a lot more detail about the process and recent improvements.
Monday, April 2, 2007
As promised, my break gave me a new burst of energy. I spent my week off in Hawaii and figured it would be a good region to experiment with a few new map features. I tackled two items that were on my to-do list: labels & logarithmic color scales. I have generated a new map of population by County Subdivsion for Hawaii to demonstrate these.
First, on labels. I've found that Google Earth is quite powerful except when it comes to labeling polygons. You can provide "names" for polygons, but these only appear in the "Places" panel on the left of the map view. I suppose this may have to do with the difficulty in figuring out where to put these names on the map display (given how oddly shaped a polygon can be), but I would have thought there was some good default behavior for this (if I am missing a feature of KML, please let me know!). To place a label on the map display, you have to create a "Point" placemark. Out of a desire to keep moving, I pushed forward without labels. The Census Bureau's shape files do actually include a center point for each polygon, so I have now gone back and updated my code to generate "Point" placemarks for each of these center points. With these points, I can now have labels appear on the map. This is very helpful, particularly when dealing with geographies below County (County Subdivision, Block Group, etc.).
Second, on logarithmic color scales. One of the challenges in mapping any sort of data that has a very wide range and is not very evenly distributed across the range is that in can be hard to find a color scheme that provides clarity at either extreme. I have talked about this in a couple of previous posts, but finally gotten around to implementing a logarithmic scheme that more evenly distributes the data across the range. I'm not entirely happy with what I've implemented, so I plan to work on it further.
On to some screenshots. Below you'll find 3 perspectives of population, by County Subdivision, from the 2000 Census for Hawaii. Every meter in height represents 5 people; Greener represents lower population, Red higher population. The labels come in handy because I'm not that familiar with the islands. The new logarithmic color scheme comes in handy because Honolulu has a much higher population that all of the other County Subdivisions. I've used the same Green to Red color scheme I've used before, but with the logarithmic scaling it now does a much better job of helping one to distinguish between the Subdivisions on the lower end of the population range. Without this new scheme, Honolulu would be red and everything else green.
From directly overhead:
From an angle, looking North:
From an angle, looking South (so you can see the northern side of Oahu):
One other note of interest: I was perplexed for a few minutes because the Midway Islands and the other islands west of Kauai all showed up with tall, red polygons - meaning they have a high population (I excluded them from the screenshots for this reason). It turns out that the Honolulu County Subdivision includes all of these islands, hence they get the data for the entire Subdivision. I suppose this demonstrates one of the perils of geographic aggregation when working with an island chain.
I apologize for the lack of posts last week. I spent the week on vacation and didn't dream of touching a computer. My trip spurred a bunch of ideas for interesting maps that I am already working on. In the mean time, I will share a quick panorama I put together from pictures overlooking Hana, on the eastern side of Maui.