Wednesday, July 25, 2007

Real Progress - 5 digit ZCTAs

It has been 4 months since real progress was made - but tonight represented a real break through. I got the code up and running on my new laptop and transitioned over to using Eclipse, a great IDE that has integrated debugging of Ruby. But talking about how I am developing is not the purpose of this post. The purpose is to breath life into the project once more.

I've gotten many e-mails over the past few months asking for help in creating maps. One person in particular had a very specific ask that I thought would both expand the functionality of the code & test a new data source. For this set of maps, I needed to start mapping 5-digit zip codes (Zip Code Tabulation Areas to the Census), my first foray into zip codes in this work.

As proof of life, enjoy the screenshot below which shows all of the zip codes for the state of Maine, colored with random colors.



Now that the code is running again and I've got some specific projects to get going again, I look forward to getting back to a much more regular schedule. Please keep up the contact.

Wednesday, May 23, 2007

Congressional District maps for every state!

It has been almost six weeks since I've had something substantive to add. I wanted to share the complete outputs from the campaign contribution work I spoke about earlier. Please read that post for a full explanation of what the data is and how I put the files together.

The following two ZIP files contain a KMZ for every single state. I used the TimeSpan element, all of the states can be loaded at once.

Without further ado:


If you have any problems with these files, please let me know.

I hope to be back up and running soon, so please stay tuned.

Thursday, May 10, 2007

New blog on the block

There is a new blog at Google published by the Google Earth and Maps team: Lat Long Blog. I've begun spinning up my efforts to produce more maps and code and hope to share some new outputs soon. Thanks for staying connected during this intermission.

Sunday, April 22, 2007

A bit of a break...

I apologize for not updating the blog for the last week - I've been silent because I have just started a new job and haven't had the time to come back to the work here yet. I do plan to come back to it, however at a slower pace. In the mean time, do not hesitate to reach out for help with similar efforts by either using the comment system or dropping me an e-mail at censuskml at gmail.

Friday, April 13, 2007

Fixed KMZ Downloads

An alert reader pointed out that the KMZs I posted yesterday were not properly downloading as KMZ files - the ended up as ZIP files on the desktop. This was because I was using Amazon's S3 service to host the files which apparently doesn't have the KMZ mime type set. I've moved the KMZs to BingoDisk which fixes the problem in my testing. Please let me know if you have difficulty getting these new files.

As a side note, the Google Maps links don't work with the BingoDisk files, but do with S3, so I've left a copy there as well.

Wednesday, April 11, 2007

FEC Data KMZs

To follow up on my last post I wanted to share some of the actual KMZ files I generated to create the screenshots I showed. Please read that post for a full explanation of what these KMZs contain. The KMZ document descriptions also contain a brief explanation.

Please note that I do not warrant in any way, the accuracy of these maps. Use at your own risk. That being said, I have done spot checking for 3 random districts in each of the attached KMZs using the FEC query tool, which can be found here. The numbers I calculated are pretty close, and sometimes match exactly, to what the FEC tool returns. In the cases they do not, I suspect is is because of the data file I choose to use, with the FEC explicitly states might contain some inaccuracies given their attempt to make them as up-to-date as possible.

I've also used the TimeSpan element, so you can load both the 109th and 110th Congress files into Google Earth at one time. Using the time slider you can move between the two Congresses - the effect is pretty cool. I've also included my StyleMap trick, so that when you mouseover the point near the center of each district a label with the district's name and the total receipts in dollars appears.

I'm posting California, New York, and Texas - all states that have many districts. If you are interested in others, please feel free to e-mail me at censuskml [at gmail].

Total Receipts for House of Representative Candidates by Congressional District KMZs:



California and Texas are too large to load into Google Maps, but New York works:



There are many interesting insights that can be gleaned from the data, in particular when contrasting the 109th and 110th election cycles. I'll save interpretation for another post. For now, enjoy.

Tuesday, April 10, 2007

Federal Election Commission Campaign Contribution Data

I am stretching my wings beyond Census data - and what could be more interesting then campaign contribution data? I have an interest in politics and thought it would be fascinating to see if there were ways to make the data about the amount spent on elections available. Using the same basic framework I created to read in Census data, I decided that the Federal Election Commission (FEC) would be a good source of interesting data about elections.

The Census provides boundary files for the last eight Congresses. The FEC provides a wealth of data and I have just started to explore the full extent of what is possible. I started by using the Candidate Financial Summary Without PAC Breakdown data. This data, of course, is in a rather complex format, so writing a Ruby module to read it out was the first order of business.

The Census boundary files for the 110th and 109th are in a relatively similar format (the 108th and before begin to deviate substantially) so I used these as my polygon files. I used the Candidate Financial Summary Without PAC Breakdown (CFS) files from the FEC. These files are the most current but do have some potential accounting issues. As the FEC states:

The cost of this timelines, though, is that some of the information available here is less precise than for "cansum". For example, in "cansum" you can see how much a campaign received from Corporate PACs or Labor PACs, while here there is only one value for the total received from "other political committees." This includes all PAC contributions, but it may also contain contributions from other candidates, and some other types of committees we don't typically think of as PACs. We can't do the full breakdowns until all the information about specific contributions has been entered into the database.

When using these summary files you need to be aware of some possible double counting of activity. Some candidates have more then one committee authorized to raise and spend funds on their behalf. The activity reflected in this file represents the sum of those committees. If they transfer funds back and forth among each other, this activity would be counted twice. Information about "transfers from authorized committees" and "transfers to authorized committees" is included in the file and if there are values in both of these fields it is necessary to subtract these from total receipts and total disbursements to obtain a more accurate value for actual activity.


In creating a data structure to store the CFS data, I created a somewhat flexible way to aggregate the data using some of the interesting features of Ruby. In particular the eval statement make it quite easy to pass in free text to allow the caller of the aggregation function to specify which field to aggregate quite easily.

I encountered two main challenges mapping the CFS data back to the Census Congressional District polygons. First, the FEC uses the two-letter acronym to identify the state and the Census uses FIPS codes to identify states. Using regular expressions in TextMate, I converted the list from the Census website to a couple of different Ruby hash tables so that I could convert back and forth from two-letter acronym to two-digit code.

Second, the FEC files are somewhat inconsistent (at least based on how I am reading it) about how they handle states with only one Congressional District. The Census bureau is pretty clear that it uses "00" to identify districts that are the sole district for a given state. The FEC seems to follow this convention for the most part, except in a couple of states. For example, in Wyoming there are candidates for the House of Representatives listed in Congressional Districts "00" and "01". Wyoming has only one seat in the 110th congress. Here is a snippet from the webl06.zip file which contains the CFS data for the 2005-2006 election cycle (the ellipses represent where I have cut from the line for the sake of readability):

  H4WY00055CUBIN, BARBARA L                      I2REP...WY01 W W48...
  H6WY01025TRAUNER, GARY S                        1DEM...WY01 W L47...
  H6WY00118WINNEY, JUSTIN WILLIAM JR              2REP...WY00     0...


According to the documentation for the file, the two digits following the state acronym represent the district. In this case, it would appear to suggest that there are candidates for House in district "00" and "01", when in fact there is only one district in Wyoming. To handle this, I simply rolled up all of the House candidates, regardless of the district in the FEC file for one district states. This may be the wrong thing to do, so I've got an e-mail into the FEC to find out the actual answer.

Alright, enough with that background, let's see some pictures. In the following screenshots, I am mapping the total amount received by House of Representative candidates in each Congressional District for a given election cycle. For each $1,000 received, the district gets one meter in height. The districts with higher amounts are more green, those with lower amounts are more red. Missouri is missing from the 109th Congress because the polygon metadata file is missing from the Census website.

109th Congress from above:


110th Congress from above:


109th looking North:


110th looking North:


109th looking West:


110th looking West:


109th looking East:


110th looking East:


109th looking South:


110th looking South:


That is it for now. I'm close to sharing the KMZs for these, so be on the look out for a post soon.

Monday, April 9, 2007

"Dynamic" Labels & StyleMaps

I've been unhappy with how Google Earth handles labeling of shapes - the Placemark KML object is pretty good at containing a single shape, but breaks down when handling multiple shapes. Placemark elements contain the name and description elements, but only Point elements pick up on these. As I have talked about before, you can't label a Polygon without using a Point. I started to use the MultiGeometry to group together Polygon elements that belong to one geography (i.e. when a County has a couple of noncontiguous shapes). The annoying part is that I needed to add a Point for labels to show up.

When dealing with complex geographies, like Block Groups, having the labels show up the entire time doesn't work very well. For example in the last map I shared, there are ~4,300 Block Groups in WA and having the labels all show up doesn't work very well because they overlap and make it quite confusing. You can play with breaking out the labels into a different Placemark and perhaps a folder structure at the County/County Subdivision level might help, but it still isn't perfect because you would have to hunt and peck for what you were looking. In looking through the KML spec, I was excited to find the StyleMap element. The StyleMap element provides a mechanism to have a Placemark respond to mouseover/click/highlight events. You can define a Style element for both normal and highlight classes. I thought this would be a great way to provide a label: when a user moves their mouse over a geography (e.g. Block Group, County Subdivision, County) the label could show up - the normal style would have it transparent and the normal would have it opaque.

Well, turns out it doesn't quite work that way. Unfortunately, the only thing that sparks the transformation is the user moving their mouse over the icon of a Point: the style doesn't change if they mouseover the label or any of the Polygon elements in the Placemark. What is odd, however, is that all of the elements of the Placemark do respond to the new style when you mouseover the icon.

What you'll see below are maps of Median Household Value (variable H85 MEDIAN VALUE (DOLLARS) FOR ALL OWNER-OCCUPIED HOUSING UNITS [1] from Summary File 3) by County Subdivision. I'm not entirely happy with my new labeling, but it works such that when you mouseover the icon at the center of the polygon, the name of the County Subdivision shows up and the border is highlighted in white. I like the effect, but I am not happy with how I have to have an icon show up. In the maps below, the more red an area/shape, the higher the median household value - the greener, the lower the median household value. In the 3D maps, each $1,000 of value adds one meter of height.

Movie:


Screenshot (notice how Leavenworth-Lake Wenatchee is highlighted):

Saturday, April 7, 2007

Change your feed readers!

Good morning! For those of you that enjoy getting updates via the feed, I kindly request that you change your readers to use my new feed:



Blogger is good at many things, but getting a sense of how your feed is used is not one of them. Thank you!

Friday, April 6, 2007

Follow up to Median Household Value for Washington

I just recently discovered the TimeSpan element in KML and thought it might be an interesting way to animate the maps. To try this out, I add this to the maps I discussed in my previous post. The dates are arbitrary, but I gave each block group a date starting from the one with the lowest median household value and ending with the block with the highest. You can see the effect in the following video which animates the drawing of each of the block groups: building all of them up in order and then removing them in reverse order.



In Google Earth you can control many aspects of the animation, including the speed at which it moves through dates. I think this may be an interesting way to make the maps come alive a bit more.

Median Household Value for Washington state by Block Group

Using the new functionality I've discussed in the last two posts, I am pushing forward in creating new maps. Today I'm going to share a few maps of median household value (variable H76 - MEDIAN VALUE (DOLLARS) FOR SPECIFIED OWNER-OCCUPIED HOUSING UNITS [1] from Summary File 3). There is another variable, H85, which might be better for what I want, but I'm going to go ahead and share these maps before going back. These new maps show the median household value by block group for Washington state. There are ~4,300 block groups in Washington and a wide array of values in the data, so this data provides a good test of the new functionality (labels, excluded polygons, logarithmic color scales). Given the large number of block groups, I've found that it takes several different map formats to fully explore the data. I've created maps that have the 3D views I've shared before and maps that are flat with some transparency so that you can see the underlying geography.

One of the challenges I found in creating these maps was gathering the Census data at the block group for the entire state. NHGIS doesn't provide many variables beyond the basic population and economic ones and the Census FactFinder website doesn't make it easy to download all of the block groups for a given state at once - you have to download each county separately. Given these challenges, I looked into download the raw data from Summary File 3. There raw files are available by FTP but, of course, are in a very complex format. The Census Bureau provides an Access database template that contains empty versions of each of the ~80 tables needed to work with the data (including import specs which is quite helpful). Feeling intrepid I downloaded all of the data for Washington state (FTP site) and loaded the tables I needed to into the Access template. This worked pretty well, but is somewhat confusing, particularly because joining the geographic identifiers for each record is not quite as straight forward as the Census documentation would lead you to believe. I finally got it to work and this provided the data for the maps provided below - perhaps you can now understand why I haven't gone back and re-run the maps using the H85 variable yet.

In the maps below, the more red an area/shape, the higher the median household value - the greener, the lower the median household value. In the 3D maps, each $1,000 of value adds one meter of height. You'll note that some areas show up as white, this is because the data provided by the Census for these block groups is 0. This makes sense for places like Mt. Rainier, but not for a region of downtown Seattle that shows up as white. I'm going to look into this next.

Now, some maps (don't forget you can click on each picture to get a larger version)!

Entire state (flat from overhead):

Entire state (flat from overhead, borders around each block group, some transparency):

Entire state (3D from overhead):

Entire State (3D looking North):

Entire State (3D looking East):

Seattle (3D looking Southeast):

Seattle (flat from overhead, borders around each block group, some transparency):

Thursday, April 5, 2007

Blogger ate my post on new features/code improvements

Today's 500 word post was just eaten by the Blogger spell-checker. I'm going to attempt to retrieve it, but it isn't looking good. The topic of the post was some major improvements I've made to the code base that make it much more generic and support two new features:


  • Exclusions are now handled - if a polygon has an exclusion listed in the Census shape file, this is now handled. This turns out to be quite important for some areas. I'm working on a map of Washington state at the Block Group level and it turns out that in the rural parts of the state there are a couple of small towns that have a single Block Group surrounding the town and then separate, smaller Block Groups for the town itself. Before, these smaller Block Groups might have been covered up.

  • A geography made up of multiple polygons now acts as one object in the KMZ output file.


In addition, I've made a number of enhancements under the hood. While I had been using some object-orientation before, I finally made the classes much more complete and generic so that it is much easier to add new geographies. For example, I created the Polygon object which encapsulates all of the information for a given polygon in one container:

class Polygon
  attr_reader :id, :centerLon, :centerLat, :mainCoords, :exCoords
  attr_writer :id, :centerLon, :centerLat
  
  def initialize(id, centerLon, centerLat)
    @id = id
    @centerLon = centerLon
    @centerLat = centerLat
    @mainCoords = Array.new()
    @exCoords = Array.new()
  end
end


With these improvements, I hope to have some maps to share soon. When I am once again inspired, I will share a lot more detail about the process and recent improvements.

Monday, April 2, 2007

Hawaii Population Data & New Features

As promised, my break gave me a new burst of energy. I spent my week off in Hawaii and figured it would be a good region to experiment with a few new map features. I tackled two items that were on my to-do list: labels & logarithmic color scales. I have generated a new map of population by County Subdivsion for Hawaii to demonstrate these.

First, on labels. I've found that Google Earth is quite powerful except when it comes to labeling polygons. You can provide "names" for polygons, but these only appear in the "Places" panel on the left of the map view. I suppose this may have to do with the difficulty in figuring out where to put these names on the map display (given how oddly shaped a polygon can be), but I would have thought there was some good default behavior for this (if I am missing a feature of KML, please let me know!). To place a label on the map display, you have to create a "Point" placemark. Out of a desire to keep moving, I pushed forward without labels. The Census Bureau's shape files do actually include a center point for each polygon, so I have now gone back and updated my code to generate "Point" placemarks for each of these center points. With these points, I can now have labels appear on the map. This is very helpful, particularly when dealing with geographies below County (County Subdivision, Block Group, etc.).

Second, on logarithmic color scales. One of the challenges in mapping any sort of data that has a very wide range and is not very evenly distributed across the range is that in can be hard to find a color scheme that provides clarity at either extreme. I have talked about this in a couple of previous posts, but finally gotten around to implementing a logarithmic scheme that more evenly distributes the data across the range. I'm not entirely happy with what I've implemented, so I plan to work on it further.

On to some screenshots. Below you'll find 3 perspectives of population, by County Subdivision, from the 2000 Census for Hawaii. Every meter in height represents 5 people; Greener represents lower population, Red higher population. The labels come in handy because I'm not that familiar with the islands. The new logarithmic color scheme comes in handy because Honolulu has a much higher population that all of the other County Subdivisions. I've used the same Green to Red color scheme I've used before, but with the logarithmic scaling it now does a much better job of helping one to distinguish between the Subdivisions on the lower end of the population range. Without this new scheme, Honolulu would be red and everything else green.

From directly overhead:


From an angle, looking North:


From an angle, looking South (so you can see the northern side of Oahu):


One other note of interest: I was perplexed for a few minutes because the Midway Islands and the other islands west of Kauai all showed up with tall, red polygons - meaning they have a high population (I excluded them from the screenshots for this reason). It turns out that the Honolulu County Subdivision includes all of these islands, hence they get the data for the entire Subdivision. I suppose this demonstrates one of the perils of geographic aggregation when working with an island chain.

I haven't disappeared...

I apologize for the lack of posts last week. I spent the week on vacation and didn't dream of touching a computer. My trip spurred a bunch of ideas for interesting maps that I am already working on. In the mean time, I will share a quick panorama I put together from pictures overlooking Hana, on the eastern side of Maui.

Thursday, March 22, 2007

More Migration Analysis

I'm finding the migration variables fascinating. These questions are a part of the long form and can be found in Summary File 3 on the Census website. For the maps in the screenshots below, I used the Population 5 years and over: Different house in 1995; In United States in 1995; Different county; Different state; ... variables. The variables allow the identification of where people are moving from, which is quite interesting. The variables are broken up into 4 regions: Northeast, Midwest, South, and West. What I have mapped is the number of people (over the age of 5) who have moved from a state in a particular region to another state (which may also be in that region). For example, in the Northeast map, if a person lived in Maine and moved to Arizona (the map below will show this appears to be quite a popular destination for New Englanders), they would be counted in the county they moved to in Arizona. If a person lived in Maine and moved to Vermont, they would be counted in the county they moved to in Vermont.

This data is broken up by county and the more red and taller a county, the more people that moved there. The heights are quite exaggerated: each person adds 10 meters of height to the county. These maps show how the linear color scale I've been employing to date only really work on datasets that have quite small ranges. I am working on a logarithmic scaling technique that should help on these sorts of datasets, where there may be a smaller number of values that may distort the distribution of values.

Also, I realize it might not have been intuitive: you can click on the pictures for a much larger version of the image. This is true for all of the pictures on the blog.

Northeast: New Englanders appear to be moving to Florida, Arizona, and California in droves. Chicago & Seattle get a fair number as well.


Midwest: Midwesterners are more focused on Arizona than California and quite drawn to Chicago.


South: Southeners appear to be moving all over including California, Arizona, Georgia , Texas, North Carolina, and DC.


West: Westerners shun moving to the Midwest, South, or East, favoring consolidation in Las Vegas (yes, the more northern red spike is Vegas) and Phoenix. You can see some movements to Hawaii & Alaska in the distance.

New way to reach CensusKML

To further facilitate the conversation about mapping Census data in Google Earth, I've created an e-mail address: censuskml [at] [gmail] (I hope you can decipher it). Feel free to reach out to me with specific questions or inquiries using either the e-mail address or the comment system.

KML (& KMZ) Support Added to the Google Maps API

Google Maps API Official Blog: KML and GeoRSS Support Added to the Google Maps API - this is pretty interesting since the functionality has been available in Google maps for some time [see my previous post]. The post doesn't explicitly state it, but I KMZs seem to work just as well. There appears to be some size limit to between 100 - 200 kb, from simply experimentation. Just to share again, here are two KMZs of population by County Subdivision from the 2000 Census that you can view on Google Maps:

New KMZ!

Today I'm going to share a KMZ of a new variable that I've been working on. I wanted to experiment with more complicated variables, beyond Median Household Income, to really push the flexibility of the code I'm writing. The migration variables of summary file 3 from the Census are fascinating and fit the bill. They allow one to understand how people are moving about the country and include quite a bit of granularity. I'm going to share 7 states worth of data: Connecticut, Maine, Massachusetts, New Hampshire, New York, Rhode Island, and Vermont. Before I share links to the actual file, I want to be sure to share some important definitions, data sources, and notes.

What is mapped?

  • There are many migration variables, but for this example I've chosen to use two. I used the Population 5 years and over: Total - P024001 variable which represents the number of people in 2000 over the age of 5 and used it to divide the Population 5 years and over: Different house in 1995 - P024003 variable. The resulting percent should represent the number of people, over the age of 5, who didn't live in the house they lived in during the 2000 Census in 1995. Put it simply, the percent of people who moved in the last 5 years. The migration data provides a much more detailed breakout of where the people that moved came from, which I hope to work further with.
  • The data is presented broken out by county: the taller, the more blue a county is in the file the higher a percent that moved - the shorter, the more green, the smaller percent that moved. The actual percent can be found in the description for each polygon, however it is multiplied by 1,000 there so the actual value is what is found in the description divided by 1,000.
  • In New England, the region in the provided KMZ, Tompkins County, New York has the max value at ~ 58%. This means that ~ 58% of people in Tompkins County moved since 1995. There are many counties at the low end: Hamilton County, New York is quite low at around ~30%, but so are Aroostook County, Maine and Orange County, Vermont, both around ~33%.
  • You'll also note how the counties are organized into folders within the KMZ file. This is a recent improvement to the KMZ generation process. I've also modified the code to follow the best practice of referencing repeated styles by ID (each polygon references a base style and only overrides what it needs to) - I thought this would save a lot on file size, but it didn't because ZIP was quite efficient at compressing the bits that were repeated again and again.
How about a quick tour of the map?
[YouTube seems to like to cut down the length of videos, so this may feel a bit choppy - not sure why it is doing this]


Where is the data from?
  • I downloaded the data from the FactFinder Download Center.
  • The County boundary files came from a Census website: 2000 County Cartographic Boundary Files.
    • The Census boundary files include data the denotes cut outs when a polygon should not cover an area. These are denoted with a -99999 ID in the Census boundary files. While I read these in, I have not decided the best way to handle them so that data is not represented here: in other words, some of the polygons may inappropriately cover an area.
Notes/Disclaimers:
  • This is a preliminary release.
  • Turn off Terrain for best viewing.
  • Rotate, fly around, change the viewing angle to get a real sense of the visualization!
  • Commercial use of this file is prohibited. If you are interested in using this file commercially, please drop an e-mail to censuskml [at] [Gmail].
  • I do not warrant in any way, the accuracy of these maps. Use at your own risk.
Files:

Please pass along any feedback/thoughts/inquires via comments!

Tuesday, March 20, 2007

Getting better color schemes...

I'm making good progress in finding color schemes that actually make sense. As a sneak peak, I've posted a video of a tour over the median household income data from the 2000 Census by County. Green is low median income and red is high (the other way around, which may be more logical there is just way too much red on the screen). The main hurdle I am working on now is how to reduce the file size. The file shown in the movie is ~7 mb.



Here are some of the color schemes that I have put together by hand (enough of the auto-hex code generation that led me to a random walk across hues & saturations). The movie above uses the scheme third from left.

Monday, March 19, 2007

Median Household Income

I have made progress and am now generating maps at the County level with data. I'm not happy with the color schemes yet, so I will be pushing on this before I share the KMZ file. The orientation of the screenshot below is looking Northeast over New England at an altitude of 900 KM. The data is Median Household Income by County from the 2000 Census (Summary File 3). The brighter & lighter a color the higher the income. The heights of each County represent one meter for every dollar of median household income. If anyone knows of any good resources for color pallets, please drop me a line in the comments!

Counties

It is amazing how life can get in the way of making progress! I've been out of town for the past few days and did not have a chance to push the project forward. I finally got some time today and in an effort to generate output I can share broadly, I'm working to generate some interesting KMZs to share. I wanted to put together KMZs with median income by county subdivision, but could not get NHGIS to generate this data. Given the complication with downloading the data from the Download Center on the Census website at the County Subdivision level, I've decided to roll up to the County level. The Census makes it easy to download variables for the entire country at this level and I am working to chart a couple of variables. If there any you an particularly interested in seeing just leave a comment.

I've gotten the code ready for Counties and I wanted to share a screenshot of what the 48 contiguous states look like broken out at the County level. Real data soon!

Wednesday, March 14, 2007

Get your real, live examples!

After only posting screenshots for the past few days, it is time to share some real, live KMZ files. I'm going to share 3 states worth of data: California, Massachusetts, and Wyoming. Before I share links to the actual files, I want to be sure to share some important definitions, data sources, and notes.

What is mapped?

  • I wanted to start simple: the following KMZs map the "Total Population" variable from the 2000 Census, broken out at the County Subdivision level. The height of each subdivision is equal to 1 meter for every 10 people. The colors also represent population, but use the state's max county subdivision population as the denominator, meaning they are only relevant within the state, not comparable across states. The largest subdivision will be bright red and the small ones will be white. I'm experimenting with this to try and get data on two levels - national and state.
Where is the data from?
  • While I could have gotten the data from the FactFinder, I got the actual Total Population variable data from National Historical Geographic Information System (NHGIS). This system is powerful because, among other things, it lets you download the data for the entire US at once. The citation for this data is as follows:
    • John S. Adams, William C. Block, Mark Lindberg, Robert McMaster, Steven Ruggles, and Wendy Thomas, National Historical Geographic Information System: Pre-release Version 0.1 Minneapolis: Minnesota Population Center University of Minnesota, 2004.
  • The County Subdivsion boundary files came from a Census website: 2000 County Subdivisions Cartographic Boundary Files.
    • The Census boundary files include data the denotes cut outs when a polygon should not cover an area. These are denoted with a -99999 ID in the Census boundary files. While I read these in, I have not decided the best way to handle them so that data is not represented here: in other words, some of the polygons may inappropriately cover an area.
Notes/Disclaimers:
  • The actual population (in people) is listed for each County Subdivision in the name for each polygon. The number listed here is the actually population divided by 10, which is the height of the polygon in meters.
  • This is a preliminary release.
  • I license the use of these files under Creative Commons and Commercial use of the data is prohibited by NHGIS. The data can be gotten for other ways, so Commercial use is not out of the question in the future.
  • Turn off Terrain for best viewing.
  • Rotate, fly around, change the viewing angle to get a real sense of the visualization!
  • I do not warrant in any way, the accuracy of these maps. Use at your own risk.

Maps of Total Population (people) from the 2000 Census by County Subdivsion:
Please pass along any feedback/thoughts/inquires via comments!

How hard can it be to generate KMZ files?

One of the items I have had on my to do list for the past few days is to change the output of my program to KMZs from KMLs. At high-levels of geographic complexity (such as Block Groups or County Subdivisions) the KML outputs can be very large (5 to 10 megabytes per state). KMZs are simply KMLs, but zipped. Everything I've tried to do in Ruby has turned out to be so easy, I figured this would be as well. I was wrong!

I started by trying to use the built in zlib module that appears to be a part of the standard Ruby install. I first tried the following bit of code:

outputFile = File.new(outputFilename,"w+")
zipWriter = GzipWriter.new(outputFile)
xml.write( zipWriter )
zipWriter.close

I was thinking to myself that this was just too easy, and it was. The output file came out as a KMZ file which Google Earth promptly refused to open. I kept getting some sort of bizarre error like "unexpected token at line 1, column 0". What got interesting is that if I just changed the file's extension to .zip and unzipped it using the built in Archive tool (on Mac OS X), the resulting KML file worked just fine - no problems!

Next, I tried writing the XML to a string which I then wrote to a zipped file, in case the streams weren't playing well together.

xmlString = StringIO.open("", "w+")
xml.write( xmlString, 0 )
zipWriter = GzipWriter.new(outputFile)
xmlString.rewind
zipWriter.write( xmlString.read )
xmlString.close
zipWriter.close

This code produced the exact same output and the exact same error in Google Earth.

Getting frustrated, I did some online searching. I found out that the KMZ spec says that the main KML in the KMZ should be named doc.xml. So, I tried writing out the KML file and then zipping that into another file:

xml.write( File.new("doc.kml","w+"), 0 )

GzipWriter.open("doc.kmz") do |gz|
  gz.orig_name = "doc.kml"
  gz.write(File.read("doc.kml"))
  gz.close
end

Same error! At each iteration I could change KMZ to ZIP, unarchive the file and open the resulting KML. I could even take a KML file, zip it, change the extension to .KMZ and open it up in Google Earth, regardless of the original filename (no "doc.kml" requirement seemed needed)! What was going on?

Next, I tried using the system gzip command, thinking maybe there was a problem with the Zlib module:

xml.write( File.new("doc.kml","w+"), 0 )
system("gzip", "doc.xml". "-S .kmz")

Again, same error!

At this point, I figured that all ZIP algorithms must not have been created equal, so I went searching for another ZIP Ruby module. I found one called rubyzip. I was encouraged by this new module because it had a lot more functionality than zlib and a lot better documentation. After giving up on using RubyGems to install the darned thing, I simple downloaded the code and installed it myself. A few sweet moments later, I held my breadth and ran the following code:

ZipFile.open( outputFilename + ".kmz", ZipFile::CREATE) {|zipfile|
  zipfile.get_output_stream("doc.kml") {|file| xml.write(file, 0)}
  }

It worked! I wish I had a better conclusion as to exactly what was going wrong with zlib, but the moral of the story is if you are trying to create KMZ files in Ruby, use rubyzip!

Tuesday, March 13, 2007

Finally got 3D colors working!

Thanks to the very first (and still only) comment left so far on CensusKML I have finally gotten good (well, maybe just bright) colors working when extruding the Census data out on Google Earth. I admit that I am terrible at picking color schemes, so please forgive the white-to-red one I'm using here. The trick was relatively simple: the coordinates must be listed in counter-clockwise order in order for Google Earth to properly draw the colors. I don't understand why this is the case, but it is. The Census boundary files list the coordinates for each polygon in clockwise fashion. Ruby made it very simple to reverse the coordinates. The coordinates are initially stored in an array in the order they are read from the Census files - when it comes time to read them out and put them into the KML file, all you have to do is call reverse! on the coordinate array and the array is reversed in place.

The following is a screen shot of the population data from the 2000 Census, broken out at the County Subdivision level. The height of each subdivision is equal to 1 meter for every 10 people. The colors also represent population, but use the state's max county subdivision population as the denominator, meaning they are only relevant within the state, not comparable across states. The largest subdivision will be bright red and the small ones will be white. I'm experimenting with this to try and get data on two levels - national and state. Among all of the data, note Boston on the upper-left, Florida on the upper-far-right in the distance, and Chicago on the bottom-right. Here you go:


You can compare the previous picture to this one to see how big a difference it makes to list the coordinates in a counter-clockwise fashion:

Monday, March 12, 2007

Some data resources & Google Maps

I've been travelling this past weekend so my development efforts were on hiatus. I plan to share some more nation-wide outputs tomorrow, but I wanted to share a fantastic website that someone passed along to me for Census data. The National Historical Geographic Information System (NHGIS) website is an extremely powerful data source for Census data. The website describes it self as follows:

The National Historical Geographic Information System (NHGIS) is a project
to create and freely disseminate a database incorporating all available
aggregate census information for the United States between 1790 and 2000.

The website is made by the Minnesota Population Center at the University of Minnesota. One of the most helpful aspects of the site is that it is slightly more intuitive and more powerful than the Census' FactFinder. For example, when looking at data at the County Subdivision level, you can download data for all of the states at once - on FactFinder it seems that the only way to get the data for all 50 states is to download each one individually which is a lot of wasted work. If I am missing this facility on FactFinder - please let me know!

The other aspect of the NHGIS website is that it seems built to provide the data in time series - something that the FactFinder website doesn't really provide. The Census website is admirably powerful, but lacks a couple of features that would make it much more so.

Just to keep up with the screenshots - I've got one more to share. One of the nifty things you can do with Google Maps (the website as opposed to the desktop app Google Earth) is displays KML files. In the screenshot below I simply put the KML file I had been displaying in Google Earth on a public HTTP site and loaded up the URL in the search box on Google Maps. My KML file had a bit too much data, so Google Maps can't draw all of the polygons, but with the right parsing down of data, this would be a very effective way to share KML files with a broad audience. In fact, this is a bit of a cleaner interface than Google Earth desktop app. I think you loose the 3D, but the you do get transparency and good colors:

Friday, March 9, 2007

Ruby has a big standard library

One of the aspects of Ruby that I've been most impressed by is the size and sophistication of its standard libraries. There are complete, easy to use classes for almost all of the functionality I've needed to date - and importantly they have the functions one really needs to be productive built right in. A good example of this is code to loop through a directory and operate on certain types of files. After some fishing around the web, I found some bits and pieces of very simple code that does exactly what I need it to do. Now, I am not expert, so I would appreciate critiques and better solutions, as I know that I'm missing ways to do this better, but I've already found a pretty simple way to do this. You'll also probably note that I like long variable names.

The following code is looking through a directory, all the way down to the leafs (file in sub-directories) and returning files that end in "a.dat". This is the extension of the polygon metadata files from the Census website. It then creates the file name for the actual polygon files (they have the same name, without an "a" before the extension) and the metadata files. From here, it passes those into the functions which read, parse these files to generate the KML outpus, which I will share when they are more robust and mature.


Find.find("/Users/xxx/Desktop/Census 2000 CSDs") do |path|
  if path.match("a.dat")

# This returns the file name, minus the second parameter
    filename = File.basename(path, "a.dat")
    path = File.dirname(path)
    polyShapeFilename = path + "/" + filename + ".dat"
    polyDataFilename = path + "/" + filename + "a.dat"
    outputFilename = Dir.getwd + "/batch_outputs/" + filename + ".kml"

    puts "Starting #{filename}"

# I'll publish this method when it is more complete
    genOutput(...)

    puts "Done #{filename}"

  end
end


An example of productivity I am talking about is the Find class. This built in class recursively searches through directories - how helpful!

One of the most powerful concepts in Ruby that I am just getting comfortable with is the block, like Find.find("/Users/xxx/Desktop/Census 2000 CSDs") do |path| that lets you run loops over collections of things. In this case, the code is going to loop every time the find method returns a path. I'm sure I'm butchering the Ruby language, but if a novice can figure these parts out, then the language, and its authors, have done a number of things right.

Thursday, March 8, 2007

County Subdivsions

I'm slowly improving the process of generating the KML files from Census boundary files. One of the challenges with wanting to show data for the entire country is that the boundary files are, at every level, broken out by state. There is no roll up for the entire country, even for the larger geographic areas. With the help of FlashGot, I've written a Ruby script that can take a directory full of ASCII boundary files, reading both the polygons & the metadata components, and create KML outputs. I am still generating each state as a separate KML file for viewing reasons, but I also generate a master KML file that links to each of the individual ones.

The following picture shows a map of all County Subdivisions from the 2000 Census. The colors are random.


This is a zoom of New England, with borders around each subdivision:

Welcome to Census KML!

A couple of days ago my wife asked me to help her map some Census data. I have always been interested in maps, and after a couple of web searches, I found a couple of interesting examples of people using Google Earth to map data. In particular, I found this post on the Juice Analytics blog. This got me going - I had some spare time, a desire to learn Ruby, a lot of data, a love of data visualization, and a purpose.

I plan to use this blog to share what I've created and let others share their knowledge, feedback, and experience. Please join in the discussion and check back frequently as I hope to share what I've found.

This wouldn't be much of a start if I didn't share some of the progress I've been making. The following screen shot shows the 2000 Census population by County Subdivision data where each meter of height represents 10 people. [There is also coloring, but I've found what appears to be a bug in Google Earth such that it doesn't properly draw colors for 3D polygons.]