Thursday, March 22, 2007

More Migration Analysis

I'm finding the migration variables fascinating. These questions are a part of the long form and can be found in Summary File 3 on the Census website. For the maps in the screenshots below, I used the Population 5 years and over: Different house in 1995; In United States in 1995; Different county; Different state; ... variables. The variables allow the identification of where people are moving from, which is quite interesting. The variables are broken up into 4 regions: Northeast, Midwest, South, and West. What I have mapped is the number of people (over the age of 5) who have moved from a state in a particular region to another state (which may also be in that region). For example, in the Northeast map, if a person lived in Maine and moved to Arizona (the map below will show this appears to be quite a popular destination for New Englanders), they would be counted in the county they moved to in Arizona. If a person lived in Maine and moved to Vermont, they would be counted in the county they moved to in Vermont.

This data is broken up by county and the more red and taller a county, the more people that moved there. The heights are quite exaggerated: each person adds 10 meters of height to the county. These maps show how the linear color scale I've been employing to date only really work on datasets that have quite small ranges. I am working on a logarithmic scaling technique that should help on these sorts of datasets, where there may be a smaller number of values that may distort the distribution of values.

Also, I realize it might not have been intuitive: you can click on the pictures for a much larger version of the image. This is true for all of the pictures on the blog.

Northeast: New Englanders appear to be moving to Florida, Arizona, and California in droves. Chicago & Seattle get a fair number as well.


Midwest: Midwesterners are more focused on Arizona than California and quite drawn to Chicago.


South: Southeners appear to be moving all over including California, Arizona, Georgia , Texas, North Carolina, and DC.


West: Westerners shun moving to the Midwest, South, or East, favoring consolidation in Las Vegas (yes, the more northern red spike is Vegas) and Phoenix. You can see some movements to Hawaii & Alaska in the distance.

New way to reach CensusKML

To further facilitate the conversation about mapping Census data in Google Earth, I've created an e-mail address: censuskml [at] [gmail] (I hope you can decipher it). Feel free to reach out to me with specific questions or inquiries using either the e-mail address or the comment system.

KML (& KMZ) Support Added to the Google Maps API

Google Maps API Official Blog: KML and GeoRSS Support Added to the Google Maps API - this is pretty interesting since the functionality has been available in Google maps for some time [see my previous post]. The post doesn't explicitly state it, but I KMZs seem to work just as well. There appears to be some size limit to between 100 - 200 kb, from simply experimentation. Just to share again, here are two KMZs of population by County Subdivision from the 2000 Census that you can view on Google Maps:

New KMZ!

Today I'm going to share a KMZ of a new variable that I've been working on. I wanted to experiment with more complicated variables, beyond Median Household Income, to really push the flexibility of the code I'm writing. The migration variables of summary file 3 from the Census are fascinating and fit the bill. They allow one to understand how people are moving about the country and include quite a bit of granularity. I'm going to share 7 states worth of data: Connecticut, Maine, Massachusetts, New Hampshire, New York, Rhode Island, and Vermont. Before I share links to the actual file, I want to be sure to share some important definitions, data sources, and notes.

What is mapped?

  • There are many migration variables, but for this example I've chosen to use two. I used the Population 5 years and over: Total - P024001 variable which represents the number of people in 2000 over the age of 5 and used it to divide the Population 5 years and over: Different house in 1995 - P024003 variable. The resulting percent should represent the number of people, over the age of 5, who didn't live in the house they lived in during the 2000 Census in 1995. Put it simply, the percent of people who moved in the last 5 years. The migration data provides a much more detailed breakout of where the people that moved came from, which I hope to work further with.
  • The data is presented broken out by county: the taller, the more blue a county is in the file the higher a percent that moved - the shorter, the more green, the smaller percent that moved. The actual percent can be found in the description for each polygon, however it is multiplied by 1,000 there so the actual value is what is found in the description divided by 1,000.
  • In New England, the region in the provided KMZ, Tompkins County, New York has the max value at ~ 58%. This means that ~ 58% of people in Tompkins County moved since 1995. There are many counties at the low end: Hamilton County, New York is quite low at around ~30%, but so are Aroostook County, Maine and Orange County, Vermont, both around ~33%.
  • You'll also note how the counties are organized into folders within the KMZ file. This is a recent improvement to the KMZ generation process. I've also modified the code to follow the best practice of referencing repeated styles by ID (each polygon references a base style and only overrides what it needs to) - I thought this would save a lot on file size, but it didn't because ZIP was quite efficient at compressing the bits that were repeated again and again.
How about a quick tour of the map?
[YouTube seems to like to cut down the length of videos, so this may feel a bit choppy - not sure why it is doing this]


Where is the data from?
  • I downloaded the data from the FactFinder Download Center.
  • The County boundary files came from a Census website: 2000 County Cartographic Boundary Files.
    • The Census boundary files include data the denotes cut outs when a polygon should not cover an area. These are denoted with a -99999 ID in the Census boundary files. While I read these in, I have not decided the best way to handle them so that data is not represented here: in other words, some of the polygons may inappropriately cover an area.
Notes/Disclaimers:
  • This is a preliminary release.
  • Turn off Terrain for best viewing.
  • Rotate, fly around, change the viewing angle to get a real sense of the visualization!
  • Commercial use of this file is prohibited. If you are interested in using this file commercially, please drop an e-mail to censuskml [at] [Gmail].
  • I do not warrant in any way, the accuracy of these maps. Use at your own risk.
Files:

Please pass along any feedback/thoughts/inquires via comments!

Tuesday, March 20, 2007

Getting better color schemes...

I'm making good progress in finding color schemes that actually make sense. As a sneak peak, I've posted a video of a tour over the median household income data from the 2000 Census by County. Green is low median income and red is high (the other way around, which may be more logical there is just way too much red on the screen). The main hurdle I am working on now is how to reduce the file size. The file shown in the movie is ~7 mb.



Here are some of the color schemes that I have put together by hand (enough of the auto-hex code generation that led me to a random walk across hues & saturations). The movie above uses the scheme third from left.

Monday, March 19, 2007

Median Household Income

I have made progress and am now generating maps at the County level with data. I'm not happy with the color schemes yet, so I will be pushing on this before I share the KMZ file. The orientation of the screenshot below is looking Northeast over New England at an altitude of 900 KM. The data is Median Household Income by County from the 2000 Census (Summary File 3). The brighter & lighter a color the higher the income. The heights of each County represent one meter for every dollar of median household income. If anyone knows of any good resources for color pallets, please drop me a line in the comments!

Counties

It is amazing how life can get in the way of making progress! I've been out of town for the past few days and did not have a chance to push the project forward. I finally got some time today and in an effort to generate output I can share broadly, I'm working to generate some interesting KMZs to share. I wanted to put together KMZs with median income by county subdivision, but could not get NHGIS to generate this data. Given the complication with downloading the data from the Download Center on the Census website at the County Subdivision level, I've decided to roll up to the County level. The Census makes it easy to download variables for the entire country at this level and I am working to chart a couple of variables. If there any you an particularly interested in seeing just leave a comment.

I've gotten the code ready for Counties and I wanted to share a screenshot of what the 48 contiguous states look like broken out at the County level. Real data soon!

Wednesday, March 14, 2007

Get your real, live examples!

After only posting screenshots for the past few days, it is time to share some real, live KMZ files. I'm going to share 3 states worth of data: California, Massachusetts, and Wyoming. Before I share links to the actual files, I want to be sure to share some important definitions, data sources, and notes.

What is mapped?

  • I wanted to start simple: the following KMZs map the "Total Population" variable from the 2000 Census, broken out at the County Subdivision level. The height of each subdivision is equal to 1 meter for every 10 people. The colors also represent population, but use the state's max county subdivision population as the denominator, meaning they are only relevant within the state, not comparable across states. The largest subdivision will be bright red and the small ones will be white. I'm experimenting with this to try and get data on two levels - national and state.
Where is the data from?
  • While I could have gotten the data from the FactFinder, I got the actual Total Population variable data from National Historical Geographic Information System (NHGIS). This system is powerful because, among other things, it lets you download the data for the entire US at once. The citation for this data is as follows:
    • John S. Adams, William C. Block, Mark Lindberg, Robert McMaster, Steven Ruggles, and Wendy Thomas, National Historical Geographic Information System: Pre-release Version 0.1 Minneapolis: Minnesota Population Center University of Minnesota, 2004.
  • The County Subdivsion boundary files came from a Census website: 2000 County Subdivisions Cartographic Boundary Files.
    • The Census boundary files include data the denotes cut outs when a polygon should not cover an area. These are denoted with a -99999 ID in the Census boundary files. While I read these in, I have not decided the best way to handle them so that data is not represented here: in other words, some of the polygons may inappropriately cover an area.
Notes/Disclaimers:
  • The actual population (in people) is listed for each County Subdivision in the name for each polygon. The number listed here is the actually population divided by 10, which is the height of the polygon in meters.
  • This is a preliminary release.
  • I license the use of these files under Creative Commons and Commercial use of the data is prohibited by NHGIS. The data can be gotten for other ways, so Commercial use is not out of the question in the future.
  • Turn off Terrain for best viewing.
  • Rotate, fly around, change the viewing angle to get a real sense of the visualization!
  • I do not warrant in any way, the accuracy of these maps. Use at your own risk.

Maps of Total Population (people) from the 2000 Census by County Subdivsion:
Please pass along any feedback/thoughts/inquires via comments!

How hard can it be to generate KMZ files?

One of the items I have had on my to do list for the past few days is to change the output of my program to KMZs from KMLs. At high-levels of geographic complexity (such as Block Groups or County Subdivisions) the KML outputs can be very large (5 to 10 megabytes per state). KMZs are simply KMLs, but zipped. Everything I've tried to do in Ruby has turned out to be so easy, I figured this would be as well. I was wrong!

I started by trying to use the built in zlib module that appears to be a part of the standard Ruby install. I first tried the following bit of code:

outputFile = File.new(outputFilename,"w+")
zipWriter = GzipWriter.new(outputFile)
xml.write( zipWriter )
zipWriter.close

I was thinking to myself that this was just too easy, and it was. The output file came out as a KMZ file which Google Earth promptly refused to open. I kept getting some sort of bizarre error like "unexpected token at line 1, column 0". What got interesting is that if I just changed the file's extension to .zip and unzipped it using the built in Archive tool (on Mac OS X), the resulting KML file worked just fine - no problems!

Next, I tried writing the XML to a string which I then wrote to a zipped file, in case the streams weren't playing well together.

xmlString = StringIO.open("", "w+")
xml.write( xmlString, 0 )
zipWriter = GzipWriter.new(outputFile)
xmlString.rewind
zipWriter.write( xmlString.read )
xmlString.close
zipWriter.close

This code produced the exact same output and the exact same error in Google Earth.

Getting frustrated, I did some online searching. I found out that the KMZ spec says that the main KML in the KMZ should be named doc.xml. So, I tried writing out the KML file and then zipping that into another file:

xml.write( File.new("doc.kml","w+"), 0 )

GzipWriter.open("doc.kmz") do |gz|
  gz.orig_name = "doc.kml"
  gz.write(File.read("doc.kml"))
  gz.close
end

Same error! At each iteration I could change KMZ to ZIP, unarchive the file and open the resulting KML. I could even take a KML file, zip it, change the extension to .KMZ and open it up in Google Earth, regardless of the original filename (no "doc.kml" requirement seemed needed)! What was going on?

Next, I tried using the system gzip command, thinking maybe there was a problem with the Zlib module:

xml.write( File.new("doc.kml","w+"), 0 )
system("gzip", "doc.xml". "-S .kmz")

Again, same error!

At this point, I figured that all ZIP algorithms must not have been created equal, so I went searching for another ZIP Ruby module. I found one called rubyzip. I was encouraged by this new module because it had a lot more functionality than zlib and a lot better documentation. After giving up on using RubyGems to install the darned thing, I simple downloaded the code and installed it myself. A few sweet moments later, I held my breadth and ran the following code:

ZipFile.open( outputFilename + ".kmz", ZipFile::CREATE) {|zipfile|
  zipfile.get_output_stream("doc.kml") {|file| xml.write(file, 0)}
  }

It worked! I wish I had a better conclusion as to exactly what was going wrong with zlib, but the moral of the story is if you are trying to create KMZ files in Ruby, use rubyzip!

Tuesday, March 13, 2007

Finally got 3D colors working!

Thanks to the very first (and still only) comment left so far on CensusKML I have finally gotten good (well, maybe just bright) colors working when extruding the Census data out on Google Earth. I admit that I am terrible at picking color schemes, so please forgive the white-to-red one I'm using here. The trick was relatively simple: the coordinates must be listed in counter-clockwise order in order for Google Earth to properly draw the colors. I don't understand why this is the case, but it is. The Census boundary files list the coordinates for each polygon in clockwise fashion. Ruby made it very simple to reverse the coordinates. The coordinates are initially stored in an array in the order they are read from the Census files - when it comes time to read them out and put them into the KML file, all you have to do is call reverse! on the coordinate array and the array is reversed in place.

The following is a screen shot of the population data from the 2000 Census, broken out at the County Subdivision level. The height of each subdivision is equal to 1 meter for every 10 people. The colors also represent population, but use the state's max county subdivision population as the denominator, meaning they are only relevant within the state, not comparable across states. The largest subdivision will be bright red and the small ones will be white. I'm experimenting with this to try and get data on two levels - national and state. Among all of the data, note Boston on the upper-left, Florida on the upper-far-right in the distance, and Chicago on the bottom-right. Here you go:


You can compare the previous picture to this one to see how big a difference it makes to list the coordinates in a counter-clockwise fashion:

Monday, March 12, 2007

Some data resources & Google Maps

I've been travelling this past weekend so my development efforts were on hiatus. I plan to share some more nation-wide outputs tomorrow, but I wanted to share a fantastic website that someone passed along to me for Census data. The National Historical Geographic Information System (NHGIS) website is an extremely powerful data source for Census data. The website describes it self as follows:

The National Historical Geographic Information System (NHGIS) is a project
to create and freely disseminate a database incorporating all available
aggregate census information for the United States between 1790 and 2000.

The website is made by the Minnesota Population Center at the University of Minnesota. One of the most helpful aspects of the site is that it is slightly more intuitive and more powerful than the Census' FactFinder. For example, when looking at data at the County Subdivision level, you can download data for all of the states at once - on FactFinder it seems that the only way to get the data for all 50 states is to download each one individually which is a lot of wasted work. If I am missing this facility on FactFinder - please let me know!

The other aspect of the NHGIS website is that it seems built to provide the data in time series - something that the FactFinder website doesn't really provide. The Census website is admirably powerful, but lacks a couple of features that would make it much more so.

Just to keep up with the screenshots - I've got one more to share. One of the nifty things you can do with Google Maps (the website as opposed to the desktop app Google Earth) is displays KML files. In the screenshot below I simply put the KML file I had been displaying in Google Earth on a public HTTP site and loaded up the URL in the search box on Google Maps. My KML file had a bit too much data, so Google Maps can't draw all of the polygons, but with the right parsing down of data, this would be a very effective way to share KML files with a broad audience. In fact, this is a bit of a cleaner interface than Google Earth desktop app. I think you loose the 3D, but the you do get transparency and good colors:

Friday, March 9, 2007

Ruby has a big standard library

One of the aspects of Ruby that I've been most impressed by is the size and sophistication of its standard libraries. There are complete, easy to use classes for almost all of the functionality I've needed to date - and importantly they have the functions one really needs to be productive built right in. A good example of this is code to loop through a directory and operate on certain types of files. After some fishing around the web, I found some bits and pieces of very simple code that does exactly what I need it to do. Now, I am not expert, so I would appreciate critiques and better solutions, as I know that I'm missing ways to do this better, but I've already found a pretty simple way to do this. You'll also probably note that I like long variable names.

The following code is looking through a directory, all the way down to the leafs (file in sub-directories) and returning files that end in "a.dat". This is the extension of the polygon metadata files from the Census website. It then creates the file name for the actual polygon files (they have the same name, without an "a" before the extension) and the metadata files. From here, it passes those into the functions which read, parse these files to generate the KML outpus, which I will share when they are more robust and mature.


Find.find("/Users/xxx/Desktop/Census 2000 CSDs") do |path|
  if path.match("a.dat")

# This returns the file name, minus the second parameter
    filename = File.basename(path, "a.dat")
    path = File.dirname(path)
    polyShapeFilename = path + "/" + filename + ".dat"
    polyDataFilename = path + "/" + filename + "a.dat"
    outputFilename = Dir.getwd + "/batch_outputs/" + filename + ".kml"

    puts "Starting #{filename}"

# I'll publish this method when it is more complete
    genOutput(...)

    puts "Done #{filename}"

  end
end


An example of productivity I am talking about is the Find class. This built in class recursively searches through directories - how helpful!

One of the most powerful concepts in Ruby that I am just getting comfortable with is the block, like Find.find("/Users/xxx/Desktop/Census 2000 CSDs") do |path| that lets you run loops over collections of things. In this case, the code is going to loop every time the find method returns a path. I'm sure I'm butchering the Ruby language, but if a novice can figure these parts out, then the language, and its authors, have done a number of things right.

Thursday, March 8, 2007

County Subdivsions

I'm slowly improving the process of generating the KML files from Census boundary files. One of the challenges with wanting to show data for the entire country is that the boundary files are, at every level, broken out by state. There is no roll up for the entire country, even for the larger geographic areas. With the help of FlashGot, I've written a Ruby script that can take a directory full of ASCII boundary files, reading both the polygons & the metadata components, and create KML outputs. I am still generating each state as a separate KML file for viewing reasons, but I also generate a master KML file that links to each of the individual ones.

The following picture shows a map of all County Subdivisions from the 2000 Census. The colors are random.


This is a zoom of New England, with borders around each subdivision:

Welcome to Census KML!

A couple of days ago my wife asked me to help her map some Census data. I have always been interested in maps, and after a couple of web searches, I found a couple of interesting examples of people using Google Earth to map data. In particular, I found this post on the Juice Analytics blog. This got me going - I had some spare time, a desire to learn Ruby, a lot of data, a love of data visualization, and a purpose.

I plan to use this blog to share what I've created and let others share their knowledge, feedback, and experience. Please join in the discussion and check back frequently as I hope to share what I've found.

This wouldn't be much of a start if I didn't share some of the progress I've been making. The following screen shot shows the 2000 Census population by County Subdivision data where each meter of height represents 10 people. [There is also coloring, but I've found what appears to be a bug in Google Earth such that it doesn't properly draw colors for 3D polygons.]