The Artima Developer Community
Sponsored Link

Agile Buzz Forum
The Hard Part: Bad Data

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
James Robertson

Posts: 29924
Nickname: jarober61
Registered: Jun, 2003

David Buck, Smalltalker at large
The Hard Part: Bad Data Posted: Jul 23, 2007 6:39 PM
Reply to this message Reply

This post originated from an RSS feed registered with Agile Buzz by James Robertson.
Original Post: The Hard Part: Bad Data
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
Latest Agile Buzz Posts
Latest Agile Buzz Posts by James Robertson
Latest Posts From Cincom Smalltalk Blog - Smalltalk with Rants

Advertisement

Inspired by Jon Udell's post, I grabbed the mean temperature data he mentioned. That's when the fun began :)

The problem with the data is the same problem you get with any dataset - it's never completely clean, and you always have to do something to sponge it off. So it was here. There were a number of small issues I ran across as I created some Smalltalk scripts to make sense of the data:

  • The usage of -9999 to indicate bad/no data for a given month. That wasn't really a data problem, so much as an undocumented convention
  • It took me a bit to find a weather station near me that had a large data set - I eventually settled on the DCA (near Washington National Airport) one
  • The data was normally separated by whitespace (within a record), and CR between. However, some of the uses of -9999 didn't have surrounding whitespace.
  • Some data seemed to be duplicated (duplicated years for the same station), but with slightly different values for some months. What did that mean? Not a clue :)

None of those were insurmountable, but they did make a "quick" look at the data harder. First came the scrubbing, then the "quick" look.

The thing is, that's not really a problem for someone with some software skills, but it will throw anyone without them. Even an Excel import would have foundered on the data that didn't have whitespace, for instance. So the sad thing is, it's even harder to deal with this stuff than Jon let on :/

Oh, and what did I discover? My wife's memory was right: summer's were slightly hotter back in the 80's (at least around here. YMMV)

Update: One man's broken data is another man's misunderstood format. Turns out that the records are fixed width, not white space delimited. Shows what I know :)

Technorati Tags:

Read: The Hard Part: Bad Data

Topic: Later... Previous Topic   Next Topic Topic: An evening waiting for Harry

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use