Special Collections Capstone: Data

For the last couple of weeks, I have been working to clean up the data of the 20,000 plus records associated with Special Collections, which were exported from the catalog. The data is pretty dirty and at times overwhelming – but I persevere.  I have undertaken the data clean up in stages, which means I try to more or less stick with one type of action taken on the records. First, I corrected about 2,000 records that were misaligned on the spreadsheet. Now, I am working on provenance. As I make decisions about what the data represents, I have tried to consistently document those decisions for the next person who might decide to look at the records.

Last week, I also had a tutorial on how to use OpenRefine. Jessica Trelogan, UTL’s Data Management Coordinator, led the tutorial. It was a great session, and I highly recommend contacting Jessica if your research is data related. One of the things that I learned in the session is that my data is particularly thorny. I will need to spend some time thinking about its structure and checking in with Jessica from time to time. I also learned how to use OpenRefine to help with my data clean up. While the publication information is particularly challenging in my data set, other pieces of data can be quickly normalized and checked with this tool.

The other issue that I recently worked through concerns the representation of provenance. Due to the nature of the export, my records exported all provenance notes for the works if one of the items was located in Special Collections. If a work (a bibliographic record) had three items (three copies of the same book) attached to it, then my record provided all provenance notes attached to the items. For example, the three hypothetical items may belong to Charles Moore, Colin Rowe, and Blake Alexander respectively. Only the Moore and Rowe, however, are housed in Special Collections. The provenance for Alexander was included though not related to any work in Special Collections.

In the end, I had to decide that the provenance notes for items not associated with Special Collections have to be removed in the final version of the cleaned data export. I made this decision, because the assessment is specifically about Special Collections. No item existed in the data upon which to attach the non-Special Collections provenance notes. The data needs to be one to one to accurately assess the collection. Earlier versions of the export will be retained with all the provenance notes.

It was difficult to make the decision to remove the extraneous provenance notes. In a non-quantitative assessment of provenance, there appeared to be a lot of overlap in the collections of Charles Moore, Colin Rowe, and Blake Alexander. More often than not, the Alexander copy was not part of APL’s Special Collections and had to be removed. Removing the Alexander notes was incredibly hard, because I realize the missed opportunity of analysis with regards to those three collections. In the future I hope to be able to undertake an analysis of the connections between the libraries of the donors across all of APL’s collections.