Engaging the Public with Web Archives: Providing Access to 10 Years of Political History with WebArchives.ca
Date
2016-05-31
Authors
Ruest, Nick
Milligan, Ian
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Introduction
The growth of digital sources since the advent of the World Wide Web in 1990-91 presents profound opportunities for historians. Large web archives contain billions of webpages, and now make it possible for us to develop large-scale reconstructions of the recent web. Yet the sheer number of these sources presents significant challenges. The Internet Archive's "Wayback Machine" (http://archive.org/web) is a standard entryway to these collections, but requires that the user know the URL of the resource they want to visit; it is not feasible to do large-scale research in this manner.
By unlocking the Wayback Machine's underlying WebARCHive (ARC/WARC) files, we can develop methods to track, visualize, and analyze change occurring over time. In this paper, we discuss how we implemented the United Kingdom Web Archive (UKWA) "Shine" interface on a Canadian corpus, and how the provision of a user layer significantly changed levels of user engagement.
Project Rationale and Case Study
The University of Toronto Library (UTL) began collecting a quarterly crawl in 2005 of Canadian political parties and political interest groups. It includes fifty websites: major and minor political parties, as well as political interest groups such as the Assembly of First Nations and equal marriage advocacy groups. Collecting continues.
Despite 2005-2015 having been a pivotal period for Canadian politics, and analytics reveal few took advantage of it. The current portal requires a visit to https://archive-it.org/collections/227 for full-text queries. There is no faceting or significant advanced search features. The interface is largely unusable for broad research questions.
Shine
To provide access, we implemented the Shine interface (https://github.com/ukwa/shine). Shine provides a web-based interface for interacting with Apache Solr. Using the open-sourced code, we indexed all of the sites, provided explanatory layers, generated additional analytics around what each crawl contained (as some crawls might contain more webpages from say the Liberals, which throws off the relative frequency of keywords), and tried to write better user documentation.
We launched http://webarchives.ca as the 2015 Canadian federal election campaign began.
Results
WebArchives.ca received significant attention. The Canadian Broadcasting Corporation (CBC) carried stories in Canada Votes, the Kitchener-Waterloo affiliate, Spark, as well as talk radio and campus news. We received 17,861 pageviews over 4,000 user sessions, largely between 27 August and 19 October.
It also led to research findings, including:
* unlike other forms of web content, political parties and interest groups do not archive material on their websites. This eases analysis due to fewer duplicates, but also shows why collecting is time critical;
* political parties flip flop: the Conservatives accused the Liberals in 2005 of paying insufficient attention to murdered and missing indigenous women; a complete reversal occurred on the 2015 websites;
* significant shifts away from user-generated content on party sites, which experimented and then abandoned widespread commenting and hosting of blogs.
These were discoverable due to the Shine/Webarchives.ca interface.
Conclusions
More work needs to be done. The next step is to work with more Archive-It collections of national/international significance and publicize them in a similar way. At the end of the presentation, I will note an ongoing project we have with Canadian partners to consolidate and provide access to multiple collections.
Description
The Canadian Society of Digital Humanities/Société canadienne des humanités numériques Conference 2016
Keywords
web archives, digital humanities, text analysis, data mining, shine