Hands on with warcbase

Thumbnail Image




Milligan, Ian
Ruest, Nick

Journal Title

Journal ISSN

Volume Title



Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark. Our team at the University of Waterloo is developing new ways to systematically track, visualize, and analyze change occurring over time within web archives. Using shared Spark Notebooks, our platform provides a robust platform for humanists and social scientists to gain access to their collections. In this half-day workshop, aimed at IIPC members with large web collections, we will:

  • lead participants through the installation process for warcbase and Spark Notebook, either on an Amazon EC2 instance or a local laptop;
  • work with participants to ingest data into the system, either their own collections (ARC or WARC files) or our sample test collection of Canadian Political Party and Political Interest Group websites (used for http://webarchives.ca/)
  • lead participants through the following processes:
    • basic analytics on their collections;
    • extracting and analyzing plain text;
    • extracting and visualizing named entity;
    • extracting and visualizing link structures with both in-browser D3.js and Gephi;
    • and explain the process to build their own scripts to provide custom solutions.



warcbase, web archives, hadoop, text analysis, data mining, digital humanities
