ABCDEF - The 6 key features behind scalable, multi-tenant web archive processing with ARCH: Archive, Big Data, Concurrent, Distributed, Efficient, Flexible

Holzmann, Helge; Ruest, Nick; Bailey, Jefferson; Dempsey, Alex; Fritz, Samantha; Lee, Peggy; Milligan, Ian

ABCDEF - The 6 key features behind scalable, multi-tenant web archive processing with ARCH: Archive, Big Data, Concurrent, Distributed, Efficient, Flexible

dc.contributor.author	Holzmann, Helge
dc.contributor.author	Ruest, Nick
dc.contributor.author	Bailey, Jefferson
dc.contributor.author	Dempsey, Alex
dc.contributor.author	Fritz, Samantha
dc.contributor.author	Lee, Peggy
dc.contributor.author	Milligan, Ian
dc.date.accessioned	2022-04-28T19:29:58Z
dc.date.available	2022-04-28T19:29:58Z
dc.date.issued	2022-06-20
dc.description.abstract	Over the past quarter-century, web archive collection has emerged as a user-friendly process thanks to cloud-hosted solutions such as the Internet Archive’s Archive-It subscription service. Despite advancements in collecting web archive content, no equivalent has been found by way of a user-friendly cloud-hosted analysis system. Web archive processing and research require significant hardware resources and cumbersome tools that interdisciplinary researchers find difficult to work with. In this paper, we identify six principles - the ABCDEFs (Archive, Big data, Concurrent, Distributed, Efficient, and Flexible) - used to guide the development and design of a system. These make the transformation of, and working with, web archive data as enjoyable as the collection process. We make these objectives – largely common sense – explicit and transparent in this paper. They can be employed by every computing platform in the area of digital libraries and archives and adapted by teams seeking to implement similar infrastructures. Furthermore, we present ARCH (Archives Research Compute Hub), the first cloud-based system designed from scratch to meet all of these six key principles. ARCH is an interactive interface, closely connected with Archive-It, engineered to provide analytical actions, specifically generating datasets and in-browser visualizations. It efficiently streamlines research workflows while eliminating the burden of computing requirements. Building off past work by both the Internet Archive (Archive-It Research Services) and the Archives Unleashed Project (the Archives Unleashed Cloud), this merged platform achieves a scalable processing pipeline for web archive research. It will be made open-source shortly and can be considered a reference implementation of the ABCDEF, which we have evaluated and discussed in terms of feasibility and compliance as a benchmark for similar platforms.	en_US
dc.description.sponsorship	This research was supported by the Andrew W. Mellon Foundation's Public Knowledge program, the Social Sciences and Humanities Research Council of Canada, as well as Start Smart Labs, the University of Waterloo, and York University.	en_US
dc.identifier.isbn	978-1-4503-9345-4/22/06
dc.identifier.uri	https://doi.org/10.1145/3529372.3530916	en_US
dc.identifier.uri	http://hdl.handle.net/10315/39458
dc.language.iso	en	en_US
dc.publisher	ACM	en_US
dc.rights	Attribution 4.0 International	*
dc.rights.article	https://doi.org/10.1145/3529372.3530916	en_US
dc.rights.journal	https://2022.jcdl.org/	en_US
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.subject	big data	en_US
dc.subject	web archives	en_US
dc.subject	internet archive	en_US
dc.subject	data processing	en_US
dc.subject	apache spark	en_US
dc.subject	scala	en_US
dc.subject	scalatra	en_US
dc.subject	computing infrastructure	en_US
dc.subject	hadoop	en_US
dc.subject	hdfs	en_US
dc.title	ABCDEF - The 6 key features behind scalable, multi-tenant web archive processing with ARCH: Archive, Big Data, Concurrent, Distributed, Efficient, Flexible	en_US
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: JCDL_2022___ABCDEF_Paper.pdf
Size:: 3.07 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.83 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

YUL research and professional contributions