DSpace Repository

Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit

Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit

Show full item record

Title: Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit
Author: Yang, Hsiu-Wei
Liu, Linqing
Milligan, Ian
Ruest, Nick
Lin, Nick
Abstract: We demonstrate the integration of the Archives Unleashed Toolkit, a scalable platform for exploring web archives, with Google's TensorFlow deep learning toolkit to provide scholars with content-based image analysis capabilities. By applying pretrained deep neural networks for object detection, we are able to extract images of common objects from a 4TB web archive of GeoCities, which we then compile into browsable collages. This case study illustrates the types of interesting analyses enabled by combining big data and deep learning capabilities.
Sponsor: This work was primarily supported by the Natural Sciences and Engineering Research Council of Canada. Additional funding for this project has come from the Andrew W. Mellon Foundation. Our sincerest thanks to the Internet Archive for providing us with the GeoCities web archive.
Subject: TensorFlow
machine learning
image analysis
web archives
Apache Spark
PySpark
Type: Article
URI: http://hdl.handle.net/10315/36161
Citation: Hsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin. “Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit.” Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Vol. 19 (2019).
Date: 2019

Files in this item



This item appears in the following Collection(s)