YorkSpace has migrated to a new version of its software. Access our Help Resources to learn how to use the refreshed site. Contact diginit@yorku.ca if you have any questions about the migration.
 

Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit

Loading...
Thumbnail Image

Date

2019

Authors

Yang, Hsiu-Wei
Liu, Linqing
Milligan, Ian
Ruest, Nick
Lin, Jimmy

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

We demonstrate the integration of the Archives Unleashed Toolkit, a scalable platform for exploring web archives, with Google's TensorFlow deep learning toolkit to provide scholars with content-based image analysis capabilities. By applying pretrained deep neural networks for object detection, we are able to extract images of common objects from a 4TB web archive of GeoCities, which we then compile into browsable collages. This case study illustrates the types of interesting analyses enabled by combining big data and deep learning capabilities.

Description

Keywords

TensorFlow, machine learning, image analysis, web archives, Apache Spark, PySpark

Citation

Hsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin. “Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit.” Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Vol. 19 (2019).
Hsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin. “Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit.” Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Vol. 19 (2019).