Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit

dc.contributor.authorYang, Hsiu-Wei
dc.contributor.authorLiu, Linqing
dc.contributor.authorMilligan, Ian
dc.contributor.authorRuest, Nick
dc.contributor.authorLin, Jimmy
dc.date.accessioned2019-04-23T02:30:41Z
dc.date.available2019-04-23T02:30:41Z
dc.date.issued2019
dc.description.abstractWe demonstrate the integration of the Archives Unleashed Toolkit, a scalable platform for exploring web archives, with Google's TensorFlow deep learning toolkit to provide scholars with content-based image analysis capabilities. By applying pretrained deep neural networks for object detection, we are able to extract images of common objects from a 4TB web archive of GeoCities, which we then compile into browsable collages. This case study illustrates the types of interesting analyses enabled by combining big data and deep learning capabilities.en_US
dc.description.sponsorshipThis work was primarily supported by the Natural Sciences and Engineering Research Council of Canada. Additional funding for this project has come from the Andrew W. Mellon Foundation. Our sincerest thanks to the Internet Archive for providing us with the GeoCities web archive.
dc.identifier.citationHsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin. “Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit.” Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Vol. 19 (2019).
dc.identifier.citationHsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin. “Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit.” Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Vol. 19 (2019).
dc.identifier.issn978-1-7281-1547-4/19
dc.identifier.urihttp://hdl.handle.net/10315/36161
dc.identifier.urihttps://doi.org/10.1109/JCDL.2019.00107
dc.language.isoen
dc.rights.urihttps://doi.org/10.1109/JCDL.2019.00107
dc.subjectTensorFlowen
dc.subjectmachine learningen
dc.subjectimage analysisen
dc.subjectweb archivesen
dc.subjectApache Sparken
dc.subjectPySparken
dc.titleScalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkiten
dc.typeArticle

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
image-analysis.pdf
Size:
10.47 MB
Format:
Adobe Portable Document Format
Description:
Main article
Loading...
Thumbnail Image
Name:
JCDL-Image-Analysis-Poster.pdf
Size:
25.57 MB
Format:
Adobe Portable Document Format
Description:
JCDL Poster
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.83 KB
Format:
Item-specific license agreed upon to submission
Description: