Optimizing Data Compression via Data Reordering Strategies

Loading...
Thumbnail Image

Date

2024-07-18

Authors

Du, Qinxin

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

To improve the efficiency and cost-effectiveness of handling large tabular datasets stored in databases, a range of data compression techniques are employed. Among these, dictionary-based compression methods such as Lz4, Gzip, and Zstandard are commonly utilized to decrease data size. However, while these traditional dictionary-based compression techniques can reduce data size to some degree, they are not able to identify the internal patterns within given datasets. Thus, there remains substantial potential for further data size reduction by identifying repetitive data patterns.

This thesis proposes two novel approaches to improve tabular data compres- sion performance. Both methods involve data preprocessing using an advanced data encoding technique called locality-sensitive hashing (LSH). One approach utilizes clustering for data reordering, while the other employs a heuristic-based solver for the Travelling Salesman Problem (TSP). The data encoding process enables the identification of internal repetitive patterns within the original datasets. Records with similar features are grouped together and compressed into a much smaller size after reordering.

Furthermore, a novel table partitioning strategy based on the number of distinct values in each column is designed to further improve the compression ratio of the entire table. Extensive experiments are then conducted on one synthetic dataset and three real datasets to evaluate the performance of the proposed algorithms by varying parameters of interest. The data encoding and reordering methods show significant efficiency improvements, resulting in reduced data size and substantially increased data compression ratios.

Description

Keywords

Information technology

Citation