Optimizing Data Compression via Data Reordering Strategies

dc.contributor.advisorYu, Xiaohui
dc.contributor.authorDu, Qinxin
dc.date.accessioned2024-07-18T21:25:41Z
dc.date.available2024-07-18T21:25:41Z
dc.date.copyright2024-04-15
dc.date.issued2024-07-18
dc.date.updated2024-07-18T21:25:41Z
dc.degree.disciplineInformation Systems and Technology
dc.degree.levelMaster's
dc.degree.nameMA - Master of Arts
dc.description.abstractTo improve the efficiency and cost-effectiveness of handling large tabular datasets stored in databases, a range of data compression techniques are employed. Among these, dictionary-based compression methods such as Lz4, Gzip, and Zstandard are commonly utilized to decrease data size. However, while these traditional dictionary-based compression techniques can reduce data size to some degree, they are not able to identify the internal patterns within given datasets. Thus, there remains substantial potential for further data size reduction by identifying repetitive data patterns. This thesis proposes two novel approaches to improve tabular data compres- sion performance. Both methods involve data preprocessing using an advanced data encoding technique called locality-sensitive hashing (LSH). One approach utilizes clustering for data reordering, while the other employs a heuristic-based solver for the Travelling Salesman Problem (TSP). The data encoding process enables the identification of internal repetitive patterns within the original datasets. Records with similar features are grouped together and compressed into a much smaller size after reordering. Furthermore, a novel table partitioning strategy based on the number of distinct values in each column is designed to further improve the compression ratio of the entire table. Extensive experiments are then conducted on one synthetic dataset and three real datasets to evaluate the performance of the proposed algorithms by varying parameters of interest. The data encoding and reordering methods show significant efficiency improvements, resulting in reduced data size and substantially increased data compression ratios.
dc.identifier.urihttps://hdl.handle.net/10315/42181
dc.languageen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectInformation technology
dc.subject.keywordsInformation technology
dc.subject.keywordsData compression
dc.subject.keywordsData reordering
dc.subject.keywordsMachine learning
dc.subject.keywordsClustering
dc.subject.keywordsTSP
dc.titleOptimizing Data Compression via Data Reordering Strategies
dc.typeElectronic Thesis or Dissertation

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Du_Qinxin_2024_Master.pdf
Size:
2.31 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
license.txt
Size:
1.87 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
YorkU_ETDlicense.txt
Size:
3.39 KB
Format:
Plain Text
Description: