Compression Techniques Comparison: Binary, Run-Length, Dictionary, Frame of Reference, and Differential Encoding

Description

This project compares five lossless data compression techniques—Binary, Run-Length, Dictionary, Frame of Reference, and Differential Encoding—applied to integer and string CSV data. The performance is evaluated in terms of compression ratio, file size, and encoding/decoding time.

Task / Problem Statement

The task is to implement and compare the performance of five data compression techniques on various integer and string datasets. The goal is to evaluate the compression efficiency and speed of each technique and analyze their suitability for different types of data, such as integers and strings.

Sub-Tasks

Implement Compression Techniques: Implement five compression techniques: Binary, Run-Length, Dictionary, Frame of Reference, and Differential Encoding.
Test on Various Data Types: Apply each technique to integer types (int8, int16, int32, int64) and strings.
Evaluate Compression Performance: Measure and compare compression ratios, encoded file sizes, and encoding/decoding times for each technique.
Analyze Results: Discuss the results and identify the most efficient techniques for different data types.

Implementation Details

1. Binary Encoding (bin)

Method: Converts integers into machine-readable binary format, which reduces storage size compared to text-based formats.
Performance: Offers fast encoding and decoding with moderate compression ratios. Best for scenarios requiring quick storage without heavy compression.

2. Run-Length Encoding (rle)

Method: Replaces consecutive repeated values with a single value and its count. Ideal for data with long sequences of identical values.
Performance: Effective on data with many consecutive repetitions, but not suitable for highly variable data.

3. Dictionary Encoding (dic)

Method: Replaces repeated values with tokens from a dictionary. Effective for string data with many repetitive words or phrases.
Performance: Performs well with string data but less effective for integer data with high variability.

4. Frame of Reference Encoding (for)

Method: Encodes integer data by storing the differences from a reference value. Suitable for ordered data with small variations.
Performance: Efficient for data with small deltas but less effective when the data is highly varied.

5. Differential Encoding (dif)

Method: Encodes data by storing the difference between consecutive values. Ideal for sequential data with incremental changes.
Performance: Achieves high compression ratios for sequential data but struggles with irregular or non-monotonic data.

Program Breakdown

The implementation consists of 26 programs in total, divided as follows:

1. Binary, Frame of Reference, and Differential Encoding

8 Programs for Each Encoding Type: 4 for encoding (int8, int16, int32, int64) and 4 for decoding (int8, int16, int32, int64).

To Run the Programs:

Encode

python <program-en-for|bin|dif-int8|16|32|64.py> <Filename.csv>

Decode

python <program-de-for|bin|dif-int8|16|32|64.py> <Filename.csv.for|dif|bin>

2. Run-Length Encoding (rle.py) and Dictionary Encoding (program-dic.py)

2 Programs for Encoding/Decoding:

Run-Length Encoding

Encode

python rle.py en|de rle int8|16|32|64 <file_path>

Decode

python rle.py en|de rle int8|16|32|64 <file_path>

Dictionary Encoding

Encode

python program-dic.py en|de dic int8|16|32|64 <file_path>

Decode

python program-dic.py en|de dic int8|16|32|64 <file_path>

Metrics and Results

Compression Ratios

Technique	File Type	Compression Ratio
Binary Encoding	int8, int16, int32, int64	0.25 to 2.00
Run-Length Encoding	string, int8, int16	0.19 to 1.77
Dictionary Encoding	string	0.70 to 2.42
Frame of Reference	int8, int16, int32, int64	0.25 to 2.82
Differential Encoding	int8, int16, int32, int64	0.26 to 2.82

Encoding and Decoding Times

Technique	File Type	Encoding Time (s)	Decoding Time (s)
Binary Encoding	int8, int16, int32, int64	1.5 to 2.5	1.5 to 2.5
Run-Length Encoding	string	0.1 to 1.0	0.1 to 1.0
Dictionary Encoding	string	2.0 to 5.0	5.0 to 8.0
Frame of Reference	int8, int16, int32, int64	2.5 to 3.7	2.5 to 3.7
Differential Encoding	int8, int16, int32, int64	2.0 to 3.0	1.5 to 3.0

Conclusion

Binary Encoding provides a fast, efficient method for compacting integer data but offers limited compression.
Run-Length Encoding is highly effective for datasets with repeated values but performs poorly on more variable data.
Dictionary Encoding is best for string data with many repeated values, though it incurs computational and storage overhead for high-variance datasets.
Frame of Reference Encoding works well for integer data with small variations but suffers from file bloat with highly variable datasets.
Differential Encoding excels with sequential data and provides high compression ratios for data with small deltas.

Future Work

Hybrid Compression: Investigate hybrid approaches that combine multiple techniques based on data characteristics.
Performance Optimization: Explore optimizations to improve encoding and decoding speeds, especially for large datasets.

References

TPCH Dataset: TPCH Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compression Techniques Comparison: Binary, Run-Length, Dictionary, Frame of Reference, and Differential Encoding

Description

Task / Problem Statement

Sub-Tasks

Implementation Details

1. Binary Encoding (bin)

2. Run-Length Encoding (rle)

3. Dictionary Encoding (dic)

4. Frame of Reference Encoding (for)

5. Differential Encoding (dif)

Program Breakdown

1. Binary, Frame of Reference, and Differential Encoding

Encode

Decode

2. Run-Length Encoding (rle.py) and Dictionary Encoding (program-dic.py)

Run-Length Encoding

Dictionary Encoding

Metrics and Results

Compression Ratios

Encoding and Decoding Times

Conclusion

Future Work

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
1. Binary		1. Binary
2. Run Length		2. Run Length
3. Dictionary		3. Dictionary
4. Frame of Reference		4. Frame of Reference
5. Differential		5. Differential
Dataset		Dataset
README.md		README.md
main.py		main.py
program-dic.py		program-dic.py
rle.py		rle.py

Mehulupase01/Compression-Benchmarks-Binary-Run-Length-Dictionary-Frame-of-Reference-and-Differential-Encoding

Folders and files

Latest commit

History

Repository files navigation

Compression Techniques Comparison: Binary, Run-Length, Dictionary, Frame of Reference, and Differential Encoding

Description

Task / Problem Statement

Sub-Tasks

Implementation Details

1. Binary Encoding (bin)

2. Run-Length Encoding (rle)

3. Dictionary Encoding (dic)

4. Frame of Reference Encoding (for)

5. Differential Encoding (dif)

Program Breakdown

1. Binary, Frame of Reference, and Differential Encoding

Encode

Decode

2. Run-Length Encoding (rle.py) and Dictionary Encoding (program-dic.py)

Run-Length Encoding

Dictionary Encoding

Metrics and Results

Compression Ratios

Encoding and Decoding Times

Conclusion

Future Work

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages