The implementation takes as input the path to the dataset, length of the stream and total number of hash functions divided into hash groups (numHashGroups * numHashFunctionsInGroup). Multiple hash functions are used to improve the estimate of the number of distinct items in the stream based on the approach described in section 4.4.3 of Mining Massive Datasets book.
This implementation estimates the value of the kth moment given the path of the dataset, length of stream, number of random variables to consider and kth moment to be estimated using the AMS streaming algorithm described in section 4.5 of Mining Massive Datasets book.
The project is compiled with:
java version 7
The parameters for the script are as follows:
- "1" to select estimation of distinct Items
- Path to the input dataset
- Length of stream (n)
- Number of hash groups
- Number of hash functions in each hash group
Following is an example:
./run2.sh 1 ebola.json.gz 2000 5 2
Also you can run the jar file directly using:
java -jar ASM.jar ebola.json.gz 2000 5 2
- "2" to select estimation of kth moment
- Path to the input dataset
- Length of stream (n)
- Number of random variables
- kth moment to estimate
Following is an example:
./run2.sh 2 ebola.json.gz 10000 10 2
- The space complexity of approximating the frequency moments AMS96
- Chapter 4. Mining Data Streams Mining Massive Datasets