Description
I need help with my java project that needs JUNIT 5.2 to be added to classpath and also latest version of java which is 19. All the details needs to be followed thoroughly as the handout I’ll provide. The report I need it sent as a word document to me so I can type in my personal details in them. I need you to write the resources used in page one of the report (this can include any video or website you used).
If you have any question that you are confused about, tell me word by word what to write to the instructor and they will be able to clarify things.
The handout is under the name: Project 3.Part1 1-pdf. The randstrs.txt file is too large to be uploaded here, once you accept the tase I will send it separately using any other mean.CS 251, Spring 2023: Project 3, Part 1
Hashing: Applications and Experimental Analysis
Due:
Part 1, Friday, March 10, 10:59 pm on Vocareum and Gradescope (released 3/1)
Part 2, Wednesday, March 29, 10:59 pm on Vocareum and Gradescope
Slip days can be used for part 1. However, We encourage you to try to submit on time.
Introduction
In Part 1, the project explores the performance of different hash functions with two types of open
addressing for collision management. You will explore the impact of the hash functions and the
load factor on the number of collisions.
Part 2 introduces you to a new data structure, Bloom Filters. You are asked to implement the
basic operations of a Bloom filter and to experimentally analyze the impact of the size of the
Bloom filter and the number of hash functions used on the false positive rate of lookup
operations. Both parts of the project ask you to prepare a report on your experimental results.
Learning Objectives
●
●
●
●
●
Explain practical differences between the hash functions and the collision management
strategies.
Identify the advantages and disadvantages of Bloom filters.
Demonstrate how the different parameters used by Bloom filters impact performance
and false positive rate.
Give examples of applications that should use/should not use Bloom filters.
Visualize and explain results generated by experiments in a clear and meaningful way.
Part 1: HashTable
In Part 1, you will implement a HashTable data structure that holds a collection of String objects.
The table may contain up to n elements, none of which may be duplicates, and it must support
insertion and lookup operations. The HashTable will use different hash functions, and your
implementation and experimental work will observe and report differences between them.
Two open addressing approaches are used to resolve collisions in an insertion: linear probing
and double hashing. In both approaches, the index into the table is incremented by step size
delta (see class slides for more details).
1. In linear probing, the step size delta is set to 1, and an insertion probes consecutive
locations in the array to find a free location.
2. In double hashing, the step size delta=h2(X) is determined by a secondary hash function
h2. When inserting element X, the locations probed are (h1(X) + j h2(X)) mod m,
j=0,1,2,…, where h1 and h2 are the primary and secondary hash functions, and m is the
size of the table.
Note that no deletions have to be handled. Both collision management strategies need to
ensure that probing does not turn into an infinite loop as the table fills up. No resizing of the
hash table is done.
1.1 Hash Functions
Provided with the template/starter code is the Hasher class, which implements the hash
functions, listed below. Do not modify this class.
Each hash function is a static method that takes a String as input and returns a signed 32-bit
integer hash value. Note that the string is converted to an integer inside the hash function.
public static int crc32(String str)
Returns the cyclic redundancy check (CRC) of the input string, using the Guava library
implementation. The CRC is a checksum typically used to verify data integrity, but here
we use it as a hash function.
public static int adler32(String str)
Returns the Adler-32 checksum of the input string, using the Guava library
implementation.
public static int murmur3_32(String str, int seedIdx)
Returns the 32-bit MurmurHash3 of the input string, using the Guava library
implementation. MurmurHash3 requires a random seed as input. The Hasher class
provides a static array of seeds to use, and the seedIdx argument indexes into this
array.
public static int polynomial(String str, int primeIdx)
Computes a polynomial rolling hash of the input string. It uses polynomial accumulation
to generate an integer and then uses the division method to generate the hash value.
Specifically, the function computes:
,
where p is a prime number and c1,…,ck are the characters in the string. Similar to
MurmurHash3, the Hasher class provides a static array of prime numbers to use for p,
and the primeIdx argument indexes into this array.
1.2 HashTable Implementation
The template code includes the HashTable class. Your task is to implement the class methods
listed below. Internally, the HashTable maintains an array of String objects of a fixed size, which
is used for storing and locating elements of the table.
public HashTable(int capacity, String hashType)
HashTable constructor. Allocate the String array with the given capacity (i.e., size of the
hashtable). The hashType argument determines which hash functions from the Hasher
class to use, as well as the collision resolution strategy.
Possible values for this argument are “crc32-lp”, “crc32-dh”, “adler32-lp”,
“adler32-dh”, “murmur3-lp”, “murmur3-dh”, “poly-lp”, and “poly-dh”.
The “-lp” suffixes indicate linear probing, whereas the “-dh” suffixes indicate double
hashing.
Note: Only valid arguments will be passed to test your code.
public int size()
Return the number of elements currently inserted into the table.
public int capacity()
Return the maximum number of elements that can be stored in the table.
public float loadFactor()
Return the ratio of inserted elements to table capacity. Expected values between 0 and
1.
private int hash1(String str) & private int hash2(String str)
Compute the primary and secondary hash indices of the input string using the hash
functions specified by the constructor’s hashType argument.
Refer to the table below for the functions to use. The returned index must be a valid
index into the String array, that is, the possible values should be in the range [0, m-1],
where m is the array length. The hash2() function is only needed for double hashing.
hashType
hash1(str)
“crc32-lp”
Hasher.crc32(str)
“crc32-dh”
Hasher.crc32(str)
“adler32-lp”
Hasher.adler32(str)
“adler32-dh”
Hasher.adler32(str)
“murmur3-lp”
Hasher.murmur3_32(str, 0)
“murmur3-dh”
Hasher.murmur3_32(str, 0)
“poly-lp”
Hasher.polynomial(str, 0)
“poly-dh”
Hasher.polynomial(str, 0)
hash2(str)
Hasher.adler32(str)
Hasher.crc32(str)
Hasher.murmur3_32(str, 1)
Hasher.polynomial(str, 1)
Java does not have an unsigned integer type. Thus all of the hash functions in Hasher
return a signed 32-bit integer. In Java, the modulus operator % returns a negative value
when the dividend is negative. Instead, use the Integer.remainderUnsigned()
method provided by Java, which interprets a signed integer as though it were unsigned.
public boolean insert(String str)
Attempt to insert the given string into the table, and return whether the insertion was
successful. Unsuccessful insertions occur if the element is already inserted, if the table is
full, or if the collision resolution runs into an infinite loop. It is your responsibility to detect
infinite loops if and when they occur. You must also keep track of the number of
collisions encountered during the insertion, to be returned by numCollisions().
True is returned for a successful insertion and false is returned for an unsuccessful
insert.
public int numCollisions()
Return the number of collisions encountered during the last call to insert(), even if
the insertion failed.
public boolean contains(String str)
Return whether the given string is already contained in the table.
Testing your code
We provide a set of white box unit tests for the HashTable class. We strongly recommend that
you proceed to Section 1.3 only after passing all of these tests. The unit tests verify both the
internal state of the hash table and the return values of its methods as insertions are performed.
Due to the white-box nature of the tests, your HashTable implementation must conform exactly
to the specification described in this handout. In particular, your collision resolution logic must
correspond to the exact open addressing schemes described on Page 2.
The project root should contain the following provided files: An input directory, an expected
directory, a HashTableTest.java class containing JUnit methods and nested classes, and a
Utils.java utility class. Obtain the input and expected directories by downloading and
decompressing the local_data.zip file from BrightSpace. Before running the tests, set
Utils.ROOT to the absolute path of the project root on your computer. Below, we describe the
contents of these files:
●
●
●
●
The input directory contains input text files. Each file contains a list of six-letter
lowercase strings, one string per line.
The expected directory contains pairs of output and state text files. Each line of an
output file contains three or four space-delimited values. Each line of a state file contains
a sequence of space-delimited six-character strings.
The HashTableTest class contains the test cases as JUnit methods. The test suite
can be run using IntelliJ’s JUnit runner. We set reasonable timeouts for each test case; if
your code takes longer to execute, you likely have an infinite loop. Below, we will further
describe the nested classes InsertTest and ContainsTest, which respectively
implement insertion tests and containment tests.
The Utils class contains methods which compare two given values and throw an
AssertionError if the values differ.
Each test case contains a collection of tests that check one aspect of the HashTable
implementation, described in the test case’s DisplayName. Associated with a test are four
properties: inName, capacity, hashType, and outName. When each test starts running, it
prints a message listing its properties. The test loads the input file ${inName}.txt,
instantiates a hash table with the associated capacity and hashType, loads the output file
${outName}_output.txt and the state file ${outName}_state.txt, inserts the input
strings into the hash table one by one, and checks the hash table against the output and state
files after each insertion.
An insertion test will, after the th insertion, check that
● The value returned by the insert() call matches the first value in the th output line;
● size() returns the second value in the th output line;
● loadFactor() returns a value within 1e-4 of the third value in the th output line;
●
●
numCollisions() returns the fourth value in the th output line1; and
The table array matches the th line of the state file, where null values in the array
correspond to horizontal lines in the state file.
A containment test will, after the th insertion, check that
● Calling contains() with each string in the hash table returns true; and
● Calling contains() with some strings (listed in inputs/small_excluded.txt) not
present in the hash table returns false.
Note: We use JUNIT test cases for this project. For this you have to add JUNIT 5.2 to your
classpath as well. You can do this from the project structure panel. ( File > Project
Structure > Modules. (Shortcut – cmd + ;)). We trust that you will be able to resolve this
on your own.
1.3 Comparing Experimental Performance
In this section we will be evaluating/measuring the performance of different hash table
implementations (hash function – collision management pairs).
You are provided two input files – randstrs.txt, words_ordered.txt. These are on
Brightspace as they were too large for Vocareum to handle. These files have strings which you
will use to evaluate your hash table implementations and then analyze your findings to answer
the questions asked in the report section. For the report, remember to draw graphs and analyze
the results.
Modify your main()method in HashTable.java to generate the data for the experimental
analysis of every pair of hash function and collision management method.
For each data file record the number of collisions for every pair of hash function-collision
management method.
● You will insert strings (one by one) into a hash table of size m until the load factor is 0.9.
● To record the number of collisions, partition the load factors from 0 to 0.9 into 90
bins(intervals) and sum up collisions per bin. (more in the example below).
● Note that double hashing needs to handle not finding a free location even when there is
one.
An example with CRC32 hash function with linear probing (hashType=”crc32-lp”) has
been provided below.
1
This check is done only if the insertion was successful, to account for implementation differences in
infinite loop detection.
●
●
Given an input file and capacity m, first, create a HashTable of size m with
hashType=”crc32-lp”.
Next, read and insert strings from the input file into the HashTable while the load factor is
< 0.9. To analyze the relationship between load factor and collisions, proceed as follows:
○ Divide the range of load factors into 90 bins and accumulate the total number of
collisions in each bin.
■ For example, bin 0 covers the load factor range [0, 0.01).
■ All collisions for load factors in the range [0, 0.01) are accumulated in bin
0.
■ Be sure to include collisions for insertions that fail as well.
○ For each insertion, record the load factor prior to the insertion to find the bin. Add
the number of collisions for the insertion to the corresponding bin.
Repeat this process for each hash function type and collision management method. This will
result in 4(hash functions) x 2(collision management methods) = 8 sets of 90(bins) values.
Print these values as a table in Comma-Separated Value (CSV) format with 9 columns and 91
rows.
● The first line should print the hashType parameter for each column.
● The first column should print the load factor bin’s upper-bound, printed to 2 decimals, as
shown in the example below (the X’s in the last two rows are placeholder values):
bin,crc32-lp,crc32-dh,adler32-lp,adler32-dh,murmur3-lp,murmur3-dh,poly-lp,poly-dh,
0.01,1,1,1,1,1,1,2,2,
0.02,3,3,5,5,6,6,7,6,
0.03,7,6,8,10,11,11,11,9,
.
.
.
0.89,X,X,X,X,X,X,X,X,
0.90,X,X,X,X,X,X,X,X,
Note: There should be no new line character and the end of the last line. Do not change the
order of the hashTypes from above. The order is crc32-lp, crc32-dh, adler32-lp,
adler32-dh, murmur3-lp, murmur3-dh, poly-lp, poly-dh.
The first line should begin with "bin" to indicate the first column is the bin, followed by 8 strings
indicating the 8 pairs of hash function and collision management method. The second line
should start with 0.01 for the first bin [0, 0.01), followed by a list of 8 numbers representing the
numbers of collisions corresponding to 8 pairs of hash function and collision management
method for the bin [0, 0.01). The next 89 lines represent the other bins, and collisions with
respect to the bin for the corresponding hash function - collision management methods pair.
You will run your program with two command line arguments: the input filename and capacity m,
in that order.
Perform the above procedure and print the results to the standard output stream System.out.
Save the output to a .csv file by using the output redirection operator >:
$ java HashTable input_file.txt 30000 > output_file.csv
The above command line means after you compile HashTable.java, you need to specify two
arguments to run the HashTable. The first argument is the input_file.txt, which contains the
String objects. The second argument is a positive integer, indicating the table capacity.
Do not change the order of the hash functions in the csv file. Provided with the template code
are two input files:
● randstrs.txt, containing 1,000,000 unique strings of 32 random alphanumeric
characters.
● words_ordered.txt containing 370,107 unique English words in sorted order..
Run your program on each of these two input files twice, with m chosen as 30,000 and 300,000,
and save each of the resulting outputs in a separate .csv file. You should end up with 4 .csv files
in total. These four files do not need to be submitted with your code.
Report
Using spreadsheet software or another external library (e.g., matplotlib), create 4 different line
charts (one from each of the .csv files you generated). These charts should be included into the
report along with a discussion answering the question below. Each chart should
● plot all 8 hashTypes as separate lines.
● color the lines according to the hash function, and make linear probing solid lines and
double hashing dashed lines. For example, “crc32-lp” and “crc32-dh” should be
the same color, but the second one should be dashed.
● have the x-axis be the load factor and the y-axis be the number of collisions in
logarithmic scale.
● include titles, axis labels, and at least one legend.
● be readable and clear.
Your report should include all relevant plots. Include your responses to the following questions
in a clear and precise way.
1. What do the experimental results tell you about the different hash functions?
2. Do the experimental results provide insight into the performance of the two collision
handling strategies? Explain and justify any conclusions made.
3. What are your recommendations on the maximum load factor?
Justify your
recommendations.
4. Clearly explain how your code avoids an infinite loop when using double hashing.
5. Do infinite loops appear on your graphs? Explain.
6. Are your recommendations on the load factor the same for all hash functions and both
collision strategies? Explain.
Explain your reasoning clearly and in a well-articulated way. Upload your report_p3_part1.pdf
file to Gradescope before the deadline for Part 1.
Code Requirements for Experimental Work
The Hasher class relies on the Guava library, included with the template code. To compile it, you
will need to specify the path to the .jar file as part of the classpath:
$ javac -classpath .:guava-31.1-jre.jar Hasher.java
Alternatively, we recommend you set the CLASSPATH environment variable to simplify the
command:
$ export CLASSPATH=.:guava-31.1-jre.jar
$ javac Hasher.java
To do this on IntelliJ go to the File > Project Structure > Modules. (Shortcut – cmd + 😉
Under the export click the + sign and add guava-31.1-jre.jar then click apply.
Compile the remaining classes similarly:
$ javac HashTable.java
To run the experiments in section 1.3, run the HashTable program with two arguments: the input
filename and table capacity m. Redirect the output to a csv file as shown below:
$ java HashTable input_file.txt 30000 > output_file.csv
The above examples assume the environment variable CLASSPATH has been set.
Structure of the Report
Overall report guidelines and expectations
● The questions the report for part 1 should address are stated in section 1.3.
● The report needs to be typed (use LaTeX or Word).
● Your name (Last, First), your Purdue email, and your section number (LE1@4:30 or
LE2@1:30) need to be on top of page 1.
● The RC statement needs to be on page 1 under your name.
● All illustrations and figures need to be clearly readable. They cannot be handwritten or
drawn, for example, on an iPad. Handwritten tables are not acceptable.
● Reports are uploaded to Gradescope and code is uploaded to Vocareum.
●
The suggested length of the report is up to 6 pages, with font size 12 and 1.5 spacing. If
you have additional supporting tables exceeding the page limit, you can include them as
an Appendix (which has no page limit). However, only your report will be graded. The
appendix may be consulted by the grader for additional information.
Grading
The expected grading rubric is given below.
● Part 1 (40 points)
○ Coding part – 25 points
○ Overall quality of the report (15 points)
■ Quality and conciseness of writing
■ Quality of tables and plots included
■ Logical flow of arguments
■ Clear focus on relevant ideas
Submission
●
To Vocareum (everything should be inside the src folder):
○ HashTable.Java
●
To Gradescope:
○ report_p3_part1.pdf
Note:
● Additional imports are not allowed. (other than the ones in the starter code)
● If for some reason you do not see the src folder under the work folder on Vocareum,
please make a new src folder.
● This has been explained on Piazza already so you can refer to it there.
● Please go to office hours for additional help.
Purchase answer to see full
attachment