I am attempting to implement the following Basic Sliding Window algorithm in Java. I get the basic idea of it, but I am a bit confused by some the wording, specifically the sentence in bold:
A sliding window of fixed width w is moved across the file, and at every position k in the file, the fingerprint of its content is computed. Let k be a chunk boundary (i.e., Fk mod n = 0). Instead of taking the hash of the entire chunk, we choose the numerically smallest fingerprint of a sliding window within this chunk. Then we compute a hash of this randomly chosen window within the chunk. Intuitively, this approach would permit small edits within the chunks to have less impact on the similarity computation. This method produces a variable length document signature, where the number of fingerprints in the signature is proportional to the document length.
Please see my code/results below. Am I understanding the basic idea of the algorithm? As per the text in bold, what does it mean to "choose the numerically smallest fingerprint of a sliding window within its chunk"? I am currently just hashing the entire chunk.
code:
public class BSW {
/**
* @param args
*/
public static void main(String[] args) {
int w = 15; // fixed width of sliding window
char[] chars = "Once upon a time there lived in a certain village a little
country girl, the prettiest creature who was ever seen. Her mother was
excessively fond of her; and her grandmother doted on her still more. This
good woman had a little red riding hood made for her. It suited the girl so
extremely well that everybody called her Little Red Riding Hood."
.toCharArray();
List<String> fingerprints = new ArrayList<String>();
for (int i = 0; i < chars.length; i = i + w) {
StringBuffer sb = new StringBuffer();
if (i + w < chars.length) {
sb.append(chars, i, w);
System.out.println(i + ". " + sb.toString());
} else {
sb.append(chars, i, chars.length - i);
System.out.println(i + ". " + sb.toString());
}
fingerprints.add(hash(sb));
}
}
private static String hash(StringBuffer sb) {
// Implement hash (MD5)
return sb.toString();
}
}
results:
0. Once upon a tim
15. e there lived i
30. n a certain vil
45. lage a little c
60. ountry girl, th
75. e prettiest cre
90. ature who was e
105. ver seen. Her m
120. other was exces
135. sively fond of
150. her; and her gr
165. andmother doted
180. on her still m
195. ore. This good
210. woman had a lit
225. tle red riding
240. hood made for h
255. er. It suited t
270. he girl so extr
285. emely well that
300. everybody call
315. ed her Little R
330. ed Riding Hood.