2

I am attempting to implement the following Basic Sliding Window algorithm in Java. I get the basic idea of it, but I am a bit confused by some the wording, specifically the sentence in bold:

A sliding window of fixed width w is moved across the file, and at every position k in the file, the fingerprint of its content is computed. Let k be a chunk boundary (i.e., Fk mod n = 0). Instead of taking the hash of the entire chunk, we choose the numerically smallest fingerprint of a sliding window within this chunk. Then we compute a hash of this randomly chosen window within the chunk. Intuitively, this approach would permit small edits within the chunks to have less impact on the similarity computation. This method produces a variable length document signature, where the number of fingerprints in the signature is proportional to the document length.

Please see my code/results below. Am I understanding the basic idea of the algorithm? As per the text in bold, what does it mean to "choose the numerically smallest fingerprint of a sliding window within its chunk"? I am currently just hashing the entire chunk.

code:

    public class BSW {

    /**
     * @param args
     */
    public static void main(String[] args) {
        int w = 15; // fixed width of sliding window
        char[] chars = "Once upon a time there lived in a certain village a little             
            country girl, the prettiest creature who was ever seen. Her mother was 
            excessively fond of her; and her grandmother doted on her still more. This 
            good woman had a little red riding hood made for her. It suited the girl so 
            extremely well that everybody called her Little Red Riding Hood."
                .toCharArray();

        List<String> fingerprints = new ArrayList<String>();

        for (int i = 0; i < chars.length; i = i + w) {

            StringBuffer sb = new StringBuffer();

            if (i + w < chars.length) {
                sb.append(chars, i, w);
                System.out.println(i + ". " + sb.toString());
            } else {
                sb.append(chars, i, chars.length - i);
                System.out.println(i + ". " + sb.toString());
            }

            fingerprints.add(hash(sb));

        }

    }

    private static String hash(StringBuffer sb) {
        // Implement hash (MD5)
        return sb.toString();
    }

}

results:

0. Once upon a tim
15. e there lived i
30. n a certain vil
45. lage a little c
60. ountry girl, th
75. e prettiest cre
90. ature who was e
105. ver seen. Her m
120. other was exces
135. sively fond of 
150. her; and her gr
165. andmother doted
180.  on her still m
195. ore. This good 
210. woman had a lit
225. tle red riding 
240. hood made for h
255. er. It suited t
270. he girl so extr
285. emely well that
300.  everybody call
315. ed her Little R
330. ed Riding Hood.
1
  • The way I understood it they compute hash for every window within a given chunk and then use the min of these numbers. Commented Sep 11, 2013 at 16:09

3 Answers 3

3

That is not a sliding window. All you have done is break up the input into disjoint chunks. An example of a sliding window would be

Once upon a time
upon a time there
a time there lived
etc. 
Sign up to request clarification or add additional context in comments.

4 Comments

Maybe the window shifts right one character each time as it says "at every position k in the file, the fingerprint of its content is computed".
And they seem slide the window through each chunk.
Correct, the OP has a fundamental misunderstanding of the sliding window concept.
I did indeed misunderstand the concept, thank you for your clarification. Is there any method for determining where to overlap?
1

The simple answer is NO per my understanding (I once studied sliding window algorithm years ago, so I just remember the principles, while cannot remember some details. Correct me if you have more insightful understanding).

As the name of the algorithm 'Sliding Window', your window should be sliding not jumping as it says

at every position k in the file, the fingerprint of its content is computed

in your quotes. That is to say the window slides one character each time.

Per my knowledge, the concept of chunks and windows should be distinguished. So should be fingerprint and hash, although they could be the same. Given it too expense to compute hash as fingerprint, I think Rabin fingerprint is a more proper choice. The chunk is a large block of text in the document and a window highlight a small portion in a chunk. IIRC, the sliding windows algorithm works like this:

  1. The text file is chunked at first;
  2. For each chunk, you slide the window (a 15-char block in your running case) and compute their fingerprint for each window of text;
  3. You now have the fingerprint of the chunk, whose length is proportional to the length of chunk.

The next is how you use the fingerprint to compute the similarity between different documents, which is out of my knowledge. Could you please give us the pointer to the article you referred in the OP. As an exchange, I recommend you this paper, which introduce a variance of sliding window algorithm to compute document similarity.

Winnowing: local algorithms for document fingerprinting

Another application you can refer to is rsync, which is a data synchronisation tool with block-level (corresponding to your chunk) deduplication. See this short article for how it works.

7 Comments

Thanks for the information. In your opinion, what is the reason for separating the document into chunks before sliding the window? Why not just slide it over the entire document, without breaking it into chunks first? I need to do more reading as to the optimal size of the chunks and the fixed window.
I am ultimately computing similarity by intersecting lists of hashed fingerprints for each document, and dividing the number of shared (intersected) chunks by the total number of chunks.
@littleK re chunking. A good question. Per my understanding, this is to optimise the code and reduce unnecessary window fingerprint comparisons. You can associate a hash with a chunk and when comparing document similarity, you first compare hashes of chunks. If the chunk hashes are the same, no need to compare the window fingerprints one-by-one, which is time costy. Correct me if you have other thoughts. :)
@littleK Your highlighted quotes may validate my previous reply. The smallest window fingerprint is a representative of the chunk.
|
0
package com.perturbation;

import java.util.ArrayList;
import java.util.List;

public class BSW {

    /**
     * @param args
     */
    public static void main(String[] args) {
        int w = 2; // fixed width of sliding window
        char[] chars = "umang shukla"
                .toCharArray();

        List<String> fingerprints = new ArrayList<String>();

        for (int i = 0; i < chars.length+w; i++) {

            StringBuffer sb = new StringBuffer();

            if (i + w < chars.length) {
                sb.append(chars, i, w);
                System.out.println(i + ". " + sb.toString());
            } else {
                sb.append(chars, i, chars.length - i);
                System.out.println(i + ". " + sb.toString());
            }

            fingerprints.add(hash(sb));

        }

    }

    private static String hash(StringBuffer sb) {
        // Implement hash (MD5)
        return sb.toString();
    }

}

this program may help you. and please try to make more efficent

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.