Implementation of Basic Sliding Window Algorithm in Java

Question

I am attempting to implement the following Basic Sliding Window algorithm in Java. I get the basic idea of it, but I am a bit confused by some the wording, specifically the sentence in bold:

A sliding window of ﬁxed width w is moved across the ﬁle, and at every position k in the ﬁle, the ﬁngerprint of its content is computed. Let k be a chunk boundary (i.e., Fk mod n = 0). Instead of taking the hash of the entire chunk, we choose the numerically smallest ﬁngerprint of a sliding window within this chunk. Then we compute a hash of this randomly chosen window within the chunk. Intuitively, this approach would permit small edits within the chunks to have less impact on the similarity computation. This method produces a variable length document signature, where the number of ﬁngerprints in the signature is proportional to the document length.

Please see my code/results below. Am I understanding the basic idea of the algorithm? As per the text in bold, what does it mean to "choose the numerically smallest fingerprint of a sliding window within its chunk"? I am currently just hashing the entire chunk.

code:

    public class BSW {

    /**
     * @param args
     */
    public static void main(String[] args) {
        int w = 15; // fixed width of sliding window
        char[] chars = "Once upon a time there lived in a certain village a little             
            country girl, the prettiest creature who was ever seen. Her mother was 
            excessively fond of her; and her grandmother doted on her still more. This 
            good woman had a little red riding hood made for her. It suited the girl so 
            extremely well that everybody called her Little Red Riding Hood."
                .toCharArray();

        List<String> fingerprints = new ArrayList<String>();

        for (int i = 0; i < chars.length; i = i + w) {

            StringBuffer sb = new StringBuffer();

            if (i + w < chars.length) {
                sb.append(chars, i, w);
                System.out.println(i + ". " + sb.toString());
            } else {
                sb.append(chars, i, chars.length - i);
                System.out.println(i + ". " + sb.toString());
            }

            fingerprints.add(hash(sb));

        }

    }

    private static String hash(StringBuffer sb) {
        // Implement hash (MD5)
        return sb.toString();
    }

}

results:

0. Once upon a tim
15. e there lived i
30. n a certain vil
45. lage a little c
60. ountry girl, th
75. e prettiest cre
90. ature who was e
105. ver seen. Her m
120. other was exces
135. sively fond of 
150. her; and her gr
165. andmother doted
180.  on her still m
195. ore. This good 
210. woman had a lit
225. tle red riding 
240. hood made for h
255. er. It suited t
270. he girl so extr
285. emely well that
300.  everybody call
315. ed her Little R
330. ed Riding Hood.

The way I understood it they compute hash for every window within a given chunk and then use the min of these numbers. — PM 77-1
– PM 77-1, Commented Sep 11, 2013 at 16:09

Jim Garrison · Accepted Answer · 2013-09-11 16:09:53Z

3

That is not a sliding window. All you have done is break up the input into disjoint chunks. An example of a sliding window would be

Once upon a time
upon a time there
a time there lived
etc.

answered Sep 11, 2013 at 16:09

Jim Garrison

87k20 gold badges162 silver badges197 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Summer_More_More_Tea Over a year ago

Maybe the window shifts right one character each time as it says "at every position k in the ﬁle, the ﬁngerprint of its content is computed".

PM 77-1 Over a year ago

And they seem slide the window through each chunk.

Jim Garrison Over a year ago

Correct, the OP has a fundamental misunderstanding of the sliding window concept.

littleK Over a year ago

I did indeed misunderstand the concept, thank you for your clarification. Is there any method for determining where to overlap?

Summer_More_More_Tea · Accepted Answer · 2013-09-12 01:30:04Z

1

The simple answer is NO per my understanding (I once studied sliding window algorithm years ago, so I just remember the principles, while cannot remember some details. Correct me if you have more insightful understanding).

As the name of the algorithm 'Sliding Window', your window should be sliding not jumping as it says

at every position k in the ﬁle, the ﬁngerprint of its content is computed

in your quotes. That is to say the window slides one character each time.

Per my knowledge, the concept of chunks and windows should be distinguished. So should be fingerprint and hash, although they could be the same. Given it too expense to compute hash as fingerprint, I think Rabin fingerprint is a more proper choice. The chunk is a large block of text in the document and a window highlight a small portion in a chunk. IIRC, the sliding windows algorithm works like this:

The text file is chunked at first;
For each chunk, you slide the window (a 15-char block in your running case) and compute their fingerprint for each window of text;
You now have the fingerprint of the chunk, whose length is proportional to the length of chunk.

The next is how you use the fingerprint to compute the similarity between different documents, which is out of my knowledge. Could you please give us the pointer to the article you referred in the OP. As an exchange, I recommend you this paper, which introduce a variance of sliding window algorithm to compute document similarity.

Winnowing: local algorithms for document fingerprinting

Another application you can refer to is rsync, which is a data synchronisation tool with block-level (corresponding to your chunk) deduplication. See this short article for how it works.

edited Sep 12, 2013 at 1:30

answered Sep 11, 2013 at 16:22

Summer_More_More_Tea

13.6k12 gold badges54 silver badges87 bronze badges

7 Comments

littleK Over a year ago

Thanks for the information. In your opinion, what is the reason for separating the document into chunks before sliding the window? Why not just slide it over the entire document, without breaking it into chunks first? I need to do more reading as to the optimal size of the chunks and the fixed window.

littleK Over a year ago

I am referencing the following papers: hpl.hp.com/techreports/2009/HPL-2009-90.pdf, webglimpse.net/pubs/TR93-33.pdf, hpl.hp.com/techreports/2005/HPL-2005-42R1.pdf, hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf

littleK Over a year ago

I am ultimately computing similarity by intersecting lists of hashed fingerprints for each document, and dividing the number of shared (intersected) chunks by the total number of chunks.

Summer_More_More_Tea Over a year ago

@littleK re chunking. A good question. Per my understanding, this is to optimise the code and reduce unnecessary window fingerprint comparisons. You can associate a hash with a chunk and when comparing document similarity, you first compare hashes of chunks. If the chunk hashes are the same, no need to compare the window fingerprints one-by-one, which is time costy. Correct me if you have other thoughts. :)

Summer_More_More_Tea Over a year ago

@littleK Your highlighted quotes may validate my previous reply. The smallest window fingerprint is a representative of the chunk.

|

kleopatra · Accepted Answer · 2013-10-17 08:06:29Z

package com.perturbation;

import java.util.ArrayList;
import java.util.List;

public class BSW {

    /**
     * @param args
     */
    public static void main(String[] args) {
        int w = 2; // fixed width of sliding window
        char[] chars = "umang shukla"
                .toCharArray();

        List<String> fingerprints = new ArrayList<String>();

        for (int i = 0; i < chars.length+w; i++) {

            StringBuffer sb = new StringBuffer();

            if (i + w < chars.length) {
                sb.append(chars, i, w);
                System.out.println(i + ". " + sb.toString());
            } else {
                sb.append(chars, i, chars.length - i);
                System.out.println(i + ". " + sb.toString());
            }

            fingerprints.add(hash(sb));

        }

    }

    private static String hash(StringBuffer sb) {
        // Implement hash (MD5)
        return sb.toString();
    }

}

this program may help you. and please try to make more efficent

Collectives™ on Stack Overflow

Implementation of Basic Sliding Window Algorithm in Java

3 Answers 3

4 Comments

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related