String Substrings Generation in Java

Question

I am trying to find all substrings within a given string. For a random string like rymis the subsequences would be [i, is, m, mi, mis, r, ry, rym, rymi, rymis, s, y, ym, ymi, ymis]. From Wikipedia, a string of a length of n will have n * (n + 1) / 2 total substrings.

Which can be found by doing the following snippet of code:

    final Set<String> substring_set = new TreeSet<String>();
    final String text = "rymis";

    for(int iter = 0; iter < text.length(); iter++)
    {
        for(int ator = 1; ator <= text.length() - iter; ator++)
        {
            substring_set.add(text.substring(iter, iter + ator));
        }
    }

Which works for small String lengths but obviously slows down for larger lengths as the algorithm is near O(n^2).

Also reading up on Suffix Trees which can do insertions in O(n) and noticed the same subsequences could be obtained by repeatedly inserting substrings by removing 1 character from the right until the string is empty. Which should be about O(1 + … + (n-1) + n) which is a summation of n -> n(n+1)/2 -> (n^2 + n)/ 2, which again is near O(n^2). Although there seems to be some Suffix Trees that can do insertions in log2(n) time which would be a factor better being O(n log2(n)).

Before I delve into Suffix Trees is this the correct route to be taking, is there some another algorithm that would be more efficient for this, or is O(n^2) as good as this will get?

Since the set contains n * (n + 1) / 2 values, you'll have to do n * (n + 1) / 2 insertions to the set, so I don't see how the algorithm could be less than O(n^2). — JB Nizet
– JB Nizet, Commented Feb 22, 2012 at 19:15
@JBNizet - I agree, there is no way to avoid touching every substring element. Since the size of the original set is n, and there are roughly n^2 elements to visit, this most likely cannot get more efficient. — Travis J
– Travis J, Commented Feb 22, 2012 at 19:28
This is not homework. Playing with the two other algorithms posted they are both slower than my original but I noticed the data structure may not be optimal. If the algorithm produces unique substrings already then TreeSet is not needed (the data structure can be sorted later) and a dynamic array also would be inefficient because of the large size of inserts (needing to expand its internal array and copy). — ntin
– ntin, Commented Feb 22, 2012 at 20:33
From some testing all three algorithms will generate the correct answer. My algorithm in the original post is empirically the quickest because it has the less constant time costs but it is not a significant difference. The problem is compounded when a String with repeat characters is added like aba where the substrings then start to become duplicates then it can no longer be assured that a structure contains only unique elements. Where if it could be assured that it did then a data structure like LinkedList would be a good speed boost over a *Set or ArrayList. — ntin
– ntin, Commented Feb 22, 2012 at 21:05
@ntin - Duplicates should not be an issue because you can always remove them faster than the current bottleneck. HeapSort the array, then iterate through it removing current item if previous item is the same. These two operations should be O(n log n) and O(n) respectively. — Travis J
– Travis J, Commented Feb 22, 2012 at 21:31

score 1 · Accepted Answer · 2012-02-22 19:45:09Z

1

I am fairly sure you can't beat O(n^2) for this as has been mentioned in comments to the question.

I was interested in different ways of coding that so I made one quickly, and I decided to post it here.

The solution I put here isn't asymptotically faster I don't think, but when counting the inner and outer loops there are less. There are also less duplicate insertions here - no duplicate insertions.

String str = "rymis";
ArrayList<String> subs = new ArrayList<String>();
while (str.length() > 0) {
    subs.add(str);
    for (int i=1;i<str.length();i++) {
        subs.add(str.substring(i));
        subs.add(str.substring(0,i));
    }
    str = str.substring(1, Math.max(str.length()-1, 1));
}

edited Feb 22, 2012 at 19:45

answered Feb 22, 2012 at 19:39

user798182

Sign up to request clarification or add additional context in comments.

Comments

Travis J · Accepted Answer · 2012-02-22 21:49:30Z

This is an inverted way of your example, but still o(n^2).

string s = "rymis";
ArrayList<string> al = new ArrayList<string>();
for(int i = 1; i < s.length(); i++){//collect substrings of length i
 for(int k = 0; k < s.length(); k++){//start index for sbstr len i
  if(i + k > s.length())break;//if the sbstr len i runs over end of s move on
  al.add(s.substring(k, k + i));//add sbstr len i at index k to al
 }
}

Let me see if I can post a recursive example. I started doing a couple recursive tries and came up with this iterative approach using dual sliding windows as a sort of improvement to the above method. I had a recursive example in mind but was having issues reducing the tree size.

string s = "rymis";
ArrayList<string> al = new ArrayList<string>();
for(int i = 1; i < s.length() + 1; i ++)
{
 for(int k = 0; k < s.length(); k++)
 {
  int a = k;//left bound window 1
  int b = k + i;//right bound window 1
  int c = s.length() - 1 - k - i;//left bound window 2
  int d = s.length() - 1 - k;//right bound window 2
  al.add(s.substring(a,b));//add window 1
  if(a < c)al.add(s.substring(c,d));//add window 2
 }
}

There was an issue mentioned with using arraylist affecting performance so this next one will be with more basic structures.

string s = "rymis";
StringBuilder sb = new StringBuilder();
for(int i = 1; i < s.length() + 1; i ++)
{
 for(int k = 0; k < s.length(); k++)
 {
  int a = k;//left bound window 1
  int b = k + i;//right bound window 1
  int c = s.length() - 1 - k - i;//left bound window 2
  int d = s.length() - 1 - k;//right bound window 2
  if(i > 1 && k > 0)sb.append(",");
  sb.append(s.substring(a,b));//add window 1
  if(a < c){
   sb.append(",");
   sb.append(s.substring(c,d));//add window 2
  }
 }
}
string s = sb.toString();
String[] sArray = s.split("\\,");

GrayR · Accepted Answer · 2012-02-23 09:34:08Z

1

I am not sure about the exact algorithm but you may look into Ropes:

http://en.wikipedia.org/wiki/Rope_(computer_science)

In summary, ropes are better suited when the data is large and frequently modified.

I believe Rope outperforms String for your problem.

edited Feb 23, 2012 at 9:34

answered Feb 22, 2012 at 21:16

GrayR

1,3852 gold badges20 silver badges32 bronze badges

Collectives™ on Stack Overflow

String Substrings Generation in Java

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related