Longest repeated substring with at least k occurrences correctness

Question

The algorithms for finding the longest repeated substring is formulated as follows1)build the suffix tree 2)find the deepest internal node with at least k leaf childrenBut I cannot understand why is this works,so basically what makes this algorithm correct?Also,the source where I found this algorithm says that is find the repeated substring in O(n),where n is the length of the substring,this is also not clear to me!Let's consider the following tree,here the longest repeated substring is "ru" and if we apply DFS it will find it in 5 step but not in 2 Can you explain this stuff to me? Thanks

image

It cannot be linear in the length of the substring since the suffix tree construction is linear in the length of the string itself — Rerito, Nov 21 '16 at 09:00

score 0 · Answer 1 · answered Nov 15 '16 at 15:29

I suppose you perfectly know O(n) (Big O notation) refers to the order of growth of some quantity as a function of n, and not the equivalence of the quantity with n.
I write this becase reading the question I was in doubt...
I'm writing this as an aswer and not a comment since it's a bit too long for a comment (I suppose...)

score 0 · Answer 2 · edited May 23 '17 at 10:30

Given a string S of N characters, building the corresponding suffix tree is O(N) (using an algorithm such as Ukkonen's).

Now, such a suffix tree can have at most 2N - 1 nodes (root and leaves included).

If you traverse your tree and compute the number of leaves reachable from a given node along with its depth, you'll find the desired result. To do so, you start from the root and explore each of its children.

Some pseudo-code:

traverse(node, depth):
    nb_leaves <-- 0
    if empty(children(node)):
        nb_leaves <-- 1
    else:
        for child in children(node):
            nb_leaves <-- nb_leaves + traverse(child, depth+1)
    node.setdepth(depth)
    node.setoccurrences(nb_leaves)
    return nb_leaves

The initial call is traverse(root, 0). Since the structure is a tree, there is only one call to traverse for each node. This means the maximum number of call to traverse is 2N - 1, therefore the overall traversal is only O(N). Now you just have to keep track of the node with the maximum depth that also verifies: depth > 0 && nb_leaves >= k by adding the relevant bookkeeping mechanism. This does not hinder the overall complexity.

In the end, the complexity of the algorithm to find such a substring is O(N) where N is the length of the input string (and not the length of the matching substring!).

Note: The traversal described above is basically a DFS on the suffix tree.

Longest repeated substring with at least k occurrences correctness

2 Answers2