Class LongestCommonSubsequence
- All Implemented Interfaces:
SimilarityScore<Integer>
The Longest common subsequence algorithm returns the length of the longest subsequence that two strings have in
common. Two strings that are entirely different, return a value of 0, and two strings that return a value
of the commonly shared length implies that the strings are completely the same in value and position.
Note. Generally this algorithm is fairly inefficient, as for length m, n of the input
CharSequence
's left
and right
respectively, the runtime of the
algorithm is O(m*n).
As of version 1.10, a more space-efficient of the algorithm is implemented. The new algorithm has linear space complexity instead of quadratic. However, time complexity is still quadratic in the size of input strings.
The implementation is based on Hirschberg's Longest Commons Substring algorithm (cited below).
For further reading see:
- Lothaire, M. Applied combinatorics on words. New York: Cambridge U Press, 2005. 12-13
- D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," CACM, 1975, pp. 341--343.
- Since:
- 1.0
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate static int[]
algorithmB
(CharSequence left, CharSequence right) An implementation of "ALG B" from Hirschberg's CACM '71 paper.private static String
algorithmC
(CharSequence left, CharSequence right) An implementation of "ALG C" from Hirschberg's CACM '71 paper.apply
(CharSequence left, CharSequence right) Calculates the longest common subsequence similarity score of twoCharSequence
's passed as input.logestCommonSubsequence
(CharSequence left, CharSequence right) Deprecated.Deprecated as of 1.2 due to a typo in the method name.longestCommonSubsequence
(CharSequence left, CharSequence right) Computes the longest common subsequence between the twoCharSequence
's passed as input.int[][]
longestCommonSubstringLengthArray
(CharSequence left, CharSequence right) Deprecated.Deprecated as of 1.10.private static String
-
Constructor Details
-
LongestCommonSubsequence
public LongestCommonSubsequence()
-
-
Method Details
-
algorithmB
An implementation of "ALG B" from Hirschberg's CACM '71 paper. Assuming the first input sequence is of sizem
and the second input sequence is of sizen
, this method returns the last row of the dynamic programming (DP) table when calculating the LCS of the two sequences in O(m*n) time and O(n) space. The last element of the returned array, is the size of the LCS of the two input sequences.- Parameters:
left
- first input sequence.right
- second input sequence.- Returns:
- last row of the dynamic-programming (DP) table for calculating the LCS of
left
andright
- Since:
- 1.10.0
-
algorithmC
An implementation of "ALG C" from Hirschberg's CACM '71 paper. Assuming the first input sequence is of sizem
and the second input sequence is of sizen
, this method returns the Longest Common Subsequence (LCS) of the two sequences in O(m*n) time and O(m+n) space.- Parameters:
left
- first input sequence.right
- second input sequence.- Returns:
- the LCS of
left
andright
- Since:
- 1.10.0
-
reverse
-
apply
Calculates the longest common subsequence similarity score of twoCharSequence
's passed as input.This method implements a more efficient version of LCS algorithm which has quadratic time and linear space complexity.
This method is based on newly implemented
algorithmB(CharSequence, CharSequence)
. An evaluation using JMH revealed that this method is almost two times faster than its previous version.- Specified by:
apply
in interfaceSimilarityScore<Integer>
- Parameters:
left
- first character sequenceright
- second character sequence- Returns:
- length of the longest common subsequence of
left
andright
- Throws:
IllegalArgumentException
- if either String inputnull
-
logestCommonSubsequence
Deprecated.Deprecated as of 1.2 due to a typo in the method name. UselongestCommonSubsequence(CharSequence, CharSequence)
instead. This method will be removed in 2.0.Computes the longest common subsequence between the twoCharSequence
's passed as input.Note, a substring and subsequence are not necessarily the same thing. Indeed,
abcxyzqrs
andxyzghfm
have both the same common substring and subsequence, namelyxyz
. However,axbyczqrs
andabcxyzqtv
have the longest common subsequencexyzq
because a subsequence need not have adjacent characters.For reference, we give the definition of a subsequence for the reader: a subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements.
- Parameters:
left
- first character sequenceright
- second character sequence- Returns:
- the longest common subsequence found
- Throws:
IllegalArgumentException
- if either String inputnull
-
longestCommonSubsequence
Computes the longest common subsequence between the twoCharSequence
's passed as input.This method implements a more efficient version of LCS algorithm which although has quadratic time, it has linear space complexity.
Note, a substring and subsequence are not necessarily the same thing. Indeed,
abcxyzqrs
andxyzghfm
have both the same common substring and subsequence, namelyxyz
. However,axbyczqrs
andabcxyzqtv
have the longest common subsequencexyzq
because a subsequence need not have adjacent characters.For reference, we give the definition of a subsequence for the reader: a subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements.
- Parameters:
left
- first character sequenceright
- second character sequence- Returns:
- the longest common subsequence found
- Throws:
IllegalArgumentException
- if either String inputnull
- Since:
- 1.2
-
longestCommonSubstringLengthArray
Deprecated.Deprecated as of 1.10. A more efficient implementation for calculating LCS is now available. UselongestCommonSubsequence(CharSequence, CharSequence)
instead to directly calculate the LCS. This method will be removed in 2.0.Computes the lcsLengthArray for the sake of doing the actual lcs calculation. This is the dynamic programming portion of the algorithm, and is the reason for the runtime complexity being O(m*n), where m=left.length() and n=right.length().- Parameters:
left
- first character sequenceright
- second character sequence- Returns:
- lcsLengthArray
-