Given two integers 'm' and 'n'. Find the number of combinations of bit-size 'n' in which there are no 'm' consecutive '1'.
EXPLANATION
We are given with m and n. There are 3 cases for any given value of n and m
Case 1:
when n is less than m
for this the ans will be equal to ans=pow(2,n).
eg: 3 2
all the 2 bit words i.e 00 01 10 11 will not have 3 consecutive 1 ,thus all of them are selected.
Case 2: when n is equal to m
for this the ans will be equal to ans=pow(2,n)-1.
eg: 3 3
all the 3 bit words i.e 000 001 010 011 100 101 110 111 out of which only one word will have 3 consecutive '1'.
Case 3: when n is more than m
This test case needs some computation which we will see through eg.
eg:
3 4
Here we first calculate for
m n
3 3 = 7
3 2 = 4
3 1 = 2
then adding this ,we get for 3 4 as 7+4+2=13.
here m=3 thus we calculate for 3 values and add them .
eg:4 6
here we first calculate for
m n
4 5 = 29
4 4 = 15
4 3 = 8
4 2 = 4
then adding this ,we get for 4 6 as 29+15+8+4=56.
here m=4 thus we calculated for 4 values and add them .
Note: for faster computation we can precompute all the values and can store them in 2-D matrix.
Complexity:
(O(N^2.logN))
C++ Code
#include <stdio.h>
#include<math.h>
int main(void) {
// your code goes here
long long int t,n,m,i,j,k,a[15][105],l;
scanf("%lld",&t);
for(i=2;i<=10;i++)
for(j=1;j<=50;j++)
{
if(j<i)
a[i][j]=pow(2,j);
else if(j==i)
a[i][j]=pow(2,j)-1;
else
{
k=i;
l=j-1;
while(k>0)
{ k--;
a[i][j]+=a[i][l--];
}
}
}
/*
for(i=2;i<=10;i++)
{
for(j=1;j<=50;j++)
{
printf("%lld\t",a[i][j]);
}
printf("\n");
}
*/
// you can uncomment the above section to see all the precomputed values
while(t--)
{
scanf("%lld%lld",&m,&n);
printf("%lld\n",a[m][n]);
}
return 0;
}
Given an array of $N$ integers. You have to find count of all subarrays of size $m$, such that if a Nim Game is played on the subarray, second player will win. Do it for all possible values of $m$, $1 \le m \le N$.
EXPLANATION:
It can be claimed using Nim Game theory that second player will win the game if $XOR$ of the selected subarray is $0$. We can find XOR of any subarray in $O(1)$ time by creating a prefix XOR array.
Now, let me present an $O(N^2)$ solution for the problem.
answer[n] = 0;
for i in range(0,n):
for j in range(i+1,n):
if preXOR[i] == preXOR[j]: //subarray having size (j-i) has XOR 0.
answer[j-i]++;
Consider a number $k$. If it appears $l$ times in the $preXOR$ array, then there will be $^lC_2$ subarrays in array $A$ whose XOR is zero. We can create an array corresponding to indices of $k$ in $preXOR$ array, and call it $S$. It means $preXOR[S[i]] = k$ for all $i < l$.
for i = 0 to l-1:
for j= i+1 to l-1:
ans[j-i]++; // there is a subarray of size (j-i) having XOR 0.
It can be explained further,Consider that preXOR array is $[1,2,3,1,1,4,1]$, then the array $S$ for $k=1$ will be $[0,3,4,6] \text{(0 based indexing)}$.
The above solution can take up to $O(N^2)$ time in the worst case, as size of $S$ can be $O(N)$. To improve it, we can do the following modifications:
1) If size of the array $S$ is lesser than $\sqrt{N}$, then we can use brute force as given above.
2) If size of the array $S$ is greater than $\sqrt{N}$, then the above solution will take more time. So, we can use the FFT to optimize the solution to time $Nlog(N)$.
Number of sub-arrays of size $K$ having XOR $0$, will be coefficient of $x^K$ in following polynomial:
The powers of $x$ in the above polynomial are negative as well, therefore we do a small modification in above polynomial to make all the powers positive. It will be equivalent to the coefficient of $x^{N+K}$ in the following polynomial:
Which can be done in $O(Nlog(N))$ time using Fast Fourier transform. The FFT will give answer corresponding to all possible value of $m$ altogether.
We can repeat the above procedure for each distinct element $k$ of $preXOR$ array. The total count of subarrays corresponding to each value of $m$ will be the count of subarrays of size $m$ for each distinct value of $k$.
Worst case complexity of the above procedure will be $O(N \sqrt{N} \log{(N)})$.
You are given an array A of integers of size N. You will be given Q queries where each query is represented by two integers L, R. You have to find the gcd(Greatest Common Divisor) of the array after excluding the part from range L to R inclusive (1 Based indexing). You are guaranteed that after excluding the part of the array remaining array is non empty.
1 ≤ N, Q≤ 100000
EXPLANATION:
If you do it naively(ie. calculating GCD of remaining array for each query), the worstcase complexity will be O(N * Q).
Let's denote by G(L, R), the GCD of AL, AL+1 ... AR. We can observe that for query [L, R], we need GCD of G(1, L-1) and G(R+1, N).
So, we precalculate prefix and suffix gcd arrays.
If we have: Prefixi = GCD of A1, A2 ... Ai Suffixi = GCD of AN, AN-1 ... Ai
answer to query [L, R], would be GCD of PrefixL-1 and SuffixR+1.
We can calculate prefix and suffix arrays in O(N) if we notice that: Prefixi = GCD(Prefixi-1, Ai) Suffixi = GCD(Suffixi+1, Ai)
Pseudo Code for building prefix and suffix arrays:
n,a=input
pre[n],suf[n]
//base case
pre[1]=a[1]
suf[n]=a[n]
for i=2 to n:
pre[i] = gcd(pre[i-1], a[i])
for i=n-1 to 1:
suf[i] = gcd(suf[i+1], a[i])
So, overall complexity would be O((N + Q) * K), where K is a constant factor for gcd calculation.
ALTERNATIVE SOLUTION:
Use segment trees for range gcd queries. But note that a factor of log N will be increased in complexity.
You are given N matchsticks arranged in a straight line, with their rearends touching each other. You are also given the rate of burn for every matchstick (possibly same) in number of seconds it takes to burn it out. If a matchstick is lit from both ends, it burns out twice as fast - taking half as much time.
Answer several queries of the following type efficiently
All the matchsticks in the range L to R, inclusive are lit from their front ends simultaneously. Find how much time it takes for all the matchsticks to burn out.
QUICK EXPLANATION
For each query, the operation performed plays out in the following way
All the matchsticks in the range L to R are lit from their front ends.
The matchstick that burns quickest in the range L to R burns out and ignites all the other matchsticks on their rear ends.
The matchticks in the range L to R now burn out twice as fast.
All the other matchsticks burn out at their original rate.
We can find the time taken in all the steps above using only the following pieces of information for the segment L to R
Quickest rate of burning for a match in the range L to R.
Slowest rate of burning for all matches in the range L to R.
Slowest rate of burning for all matches outside the range L to R.
EXPLANATION
For a given range L to R
Let m denote the minimum time taken by some matchstick in the range L to R to burn out
Let M denote the largest time taken by some matchstick in the range L to R to burn out
Let M' denote the largest time taken by some matchstick outside the range L to R to burn out
The time taken by each of the steps in the scenario described above is as follows
The matchstick that burns quickest, burns out
Takes time m
The following things happen in parallel
The matchsticks in the range L to R now burn out twice as fast
Takes time (M - m) / 2
The matchsticks outside the range
Takes time M'
Thus, the time taken for all the matches to burn out completely is
m + max( (M-m)/2 , M' )
It remains to find efficiently the minimum time some matchstick will take in a range, and the maximum time some matchstick will take in a range.
Such queries can be answered in O(N log N) time by using segment trees. Refer to this topcoder tutorial for a wonderful writeup with code samples on how to get about writing a segment tree. The topcoder tutorial also describes a O(N sqrt(N)) approach as well which will also work within the time limits for this problem.
Two segment trees must be constructed. One to answer queries of the type "minimum in a range", that returns the time it takes for the fastest burning matchstick to burn out. Another to answer queries of the type "maximum in a range" to find M and M' as defined above. Note that M' will itself be the maxmimum of two ranges, 1 to L-1 and R+1 to N respectively.
A lot of solutions were stuck in the caveat that it is required to always print the answer in a single decimal place. Note how the answer will either be integer, or contain a single decimal digit (5). In case the answer is integer, it is required to print a trailing decimal followed by 0.
Given a string containing '(' and ')' only. Find the longest such subsequence of the given string that is non-regular. Amongst all such distinct answers, output the lexicographically $K^{th}$ amongst them.
QUICK EXPLANATION:
If the given string in not balanced, then the string itself is the largest non-regular(unbalanced) pattern. On the contrary, if the given string is regular(balanced), then all subsequnces having length $(N-1)$ will be non-regular and there is a pattern in their lexicographic ordering.
EXPLANATION:
The problem can be broadly classified into two cases:
Case I: Given parenthesis string is not balanced, which can be checked using stack data structure described here. In this case, the largest unbalanced sub-sequence will be then the given string itself. So if $K=1$, then the answer will be the given string, $-1$ otherwise.
Case II: Given parenthesis string is the balanced one. Consider that $L$ is the length of the given string, then there will $L$ sub-sequences of length $(L-1)$, and each of them will be unbalanced. This is because, in any sub-sequence, number of opening and closing parenthesis will not be the same.
How to find the number of distinct unbalanced sub sequences?
Consider a string $S$ = "(())", what will be number of distinct unbalanced sub-sequences? it can be easily claimed that if we delete $S[0]$ or $S[1]$, we will get same string i.e. "())". So, it can be seen easily that if we delete any character from a contiguous chunk of the characters, we will get same unbalanced sub-sequence. It can be formally written as follows:
Consider the any contiguous chunk of same characters in the given string, Suppose $S[i] = S[i+1] = S[i+2] =\ ....\ = S[j] =\ '('$ or $')'$. If we delete any character between $i^{th}$ and $j^{th}$ positions (both inclusive), it will produce the same sub-sequence, because deleting any of them will replace $(j-i)+1$ number of same characters by $(j-i)$ number of same characters. So number of distinct sub-sequences of length $(L-1)$ will be number of different contiguous chunks of same characters. For example, string $(())((()))$ will have $4$ distinct sub-sequence of length $9$, and each of them will be unbalanced.
How to do lexicographic ordering of those sub-sequences?
Suppose that the string which we generate by deleting a character from the $i^{th}$ chunk (0-indexed) is $L_i$. As our given string is balanced, it will consist of one or more opening parentheses and one or more closing parentheses alternatively (starting with opening). So, to produce $L_{2n}$ we need to delete an opening parenthesis, and to produce $L_{2n+1}$, we need to delete a closing parenthesis. Now, let me claim that the lexicographic ordering:
So for string $S = ''(())()((()))(())"$, the lexicographic order will be $L_1 < L_3 < L_5 < L_7 < L_6 < L_4 < L_2 < L_0$. To produce lexicographically $3^{rd}$ string we need to produce $L_5$ by deleting any of the bold characters : "(())()((()))(())".
Proof:
Consider the proof $L_{2i+1} < L_{2j+1},\hspace{2 mm} \forall \hspace{2 mm} (i < j)$. Others can be proved in the same manner.
Consider that the $i^{th}$ chunk has $A_i$ number of characters. Now consider the $(A_1 + A_2 + .... A_{2i+1})^{th}$ character in $L_{2i+1}$ and $L_{2j+1}$, it will be ')' in $L_{2j+1}$, as $i < j$, and it will remain the same as in the original string, but it will be '(' in $L_{2i+1}$ and all previous characters will be the same in both of the strings. As opening parenthesis is lexicographically smaller than closing parenthesis, hence $L_{2i+1} < L_{2j+1}$.
Solution:
Setter's solution can be found here
Tester's solution can be found here
I am searching for an online community programmer to build the next generation of facebook.
My idea is revolutionary.
It would require the skills of a programmer that has SEO skills.
A partnership would be available to the right person.
If you know anyone please respond.
By repeated squaring of matrices. The results need to be found modulo 1000000007. Since all calculations are sum or product only, the drill should be straightforward.
METHOD 2: Sum of Geometric Progression
S(T) = 26*(26T - 1) / 25
Calculating the numerator modulo 1000000007 is pretty straight forward. We use repeated squaring for the exponent part.
But, to the uninitiated it might be confusing as to why can we get away with calculating the residue of the numerator considering we are yet to divide it by 25.
Here, we apply the concept of modular arithmetic inverses. These are well defined for residues modulo primes. 1000000007 happens to be a prime (and is often the reason it is chosen in problems by several authors). The modular arithmetic inverse of 25, modulo 1000000007 is
Given a binary string and an integer $k$, flip the smallest number of characters of the string, such that the resulting string has no more than $k$ consecutive equal characters.
QUICK EXPLANATION:
Handle the case of $k = 1$, and $k > 1$ separately. In case of $k = 1$, the resulting string must have alternating characters, and there are only two such strings (one starting with $0$, and the other starting with $1$), pick the one which can be obtained by fewer flips. In case of $k > 1$, partition the original string into substrings made of the same character. If a substring is of length $n > k$, then flip $\lfloor n/(k + 1) \rfloor$ characters in this substring, so that it contains no more than k consecutive equal characters. Special care must be taken when $n$ is divisible by $k + 1$
.
EXPLANATION:
We are given a binary string $S$ and an integer $k$. The task is to flip the minimum number of characters of the given string such that resulting string has no more than $k$ consecutive equal characters.
Let us split the string S into substrings such that all characters of a substring are the same, e.g., if the original string was "1100000011100101", then the partition would have $7$ substrings: "11", "000000", "111", "00", "1", "0", and, "1". Now, any substring which has a length greater than $k$, will require some flips, so that after flip we can further split this substring into smaller substrings of equal characters, and of length not exceeding $k$.
For example, if a substring is "000000000", and the value of $k$ is $3$, then we need to make two flips into this substring to convert it into "000100010", which can then be split into "000", "1", "000", "1", "0", none of which have a length higher than $k = 3$.
In fact, if a substring is of length $n$, then we need to make exactly $\lfloor n/(k + 1) \rfloor$ flips to break it into substrings of length not exceeding $k$ (we flip each $(k + 1)$-th character of the substring). However, there is one tricky case, when $n$ is divisible by $(k + 1)$. In this case, we would end up flipping the last character of the substring, which would increase the length of the substring following this substring. For example:
original string: "00011", $k = 2$,
substrings: "000", "11"
After flip: "001", "11"
After merging back: "00111"
The final string still has $3$ equal consecutive characters. Hence, we must make sure that we never flip the last character of a substring. This can always be achieved when $k > 1$, as instead of the last character, we would flip the second last character of the string, and it will not interfere with the next substring. For example,
original string: "1100000011100101", $k = 2$,
substrings: "11", "000000", "111", "00", "1", "0", "1",
after flip: "11", "001010", "101", "00", "1", "0", "1",
after merger: "1100101010100101"
Note that, in the flip step, we flipped the second last character of second and third substring, instead of the last one. However, this is not possible in case $k = 1$, because if we flip the second last character, then it will interfere with the previous substring.
Special Handling of $k = 1$:
In case of $k = 1$, there can no two consecutive equal characters in the resulting string, i.e., resulting string must have alternating characters. There are only two such strings of length $N$: one which starts with "0" and then alternates between "0" and "1", and the other which starts with "1", and then alternates. We consider both of these as potential target string, and pick the one which could be obtained by using fewer flips.
Time Complexity:
$O (N)$
AUTHOR'S AND TESTER'S SOLUTIONS:
Author's solution will be put up soon.
Tester's solution will be put up soon.
A few weeks ago,I had written about a Chrome and Firefox extension that I had created called Coder's Calendar which displayed a list of live and upcoming coding contests.(read about it here)
I've created and published an Android app of the same.
As in MARCH13 contest we needed to use primary school arithmetics once again, and as it is a topic that comes up quite frequentely here in the forums thanks to this problem, and also, as I don't see a complete and detailed tutorial on this topic here, I decided I'd write one before my Lab class at university :P (spending free time with Codechef is always a joy!!)
Some preliminary experiences with C
If we want to implement a factorial calculator efficientely, we need to know what we are dealing with in the first place...
Some experiences in computing factorials iteratively with the following code:
#include <stdio.h>
long long int factorial(int N)
{
long long ans = 1;
int i;
for(i=1; i <= N; i++)
ans *= i;
return ans;
}
int main()
{
int t;
for(t=1; t <= 21; t++)
{
printf("%lld\n", factorial(t));
}
return 0;
}
So, we can now see that even when using the long long data type, the maximum factorial we can expect to compute correctly, is only 20!
When seen by this point of view, suddenly, 100! seems as an impossible limit for someone using C/C++, and this is actually only partially true, such that we can say:
It is impossible to compute factorials larger than 20 when using built-in data types.
However, the beauty of algorithms arises on such situations... After all, if long long data type is the largest built-in type available, how can people get AC solutions in C/C++? (And, as a side note, how the hell are those "magic" BigNum and variable precision arithmetic libraries implemented?).
The answer to these questions is surprisingly and annoyingly "basic" and "elementar", in fact, we shall travel back to our primary school years and apply what was taught to most of us when we were 6/7/8 years old.
I am talking about doing all operations by hand!!
The underlying idea behind long operations and how to map it into a programming language
12*11 = 132
Any programming language will tell you that. But, so will any 8 year old kid that's good with numbers. But, the kid's way of telling you such result is what we are interested in:
Here's how he would do it:
12
x 11
---------
12
+12
----------
132
But, why is this idea way more interesting than simply doing it straighforwardly? It even looks harder and more error-prone... But, it has a fundamental property that we will exploit to its fullest:
The intermediate numbers involved on the intermediate calculations never exceed 81
This is because it is the largest product possible of two 1-digit numbers (9*9 = 81), and these numbers,well, we can deal with them easily!!
The main idea now is to find a suitable data structure to store all the intermediate results and for that we can use an array:
Say int a[200] is array where we can store 200 1-digit numbers. (In fact, at each position we can store an integer, but we will only store 1 digit for each position.)
So, we are making good progress!! We managed to understand two important things:
Primary school arithmetic can prove very useful on big number problems;
We can use all built-in structures and data-types to perform calculations;
Now, comes up a new question:
How can we know the length of such a huge number? We can store an array and 200 positions, but, our big number may have only 100 digits for example.
The trick is to use a variable that will save us, at each moment, the number of digits that is contained in the array. Let's call it m.
Also, since we want only one digit to be stored in every position of array, we need to find a way to "propagate" the carry of larger products to higher digits and sum it afterwards. Let's call the variable to hold the carry, temp.
m -> Variable that contains the number of digits in the array in any given moment;
temp -> Variable to hold the "carry" value resultant of multiplying digits whose product will be larger than 9. (8*9 = 72, we would store 2 on one array position, and 7 would be the "carry" and it would be stored on a different position.)
So, now that we have an idea on how to deal with the multiplications, let's work on mapping it into a programming language.
Coding our idea and one final detail
Now, we are ready to code our solution for the FCTRL2 problem.
However, one last remark needs to be done:
How do we store a number in the array, and why do we store it the way we do?
If after reading this tutorial you look at some of the accepted solutions in C/C++ for this problem, you will see that contestants actually stored the numbers "backwards", for example:
123 would be saved in an array, say a, as:
a = [3,2,1];
This is done such that when the digit by digit calculations are being performed, the "carry" can be placed on the positions of the array with higher index. This way, we are sure that carry is computed and placed correctly on the array.
Also, computing the products this way and maintaining the variable, m, allows us to print the result directly, by looping from a[m-1] until a[0].
As an example, I can leave here an implementation made by @upendra1234, that I took the liberty to comment for a better understanding:
#include<stdio.h>
int main()
{
int t;
int a[200]; //array will have the capacity to store 200 digits.
int n,i,j,temp,m,x;
scanf("%d",&t);
while(t--)
{
scanf("%d",&n);
a[0]=1; //initializes array with only 1 digit, the digit 1.
m=1; // initializes digit counter
temp = 0; //Initializes carry variable to 0.
for(i=1;i<=n;i++)
{
for(j=0;j<m;j++)
{
x = a[j]*i+temp; //x contains the digit by digit product
a[j]=x%10; //Contains the digit to store in position j
temp = x/10; //Contains the carry value that will be stored on later indexes
}
while(temp>0) //while loop that will store the carry value on array.
{
a[m]=temp%10;
temp = temp/10;
m++; // increments digit counter
}
}
for(i=m-1;i>=0;i--) //printing answer
printf("%d",a[i]);
printf("\n");
}
return 0;
}
I hope this tutorial can help someone to gain a better understanding of this subject and that can help some people as it is why we are here for :D
Best Regards,
Bruno Oliveira
EDIT: As per @betlista comment, it's also worth pointing out that, since we keep only a single digit at each position on the array, we could have used the data-type char instead of int. This is because internally, a char is actually an integer that only goes in the range 0 - 255 (values used to represent the ASCII codes for all the characters we are used to see). The gains would be only memory-wise.
Given Frog's location on the X axis and they can communicate their message to another frog only if the distance are less than or equal to K , now you need to answer if given two frogs can communicate or not. Assumption : Frog's are cooperative in nature and will transfer the message without editing it :D .
Quick Explanation
Find for each frog the maximum distance which his message can reach. Two Frogs can only communicate if their maximum distance of communication are same.
Reason
You can easily proof by contradiction. Let us assume that there are two frogs 1 and 2 which can communicate but their maximum distances are different. You can easily contradict this , proof is left as an exercise for the reader.
Explanation
Only Challenge left is to calculate the maximum distance which each frog can message.
The First point to note is that one of the optimal strategy of each frog(say f) will be to send message to it's nearest frog(say f1) and then it will be the responsibility of the nearest frog to carry it further.
One Line Proof : Frog's reachable from the f in the direction of f1 is also reachable from f1.
Another point to note is that the frog on the extreme positive side of X axis(i.e Maximum A[i] , say A[j]) can communicate till A[j] + K.
Using these observation , one can use a simple dp to calculate the maximum distance . But how ?
Sort A[i]'s but do not loose the index of frog's while sorting. Let the sorted array be Frog[] . Now if Frog[i] can communicate to Frog[i+1] , then Frog[i] can communicate as mcuh distance as Frog[i+1] can communicate.
Pseudo Code
Pre-Compute ( A , K ):
sort(A,A+N); //Sorted in Decreasing Order of X .
Max_Distance[A[0].ind]=A[0].v+K;
for(int i=1;i<N;i++)
if((A[i-1].x-A[i].x)<=K)
Max_Distance[A[i].ind] = Max_Distance[A[i-1].ind];
else
Max_Distance[A[i].ind] = A[i].x + K;
Answer ( x , y ):
if ( Max_Distance[x] == Max_Distance[y] ):
return "Yes"
else
return "No"
hello all
This question is all about implementation and design.
Given an array of n elements print array's elements in ascending order or descending order without changing the array's elements and without creating any new array.
Given T big integers with at most D digits, determine whether it can be a fibonacci number. The total digits are L.
EXPLANATION:
There are a number of ways to solve this problem.
If you look up fibonacci properties online, you may find the followings:
n is a fibonacci number if and only if 5 * n * n + 4 or 5 * n * n - 4 is a perfect square number.
Using this property, together with big integer multiplication and sqrt, you can get the answer.
However, this method is too complex to pass this problem. It is worth noting that fibonacci number increases exponentially. And thus, there are only O(D) fibonacci numbers have at most D digits. This observation gives us the intuition to solve this problem much simpler.
The fist method is an offline version. First, we load all queries and sorted them. This step will take O(TDlogT) if we use quick sort. Second, we compute fibonacci numbers one after another using big integer addition. For each fibonacci number, check its relationship with the current smallest query number:
If the they are same, then the answer of that query is YES and let’s look at the next query number;
If the fibonacci number is smaller, then let’s look at the next fibonacci number;
if the fibonacci number is larger, then the answer of that query is NO and let’s look at the next query number.
This procedure needs O((D + T)D) time. Therefore, this offline version’s time complexity is O(TDlogT + (D + T)D).
The second one involves hash. We can simply generate O(D) fibonacci number and only restore the remainder of some relatively big number, for example, 2^64 or 10^9+7. And then, check the query number’s remainder to see whether it occurred using hash table or tree set. Suppose we use hash table, the time complexity is O(D + L).
AUTHOR'S AND TESTER'S SOLUTIONS:
Author's solution can be found here.
Tester's solution can be found here.
This text will focus on the construction of Suffix Arrays, it will aim to explain what they are and what they are used for and hopefully some examples will be provided (it will be mainly simple applications so that the concepts don't get too attached to the theoretical explanation).
As usual, this follows my somewhat recent series of tutorials in order to make the reference post with links as complete as possible!
What is a Suffix Array?
In simple terms, a suffix array is just a sorted array of all the suffixes of a given string.
As a data structure, it is widely used in areas such as data compression, bioinformatics and, in general, in any area that deals with strings and string matching problems, so, as you can see, it is of great importance to know efficient algorithms to construct a suffix array for a given string.
Please note that on this context, the name suffix is the exact same thing as substring, as you can see from the wikipedia link provided.
A suffix array will contain integers that represent the starting indexes of the all the suffixes of a given string, after the aforementioned suffixes are sorted.
On some applications of suffix arrays, it is common to paddle the string with a special character (like #, @ or $) that is not present on the alphabet that is being used to represent the string and, as such, it's considered to be smaller than all the other characters. (The reason why these special characters are used will hopefully be clearer ahead in this text)
And, as a picture it's worth more than a thousand words, below is a small scheme which represents the several suffixes of a string (on the left) along with the suffix array for the same string (on the right). The original string is attcatg$.
Image may be NSFW. Clik here to view.
The above picture describes what we want to do, and our goal with this text will be to explore different ways of doing this in the hope of obtaining a good solution.
We will enumerate some popular algorithms for this task and will actually implement some of them in C++ (as you will see, some of them are trivial to implement but can be too slow, while others have faster execution times at the cost of both implementation and memory complexity).
The naive algorithm
We shall begin our exploration of this very interesting topic by first studying the most naive algorithm available to solve our problem, which is also the most simple one to implement.
The key idea of the naive algorithm is using a good comparison-based sorting algorithm to sort all the suffixes of a given string in the fastest possible way. Quick-sort does this task very well.
However we should remind ourselves that we are sorting strings, so, either we use the overloaded < sign to serve as a "comparator" for strings (this is done internally in C++ for the string data type) or we write our own string comparison function, which is basically the same thing regarding time complexity, with the former alternative consuming us more time on the writing of code. As such, on my own implementation I chose to keep things simple and used the built-in sort() function applied to a vector of strings. As to compare two strings, we are forced to iterate over all its characters, the time complexity to compare strings is O(N), which means that:
On the naive approach, we are sorting N strings with an O(N log N) comparison based sorting algorithm. As comparing strings takes O(N) time, we can conclude that the time complexity of our naive approach is O(N2 log N)
After sorting all the strings, we need to be able to "retrieve" the original index that each string had initially so we can actually build the suffix array itself.
[Sidenote: As written on the image, the indexes just "come along for the ride".
To do this, I simply used a map as an auxiliary data structure, such that the keys are the strings that will map to the values which are the original indexes the strings had on the original array. Now, retrieving these values is trivial.]
Below, you can find the code for the naive algorithm for constructing the Suffix Array of a given string entered by the user as input:
//Naive algorithm for the construction of the suffix array of a given string
#include <iostream>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
using namespace std;
int main()
{
string s;
map<string,int> m;
cin >> s;
vector<string> v;
for(int i = 0; i < s.size();i++)
{
m[s.substr(i,s.size()-i)] = i;
v.push_back(s.substr(i,s.size()-i));
}
sort(v.begin(),v.end());
for(int i = 0; i < v.size();i++)
{
cout << m[v[i]] << endl;
}
return 0;
}
As you can see by the above code snippet, the implementation of the naive approach is pretty straightforward and very robust as little to virtually no space for errors is allowed if one uses built-in sorting functions.
However, such simplicity comes with an associated cost, and on this case, such cost is paid with a relatively high time complexity which is actually impractical for most problems. So, we need to tune up this approach a bit and attempt to devise a better algorithm.
This is what will be done on the next section.
A clever approach of building the Suffix Array of a given string
As noted above, Suffix Array construction is simple, but an efficient Suffix Array construction is hard.
However, after some thinking we can actually have a very defined idea of why we are performing so badly on such construction.
The reason why we are doing badly on the construction of the SA is because we are NOT EXPLOITING the fact that the strings we are sorting, are actually all part of the SAME original string, and not random, unrelated strings.
However, how can this observation help us?
This observation can help us greatly because now we can actually use tuples that contain only some characters of the string (which we will group in powers of two) such that we can sort the strings in a more ordered fashion by their first two characters, then we can improve on and sort them by their first four characters and so on, until we have reached a length such that we can be sure all the strings are themselves sorted.
With this observation at hand, we can actually cut down the execution time of our SA construction algorithm from O(N2 log N) to O(N log2 N).
Using the amazing work done by @gamabunta, I can provide his explanation of this approach, along with his pseudo-code and later improve a little bit upon it by actually providing an actual C++ implementation of this idea:
Let us consider the original array or suffixes, sorted only according to the first 2 character. If the first 2 character is the same, we consider that the strings have the same sort index.
Now, we wish to use the above array, and sort the suffixes according to their first 4 characters. To achieve this, we can assign 2-tuples to each string. The first value in the 2-tuple is the sort-index of the respective suffix, from above. The second value in the 2-tuple is the sort-index of the suffix that starts 2 positions later, again from above.
If the length of the suffix is less than 2 characters, then we can keep the second value in the 2-tuple as -1.
Now, we can call quick-sort and sort the suffixes according to their first 4 characters by using the 2-tuples we constructed above! The result would be
Similarly constructing the 2-tuples and performing quick-sort again will give us suffixes sorted by their first 8 characters.
Thus, we can sort the suffixes by the following pseudo-code
SortIndex[][] = { 0 }
for i = 0 to N-1
SortIndex[0][i] = order index of the character at A[i]
doneTill = 1
step = 1
while doneTill < N
L[] = { (0,0,0) } // Array of 3 tuples
for i = 0 to N-1
L[i] = ( SortIndex[step - 1][i],
SortIndex[step - 1][i + doneTill],
i
)
// We need to store the value of i to be able to retrieve the index
sort L
for i = 0 to N-1
SortIndex[step][L[i].thirdValue] =
SortIndex[step][L[i-1].thirdValue], if L[i] and L[i-1] have the same first and second values
i, otherwise
++step
doneTill *= 2
The above algorithm will find the Suffix Array in O(N log2 N).
Below you can find a C++ implementation of the above pseudo-code:
#include <cstdio>
#include <algorithm>
#include <cstring>
using namespace std;
#define MAXN 65536
#define MAXLG 17
char A[MAXN];
struct entry
{
int nr[2];
int p;
} L[MAXN];
int P[MAXLG][MAXN];
int N,i;
int stp, cnt;
int cmp(struct entry a, struct entry b)
{
return a.nr[0]==b.nr[0] ?(a.nr[1]<b.nr[1] ?1: 0): (a.nr[0]<b.nr[0] ?1: 0);
}
int main()
{
gets(A);
for(N=strlen(A), i = 0; i < N; i++)
P[0][i] = A[i] - 'a';
for(stp=1, cnt = 1; cnt < N; stp++, cnt *= 2)
{
for(i=0; i < N; i++)
{
L[i].nr[0]=P[stp- 1][i];
L[i].nr[1]=i +cnt <N? P[stp -1][i+ cnt]:-1;
L[i].p= i;
}
sort(L, L+N, cmp);
for(i=0; i < N; i++)
P[stp][L[i].p] =i> 0 && L[i].nr[0]==L[i-1].nr[0] && L[i].nr[1] == L[i- 1].nr[1] ? P[stp][L[i-1].p] : i;
}
return 0;
}
This concludes the explanation of a more efficient approach on building the suffix array for a given string. The runtime is, as said above, O(N log2 N).
Constructing (and explaining) the LCP array
The LCP array (Longest Common Prefix) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes between pairs of consecutive suffixes in the suffix array.
So, if one has built the Suffix Array, it's relatively simple to actually build the LCP array.
In fact, using once again @gamabunta's amazing work, below there is the pseudo-code which allows one to efficiently find the LCP array:
We can use the SortIndex array we constructed above to find the Longest Common Prefix, between any two prefixes.
FindLCP (x, y)
answer = 0
for k = ceil(log N) to 0
if SortIndex[k][x] = SortIndex[k][y]
// sort-index is same if the first k characters are same
answer += 2k
// now we wish to find the characters that are same in the remaining strings
x += 2k
y += 2k
The LCP Array is the array of Longest Common Prefixes between the ith suffix and the (i-1)th suffix in the Suffix Array. The above algorithm needs to be called N times to build the LCP Array in a total of O(N log N) time.
Moving on from here
This post was actually the first long post I wrote about a subject which I'm not familiar with, AT ALL. This is always a risk I am also taking, but I tried to adhere only to the sub-topics I considered I mastered relatively well myself (at least, in theory, as I still don't think I could implement this correctly on a live contest or even a pratice problem... But, as I said many times, I'm here to work as hard as I can to learn as much as I can!)
I hope that what I wrote is, at least, decent and I did it basically as a good way of gathering information which is very spread over many papers and websites online, so that when people read this post they will be able to grasp the ideas for the naive solution as well as for the improvement presented as a better solution.
There are many interesting linear algorithms to "attack" this problem, with one of the most famous being the Skew Algorithm, which is lovely described on the link I provide here.
Besides this, there are several other algorithms which are also linear that exploit the relationship between Suffix Trees and the Suffix Array and that use linear sorting algorithm like radix sort, but, which I sadly don't yet understand which makes me unable to discuss them here.
However, I hope this little text does its job by at least gathering some useful information on a single post :)
I am also learning as I write these texts and this has helped me a lot on my evolution as a coder and I hope I can keep contributing to give all my best to this incredible cmmunity :D