Accepting O(log(n)) as a “Fairly Constant” Solution

Is O(log(n)) considerably worse than O(1) as an algorithm complexity? Should we strive to find the constant-time solution? There’s no definitive answer for that, and it usually is “depends on what you’re trying to achieve”. This is why I’m going to present the comparison in a more abstract sense, by looking at them.

Here’s what a constant function like f(n) = 5 looks likeconst

It’s just a boring straight line parallel to the x-axis, but it’s the best complexity we hope for. No matter how large the input is, it still requires the same number of steps.

Now let’s look at a logarithmic function such as f(n) = log_{10} (n)

log_10

Doesn’t look very bad, does it? Its growth is what you might consider “fairly constant”, it grows so slowly that it n = 1,000,000 to hit 6. But this because the base was 10, how about a smaller base, say 2, it will be way worse, right? From a complexity point of view, the base of a logarithmic function is just a constant, and it just doesn’t matter when it comes for asymptotic growth (you can have some fun finding why it doesn’t).

Let’s put f(n) = log_2(n) to test anyway. It’s worse but not “way” worse at all. At n = 1,000,000 the value was just 20.log_2

How about looking at it from a different point of view. There’s no such thing as 1.732 steps in algorithms or machines; it’s always a positive integer number of steps. So let’s round the logarithmic functions output up and use f(n) = ceil(log_2(n)) and f(n) = ceil(log_{10}(n)) instead and see how close to constant they look yourself.ceil_log_2

ceil_log_10

How to C

Oh C, the good old language which is still out there, and on which many newer languages are based. It’s common for people to think that C isn’t for the faint of heart, but the language is actually simple. So simple that it’s almost never the best choice for a large project which isn’t an operating system. Here I’ll not introduce C, or how to program using C. I’ll assume that you have some knowledge of the language, and will discuss how I like to use C by addressing common beginner mistakes and general stylistic preferences.

 

Good Code

I, to this day, still believe that ‘Good Code = Good Types + Good Functions’. Let’s think about it, the types you define (structs)  and the types you define a more representative or simpler alias for (typedefs), constitute the data the program will hold. The functions are the operations to be applied on those types. And that’s it, define your types and write proper operations and you’re good to go.

 

Good Functions

Before talking about types, let’s first talk about functions. Here’s the biggest mistake: long functions. While dealing with C, you’re most likely on your own, with nothing to help you but a debugger, a profiler, and function documentation at your disposal. So connecting to a server and receiving a chunk of data from it, in my mind, was a simple task that should be one function. I started going back and forth between my code and the documentations, stacking lines over lines in the same scope. In the end it worked, but it looked terrible. I moved on to the next task and repeated the same thing.

Some guidelines to write good functions:

  1. Small: Consider each sub-task a task on its own, and write it as such. In fact, every ((sub-sub)^n)-task should be written independently.
  2. Meaningful return: Return something, let it be an error, a state, or otherwise. Use ‘void’ only for a procedure routine; that’s something that just defines a certain flow of operations and that’s it.
  3. Stateless: Try as much as possible to avoid reading variables which aren’t part of your function arguments.
  4. Pointer-free: Do you really need a pointer for this? probably not. If you do, think again before passing it. You should avoid manipulating passed arguments, which means that you probably shouldn’t use a pointer for this.
  5. Callbacks: C supports higher-order functions, USE THAT! Don’t go too far or you’ll end up in callback hell and hate yourself (learn from your JavaScript experience).

Let’s discuss the guidelines a bit.

‘Small’, classic and everyone will tell you that all the time but hold on a second. By ‘small’ I mean really really small. I usually start from the function which represents the final goal, then I start going down as if it was a tree and everything I think of under it is a new function, apply recursively until you’ve hit rock bottom.

‘Meaningful return’. Man oh man. Just return something useful, damn it! Tell me how things went, give me a modified version of an argument instead of changing its value and let me decide what to do with it. Remove ‘void’ from your dictionary, you’ll know when you really need to use it when you start not using it all.

‘Stateless’. Everyone in the programming scene will tell you so, do as they say.

‘Pointer-free’. But that’s C, we have to use pointers. Do we really? You have the power to do something but that doesn’t mean that you should use it. Problems with pointers arise because of the abuse of pointers. Use them wisely and be considerate when you ask for a pointer instead of an object.

‘Callbacks’. Callbacks are fine, yeah they’re function calls and the compiler will not know how to inline them (I guess! Haven’t checked actually) but they are good way to chain operations together. Use them when you can. And since function pointers can get ugly, we’ll see how to deal with that in Good Types.

 

Good Types

There’s not much to say here, write good descriptive types which fit the purpose of your program or library. Define enums instead of passing integer if your values are categorical. Define typedefs for function pointers you want to use as callbacks and give them good names. Use typedefs to make a more meaningful alias of a generic type. And for the love of anything you hold sacred, define structs with typedefs.

 

Coding Convention

100% of experienced programmers will tell you that having a code convention for yourself and for a team is fundamental. Pick the one you’re most comfortable with and go for it. I personally prefer CamelCase for types (enums and structs) even when most C code bases use snake_case for them. I hate having ‘struct <type>’, use typedef struct instead. I also like snake_case for functions, leaving a space between a keywork or a function name and the parentheses (except for function calls). And if you follow the concept of making your functions small, you can easily encapsulate other coding conventions within the functions themselves, and hide them.

Taking this a bit further, I consider coding principles part of the convention. There are no restrictions here, whatever principal design you like to follow in the way you code, just follow it.

 

Heap Allocation

Two comments here: reduce allocations, profile your memory usage. Unless needed, use the stack memory instead, but the moment you start having heap allocation is the moment you use something like Valgrind to validate that you’re not messing anything up.

 

Ugly but Performs Well

In most cases you should favor good design and stylistic decisions over better performance. The reason is simple, don’t use relative performance comparison, use an ultimate set one and see if you hit or not. Instead of saying this piece of code is 25% faster, decide how much throughput, and as long as you’re satisfying that limit, relative performance within that range is meaningless. Oh, and did I mention profiling? STOP MAKING GUESSES, use a profiler, micro-benchmark (if needed), use Godbolt to see the compiler output of some pieces of code. When you’re done with that then decide if you want better performance or not.

 

Be Ready for Optimization

Yeah you don’t usually need to optimize, but what if you had to? Learn how to optimize the performance of your code, learn how to deal with bottlenecks. Just be ready so that when the moment comes, you don’t spend days learning first.

 

That’s it, that’s my philosophy while using C. I follow the same philosophy with other languages as well, but I’m not as pedantic with them.

Functional Interfaces and Callbacks in Java 8

It’s undeniable that Java 8 was a great step forward for the language. For me, it made Java a bit less rigid than it was, and added some great features to the table. One of those additions was the introduction of Lambda Expressions accompanied by Functional Interfaces. Recently after being away from Java for a while I came back to it for a certain task, and it happened that I found using the aforementioned features to be the most convenient approach. That happened when I needed to make a chain of callbacks. I will leave reading about Lambda Expressions to you, and will just introduce Functional Interfaces.

What are Functional Interfaces?

Simply put, they are interfaces with a single abstract function. If the interface conforms to that single rule, it can be annotated as @FunctionalInterface. The word “abstract” here is important since not all functions in an interface need to be abstract anymore. Java allows programmers to add concrete default functions to an interface. Here are some examples of what does and what doesn’t classify as a Functional Interface.

  • Valid:

interface Example {
    void doSomething();
}
interface Example {
    void doSomething();
    default void doDefault() { ... }
}
interface Example {
    void doSomething();
    default void doDefault() { ... }
    static void doStatic() { ... }
}
  • Invalid:

interface Example {
    void doSomething();
    void doOtherThing();
}

 

What does make Functional Interfaces special?

This question has the same answer as “why are they required to have only one abstract function?”. Since it has only abstract function, one can assign a lambda expression in place of the interface, and that lambda will be considered as the implementation of the abstract function. This will come in handy when we talk next about callbacks.

 

Java callbacks

Remember the bad old days when you had to deal with Event Listeners and pretty much anything which represented a callback. You had to make a class which implements an abstract function of a superclass, and you had to write the full definition of the class, inline or separately. Well, you still have to do that, but now you can just write the definition of your funtction directly, which is way more convenient, and I did it as follows:

Callback interface:

@FunctionalInterface
interface ControllerCallback {
    void processResult(ControllerResult result, Exception ex);
    static ControllerCallback empty() { return new ControllerCallback() {
        @Override
        public void processResult(ControllerResult result, Exception ex){}
        };
    }
}

Here the interface has its lone abstract function processResult() which takes the result of the previous step and an instance of an exception (if any was thrown) as arguments. Additionally, one can add a ControllerCallback to the parameters, but that wasn’t needed in my case. The empty() function returns a callback which does nothing, it was added for testing cases which didn’t need the full pipeline of operations.

Function which uses the callback:

void operation(..., ControllerCallback callback) {
    if (...) {
        callback.processResult(ControllerResult.FAILED_NOT_FOUND, null);
        return;
    }
    
    try {
        ...
    } catch (Exception ex) {
        callback.processResult(ControllerResult.FAILED_OTHER, ex);
        return;
    }
    
    callback.processResult(ControllerResult.SUCCESS, null);
}

The function shows three ways the callback the used: to signal a failure which wasn’t caused by an exception, to signal a failure with an exception, and finally to signal that the operation was done successfully. It’s worth noting that an exception could be instantiated (but not thrown) and passed to the callback to provide extra information about the error.

Usage:

ControllerCallback operationCallback = (res, ex)-> {
    switch (res) {
        case SUCCESS:
            ...
            break;
        default:
            System.err.println("Failed");
            if (ex == null)
                System.err.println("No exception was thrown.");
            else
                System.err.println("An exception was thrown:\n"
                                   + ex.getMessage());
    }
};
controller.operation(..., operationCallback);

Notice how the callback was instantiated, it was assigned only a single Lambda representing the implementation of the function. This is much better than the old tedious way that Java forced us to use for a long time.

Solving Frequent k-Mers Problem using MapReduce (Bioinformatics)

The frequent k-mers problems is one of the very first problems introduced to those who took Bioinformatics I course on Coursera. It is a problem which is not difficult, but also represents a good introduction to the field since its naive solution is impractical, and it requires a little bit of optimization. After solving the problem as efficiently as possible, I started thinking of implementing it as a MapReduce task for practice.

First, does the problem really need to be distributed? I do not know! My lack of experience does not allow me to know much about how frequently people need to find the most frequent k-mers. However, for the sake of practice, I decided to implement it,; there is no harm. In this post, we will not discuss how to solve the problem itself, but rather we will discuss how to properly split the input to be able to write the MapReduce task correctly.

 

Problem definition

In case you are not familiar with the problem, it is simply defined as: “given a string, find the most common sub-string of length k”

Sounds similar? the problem is very similar to the word count problem, but all words are of a fixed length, and the output is only the words which occurred most frequently.

 

Solution

The solution is divided into four parts: partitioning, mapping, reducing, and finding the final answer. In order to implement it, I will use LazyMapReduce, but you can achieve the same thing with any other MapReduce implementation.

Since we are already familiar with Map and Reduce for the word count problem, we can just write them directly. Notice that the two functions are independent of how partitioning will be done, as Map will only work on an individual partition. This time, we will make a class which extends MapReduceTask and assigns Map and Reduce internally, instead of creating a MapReduceTask instance and assigning its Map and Reduce functions.

Just like the word count count problem, the MapReduceTask types will be ‘<object, string, string, int, string, int>’ (refer to the previous tutorial to know more about that). Therefore the structure of the class will be like this

public class FrequentKmersTask: MapReduceTask<object, string, string, int, string, int>
{
    string sequence;
    int k;
    int partitionSize;
 
    public FrequentKmersTask (string _sequence, int _k, int _partitionSize)
    {
       sequence = _sequence;
       k = _k;
       partitionSize = _partitionSize;
 
       this.Map = FrequentKmersMap;
       this.Reduce = FrequentKmersReduce;
    } 
}

Now we will work on the very first part of the implementation, which is how to split the genome sequence into chunks, each to be pushed as input to the Map function.

  • Naive partitioning

Let’s just divide the genome into equal-sized partitions, where the last one might be shorter than the rest, and make every partition start right after the previous one.

void PrepareTaks ()
{
      var index = 0;
      var length = 0;
      var partition = "";
 
      while (true) {
          length = Math.Min (partitionSize, sequence.Length - index);
          partition = sequence.Substring (index, length);
          PushInput (null, partition);
          index += length;
 
          if (index >= sequence.Length)
               break;
      }
}

That was easy, and we are done. NO, NOT SO FAST. There is a big flaw in the way we split the genome. A k-mer which is divided between two chunks will be ignored by our Map function. To make this clearer let’s try to check mathematically if the algorithm we use to partition a genome sequence will give us in the end the same number of k-mers or not.

First, we need to know how many k-mers a string of length N has, and it is N - k + 1 (you can try to verify that if you would like to). Then we split the string into m chunks with size n’, and according to our partitioning algorithm, m = N/n'.

Then we need to calculate how many k-mers will result from splitting the string the way we did. The total number of k-mers is \sum_{i=1}^{m} kmers_i =  \sum_{i=1}^{m} n'-k+1 = mn'-mk+m. Then after replacing m with $N/n’$ the formula can be rewritten as N - \frac{N}{n'} k + \frac{N}{n'} .

The only situation where N - \frac{N}{n'} k + \frac{N}{n'}  = N - k + 1 is when N=n' , which means that we do not split the genome at all. On the other hand, the solution is rather simple, and it only involves modifying one statement.

  • Correct partitioning

But before we go into the code, let’s fix the way we thought about splitting the string first, and change it to “divide the genome into equal-sized partitions, where the last one might be shorter than the rest, and make every partition start right after the LAST INDEX WHICH IS A VALID START OF A K-MER IN THE previous one”.

In a string of size N , the indices which are starts of k-mers are \{ 0, 1, ..., N-k \}. Therefore, for chunks of size n' , each chunk will start right at i_0 + n'-k+1 , where i_0 is the starting index of the previous chunk.

To achieve that we only need to change the way we update the starting index of the next chunk from the previous implementation, and convert it to

void PrepareTaks ()
{
      var index = 0;
      var length = 0;
      var partition = "";
 
      while (true) {
          length = Math.Min (partitionSize, sequence.Length - index);
          partition = sequence.Substring (index, length);
          PushInput (null, partition);
          index +=  partitionSize - k + 1; 
 
          if (index >= sequence.Length)
               break;
      }
}

 

And then, FrequentKmersMap and FrequentKmersReduce will be defined respectively as

List<KeySingleValuePair<string, int>> FrequentKmersMap (KeySingleValuePair<object, string> input)
{
    string kmer;
    var partition = input.value;
    var result = new List<KeySingleValuePair<string, int>> ();
 
    for (int i = 0; i <= partition.Length - k; i++) {
        kmer = partition.Substring (i, k);
        result.Add (new KeySingleValuePair<string, int> {
            key = kmer,
            value = 1
        });
    }
 
    return result;
}
 
KeySingleValuePair<string, int> FrequentKmersReduce (KeyMultipleValuesPair<string, int> input)
{
    return new KeySingleValuePair<string, int> {
         key = input.key,
         value = input.values.Count
    };
}

Since everything is ready, the only remaining task is to find the most frequent k-mers. The results of all Reduce calls can be retrieved by calling GetResults(), then by iterating through them we can find the most frequent k-mers as follows

public List<string> GetFrequentKmers ()
{
     var results = GetResults ();
     var kmers = new List<string> ();
     var maxCount = 0;
 
     foreach (var kmer in results) {
         if (kmer.value > maxCount) {
                maxCount = kmer.value;
                kmers.Clear ();
                kmers.Add (kmer.key);
         } else if (kmer.value == maxCount) {
                kmers.Add (kmer.key);
         }
     }
 
     return kmers;
}

 

Lazy MapReduce for C#

Hello, a few days ago I developed LazyMapReduce, which is a single file that anyone can add to their project and use it to implement some tasks as MapReduce functions. In this post, I will discuss the motivation behind it, what LazyMapReduce is for, and show a simple example on how to implement a MapReduce task using it.

The project is hosted on GitHub and is written in C#, but I will add another implementation later for Java, and maybe Scala.

 

Motivation

I was bored, and I wanted to implement something as a MapReduce task, but did not want to install Hadoop. Therefore, I decided to fill my time by creating my own MapReduce.

 

Why LazyMapReduce?

LazyMapReduce is an implementation of MapReduce which does not require any configuration, and does not run in a distributed environment. It is not a good implementation by any means for any practical use. It only serves as a simple and easy way for those who want to learn or practice MapReduce, or those who want to try to implement something first to make sure it works. LazyMapReduce is solely meant for educational, practice, and try-out purposes only.

 

Example

The hello world example of MapReduce is the infamous word count problem. You can hardly find any tutorial about MapReduce which does not include that example, so we will go with the flow and implement it. The example here assumes that you have some knowledge of MapReduce already, and know the concepts of generic classes and lambda expressions. If you do not, then I would advise to have a look at them before continuing this example. But they simple to grasp and you might get an idea on how they work if you just follow the example.

Class MapReduceTask makes use of delegates in C# (and VisualBasic.NET). It has a MapDelegate and a ReduceDelegate, for Map and Reduce functions, respectively. Hence, Map and Reduce functions are to be assigned to any function which has the same signature as the delegates.

The use of delegates gives us two ways to implement a LazyMapReduce task. The first is by extending the MapReduceTask and then assigning the Map and Reduce functions internally. The second is by declaring a MapReduceTask object and then assigning its Map and Reduce functions. In this example, we will follow the latter.

  • Instantiating a MapReduceTask

MapReduceTask<object, string, string, int, string, int> wordCount =
          new MapReduceTask<object, string, string, int, string, int> ();

MapReduceTask class is generic, which means that it needs its types to be passed to it as it is initialized. The types, in order, represent: Map input key type, Map input value type, Map output key type, Map output value type, Reduce output key type, and Reduce output value type.

Both the Map and Reduce functions rely on delegates which use generic data structure whose types are assigned based on the types given to the task. Note that it is preferable to declare such variables using ‘var‘ to make the line more concise, but a full declaration is shown here for illustration purpose.

  • Map Function

In our word count example, the Map function should receive a key-value pair of types object and string. We do not really care about the key in this case, so anything would work. Then it should return a list of key-value pairs of types string and int, as specified as Map output types. Where the key is a single word, and the value is either the number of occurrences of that word. Alternatively, people usually go for the simple solution and with every occurrence of a word they add (word, 1), and then let the Reduce function handle counting them. As usual, we will go for the second option, and our Map function will look like this:

char[] delims = new char[] { ' ', ',', '!', ';', '?' };  
wordCount.Map = (KeySingleValuePair<object, string> input) => {
       string[] words = input.value.ToLower ().Split (delims);
       var result = new List<KeySingleValuePair<string, int>> ();
 
       foreach (string word in words) {
             result.Add(new KeySingleValuePair<string, int> {
                  key = word,
                  value = 1
              });
       }
                     
       return result;
};
  • Reduce Function

The Reduce function in the word count example is much simple that the Map function, it is just basically adding all the value which are associated with a certain key to get the total sum. Hence, the Reduce function will be:

wordCount.Reduce = (KeyMultipleValuesPair<string, int> input) => {
       return new KeySingleValuePair<string, int> {
              key = input.key,
              value = input.values.Count
       };
};
  • Passing Input to the Task

We can pass the input to the task simply by just pushing some key-value pairs of the specified Map input types in the declaration. Since for word count there is no input key, we will just pass null. In this example, we will declare an array of strings, where each string represents an independent text, and the goal is to find the frequency of each word across all of them.

string[] texts = ...;

foreach (string text in texts) {
       wordCount.PushInput (null, text);
}
  • Running the Task

The final step now is to run the task when everything is set up. Unlike Hadoop, the Map function does not emit one by one (writes to context) so that the task does not need to wait for all of Map outputs to proceed. However, LazyMapReduced is not optimized for such tasks, so it will wait until all Map outputs from all Map calls are ready, then it will proceed.

In order to run the task we just need to call the function Run (). Run function has another parameterized overloaded function which takes to boolean values. The boolean values indicate whether, Map and/or Reduce tasks should run in parallel respectively. Calling Run without any parameters will result in both Map and Reduce running their tasks sequentially.

 wordCount.Run (true, true);
  • Getting the Results

After the task is run, we can get the results as an enumerable of key-value pairs of specified Reduce output types.

var results = wordCount.GetResults ();
var resultsEnumerator = results.GetEnumerator ();

And that is it, now we have a simple, easy-to-use way to work with MapReduce for the lazy.

 

Note: You can use LINQ to get a similar MapReduce approach. However,LazyMapReduce aims to mimic the way things like Hadoop work, so that one can play around and also get familiar with something close to how a real MapReduce implementation will look like.

Lexicographical Encoding and Decoding of Genome Patterns (C#)

Note: this post is based on materials covered in Bioinformatics I (Finding Hidden Messages in Genome) on Coursera, and the idea is not my original work.

 

Problem Definition:

In the Bioinformatics course I took on Coursera, we covered something a bit trivial, yet helpful. The problem was: if we indexed all possible patterns of length k; then given a certain pattern of length k, what is its index? And given a certain index i, what is the pattern whose index is i?

The problem is about how we can encode and decode patterns based on their lexicographical order. For example, for k = 3, one should get a list which starts with ‘AAA‘ and ends with ‘TTT‘.

 

How to Solve it:

The first step to encode a pattern, is to encode the symbols first. If we assume that we only deal with DNA, we can encode the symbols of the nucleotides as follow: A= 0, C=1, G=2, T=3.

There are two ways to think about how we can encode a full pattern, and both lead to the same conclusion. Let’s look at the first way. We can replace each symbol, independently, with its corresponding encoding to get a base-4 number, then we convert that number to base-10, and we will be done. In order to decode the number and get the original pattern, we just reverse the operation (convert from base-10 to base-4, then replace each digit with its symbol).

For example, the pattern ‘ATGCAA‘ can be written as (032100) in base-4, which is equivalent to (912) in base-10; therefore, ‘ATGCAA‘ can be encoded as 912. However, if we reverse the operation we get (32100) which will be decoded into ‘TGCAA‘. That problem arises because we did not take into consideration the original length of the encoded pattern, and since 0’s to the left are meaningless arithmetically, all leading A’s will be discarded. That is why it is important to know the original value of k (the length of the pattern), in order to decode it correctly.

The second way follows a lexicographical view, it relies on two simple facts to derive the encoding function: 1- if the last symbol is removed from all patterns, the remaining list is still sorted, 2- after removing the last symbol, each pattern in the remaining list will appear exactly four times. The first fact is trivial, but the second one might not seem straightforward at first. To clarify, let’s look at a certain pattern ‘CGC‘, if we append a new symbol to it, we have four possibilities: ‘CGC(A)’, ‘CGC(C)’, ‘CGC(G)’, and ‘CGC(T)’. So if we take it backwards, if we remove the forth symbol from all patterns, ‘CGC‘ and any other pattern of three symbols will appear exactly four times.

Based on the aforementioned points, we can derive a recursive encoding function as follows: Encode(pattern) = 4 * Encode(Prefix(pattern)) + EncodeSymbol(last symbol in pattern), where Prefix(pattern) returns the the pattern without the last symbol. Now in order to decode, we will use the same way we showed earlier, we will convert from base-10 to base-4. Which implies that the previous issue we had (that the length of the pattern must be known in order to properly decode the number) still persists.

 

Implementation (C#):

Here we will implement how to encode using the recursive lexicographical approach, instead of converting from base-4 to base-10 directly.

  • Symbols & Numbers:
Dictionary<char, byte> encodings = new Dictionary<char, byte> () {
     { 'A', 0 },
     { 'C', 1 },
     { 'G', 2 },
     { 'T', 3 }
};
 
Dictionary<byte, char> decodings = new Dictionary<byte, char> () {
     { 0, 'A' },
     { 1, 'C' },
     { 2, 'G' },
     { 3, 'T' }
};
  • Encoding:
ulong LexicographicalEncoding (string pattern, int len)
{
     byte  encoding = EncodeSymbol (pattern [len - 1]);
     if (len == 1) {
          return (ulong)encoding;
     }
     return 4 * LexicographicalEncoding (pattern, len - 1) + encoding;
}
  • Decoding:
string LexicographicalDecoding (ulong encoding, int len)
{
      char[] decoded = new char[len];
      int index = len;
      ulong quotient = encoding, remainder;
 
      while (quotient > 0 || index > 0) {
           remainder = quotient % 4;
           quotient /= 4;
           decoded [--index] = DecodeSymbol ((byte)remainder);
      }
 
      return new string (decoded);
}