Searching and Sorting

Introduction to Sorting and Searching Arrays

In this blog post, we will be going over simple implementations for sorting and searching lists. Learning to write fast search and sort algorithms will have an immense impact in the performance of all programs, as many real world applications deal with huge sets of data. In this post, I will not include Big-Oh notation, which I will write in a later post. Just worry about speed in a relative sense for now.

Sorting using Selection Sort

Let us take a look sorting an array first.

You are given a random list of numbers in an array, and are told to sort them in ascending order.

9 5 7 3 2 1 8

There are 7 numbers in total in the array. Now the most basic way of approaching this problem would be to find the smallest number swap it with the front position, but in an unordered list, you have no idea of where that number will be. Therefore you must traverse the entire list to find the smallest element, besides the elements you have already found.

So the list above, using this logic, we save 9, save 5, pass 7, save 3, save 2, save 1, pass 8...swap 1.

1 5 7 3 2 9 8

Continue sorting...
save 5, pass 7, save 3, save 2, pass 9, pass 8...swap 2
1 2 7 3 5 9 8

This algorithm works well as it swaps larger elements towards the end and smaller towards the beginning, as near the end there may be less swaps needed. But this is aspect of the sort is overshadowed by the number of times that the array must be examined to find the smallest value. Imagine if you had over 100,000 elements. This algorithms would have to compare each element of the array for each index of the array. The last element of this array will have been check at least 99,999 times!

More on Selection Sort

My implementation of Selection Sort here.

Sorting using Merge Sort

Selection sort is a very easy to sorting method, but when sorting large data sets it proves not be a viable solution. One of the first algorithms to do well on large data sets is Merge Sort, which uses a divide and conquer approach to sort a large list.

The way Merge Sort works, is that you divide the current list into two parts, and recursively call the same sort function on the resulting lists. This continues to divide the resulting lists until they are single elements in a list. Then the separated lists are merged, continuing to merge the two lists that were split up until you are left with the original whole list. When the lists are merged, an evaluation between the two lists compare the elements and returns a sorted list.

Find a implementation for Merge Sort here.

When I first began writing this blog post, I originally was going to write on how today's computer trend was towards separating the load of programs into smaller parts, in both programming and hardware. But I didn't think I could create a quality informative post at the moment about this, so I'll leave a little thought here. When you look at merge sort, you immediately see that it is recursively separating the list into two smaller lists. So then I thought, well these two lists are both separate from one another, and do not rely on the data of the other to sort each side properly. Could we potentially use the modern computer architecture to our advantage to speed us this process?

I found the answer in this blog post by blogger user Ivan Macteki with his implementation included.

He proposes a two threaded Merge sort, which we can look at his tests...

Performance
Concurrent Merge Sort does not improve performance in a single processor platform. In a multi-processor platform, it has a good speed up. The following is the result of sorting 20 million integers in my single processor computer (Pentium M 1.73Ghz).

(Part 1)JavaSort n=20000000 elapse=7.341s

(Part 3)ShellSort n=20000000 elapse=12.438s

(Part 4)SimpleMergeSort n=20000000 elapse=31.966s

(Part 5)HelperMergeSort n=20000000 elapse=13.149s

(Part 6)ConcurrentMergeSort n=20000000 elapse=13.299s

Since the Insertion Sort introduced in Part 2 may takes hours or even days to sort 20 million integers, it is omitted in the test.
From the above test, concurrent merge sort is not faster than the sequential merge sort introduced in Part 5.

Let's start another test on a multiprocessor environment. The following is the result of running the test in my macbook air 2010, which is equipped with 1.86GHz Intel Core 2 Duo processor.

(Part 1)JavaSort n=20000000 elapse=4.593s

(Part 3)ShellSort n=20000000 elapse=7.925s

(Part 4)SimpleMergeSort n=20000000 elapse=14.329s

(Part 5)HelperMergeSort n=20000000 elapse=6.515s

(Part 6)ConcurrentMergeSort n=20000000 elapse=4.323s

This time concurrent merge sort is running significantly faster than helper merge sort(4.323s vs 6.515s). Of course we cannot expect a double speed up because the final merge() step is still sequential. For such a simple modification, it is worth the effort for this speed up.

Posted by Ivan Macteki

This example definitely shows the clear power of utilizing divide and conquer approach to increasing performance.

Searching using Binary Search

The most common example of a binary search is if you have ever tried to look for a phone number in the phone book. You open it up somewhere in the middle, check to see if it is smaller or larger, flip around half and check again if it is smaller or larger, until you have found what you wanted. In computer science, this method of searching is called Binary Search. Keep in mind that the list we are searching is assumed to be sorted, this method of searching drastically cuts down time of checking each index of a list. The first "checkpoint" effectively reduces the number of index checks by half. After that, the time saved diminishes, but you can see how much time is saved compared to a traditional sequential search or starting from the first index and working up.

Find a implementation for Binary Search here.

Summary

Here are somethings to take away from this post

Searching and sorting algorithms play a major role in program performance in real world applications
Divide and conquer algorithms can increase performance immensely when dealing with huge data sets
Coding with the the machine in mind can lead to drastic performance increases, if you are able to utilize the specs correctly.

As for my final remark for this post, this post is meant for programmers to keep performance in mind when coding for real world applications. Finding ways to optimize your code is an essential skill that will undoubtedly come in handy when you land your first programming job.

Lastly, I'm not too familiar with how blogs are to be shared, so if Ivan does not like my direct quotation of his work, I will gladly take it down from this post.

Search This Blog

Learning on the Job