Getting Started with OpenMP

Using OpenMP is one of the best way to parallelize the serial code with least efforts. OpenMP stands for Open Multi-Processors, its not a new language, it is just a library in C/C++ which comes preinstalled with latest versions of gcc.

Hello world!

As soon as #pragma omp parallel is encountered, new threads are created by the master thread and every thread executes the code inside the braces after the above statement is encountered. Here, omp_get_thread_num() returns the #id of the thread and  omp_get_num_threads() returns the number of threads created for the given task.

#include "omp.h"
#include "stdio.h"
int main(int argc, char const *argv[])
{
	#pragma omp parallel
	{
		printf("Hello from thread: %d\n", omp_get_thread_num());
		if (omp_get_thread_num()==0)
		{
			printf("Number of threads: %d\n", omp_get_num_threads());
		}
	}
	return 0;
}

Output:

Hello from thread: 2
Hello from thread: 1
Hello from thread: 0
Number of threads: 4
Hello from thread: 3

For Loop

#pragma omp parallel for
	for (int i = 0; i < 8; i++)
	{
		printf("i: %d is hadled by thread: %d\n", i, omp_get_thread_num());
	}

Output:

i: 6 is handled by thread: 3
i: 7 is handled by thread: 3
i: 0 is handled by thread: 0
i: 1 is handled by thread: 0
i: 2 is handled by thread: 1
i: 3 is handled by thread: 1
i: 4 is handled by thread: 2
i: 5 is handled by thread: 2

Shared vs Private Variables:

Shared variables are common to all the threads and in case of private variables, individual copy of variable is created as the threads are created and destroyed as soon as threads are destroyed or merged.

int shared_x=100, shared_y=300, private_x=200, private_y=400;
	#pragma omp parallel private(private_x, private_y) shared(shared_x, shared_y)
	{
		private_x=omp_get_thread_num();
		printf("Thread:%d	shared_x: %d	private_x: %d\n", omp_get_thread_num(), shared_x, private_x);
		shared_x=omp_get_thread_num();
	}
	printf("\nOut of parallel world <> shared_x: %d	private_x:%d\n", shared_x, private_x);

Output:

Thread:2	shared_x: 100	private_x: 2
Thread:3	shared_x: 100	private_x: 3
Thread:1	shared_x: 100	private_x: 1
Thread:0	shared_x: 100	private_x: 0
Out of parallel world <> shared_x: 0	private_x:200

The Confusing Part

Number of threads are going to be fixed, that is it doesn’t matter if parallel for loop is called in the parallel region. As soon as the parallel for loop is encountered in the parallel region the task is divided in the four predefined threads, as it can be seen from the code below.

#pragma omp parallel private(i,tid)
  {
  tid = omp_get_thread_num();
  if (tid == 0)
    {
    nthreads = omp_get_num_threads();
    printf("Number of threads = %d\n", nthreads);
    }

  printf("Thread %d starting...\n",tid);

  #pragma omp for
  for (i=0; i&lt;10; i++)
    {
    printf("Thread id:%d Thread: %d\n",tid,omp_get_thread_num());
    }
  }

Output:

Thread 1 starting...
Thread id:1 Thread: 1
Thread id:1 Thread: 1
Thread id:1 Thread: 1
Thread 2 starting...
Thread id:2 Thread: 2
Thread id:2 Thread: 2
Thread id:2 Thread: 2
Thread 3 starting...
Thread id:3 Thread: 3
Number of threads = 4
Thread 0 starting...
Thread id:0 Thread: 0
Thread id:0 Thread: 0
Thread id:0 Thread: 0

or

The above code is similar to directly calling the parallel for loop from the serial code.

#pragma omp parallel for
  for (i=0; i&lt;10; i++)
    {
    tid = omp_get_thread_num();
	if (tid == 0){
	    nthreads = omp_get_num_threads();
	    printf("Number of threads = %d\n", nthreads);
	}
    printf("Thread id:%d Thread: %d\n",tid,omp_get_thread_num());
  }

Output:

Number of threads = 4
Thread id:2 Thread: 0
Number of threads = 4
Thread id:0 Thread: 0
Number of threads = 4
Thread id:0 Thread: 0
Thread id:2 Thread: 2
Thread id:2 Thread: 2
Thread id:1 Thread: 1
Thread id:1 Thread: 1
Thread id:1 Thread: 1
Thread id:3 Thread: 3
Thread id:2 Thread: 2

Ordered for loop:

Initially threads will get executed according to the resources available, but as soon as the second clause is encountered the threads starts to get executed in the prescribed order.

#pragma omp parallel for private(private_x) ordered
	for(int i=1; i&lt;=5; i++){
		private_x = i*i;
		#pragma omp ordered
		{
			printf("%d %d\n", i, private_x);
		}
	}

Output:

1 1
2 4
3 9
4 16
5 25

Reduction Operations

All the threads will be accessing the same variable sum for the task of summation. The reduction variable is specifically used for the given task only and not for any other task in the parallel region. Hence, sum variable can neither be parallel nor private.

int sum;
#pragma omp parallel for reduction(+:sum)
   for(int i=1; i&lt;5; i++){
      sum = sum + i;
}
printf("%d\n", sum);

Output:

10

C/C++  Reduction Operands

Operator Initial value
+ 0
* 1
0
& ~0
| 0
^ 0
&& 1
|| 0