Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

efficient summation 1

Status
Not open for further replies.

Pfor

Technical User
Aug 11, 2010
22
IT
Hallo!
I am programming using fortran 90 (gfortran as a compiler).
I have to sum a lot of functions like this:

do ix=1,100
do iy =1,100

do n=1,10000
do k=1,10000

d(ix,iy) = d(ix,iy) + coeff(n,k) * f(n,ix) *g(k,iy)

enddo
enddo

enddo
enddo

I was trying to render this procedure more efficient from the
time point of view.

I was thinkning to build a new kind of array like:

fg(n,k,ix,iy) = f(n,ix) *g(k,iy)

so to avoid the loops on the ix and iy coordinates, however
the "fg" array would become too big (and I am not even sure this
would speed up the calculation).

Do you have any suggestions?
thank you a lot,
cheers
Paolo
 
Well the way you ask it, fg(n,k,ix,iy) has to be filled in as well. So, I think you lose the same total time only changing the moment of computing.

It's an interesting question if you read input from a file. In that case you can store it in fg(n,k,ix,iy) while reading. I always wondered whether
Code:
sum=a(:)+b(:)
is faster than the F77 way
Code:
sum=0.
DO i=1,n,1
   sum=a(i)+b(i)
END DO
I don't know if your question comes down to that, bit I imagine it to be compiler dependent
 
If you want a fast improvement, simply transpose the array coeff. You will take advantage of a better data localization (cache memory usage).

I have reduced the maximum dimensions but this result could be transposed to larger arrays.

Initial code :

Code:
program test
  implicit none
  integer,parameter :: nmax=1000,kmax=1000,imax=100,jmax=100
  real :: coeff(nmax,kmax),f(nmax,imax),g(kmax,jmax),d(imax,jmax)
  integer :: i,j,k,n
  call random_number(coeff)
  call random_number(f)
  call random_number(g)
  d=0
  do i=1,imax
    do j =1,jmax
      do n=1,nmax
        do k=1,kmax
          d(i,j) = d(i,j) + coeff(n,k) * f(n,i) *g(k,j)
        enddo
      enddo
    enddo
  enddo
  write(*,*) sum(d)
end program

Result :

Code:
[lcoul@localhost test]$ ifort -O3 p.f90
[lcoul@localhost test]$ time a.out
  1.2531997E+09
65.03user 12.45system 1:17.84elapsed 99%CPU (0avgtext+0avgdata 21856maxresident)k

Modified version :

Code:
program test
  implicit none
  integer,parameter :: nmax=1000,kmax=1000,imax=100,jmax=100
  real :: coeff(nmax,kmax),f(nmax,imax),g(kmax,jmax),d(imax,jmax)
  real :: c(kmax,nmax),z
  integer :: i,j,k,n
  call random_number(coeff)
  call random_number(f)
  call random_number(g)
  do n=1,nmax
    do k=1,nmax
      c(k,n)=coeff(n,k)
    enddo
  enddo
  d=0
  do i=1,imax
    do j =1,jmax
      do n=1,nmax
        do k=1,kmax
          d(i,j) = d(i,j) + c(k,n)*f(n,i) *g(k,j)
        enddo
      enddo
    enddo
  enddo
  write(*,*) sum(d)
end program

Result :

Code:
[lcoul@localhost test]$ ifort -O3 pnew.f90
[lcoul@localhost test]$ time a.out
  1.2531997E+09
3.54user 0.01system 0:03.57elapsed 99%CPU (0avgtext+0avgdata 37504maxresident)k



François Jacq
 
With gfortran, I get smaller improvement :

Code:
[lcoul@localhost test]$ cp p.f90 pnew.f90
[lcoul@localhost test]$ gfortran -O3 p.f90
[lcoul@localhost test]$ time a.out
  1.24814490E+09
64.05user 16.01system 1:21.19elapsed 98%CPU

[lcoul@localhost test]$ gfortran -O3 pnew.f90
[lcoul@localhost test]$ time a.out
  1.24814490E+09
26.86user 0.02system 0:27.10elapsed 99%CPU

Only a factor between 2 and 3 instead of almost 20 with ifort.

3 is the scaling factor I expected before running the test. I don't know how the intel compiler is able to optimize so efficiently !

I have found a mistake in pnew.f90 but without consequence because nmax=kmax (wrong loop upper bound when transposing coeff). Another mistake is perhaps responsible of this too good result ..




François Jacq
 
Now its time to improve the algorithm. It is quite easy to delete one loop level. It is also possible to improve again the memory management.

The idea is to compute the product coeff*g first :

Code:
program test
  implicit none
  integer,parameter :: nmax=1000,kmax=1000,imax=100,jmax=100
  real :: coeff(nmax,kmax),f(nmax,imax),g(kmax,jmax),d(imax,jmax)
  real :: cnew(kmax,nmax),fnew(imax,nmax),z
  integer :: i,j,k,n
  call random_number(coeff)
  call random_number(f)
  call random_number(g)
  DO i=1,imax ! transposing f
    fnew(i,:)=f(:,i)
  ENDDO
  DO k=1,kmax ! transposing coeff
    cnew(k,:)=coeff(:,k)
  ENDDO
  d=0
  do n=1,nmax
    do j=1,jmax
      z=0
      do k=1,kmax
        z=z + cnew(k,n)*g(k,j)
      enddo
      do i=1,imax ! the loop over i is moved here
        d(i,j)=d(i,j)+z*fnew(i,n)
      enddo
    enddo
  enddo
  write(*,*) sum(d)
end program

Result with ifort :

Code:
[lcoul@localhost test]$ ifort -O3 pnew.f90
[lcoul@localhost test]$ time a.out
  1.2532465E+09
0.10user 0.01system 0:00.11elapsed 96%CPU

Result with gfortran :

Code:
[lcoul@localhost test]$ gfortran -O3 pnew.f90
[lcoul@localhost test]$ time a.out
  1.24814490E+09
0.25user 0.01system 0:00.26elapsed 98%CPU

Finally the CPU has been reduced by a factor 300 with gfortran :
- a factor 100 is due to the better algorithm deleting one loop over i
- a factor 3 is due to a better memory management

François Jacq
 
Hallo François,
Actually I didn't check the forum during the w/e (also because
I was not so much positive about real improvements).

I have to admit you opened my mind with your suggestions.

I had been reading that transposing the vectors could
increase the speed of the calculation however I was thinking
(and I tryed some test, not so accurate probably) that
this applyed only when using the "sum" intrinsic function;
I had no idea it could work also with summations of this kind.

However the trick about "avoiding one loop" is much more
impressive.

It is very smart and simple and I would never guessed it by myself.
thank you very very much!
Paolo
P.S.
now I am implementing it in my code which is a little bit
more complicated than the example I posted!





 
Shouldn't the compiler take care of this?
I'd rather program in a human way, not thinking about swapping indices.
Anyway worth a star :)
 
It is not really difficult to think about index ordering : the most internal loop index must work, if possible, with the first index of a matrix.

Code:
DO j=...
  DO k=...
    DO i=1,n
      a1=a1+b(i,j)*c(i,k) ! good
      a2=a2+b(i,j)*c(k,i) ! less good because of the matrix c. Transposing
                          ! c (or redefining it) may be a good idea
      a3=a3+b(j,i)*c(k,i) ! bad : its time to think about reordering the
                          ! loops or transposing each matrix
    ENDDO
  ENDDO
ENDDO

But the best idea, with many enclosed loops, is always to look for a better algorithm deleting one (or more) loop level.

Improving an algorithm is always the best way to reduce the CPU time consumption, far before trying to take advantage of a particular computer architecture (like here, the presence of a cache memory).

François Jacq
 
How about intrinsic constructions like SUM(a:))+b:)))?

It depends whether the expression needs a temporary array or not. For instance, if the whole instruction is :
Code:
s=SUM(a(:)+b(:))
then, I bet that a compiler could simply implement that with a loop :

Code:
s=0
DO i=1,SIZE(a)
  s=s+a(i)+b(i)
ENDDO

With more complex instructions, vector or matrix operation may be less efficient than loops... except when the compiler is able to call an optimized BLAS library. I even suspect that the Intel compiler is able to replace loops by calls to BLAS functions.

But a clear and short programming is generally much more important than CPU.

The only case, where CPU becomes important is when ... a code takes too much running time ! obvious ? No so much because this must be proven by a measurement (unix command "time", use of a profiler tool like gprof, use of cpu_time intrinsic ...).

In that case, one has to locate the bottle necks and to find how to delete them. The rules are simple :
- do not try to optimize a part of a code without having measuring the real cost of that part.,
- start the optimization work by the most expensive part,
- look for better algorithms rather than applying first programming tricks.



François Jacq
 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Sponsor

Back
Top