Anything faster than memcpy 3

hunghsun · Jul 28, 2004

I was wondering if there is a faster method than memcpy to copy data from one place to the other. Anyone?

Salem · Jul 28, 2004

You can be reasonably sure that memcpy is pretty optimal for your system already, so it will be hard to beat in most circumstances.

Have you considered not copying the data at all, and just using pointers to point at the data.

The real question is why you're copying so much data, and why you think improving just the memcpy will fix your performance problems.

Have you profiled your completed code with some representative real-world data yet?

--

hunghsun · Jul 29, 2004

I wish I could do away with copying. There is a requirement in the setup currently that force me to do data copying and I pretty much just need to find a way to make it more efficient.

fl0ra · Jul 29, 2004

grouping data might be a good idea...
instead of calling 10 memmove just call one big one.

fl0ra · Jul 29, 2004

I meant: one that is going to move a larger chunk of contigous data.

of course...

Salem · Jul 29, 2004

So you know that memory copying is the biggest cause of work in the system, or is it just some "gut feel" you have?

--

fl0ra · Jul 29, 2004

yeah Salem is right.
are you sure that it is the memory copy that is actually slowing down your application?
I've been dealing with huge amount (couple of hundreds megabyte) of memory handling in my recent roles and it was always quite fast...

iribach · Jul 29, 2004

ok they are old, i found this googling

Code:

#ifndef lint
static	char sccsid[] = "@(#)memcpy.c 1.1 90/03/23"; /* from S5R2 1.1 */
#endif

/*LINTLIBRARY*/
/*
 * Copy s2 to s1, always copy n bytes.
 * Return s1
 */
char *
memcpy(s1, s2, n)
register char *s1, *s2;
register int n;
{
	register char *os1 = s1;
	while (--n >= 0)
		*s1++ = *s2++;
	return (os1);
}

Code:

#if !defined(lint) && defined(SCCSIDS)
static	char sccsid[] = "@(#)strcpy.c 1.1 90/03/23"; /* from UCB 4.1 82/10/05 */
#endif

/*
 * Copy string s2 to s1.  s1 must be large enough.
 * return s1
 */

char *
strcpy(s1, s2)
	register char *s1, *s2;
{
	register char *os1;
	os1 = s1;
	while (*s1++ = *s2++)
		;
	return (os1);
}

is memcpy better as strcpy ??

perhaps they changed ? i don't believe

hunghsun · Jul 29, 2004

First of all, thanx for all the replies.

I guess more information is in order. I am dealing with a high speed interconnect that require the data to be copy from a source location to a buffer location before it can be sent over the network (Think of it as MPI where you copy the source data into the MPI buffer and then transfer over the network to another computer). I did a time profile base on the size of data and anything beyond 64K, memcpy (buffer, source, size) simply dominates this process. I think beyond 64K, it counts for more than 50% of the execution time. Because I have to do this copying (no way around it due to hardware limitation), I am trying to find out if I can use something else instead.

ArkM · Jul 29, 2004

For example, in MS Visual C++ 6.0 memcpy() implementation is not optimal (obviously): it's a simple byte by byte loop with incrementing two pointers.

The discussed problem is platform-depending one. It seems it's not correct to tell about memcpy as is without specific target platforms and compilers.

None the less, sometimes (on a very fast pipelined processors, for example) it's useful to implement mem copy loop with the target processors word (or even double word) elements (with previous loop counter calculations and proper pointers casting).

On old Intel processors string moving instructions (via asm, of course) was the most effective implementation (and old RTL used it)...

Salem · Jul 29, 2004

Is the amount of data you copy always a multiple of say 4 bytes in length?

Which machine(s) are you running this on - Intel Pentium or compatibles for example?

How do you allocate buffers for this memory - do you use static arrays, or do you malloc() them?

--

fl0ra · Jul 30, 2004

Salem,
quick question: you're mentionning static arrays and malloc. Surely it does not make any difference in terms of performance to create a do

Code:

char foo[0xFF] 
/*rather than*/
char * foo;
foo=malloc(0xFF*sizeof(char));

or does it?

DaveSinkula · Jul 30, 2004

hunghsun said:
I guess more information is in order. I am dealing with a high speed interconnect that require the data to be copy from a source location to a buffer location before it can be sent over the network (Think of it as MPI where you copy the source data into the MPI buffer and then transfer over the network to another computer). I did a time profile base on the size of data and anything beyond 64K, memcpy (buffer, source, size) simply dominates this process. I think beyond 64K, it counts for more than 50% of the execution time. Because I have to do this copying (no way around it due to hardware limitation), I am trying to find out if I can use something else instead.

This article may be of interest: Optimizing Memcpy improves speed.

Salem · Jul 30, 2004

> Surely it does not make any difference in terms of performance to create a do
Well it can - because of alignment.

You don't know what the alignment of a char array will be, it could be at worst aligned on a char boundary.
Memory you get from malloc is always aligned for the most restrictive data type usually 'long' or 'double'.

For some processors, this doesn't really make any difference, but for others the differences can be huge.

If your memcpy has optimisations for aligned memory copying, it means it can copy 4 bytes at a time instead of 1 (which sounds like a good idea).

LOL - good job I checked

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#ifdef WIN32
#include <windows.h>
typedef __int64 precise_clock_t;
precise_clock_t ReadClock ( void ) {
  LARGE_INTEGER result;
  QueryPerformanceCounter( &result );
  return result.QuadPart;
}
#endif

#ifdef __GNUC__
#define RDTSC(llptr) {         __asm__ __volatile__ (         "rdtsc"         : "=A" (llptr)         ); }
typedef unsigned long long precise_clock_t;
precise_clock_t ReadClock ( void ) {
  precise_clock_t result;
  RDTSC( result );
  return result;
}
#endif

/*
 * from [URL unfurl="true"]http://www.lysator.liu.se/c/duffs-device.html[/URL]
 */
void *fast_memcpy1 ( void *dst, const void *src, size_t len ) {
  void        *result = dst;
  long        *to = dst;
  const long  *from = src;
  long        n;

  /* copying longs, not bytes */
  /* use len >>= 2 if your compiler can't spot the optimisation */
  len /= sizeof(long);

  n = ( len + 7 ) / 8;
  switch ( len % 8 )
  {
    case 0: do { *to++ = *from++;
    case 7:*to++ = *from++;
    case 6:*to++ = *from++;
    case 5:*to++ = *from++;
    case 4:*to++ = *from++;
    case 3:*to++ = *from++;
    case 2:*to++ = *from++;
    case 1:*to++ = *from++;
    } while ( --n > 0 );
  }
  return dst;
}

/* This works well if the processor has register+offset addressing modes. */
/* This avoids the overhead of incrementing pointers for each long copied. */
/* However, such addressing modes tend to be slower, so overall there may be */
/* little difference */
void *fast_memcpy2 ( void *dst, const void *src, size_t len ) {
  void        *result = dst;
  long        *to = dst;
  const long  *from = src;
  long        n;

  /* copying longs, not bytes */
  /* use len >>= 2 if your compiler can't spot the optimisation */
  len /= sizeof(long);

  n = len / 8;
  while ( n-- > 0 ) {
    to[0] = from[0];
    to[1] = from[1];
    to[2] = from[2];
    to[3] = from[3];
    to[4] = from[4];
    to[5] = from[5];
    to[6] = from[6];
    to[7] = from[7];
    to += 8;
    from += 8;
  }
  n = len % 8;
  while ( n-- > 0 ) {
    *to++ = *from++;
  }
  return dst;
}

/* copy one byte at a time */
void *simple_memcpy1 ( void *dst, const void *src, size_t len ) {
  void        *result = dst;
  char        *to = dst;
  const char  *from = src;
  while ( len-- ) *to++ = *from++;
  return dst;
}
/* copy one long at a time */
void *simple_memcpy2 ( void *dst, const void *src, size_t len ) {
  void        *result = dst;
  long        *to = dst;
  const long  *from = src;
  len /= 4;
  while ( len-- ) *to++ = *from++;
  return dst;
}

/* TEST CODE */
/* --------- */

#define TEST_LEN  102400
#define LOW_COPY  10000
#define HIGH_COPY 80000

long src[TEST_LEN];
long dst[TEST_LEN];
/* fill up some memory with some test data */
void prepare ( void ) {
  int i;
  for ( i = 0 ; i < TEST_LEN ; i++ ) src[i] = ~i;
  for ( i = 0 ; i < TEST_LEN ; i++ ) dst[i] = 0;
}
/* make sure it was copied to just the bits we're interested in */
int check ( void ) {
  int valid = 1;
  int i;
  for ( i = 0         ; i < LOW_COPY  && valid ; i++ ) valid = dst[i] == 0;
  for ( i = LOW_COPY  ; i < HIGH_COPY && valid ; i++ ) valid = dst[i] == ~i;
  for ( i = HIGH_COPY ; i < TEST_LEN  && valid ; i++ ) valid = dst[i] == 0;
  return valid;
}

typedef void *(*cpyfunc)(void*,const void*,size_t);
struct {
    cpyfunc fn;
    char    *name;
} funcs[] = {
#define MAKEFN(x)   { x, #x }
    MAKEFN(memcpy),
    MAKEFN(fast_memcpy1),
    MAKEFN(fast_memcpy2),
    MAKEFN(simple_memcpy1),
    MAKEFN(simple_memcpy2),
};

/* call a function with the test data, and time it */
void do_tests ( cpyfunc fn, char *name, void *dst, const void *src, size_t len ) {
  precise_clock_t   t1, t2;
  unsigned long     timediff;
  int               i;

  prepare();
  t1 = ReadClock();
  fn( dst, src, len );
  t2 = ReadClock();
  timediff = t2 - t1;
  printf( "Check %-15s=%d, time=%lu\n", name, check(), timediff );
}

int main ( int argc, char *argv[] ) {
  int       i, testnum = 0;
  char      *from = (char*)&src[LOW_COPY];
  char      *to   = (char*)&dst[LOW_COPY];
  size_t    len   = (HIGH_COPY-LOW_COPY)*sizeof(long);

  if ( argc >= 2 ) testnum = atoi(argv[1]);
  printf( "Addresses are %p and %p\n", from, to );
  for ( i = 0 ; i < 5 ; i++ ) {
    do_tests( funcs[testnum].fn, funcs[testnum].name, to, from, len );
  }
  printf("\n");
  return 0;
}

My results
[tt]
Unoptimised code
Addresses are 0x811b720 and 0x8053720
Check memcpy =1, time=1908172
Check memcpy =1, time=2032056
Check memcpy =1, time=2022724
Check memcpy =1, time=2035848
Check memcpy =1, time=2066232

Addresses are 0x811b720 and 0x8053720
Check fast_memcpy1 =1, time=1786356
Check fast_memcpy1 =1, time=1978712
Check fast_memcpy1 =1, time=1972628
Check fast_memcpy1 =1, time=1980148
Check fast_memcpy1 =1, time=1981700

Addresses are 0x811b720 and 0x8053720
Check fast_memcpy2 =1, time=1798976
Check fast_memcpy2 =1, time=1932904
Check fast_memcpy2 =1, time=1965488
Check fast_memcpy2 =1, time=1966316
Check fast_memcpy2 =1, time=1961060

Addresses are 0x811b720 and 0x8053720
Check simple_memcpy1 =1, time=3008256
Check simple_memcpy1 =1, time=3011732
Check simple_memcpy1 =1, time=2973612
Check simple_memcpy1 =1, time=3018960
Check simple_memcpy1 =1, time=3012460

Addresses are 0x811b720 and 0x8053720
Check simple_memcpy2 =1, time=1830516
Check simple_memcpy2 =1, time=1997360
Check simple_memcpy2 =1, time=2003104
Check simple_memcpy2 =1, time=1942524
Check simple_memcpy2 =1, time=1952060

Optimised code
Addresses are 0x811b580 and 0x8053580
Check memcpy =1, time=1945176
Check memcpy =1, time=2076424
Check memcpy =1, time=2041604
Check memcpy =1, time=2060140
Check memcpy =1, time=2080792

Addresses are 0x811b580 and 0x8053580
Check fast_memcpy1 =1, time=1853304
Check fast_memcpy1 =1, time=1917172
Check fast_memcpy1 =1, time=2004804
Check fast_memcpy1 =1, time=2009464
Check fast_memcpy1 =1, time=2008936

Addresses are 0x811b580 and 0x8053580
Check fast_memcpy2 =1, time=1900900
Check fast_memcpy2 =1, time=1997572
Check fast_memcpy2 =1, time=1997756
Check fast_memcpy2 =1, time=2057424
Check fast_memcpy2 =1, time=2058308

Addresses are 0x811b580 and 0x8053580
Check simple_memcpy1 =1, time=1869808
Check simple_memcpy1 =1, time=1983196
Check simple_memcpy1 =1, time=1985300
Check simple_memcpy1 =1, time=2006828
Check simple_memcpy1 =1, time=1988504

Addresses are 0x811b580 and 0x8053580
Check simple_memcpy2 =1, time=1907460
Check simple_memcpy2 =1, time=2002392
Check simple_memcpy2 =1, time=2001332
Check simple_memcpy2 =1, time=2060276
Check simple_memcpy2 =1, time=2048992

[/tt]
The two 'fast' routines are pretty evenly matched, with a slight edge on the built-in function. The simple 'byte-at-a-time' is nowhere to be seen in the unoptimised code, but is surprisingly effective when the optimiser is enabled.

When it comes to optimisations, the first thing to try is the really simple approach, because these are the ones which the optimiser finds easiest to deal with. The more you complicate things, the more conservative the optimiser will become.

There are lots of things to try, but there are no guaranteed answers!

Running Linux on a 1.8GHz P4 - YMMV

--

fl0ra · Aug 3, 2004

cheers Salem.

hunghsun · Aug 4, 2004

I am using gcc and in my tests, everything is a power of 2 so yes they are all word aligned for large size. I presume that the Intel compiler give you better performance, but how much difference and why so? Can we use the same technique and write our own stuff?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Anything faster than memcpy 3

hunghsun

Programmer

Salem

Programmer

hunghsun

Programmer

fl0ra

MIS

fl0ra

MIS

Salem

Programmer

fl0ra

MIS

iribach

Technical User

hunghsun

Programmer

ArkM

IS-IT--Management

Salem

Programmer

fl0ra

MIS

DaveSinkula

Programmer

Salem

Programmer

fl0ra

MIS

hunghsun

Programmer

Similar threads

Part and Inventory Search

Sponsor