Converting from ksh to C 2

cryptoadm · Mar 20, 2009

I've written this in ksh but due to the large number of files (millions) it should be faster in C but unsure how to do things like year=${file:$yearStart:4} though I know I need to use strlen but do I need strcomp for the ${file:$yearStart:4} syntax? And how do you strip off a path like I'm doing with fpath=${data[c]%/*} and only the file name without path with file=${data[c]##*/} ?

Thanks!

Code:

#!/bin/ksh

clear
set -A data $(cat /tmp/data.txt)
numdata=${#data[@]}
MAINDIR=/export/data
DIR=newdata
CP=/bin/cp
i=0; c=0

while (( i < numdata ))
do
        file=${data[c]##*/}
        fpath=${data[c]%/*}
        filelen=${#file}

        s=0
        while (( s < filelen ))
        do
                x=${file:$s:1}
                ((t=s+1))
                tt=${file:$t:4}
                if [[ $x = "#" && $tt = 2008 ]]
                then
                        ((yearStart=s+1))
                        ((monthStart=s+5))
                        year=${file:$yearStart:4}
                        month=${file:$monthStart:2}
                        typeset -R2 yearend=$year
                        if [[ $year = 2008 ]]
                        then
                                case $month in
                                        01) MONTH=Jan ;;
                                        02) MONTH=Feb ;;
                                        03) MONTH=Mar ;;
                                        04) MONTH=Apr ;;
                                        05) MONTH=May ;;
                                        06) MONTH=Jun ;;
                                        07) MONTH=Jul ;;
                                        08) MONTH=Aug ;;
                                        09) MONTH=Sep ;;
                                        10) MONTH=Oct ;;
                                        11) MONTH=Nov ;;
                                        12) MONTH=Dec ;;
                                esac
                        fi
                fi
        ((s+=1))
        done
        tput cup 50 0; echo -n "$i"
         #echo "Copying $MAINDIR${fpath}/${file} to /${year}/${MONTH}${yearend}/${DIR}"
         ${CP} $MAINDIR${fpath}/${file} /${year}/${MONTH}${yearend}/${DIR} 2>/dev/null
((c+=1))
((i+=1))
done
echo
echo "Copied $numdata files"
exit 0

cryptoadm · Mar 20, 2009

An example /tmp/data.txt file would contain:

/sample/mktg/1334543eeeefdddddd#8933djij#20080173300292
/example/acct/324234543asf#####84345a#2008078437345

feherke · Mar 21, 2009

Hi

Code:

[blue]master #[/blue] cat cryptoadm.sh
#!/bin/mksh

data="/sample/mktg/1334543eeeefdddddd#8933djij#20080173300292"

yearStart=28

fpath="${data%/*}"

file="${data##*/}"

year="${file:$yearStart:4}"

echo "path : $fpath"
echo "file : $file"
echo "year : $year"

[blue]master #[/blue] cat cryptoadm.c
#include <stdio.h>
#include <string.h>

int main(void)
{

  char data[]="/sample/mktg/1334543eeeefdddddd#8933djij#20080173300292";

  int yearStart=28;

  char fpath[256];
  strcpy(fpath,data);
  fpath[(int) strrchr(data,'/')-(int) data]='\0';

  char *file=strrchr(data,'/')+1;

  char year[256];
  strncpy(year,file+yearStart,4);
  year[4]='\0';

  printf("path : %s\n",fpath);
  printf("file : %s\n",file);
  printf("year : %s\n",year);

}

[blue]master #[/blue] ./cryptoadm.sh
path : /sample/mktg
file : 1334543eeeefdddddd#8933djij#20080173300292
year : 2008

[blue]master #[/blue] gcc -o cryptoadm cryptoadm.c 

[blue]master #[/blue] ./cryptoadm             
path : /sample/mktg
file : 1334543eeeefdddddd#8933djij#20080173300292
year : 2008

Note 1 : I used [tt]mksh[/tt], the MirBSD implementation of Korn shell.
Note 2 : I actively programmed in C about 8 years ago, so my knowledge slightly faded out.

Feherke.

http://rootshell.be/~feherke/

cryptoadm · Mar 23, 2009

Thank you.

SamBones · Mar 24, 2009

I don't think you are going to speed it up very much by changing it to C. The thing that will take the longest is the file copy itself. It doesn't matter what is driving it, C or Ksh, the copy will take however long it will take.

Part of the problem with the script is that you are doing some very costly Korn shell processing. You are looping through each file name character by character. Have you tried tightening up your Korn shell script? That might speed it up quite a bit. Something like...

Code:

#!/bin/ksh

set -A MONTHS Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dev

INPUT_FILE=/tmp/data.txt
TARGET_DIR=newdata
MAIN_DIR=/export/data
CP=/bin/cp
QTY=0

while read FILENAME
do
        FILE=$(basename ${FILENAME})
        LOCATION=$(dirname ${FILENAME})

        YYYYMM=${FILE##*\#}
        TAIL=${YYYYMM#??????}
        YYYYMM=${YYYYMM%${TAIL}}

        YYYY=${YYYYMM%??}
        MM=${YYYYMM#????}

        MONTH=${MONTHS[MM]}

        #echo "Copying $MAINDIR${LOCATION}/${FILE} to /${YYYY}/${MONTH}${YYYY}/${TARGET_DIR}"
        echo ${CP} ${MAIN_DIR}${LOCATION}/${FILE} /${YYYY}/${MONTH}${YYYY}/${TARGET_DIR}

        (( QTY += 1 ))

done < ${INPUT_FILE}

echo
echo "Copied ${QTY} files"
exit 0

That will speed up the Ksh part about as much as it can go.

If the file's source and destination locations are on the same device, you can do a "[tt]mv[/tt]" instead of a copy and it will just move the directory entry to the new location. This will be almost instantaneous. Millions of files could take only seconds.

If they do need to be copied and not moved, you could change the script above to do the copy as a background process and have more than one running at the same time. Have maybe three to five running at the same time and it could speed it up a lot. Don't do too many or it could eat up all your IO bandwidth. Something like this...

Code:

#!/bin/ksh

set -A MONTHS Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dev

INPUT_FILE=/tmp/data.txt
TARGET_DIR=newdata
MAIN_DIR=/export/data
CP=/bin/cp
QTY=0
MAXRUNNING=5
SNOOZE=10

while read FILENAME
do
        FILE=$(basename ${FILENAME})
        LOCATION=$(dirname ${FILENAME})

        YYYYMM=${FILE##*\#}
        TAIL=${YYYYMM#??????}
        YYYYMM=${YYYYMM%${TAIL}}

        YYYY=${YYYYMM%??}
        MM=${YYYYMM#????}

        MONTH=${MONTHS[MM]}

        while (( $(jobs -p|wc -l) >= MAXRUNNING ))
        do
            sleep ${SNOOZE}
        done

        #echo "Copying $MAINDIR${LOCATION}/${FILE} to /${YYYY}/${MONTH}${YYYY}/${TARGET_DIR}"
        ${CP} ${MAIN_DIR}${LOCATION}/${FILE} /${YYYY}/${MONTH}${YYYY}/${TARGET_DIR} &

        (( QTY += 1 ))

done < ${INPUT_FILE}

print "waiting for all copies to finish!"
wait

echo
echo "Copied ${QTY} files"
exit 0

Hope this helps.

cryptoadm · Mar 25, 2009

Unfortunately I have to do a copy instead of a move. The reason I am going character by character for each file name is the 4 digit year is always after the last '#' and there are also instances where a '#' may exist elsewhere in the name. Going by each character was the only way I could think of finding the year and month after the last #.

I like how you did it much, much better and will do that.

Thanks!!

SamBones · Mar 26, 2009

Since you're talking millions of files, changing it so it doesn't call "[tt]basename[/tt]" and "[tt]dirname[/tt]" will speed it up. Change those lines to...

Code:

# Old
#       FILE=$(basename ${FILENAME})
#       LOCATION=$(dirname ${FILENAME})

        FILE=${FILENAME##*/}
        LOCATION=${FILENAME%/*}

That keeps it all within Ksh for the string manipulation.

Hope it helps.

cryptoadm · Mar 26, 2009

Thanks again. Removing the character by character scan has led to a 2x increase in the speed of copies. I'll make the new change too. Thanks!!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Converting from ksh to C 2

cryptoadm

MIS

cryptoadm

MIS

feherke

Programmer

cryptoadm

MIS

SamBones

Programmer

cryptoadm

MIS

SamBones

Programmer

cryptoadm

MIS

Similar threads

Part and Inventory Search

Sponsor