Splitting File into smaller batches

venkatpavan · Oct 16, 2007

Hi,

I'm a beginner on UNIX,So please forgive my Ignorance on this.

I'm trying to split a file into batches,Big file which i want to split looks like below one
00
XX
XX
99
00
XX
XX
99
00
XX
XX
99
00
XX
XX
99

Each batch starts 00 and ends with 99 records,I want to split this big file into batches,So that each batch should be 00-99.Can someone help me on this.

Thanks.....

feherke · Oct 16, 2007

Hi

If you have [tt]csplit[/tt] :

Code:

csplit /input/file %00% /00/ '{*}'

Feherke.

http://rootshell.be/~feherke/

feherke · Oct 16, 2007

Hi

If you have [tt]awk[/tt] :

Code:

awk '/00/{if(NR>1)close(f);f="file"n++}{print>f}END{close(f)}' /input/file

Tested with [tt]gawk[/tt].

Feherke.

http://rootshell.be/~feherke/

columb · Oct 16, 2007

It's very basic but, as a beginner, probaly easier to understand than the csplit or awk variants.

Code:

#!/bin/ksh
COUNT=0

while read line
do
  echo $line >> FILE$COUNT
  if [ $line = 99 ] 
  then
      (( COUNT += 1 ))
   fi
done < /input/file

This will split the file into FILE0, FILE1, FILE2 etc.

Ceci n'est pas une signature
Columb Healy

venkatpavan · Oct 16, 2007

I think i messed up the question here,I'm sorry about that.I want to split the file by 00 to 99 as each batch,But 00 and 99 should be the first two characters of the record,In this file Each record is around 1018 bytes and each record is separated by either Carriage return line feed or line feed.Below is the example of the file,Due to the space issue here i used dots(...)

0000000000009000000000090000400023726..............
0100000000009000000000090000400023726..............
0200000000009000000000090000400023726..............
0300000000009000000000090000400023726..............
0400000000009000000000090000400023726..............
999999999999999900009000000000090000400023726
0000000000009000000000090000400023726..............
0100000000009000000000090000400023726..............
0200000000009000000000090000400023726..............
0300000000009000000000090000400023726..............
0400000000009000000000090000400023726..............
999999999999999900009000000000090000400023726
0000000000009000000000090000400023726..............
0100000000009000000000090000400023726..............
0200000000009000000000090000400023726..............
0300000000009000000000090000400023726..............
0400000000009000000000090000400023726..............
999999999999999900009000000000090000400023726

Once again i'm sorry about earlier.

Thanks.....

PHV · Oct 16, 2007

awk '/^00/{if(NR>1)close(f);f="file"++n}{print>f}' /path/to/bigfile

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

columb · Oct 17, 2007

For my script

Code:

#!/bin/ksh
COUNT=0

while read line
do
  echo $line >> FILE$COUNT
  if egrep -q ^99 $line
  then
      (( COUNT += 1 ))
   fi
done < /input/file

Ceci n'est pas une signature
Columb Healy

feherke · Oct 17, 2007

Hi

Columb, are you sure about this ?

Code:

egrep -q ^99 $line

Thet line by line file appending you already used was slow, but now you added the [tt]egrep[/tt] call which is even more slower. ( Note that for the [tt]time[/tt] test I corrected the above mentioned [tt]egrep[/tt] misunderstanding. )

Code:

[blue]master #[/blue] time my.sh 
real    0m0.225s
user    0m0.030s
sys     0m0.190s

[blue]master #[/blue] time columb.sh 
real    0m11.450s
user    0m5.630s
sys     0m5.510s

I would rewrite it like this :

Code:

[highlight #eee]#!/usr/bin/ksh[/highlight]

COUNT=0
end=no

[b]while[/b] [b]test[/b] $end; [b]do[/b]
  end=[i]''[/i]
  [b]while[/b] [b]read[/b] line; [b]do[/b]
    echo [i]"$line"[/i]
    [b]if[/b] [[ [i]"$line"[/i] = 99* ]]; [b]then[/b]
      (( COUNT ++ ))
      end=no
      [b]break[/b]
    [b]fi[/b]
  [b]done[/b] > FILE$COUNT
[b]done[/b] < /input/file

Tested with (pd)[tt]ksh[/tt].

Note : if there is an empty line at the end of file my script will output that too, in a file with the suitable order number.

Feherke.

http://rootshell.be/~feherke/

columb · Oct 17, 2007

feherke

I was going for simple rather than fast but yes, your code is better. Thanks for the amends.

My resoning for using a basic script is that venkatpavan is a beginner and will have to maintain the code. I'm always reluctant to provide awk scripts because I remember just how long it took me to get my head round the awk basics and even today the lemur book still lives on my desk.

Ceci n'est pas une signature
Columb Healy

venkatpavan · Oct 17, 2007

Thanks a lot Guys,It's Working.You guys made it look lot easier,One thing is sure if we know UNIX,Life will be lot easier in Programming World.

Once again I appreciate all your Help.

feherke · Oct 17, 2007

Hi

Columb said:
I remember just how long it took me to get my head round the awk basics

Well, I remember for me was a nice autumn day's afternoon...

Thank you Columb for sharing your experience. I will keep it in mind when posting in the future.

Feherke.

http://rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Splitting File into smaller batches

venkatpavan

Programmer

feherke

Programmer

feherke

Programmer

columb

IS-IT--Management

venkatpavan

Programmer

PHV

MIS

columb

IS-IT--Management

feherke

Programmer

columb

IS-IT--Management

venkatpavan

Programmer

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor