split text file 1

w5000 · Sep 9, 2016

I've a text file with many lines where are many messages beginning with XX and ending with bracket ")" (each one)

I'd like to split it so each message is saved to separate file (e.g. file names with increasing number)

so file1 will have:
XX sfjsklf
sfsfsfsf
(sdfsdfsf
sfsfsf
gfdfgdgdghdh)

file2:
XX 902wriwirj
sdfs
sdfsf
(sdfsdfs
sfsf)

etc.

feherke · Sep 9, 2016

Hi

Is there unwanted content between the section ? If not, this will produce file01, file02 and so on ( also a ( potentially empty ) file00 with content before first section ) :

Code:

csplit -f file /path/to/input '/^XX /' '{*}'

Feherke.
feherke.ga

w5000 · Sep 9, 2016

yes, there are some lines unwanted between (to be ignored)

and the might also be some tailing unwanted stuff after closing ")" - at the same line

sample of wanted (green) and unwanted (red)

sfkjskfsdf
sfsfsf
sfsf
XX sdfsf
sfsf
sdfs
sdfsfsf
(sfsfsfsf
sdfsf
sfsf
sfsfsfs)fgkjsfsdflk
wrwrewr
werwerweer
wrwrwerwre

feherke · Sep 9, 2016

Hi

In that case I would go with Awk :

Code:

awk -v RS='(^|\n)XX [^)]+\\)' 'RT[teal]{[/teal][b]sub[/b][teal]([/teal][fuchsia]/^\n/[/fuchsia][teal],[/teal][i][green]""[/green][/i][teal],[/teal]RT[teal]);[/teal][b]print[/b] RT[teal]>[/teal][i][green]"file"[/green][/i][teal]++[/teal]n[teal]}[/teal]' /path/to/input

Feherke.
feherke.ga

w5000 · Sep 12, 2016

brilliant! thank you very much.

by the way, if I would like to pipe each message to a command (instead of writing it to a file), would this be ok or something is superfluous there? It looks it is working but I'd like to be sure this approach is ok...

gawk -v RS='(^|\n)XX [^)]+\\)' 'RT{sub(/^\n/,"",RT);cmd="wc -l";print RT | cmd; close(cmd)}'

also, how to modify it so in awk I could use another command with redirect like:

cmd < message

(cmd could be given with some options)

feherke · Sep 12, 2016

Hi

Yes, that is the way to run an external command and passing it input.

Not sure about what are you asking there, but I assume you would like bidirectional communication with the external command. That is GNU Awk only feature :

Code:

gawk -v RS='(^|\n)XX [^)]+\\)' 'RT[teal]{[/teal][b]sub[/b][teal]([/teal][fuchsia]/^\n/[/fuchsia][teal],[/teal][i][green]""[/green][/i][teal],[/teal]RT[teal]);[/teal][navy]cmd[/navy][teal]=[/teal][i][green]"wc -l"[/green][/i][teal];[/teal] [b]print[/b] RT [teal]|&[/teal] cmd[teal];[/teal] [b]close[/b][teal]([/teal]cmd[teal],[/teal][i][green]"to"[/green][/i][teal]);[/teal] cmd [teal]|&[/teal] [b]getline[/b] c[teal];[/teal] [b]close[/b][teal]([/teal]cmd[teal],[/teal][i][green]"from"[/green][/i][teal]);[/teal] [b]print[/b] c[teal]}[/teal]' /path/to/input

Though if you really want to just get the line count, then better solve it in Awk :

Code:

gawk -v RS='(^|\n)XX [^)]+\\)' 'RT[teal]{[/teal][b]gsub[/b][teal]([/teal][fuchsia]/^\n/[/fuchsia][teal],[/teal][i][green]""[/green][/i][teal],[/teal]RT[teal]);[/teal][b]print split[/b][teal]([/teal]RT[teal],[/teal]a[teal],[/teal][i][green]"[/green][/i][lime]\n[/lime][i][green]"[/green][/i][teal])}[/teal]' /path/to/input

Feherke.
feherke.ga

w5000 · Sep 12, 2016

sorry for bad explaining my goal

with your first command I can use files created for further processeing with for loop:

for i in file*;do somecommand -o sss < $i;done

I was thinking of implementing it directly into awk command not having to do the for loop at all.

feherke · Sep 13, 2016

Hi

In such cases usually a [tt]\0[/tt] delimiter is used, hoping the text to process will not contain it :

Code:

gawk -v RS='(^|\n)XX [^)]+\\)' -v ORS='\0' 'RT[teal]{[/teal][b]gsub[/b][teal]([/teal][fuchsia]/^\n/[/fuchsia][teal],[/teal][i][green]""[/green][/i][teal],[/teal]RT[teal]);[/teal][b]print[/b] RT[teal]}[/teal]' /path/to/input [teal]|[/teal]
[b]while[/b] [navy]IFS[/navy][teal]=[/teal][i][green]''[/green][/i] [b]read[/b] -d $[i][green]'[/green][/i][lime]\0[/lime][i][green]'[/green][/i] s[teal];[/teal] [b]do[/b]
    echo [i][green]"--=[$s]=--"[/green][/i]
[b]done[/b]

Feherke.
feherke.ga

w5000 · Sep 13, 2016

thank you.

I have tried also to add leading zeroes to the counter in filenames - could you tell me why _79 is twice? and how to start from file_01 (and not file_00)?
in my example there should be 80 files created from file_01

$ gawk -v RS='(^|\n)XX [^)]+\\)' 'RT{sub(/^\n/,"",RT);file=sprintf("%s_%02d","file",n++);print RT>file};{print file}' ddddd|head -5
file_00
file_01
file_02
file_03
file_04
$ gawk -v RS='(^|\n)XX [^)]+\\)' 'RT{sub(/^\n/,"",RT);file=sprintf("%s_%02d","file",n++);print RT>file};{print file}' ddddd|tail -5
file_76
file_77
file_78
file_79
file_79
$

feherke · Sep 13, 2016

Hi

Then either initialize it with [tt][navy]n[/navy][teal]=[/teal][purple]1[/purple][/tt] or preincrement it with [tt][teal]++[/teal]n[/tt].

Such [tt]RT[/tt] based solutions tend to produce an empty data too at one end of the processing. To stop that, I asked Awk to write only when [tt]RT[/tt] is not empty. You asked it to [tt]{print file[/tt] unconditionally, even if the block that writes to file was skipped.

Feherke.
feherke.ga

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

split text file 1

w5000

Technical User

feherke

Programmer

w5000

Technical User

feherke

Programmer

w5000

Technical User

feherke

Programmer

w5000

Technical User

feherke

Programmer

w5000

Technical User

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor