Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

split text file 1

Status
Not open for further replies.

w5000

Technical User
Nov 24, 2010
223
PL
I've a text file with many lines where are many messages beginning with XX and ending with bracket ")" (each one)

I'd like to split it so each message is saved to separate file (e.g. file names with increasing number)

so file1 will have:
XX sfjsklf
sfsfsfsf
(sdfsdfsf
sfsfsf
gfdfgdgdghdh)

file2:
XX 902wriwirj
sdfs
sdfsf
(sdfsdfs
sfsf)


etc.
 
Hi

Is there unwanted content between the section ? If not, this will produce file01, file02 and so on ( also a ( potentially empty ) file00 with content before first section ) :
Code:
csplit -f file /path/to/input '/^XX /' '{*}'

Feherke.
feherke.ga
 
yes, there are some lines unwanted between (to be ignored)

and the might also be some tailing unwanted stuff after closing ")" - at the same line

sample of wanted (green) and unwanted (red)

sfkjskfsdf
sfsfsf
sfsf

XX sdfsf
sfsf
sdfs
sdfsfsf
(sfsfsfsf
sdfsf
sfsf
sfsfsfs)
fgkjsfsdflk
wrwrewr
werwerweer
wrwrwerwre




 
Hi

In that case I would go with Awk :
Code:
awk -v RS='(^|\n)XX [^)]+\\)' 'RT[teal]{[/teal][b]sub[/b][teal]([/teal][fuchsia]/^\n/[/fuchsia][teal],[/teal][i][green]""[/green][/i][teal],[/teal]RT[teal]);[/teal][b]print[/b] RT[teal]>[/teal][i][green]"file"[/green][/i][teal]++[/teal]n[teal]}[/teal]' /path/to/input


Feherke.
feherke.ga
 
brilliant! thank you very much.

by the way, if I would like to pipe each message to a command (instead of writing it to a file), would this be ok or something is superfluous there? It looks it is working but I'd like to be sure this approach is ok...

gawk -v RS='(^|\n)XX [^)]+\\)' 'RT{sub(/^\n/,"",RT);cmd="wc -l";print RT | cmd; close(cmd)}'

also, how to modify it so in awk I could use another command with redirect like:

cmd < message

(cmd could be given with some options)
 
Hi

Yes, that is the way to run an external command and passing it input.

Not sure about what are you asking there, but I assume you would like bidirectional communication with the external command. That is GNU Awk only feature :
Code:
gawk -v RS='(^|\n)XX [^)]+\\)' 'RT[teal]{[/teal][b]sub[/b][teal]([/teal][fuchsia]/^\n/[/fuchsia][teal],[/teal][i][green]""[/green][/i][teal],[/teal]RT[teal]);[/teal][navy]cmd[/navy][teal]=[/teal][i][green]"wc -l"[/green][/i][teal];[/teal] [b]print[/b] RT [teal]|&[/teal] cmd[teal];[/teal] [b]close[/b][teal]([/teal]cmd[teal],[/teal][i][green]"to"[/green][/i][teal]);[/teal] cmd [teal]|&[/teal] [b]getline[/b] c[teal];[/teal] [b]close[/b][teal]([/teal]cmd[teal],[/teal][i][green]"from"[/green][/i][teal]);[/teal] [b]print[/b] c[teal]}[/teal]' /path/to/input

Though if you really want to just get the line count, then better solve it in Awk :
Code:
gawk -v RS='(^|\n)XX [^)]+\\)' 'RT[teal]{[/teal][b]gsub[/b][teal]([/teal][fuchsia]/^\n/[/fuchsia][teal],[/teal][i][green]""[/green][/i][teal],[/teal]RT[teal]);[/teal][b]print split[/b][teal]([/teal]RT[teal],[/teal]a[teal],[/teal][i][green]"[/green][/i][lime]\n[/lime][i][green]"[/green][/i][teal])}[/teal]' /path/to/input


Feherke.
feherke.ga
 
sorry for bad explaining my goal

with your first command I can use files created for further processeing with for loop:

for i in file*;do somecommand -o sss < $i;done

I was thinking of implementing it directly into awk command not having to do the for loop at all.
 
Hi

In such cases usually a [tt]\0[/tt] delimiter is used, hoping the text to process will not contain it :
Code:
gawk -v RS='(^|\n)XX [^)]+\\)' -v ORS='\0' 'RT[teal]{[/teal][b]gsub[/b][teal]([/teal][fuchsia]/^\n/[/fuchsia][teal],[/teal][i][green]""[/green][/i][teal],[/teal]RT[teal]);[/teal][b]print[/b] RT[teal]}[/teal]' /path/to/input [teal]|[/teal]
[b]while[/b] [navy]IFS[/navy][teal]=[/teal][i][green]''[/green][/i] [b]read[/b] -d $[i][green]'[/green][/i][lime]\0[/lime][i][green]'[/green][/i] s[teal];[/teal] [b]do[/b]
    echo [i][green]"--=[$s]=--"[/green][/i]
[b]done[/b]


Feherke.
feherke.ga
 
thank you.

I have tried also to add leading zeroes to the counter in filenames - could you tell me why _79 is twice? and how to start from file_01 (and not file_00)?
in my example there should be 80 files created from file_01

$ gawk -v RS='(^|\n)XX [^)]+\\)' 'RT{sub(/^\n/,"",RT);file=sprintf("%s_%02d","file",n++);print RT>file};{print file}' ddddd|head -5
file_00
file_01
file_02
file_03
file_04
$ gawk -v RS='(^|\n)XX [^)]+\\)' 'RT{sub(/^\n/,"",RT);file=sprintf("%s_%02d","file",n++);print RT>file};{print file}' ddddd|tail -5
file_76
file_77
file_78
file_79
file_79
$
 
Hi

Then either initialize it with [tt][navy]n[/navy][teal]=[/teal][purple]1[/purple][/tt] or preincrement it with [tt][teal]++[/teal]n[/tt].

Such [tt]RT[/tt] based solutions tend to produce an empty data too at one end of the processing. To stop that, I asked Awk to write only when [tt]RT[/tt] is not empty. You asked it to [tt]{print file[/tt] unconditionally, even if the block that writes to file was skipped.


Feherke.
feherke.ga
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top