Re-formatting a file 3

meinida · Jul 15, 2015

I have a fairly large file (40 meg) of text that looks like this

IMSI = 123456000000049
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108
IMSI = 123456000000011
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108
IMSI = 123456000000050
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
IMSI = 123456000000075
APNTPLID = 3
QOSTPLID = 108
APNTPLID = 1
QOSTPLID = 108
APNTPLID = 2
QOSTPLID = 108
APNTPLID = 5
QOSTPLID = 108
APNTPLID = 6
QOSTPLID = 108

I would like to make it look like this

IMSI = 123456000000049,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108
IMSI = 123456000000011,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108
IMSI = 123456000000050,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108
IMSI = 123456000000075,APNTPLID = 3,QOSTPLID = 108,APNTPLID = 1,QOSTPLID = 108,APNTPLID = 2,QOSTPLID = 108,APNTPLID = 5,QOSTPLID = 108,APNTPLID = 6,QOSTPLID = 108

Is AWK the right tool to use? I have heard it is very powerful but I have not been able to figure out how to get the result I want.

I have tried different commands but they either don't work or don't have any change on the format.

awk -F'=' '$1=="IMSI" $2=="APNTPLID" $3=="QOSTPLID" {print $1, $2, $3}' output.txt
awk -F'=', '"$1=="IMSI", $2=="APNTPLID", $3=="QOSTPLID"" {print $1, $2, $3}' output.txt
awk '"$1 =/IMSI/, $2 =/APNTPLID/, $3 =/QOSTPLID/" {print $1, $2, $3}' output.txt

michaelvv · Jul 16, 2015

awk '
NR==1 {LINE=$0
next}
/^IMSI/ {print LINE
LINE=$0
next}
{LINE=LINE "," $0}
END {print LINE}' your-input-file>your-output-file

this assumes that the input file starts with the 'IMSI' record

meinida · Jul 16, 2015

Thank you for your reply

This is the error I get when I try to run the command

awk 'NR==1 {LINE=$0 next} /^IMSI/ {print LINE LINE=$0 next}{LINE=LINE "," $0}END{print LINE}' output.txt
awk: syntax error at source line 1
context is
NR==1 {LINE=$0 >>> next <<< } /^IMSI/ {print LINE LINE=$0 next}{LINE=LINE "," $0}END{print LINE}
awk: illegal statement at source line 1

Is it possible the formatting of the input file is bad?

michaelvv · Jul 16, 2015

Looks like the awk program was reformatted. My original post has 8 distinct lines. Also, what platform are you running on? I tested my script under AIX.

meinida · Jul 16, 2015

I am running it on MAC OS 10.10.3

meinida · Jul 16, 2015

Turns out I am an idiot. I did not realize there was significance in the 8 distinct lines. I ran the command again as you posted it and it works.

Is there somthing other than AWK that might be better? The file I am running this on is around 40 meg in size and I think it is pulling it all into ram before it writes the output.

Thank you again for your help

michaelvv · Jul 17, 2015

Not sure if awk pulls the entire file into RAM for this type of processing. How long did it take to process? How many input lines?

I doubt that there is something better in terms of risk/reward. A high level language might process in a few seconds faster, but the only high level language I know is COBOL. Would have taken me maybe 30 minutes to write, compile an test in COBOL, took me under two minutes in awk.

mikrom · Jul 17, 2015

COBOL is compiled, so it will be faster, than interpreted awk. But on the other hand awk (compared to COBOL) is free, simpler and supports regular expression.

mikrom · Jul 18, 2015

You can compare the awk solution with my C solution.
I tried it only for my educational purposes, to see if I'm able to do it C

It took me long, because I'm not experienced C programmer and did't know the library functions for working with strings.
IMO, doing it in awk (or other scripting language) is simpler than in C.

meinida.c

Code:

[COLOR=#a020f0]#define MAXLINE [/color][COLOR=#ff00ff]10000[/color][COLOR=#a020f0]  [/color][COLOR=#0000ff]// maximum line length[/color]
[COLOR=#a020f0]#define substr  [/color][COLOR=#ff00ff]"IMSI"[/color][COLOR=#a020f0] [/color][COLOR=#0000ff]// substring[/color]
[COLOR=#0000ff]/*[/color][COLOR=#0000ff]**                                    **[/color][COLOR=#0000ff]*/[/color]

[COLOR=#a020f0]#include [/color][COLOR=#ff00ff]<stdio.h>[/color]
[COLOR=#a020f0]#include [/color][COLOR=#ff00ff]<string.h>[/color]
[COLOR=#a020f0]#include [/color][COLOR=#ff00ff]<stdlib.h>[/color]


[COLOR=#2e8b57][b]int[/b][/color] main([COLOR=#2e8b57][b]int[/b][/color] argc, [COLOR=#2e8b57][b]char[/b][/color] *argv[]) {
  [COLOR=#2e8b57][b]const[/b][/color] [COLOR=#2e8b57][b]char[/b][/color] *filename = argv[[COLOR=#ff00ff]1[/color]]; 
  [COLOR=#2e8b57][b]char[/b][/color] line[MAXLINE], line_out[MAXLINE];
  [COLOR=#2e8b57][b]FILE[/b][/color]* file = fopen(filename,[COLOR=#ff00ff]"r"[/color]);
  [COLOR=#2e8b57][b]long[/b][/color] nr_line;
  [COLOR=#2e8b57][b]int[/b][/color] len_line;

  [COLOR=#0000ff]// if file doesn't open then exit with error[/color]
  [COLOR=#804040][b]if[/b][/color] (file == [COLOR=#ff00ff]NULL[/color]) 
  {
    perror (filename);
    exit([COLOR=#ff00ff]EXIT_FAILURE[/color]);
  } 

  nr_line = [COLOR=#ff00ff]0[/color];
  [COLOR=#804040][b]while[/b][/color](fgets(line, [COLOR=#804040][b]sizeof[/b][/color](line), file) != [COLOR=#ff00ff]NULL[/color]) {
    nr_line++; 
    [COLOR=#0000ff]// chomp line[/color]
    line[strcspn(line, [COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color])] = [COLOR=#6a5acd]'\0'[/color];   
    [COLOR=#804040][b]if[/b][/color] (strstr(line, substr)) {
      [COLOR=#804040][b]if[/b][/color] (nr_line > [COLOR=#ff00ff]1[/color]) {
         [COLOR=#0000ff]// print line_out for output[/color]
         printf ([COLOR=#ff00ff]"[/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color], line_out);
      } 
      [COLOR=#0000ff]// create new line_out[/color]
      strcpy(line_out, line);
    }
    [COLOR=#804040][b]else[/b][/color] {
      [COLOR=#0000ff]// add  line to line_out[/color]
      strcat(line_out, [COLOR=#ff00ff]";"[/color]);
      strcat(line_out, line);
    }
  }
  [COLOR=#0000ff]// at end: print last line[/color]
  printf ([COLOR=#ff00ff]"[/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color], line_out);

  [COLOR=#0000ff]// close file[/color]
  fclose(file);

  [COLOR=#0000ff]// at end return[/color]
  [COLOR=#804040][b]return[/b][/color]([COLOR=#ff00ff]0[/color]);
}

Compilation and running:

Code:

$ gcc meinida.c -o meinida
$ meinida meinida.txt > meinida_out.csv

Output: meinida_out.csv

Code:

IMSI = 123456000000049;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000011;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000050;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108
IMSI = 123456000000075;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108

michaelvv · Jul 18, 2015

You have made my point!! It only took me a couple of minutes to write the awk program, in only 8 lines. How did the run time compare?

feherke · Jul 18, 2015

Hi

meinida said:
The file I am running this on is around 40 meg in size and I think it is pulling it all into ram before it writes the output.

No, regular Awk implementations do not slurp in the entire input at once. The [tt]RS[/tt] may change anytime, affecting the next record to be read.

michaelvv's code reads one line of input, builds up one output line in the memory, then outputs it and discards it.

There is one way to do it even more efficiently from memory usage's point of view, but runs slower than michaelvv's code :

Code:

[teal]{[/teal]
    [b]printf[/b][teal]([/teal][i][green]"%s%s"[/green][/i][teal],[/teal] NR [teal]==[/teal] [purple]1[/purple] [teal]?[/teal] [i][green]""[/green][/i] [teal]:[/teal] [navy]$1[/navy] [teal]==[/teal] [i][green]"IMSI"[/green][/i] [teal]?[/teal] [i][green]"[/green][/i][lime]\n[/lime][i][green]"[/green][/i] [teal]:[/teal] [i][green]","[/green][/i][teal],[/teal] [navy]$0[/navy][teal])[/teal]
[teal]}[/teal]
[b]END[/b] [teal]{[/teal]
    [b]print[/b] [i][green]""[/green][/i]
[teal]}[/teal]

This reads one line of input and outputs it immediately. The trick is, it outputs no separator after the first line then always output first a separator then the current line.

Feherke.
feherke.ga

mikrom · Jul 18, 2015

michaelvv said:
You have made my point!! It only took me a couple of minutes to write the awk program, in only 8 lines. How did the run time compare?

I didn't compare the run time of the C and awk progams - maybe the the OP could it do.

But, IMO the awk solution is flexibler, because awk is language specialised for text processing.
In awk we don't need to care about opening files, reading it line by line, about maximum length of string, ... etc. The string operation are very simple in comparition to the C. If the example were more complicated - for example we had to necessarily use regex - then the C code would have more lines.

You mentioned COBOL - I know it too but didn't have a free compiler avaiable on my desktop.
In COBOL it would be similar to C, we have to declare file, open it and read it line by line. Maybe the string operation would be little bit simpler - but the result code would not be comparable with the simplicity of awk.
The resulting program in COBOL would be more verbose than in C.
Awk is easier to learn than any other programming language and more productive.

When you say, that it took you only some minutes, then I am ashamed and I have to confess that it took me some hours

meinida · Jul 20, 2015

Thank you all for your responses. I will try to answer all of the questions asked.
The number of lines in the input file is 965971
The original AWK script took around 3.5 hours to run.
The C script took less than 10 seconds to run, but the format when opened with Excel isn't quite right. After I loaded it in Excel I formatted the text to columns using ; as seperator but the output didn't end up in one line per "IMSI"

IMSI = 123456000000049
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000011
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000050
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000075
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108

After looking at the AWK output it was the same way when loaded into excel. Maybe I am doing something wrong.

meinida · Jul 20, 2015

I used feherke code in a script and it runs pretty quick too less than 10 seconds. I must be doing something wrong with michaelvv script. Anyway they are all giving the same result now. Some with ; some with , seperators
IMSI = 123456000000049
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000011
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108
IMSI = 123456000000050
; APNTPLID = 3
; QOSTPLID = 108
; APNTPLID = 1
; QOSTPLID = 108
; APNTPLID = 2
; QOSTPLID = 108
; APNTPLID = 5
; QOSTPLID = 108
; APNTPLID = 6
; QOSTPLID = 108

any ideas why the output is not like below

IMSI = 123456000000049;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000011;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108
IMSI = 123456000000050;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108
IMSI = 123456000000075;APNTPLID = 3;QOSTPLID = 108;APNTPLID = 1;QOSTPLID = 108;APNTPLID = 2;QOSTPLID = 108;APNTPLID = 5;QOSTPLID = 108;APNTPLID = 6;QOSTPLID = 108

mikrom · Jul 20, 2015

IMO, your problems are caused by the format of your file.
You have probably problems with end-of-line characters i.e. "\r\n" vs. "\n" or with some blank characters on the begin or end of line.
Either set properly the variable RS and/or try to remove these characters on each line.

meinida · Jul 21, 2015

tr -d '\015' <output.txt >newoutput.txt

Did the trick.

meinida · Jul 21, 2015

To ALL
I have not used TekTips much in the past. My experience with all of the people who posted on this was outstanding. Is there a way to rate or acknowledge the programers that helped me? You are all topnotch!
Thank you all!

mikrom · Jul 21, 2015

meinida said:
Is there a way to rate or acknowledge the programers that helped me?

To rate the answers, you can click at the star which is placed right in every reply:

Great post?
Star it!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Re-formatting a file 3

Technical User

MIS

Technical User

MIS

Technical User

Technical User

MIS

Programmer

Programmer

MIS

Programmer

Programmer

Technical User

Technical User

Programmer

Technical User

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor