Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

adding field separator 3

Status
Not open for further replies.

omoo

Programmer
May 30, 2005
87
US
Hi,

I have the following data in the format as shown(note: there are more than 1 blank spaces between each field and the spaces are not uniform, meaning there can be one blank field between field1 and field2 and 3 spaces between field3 and field4):
________________________________________
num1a num2a num3a num4a num5a
num1b num2b num4b
num2c num3c num4c num5c
num2d num3d num4d num5d
num1e num2e num3e num4e num5e
num1f num5f
________________________________________

i need to include ; as the field delimited and the required output is:
________________________________________
num1a;num2a;num3a;num4a;num5a
num1b;num2b;;num4b;
;num2c;num3c;num4c;num5c
;num2d;num3d;num4d;num5d
num1e;num2e;num3e;num4e;num5e
num1f;;;;num5f
________________________________________

I tried using
sed -e 's/ */;/g'
but it will take consecutive blank fields as one blank field. Can anyone help?


 
omoo

that is a nasty data file

i reckon this data structure is so ambiguous it will be almost impossible to solve this with no problems

BUT - i would imagine it would be best to iterate through the file once to find a line that contains the most values - and then assume that this line (or lines) with the most matches are a full complement of values - and then work from that basis

it is important this is understood - i.e. we can easily look at your data and visually see that there 5 values max. But what if you had the following:-

Code:
num1a           num3a   num4a   num5a
num1b                   num4b
                num3c   num4c   num5c
                num3d   num4d   num5d
num1e           num3e           num5e
num1f                           num5f

... does column 2 exist - or not!?

having a structure that has a indefinite number of spaces between the columns makes this very prone to error


Kind Regards
Duncan
 
Hi

Get it to a "regular" format filling the missing fields with a spaceholder, let say [tt]\t[/tt], then is easyer:

Code:
sed -e 's/^ \{5\}/\t/;s/ \{8\}/ \t/g;s/ \+/;/g;s/\t//g' file

Feherke.
 
Hi Feherke

Please explain, exactly, what a 'field' is in the above example


Kind Regards
Duncan
 
some of the field lengths are fixed while some are not. this is the problematic part.

yes it is possible that there are no data in the entire field. and it is the difficult part that i am trying to solve.

i cannot get it to a "regular" format because i output file always cannot detect the absence of some of the fields.

the fields in the example numbers, alphabets, dates or blank. for exmaple, TM5362F0000D-NBZ5, 4500727101, CR4900.00, 24, 06/29/2005.
 
the number of fields in each line are the same, but some of them can be blanks
 
omoo,
youforgot to mention [from the other forum] that the number of fields is always the same on every line.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
i think this would be alot easier if you posted a chunk of your data - a few lines preferably


Kind Regards
Duncan
 
Hi

Duncan wrote:
Please explain, exactly, what a 'field' is in the above example

Just as in [tt]awk[/tt]. My first idea was to use [tt]awk[/tt] and set the [tt]FIELDWIDTHS[/tt] variable. Maybe this is why for me was obvious to use the field word.

Feherke.
 
omoo said:
there are more than 1 blank spaces between each field and the spaces are not uniform, meaning there can be one blank field between field1 and field2 and 3 spaces between field3 and field4)

some of the field lengths are fixed while some are not. this is the problematic part.

the number of fields in each line are the same, but some of them can be blanks

You don't have a valid data file. The field-delimiting information is gone.
It's impossible to restore it.

omoo, I have a word. Some of its letters (I don't know how many) have
been removed. I am left with "at". What was the original word?
Hat? Catch? Watch? Attach? What? Attitude?

Consider this possible line from the file (for clarity, spaces have been replaced with "."):
[tt]......foo......[/tt]

Let's change the spaces that make up fields to x's:
It could be like this:
[tt]xxxx..foo.x.x.x[/tt]

which should be converted to
[tt];foo;;;[/tt]

Or it could be like this:
[tt]xx.x..foo..x..x[/tt]

which should be converted to
[tt];;foo;;[/tt]

Or it could be like this:
[tt]x.x.x.foo...xxx[/tt]

which should be converted to
[tt];;;foo;[/tt]

I wish that Tom Noddy could have thought before he created this so-called "data file".
 
The following [tt]sed[/tt] command will work if the fields are all the same width...
Code:
sed 's/......../&;/g;s/ *//g' file.dat
Using the sample in the original post this gives...
Code:
num1a;num2a;num3a;num4a;num5a
num1b;num2b;;num4b;
;num2c;num3c;num4c;num5c
;num2d;num3d;num4d;num5d
num1e;num2e;num3e;num4e;num5e
num1f;;;;num5f
If the columns for each field are a fixed width, but different widths, you can use [tt]cut[/tt] and [tt]paste[/tt]. The following should work...
Code:
cut -b 1-8 file.dat > field.1
cut -b 9-16 file.dat > field.2
cut -b 17-24 file.dat > field.3
cut -b 25-32 file.dat > field.4
cut -b 33-40 file.dat > field.5
paste -d\; field.[1-5] | sed 's/ *//g'
This gives the same output. Just adjust the field widths for each [tt]cut[/tt].

Both of these even handle duncdude's completely missing column 2 example above, as long as there's a span of spaces the size of the missing field.

BUT, this all assumes the fields all have a fixed length. If they vary from line to line with no clear delimiter, there's no way to parse it.

Hope this helps.
 
Hi,

The method by SamBones might be working.

may I know how can I delete the trailing empty spaces after each field?

 
ok i realise the sample code given by SamBones already delete the trailing spaces! oops. thanks!
 
The commands I gave don't leave any spaces at all. The [tt]field.*[/tt] files have spaces in them, but they're just temporary files for the [tt]paste[/tt] to use. The final output has no spaces.

Hope this helps.
 
Hi

As I mentioned above, my first idea was using [tt]awk[/tt] with fixed widths. Seeing SamBones solution using similar steps, I think that wasn't so bad, so I post it, as another alternative :

Code:
awk '
BEGIN { FIELDWIDTHS="8 8 8 8 8 8" }
{ s=$1; for (i=2;i<NF;i++) s=s ";" $i; print gensub(" ","","g",s) }
' file.dat

Feherke.
 
yeah it works! thanks SamBones.

I have another problem but I think it is not as bad.

I have data in the following format:

Code:
data1
data2 data3    data4
data5 data6
data7 data8    data9
      data10

I require the final output to be:

Code:
data1
data1
data1
data1

i only require the 1st line but I need to replicate it for all rows with data. In this example, there are 4 rows of data(excluding data1), so I need to replicate it in 4 rows.

 
Hi

And how should be that data1 recognised ? Is it a fix keyword, or something variable ?

If data1 is a keyword :

Code:
sed -n '/data1$/{h;n};g;p'

If data1 is whatever is in the first line :

Code:
sed -n '1{h;n};g;p'

If data1 is in a line alone ( no separator and other data ) :

Code:
sed -n '/ /!{h;n};g;p'

Feherke.
 
Hi,

what is h and n supposed to be?

data1 is whatever in the 1st line. There is only one data in the 1st line. I need to duplicate this 1st line according to the number of rows.
 
Hi

omoo wrote :
what is h and n supposed to be?

[tt]sed[/tt] commands, according to [tt]man[/tt] :
h - Copy pattern space to hold space
n - Read the next line of input into the pattern space
g - Copy hold space to pattern space
p - Print the current pattern space

I think the second code from my previous post is exactly what you need.

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top