Match row sequence and average part of column elements

TTT100 · Nov 11, 2007

I have a large data that have the following format

1 0.1 0.2 0.4
2 0.1 0.2 0.4
3 1.0 0.2 0.4
10 4.1 0.2 0.4
20 3.1 0.2 0.4
50 2.1 0.2 0.4
31 5.1 0.2 0.4

.
.
.
366 11.1 0.2 0.4
The first column is a day stamp and i have many columns as a field. The data is not spaced 1 to 366 evenly as you see.

My first objective is to generate a date column 1 to 366 with 1 day increment. Then I want to match and space my data according to this sequence leaving a an empty space when the sequence field is not present in my data.

My second objective is to average the corrected sequence of data every 10 colum elements and print the averaged result for the whole data.

I appreciate if anyone have a short awk routine in mind to do this job. I am new to awk

Thanks

feherke · Nov 12, 2007

Hi

How exactly the result should look ?

What exactly you tried so far ?

Feherke.

http://rootshell.be/~feherke/

TTT100 · Nov 12, 2007

Thanks Feherke

My main problem is just spacing the large data that I have collected every 3 minutes 366 days. I made the data increment step every day here to make it simpler.

This is example of the input data with the first field being a time stamp

1 0.1 0.2 0.4
2 0.1 0.2 0.4
3 1.0 0.2 0.4
10 4.1 0.2 0.4
15 3.1 0.2 0.4
20 2.1 0.2 0.4
31 5.1 0.2 0.4

.
.
.
366 11.1 0.2 0.4

Here is the desired out put
------------------------------

1 0.1 0.2 0.4
2 0.1 0.2 0.4
3 1.0 0.2 0.4
4
5
6
7
8
9
10 4.1 0.2 0.4
11
12
13
14
15 3.1 0.2 0.4
16 2.1 0.2 0.4
17 5.1 0.2 0.4
18
19
20
.
.
.
366 11.1 0.2 0.4
-------------------------------------------------

The next step is to produce one average line every ten columns for each field and append to the data or a different file --doesnt matter. I can do the averaging in R or matlab but I just wanted if some one knowns a simple one-liner to do the first spacing problem.

Thanks

feherke · Nov 12, 2007

Hi

Well, I still not understand some details, but in theory you need something like this :

Code:

awk 'p+1!=$1{for(i=p+1;i<$1;i++)print i}{p=$1}1' /input/file > /output/file

The open questions would be :
[ul]
[li]What is with that 31 after 50 in your first post ? Was it just a typo ?[/li]
[li]How you get the content for lines 16 and 17 which are not present in your input data ? Is that a typo ?[/li]
[/ul]
Well, I do not know R and Matlab, so I prefer the [tt]awk[/tt] way :

Code:

awk '{for(i=1;i<=NF;i++){s[i]+=$i;if($i!="")c[i]++}}1;NR%10==0{for(i=1;i<=NF;i++)printf"%f%s",s[i]/c[i],(i<NF)?OFS:ORS;s[i]=c[i]=0}' /input/file

Tested with [tt]gawk[/tt] and [tt]mawk[/tt].

Feherke.

http://rootshell.be/~feherke/

TTT100 · Nov 14, 2007

Thanks feherke, the second part of the code works great but I have still a problem with my first problem. Sorry I should have put the problem more clearly. Here is a clarification again. My data is a time series and runs day 1 to day 366 say with one data point per day. But for some days, data was missed. In the example below there is data for day1 to day3 and then day 4 and 5 are missed so does days 9, 11,12,13,14 etc but my collected data didnot put time stamp and indicate data as empty when missed.

1 2006 1 5 1 425.25
2 2006 1 8 1 425.48
3 2006 1 11 1 426.27
6 2006 1 14 1 427.05
7 2006 1 17 1 429.56
8 2006 1 20 1 423.75
10 2006 1 23 1 423.72
15 2006 1 26 1 424.22
16 2006 1 29 1 424.72
.
.
.
365 2006 365 2359 1 385.73

The desired output is first artificially generate fixed time sequence of 1 to 366 equally spaced with 1day increments as first column and then sort my data against this sequence so that I get the following output, which I can then match with other similar time series data.

1 1 2006 1 5 1 425.25
2 2 2006 1 8 1 425.48
3 3 2006 1 11 1 426.27
4
5
6 6 2006 1 14 1 427.05
7 7 2006 1 17 1 429.56
8 8 2006 1 20 1 423.75
9
10 9 2006 1 23 1 423.72
11
12
13
14
15 15 2006 1 26 1 424.22
16 16 2006 1 29 1 424.72
.
.
.
365 365 2006 365 2359 1 385.73

hope this makes sense and is clear to understand.I greatly appreciate your response

Thanks

PHV · Nov 14, 2007

Typed, untested:

Code:

awk '{a[$1]=$0}END{for(i=1;i<=366;++i)print ((i in a)?a[i]:i)}' /path/to/input

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

TTT100 · Nov 15, 2007

Thanks PHV that code seem to be working. I have one more question. In my real data, I have time stamps that increment by 0.002 (ie instead of day 1 to 366 incrementing by 1, I have time stamp that goes from 1.002 to 365.998 by step of 0.002. Any tips! Thanks a mill.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Match row sequence and average part of column elements

TTT100

MIS

feherke

Programmer

TTT100

MIS

feherke

Programmer

TTT100

MIS

PHV

MIS

TTT100

MIS

Similar threads

Part and Inventory Search

Sponsor