Friends,
New to the forum, posting for the first time. Looking for some help from the community,
will really appreciate if you folks could provide some input and share some knowledge.
I am trying to achieve 3 things together, source file is around 800 million records
(17 columns) and takes long time for each step. Hardware on AWS - 61gb RAM, 8CPU, 3TB disk.
Sample csv file with header pipe delimiter.
ID |DATE
1234|2015-01-01 00:01:15
1) Want to split DATE column into two (DATE and TIME) should look
like the following 3 columns.
ID |DATE |TIME
1234|2015-01-01|00:01:15
Used this code -> awk -F'|' '{gsub(" ", "|", $2) ; print )' file1.txt > file1.out
works fine takes around 2hr:30 mins.
2) Want to add 3rd column based on the TIME criteria.
if TIME is 00:00:00 there should be 1 in the fourth column and so on.
ID |DATE |TIME |TIMEMASK
1233|2015-01-01|00:00:00|1
2345|2015-01-01|00:00:15|2
...
1234|2015-01-01|00:01:15|6
Used this code works fine, but took close to 5 hours.
awk -F '|' -vOFS='|' '{
gsub("00:00:00", "00:00:00|1|", $3) ;
....
gsub("23:45:00", "23:45:00|96|", $3) ;
print )' file2.fle > file2.out
3) Have to add 5th column, based on the date we have to generate
day of the week. Output should look like this.
ID |DATE |TIME |TIMEMASK|DAYOFWEEK
1233|2015-01-01|00:00:00|1 |4
2345|2015-01-02|00:00:15|2 |5
...
1234|2015-01-03|00:01:15|6 |6
Used this code which is working but too slow, just for one file it was running
for over 24 hours. Had to kill the process, since we have multiple files looping
was done in this, which probably not efficient either.
#!/bin/bash
while IFS='|' read col1 col2 DATE col4 col5 col6 col7 col8 col9 col10 col11
col12 col13 col14 col15 col16 col17 ;
do
col6=`date --date="$DATE" '+%u'`
echo
"$col1|$col2|$DATE|$col4|$col5|$col6|$col7|$col8|$col9|$col10|$col11|$col12|
$col13|$col14|$col15|$col16|$col17"
done < file3.fle > file4.out
New to the forum, posting for the first time. Looking for some help from the community,
will really appreciate if you folks could provide some input and share some knowledge.
I am trying to achieve 3 things together, source file is around 800 million records
(17 columns) and takes long time for each step. Hardware on AWS - 61gb RAM, 8CPU, 3TB disk.
Sample csv file with header pipe delimiter.
ID |DATE
1234|2015-01-01 00:01:15
1) Want to split DATE column into two (DATE and TIME) should look
like the following 3 columns.
ID |DATE |TIME
1234|2015-01-01|00:01:15
Used this code -> awk -F'|' '{gsub(" ", "|", $2) ; print )' file1.txt > file1.out
works fine takes around 2hr:30 mins.
2) Want to add 3rd column based on the TIME criteria.
if TIME is 00:00:00 there should be 1 in the fourth column and so on.
ID |DATE |TIME |TIMEMASK
1233|2015-01-01|00:00:00|1
2345|2015-01-01|00:00:15|2
...
1234|2015-01-01|00:01:15|6
Used this code works fine, but took close to 5 hours.
awk -F '|' -vOFS='|' '{
gsub("00:00:00", "00:00:00|1|", $3) ;
....
gsub("23:45:00", "23:45:00|96|", $3) ;
print )' file2.fle > file2.out
3) Have to add 5th column, based on the date we have to generate
day of the week. Output should look like this.
ID |DATE |TIME |TIMEMASK|DAYOFWEEK
1233|2015-01-01|00:00:00|1 |4
2345|2015-01-02|00:00:15|2 |5
...
1234|2015-01-03|00:01:15|6 |6
Used this code which is working but too slow, just for one file it was running
for over 24 hours. Had to kill the process, since we have multiple files looping
was done in this, which probably not efficient either.
#!/bin/bash
while IFS='|' read col1 col2 DATE col4 col5 col6 col7 col8 col9 col10 col11
col12 col13 col14 col15 col16 col17 ;
do
col6=`date --date="$DATE" '+%u'`
echo
"$col1|$col2|$DATE|$col4|$col5|$col6|$col7|$col8|$col9|$col10|$col11|$col12|
$col13|$col14|$col15|$col16|$col17"
done < file3.fle > file4.out