Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Installing AWK on a Windows 7 64-bit computer - Novice 2

Status
Not open for further replies.

BxWill

MIS
Mar 30, 2009
367
US
Context: Have a rather large text file - over 800,000 records and over 180 fields that appear to have commas separating the fields and
also quite a few instances whereby several fields have commas within the fields!

Consequently, it appears that I cannot use bulk insert within TSQL to load the "clean records" into the Sql Server database table.

Therefore, it appears that AWK may be used to resolve this problem. Note, I have never used AWK and not at all familiar with its functionality but it appears that resolving this issue via AWK is preferable to opening the text file in MS Excel and using a volatile function to determine the number of commas for each record, extracting the "good" records and then converting to a tab-delimited file for bulk inserting into the Sql Server database.

Over the last hour or so, I have downloaded files from the Cygwin site but then restored the laptop to a system restore point because the laptop appeared to be "bogged down" with unnecessary files! Although I was able to search for "GAWK", it appeared that unnecessary un-related files were downloaded.

[Bold]
What is the simplest method to download and install AWK to a Windows 7 professional 64-bit laptop? Is it preferable to install 32-bit AWK instead of 64-bit?[/Bold]

What is an honest assessment as to a learning curve for a Novice to master the quick review of a text file via AWK to determine the proportion of "bad" records within any given text file relative to all of the records within the text file. Then, I can quickly send the file back to the source department instead of spending an inordinate amount of time to extract the "good" records.

(In other words, is it possible for me to develop an "AWK Template" that I can use to quickly review every text file from the source department to validate various fields such as;

1. Zipcode - is it a valid length and is it valid for the particular city?
2. Dates - is it a valid date
3. Other fields - Is it numeric or alpha when it should be the other...?
4. For comma delimited, are there commas within one or more fields?, etc. etc.

For example, if the number of "bad" records relative to "good" records is beyond the "Bad Record" Threshold (say, 50%), I can readily send back to the source department instead of spending three to four weeks to validate and load only the good records!

Any additional insight is appreciated that relates to the use of AWK to resolve the issue of "bad" records within a relatively large text file.








 
Hi BxWill,

When I get a new PC with Windows, first I install some utilities on it and one of them is MSYS = Minimal SYStem. MSYS is bash plus other basic utilities like awk, sed, ....
I'm not sure because I'm on Linux now and not on my windows PC, but I think that I used for the installation this installer mingw-get-setup.exe from the MSYS Getting Started page.

The other option would be to use for example this port of Gawk for Windows.

However, the most complete Linux-like environment for Windows is Cygwin, you mentioned that you tried it, but Cygwin is very huge, so be aware and don't try to install everything on your Laptop.

To learn awk, for the first steps you can use some tutorials from the web e.g. or buy a nice book about it.
Then later for the reference helps for example this manual:
Regarding your other questions on how to validate a text file, please post a specific minimal example to this forum a we will look at it.
 
Displayed below is a sample text file from the source department.

Is it possible for AWK to quickly validate the following fields;
1. Equipment_Flag - Should be one character, only the code - N for New or O for Old
2. Tracking Number - In record 1, there are two tracking numbers instead of just one. Hence, this record will have 9 commas whereas
the other records will have just have eight commas that delimit the 9 fields
3. Sale_Date - Records 2 and 4 are incorrect. Record 2 contain asterisks and Record 4 is nonsensical - not a date

So, is it possible for me to run several AWK commands and then quickly send this text file back to the source
department stating that all four records contain errors? ("Fix the errors and return the file!")

SampleTextFile_-_Copy_-_Copy_id7kkb.gif
 
Hi BxWill,
You say that it's a text file, but you posted only a picture from it.
Can you please post here the example as text ?

i.e. like this:
Code:
No.; [Account_Number]; ...
  1;             1345; ...
  2;             1456; ...
I think that it's CSV-file, but then I need to know how are the columns separated: with commas, semicolons or spaces ?
 
The fields within the text file is indeed delimited by commas and there is no double quotes around each field. Hence, not a true CSV file. Usually the data is sent as a pipe delimited text file.

So, what is displayed in the image is exactly what I see upon opening the file in a text editor.

Based on research, it appears that I can extract the "bad" records by using something similar to;

Code:
awk -F , 'NF != 11' <file

Now, the challenge remains - how to quickly load AWK...
 
Why NF != 11, If I look at the picture you posted I see 9 fields.
 
I took this example file
bxwill
Code:
No, [Account_Number],[Sequence_Number],[Equipment_Flag],[State_Code],[Tracking_Number],[Sale_Date],[Sale_Amount],[Purchase_Zipcode]
  1,            1345,                2,N               ,Ca          ,233432, 289908   ,   20170215,        10000,             90001
  2,            1456,                1,N               ,Ca          ,            23678,***********,       120000,             90003
  3,           15443,                4,New             ,Ca          ,            58973,   20170402,       498211,             85007
  4,           58722,                1,O               ,Az          ,            87963,ABC        ,       449390,             85011

and according to your criterions given above I wrote this validation script
bxwill.awk
Code:
[COLOR=#0000ff]# Run:[/color]
[COLOR=#0000ff]#   awk -f bxwill.awk bxwill.txt[/color]
[COLOR=#6a5acd]BEGIN[/color] {
  [COLOR=#6a5acd]FS[/color] = [COLOR=#ff00ff]","[/color]
  [COLOR=#0000ff]# number of errors[/color]
  nr_err = [COLOR=#ff00ff]0[/color]
}

[COLOR=#0000ff]# Line rules[/color]
[COLOR=#6a5acd]NR[/color] == [COLOR=#ff00ff]1[/color] {
 [COLOR=#0000ff]# skip header line[/color]
 [COLOR=#a52a2a][b]next[/b][/color]
}

{ 
  [COLOR=#0000ff]# chomp current line[/color]
  chomp_line()
  [COLOR=#0000ff]# remove spaces[/color]
  [COLOR=#008b8b]gsub[/color]([COLOR=#ff00ff]/[[/color][COLOR=#6a5acd] [/color][COLOR=#ff00ff]][/color][COLOR=#6a5acd]+[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
}

[COLOR=#6a5acd]$0[/color] ~ [COLOR=#ff00ff]/[/color][COLOR=#6a5acd]^[/color][COLOR=#ff00ff][[/color][COLOR=#6a5acd] [/color][COLOR=#ff00ff]][/color][COLOR=#6a5acd]*$[/color][COLOR=#ff00ff]/[/color] {
  [COLOR=#0000ff]# skip empty line[/color]
  [COLOR=#a52a2a][b]next[/b][/color]
}

{
  [COLOR=#0000ff]#printf("Processing #%2d :  %s\n", NR, $0)[/color]
  [COLOR=#a52a2a][b]if[/b][/color] ([COLOR=#6a5acd]NF[/color] != [COLOR=#ff00ff]9[/color]) {
    errors[[COLOR=#6a5acd]++nr_err[/color]] = [COLOR=#ff00ff]"Line "[/color] [COLOR=#6a5acd]NR[/color] [COLOR=#ff00ff]" : Unexpected number of columns ="[/color] [COLOR=#6a5acd]NF[/color] [COLOR=#ff00ff]" !"[/color]
    [COLOR=#a52a2a][b]next[/b][/color]
  }  

  equipment_flag = [COLOR=#6a5acd]$4[/color]
  [COLOR=#a52a2a][b]if[/b][/color] ((equipment_flag == [COLOR=#ff00ff]"N"[/color] ) || (equipment_flag == [COLOR=#ff00ff]"O"[/color])) {
    [COLOR=#0000ff]# pass[/color]
  }
  [COLOR=#a52a2a][b]else[/b][/color] {
    errors[[COLOR=#6a5acd]++nr_err[/color]] = [COLOR=#ff00ff]"Line "[/color] [COLOR=#6a5acd]NR[/color] [COLOR=#ff00ff]" : Unexpected value of equipment_flag = [/color][COLOR=#6a5acd]\"[/color][COLOR=#ff00ff]"[/color] equipment_flag [COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\"[/color][COLOR=#ff00ff]!"[/color]
  }
  
  tracking_number = [COLOR=#6a5acd]$6[/color]
  [COLOR=#a52a2a][b]if[/b][/color] (tracking_number ~ [COLOR=#ff00ff]/[/color][COLOR=#6a5acd]^[/color][COLOR=#ff00ff][[/color][COLOR=#a52a2a][b]0-9[/b][/color][COLOR=#ff00ff]][/color][COLOR=#6a5acd]+$[/color][COLOR=#ff00ff]/[/color]) {
    [COLOR=#0000ff]# pass because it's number [/color]
  }
  [COLOR=#a52a2a][b]else[/b][/color] {
    errors[[COLOR=#6a5acd]++nr_err[/color]] = [COLOR=#ff00ff]"Line "[/color] [COLOR=#6a5acd]NR[/color] [COLOR=#ff00ff]" : Unexpected value of tracking_number = [/color][COLOR=#6a5acd]\"[/color][COLOR=#ff00ff]"[/color] tracking_number [COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\"[/color][COLOR=#ff00ff]!"[/color]
  }

  sale_date = [COLOR=#6a5acd]$7[/color]
  [COLOR=#a52a2a][b]if[/b][/color] (sale_date ~ [COLOR=#ff00ff]/[/color][COLOR=#6a5acd]^[/color][COLOR=#ff00ff][[/color][COLOR=#a52a2a][b]0-9[/b][/color][COLOR=#ff00ff]][/color][COLOR=#6a5acd]{[/color][COLOR=#ff00ff]8[/color][COLOR=#6a5acd]}$[/color][COLOR=#ff00ff]/[/color]) {
    [COLOR=#0000ff]# pass because it's date YYYYMMDD[/color]
  }
  [COLOR=#a52a2a][b]else[/b][/color] {
    errors[[COLOR=#6a5acd]++nr_err[/color]] = [COLOR=#ff00ff]"Line "[/color] [COLOR=#6a5acd]NR[/color] [COLOR=#ff00ff]" : Unexpected value of sale_date = [/color][COLOR=#6a5acd]\"[/color][COLOR=#ff00ff]"[/color] sale_date [COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\"[/color][COLOR=#ff00ff]!"[/color]
  }
}

[COLOR=#6a5acd]END[/color] {
  [COLOR=#0000ff]# *** print result ***[/color]
  [COLOR=#a52a2a][b]if[/b][/color] (nr_err == [COLOR=#ff00ff]0[/color]) {
    [COLOR=#a52a2a][b]print[/b][/color] [COLOR=#ff00ff]"File [/color][COLOR=#6a5acd]\"[/color][COLOR=#ff00ff]"[/color] [COLOR=#6a5acd]ARGV[/color][[COLOR=#6a5acd]1[/color]] [COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\"[/color][COLOR=#ff00ff] is valid."[/color]
  }
  [COLOR=#a52a2a][b]else[/b][/color] {
    [COLOR=#a52a2a][b]print[/b][/color] [COLOR=#ff00ff]"Following errors were found:"[/color]
    [COLOR=#a52a2a][b]for[/b][/color] (j=[COLOR=#ff00ff]1[/color][COLOR=#6a5acd];[/color] j <= nr_err[COLOR=#6a5acd];[/color] j++) {
      [COLOR=#a52a2a][b]printf[/b][/color]([COLOR=#ff00ff]"#[/color][COLOR=#6a5acd]%02d[/color][COLOR=#ff00ff] : [/color][COLOR=#6a5acd]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color][COLOR=#6a5acd],[/color] j[COLOR=#6a5acd],[/color] errors[[COLOR=#6a5acd]j[/color]])
    }
  }
}

[COLOR=#0000ff]# -- fuctions --[/color]
[COLOR=#a52a2a][b]function[/b][/color] chomp_line() {
  [COLOR=#0000ff]# strip out the carriage return or line feed at the end of current line[/color]
  [COLOR=#0000ff]# the function modifies global variable $0 (current line)[/color]
  [COLOR=#008b8b]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\r[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
  [COLOR=#008b8b]sub[/color]([COLOR=#ff00ff]/[/color][COLOR=#6a5acd]\n[/color][COLOR=#6a5acd]$[/color][COLOR=#ff00ff]/[/color][COLOR=#6a5acd],[/color] [COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color] [COLOR=#6a5acd]$0[/color])
}

For the given file the script delivers this result:
Code:
$ awk -f bxwill.awk bxwill.txt
Following errors were found:
#01 : Line 2 : Unexpected number of columns =10 !
#02 : Line 3 : Unexpected value of sale_date = "***********"!
#03 : Line 4 : Unexpected value of equipment_flag = "New"!
#04 : Line 5 : Unexpected value of sale_date = "ABC"!

When I use the corrected file
bxwill_correct.txt
Code:
No, [Account_Number],[Sequence_Number],[Equipment_Flag],[State_Code],[Tracking_Number],[Sale_Date],[Sale_Amount],[Purchase_Zipcode]
  1,            1345,                2,N               ,Ca          ,           233432,   20170215,        10000,             90001
  2,            1456,                1,N               ,Ca          ,            23678,   20180214,       120000,             90003
  3,           15443,                4,N               ,Ca          ,            58973,   20170402,       498211,             85007
  4,           58722,                1,O               ,Az          ,            87963,   20180101,       449390,             85011

then the script delivers this:
Code:
$ awk -f bxwill.awk bxwill_correct.txt
File "bxwill_correct.txt" is valid.
 
Yes, I misstated - the line should have been "NF != 9."

Appreciate the insight!

Had no idea whatsoever that AWK could perform as such! Or, if any application/program could.

Definitely appears that this is worth learning! Although, the learning curve appears steep...

Stepping through each line of the code and will probably have to obtain a book for reference before I fully understand what is going on.

For large text files such as near 1,000,000 records and over 180 columns, will AWK still be able to process with ease?

By the way, how were you able to present the tabular data as a table?
 
BxWill said:
Definitely appears that this is worth learning! Although, the learning curve appears steep...
But learning of each programming language requires a bit of effort and AWK is one of the simplest.

BxWill said:
For large text files such as near 1,000,000 records and over 180 columns, will AWK still be able to process with ease?
I never had so huge files, so have zero experience with it. Try it and you will see ...

BxWill said:
By the way, how were you able to present the tabular data as a table?
What you mean with tabular data? Data separated by tabulator (character \t)?
The best would be, when you post an example, how your input data look like and what would be the desired output.


 
Was referring to how you display the data in aligned columns.

When I paste the data from MS Excel, I was not able to display the data in aligned columns...
 
Still not able to successfully download and use Gawk for Windows.

Clicked "Gawk for Windows" in mikrom's post and then clicked "Setup" next to "Complete package except sources."

At this point, I assumed that the setup is completed.

Is it necessary to download "Sources?"

Currently at the command line prompt...

Will re-read the Instructions - GnuWin Installation and Usage...

Any additional insight as to what I may be overlooking is appreciated.

 
You don't need sources. When setup not works then download ZIP with binaries.
 
Now I tried the setup program.
I clicked at the [url ]Setup[/url] and then the setup program gawk-3.1.6-1-setup.exe was downloaded. I clicked at the exe and installed awk. It seems to work:

Code:
c:\Program Files (x86)\GnuWin32\bin>awk --version
GNU Awk 3.1.6
Copyright (C) 1989, 1991-2007 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see [URL unfurl="true"]http://www.gnu.org/licenses/.[/URL]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top