Best way to approach this

younixgeek · Dec 6, 2005

Hi Guys I need some help from the experts here. I need to convert an unformatted text report into any kind of CSV or field delimited file. I have been playing with this for hours and I keep hitting stumbling blocks. Can you all tell me how you would tackle this? All I have in my skill set to work with is a moderate level of SED, AWK, or KSH. I am not sure if I should do this all in SED, or an AWK array, or both. All of your suggestions and ideas much appreciated
here is a sample file with on record:

-----------------------------------------------------------------------
11/21/2005 DISTRIBUTOR Page:1
Client Listing
========================================================================

300W-3 300 W. CLIENT NAME Count: 1
Location #: 000 Phone #: (202)555-7414

Shipping Address: Billing Address:
300 W. ELM AVE. #110 300 W. ELM AVE. #110
PHOENIX AZ 85330-0000 PHOENIX AZ 85330-0000

Contact Name: JOHN SMITH Title:
Assigned To: 007
Customer Type: BR BOOTH RENTER
Secondary Customer Types
RRITORY 1 :
il Salon Services :
ova 10% Disc Cust :
ruples :
cleic A :
RRITORY :
TAX EXCEMPT CUST. :
COMAPNY INTEGRITY :
:
:

Year To Date: Number of Orders: 1 Purchase: 69.95
Start To Date: Number of Orders: 1 Purchase: 69.95
Current Balance: 0.00
Customer Since: 09/03/03
Last Order: 09/04/03
Last Payment: 09/15/03

Parent Customer ID: Alias:

Default Information
Delivery Code: VN4 Warehouse: 1
Salesman: 63

Discount Information
Price Level: 6 Matrix Code:
Term Code: 2 Discount Code: 0

Tax Information
Sales Tax Code: SCT Tax Exempt Number:

Credit Information
A/R Code: 1
Finance Charge Code: FC
Credit Limit: $ 0 Statement Type: 1
COD Credit Limit: $ 0 Credit Status: 1
COD Delinquent Over: 20 Days
Number of Returned Checks: 0 Last Returned Check:
-----------------------------------------------------------------------------

Thanks
-YG

SamBones · Dec 6, 2005

What do you have so far? What problems are you having.

I would start by analyzing the input. Do things start at fixed locations (i.e. account number always on 6th line), are there consistent titles in front of the fields you want to capture (i.e. "Contact Name:"), do you need all fields or just some? And on and on.

It looks like you're stripping reports to get data. Is it possible to get a file in the format you need from the system that's making these reports? Is it possible to get a data feed directly to where it needs to go (database to database)? Either of those options will be much less error prone that trying to parse reports to make your file. If you are parsing reports, every change in the report format will break your "feed".

Annihilannic · Dec 6, 2005

It would help if you supplied the desired output for the example record as well.

Annihilannic.

younixgeek · Dec 7, 2005

Sam what I have are a few snippets of code and ideas to tackle this...Nothing tangible. Right now my thinking is using the dashed lines as an indicator of the beginning of a new record (the dashes are consistent record breaks through the whole file) using SED. Pass that to AWK to read in each line as a string, and depending on what is in that line using a substrings to dice up the line into fields and print the fields needed in the format of my choice for the whole record, and then start over again at the next customer record. I have never done this previously, and before I get to deep I wanted some other opinions. Something is telling me that using an array in ksh or awk may be a better solution.

Another thought was to match any string ending with a : and print all the characters following it until the next string with a : That would get me almost everything I need except for the first 6 lines.

Regarding the consistency, I would have to do some more spot checks on the reports to verify that the specific line numbers stay consistent, but I agree that’s kind of a shaky way to do this. If one extra line char is in there the whole thing will be thrown off. That’s why I was kind of thinking along the lines of reading lines in as strings, and then having "some action" take place on that string, so if it happened to read rouge line it wouldn’t throw everything off.

Unfortunately for me the source of the data is an old DOS based CRM app that does not have any export tools. Trust me I searched high and low for something internal. This is a very niche Salon industry application that is not very wide spread. Contacting the developer is not an option. Unless you are upgrading to their own newer version, you can forget about any assistance unless your paying T & M.

As far at the output goes, anything with fields. It could be like: CUST_ID|ADDRESS|PHONE|CONTACT|NUM_OF_ORDERS,etc The new software they are moving to is looking to read in formatted data from a spreadsheet or CSV.

Well thanks for taking the time guys. I know its a high altitude order, but I just need a kick start.

SamBones · Dec 7, 2005

Well, if you have a budget for tools, there is a class of tools called "report scraping" tools. These are often used as part of ETL for Data Warehousing. With these tools you can usually take one of your sample reports, and just highlight the fields you want and define how you want the output. It does the hard part of parsing and collecting the data from the report.

This is an example: Monarch Report Mining

Also try Googling "Report Scraping".

Hope this helps.

PHV · Dec 7, 2005

Similar issues already solved in the awk forum: Forum271

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

younixgeek · Dec 9, 2005

Thanks guys,

I will scan the awk forum and check out some commerical tools.

-YG

Annihilannic · Dec 9, 2005

This should be fairly easy to do in awk or perl provided the report format is fairly consistent; I don't see the need to go commercial.

Annihilannic.

mrn · Dec 9, 2005

1. Delete blank lines

sed /^$/d

2. Delete lines with multiple --

sed /^--*/d

3. Delete lines with multiple ==

sed /^==*/d

4. Change multiple tab or spaces to one space

sed s/[ ][ ]/ /g

The first [ ] contains a space then a tab the second [ ]is reversed a tab then a space.

5. Change spaces to comma

sed s/ /,/g

Syntax above is not spot on but gives you an idea.

It may be worth while finding the line with the most words and padding the other lines with a , as needed.

Mike

"A foolproof method for sculpting an elephant: first, get a huge block of marble, then you chip away everything that doesn't look like an elephant."

http://www.airport-parking-site.co.uk/

younixgeek · Dec 27, 2005

Just wanted to update the thread. Thanks for all of your input for future readers.

A co-worker came up with this Perl solution:

#!/usr/bin/perl

use strict;

my $file = $ARGV[0];
my %xml;

unless ($ARGV[0]) {
print "Usage:\n $0 <file.PRN>\n";
exit;
}

open(FILE, "<$file") or die "Can't open $file!\n";

my @lines;

print "<?xml version=\"1.0\"?>\n<client_listing>\n";

while (<FILE>) {
if (/(.*?)Count:/) {
$xml{'name'} = $1;
}
if (/Location \#

.*?)Phone \#

.*?)\n/) {
$xml{'location'} = $1;
$xml{'phone'} = $2;
$xml{'phone'} =~ s/[\-]//g;
}
if (/Shipping Address:/) {
while (<FILE>) {
chomp;
$xml{'shipping_address'} .= substr($_,0,41) . "\n" if (/./);
$xml{'billing_address'} .= substr($_,41) . "\n" if (/./);
last unless (/^\s/); # Break out of this inner while loop
}
}
if (/Contact Name

.*?)Title

.*?)\n/) {
$xml{'contact_name'} = $1;
$xml{'title'} = $2;
}

if (/Assigned To

.*?)\n/) {
$xml{'assigned_to'} = $1;
}
if (/Customer Type

.*?)\n/) {
$xml{'customer_type'} = $1;
}
if (/Secondary Customer Types/) {
while (<FILE>) {
chomp;
s/://g;
$xml{'secondary_customer_types'} .= $_ . "\n" if (/./);
last unless (/^\s/); # Break out of this inner while loop
}
}
if (/Start To Date

.*?)Number of Orders

.*?)Purchase

.*?)\n/) {
$xml{'std_orders'} = $2;
$xml{'std_purchase'} = $3;
}
if (/Year To Date

.*?)Number of Orders

.*?)Purchase

.*?)\n/) {
$xml{'ytd_orders'} = $2;
$xml{'ytd_purchase'} = $3;
}
if (/Current Balance

.*?)\n/) {
$xml{'current_balance'} = $1;
}
if (/Customer Since

.*?)\n/) {
$xml{'customer_since'} = $1;
}
if (/Last Order

.*?)\n/) {
$xml{'last_order'} = $1;
}
if (/Last Payment

.*?)\n/) {
$xml{'last_payment'} = $1;
}
if (/Parent Customer ID

.*?)Alias

.*?)\n/) {
$xml{'parent_customer_id'} = $1;
$xml{'alias'} = $2;
}
if (/Delivery Code

.*?)Warehouse

.*?)\n/) {
$xml{'delivery_code'} = $1;
$xml{'warehouse'} = $2;
}
if (/Salesman

.*?)\n/) {
$xml{'salesman'} = $1;
}
if (/Price Level

.*?)Matrix Code

.*?)\n/) {
$xml{'price_level'} = $1;
$xml{'matrix_code'} = $2;
}
if (/Term Code

.*?)Discount Code

.*?)\n/) {
$xml{'term_code'} = $1;
$xml{'discount_code'} = $2;
}
if (/Sales Tax Code

.*?)Tax Exempt Number

.*?)\n/) {
$xml{'sales_tax_code'} = $1;
$xml{'tax_exempt_number'} = $2;
}
if (/A\/R Code

.*?)\n/) {
$xml{'ar_code'} = $1;
}
if (/Finance Code Charge

.*?)\n/) {
$xml{'finance_code_charge'} = $1;
}
if (/Credit Limit

.*?)Statement Type

.*?)\n/) {
$xml{'credit_limit'} = $1;
$xml{'statement_type'} = $2;
}
if (/COD Credit Limit

.*?)Credit Status

.*?)\n/) {
$xml{'cod_credit_limit'} = $1;
$xml{'credit_status'} = $2;
}
if (/COD Delinquent Over

.*?)\n/) {
$xml{'cod_delinquent_over'} = $1;
}
if (/Number of Returned Checks

.*?)Last Returned Check

.*?)\n/) {
$xml{'number_of_returned_checks'} = $1;
$xml{'last_returned_check'} = $2;
}
if (/-----------------------------------------------------------------------------/) {
print "\t<client>\n";
for (keys %xml) {
$xml{$_} =~ s/\s+$//; # Remove trailing whitespace
$xml{$_} =~ s/^\s+//; # Remove leading whitespace
$xml{$_} =~ s/\&/\&/g; # Convert ampersands to their proper xml equiv
print "\t\t<$_>$xml{$_}</$_>\n";
}
print "\t</client>\n";
%xml = (); # Initilize hash for next client record
#exit;
}
}

print "</client_listing>\n";
close FILE;

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Best way to approach this

younixgeek

IS-IT--Management

SamBones

Programmer

Annihilannic

MIS

younixgeek

IS-IT--Management

SamBones

Programmer

PHV

MIS

younixgeek

IS-IT--Management

Annihilannic

MIS

mrn

MIS

younixgeek

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor