Field Delimited Flat File Processing

SmileeTiger · Nov 29, 2005

I have a flat file of the following format:
field1: data1.1 field2: data2.1 field4: data4.1
field1: data1.2 field3: data3.2 field5: data5.2

That I want to convert to a delimited file so that I can import it into a spreadsheet with the following format:
field1 field2 field3 field4 field5
data1.1 data2.1 data4.1
data1.2 data3.2 data5.2

(Where there are tabs between columns)
I know the column headers that are present in the flat file ahead of time.

Does anyone have a good way to do this in perl?

KevinADC · Nov 29, 2005

post some of the real data in the file you wish to convert. It can be done, but your sample data looks too contrived to give more than a generic answer.

SmileeTiger · Nov 30, 2005

Here is a sample from my real data file:
title: Harry Potter and the Sorcerer's Stone (Book 1) author: J.K. Rowling pub-date: 1998 format: hardcover
title: Harry Potter and the Sorcerer's Stone (Book 1) author: J.K. Rowling pub-date: 1998 format: paperback publisher: Arthur A. Levine Books

The order of the header fields that I want would be:
title author pub-date format publisher

If a field is missing from the data file then it should be skipped.

rharsh · Nov 30, 2005

Are all the field names a single word (a string containing no white space) immediately followed by a colon?

SmileeTiger · Nov 30, 2005

varakal · Nov 30, 2005

This should give you a starting point:

Code:

#!/usr/bin/perl
use strict;

my @fields = ("title","author","pub-date","format","publisher");
my %fields_hash = ("title",1,"author",1,"pub-date",1,"format",1,"publisher");

while (<>) {
	my $string = $_;
	my (@pairs) = split(/:/, $string);
	my @sequence = ();
	for (my $i=0; $i <= $#pairs; $i++) {
		if (($i==0) || ($i==$#pairs)) {
			push(@sequence,$pairs[$i]);
			next;
		}
		my (@twos) = $pairs[$i] =~ m/(.*) (.*)/;
		if ($fields_hash{"$twos[1]"} == 1) {
			push(@sequence,(@twos));
		} else {
			$#sequence += 1;
			$sequence[$#sequence] = $pairs[$i];
		}
		
	}

	for (@sequence) {
		if ($fields_hash{"$_"} == 1) {
			print "\n$_: ";
		} else {
			print "$_";
		}
	}
}

data I tested for is:

Code:

title: Harry Potter and the Sorcerer's Stone (Book 1) author1: J.K. Rowling pub-date: 1998 format: hardcover
title: Harry Potter and the Sorcerer's Stone (Book 1) author: J.K. Rowling pub-date: 1998 format: paperback publisher: Arthur A. Levine Books

Output:

Code:

title:  Harry Potter and the Sorcerer's Stone (Book 1) author1 J.K. Rowling
pub-date:  1998
format:  hardcover
 
title:  Harry Potter and the Sorcerer's Stone (Book 1)
author:  J.K. Rowling
pub-date:  1998
format:  paperback publisher Arthur A. Levine Books

You can do printing as your wish.

VaRaKal

varakal · Nov 30, 2005

Just noticed I missed publisher.
Make the following change:

Code:

my %fields_hash = ("title",1,"author",1,"pub-date",1,"format",1,"publisher",1);

rharsh · Nov 30, 2005

Here, see how this works for you. What is the output of this going to be used for? This code delimites the fields with pipe symbols, it's easy to change it to a tab or whatever characters you want.

If you want everything to be in nicely spaced columns, it will take a bit more work.

Code:

my @headers = qw/title author pub-date format publisher/;
my $delim = '|';
print join($delim, @headers), "\n";

while (<DATA>) {
    # Assumes book records always have a title
    if (index($_, 'title:') > $[-1) {
        my %hash;
        while (/(\S+):\s*(.+?)(?=(\s*\S+:)|(\s*$))/g) {
            $hash{$1} = $2;
        }
        print join($delim, map {$hash{$_} || ""} @headers), "\n";
    }
}

SmileeTiger · Dec 2, 2005

Thanks so much for the help that does exactly what I need.

I am using the script to parse a dump from an old book data storage system that I have and I am importing it into a spreadsheet so I can work with the data a little more easily. The old application is pretty clunky

.

My plan is to eventually replace the spreadsheet with a database and writing a small application around the data.

Thanks again!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Field Delimited Flat File Processing

SmileeTiger

Programmer

KevinADC

Technical User

SmileeTiger

Programmer

rharsh

Technical User

SmileeTiger

Programmer

varakal

IS-IT--Management

varakal

IS-IT--Management

rharsh

Technical User

SmileeTiger

Programmer

Similar threads

Part and Inventory Search

Sponsor