Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

formating character lenght of a line in a txt file 3

Status
Not open for further replies.

invalid6363

Programmer
Mar 21, 2008
24
US
i need help on writing some code that will take each line of a file and determine if it contains the right length of characters for each line (150 max). Each line in the file contains 5 fields and has an assigned number of characters:

name (0-49)
address (50-59)
phone number (60-73)
status (74-76)
account (77-149)

if any of the fields contains less than the max number of characters, then the field should be right-padded with spaces.
i know that i can use the built-in length function to get the total, but not sure on how to check the individual fields.
i was thinking i would first split $line into in an array and then assign them variables.then check the length of each one of these and add spaces if needed using sprintf function.

if(length($line) < 150)
{
@line = split(/,+/,$line);
($name,$address,$phone,$status,$account) = @line;

if(lenght($name) < 50)
{
$pad_len = 50 - length($name);
$padded = sprintf outfile "%-*s", $pad_len, $name";
}
...repeat process for other fields

}

but looking at this, it seems that it will print each of the fields one at a time on a different line in the output file, instead of all on one line.

does anybody have a better way to do this or any suggestions on how to improve this code?
any suggestions would be greatly appreciated.
 
Here is the majority of the code and it does work but not as it should:

#Format 2
#Each line is fixed at 150 characters in length.
#Each field has an assigned number of characters. If the
#data for that field contains less than the maximum number of characters, that field will be right-padded
#with spaces.

elsif(length($myString) < 150){

@line = split(/[,|\s]+/,$line);
($name,$account,$phone,$status,$address) = @line;

if(length($name) < 50)
{
$pad_len = 50 - length($name);
$padded = sprintf ("%-*s", $pad_len, $name);
print outfile $padded . "\n";
}
elsif(length($address) < 10)
{
$pad_len = 10 - length($address);
$padded = sprintf ("%-*s", $pad_len, $address);
}
elsif(length($phone) < 14)
{
$pad_len = 14 - length($phone);
$padded = sprintf ("%-*s", $pad_len, $phone);
print outfile $padded . "\n";
}
elsif(length($status) < 3)
{
$pad_len = 3 - length($status);
$padded = sprintf ("%-*s", $pad_len, $status);
print outfile $padded . "\n";
}
elsif(length($account) < 73)
{
$pad_len = 73 - length($account);
$padded = sprintf ("%-*s", $pad_len, $account);
print outfile $padded . "\n";
}

else{

print outfile "$name,$account,$phone,$status,$address";

}
}
 
Is this to make it look 'pretty' on a printed page?
If so, why not put the results into a table who's columns are the correct width for the screen or a grid of divs. That way, you could just print one field per cell and have them all lined up vertically

Keith
 
You describe your input file, and what the formatted output needs to look like. But in the first code you are splitting on commas, and in the second on commas and white space. Because you are using '+' it will collapse multiple field separators into one, so your fields could end up in the wrong columns if you have an empty field in one of your records.

The output side is fairly trivial, but parsing the input could be more difficult. Can you post a small sample of input records for us to use (anonymised if necessary), and we can have a go at it.



Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
First remark: the regex under split will discard a field with zero length (you didn't specify whether this can occur) messing up the rest of the line
Second remark: your figures are wrong, there is no place for the field separators in your line length
Third remark: line length should preferably not include the line separator, but you must chomp your line before processing
Fourth remark: as you will space pad your data, you shouldn't use a space as the field separator in the input file. You should also decide the format for the resulting data: as they are now fixed length, you could use no field separator, but a comma will also be fine (assuming of course there are no commas in the fields). You should also include a line separator in printing to the outfile.
Making reasonable assumptions on the above points, this is more or less what you need (untested):
Code:
my@field_lengths=(50,10,14,3,73);
my$line_length=$#field_lengths;
map{$line_length+=$_}@field_lengths;
while(<INFILE>){
  chomp;
  die"Line $_ too long\n"if(length)>$line_length;
  if((length)<$line_length){
    my@line=split/,/;
    die"Line $_ corrupted\n"unless@line==@field_lengths;
    for(my$i=0;$i<@line;$i++){
      die"Field $line[$i] in line $_ too long\n"if length($line[$i])>$field_lengths[$i];
      $line[$i]=sprintf("%-$field_lengths[$i]s",$line[$i]);
    }
    print OUTFILE join('',@line),"\n";
  }else{
    print OUTFILE $_,"\n";
  }
}
Some performance improvements are possible in the above code, if you need to process very large amounts of data.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
here is an example of how the input data could look in the text file:

Joe Smith, 1 Uni Dr. Camarillo, CA. 93012., (805) 555-1234, AW87124356, 001
Joe Smith AW87124356(805) 555-12340011 Uni Dr. Camarillo, CA. 93012.
Joe Smith (805) 555-1234 1 Uni Dr. Camarillo, CA. 93012. 001 AW87124356

and this is how it should look:

AW87124356 Joe Smith 001 (805) 555-1234 1 Uni Dr. Camarillo, CA. 93012.
AW87124356 Joe Smith 001 (805) 555-1234 1 Uni Dr. Camarillo, CA. 93012.
AW87124356 Joe Smith 001 (805) 555-1234 1 Uni Dr. Camarillo, CA. 93012.

the text file contains the following info: name, address, phone number, status, and account number. text file needs to be formated so data is cleaned up and displays in the following order: account number, name, status, phone number, and address.

i thought i could use the commas or spaces as delimiters to separate these individual pieces of date and then rearrange them to the correct format.

thank you for your help.
 
With so many different formats it's not necessarily doable. You don't have fixed length, no recognizable separator, as the space is used also within the fields, and even no separator at all.
To decide whether you have a last chance, you need to clearly define everything that's constant across different formats (even the allowed relative positions of fields may count). Examples:
-is the account number always prefixed with AW and followed by exactly 8 figures?
-is the status composed of exactly 3 figures?
-is the phone number always prefixed by a parenthesized group of figures followed by a space followed by 3 figures followed by a dash followed by 4 figures?
These are only examples of course, and even if all the answers were yes, you can't be certain of being successful, as the other two fields are almost inevitably with no specific formats.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
The text file looks like it has been combined from a number of sources. Can you get your hands on the original data, before it was combined? It might be easier to process the files individually into a standard form.

Another alternative is to see if there is a reliable way to determine which of the three (that we can see so far) record types you have on each line, and then process each one in a dedicated subroutine.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
These are the fields that make up a data record:
Name: Customer names may contain any alphabet letter, spaces, hyphens, and apostrophes.
Address: May contain any letter or number, spaces, and the following symbols: # , ' :. - /
Phone Number: Will always be in format (###) ###-####
Status: 3 digit integer
Account Number: 8 digits, optionally prefixed by two capital letters

the length of the line in the input file is one of the format that i am checking for.

in the first format, the fields are delimited by comma and appear in the following order: name,address,phone number,account number,status.

the second format that i am checking is the length of the line as described in previous post and appears in the following order: name,account number,phone number,status,address.

in the third format there is no delimiter, except for a space between each field and appears in the following order: name,phone number,address,status,account number.

if one of the three formats is found, the line of input is cleaned up and written to OUTFILE in the following order: account number,name,status,phone number,address.

if there is a format that is not recognized, then it is written to ERRORFILE instead of OUTFILE.
 
Here is what the code looks like so far:
Code:
#!/usr/bin/perl

open(INFILE, "input.txt");
open(OUTFILE, ">output.txt");
open(ERRORFILE, ">errorfile.txt");

#performs data cleanup and formatting functions on a semi-structured input
#file.

#name = [A-Za-z\s-']+
#phone number = .[\d]{3}.\s?[\d]{3}-[\d]{4}
#account number = ([A-Z]{2})?[\d]{8}
#address = [\w\W]+
#status = [\d]{3}


sub trim($);

my @field_lengths = (50,10,14,3,73);
my $line_length = $#field_lengths;
map{$line_length+=$_}@field_lengths;

while($line = <INFILE>){ 
    
    chomp($line);   
    $myString = trim($line);
   
#Format 1
#Fields are delimited by comma and appear in the following order:
#Name,Address,Phone Number,Account Number,Status

   if($myString =~ /^[A-Za-z\s-']+,+[\w\W]+,+.[\d]{3}.\s?[\d]{3}-[\d]{4},+[\d]{3},+([A-Z]{2})?[\d]{8}$/){
      
       @line = split(/,+/,$myString);
       ($name,$address,$phone,$account,$status) = @line;   
           
       print OUTFILE "$account,$name,$status,$phone,$address\n";
                
   } #end of first format
    
#Format 2
#Each line is fixed at 150 characters in length. 
#Each field has an assigned number of characters. If the
#data for that field contains less than the maximum number of characters, that field will be right-padded
#with spaces.

    elsif((length($myString) < $line_length) && ($myString =~ /^[A-Za-z\s-']+,+([A-Z]{2})?[\d]{8},+.[\d]{3}.\s?[\d]{3}-[\d]{4},+[\d]{3},+[\w\W]+$/)){
           
       my @line = split(/,/,$myString);
        
       for(my $i = 0;$i < @line;$i++){
      
           $myString[$i] = sprintf("%-$field_lengths[$i]s",$myString[$i]);
       }
    
       ($name,$account,$phone,$status,$address) = @line;
        
       print OUTFILE "$account,$name,status,$phone,$address\n";
      
       
    }  #end of second format
    
#Format 3
#There is no delimiter or fixed field length but the fields always go in this order:
#Name, Phone Number, Address, Status, Account Number
#The space is not a delimiter but it is guaranteed that there will be at least one space between each field.

   elsif($myString =~ /^[A-Za-z\s-']+\s+.[\d]{3}.\s?[\d]{3}-[\d]{4}\s+[\w\W]+\s+[\d]{3}\s+([A-Z]{2})?[\d]{8}$/){
  
     @line = split(/\s+/,$myString);
     ($name,$phone,$address,$status,$account) = @line;      
      
     print OUTFILE "$account,$name,$status,$phone,$address\n";
  
   } #end of third format
    
   else{
          
      print ERRORFILE "$myString\n";
  
   }
  
   $line = <INFILE>;
   chomp($line);  
   $myString = trim($line);

} #end of while statement

sub trim($)
{
  my $string = shift;
  $string =~ s/^\s+//;
  $string =~ s/\s+$//;
  return $string;
}

close(INFILE);
close(OUTFILE);
close(ERRORFILE);
 
There are various problems with your code. An example is the lines
[tt]@line = split(/,+/,$myString);
($name,$address,$phone,$account,$status)=@line;[/tt]
The + after the comma will discard any null fields; also, as a comma may be present in address, you simply can't split on commas without further checking on the result.
I would attack this problem by steps, as always we should do when writing code.
First step is how to select the correct format (recognizing an incorrect one), second step is how to extract fields for each format (also recognizing bad formats in the fields).
First step: first thing, it is essential to know whether blank fields may exist in your data and which ones. I'll assume in the following that all fields will have at least one non blank character.
First format may be recognized as having the status field in the last place, as this field cannot be confused with others. So [tt]if($myString=~/\W\d{3}$/){[/tt] this will be considered as the first format.
Third format may be recognized as having 8 digits at the end, and, if this is not sufficient (the address might contain 8 consecutive digits?), as having before two optional uppercase letters, a space, three digis and a space. So [tt]}elsif($myString=~/\s\d{3}\s([A-Z]{2})?\d{8}$/){[/tt] this will be considered as the third format.
Second format may be recognized by length: so [tt]}elsif(length($myString)==150){[/tt].
Of course if none is satisfied this will be an unrecognized format
Code:
}else{
  print ERRORFILE$myString,"\n";
  next;
}
Now comes the second step (for each format, but I'll treat only the first one, leaving to you the remaining).
The first field may be split on commas, but a check is required on the number of fields (I'll assume also that you don't need to check the validity of each single field). Here it comes:
Code:
if($myString=~/\W\d{3}$/){
  my@line=split(/,/,$myString);
  if(@line>@field_lengths){
    $status=pop@line;
    $account=pop@line;
    $phone=pop@line;
    $name=shift@line;
    $address=join(',',@line);
  }else{
    ($name,$address,$phone,$account,$status)=@line;
  }
You should then format the output line at the end of the if's (that's why the [tt]next;[/tt] above), trimming every single field, space padding and writing to OUTFILE (without commas, as per your specification!).
A minor closing comment: if you are looking for efficiency, your trim function is not very efficient, as you pass an entire string as the argument; you should trim in place or pass a reference to the [tt]sub[/tt]

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Sorry, I was partly wrong above: the status field may be confused with the end of the address; so first format is better recognized as [tt]if($myString=~/\d{8}\,\s*\d{3}$/){[/tt] , assuming of course that that sequence may not occur in the address.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Have a star, Franco. Also, I was surprised to see that
Perl:
$account=pop@line;
actually parses correctly. I must admit I like to put the spaces in just to make it easier for me to read, like
Perl:
$account = pop @line;

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
One other thought - Considering the example input data, I'm guessing that the programmer has to assume a hostile data environment. The basic data rules I'm seeing are:
- Comma separated data
- Field ordering: name, addr, phone, status, account

If so, the following may be of use:
*** In the opening comments of your program, state what you will accept as valid data. This will save you from trying to develop obscure logic paths that never get used.
Example:
NAME - Must not contain commas and be in common written format (first [middle] last)
This will save you from trying to worry about "last name, first name" type formats as well as "Mr. Billy Bob Cliffhutch III Phd., Esq."

ADDRESS - Commas will be allowed in this field only

PHONE - 0-9 * # () - blanks and P's are allowed.
If the utility supplying the data allows commas to represent a pause in the dialing sequence (many dialing programs do), require it to be replaced with the letter P prior to sending to this program. From there you're program will know what to do with the letter P in your var if needed.

STATUS - Any character except commas

ACCOUNT - Any character except commas

The next hurtle will be to deal with the field you've allowed comas in, the address. From here I've had to do things like stripping off chunks from the front and back to parse through a string. Consider:

# Please excuse the Conan style programming here - it's
# purpose is to get an idea across. Note that I'm not
# using the length != 150 construct here seen elsewhere
# as we could have a string like: "<146 chars>,,,," which
# is an error but was being accepted.
#########################################################
sub line_parser1 {
# assume global vars below
if ($line !~ /,/) { &gen_error() ; return 0 ;};
($name, $sub_line) = ($line =~ /^([^,]*),.*$/);
if (! &format_str(\$name, 50)) {
&gen_error() ; return 0;
};
# Note pass by ref above.

# Here I'm parsing off the end of the string. This is
# useful because of the way we've allowed commas in the
# address.
if ($sub_line !~ /,/) { &gen_error() ; return 0 ;};
($sub_line, $account) = ($sub_line =~ /^(.*),([^,]*)
if (! &format_str(\$account, 73)) {
&gen_error() ; return 0;
};

if ($sub_line !~ /,/) { &gen_error() ; return 0 ;};
($sub_line, $status) = ($sub_line =~ /^(.*),([^,]*)
if (! &format_str(\$status, 3)) {
&gen_error() ; return 0 ;
};

if ($sub_line !~ /,/) { &gen_error() ; return 0 ;};
($sub_line, $phone_num) = ($sub_line =~ /^(.*),([^,]*)
if (! &format_str(\$phone_num, 14)) {
&gen_error() ; return 0 ;
};

$address = $sub_str ;
if (! &format_str(\$phone_num, 10)) {
&gen_error() ; return 0 ;
};
# While 10 characters may seem really short for
# an address, that is what we're given.
return 1 ;
};
sub format_str {
# Left up to the reader
};
sub gen_error {
# Left up to the reader
};
#########################################################

At this point, if you're so inclined, you can add into your program comments that the error_log will be reviewed for repeating errors and adjustments to the data provided or this program will be made as appropriate.

While this style of coding may seem a bit harsh, the more hostile the data environment the more trying to do things like: /^[A-Za-z\s-']+\s+.[\d]{3}.\s?[\d]{3}-[\d]{4}\s+[\w\W]+\s+[\d]{3}\s+([A-Z]{2})?[\d]{8}$/){
will lead you to weeks of debugging. As your incoming data is refined, so can the program be refined.
 
thanks franco for your help.
i got the second format to work.
Code:
elsif(length($line) == 150){      
      $name = substr($line,0,50);
      $account = substr($line,50,10);
      $phone = substr($line,60,14);
      $status = substr($line,74,3);
      $address = substr($line,77,73);      
      print OUTFILE trim($account),"\t",trim($name),"\t",trim($status),"\t",trim($phone),"\t",trim($address),"\n";  
     }
still working on the third format but i'm not sure on how to split $line into the five different fields. as you can see below i tried using split(/PATTERN/,EXPR,LIMIT) but it did not split $line correctly.
Code:
if($line =~ /\s[\d]{3}\s([A_Z]{2})?\d{8}$/){
      my@line = split(/\s/,$line,5);
      ($name,$phone,$address,$status,$account) = @line;
      
      print outfile "$account\t$name\t$status\t$phone\t$address\n";    
    }
i'll keep working on it.
 
($name,$phone,$address,$status,$account) = split(/\s/,$line,5);

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Code:
my$_=$line;
my@line=split;
  #with the above you suppress any multiple spaces inside and between the fields
$account=pop@line;
$status=pop@line;
for(my$i=1;$i<@line;$i++){
  if($line[$i]=~/^\(\d{3}\)$/){
    $phone=join' ',@line[$i..$i+1];
    $name=join' ',@line[0..$i-1];
    $address=join' ',@line[$i+2..-1];
    last;
  }
}
Of course we are assuming in all the above that every and each field is always present.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
franco, could you explain your code for format 3, i don't fully understand how it works:
Code:
for(my$i=1;$i<@line;$i++){
  if($line[$i]=~/^\(\d{3}\)$/){
    $phone=join' ',@line[$i..$i+1];
    $name=join' ',@line[0..$i-1];
    $address=join' ',@line[$i+2..-1];
    last;
  }
}
how did you determine the range for each of the fields?
why did you use the pattern /^\(\d{3}\)$/?

 
When you split the record by blanks, you'll get more than 5 fields, as there are blanks inside some of the fields. The only way of reconstructing correctly the structure is by the phone field, that luckily separates two unformatted fields that would be impossible to separate otherwise, and that has a structure recognizable with that regexp.
So everything that's before the first part of the phone field is the name, the phone field uses two parts (because there's a space in it), and the remainder is the address.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top