CSV - Sorted vs unsorted source data - creates error 1

daveinOK2004 · Jun 26, 2007

I don't really know how to title this message, so hope that worked

I'm working on a program that parses and modifies a CSV file. The source file is an export of another companies address book. I make the necessary alterations to make it importable into our Active Directory as mail enabled contacts. There are roughly 4500 lines in the source CSV
file. My script works perfectly on it.

However, if I use Excel and sort the file by last name prior to running my script, then the script barfs repeatedly with this error:

Use of uninitialized value in pattern match (m//) at blib\lib\Text\CSV.pm (autosplit
into blib\lib\auto\Text\CSV\combine.al) line 139, <CSV_IN> line 4433.
Use of uninitialized value in substitution (s///) at blib\lib\Text\CSV.pm (autosplit
into blib\lib\auto\Text\CSV\combine.al) line 154, <CSV_IN> line 4433.
Use of uninitialized value in concatenation (.) or string at blib\lib\Text\CSV.pm
(autosplit into blib\lib\auto\Text\CSV\combine.al) line 156, <CSV_IN> line 443
3.

The data is exactly the same, it's just the order in the CSV file that is different. Any ideas why or how to fix?

Here's my code:

Code:

#!/usr/bin/perl -w
use strict;
use warnings;
use Text::CSV;

##############################################
# CSV prep utility    v1.1                   #
# by David Burklin 6/26/07                   #
#                                            #
# This utility will take a CSV file export   #
# and prepare it for input into our Exchange #
# 2003 Server as a mail enabled contact.     #
##############################################

# Source CSV file will look like this:
# HEADER: Obj-Dist-Name (DN),Display Name (cn),E-Mail Addresses (rfc822Mailbox),Company (Company),
# Department (department), Office (physicalDeliveryOfficeName),Phone Number (telephoneNumber),
# First Name (givenName),Initials (initials),Last Name (sn),Title (title)
# RECORD: /o=DOMAIN/ou=LOCATION/cn=Recipients/cn=BobMarley,"Marley, Bob",bobm@rasta.org,Rasta Inc,
# Voice Services,"Kingston, Jamaica", 777-555-1234,Bob,B,Marley,rastafarian

# Define the files that will be used. Future revision may prompt for these with the values here as a default.
my $infile = 'c:\work\input.csv';
my $outfile = 'c:\work\output.csv';

# Other data used.
my @skipcn = ('EAdmin','IAdmin','NAdmin','Administrator','NetappAutosupport','BESADMIN','NWTRfictest1',
              'NWTRPGP','bosc-cmsupport','EBTC','ABCT2', 'DSecurity','DSTUpdate','EOffice','thou',
              'ABoston','EBoston','TBoston','SMail','MMailbox','qmailbox','SMailbox','WAnthony',
              'TSMailbox','masterp','MPlace', 'TNetwork-Unity','EOKC','PFRepl','PGPValidate','SecAlert',
              'SConsole','Sreports','SQLServerAccount','BSQL','SQLMAILTEST','SReport','ISynergy','TCop',
              'TOperator','tmain','TSQLMail','ExchangeT','ExchangeUser','W2','IWebmaster','JWright-Test');
my $skip=0;
my $RecCnt=0; #count records processed.
my $RecWrt=1; #start at 1 for the header.
my $lastcn=""; #will be used to handle dupe CN entries.
my $lastcnfx=1; #used in conjunction with $lastcn
        
# Create CSV instances for input and output using the TEXT::CSV module
my $csv = Text::CSV->new();
my $csvout = Text::CSV->new();

# Open the infile and create the outfile.
open (CSV_IN, "<", $infile) or die $!;
open (CSV_OUT, ">", $outfile) or die $!;

# Create an array with the fieldnames. Write this as the header for the output CSV file. 
my @fields = ('DN','objectClass','cn','distinguishedName','instanceType','displayName','targetAddress',
              'mailNickname','name','objectCategory','mail','sn','givenName','showInAddressBook',
              'Company','Department','physicalDeliveryOfficeName','telephoneNumber','title');
if ($csv->combine (@fields)) {
      my $string = $csv->string;
      print CSV_OUT "$string\n";
      }
  else {
      my $err = $csv->error_input;
      print "combine () failed on argument: ", $err, "\n";
      }

# Begin parsing through the source file. Store the fields in variables denoted by the inFieldName.
my $dump_header=0;
while (<CSV_IN>) {
    $RecCnt = $RecCnt + 1;
    if ($dump_header<1) {$dump_header=1;} else
    {
    if ($csv->parse($_)) {
        my @columns = $csv->fields();
        my $inDN = $columns[0];
        my $inCN = $columns[1];
        my $inrfc822Mailbox = $columns[2];
        my $inCompany = $columns[3];
        my $inDepartment = $columns[4];
        my $inphysicalDeliveryOfficeName = $columns[5];
        my $intelephoneNumber = $columns[6];
        my $ingivenName = $columns[7];
        my $inInitials = $columns[8];
        my $insn = $columns[9];
        my $intitle = $columns[10];        
        
        #Now that we have the input fields, prepare the data for our output file.
        # break the incoming DN down to get just the CN from it.
        my @ary=split('/',$inDN);
        my $cntmp=$ary[-1];
        @ary=split('=',$cntmp);
        
        my $cn = $ary[-1];
        
        #This routine will ensure that every CN is unique by appending a numeric
        #value to the end of any that are dupes.
        #if (lc($cn) eq lc($lastcn))
        #    {
        #        $cn=$cn . sprintf("%u",$lastcnfx);
        #    }
        #$lastcn=$cn;
        
        #now create the fields for output.
        my $DN="CN=$cn,OU=NWTR,DC=domain,DC=local";
        if ($RecCnt/10==int($RecCnt/10)){print ".";} # gives a visual progress indicator.
        my $objectClass = "contact";
        my $distinguishedName = "OU=NWTR,DC=domain,DC=local";
        my $instanceType = "4";
        my $displayName = "$ingivenName $inInitials $insn";
        my $targetAddress = "SMTP:$inrfc822Mailbox";
        my $mailNickname = "$ingivenName$insn$inInitials";
        my $name = $displayName;
        my $objectCategory = "CN=Person,CN=Schema,CN=Configuration,DC=domain,DC=local";
        my $mail = $inrfc822Mailbox;
        my $sn = $insn;
        my $givenName = $ingivenName;
        my $showInAddressBook = "CN=Default Global Address List,CN=All Global Address Lists,
        CN=Address Lists Container,CN=Exchange Organization,CN=Microsoft Exchange,CN=Services,
        CN=Configuration,DC=domain,DC=local;CN=All Contacts,CN=All Address Lists,
        CN=Address Lists Container,CN=Exchange Organization,CN=Microsoft Exchange,CN=Services,
        CN=Configuration,DC=domain,DC=local";
        my $Company = $inCompany;
        my $Department = $inDepartment;
        my $physicalDeliveryOfficeName = $inphysicalDeliveryOfficeName;
        my $telephoneNumber = $intelephoneNumber;
        my $title = $intitle;
        
        # Now create an array containing the output data.
        my @fields = ($DN,$objectClass,$cn,$distinguishedName,$instanceType,$displayName,
                      $targetAddress,$mailNickname,$name,$objectCategory,$mail,$sn,$givenName,
                      $showInAddressBook,$Company,$Department,$physicalDeliveryOfficeName,
                      $telephoneNumber,$title);

        # Now we are going to compare against the array skipcn. If the cn is on the skip list then
        #bypass it. Otherwise, write it to the output file.
        foreach my $comparecn (@skipcn)
        {
            if (lc($comparecn) eq lc($cn))
            {
                print "\nSkipping:$comparecn";
                $skip=1;
            }
        }
        if ($skip==0)
        {            
            #write it to the output file
            if ($csv->combine (@fields))
                {
                    my $string = $csv->string;
                    print CSV_OUT "$string\n";
                    $RecWrt=$RecWrt+1;
                }else
                {
                    my $err = $csv->error_input;
                    print "combine () failed on argument: ", $err, "\n";
                }
        }else{$skip=0;}
     

    } else {
        my $err = $csv->error_input;
        print "\nFailed to parse line: $err\n";
    }
    }#end dump header
}
# close open handles and you are done.
close CSV_OUT;
close CSV_IN;

print "\n\n$RecCnt records processed from $infile.";
print "\n$RecWrt records written to $outfile.\n";

### The resultant file can now be imported to Active Directory using
### csvde -i -f c:\work\output.csv -v -j c:\log
### from the exchange system console.

MillerH · Jun 26, 2007

daveinOK2004 said:
The data is exactly the same, it's just the order in the CSV file that is different. Any ideas why or how to fix?

To put it simply, obviously the data is not exactly the same. My guess is that Excel is changing the formatting (or maybe just the line delimiter) or some way. Windows versus Unix.

My suggestion would be to just ditch attempting to reorder the data using Excel, and just sort using perl.

Also, note that skipping the header would be easier by just the calling "<CSV_IN>;" before the while loop, or adding "next if $. == 1;" immediately inside the while.

- Miller

KevinADC · Jun 26, 2007

hen the script barfs repeatedly with this error:

actually that is a warning, not an error. It's perls way of alerting you to a potnetial problem, but it can also be a false alarm. One of the scalars in your your @fields array is apparently undefined when you try and combine() them:

Code:

       my @fields = ($DN,$objectClass,$cn,$distinguishedName,$instanceType,$displayName,
                      $targetAddress,$mailNickname,$name,$objectCategory,$mail,$sn,$givenName,
                      $showInAddressBook,$Company,$Department,$physicalDeliveryOfficeName,
                      $telephoneNumber,$title);

if everything is really working OK, you can suppress the error my removing the "warnings" pragam or using "no warnings" within selected blocks of the program.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

stevexff · Jun 27, 2007

I've had problems like this when the CSV was prepared/edited with Excel. Check for rows that look 'empty' which Excel thinks have something in them. Or look at the file with a text editor to see if it has a line of

Code:

,,,,,,,,,

or a bunch of stray line feeds at the end.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

stevexff · Jun 27, 2007

Just a thought...

It's not something really dumb, like forgetting to check the 'header row' when you sort the data in Excel? So the header row gets sorted to somewhere other than row #1? This would certainly explain the symptoms.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

ishnid · Jun 27, 2007

It's all being caused by line 4433 in your CSV file. If you have a look at what that line contains, it might shed some light on the situation.

MillerH · Jun 27, 2007

Good eyes ishnid. I didn't even notice that, and for a second there I thought you were telling a joke.

- Miller

daveinOK2004 · Jun 27, 2007

Also, note that skipping the header would be easier by just the calling "<CSV_IN>;" before the while loop, or adding "next if $. == 1;" immediately inside the while.

- Miller

Thanks for the <CSV_IN>;.. Worked like a charm and is much cleaner than the method I was using.

Just a thought...

It's not something really dumb, like forgetting to check the 'header row' when you sort the data in Excel? So the header row gets sorted to somewhere other than row #1? This would certainly explain the symptoms.

Steve

Good point Steve, but I already checked that one. I've been bitten by that bug in times past when doing other stuff. Good thing to note though.

Ishnid and everyone else, I'm trying to figure out the difference in the sorted file now. Thanks again to everyone who's helped me with all the posts in the last few days. With a community like this, I hope that someday I'll be able to make a contribution instead of being the one always getting the help.

daveinOK2004 · Jun 27, 2007

OK. I've found the problem.

For some reason, when I sort the file in Excel, it adds an additional comma to the end of some lines. The extra comma seems to be the culprit.

before sort:
/o=NWTR/ou=Alliance/cn=Recipients/cn=RDCampbell,"Campbell, Ronald",RDCampbell@NWTR.com,,,,,,,

after sort:
/o=NWTR/ou=Alliance/cn=Recipients/cn=RDCampbell,"Campbell, Ronald",RDCampbell@NWTR.com,,,,,,,,

Now I'm just gonna have to go kick Micro$oft's butt for messing up my file

Thanks all.

stevexff · Jun 27, 2007

I couldn't figure the bizarre Excel sort problem, but I guessed you wanted the sorting it to help deal with the duplicate-incrementing part that was commented out. I've refactored your script a bit so it can work with unsorted files, while still detecting duplicates. I've also switched the array-search for exclusions over to a hash to speed it up, and put the exclusion list in the __DATA__ at the bottom (although this could just as easily go in a separate file). And it now expects the input and output files to be specified on the command line.

Code:

#!/usr/bin/perl -w
use strict;
use warnings;
use Text::CSV;

##############################################
# CSV prep utility    v2.0                   #
# by David Burklin 6/26/07                   #
# additional material : stevexff             #
#                                            #
# This utility will take a CSV file export   #
# and prepare it for input into our Exchange #
# 2003 Server as a mail enabled contact.     #
##############################################

# Source CSV file will look like this:
# HEADER: Obj-Dist-Name (DN),Display Name (cn),E-Mail Addresses (rfc822Mailbox),Company (Company),
# Department (department), Office (physicalDeliveryOfficeName),Phone Number (telephoneNumber),
# First Name (givenName),Initials (initials),Last Name (sn),Title (title)
# RECORD: /o=DOMAIN/ou=LOCATION/cn=Recipients/cn=BobMarley,"Marley, Bob",bobm@rasta.org,Rasta Inc,
# Voice Services,"Kingston, Jamaica", 777-555-1234,Bob,B,Marley,rastafarian

# Define the files that will be used. Future revision may prompt for these with the values here as a default.

die "Usage: $0 input.csv output.csv" unless (@ARGV == 2); # check arguments

my ($infile, $outfile)  = @ARGV; # pick up file names from command line
my %skipcn;                      # changed to a hash for speed
my %dupecn;                      # duplicate checker hash (no need to sort)
my $outrec = 0;                  # output record count

while (<DATA>) {                 # load the skip hash
   chomp;
   $skipcn{$_}++;
}

my $csv = Text::CSV->new();      # only need one of these

# Open the files

open (CSV_IN, "<", $infile) or die $!;
open (CSV_OUT, ">", $outfile) or die $!;

# Create an array with the header fieldnames.

my @header = ('DN','objectClass','cn','distinguishedName','instanceType','displayName','targetAddress',
              'mailNickname','name','objectCategory','mail','sn','givenName','showInAddressBook',
              'Company','Department','physicalDeliveryOfficeName','telephoneNumber','title');

# process the file

while (<CSV_IN>) {
    my @output;
    unless ($. == 1) {
        chomp;
        die "Parse failure at input record $. in $infile:\n$_" unless ($csv->parse($_));
        my @columns = $csv->fields();
        $columns[0] =~ s/.*=(\w+)$/$1/;

        next if (exists $skipcn{$columns[0]});    # ignore 'skip listed' rows

        @output = (
            "CN=" . dupeFix($columns[0]) . ",OU=NWTR,DC=domain,DC=local",    # DN
            "contact",                                                       # objectClass
            "OU=NWTR,DC=domain,DC=local",                                    # distinguishedName
            "4",                                                             # instanceType
            "$columns[7] $columns[8] $columns[9]",                           # displayName
            "SMTP:$columns[2]",                                              # targetAddress
            "$columns[7]$columns[9]$columns[8]",                             # mailNickname
            "$columns[7] $columns[8] $columns[9]",                           # name
            "CN=Person,CN=Schema,CN=Configuration,DC=domain,DC=local",       # objectCategory
             $columns[2],                                                    # mail
             $columns[9],                                                    # sn
             $columns[7],                                                    # givenName
            "CN=Default Global Address List,CN=All Global Address Lists," .
            "CN=Address Lists Container,CN=Exchange Organization,CN=Microsoft Exchange,CN=Services," .
            "CN=Configuration,DC=domain,DC=local;CN=All Contacts,CN=All Address Lists," .
            "CN=Address Lists Container,CN=Exchange Organization,CN=Microsoft Exchange,CN=Services," .
            "CN=Configuration,DC=domain,DC=local",                           # showInAddressBook
            $columns[3],                                                     # Company
            $columns[4],                                                     # Department
            $columns[5],                                                     # physicalDeliveryOfficeName
            $columns[6],                                                     # telephoneNumber
            $columns[10]);                                                   # title

    } else {@output = @header};
    
    $csv->combine(@output);
    print CSV_OUT $csv->string(), "\n";
    $outrec++;
    print "." if ($. % 100 == 0);       # modulus operator is easier
}

print "$. records processed from $infile, $outrec records written to $outfile\n";

# close open handles and you are done.

close CSV_OUT;
close CSV_IN;

sub dupeFix {
    my $cn = shift;
    my $suff = "";
    $suff++ while (exists $dupecn{$cn . $suff});
    $dupecn{$cn . $suff}++;
    return $cn . $suff;
}

### The resultant file can now be imported to Active Directory using
### csvde -i -f c:\work\output.csv -v -j c:\log
### from the exchange system console.

# you could just as easily put the exclude list on another file...

__DATA__
EAdmin
IAdmin
NAdmin
Administrator
NetappAutosupport
BESADMIN
NWTRfictest1
NWTRPGP
bosc-cmsupport
EBTC
ABCT2
DSecurity
DSTUpdate
EOffice
thou
ABoston
EBoston
TBoston
SMail
MMailbox
qmailbox
SMailbox
WAnthony
TSMailbox
masterp
MPlace
TNetwork-Unity
EOKC
PFRepl
PGPValidate
SecAlert
SConsole
Sreports
SQLServerAccount
BSQL
SQLMAILTEST
SReport
ISynergy
TCop
TOperator
tmain
TSQLMail
ExchangeT
ExchangeUser
W2
IWebmaster
JWright-Test

HTH

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

CSV - Sorted vs unsorted source data - creates error 1

daveinOK2004

MIS

MillerH

Programmer

KevinADC

Technical User

stevexff

Programmer

stevexff

Programmer

ishnid

Programmer

MillerH

Programmer

daveinOK2004

MIS

daveinOK2004

MIS

stevexff

Programmer

Similar threads

Part and Inventory Search

Sponsor