Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

CSV Question

Status
Not open for further replies.

GiffordS

Programmer
May 31, 2001
194
CA
I have what I think is a CSV question. I'm having some issues with a massive update that, because it involves several inter-related tables, I can't just dump into MySql but instead must read through a huge csv file and insert the data one line at a time. Anyway, the problem I'm having is that the data is periodically inserted improperly and I'm at a loss to find a reason. I will get 50 or 100 good rows and then it spits the bit and inserts everything wrong, then, after a few bad inserts, picks back up correctly again. Here is the code for stepping through the csv once the file is open etc.

Code:
 for ($c=0; $c < $number_of_fields; $c++)
       {
           $data_array[$header_array[$c]] = $data[$c];
       }
       $name = addslashes($data_array["Name"]);
       $address = $data_array["Address"];
       $phone = $data_array["Phone"];
       $fax = $data_array["Fax"];
       $url = $data_array["URL"];
       $city = $data_array["City"];
       $state = $data_array["State"];
       $zip = $data_array["ZIP"];
       $category = addslashes($data_array["Category"]);
       $usn = "100".$current_row;
       $psw = "900".$current_row;
       $country = "United States";
       $acl_user = 0;
       $firstname = "Name";
       $lastname = "Name";   
       $listing_start = date("Y-m-d");
       $rank = 0;
       $selected_rank = 0;
       $status = "approved";
       $expires = "2007-11-01";
       
       
       $qa = "insert into firsttable 
       (organization, address, phone, fax, website, city, stateprov, postalcode, username, password, country, acl_user, firstname, lastname, customersince)
       values
       ('$name', '$address', '$phone', '$fax', '$url', '$city', '$state', '$zip', '$usn', '$psw', '$country', '$acl_user', '$firstname', '$lastname', '$listing_start')";

The insert id is then used to update other tables w/ the category and other info. Not my database design, so don't even get me started there... Anyway... as I said everything cruises along and then for some odd reason I will get an insert where everything but the zip code is put into a row, and then another row is created with no other info but the zipcode, or I'll get four consecutive rows where everything is fine, but the category from row one is carried down to rows two through four and then it corrects itself and moves on. I've checked the csv for any anomolies around the effected rows and there is nothing, no stray commas, apostrophes or anything like that, not even a common denominator such as all effected rows having no value for address or some such. I'm just wondering if either there is something that I'm missing here of if anyone else has encountered this type of issue before.
 
I've encountered things like this before. The most common cause is a data-driven fault.

If your script fails at the same exact place each run, check the record the script is processing when it fails. You'll probably find some mismatched apostrophe or somesuch that trips up your script until it finds something that counter-trips it.



I don't understand your posted code, particularly the positions of the curly braces.



Want the best answers? Ask the best questions! TANSTAAFL!
 
sleep,

The curly braces do look odd because I only posted the relevant portions of a much larger block. The block within the first set of braces simply reads in the first line of the CSV, which is the header info, and stores it into an array. The whole chunk is part of a larger for loop that updates three separate tables etc.

As for the data itself, as I said I simply can't find the anomoly. Here is the header line...

Code:
"Name","Address","Phone","Fax","URL","City","State","ZIP","Category"

and a few examples of what I'm experiencing....

Code:
"Ugarte, Thomas R MD","900 W 49th St #300","305-823-6525",,,"Hialeah","FL","33012","Physicians & Surgeons"

"Highland Village Recreational","13565 NE 21st CT","305-948-2928",,,"North Miami","FL","33181","Government Offices-City, Village & Twp"

"Mc Augiffe, Marianne MD","124 E Miracle Strip Pkwy #3","850-862-3141",,,"Mary Esther","FL","32569","Physicians & Surgeons"

"Easy-Stor Self Storage","13038 US Highway 301","352-521-3374",,,"Dade City","FL","33525","Storage-Household & Commercial"

"Southeast Atlantic Corp","3700 Avenue F","772-461-3383","772-461-2779",,"Fort Pierce","FL","34947","Beverages (Wholesale)"

The full lines between entries were added by me here to make the post easier to read as the display area here is quite small. In the above example the first two lines went in fine, the third line, Mc Augiffe, was entered twice, with the second entry containing only the zip code, which was left as null in the first entry. To make matters worse, both entries retained the category designation of the line before, municipal offices.

Code:
"Children's Discovery Ctr","627 S 5th St","904-259-2212",,,"Macclenny","FL","32063","Child Care Service"

"Era Neubauer Real Estate","10740 Hutchison Blvd","850-230-3665","850-236-7680",,"Panama City Bch","FL","32407","Real Estate"

"Medi Corp Florida","3900 Woodlake Blvd #207","561-649-3457",,,"Greenacres","FL","33463","Billing Service

In this one the second line, Era Neubauer etc, was double entered with only the phone number being entered in the second entry. Unlike the Augiffe double entry, in this case the phone number was not omitted from the first entry. Everything else was entered correctly and the other tables were properly updated.

Unfortunately, we're talking about over 100,000 records here with as many as 5,000 problems so had correcting isn't an option. I don't mind running the update again, but I have to find the problem first.
 
I don't know. It could be unprintable characters. Do these records look all right when viewed in hexadecimal format?

Can you post your code that reads a record into $data?



Want the best answers? Ask the best questions! TANSTAAFL!
 
i notice you are not manually escaping the data before entering it into your database. this will cause problems where there are, for example, apostrophes (unless you are using some automated escaping).

you could also consider building a text file with sql insert commands instead of a direct entry to myql. then take the file and use a query parser like phpmyadmin to process the inserts for you. this will point out errors in each query too.
 
jpadie has a point. But I've found the quickest way to get CSV-type data into a MySQL table is by getting the data into the right format and using MySQL's LOAD DATA INFILE query.



Want the best answers? Ask the best questions! TANSTAAFL!
 
thanks for the responses, guys, but this just makes less and less sense as I go. I would love to do the MySql LOAD, but there are several senselessly inter-related tables that for require an insert id from one to be fed to the other and then down the line. They database design itself is terrible, but that's not an issue client wants to pay to have fixed. As for the data itself, using the augiffe example from earlier, I have pinpointed exactly where it's jumping the rails. I just can't for the life of me understand why. I've tested this particular line of csv several times and keep getting the same result, but I am completely at a loss on it...

Code:
"Mc Augiffe, Marianne MD","124 E Miracle Strip Pkwy #3","850-862-3141",,,"Mary Esther","FL","32569","Physicians & Surgeons"

Every time it parses the name, address, phone, fax, url, city, state, and zip properly but does not recognize the last item, which is category. Instead it carries the category from the previous line down and does not update that variable. Then it completes the line and begins a new line, for some reason leaving Name as null and putting the category value (Physicians $ Surgeons) into the address variable of the second line. From there it leaves everything else blank until it hits category which it once again uses the category form the line before. After this it starts up properly on the next line and everything is fine again.
 
Again, check non-printable characters. Suppose that instead of "\r\n", the Win32 normal end of line string, a line ended with "\n"? How would your script react?



Want the best answers? Ask the best questions! TANSTAAFL!
 
sleep,

I'm a little new to csv and am really learning this as I go. I can't imagine why there would be any characters between the zip and category fields of that line, but perhaps there are. How would I view this in such a way as to be able to see those characters?
 
Thanks for the tip! I checked that line in hexedit and it comes back clean. There are no extraneous characters. This just isn't making any sense to me at all as to why it would spit the bit there.
 
where does the data come from? and how do you originally get the data into csv?

and could you post the code you use to get the csv into php.

and also you mention interrelated tables and yet your code, as posted, does not handled any uniqueness or id-getting to make such interrelation work. please elaborate and perhaps provide all the code you are working with together with a database schema.
 
Long story short... the data is from a third party reseller who was not able to do the update themselves to client's database. That's really all I know about the genesis of the csv file. It is truly massive, over 100,000 separate listings. To run the update in php takes about a week because it must be done in chunks since the hosting company allows only 50k transactions per day on the MySql server per client. The db schema is.... unique. There are tables upon tables upon redundant tables, many of which I've yet to see called by any scripts, but that doesn't mean anything. The PHP is even worse. The intent was a modular design but the result was one of the worst spaghetti bowls I've ever encountered. There was on page, a simple form that required filling 3 fields, that had over 10 module calls simply to generate the form. It's crazy. I've warned client repeatedly that once this is put under the stress of even moderate traffic it will start to bog down, but no one wants to pay to do it right. Ok, I'm ranting. Sorry. Here is all of the relevant code to read the csv and update 3 of the tables. It's pointless to post more than that.

Code:
ini_set("auto_detect_line_endings", 1);
$current_row = 1;
$handle = fopen("datafile.txt", "r");
while ( ($data = fgetcsv($handle, 100, ",") ) !== FALSE )
{
   $number_of_fields = count($data);
   if ($current_row == 1)
   {
   //Header line
       for ($c=0; $c < $number_of_fields; $c++)
       {
           $header_array[$c] = $data[$c];
       }
   }
   else
   {
   //Data line
       *********** database connection omitted ***********

       for ($c=0; $c < $number_of_fields; $c++)
       {
           $data_array[$header_array[$c]] = $data[$c];
       }
       $name = addslashes($data_array["Name"]);
       $address = $data_array["Address"];
       $phone = $data_array["Phone"];
       $fax = $data_array["Fax"];
       $url = $data_array["URL"];
       $city = $data_array["City"];
       $state = $data_array["State"];
       $zip = $data_array["ZIP"];
       $category = addslashes($data_array["Category"]);
       $usn = "100".$current_row;
       $psw = "900".$current_row;
       $country = "United States";
       $acl_user = 0;
       $firstname = "Name";
       $lastname = "Name";   
       $listing_start = date("Y-m-d");
       $rank = 0;
       $selected_rank = 0;
       $status = "approved";
       $expires = "2007-11-01";
       
       
       $qa = "insert into table1 
       (organization, address, phone, fax, website, city, stateprov, postalcode, username, password, country, acl_user, firstname, lastname, customersince)
       values
       ('$name', '$address', '$phone', '$fax', '$url', '$city', '$state', '$zip', '$usn', '$psw', '$country', '$acl_user', '$firstname', '$lastname', '$listing_start')";

       $ra=mysql_query($qa)or die(mysql_error());
       $cid = mysql_insert_id();
       
       $qb = "insert into table2
       (fk_customerid)
       values
       ('$cid')";
       
       $rb = mysql_query($qb)or die(mysql_error());
       
       $qc = "insert into table3                 
       (fkcustomerid, category, description, rank, selected_rank, status, expires)
       values
       ('$cid', '$category', '$category', '$rank', '$selected_rank', '$status', '$expires')";
       
       $rc =  mysql_query($qc) or die(mysql_error());
       
         
       
       echo $name." ".$address." ".$phone." ".$fax." ".$url." ".$city." ".$state." ".$zip." ".$category."<br>";
       
       
   }
   $current_row++;
}
fclose($handle);
echo "Update Complete";

The queries are very simply written as client is trying to learn php as well so I write everything longhand and try to show him what is happening as I go. I echo everything back so I can try to spot anomolies and so that I know exactly where in the csv the script is forced to stop by the 50k rule.
 
100k records is not so many. it should only take a php script a few seconds.

i suspect what is going wrong is bad data in the csv file.

try running this code against the file and post back the results. it just parses the csv to make sure that the number of columns are even across the dataset.

Code:
<?
$filename = "";
$counter = array();
$fh = fopen($filename, "rbt") or die("unable to open file");
while ( ($row = fgetcsv($fh, 100000, ",",'"')) !== false){
 if (isset($counter[count($row)]){
   $counter[count($row)]++;
 }else{
   $counter[count($row)] = 1;
 }
}
fclose($fh);
echo "<pre>";
print_r($counter);
echo "</pre>";
$total = 0;
foreach ($counter as $c){
 $total = $total + $c;
}
echo "Total rows = $c";
?>

if i'm right in my suspicions, what we can do is write an exception handling routine to write the badly formatted rows to a new csv file which you can manually edit, and then bung it back in.
 
jpadie,

Thanks for the help. Sorry for not responding sooner but it's been crazy today. Just got back and have to run again. That script throws and error in the if loop. I'll see if I can sort the error out in about an hour. Thanks again though for responding.

 
Ok, good idea with looking at the array as it's read. I modified your code just a bit and ran it on a sample of about 12 lines of the csv. You can in the result below where it once again breaks down on the Mc Auliffe listing and carries the previous value for "Category" forward, ignores the name of the next listing and deposits the Mc Auliffe category value into the following Address. I'm not sure which makes less sense, the way that it breaks down here when there are no marks at all in the csv that would cause it, or the way that it manages to right itself after a few bad repetitions. I'll post the results below...

Code:
Array
(
    [Name] => Sheffield Knifemaker Supply CO
    [Address] => 1027 Shadick Dr
    [Phone] => 386-775-6453
    [Fax] => 386-774-5754
    [URL] => 
    [City] => Orange City
    [State] => FL
    [ZIP] => 32763
    [Category] => Mail Order & Catalog Shopping
)

Array
(
    [Name] => Racs
    [Address] => 805 S Orlando Ave
    [Phone] => 407-667-4714
    [Fax] => 
    [URL] => 
    [City] => Winter Park
    [State] => FL
    [ZIP] => 32789
    [Category] => Janitor Service
)

Array
(
    [Name] => Ugarte, Thomas R MD
    [Address] => 900 W 49th St #300
    [Phone] => 305-823-6525
    [Fax] => 
    [URL] => 
    [City] => Hialeah
    [State] => FL
    [ZIP] => 33012
    [Category] => Physicians & Surgeons
)

Array
(
    [Name] => Highland Village Recreational
    [Address] => 13565 NE 21st CT
    [Phone] => 305-948-2928
    [Fax] => 
    [URL] => 
    [City] => North Miami
    [State] => FL
    [ZIP] => 33181
    [Category] => Government Offices-City, Village & Twp
)

Array
(
    [Name] => Mc Augiffe, Marianne MD
    [Address] => 124 E Miracle Strip Pkwy #3
    [Phone] => 850-862-3141
    [Fax] => 
    [URL] => 
    [City] => Mary Esther
    [State] => FL
    [ZIP] => 32569
    [Category] => Government Offices-City, Village & Twp
)

Array
(
    [Name] => 
    [Address] => Physicians & Surgeons
    [Phone] => 850-862-3141
    [Fax] => 
    [URL] => 
    [City] => Mary Esther
    [State] => FL
    [ZIP] => 32569
    [Category] => Government Offices-City, Village & Twp
)

Array
(
    [Name] => Easy-Stor Self Storage
    [Address] => 13038 US Highway 301
    [Phone] => 352-521-3374
    [Fax] => 
    [URL] => 
    [City] => Dade City
    [State] => FL
    [ZIP] => 33525
    [Category] => Storage-Household & Commercial
)

Array
(
    [Name] => Southeast Atlantic Corp
    [Address] => 3700 Avenue F
    [Phone] => 772-461-3383
    [Fax] => 772-461-2779
    [URL] => 
    [City] => Fort Pierce
    [State] => FL
    [ZIP] => 34947
    [Category] => Storage-Household & Commercial
)

Array
(
    [Name] => 
    [Address] => Beverages (Wholesale)
    [Phone] => 772-461-3383
    [Fax] => 772-461-2779
    [URL] => 
    [City] => Fort Pierce
    [State] => FL
    [ZIP] => 34947
    [Category] => Storage-Household & Commercial
)

Array
(
    [Name] => Florida Conference
    [Address] => 4298 Livingston Rd
    [Phone] => 904-880-4276
    [Fax] => 
    [URL] => 
    [City] => Jacksonville
    [State] => FL
    [ZIP] => 32257
    [Category] => Churches
)

Array
(
    [Name] => Joffe, Jeffrey DDS
    [Address] => 2705 Tamiami Trl #112
    [Phone] => 941-637-6003
    [Fax] => 
    [URL] => 
    [City] => Punta Gorda
    [State] => FL
    [ZIP] => 33950
    [Category] => Dentists
)

Array
(
    [Name] => 
    [Address] => 2705 Tamiami Trl #112
    [Phone] => 941-637-6003
    [Fax] => 
    [URL] => 
    [City] => Punta Gorda
    [State] => FL
    [ZIP] => 33950
    [Category] => Dentists
)

Array
(
    [Name] => Llv Home Color Consulting
    [Address] => 994 N Barfield Dr #42
    [Phone] => 239-642-6994
    [Fax] => 
    [URL] => 
    [City] => Marco Island
    [State] => FL
    [ZIP] => 34145
    [Category] => Interior Decorators Design & Consultants
)

Array
(
    [Name] => Wise, Katarina DDS
    [Address] => 1530 Pinehurst Dr
    [Phone] => 352-683-7668
    [Fax] => 
    [URL] => 
    [City] => Spring Hill
    [State] => FL
    [ZIP] => 34606
    [Category] => Dentists
)

Any thoughts?
 
As for the output from your original code...

Code:
Array
(
    [9] => 12
)

Total rows = 12

 
can you post the underlying data set for this sample too (in csv)? in preference if you can upload the raw data to feel free to upload the whole file - there are no restrictions on this server for size and i guarantee not to keep the data!

given that your columns look clean i'm now with sleipnir214 on the cleanliness of the characters or, equally possible, the underlying data is actually just fine in format but rows are repeated etc (or you are not resetting your array at each iteration).
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top