Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parsing Odd Data

Status
Not open for further replies.

PCHomepage

Programmer
Feb 24, 2009
609
US
This may be a MySQL query question but I suspect that PHP would be needed instead. In a table with a number of columns, there are two as shown below that I'm having trouble with. The data is from 1930s, 1940s and 1950s automotive parts books so I cannot change it but I need to be able to search on it more efficiently and its formatting is preventing that so I am thinking that either a query that can work through the issues or something else (perhaps the one-time creation of a lookup table) is the way to go.

The problem now is that it is impossible to search on many models because only part of the model number is there. For example, if someone wants to find 2213 nothing here will show up and searching on only 13 or even -13 will bring up thousands of unrelated entries from the table.

The data was formatted only to make it printable (this was OCRed from hard-copy) and it looks straightforward but actually the second example is what it really means:
(These are simplified without extraneous columns but there is a primary key field, among others)

[pre]PartNo Models
_________________________________________________________________
391719 2206-13-26-33; 2302-06-13-32-33

G141111 2631 (BODY 2678); 5431 (BODY 5478)

L422445 2401 (BODIES 2462-65-92-15-98): 2501; 2601-11-33[/pre]

Meaning:

[pre]PartNo Models
_________________________________________________________________
391719 2206 2613 2626 2633; 2302 2306 2313 2332 2333

G141111 2631 (BODY 2678); 5431 (BODY 5478)

L422445 2401 (BODIES 2462 2465 2492 2415 2498): 2501; 2601 2611 2633[/pre]

I am open to suggestions but what I think I need is to extract the data into something like this but I have no idea how to do so:

[pre]PartNo ModelNo BodyNo
391719 2206
391719 2213
391719 2226
391719 2233
391719 2302
391719 2306
391719 2313
391719 2332
391719 2333
G141111 2631 2678
G141111 5431 5478
L422445 2401 2462
L422445 2401 2465
L422445 2401 2492
L422445 2401 2415
L422445 2401 2498
L422445 2501 2601
L422445 2501 2611
L422445 2501 2633[/pre]

Otherwise some way to make the search engine see 2462-65 (for example) as 2462 2465 would be okay and, in that event, it probably is not necessary to differentiate between model number and body number. There will ultimately be hundreds of thousands of rows of data and there may be other similar oddities that I've not yet come across.

I just spotted another and without going through adding it to the above, the Model column looks like this so clearly I'll need to add a generic model table with the needed numbers:

ALL 22ND; 23RD; 24TH; 25TH; 26TH; 5402-06-13-26-31
 
Oops! L422445 2401 (BODIES 2462 2465 2492 2415 2498): 2501; 2601 2611 2633 should be:

Code:
[pre]PartNo   ModelNo   BodyNo
391719   2206
391719   2213
391719   2226
391719   2233
391719   2302
391719   2306
391719   2313
391719   2332
391719   2333
G141111  2631      2678
G141111  5431      5478
L422445  2401      2462
L422445  2401      2465
L422445  2401      2492
L422445  2401      2415
L422445  2401      2498
L422445  2601
L422445  2611
L422445  2266[/pre]
 
Why does this
Code:
2206-13-26-33
Expand to
Code:
2206 2613 2626 2633

Ie where is the logic that says transform 22 to 26?

 
It was a simply typo that I had missed and there may be a couple others but the concept should be clear enough even if my typing is not.
 
Oh. I had not made that assumption.

So that series will actually be 2206 2213 2226 2233?

If so then a regex query will allow you to search accurately. And a like query will get you close.

But it looks overall better to restructure the table and do a one time expansion of the textual data into a properly structured format.
 
Yes, the series will actually be 2206 2213 2226 2233 and my apologies for the confusion. Unfortunately I cannot restructure or modify the existing table or its data so whatever is needed will either have to be done in the programming or by creating a secondary table as described and I have no idea how to begin such a task. If it can be done by a query or even a series of queries, then that's fine but it is beyond my skills to do so!

So far the data is limited to a few thousand test rows so whatever is done would need to be repeatable as it is being populated from the hard copy which itself is extremely time-consuming due to the quality of the originals, headings on each section and subsection, the fact that the pages' layout change slightly from year to year and a few other factors.
 
i spent five minutes writing a script for you. one anomaly i cannot solve from your explanation is the significance of 2501 in this

Code:
L422445 2401 (BODIES 2462 2465 2492 2415 2498): 2501

I suspect that the 2501 will be another modelNum. although I can't see a reason why it is delimited by a comma rather than semi-colon. And in your second explanation you remove all references to 2501.

 
The 2501 is another model and the clue us that it is separated from the previous model by a semicolon. So for model 2401, it applies only to certain bodies but for 2501 and the others, the part fits all bodies of the listed models.
 
Yes, my second reference has inadvertently omitted 2501 which should precede 2601 and the last one should be 2633. Sorry, I'm terribly dyslexic but I try!

I don't see anything delimited by a comma but I do see a colon rather than a semicolon which must be an OCR error that I missed.
 
i see.

it is going to be difficult to cater for OCR errors as well. but try this and see what output you get

Code:
<?php

$query = <<<SQL
select  partNo, models
from    oldTable
SQL;

$createQuery = <<<SQL
CREATE TABLE IF NOT EXISTS p_m_b (
partNo varchar(20) NOT NULL,
modelNo varchar(8) NOT NULL,
bodyNo varchar (8) NULL
)
SQL;

$indexQuery = <<<SQL
CREATE UNIQUE INDEX IF NOT EXISTS pmb ON p_m_b (partNo, modelNo, bodyNo)
SQL;


mysql_query($createQuery) or die(mysql_error());
mysql_query($indexQuery) or die(mysql_error());

$result = mysql_query($query);
while($row = mysql_fetch_array($result)):
    $partNum = $row[0];
    $chunks = explode(';',$row[1]);  //grab the model data
    $chunks = array_map('trim', $chunks); //trim it
    foreach($chunks as $key=>$val):
        if(strpos($val, '(') === FALSE):
            //no body data
            $prefix = substr($val, 0, 2);
            $bits = explode('-', $val);
            foreach($bits as $_key=>$bit):
                if($_key == 0):
                    addItem($partNum, $bit, null);
                else:
                    addItem($partNum, $prefix . $bit, null);
                endif;
             endforeach;
        else:
            //check for colon
            $bits = explode(':', $val);
            foreach($bits as $bit):
                if(strpos($bit,'(') === false):
                    //no body data
                    $_bits = explode('-', $bi);
                    foreach($_bits as $_key=>$_bit):
                        if($_key == 0):
                            addItem($partNum, $bit, null);
                        else:
                            addItem($partNum, $prefix . $bit, null);
                        endif;
                    endforeach;
                else:
                    //has body data
                    $pattern = "|(\d.*?) \(BOD.*? (.*?)\)|";
                    preg_match_all($pattern, $bit, $matches);
                    foreach($matches[1] as $i=>$j):
                        $modelNum = trim($j);
                        $prefix = substr($matches[2][$i],0,2);
                        foreach(explode('-',$matches[2][$i]) as $p=>$bodyNum):
                            if($p === 0):
                                addItem($partNum, $modelNum, $bodyNum);
                            else:
                                addItem($partNum, $prefix . $modelNum, $bodyNum);
                            endif;
                        endforeach;
                    endforeach;
                endif;
            endforeach;
        endif;
    endforeach;
endwhile;


function addItem($partNum, $modelNum, $bodyNum){
    $insertQuery = <<<SQL
INSERT INTO p_m_b
(partNo, modelNo, bodyNo) 
VALUES
('%s,'%s','%s')
SQL;
    $query = sprintf($insertQuery, $partNum, $modelNum, $bodyNum);
    return mysql_query($query);
}
 
Thank you. It's after 3:00 AM now so I'll try it tomorrow but please don't try to cater to OCR errors. Just presume they are not there as I'll run a query and search manually to try to locate things like that. Also, there was a section of un-proofed data and my examples my have come from that area.
 
I added this to get the connection, which seems to be compatible with your code. The table is being created bo far it is giving an error on the line trying to create the index although I've not yet looked into it to see why. It should be easy to sort out, though. I suspect it will also need a facility to drop the table and recreate it each time it's run but I'll look into that at some later date.

[bold]Connected successfully

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'IF NOT EXISTS pmb ON p_m_b (partNo, modelNo, bodyNo)' at line 1[/bold]


PHP:
include("config/config.php");

$conn = mysql_connect($dbhost, $dbusername, $dbpass);

if (!$conn) :
    die("Connection failed: " . mysql_connect_error());
endif;
echo "Connected successfully\n\n<p>";
	
$db_selected = mysql_select_db($dbname, $conn);
if (!$db_selected) :
    die ('Can\'t use $dbname : ' . mysql_error());
endif;
 
No more colons instead of semicolons but I did come across another odd bit of data for RHD (Right Hand Drive) export cars that I'm not sure how to handle.

RHD, 2401; 2501; 2601-11; 5400-01-11

Perhaps you have an idea although probably simply prepending it to the model listing might be okay:

RHD 2401
RHD 2501
RHD 2601
RHD 2611
RHD 5400
RHD 5401
RHD 5411
 
I changed the table name and modified the code a bit so that it creates an auto-incrementing key rather than trying to add an index later. It is creating the table properly but it is not populating it so I added some echos to see how it parsed the data. Apparently the addItem() function is failing but I see nothing wrong unless it needs the connection passed into it.

The lines on-screen also help me to know when it has completed the operation. It's quite fast now but won't be once the data is all there.

With a couple exceptions (see below), the parsing looks good. Here is the code as it is now.

PHP:
include("config/config.php");

$conn = mysql_connect($dbhost, $dbusername, $dbpass);

if (!$conn) :
    die("Connection failed: " . mysql_connect_error());
endif;
	
$db_selected = mysql_select_db($dbname, $conn);
if (!$db_selected) :
    die ('Can\'t use $dbname : ' . mysql_error());
endif;
	
$query = <<<SQL
SELECT  PartNo, Models
FROM    parts_listing
SQL;

[COLOR=#3465A4][b]$dropQuery = <<<SQL
DROP TABLE IF EXISTS parsedmodels
SQL;[/b][/color]

$createQuery = <<<SQL
CREATE TABLE IF NOT EXISTS [COLOR=#3465A4][b]parsedmodels[/b][/color] (
ID INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
partNo varchar(20) NOT NULL,
modelNo varchar(8) NOT NULL,
bodyNo varchar (8) NULL
)
SQL;

[COLOR=#3465A4][b]mysql_query($dropQuery) or die(mysql_error());[/b][/color]
mysql_query($createQuery) or die(mysql_error());

$result = mysql_query($query);
while($row = mysql_fetch_array($result)):
    $partNum = $row[0];
    $chunks = explode(';',$row[1]);  //grab the model data
    $chunks = array_map('trim', $chunks); //trim it
    foreach($chunks as $key=>$val):
        if(strpos($val, '(') === FALSE):
            //no body data
            $prefix = substr($val, 0, 2);
            $bits = explode('-', $val);
            foreach($bits as $_key=>$bit):
                if($_key == 0):
                    addItem($partNum, $bit, NULL);
		    [COLOR=#3465A4][b]echo "inserting $partNum, $bit, NULL<br>\n";[/b][/color]
                else:
                    addItem($partNum, $prefix . $bit, NULL);
		    [COLOR=#3465A4][b]echo "inserting $partNum, $prefix$bit, NULL<br>\n";[/b][/color]
                endif;
             endforeach;
        else:
            //check for colon
            $bits = explode(':', $val);
            foreach($bits as $bit):
                if(strpos($bit,'(') === FALSE):
                    //no body data
                    $_bits = explode('-', $bi);
                    foreach($_bits as $_key=>$_bit):
                        if($_key == 0):
                            addItem($partNum, $bit, NULL);
			    [COLOR=#3465A4][b]echo "inserting $partNum, $bit, NULL<br>\n";[/b][/color]
                        else:
                            addItem($partNum, $prefix . $bit, NULL);
			    [COLOR=#3465A4][b]echo "inserting $partNum, $prefix$bit, NULL<br>\n";[/b][/color]
                        endif;
                    endforeach;
                else:
                    //has body data
                    $pattern = "|(\d.*?) \(BOD.*? (.*?)\)|";
                    preg_match_all($pattern, $bit, $matches);
                    foreach($matches[1] as $i=>$j):
                        $modelNum = trim($j);
                        $prefix = substr($matches[2][$i],0,2);
                        foreach(explode('-',$matches[2][$i]) as $p=>$bodyNum):
                            if($p === 0):
                                addItem($partNum, $modelNum, $bodyNum);
				[COLOR=#3465A4][b]echo "inserting $partNum, $modelNum, $bodyNum<br>\n";[/b][/color]
                            else:
                                addItem($partNum, $prefix . $modelNum, $bodyNum);
				[COLOR=#3465A4][b]echo "inserting $partNum, $prefix$modelNum, $bodyNum<br>\n";[/b][/color]
                            endif;
                        endforeach;
                    endforeach;
                endif;
            endforeach;
        endif;
    endforeach;
endwhile;
[COLOR=#3465A4][b]echo "Done";[/b][/color]


function addItem($partNum, $modelNum, $bodyNum){
    $insertQuery = <<<SQL
INSERT INTO [COLOR=#3465A4][b]parsedmodels[/b][/color]
(partNo, modelNo, bodyNo) 
VALUES
('%s,'%s','%s')
SQL;
    $query = sprintf($insertQuery, $partNum, $modelNum, $bodyNum);
    return mysql_query($query);
}

For the exceptions, here are some examples. I already mentioned the RHD issue and thought I had mentioned the ALL issue too. For RHD, you can see that it is inappropriately shoving over the model value into the body column when instead it should prepend RHD to the model. The ALL issue is more difficult and will require that I create another table manually to deal with it, which will be easier once this code is working. That way I can get the values from this data, or possibly the code here can be modified to go it on the fly somehow but I'm not sure how.

So for ALL 24TH, the models table would have something like (this is a partial only):

[pre]ID Model
1 2401
2 2402
3 2406
4 2413[/pre]

. . . and it would need to give them each their own line

inserting 426493, ALL 24TH, NULL
inserting G107762, ALL 24TH, NULL
inserting 426743, RHD, 2401, NULL
inserting 426738, RHD, 2401, NULL
inserting G120854, RHD, 2401, NULL
inserting G120380, RHD, 2401, 2501, NULL

There may also be instances where the bodyNo column will be populated but not the modelNo.
 
You need the index for searching. The error was that indices cannot take an if not exists declaration. So of you delete that you should be fine.

Likewise you should not need to drop the table as the queries will just error for those rows that are already in the table. But there is little harm in doing so provides you are not using the ID of the row as a foreign key elsewhere.

The RHD issue is easy to solve but what are the rules? Does RHD 2401 mean that it is the 2401 model with a RHD drive train?

I can't derive the ALL issue without seeing some raw data understanding its meaning. If it means all model nums then that is again easy to solve.

There are references to 24TH in your posts. Is than an OCR artefact and the TH is being substitute for a number?

 
I don't see how not dropping the table would not allow duplicates since we're using insert without any key. However, the ID key will not be needed for anything so dropping and recreating the table is probably safer, particularly as the data is being developed.

Yes, the RHD would include all items in the field after the RHD so preceding the modelNo with RHD should be fine unless you can think of a better way.

For the All issue, it would be for all models beginning with the series number. That is, All 24th refers to all models that begin with 24 and I can create a lookup table as described in my last posting that lists them but I can't create it until this other table has been populated. Once it has been, I can simply grab out the distinct models into a separate table but it will have the structure as I indicated above. If it helps, I can add a column with 24TH, 25TH etc.

Just some history to try to make more sense of the designations, 24TH is in the data and refers to 24TH series. Not that it matters, but it means 1951 models, and was a code used by the manufacturer. In other words, ALL 24TH means all 1951 models. 22ND was 1948 to mid 1949; 23RD was mid 1949 to 1950; then it was by model year after that although the designation was changed in 1954 to 54TH series, which continued to 1957. I'm not sure how it was done for 1957 and 1958 so it's possible that some tweeking will be needed once I get that far but that's still some time off.
 
I see that there is a section to check for the odd colon but they no longer exist in the data. Can this be removed or repurposed to check for the comma to prepend RHD to the partNo?
 
In the script output, I just noticed that some with RHD are parsing oddly. This is the raw model data as it is in the database:

RHD, 2201-02-11-22-32-40; 2301

and RHD applies to all models listed, not just those before the semicolon. Here is the script output:

378882, RHD, 2201, NULL
378882, RH02, NULL
378882, RH11, NULL
378882, RH22, NULL
378882, RH32, NULL
378882, RH40, NULL
378882, 2301, NULL

Until the RHD issue is corrected, this script output should be:

378882, RHD, 2201, NULL
378882, RHD, 2202, NULL
378882, RHD, 2211, NULL
378882, RHD, 2222, NULL
378882, RHD, 2232, NULL
378882, RHD, 2240, NULL
378882, RHD, 2301, NULL

Note the RH as the first part of some model numbers and the lack of RHD on the last entry.

To clarify some entries, this from the script's output means that All applies to all 24th, 25th, 26th and 54th series models. First the raw data, then the script output:

ALL 24TH; 25TH; 26TH; 54TH

418316, ALL 24TH, NULL
418316, 25TH, NULL
418316, 26TH, NULL
418316, 54TH, NULL
 
There is now a table called parts_modelno that contains a listing of the models for use with the All entries. It's simply a single column gleaned from a query of the distinct parsed data that the script created. Here is a sample from 22ND series and if it would be helpful I can add a column with 22ND etc.

modelNo
2201
2202
2206
2211
2213
2220
2222
2226
2232
2233
2240
 
Someone else may be able to assist this evening - regrettably I'm out of action until tomorrow. Apologies

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top