Converting Tab delimited to MySQL database

Karl Blessing · May 9, 2008

I'm currently working on my church's website incorporating a couple versions of the bible into the site for an on-site searching and browsing capability. I currently already have King James, and some others from older text that I had (manually yikes!) parsed into the mysql database and actually got it working rather well on the site.

However I've come across a site called unbound bible which provides for free of use other versions of the bibles (mostly older translations no longer subject to copyright laws or were released for public use), some unicode downloads of those versions.

My current MySQL database is setup into three tables

Versions
Books
Verses

Versions are setup as

Code:

+-----------+---------------------+------+-----+---------+----------------+
| Field     | Type                | Null | Key | Default | Extra          |
+-----------+---------------------+------+-----+---------+----------------+
| id        | int(10) unsigned    | NO   | PRI | NULL    | auto_increment | 
| shortname | varchar(32)         | NO   |     |         |                | 
| longname  | varchar(64)         | NO   |     |         |                 
+-----------+---------------------+------+-----+---------+----------------+

Books are setup as

Code:

+-----------+-------------------+------+-----+---------+----------------+
| Field     | Type              | Null | Key | Default | Extra          |
+-----------+-------------------+------+-----+---------+----------------+
| id        | int(10) unsigned  | NO   | PRI | NULL    | auto_increment | 
| title     | varchar(64)       | NO   |     |         |                | 
| bookorder | int(2) unsigned   | NO   |     | 0       |                | 
| testament | enum('Old','New') | NO   |     | Old     |                | 
| version   | int(10) unsigned  | NO   |     | 0       |                | 
+-----------+-------------------+------+-----+---------+----------------+

and Verses are setup as (all are int except content)

Code:

+---------+------------------+------+-----+---------+----------------+
| Field   | Type             | Null | Key | Default | Extra          |
+---------+------------------+------+-----+---------+----------------+
| id      | int(10) unsigned | NO   | PRI | NULL    | auto_increment | 
| book    | int(10) unsigned | NO   |     | 0       |                | 
| chapter | int(10) unsigned | NO   |     | 0       |                | 
| verse   | int(10) unsigned | NO   |     | 0       |                | 
| content | text             | NO   | MUL |         |                | 
+---------+------------------+------+-----+---------+----------------+

Now a download from the unbound bible comes in mainly two usable files (without the mapping that is), the list of books titles, and then all the verses.

The books txt file is setup like...

Code:

01O	Genesis
02O	Exodus
03O	Leviticus
04O	Numbers

where there are two digits followed by a letter (O = old, N = new, A = suplimental, but I'd just throw A in with the new testament, or add it to the enumeration)

And the verse file (for example English Standard UTF-8)

Code:

#name	English: Basic English Bible
#filetype	Unmapped-BCVS
#abbreviation	ESV
#language	eng
#note	
#columns	orig_book_index	orig_chapter	orig_verse	orig_subverse	order_by	text
01O	1	1		10	At the first God made the heaven and the earth.
01O	1	2		20	And the earth was waste and without form; and it was dark on the face of the deep: and the Spirit of God was moving on the face of the waters.
01O	1	3		30	And God said, Let there be light: and there was light.

Basically going by BookID, Chapter, Verse and something called SubVerse and it's sort order.

The problem is I'm trying to figure out how to write a script in PHP to parse the two files, and insert them into my own database, also I can't use the same ID that are given to the books from the txt file, so I would probably have to manually insert the version first, then in the script somehow parse the books file first and assign to an array (ie: an array that would hold the ID of the book as it is in mysql, and an ID that it would show up in the verse text).

Then go and add in that manner.

But mainly having trouble getting an actual starting point.

Karl Blessing

http://www.karlblessing.com

jpadie · May 10, 2008

why can you not use the original book id?

i would do the following

edit the text file to get rid of the comments at the beginning.
then run a script a bit like this

Code:

//connect to database

function cleanse($var){
 return mysql_real_escape_string(trim($var));
}

$file = "path/to/verses.txt";
$contents = file($file);
foreach ($contents as $line){
  $_line = explode ("\t", $line);
  //prep the file for use in a query
  array_walk($_line, 'cleanse');
  $query = "insert into sometable (bookID, chapterID, verseID, subVerseID, `orderBy`, `text`) values ('%s', '%s', '%s','%s','%s')";
  mysql_query(vsprintf($query, $_line) or die (mysql_error());
}

you should probably throw some error checking and trapping in too. test for badly formed lines etc.

Karl Blessing · May 10, 2008

I can't use the original book id, cuz there are already 3 other versions of the bible in the database ( as I said ), which have unique ids of their own, and using the one from the file would cause conflict.

Karl Blessing

http://www.karlblessing.com

Karl Blessing · May 10, 2008

Figured it out, I wrote this script. Basically...

1) Manually add a version, note the ID created
2) Change version # on top of script to that version
3) Upload the book_names file into the same folder for that version
4) remove comments and upload verses for that version
5) Run script
6) Done

Code:

<?
include("database.php");
$_Link = mysql_connect(HOST, USER, PASS);
mysql_select_db(DATABASE, $_Link);	

$bookIDs = "";
$version = 16;
$row = 0;
$handle = fopen("book_names.txt", "r");
while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
	$row++;
	switch(substr($data[0], 2, 1))
	{
		case "O": $testament = "Old"; break;
		case "N": $testament = "New"; break;
		case "A": $testament = "Apocryphal"; break;
	}
	
		$query = "INSERT INTO bible_books (title, bookorder, testament, version) VALUES('".$data[1]."', ".$row.", '".$testament."', ".$version.");";
		$result = mysql_query($query, $_Link);
		if(!$result)
		{
			echo "There was an error on row number: ".$row;
			echo "\n<BR>".mysql_error()." - ".mysql_errno()."\n<BR>";
			exit();
		}
		else
		{
			$theID = substr($data[0], 0, 2);
			$bookID[$theID] = mysql_insert_id();
 		} 
}
fclose($handle);
echo $row." Books inserted.<BR>\n";
$row = 0;
$hitA = 0;
$handle = fopen("verse.csv", "r");
while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
	$row++;
	$book_ID = substr($data[0], 0, 2);
	$book_Type = substr($data[0], 2, 1);
	if(strcmp($book_Type, "A") == 0)
		$hitA = 1;
	
	$query = "INSERT INTO bible_verses (book, chapter, verse, content) VALUES(
	".$bookID[$book_ID].", ".$data[1].", ".$data[2].", '".mysql_real_escape_string(addslashes($data[5]))."');";
	$result = mysql_query($query, $_Link);
	if(!$result)
	{
		echo "There was an error on Verse row number: ".$row;
		echo "\n<BR>".mysql_error()." - ".mysql_errno()."\n<BR>";
		echo $query."<BR>\n";
		exit();
	}
	
}
fclose($handle);
if (!$hitA)
{
	// No Apocryphal verses were inserted, so we must remove those books if they were added to avoid blank results. 
	$query = "DELETE FROM bible_books WHERE testament = 'Apocryphal'";
	$result = mysql_query($query, $_Link);
}
mysql_close($_Link);
echo $row." rows inserted.";

?>

I had to add the last bit, because some of the versions apparently while they list Apocryphal books in the book_names, they don't always have verses for them.

Karl Blessing

http://www.karlblessing.com

Karl Blessing · May 10, 2008

Actually, remove stripslashes and it's good to go.

Karl Blessing

http://www.karlblessing.com

jpadie · May 11, 2008

you could still have used the original bookID and just appended something in the iteration. that would have been the path of least resistance.

you should not addslashes and use mysql_real_escape_string. get rid of addslashes and make sure that you never use magic_quotes_runtime or gpc

i note that you are not escaping the data that gets inserted into bible_books. unless you KNOW your data is totally clean, this is bad practice.

similarly you are taking on faith certain data points in the verses array. this is again bad practice, particularly with csv or other structured text files as often things can go slightly wrong in the formatting and cause havoc.

for optimisation, and as an aside, consider using prepared queries and PDO. it will be quicker and your code neater.

Karl Blessing · May 14, 2008

Opening the book names since theres only like 20 of them, I can verify that it's clean before doing anything.

The code works, and was only intended for a one time use, with a verification afterwards. Once the 6 or so versions were uploaded and verified, there was very very little need to ever use the code again.

The only problem I ran into (which before you replied that I incorporated some of what you said) was trying to get other languages into the database, such as greek and hebrew. Despite the database being setup for UTF-8 just fine, it turns out that not only is the connection from PHP usually defaulting to latin-1, but that also some functions such as fgetcsv are not UTF-8 compatible, and will break any strings that are multi-byte into them. So had to create some new functions and clean up the code a bit to get the multibyte versions in.

Karl Blessing

http://www.karlblessing.com

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Converting Tab delimited to MySQL database

Karl Blessing

Programmer

jpadie

Technical User

Karl Blessing

Programmer

Karl Blessing

Programmer

Karl Blessing

Programmer

jpadie

Technical User

Karl Blessing

Programmer

Similar threads

Part and Inventory Search

Sponsor