Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Remove quotes inside a text field

Status
Not open for further replies.

koresnordic

IS-IT--Management
Nov 28, 2002
422
GB
HI,
I have been given a csv file to import that contains almost every type of problem you can think of.
I have dealt with tabs, commas and pipes, which are all normally used as separators. But I am stuck with the quote char.

An example. the following is part of a text field: "UK games firms "stuck in the slow lane", HTC M8 landing March, cheap Xbox One denied"
I need to keep the surrounding quotes but the ones in the middle need to go. If I use the replace, then the outside pair also go.
How would I remove the middle ones? Note only very few records in the database have this issue.

I also have no connection to the original data and have to work from the csv they supply - they will not make any changes to the output in any way.

thanks for reading

regards
Graham B.

[pc]

Graham
 
Note only very few records in the database have this issue.

Is this only a one time thing, or will this happen very often? That is can you do it manually or does it need to be automated to be done regularly?

Otherwise, you would likely need a regular expression to get the interior quotes only. And doing it in a query is not the best approach.

Perhaps a script (PHP, or something else) that can process the data before insertion to the DB may be a better alternative.





----------------------------------
Phil AKA Vacunita
----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

Web & Tech
 
HI Phil,

Unfortunately, this can happen daily. The data is e-mail extracts, where the subject line can contain any number of quotes etc. I receive the file into a windows server directory and have to work on it from there.

[pc]

Graham
 
Hi Koresnordic

if you're familiar with sed and regular expressions,
i recommend that you use a sed (4.2.1) script from dos.
i use it extensively with csv and dos files before i import
them into our database (foxpro)
 
Hi coldbohemia,

Unfortunately, this will have to run exclusively inside MySQL on a box not under my control, else I would create a bit of vb.net to do what I need. Thanks for the option though.



[pc]

Graham
 
OP said:
I need to keep the surrounding quotes but the ones in the middle need to go

may I ask why? perhaps you could give a larger segment of the csv so we can see the pain you're having.

but i suspect if it is as messy as you make out, the best bet is to pre-parse and clean it rather than do this in the database engine. if you really need to do it in mysql then a regex is an option. but for that to be successful you need to have data that is consistently dirty.

for example i was recently asked to process about 800 songs and parse the data into the song lyrics, title and author etc. there was no consistency in how each datablock was presented so no way of using a regex. I ended up writing a line-level lexer in a finite state-machine. which worked very well. I felt that writing a regex to deal with every situation would have taken longer than a more brute force approach. and potentially longer than a manual approach.

anyway - show us the damage and maybe we'll come up with something good.
 
Two example lines that are causing me an issue:

[user]@[company],04/02/2014 02:11:59,customer-reviews-messages@amazon.co.uk,[user]@[company],"[name], did 'Kindle Fire HD 7", HD Display, Wi-Fi, 8 GB - Includes Special Offers' meet your expectations? Review it on Amazon.co.uk",14349,<00000143faaa7dac-bfa654d3-aca4-4115-898e-3d2058000540-000000@eu-west-1.amazonses.com>
&
[name]@[company],04/02/2014 14:05:01,[Sender e-mail],[user]@[company],"New Helpdesk #40328 Firecrest order upload- If no P.O value on EDI file then default \"NA\"",4168,<20140204140212.1D2813AC001@[company]>

I want to use the following but it fails because of the quote after HD 7:

load data infile "\\\\[path to file]\\DetailedMessageStats-04-02-14.csv"
ignore into table import_temp
fields terminated by ','
optionally enclosed by '"'
lines terminated by '\r\n';

I have removed the users e-mail address to stop scrappers picking it up.

[pc]

Graham
 
that is unfortunate....

are you certain that you cannot pre-process the file? that use case is solvable through (for example) using php as the import conduit. I see that you say it would take a lot of work in vb. it would not in php; unless there are other use cases that you haven't included here?
 
OK,

I left it to start with and hand balled the data in correctly, so I could work on the what I needed to do with it after this stage. Then I came back to it.

The way I resolved this is:
Load the whole record into one data field with:
Code:
-- Empty the temp table
TRUNCATE temp_holding;
-- Load in the data into the temp table as one log string
LOAD DATA INFILE "C:\\DetailedMessageStats-04-02-14.csv"
IGNORE INTO TABLE temp_holding
-- select charcters that are very unlikely to exist in the record
FIELDS TERMINATED BY '!!**@'
LINES TERMINATED BY '\r\n';

CALL import_routine;

With the import_routine stored procedure as:
Code:
-- Set up variables
DECLARE temp_text VARCHAR(4096);
DECLARE temp_int INT DEFAULT 0;
DECLARE counter INT default 0;
DECLARE i INT default 0;
DECLARE first_quote INT default 0;
DECLARE next_quote INT default 0;
DECLARE split_one INT;
DECLARE split_two INT;
DECLARE split_three INT;
DECLARE split_four INT;
DECLARE split_five INT;
DECLARE split_six INT;
DECLARE comma_loc INT;

DECLARE testmode BOOLEAN DEFAULT false;

-- Empty the table the records will go into
TRUNCATE import_complete;

-- Remove blank lines & the header line
DELETE FROM temp_holding WHERE LENGTH(data) = 0 OR LEFT(DATA,9)='User,Time';

-- Remove the "total" lines, found by checking for no commas
DELETE FROM temp_holding WHERE INSTR(DATA,',') = 0;

-- Get the number of records
SELECT COUNT(*) FROM temp_holding INTO counter;

IF testmode THEN
	-- SET counter = 6;
	SELECT counter;
END IF;

SET i = 0;

-- loop through the records
WHILE i < counter do

	-- Get the data line
	SET temp_text = (SELECT data FROM temp_holding LIMIT i,1);
	
	-- Find how many quotes are in the data
	SET temp_int = LENGTH(temp_text) - LENGTH(REPLACE(temp_text,'"',''));

	-- only need to worry if there are more than 2 quotes in the data
	IF temp_int > 2 THEN
		
		IF testmode then
			SELECT i,temp_int,temp_text;
		END IF;

		-- Find the first quote
		SET first_quote = LOCATE('"',temp_text);

		-- Loop until there are only 2 quotes
		WHILE temp_int > 2 DO

			-- Get the next quote
			SET next_quote = LOCATE('"',temp_text,first_quote + 1);

			SET temp_text = CONCAT(LEFT(temp_text,next_quote - 1),MID(temp_text,next_quote + 1, 4096));

			SET temp_int = length(temp_text) - length(replace(temp_text,'"',''));

		END WHILE;

	END IF;

	-- Check between the quotes for commas
	SET first_quote = LOCATE('"',temp_text);
	SET next_quote = LOCATE('"',temp_text,first_quote + 1);
	SET comma_loc = LOCATE(',',temp_text,first_quote + 1);
		
	-- If the comma is beyond the second quote, then this is ok
	WHILE comma_loc < next_quote DO
		IF testmode THEN
			SELECT i,first_quote,next_quote,comma_loc,temp_text;
		END IF;

		SET temp_text = CONCAT(LEFT(temp_text,comma_loc - 1),' ',mid(temp_text,comma_loc + 1,4096));
		SET comma_loc = LOCATE(',',temp_text,first_quote + 1);
	END WHILE;

	-- Find where the fields break
	SET split_one = LOCATE(',',temp_text);
	SET split_two = LOCATE(',',temp_text,split_one + 1);
	SET split_three = LOCATE(',',temp_text,split_two + 1);
	SET split_four = LOCATE(',',temp_text,split_three + 1);
	SET split_five = LOCATE(',',temp_text,split_four + 1);
	SET split_six = LOCATE(',',temp_text,split_five + 1);

	-- Put into the table
	INSERT INTO import_complete
		VALUES(	
		MID(temp_text,1,split_one - 1), 								-- User
		MID(temp_text,split_one + 1, split_two - split_one - 1),		-- Date / time
		MID(temp_text,split_two + 1, split_three - split_two - 1),		-- Sender
		MID(temp_text,split_three + 1, split_four - split_three - 1),	-- Recipient
		MID(temp_text,split_four + 1, split_five - split_four - 1),		-- Subject
		MID(temp_text,split_five + 1, split_six - split_five - 1),		-- Size
		MID(temp_text,split_six + 1,4096)								-- Message ID
		);

	SET i = i + 1;

END WHILE;
I am not saying this is the best or quickest way, but it works, so for the moment that is good enough for me.
Now I just need to find a way of getting an equivalent of the 'load data infile' that works in a stored procedure.

[pc]

Graham
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top