REGEXP that returns index of match? 1

OsakaWebbie · May 25, 2006

Is there any way to use a REGEXP but have it tell me where in the string it occurred? Or is there a way to do what I want without REGEXP?

I have a field (lyrics of songs) that, in the case of songs sung in Japanese, contains Japanese lyrics followed by the English equivalent. I want to be able to find the spot where the English starts and either return just the Japanese or return just the English. There may be sets of square brackets with English characters (guitar chords) inline in either the Japanese or English, so I also have to watch out for that. To help you understand what I mean, let's hypothetically say that there is a regexp "function" that returns the index to the first occurrence of a match (which MySQL's REGEXP operator does not appear to do). If that were the case, I would do something like this:

Code:

SELECT SUBSTRING(Lyrics,0,REGEXP(Lyrics,'^(\[[^\]]\])?[a-zA-Z]')) AS Japanese FROM song...
SELECT SUBSTRING(Lyrics,REGEXP(Lyrics,'^(\[[^\]]\])?[a-zA-Z]')) AS English FROM song...

I have to use the carat (force the match to be only at the beginning of a line) because there may be chords inline in the Japanese, which of course contain English letters.

The messy-looking stuff in parentheses followed by the question mark is just to check for a possible chord at the very beginning of the English lyrics. If there is a way without regexp to do just the check for the alphabet at the beginning of lines (the equivalent of just '^[a-zA-Z]'), I can make a rule for my DB that the very first character of the English will never a chord, to simplify the challenge. But I have used regexp so long that it is hard to imagine how to do even the simplified search without it. Any thoughts?

TonyGroves · May 26, 2006

First, it seems like unfortunate table design to have both sets of lyrics in the one field. Splitting it into two fields sounds like a much better idea. However, you're probably stuck with your design for the moment.

The nearest I can come is something like the following:
[tt]
SELECT
SUBSTRING(
lyrics,
LEAST(
IF(LOCATE('\na')=0,99999,LOCATE('\na')),
IF(LOCATE('\nb')=0,99999,LOCATE('\nb')),
...
IF(LOCATE('\nz')=0,99999,LOCATE('\nz')),
)
+1
)
FROM ...
[/tt]
It would probably be a lot easier using program code, something like Perl.

OsakaWebbie · May 26, 2006

The reason the two things are in the same field is that the database structure and interface code is intended to be situation independent - it is already being used by two churches, and maybe more in the future. Normally a song's lyrics only have to be written once. Songs that are bilingual are in separate records for the two languages. But in this case, our international church has people who cannot read Japanese, so what is underneath is the same lyrics "romanized". It's not really two sets of lyrics, but rather two ways of reading the same one. The purpose of the query I'm trying to write is to print out the lyrics for people on the worship team, some of whom only need the Japanese and some of whom need only the romanized. Before printing, I have an opportunity to "edit" the output using Javascript, but I'm getting tired of deleting sections of it over and over.

I do have control over the program code, as I wrote it all. I could add a separate field for the romanized, or process the separation in PHP. But I don't want to. I did design the application such that the SQL queries for formatted output for printing are customizable for each installation (the SQL commands, and even the CSS to lay out the data, are stored in the database), so I'm trying to utilize that for this issue, rather than adding extra stuff in the DB structure or the PHP code that would only be applicable to our special situation and just distract everyone else. It's just cleaner.

I'm unfamilier with the LOCATE function, but tomorrow I will try it out (it's past midnight now). I'm glad I chose to make the field that holds the SQL command a "text" field rather than a varchar255 - your idea will need lots of space! But if it works, that will be great. I'll report back.

OsakaWebbie · May 27, 2006

After some minor tweaking and learning the functions used, I got Tony's idea to work for songs that really are Japanese songs with romanized stuff at the bottom. I added the "haystack" argument that Tony left out (hey, I've been there, done that), and then played around with how to get the code repeated for each letter to be simpler, especially to not have to call LOCATE() twice. I finally discovered that null values are not considered when LEAST is doing its thing, so I was able to use NULLIF() to simplify things. My final version of the line that gets repeated 26 times is:

Code:

NULLIF(LOCATE('\na',Lyrics),0),

But I discovered that behavior is unpredictable for songs in English (i.e. ones that don't really need this processing), especially when there are chords at the beginning of lines (which is often). So now I need a way to only use this on Japanese songs. The most reliable way to check for that would be to see if there are multibyte characters in it, but does anyone know how to do that? If there is a way, the code would be:

[tt] [green]SELECT IF([red]some check for multibyte[/red],SUBSTRING(Lyrics,LEAST(...)+1),Lyrics) FROM...[/green][/tt]

I know that on an English forum not many people deal with multibyte characters, but does anyone happen to know? I could use Tony's method again to check for a few commonly occurring Japanese characters (ones used as grammar elements), but there are no guarantees, and it would be cleaner to have a single check that does it. I am using UTF-8 as my encoding type, on MySQL 4.0.18.

hoinz · May 29, 2006

some check for multibyte

One way of achieving this should be by comparing results of lenght() and char_lenght().

hope this helps

OsakaWebbie · May 29, 2006

Great idea, and the MySQL documentation says it should work, but in my case not only length(Lyrics), but char_length(Lyrics) also returns the number of bytes, not the number of characters as the documentation says it should. For every song in my database, the two functions return the exact same number. Any ideas as to why this might be? I searched the web for people complaining of it not working, but found nothing. Supposedly at version 3.23.6 char_length was made multibyte safe. I'm clueless!

OsakaWebbie · Jun 1, 2006

Anybody have any clues on this? My hosting service is supposedly running MySQL 4.0.18, and it's a Japanese hoster so they set things up intentionally to work with multibyte (as far as I can tell), but a function that was supposedly multibyte safe since 3.23.6 fails to do what the documentation says it should. Or am I missing something?

hoinz · Jun 2, 2006

strange

I wonder if there is a bug that will become apparent for long strings only.
How many bytes are there typically in your lyrics?
And did you try if char_lenght() will give you correct values for short strings, for instance just one Japanese word?

OsakaWebbie · Jun 2, 2006

For a test, I checked char_length vs. length on the title of the songs instead of the lyrics (much shorter, of course). Still the same problem. For example, the shortest one I looked at has a title consisting of six Japanese characters - both char_length and length returned 18.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

REGEXP that returns index of match? 1

OsakaWebbie

Programmer

TonyGroves

Programmer

OsakaWebbie

Programmer

OsakaWebbie

Programmer

hoinz

MIS

OsakaWebbie

Programmer

OsakaWebbie

Programmer

hoinz

MIS

OsakaWebbie

Programmer

Similar threads

Part and Inventory Search

Sponsor