mb_split misbehaving due to previous character 1

OsakaWebbie · Dec 27, 2012

I use mb_split (in a UTF-8 environment that can contain Japanese text) to break apart stanzas of song lyrics I have saved in a database, but it doesn't always find the pattern when it should, and the difference seems to be the previous character. This should be really simple:

PHP:

$stanzas = mb_split("\n\s*\n",rtrim($song->Lyrics));

Normally it works, recognizing a blank line. But if the previous character is the Japanese character "く" (UTF-8 0x304f), the blank line is not recognized and the two stanzas are not split. There are probably other characters that cause the same problem, but 0x304f is the one I have identified so far.

I thought perhaps something about that character caused mb_split to think the next byte is part of the same character, but I checked my data, and it actually contains a CR before each LF, so even if the CR got swallowed, since the regex is only looking for LF anyway, it still should be fine. I can't figure out what the problem could be. Does anyone have an idea?

If you would like a little more context, here is the result of the query I used to examine the line ending bytes:

SQL:

SELECT REPLACE(REPLACE(Lyrics,CHAR(10),'{LF}\n'),CHAR(13),'{CR}') FROM pw_song WHERE SongID=468

Code:

[D]荒野の[D/A]果[A]て[D]に　夕日は[D/A]落[A]ち[D]て{CR}{LF}
[D]妙(たえ)なる[D/A A]調[D]べ　天(あめ)より[D/A A]響[D]く{CR}{LF}
{CR}{LF}
[D Bm G A D G]グローー[A]リア、[D]イン [A/C#]エク[D]セル[G]シス [D/A]デ[A]オ{CR}{LF}
[D Bm G A D G]グローー[A]リア、[D]イン [A/C#]エク[D]セル[G]シス [D/A A]デ[D]オ{CR}{LF}
{CR}{LF}
[D]羊を[D/A A]守[D]る　野辺の[ D/A]牧[A D]人{CR}{LF}
[D]天なる[D/A A]歌[D]を　喜び[D/A]聞[A]き[D]ぬ{CR}{LF}

The second blank line is recognized, but not the first one.

jpadie · Dec 27, 2012

forgive my ignorance on this topic. i rarely have to use multi-byte issues.

two thoughts spring to mind:

1. does this help explain the issue:

http://limechat.net/report/webkit-search-problem.html

2. as pcre is not compiled with utf support in php, try running the pcre compare via the command line using exec. (pcretest)

OsakaWebbie · Dec 27, 2012

jpadie said:
forgive my ignorance on this topic. i rarely have to use multi-byte issues.

And forgive me for asking a multibyte question on an English forum - there are probably Japanese forums I could ask on, but (a) they wouldn't be Tek-Tips [thumbsup2]

, and (b) I'm lazy (Japanese is much slower for me to read and write).

1. Nope, that is irrelevent to this problem for several reasons: (a) the problem happens server side, not in the browser, (b) the regex is looking for blank lines, not Japanese characters, and (c) it does not fail if the preceding character is ぐ (the character that webkit doesn't distinguish from く).

2. AFAIK, the mb_* functions don't use the Perl-compatible method, but POSIX.

jpadie · Dec 28, 2012

correct. that's why i was suggesting that you use the command line with a pcre variant that supports utf. My suspicion is that this is a bug in the posix engine that is recognising the character as something other than it is (thus the link in my first suggestion which describes similarly that the character, amongst others, causes posix to hiccough).

php does not yet support utf within its pcre engine. I used to remember why it didn't; but christmas spirit has intervened.

OsakaWebbie · Dec 28, 2012

Christmas is a good thing to intervene! [smile]

I re-read that page about Webkit several times looking for the relationship to my issue, but I still didn't understand. The page title sounds promising, but the content is about something else. In the content, there is no mention of line breaking, POSIX, PCRE, or even regular expressions at all. And the problem he is pointing out is not actually a bug, but a design decision by the Webkit developers that he happens to disagree with. How Webkit does its sorting/searching is not necessarily wrong - it matches the sorting order of all Japanese dictionaries I have ever seen, both printed and electronic. True, く and ぐ are not the same character, but the way the Japanese language is organized, there is a strong relationship, and they are sometimes interchanged. So I can see both sides of the argument - I personally wish my electronic dictionary would not treat those characters the same, but it does, and such behavior is quite common. (An amusing irony is that Google's search engine treats those lists of characters differently in searches, but Google Chrome does not.) Anyway, in my case there are no Japanese characters in my regular expression - I think the POSIX bug has something to do with the bytes that make up the character, not its use in the language.

I tried pcretest as you suggested - this is the transcript:

Bash:

[...]# pcretest -C
PCRE version 6.6 06-Feb-2006
Compiled with
  UTF-8 support
  Unicode properties support
  Newline character is LF
  Internal link size = 2
  POSIX malloc threshold = 10
  Default match limit = 10000000
  Default recursion depth limit = 10000000
  Match recursion uses stack
[root@vps-1011517-5697 ~]# pcretest
  re> /\n\s*\n/8
data> ひさしく\n\nNext line
 0: \x0a\x0a
data> ひさしく\r\n \r\nNext line
 0: \x0a \x0d\x0a
data>

So that appears to work, and the PCRE it is using has UTF-8 enabled.

Based on your last sentence ("php does not yet support utf within its pcre engine"), I would have considered the above test to be a mere academic exercise. But results of a Google search regarding PHP/PCRE/UTF8 seem to indicate that newer versions of PHP's PCRE engine do support UTF-8 and may even have it enabled by default - see

http://stackoverflow.com/questions/...-enable-unicode-properties-or-enable-utf?rq=1

and

https://github.com/symphonycms/symphony-2/issues/692

(both of which are two years old). I don't know how to tell what version of PCRE my server is using with PHP - according to phpinfo() mine is compiled with "--with-pcre-regex=/usr", which apparently means it's not using the one that came with PHP. But it also has "--enable-mbregex", which is encouraging.

I changed the command in my code to use preg_split with the /u option. I don't get any error messages - does that mean it's doing it? Yes, it properly finds the blank lines that were failing before. The risk with this is false positives - if it's only doing the regex in a byte-by-byte way, then if a line in the middle of a stanza ends with a character whose last byte happens to be 0x0a, it would erroneously split at that point. But I can just cross that bridge if I ever come to it (it will be fine as long as either the /u switch is working or the data never has any lines ending in a 0x**a0 character). So I'll consider this solved, even though I'm not 100% certain. And you have taught me that I can't completely trust POSIX - I have been in the habit of using the mb_* commands, but perhaps I should use the preg_* ones with the /u option instead. Thanks for your help!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

mb_split misbehaving due to previous character 1

OsakaWebbie

Programmer

jpadie

Technical User

OsakaWebbie

Programmer

jpadie

Technical User

OsakaWebbie

Programmer

Similar threads

Part and Inventory Search

Sponsor