Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

PHP Comparing Base 64 strings

Status
Not open for further replies.

cmayo

MIS
Apr 23, 2001
159
US
Bear with me... I think my question really is about comparing base 64 encoded strings...

In a mail parsing app, I've come across incoming mail where the In-Reply-To and References headers have been stripped out and replaced by a Microsoft/Outlook Thread-Index header, and need to start generating my own Thread-Index headers on outgoing mail so that when I receive a reply without standard threading headers, I can still match the reply to a thread using Thread-Index.

I found a function which creates a valid Thread-Index header ( and am storing that header in a MySQL table. According to the function's author

* These headers are base64 encoded 22-byte binary strings in the format:
* 6 bytes: The first 6 significant bytes from a FILETIME timestamp.
* 16 bytes: A unique GUID in hex.

So... a Thread-Index header value apparently looks like this:

Code:
AdH1tsVUVHkXt/ZLS4eksRmXC4Q5Ig==

Outlook appends 5-byte suffixes to subsequent thread members, so a thread reply would be coded like so:

Code:
AdH1tsVUVHkXt/ZLS4eksRmXC4Q5IgAiTOHA

Note that the first 30 characters are the same (AdH1tsVUVHkXt/ZLS4eksRmXC4Q5Ig), but the reply has dropped the two original trailing equals signs and added the characters AiTOHA.

What I really need to be able to do is match up emails containing the original thread index using a MySQL query, but I don't understand what's going on with the base conversions and encoding in the PHP function.

Do you think it's safe to just match on the first 30 characters of the thread index in order to identify messages from the same thread? I'd be grateful for any advice or suggestions!

More examples of Thread-Index values:

Code:
AAAAHomOX/VlopU2wo+fFE1Bko39Cw==    
AAABMV+Taic/ZZdmYJJfphDCqDHr3A==    
AAABMV8OU5mSg/7oV6a8xlfEQ7kf5w==    
AAABMV8PU5mSg/7oV6a8xlfEQ7kf5w==    
AAABMV8U4KsZxQheHAiU3/alJNqcXQ==    
AAABMV8UdFBCUFnlrAhDixq8PgSEqg==    
AAABMV8vBEZjaDr8P0KrKK8KuJ3JSA==    
AAABMV8vfQVtOjhVMiyPRf32ThjaOA==    
AAABMV8vGK8E4NQKPshHoM6cj6W/iA==    
AAABMV8vWPmyCure5b9P0thcJxfQ0g==    
AAABMV8vwWsSxiCWEb7Ma5oSZfBnXw==    
AdH1tsVUVHkXt/ZLS4eksRmXC4Q5Ig==    
AdH1tsVUVHkXt/ZLS4eksRmXC4Q5IgAiTOHA
AdH1tsVUVHkXt/ZLS4eksRmXC4Q5IgAiyCOQ
 
In base 64 encoding trailing == just pad the bas64 representation, they have no meaning. That should be the main answer.
So yess, you can just skip the == part, any = can be stripped off in about the same sense as leading zeros don't change a number, trailing = don't change the encoded data.

Bye, Olaf.
 
Hi

Code:
Interactive mode enabled

[blue]php >[/blue] echo base64_encode('M'), PHP_EOL;
TQ==

[blue]php >[/blue] echo base64_encode('Ma'), PHP_EOL;
TWE=

[blue]php >[/blue] echo base64_encode('Man'), PHP_EOL;
TWFu
As you can see, the character before the equal signs can change when another character is added to the input string.

Feherke.
feherke.ga
 
That's a very good point. Since each of the characters in base64 is about 6bits of the original data, and if the data is not a multiple of 6bits by being a multiple of 3 Bytes (24 bit is divisible by 6bit) adding a character to the unencoded string means the 6bit packets at the end change, and that not only changes the = pad characters, but also the last one, as the Q is changed to W in Feherkes example.

You would both remove the padding = chars and avoid the change of characters if your original string is padded to be a multiple of 3 bytes. You say you have 22-byte binary strings if you pad that to 24 bytes you get a base64 result without = and whatever is then added from outlook to the binary data and also encoded base64 does not influence the last chars of the original base64 string. That may be the best option to solve this.

Well, the other obvious option to check whether the first 22 binary bytes are matching is to decode the base64 data.

Bye, Olaf.

 
Just FYI, you can see how and why it works this way, if you encode Feherkes sample strings all padded to length 3 with chr(0):
Code:
echo base64_encode("M\000\000"),PHP_EOL; 
echo base64_encode("Ma\000"),PHP_EOL; 
echo base64_encode("Man"),PHP_EOL;

That'll show [tt]TQAA TWEA TWFu[/tt] and that differs in A instead of =, so you see the "=" chars denote chr(0), but they also denote these chr(0) don't belong to the original data, so the decoded data has to be cut off.

Seeing that it's clear you not only get new chars, you also modify the TQ to TW and TWE to TWF, as in the first step a chr(0) is replaced by 'a' and in the second step the final chr(0) with 'n', that does not only influence the = positions, that also in general influences the last character of the encoding of the previously shorter string.

Bye, Olaf.
 
Thanks, all.

As I understand it, the original value, i.e. AdH1tsVUVHkXt/ZLS4eksRmXC4Q5Ig==, becomes a pseudo-unique identifier and is never re-coded during the process. As the email thread grows, additional 5-byte values are appended to the original value to indicate subsequent mails' position in the thread, i.e. AdH1tsVUVHkXt/ZLS4eksRmXC4Q5IgAiTOHA, which seems to indicate thread id AdH1tsVUVHkXt/ZLS4eksRmXC4Q5Ig followed by message id AiTOHA.

If the two equals signs trailing the original value are just right-padding, it doesn't seem to matter that they're stripped from the original value when the first 5-byte value is appended, and since the original ID (sans padding characters) doesn't change throughout the thread, I think I can safely consider the 30-character base 64 string as a unique id.
 
In your example I think you are just lucky, the last character before the ==, in this case the g, could also change. The first few bits of the extended data are influencing this. Therefore you better pad your identifier to 24 bytes. It doesn't matter by how many bytes the value grows, you could also take the 29 left chars as an identifier, but to be 100% sure, pad your 22 bytes by 2 0 bytes to 24 bytes, that'll result in a 32-character base64 string, which will stay the same no matter how the first 6 bit of added data are.

Bye, Olaf.
 
Olaf said:
In your example I think you are just lucky, the last character before the ==, in this case the g, could also change.

It seems like after the initial encoding, the base 64 value is just a string, to which other base 64 strings are appended as the thread grows. If the base 64 strings are never re-encoded but simply appended to other base 64 strings, how would the original string change?

I do intend to do much more testing with different values, though, and see if my assumptions hold up.
 
cmayo said:
It seems like after the initial encoding, the base 64 value is just a string, to which other base64 strings are appended as the thread grows
Well, no, if that was the case, then the == wouldn't go away or move to the end. It's rather the string is decoded, then new bytes are appended and that's encoded again. Base64 is just a transfer encoding.

And the result can change the previous encoding in the last character, as Feherkes examples show, an "M" is encoded in one way, an added "a" - so "Ma" . is resulting in something different not only at the second character and again the last character changed as he added the final "n" to "Man", even disregarding the =, which changed from two to one to none.

Bye, Olaf.
 
If that's the case, I'm going to have a problem searching for like values in the database. MySQL provides a FROM_BASE64() function for use in queries, but that would only partially decode the values.

I guess I'll try reversing the encoding process before I insert into MySQL and insert the unencoded value, then decode the search string before matching with MySQL.

Cheers!
 
Well, c'mon, the problem is no problem if you initial data is a multiple of 3 bytes. You just have to add 2 bytes, and then you have 24/3*4 = 32 base64 character never changing, so you can compare the left 32 chars, they then remain the same thread id.

Simply add two 0 bytes in this line:

Code:
$thread_ascii = substr($ft_hex, 0, 12) . $guid . "\000\000";

Now you're set.

Bye, Olaf.

Edit: Actually it's 12 chars from $ft_hex, isn't it?
And md5 should be 32 chars, hex chars. All these are rather hex digits than ascii. In that case I think you need to simply add . "0000" for 2 zero bytes;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top