Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

explode and preg_split eat all Memory 2

Status
Not open for further replies.

sen5241b

IS-IT--Management
Sep 27, 2007
199
US
Why do both the explode and preg_split functions eat up the available 32MG of memory on my server resulting in a Fatal error? Kinda crazy that it does. (I un-commented the explode and got same error.) The data has a lot of Unicode non-English chars.

Code:
$len = mb_strlen($FileContents);
	echo '<br> len  of filecontents=' . $len; 

	// $SignificantPlaceNames = explode("\n", $FileContents);   // stupid explode kills memory
	$SignificantPlaceNames = preg_split("/[\s,]+/", $FileContents);

Output:

len of filecontents=2767569
Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 16 bytes) in /var/ on line 57

 
tsuji,

Yes, "fopen" and then "fread" and nothing in between. There is a lot of Unicode in the file and I think that is the real issue here.
 
Lot of multi-byte utf-8 characters shouldn't matter, it is the normal state of the affairs. Even the splits could happen on the wrong place - which won't happen for \s and "," - the result should just vaguely following the arithmetic of counting of bytes. It should not blow up the memory, in particular, the pattern is very simple and by byte counting there are ample for operations of swapping around 3MB to start with and 33MB space to play with.

Have you tried the splitting by byte with pattern "//"?
 
I am still stymied by the report that this happens both with fopen type commands and with file_get_contents. They use different memory mapping techniques and i would have expected some difference

Next steps for me would be an examination of examples of source data to see whether I can reproduce the issue. So please post the data somewhere and point us to the source.
 
Q: Have you tried the splitting by byte with pattern "//"?

A: I have not tried that pattern. Is that a regex (used with preg_split) that would split on blanks?



Q: I am still stymied by the report that this happens both with fopen type commands and with file_get_contents. They use different memory mapping techniques and i would have expected some difference

A: The fopen and the file_get_contents do NOT cause the issue. The file reads just fine into a big string using file_get_contents BUT when I try to convert the string to an array it gets the memory issue.

Give me a little time and I will reproduce code for the numerous methods I tried to just get the data into an array without it eating a ton or memory.



 
Sorry to come late to this party!
So what your saying is if you had a string in your code made up of 2.7mb of characters and you try to convert to an array you get the issue nothing to do with reading files etc?
Coud you suupply a reprsentative sample of data that we could scale up to 2.7mb and see if we get the error?
 
personally I am more interested in the data than the code. Like ingresman says and I have asked, if you can supply a sample file that barfs for you then we can take a proper look.
 
Yes ingresman, the opening of the file and reading in of the data is not a problem. Only when I try to put it into an array does it eat up an unusual and huge amount of memory.

I'll supply the code and data. I deleted some of the things I tried so I am still writing it up.
 
Here are most of things I tried that caused the memory issue. I did not try all of these methods at the same time. I commented code so as only to try one at a time but all caused the problem where the script gobbles an insane amount of memory. I am leaning toward the theory that PHP is barfing on the 2.7mg of Unicode.

Here is the Unicode data file, right click and save:


Code:
<?php
		echo '<br> Tried all on PHP Version 5.2.4-2 ubuntu  and also  PHP Version 5.2.14 --same result';
	ini_set('auto_detect_line_endings', true);
	mb_internal_encoding("UTF-8");
	$FTfile = 'SignificantPlaceNames.dat';  
// ============================================================  
	echo '<br> #1 fgets';
	echo '<br> before  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> before  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
    $thefile = fopen($FTfile, 'rb'); 			// also tried without the "b"
	$SignificantPlaceNames = array(); 
	while (false !== ($data = fgets($thefile, 4096)))					// memory error occurs on the fgets   !!!!!!
			{ $SignificantPlaceNames[$i] = $data;  $i++; }
	fclose($thefile);
	echo '<br> after  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> after  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
	$len = mb_strlen($FileContents);
	echo '<br> len  of filecontents=' . $len; 
 // print_r($SignificantPlaceNames);      
// =============================================================
	echo '<br> #2 explode';
	echo '<br> before  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> before  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
	$thefile = fopen($FTfile, 'rb'); 		
	if ($thefile)   {  $FileContents = fread($thefile, filesize($FTfile));   	} 
	fclose($thefile);
	$SignificantPlaceNames = explode("\n", $FileContents);   		// stupid explode  kills memory  !!!!!
	echo '<br> after  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> after  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
 // print_r($SignificantPlaceNames); 
// =============================================================
	echo '<br> #3  the file function';
	echo '<br> before  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> before  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
	$SignificantPlaceNames = file($FTfile, FILE_SKIP_EMPTY_LINES | FILE_IGNORE_NEW_LINES);	// error eats memory
	echo '<br> after  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> after  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
	// print_r($SignificantPlaceNames);
	// =============================================================  
	echo '<br> #4    preg_split';
	echo '<br> before  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> before  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
	$thefile = fopen($FTfile, 'rb'); 		
	if ($thefile)   {  $FileContents = fread($thefile, filesize($FTfile));   	} 
	fclose($thefile); 
	$SignificantPlaceNames = preg_split("/\s/", $FileContents);			// tried other patterns but all eat memory
	echo '<br> after  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> after  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
 // print_r($SignificantPlaceNames);	
 	// =============================================================  
	echo '<br> #5    mb_split';
	echo '<br> before  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> before  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
	$thefile = fopen($FTfile, 'rb'); 		
	if ($thefile)   {  $FileContents = fread($thefile, filesize($FTfile));   	} 
	fclose($thefile); 
	$SignificantPlaceNames = mb_split("/\s/", $FileContents);			// tried other patterns but all eat memory
	echo '<br> after  DEBUG: BEGIN MEMORY=' . memory_get_usage();
	echo '<br> after  DEBUG: MEMORY PEAK=' . memory_get_peak_usage();
 // print_r($SignificantPlaceNames);	
	?>
 
SORRY, to get the data file, go to the link THEN right click and save!
 
I discharge myself on testings #1 and #5, here are the results for the rest on local system:
[tt]
<br> #2 explode
<br> before DEBUG: BEGIN MEMORY=6471104
<br> before DEBUG: MEMORY PEAK=6605456
<br> after DEBUG: BEGIN MEMORY=9268464
<br> after DEBUG: MEMORY PEAK=15524560

analysis:
start (begin,after) : (6471104, 9268464); ratio (1, 1.43)
peak (begin,after) : (6605456, 15524560); ratio (1, 2.35)

<br> #3 the file function
<br> before DEBUG: BEGIN MEMORY=9268464
<br> before DEBUG: MEMORY PEAK=15524560
<br> after DEBUG: BEGIN MEMORY=9235320
<br> after DEBUG: MEMORY PEAK=18297768

analysis:
start (begin,after) : (9268464, 9235320); ratio (1, 0.996)
peak (begin,after) : (15524560, 18297768); ratio (1, 1.18)

<br> #4 preg_split
<br> before DEBUG: BEGIN MEMORY=9235320
<br> before DEBUG: MEMORY PEAK=18297768
<br> after DEBUG: BEGIN MEMORY=41416424
<br> after DEBUG: MEMORY PEAK=47555480

analysis:
start (begin,after) : (9235320, 41416424); ratio (1, 4.48)
peak (begin,after) : (18297768, 47555480); ratio (1, 2.60)
[/tt]
utf-8 encoded characters can mostly take 1 to 4 bytes or 6 bytes max. I don't see, on the face of it, particular alarming anomalies.
 
Thanks for looking at this tsuji. Let me repeat that I saw this issue on two completely different LAMP servers.
 
That's what it is about: looking at it. Re-installation of php package is on order.
 
two different LAMP servers.

i assume both were repo installs of php on the ubuntu 5.12 distros? if you can confirm then i can set up a mirrored test bed.
 
"i assume both were repo installs of php on the ubuntu 5.12 distros? if you can confirm then i can set up a mirrored test bed."

No they were not the same distros. The two LAMPs I tested on were completley different. One was an Ubuntu 5.12 distro but the other was Go Daddy's own specially flavored LAMP. They have their own custom tailored version of Linux --they do not use any distro.
 
I believe reinstallation of php is the only sensible action to take, even the team taking care of php who has an interest as stackholder would possibly look into it only after that action is taken and having the same behavior persisting...
 
"I believe reinstallation of php ..."

I disagree. If the issue exists on two completely different LAMPs, I do no believe a PHP reinstall would resolve the issue.
 
LAMP is not something one can rely on 100%. They can get out of sync. Ok, file your bug report with that kind of debugging on your side.
 
Ok I just built a 3rd LAMP from scratch with a different version of Linux and the same result --the script eats up well over 30 MG of memory.
 
these are the results i obtain:

Code:
#1 fgets
before DEBUG: BEGIN MEMORY=82124
before DEBUG: MEMORY PEAK=135516
after DEBUG: BEGIN MEMORY=28215304
after DEBUG: MEMORY PEAK=28227756

#2 explode
before DEBUG: BEGIN MEMORY=77176
before DEBUG: MEMORY PEAK=80044
after DEBUG: BEGIN MEMORY=30714224
after DEBUG: MEMORY PEAK=30714276

#3 the file function
before DEBUG: BEGIN MEMORY=71940
before DEBUG: MEMORY PEAK=80240
after DEBUG: BEGIN MEMORY=27942284
after DEBUG: MEMORY PEAK=33477884

#4 preg_split
before DEBUG: BEGIN MEMORY=68536
before DEBUG: MEMORY PEAK=80056
after DEBUG: BEGIN MEMORY=30705700
after DEBUG: MEMORY PEAK=30705752

#5 mb_split
before DEBUG: BEGIN MEMORY=63136
before DEBUG: MEMORY PEAK=80052
after DEBUG: BEGIN MEMORY=5597216
after DEBUG: MEMORY PEAK=5599432

#6 split
before DEBUG: BEGIN MEMORY=68524
before DEBUG: MEMORY PEAK=80044
after DEBUG: BEGIN MEMORY=5602604
after DEBUG: MEMORY PEAK=5604820

the last iteration used split in lieu of mb_split

at first glance this shows that neither split nor mb_split are being memory-greedy, but the others are.

None, however, breached the memory limit of 32MB that I placed on the script.

the split and mb_split results are aberrations as, on examination, the array is not being properly formed. The regex is not matching \x, \n or \v and thus only a single element array is returned.

i also examined the memory usage within the loop at each iteration. Whilst the increase in usage is not strictly linear there is certainly no big jump.

taking one of the functions, if we unset the array after it has been built then the memory usage drops back down to anticipated levels. We can therefore be confident (reasonably) that there is no memory leakage.

The above testing was done on a MAMP installation (as the OP reports that ubuntu is not a common denominator). php version 5.2.9.

Wanting to experiment further I tried running the explode variant via the command line on a 5.3.3 installation:

these were the results
Code:
ver 5.3.3
#YY file method
before  DEBUG: BEGIN MEMORY=631176
before  DEBUG: MEMORY PEAK=637176
after split  DEBUG: BEGIN MEMORY=54764784
after split  DEBUG: MEMORY PEAK=57541328

which signifies that php 5.3.3 is significantly more greedy than 5.2.9.

wanting to test whether this was something specific to multibyte strings (which struck me as unlikely) I tested a loop that created an array of 300000 elements (roughly the number of lines in the test data) with an eleven character word. the memory usage was around 30MB.

this indicates that there is simply an overhead in the array structures (indexes etc) that increases memory usage. which in turn makes it very likely that this is NOT a bug but expected behaviour.

solutions to the issue would depend on what you actually want to do with the data.
 
jpadie, thanx so much for your help on this. It does seem that the functions are memory "greedy" and not a memory leak. I am still curious as to why over 30mg of "overhead" is needed to put 2.7 mg of data into an array? This might be a question for a PHP developer.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top