Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How do I avoid problems when a file I'm reading changes ? 1

Status
Not open for further replies.

NewtownGuy

Technical User
Jul 27, 2007
146
0
0
US
I'm not sure if this is the right place for this question. I'm running Ubuntu Server 10.04 LTS. This machine is continuously acquiring data in real time and adding snippets of data to various of tens of thousands of files. From time to time, I have a bash program read all the files in turn and process the data in each one.

The problem is that some file reads are getting mixed up. (I checked the files and they're ok.) Sometimes, in my log file that shows the data read and the processing that was done, I see parts of several different files when I'm trying to read only one file. The files whose reads are getting mixed up were updated while my analysis program was running.

When I read a file, I read the whole file into an array, from which I do the processing for that file. Then I go on to the next one.

How do I avoid file reads getting mixed up ? I can't stop the data acquisition process, which can update any of the many files. Sometimes, something is going wrong while I'm reading a given file into an array.

I considered copying each file to a temp file before I analyze it, but I have 100's of MB of files in total and don't want to wear out the disk, and I'm concerned that the copying would be corrupted anyway if the file happens to get updated while I'm trying to copy it.

Thank you in advance for your help.

 
Without actually knowing the specifics of how this stuff is done in the Linux world, it smells like you are not locking a file while reading it, so the background process may have copied the file elsewhere while updating it, and then started writing a different file there, where you are still reading where your file used to be.

 
TO: Mike

Thank you for your reply.

How do I lock and unlock a file in Ubuntu 10.04 Server bash while I read it ? What do I do in the other program that may need to write to a locked file so no data is lost ? Does the other program have to check for a lock, buffer the data if locked, then append the buffer somehow after the lock is removed ?



 
Hi

See if you have [tt]flock[/tt] ( part of the util-linux package ) :
man flock said:
[pre]NAME
flock - manage locks from shell scripts[/pre]

But better show us the part of your script where you "read the whole file into an array". I suppose you use either [tt]mapfile[/tt] or [tt]cat[/tt], in which case I would not expect it to fail in any circumstances.

Feherke.
feherke.github.io
 
TO: feherke

Thank you for your help.

I need to process every file, of which there can be tens of thousands, in a folder. I get the file names with:

ls $readPath -1 -S -r -all --time-style=long-iso --group-directories-first > $listFileName

-- snip --

I then read that file into an array:

index="0"
IFS='
'

while read -r line ; do
Marray[$index]="$line"
index=$(($index+1))
done < $listFileName

-- snip --

I then read the file name out of Marray and process each one of those files in turn, starting by reading each one into Parray:

index="0"
IFS='
'
while read -r line ; do
pArray[$index]="$line"
index=$(($index+1))
done < $readPath$MfileName

-- process each file --

Any of the files in the folder can be updated at any time by another program. I can't stop the data acquisition process to do the processing. I'm having problems processing some of the files, apparently ones that are changing while I'm trying to process them. Since I do my processing from pArray, changes must be occurring while I'm reading a file into pArray. I can see this in log files that show parts of several files being processed when I'm presumably processing only one file. Processing all of the files takes several hours, giving lots of time for things to change, although the critical time seems to be only while I'm reading a file into pArray before processing it.

The data acquisition process appends data to the end of an existing file, so I didn't think files would be moving around the disk and get messed up when another program reads them. But apparently this is a level of the operation of the file system that I don't understand.


 
Hi

Which version of Bash are you using ? Since version 4.0 there is [tt]mapfile[/tt] :
help mapfile said:
[pre]mapfile: mapfile [-n count] [-O origin] [-s count] [-t] [-u fd] [-C callback] [-c quantum] [array]
Read lines from the standard input into an indexed array variable.[/pre]
Bash:
mapfile -t pArray [teal]<[/teal] [green][i]"$readPath$MfileName"[/i][/green]
( As you say "The data acquisition process appends data to the end of an existing file", if you store the previously seen line count for each file, on next processing you can skip the old lines using the [tt]-s count[/tt] option. )

If your Bash is old, better do the reading in one move like this :
Bash:
[navy]content[/navy][teal]=[/teal][green][i]"$(< $readPath$MfileName )"[/i][/green]

[b]while[/b] [navy]IFS[/navy][teal]=[/teal]$[green][i]'[/i][/green][lime][i]\n[/i][/lime][green][i]'[/i][/green] [b]read[/b] -r line [teal];[/teal] [b]do[/b]
  pArray[teal]+=([/teal][green][i]"$line"[/i][/green][teal])[/teal]
[b]done[/b] [teal]<<<[/teal] [green][i]"$content"[/i][/green]
Note that the [tt]+=[/tt] ( append ) operator was not supported in very old versions. ( Exists in Bash 3.2, not sure about earlier ones. )

With this later code is easier to implement the locking, just change the line where you read the file with :
Bash:
[navy]content[/navy][teal]=[/teal][green][i]"$( flock "$readPath$MfileName" -c "cat '$readPath$MfileName'" )"[/i][/green]
Note that this will have the desired effect only if you also use [tt]flock[/tt] when writing to those files. Also note that locking may slow down your scripts.


Feherke.
feherke.github.io
 
TO: Feherke

Thank you for the many suggestions. I'm running bash 4.1.5. Since I'm running a version that supports mapfile, how should I use locks with mapfile ?

How do I use locks to write files ? Do I use a loop that atomically checks and sets the lock, waits until it's not locked, then writes the file and clears the lock ? Is there any performance data on how long this takes ?

I'm curious... Since I'm appending data to the end of files, why is data in the middle of unlocked files getting mixed up ? It acts as if files are being moved around the disk instead of data just being added to the ends of them, which implies a lot more disk traffic than I expected. This is of concern since I'm using SSD's, which have limited write life expectancy.

 
Hi

NewtownGuy said:
how should I use locks with mapfile ?
For that you will need to use a file descriptor :
Bash:
[b]exec[/b] [purple]3[/purple][teal]<[/teal] [green][i]"$readPath$MfileName"[/i][/green]
flock [purple]3[/purple]
mapfile -u [purple]3[/purple] pArray
flock -u [purple]3[/purple]
[b]exec[/b] [purple]3[/purple][teal]>&[/teal]-
( If your script is already using file descriptor 3 in that moment, you will need to choose another one. )

NewtownGuy said:
How do I use locks to write files ?
Similarly to the reading :
Bash:
[gray]# with wrapped command[/gray]
flock [green][i]"$readPath$MfileName"[/i][/green] -c [green][i]"echo 'New content at `date`' >> '$readPath$MfileName'"[/i][/green]

[gray]# with file descriptor[/gray]
[b]exec[/b] [purple]3[/purple][teal]>>[/teal] [green][i]"$readPath$MfileName"[/i][/green]
flock [purple]3[/purple]
echo [green][i]"New content at `date`"[/i][/green] [teal]>&[/teal][purple]3[/purple]
flock -u [purple]3[/purple]
[b]exec[/b] [purple]3[/purple][teal]>&[/teal]-

NewtownGuy said:
Do I use a loop that atomically checks and sets the lock, waits until it's not locked, then writes the file and clears the lock ?
man flock said:
[pre]DESCRIPTION
[gray](...)[/gray] It locks a
specified file or directory, which is created (assuming appropriate
permissions), if it does not already exist. By default, if the lock
cannot be immediately acquired, flock waits until the lock is
available.[/pre]

[small]I can not answer your question regarding the garbled reading at the middle of the file. But anyway, would be useful for others able to answer it to see your code which performs the writing.[/small]

Feherke.
feherke.github.io
 
I knew this would introduce me to some new things... :)

I'm confused by wrapped commands and file descriptors. What lock file is used with a wrapped command ? If the writing program uses one locking method, do the reading programs have to use the same method ? If I use file descriptors, do the reading and writing programs have to agree on the use of a particular file descriptor ? How do I know which file descriptors are in use, and how do I reserve another one for my application ? Which code is the wrapped code for reading that corresponds to the wrapped code for writing ?

 
Hi

NewtownGuy said:
If I use file descriptors, do the reading and writing programs have to agree on the use of a particular file descriptor ?
Do not even need to be implemented in the same programming language. All uses the [tt]flock()[/tt] system call ( see man 2 flock ) :
Code:
[small][COLOR=white gray] bash                                           [/color] [COLOR=white gray] perl -de 42                                    [/color]
[blue]master #[/blue] cat NewtownGuy                         [highlight silver] [/highlight] 
                                                [highlight silver] [/highlight] 
                                                [highlight silver] [/highlight] [blue]DB<1>[/blue] open F,'>','NewtownGuy';
                                                [highlight silver] [/highlight] [blue]DB<2>[/blue] flock F,2; # LOCK_EX
[blue]master #[/blue] flock NewtownGuy -c 'cat NewtownGuy'   [highlight silver] [/highlight]
                                                [highlight silver] [/highlight] [blue]DB<3>[/blue] print F "Some content\n";
                                                [highlight silver] [/highlight] [blue]DB<4>[/blue] flock F,8; # LOCK_UN
Some content                                    [highlight silver] [/highlight]
                                                [highlight silver] [/highlight]
[blue]master #[/blue]                                        [highlight silver] [/highlight]
                                                [highlight silver] [/highlight] [blue]DB<5>[/blue] close F;[/small]
( On systems that not support [tt]flock()[/tt] some programming languages will emulate it, while others will raise error. )

In the above sample the important part is that [tt]flock[/tt] in Bash will wait until the file is unlocked by Perl, only after that will execute the [tt]cat[/tt].

NewtownGuy said:
How do I know which file descriptors are in use, and how do I reserve another one for my application ?
In shell scripts you know because you open them by number. In programming languages the next unused file descriptor is assigned on [tt]fopen()[/tt]. For your shell script, on Linux, just run [tt]ls -l /proc/$$/fd/[/tt] to get a list of the file descriptors open by the current process.

No need ( and as far as I know, no way ) to reserve the file descriptors. Each process has its own file descriptors, numbered from 0 up to the maximum. ( [tt]ulimit -n[/tt] will display the configured maximum. ) When a process starts, file descriptors 0 ( standard input ), 1 ( standard output ) and 2 ( standard error ) are assigned implicitly. ( You will find that Bash also assigns 255, no idea why. ) In my examples I just used the next one, 3.

NewtownGuy said:
Which code is the wrapped code for reading that corresponds to the wrapped code for writing ?
Do you mean how to combine [tt]flock[/tt] with [tt]mapfile[/tt] ? No way. [tt]mapfile[/tt] is a shell builtin, so [tt]flock[/tt] can not execute it directly. If you issue a new shell instance to call [tt]mapfile[/tt], then the read data will be disposed as soon as its shell instance terminates. So you can only choose between the already posted [tt]flock[/tt] + [tt]cat[/tt] code and the file descriptor base one.

Feherke.
feherke.github.io
 
Hi,

I read the man page in Ubuntu Server 10.04 for flock, but I still have questions. The man page gives an example:

(
flock -s 200
-- code to execute under lock ---
) 200>/var/lock/mylockfile

The man page says -s is a shared lock, or read lock, but the default is a write lock. Does the program that writes the data set a write lock, and the program that reads the data set a read lock ?

If I use -w N, for a timeout, how small can N be and still be useful ? .001 ? Since processing is suspended until the lock is available, I don't want to hang too long if there's a problem.

What is mylockfile ? Is it a file, whose name I can chose, for all my locks so they don't get mixed up with other locks ?

What is the range of values for the lock, e.g., 200 ? Must it be numeric ?

How much does using locks slow down program execution ?

Why did you use exec's with flock instead of the format with ( and ) from the man page ?

--

 
Hi

NewtownGuy said:
Does the program that writes the data set a write lock, and the program that reads the data set a read lock ?
If you ask me, I would just use the default exclusive lock for all operations. The shared lock is useful to allow multiple reads in the same time, as long as no write happens. ( As I understood, you will not need multiple reads. )

NewtownGuy said:
If I use -w N, for a timeout, how small can N be and still be useful ? .001 ? Since processing is suspended until the lock is available, I don't want to hang too long if there's a problem.
If you mean, your script is able to postpone a locked file's processing, process other files then retry those earlier locked ones, then I would just use [tt]-n[/tt] ( [tt]--nonblock[/tt] ) to not wait at all after failed locking attempts.

NewtownGuy said:
What is mylockfile ? Is it a file, whose name I can chose, for all my locks so they don't get mixed up with other locks ?
It is just a dummy file. That approach is useful in case you want to use [tt]flock[/tt] to synchronize some other kind of operations, not involving files. [tt]flock[/tt] will need a file for locking anyway, so there you provide one. All [tt]flocks[/tt] calls on the same file will wait for each other.

So the following code is a bad idea, as all file accesses will be synchronized, regardless whether they use the same file, or not :
Code:
[b]for[/b] MfileName [b]in[/b] [green][i]"$readPath/"[/i][/green][teal]*;[/teal] [b]do[/b]
  [teal]([/teal]
  flock [purple]200[/purple]
  echo [green][i]"New content at `date`"[/i][/green] [teal]>>[/teal] [green][i]"$MfileName"[/i][/green]
  [teal])[/teal] [purple]200[/purple][teal]>[/teal]/var/lock/mylockfile
[b]done[/b]
Code:
[b]for[/b] MfileName [b]in[/b] [green][i]"$readPath/"[/i][/green][teal]*;[/teal] [b]do[/b]
  [teal]([/teal]
  flock -s [purple]200[/purple]
  cat [green][i]"$MfileName"[/i][/green]
  [teal])[/teal] [purple]200[/purple][teal]>[/teal]/var/lock/mylockfile
[b]done[/b]

NewtownGuy said:
What is the range of values for the lock, e.g., 200 ? Must it be numeric ?
As mentioned, that is the file descriptor. The range is between 0 ( inclusive ) and the maximum configured on your system ( exclusive ). You can find out the maximum using [tt]ulimit -n[/tt]. ( For example on my system it displays 1024, so I am allowed to use file descriptors from range 0..1023. ) And yes, file descriptors are integer numbers.

NewtownGuy said:
How much does using locks slow down program execution ?
No idea.

NewtownGuy said:
Why did you use exec's with flock instead of the format with ( and ) from the man page ?
Mostly matter of personal style. That way I can place the unlocking code in a different function, for example one dedicated for cleaning up.

Regarding the example from the man page, note that parentheses ( () ) will issue a subshell, so variables set inside them will not be visible outside :
Code:
[teal]([/teal]
flock -s [purple]200[/purple]
mapfile -t pArray [teal]<[/teal] [green][i]"$readPath$MfileName"[/i][/green]
[teal])[/teal] [purple]200[/purple][teal]>[/teal]/var/lock/mylockfile

echo [green][i]"${pArray[@]}"[/i][/green] [gray]# content read by mapfile not available anymore[/gray]
Of course, you can move the processing inside the parenthesis, but then the lock will be kept pointlessly on the file even after not accessing it anymore. Now that has good chances to slow down your script.

To correct this, you can just change the enclosing parenthesis with braces ( {} ) to avoid running that code in a subshell. With that change the above example will work.

Feherke.
feherke.github.io
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top