Hi,
After a recent upgrade to a massively faster hardware, LockFile::Simple started to generate large number of errors, e.g.:
cannot unlock /net/home/username/scratch/XXXXX : lock not owned at /usr/lib/perl5/vendor_perl/5.8.8/LockFile/Simple.pm line 206.
Where /net/home/username/ was an NFS-shared filesystem and the lock manager object was configured with:
-nfs=>1, -stale=>0, -hold=>0.
After doing some debugging, it appears that in about 15% of the cases contents of lock files created were wrong. In some cases they looked corrupted (e.g. spurious extra lines with bits of garbage) while in most cases they held PID:hostname pairs from wrong process instances (run on another machine). Looking inside _acs_lock() code I noticed no attempt to prevent race condition is made at all when creating new lock file. Obviously, since file creation is not an atomic process on NFS, this can't work reliably with e.g., thousands of processes attempting to obtain new lock for the same file (almost) simultaneously.
Surprisingly enough, I have been using this module for locking files across NFS for several years without a single such error at all, albeit on an older and slower hardware and at a somewhat smaller scale (3-4 times less the number of concurrent processes).
Anybody else experiencing such errors?
I was thinking about patching _acs_lock() code and adding file locking via flock() but unfortunately, the situation with Perl flock() implementation is rather unclear: half of the documentation I found (including latest perlopentut for 5.10.1) says it does not work across NFS while the other half states it does (might?) work on Linux kernels after 2.6.12. I will have to experiment a bit myself but it looks like flock() solutions would be non-portable anyway.
--bamyasi