An Unpleasant Surprise

Nick Name

Administrator
USA team member
I came home from work, found out #2 was dead in the water. Something filled the disk up until there wasn't any room left for BOINC. My suspicion is temporary BOINC files or maybe the slots weren't getting deleted, I recall that problem a few years ago. Anyway I was pretty much stuck and had to reboot. That cleared the problem, but since I don't know what caused it I expect it to return at some point. This system has run for weeks if not months 24/7 without issues until now. I'll have to keep an eye on it and maybe reboot once a week or so.
 

doneske

Well-Known Member
USA team member
BOINC really doesn't have what one would think of as "temporary" files that I remember. Most of the files it uses are in the slots directory and will still be there after a reboot. I would think it must be something else other than BOINC. IF it is a Windows system, the event log MIGHT have an indication. Also, since it was pretty much stable for a period of time, I would rule out BOINC again as most of BOINC's WUs tend to be homogeneous. They tend to do the same thing in the same amount of space over and over again unless a completely new work unit was introduced or there was different project added.
 

Nick Name

Administrator
USA team member
It was the Linux box. After looking at disk use this morning I was starting to worry there was something wrong with the ssd, it's just a cheap PNY 120 GB model I got on sale at Best Buy. The system was saying it only had 12 GB available. :oops: Turns out a search & index program that's part of Gnome called Tracker had a few thousand files eating up disk space. I deleted them and that fixed it. I've tried the easy way to disable it for now as this is something I don't need at all on this machine.


It's been several years and I don't recall specific details, but there was a BOINC version that had a bug where the slots files weren't deleted after a task completed. Eventually BOINC would complain about not having slots available and quit. I think it was specific to Linux, but could be mistaken. DA fixed it pretty quickly as I recall. Anyway that's not what's happening here thankfully, and hopefully nothing else besides this search bug.
 

doneske

Well-Known Member
USA team member
Checked a slot on a server first thing this morning and found a WU attempting to execute but failing immediately due to domain decomposition. Looking at the log, it had been doing this for about 12 hours. Log had grown to over 7MB in size. Was able to get it started by dropping the number of CPUs in the slot to 8. Bad news is, things like this can happen and one may not know about it for hours and hours since F@H really doesn't have a good monitoring tool on the Linux platform. HFM works but it is Windows based. Tried installing HFM under Wine but the Wine version in CentOS is too old and HFM complains about missing functions. If it's not one thing it's another. I'll have to see if I have an Ubuntu box that has the desktop installed and try it there. Anyway, all slots are back up and running again.
 

Nick Name

Administrator
USA team member
Checked a slot on a server first thing this morning and found a WU attempting to execute but failing immediately due to domain decomposition. Looking at the log, it had been doing this for about 12 hours. Log had grown to over 7MB in size. Was able to get it started by dropping the number of CPUs in the slot to 8. Bad news is, things like this can happen and one may not know about it for hours and hours since F@H really doesn't have a good monitoring tool on the Linux platform. HFM works but it is Windows based. Tried installing HFM under Wine but the Wine version in CentOS is too old and HFM complains about missing functions. If it's not one thing it's another. I'll have to see if I have an Ubuntu box that has the desktop installed and try it there. Anyway, all slots are back up and running again.
I've never heard of this domain decomposition before and a search doesn't seem to bring up anything relevant. Is this something specific to FAH?
 

doneske

Well-Known Member
USA team member
I've never heard of this domain decomposition before and a search doesn't seem to bring up anything relevant. Is this something specific to FAH?
It seems to be related to Gromacs. If you aren't running the CPU cores you may not see it since it seems the GPUs are using the OPENMM cores. From what I can tell, no one really understands it all that well. You normally get sent to the Gromacs site.
 

Nick Name

Administrator
USA team member
It seems to be related to Gromacs. If you aren't running the CPU cores you may not see it since it seems the GPUs are using the OPENMM cores. From what I can tell, no one really understands it all that well. You normally get sent to the Gromacs site.
Ah, ok thanks. If I have enough time this weekend I hope to get a dedicated system running, I'll try the CPU work then.
 
Top