GPU troubles with FAH in Linux

BeauZaux

Well-Known Member
USA team member
Completed a new build with 990x and GTX 690 dual GPU. FAH loaded and configured, GPU's appear to be recognized, but fails to run. "Running" flashes on FAHControl, then jumps back to "Ready". WU_STALLED last log entry. This is a beta version of FAH, but running fine on another machine with GTX 760. Can't get the latest stable version to see any GPUs in Linux Mint. Restarted and reloaded several times. In configuration, I have OpenCl-index set to "0" to eliminate another error. Searched all over, but can't seem to find a solution. Here is some of the log. I'd appreciate a clue.
Oh, started GPUGrid to verify GPU capability, running well on both GPUs.
Thanks

17:03:40:WARNING:WU02:FS02:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
17:03:40:WU02:FS02:Connecting to 18.218.241.186:80
17:03:40:WU02:FS02:Assigned to work server 3.133.76.19
17:03:40:WU02:FS02:Requesting new work unit for slot 02: READY gpu:1:GK104 [GeForce GTX 690] from 3.133.76.19
17:03:40:WU02:FS02:Connecting to 3.133.76.19:8080
17:03:40:WARNING:WU01:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
17:03:40:WU01:FS01:Connecting to 18.218.241.186:80
17:03:40:WU01:FS01:Assigned to work server 3.133.76.19
17:03:40:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:GK104 [GeForce GTX 690] from 3.133.76.19
17:03:40:WU01:FS01:Connecting to 3.133.76.19:8080
17:03:40:ERROR:WU01:FS01:Exception: Server did not assign work unit
17:03:44:WU02:FS02:Downloading 17.39MiB
17:03:46:WU02:FS02:Download complete
17:03:47:WU02:FS02:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:14436 run:1950 clone:4 gen:11 core:0x22 unit:0x0000000c03854c135e9a790e422270de
17:03:47:WU02:FS02:Starting
17:03:47:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 4144 -checkpoint 15 -gpu-vendor nvidia -opencl-device 0 -cuda-device 1 -gpu 0
17:03:47:WU02:FS02:Started FahCore on PID 5064
17:03:47:WU02:FS02:Core PID:5068
17:03:47:WU02:FS02:FahCore 0x22 started
17:03:47:WARNING:WU02:FS02:FahCore returned: WU_STALLED (127 = 0x7f)
17:03:47:WU02:FS02:Starting
17:03:47:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 02 -suffix 01 -version 706 -lifeline 4144 -checkpoint 15 -gpu-vendor nvidia -opencl-device 0 -cuda-device 1 -gpu 0
17:03:47:WU02:FS02:Started FahCore on PID 5069
17:03:47:WU02:FS02:Core PID:5073
17:03:47:WU02:FS02:FahCore 0x22 started
17:03:48:WARNING:WU02:FS02:FahCore returned: WU_STALLED (127 = 0x7f)
 

Nick Name

Administrator
USA team member
I haven't had time to make a serious effort to get it working on Linux yet. I haven't seen this error on my Windows rigs running 7.5.1 or 7.6.9. I don't see a lot on this specific error but it seems to indicate the job may be crashing. I did notice this:

-opencl-device 0 -cuda-device 1 -gpu 0

I'm not sure why the device numbers don't match. It shouldn't matter since both GPUs are identical, but you can try a config like this.

<!-- Folding Slots -->
<slot id='0' type='GPU'>
<cuda-index v='1'/>
<gpu-index v='1'/>
<opencl-index v='1'/>
</slot>

Or, set all device ordering to 0, the idea is set the device ordering the same. Is anything running on the other GPU?
 

BeauZaux

Well-Known Member
USA team member
Thanks, Nick, I'll give that a try. Presently taking my systems down to redo thermal compounds on all GPU and CPU's. Lots of old stuff running higher than normal temps.
 

Nick Name

Administrator
USA team member
That's always a good idea. I had a Vega Frontier card briefly I got on Ebay, got it before the Radeon VII was released, and it didn't work until I cleaned the water block and redid the pads and paste.
 

BeauZaux

Well-Known Member
USA team member
Argh, delayed again. Had to order thermal pads . Not available in town. Cleaned and refurbished 760 and 6870 to no avail. Still fubar. 275 is only slightly cooler, 85C running PrimeGrid with the cover open. I think that's a driver problem. No auto fan control. Researching...
Just saw the happy face emojis in my log above...wt?
 

Nick Name

Administrator
USA team member
Argh, delayed again. Had to order thermal pads . Not available in town. Cleaned and refurbished 760 and 6870 to no avail. Still fubar. 275 is only slightly cooler, 85C running PrimeGrid with the cover open. I think that's a driver problem. No auto fan control. Researching...
Just saw the happy face emojis in my log above...wt?
I haven't had any luck with AMD on Linux yet, but I know that Nvidia fan speed for crunching is always too low.

This worked for me so I can manually set the fan speed high enough to keep things cool, but I never tried it on any hardware that old. Hopefully it works for you too. I've never been able to get the persistence mode to hold, so I have to reset the fan speed on every restart but thankfully that's not too often for me.
 

BeauZaux

Well-Known Member
USA team member
Wow, that was a huge help. Set fan to 80%, dropped temp 10C at 95% load. Closing side panel only raised temp 1C. Got it working in startup, too.
Will have to update other systems.
Got thermal padding last night, so should have 690 running in 990x later today.
Thanks
 

BeauZaux

Well-Known Member
USA team member
Got the 690 back together and into the 990x system. Took some juggling, but finally got the index numbers happy (0,0,0 & 1,1,1), so 10 cores and 2 GPUs folding. Unfortunately spent hours trying to get the NVIDIA fan control to work with no luck. First got "input not supported", good chance to learn Linux commands. After patience lost, I figured it was a good time install my SSD and reinstall Linux. New problems of course, blank screen. Deleted some, then installed older 390 NVIDIA driver and bang, all well. Made a couple of alternate attempts at the fan control to no avail. Gonna have to be satisfied for today.

I'll have to add to my profile, spent 22 years in the Air Force maintaining Avionics and 21 years at the Post Office maintaining mail processing systems. All sorts of computer and control system, but very little software was on my plate. My lack of software savvy is made up for with persistence and luck...ok, mostly luck.;)
 

BeauZaux

Well-Known Member
USA team member
Is this warning my problem? This creates an full xorg.conf, but I also have an empty xorg.conf.nvidia-xconfig-original?

XXXXXX:~$ sudo nvidia-xconfig -a --cool-bits=28 --allow-empty-initial-configuration

WARNING: Unable to locate/open X configuration file.

Package xorg-server was not found in the pkg-config search path.
Perhaps you should add the directory containing `xorg-server.pc'
to the PKG_CONFIG_PATH environment variable
No package 'xorg-server' found
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0". (Tried with and without these 2 lines.)
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1". (Tried with and without these 2 lines.)
New X configuration file written to '/etc/X11/xorg.conf'

Bugs me it worked so well on the older 275.

Sorry to be a bother..., but this is why I joined...the camaraderie.
:USA:
 

Nick Name

Administrator
USA team member
No bother, hopefully we can all learn a bit.

I take it the fan sliders in nvidia-settings aren't working? Is it working for just one, or neither? Things get be a little wonky if there's no monitor attached, maybe that's the problem.

I'll have to look at my system later, but I *think* that warning message is normal. What distro are you using? I have been using and recommend Pop!OS for Nvidia users on Linux. It's an Ubuntu derivative so the usual Debian / Ubuntu processes etc. work. I recommend it because the GPU driver is already installed which saves a lot of aggravation.

You may also find something useful here on the ArchWiki although I never had any success.

I also like to install the NVTop tool, sort of like HTop for Nvidia. It gives a nice readout of GPU load, temps etc. although it doesn't offer control of functions.
 

BeauZaux

Well-Known Member
USA team member
Thanks, Nick. The xorg file that procedure creates kills the display on reboot. After a couple of more reinstalls, I just added Coolbits 12 to the original basic xorg file. Gives me a slider for one GPU, but there's only one fan anyway. After messing with all that this morning, had trouble again with folding on GPUs. Forgot about installing "ocl-icd-opencl-dev", from a previous fix. So one GPU now running again and one waiting for available work, without errors.

I have used that ArchWiki by accident, help me get the display back from "input not supported". I'll study that Pop!OS and the tool.

Thanks for the tips.
 

Nick Name

Administrator
USA team member
I've run into the blank / frozen screen problem. The "solution" is supposed to be to add nomodeset to GRUB, but I never had any success with that. It was probably a PEBCAK error but I never did get it working. That issue actually led me to Pop!OS. Honestly, I have even less patience messing with stuff like this in Linux than I did when I first started using it 5-6 years ago. :LOL:

I forgot that card only has one fan. I'd try connecting the display to the 2nd GPU to make sure it's working, if you have a way to tell which output goes to which one.
 

BeauZaux

Well-Known Member
USA team member
The 690 is back folding on both GPUs, but still don't know why the xorg file blanked the screen. I've created them before on other machines to get better resolution. I had also tried the nomoeset with no luck. Afraid to mess with it now. I will probably pull the board to check the thermal job i did. I maybe I should have used thicker pads and I'd like to see how my paste application is. GPUs are running 85C and 79C both at 98-99% load with GPU fan at 95% and a small floor fan. I'd at least like them a few degrees closer together. Did I say there are 11 philips and 10 t6 torx screws holding that heatsink on?:sick:

I'll try the Pop!OS on another machine.

Today I picked up an HD7970 off Craigslist here in town to replace the GTX 275 in the ASRock system. Presently not getting a signal out, but thinking it's me (bad karma lately.) Reset bios and got the bios display up once, but only once and can't get a signal long enough to get boot selection up. Argh. May try in a different system. Just put 275 back in ASRock and still no signal. Spent a couple of hours working in the garden earlier. Maybe I should go back to that. :cautious:

If these machines are my road to insanity, I'll enjoy every step.:woot:
:USA:
 

BeauZaux

Well-Known Member
USA team member
After a little garden work and dinner, got the HD7970 working on my son's previous ASUS system and I moved it's GTX760 to the ASRock and all are happy. ASRock took some hard bios resets and reloads to get anything except the AMD working. Of course now there's the non-NVIDIA learning curve to overcome, like drivers, sensors, and controls. The HD7970 is waiting for WUs to fold. No errors, just waiting.

I didn't want to say earlier, that I may have spent $70 on a door stop.:cautious: My good karma returning.:D

The ASRock still needs complete loading and folding/BOINC installed.

Wondering if there are less folding WUs for AMDs.
 

Nick Name

Administrator
USA team member
Ok, glad to hear you got things working. I guess I misunderstood, I thought you only one one GPU on the 690 working. You can try setting a power limit and that will help control temps as well as power usage.

sudo nvidia-smi -pl watts (watts = the power limit in watts). You'll have to run it for both and specify the GPU. I'm not sure what the impact will be on that card, on at least 10xx and 20xx you can reduce it 10-15% with not too much reduction in output.

sudo nvidia-smi -L will give you a list of the GPUs in the system.

sudo nvidia-smi -i 0 -pl watts and then again with -i 1, should limit both.

That 7970 should really sing on MilkyWay, you can run that on BOINC and set a GPU exclusive app to pause when it gets F@H work.

My impression has been that it's easier to get AMD work on F@H than Nvidia, but work in general can be sporadic and the lack of history like BOINC has makes it hard to get a proper feel for what's really going on. I haven't had much work for either in the last day.
 

BeauZaux

Well-Known Member
USA team member
I couldn't get those lines to work, older unit, but able to adjust the graphic clock offset on one gpu enough to get the temp down on both. So that system is singing without cooking me.

No luck with new AMD. No work from Folding, Milkyway, PrimeGrid, or Collatz. Maybe it's the Linux. I have an unused Win 7 key, maybe I'll try that.

Also, no luck with Pop!OS, Intel version. Froze a couple of times, but I may also have a KVM problem. And couldn't load FAHcontrol for lack of a python package that was there.

I'll try the Windows this evening. They finally opened a few theaters here, so heading to the movies shortly. Woohoo! Freedom! :D
:USA:
 

Nick Name

Administrator
USA team member
If you haven't gone the Windows route yet install the clinfo utility. If it reports zero OpenCL devices, there's your problem.

The Python dependency problem is known, from what I've read there are a lot of developers looking at the code and trying to improve it. Someone created a PPA which fixes it, but like I said in the other thread I couldn't get the F@H client to see the GPU.
 

BeauZaux

Well-Known Member
USA team member
Yep, did clinfo previously, rcvd zero, install the recommended items, got the recommended results and got nothing in return. Sent 2+ full days getting Win 7 installed. I should have known it was going to be bad when window wouldn't except my pre-Linux hard drive no matter how I partitioned or formatted. Who'd have known deleting and leaving it partitionless would be the cure. Loaded MB/graphics drivers and running folding now and haven't gotten a blue screen since the last auto reboot this morning. GPU load fluctulates from 0-90% and a 14 hour WU gets 28600 credits. Don't seem right. Seeing my miss steps here and seeing some others while watching the NVIDIAs folding I may have learned something. Reading your other post, Nick, I may give Pop!OS another try after a little more experimenting here. BTW... Windows :woot:
 

Nick Name

Administrator
USA team member
Yep, did clinfo previously, rcvd zero, install the recommended items, got the recommended results and got nothing in return. Sent 2+ full days getting Win 7 installed. I should have known it was going to be bad when window wouldn't except my pre-Linux hard drive no matter how I partitioned or formatted. Who'd have known deleting and leaving it partitionless would be the cure. Loaded MB/graphics drivers and running folding now and haven't gotten a blue screen since the last auto reboot this morning. GPU load fluctulates from 0-90% and a 14 hour WU gets 28600 credits. Don't seem right. Seeing my miss steps here and seeing some others while watching the NVIDIAs folding I may have learned something. Reading your other post, Nick, I may give Pop!OS another try after a little more experimenting here. BTW... Windows :woot:
I definitely recommend Pop! if you're running Nvidia on Linux, mainly because the GPU driver is already installed and it's eliminated driver problems for me. I don't necessarily recommend it for other uses, I just wanted to stick with it because I've gotten used to it. I probably mentioned it elsewhere but I don't remember where, the AMD card I'm using in the Pop! system is an RX 590. I'd think the same process that worked for me would work for you, but that 7970 is obviously older and the driver may not work that well with older cards.

I don't know if 14 hours is excessively long for that card or not. It sounds like it, but I don't know how to compare it. I do think that credit sounds low. I'd check two things. First, make sure you have entered your PassKey. It's similar to the BOINC CPID and if you aren't using it your credit will be much lower than it should be. Second, make sure the CPU isn't overloaded and not keeping the GPU fed. I have observed periods where the client is running but the GPU load is basically zero, that seems to be normal, but otherwise you should be seeing the card pushed pretty hard.
 

BeauZaux

Well-Known Member
USA team member
As you recommended, I reloaded my passkey on that system. latest WUs are shorter duration and even shorter on credits. Maybe FAH is learning this system and the numbers will go up later. Loads look better and temps are great. I can't remember which GPU project I ran on another system that showed almost no GPU load thru out processing, but ran fine. I need to run that one on my GTX 690 during the summer.:cool:

Doing some more research on my 7970, the numbers look great and figure it should run better than my GTX 760. As a gen 1 AMD, Linux Mint does not have drivers for it. So if I revert back, I'll have to try mfg drivers as I've done with Windows.

I need to do more reading on Pop!OS. Just learned I can load Synaptic Package Manager, so that'll make it more familiar.

Thanks again for all the assistance.
 
Top