Second GPU crashing

BeauZaux

Well-Known Member
USA team member
I'm running 2 NVIDIA GTX 760s with an AMD Phenom II X6 1055T on an ASUS M4A87TD EVO motherboard with 8G of memory. Most projects will crash the second GPU showing zeros for all sensors and remaining time for task growing. I've managed to get GPUGRID to run fine as long as I run no tasks on CPU cores. Haven't tried the "full" elimination of CPU tasks on other GPU projects yet. Swapped GPUs and installed a larger PSU to no avail. Heat is not problem. Just thought I'd throw this out there before reinventing the wheel. Thanks
 

Nick Name

Administrator
USA team member
I can't say I've seen this particular problem before. What other projects are you running besides GPUGrid? Is the project app crashing or are you seeing driver crashes?

I looked at your machine on GPUGrid and it seems to be working fine. From the logs it looks like the app is running on one card, gets suspended and then resumes and successfully finishes on the 2nd. That would seem to rule out a driver problem and also a GPU hardware problem. I'd like to see some jobs that failed before suggesting other options, although you could try a driver update. There were problems at SETI with drivers past 431.xx up until 442.19, so use something outside that range if you decide to update.

[Edit] I found some failed tasks on Asteroids and the log doesn't make any sense. Jobs appear to be failing on the same card some have completed and validated on. That, to me, points to a buggy app. Just for fun, post the first 20 or so lines from your startup log, let's see what BOINC is actually detecting. [/Edit]
 
Last edited:

doneske

Well-Known Member
USA team member
At just a brief glance, I would be mildly concerned about the 8G of memory. The OS is going to take almost a gig of that to start with. In BOINC world, it's kinda, sorta, assumed about 2GB per CPU thread as a minimum. Yes, you can get by with a lesser amount but then one has to watch what projects you run and in what mix. The GPU tasks memory requirement will most likely be satisfied from the Graphics memory but there is still a small amount of system memory that is used.
 

BeauZaux

Well-Known Member
USA team member
Reverted from 435 to 430.5 driver. Presently running Asteroids on GPUs and no CPU. I'll give it half the day to crash. If not then I'll add a task one core at a time and see what happens. What app are you speaking of when you say "buggy app". Hardware-wise the only thing left is the motherboard.
UPDATE: Started this msg this morning, but kept getting dragged away. In the meantime, one GPU running Asteroids stopped and the other is just pretending to run (time remaining increasing and only 3-8% GPU use). Can't figure how to post a screen shot/image on forum. I'll abort and try another project.

Here's the first 20+ lines. Thanks.

Mon 24 Feb 2020 07:45:35 AM CST | | Starting BOINC client version 7.9.3 for x86_64-pc-linux-gnu
Mon 24 Feb 2020 07:45:35 AM CST | | log flags: file_xfer, sched_ops, task
Mon 24 Feb 2020 07:45:35 AM CST | | Libraries: libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
Mon 24 Feb 2020 07:45:35 AM CST | | Data directory: /var/lib/boinc-client
Mon 24 Feb 2020 07:45:39 AM CST | | CUDA: NVIDIA GPU 0: GeForce GTX 760 (driver version 430.50, CUDA version 10.1, compute capability 3.0, 2000MB, 1959MB available, 2590 GFLOPS peak)
Mon 24 Feb 2020 07:45:39 AM CST | | CUDA: NVIDIA GPU 1: GeForce GTX 760 (driver version 430.50, CUDA version 10.1, compute capability 3.0, 1999MB, 1945MB available, 2379 GFLOPS peak)
Mon 24 Feb 2020 07:45:39 AM CST | | OpenCL: NVIDIA GPU 0: GeForce GTX 760 (driver version 430.50, device version OpenCL 1.2 CUDA, 2000MB, 1959MB available, 2590 GFLOPS peak)
Mon 24 Feb 2020 07:45:39 AM CST | | OpenCL: NVIDIA GPU 1: GeForce GTX 760 (driver version 430.50, device version OpenCL 1.2 CUDA, 1999MB, 1945MB available, 2379 GFLOPS peak)
Mon 24 Feb 2020 07:45:40 AM CST | | [libc detection] gathered: 2.27, Ubuntu GLIBC 2.27-3ubuntu1
Mon 24 Feb 2020 07:45:40 AM CST | | Host name: Tucson
Mon 24 Feb 2020 07:45:40 AM CST | | Processor: 6 AuthenticAMD AMD Phenom(tm) II X6 1055T Processor [Family 16 Model 10 Stepping 0]
Mon 24 Feb 2020 07:45:40 AM CST | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt cpb hw_pstate vmmcall npt lbrv svm_lock nrip_save pausefilter
Mon 24 Feb 2020 07:45:40 AM CST | | OS: Linux LinuxMint: Linux Mint 19.3 Tricia [5.3.0-40-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]
Mon 24 Feb 2020 07:45:40 AM CST | | Memory: 7.77 GB physical, 2.00 GB virtual
Mon 24 Feb 2020 07:45:40 AM CST | | Disk: 1.79 TB total, 1.68 TB free
Mon 24 Feb 2020 07:45:40 AM CST | | Local time is UTC -6 hours
Mon 24 Feb 2020 07:45:40 AM CST | | VirtualBox version: 5.2.34_Ubuntur133883
Mon 24 Feb 2020 07:45:40 AM CST | | Config: GUI RPCs allowed from:
Mon 24 Feb 2020 07:45:40 AM CST | Asteroids@home | URL http://asteroidsathome.net/boinc/; Computer ID 654192; resource share 300
Mon 24 Feb 2020 07:45:40 AM CST | GPUGRID | URL http://www.gpugrid.net/; Computer ID 522684; resource share 300
Mon 24 Feb 2020 07:45:40 AM CST | LHC@home | URL https://lhcathome.cern.ch/lhcathome/; Computer ID 10633548; resource share 100
Mon 24 Feb 2020 07:45:40 AM CST | Rosetta@home | URL http://boinc.bakerlab.org/rosetta/; Computer ID 3776010; resource share 100
Mon 24 Feb 2020 07:45:40 AM CST | World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 6236563; resource share 100
 

BeauZaux

Well-Known Member
USA team member
I ran Rosetta on all 6 CPU cores without a problem. Used over half of memory and it didn't like anything else to go on. Recently upgraded from 4 to 6 cores. problem showed itself before. Thought the extra cores would help. :sneaky: But I'll look into more memory.
 

BeauZaux

Well-Known Member
USA team member
Running SETI on 2 GPUs and 3 CPU cores and all is fine. Three completed GPU tasks ready to report. I'll experiment with different projects see what happens.
 

Nick Name

Administrator
USA team member
Looks like BOINC is detecting both cards and both CUDA and OpenCL, so that's good. I think we can rule out a hardware problem, at least with the GPUs themselves.

I'm not able to test Asteroids GPU because they haven't updated the app to run on Turing. It's certainly possible there is a problem with not enough RAM, although you should see some messages in the manager like "Waiting for memory" or something along the lines of can't start because there's not enough memory available. Rosetta has some work units that are taking enormous amounts of RAM, some have reported up to 3.5 gigs. If you were unlucky enough to get some of these, for sure your machine ran out of RAM.

I have not carefully tracked the GPUGrid app RAM load. I have two jobs in Windows taking 423 and 393 megs respectively. Linux is somewhat less at 342 and 365. I don't believe SETI's GPU app uses much RAM. I'd probably assume a max of 500 gigs for a margin of safety. If you happen to be running Amicable Numbers, that GPU app takes 8 gigs, which would definitely be a problem.

Since things seem to run ok sometimes, you might just need to tweak your BOINC CPU and RAM limits. Normally, BOINC will over-commit the CPU to run GPU work, meaning that if you currently have BOINC set to use 100% of your threads it will try to run eight total jobs: six CPU and two GPU. I like to keep at least one thread available for the OS and general system tasks to keep everything running smoothly. If the system is overloaded your crunching will actually slow down, sometimes enormously. It takes a little trial and error to see what's best for you. I'd suggest setting it to use 50% of your threads, and see how that works over time with your project mix. That should give you three CPU and two GPU jobs. If you haven't already I'd also implement an app_config for Rosetta so only one job runs at a time. Currently there's no way to keep from getting those ginormous RAM eating tasks. It's actually irresponsible of the project to send those to everyone but that's out of our control. Next, if you can, I'd increase the amount RAM that BOINC can use.
 
Top