WCG restart delayed...until...13May 2023

Vester

Well-Known Member
USA team member
Just when I filled my queue, my fiber optic internet was taken out by falling tree. On the bright side the ARP tasks will still be running on Monday when the experts splice the cable. I was afraid the last 3/4 mile above ground would be a problem.

Edit: Internet restored. Fiber optic cables do not stretch.
 
Last edited:

Ghost Rider 51

Member
USA team member
I have yet to get a full queue. It only really started again a week or so ago and I never get enough tasks to continue running constantly.

Still running einstein to keep my systems busy.
 

Vester

Well-Known Member
USA team member
Downloads improved about an hour ago. Download speeds are slow, but the downloads about 70% completed before having to "Retry Now."
 

Vester

Well-Known Member
USA team member
Downloads appear to be normal again. All files have downloaded without delays this morning. I am hopeful that the nightmare is nearly over.
 

Ghost Rider 51

Member
USA team member
Still coming in drips and dribbles though. Never enough to even have some waiting to process. Only get 1-4 at a time (for 3 systems) and those run out before I get any more.
 

Vester

Well-Known Member
USA team member
Yes. I haven't had any new tasks in about two hours. They seem to be out of OPNG tasks and all others are reported to be unavailable, too. I expect this to be a temporary situation. I have plenty of work in queue for CPU, but the GPU is idle.
 

Vester

Well-Known Member
USA team member
There is a shortage of OPNG tasks, but others are downloading normally (quickly).
 

Vester

Well-Known Member
USA team member
Yes, I am only getting MCM now. They seem to be cycling through the different projects. None have needed my intervention to download.
 

Ghost Rider 51

Member
USA team member
Yeah, they are making progress, even though it is much slower than projected.
I have 3 machines and barely getting enough jobs to keep them (mostly) busy.(y)
 

Ghost Rider 51

Member
USA team member
This has been my average with WCG since Sept 01

Steady decline from ~6000 units to <4000 units. And that is after a sharp uptick from <1000 units to ~6000 units over 4 days.
 

Jason Jung

Well-Known Member
USA team member
Seems to be smoothing out. Still loads of OPNG work coming and many are going through first try. Two or three retries for the rest. Yesterday morning had a few units that needed over 40 retries. I'm hoping the reason for the improvement is they've upped their connection handling capacity.
 

Vester

Well-Known Member
USA team member
Cubes posted something coherent at WCG (Sep 23, 2022 10:53:27 PM).

We have made some improvements to the WCG system today that should improve the download situation (repeated download attempts and "transient" HTTP errors in the BOINC client logs). In short, we have doubled the number of World Community Grid download servers and have begun tuning a related part of the system.

A somewhat longer explanation:

The WCG back-end system operates as a network of virtual servers on a private cloud. File-upload and download requests are received first by our load balancer, which directs each request to an available upload/download server. As designed, our system should run with two u/d servers, but one of them was affected by a mysterious network problem that has kept several of our virtual servers offline for weeks. We suspected ghosts, cursed VM images, and OpenStack glitches, but recently, our hosting provider ruled those out for us, determining the problem to lie between a physical server a router. The problem is not 100% fixed, but with the cause identified, we managed to squeeze the second u/d VM onto another physical server, and successfully brought it online about 9.5 hours ago.

Prior to that happy event, we looked into the source of the "transient" errors reported in client logs. As it happens, the BOINC client will log almost any kind of HTTP/HTTPS error status as a "transient HTTP error". We first investigated our upload/download server, but its logs showed a >99.9% rate of successful responses, and the server load was generally low. Whatever the exact errors the clients were receiving, it seemed they did not come directly there. So we moved on to the load balancer. Our load balancer runs HAProxy. Examining its operating stats showed it was the source of the BOINC "transient" errors, apparently configured to be a little over-protective of our u/d server, turning down lots of requests. Our HAProxy configuration was originally copied from IBM's, then adapted to work in the new environment, though we left many of parameters unchanged -- maximum number of simultaneous connections, etc. As it turns out, some of those settings do not work well in the Krembil WCG cluster, at least when we're at 50% download capacity. We made a cautious change or two, but with the new server online now, we will wait until the system settles into a new equilibrium to resume parameter tuning.

The changes probably won't eliminate the "transient" errors -- initial stats from HAProxy say both download servers are saturated now, but hopefully the second download server reduces the pain, and tuning our load balancer should improve things further.

Christian
 
Top