USA team member
We have made some improvements to the WCG system today that should improve the download situation (repeated download attempts and "transient" HTTP errors in the BOINC client logs). In short, we have doubled the number of World Community Grid download servers and have begun tuning a related part of the system.
A somewhat longer explanation:
The WCG back-end system operates as a network of virtual servers on a private cloud. File-upload and download requests are received first by our load balancer, which directs each request to an available upload/download server. As designed, our system should run with two u/d servers, but one of them was affected by a mysterious network problem that has kept several of our virtual servers offline for weeks. We suspected ghosts, cursed VM images, and OpenStack glitches, but recently, our hosting provider ruled those out for us, determining the problem to lie between a physical server a router. The problem is not 100% fixed, but with the cause identified, we managed to squeeze the second u/d VM onto another physical server, and successfully brought it online about 9.5 hours ago.
Prior to that happy event, we looked into the source of the "transient" errors reported in client logs. As it happens, the BOINC client will log almost any kind of HTTP/HTTPS error status as a "transient HTTP error". We first investigated our upload/download server, but its logs showed a >99.9% rate of successful responses, and the server load was generally low. Whatever the exact errors the clients were receiving, it seemed they did not come directly there. So we moved on to the load balancer. Our load balancer runs HAProxy. Examining its operating stats showed it was the source of the BOINC "transient" errors, apparently configured to be a little over-protective of our u/d server, turning down lots of requests. Our HAProxy configuration was originally copied from IBM's, then adapted to work in the new environment, though we left many of parameters unchanged -- maximum number of simultaneous connections, etc. As it turns out, some of those settings do not work well in the Krembil WCG cluster, at least when we're at 50% download capacity. We made a cautious change or two, but with the new server online now, we will wait until the system settles into a new equilibrium to resume parameter tuning.
The changes probably won't eliminate the "transient" errors -- initial stats from HAProxy say both download servers are saturated now, but hopefully the second download server reduces the pain, and tuning our load balancer should improve things further.