WCG restart delayed...until...13May 2023

Vester

Well-Known Member
USA team member
Nothing at WCG is accessible at this time. It was great while it lasted. Maybe there will be projects other than MCM and OPN1 when it is restored.

Edit: Tasks are being uploaded again. I am not sure about downloads because I had a four-day cache and changed my setting to one day.
 
Last edited:

supdood

Well-Known Member
USA team member
I've been trying to remain positive, but this is getting a bit out of hand. Seems like we are going back and forth between DB crashes and download issues every week. At this point, I would almost prefer that they simply take another couple of months to reboot the project on a standard BOINC install instead of trying to patch together what IBM had with the tools available in Krembil's datacenter. I'm also not sure why Krembil didn't just leave WCG in the IBM Cloud and take over management and finances.
 

Nick Name

Administrator
USA team member
I've been trying to remain positive, but this is getting a bit out of hand. Seems like we are going back and forth between DB crashes and download issues every week. At this point, I would almost prefer that they simply take another couple of months to reboot the project on a standard BOINC install instead of trying to patch together what IBM had with the tools available in Krembil's datacenter. I'm also not sure why Krembil didn't just leave WCG in the IBM Cloud and take over management and finances.
Either of these would have been preferable to the fragility we've seen. Sometimes it's hard to tell if you should stay the course or tear it up and redo it. So far it looks like they chose the wrong option. Hopefully things stabilize over time. We're at least better than we were in March. :)
 

Vester

Well-Known Member
USA team member
The blame for this non-standard BOINC project goes to United Devices (UD), not IBM. It was a mess when IBM took over. I went to Find-a-Drug (Keith Davies) because UD was as unresponsive as Krembil. "Backing off unable to connect!" was the biggest problem when UD was in charge. Not much has changed.
 

Vester

Well-Known Member
USA team member
The WCG system is down. Cyclops said hardware changes are expected to improve task distribution, but he said nothing about the system being down over six hours. I noted a two-day increase in my ARP statistics yesterday, but there are many more days still missing. Here's hoping that today is the big day!

Edit: This was on Twitter 22 hours ago. "We are experiencing a system error that prevents access to the WCG website. We are working to fix the issue and expect to have it resolved in the next 30 minutes. Sorry for any confusion this may have caused." No update since.
 
Last edited:

Nick Name

Administrator
USA team member
So the hardware isn't adequate. That explains a lot. It's not clear to me if they underestimated the load or just didn't have the resources to get what what was really needed.
 

Jason Jung

Well-Known Member
USA team member
IBM's WCG team didn't exactly have the hardware resources to handle the full workload with OPNG but they had the knowledge to properly manage their resources. I imagine Krembil knew they weren't going to have adequate hardware resources and had a whole lot of wishful thinking that they could still manage things somewhat smoothly.

I'm still hopeful that they'll get the existing projects running but I don't see World Community Grid getting off life support. I can't see there being a new research project. Just the existing ones completing.
 

Nick Name

Administrator
USA team member
IBM's WCG team didn't exactly have the hardware resources to handle the full workload with OPNG but they had the knowledge to properly manage their resources. I imagine Krembil knew they weren't going to have adequate hardware resources and had a whole lot of wishful thinking that they could still manage things somewhat smoothly.

I'm still hopeful that they'll get the existing projects running but I don't see World Community Grid getting off life support. I can't see there being a new research project. Just the existing ones completing.
I hope this isn't the case, but I can see this being the most likely outcome.
 

Jason Jung

Well-Known Member
USA team member
Hi everyone, an update on network connection and storage.

We are working together with SHARCNET (an HPC site where WCG servers and storage reside) to resolve the network congestion events we have been experiencing. For volunteers, these events manifest as the arbitrary website/forums database downtime and constant interruptions to volunteers attempting to download workunits. At this time, we believe the root cause to be a limitation or bug in the OpenStack software through which our virtual environment is provisioned at SHARCNET.

To help ameliorate the worst effects of this issue, SHARCNET have re-routed all WCG traffic through a new network node. Effectively, this separates WCG traffic from that of other users and deployments unrelated to the WCG that are colocated at the SHARCNET HPC facility. We have already seen a benefit from this change, and it could help us to further diagnose and optimize additional performance issues.

We have also reduced the maximum concurrent connections permitted on the download servers at SHARCNET’s request, and reduced the maximum number of packages available at any one time for download. Although these adjustments suggest a lower throughput, they have been active since November 11 and are in fact helping the overall throughput of WCG by stabilizing the network to a degree. However, we are still seeing events inside our environment where the load balancer and servers behind it are sometimes unable to communicate with each other.

Importantly, the bandwidth that the WCG environment is provided with at SHARCNET is nowhere near saturated during these events. It is not an issue of capacity. We are working to resolve this and will provide an update on our progress as soon as we have new information. Once resolved, we will be in a position to fully restart, and bring new projects to the Grid.

The new and faster storage server is physically installed at SHARCNET now and will be connected to the rest of the WCG servers next week. The primary benefit of the new storage array is the SSD storage that comes with it, which will increase performance of many key components that currently rely on NFS shares of logical volumes that are composed of HDD storage only.

If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.

WCG team at Krembil Research Institute

 

Vester

Well-Known Member
USA team member
Last edited:
Top