WCG restart delayed...until...13May 2023

Vester

Well-Known Member
USA team member
African Rainfall Project (ARP) resumed and Open Pandemics (OPN1) downloads are slower than a snail again. With the lack of progress, I am beginning to think Krembil may pull the plug on WCG.
 

Vester

Well-Known Member
USA team member
Update by Cyclops.
2023-01-25 Update (ARP & OPN1 workunits)

ARP & OPN1 workunits

On Monday afternoon, many volunteers reported receiving new ARP1 and OPN1 workunits. These workunits are not from a new batch; these are older WUs that were never sent out due to an overloaded server causing problems in our workunit-distribution process. ARP1 and OPN1/OPNG teams remain on temporary pause, preparing new workunits.

In addition, this infusion of about 2 million WUs helped us to confirm that the networking/download issues we have in the data center persist under a normal load. Improvements made by the SHARCNET team did reduce network congestion. However, based on these results, they are now implementing further modifications to the network, which should resolve these issues for the future. We will keep you updated with further details about the upcoming maintenance, once we receive more information from the SHARCNET team.

Thank you for sending reports of HTTP errors that were experienced by volunteers processing the recent ARP1/OPN1 workunits, which helped us diagnose these errors. The effect is especially strong after an outage, because of the pent-up demand by all the connected BOINC clients. The backlog of workunits released for distribution over the last few days produced the same effect. We continue working together with the SHARCNET team on improving our network. In parallel, we are finalizing the SSD storage upgrade we mentioned in December, and this will also help in improving WCG backend performance.

If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.

WCG team
 

Vester

Well-Known Member
USA team member
Dr. Igor Jurisica posted yesterday.

Dear all – thank you for the comments and suggestions. Well received. We do try to increase the communications, but I rather not post than just post for the sake of posting. Too much of that on social media and unfortunately news channels too.

We have regular monthly meetings with all science teams but even in between them we try to encourage the teams to send us updates we can share with volunteers. There were several posted on the web and forums over the last few months – but I agree, more good quality updates would be better.

As we have posted – SCC/OPN/HSTB all were off due to science – finishing analysis on the past targets and preparing new work units. OPN and SCC are entering new phase – OPN WUs are already available and new WUs should be increasing from SCC shortly in addition to the current MCM.

Technical side of the Grid – we are making some progress – but still, it feels like trying to fit F1 engine into Fiat 500 chassis (no disrespect to the classic Italian design). We had to rebuild the back end to fit the hardware constraints we live with – and we still discover some strange behaviour (e.g., the recent drying out of WUs, while we had them ready due to automation). We do not have hardware that would match cloud setup used before – but we continue to look for funding or partners that could help us there (note: we do not have budget to run cloud going forward either).

As the projects move on to the new phase, and as we solve our back-end aches, there would be more science to report on. Also, once we resolve our technical challenges, we have plans for new projects and new initiatives. The balance between number of projects and number of volunteers/devices is a tricky one – and there are also (multiple good) reasons why only about 1-2 new projects were onboarded annually over the history of WCG.

So, while I do not have the answers yet, and cannot commit ETA for new projects, we are on this together with you – we do not have split personalities but we wear multiple hats – we are volunteers, science partners, and we also try to run the Grid now. While it is not perfect – we are making a progress and the Grid with 62,751 devices actively contributing over the last month and producing 3,340,603 run time days contributing to science in that period is much better than Grid closed by the end of 2021.

For now – we have only limited resources – but in time improved infrastructure and funding would enable us to expand. One such expansion in the future will also be a full time communications professional.

thank you
Igor
 

Vester

Well-Known Member
USA team member
The failure was a RAID controller.
Update: Our RAID controller card failed, so SHARCNET provided us with a spare. We can use it to hopefully restore the configuration and access the storage room soon.
 

Vester

Well-Known Member
USA team member
From Twitter:
World Community Grid
@WCGrid
14h
Update: We have confirmed all the data is intact and have replaced the RAID controller, but we are still having some issues with getting the new hardware production ready. Unfortunately, data center staff will not be able to help us over the weekend.
 

Jason Jung

Well-Known Member
USA team member

New hardware
While we prepare the new and improved hardware to host our databases and parallel filesystems, we have been using a temporary system provided to us by the data center. All data is confirmed intact and there has been no data loss as we continue to recover. The recovery system is a stand-in for the storage server that failed, selected for hardware compatibility to recover the data. We will not be continuing with the recovery system indefinitely, and it will be discontinued only once the new storage system has been fully installed and synced with the recovery system for a smooth handoff.

BOINC database is UP
The BOINC database is now up and running, joining the website/forums database which has been up since last week. However, upload/download of workunits is paused until we restore the parallel filesystem that supports the workunit management stack, to the state it was in at the time of the hardware failure. Deadlines have been extended and valid results computed during this pause will be credited when we resume.

Website crashes
During the hardware recovery process the website has been intermittently crashing. Looking into the cause we identified bugs that only present themselves in such cases as the BOINC database being offline, and other resources unavailable as we recover the system. The website will now remain available to users in these cases or restart automatically after crashing.

In the meantime, we have posted research updates from the ARP and MCM teams. We are planning on sharing more updates soon.

If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.

WCG team
 

Jason Jung

Well-Known Member
USA team member
Haven't seen any new update but all my work units have uploaded today. Feeder appears to be offline still.
 

supdood

Well-Known Member
USA team member
Been a while since I chimed in, hope everyone is doing well. Below is a post from Dr. Jurisica--if I'm interpreting it correctly, sounds like they are going to wait to restart until they've moved over to the new, SSD storage array. Hopefully we'll be back up and running next week.

On a side note, anyone else find the number of WCG users complaining about having their computers on with no work a bit absurd? Either turn them off or load up another project!



Thank you for the suggestion and the offer.
In deed, we use VMs, and we do have multiple blades - but we do not have capacity (yet) for redundancy or sufficient capacity for growth.

This is all older equipment - but despite multiple attempts we do not have yet generous IT vendor or other partner that would give us much needed refresh and redundancy. (suffice to note - there was a planned refresh across academic HPC in Canada last year - but it did not happen yet).

However, two possible leads - *if* they work - we would be moved several years ahead.


on a quick update, finally, /science filesystem is on the move to the new storage from the recovery storage unit. As of last night, after 3 hours, the new storage /science filesystem shows 1.4TB used. Assuming such average rate of file transfer, it will take about 74 hours. Hopefully, we will be able to restart BOINC from the new storage and finally put the failure behind us. We will keep you posted.

sincerely
igor
 
Top