2023-01-25 Update (ARP & OPN1 workunits)
ARP & OPN1 workunits
On Monday afternoon, many volunteers reported receiving new ARP1 and OPN1 workunits. These workunits are not from a new batch; these are older WUs that were never sent out due to an overloaded server causing problems in our workunit-distribution process. ARP1 and OPN1/OPNG teams remain on temporary pause, preparing new workunits.
In addition, this infusion of about 2 million WUs helped us to confirm that the networking/download issues we have in the data center persist under a normal load. Improvements made by the SHARCNET team did reduce network congestion. However, based on these results, they are now implementing further modifications to the network, which should resolve these issues for the future. We will keep you updated with further details about the upcoming maintenance, once we receive more information from the SHARCNET team.
Thank you for sending reports of HTTP errors that were experienced by volunteers processing the recent ARP1/OPN1 workunits, which helped us diagnose these errors. The effect is especially strong after an outage, because of the pent-up demand by all the connected BOINC clients. The backlog of workunits released for distribution over the last few days produced the same effect. We continue working together with the SHARCNET team on improving our network. In parallel, we are finalizing the SSD storage upgrade we mentioned in December, and this will also help in improving WCG backend performance.
If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.
WCG team
From Twitter:
World Community Grid
@WCGrid
Our servers will be undergoing maintenance starting at 3:00 PM EST, expected to last for 45 minutes. We will notify you when it has been completed.
8:00 AM · Jan 27, 2023·
827
Views
Dear all – thank you for the comments and suggestions. Well received. We do try to increase the communications, but I rather not post than just post for the sake of posting. Too much of that on social media and unfortunately news channels too.
We have regular monthly meetings with all science teams but even in between them we try to encourage the teams to send us updates we can share with volunteers. There were several posted on the web and forums over the last few months – but I agree, more good quality updates would be better.
As we have posted – SCC/OPN/HSTB all were off due to science – finishing analysis on the past targets and preparing new work units. OPN and SCC are entering new phase – OPN WUs are already available and new WUs should be increasing from SCC shortly in addition to the current MCM.
Technical side of the Grid – we are making some progress – but still, it feels like trying to fit F1 engine into Fiat 500 chassis (no disrespect to the classic Italian design). We had to rebuild the back end to fit the hardware constraints we live with – and we still discover some strange behaviour (e.g., the recent drying out of WUs, while we had them ready due to automation). We do not have hardware that would match cloud setup used before – but we continue to look for funding or partners that could help us there (note: we do not have budget to run cloud going forward either).
As the projects move on to the new phase, and as we solve our back-end aches, there would be more science to report on. Also, once we resolve our technical challenges, we have plans for new projects and new initiatives. The balance between number of projects and number of volunteers/devices is a tricky one – and there are also (multiple good) reasons why only about 1-2 new projects were onboarded annually over the history of WCG.
So, while I do not have the answers yet, and cannot commit ETA for new projects, we are on this together with you – we do not have split personalities but we wear multiple hats – we are volunteers, science partners, and we also try to run the Grid now. While it is not perfect – we are making a progress and the Grid with 62,751 devices actively contributing over the last month and producing 3,340,603 run time days contributing to science in that period is much better than Grid closed by the end of 2021.
For now – we have only limited resources – but in time improved infrastructure and funding would enable us to expand. One such expansion in the future will also be a full time communications professional.
thank you
Igor
Update: Our RAID controller card failed, so SHARCNET provided us with a spare. We can use it to hopefully restore the configuration and access the storage room soon.
World Community Grid
@WCGrid
14h
Update: We have confirmed all the data is intact and have replaced the RAID controller, but we are still having some issues with getting the new hardware production ready. Unfortunately, data center staff will not be able to help us over the weekend.
New hardware
While we prepare the new and improved hardware to host our databases and parallel filesystems, we have been using a temporary system provided to us by the data center. All data is confirmed intact and there has been no data loss as we continue to recover. The recovery system is a stand-in for the storage server that failed, selected for hardware compatibility to recover the data. We will not be continuing with the recovery system indefinitely, and it will be discontinued only once the new storage system has been fully installed and synced with the recovery system for a smooth handoff.
BOINC database is UP
The BOINC database is now up and running, joining the website/forums database which has been up since last week. However, upload/download of workunits is paused until we restore the parallel filesystem that supports the workunit management stack, to the state it was in at the time of the hardware failure. Deadlines have been extended and valid results computed during this pause will be credited when we resume.
Website crashes
During the hardware recovery process the website has been intermittently crashing. Looking into the cause we identified bugs that only present themselves in such cases as the BOINC database being offline, and other resources unavailable as we recover the system. The website will now remain available to users in these cases or restart automatically after crashing.
In the meantime, we have posted research updates from the ARP and MCM teams. We are planning on sharing more updates soon.
If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.
WCG team