WCG restart delayed...until...13May 2023

Vester

Well-Known Member
USA team member
From Facebook:
With a heavy heart we must announce to volunteers our intent to delay restarting the WCG until May 9th, 2022. Several issues discovered in our production environment remain unresolved, making it impossible to meet the April 22, 2022 deadline at this time.
Unexpected issues continue to delay the full validation of the QA environment, meaning there is no path yet to a responsible restart of the Production system even if all outstanding issues were resolved today. Several obstacles that proved difficult to resolve due to inexperience with specific components in the WCG software stack, contribute to our need for yet more time to bring the Grid back online.
Notably, the website build broke due to a dependency that brought React version 18.0.2 into the build, whereas the site was developed against React version 17. The lack of experience with React and modern web development practices within our team resulted in what now seems a simple fix, pinning React to the previously working version 17 for all dependencies that permitted a change in major versions when resolving required packages at build time. Volunteers may have noticed this issue, as it coincided with a long silence at www.worldcommunitygrid.org, which could not be updated. In addition, we overlooked a misconfiguration in the messaging/queueing middleware (IBM MQ), and a missing root certificate took far too long to discover as the reason Apache could not talk to IBM Websphere. The last few public IPs to be assigned were not routable due to a misconfigured VLAN. While these issues were all resolved, we now need more time to ensure there are no more surprises.
As a rule, this level of detail should be omitted from updates and has been omitted from previous updates. From our perspective, the specifics of the technical obstacles that hinder us are immaterial, as what we owe to volunteers is a working backend for the World Community Grid. The Grid is far too valuable to let go - and despite challenges, we are committed to supporting open science on the Grid. Given the already overlong timeframe of the migration, and to assuage concerns as to whether we are progressing towards the goal at all, we thought to make an exception given that we are asking for your patience just a while longer before we are fully ready to restart the WCG. We must and we will succeed.
Thank you to all who have contributed feedback and words of encouragement during the downtime. We do see your posts even if we cannot always reply at this stage. Your understanding and patience is truly appreciated. We will prepare a proper team introduction and answer the questions and address the comments once the Grid is back on.
WCG Tech team
No photo description available.
 
Last edited:

doneske

Well-Known Member
USA team member
That's interesting in light of:

from February 28 until April 22, 2022. A simplified version of the website will act as a sign post during the transition, to share updates and communicate our progress and roadmap for the World Community Grid in 2022. "

Nothing on the "simplified version of the website". Plus, if they had these problems in production, how did they do the stress test on February 28? What and how did they test? Additionally, did IBM help with these issues since the contract was up in March? I was under the mistaken impression that from the announce change over until March contract end, IBM was helping build the hardware, software platform, and then transfering the data to Kembrill so that Kembrill people could learn the system. It is sounding like a lot of this is being done after IBM left so what did they do before IBM left? :unsure:
 

doneske

Well-Known Member
USA team member
Eerily silent at WCG considering it is the Friday before the go-live on Monday. Is that a harbinger of another Facebook post on Monday morning?:unsure:
 

Vester

Well-Known Member
USA team member
Someone is learning that it takes more than a degree. Some smart people need to work 12-18 hours a day for a while.
 

doneske

Well-Known Member
USA team member
Hate to sound pessimistic but this doesn't bode well for the future support of WCG considering the complexity of on-boarding new apps and weird problems that can arise in production. I still have to ask what happened from September to March when IBM was still there. I know from experience that it doesn't take six months to order and install hardware. This looks like a bunch of researchers that have turned into IT people. They should have had at least a test system up and running before IBM left. Then, all they had to do was replicate to the production environment. It's scary to think about a possible filesystem issue in that terabyte storage system (remember the HTTP errors trying to upload/download). At least IBM had access to Red Hat software engineers or their own developers for help.
 

Vester

Well-Known Member
USA team member
From Facebook:

World Community Grid

star_filled_24_fds-gray-70.png
Favorites · 52m ·

While we continue working hard to restart WCG as soon as possible, we want to provide an update on the current status. We are currently facing unexpected issues with the load balancer - a small but critical component that ensures science servers will cope with increased workloads. If we do not resolve this in time, it would prevent us from restarting on May 9th. We will provide an update on the result of our efforts to resolve the issue Sunday evening Eastern time.
A brief update on science
WCG downtime enabled teams to catch up with analyses and validations. ARP project continues to analyze data, and started to prepare the online portal to help disseminate the results to wider scientific community. OPN1 is busy validating results from the first round of computation (validation has been delayed due to problems in Europe). MCM is partially on pause until we fully restart (as the MCM team is now the WCG team). SCC is also finishing preclinical validation of previous drugs, and is preparing new targets for the computation on the WCG. We do have work-units prepared for the restart for both OPN1 and MCM projects.
We really appreciate all the support messages we have received on the last update, your patience and continued support of WCG and its projects.
Thank you
WCG Tech team
 

doneske

Well-Known Member
USA team member
One has to ask, how did they do a stress test without a load balancer? If they had a working load balancer at the time of the stress test and then reconfigured it, how worthwhile was the stress test? If they did the stress test without a load balancer (with only one server), once again, how worthwhile was the stress test? it wouldn't have accurately reflected the environment. I was understanding that they were doing the stress test in the production environment. If not, it seems like it would have been a total waste of time.
 

Vester

Well-Known Member
USA team member
No one ever got anything done by Friday and WCG doesn't work on weekends, holidays, before 9 am or after 4 pm? (Speculation)

This is going to be interesting!

Added: Twenty-three hours and the project is still "temporarily unavailable" in BOINC Manager. No further update from WCG.
 
Last edited:

Vester

Well-Known Member
USA team member
There comes a time in every project when it becomes necessary to shoot the engineers and begin production.
 

doneske

Well-Known Member
USA team member
We continue to work on this tonight and we aim to update you tomorrow [May 9th, 2022] with the revised start date for the WCG, unless we run into some unexpected challenges.

Must have run into some "unexpected challenges". This group is a poster child for IT outsourcing.
 

Vester

Well-Known Member
USA team member
Restart update


Issues were fixed, we can continue with testing.

Published on: 10 May 2022

We performed further investigations into issues we have been experiencing with our message broker and IBM WebSphere. We were able to address the stale JAAS configuration in IBM WebSphere causing the issue whereby JMS connections that required updated credentials were not re-initialized so long as any credential of the correct alias already existed in the WebSphere config. Hence, the outdated credential simply survived scripted reconfiguration of WebSphere to pollute all queue and topic connections anew on restart. While our team used the typical diagnostic tools for IBM MQ we gleaned no additional insight from them and were significantly delayed in discovering the bug in our deployment scripting. Stepping through all configuration scripting and checking each referenced object manually in the WAS command line console revealed the issue only this morning.

As we are now able to continue with testing the system we plan to reassess the earliest WCG restart date we can commit to by Thursday evening (May 12th, 2022). We will post an updated schedule to social media and to the website on the 12th.

Thank you
WCG Tech team
 

supdood

Well-Known Member
USA team member
With each delay and each story of troubleshooting woe, I become more concerned about how WCG will stand up to the load and how they will onboard new projects. Not sure if it is a case of gremlins, just back luck, underestimating the complexity of WCG, or ill-preparedness, but to me it doesn't bode well for the project's future.

We seem to be running low on reliable, large-scale, non-VM BOINC projects. For non-math projects are we down to WCG (maybe) and Einstein for CPU? I've been at Universe while WCG is down and it has been more reliable than I expected, but it has a single point of failure in that it is just one admin doing everything.
 

Vester

Well-Known Member
USA team member

World Community Grid

star_filled_24_fds-gray-70.png
Favorites · 1h ·

The revised date for launch will be May 24th, 2022, after Victoria day.
We were able to solve a redirection loop that caused much of the website to be unusable due to incorrect rewrite rules in Apache and their interaction with self-hosted DNS. Additional issues were then resolved that had resulted from the previously discussed necessary changes to the configuration of HAProxy, internal server certificates and thus domains, and IBM WebSphere.
We are now updating the content on the production website to include the updates that were published during downtime, porting the React dependency pin to version 17 from the current website hosted at www.worldcommunitygrid.org to the full-featured production website which was also affected. We then need to ensure functionality on all major browsers manually. Once finished, the website and forums will be good to go.
Finally, we were able to test BOINC client connections to our servers from newly created/registered accounts. While we were able to contact the BOINC scheduler and check for available workunits, we are now diagnosing a failure to validate the project key that occured in some cases.
On the server side, we were able to verify the flow of data from our research partners into the workunit management layer in our stack. Thus, this part is fully validated, and will proceed smoothly upon restart. We continue to assess readiness of the workunit management stack for launch together with the website.
No photo description available.



3535

12 Comments

Like



Comment


Share
 

doneske

Well-Known Member
USA team member
This is really pathetic (yes, I know that is a pretty strong comment). If they had resolved most of the issues, why wait until the 24th to restart? Reading their last post doesn't indicate that they are ready for a restart and that they are probably still trying to resolve problems. The fact that they are delaying another 2 weeks just suggests that we will probably get another post about another delay. My son works for IBM and I had him look up the old crew in the employee directory. The ones that are still with IBM, like Kevin Reed and Keith Uplinger, are working in other departments now so they aren't supporting WCG. I'm not surprised they are having problems with WebSphere. I am surprised they are still using it. It is a very complicated piece of software and the more application servers and HTTP servers you have the worse it gets. When you put a load balancer on the front of WebSphere it becomes a nightmare because WebSphere will attempt to load balance itself across it's app servers. What you have is a load balancer load balancing another load balancer. How do I know? Been there, done that. Most production WebSphere sites have an IBM support contract so that you can call into their support center for help. This is not open source code so you have to have IBM provide patches and upgrades. This also isn't a piece of software that can be made to work and then just left alone (If it ain't broke, don't fix it). It will break because you have too many things interacting with it that change. Browsers change, HTTP protocol changes, security changes, etc. If you change your environment, that could mean a change to the WebSphere environment. These things will require patching to WebSphere. Keith and Kevin didn't have to worry about that because it was IBM code and they had access to the developers. Kembrill doesn't. I know Keith was using Ansible (Red Hat product) for a lot of their automation type stuff. That is a rather complicated piece of software too (not quite like WebSphere). Haven't heard much about it yet which makes me wonder, 1. Are they using it? or, 2. They are using it and so far it is working or, 3. They don't even know they have it yet because of all the other problems. Meanwhile, I've been bouncing around the BOINC ecosystem like a vagabond. Did Einstein for a while, moved to Universe until I reached number 1 in team rankings, then moved to Number Fields until I reached # 1, I guess will try another project I haven't crunched in awhile.
 
Top