03/04/2020 - Drupal 8 infrastructure intervention

eduardoa · April 2, 2020, 12:43pm

Short announcement:

SSB entry: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0055749

This Friday starting at 06:00 AM Official Drupal 8 sites will be put offline. Site access will be restored once the underlying Database instance is back from a version update (30 minutes intervention foreseen by Database Group)
We are sorry for the short notice on this intervention, but recent infrastructure stability issues forced us to speed up the version upgrade with the goal of recovering the stability at the Database level.

No read-only mode will be used this time, since it was proven last time that it doesn’t work as expected on Drupal 8 and can provoke multiple sites to appear being down. Instead, the following screen will appear for all official sites:

Long explanation:

Probably many of you noticed some instability on the service lately. The chain of actions that led to unavailability of the websites was:

Monday 30/03: scheduled intervention at the Database level to move the production/official database to new Server and Storage server. SSB Entry. During this intervention (expected to be only degrading the service) the read-only mode proved that Drupal 8 requires read/write permissions at the Database level to work, so we needed to put sites offline during the intervention.
Wednesday 01/04: Database incident on production/official instance. Due to the time it happened (03:30 AM) it took longer to be detected by our database colleagues and put the instance back (around 06:55). During this period official sites were giving errors. Post-hoc analysis gave a hint on a possible MySQL bug affecting the version of our instance. As a first measure the Database team moved the slave/secondary instance to physical hardware and also updated the instance version to solve the bug that affects production. This change will help provide a seamless movement of instances when an intervention/incident happens again in the future, reducing downtime considerably, but we need to work on that so we hope to have it soon. We decided to wait a bit to do a similar intervention on the Master/primary instance to avoid too many interventions/incidents in the same week.
Thursday 02/04, unfortunately similar incident happened again one day after. Database team investigation points to the same bug, so now upgrading the Master/primary instance is critical to be sure we are not affected by this bug anymore, with the consequent downtime.

So we planned this intervention to hopefully resolve once and for all this instability.

On behalf of the Drupal infrastructure and Database team, sorry for the inconvenience.

CERN Accelerating science

03/04/2020 - Drupal 8 infrastructure intervention

Short announcement:

Long explanation: