Sunday 21st July 2019

Cedar Updated: Cedar Planned Outage - Arrêt planifié

Update, July 21: Another contoller arrived today and was installed. This one appears to be working. Cedar is available again.

Update, July 21:

We replaced the faulty storage controller yesterday, but new the one is also faulty. I are still in discussion with the vendor to evaluate how to proceed and we hope to be in production later today.

Update, July 20: The verify process that was triggered by defective hardware has finally finished. There are still a few rebuild processes running that need to finish in order to bring the storage system into a stable state. We are forced to let those verify and rebuild processes finish in order to protect the integrity of the /home and /scratch filesystems and to avoid loss of data. Our current estimate is that Cedar will become available late today. We sincerely apologize for the situation and wish that it could have been avoided. Because of the extended downtime the purge of old files in /scratch will not be done this month.

Update, July 17: The replacement parts for the Cedar storage system that serves the /home and /scratch filesystems arrived in time today (July 17). However, when the parts were installed it was detected that they were also defective. Furthermore, the defects triggered a verify process of all disks in the system. This verify process is very slow and expected to run all of July 18. At this point we hope that the system will become available sometime on July 19. We wish we would have better news and apologize for the situation.

Outage extended to end of day July 17. Unfortunately some extra work is required to complete the outage and it had to be extended by one day. Apologies for the unexpected increase. An outage is planned to do Filesystem maintenance and node upgrades, expected outage to last until end of day July 17. The Cedar facility will be unavailable on July 15, 16 and 17 because of necessary upgrades to the /home and /scratch filesystems. These upgrades require a filesystem check which is estimated to run for about two days. All jobs that are still running in the morning of the 15th will need to be terminated. We apologize for the inconvenience.