Pando Archive Failure

29 January 2018

Update: 9 March 2017

Thanks to The Climate Corporation, we backfilled the HRRR archive to 15 July 2016.

The CHPC Pando archive object store system failed. All our archived HRRR and GOES-16 files were lost. The backup of the archive on Google Drive managed by CHPC was found subsequently to have failed as well.

The failure has hindered my research, and we realize many of you relied on the data for your own work. We apologize for the inconvenience this may cause you and your organization. The archive was an experimental system, and we learned yet another valuable lesson: even redundant archive systems can have undocumented single points of failures.

We plan on rebuilding the archive once the Pando system is repaired beginning with files from 1 January 2018. These repairs and data transfers may not be completed until mid-February. Data prior to 1 January 2018 is lost.

We want to fill our data gaps. If you have any raw HRRR grib2 files prior to January 2018 and would like to share those files with us, please let us know.

In the meantime, limited archives exist other places.

Email Notification: CHPC Object Storage (Pando) Failure

You are receiving this message as you are the PI of a group that owns space on the CHPC archive storage solution, Pando.

As many are aware, we started to have issue with accessing the data on Pando about two weeks ago. There was an initial failure of a large number of encryption partitions on the drives, after which the system was never able to complete the rebuilding process. Analysis of the problem indicated the need for more memory in each of the servers. However, doubling the memory, which was done last week, did not help, as the rebuild process was stuck in a feedback loop from which it could not recover. At this point the state of the object store was already beyond repair. We will mention that there is no issue with the hardware, other for the need for more memory to support the number of objects/the amount of data, which we have already addressed.

At this time, we have exhausted efforts to make the data accessible and have determined it is time to start a fresh installation - meaning that all data is lost. We will be starting this process on Monday January 29 and will notify you when the storage is again available for use. We estimate this will take about a week.

The Unsinkable Sank
The unsinkable sank

The archive redundancy was thought to be fail proof

Pando is Dying
Pando is Dying

Let's hope the archive doesn't experience the fate of its namesake