Quantcast
Channel: All Data Protector Practitioners Forum posts
Viewing all articles
Browse latest Browse all 10494

Re: Fixing a botched StoreOnce Software store?

$
0
0

Hi Justin,

 

thanks for chiming in. As an update, the store in question is now dead in the water - it has completely emptied on the user data side (all remaining media has expired meanwhile), but still claims some 500GB on disk. Given there are no tools to repair this, I'm only left with removing that store, which I will do soon. This is clearly not a satisfactory solution.

 

What I really would like is a way to verify the datastore directly, rather than via the individual media. Verifying this way would mean checking between 8 and 14 TB rather than the full data set.

 

Exactly. There is a need for a fsck kind of operation that can get triggered externally and would verify the entire structure of a store, repairing anything that can be repaired, removing anything that can't, and returned to a consistent state represented to the outside. So I could call it on my remaining stump of a store and it would turn into a proper store with zero user data. And which I could have called directly after the issue, to clean up from it. There is also a need for automatic as well as externally triggerable scrubbing of entire stores in normal operation, to find errors that silently crept in. The latter is in the making for the appliances, so let's hope the feature which is in the codebase anyway will also get exposed to SO Software users.

 

The other thing I'd like to see is that if there *is* a corrupt block, it's flagged, and the next time a backup is taken, the block gets rewritten. There's no reason to completely discard ~20 copies of a backup if you still have a valid source copy you can use...

 

That is a great idea. It would mean that the fsck style operation has to come in two flavors - one that leaves missing data chunks pending, and one that finally loses missing data. Now you could do this after a botch:

  1. Run the soft fsck to get the store into a state where it can again be written to, with all media readable, but some of them failing on read due to missing blocks.
  2. Now wait a cycle of full backups or manually start some. This would repair all the blocks that went missing, magically reviving the media that failed read after step 1. Some will not make it, though.
  3. Now run the hard fsck to get the store back to a consistent state, finally getting rid of anything that is unrecoverable.

Ideally, you would never need step 3 because media that expires would slowly rid us of ony referenced but missing blocks. Both the soft and hard fsck would also need a way to communicate broken media back to DP in a quick way, so the media can go poor there without a need for a verify (which, as you state, is essentially no option for any properly filled store, I was just lucky I got hit on a a less-than-1TB-store where the scripted verify took two days).

 

The basic problem is probably that the developers never hear us. The layer between them and us which treats every idea or bug report as a support cost factor is quite thick these days...

 

Thanks,

Andre.


Viewing all articles
Browse latest Browse all 10494

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>