Limpiar el raid superblock de una partición RAID

A veces por una falla eléctrica, mecánica o azares del destino nuestro raid se llega a salir de sincronía y regularmente se puede forzar una reconstrucción del mismo usando -f. Sin embargo hay veces en las que nuestra partición es rechazada por mdadm con un mensaje genérico como “mdadm: add new device failed for /dev/sd* as *: Invalid argument” y es aquí cuando nos queremos arrancar los pelos.

Bueno, si es solo un disco o tu RAID sigue funcionando “KEEP CALM, REMOVE RAID METADATA AND READD THE PARTITION” y aquí es donde sucede la magia. Igual ya probaste reiniciando la computadora, rearmando el raid y demás… Extrañamente en ese disco las particiones 1, 2 y 3 si ensamblan en el raid sin problema alguno; el superblock puede estar corrupto.

dd if=/dev/zero of=/dev/[tu drive/partición] bs=512 seek=$(( $(blockdev --getsz /dev/[tu drive/partición]) - 1024 )) count=1024

mdadm -a /dev/md[tu_raid] /dev/[tu drive/partición]

Ahora solo queda esperar a que el raid termine la reconstrucción…  recuerda monitorear el proceso /proc/mdstat para ver su progreso.

2 Comments

  1. Just got bit by this myself. My opoiinn is that it’s not suitable for production systems to automatically do this. At least put a question to the user if they want this turned on.My main database runs on an Areca hardware card that autoscrubs a 16 disk RAID-10 but doesn’t get in the way. I have several replicant slaves that have just a pair of 1TB 7200RPM SATA drives because they are read only and don’t need massive performance on IO. They work just fine. Until the first sunday of the month. I don’t care about bit rot, the data is all reproduceable from the master anytime I need it. I can monitor the drives with smart and replace them as needed easily and cheaply.However, a batch job that takes 6 hours ran for 12 hours last night and every customer I had was bitching about how slow the system was. In general there’s a lot of smart stuff done with mdadm. This is NOT one of them. Scott Marlowe

  2. Yeah, as I was saying on the comment, from time to time mdadm will reject our partitions because they got corrupted. This happends mainly on software raids who lacks of a real RAID card with a propper battery and ECC memory on it.

    You can fix your raid using other techniques, you can even try to rebuild the whole raid asuming clean partitions in order to recover the data… but all that is out of the question now, this is a last resource, remote ssh fix ;)

    Also it is the partition who failed not the disk itself… in this case you don’t need to replace a disk but to complete your raid again ;)

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos necesarios están marcados *

Puedes usar las siguientes etiquetas y atributos HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>