I recently had a disk failure that I was using for my Microsoft Data Protection Manager 2010 (DPM) protection groups. Unfortunately, it wasn't a recoverable error so I had to remove the disk from the pool and reallocate the affected data sources. DPM is supposed to error out the data sources when it can't perform the backup, but I've found this is not always the case. Only about 25% of the data sources that were on that disk ever errored out. The rest had the happy green "OK." Looking through the protection groups, I noticed that any data source that was no longer protected would still show the correct last recovery point from when it last succeeded (which was days ago). When I tried to run a manual express full backup at that point, I would get an error that stated that the disk was missing and it could not perform the backup. However, it still showed the green "OK" symbol next to the data source.
I have several hundred protected data sources and I couldn't go through them one by one, so I whipped up a PowerShell script to show me stale DPM data. Basically, it enumerates all the data sources and compares the latest recovery point date with 24 hours ago. If it's older than that, it outputs the protected resource so I can removed it and add it back to the protection group with a fresh volume.
The biggest gotcha I ran into is that a lot of properties returned by Get-DataSource are asynchronous, which is annoying when you are scripting. Luckily, a TechNet blogger had a solution to that problem. His script contains an error (a missing parenthesis), though. I notified him so hopefully it will get fixed. I have confirmed my script below works.
This script is not just useful for finding stale data due to a failed disk. It could also be adapted to notify you when failed backups is close to surpassing your defined Recovery Point Objective (RPO). DPM's internal notification system is very noisy since it's not uncommon for a backup to fail, then recover on its own very quickly. If you manage a large DPM deployment, you are probably used to hundreds to thousands of emails awaiting you after the weekend. I'm not using it that way yet, but I think I just might.
The output of my script looks like this:
$ds[133] System State and BMR : server1.avianwaves.com : Computer\System Protection
The $ds variable is the array that stores all the data sources used in the script. The 133 is the index, so you can quickly query more information about the data source if you need to. Just type "$ds[133]" at the PowerShell prompt to do so. Immediately after the variable name is the Protection Group. Following that is the server holding the protected resource. Then the last part is the protected resource itself.
I hope this helps somebody out there!
# Refresh the datasource metadata. Code taken from: http://blogs.technet.com/b/dpm/archive/2010/09/11/why-good-scripts-may-start-to-fail-on-you-for-instance-with-timestamps-like-01-01-0001-00-00-00.aspx
Disconnect-DPMserver #clear object caches
$ds = @(Get-ProtectionGroup (&hostname) | foreach {Get-Datasource $_})
$ds = $ds | ?{$_} #remove blanks
$global:RXcount=0
for ($i=0; $i -lt $ds.count;$i++) { [void](Register-ObjectEvent $ds[$i] -EventName DataSourceChangedEvent -SourceIdentifier "TEV$i" -Action { $global:RXcount++}) }
# touch properties to trigger events and wait for arrival
$ds | select latestrecoverypoint > $null #do not use [void] coz does not trigger
$begin = get-date
$m = Measure-Command { while (((Get-Date).subtract($begin).seconds -lt 30) -and ($RXcount -lt $ds.count)) {sleep -Milliseconds 100} }
if ($RXcount –lt $ds.count) { write-host “WARNING: Less events arrived [$RXcount] than expected [$($ds.count)]” }
Unregister-Event *
# Look for stale data
$staleDate = (get-date).AddDays(-1) # 24 hours old is our limit
$count = 0
foreach ($dsi in $ds)
{
if ($dsi.LatestRecoveryPoint -lt $staleDate)
{
write-host "`$ds[$count] $($dsi.ProtectionGroup.FriendlyName) : $($dsi.ProductionServerName) : $($dsi.LogicalPath)"
}
$count ++
}