Comparing drive contents

Silent.Sin · Apr 23, 2012

tl;dr:
How would I go about running a diff from a windows box on two mapped drives that are supposed to be identical?

Each contain over 15 million files and 5 million sub directories
I need a list of the files that are on one but not the other

Director's cut:

I couldn't decide whether to post this in the OS forum or here so I flipped a coin and you guys win. I've got a very specific problem that I can't seem to work out and am wasting a ton of time trying because of the logistics behind it. Here's the deal:

1) 2 NAS devices are supposed to be exact replicas of one another containing about 250GB worth of tif images (no, not that kind you pervs)
2) Something went terribly wrong in the transfer process from the 'master' NAS to the backup and the devices wound up with different contents
3) The total number of images on the master device is 19,320,955 inside of 7,414,880 directories which is where my problem seems to lie. The backup device lists 15,149,756 and 5,731,884 respectively.

The boxes are Cisco NSS300 SMB NAS devices which are basically just rebranded QNAP devices with the Cisco brand so management people feel all warm and fuzzy about the name on the plastic. It seems to run a variant of BSD or *nix or at least busybox as it has a very limited set of commands and allows ssh, etc. Unfortunately they do not offer a built in diff binary as whatever OS is on them is super trimmed down. I am usually forced to do any troubleshooting from an attached machine instead of natively on the devices.

What I want to do is to be able to come up with a proper diff of the master and backup NAS. I've tried running the GNU distro of diff but that eventually fails because of the incredible number of files and sub directories. This all needs to be documented and repeatable due to the nature of the client we are dealing with. That's the reason I haven't just started over on the copy, I need to have a documented reason with it. Plus the raw copy takes over 2 weeks to complete, again because of the number of files and its structure. It's one thing to copy 250GB of contiguous data, copying 17 million chunks and recording each one individually in the file table makes downloading porn on AOL over a 56k look speedy.

Whatever method I use needs to be Windows compatible because that's all the client has to connect to the devices. I would think any GUI tool would either wreck the memory of a connected server because of the giant list of differing files or just crapout like diff did. A command line util also offers output redirection which would satisfy the documentation part at the same time so I would think that's preferred. Any advice?

Ripley · Apr 23, 2012

Have you looked at Rsync? There are versions that'll run in Windows.

Something like this looks like it has built in logging - https://www.itefix.no/i2/cwrsync#

It can determine what's different and then make them the same.

Silent.Sin · Apr 23, 2012

Ripley said:
Have you looked at Rsync? There are versions that'll run in Windows.

I did look at this as well as robocopy because it had the ability to spawn multiple threads to increase overall transfer speed but the commands would eventually wind up in a hung state without any activity being logged after a couple of days. A few hundred megs of files would get sync'd but then they died. I even tried running it a couple of subsequent times and each time it seemed to sync less and less past what it had done the first time, so I figured it might be hitting a resource wall. This is running on a Win 2008 R2 box with 32GB of RAM and dual Xeon 5650s, not exactly lacking in horsepower.

bexamous · Apr 23, 2012

Is copying data again a possibility?

I've had this problem before, trying to copy millions of files between two systems. I tried running cp to copy folders to a NFS share, that failed. Not sure if CP ever even did anything, it just spent a few hours using lots of CPU and I eventually kill it not wanting to wait. I forget how Rsync went but I think it was avging like 200KB/sec or something that was way too slow to be reasonable. I forget what else I tried but fairly sure in the end I found the best method is:
tar c bigdirectory | gzip - | ssh newsystem tar xz

Copied data at 100MB/sec basically, I remember the network becoming the bottleneck.

--edit--
Actually I forget if I used ssh, I might have used netcat to not deal with encryption.

I think with netcat the general idea is on the recieving system you run like:
nc -l 50000 | tar xf

And on the sending system you do:
tar c bigdirectory | gzip - | nc newsystemip -p 50000

Something like that, too lazy to look it up. The solution aspect of it is to let tar handle millions of files, it does it very well. You can either tar up file on one end, copy the big file, untar it on other, or you can try ssh or netcat to remove intermediate file, and just tar on one end, stream contents over network, and untar stream on other end. Also tar, nc are part of busybox, so good chance they're avaiable.

Comparing drive contents

Silent.Sin

Gawd

Ripley

Limp Gawd

Silent.Sin

Gawd

bexamous

[H]ard|Gawd