Mirroring storage over WAN

Boomslang · Dec 25, 2007

I'm looking for a way to mirror storage devices over the internet in an efficient way. Here are features I'd like to see:

1.) Efficient use of bandwidth. This means that doing a simple once-daily rsync would not be ideal, because if I had ten nodes in the "distributed cluster," each node would have to download all the changes from the "master node." I'd rather see the node where the changes were made send 1/9th of the changes out to the other nodes, and have those 9 nodes share the data in a similar way to make most efficient use of available bandwidth. The paths between nodes can be analyzed before the transfer and the data can be split into chunks according to who can handle the most data.

2.) No "master/slave" or "primary/secondary" relationship. I'd like for the operator of each node to be able to make changes to each storage device, and have the changes immediately pushed out to the other nodes in the cluster.

3.) No real need for process sharing, so this isn't exactly a High Availability thing in the usual sense where if one machine goes down, the running processes kick in on another machine.

The best I can think of so far is drdb. I can mirror devices over a network with drdb, but I can only have one device be the "primary" device that can be directly read from / written to at any given time. (I learned this from the Gentoo How-To and haven't confirmed this from other sources.) Also, I don't expect it would make efficient use of a network. rsync is not ideal enough because while it is efficient in handling data changes, it is not efficient in mirroring data to multiple nodes and is not automatic enough.

Do you guys know of any Linux gems that I may be overlooking that can perform some of these functions? Alternatively, is there anyone that uses drdb to do something similar?

Thanks!

EDIT: Just read this on the drdb website: "Since DRBD-8.0.0 you can run both nodes in the primary role, enabling to mount a cluster file system (a physical parallel file system) one both nodes concurrently."

That's more like what I want! This is the bare minimum that would satisfy me, so I might be able to get by with drdb. Not sure about the rest or if it can be achieved with off-the-shelf tools.

drizzt81 · Dec 26, 2007

interesting project, subscribed

hokatichenci · Dec 26, 2007

I've been looking for a similar solution for a while and haven't really come up with anything - there is some commercial stuff out there and I think Network Appliances has a feature that does pretty much exactly this, but it's most likely very expensive.

My suggestion is to break out shared storage and the file system itself. Use DRBD to handle shared storage, breaking each site into two halves, one half is active, another half is standby for another site. Once you have your mesh of shared storage pools, you can put on a file system like Lustre, Ceph, Red Hat GFS etc. You could then have a node at each site which then re-exports that to NFS, SMB, etc, depending on your needs.

A lot of the file system solutions rely on shared storage to achieve HA status, and so if you break out functionality separately choosing a FS becomes a lot easier. The one problem with this system is that it will be horrendously inefficient and probably require a decent sized pipe to the other site. Have you run any profiling on bandwidth usage having two DRBD nodes locally on the same network for the application usage?

longblock454 · Dec 26, 2007

How are these nodes connected? Each one is on a separate WAN? Or is it when you mirror from node 1 to node 2, via say point to point T1, then the other nodes are say local LAN? My point is perhaps the push "method" from node 1 to 2, might not be the same from node 2 to 3 etc.

How much data we talking?

Bones · Dec 26, 2007

Have you checked out AFS and Coda? They are distributed filesystems that support replication.

/dev/null · Dec 26, 2007

I wonder if iScsi might work for you ? Probably wouldn't be too fast, but depending your budget you could use some kind of wan optimizer.

You can map say, 1 drive of a box to an iscsi device and create an md device out of local disk + iscsi connected disk. Probably not too efficient.

AFAIK multiple boxes can connect to the same iscsi block device, but I'm not sure how conflicts are resolved. I assume when you say mirror, you mean "real-time" mirroring. If it is not real-time, you'd have to figure out some sort of conflict resolution when changes are synched.

If you don't need real time, rsync + shell scripts (yeah I know you said it is not automatic enough) would probably work.

If you would like to try mirroring to Iscsi and need help, post here & I can give you a guide. It is fairly simple to setup both the target & client.

Boomslang · Dec 27, 2007

Good stuff, guys...

hokatichenci said:
I've been looking for a similar solution for a while and haven't really come up with anything - there is some commercial stuff out there and I think Network Appliances has a feature that does pretty much exactly this, but it's most likely very expensive.

My suggestion is to break out shared storage and the file system itself. Use DRBD to handle shared storage, breaking each site into two halves, one half is active, another half is standby for another site. Once you have your mesh of shared storage pools, you can put on a file system like Lustre, Ceph, Red Hat GFS etc. You could then have a node at each site which then re-exports that to NFS, SMB, etc, depending on your needs.

A lot of the file system solutions rely on shared storage to achieve HA status, and so if you break out functionality separately choosing a FS becomes a lot easier. The one problem with this system is that it will be horrendously inefficient and probably require a decent sized pipe to the other site. Have you run any profiling on bandwidth usage having two DRBD nodes locally on the same network for the application usage?

Had not looked into Lustre or Ceph before - Ceph in particular looks quite interesting, and I will dig into it a little more when I get a free moment. I had ruled out GFS because its purpose is more for clusters than for individual consumer-grade nodes connected by the internet. I am not really looking for a full-blown HA configuration; just a way to maintain a "master copy" of data that can be contributed to by any node administrator such that the data mirrors itself to every node.

longblock454 said:
How are these nodes connected? Each one is on a separate WAN? Or is it when you mirror from node 1 to node 2, via say point to point T1, then the other nodes are say local LAN? My point is perhaps the push "method" from node 1 to 2, might not be the same from node 2 to 3 etc.

How much data we talking?

The nodes, as I mentioned above, would be consumer-grade devices connected by a standard (read: unreliable) internet connection. Data would be contributed (most likely) in chunks of ~10-100MBytes and would accumulate gradually up to and possibly surpassing ~5TB. I'd like to be limited only by the capacity of the consumer-grade hardware, which will be running (at most) 16-20 hard drives in software RAID5.

Bones said:
Have you checked out AFS and Coda? They are distributed filesystems that support replication.

Just skimmed the descriptions, and Coda put a sparkle in my eye. I'll check into it further to see if this is what I'm looking for.

Robstar said:
I wonder if iScsi might work for you ? Probably wouldn't be too fast, but depending your budget you could use some kind of wan optimizer.

You can map say, 1 drive of a box to an iscsi device and create an md device out of local disk + iscsi connected disk. Probably not too efficient.

AFAIK multiple boxes can connect to the same iscsi block device, but I'm not sure how conflicts are resolved. I assume when you say mirror, you mean "real-time" mirroring. If it is not real-time, you'd have to figure out some sort of conflict resolution when changes are synched.

If you don't need real time, rsync + shell scripts (yeah I know you said it is not automatic enough) would probably work.

If you would like to try mirroring to Iscsi and need help, post here & I can give you a guide. It is fairly simple to setup both the target & client.

Yours is an interesting idea, but I think iSCSI might be too low-level for what I'm trying to accomplish. Right now, rsync with a custom wrapper might actually be ideal (depending on how Coda looks) and I don't really want to work with block-level software like iSCSI or drdb.

The "theoretical" situation that I'm working with is: A handful of offices with low budgets need to collaborate on a project and the best way to do this would be to archive all the data that each office creates on a fileserver, one at each office. To maximize the effectiveness of collaboration, the data that each office generates should be available to the other offices ASAP. Thus, while each office adds their own data to their local fileserver, the data that the other offices are generating are also being sent over the wire and written to the fileservers. If this is unclear, I can try to elaborate.

One thing I was playing with was the idea of "circular rsync" - somehow automate rsync so that as changes are committed to a local fileserver, an rsync operation is triggered that begins sending the data to a remote fileserver. This remote fileserver is also outfitted with "change detection" so when it sees the data incoming from the original local fileserver, it too starts sending that data on down the line. The fileservers could be linked in this way in a loop, so that the changes get sent to every server in the loop until no more changes need to be made. However, this is not robust or efficient... It was just an idea.

I will look into Coda for now.

longblock454 · Dec 27, 2007

Boomslang said:
One thing I was playing with was the idea of "circular rsync" - somehow automate rsync so that as changes are committed to a local fileserver, an rsync operation is triggered that begins sending the data to a remote fileserver. This remote fileserver is also outfitted with "change detection" so when it sees the data incoming from the original local fileserver, it too starts sending that data on down the line. The fileservers could be linked in this way in a loop, so that the changes get sent to every server in the loop until no more changes need to be made. However, this is not robust or efficient... It was just an idea.

I will look into Coda for now.

This may not be "highly" efficient, but it's very easy and even simpler to understand and implement. Never underestimate simple!! Rsync is one of my personal favorites.....

I once set up a 11g wireless bridge with 400mw 24db gain antennas between to sites for syncing (at this power level/antenna no special regulations/permits are required). Could this be an option for pushing the data faster?

If you have the bandwidth to sync 5TB+ to several sites now, is it not possible to allow the sites just to "use" the data from a central location? We are now curious

Boomslang · Dec 27, 2007

longblock454 said:
This may not be "highly" efficient, but it's very easy and even simpler to understand and implement. Never underestimate simple!! Rsync is one of my personal favorites.....

I once set up a 11g wireless bridge with 400mw 24db gain antennas between to sites for syncing (at this power level/antenna no special regulations/permits are required). Could this be an option for pushing the data faster?

If you have the bandwidth to sync 5TB+ to several sites now, is it not possible to allow the sites just to "use" the data from a central location? We are now curious

The wireless thing, while definitely cool, is not an option, as scalability requirements say that we need to be able to accommodate nodes from coast to coast. The 5TB, while a lot, would not be all at once (that would just be the accumulation - shouldn't be more than a few hundred MBytes bandwidth per day) and I'd like each node to be able to function reasonably well in the event of a disconnection, which is another thing Coda sounds good at handling.

/dev/null · Dec 27, 2007

Whatever solution you decide here, post back on how it goes. I, for one, am interested.

I would like to do something similar between my brothers/parents/my house.

drizzt81 · Jan 8, 2008

i have been looking at a similar thing with unison and just now noticed this:

The Unix owner and group ids are not propagated. (What would this mean, in general?) All files are created with the owner and group of the server process.

http://www.cis.upenn.edu/~bcpierce/unison/download/releases/stable/unison-manual.html#perms

which really puts a damper on my idea of creating a single script that would run as a backupadmin and sync all my home/ profiles etc folders.

edit:

Apparently there is a command line option for unison to transfer permissions and users:
http://www.cis.upenn.edu/~bcpierce/unison/download/releases/stable/unison-manual.html

owner
When this flag is set to true, the owner attributes of the files are synchronized. Whether the owner names or the owner identifiers are synchronizeddepends on the preference extttnumerids.
perms n
The integer value of this preference is a mask indicating which permission bits should be synchronized. It is set by default to 0o1777: all bits but the set-uid and set-gid bits are synchronised (synchronizing theses latter bits can be a security hazard). If you want to synchronize all bits, you can set the value of this preference to -1.

Boomslang · Jan 9, 2008

Rsync itself has the ability to preserve permissions, and it looks like a number of issues with Unison are due to its accommodation of non-*nix systems.

As far as the original topic is concerned, here are some small updates:

-The easiest way to replicate data would be a "linear" path between nodes. For example, you could write a small file transfer benchmark that runs on each node to calculate the time to send a set amount of data to every other node. Using the results from this benchmark, you can calculate the fastest route from one end of the network to the other. Then, rsync in "batch mode" should be able to sync from node to node to node, following the fastest route through the network. Splitting the data up would not introduce a whole lot of efficiency to the network (I did the math), and would increase the complexity of the wrapper script a great deal.

Unless, however, a new node was being added to the network. In this situation, every existing node should have a full set of data, so 'swarming' could be used quite effectively here. During normal operation, not every node has a full set of data, so there would be an unfair strain on the bandwidth of the node that pushes the changes out to the rest of the nodes. I'm now leaning into P2P-land with this project, so I'd like each node to be treated fairly - by this I mean that uploaders are more motivated to add content because they know they won't have to deuce on their pipes.

Using the "best linear route" method described above, no node has to move more than two times the amount of data to be transferred. It's slower for the changes to make their way down the whole network, but it's more fair. This decision is not firm and it's still in the "thinking stage."

Mirroring storage over WAN

Boomslang

Limp Gawd

drizzt81

[H]F Junkie

hokatichenci

Gawd

longblock454

2[H]4U

Bones

[H]ard|Gawd

/dev/null

[H]F Junkie

Boomslang

Limp Gawd

longblock454

2[H]4U

Boomslang

Limp Gawd

/dev/null

[H]F Junkie

drizzt81

[H]F Junkie

Boomslang

Limp Gawd