Linux FAH Backup Script - V2

musky

[H]ard|DCer of the Year 2012
Joined
Dec 14, 2009
Messages
3,154
UPDATE: 6/22/2013
THIS UTILITY IS OBSOLETE. PLEASE USE THE FAHINSTALL SCRIPT FOR YOUR LINUX FAH INSTALLS, WHICH ALSO HANDLES BACKUPS

===========================================================================================================

This is based on foxtrotniner's original backup script. I made it more generic for different Linux installs. It should be distro-independent (note: this did not work on my Arch install due to permission for editing crontab. If you run into this for other distros, let me know. If you know how to fix it, also let me know.) It is only for Linux installs, either native or in a VM. The script sets up a cron job that runs every three hours. This job;
1. Saves off the previous backups and removes any older backups
2. Backs up the work directory, queue.dat, and unitinfo.txt
3. Waits 60 seconds, then backs everything up again in case checkpoint were being written during the first backup

To install this, log in as the user you use to run FAH and change to you FAH directory. If you used the Ubuntu install guide, this would be
Code:
cd ~/fah
Run the following to download the install script (fahbkup.sh), make it executable, and run it (note: do not run this as root or use sudo in front of anything - you have to be logged in as the fah user):
Code:
wget "http://musky.dyndns.info:8088/fah/fahbkup-0.2.sh"
chmod 755 fahbkup-0.2.sh
./fahbkup-0.2.sh

Once this finishes, nothing else is required. You may see a "no crontab for user" message when installing. This is fine. You can verify the install by checking a couple of things.
1. Typing crontab -l should return something like this:
Code:
0 */3 * * * /home/dave/fah/backup.sh
2. Typing ls should show these 5 new files:
backup.sh
restorebk.sh
restore.sh
start
stop

You should also have two new directories:
backup
cron

If all of this is in place, the install was successful and you are backing up your WU every three hours.

I also added a start and stop script. The start script will start FAH in a detached screen. You must have screen installed for it to work. For Ubuntu, you probably already have it. To check, type:
Code:
sudo apt-get install screen
You also need to have your flags (-smp, -bigadv, -bigbeta, etc.) set up in the config file. To run the script, type this from the fah directory:
Code:
./start
Type screen -r to see this new screen. Hold down the Control key and press A then D to detach from it.

The stop script is similar - run it by typing
Code:
./stop
This script will back up the WU twice, then stop fah. It assumes your folding executable is named fah6. If yours is not, let me know and I will show you what to change.

If you have a problem with your checkpoints when restarting a WU, you can restore to the previous backup with the restore scripts. restore.bk restores to the first saved copy. restorebk.sh restores to the second copy. You only need to run one. If one doesn't work, try the other. You run them like any other script:
Code:
./restore.sh

Performance implications: Every three hours, one core of your machine will be used for a short period of time to create the backup. How long depends on hardware. My 1.8 GHz Magny Cours machine takes around 30 seconds for each bigbeta backup. That is probably worst case since my clock speed is low, my hdd is just an old SATA drive, and bigbetas are the largest units. I doubt you will notice, but there certainly will be some performance implications. The cron job interval can be increased from every three hours if this is a problem for you. Just let me know.

Please post or pm me with any comments, feedback, or enhancement requests.
 
Last edited:
musky: screen is one of those things in 10.11 that was not in 11.04. You need sudo apt-get screen
I also do sudo apt-get install htop which is a good idea.
 
musky: screen is one of those things in 10.11 that was not in 11.04. You need sudo apt-get screen
I also do sudo apt-get install htop which is a good idea.

That is why I mentioned installing screen in my original post. You also can use the stop script without using the start script, of you really don't like screen.

I still have not seen to benefit of htop, but to each his or her own I suppose...
 
This looks unbearably awesome, cant wait to try it. Thanks heaps for the hard work - this will make Linux a lot more reliable for me - I have to stop&start a great deal.

Who needs Stanford to fix buggy code when you have the [H]orde to fix things? :D
 
Trying now... Notes:

cron folder was not created at install, but at first run. No biggie.

Stop works and backs it up. Heaven.:D

Love the use of tar - you can see at glance timestamps of which backup is recent - folders in Ubuntu dont always give meaningful timestamps so you have to go into a folder to check which one is the latest backup.

I might see if I can edit backup.sh to add machine ID and client.cfg to backup set for easier sneaker netting... but then I have to work out restore options

Also - can I make my own versions of "start" and append different flags at the end of this line? eg: "startoneunit"
screen -d -m /home/dave/fah/fah6 -smp -bigbeta -oneunit
would that work?
 
Just look at the backup script, it allows to have 2 backup sets -- current and previous .. which is nice enough for recover. Awesome.
 
It really works!

I modified it slightly to be more verbose (always happy to do baby edits on working scripts that someone else has done the hard yards to make;)) - ran it manually and managed to catch FAH writing to disk on the second copy.

original.jpg


All that happens is the tar archive is missing the files being written to. But of course the prior copy a minute earlier caught a good version. I love brute force redundancy.

Cute to see it in action.

And the every 3 hours seems to work on the hour every 3 hrs - all machines did a backup at 15:00 local time. (Local time in a virtual machine being all over the place, as the guest clock drifts up to hours a day inside virtualbox.)
 
Ok, had a bit of fun doing a sneakernet swap between 2 machines for fun. Having a great deal of fun hacking these.... Of course had to add machinedependent.dat to the backup. I will know in a day or so if I messed it up. :p

But one thing occurred to me - if you run the backup from a terminal, you can see failure errors from files being locked/written. But if you accidentally restore an incomplete one, you could get into trouble - no queue.dat and you will grab a new unit etc.

How hard is it to amend the script to abort if the tar backup returns an error?

Also restoring on a different VM gave me warnings about timestamps of files being in the future - bloody clock drift.

Here is my sneakernet beta modification - adding 2 files and naming the backups with the machine name appended. (and AA and BB so my foggy brain can remember which to try first) Example:
backup-BB-sr2b-2011-08-01-2100.tgz

Thanks to the wonderfully robust and clear way these have been written, I can muck with things easily. Maybe too easily! Fingers crossed.


#!/bin/sh
# FAH backup script
date=$(date +%Y-%m-%d-%H%M)
rig=$(hostname)
rm -rf /home/dave/fah/backup/previous/*
mv -f /home/dave/fah/backup/*.* /home/dave/fah/backup/previous/ 2>/dev/null
tar czPf /home/dave/fah/backup/backup-BB-$rig-$date.tgz /home/dave/fah/work/* /home/dave/fah/unitinfo.txt /home/dave/fah/queue.dat /home/dave/fah/client.cfg /home/dave/fah/machinedependent.dat
echo First copy made!, waiting 60 sec
sleep 15
echo ...in 45...
sleep 15
echo ...in 30...
sleep 15
echo ...in 15...

tar czPf /home/dave/fah/backup/mirror-AA-$rig-$date.tgz /home/dave/fah/work/* /home/dave/fah/unitinfo.txt /home/dave/fah/queue.dat /home/dave/fah/client.cfg /home/dave/fah/machinedependent.dat
echo Second copy made!

Shunt files on network manually and for the restore:


#!/bin/sh
# FAH restore script
rm -rf /home/dave/fah/work
rm /home/dave/fah/queue.dat
rm /home/dave/fah/unitinfo.txt
rm /home/dave/fah/client.cfg
rm /home/dave/fah/machinedependent.dat

tar xzPf /home/dave/fah/shunt/mirror*.*
 
Trying now... Notes:

cron folder was not created at install, but at first run. No biggie.

Stop works and backs it up. Heaven.:D

Love the use of tar - you can see at glance timestamps of which backup is recent - folders in Ubuntu dont always give meaningful timestamps so you have to go into a folder to check which one is the latest backup.

I might see if I can edit backup.sh to add machine ID and client.cfg to backup set for easier sneaker netting... but then I have to work out restore options

Also - can I make my own versions of "start" and append different flags at the end of this line? eg: "startoneunit"
screen -d -m /home/dave/fah/fah6 -smp -bigbeta -oneunit
would that work?

The cron folder is fairly useless anyway. It would be needed if I had an "uninstall" script, but I don't. It should have been made with the install, though. Is there a file in it called cron.old?

Machine ID - you mean the one from FAH itself? I guess we could read the cient.cfg file and append some of it to the file name. Would anything else from client.cfg be useful?

Adding flags to start - I actually realized that this would be a problem for -oneunit yesterday evening. Another great enhancement idea, but not possible yet.

I'll PM you some changes to put into your start and backup.sh scripts to test out before I add them to the main script, if that is OK with you MIBW.
 
Ok, had a bit of fun doing a sneakernet swap between 2 machines for fun. Having a great deal of fun hacking these.... Of course had to add machinedependent.dat to the backup. I will know in a day or so if I messed it up. :p

But one thing occurred to me - if you run the backup from a terminal, you can see failure errors from files being locked/written. But if you accidentally restore an incomplete one, you could get into trouble - no queue.dat and you will grab a new unit etc.

How hard is it to amend the script to abort if the tar backup returns an error?

Also restoring on a different VM gave me warnings about timestamps of files being in the future - bloody clock drift.

Here is my sneakernet beta modification - adding 2 files and naming the backups with the machine name appended. (and AA and BB so my foggy brain can remember which to try first) Example:
backup-BB-sr2b-2011-08-01-2100.tgz

Thanks to the wonderfully robust and clear way these have been written, I can muck with things easily. Maybe too easily! Fingers crossed.


#!/bin/sh
# FAH backup script
date=$(date +%Y-%m-%d-%H%M)
rig=$(hostname)
rm -rf /home/dave/fah/backup/previous/*
mv -f /home/dave/fah/backup/*.* /home/dave/fah/backup/previous/ 2>/dev/null
tar czPf /home/dave/fah/backup/backup-BB-$rig-$date.tgz /home/dave/fah/work/* /home/dave/fah/unitinfo.txt /home/dave/fah/queue.dat /home/dave/fah/client.cfg /home/dave/fah/machinedependent.dat
echo First copy made!, waiting 60 sec
sleep 15
echo ...in 45...
sleep 15
echo ...in 30...
sleep 15
echo ...in 15...

tar czPf /home/dave/fah/backup/mirror-AA-$rig-$date.tgz /home/dave/fah/work/* /home/dave/fah/unitinfo.txt /home/dave/fah/queue.dat /home/dave/fah/client.cfg /home/dave/fah/machinedependent.dat
echo Second copy made!

Shunt files on network manually and for the restore:


#!/bin/sh
# FAH restore script
rm -rf /home/dave/fah/work
rm /home/dave/fah/queue.dat
rm /home/dave/fah/unitinfo.txt
rm /home/dave/fah/client.cfg
rm /home/dave/fah/machinedependent.dat

tar xzPf /home/dave/fah/shunt/mirror*.*

Your restore is restoring from the second backup instead of the first, but that is fine (and probably the better 'first choice" since it is newer, now that I think about it.) Your additions to the tar czPf command are fine also - it never occurred to me that you would need those two files for sneaker-netting. but you are correct that they would be.

Since I had never seen what actually happens when there is a conflict making the tarball, I hadn't really thought about verify it. Now that i see it, that is a great idea, and may actually eliminate the need for a second backup - create the tarball, if it errors, delete it and make another. I'll have to see what tar can do, but it sounds possible. Unix guys, please chime in here. I know next to nothing about the tar command.
 
The cron folder is fairly useless anyway. It would be needed if I had an "uninstall" script, but I don't. It should have been made with the install, though. Is there a file in it called cron.old?

There was after I started running backups manually a few minutes later. I am not 100% sure anymore when it appeared, because this afternoon I noticed that sometimes files created externally do not always show up in nautilus without a refresh. I had no idea if cron was needed or not, but maybe add a line to the guide about don't worry if it does not show up right away.

Machine ID - you mean the one from FAH itself? I guess we could read the cient.cfg file and append some of it to the file name. Would anything else from client.cfg be useful?

Machinedependent.dat - the unique hardware key that on Windows is stored in the registry but thankfully in Linux is in this file. Sneakernet without it and you don't get the bonus.

Adding flags to start - I actually realized that this would be a problem for -oneunit yesterday evening. Another great enhancement idea, but not possible yet.

No biggie really, now that I have some examples to play with, I can make my own separate scripts to start if I really want them. I have too many options to hardwire - I drop from bigbeta to bigadv on the last 18 hrs of a rental because I cannot afford to get a 2+ day 6904. So with -oneunit or not and bigadv vs bigbeta that is 4 scripts already. Only worth it for remote operations on my iPhone, where CLI is a pita.

I'll PM you some changes to put into your start and backup.sh scripts to test out before I add them to the main script, if that is OK with you MIBW.

Sure send them, who knows if I will get to run them, my wife is 6 days overdue. Tick tock buddy... :p

But most of my stuff might be overkill to add to your script - I wonder how many teammates will actually be sneaker-netting? In fact most of mine is going to be on the same rig, just between native Linux and VMs - in which case the current is good. I just like the idea of portability.

Your restore is restoring from the second backup instead of the first, but that is fine (and probably the better 'first choice" since it is newer, now that I think about it.) Your additions to the tar czPf command are fine also - it never occurred to me that you would need those two files for sneaker-netting. but you are correct that they would be.

Yeah I went with the latest-is-greatest philosophy. I might add FAHlog.txt too, but the more I add the greater the odds of mucking up due to hitting a file-write.

Since I had never seen what actually happens when there is a conflict making the tarball, I hadn't really thought about verify it. Now that i see it, that is a great idea, and may actually eliminate the need for a second backup - create the tarball, if it errors, delete it and make another. I'll have to see what tar can do, but it sounds possible. Unix guys, please chime in here. I know next to nothing about the tar command.

I only have very basic scripting knowledge from using Winbatch, and it has error handling that would be workable to abort/go around. Surely Unix has that, just how much bother is the issue. One hacky idea is to write out a temp logfile of whatever error messages hit the terminal, then grep that for "Resource temporarily unavailable: etc.

Running the backup script lots today to test my alterations, I think maybe 1 in 4 had a failure. It was an eye opener.

You know we will have this absolutely nailed jut in time for the 64 bit windows core to come out! :mad:
 
Since I had never seen what actually happens when there is a conflict making the tarball, I hadn't really thought about verify it. Now that i see it, that is a great idea, and may actually eliminate the need for a second backup - create the tarball, if it errors, delete it and make another. I'll have to see what tar can do, but it sounds possible. Unix guys, please chime in here. I know next to nothing about the tar command.
You could put an if statement after the tar command within a loop to check the return code. The $? variable is set to the return code of the last executed command. For tar 0=Success, 2=Error. Someone could clean up the code (I'm not a fan of infinite loops either) but here's an example:

for (( ; ; ))
do
tar czPf /home/dave/fah/backup/backup-temp.tgz /home/dave/fah/work/* \
/home/dave/fah/unitinfo.txt /home/dave/fah/queue.dat /home/dave/fah/client.cfg \
/home/dave/fah/machinedependent.dat
if [ $? == 0 ]
then
echo "Tar backup successful!"
mv /home/dave/fah/backup/backup-temp.tgz /home/dave/fah/backup/backup-$rig-$date.tgz
break
fi
echo "Tar backup failed! Trying again..."
rm /home/dave/fah/backup/backup-temp.tgz
done
 
Last edited:
I probably missed this somewhere in the discussion, but, as far as using the ./stop script, afterwards, should you always do a restore operation before starting again (./start)?
And what "indication" can you use to determine that a WU needs to use which backup file?
I notice that when I have tried ./stop and then run ./start, it does not necessarily restart at the same % point.


 
I probably missed this somewhere in the discussion, but, as far as using the ./stop script, afterwards, should you always do a restore operation before starting again (./start)?
No need to do this. The backup is a "just in case" measure. You should be able to just use the start script 99.9% of the time.

And what "indication" can you use to determine that a WU needs to use which backup file?
Basically, when you restart, if the client can't resume from checkpoint, you need to restore. You have two backups and two scripts to install them. Try one, then the other.

I notice that when I have tried ./stop and then run ./start, it does not necessarily restart at the same % point.
It restarts at the last saved checkpoint, which is anywhere from 0 to 15 minutes ago unless you changed your checkpoint interval in client.cfg.

Make sense?
 
Basically, when you restart, if the client can't resume from checkpoint, you need to restore......
It restarts at the last saved checkpoint, which is anywhere from 0 to 15 minutes ago unless you changed your checkpoint interval in client.cfg.

Got it. I just have never seen a fail-to-resume condition after a stop, but I have seen WUs, that I stopped and resumed, fail after running to 100% complete. Just wondered if it was an operator error. ;)

I suppose setting the interval smaller "could" make the restarts more "time-accurate", but would probably waste more time and impact performance. Plus, the more stable the rigs get the less reason I have to stop and restart.

 
Back
Top