Add new comment

File synchronization across systems (Your personal "Dropbox")

Submitted by Marius Kjeldahl on Wed, 2010-07-21 17:46

After upgrading my development system and servers to newer linux releases (Ubuntu 10.04 in my case), it's that time again where I need to look at how I synchronize work across my machines. My work pattern is pretty standard; in my office I have a desktop computer with a multiscreen setup. In addition I have a laptop that I carry with me everywhere else. I also have a server which hosts my personal web server, email and other apps I'm working on that are server hosted.

I used to have a Windows laptop as well, but less and less work (from clients, and involving my own projects) touches Windows, so the requirement to synchronize with Windows is less of an issue than it used to be. In fact, Windows work so well running under linux virtualized now that it can be treated quite similar to other "embedded" type development systems, minimizing the need to actually work from within the Windows environment. When needed, I develop outside of Windows, where the Windows machine is set up to access files directly from the host.

Mac OSX is similar enough to other unix/linux system that most solutions should work pretty well on OSX, and I will not treat it separately in my notes below.

The solutions I will comment on below are:

  • Dropbox
  • rsync
  • Unison
  • Persy
  • git-sync.sh
  • Jake
  • git (updated 23:03)

At the end of this article I have also written a short summary of what I ended up choosing.

Dropbox

Unless you have lots of content or need to manage your own privacy/security issues, http://dropbox.com/ should solve most, if not all of your problems. They have client support for Windows, OSX and Linux, and possibly more. Everything is sync'ed to Dropbox' own servers. The only thing you need to do is sign up and install the software, and you can sync a few tens of gigabytes easily.

If you need more storage and want to depend on their storage backend, their pricing seems pretty reasonable. Unfortunately, it still may not be what you need for synchronizing your family albums (photos, videos), music colletion and movies. And considering the price of hard disks these days, any solution you manage to get working on your own will be a lot cheaper, and probably faster (assuming many of your machines are running on your local networks).

rsync

rsync is the solution I have been using up to now. rsync is a pretty standard utility in most unix environment, and I believe it is pretty well supported on Windows as well. It's a really good one-way synchronization utility. Up to now I've been running the following script to keep my machines synchronized:

#!/bin/bash
rsync -avzPu /home/marius/sync/ myserver.com:sync/
rsync -avzPu myserver.com:sync/ /home/marius/sync/

This script first synchronizes all my local files to my server, pushing everything that I've modified locally to the server. Then it pushes changes from the server to my local machine again (the stuff that has been modified and pushed from other machines).

This setup works if you understand what you are doing. It has a few weaknesses. The most visible weakness is that files are not deleted. If you delete a file locally, then push it to the server, it gets pulled back again. Unless you take care and remove the file from all machines participating in this synchronization, you will not be able to remove files. Secondly, this setup only keeps the latest versions of your files. It has no history / revisions. If you accidentially overwrite a file, and only notice later, after you've synched all your machines, that file is gone and you need to pull it from your backup. If you have a backup, that is. If you sync data across multiple machines at multiple locations, you kind of have a backup anyway and chances are you did not bother with doing backups in addition to synchronization.

Unison

If you search/google the net for doing file synchronization across machines, with Windows support, chances are you will learn about Unison, at least once you read past all the Dropbox references. I have successfully used Unison myself, but moved away from it after I wasted a lot of time on it. Unison, like many similar solutions, seems kind of "half-baked". It has some serious timeout issues that may hit you if you try to synchronize anything big. I used to have an mp3 collection that I wanted to keep synchronized across machines, and Unison kept stopped with weird error messages. The homepage had some information about it, saying it was related to some timeout issues and the solution was to keep initiating synchronization and it should work eventually. Well, in my case it took a few days because Unison kept stopping because of this issue. The sad thing is that if this issue hits you, it will be on your first synchronization, making it very hard to trust the software from there on. If they can't fix that "feature", what else isn't working?

The good part about Unison is that it actually does support Windows, so if you can live with the timeout issue, or you are lucky enough not to get hit by it, it may fit the bill for you. Having bit hit with the timeout issues, I decided to go with the rsync option mentioned above instead.

Persy

I just recently discovered Persy (http://kinkerl.github.com/persy/), and decided to take a look at it. It uses git at the lower levels, which has it's own pros and cons like all similar systems (I'll talk about git further down). Setup is pretty easy, and runs great via ssh (similar to rsync and git). Despite early promise, I could not get it to work. I set it up both using the command-line and via the gui, but never got past the following error:

$ persy --initremote
initialising and adding remote repository...
error creating dir, maybe it exists already?
remoteAdd: 128

It seems the comment about "half-baked" is appropriate again. The author highlights the following paragraph in his docs:

Warning: the synced directories should be empty before the sync. i had some problems with already existing files. you can start a sync and then add new files to the synced directory.

I'm pretty certain that is similar to the bug that hit me. Unfortunately no amount of removing/emptying/recreating directories at both the client and server end got me past that
stage, so I never managed to actually test it.

git-sync.sh

This is a script that automates synchronization across directories, using git. Unfortunately, it seems to support synchronization only across directories on the same system.
The use-case described by the author is about synchronizing data across USB pluggable storage devices. I have not looked into why it has this limitation (considering
it is git based, it really should not care..), but suffice to say I did test to see if it would be willing to sync across machines and it did not. It may be easy to modify it
to work across machines, but I haven't tried.

Jake

I haven't tested Jake myself. From reading about it, it uses XMPP and doesn't support file history. I'm not sure how it uses XMPP, but it got me worried that it might require certain network ports open to work (outside of the "standard" ones), It's also seems to be a client-only application written in java, which unfortunately for me usually means lots of trouble.

git

I'll write about git specifically here, but I believe most of the items discussed are relevant for other SCM systems as well.

I use git a lot. I have been using it for two years now for all my own projects involving computer source code or general text content that I produce (which isn't being stuffed into a CMS or similar). It works great. So why can not git be used for general file synchronization as well? It can. As I've already written it is being used by at least a few of the solutions I have mentioned already. There is some stuff that is needed on top of git to simplify matters.

One issue is commits. When commit source code changes, one usually supplies a commit message, describing in keywords what has been done. For simple file synchronization, such messages would typically be auto-generated.

Another issue is when to commit and push. It's certainly possible to listen for changed files and commit (and push) on each change. Another simpler solution may be to synchronize at regular intervals (every 5 minutes), and also offer a "syncnow" script that forces synchronization when needed.

A final issue is revision history and keeping all files (even deleted ones) in the repository. If you use git for managing your music collection, and you want to remove music, those files will still stay in your repository. I believe git has functionality to "filter" out such files, and/or use a certain revision from for instance one month ago as it's base, meaning anything deleted from the repository before that will actually no longer exist in the repository.

While I believe a nice git-based solution should exist, I haven't found it yet. I do believe however that one such solution should be fairly simple to build, at least something that is fairly simple for people already comfortable using git.

Summary

Based on this, I will probably move from my rsync based solution and create a new customized solution based on git. It should support my needs for a good while, and the necessary features shouldn't be too hard to implement when needed.

Update 23:03:

I've implemented the git solution with a normal remote repository, and use it just like you would on a programming project. Assuming you keep the "big blobs" somewhere else, a standard git solution will probably work just great. At least if you are used to working with git. When/if I get the time (or somebody else does it before me), implementing some easy way of autosynching (on file change, or based on time intervals) without having to do manual commits. Another feature which would make it more usable as a "big blob" store as well would be filtering out old changes; for instance getting rid of all files which were deleted more than X months ago. AFAIK, this should be fairly easy to do with the existing git feature set (+some bash/perl text handling).

Unless you're willing to get your hand dirty, your best bet is probably Dropbox, and if you have to implement something yourself and you do not want to use a typical SCM solution (like git), Unison is probably a good choice (at least until you are trying to sync a lot of stuff, which may lead to timeout errors that I wrote about already).

  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.