After having a group for 10 years, we're starting to formulate a cogent information handling and disposition (data archival) policy. The general philosophy is always to use as little (or as much) space as needed, and hold on just as much (or as little). There are different modes of doing this, from communal information, to information in individual user accounts, to information group communications. Each may have different approaches to information gather, manipulation, storage, retrieval, and ultimate disposition. Here, the terms "information" and "data" are used interchangeably (somewhat but not completely), with the former preferred for its more general and broader theoretical and quantitative connotations.
In the past, it used to be that users responsible for their information were also responsible for backing up their information. This precept is changing somewhat, in that some users in certain communal storage locations, or working on communal data, can expect their data to backed up for them. But this is not a guarantee by any means and users are encouraged to check in with the sysadm or be safe than sorry and go ahead and make arrangements to have their data backed up.
The longer term view is to have a robust backup and archival that not only copies the information with minimal or zero error, but also in a way that's safe from stupidity and/or malice.
Group communications that are archived occur via the agroup@compbio mailing list which is a mail alias to samudrala-compbio at Google Groups. Communications exchanged via this email address are stored in Google and also in the mail spool file in the host representing compbio.washington.edu for the user/alias agroup-local (which is where the mail sent to Google Groups gets sent back to). The mail spool file, which has gotten quite large over time of course, is our local copy of the information that Google has, where it is searchable and analysable in every way Google permits and enables us to do so.
Active users will be given a choice of being in a reasonably fast communical disk storage system to be accessible via cosmos.compbio.washington.edu (das1 and das2.compbio.washington.edu will be the storage units). Or they may have their information on individual disks on the various machines in the playground cluster, which is part of CompBio.
Inactive and nonactive users
Once you join our group, you never leave, even if you "check out" (this is a joke referring to a famous Eagles song but there's no Flavour Aid passed around here). So what to do with your data? Well, if there's any reason to hold on to it, then we archive it in group archives (archives.compbio.washington.edu, which currently maps to nas3, and nas3 and nas4 are mirrors of each other). At the present moment, the goal is go by the most relevant usernames whose information we wish archived first and work on their data, and then follow it up systematically with all the users in the system.
A general strategy of everything we do is duplication or mirroring. While sophisticated schemes exist for data redundancy and handling data failure, we prefer simple mirrors on different devices. The logic is that it's fairly common for even the most robust storage devices to fail, and even more common for them to become redundant. Thus two devices, where only the "first" of the two will be considered "definitive" at any given point (this is simply so some pieces of software like the automounter isn't confused about which machine to use).
An additional backup of all information users are responsible for needs to be done by the users themselves, and users are not only responsible for the information but also responsible for these off site backups.
Longer term robustness and safety
In the long term, a system that securely and robustly copies all the information in a very distant location so that even something like an earthquake that destroys the two local mirrors will result in irreparable damage. Such a system, however, also needs to be inaccessible from the CompBio cluster for any destructive actions. That is, information manipulation, and particularly removal, need to be "pulled" from an emasculated mirroring unit and not pushed from the unit with all the data capable of causing damage. This needs to be the case for both the local and distant mirroring devices.