Information black hole

From CompBio
Jump to: navigation, search

After having a group for 10 years, we're starting to formulate a cogent information handling and disposition (data archival) policy. The general philosophy is always to use as little (or as much) space as needed, and hold on just as much (or as little). There are different modes of doing this, from communal information, to information in individual user accounts, to information group communications. Each may have different approaches to information gather, manipulation, storage, retrieval, and ultimate disposition. Here, the terms "information" and "data" are used interchangeably (somewhat but not completely), with the former preferred for its more general and broader theoretical and quantitative connotations.

In the past, it used to be that users responsible for their information were also responsible for backing up their information. This precept is changing somewhat, in that some users in certain communal storage locations, or working on communal data, can expect their data to backed up for them. But this is not a guarantee by any means and users are encouraged to check in with the sysadm or be safe than sorry and go ahead and make arrangements to have their data backed up.

The longer term view is to have a robust backup and archival that not only copies the information with minimal or zero error, but also in a way that's safe from stupidity and/or malice.


Group communications

Group communications that are archived occur via the agroup@compbio mailing list which is a mail alias to samudrala-compbio at Google Groups. Communications exchanged via this email address are stored in Google and also in the mail spool file in the host representing for the user/alias agroup-local (which is where the mail sent to Google Groups gets sent back to). The mail spool file, which has gotten quite large over time of course, is our local copy of the information that Google has, where it is searchable and analysable in every way Google permits and enables us to do so.

Inactive and nonactive users

Once you join our group, you never leave, even if "check out" (this is a joke referring to a famous Eagles song but there's no Flavour Aid passed around here). So what to do with your data? Well, if there's any reason to hold on to it, then we archive it in group archives (, which currently maps to, and nas3 and nas4 are mirrors of each other).


A general strategy of everything we do is duplication or mirroring. While sophisticated schemes exist for data redundancy and handling data failure, we prefer simple mirrors on different devices. The logic is that it's fairly common for even the most robust storage devices to fail, and even more common for them to become redundant. Thus having two

An additional backup of all information users are responsible for needs to be done by the users themselves, and users are not only responsible for the information but also responsible for this off site backups.

Longer term robustness and safety

In the long term, a system that securely and robustly copies all the information in a very distant location so that even something like an earthquake that destroys the two local mirrors will result in irreparable damage. Such a system, however, also needs to be inaccessible from the CompBio cluster for any destructive actions. That is, information manipulation, and particularly removal, need to be "pulled" from an emasculated mirroring unit and not pushed from the unit with all the data capable of causing damage. This needs to be the case for both the local and distant mirroring devices.

Personal tools