Fink:Distfiles
Fink's strategy for bringing open source projects to OS X has been to construct a system which automatically compiles and installs open source software, using source files which fink fetches from their original locations.
One of the problems which emerged early on, though, was the problem of files which disappear from their original location. To work around this, the Distfiles portion of the project was started. Distfiles was originally shared between Fink and MacPorts (formerly DarwinPorts) but is now solely a Fink operation; it is currently hosted at opendarwin.org, with mirror sites located elsewhere.
A few times a day, the distfiles server downloads updated copies of the package databases of fink and darwinports, examines them for new sourcefiles, and attempts to download those sourcefiles. (The small handful of fink packages whose source files cannot be redistritubted freely are excluded from these downloads.) Integrity checking is done with MD5 checksums.
The fink software, in turn, gives users the option of searching this "distfiles" location, either before or after searching the original location. Most users elect to do this searching before, which makes the "distfiles" location the default download location for software for most fink users.
The distfiles repository layout is in the form of /<hash-algorithm>/<hash>/<filename>. For example: /md5/98800eaa4803fa5531eb74ad04f6c429/Font-TTF-0.35.tar.gz. Files should exist in all directories of all supported hashing functions through symlinks, although md5 is currently the only supported hashing function. For compatibility, the latest version (encountered by the mirroring script most recently) of a filename will appear in /, so Font-TTF-0.35.tar.gz would exist in it's hashed location as well as /Font-TTF-0.35.tar.gz, assuming it is the latest version of that filename. The hash and hash algorithm in the file path was implemented to deal with multiple files with the same filename but different hashes. This also keeps track of multiple files with the same hash, but different filenames. Note that files with the same hash are stored multiple times. This is to avoid problems with hash and name collisions, but different content. Distfile information is also stored in a mysql database on the file server. This is both a performance optimization and a management convenience. It is an optimization so all previous files and their hashes can be retrieved easily without need for recomputing their hash from the filesystem on every run of the script. Fink's input to the distfiles mirroring script is through the fetchall -i --dry-run interface from fink, and depends on the output format not changing.
The distfiles mirroring script is currently in CVS under scripts/distfiles/newmirror. The script is run periodically on the master distfiles machine. The script iterates through various config files, one for each supported Tree, and one for each branch (stable/unstable). Information about downloaded distfiles is stored in a mysql database. This prevents needing to re-checksum files on disk to compare if the .info file's checksum has changed from what was last downloaded.
To Do: There should be a periodic examination of the entire fink package database, to see if the download locations are still accurate. In cases in which the file is no longer available at the original location, the package maintainer should be notified (in situations where the upstream source is no longer available at all, the distfiles site can be used as the primary site).