The time span we have to keep these files around is forever and not just 30-60 years that tax related data need to be available. The top direct response when talking about this is:
I just save it on [Tape, Paper, etc.]. This is so much saver than disks.
While this may be true, people forget or are not aware that the data not only must be stored but also must be accessible at the same time. People want to go into an archive (either physical or online) and dig through the data.
Suddenly [Tape, Paper, etc.] is not an option anymore because of access time.
Another problem is that you have to be absolutely sure that your data is still intact and not corrupted by silent bit flips. This means not only storing it but visiting [Tape, Paper, etc.] from time to time to make sure that the medium still works.
Traditionally this has been solved by huge raid systems which are unfortunately very expensive and in most cases push the responsibility to a single machine. If the machine is down, the data is either not accessible or might become corrupt.
My plan to tackle this tasks is to provide an architecture that uses many copies of a file to cross check itself, uses commodity hardware to run and consumer grade disks to store the data. Actually, lots of hardware. The more the better.
By moving the responsibility of data availability from the raid controllers and a single machine onto a network of machines we get a system that is very robust against individual hardware errors. If one machine goes down, the rest of the system recognized this and heals itself by making sure that a predefined number of copies of the data is always in the system.
I spend a lot of time coding the last months and this weekend marks another major milestone for me.
I am eating my own dog food and running an archive at home.
Hardware wise it consist of:
4x 2TB external hard disks with a USB 2.0 Port
4x Shevaplug computers (1.2GHz ARM CPU, 512MB RAM, 512MB internal flash disk, 1 GBit Ethernet, 1xUSB 2.0 Port)
I had to optimize the software running specifically in such a "low" powered environment but I wanted to keep the entry level for hardware really to a minimum. The Sheevaplugs and the disks are around 70€ per item.
Currently the archive is running with 400.000 small files (4k). For now it is access by a REST-API
Here is a picture of my current setup.










