The software is being used in the Arquivo da Web Portuguesa.
Scalable– at a first stage rARC must be scalable to thousands of storage nodes;
Secure– the web data kept by the storage nodes cannot be accessed by the users and the system must be robust against malicious users. rARC must guarantee that there were created a minimum number of replicas for a given ARC file and that they are not corrupted;
Usable– Internet users must be able to join a replication initiative and provide storage space as easily as possible;
Configurable – to enable its usage in independent web archiving initiatives.
RARC presents a client-server architecture. A rARC server is installed in a web archive and configured to replicate its ARC files. Internet users install client applications on their computers to kept replicas. The client applications communicate with the rARC server to receive authentication credentials and then download ARC files from the server.
Each ARC file is encrypted and signed to ensure its confidentiality and integrity. Periodically, the client applications communicate the state of their ARC files to the server, so that it can manage replication. If there is an irrecoverable failure of the central repository, a new instance of a rARC server can be installed on a new machine and start the recovery process to rebuild the repository from the replicas existent in the clients. Only a small amount of static data to ensure the security of the system must survive to a failure of the central repository. There are two types of recovery processes that prevent different security levels against malicious users: sequential and consensus by majority.
In the sequential recovery process, the server receives an ARC file from a client, decrypts it and verifies if the checksum of the ARC file is consistent with its web data. If the ARC file passes this verification, the server accepts the ARC file and closes all the other transfers for that file. Otherwise, the file is discarded and the server tries to recover a replica from other client. This approach is vulnerable to malicious users that may change an ARC file and forge its checksum by breaking the cipher hey.
In the consensus by majority recovery process, an uploaded ARC file is accepted by the server only after a majority of clients present the same checksum for an ARC file. The consensus recovery process is more secure because a majority of replicas must be compromised to allow the recovery of a corrupted file. However, this approach slows the recovery process since it has to wait that a majority of clients upload a given ARC file, which may never occur.