Replication of ARC

SourceForge.net Logo

What is rARC?

rARC is a distributed system, which enables the replication of the archive files kept in a repository (ARC files) across several storage nodes on the Internet.
The main idea is to enable Internet users to provide storage space from their computers to replicate a relatively small part of the archived data. rARC is 100% pure Java.

The software is being used in the Arquivo da Web Portuguesa.

What are the main features of rARC?

The main features of rARC are:

Scalable at a first stage rARC must be scalable to thousands of storage nodes;
Secure the web data kept by the storage nodes cannot be accessed by the users and the system must be robust against malicious users. rARC must guarantee that there were created a minimum number of replicas for a given ARC file and that they are not corrupted;
Usable Internet users must be able to join a replication initiative and provide storage space as easily as possible;
Configurable to enable its usage in independent web archiving initiatives.

How does the rARC work?

RARC presents a client-server architecture. A rARC server is installed in a web archive and configured to replicate its ARC files. Internet users install client applications on their computers to kept replicas. The client applications communicate with the rARC server to receive authentication credentials and then download ARC files from the server.

Each ARC file is encrypted and signed to ensure its confidentiality and integrity. Periodically, the client applications communicate the state of their ARC files to the server, so that it can manage replication. If there is an irrecoverable failure of the central repository, a new instance of a rARC server can be installed on a new machine and start the recovery process to rebuild the repository from the replicas existent in the clients. Only a small amount of static data to ensure the security of the system must survive to a failure of the central repository. There are two types of recovery processes that prevent different security levels against malicious users: sequential and consensus by majority.

In the sequential recovery process, the server receives an ARC file from a client, decrypts it and verifies if the checksum of the ARC file is consistent with its web data. If the ARC file passes this verification, the server accepts the ARC file and closes all the other transfers for that file. Otherwise, the file is discarded and the server tries to recover a replica from other client. This approach is vulnerable to malicious users that may change an ARC file and forge its checksum by breaking the cipher hey.

In the consensus by majority recovery process, an uploaded ARC file is accepted by the server only after a majority of clients present the same checksum for an ARC file. The consensus recovery process is more secure because a majority of replicas must be compromised to allow the recovery of a corrupted file. However, this approach slows the recovery process since it has to wait that a majority of clients upload a given ARC file, which may never occur.