summaryrefslogtreecommitdiff
path: root/README.md
blob: cad55d33488642a9959444f2932604b58fb05876 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# bb: basic backup

Incremental encrypted backup system

## Archive v0

1. cksum original (sha256)
2. compress (gzip)
2. encrypt (aes256)
3. split in cksumed chunks. chunks are named from the hmac of
   encrypted+compressed
4. build index of chunks
5. compress (gzip) and encrypt (aes) index
6. return index cksum

Good:
- chunks are named from their compressed/crypted hmac.

Problems:
- the salt (or iv in aes) must be static, to make the encryption
  idempotent, otherwise no dedup. Weak encryption.
- dedup occurs only for append only files. The same chunk content will lead to
  a different hmac if located at a different offset.

## Archive v1

- chunk before compression
- name chunks from checksum of uncompressed/unencrypted data (invariant => allow dedup).
- then compress and encrypt (in this order).

Chunk encryption can use randomized cipher, but a hmac must be added at end of
file (before encrypt) to check integrity without having to decrypt/decompress.
This is achieved through aes-gcm.

Problems:
- possible collisions of chunks with same name (same content) but encrypted
  with a foreign key (different user), which would a user to download a block
  which he could not decrypt.

## Archive v2

Each user has a fixed unique id: random 96 bits (12 bytes). This id is added
to the content of each block / file prior to compute the invariant checksum
but is not transmitted (no storage overhead).

It allows to avoid collisions between same original content blocks in different
users. Dedup should only happen in the same user space, as one can not decrypt
a block from another user.

Problems:
- in this design, and all previous ones, there is no way to disgard data in an
  archive. For example, tarsnap does not allow to suppress data.

## Repository

A repo is associated with a single id/key tuple, ensuring a deduplification space,
i.e. a unique `chunks` directory.

Each backup is denoted by its archive index $host:$dir:$date

$dir is the rootdir of the backuped files.

A repo entrypoint is its current index, containing the list of $host:$dir:$date.

## Roadmap

- p2p storage
- chunker based on rolling hash (instead of fixed size)

disgarded:
- encode checksums in base64 instead of hex. Wrong idea: incompatible with case
  insensitive filesystems (macos).

## What tarsnap is doing

1. cksum original (sha256)
2. build chunks of variable size
3. cksum uncompressed unencrypted chunks
4. compress chunk (deflate)
5. encrypt chunk (rsa2048) + HMAC

## References

- tarsnap: https://www.tarsnap.com https://github.com/tarsnap/tarsnap
- tarsnap chunker in Go: https://github.com/karinushka/chunker
- borg: https://borgbackup.org
- rclone: https://rclone.org
- restic: https://restic.readthedocs.io/en/v0.2.0/Design/