devRant - A fun community for developers to connect over code, tech & life as a programmer

I hate the elasticsearch backup api.

From beginning to end it's an painful experience.

I try to explain it, but I don't think I will be able to cover it all.

The core concept is:
- repository (storage for snapshots)
- snapshots (actual backup)

The first design flaw is that every backup in an repository is incremental. ES creates an incremental filesystem tree.

Some reasons why this is a bad idea:
- deletion of (older) backups is slow, as newer backups need to be checked for integrity
- you simply have to trust ES that it does the right thing (given the bugs it has... It seems like a very bad idea TM)
- you have no possibility of verification of snapshots

Workaround... Create many repositories as each new repository forces an full backup.........

The second thing: ES scales. Many nodes / es instances form a cluster.

Usually backup APIs incorporate these in their design. ES does not.

If an index spans 12 nodes and u use an network storage, yes: a maximum of 12 nodes will open an eg NFS connection and start backuping.

It might sound not so bad with 12 nodes and one index...

But it get's pretty bad with 100s of indexes and several dozen nodes...

And there is no real limiting in ES. You can plug a few holes, but all in all, when you don't plan carefully your backups, you'll get a pretty f*cked up network congestion.

So traffic shaping must be manually added. Yay...

The last thing is the API itself.

It's a... very fragile thing.

Especially in older ES releases, the documentation is like handing you a flex instead of toilet paper for a wipe.

Documentation != API != Reality.

Especially the fault handling left me more than once speechless...

Eg:
/_snapshot/storage/backup
gives you a state PARTIAL
/_snapshot/storage/backup/_status
gives you a state SUCCESS

Why? The first one is blocking and refers to the backup status itself. The second one shouldn't be blocking and refers to the backup operation.

And yes. The backup operation state is SUCCESS, while the backup state might be PARTIAL (hence no full backup was made, there were errors).

So we have now an additional API that we query that then wraps the API of elasticsearch. With all these shiny scary workarounds like polling, since some APIs are blocking which might lead to a gateway timeout...

Gateway timeout? Yes. Since some operations can run a LONG (multiple hours) time and you don't want to have a ton of open connections hogging resources... You let the loadbalancer kill it. Most operations simply run in ES in the background, while the connection was killed.

So much joy and fun, isn't it?

Now add the latest SMR scandal and a few faulty (as in SMR instead of CMD) hdds in a hundred terabyte ZFS pool and you'll get my frustration level.

PS: The cluster has several dozen terabyte and a lot od nodes. If you have good advice, you're welcome - but please think carefully about this fact.

I might have accidentially vaporized people sending me links with solutions that don't work on large scale TM.