Skip to content
This repository has been archived by the owner on Jan 19, 2024. It is now read-only.

Backup/Restore to/from AWS S3 bucket #100

Closed
rjbaucells opened this issue Sep 9, 2020 · 13 comments
Closed

Backup/Restore to/from AWS S3 bucket #100

rjbaucells opened this issue Sep 9, 2020 · 13 comments
Labels
enhancement New feature or request

Comments

@rjbaucells
Copy link

Provide backup/restore helm chart for AWS S3 bucket

@srfaytkn
Copy link
Contributor

+1

@moxious
Copy link
Contributor

moxious commented Sep 21, 2020

Would you intend to configure credentials to write to S3 via some AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY? Or via which other method

@rjbaucells
Copy link
Author

Yes, that could be the initial setup but I would also consider the possibility of having the ServiceAccount associated to the Pod linked with an AWS IAM Role when deployed inside an AWS EKS cluster. (see IAM for ServiceAccount)

@bbenzikry
Copy link

We would also be interested in this specifically with IRSA

@moxious moxious added the enhancement New feature or request label Oct 12, 2020
@moxious
Copy link
Contributor

moxious commented Oct 12, 2020

@bbenzikry has submitted a PR on this topic here: #118

👀

@bbenzikry
Copy link

@moxious Thanks for letting me know.

I see a few issues with the PR:

  • As far as I could see it uses a dedicated secret and doesn't have an option for IRSA ( e.g. I would assume the helm chart would create a service account that will receive the proper arguments for the IRSA role, which mounts a secret automatically )
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/ROLE_NAME"
  • It would be really great if we could be less dependent on volume sizing issues in init containers ( it currently uses the same volume for replicas / core as far as I can see ) and use a different volume for restoring.
    I'm not sure how complicated this can be - but it would be great if we could check the k8s api for allocated size and dynamically provision to prevent any issues.

  • As for backups ( I may have missed something in the configuration ), it seems it is stored in the docker writable layer, which has performance issues.
    The sizing issue could happen here as well if the node the cron job is scheduled on doesn't have enough space - we should at least be able to define a volume.

Unrelated to that, really nice work on https://github.com/neo4j-contrib/neo4j-spark-connector, I'm really looking forward to a 3.0.0 release

@moxious
Copy link
Contributor

moxious commented Oct 12, 2020

On your point #1, I think there's a clear tension between the design approach of "make a backup container and expand its support for multi-cloud" (which is what the PR is evolving) and "make a bespoke backup approach for each cloud". Flexibility & power to use cloud-specific things like IRSA is eventually going to pull in the direction of option #2. As noted in the PR comments, eventually someone will ask for Azure, and then Alibaba, and so on. TBH, my main goal for the backup approach in this repo is to act as an example. In working with production customers, they end up having so many custom requirements that a pre-cooked single backup container usually ends up infeasible. For example, they'll want a split schedule: incremental every 24 hours, full backup weekly. Or they have pre-preparation requirements, etc. No single approach can capture all the requirements, frankly. Add to this that the cloud k8s distributions evolve and various security features are added, you get the picture. This is a long way of saying it doesn't bother me if the PR doesn't support IRSA. But if it really bothers you, we need consider whether modifying this is the right approach or whether best available guidance to people will be to roll their own on the basis of an example.

On your second point -- volume sizing. I get this pain, but it's a complex issue. Sizing backups & database volumes the same is definitely not the right approach. Neo4j has a lot of options dealing with things like transaction & log retention, and we'd usually expect backups to be smaller than a real database. Doing them separately is a point of complexity, but also ... flexibility. This could be broken out into a separate issue for discussion though in terms of (a) what can be done (b) what's good to do here.

On your third point -- yes, I'm aware. A hard limitation here is that the neo4j backup tool is a CLI that doesn't know how to write to any storage layer other than local disk. And so fundamentally all backup approaches will "double the data". You'll take a backup -> local disk, and you'll push that up to cloud storage, so 2 full copies of the same data. As I've thought through this the best we can do here is choose the best performing storage layer. I've looked into some options where you can mount s3 storage as a local FS, but this has so many extra issues & complexity it doesn't seem like a good option for the kind of file access we're doing here.

@bbenzikry
Copy link

bbenzikry commented Oct 12, 2020

This is a long way of saying it doesn't bother me if the PR doesn't support IRSA.

What bothers me the most are security implications.
When using the chart, I would expect ( as a user ) some guidance on proper security guidelines.
I don't think comparing it to cron intervals is the right analogy.

As noted in the PR comments, eventually someone will ask for Azure, and then Alibaba, and so on. TBH, my main goal for the backup approach in this repo is to act as an example.

I don't really see it that way as the PR already has cloud vendor specific code ( e.g. using a baked image with specific AWS / GCP CLIs, expecting credentials to be in a certain format etc. )

I'm just afraid what will happen is that users will just use this with a non rotated secret they'll forget about.
I understand the hardship in supporting every vendor, but supporting the major 3 doesn't seem like that much of a sacrifice to me to promote security practices ( I would also say it probably makes sense at the very least for clouds which Neo4j / APOC integrates with like AWS and GCP )

Obviously this could also go in a direction of showcasing a more agnostic solution ( e.g. vault ), but I would think that would be even more of a bother.

As IRSA support is off the table, would you be open to the following?

  1. Add reference documentation so users are aware of cloud specific routes.
  2. Add an optional service account and secret mount path as parameters to the chart so users can easily configure this to their liking

Sizing backups & database volumes the same is definitely not the right approach. Neo4j has a lot of options dealing with things like transaction & log retention, and we'd usually expect backups to be smaller than a real database.

Thanks for clarifying this - the easiest way I can think of is using some kind of elastic store or pre-provisioned options ( gluster etc. ), but that's really outside of the chart purview IMO.
Can you refer to any specs that can help us estimate the size difference?
At the moment we use volumeExpansion, which has issues of its own - but as long as ratios are defined well it should alleviate some of my concerns.
I'll be happy to participate in a more thorough discussion on a separate issue.

@moxious
Copy link
Contributor

moxious commented Oct 12, 2020

When using the chart, I would expect ( as a user ) some guidance on proper security guidelines

It must be said, because it's not explicit & clear everywhere: there is some general security guidelines for running Neo4j, but I know you're asking about k8s specifically and that information hasn't been written up. The status of this repo at the moment is that it's an open source project by yours truly, and has no official standing and isn't officially supported by Neo4j.

I'm just afraid what will happen is that users will just use this with a non rotated secret they'll forget about.

Yep. But it must be said: people will frequently do this even if you write detailed guidance telling them not to. Doing it in the more complicated way is simply more work, and people who use the helm chart are not frequently looking to become experts in these issues, they're usually trying to get something rolling. Endemic to the industry, really.

this could also go in a direction of showcasing a more agnostic solution ( e.g. vault ), but I would think that would be even more of a bother

More agnostic solutions are separately problematic, as a lot of people are in locked-down environments where they can't necessarily adopt extra recommended technologies. For example, you can do nice things if you require this or that k8s add-on, but requiring such tends to progressively narrow the audience that can apply it, which is why most of the repo uses very vanilla k8s techniques, and why I haven't jumped on #42

As IRSA support is off the table, would you be open to the following?

Add reference documentation so users are aware of cloud specific routes.
Add an optional service account and secret mount path as parameters to the chart so users can easily configure this to their liking

This is fine, and I'm not hostile to it, but this is expanding the ask; what's really wanted is a generalized approach that works across 3 major clouds, with security best practices on securing the backups & specifying the access credentials path. That's a legitimate request, but a big chunk of work for me alone.

In the context of this issue & the linked PR, what we need to do is decide whether some iteration of the PR makes meaningful progress towards that goal.

@bbenzikry
Copy link

The status of this repo at the moment is that it's an open source project by yours truly, and has no official standing and isn't officially supported by Neo4j.

Thanks for mentioning that, I was under the assumption that it was officially supported.

This is fine, and I'm not hostile to it, but this is expanding the ask; what's really wanted is a generalized approach that works across 3 major clouds, with security best practices on securing the backups & specifying the access credentials path. That's a legitimate request, but a big chunk of work for me alone.

I hope I'll have some time in the coming weeks; maybe I'll try and have a go at this to help out.
I believe a simple service account / secret path configuration should be good enough for now as both IRSA and workload-identity are similar ( I'm less familiar with the AKS offering )

In general, thinking more about cloud specific auth methods, perhaps this should be a larger issue that adds a generalized approach for dealing with cloud auth for both the cluster itself and backup / restore procedures - but this should probably be done once there's progress on related issues in other projects that can help such security features be more mainstream in Neo4j use ( neo4j-contrib/neo4j-apoc-procedures#1596 as an example )

Can you refer to any specs that can help us estimate the size difference?

Would love if you could weigh in on that, didn't see anything specific about full / incremental backup sizes on the op manual / quick search.

@moxious
Copy link
Contributor

moxious commented Oct 12, 2020

I don't think there's much guidance on differential storage planning. K8s is a new environment for Neo4j relatively speaking. In a lot of deployments, it's done on VMs in environments where disk is cheap & plentiful, and storage is planned for in combination, since folks would run the backup tool on one of their server instances. As a result, only k8s folks tend to think about capacity planning in separate terms like this. In actuality, when working with a particular team, you tend to do it based on their needs.

The variables at play are things like transaction log retention settings, log file retention & rotation settings, and overall ingest rate, which gives you very different results for hyper-active transactional setups vs. big analytical graphs

moxious pushed a commit that referenced this issue Oct 16, 2020
* #backup aws support #100

* #backup secretName option

* #restore aws support for core pods

* #restore aws support for read pods

* #default values updated

* #single secret path for backup

* #restore deployment-scenarios
@moxious
Copy link
Contributor

moxious commented Oct 16, 2020

Closing this issue in light of the merge of the linked PR above, and also some odds and ends in this PR. #122

I'm working on the release process for 4.1.3-1 right now, and it'll probably be a few hours before the docs site is updated, but that should be updated soon accordingly by a different build chain.

@moxious moxious closed this as completed Oct 16, 2020
@moxious
Copy link
Contributor

moxious commented Oct 16, 2020

Many thanks to @srfaytkn who did the bulk of the work to bring in the implementation of this!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants