-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: state import/export #4038
base: main
Are you sure you want to change the base?
Conversation
|
||
|
||
@cli.group(no_args_is_help=True) | ||
def state() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slight departure from our existing structure but I decided to group state
operations to keep things open for more operations in future (such as perhaps being able to query state).
So the CLI syntax is sqlmesh state dump
or sqlmesh state load
vs something like sqlmesh state_dump
or sqlmesh state_load
.
@treysp i'd be keen to know if this is a direction you were planning to head in given the recent CLI refactoring work
sqlmesh/core/context.py
Outdated
f"This [b]destructive[/b] operation will delete all existing state against the '{self.selected_gateway}' gateway \n" | ||
f"and replace it with what's in the '{input_file.as_posix()}' file." | ||
) | ||
if isinstance(self.console, TerminalConsole): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some trouble figuring out how to handle the confirmations since they need to go before the load or dump progress even starts.
I'm not super happy with this implementation, open to ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic needs to go into the console implementation itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored how the console interactions work and have moved it
sqlmesh/core/state_sync/base.py
Outdated
@@ -459,6 +464,16 @@ def add_interval( | |||
) | |||
self.add_snapshots_intervals([snapshot_intervals]) | |||
|
|||
@abc.abstractmethod | |||
def load(self, stream: StateStream, clear: bool = True) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to make dump/load first class citizens on StateSync
rather than trying to coordinate everything over the public interface.
The reason is to allow different StateSync
implementations to perform local optimizations or call internal methods without exposing them publicly and also helps with being able to wrap things in transactions.
76b08ca
to
480d023
Compare
sqlmesh/core/context.py
Outdated
@@ -2074,6 +2077,56 @@ def clear_caches(self) -> None: | |||
for path in self.configs: | |||
rmtree(path / c.CACHE) | |||
|
|||
def dump_state(self, output_file: Path, confirm: bool = True) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the following options:
- Dump local state only
- Only dump a specific environment(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to add those in follow-up PR's to prevent this one from becoming too big.
Also, what is local state? Sounds like something that isn't even present in the StateSync, is it a ContextDiff against prod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local state is context.snapshots
reflecting local changes that have not yet been persisted to the StateSync.
To support it, we need some extra metadata in the state file to "taint" the file as non importable if it contains local state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local state export is now available via:
$ sqlmesh state export --local -o local_state.json
And specific environments can now be exported like:
$ sqlmesh state export --environment foo --environment bar -o specific_environment_state.json
sqlmesh/core/context.py
Outdated
# trigger a connection to the StateSync so we can fail early if there is a problem | ||
self.state_sync.get_versions(validate=True) | ||
|
||
if confirm and isinstance(self.console, TerminalConsole): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should be a part of the console implementation. I don't like breaking the abstraction here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored how the console interactions work and have moved it
sqlmesh/core/context.py
Outdated
|
||
dump_state(self.state_sync, output_file, self.console) | ||
|
||
def load_state(self, input_file: Path, confirm: bool = True) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer export
/ import
over dump
/ load
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched the terminology
sqlmesh/core/state_sync/dump_load.py
Outdated
yield "environments", _dump_environments(state_stream.environments) | ||
console.update_state_dump_environments(complete=True) | ||
|
||
yield "auto_restatements", _dump_auto_restatements(state_stream.auto_restatements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we should literally dump our tables 1-to-1. This is way too low level. For example a complete snapshot instance is assembled using data from _snapshots
, _intervals
and _auto_restatements
tables. I don't think we want to expose users to all these internals. Instead, I believe it should just be environments
, snapshots
, versions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, the format should be compatible with export of the local state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding AutoRestatements, I could see that the _auto_restatements
table is joined in when calling get_snapshots()
but I couldn't see how it was being populated if the snapshots were written back via push_snapshots()
.
But I guess part of the load could be to extract the auto restatement information from the Snapshot records themselves and call update_auto_restatements()
to create the auto restatement records
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've improved the import implementation to keep track of the auto restatements as the snapshots are being inserted and then insert them at the end.
This means the auto restatements table no longer needs to be written to the state file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick first pass, agree with Iaroslav's comments. Nice 👍
docs/concepts/state.md
Outdated
@@ -0,0 +1,235 @@ | |||
# State | |||
|
|||
SQLMesh stores information about your project in a state database separate from your main warehouse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] should we rephrase as "... possibly separate ..." here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I agree that technically correct is the best kind of correct :)
I guess it's about what we want to encourage. The point is that state is a different workload type to a warehouse workload so for the best experience it needs to be stored in a suitable database type.
Of course this falls down if your warehouse is an OLTP database (PostgreSQL, MySQL, MSSQL) in which case storing state in your warehouse is perfectly fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tweaked this wording
480d023
to
6c67823
Compare
required=True, | ||
type=click.Path(exists=True, dir_okay=False, readable=True, path_type=Path), | ||
) | ||
@click.option( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively makes the default import strategy "merge" instead of "replace", so if the user wants to wipe all state they need to pass --replace
specifically.
@izeigerman do you have a preference of what the default should be?
6c67823
to
27c6a44
Compare
This PR implements the ability to export the state database to a file and import it back.
The state export file format is a
json
file. I tried to implement a streaming interface via theStateStream
abstraction and the use of the json_stream library. The goal is to be able to dump large projects without loading everything into memory and crashing with an OOM.In terms of version compatibility, there is a hard requirement to use the same version of SQLMesh to load the state as was used to dump it. This greatly simplifies the implementation and ensures our Pydantic model definitions will always be compatible. Guidance is included in the documentation on how to upgrade an older state file to be compatible with a new version of SQLMesh.
State export:

State import:
