Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: state import/export #4038

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Feat: state import/export #4038

wants to merge 1 commit into from

Conversation

erindru
Copy link
Collaborator

@erindru erindru commented Mar 26, 2025

This PR implements the ability to export the state database to a file and import it back.

The state export file format is a json file. I tried to implement a streaming interface via the StateStream abstraction and the use of the json_stream library. The goal is to be able to dump large projects without loading everything into memory and crashing with an OOM.

In terms of version compatibility, there is a hard requirement to use the same version of SQLMesh to load the state as was used to dump it. This greatly simplifies the implementation and ensures our Pydantic model definitions will always be compatible. Guidance is included in the documentation on how to upgrade an older state file to be compatible with a new version of SQLMesh.

State export:
Screenshot From 2025-03-31 15-43-59

State import:
Screenshot From 2025-03-31 15-43-33



@cli.group(no_args_is_help=True)
def state() -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slight departure from our existing structure but I decided to group state operations to keep things open for more operations in future (such as perhaps being able to query state).

So the CLI syntax is sqlmesh state dump or sqlmesh state load vs something like sqlmesh state_dump or sqlmesh state_load.

@treysp i'd be keen to know if this is a direction you were planning to head in given the recent CLI refactoring work

f"This [b]destructive[/b] operation will delete all existing state against the '{self.selected_gateway}' gateway \n"
f"and replace it with what's in the '{input_file.as_posix()}' file."
)
if isinstance(self.console, TerminalConsole):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some trouble figuring out how to handle the confirmations since they need to go before the load or dump progress even starts.

I'm not super happy with this implementation, open to ideas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic needs to go into the console implementation itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored how the console interactions work and have moved it

@@ -459,6 +464,16 @@ def add_interval(
)
self.add_snapshots_intervals([snapshot_intervals])

@abc.abstractmethod
def load(self, stream: StateStream, clear: bool = True) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to make dump/load first class citizens on StateSync rather than trying to coordinate everything over the public interface.

The reason is to allow different StateSync implementations to perform local optimizations or call internal methods without exposing them publicly and also helps with being able to wrap things in transactions.

@erindru erindru marked this pull request as ready for review March 26, 2025 22:35
@@ -2074,6 +2077,56 @@ def clear_caches(self) -> None:
for path in self.configs:
rmtree(path / c.CACHE)

def dump_state(self, output_file: Path, confirm: bool = True) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the following options:

  • Dump local state only
  • Only dump a specific environment(s)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to add those in follow-up PR's to prevent this one from becoming too big.

Also, what is local state? Sounds like something that isn't even present in the StateSync, is it a ContextDiff against prod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local state is context.snapshots reflecting local changes that have not yet been persisted to the StateSync.

To support it, we need some extra metadata in the state file to "taint" the file as non importable if it contains local state

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local state export is now available via:

$ sqlmesh state export --local -o local_state.json

And specific environments can now be exported like:

$ sqlmesh state export --environment foo --environment bar -o specific_environment_state.json

# trigger a connection to the StateSync so we can fail early if there is a problem
self.state_sync.get_versions(validate=True)

if confirm and isinstance(self.console, TerminalConsole):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be a part of the console implementation. I don't like breaking the abstraction here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored how the console interactions work and have moved it


dump_state(self.state_sync, output_file, self.console)

def load_state(self, input_file: Path, confirm: bool = True) -> None:
Copy link
Member

@izeigerman izeigerman Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer export / import over dump / load.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched the terminology

yield "environments", _dump_environments(state_stream.environments)
console.update_state_dump_environments(complete=True)

yield "auto_restatements", _dump_auto_restatements(state_stream.auto_restatements)
Copy link
Member

@izeigerman izeigerman Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we should literally dump our tables 1-to-1. This is way too low level. For example a complete snapshot instance is assembled using data from _snapshots, _intervals and _auto_restatements tables. I don't think we want to expose users to all these internals. Instead, I believe it should just be environments, snapshots, versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the format should be compatible with export of the local state.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding AutoRestatements, I could see that the _auto_restatements table is joined in when calling get_snapshots() but I couldn't see how it was being populated if the snapshots were written back via push_snapshots().

But I guess part of the load could be to extract the auto restatement information from the Snapshot records themselves and call update_auto_restatements() to create the auto restatement records

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved the import implementation to keep track of the auto restatements as the snapshots are being inserted and then insert them at the end.

This means the auto restatements table no longer needs to be written to the state file

Copy link
Contributor

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick first pass, agree with Iaroslav's comments. Nice 👍

@@ -0,0 +1,235 @@
# State

SQLMesh stores information about your project in a state database separate from your main warehouse.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] should we rephrase as "... possibly separate ..." here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I agree that technically correct is the best kind of correct :)

I guess it's about what we want to encourage. The point is that state is a different workload type to a warehouse workload so for the best experience it needs to be stored in a suitable database type.

Of course this falls down if your warehouse is an OLTP database (PostgreSQL, MySQL, MSSQL) in which case storing state in your warehouse is perfectly fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tweaked this wording

@erindru erindru changed the title Feat: state dump/load Feat: state import/export Mar 31, 2025
required=True,
type=click.Path(exists=True, dir_okay=False, readable=True, path_type=Path),
)
@click.option(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively makes the default import strategy "merge" instead of "replace", so if the user wants to wipe all state they need to pass --replace specifically.

@izeigerman do you have a preference of what the default should be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants