Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster_test.py::test_migration_rebalance_node #4551

Open
BagritsevichStepan opened this issue Feb 3, 2025 · 1 comment · May be fixed by #4576
Open

cluster_test.py::test_migration_rebalance_node #4551

BagritsevichStepan opened this issue Feb 3, 2025 · 1 comment · May be fixed by #4576
Assignees
Labels
bug Something isn't working failing-test iouring iouring backend

Comments

@BagritsevichStepan
Copy link
Contributor

https://github.com/dragonflydb/dragonfly/actions/runs/13063852208/job/36452588898

...
stderr=b''.join(stderr_seq) if stderr_seq else None)
E           subprocess.TimeoutExpired: Command '['/__w/dragonfly/dragonfly/build/dragonfly', '--proactor_threads=4', '--cluster_mode=yes', '--port=30123', '--admin_port=30124', '--vmodule=outgoing_slot_migration=2,cluster_family=2,incoming_slot_migration=2,streamer=2', '--dbfilename=', '--noversion_check', '--maxmemory=8G', '--jsonpathv2', '--list_experimental_v2', '--log_dir=/tmp/dragonfly_logs/test_migration_rebalance_node_df_seeder_factory0-df_factory0_', '--serialization_max_chunk_size=300000', '--fiber_safety_margin=4096', '--num_shards=3']' timed out after 120 seconds

/usr/lib/python3.8/subprocess.py:1072: TimeoutExpired

During handling of the above exception, another exception occurred:

self = Factory({'proactor_threads': 4, 'cluster_mode': 'yes'})

    async def stop_all(self):
        """Stop all launched instances."""
        exceptions = []  # To collect exceptions
        for instance in self.instances:
            await instance.close_clients()
            try:
               instance.stop()

dragonfly/instance.py:464: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = :30123, kill = False

    def stop(self, kill=False):
        proc, self.proc = self.proc, None
        if proc is None:
            return
    
        logging.debug(f"Stopping instance on {self._port}")
        try:
            if kill:
                proc.kill()
            else:
                proc.terminate()
                proc.communicate(timeout=120)
                # if the return code is 0 it means normal termination
                # if the return code is negative it means termination by signal
                # if the return code is positive it means abnormal exit
                if proc.returncode != 0:
                    raise Exception(
                        f"Dragonfly did not terminate gracefully, exit code {proc.returncode}, "
                        f"pid: {proc.pid}"
                    )
    
        except subprocess.TimeoutExpired:
            # We need to send SIGUSR1 to DF such that it prints the stacktrace
            proc.send_signal(signal.SIGUSR1)
            # Then we sleep for 5 seconds such that DF has enough time to print the stacktraces
            # We can't really synchronize here because SIGTERM and SIGKILL do not block even if
            # sigaction explicitly blocks other incoming signals until it handles SIGUSR1.
            # Even worse, on SIGTERM and SIGKILL none of the handlers registered via sigaction
            # are guranteed to run
            time.sleep(5)
            logging.debug(f"Unable to kill the process on port {self._port}")
            logging.debug(f"INFO LOGS of DF are:")
            self.print_info_logs_to_debug_log()
            proc.kill()
            proc.communicate()
           raise Exception("Unable to terminate DragonflyDB gracefully, it was killed")
E           Exception: Unable to terminate DragonflyDB gracefully, it was killed

dragonfly/instance.py:258: Exception

The above exception was the direct cause of the following exception:

    def finalizer() -> None:
        """Yield again, to finalize."""
    
        async def async_finalizer() -> None:
            try:
                await gen_obj.__anext__()
            except StopAsyncIteration:
                pass
            else:
                msg = "Async generator fixture didn't stop."
                msg += "Yield only once."
                raise ValueError(msg)
    
       event_loop.run_until_complete(async_finalizer())

/usr/local/lib/python3.8/dist-packages/pytest_asyncio/plugin.py:276: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
    return future.result()
/usr/local/lib/python3.8/dist-packages/pytest_asyncio/plugin.py:268: in async_finalizer
    await gen_obj.__anext__()
dragonfly/conftest.py:146: in df_factory
    await factory.stop_all()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Factory({'proactor_threads': 4, 'cluster_mode': 'yes'})

    async def stop_all(self):
        """Stop all launched instances."""
        exceptions = []  # To collect exceptions
        for instance in self.instances:
            await instance.close_clients()
            try:
                instance.stop()
            except Exception as e:
                exceptions.append(e)  # Collect the exception
        if exceptions:
            first_exception = exceptions[0]
           raise Exception(
                f"One or more errors occurred while stopping instances. "
                f"First exception: {first_exception}"
            ) from first_exception
E           Exception: One or more errors occurred while stopping instances. First exception: Unable to terminate DragonflyDB gracefully, it was killed

dragonfly/instance.py:469: Exception
----------------------------- Captured stdout call -----------------------------
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
cpu time 2.4496088027[954](https://github.com/dragonflydb/dragonfly/actions/runs/13063852208/job/36452588898#step:6:956)1 batches 1717 commands 171700
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
------------------------------ Captured log call -------------------------------
@BagritsevichStepan BagritsevichStepan added bug Something isn't working failing-test iouring iouring backend labels Feb 3, 2025
@adiholden
Copy link
Collaborator

In this test we expose some bug in migration finalization which also leads to server not being able to be terminated
Here are some interesting logs taken from the failure:

W20250131 00:46:05.284759 21149 outgoing_slot_migration.cc:355] Incorrect response type for a1bb33bff8fe453b7c6ed5f28db82360914ef3f4 : 7c66a6e1a7d77c5d84bf8da3509d7ca45c2518f2 attempt 1 msg: ERR Join timeout happened

I20250131 00:46:19.853883 21149 accept_server.cc:25] Exiting on signal Terminated
I20250131 00:46:19.854838 21151 listener_interface.cc:224] Listener stopped for port 30124
I20250131 00:46:19.855050 21152 listener_interface.cc:224] Listener stopped for port 30123
I20250131 00:46:19.855865 21149 cluster_family.cc:825] Outgoing migration cancelled: slots [8365, 16383] to 127.0.0.1:30130
I20250131 00:46:19.855890 21149 outgoing_slot_migration.cc:150] Finish outgoing migration for a1bb33bff8fe453b7c6ed5f28db82360914ef3f4: 7c66a6e1a7d77c5d84bf8da3509d7ca45c2518f2
I20250131 00:46:19.855962 21150 streamer.cc:83] JournalStreamer::Cancel
I20250131 00:46:19.855968 21151 streamer.cc:83] JournalStreamer::Cancel
I20250131 00:46:19.856185 21149 streamer.cc:83] JournalStreamer::Cancel

I20250131 00:48:19.974931 21149 scheduler.cc:460] ------------ Fiber shard_handler_periodic0 (sleeping until 1387387359966 now is 1387320559180) ------------
0x1336eb4 boost::context::detail::fiber_ontop<>()
0x133688d boost::context::fiber::resume_with<>()
0x1335b1f util::fb2::detail::FiberInterface::SwitchTo()
0x132aa3b util::fb2::detail::Scheduler::Preempt()
0x132bc73 util::fb2::detail::Scheduler::WaitUntil()
0x55d0b5 util::fb2::detail::FiberInterface::WaitUntil()
0x131f8d8 util::fb2::EventCount::wait_until()
0x460880 util::fb2::EventCount::await_until<>()
0x45e9ce util::fb2::Done::Impl::WaitFor()
0x45e8ce util::fb2::Done::WaitFor()
0xaaae46 dfly::RunFPeriodically()
0xaab562 dfly::EngineShard::StartPeriodicShardHandlerFiber()::{lambda()#1}::operator()()
0xab28b5 std::__invoke_impl<>()
0xab226d std::__invoke<>()
0xab1d3f std::__apply_impl<>()

I20250131 00:48:19.974965 21149 scheduler.cc:460] ------------ Fiber outgoing_migration (suspended) ------------
0x1336eb4 boost::context::detail::fiber_ontop<>()
0x133688d boost::context::fiber::resume_with<>()
0x1335b1f util::fb2::detail::FiberInterface::SwitchTo()
0x132aa3b util::fb2::detail::Scheduler::Preempt()
0x440de6 util::fb2::detail::FiberInterface::Suspend()
0x133524c util::fb2::detail::FiberInterface::Join()
0x131115e util::fb2::Fiber::Join()
0x1311326 util::fb2::Fiber::JoinIfNeeded()
0x9c6566 dfly::cluster::OutgoingMigration::FinalizeMigration()::{lambda()#2}::operator()()
0x9c9630 absl::lts_20240722::cleanup_internal::Storage<>::InvokeCallback()
0x9c8650 absl::lts_20240722::Cleanup<>::~Cleanup()
0x9c76ce dfly::cluster::OutgoingMigration::FinalizeMigration()
0x9c5ef4 dfly::cluster::OutgoingMigration::SyncFb()
0x9d0e58 std::__invoke_impl<>()
0x9d0a47 std::__invoke<>()

@BorysTheDev BorysTheDev linked a pull request Feb 7, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working failing-test iouring iouring backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants