Skip to content
Snippets Groups Projects
  1. Mar 14, 2024
  2. Mar 13, 2024
  3. Mar 12, 2024
    • Erik Johnston's avatar
    • Erik Johnston's avatar
    • Gerrit Gogel's avatar
      Prevent locking up while processing batched_auth_events (#16968) · 1f887907
      Gerrit Gogel authored
      This PR aims to fix #16895, caused by a regression in #7 and not fixed
      by #16903. The PR #16903 only fixes a starvation issue, where the CPU
      isn't released. There is a second issue, where the execution is blocked.
      This theory is supported by the flame graphs provided in #16895 and the
      fact that I see the CPU usage reducing and far below the limit.
      
      Since the changes in #7, the method `check_state_independent_auth_rules`
      is called with the additional parameter `batched_auth_events`:
      
      
      https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/handlers/federation_event.py#L1741-L1743
      
      
      It makes the execution enter this if clause, introduced with #15195
      
      
      https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/event_auth.py#L178-L189
      
      There are two issues in the above code snippet.
      
      First, there is the blocking issue. I'm not entirely sure if this is a
      deadlock, starvation, or something different. In the beginning, I
      thought the copy operation was responsible. It wasn't. Then I
      investigated the nested `store.get_events` inside the function `update`.
      This was also not causing the blocking issue. Only when I replaced the
      set difference operation (`-` ) with a list comprehension, the blocking
      was resolved. Creating and comparing sets with a very large amount of
      events seems to be problematic.
      
      This is how the flamegraph looks now while persisting outliers. As you
      can see, the execution no longer locks up in the above function.
      
      ![output_2024-02-28_13-59-40](https://github.com/element-hq/synapse/assets/13143850/6db9c9ac-484f-47d0-bdde-70abfbd773ec)
      
      Second, the copying here doesn't serve any purpose, because only a
      shallow copy is created. This means the same objects from the original
      dict are referenced. This fails the intention of protecting these
      objects from mutation. The review of the original PR
      https://github.com/matrix-org/synapse/pull/15195 had an extensive
      discussion about this matter.
      
      Various approaches to copying the auth_events were attempted:
      1) Implementing a deepcopy caused issues due to
      builtins.EventInternalMetadata not being pickleable.
      2) Creating a dict with new objects akin to a deepcopy.
      3) Creating a dict with new objects containing only necessary
      attributes.
      
      Concluding, there is no easy way to create an actual copy of the
      objects. Opting for a deepcopy can significantly strain memory and CPU
      resources, making it an inefficient choice. I don't see why the copy is
      necessary in the first place. Therefore I'm proposing to remove it
      altogether.
      
      After these changes, I was able to successfully join these rooms,
      without the main worker locking up:
      - #synapse:matrix.org
      - #element-android:matrix.org
      - #element-web:matrix.org
      - #ecips:matrix.org
      - #ipfs-chatter:ipfs.io
      - #python:matrix.org
      - #matrix:matrix.org
      Unverified
      1f887907
    • Erik Johnston's avatar
      1.103.0rc1 · 9d7880c0
      Erik Johnston authored
      9d7880c0
  4. Mar 11, 2024
  5. Mar 08, 2024
  6. Mar 06, 2024
    • Quentin Gliech's avatar
      Fix joining remote rooms when a `on_new_event` callback is registered (#16973) · 4af33015
      Quentin Gliech authored
      Since Synapse 1.76.0, any module which registers a `on_new_event`
      callback would brick the ability to join remote rooms.
      This is because this callback tried to get the full state of the room,
      which would end up in a deadlock.
      
      Related:
      https://github.com/matrix-org/synapse-auto-accept-invite/issues/18
      
      The following module would brick the ability to join remote rooms:
      
      ```python
      from typing import Any, Dict, Literal, Union
      import logging
      
      from synapse.module_api import ModuleApi, EventBase
      
      logger = logging.getLogger(__name__)
      
      class MyModule:
          def __init__(self, config: None, api: ModuleApi):
              self._api = api
              self._config = config
      
              self._api.register_third_party_rules_callbacks(
                  on_new_event=self.on_new_event,
              )
      
          async def on_new_event(self, event: EventBase, _state_map: Any) -> None:
              logger.info(f"Received new event: {event}")
      
          @staticmethod
          def parse_config(_config: Dict[str, Any]) -> None:
              return None
      ```
      
      This is technically a breaking change, as we are now passing partial
      state on the `on_new_event` callback.
      However, this callback was broken for federated rooms since 1.76.0, and
      local rooms have full state anyway, so it's unlikely that it would
      change anything.
      Unverified
      4af33015
  7. Mar 05, 2024
  8. Feb 23, 2024
  9. Feb 22, 2024
  10. Feb 21, 2024
  11. Feb 20, 2024
Loading