Skip to content
Snippets Groups Projects
  1. Aug 30, 2022
    • Erik Johnston's avatar
      Fix bug where we wedge media plugins if clients disconnect early (#13660) · 1c26acd8
      Erik Johnston authored
      We incorrectly didn't use the returned `Responder` if the client had
      disconnected, which meant that the resource used by the Responder
      wasn't correctly released.
      
      In particular, this exhausted the thread pools so that *all* requests
      timed out.
      Unverified
      1c26acd8
    • Patrick Cloke's avatar
      Do not wait for background updates to complete do expire URL cache. (#13657) · 303b40b9
      Patrick Cloke authored
      Media downloaded as part of a URL preview is normally deleted after two days.
      However, while a background database migration is running, the process is
      stopped. A long-running database migration can therefore cause the media
      store to fill up with old preview files.
      
      This logic was added in #2697 to make sure that we didn't try to run the expiry
      without an index on `local_media_repository.created_ts`; the original logic that
      needs that index was added in #2478 (in `get_url_cache_media_before`, as
      amended by 93247a42), and is still present.
      
      Given that the background update was added before Synapse v1.0.0, just drop
      this check and assume the index exists.
      Unverified
      303b40b9
    • Patrick Cloke's avatar
      Speed up inserting `event_push_actions_staging`. (#13634) · 20df96a7
      Patrick Cloke authored
      By using `execute_values` instead of `execute_batch`.
      Unverified
      20df96a7
    • Dirk Klimpel's avatar
    • Eric Eastwood's avatar
      Optimize how we calculate `likely_domains` during backfill (#13575) · 51d732db
      Eric Eastwood authored
      Optimize how we calculate `likely_domains` during backfill because I've seen this take 17s in production just to `get_current_state` which is used to `get_domains_from_state` (see case [*2. Loading tons of events* in the `/messages` investigation issue](https://github.com/matrix-org/synapse/issues/13356)).
      
      There are 3 ways we currently calculate hosts that are in the room:
      
       1. `get_current_state` -> `get_domains_from_state`
          - Used in `backfill` to calculate `likely_domains` and `/timestamp_to_event` because it was cargo-culted from `backfill`
          - This one is being eliminated in favor of `get_current_hosts_in_room` in this PR 🕳️
       1. `get_current_hosts_in_room`
          - Used for other federation things like sending read receipts and typing indicators
       1. `get_hosts_in_room_at_events`
          - Used when pushing out events over federation to other servers in the `_process_event_queue_loop`
      
      Fix https://github.com/matrix-org/synapse/issues/13626
      
      Part of https://github.com/matrix-org/synapse/issues/13356
      
      Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)
      
      
      ### Query performance
      
      #### Before
      
      The query from `get_current_state` sucks just because we have to get all 80k events. And we see almost the exact same performance locally trying to get all of these events (16s vs 17s):
      ```
      synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
      Time: 16035.612 ms (00:16.036)
      
      synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
      Time: 4243.237 ms (00:04.243)
      ```
      
      But what about `get_current_hosts_in_room`: When there is 8M rows in the `current_state_events` table, the previous query in `get_current_hosts_in_room` took 13s from complete freshness (when the events were first added). But takes 930ms after a Postgres restart or 390ms if running back to back to back.
      
      ```sh
      $ psql synapse
      synapse=# \timing on
      synapse=# SELECT COUNT(DISTINCT substring(state_key FROM '@[^:]*:(.*)$'))
      FROM current_state_events
      WHERE
          type = 'm.room.member'
          AND membership = 'join'
          AND room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
       count
      -------
        4130
      (1 row)
      
      Time: 13181.598 ms (00:13.182)
      
      synapse=# SELECT COUNT(*) from current_state_events where room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
       count
      -------
       80814
      
      synapse=# SELECT COUNT(*) from current_state_events;
        count
      ---------
       8162847
      
      synapse=# SELECT pg_size_pretty( pg_total_relation_size('current_state_events') );
       pg_size_pretty
      ----------------
       4702 MB
      ```
      
      #### After
      
      I'm not sure how long it takes from complete freshness as I only really get that opportunity once (maybe restarting computer but that's cumbersome) and it's not really relevant to normal operating times. Maybe you get closer to the fresh times the more access variability there is so that Postgres caches aren't as exact. Update: The longest I've seen this run for is 6.4s and 4.5s after a computer restart.
      
      After a Postgres restart, it takes 330ms and running back to back takes 260ms.
      
      ```sh
      $ psql synapse
      synapse=# \timing on
      Timing is on.
      synapse=# SELECT
          substring(c.state_key FROM '@[^:]*:(.*)$') as host
      FROM current_state_events c
      /* Get the depth of the event from the events table */
      INNER JOIN events AS e USING (event_id)
      WHERE
          c.type = 'm.room.member'
          AND c.membership = 'join'
          AND c.room_id = '!OGEhHVWSdvArJzumhm:matrix.org'
      GROUP BY host
      ORDER BY min(e.depth) ASC;
      Time: 333.800 ms
      ```
      
      #### Going further
      
      To improve things further we could add a `limit` parameter to `get_current_hosts_in_room`. Realistically, we don't need 4k domains to choose from because there is no way we're going to query that many before we a) probably get an answer or b) we give up. 
      
      Another thing we can do is optimize the query to use a index skip scan:
      
       - https://wiki.postgresql.org/wiki/Loose_indexscan
       - Index Skip Scan, https://commitfest.postgresql.org/37/1741/
       - https://www.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
      
      Unverified
      51d732db
  2. Aug 28, 2022
  3. Aug 26, 2022
  4. Aug 25, 2022
  5. Aug 24, 2022
  6. Aug 23, 2022
Loading