Skip to content
Snippets Groups Projects
Forked from Maunium / synapse
Source project has a limited visibility.
  • Sean Quah's avatar
    68db233f
    Handle race between persisting an event and un-partial stating a room (#13100) · 68db233f
    Sean Quah authored
    Whenever we want to persist an event, we first compute an event context,
    which includes the state at the event and a flag indicating whether the
    state is partial. After a lot of processing, we finally try to store the
    event in the database, which can fail for partial state events when the
    containing room has been un-partial stated in the meantime.
    
    We detect the race as a foreign key constraint failure in the data store
    layer and turn it into a special `PartialStateConflictError` exception,
    which makes its way up to the method in which we computed the event
    context.
    
    To make things difficult, the exception needs to cross a replication
    request: `/fed_send_events` for events coming over federation and
    `/send_event` for events from clients. We transport the
    `PartialStateConflictError` as a `409 Conflict` over replication and
    turn `409`s back into `PartialStateConflictError`s on the worker making
    the request.
    
    All client events go through
    `EventCreationHandler.handle_new_client_event`, which is called in
    *a lot* of places. Instead of trying to update all the code which
    creates client events, we turn the `PartialStateConflictError` into a
    `429 Too Many Requests` in
    `EventCreationHandler.handle_new_client_event` and hope that clients
    take it as a hint to retry their request.
    
    On the federation event side, there are 7 places which compute event
    contexts. 4 of them use outlier event contexts:
    `FederationEventHandler._auth_and_persist_outliers_inner`,
    `FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
    `FederationHandler.do_remotely_reject_invite`. These events won't have
    the partial state flag, so we do not need to do anything for then.
    
    The remaining 3 paths which create events are
    `FederationEventHandler.process_remote_join`,
    `FederationEventHandler.on_send_membership_event` and
    `FederationEventHandler._process_received_pdu`.
    
    We can't experience the race in `process_remote_join`, unless we're
    handling an additional join into a partial state room, which currently
    blocks, so we make no attempt to handle it correctly.
    
    `on_send_membership_event` is only called by
    `FederationServer._on_send_membership_event`, so we catch the
    `PartialStateConflictError` there and retry just once.
    
    `_process_received_pdu` is called by `on_receive_pdu` for incoming
    events and `_process_pulled_event` for backfill. The latter should never
    try to persist partial state events, so we ignore it. We catch the
    `PartialStateConflictError` in `on_receive_pdu` and retry just once.
    
    Refering to the graph of code paths in
    https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
    
    
    may make the above make more sense.
    
    Signed-off-by: default avatarSean Quah <seanq@matrix.org>
    68db233f
    History
    Handle race between persisting an event and un-partial stating a room (#13100)
    Sean Quah authored
    Whenever we want to persist an event, we first compute an event context,
    which includes the state at the event and a flag indicating whether the
    state is partial. After a lot of processing, we finally try to store the
    event in the database, which can fail for partial state events when the
    containing room has been un-partial stated in the meantime.
    
    We detect the race as a foreign key constraint failure in the data store
    layer and turn it into a special `PartialStateConflictError` exception,
    which makes its way up to the method in which we computed the event
    context.
    
    To make things difficult, the exception needs to cross a replication
    request: `/fed_send_events` for events coming over federation and
    `/send_event` for events from clients. We transport the
    `PartialStateConflictError` as a `409 Conflict` over replication and
    turn `409`s back into `PartialStateConflictError`s on the worker making
    the request.
    
    All client events go through
    `EventCreationHandler.handle_new_client_event`, which is called in
    *a lot* of places. Instead of trying to update all the code which
    creates client events, we turn the `PartialStateConflictError` into a
    `429 Too Many Requests` in
    `EventCreationHandler.handle_new_client_event` and hope that clients
    take it as a hint to retry their request.
    
    On the federation event side, there are 7 places which compute event
    contexts. 4 of them use outlier event contexts:
    `FederationEventHandler._auth_and_persist_outliers_inner`,
    `FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
    `FederationHandler.do_remotely_reject_invite`. These events won't have
    the partial state flag, so we do not need to do anything for then.
    
    The remaining 3 paths which create events are
    `FederationEventHandler.process_remote_join`,
    `FederationEventHandler.on_send_membership_event` and
    `FederationEventHandler._process_received_pdu`.
    
    We can't experience the race in `process_remote_join`, unless we're
    handling an additional join into a partial state room, which currently
    blocks, so we make no attempt to handle it correctly.
    
    `on_send_membership_event` is only called by
    `FederationServer._on_send_membership_event`, so we catch the
    `PartialStateConflictError` there and retry just once.
    
    `_process_received_pdu` is called by `on_receive_pdu` for incoming
    events and `_process_pulled_event` for backfill. The latter should never
    try to persist partial state events, so we ignore it. We catch the
    `PartialStateConflictError` in `on_receive_pdu` and retry just once.
    
    Refering to the graph of code paths in
    https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
    
    
    may make the above make more sense.
    
    Signed-off-by: default avatarSean Quah <seanq@matrix.org>