Skip to content

Fix ControlHost data corruption issues

Tamas Gal requested to merge optimise-controlhost-stream into master

The socket.recv(size) function in Python treats the "size" argument as a "maximum amount of data" (cf. https://docs.python.org/3/library/socket.html#socket.socket.recv). There are two places in the ControlHost implementation where socket.recv() is called: once when receiving the prefix (16 bytes), which is then used to determine the remaining number of bytes and choose the correct routine to parse the message.

The second call to socket.recv() is made to obtain the remaining bytes, which utilises a buffer and a loop until the exact number of bytes were received, which explains why there was never an issue with the actual message "body".

Since the argument size is the "maximum amount of data", it can happen that it (prematurely) returns less then requested. This happens very rarely for small amounts of bytes but still was observed from time to time in the monitoring system, which resulted in an error (Failed to construct Prefix, reconnecting.) and the corresponding ControlHost client then automatically reconnected to the Ligier.

The highest rate of ControlHost messages is handled by the msg_dumper.py scripts, which unsurprisingly has more occurrences of this error (see below).

This explains also why the Python implementation of the ligiermirror occasionally showed signs of data corruption. That application has been replaced by a new implementation: JLigierMirror.

Here is grep of the log messages from the past months:

$ zgrep "Failed to construct Prefix" *.err.log*
dom_activity.err.log.1:2022-06-08 08:00:32 ERROR ++ km3pipe.controlhost: Failed to construct Prefix, reconnecting.
dom_rates.err.log:2022-06-08 08:00:34 ERROR ++ km3pipe.controlhost: Failed to construct Prefix, reconnecting.
msg_dumper.err.log:2021-12-02 09:24:15 ERROR ++ km3pipe.controlhost: Failed to construct Prefix, reconnecting.
msg_dumper.err.log:2022-03-18 06:00:07 ERROR ++ km3pipe.controlhost: Failed to construct Prefix, reconnecting.
msg_dumper.err.log:2022-10-31 21:14:34 ERROR ++ km3pipe.controlhost: Failed to construct Prefix, reconnecting.
msg_dumper.err.log:2022-11-03 14:48:50 ERROR ++ km3pipe.controlhost: Failed to construct Prefix, reconnecting.
msg_dumper.err.log:2022-12-07 18:00:01 ERROR ++ km3pipe.controlhost: Failed to construct Prefix, reconnecting.
msg_dumper.err.log:2022-12-08 19:26:44 ERROR ++ km3pipe.controlhost: Failed to construct Prefix, reconnecting.

The mystery is now solved.

Let me ping @mdejong

Merge request reports

Loading