We previously would block waiting for a CSemaphoreGrant in
ThreadOpenAddedConnections, when we did not need to. This would
block as the posts in CConnman shutdown were both to the wrong
semaphore and in the wrong location.
Define MSG_DONTWAIT and MSG_NO_SIGNAL in the implementation files that
use them (`net.cpp` and `netbase.cpp`), instead of compat.h which is
included all over the place.
This avoids putting them in the global namespace, as defining them as 0
is a hack that works for our specific usage, but it is not a general
solution.
Also makes sure they are defined only once so the `!defined(MSG_x)` guard can go.
These are (afaik) all long-standing races or concurrent accesses. Going
forward, we can clean these up so that they're not all individual atomic
accesses.
- Reintroduce cs_vRecv to guard receive-specific vars
- Lock vRecv/vSend for CNodeStats
- Make some vars atomic.
- Only set the connection time in CNode's constructor so that it doesn't change
Since ForEach* are can be used to send messages to all nodes, the caller may
end up sending a message before the version handshake is complete. To limit
this, filter out these nodes. While we're at it, may as well filter out
disconnected nodes as well.
Delete unused methods rather than updating them.
Once the CNode has been added to vNodes, it is possible that it is
disconnected+deleted in the socket handler thread. However, after
that we now call InitializeNode, which accesses the pnode.
helgrind managed to tickle this case (somehow), but I suspect it
requires in immensely braindead scheduler.
The use of mocktime in test logic means that comparisons between
GetTime() and GetTimeMicros()/1000000 are unreliable since the former
can use mocktime values while the latter always gets the system clock;
this changes the networking code's inactivity checks to consistently
use the system clock for inactivity comparisons.
Also remove some hacks from setmocktime() that are no longer needed,
now that we're using the system clock for nLastSend and nLastRecv.
Technically cs_sendProcessing is entirely useless now because it
is only ever taken on the one MessageHandler thread, but because
there may be multiple of those in the future, it is left in place
cs_vSend is used for two purposes - to lock the datastructures used
to queue messages to place on the wire and to only call
SendMessages once at a time per-node. I believe SendMessages used
to access some of the vSendMsg stuff, but it doesn't anymore, so
these locks do not need to be on the same mutex, and also make
deadlocking much more likely.
vRecvMsg is now only touched by the socket handler thread.
The accounting vars (nRecvBytes/nLastRecv/mapRecvBytesPerMsgCmd) are also
only used by the socket handler thread, with the exception of queries from
rpc/gui. These accesses are not threadsafe, but they never were. This needs to
be addressed separately.
Also, update comment describing data flow
Similar to the recv flag, but this one indicates whether or not the net's send
buffer is full.
The socket handler checks the send queue when a new message is added and pauses
if necessary, and possibly unpauses after each message is drained from its buffer.
Messages are dumped very quickly from the socket handler to the processor, so
it's the depth of the processing queue that's interesting.
The socket handler checks the process queue's size during the brief message
hand-off and pauses if necessary, and the processor possibly unpauses each time
a message is popped off of its queue.
In order to sleep accurately, the message handler needs to know if _any_ node
has more processing that it should do before the entire thread sleeps.
Rather than returning a value that represents whether ProcessMessages
encountered a message that should trigger a disconnnect, interpret the return
value as whether or not that node has more work to do.
Also, use a global fProcessWake value that can be set by other threads,
which takes precedence (for one cycle) over the messagehandler's decision.
Note that the previous behavior was to only process one message per loop
(except in the case of a bad checksum or invalid header). That was changed in
PR #3180.
The only change here in that regard is that the current node now falls to the
back of the processing queue for the bad checksum/invalid header cases.
Previously addnodes were in competition with outbound connections
for access to the eight outbound slots.
One result of this is that frequently a node with several addnode
configured peers would end up connected to none of them, because
while the addnode loop was in its two minute sleep the automatic
connection logic would fill any free slots with random peers.
This is particularly unwelcome to users trying to maintain links
to specific nodes for fast block relay or purposes.
Another result is that a group of nine or more nodes which are
have addnode configured towards each other can become partitioned
from the public network.
This commit introduces a new limit of eight connections just for
addnode peers which is not subject to any of the other connection
limitations (including maxconnections).
The choice of eight is sufficient so that under no condition would
a user find themselves connected to fewer addnoded peers than
previously. It is also low enough that users who are confused
about the significance of more connections and have gotten too
copy-and-paste happy will not consume more than twice the slot
usage of a typical user.
Any additional load on the network resulting from this will likely
be offset by a reduction in users applying even more wasteful
workaround for the prior behavior.
The retry delays are reduced to avoid nodes sitting around without
their added peers up, but are still sufficient to prevent overly
aggressive repeated connections. The reduced delays also make
the system much more responsive to the addnode RPC.
Ban-disconnects are also exempted for peers added via addnode since
the outbound addnode logic ignores bans. Previously it would ban
an addnode then immediately reconnect to it.
A minor change was also made to CSemaphoreGrant so that it is
possible to re-acquire via an object whos grant was moved.
Surprisingly this hasn't been causing me any issues while testing, probably
because it requires lots of large blocks to be flying around.
Send/Recv corks need tests!
- Drop the interruption point directly after the pnode allocation. This would
be leaky if hit.
- Rearrange thread creation so that the socket handler comes first