Don't consider a connecting peer as inactive. On my quite slow machine I was getting lots of connect/disconnects of the same peers over and over again. Turns out that gossip is checking every 2 seconds and if you have a slow machine or a slow connect...
%MBZC4ix+ju7Yjtrmh0AQ2NH6xlyZuHs1Z327cs43Xr0=.sha256
master
from arj / scuttlebot / fix-gossip
Don't consider a connecting peer as inactive. On my quite slow machine I was getting lots of connect/disconnects of the same peers over and over again. Turns out that gossip is checking every 2 seconds and if you have a slow machine or a slow connection, these can be seen as inactive and fall into the retry queue which will quickly meet its quota.
I gave this patch a spin, but I'm seeing peers as 'disconnecting' for long periods of time now instead of connected. I need to take a closer look at the replication code to figure out why this is...
In fact, I think the entire replication schedule deserves a closer look.
A bit wierd, why are they labelled as legacy?
But yeah the fix should be rather innocent as long as the state is correctly handled.
@ev is this plain sbot or do you have any local patches? Because with this patch, patchbay and sbot has been much more stable for me. I added a ton of debug to the gossipping code to find the problem is, and there is definitely a problem in that part of the code. But it might be harder for other clients to trigger. Then again, I see some wierd "end of parent stream" and handshake problems that might be related to this if the connection is closed during connect.
@arj I just tried your pull-request on a fresh clone of sbot and I'm not having the same issue I mentioned above. It might have been that I merged it into flume or my local replication patch and that was the issue.
Okay, that makes more sense :)
@dominic what do you think, can this be merged in? I'm eager to see if it improves the random connection problems people are seeing.
Thanks for testing @ev!
@dominic what do you think about this patch? I guess you are the only one qualified to review it :)
sorry, yes I this is a problem. The trouble with this problem is question of how do we evaluate whether this is really working well? Given how pubs are advertised currently, we have a lot of old pubs which are now disfunctional.
Ah, it seems I am not online enough to get this currently...
Do we have an idea of how many old pubs are around? I'm all for backwards compatibility, but maybe we can get most updated. The operations are hopefully checking sbot once in a while.
I'm seeing quite a lot of errors from the network as it is now, might just be much low-spec machine but I have a feeling that with some patches and getting people to upgrade we have a better idea of what the problems are.
yeah agree - It feels like a system-patch type message could be good. Or at least a channel where we declare needs for update, and an easy way to see that channel.
How many single-point-of-failure situations exist in the scuttleverse?
okay, this is pretty weird - when I look at your PR in git-ssb web the commit looks like every plugin is deleted. I'm sure that isn't what this is meant to do.
when I look at your PR in git-ssb web the commit looks like every plugin is deleted
@dominic that's probably a bug (race condition) related to %nF/4tIc0R2CQ6tKspWF1tNYbTqIC3BnsQX+aJu5udfc=.sha256
the diff is currently correctly shown here: https://git.scuttlebot.io/%25AHqgeMrdCYeKxB34YTG9S5Hd8b5pg4yGeLR09E0Rv4k%3D.sha256/commit/8ebaee4a7d51efc20cd4cb60c60a4b507531a6da
Built with git-ssb-web