ControlMaster

I have a ton of repos in my ~/.mrconfig, so a full mr update takes a long time, even if parallelized. A lot of those remotes are on the same server, however, so I can't help but wonder if it would save time to enable ControlMaster from mr directly. I know I could set that up in my ~/.ssh/config myself, but I'm not sure I feel comfortable setting that all for all use cases. If it would be done in mr, it could setup the multiplexer and tear it down when done...

Here a full pull (mr -j10 update) takes:

30.39user 8.05system 1:51.04elapsed 34%CPU (0avgtext+0avgdata 73320maxresident)k
0inputs+15664outputs (0major+529937minor)pagefaults 0swaps

One trick is that multiplexing might fail with -j because of a race condition that ssh doesn't seem to handle very well:

ControlSocket /home/anarcat/.ssh/control-master-747c83a00480de0860fde4d7df86be06fe25e555.sock already exists, disabling multiplexing

That's with the following configuration:

 ControlMaster auto
 ControlPath ~/.ssh/control-master-%C.sock

So with that configuration, there's very little improvement, unfortunately. I would bet that, however, mr could be smarter than this and dispatch stuff much better...

Thanks! -- ?anarcat

RSS Atom

my approach

I've worked to getting most of my remote repos over https instead of SSH. This is mainly because I have a prompt for each use of my SSH keys, but also helps with this problem.

I'm using ControlMaster for all SSH connections, with some situations where I turn it off, and I find it works fairly well apart from the annoying race condition you mention during mr fetch. I use time-limited connections so I don't have SSH running forever.

Instead of configuring ControlMaster for all SSH connections, you could also just configure it for the hosts that you can only use git with (I'm thinking github.com/etc).

I suggest putting SSH sockets in $XDG_RUNTIME_DIR instead of HOME, so you never get stale sockets after reboot and never get backup programs looking at sockets (some do).

ControlMaster autoask
ControlPath /run/user/%i/ssh-control-%l->%r@%h:%p
ControlPersist 60

To be honest I'm not sure how myrepos could selectively enable ControlMaster for commands it runs. The only thing I can think of would be to modify PATH to add a script that runs the real ssh with additional options, but that seems pretty hacky.

Either way, if ControlMaster provides very little improvement for you, adding it via myrepos instead of via the SSH config isn't going to change that.

PS: I only ever do mr -j10 fetch and I leave updating branches to interactive shells where I can deal with any conflicts manually.

Comment by PaulWise — Fri Oct 18 04:27:19 2019

Remove comment

comment 2

PS: I forgot to mention that I use https for the repo remote URLs and then I set the push URLs globally in ~/.gitconfig like this:

[url "ssh://github.com/"]
        pushInsteadOf = https://github.com/

Comment by PaulWise — Fri Oct 18 04:29:24 2019

Remove comment

dispatch stuff much better...

Perhaps by "dispatch stuff much better" you mean just to avoid the race condition? Grouping accesses to the same host isn't likely to improve speed, especially if it is an overloaded host you probably want to spread out accesses instead. For spreading accesses out, it isn't possible for myrepos to know what host(s) the command it runs will connect to, since that info is in the git config for each repo and the myrepos copy of that config could be out of date and would be missing non-origin remotes.

Comment by PaulWise — Fri Oct 18 04:38:44 2019

Remove comment

comment 4

I've worked to getting most of my remote repos over https instead of SSH. This is mainly because I have a prompt for each use of my SSH keys, but also helps with this problem.

My wrapper script takes care of that: it makes sure SSH works first...

But it's true I might have better performance over HTTPS, I guess. It's just that I like the idea of pulling everything over SSH: I have a solid TOFU there that I don't have over HTTPS... And since this is an automated process, I really prefer to have that solid trust path in place. Plus setting up each push/pull URL is annoying, although there are nice ways around that, as you said...

Instead of configuring ControlMaster for all SSH connections, you could also just configure it for the hosts that you can only use git with (I'm thinking github.com/etc).

I suggest putting SSH sockets in $XDG_RUNTIME_DIR instead of HOME, so you never get stale sockets after reboot and never get backup programs looking at sockets (some do).

Excellent suggestions, thanks!

To be honest I'm not sure how myrepos could selectively enable ControlMaster for commands it runs. The only thing I can think of would be to modify PATH to add a script that runs the real ssh with additional options, but that seems pretty hacky.

Well, what I was thinking of is that etckeeper is the one that calls the git processes ultimately, so it could do some magic with GIT_SSH_COMMAND for example... It's VCS-specific of course...

Either way, if ControlMaster provides very little improvement for you, adding it via myrepos instead of via the SSH config isn't going to change that.

I guess I would first need to figure out that part of course.

PS: I only ever do mr -j10 fetch and I leave updating branches to interactive shells where I can deal with any conflicts manually.

That's also a good point, I always forget about fetch, thanks!

One of the problems i have with fetch though is that it pulls all remotes that are configured. In some cases, those refer to USB keys or non-existent, transient repos... For example, I have this in one repo:

anarcat@angela:calendes(master)$ git rv
origin  anarc.at:/var/www/calendes (fetch)
origin  anarc.at:/var/www/calendes (push)
sneakernet  /media/anarcat/KINGSTON/calendes/ (fetch)
sneakernet  /media/anarcat/KINGSTON/calendes/ (push)

It fails with the following:

mr fetch: /home/anarcat/Pictures/calendes
Récupération de origin
Récupération de sneakernet
fatal: '/media/anarcat/KINGSTON/calendes/' does not appear to be a git repository
fatal: Impossible de lire le dépôt distant.

This also makes fetch slower, as it fetches more remotes.

Perhaps by "dispatch stuff much better" you mean just to avoid the race condition?

Well, what I would like would be to the process to be much faster. I am not sure it's possible at all! I figured that ControlMaster might help, but I stumbled upon the race condition, so maybe that's the first thing to solve. Maybe, even, that's a bug in the ssh client itself that should never run into such a race condition (it could easily wait just a bit for another process to setup the socket correctly when it's created, for example).

Here's a new benchmark, with your ControlPersist setting. I still get those errors, unfortunately:

ControlSocket /run/user/1000/ssh-control-local-user@remote.example.com:22 already exists, disabling multiplexing

... which feel more and more like a bug in SSH: if it exists, use it! If you can't use it, just wait! If you can't wait, just shut the hell up, no?

Anyways, without the multiplexer:

13.45user 6.51system 2:10.01elapsed 15%CPU (0avgtext+0avgdata 44900maxresident)k
0inputs+14400outputs (0major+514164minor)pagefaults 0swaps

And with the multiplexer:

8.75user 4.86system 1:48.98elapsed 12%CPU (0avgtext+0avgdata 44972maxresident)k
1760inputs+14000outputs (39major+474784minor)pagefaults 0swaps

So it still shaves off a good 20 seconds, which is not negligible. But since I now use fetch instead of update, I'm actually back to square one, and get tons of warnings to go through. :p

Can't help but think this is just a bug with ssh though...

Comment by anarcat — Fri Oct 18 14:50:54 2019

Remove comment

comment 5

About the TOFU issue, agreed that is better for security. I switched away because my ControlMaster autoask at the time was very annoying during mr fetch. On alioth I had an SSH key for read-only git access with no prompt and a key for read-write git access with a prompt, but that doesn't appear to be supported by gitlab/etc. I probably should rethink how to do this more optimally at some point.

For the git fetch issue, you could do something like what I do here to make mr -m fetch not print things for remotes that don't have any changes and skip remotes that I disabled fetching from because they are gone now. Then just add any checks you want; if the remote is a USB stick and the path doesn't exist, skip the remote. You could also just make a git_fetchorigin command to only fetch one remote.

# work around git annoyance with mr -m fetch
git_fetch =
        git remote |
                while read -r remote ; do
                        if [ xtrue != "x$(git config --bool "remote.$remote.skipFetchAll")" ] ; then
                                git fetch "$remote"
                                git fetch --tags "$remote"
                        fi
                done

Comment by PaulWise — Fri Oct 18 15:10:35 2019

Remove comment

comment 6

git-annex contains a robust implementation of this in Annex/Ssh.hs. It's nontrivial; my implementation is over 400 LoC.

Some other complications that may or may not have been discussed:

The path to a unix socket is limited to around 100 bytes, so using the full server name in the socket name will sometimes fail.
Multiple password prompting has to be avoided when there's concurrency. This and ssh failing on a socket that is not yet connected can both be dealt with by locking until a connection is set up, running a dummy command like "true" has to be run over ssh to know when the connection is up.

Comment by joey — Sun Oct 20 17:50:20 2019

Remove comment

thanks!

thanks for all the advice, pabs and joeyh!

joeyh: I suspected there was a lot more to it, and considering what I know of Haskell, I suspect 400 LOC in Haskell would translate into something much more complex in myrepos. :p

pabs: neat trick! couldn't this be merged straight into mr though? it seems you have reimplemented most of the mr functions in your .mrconfig now.

also, now that I switched to fetch, i keep up stumbling upon the issue that i have to rerun update everywhere to get the changes... couldn't there be a way to run update (with --ff-only maybe) and fetch at once? in other words, why doesn't update fetch changes from all remotes?

Comment by anarcat — Fri Oct 25 15:25:27 2019

Remove comment

comment 8

My git_fetch is just a workaround for something that IMO belongs in git itself that I haven't gotten around to implementing in a way that would be acceptable to git upstream. I have a branch that implements what I wanted, but it got rejected upstream and I never got around to figuring out a way forward:

https://github.com/pabs3/git/commits/minimal-output https://public-inbox.org/git/1445741384-30828-1-git-send-email-pabs3@bonedaddy.net/

Personally, I think in many instances git pull should be deprecated in favour of git fetch plus reviewing the incoming changes from upstream and then if appropriate fast-forwarding/rebasing/merging branches and fixing any issues. The obvious exception is upstreams that move so fast that review is not possible (like Linux) or people who have way too many repos checked out like me

The default mr update just runs git pull, which allegedly "is shorthand for git fetch followed by git merge FETCH_HEAD" according to the docs. Unfortunately git pull doesn't seem to respect the alias.fetch config and I don't see any other config option to make git update/fetch pull from all remotes. We could change mr update to run git pull --all instead I guess, not sure if there are any downsides to that apart from the behaviour change...

Comment by PaulWise — Sun Oct 27 02:15:24 2019

Remove comment

Add a comment