13:59:28 #startmeeting revised revised prop#259: more talking about guards. 13:59:28 Meeting started Thu Jul 7 13:59:28 2016 UTC. The chair is nickm. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:59:28 Useful Commands: #action #agreed #help #info #idea #link #topic. 13:59:33 hi! 13:59:45 I see asn and dgoulet. Who else do we have here for this fun? 14:00:08 relevant links: https://trac.torproject.org/projects/tor/ticket/19468 14:00:21 https://gitlab.com/asn/torspec/merge_requests/3/diffs 14:00:24 I'm here 14:00:28 nickm: I'm here 14:00:39 previous version: https://raw.githubusercontent.com/twstrike/torspec/review/proposals/259-guard-selection.txt 14:00:52 current latest version: https://trac.torproject.org/projects/tor/attachment/ticket/19468/prop259-redux-v3.txt 14:00:53 hi all! I'll probably be more of an observor here :) 14:01:06 isis's version: https://gitweb.torproject.org/torspec.git/tree/proposals/259-guard-selection.txt 14:01:07 hi all 14:01:14 hi rjunior , olabini , athena ! 14:01:41 hi all. I'll be more of an observer too 14:02:05 hi also ron_ ! 14:02:37 So, we've been kicking this design back and forth a lot. I'm hoping we can do a meaningful conclusion wrt some version of it some time soon. 14:02:52 hi 14:03:09 The way we usually start these off is to ask somebody who has read and understood the proposal, other than the author, to summarize it, in case anybody hasn't had a chance to read it thoroughly yet. 14:03:12 fanjiang: hi! 14:03:28 (also, my apologies for all regressions in my writeup since the previous writeup.) 14:04:04 also also: MeetBot is logging us at http://meetbot.debian.net/tor-dev/2016/tor-dev.2016-07-07-13.59.log.txt . It will generate a helpful summary of URLS and action items when we're done 14:04:18 So, would anybody like to summarize? 14:04:39 i can try to. but if anyone else wants, feel free. 14:04:42 or make any comments / ask any questions / recite any poetry before we start? :) 14:05:36 hola 14:05:41 hey iapazmino 14:05:59 hihi iapazmino 14:05:59 ok should I try to briefly summarize the new proposal? 14:06:04 yes please go ahead! 14:06:08 ok 14:06:33 so this new proposal, starts by defining the various guard lists that tor will use and how they are populated. 14:06:41 this is very similar to the old prop259 14:07:10 in the sense, that tor samples guards into a restricted guard list (so that we dn't try an infinite amount of guards everytime the network is down) 14:07:25 and from that restricted guard list, we filter further based on whether we think the guards are reachable or not, 14:07:36 to end up with the guard list that tor will actually use for circuits 14:07:48 the concept of primary guards still exists, and is defined quite nicely in the new proposal 14:08:11 (i.e. primary guards switch from "not reachable" to "maybe reachable" faster than other guards, and hence we are willing to retry them much more frequently) 14:08:24 (namely, every 15 mins instead of every 1 hour. but numbres are subject to change) 14:08:45 so this part of the proposal, is very very similar to the old prop259 14:09:34 now, the guard algorithm needs to be called concurrently, and this is where the new proposal starts diverging from the old prop259 14:10:11 [sidenote: I just added the twstrike version of prop259 as prop#268. I believe that we discussed this by email and decided it was okay, but please let me know if I got that wrong.] 14:10:34 to do so, the new proposal specifies a state machine which works on circuits 14:10:39 asn: to clarify, can you explain what you mean by "concurrently" ? 14:10:45 indeed 14:11:08 by concurrently I mean that Tor might decide that we immediately need to build 10 circuits 14:11:22 we don't want to wait for each circuit to complete before we build the next one 14:11:41 and in this case, should not we use the same guard for all the 10 circuits? 14:11:43 so we want our guard algorithm to be able to support 10 circuit constructions (i.e. 10 guard picks) at the same time 14:12:07 #item nickm should answer rjunior's question above. 14:12:35 (after asn is done summarizing) 14:12:40 k 14:12:55 to do so, the new proposal specifies a state machine for circuits where every time we build a circuit we assign it a state based on how "secure" we consider its guard to be 14:13:12 so like, if the circuit was created with the very top primary guard, it's a pretty good guard pick. and we should use this circuit. 14:13:35 but if hte circuit was created with a guard 50 slots down our guard list, we should try to get a better guard before using this circuit 14:13:48 so instead of immediately using this circuit, we put the circuit in a state where it waits till we retry the top guards again 14:14:00 and only after we retry all top guards and they failed, we actually use this circuit. 14:14:14 anyway, this sort of state machine is the main novelty of the new proposal. 14:14:30 it starts getting explained here: https://trac.torproject.org/projects/tor/attachment/ticket/19468/prop259-redux-v3.txt#L231 14:14:41 (There is also a secret novelty in how it handles bridges and entrynodes) 14:14:58 i'll stop summarizing now 14:15:27 nickm: anything else we should mention? 14:15:36 rjunior: good question! Sadly, "it depends". The advantage of using the same guard for all 10 is that if it's actually working, and it's the best guard that is working, we should use it! 14:16:04 but the disadvantage is that if we are searching for a guard that works, it is inefficient to try the same guard 10 times 14:16:26 the design tries to follow a couple of rules, I think: 14:16:57 #item don't actually _use_ a circuit unless we are pretty sure that its guard is the best we can reach... 14:17:22 #item don't try too many circuits through the same guard if we aren't pretty sure whether it's up yet 14:17:30 asn: not that I can think of 14:17:52 so can I say "picking guards into lists" are not necessarily triggered by "building new circuit"? 14:18:13 good question. let me re-read what I wrote to make sure I give you a good answer.... 14:18:55 SAMPLED_GUARDS is maintained almost asynchronously from everything else; it grows as needed. 14:19:19 FILTERED_GUARDS is defined as a function of SAMPLED_GUARDS and the current configuration... 14:19:57 and I turned USED_GUARDS into CONFIRMED_GUARDS. We add a new member to CONFIRMED_GUARDS when we decide that it can actually be used for user traffic 14:20:02 great, because in the previous version we were struggling with that entry-point 14:20:43 although the proposal also says: 14:20:45 137 Whenever we are going to sample from USABLE_FILTERED_GUARDS, and it 14:20:46 138 contains fewer than MIN_FILTERED_SAMPLE elements, we add new 14:20:46 139 elements to SAMPLED_GUARDS until one of the following is true: 14:20:46 140 * USABLE_FILTERED_GUARDS is large enough, 14:20:47 141 OR 14:20:50 142 * SAMPLED_GUARDS is at its maximum size. 14:21:03 which gets triggered at "building new circuit" 14:21:15 So even if we try 100 guards and connect successfully to all 100, we don't actually add them to USED_GUARDS until we decide to use one or more for 'real' traffic 14:22:11 #action try to think of ways that the "don't add to USED_GUARDS till we use it" rule can be manipulated by an attacker 14:22:16 fanjiang: what do you mean you were struggling with that entry point? 14:22:18 (I am a little concerned about that) 14:22:30 ok, I can explain a bit 14:23:18 the previous version we made all "picking guards into list" triggered by "build new circuit" 14:24:04 but the time we can know a guard is good/bad is asynchronized. 14:25:11 it's more about the implementation details, but this problem forcing us to have that pending_guard 14:25:31 asn: did that answer your question? 14:25:56 i'm not sure i undrstood the problem entirely. or if this problem applies to the current proposal. 14:26:11 what do you mean "know a guard is good/bad"? reachable / unreachable? 14:26:54 successfully connected? 14:27:58 fanjiang: so you think we will stumble into the same problem with this proposal? 14:29:03 not exactly, if we decide to pick more than one guard with "pending" tag, we probably will have different result, I need to read that part again 14:29:14 ack 14:29:34 nickm: tbh, i think the pending flag logic is an open design problem with the current proposal 14:29:55 asn: elaborate? 14:30:11 (say more?) 14:30:37 i'm worried about these sort of cases: 14:30:47 https://gitlab.com/asn/torspec/merge_requests/3/diffs#note_12742458 14:30:50 the ones i tried to explain here &^ 14:31:07 where all the primary guards are down, so we have to respect the pending flag, and not connect to any guards that are already pending 14:31:35 which means that if we have N circuits waiting for guards, we will connect to the first N guards. since everytime we pick a guard we mark it as pending and not use it again. 14:31:55 (till it becomes non-pending, which can take a while) 14:32:13 i feel that this case can happen _very_ frequently for HSes. and also for clients. 14:33:20 Do you accept my argument in the non-HS case there, though? 14:33:57 you mean: 14:33:58 " 14:33:59 I don't see what you mean. If alice's circuit succeeds in step b, then it is marked "complete", and you can attach circuits to it. The other circuits just wait around in case we turn out to need them after all." 14:34:03 ? 14:34:05 yes 14:34:08 on the above, do you mean "you can attach streams to it"? 14:34:13 ah yes, sorry 14:34:16 ack 14:34:21 in this case it does make sense 14:34:38 although I don't know all the cases that can make a non-HS build tons of circuits concurrently. 14:35:07 it's harder with hidden services, though, since "you can attach streams to it" isn't quite the right interpretation for "complete" there. I think "you can actually carry on with the rendezvous protocol" is better. 14:36:00 i think the problem with HSes, is that you actually need to build N circuits, not just attach N streams. 14:36:17 that's true. 14:36:22 note that this applies to HS clients as well as services 14:36:38 we might need trickle-down consequences on how many circuits we try to build when we don't have a working guard. 14:36:57 special: yeah, like ricochet will try to build many circs at the same time. 14:36:59 (OTOH, this only happens when we don't have a working guard. If we have a good one, we should just be using it.) 14:37:16 nickm: you mean a "working primary guard". indeed if we have a working primary guard, the above problem does not exist. 14:37:21 asn: or webpages that refer to lots of other onions. there might be similar things for non-hs because of the url circuit isolation 14:37:59 asn: do you think we need to handle the "we're connecting though a non-primary guard" case differently? 14:38:07 (differently than the proposal says now) 14:38:36 nickm: i'm not sure. but I don't think the current logic will work very well for the use cases we discussed above. 14:39:19 #action either convince myself that the retry logic does good things for HS clients/services when all primary, or change it so it does. 14:39:26 (does that summarize what we need to do there?) 14:39:39 "when all primary" ? 14:39:42 "when all primary are down" ? 14:39:44 yes 14:39:51 (sorry, 4 hours of sleep) 14:39:56 yes i think it's a good summary 14:40:01 ok 14:40:18 although, if we are confident that the retry/pending logic for primaries works well, why not do the same for non-primaries? 14:40:39 well, that would effectively make every guard a primary guard. 14:40:41 this might translate to, "don't respect pending flag for non-primaries" 14:41:02 (note that the pending flag does not apply to primary guards] 14:41:07 yep 14:41:12 (so your suggestion is "remove the pending flag" 14:41:14 yes 14:41:35 so all incoming circuits get stuck on one guard. 14:41:49 if one circuit fails to connect to that guard, fail all circuits and move them to next. 14:42:09 otherwise, just try to fit all those circuits through that one guard. 14:42:18 the logic here needs to be ironed out a bit 14:42:23 I don't understand how that would be an improvement. 14:42:41 oh man, a whiteboard. 14:42:48 hmm 14:42:58 My rationale for having primary guards behave differently was: 14:43:19 if all your primary guards are down, it's likely you'll be searching for a guard that works. 14:43:36 so trying more in parallel is smart. 14:43:45 with your suggestion, we never ever try guards in parallel 14:44:07 I wonder if the answer here is not to just say "never try in parallel" or "make all guards primary" (they amount to the same, I think)... 14:44:18 but rather to treat 'primaryness' as a continuum somehow. 14:44:25 #action nickm can primaryness be a continuum? 14:44:32 what does this mean approx? 14:44:38 the lower you are in the list, the less primary you are? 14:44:52 right now we have 'bool primary;' ; we could have 'double how_primary;'. 14:45:07 So if we think that the first couple of guards might work, we try 1 guard in parallel. 14:45:28 once we're pretty sure that the first N guards are down, we try f(N) guards in parallel 14:45:40 I am not sure that's actually workable; I just came up with it. 14:45:41 i see 14:45:44 it might be foolish 14:45:48 plausible 14:46:11 So, what other issues do we need to iron out before we can progress with this proposal? 14:46:14 and what other steps do we face? 14:46:20 and what other questions are there? 14:46:23 and what else is broken about this? 14:46:27 and how can we find out? :) 14:46:32 :) 14:46:33 asn: let's not us take up all the talking time 14:46:44 i think the main reachability/speed issues with trying parallel guards 14:46:49 is this section: https://trac.torproject.org/projects/tor/attachment/ticket/19468/prop259-redux-v3.txt#L363 14:47:37 which basically says "If you tried parlalel guards, you need to wait 15seconds before using the lower guards, to make sure that the top guards are actually down" 14:48:06 but yeah, let's not take up all the talking time you are right!!! 14:48:23 we could probably keep on hitting this problem for hours. 14:48:36 everyone else: what do you think? :) 14:48:54 (it seems to be one of those problems where there is no real best solution, and you need to balance tradeoffs carefully to arrive at a terrible-but-might-work-for-most-cases solution) 14:49:32 yes, I guess that was the biggest issue we came up with 14:50:53 for other stuff, to my understanding new proposal is more clearly fitting into real API 14:52:55 anything else that seems broken here? 14:53:04 nickm: also there are a few more points from my latest review that have not been addressed yet 14:53:19 but i guess these will happen on the next redux 14:53:24 overall it looks good to me - my worries have been brought up already. 14:53:25 asn: ack 14:53:38 olabini: is it mainly asn's thing about the delay-to-use? 14:53:50 yeah 14:53:59 tbh, i think this pending/delay-to-use problem is probably isomorphic to problems that the twstrike team was facing. 14:54:08 although I like the new model of keeping state for circuits to get around the problems we had. 14:54:21 asn: potentially. maybe there are no good solutions. 14:54:25 from my side, this was the hardest part of working on this proposal: understanding the networking API, and its asynchronous nature and how to fit the new strategy of choosing "candidate entry guards" to build circuits 14:55:04 but X.6-9 seems to cover it pretty well 14:55:20 ok 14:56:03 rjunior: olabini: fanjiang: do you guys have any estimates on how much of your code we can salvage here? 14:56:15 I'm not familiar with the networking internals to be able to challenge this part, but at least it makes sense for me 14:56:17 i think the "initialize/sample guard lists" code can be reused 14:57:00 although the prop259 branch had lots of code/tests maintaining the guard picking state machine, which does not really exist in the latest proposal 14:57:08 however, the logic described here: https://trac.torproject.org/projects/tor/attachment/ticket/19468/prop259-redux-v3.txt#L261 14:57:19 is basically the state machine of prop259 (STATE_PRIMARY_GUARDS, STATE_TRY_REMAINING) 14:57:24 (One way to think about the above issue: either we sometimes use a guard when a 'better' guard will soon be working ... or we sometimes find a guard that would work, but delay using it until we know whether a 'better' guard will work instead. I don't see a third possibility) 15:00:38 at least the code has the state machine already I guess 15:01:08 simulation related, i remember we noticed all scenarios were being measured using the same unit (successful connections was it?). should this proposal include what a success is for every scenario were it's supposed to work? 15:01:46 iapazmino: say more? I don't understand the question exactly 15:02:32 fanjiang/others: If you were going to implement this, would you start based on your existing branch, or would you start based on tor git master and look for things to copy from your branch; or something else? 15:03:00 i think i would start off tor git master and look for things to copy from the twstrike branch 15:03:07 nickm: there were a few scenarios for simulation, like good network, bad network, proxy network and more. In every one of these scenarios all we measured was successful connections 15:03:08 or maybe i would first look for things to copy, and then I would start coding. 15:03:25 iapazmino: which version of the simulation code is most up-to-date? 15:03:41 iapazmino: I agree we should simulate this too. 15:03:58 iapazmino: I think "number of guards that see user traffic" might be a good measurement to add. 15:04:10 nickm: https://github.com/twstrike/tor_for_patching/tree/prop259 15:04:39 nickm: open a new branch and copy would be better for me 15:04:40 iapazmino: oh! by "simulation" you mean like a real tor with chutney; I thought you meant a guardsim fork. 15:05:05 nickm: yeah, what i mean is one measure does not fit all scenarios 15:06:01 iapazmino: right. There is more than one axis of goodness here. :) 15:07:02 the simulation is here 15:07:03 https://github.com/twstrike/tor_guardsim/tree/develop 15:07:08 in python 15:07:22 rjunior: do you think it's a reasonably good simulation? 15:07:22 yep, the code iapazmino put was the tor fork 15:07:44 (I wrote v0 of the code, so I have no idea whether it actually works) 15:08:36 btw nickm, after we pin down the new proposal, i would be glad to dig into the twstrike code and see what can be reused. 15:08:48 cool 15:08:48 it helped us to get a feeling of the impact of small changes in the proposal 15:09:04 in terms of how much that change increased our exposure to the network 15:09:15 so, I don't want to keep everybody around indefinitely. What are next steps in between here and deployment? :) 15:09:46 and how faster we would succeed to build the first circuit (for the case where we are in a dystopic network, but believe to be in a utopic network) 15:10:25 did the twstrike version of proposal 259 (now called prop#268) have a utopic/dystopic notion? 15:10:25 and how frequently the strategy "works" (successfully connect to the chosen guard) 15:10:32 I didn't see one in the version I looked at 15:11:48 i think the dystopic/utopic notion was abandoned before the prop259 implementation started. 15:12:18 yep 15:12:21 it was kind of a flaky heuristic 15:12:51 the initial idea was having 2 different sampled sets 15:12:55 I've started a temporary pad to try to draft a set of tasks to do going from here 15:12:58 https://pad.riseup.net/p/draft-guard-steps 15:12:59 nickm: dystopic guards were explicitly mentioned at the beginning but removed later on because samples of guards would always include dystopic nodes 15:13:12 (we should turn this into tickets once we've edited it a bit) 15:13:45 but then it was abandoned mostly due how it would (not) work together with ReachableAddress 15:14:12 See B.2, by the way 15:14:13 in general, I think as long as we are confident that the basis of the proposal will stay as is, we should move to implementation sooner than later 15:15:27 and also, because most of the utopic guards sampled with high bandwidth are also dystopic 15:15:40 (those with higher bandwidth are more likely to be in both sample sets) 15:16:43 I wonder if the coding tasks in that pad are the right ones. 15:16:59 like, if they can be done in parallel; if they can be tested; etc 15:19:52 athena / other programmers -- can we come up with a better set of steps than those in https://pad.riseup.net/p/draft-guard-steps ? 15:20:38 nickm: do you think we should turn https://trac.torproject.org/projects/tor/attachment/ticket/19468/prop259-redux-v3.txt#L261 15:20:41 into a proper state machine? 15:20:55 it's already a state machine but not a documented/properly defined one 15:21:22 it's basically: "try primary guards" -> "try confirmed/used guards" -> "try sampled guards" 15:21:49 hm. we can document that that is the intention, but I don't think such states actually would appear in the code 15:22:03 they are supposed to be emergent from the other data structures 15:22:04 ack 15:22:58 do you think we should have both old and new guard logic coexist for a while? 15:23:04 or the new code should kill the old? 15:23:09 I don't know. 15:23:12 what does "Maintaining SAMPLED_GUARDS" means in this context ? the code to add guards to that list? 15:23:28 dgoulet: the code to add as needed, remove as needed, persist, etc 15:23:33 ack 15:24:21 fanjiang / others: do you think it makes sense to try to have multiple guard selection algorithms in parallel? I worry that our existing code is too spaghetti-like to extract it properly 15:24:40 but I'll defer to programmers who have looked at it more deeply more recently :) 15:25:06 i'm pretty sure that twstrike managed to keep both algorithms inside 15:25:21 was that hard? 15:25:22 the functions receive all the sets as parameters 15:25:52 so you shouldnt have any problem to run multiple instances of the guard selections 15:27:02 running both code at the same time was tricky, but we managed to do it by having a --enable-prob259 in ./configure and copying things that should work similarly in both cases 15:27:46 but then it started to get in our way, mostly because it started to hold us from making big changes in the API 15:27:59 that might break the existing behaviour completely 15:28:21 and again, it was harder to do this in the part that started to touch the networking API 15:28:28 nickm: those seem reasonable to me 15:28:41 rjunior: ack 15:29:09 what is "pid match support" do we have this? 15:29:25 I don't know. :) 15:29:26 DataDirectory/pid ? 15:29:41 or maybe some iptables feature 15:30:44 (90 mins in boom) 15:30:57 (crazy how time passed!!) 15:31:29 asn: pid match support is an iptables-related kernel module 15:31:56 you can allow/block traaffic based on PID? i can see how it can be useful : 15:31:58 :) 15:32:03 yeah 15:32:21 well, on the OUTPUT end anyway 15:32:28 obviously no pid available for INPUT/FORWARDING 15:32:30 * nickm is now in another meeting concurrently. Will timeslice! 15:32:52 nickm: i think we should only do python simulation of this proposal, only if it's signifcantly easier than doing it in tor 15:33:09 because if doing it in python takes half the time of doing it in tor, might as well do it in tor and give it some real testing 15:33:39 (because to simulate the latest prop in python we would need to implement the logic for circuit states, and circuit retries, etc.) 15:33:40 I will give you 90% odds that the python version is a lot faster. 15:33:45 I bet I could do it in a few hours tops 15:34:12 ack 15:34:16 ok let's wrap this up? 15:34:25 any other business? 15:34:34 and maybe nickm you can post your next revision into tor-dev? 15:34:36 I'll be sticking around for isis to show up and to talk with her about the backlog. :) 15:34:42 asn: I hope to do so! 15:34:45 ack 15:34:52 ok need to relocate. 15:34:54 please keep bugging me about it; I recognize that a lot of this stuff is blocking on me 15:34:58 peace, friends! 15:34:59 #endmeeting