16:00:26 <GeKo> #startmeeting network-health 08/30/2021 16:00:26 <MeetBot> Meeting started Mon Aug 30 16:00:26 2021 UTC. The chair is GeKo. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:26 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 16:00:31 <GeKo> hello! 16:00:36 <hiro> hello 16:00:44 <GeKo> last network-health meeting in august 2021 :) 16:00:48 <GeKo> let's see 16:00:54 <GeKo> http://kfahv6wfkbezjyg4r6mlhpmieydbebr5vkok5r34ya464gqz6c44bnyd.onion/p/tor-nethealthteam-2021.1-keep 16:00:57 <GeKo> is our pad 16:01:10 <GeKo> please add your items on it 16:01:15 <GeKo> if you have not already 16:01:57 <GeKo> as always for things you want to bring up or talk about, mark them as bold 16:02:05 <GeKo> or put them into the discussion section 16:06:24 <GeKo> okay, let's get started 16:07:07 <GeKo> i don't see anything marked as bold yet 16:07:10 <GeKo> good 16:07:13 <meskio> I'm around if we want to talk about the grafana dashboard 16:07:52 <GeKo> ggus: i've done a first round of drafting some process for dealing with EOL relays/bridges 16:08:11 <GeKo> https://pad.riseup.net/p/NzO5KK6H2_tp_bSJ7xdI 16:08:33 <GeKo> i gonna stare a bit more at it this week and move it as a draft maybe onto the wiki 16:08:37 <GeKo> *into 16:09:01 <GeKo> there are still a bunch of XXX we should think about 16:09:20 <GeKo> but either way, if you have comments/ideas those would be much appreciated 16:09:33 <GeKo> meskio: sounds good 16:09:44 <GeKo> hiro: do we want to start talking about that point first? 16:09:48 * ggus loading the pad 16:09:57 <hiro> GeKo: ok sounds good 16:10:12 <cohosh> o/ 16:10:25 <hiro> so the dashboard was a first step to see what data we can get out of onionoo that we need and how grafana can help us 16:10:38 <hiro> IMO it's a bit basic and slow to load but it's a start 16:10:56 <hiro> depending on what ggus and cohosh and meskio need we can do different things 16:11:20 <meskio> it looks prety nice and I think it can be very useful 16:11:27 <hiro> one thing I was talking with GeKo about was that if we start loading data about relays/bridges in a postgres db we can have timeseries "for free" with grafana queries 16:11:59 <hiro> and on grafana you can do alerts on timeseries like from the ui which could be interesting for spotting patterns 16:12:34 <hiro> but we are not there yet and maybe knowing what you all need can help me understand how to prioritize things 16:12:52 <ggus> hiro: that would be super useful! one thing that i would like to work for a future sponsor is relay operator *retention*. in order to do that, we need to know which relays (and its contactinfo) are leaving the network. 16:13:36 <cohosh> hiro: how to do you tell if a bridge is "overloaded"? i think this will be really useful to know wehther we need more default bridges for example 16:13:55 <hiro> it's in the descriptors 16:13:55 <ggus> GeKo: wrt EOL relays/bridges, i can take a look next week. this week is very packed. 16:14:20 <GeKo> no worries 16:14:38 <hiro> cohosh: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n637 16:14:42 <cohosh> ah ty! 16:14:46 <meskio> Are 'bridges per distributor' the number of bridges available on each mechanism? (the graph now says 'No data', but it was working last time I looked at it) 16:15:01 <hiro> also: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1353 16:15:13 <GeKo> cohosh: prop#328 16:16:36 <hiro> meskio: yes sometimes the dashboard needs to be reloaded 16:16:53 <GeKo> ggus: how far in the future is "future" for the future sponsor? 16:16:58 <hiro> and yes that's the idea, I just grouped per mechanism 16:17:04 <GeKo> that's nothing we committed yet, right? 16:17:29 <ggus> GeKo: that's right. we don't have yet any work committed 16:17:40 <GeKo> okay. 16:17:54 <GeKo> so we could think about when writing the proposal to get money for that part, too 16:17:55 <ggus> but to evaluate how much work it will be, we need to start collecting this. 16:18:26 <GeKo> that part = getting the grafana dashboard to do what you need 16:18:46 <meskio> hiro: nice, that looks pretty cool, do you also have data on the number of distributed bridges per distributor? that might be nice to see how much each distributor is used 16:19:37 <hiro> meskio: that data is from bridgedb? 16:20:14 <hiro> we have bridge users by rtansport 16:20:15 <meskio> yes 16:20:27 <cohosh> you would have to use data from the pool assignments 16:20:45 <cohosh> https://metrics.torproject.org/collector.html#bridge-pool-assignments 16:21:08 <meskio> exactly, AFAIK metrics.tpo doesn't let you see how many bridges are requested by email or how many by moat 16:21:21 <ggus> https://metrics.torproject.org/bridgedb-distributor.html 16:21:26 <cohosh> we've done this manually, the same as how we look at default bridge usage metrics 16:21:46 <ggus> i didnt understand. what's missing from that graph? ^ 16:21:49 <cohosh> but it would be cool to have a more accessible and faster way to see recent changes 16:21:52 <meskio> ggus: thanks for proof me wrong :) 16:22:14 <ggus> hahah 16:22:34 <cohosh> ggus: oh yeah, i was thinking by country XD 16:22:34 * meskio needs to explore more metrics.tpo ... 16:22:51 <cohosh> but maybe by country is too fine-grained for grafana? 16:23:26 <meskio> in grafana you can set up selectors, so you could have a country selector, but AFAIK will be for the whole dashboard 16:23:43 <cohosh> yeah sorry, i guess this is getting too off topic from network health 16:23:56 <GeKo> :) 16:23:58 <hiro> cohosh meskio, in grafana you can do different kind of grouping... the problem in this case is where we get the data 16:24:29 <hiro> because I started just filtering from onionoo basically I get a json with all the data from there and then I do some manipulation on grafana directly 16:24:33 <hiro> if this data was in a db we could do more 16:24:52 <hiro> one first step maybe if connect grafana to the db we are suing on the metrics website 16:25:05 <hiro> that's used to generate all the graphs that we have on metrics.tpo 16:25:21 <hiro> s/suing/usign 16:26:15 <GeKo> we could think about that 16:26:39 <GeKo> just to play around and see what is possible already 16:27:08 <GeKo> which probably not many people know :) 16:27:12 <hiro> regarding measuring operators churn ... we should record when we stop seeing a relay for a while 16:27:21 <hiro> I think we have that data on exonerator 16:27:44 <hiro> but only the ip 16:27:56 <ggus> hiro: yeah, and we want contact_info 16:27:57 <hiro> I haven't looked at it just yet to be honest 16:29:19 <hiro> maybe development wise is not a lot of work but we should start thinking on the amount of data we store and for how much time 16:29:31 <hiro> if we had snapshots of all the relays at a given time we could find out 16:29:41 <GeKo> ggus: you could file a ticket in the network-health/team project about some way to get at that info 16:29:44 <hiro> question is how many snapshots we need and how many we can keep 16:29:56 <ggus> GeKo: yes 16:29:58 <GeKo> hiro: well, we do. we have hourly onionoo json files :) 16:30:27 <hiro> we also do in collector 16:31:37 <cohosh> we can also get the contact info manually from polyanthum 16:31:44 <hiro> but to have that into grafana we should have it in a db somewhere 16:31:48 <ggus> hiro: GeKo: we could take a look at TorBSD stats and think about a network diversity dashboard - https://torbsd.github.io/oostats.html 16:31:49 <hiro> maybe we don't need grafana 16:32:03 <hiro> just a report of what ops are leaving 16:32:35 <GeKo> ggus: yeah 16:32:45 <GeKo> that's definitely something i want :) 16:32:54 <hiro> ggus yeah that we can do 16:33:17 <hiro> maybe even already 16:33:34 <GeKo> i'll file a ticket to track that effort after the meeting is done 16:33:41 <ggus> ack! 16:34:11 <GeKo> so, yes, i guess we should keep thinking about potential use-cases for that dashboard 16:34:24 <GeKo> with an eye towards where the data should come from/is coming from 16:35:02 <GeKo> but, yes, this is for all teams and we need that input to know where we should prioritize our time/resources 16:35:36 <GeKo> i wonder whether we should collect somewhere all the ideas that popped up now 16:35:49 <gaba> is there a ticket for this? 16:35:53 <GeKo> and might pop up once we start thinking more about possible use-cases 16:35:57 <ggus> one ticket to rule them all! 16:35:58 <gaba> that should be a good place to collect use cases 16:36:01 <hiro> there was the ticket about tpi infrastructure 16:36:03 <gaba> :) 16:36:13 <hiro> but it was only about the overload case 16:36:28 <GeKo> i think there is no place for that yet 16:36:36 <GeKo> and no ticket :) 16:37:27 <GeKo> but i guess i can file one here, too 16:37:35 <hiro> https://gitlab.torproject.org/tpo/network-health/team/-/issues/34 16:37:54 <GeKo> yeah, that's specific for the s61 use case 16:38:23 * gaba 's browser is so f* slow today... 16:38:58 <gaba> geko: you are creating a tikcet for it then 16:39:07 <GeKo> yes 16:39:10 <gaba> ty 16:39:17 <GeKo> anything else for the grafana discussion item? 16:39:18 <ggus> gaba: my tb is very slow today too. 16:39:48 <gaba> the grafana board has been loading for the last 5 min... with no graphs 16:39:55 <GeKo> if not then let's go to the other item i put on the list 16:40:27 <ggus> GeKo: hiro: https://gitlab.torproject.org/tpo/network-health/team/-/issues/113 16:40:46 <GeKo> there was some discussion last week going on about how to present the overloaded info to relay operators and how to offer them help 16:40:59 <GeKo> is everyone good with that now? 16:41:05 <GeKo> and we have a plan moving forward? 16:41:26 <GeKo> or do we need more discussion by all stakeholders? 16:41:30 <hiro> I think - for me - what is left is the support article 16:41:42 <GeKo> right now as i see it we go with the support article 16:41:54 <GeKo> and point relay operators on relay search to that one 16:42:02 <GeKo> in case their relay is overloaded 16:42:13 <ggus> after adding to the support portal and to metrics, we could send an email to tor-relays@ 16:42:30 <GeKo> that, too, good idea 16:42:48 <hiro> as I commented on https://gitlab.torproject.org/tpo/web/support/-/merge_requests/43 maybe we want to add some more pointers to help people figure out what is wrong before dumping their data to some email address 16:43:24 <hiro> otherwise as GeKo said people might just send the data 16:43:35 <GeKo> yeah, i am fine with that 16:44:00 <GeKo> really, reaching out to dgoulet or me should be the last resort 16:44:13 <GeKo> and one could see that case as a failure on our side :) 16:44:48 <GeKo> in that we did not good enough help to relays operators figuring out what's up with their system before they reached that step 16:44:54 <GeKo> *give 16:45:50 <hiro> I think some examples of what needs tuning on sysctl or how to understand what's overloaded 16:46:22 <hiro> if you or dgoulet already have typical scenarios we could add these 16:46:30 <GeKo> who is writing the support article? 16:46:34 <GeKo> is that you, ggus? 16:46:38 <ggus> i'm not familiar with this new overloaded info. it will check every consensus, how this will works? 16:46:50 <hiro> I made a draft during the weekend 16:46:50 <ggus> GeKo: hiro is writing 16:47:10 <GeKo> hrm, there is no dgoulet here 16:47:45 <GeKo> hiro: that's your !43? 16:48:14 <hiro> yes 16:48:50 <GeKo> okay. i'll try helping with that although i don't have much experience with overloaded relays 16:49:08 <GeKo> i guess we should flag dgoulet there, too, for input 16:49:09 <hiro> yes me neither 16:49:22 <GeKo> here we go 16:49:53 <GeKo> dgoulet: we try to get that support article for the overload indicator done 16:50:06 <GeKo> https://gitlab.torproject.org/tpo/web/support/-/merge_requests/43 has the draft 16:50:07 <dgoulet> ah nice yes! 16:50:11 <dgoulet> anything I can help with? 16:50:20 <GeKo> i guess getting input from you would be good 16:50:29 <dgoulet> np 16:50:50 <dgoulet> I'll get your feedback today for sure 16:51:01 <GeKo> i think we can use that mr for ideas we think we should add 16:51:08 <GeKo> and how to phrase things 16:51:28 <ggus> ok! 16:51:35 <GeKo> in particular examples of what to check for and what to do in case X would be good 16:52:05 <GeKo> dgoulet: i think if folks reach the step where they feel they need to contact us then we have not done the best job with the support article 16:52:24 <GeKo> at least that's the kind of polar star i have here for guidance :) 16:52:42 <GeKo> i mean if that happens from time to time, well, that's fine 16:52:45 <dgoulet> agree 16:52:59 <GeKo> but we should really try to give relay ops the means to solve the problems on their own 16:53:09 <hiro> yep I have found some ops publish some of their tunings around but we don't have much pointers for people about where to start looking if there is an issue 16:53:22 <GeKo> just so the potential exposure of that metrics port data is minimized 16:53:42 <GeKo> okay 16:53:55 <GeKo> it's good to see that we are all on the same page here, though 16:54:13 <GeKo> do we have anything else for today's meeting? 16:54:53 <GeKo> hearing nothing 16:55:03 <GeKo> thanks everyone and have a nice week o/ 16:55:05 <GeKo> #endmeeting