17:01:01 <phw> #startmeeting anti-censorship weekly checkin 2019-09-12 17:01:01 <MeetBot> Meeting started Thu Sep 12 17:01:01 2019 UTC. The chair is phw. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:01:01 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 17:01:05 <phw> hi everyone! 17:01:15 <phw> for the record, here's our meeting pad: https://pad.riseup.net/p/tor-censorship-2019-keep 17:01:28 <cohosh> hi o/ 17:02:14 * catalyst is kind of here 17:02:20 <phw> let me start with the first discussion item: are there services left that aren't monitored or (re)started automatically on crash/reboot? 17:03:13 <phw> i'm asking because we ran into an issue with gettor - it looks like it wasn't started automatically on boot? is this correct, hiro? 17:04:10 <phw> fwiw, anarcat once helped me set up a relatively simple systemd script to monitor/restart services: https://help.torproject.org/tsa/doc/services/ 17:04:22 * anarcat o/ 17:04:32 <anarcat> gettor should have a systemd --user process now, iirc 17:04:57 <phw> yes, thanks anarcat. i set it up for gettor and systemd now restarts the service when it crashes 17:05:02 <hiro> anarcat before sysadmin were against doing that... that's probably why there wasn't a script like that 17:05:21 <phw> (we have yet to see if it also does its job when the host reboots) 17:05:24 <anarcat> phw: awesome 17:05:30 <anarcat> hiro: ah 17:05:39 <anarcat> good thing i messed things up then :) 17:05:50 <hiro> the idea was that if a service would be restarted when it crashed nobody would fix the reason for the crash 17:05:59 <phw> hiro: there's no possibility to monitor the gettor process from a separate machine, right? because both the http and smtp server are separate processes? 17:06:18 <hiro> the http server is apache serving a static html 17:06:23 <anarcat> hiro: oh right, that's a good point 17:06:32 <anarcat> hiro: i thought we were just starting the service in systemd 17:06:33 <phw> hiro: that's a good reason but that works only if we have sound monitoring that tells us when a service disappared. we don't have that for gettor aiui 17:06:38 <anarcat> but yeah, systemd can also restart the service on crashes 17:06:47 <anarcat> i think that as long as people get notified on crashes, it's okay 17:06:48 <hiro> the smtp server is twistd 17:06:53 <arma2> phw: to answer your original "are there any other services" question, i am hoping we get to the point where you can show me a web page with a bunch of green lights on it. until we're there, i don't know how to know what got skipped. :) 17:06:54 <hiro> I am happy to have something that restarts the service 17:06:55 <anarcat> but i shouldn't hijack your meeting :) 17:07:13 <anarcat> arma2: working on that dashboard in grafana, actually 17:07:30 <phw> hiro: so if gettor crashes, nothing is listening on port 25 anymore? 17:07:52 <hiro> the system will still get the email but it will not be processed 17:08:08 <hiro> the point is that if gettor only sends email it doesn't need to be a twisted service 17:08:21 <hiro> it can be a python script that is called when the email hit the system 17:08:35 <hiro> it's just a postfix rule 17:08:36 <phw> i understand that but our problem right now is that gettor was dead for ~3 days and nobody noticed. how can we notice? 17:08:39 <anarcat> arma2: this is the current grafana homepage https://paste.anarc.at/publish/2019-09-12-7hOjxgi6jUg/screenshot.png 17:09:02 <hiro> maybe we can have system checks on the twisted process 17:09:05 <anarcat> phw: we should be monitoring the actual service 17:09:12 <anarcat> phw: send an email, check if we get tor back 17:09:13 <hiro> via nagios or something else 17:09:38 <anarcat> checking that all the bits underneath are all in place will not be as useful as a "does the thing actually work" test 17:09:47 <hiro> yeah we can send an email 17:09:58 <anarcat> you need to have an inbox somewhere to check if you get the reply i guess 17:10:01 <anarcat> but that can be arranged 17:10:06 <anarcat> nagios has checks to do this 17:10:21 <anarcat> i don't know about the politics of putting this in our nagios 17:10:33 <anarcat> but that seems like the best solution, technically 17:10:53 <phw> i don't have a strong preference but i think that some sort of service monitoring should be a priority. what do you think, hiro? 17:11:12 <hiro> looks good to me 17:12:37 <phw> ok, let's try to get this done asap. can you take the lead on that hiro? 17:12:42 <phw> (i'm happy to help however i can) 17:12:59 <anarcat> (same here) 17:13:16 <hiro> I can check how to do this in nagios 17:13:23 <phw> thanks! 17:13:38 <phw> fwiw, our main monitoring system is sysmon, run by gman999. the config is here: https://dip.torproject.org/torproject/anti-censorship/sysmon-configuration 17:14:15 <phw> it's limited though because it cannot follow http redirects. still, it does a good job at basic tcp reachability tests and has notified us of default bridges going offline 17:14:38 <phw> i realise that a mix of sysmon, nagios, and others is not optimal but scattered monitoring is better than no monitoring 17:15:56 <phw> (i also experimented with monit on my laptop, which i use to monitor a set of private obfs4 bridges that we will hand out to an NGO) 17:17:00 <phw> ok, that's it from my side wrt monitoring and reliability. any more thoughts? 17:17:47 <phw> next is a link to google's reviewing guidelines, which i found interesting and worth a skim: https://google.github.io/eng-practices/review/reviewer/ 17:18:36 <phw> i think we can learn from some of their experience 17:18:40 <cohosh> thanks for the thoughts, i can be better about doing reviews earlier in the week for sure 17:19:11 <arma2> the network team also periodically tries to prioritize reviews, 17:19:14 <phw> cohosh: right, i was thinking about gettor. if hiro has only thursday to work on it, we should be done with our reviews by wednesday the week after, so hiro doesn't block on us 17:19:26 <arma2> especially because a new volunteer will get hooked if they get feedback and attention, and wander away if they don't 17:19:40 <arma2> they've found it hard to be consistent with that priority though 17:19:47 <cohosh> yep maes sense 17:19:58 <cohosh> *makes 17:20:16 <hiro> phw I tend to do more things on thu but I have always other stuff so I do gettor when I can ... as I do all the other things 17:20:28 <hiro> so reviews do not really block me 17:20:56 <hiro> if I am blocked I ping you guys and let you know 17:21:02 <phw> hiro: gotcha 17:21:34 <cohosh> hiro: thanks :) 17:21:51 <phw> i think our weekly cycle works reasonably well but we should speed up things a bit if it's helpful 17:22:24 <cohosh> yeah i think i will try to prioritize reviews a bit more, was putting them in a giant bucket of "things to do by the next meeting" 17:24:04 <phw> another thing i liked in google's reviewing guidelines is to prefix suggestions with "nitpick" if they're worth pointing out but not necessary to incorporate 17:24:43 <phw> i can sometimes think of nicer ways to accomplish something but i don't want to drown somebody in minor feedback. that may be discouraging 17:25:00 <phw> the idea of "nitpick" is to say "hey, this is worth noting but feel free to ignore" 17:25:46 <cohosh> cool 17:25:56 <phw> (and as reviewee i would like to learn about all the ways to improve my code, even if i don't end up incorporating everything) 17:26:11 <phw> anyway, it's a useful document! 17:26:39 <phw> shall we move on to our 'needs help with' sections? 17:27:15 <cohosh> sounds good 17:27:47 <phw> hiro has 'probably more reviews' :) 17:27:50 <phw> keep em coming! 17:28:38 <hiro> yes I might have some more reviews as I incorporate your feedback and fix a few more pending things 17:28:58 <phw> cool, i'm happy to take a look 17:30:12 <phw> another thing related to gettor: i didn't mean to step on your toes with the systemd script, hiro. whenever i touch things on getulum, i try to document it and let you know. but please let me know if you can think of a better process 17:30:32 <phw> (generally, i prefer not to touch anything without talking to you first) 17:30:43 <hiro> you need to feel free to touch it actually 17:30:51 <hiro> that's why I am setting up the ansible recipe 17:31:19 <phw> ok, gotcha 17:31:23 <hiro> my idea is hat the ansible playbooks can run via cron and restart or/update the system 17:32:05 <hiro> if everthing runs via ansible there are no hidden scripts 17:32:25 <hiro> and everyone working on the service can see and improve the code 17:32:38 <phw> that's great, thanks for working on this 17:33:34 <phw> coming back to reviews: i only have one for now, #31692. it's a minor change to our docker image. can you take a look, cohosh? 17:33:55 <cohosh> phw: yup 17:34:20 <phw> next up is #29206 17:34:30 <cohosh> i think dcf1 has started reviewing it 17:34:41 <dcf1> yes I am 17:34:41 <phw> right, that's good 17:34:54 <cohosh> thanks 17:34:54 <phw> oh, you aren't absent after all, dcf1! 17:35:26 <dcf1> no I amn't 17:36:02 <phw> the last review seems to be #31455. i sent sina an email last week but haven't heard back yet. 17:36:27 <phw> that's a bummer.. he usually responds swiftly to bridge-related emails 17:37:27 <phw> is there anyone else at cymru that we could ask? rabbi rob maybe? 17:38:55 <arma2> who runs it there? is it sina or is it a generic cymru service? 17:39:14 <phw> i believe it's sina, right dcf1? 17:39:19 <dcf1> it's sina 17:40:08 <arma2> ok. then other cymru folks won't be so helpful. maybe follow up and ask how it's going and cc me? 17:40:37 <arma2> last night i started answering a batch of urgent mails from july. and my watch tells me it's no longer july. so, this happens. :/ 17:40:58 * phw sent a reminder 17:41:28 <phw> looks like that's it for today. anything else? 17:42:06 <cohosh> not from me 17:42:12 <phw> alrighty, let's wrap it up 17:42:14 <phw> #endmeeting