14:01:18 <karsten> #startmeeting Measurement Team meeting #4
14:01:18 <MeetBot> Meeting started Wed Jul 29 14:01:18 2015 UTC.  The chair is karsten. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:01:18 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
14:01:30 <karsten> I saw phw. who else is here for the meeting?
14:02:32 <karsten> shall we give it another 5 minutes, phw?
14:03:14 <phw> sure, sounds good.
14:03:23 <karsten> phw: want to take a look at the roadmap draft in the meantime?
14:03:27 <karsten> https://people.torproject.org/~karsten/volatile/measurement-roadmap.pdf
14:03:30 * phw looks
14:03:39 * karsten makes coffee and is back in 4
14:06:39 * karsten has coffee
14:06:56 <karsten> did anybody else arrive for the meeting?
14:08:23 <karsten> phw: should we talk about measurement things anyway?
14:08:38 <phw> let's do it.
14:08:42 <karsten> ok :)
14:08:53 <karsten> so, that roadmap is the result of our last two meetings.
14:09:04 <karsten> with some content I added yesterday.
14:09:30 <phw> it looks good so far!  is it on git?  if so, i can go over it and make minor patches.
14:09:31 <karsten> it's mostly for ourselves, though we may be able to use it in the future.
14:09:42 <karsten> oh, that would be awesome!
14:09:52 <karsten> it's not on git, because I didn't know where to put it. any suggestions?
14:10:03 <karsten> tech-reports.git maybe?
14:10:16 <karsten> though it's technically not decided that it will be a tech report.
14:10:21 <karsten> which might not matter.
14:10:36 <karsten> it's just prettier in latex.
14:10:42 <phw> tech-reports.git is where i would have looked.
14:11:19 <karsten> great. let me put it there now...
14:11:52 <phw> thanks
14:13:07 <karsten> https://gitweb.torproject.org/user/karsten/tech-reports.git/log/?h=measurement-roadmap
14:13:10 <karsten> thank you!
14:13:31 <karsten> so, this is already part of the 1-1-1 task exchange I envisioned for today.
14:13:49 <karsten> one thing I wanted to ask for was that somebody reviews/revises that document.
14:13:59 <karsten> is there something that I can review for you?
14:14:10 <karsten> or, let me explain the "rules" first:
14:14:13 <karsten> - 1-1-1 task exchange: you get 1 minute to describe a task that would take somebody else roughly 1 hour and that they will do for you within 1 week (review a document, write some analysis code, fix a small bug, etc.; better come prepared to get the most out of this; give 1, take 1)
14:14:56 <karsten> the idea is that having a fresh set of eyes on something might help you make more progress than spending that hour on the thing yourself.
14:15:06 <phw> oh, that sounds useful.
14:16:10 <karsten> if you need to think a bit about this, feel free to send me something later today or tomorrow.
14:16:24 <karsten> (it took me a bit to go through trac, email, todo lists, etc. to find good tasks.)
14:16:37 <phw> i don't have anything right now, but i have a question regarding collector's data.
14:16:39 <karsten> (which I'll save for next week.)
14:16:42 <karsten> sure
14:17:15 <phw> have you ever experimented with putting (parts of) collector's data in a database for easy querying?
14:17:46 <karsten> we're using a database for parts of the metrics website, yes.
14:17:52 <phw> i'm currently experimenting with ways to make the data easier to analyse.  see also my recent mail to damian on tor-dev@.
14:18:08 <karsten> and we're using a database for exonerator.
14:18:42 <karsten> the idea is to keep this database general purpose, not make a new database for a new problem?
14:19:42 <phw> is the current database flexible enough to answer questions such as "which guards changed their ip address more than X times?"
14:19:52 <karsten> not at all.
14:20:10 <karsten> so,
14:20:26 <karsten> I think a major problem is that data is distributed to more than one descriptor.
14:20:27 <phw> ah.  because that's the kind of question i find myself asking a lot when analysing bad relays.  and answering them involves a bit of manual work.
14:20:39 <karsten> which doesn't matter in this specific case,
14:20:54 <karsten> but for many problems you want to combine consensuses with server descriptors and even extra-info descriptors.
14:21:28 <karsten> and table joins are expensive.
14:21:59 <phw> by saying "not at all", do you mean it's impossible or just very slow?
14:22:02 <karsten> though in this case you'd be happy to wait a bit, right?
14:22:31 <karsten> the current database is written specifically for the purpose of producing the exact aggregate statistics shown on the metrics website.
14:22:36 <phw> personally, i'm find with waiting several seconds.  several minutes would make it a little bit annoying.
14:22:44 <phw> s/find/fine/
14:22:47 <karsten> it's not flexible at all. you could use it as inspiration, but not to solve your problem.
14:23:09 <karsten> who would use that database? just you?
14:23:14 <karsten> like, not the internet?
14:23:45 <karsten> unfortunately, it's quite possible that you'll have to wait for minutes or even longer. until you figure out which index you're missing.
14:23:58 <karsten> it's a huge amount of data, and it's easy to screw up performance-wise.
14:24:12 <phw> i would like it to be used by anyone who wants.  either by setting up a dedicated service (which might be difficult) or by asking people to set up their own service, locally.
14:24:29 <karsten> the latter sounds good as a start.
14:24:55 <karsten> do you already have a database schema for this?
14:25:02 <phw> i should probably find a database person at the university and have a chat.
14:25:12 <karsten> oh, if you can find such a person, yes.
14:25:13 <phw> no, nothing.
14:25:25 <karsten> I can also take a look. but I'm not a database person.
14:25:32 <karsten> but I could comment on the tor specifics.
14:25:46 <karsten> like, which data could be missing, or what's potentially expensive to join, etc.
14:25:58 <phw> i was thinking it would be cool to have the data in a python shell eventually, which is more flexible and would facilitate exploratory analysis.
14:26:13 <karsten> I suggest you also look at the exonerator database schema, which is better designed, though still not perfect. let me find a link.
14:26:28 <karsten> well, there's also psql. :)
14:26:55 <karsten> you just need to write the importer, possibly using python.
14:27:23 <phw> psql is postgresql?
14:27:24 <karsten> for extra performance, write the importer in a way that produces .sql files that you can then import with psql.
14:27:32 <karsten> ah, yes, its command-line tool.
14:27:53 <karsten> https://gitweb.torproject.org/exonerator.git/tree/db/exonerator.sql
14:28:35 <karsten> here's another example for a metrics thing using psql: https://gitweb.torproject.org/metrics-tasks.git/tree/task-8462
14:28:52 <karsten> with the importer being https://gitweb.torproject.org/metrics-tasks.git/tree/task-8462/src/Parse.java
14:29:02 <karsten> look at the end. it writes files that psql can import.
14:29:11 <phw> very useful, thanks!
14:29:12 <karsten> that's faster than using any binding.
14:29:17 <karsten> to python/java/etc.
14:29:44 <karsten> sure. fun stuff! :)
14:30:13 <karsten> anything else we should talk about while we're here?
14:30:41 <phw> that's basically what kept me busy.  maybe something you would like to talk about?
14:31:44 <karsten> no, I think we talked about two important things. nothing else comes to mind now.
14:32:00 <phw> ok.
14:32:11 <karsten> did the meeting reminder reach you on time?
14:32:20 <karsten> like, is 24 hours in advance good? or too late?
14:32:37 <phw> i use tor's google calendar, so i actually don't need a reminder.
14:32:44 <karsten> oh!
14:32:55 <karsten> okay, that works, too.
14:33:06 <karsten> great, I'll send out the next reminder 24 hours in advance for the folks who don't use it.
14:33:38 <karsten> okay, let's end this meeting early then. be sure to send me something to review for the 1-1-1 thing if you want.
14:33:47 <phw> will do!
14:33:52 <karsten> :)
14:33:54 <karsten> #endmeeting