17:02:41 <hellais> #startmeeting OONI gathering 2016-08-01
17:02:41 <MeetBot> Meeting started Mon Aug  1 17:02:41 2016 UTC.  The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:02:41 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:02:50 <hellais> greetings!
17:02:57 <hellais> another month, another grathering
17:02:58 <agrabeli> hello :)
17:02:59 <hellais> who is here?
17:04:14 <MightyOctopus> [13ooni-probe] 15hellais pushed 1 new commit to 06feature/webui2: 02https://git.io/v6f0P
17:04:14 <MightyOctopus> 13ooni-probe/06feature/webui2 14faf422e 15Arturo Filastò: Convert the director status into a property so the IP and ASN get's updated...
17:05:17 * hellais crickets
17:05:27 <sbs> helo!
17:05:47 <hellais> hey there!
17:05:50 * hodgepodge waves
17:06:02 <hellais> ok well I guess we can get started with this
17:06:24 <hellais> #topic discuss scripting language for measurement-kit
17:06:54 <hellais> so there is this thing called measurement-kit, that one day will replace the testing engine of ooniprobe.
17:08:01 <hellais> One of the things though is that C++ is not a language that makes experimentation and rapid prototyping as easy to do as in python. Not to mention the fact that if you want to deploy some new measurements in the field you need to ship people an re-built version of the client
17:08:19 <hellais> for this reason we have been throwing around the idea of having a scripting layer inside of measurement-kit
17:09:19 <hellais> sbs: I will let you explain the rest if you want
17:09:26 <sbs> hellais: yes, thanks
17:11:25 <sbs> so, other network tools use scripting, for example nmap uses lua to perform certain kind of measurements
17:11:30 <sbs> see https://github.com/nmap/nmap/tree/master/scripts
17:11:45 <sbs> in their case, they are using lua (a popular embedded scripting language)
17:12:16 <sbs> similar cases could be, if you want, node.js and apache with mod_$language
17:12:44 <sbs> in all these cases the point is that you write low level code in C/C++ and you are able to express glue code in a higher layer language
17:13:22 <sbs> so, for example, most of NDT could be written easily in a scripting language because it has no performance constraints, except the part where we measured speedd which must be as efficient as possible
17:13:59 <sbs> I did some research and identified some possibilities:
17:14:37 <sbs> 1) lua, already mentioned above, for which there is alaready a quite advanced prototype inside of MK sources https://github.com/measurement-kit/measurement-kit/pull/700
17:15:24 <sbs> 2) chaiscript, another scripting language that is specifically designed to integrate well with C++, for which also there is a prototype but more simple https://github.com/measurement-kit/measurement-kit/pull/701
17:15:53 <sbs> 3) an embedded javascript engine, such as https://github.com/svaarala/duktape or https://github.com/cesanta/v7
17:16:09 <sbs> other things to keep in mind:
17:16:34 <sbs> a) the popularity of the language (javascript here being the most popular)
17:17:05 <sbs> b) tje complexity of converting data type to/from the scripting language (here the best seems to be chaiscript which does that natively)
17:17:15 <elation> Want to chime in saying that I am quite happy that you are looking at maintaining the scripting ability of OONI. I was quite worried when I heard about the move to C++.
17:17:34 <sbs> b) whether we want to enable a sync semantic rather than an async one, which should help people to more easily write code
17:17:46 <elation> Also IMO  +1 to Lua since it is quickly becoming quite common in the network security community for adding scripting to netsec tools.
17:18:22 <sbs> d) the complexity of sandboxing (lua for example is quite sandboxed others possibly are but I have not checked)
17:18:35 <sbs> elation: the thing of lua that annoys me the most is that arrays start from 1
17:18:52 <sbs> elation: OTOH I really like that it has coroutines
17:19:07 <sbs> elation: thanks for your feedback :-)
17:19:20 <sbs> anyway... think I said it all... hellais do you want to add something more?
17:20:56 <hellais> well I think another aspect to consider is that generally, where performance is not a concern, it's probably better to have protocol parsing and serialization be done inside of a scripting language rather than in C++. To this end another it's ideal to also look into what are the languages for parsing libraries for protocols we may be interested in supporting.
17:21:49 <sbs> hellais: yes, thanks, this is super important as well!
17:23:22 <hodgepodge> sbs: what is the main purpose of using a scripting engine within MK? Would you say that it's intended to reduce the technical complexity of integrating MK into other libraries, test instrumentation, or?
17:24:24 <hodgepodge> With regards to the [1] indexed array issue, you could usually just use iterators, right?
17:25:22 <sbs> hodgepodge: I think scrpting would simplify writing code for MK, experimenting, or also updating algorithms by updating the scripts
17:27:23 <sbs> hodgepodge: regarding arrays, it worries me that they start from 1 because it's not the usual thing and you need to pay attention to that, which goes against be relaxed because you're scripting
17:29:57 <willscott> i would highly recommend thinking about a specific test that you think would be a candidate for scripting, and starting with just enough capabilities to be able to run that test
17:30:02 <hodgepodge> That's a good point, but there's a chance that you won't encounter that issue. By the way, this helped me understand what you're looking into building: https://github.com/measurement-kit/measurement-kit/issues/702
17:31:43 <sbs> willscott: that's a very good suggestion, thanks! I think a good starting candidate could be some Neubot stuff that has not been implemented yet
17:33:30 <sbs> hodgepodge: yes, thanks for linking to the issue... now that I re-read it I think it's still accurate in terms of design ideas, but I am not sure that enforcing a go-like model is something that must be done
17:34:19 <hellais> yes I also think having a concrete example is useful to avoid too many over-engineering rabit holes. As another concrete example I think it would be useful to have implemented the http-header-field-manipulation test of OONI in the scripting engine (that would give us insight into how a common class of tests can be written in it)
17:34:24 <elation> sbs: Yeah, it took me a while to learn to appreciate lua. But, the coroutines and the metatables become absolutely wonderful once you wrap your head around them.
17:35:17 <sbs> hellais: ah, yes, perhaps that is even better because we have more other OONI tests to compare to
17:38:05 <hodgepodge> I definitely second hellais' suggestion. So, are you planning on porting the tests from C++ to Lua, or do you think that the tests should be written mainly in C++ still?
17:38:11 <hodgepodge> What I said is a little ambiguous, one second.
17:38:55 <hodgepodge> For example, in the case of dns_consistency, and http_requests, what pieces would be written in Lua?
17:40:53 <sbs> hodgepodge:
17:40:56 <sbs> ehm
17:41:03 <hellais> I think future tests should be written using the scripting engine, so if we were to, for example, port the OONI dns_consistency and http_requests tests to measurement-kit and the scripting engine were available, they would be written in lua.
17:41:26 <sbs> hodgepodge: I think the logic for driving the test should be in the scripting language and the primitives implementing the test in C++
17:42:02 <hellais> how I imagine the scripting engine working is that it would expose some set of primitives (that are implemented under the hood in C++) that take care of actually doing the measurements on the network and submitting the results
17:42:12 <hodgepodge> That's what I was thinking too, sbs (i.e. the primitives would be in C++, and the logical flow of the test would be written in Lua). Thanks!
17:43:00 <hellais> these are the basic primitives that would be exposed for doing ooni tests for example: https://github.com/measurement-kit/measurement-kit/blob/master/include/measurement_kit/ooni/templates.hpp#L15
17:45:15 <sbs> yeah
17:45:43 <sbs> one important thing is the complexity of converting tables from C++ to lua
17:46:10 <sbs> there are libraries that make lua easier to use in C++11, such as sol2 -- https://github.com/ThePhD/sol2
17:46:19 <sbs> I have not investigated it in depth, though
17:50:36 <hodgepodge> (as an aside, I have a discussion topic for later regarding the ooni-pipeline workflow engine)
17:51:12 <hellais> sbs: do you have anything more to add on this topic?
17:51:19 <sbs> hodgepodge: I think we have discussed in depth the scripting thing, perhaps if no one has anything to add we could move forward?
17:51:35 <sbs> hellais: nope
17:52:37 <hodgepodge> Sweet, so let's go for it then?
17:53:20 <hellais> hodgepodge: hit it!
17:53:32 <hellais> #topic ooni-pipeline workflow engine
17:55:45 <hodgepodge> Okay, great! So, I'm going to be giving a presentation on writing data pipelines at my university, and will be using a subset of ooni-probe test results within the scope of the miniature pipeline that I'll be building. So far, I've provided an example of a sample workflow in Luigi, as well as a workflow in AirBnB's Airflow framework.
17:55:50 <hodgepodge> Since the presentation is tenatatively scheduled for either August 8th, or August 15th, it aligns well with hellais/willscott's work with regards to designing the pipeline.
17:55:56 <hodgepodge> Right now, I've sketched out the following:
17:56:27 <hodgepodge> 1) The structure of the OONI test network;
17:56:32 <hodgepodge> 2) How measurements are performed
17:56:48 <hodgepodge> 3) How to write a pipeline in Airflow; and
17:57:19 <hodgepodge> 4) A DAG which can be used for the purpose of normalising the http-requests/http-invalid-request-line/web-connectivity tests
17:58:31 <sbs> hodgepodge: what is a DAG?
17:58:47 <hodgepodge> The question that I'd like to ask is, is anyone looking into any frameworks, or technologies that need to be mocked out (e.g. using Hive + PostgreSQL, Spark, etc.)
17:58:54 <hellais> Directed Acyclic Graph?
17:59:21 <hodgepodge> sbs: within the scope of workflow management a DAG is used to define the dependencies associated with tasks within a particular workflow.
17:59:27 <willscott> what do you mean by mocked out?
17:59:46 <willscott> i’m not convinced we have so much data at this point that we need spark or such
17:59:51 <hodgepodge> sbs: so, a task exists as a node in the DAG, and its edges define the dependencies.
17:59:51 <sbs> hellais hodgepodge: thanks, I see
18:00:07 <willscott> we may likely still want postgres or some sort of database for interactive exploration of the final products
18:00:26 <hodgepodge> willscott: neither am I; I'd actually like to stay away from Hadoop and the related ecosystem for the same reasons that I proposed last year.
18:00:39 <willscott> full agreement here :)
18:01:07 <hodgepodge> willscott: By "mocked out" I mean, does anyone have any technologies that they'd like to see incorporated into an example workflow to get a feel for how things would look if we tied them into ooni-pipeline?
18:01:16 <hellais> hodgepodge: yes I don't think at our current scale there is that much need to resort to using hadoop related tooling. I think with smarted minimization of the data products we can easily get to where we want to be with just postgres.
18:01:27 <hodgepodge> ^
18:01:39 <hellais> hodgepodge: I think postgres would be a good this to have at the end of the workflow
18:02:01 <willscott> one thing we can start with that lives off on its own a bit is keeping records of network structure
18:02:10 <willscott> periodic downloads of IP geolocation / routing tables
18:02:16 <hellais> also I would be interested in seeing how the workflow can be fed from a message queue of some sort
18:02:40 <hodgepodge> I'm currently thinking of designing a couple of workflows where the first performs ETL, and the second performs anomaly detection (it'd be triggered by updates to the PostgresDB, or scheduled)
18:04:04 <hodgepodge> That's a good point, hellais. By the way, I'm also going to be touching on how we can reduce the data storage requirements. I'm playing with LZ4 + dill right now for message serialization, and so far, I haven't had any issues.
18:04:07 <hellais> I think the anomaly detection logic is probably best placed before the Load stage though
18:05:15 <hellais> what I have seen with the current design is that loading everything in the database ends up being too wasteful and it's probably best to set the test_keys values for doing anomaly detection and not load them
18:05:49 <willscott> +1
18:06:08 <hellais> another aspect that is worth keeping in consideration is that the anomaly detection heuristics are things that change over time
18:06:40 <hellais> so the system needs to also take into the account the fact that these can be updated and probably have a way of partitioning the data so it can re-run the anomaly detection logic over a subset of it
18:06:47 <hodgepodge> I haven't been able to look into ooni-pipeline lately due to my workload at the university, but what is the pipeline like right now in terms of its workflow?
18:07:14 <hodgepodge> i.e. is it still s3 -> disk -> postgres -> materialised views for anomaly detection?
18:07:22 <hellais> for example, when we find a new block page for Italy we want to re-run the anomaly detection that finds blockpages on all  measurements of one class from a certain country
18:07:36 <hellais> hodgepodge: yes, it hasn't changed much
18:07:47 <hellais> except that the s3 step is side-chained
18:07:58 <hellais> so it's not blocking on the workflow anymore
18:08:18 <hodgepodge> Oh, nice.
18:08:33 <hellais> I have some uncommitted code where I came up with some meta language for defining anomalies
18:08:42 <hellais> I guess I can push it somewhere if you think it would be useful to take a look at it
18:08:59 <hodgepodge> So, I was thinking of something like this (I have an incomplete DAG mocked out here: https://github.com/TylerJFisher/ooni-lyzer/issues/9)
18:09:28 <hodgepodge> hellais: I think that just knowing the high-level stages of the pipeline is more valuable, IMO.
18:09:42 <hodgepodge> So, as I said, my DAG is incomplete.
18:10:39 <hodgepodge> I was thinking of branching after the load stage into the anomaly detection logic either using Postgres, or something like MapReduce.
18:11:18 <hellais> that looks good, except I think there needs to be an extra partitioning step before the load
18:11:25 <hodgepodge> How so?
18:12:29 <hellais> the general idea is that the only values that for sure need to go inside of the database are the top level keys (those not contained in the test_keys key)
18:13:21 <hodgepodge> Oh, yeah. That partitioning step is going to be incorporated into the "Generic normalisation" part of the workflow.
18:13:25 <willscott> i see
18:13:28 <hodgepodge> So it'll be something like this:
18:13:29 <hellais> the stuff inside of test_keys instead is what should be used by the anomaly detection logic and I think it's better to not overload the database with that data and instead keep it on disk in some smarter way
18:14:03 <willscott> i’m not sure if we need individual reports in postgres, but maybe aggregate views per ASN or per country, or per time period
18:14:22 <willscott> if there’s a row for each individual report, we’re going to still have a hard-to-work-with database
18:15:14 <hellais> willscott: that's true. Then maybe it's worthwhile to defer the loading to an even later stage
18:15:45 <hellais> and just tally the things we care about and put those inside of some fact table
18:16:13 <hodgepodge> Yeah, I think that it'd be really nice if we could move towards an OLAP database.
18:16:20 <hodgepodge> Right now it's OLTP right?
18:17:07 <hodgepodge> After switching from a 3NF to a star schema I noticed that it was almost trivial to write analytic queries.
18:17:57 <hodgepodge> The only issue is that we'd need to determine whether or not we can construct a star schema from the data we have.
18:19:55 <hodgepodge> We could always have one fact table per test, and then have a series of dimension tables that are shared between all of the tests.
18:21:31 <hodgepodge> So, fact_bridge_reachability might depend on dim_hosts, dim_bridges, dim_transports, dim_logs, dim_time, dim_date
18:25:21 <willscott> i guess it depends a bit on what we’re trying to do with this data
18:26:03 <willscott> if it’s just explorer type visualizations, full data is overkill. if we want deeper explorations, or to open up the database itself to other analyists, then we probably do want to keep all of the individual measurements there for people to play with
18:26:52 <willscott> i guess, easiest might be to start with the full database, and then if we have performance issues that can be the cue to fall back to a smaller / more processed view of the data
18:29:11 <hodgepodge> Personally, I think that we should have all of the data accessible in a relational database so we can further develop the heuristics driving the OONI Explorer.
18:29:38 <hodgepodge> It's pretty expensive to fetch data from S3, so we should try to minimize the amount of data extraction we're performing, IMO.
18:29:45 <hodgepodge> It's expensive in terms of time, and capital costs.
18:29:49 <willscott> for ooni data, i’m on board
18:30:13 <willscott> if the explorer pulls in other data sources as well, like tor metrics, or the stuff princeton/iclab/satellite are doing
18:30:19 <hellais> well that is the model we currently have (full database) and we do have performance issues at the moment. I think there is some value in also allowing some analysis and exploration capabilities, not necessarily for the general public, but for us.
18:30:25 <willscott> then there’s a lot more data, and at some point we outgrow what a single machine database can do
18:30:30 <hodgepodge> willscott: Oh, so you're just thinking of separating the internal OONI data, and the Explorer data?
18:30:42 <hellais> I find myself quite often connecting to the database and running some queries to answer questions that are not answered by the explorer.
18:30:48 <hodgepodge> ^
18:30:48 <hodgepodge> Same.
18:30:51 <willscott> yeah, i think that’s right
18:31:06 <willscott> interactive exploration of ooni data in a relational DB seems to make a lot of sense
18:32:31 <hellais> that said I still think it needs to be partitioned better and we absolutely need to remove the 500 line materialized views that take 1 hour to update.
18:32:56 <hodgepodge> omg
18:33:08 <hodgepodge> Yeah, it sounds like you're using a 3NF schema.
18:33:19 <hodgepodge> I'll see if I still have Postgres creds.
18:34:06 <hellais> it's actually not 3NF
18:34:17 <hellais> but I don't think that would make things significantly better or worse
18:36:10 <hodgepodge> 3NF wouldn't really be suitable for the types of processing that you're performing anyway.
18:36:44 <hellais> yeah we wouldn't gain that much, since we aren't doing any updates on the actual rows
18:36:52 <hellais> or deletions for that matter
18:36:59 <hellais> so there are no anomalies of that sort to worry about
18:37:02 <hodgepodge> It's usually used for data ingestion, after which point you'd usually normalise the data from the 3NF schema into a star, or snowflake schema.
18:39:07 <hodgepodge> In any event, I could take a look at the performance issues in a couple of weeks after I've graduated.
18:39:26 <hodgepodge> I think that we've touched on everything that I'd like to talk about. Does anyone have anything else that they'd like to discuss?
18:41:07 <hellais> I think we are good
18:41:14 <hellais> and we have already gone quite a bit overtime :P
18:41:41 <hellais> but it was a very useful discussion, thanks hodgepodge !
18:42:14 <hodgepodge> I think so too, thanks hellais, willscott for your insights. :D
18:43:59 <hellais> thank you all for attending
18:44:02 <hellais> godspeed!
18:44:05 <hellais> #endmeeting