15:58:11 #startmeeting anti-censorship team meeting 15:58:11 Meeting started Thu Dec 10 15:58:11 2020 UTC. The chair is phw. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:58:11 Useful Commands: #action #agreed #help #info #idea #link #topic. 15:58:15 hi 15:58:23 hi all. here's our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep 15:58:45 hi 15:59:05 i added a few discussion items but didn't have enough time to prepare, so i'll probably cover them next week 15:59:47 agix: let's talk about your topic then 15:59:56 sure 16:00:04 I'll provide some more context 16:00:29 The research question on that issue is how to generate a synthetic index page for HTTPT proxies. 16:01:08 The way I see it is that we would need to first find content keywords that most likely won’t be blocked, for something like a web blog. We could feed those keywords into OpenAI’s text generator for example, in order to create the content for the blog. 16:01:21 What seems to be trickier is the issue of how to generate different DOM structures and css layouts for different proxy index pages. 16:01:44 So I wanted to gather your thoughts on this and additionally, how a realistic threat model would look like? Are we trying to defend against web scrapers or human censors? 16:01:57 If the latter, will the index page be sufficient or do we need to generate more content and pages in order to not attract attention? 16:02:52 i would say that scanners should be in the threat model but targeted analysis by a person should not 16:03:55 yeah, agreed. at least for a first step 16:04:05 hmm, is it really such a big deal if we have blocked keywords on the page? only if we assume that the censor would then block the entire host/domain, no? 16:04:46 we should probably avoid it as much as we can but it may not be an automatic death sentence 16:04:59 good point 16:05:55 so in case we consider scanners as a potential threat, would the index page be enough to masquerade the proxy? 16:06:10 i wonder if the GAN work our race colleagues have been doing would come in handy for evaluating and coming up with pages that look realistic enough to a bot? 16:06:22 * cohosh says this not really knowing much about GANs 16:07:14 I don't have so much experience with GANs either ^^ but phw was so kind to ask the people from race and they provided us with the paper 16:07:45 ah nice! 16:07:45 I still have to look into though 16:08:39 I also found another paper, were researcher we able to create synthetic research papers, perhaps also helpful for us 16:08:44 *were 16:08:48 there may be papers out there that looked at the landing pages of web servers and tell you would you could expect 16:09:20 a simple default nginx/apache index page may be good enough in a lot of cases (even though other censors may not hesitate to block those) 16:09:58 oh, this one? https://pdos.csail.mit.edu/archive/scigen/ 16:10:20 oh yeah i think it was that one 16:10:56 sergey also proposed similar approaches, were we could display common error pages or a login page 16:11:45 oh a login page is a good one 16:11:55 yes, i like that too 16:12:52 is there any way on how to evaluate how "censorable" a web page is? 16:13:29 just based on the appearance 16:14:20 i don't know of any previous work on this 16:14:34 that's difficult, in part because it's country-specific 16:14:51 i also think we shouldn't get carried away worrying about increasingly exotic attacks 16:15:01 similar to how people worry about obfs4 flow classifiers 16:15:11 ...when the real issue is that a simple decision tree already works great 16:15:13 for a research paper you'll probably need a way to evaluate it though 16:15:29 but I agree with the simple decision tree 16:16:12 cohosh the evaluation might be a tricky one :-/ 16:16:51 i guess you could attempt to show that a decision tree would have low accuracy when distinguishing between your page and non HTTPT pages? 16:17:05 i always find evaluation for censorship resistance research a bit tricky 16:17:44 yeah I like that one, that might be good way to do it 16:18:10 (the comparison to obfs4 isn't great because in httpt there's a clear difference between the protocol itself and the content that's served by the web server) 16:19:43 in other words: it may be wise to broaden your scope a bit and look at synthetic content specifically in the context of httpt 16:20:11 phw good point 16:20:14 if this was a corporate meeting room, i'd say you have to look at it holistically 16:20:47 Do you think that in the end a simple login page might still be a better choice than a complex AI generated weg page? 16:21:29 lol phw 16:22:14 i'd say diversity is important here. if all we have is login pages, then we may find ourselves in trouble soon. if we have login pages *and* default apache pages *and* synthetic content, it gets much easier for censors 16:24:02 sure, that makes sense 16:25:02 I don't want to take to much of your time, so thanks so much for the input and I'll keep the open issue updated on my progress :) 16:25:59 no worries, i enjoy these brainstorming sessions 16:26:24 cool, I will bug you with new ones in the future 16:27:56 ok, let's move to reviews 16:28:39 hmm, nothing? 16:28:47 i guess not 16:29:05 anything else for today? 16:29:18 not from me :) 16:29:25 same here 16:29:33 same. let's wrap it up then 16:29:35 #endmeeting