Keyword Selection For Synthetic Content

Which keywords are susceptible to internet censorship? (05.02.2023)

Förderjahr 2021 / Stipendien Call #16 / ProjektID: 5900 / Projekt: Synthetic Content for Probe-Resistant Proxies

Finding keywords prone to censorship

In this blog post, we're diving into the world of internet censorship and exploring the most vulnerable keywords. During my research I conducted an experiment to identify which keywords are most susceptible to censorship by identifying infrastructure web servers and conducting a remote measurement on these servers.

The goal was to determine which content is currently being blocked from being accessed and the findings provide valuable insights into the current state of internet censorship and would later on be useful for selecting the theme on which the synthetic content will be based on. I used a remote measurement technique, similar to Hyperquack, to test which content is being censored without setting up testing machines in every country.

Experiment

The experiment was based on the concept of predictable errors that can indicate censorship interference using a remote scan technique that does not require control of probe machines in each individual country.

The depiction above showcases a scenario where a client is sending a request for google.com to a server that hosts the domain, and responds with the expected Google landing page in return.

Exptected error — Expected error message

If a client however, asks a web server that doesn't host google.com for the search engine page, an predictable error response will be generated and sent back to the client. This behavior can be used to create an expected response template for the tested web server and if the response deviates, certain behaviors can be a strong indicator for interference by a censorship device. By sending multiple control requests for non-existing domains, a solid predictable response template was created. After that, different keywords or domain names were sent to the tested web server, and deviations from the template revealed which keywords are being censored.

Censor injecting the communictation — Keyword based censoring on the path

The graphic above outlines the entire process, where the measurement machine is going to test a web server in a specific country for signs of censorship. It first creates a solid response template, by probing the machine for rather harmless keywords, like example1.com, domains that usually are not served on that server. Once the template has been established, probes for domains that could trigger a censor on the path will be sent out. For this experiment I used domains of various categories, like social media, news media, human right issues and many more, to test which keywords are prone to censorship. For example, if access to Facebook was being blocked in a specific country, censorship devices may interrupt the traffic flow or redirect the user to a blockpage, which would generate a response that deviates from the template and reveal the censored keyword.

Results

The experiment was conducted on December 14, 2022 and took 36 hours to complete, resulting in 17.6 million data points for HTTPS and 26 million for HTTP. The list of domains was sourced from Citizen Lab's test list and Tranco list, resulting in 868,672 domains being tested. A multi-threaded Golang application was used to send probes, receive, and parse the responses into a JSON file. Python scripts were used for analysis and the data was vetted to exclude faulty responses.

HTTPS censorship data — Comparing the amount of anomalies detected based on each country

The focus was on heavily censored countries, according to the latest Freedom House Report, with China and Iran showing strong indicators for censorship while Russia showed a surprisingly low amount. There are challenges in measuring Internet censorship, like the lack of a single strong indicator and limited coverage. The HTTPS scan shows anomalies mostly in header fields and RST-packets being injected. The most affected categories globally are can be viewed in the table below.

Domain categories censored — Depicting how many domains of each category are prone to censorship.

On a global scale, topics surrounding pornography, social networking, media sharing, news media and culture seem to be among the most effected categories. Depending on the actual location of the web server, the trend can shift dramatically, like websites about political criticism and human rights issues in China or websites about censorship circumvention, alcohol, gambling and online dating in Iran. If the location of the web server would not be take into account, then keywords surrounding themes like economics and public health seem to be among the safest choices.

Armin Huremagic

Weitere Blogbeiträge

Förderjahr 2021 / Stipendien Call #16 / ProjektID: 5900 / Projekt: Synthetic Content for Probe-Resistant Proxies

Finding keywords prone to censorship

Experiment

Results

Armin Huremagic

Weitere Blogbeiträge

Putting the pieces together

Synthetic Content Needs a Home

Probe-Resistant Proxies? Synthetic Content?

Internet Censorship - Wayne interessierts?