{
"cells": [
{
"cell_type": "markdown",
"id": "41485f29-7287-40d2-a87a-c5daeb84f731",
"metadata": {},
"source": [
"## Analyzing raw OONI data, a case study\n",
"\n",
"The goal of this notebook is to explain some of the common workflows that can be adopted when performing analysis of OONI data. This will be done within the context of a specific case study and will focus on the analysis of [Web Connectivity](https://github.com/ooni/spec/blob/master/nettests/ts-017-web-connectivity.md) data.\n",
"\n",
"We will be focusing on answering the following 2 research questions:\n",
"- What domains present signs of blocking in Russia between the 23rd of February and the 17th of March 2022?\n",
"- How does the blocking vary from ISP to ISP?\n",
"\n",
"It can be useful, before you dive into more extensive analysis, to get a sense for what you are likely to find in the data by using the [Measurement Aggregation Toolkit](https://explorer.ooni.org/experimental/mat). For example you can pick a certain country and plot the [anomalies with a per-domain breakdown](https://explorer.ooni.org/experimental/mat?probe_cc=RU&test_name=web_connectivity&category_code=GRP&since=2022-03-09&until=2022-04-09&axis_x=measurement_start_day&axis_y=domain) (it's often helpful to limit the domains to categories that are most relevant, so as to focus on interesting insight).\n",
"\n",
"In doing so, you will understand if there is something interesting to investigate in the country in question at all and will also help in identifying some examples of interesting sites that you might want to further investigate.\n",
"\n",
"It's also posisble to use the same API the MAT relies on, for downloading the anomaly,confirmed,failure,ok breakdowns to be used in your own analysis or plotting tooling. Depending on the type of analysis you need to do, this might be sufficient, however keep in mind that the anomaly flag is [suscpetible to false positives](https://ooni.org/support/faq/#why-do-false-positives-occur).\n",
"\n",
"It's also useful, while you are performing the analysis, to refer to OONI Explorer to inspect the measurements that present anomalies, so as to be able to identify patterns that you can use to further improve your detection heuristics.\n",
"\n",
"At a high level the workflow we are going to look at is the following:\n",
"\n",
"![High level overview](https://kroki.io/blockdiag/svg/eNqVj7EKwkAMhnefIpM3CUVxEgVF3FxcHMQh9mINXpNyplQQ392ednAR6RjyfX_yn4LmV89YwGMA4NbaSFD0sFvuwaOhg9EC3IbQ6khAd4uYG6vAEEoWLvmGafxgqxTGUgCKqH0ttig1hlavgkbs_HNLUqwii4Eno3eum6U3evA_D_cNOjTs7TKfZNkxmf8rd8J42grPF1OgcX0=)\n",
"\n",
"### Downloading the data\n",
"\n",
"Once you have gotten a feel for the data, it's time to download the raw dataset.\n",
"\n",
"We offer a tool called oonidata (that's currently in BETA and be sure you have at least v0.2.3), which can be installed by running:\n",
"```\n",
"pip install oonidata\n",
"```\n",
"\n",
"To download all OONI data for this example notebook, run the following command (you should have at least 38GB on disk):\n",
"```\n",
"oonidata sync --start-day 2022-02-23 --end-day 2022-03-17 --probe-cc RU --test-name web_connectivity --output-dir ooni-russia-data\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a96ac23b-e9cd-482c-b598-ba70184eee58",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from datetime import datetime, timedelta\n",
"from dateutil.parser import parse as parse_date\n",
"from urllib.parse import urlencode, quote, urlparse\n",
"\n",
"from tqdm import tqdm\n",
"tqdm.pandas()"
]
},
{
"cell_type": "markdown",
"id": "aff8b48e-786d-462f-91ed-881f995a9a5f",
"metadata": {},
"source": [
"### OONI Explorer utility functions\n",
"\n",
"Below are a couple of useful utility functions when dealing with measurements. They take a dataframe row and return (or print) the OONI Explorer URL. This is useful to get a link to OONI explorer to more easily inspect the raw measurement to better understand what is going on."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "db0a313f-7410-409a-9059-a6e19bd157a9",
"metadata": {},
"outputs": [],
"source": [
"def get_explorer_url(e):\n",
" query = ''\n",
" if 'input' in e.keys() and e['input']:\n",
" query = '?input={}'.format(quote(e['input'], safe=''))\n",
" return 'https://explorer.ooni.org/measurement/{}{}'.format(e['report_id'], query)\n",
" \n",
"def print_explorer_url(e):\n",
" print(get_explorer_url(e))"
]
},
{
"cell_type": "markdown",
"id": "0a2c55b2-f304-4288-92ff-965991ffdea6",
"metadata": {},
"source": [
"### Extracting metadata from raw measurements\n",
"\n",
"The OONI raw data is very rich, but for most analysis use-cases you just need a subset of the fields or some value that is derived from them.\n",
"\n",
"Below are functions that will extract all the metadata we care about from the web_connectivity test."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b0ed1fd5-5338-43ec-ae9a-d3dd468a5e2f",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from base64 import b64decode\n",
"import hashlib\n",
"import json\n",
"import re\n",
"\n",
"def get_raw_measurement(row):\n",
" r = requests.get(\"https://api.ooni.io/api/v1/measurement_meta\", params={\n",
" 'report_id':row['report_id'],\n",
" 'input': row['input'],\n",
" 'full': True\n",
" })\n",
" j = r.json()\n",
" return json.loads(j['raw_measurement'])\n",
"\n",
"def get_resolved_ips(msmt):\n",
" queries = msmt['test_keys'].get('queries', [])\n",
" if not queries:\n",
" return ''\n",
" answers = queries[0].get('answers', [])\n",
" if not answers:\n",
" return []\n",
" \n",
" ip_list = []\n",
" for a in answers:\n",
" ip = a.get('ipv4', '')\n",
" if ip:\n",
" ip_list.append(ip)\n",
" return ip_list\n",
"\n",
"def get_control_failure(msmt):\n",
" if 'test_keys' not in msmt:\n",
" return 'missing_test_keys'\n",
" return msmt['test_keys']['control_failure']\n",
"\n",
"def get_test_keys_blocking(msmt):\n",
" return str(msmt['test_keys']['blocking'])\n",
"\n",
"def get_http_experiment_failure(msmt):\n",
" return str(msmt['test_keys']['http_experiment_failure'])\n",
"\n",
"def get_resolver_info(msmt):\n",
" return {\n",
" 'resolver_ip': msmt.get('resolver_ip', ''),\n",
" 'resolver_asn': msmt.get('resolver_asn', ''),\n",
" 'resolver_network_name': msmt.get('resolver_network_name', '')\n",
" }\n",
"\n",
"def get_network_events(msmt):\n",
" return msmt['test_keys'].get('network_events', [])\n",
"\n",
"def get_tcp_connect(msmt):\n",
" return msmt['test_keys'].get('tcp_connect', [])\n",
"\n",
"def decode_body(body):\n",
" if body is None:\n",
" return ''\n",
" if isinstance(body, dict):\n",
" raw_body = b64decode(body['data'])\n",
" try:\n",
" return raw_body.decode('utf-8')\n",
" except:\n",
" return raw_body\n",
" return body\n",
"\n",
"def get_last_response_body(msmt):\n",
" try:\n",
" # The requests/response list sorts them from the newest to the oldest, \n",
" # hence the first item in the list is the last response we received.\n",
" body = msmt['test_keys']['requests'][0]['response']['body']\n",
" return decode_body(body)\n",
" except (KeyError, TypeError, IndexError):\n",
" return ''\n",
"\n",
"TITLE_REGEXP = re.compile(\"
(.*?)\", re.IGNORECASE | re.DOTALL)\n",
"# Doesn't take into account ordering\n",
"META_TITLE_REGEXP = re.compile(\"= ts:\n",
" continue\n",
" if query.get('until') and parse_date(query['until']) <= ts:\n",
" continue\n",
" yield p\n",
" \n",
"def iter_raw_measurements(query):\n",
" path_list = list(iter_jsonl_paths(query))\n",
" print(f\"processing {len(path_list)}\")\n",
" for fp in tqdm(path_list):\n",
" for msmt in iter_msmts(fp):\n",
" if query.get('probe_asn') and msmt['probe_asn'] != query['probe_asn']:\n",
" continue\n",
" if query.get('domain'):\n",
" domain = urlparse(msmt['input']).netloc\n",
" if domain != query['domain']:\n",
" continue\n",
" yield msmt"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "6439fc37-2715-4cc1-9b50-a4754aebf955",
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"\n",
"def msmt_to_csv(query, output_file=\"output.csv\"):\n",
" with open(output_file, 'w') as output_file:\n",
" csv_writer = None\n",
" for msmt in iter_raw_measurements(query):\n",
" msmt_meta = get_measurement_meta(msmt)\n",
" if csv_writer is None:\n",
" fieldnames = msmt_meta.keys()\n",
" csv_writer = csv.DictWriter(output_file, fieldnames=fieldnames)\n",
" csv_writer.writeheader()\n",
" csv_writer.writerow(msmt_meta)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "898e8ebf-bce6-4a7e-905a-0563460b539d",
"metadata": {},
"outputs": [],
"source": [
"def get_msmt_df(query):\n",
" msmt_list = []\n",
" for msmt in iter_raw_measurements(query):\n",
" mdf = pd.DataFrame([get_measurement_meta(msmt)])\n",
" msmt_list.append(mdf)\n",
" return pd.concat(msmt_list, ignore_index=True)"
]
},
{
"cell_type": "markdown",
"id": "5dbaf33a-5686-4c16-8c9c-380fa55f0302",
"metadata": {},
"source": [
"Here we do the actual conversion to CSV."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a5546b9-d164-4f14-a115-ff14cfe675b4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"processing 14234\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" 94%|âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ | 13411/14234 [1:27:58<05:48, 2.36it/s]"
]
}
],
"source": [
"msmt_to_csv({\n",
" 'since': '2022-02-23',\n",
" 'until': '2022-03-17',\n",
" 'probe_cc': 'RU',\n",
" 'test_name': 'web_connectivity'\n",
"}, output_file=\"ooni-data-russia.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "3f784f91-8de0-4595-9559-998ff3d40e23",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3423845 ooni-data-russia.csv\n"
]
}
],
"source": [
"!wc -l ooni-data-russia.csv"
]
},
{
"cell_type": "markdown",
"id": "0cb1e94e-746b-42cd-898f-6f46122e65d1",
"metadata": {},
"source": [
"We then load the CSV file in memory as a pandas dataframe for more analysis"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "3dbca11f-b14a-4ff8-8a78-dcd0c15db207",
"metadata": {},
"outputs": [],
"source": [
"df_ru = pd.read_csv('ooni-data-russia.csv')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "d7c05cfc-c6bc-48d3-b4d0-7ec3f85eecef",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3152336"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df_ru)"
]
},
{
"cell_type": "markdown",
"id": "9616a5be-8915-45af-93dd-4ad6e0d01c82",
"metadata": {},
"source": [
"When dealing with websites, we generally care to look at data from a domain centric perspective. This allows us to group together URLs that are of the same domain, but that have different paths.\n",
"\n",
"Since the raw dataset doesn't include the `domain` we add this column here."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "0adbc67f-adf9-4d7b-aa47-93b91b696b66",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|ââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 3152336/3152336 [00:08<00:00, 365955.32it/s]\n"
]
}
],
"source": [
"df_ru['domain'] = df_ru['input'].progress_apply(lambda r: urlparse(r).netloc)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "c15eadb9-8149-4d80-b011-f9c72a59dbb2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13.035878223367035"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ru.memory_usage(deep=True).sum()/1024**3"
]
},
{
"cell_type": "markdown",
"id": "9cf49c12-b6c0-4c0c-af15-9872f28fe712",
"metadata": {},
"source": [
"### Hunting for blocking fingerprints\n",
"\n",
"We can have a very high confidence that the blocking is intentional (and not caused by transient network failures), when it fits in the following classes:\n",
"- DNS level interference\n",
"- HTTP level intereference\n",
"- TLS MITM\n",
"\n",
"\n",
"The first two classes, though, are susceptive to false positives, because sometimes the IP returned in a DNS query can differ based on the geographical location (think CDNs) and sometimes the content of a webpage can also vary from request to request (think the homepage of a news site).\n",
"\n",
"On the other hand, once we find a blocking fingerprint, we can with great confidence claim that access to that particular site is being restricted. For example we might notice that when a site is blocked on a particular network, the DNS query always returns a given IP address or we might know that the HTTP title for a blockpage is always \"Access to this website is denied\".\n",
"\n",
"Our goal now to come up with some heuristics that will allow us to, in a way, hunt for these blockpage fingerprints in the big dataset that we have available."
]
},
{
"cell_type": "markdown",
"id": "7ce9e1ba-6888-44a4-bc74-3daaf697f89e",
"metadata": {},
"source": [
"### Same title, but different page\n",
"\n",
"One heuristic which we can apply to spotting blockpages, is that we can say that a web page that looks exactly the same for many different sites. Based on this fairly simple intuition, we can look for blockpage fingerprints by just counting for the number of domains that share the same HTTP title tag."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "1b97c56c-2da8-4987-aabd-191f6fb5003a",
"metadata": {},
"outputs": [],
"source": [
"title_domain_count = df_ru[\n",
" df_ru['blocking'] == 'http-diff'\n",
"].groupby('http_title')['domain'].nunique().sort_values().reset_index()"
]
},
{
"cell_type": "markdown",
"id": "4881770f-2b65-4574-88c4-ac6611ee02cf",
"metadata": {},
"source": [
"As we can see in the breakdown below, all these blockpage fingerprints look fairly suspicious and are quite likely to be an indication of blocking. Some of them, however, might be signs of server-side blocking (ex. Geoblocking or DDOS prevention). This is why it's best, to obtain a high degree of accuracy, to investigate these manually and add them to a fingerprint database.\n",
"\n",
"This is a shared effort amonst censorship research projects, for example you can find a repo of known blocking fingerprints maintained by the CitizenLab here: https://github.com/citizenlab/filtering-annotations "
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "8bf95466-f879-4187-a670-8f38ab86c95f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" dns_resolved_ips domain\n",
"1819 ['81.200.2.238'] 12\n",
"1820 ['188.114.97.132'] 13\n",
"1821 ['83.69.208.124'] 14\n",
"1822 ['127.0.0.2'] 16\n",
"1823 ['188.114.97.136'] 16\n",
"1824 ['188.114.96.136'] 16\n",
"1825 ['31.28.24.3'] 16\n",
"1826 ['35.168.95.233'] 16\n",
"1827 ['195.128.72.1'] 17\n",
"1828 ['195.128.72.3'] 17\n",
"1829 ['188.65.128.218'] 20\n",
"1830 ['212.109.26.243'] 21\n",
"1831 ['188.114.98.136'] 21\n",
"1832 ['188.114.99.136'] 22\n",
"1833 ['193.58.251.1'] 22\n",
"1834 ['188.114.98.132'] 24\n",
"1835 ['188.114.96.128'] 24\n",
"1836 ['81.88.208.208'] 28\n",
"1837 ['10.1.1.3'] 30\n",
"1838 ['89.21.139.21'] 31\n",
"1839 ['188.114.97.128'] 32\n",
"1840 ['188.114.97.7', '188.114.96.7'] 33\n",
"1841 ['188.114.96.7', '188.114.97.7'] 34\n",
"1842 ['37.252.254.39'] 36\n",
"1843 ['188.114.97.2', '188.114.96.2'] 37\n",
"1844 ['188.114.98.128'] 39\n",
"1845 ['78.29.1.40'] 42\n",
"1846 ['188.43.20.67'] 44\n",
"1847 ['188.114.99.132'] 44\n",
"1848 ['188.114.96.2', '188.114.97.2'] 44\n",
"1849 ['185.77.150.2'] 46\n",
"1850 ['46.175.31.250'] 49\n",
"1851 ['188.114.99.128'] 60\n",
"1852 ['176.103.130.135'] 62\n",
"1853 ['0.0.0.0'] 73\n",
"1854 ['78.24.40.190'] 78\n",
"1855 ['62.140.245.46'] 81\n",
"1856 ['46.175.31.251'] 81\n",
"1857 ['127.0.0.1'] 89\n",
"1858 ['77.238.226.53'] 163\n",
"1859 ['62.33.207.197', '62.33.207.196'] 202\n",
"1860 ['85.142.29.248'] 203\n",
"1861 ['62.33.207.196', '62.33.207.197'] 208\n",
"1862 ['80.76.104.20'] 222\n",
"1863 ['100.64.64.66'] 223\n",
"1864 ['95.213.158.61'] 225\n",
"1865 ['188.186.157.49'] 238\n",
"1866 [] 1590"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dns_resp_sorted[\n",
" dns_resp_sorted['domain'] > 10\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "8806744f-7408-4994-b4e8-bb941ba7d0ee",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"360 sci-hub.se\n",
"549 www.ej.ru\n",
"1896 www.shram.kiev.ua\n",
"1902 zhurnal.lib.ru\n",
"1948 rutracker.org\n",
" ... \n",
"3149505 bluesystem.info\n",
"3149517 www.rollitup.org\n",
"3150702 rutracker.org\n",
"3151623 www.bbm.com\n",
"3151928 nnmclub.to\n",
"Name: domain, Length: 15447, dtype: object"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ru[\n",
" (df_ru['blocking'] == 'dns')\n",
" & (df_ru['dns_resolved_ips'] == \"['188.186.157.49']\")\n",
"]['domain']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "c5fb9b86-0596-467e-9827-ac0266aca99e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://explorer.ooni.org/measurement/20220302T160209Z_webconnectivity_RU_41733_n1_3bHDEUlWMQ3J7M9e?input=https%3A%2F%2Fsci-hub.se%2F\n"
]
}
],
"source": [
"print_explorer_url(df_ru.iloc[360])"
]
},
{
"cell_type": "markdown",
"id": "79b97d53-57bc-4862-aeff-3ce8c7f20c30",
"metadata": {},
"source": [
"### DNS inconsistency false positive removal\n",
"\n",
"To understand if what we are looking at is a real blocking IP or not, we can use the following heuristics:\n",
"\n",
"1. Does the IP in question have a PTR record pointing to something that looks like a blockpage (ex. a hostname that is related to the ISP)\n",
"2. What information can we get about the IP by doing a whois lookup\n",
"3. Is the ASN of the IP the same as the network where the measurement was collected\n",
"4. Do we get a valid TLS certificate for one of the domains in question when doing a TLS handshake and specifying the SNI\n",
"\n",
"Using these 4 conditions, we are generally able to understand if it's in fact a blocking IP or not"
]
},
{
"cell_type": "markdown",
"id": "4d64bb3d-cc1f-4044-bc3a-501061ec842b",
"metadata": {},
"source": [
"### True positive example\n",
"\n",
"In the following example we can see that the IP `188.186.157.49`:\n",
"\n",
"1. Has a PTR record pointing to `k8s-lb-onlyhttp-cluster-ingress.static.cc.ertelecom.ru`\n",
"2. The whois record shows it's owned by the ISP\n",
"3. The AS network name is the same as the measured network\n",
"4. We get a certificate with a common name \"*.dom.ru\" (i.e. it's not valid for sci-hub.se)\n",
"\n",
"This gives is a strong indication that it is in fact a blockpage IP"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9645f03e-f0d7-4a86-a043-5bc421e1a911",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 25,
"id": "99b3d33d-5c27-4eca-bbd9-f7fdf9810547",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"49.157.186.188.in-addr.arpa domain name pointer k8s-lb-onlyhttp-cluster-ingress.static.cc.ertelecom.ru.\n"
]
}
],
"source": [
"!host 188.186.157.49"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "fdb48924-f43b-4fec-9121-29aaf454b555",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"% This is the RIPE Database query service.\n",
"% The objects are in RPSL format.\n",
"%\n",
"% The RIPE Database is subject to Terms and Conditions.\n",
"% See http://www.ripe.net/db/support/db-terms-conditions.pdf\n",
"\n",
"% Note: this output has been filtered.\n",
"% To receive output for a database update, use the \"-B\" flag.\n",
"\n",
"% Information related to '188.186.0.0 - 188.187.255.255'\n",
"\n",
"% Abuse contact for '188.186.0.0 - 188.187.255.255' is '[email protected]'\n",
"\n",
"inetnum: 188.186.0.0 - 188.187.255.255\n",
"netname: RU-RAID-20090619\n",
"country: RU\n",
"org: ORG-RA21-RIPE\n",
"admin-c: RAID1-RIPE\n",
"tech-c: RAID1-RIPE\n",
"status: ALLOCATED PA\n",
"mnt-by: RIPE-NCC-HM-MNT\n",
"mnt-by: RAID-MNT\n",
"mnt-lower: RAID-MNT\n",
"mnt-routes: RAID-MNT\n",
"created: 2009-06-19T14:03:12Z\n",
"last-modified: 2016-05-30T12:40:21Z\n",
"source: RIPE # Filtered\n",
"\n",
"organisation: ORG-RA21-RIPE\n",
"org-name: JSC \"ER-Telecom Holding\"\n",
"country: RU\n",
"org-type: LIR\n",
"address: str. Shosse Kosmonavtov, 111, bldg. 43, office 509\n",
"address: 614990\n",
"address: Perm\n",
"address: RUSSIAN FEDERATION\n",
"phone: +7 342 2462233\n",
"fax-no: +7 342 2195024\n",
"admin-c: ERTH3-RIPE\n",
"tech-c: RAID1-RIPE\n",
"abuse-c: RAID1-RIPE\n",
"mnt-ref: RIPE-NCC-HM-MNT\n",
"mnt-ref: RAID-MNT\n",
"mnt-ref: ENFORTA-MNT\n",
"mnt-ref: AS8345-MNT\n",
"mnt-ref: RU-NTK-MNT\n",
"mnt-by: RIPE-NCC-HM-MNT\n",
"mnt-by: RAID-MNT\n",
"created: 2004-04-17T11:56:55Z\n",
"last-modified: 2021-05-17T06:43:35Z\n",
"source: RIPE # Filtered\n",
"\n",
"role: ER-Telecom ISP Contact Role\n",
"address: JSC \"ER-Telecom\"\n",
"address: 111, str. Shosse Kosmonavtov\n",
"address: 614000 Perm\n",
"address: Russian Federation\n",
"phone: +7 342 2462233\n",
"fax-no: +7 342 2463344\n",
"abuse-mailbox: [email protected]\n",
"remarks: 24/7 phone number: +7-342-2362233\n",
"admin-c: AAP113-RIPE\n",
"tech-c: AAP113-RIPE\n",
"tech-c: GRIF59-RIPE\n",
"nic-hdl: RAID1-RIPE\n",
"mnt-by: RAID-MNT\n",
"created: 2005-02-11T12:50:50Z\n",
"last-modified: 2022-01-11T06:25:37Z\n",
"source: RIPE # Filtered\n",
"\n",
"% Information related to '188.186.157.0/24AS31483'\n",
"\n",
"route: 188.186.157.0/24\n",
"origin: AS31483\n",
"org: ORG-RA21-RIPE\n",
"descr: JSC \"ER-Telecom Holding\"\n",
"descr: Russia\n",
"mnt-by: RAID-MNT\n",
"created: 2016-05-12T07:15:31Z\n",
"last-modified: 2016-05-12T07:15:31Z\n",
"source: RIPE # Filtered\n",
"\n",
"organisation: ORG-RA21-RIPE\n",
"org-name: JSC \"ER-Telecom Holding\"\n",
"country: RU\n",
"org-type: LIR\n",
"address: str. Shosse Kosmonavtov, 111, bldg. 43, office 509\n",
"address: 614990\n",
"address: Perm\n",
"address: RUSSIAN FEDERATION\n",
"phone: +7 342 2462233\n",
"fax-no: +7 342 2195024\n",
"admin-c: ERTH3-RIPE\n",
"tech-c: RAID1-RIPE\n",
"abuse-c: RAID1-RIPE\n",
"mnt-ref: RIPE-NCC-HM-MNT\n",
"mnt-ref: RAID-MNT\n",
"mnt-ref: ENFORTA-MNT\n",
"mnt-ref: AS8345-MNT\n",
"mnt-ref: RU-NTK-MNT\n",
"mnt-by: RIPE-NCC-HM-MNT\n",
"mnt-by: RAID-MNT\n",
"created: 2004-04-17T11:56:55Z\n",
"last-modified: 2021-05-17T06:43:35Z\n",
"source: RIPE # Filtered\n",
"\n",
"% This query was served by the RIPE Database Query Service version 1.102.3 (HEREFORD)\n",
"\n",
"\n"
]
}
],
"source": [
"!whois 188.186.157.49"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "5251312c-f5af-4092-b3e0-b4a2bae66357",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'autonomous_system_number': 31483,\n",
" 'autonomous_system_organization': 'JSC \"ER-Telecom Holding\"'}"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lookup_asn(\"188.186.157.49\")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "b69ac508-a875-4e73-8657-eb19cc978ec8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'JSC \"ER-Telecom Holding\"'"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ru.iloc[360]['probe_network_name']"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "1edcd0af-7573-4071-b4f3-8c9912d37410",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'AS41733'"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ru.iloc[360]['probe_asn']"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "c3e8b467-cf43-4e1f-8406-a5cb8bd2f067",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"depth=2 C = US, ST = New Jersey, L = Jersey City, O = The USERTRUST Network, CN = USERTrust RSA Certification Authority\n",
"verify return:1\n",
"depth=1 C = RU, ST = Moscow, L = Moscow, O = RU-Center (\\D0\\97\\D0\\90\\D0\\9E \\D0\\A0\\D0\\B5\\D0\\B3\\D0\\B8\\D0\\BE\\D0\\BD\\D0\\B0\\D0\\BB\\D1\\8C\\D0\\BD\\D1\\8B\\D0\\B9 \\D0\\A1\\D0\\B5\\D1\\82\\D0\\B5\\D0\\B2\\D0\\BE\\D0\\B9 \\D0\\98\\D0\\BD\\D1\\84\\D0\\BE\\D1\\80\\D0\\BC\\D0\\B0\\D1\\86\\D0\\B8\\D0\\BE\\D0\\BD\\D0\\BD\\D1\\8B\\D0\\B9 \\D0\\A6\\D0\\B5\\D0\\BD\\D1\\82\\D1\\80), CN = RU-CENTER High Assurance Services CA 2\n",
"verify return:1\n",
"depth=0 C = RU, ST = Permskiy kray, L = Perm, O = JSC ER-Telecom Holding, OU = job, CN = *.dom.ru\n",
"verify return:1\n",
"DONE\n"
]
}
],
"source": [
"!echo Q | openssl s_client -connect 188.186.157.49:443 -servername sci-hub.se | openssl x509 -noout -text | grep sci-hub.se"
]
},
{
"cell_type": "markdown",
"id": "7e2e3d4f-a33d-48fb-8566-302d0bc4e66c",
"metadata": {},
"source": [
"### False positive example\n",
"\n",
"In the following example we can see that the IP `188.114.97.7`:\n",
"\n",
"1. Doesn't have a PTR record\n",
"2. The whois record shows it's owned by the Cloudflare\n",
"3. The ASN is **not** the same as the measured network\n",
"4. We get a valid certificate for `mastodon.cloud` when doing a TLS handshake\n",
"\n",
"We can conclude that this is most likely a false positive"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "b50a205b-fc38-45f5-8329-0fb4928b8d8e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"944 www.blogdir.ru\n",
"1333 www.freewebspace.com\n",
"1805 www.babyplan.ru\n",
"2676 mastodon.cloud\n",
"2924 sputnikipogrom.com\n",
" ... \n",
"3145972 hitwe.com\n",
"3146741 www.resistance88.com\n",
"3146962 www.wftucentral.org\n",
"3149916 www.nostraightnews.com\n",
"3150651 www.metal-archives.com\n",
"Name: domain, Length: 4255, dtype: object"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ru[\n",
" df_ru['dns_resolved_ips'] == \"['188.114.97.7', '188.114.96.7']\"\n",
"]['domain']"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "88251b53-51e3-46c5-8945-39dc7ce399cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://explorer.ooni.org/measurement/20220302T224757Z_webconnectivity_RU_31257_n1_ElZKi2MAW05O7NYj?input=https%3A%2F%2Fmastodon.cloud%2F\n"
]
}
],
"source": [
"print_explorer_url(df_ru.iloc[2676])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "c9ab47c9-a824-4e41-83ba-441588823a42",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Host 7.97.114.188.in-addr.arpa. not found: 3(NXDOMAIN)\n"
]
}
],
"source": [
"!host 188.114.97.7"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "d7003aa0-ee69-4492-9180-598ac554c36c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"% This is the RIPE Database query service.\n",
"% The objects are in RPSL format.\n",
"%\n",
"% The RIPE Database is subject to Terms and Conditions.\n",
"% See http://www.ripe.net/db/support/db-terms-conditions.pdf\n",
"\n",
"% Note: this output has been filtered.\n",
"% To receive output for a database update, use the \"-B\" flag.\n",
"\n",
"% Information related to '188.114.96.0 - 188.114.99.255'\n",
"\n",
"% Abuse contact for '188.114.96.0 - 188.114.99.255' is '[email protected]'\n",
"\n",
"inetnum: 188.114.96.0 - 188.114.99.255\n",
"netname: CLOUDFLARENET-EU\n",
"descr: CloudFlare, Inc.\n",
"descr: 101 Townsend Street, San Francisco, CA 94107, US\n",
"descr: +1 (650) 319-8930\n",
"descr: https://cloudflare.com/\n",
"country: US\n",
"admin-c: CAC80-RIPE\n",
"tech-c: CTC6-RIPE\n",
"status: ASSIGNED PA\n",
"mnt-by: MNT-CLOUDFLARE\n",
"mnt-lower: MNT-CLOUDFLARE\n",
"mnt-routes: MNT-CLOUDFLARE\n",
"remarks: https://cloudflare.com/abuse\n",
"created: 2015-10-16T16:26:10Z\n",
"last-modified: 2015-10-16T16:26:10Z\n",
"source: RIPE\n",
"\n",
"person: Cloudflare Abuse Contact\n",
"address: 101 Townsend Street, San Francisco, CA 94107, US\n",
"phone: +1 (650) 319-8930\n",
"remarks: All Cloudflare abuse reporting can be done via https://www.cloudflare.com/abuse\n",
"nic-hdl: CAC80-RIPE\n",
"mnt-by: MNT-CLOUDFLARE\n",
"created: 2012-06-01T23:27:49Z\n",
"last-modified: 2018-06-10T10:14:26Z\n",
"source: RIPE # Filtered\n",
"\n",
"person: Cloudflare Technical Contact\n",
"address: 101 Townsend Street, San Francisco, CA 94107, US\n",
"phone: +1 (650) 319-8930\n",
"nic-hdl: CTC6-RIPE\n",
"mnt-by: MNT-CLOUDFLARE\n",
"created: 2012-06-01T23:35:57Z\n",
"last-modified: 2018-06-10T10:16:13Z\n",
"source: RIPE # Filtered\n",
"\n",
"% Information related to '188.114.97.0/24AS13335'\n",
"\n",
"route: 188.114.97.0/24\n",
"origin: AS13335\n",
"mnt-by: MNT-CLOUDFLARE\n",
"created: 2020-06-15T18:05:37Z\n",
"last-modified: 2020-06-15T18:05:37Z\n",
"source: RIPE # Filtered\n",
"\n",
"% This query was served by the RIPE Database Query Service version 1.102.3 (HEREFORD)\n",
"\n",
"\n"
]
}
],
"source": [
"!whois 188.114.97.7"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "a44b47ba-3204-4867-b63e-a29f0453d764",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'autonomous_system_number': 13335,\n",
" 'autonomous_system_organization': 'Cloudflare, Inc.'}"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lookup_asn(\"188.114.97.7\")"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "a65aeca9-1588-4fce-a82b-2a37b7e72a53",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AS31257\n",
"Orion Telecom LLC\n"
]
}
],
"source": [
"print(df_ru.iloc[2676]['probe_asn'])\n",
"print(df_ru.iloc[2676]['probe_network_name'])"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "1d44e940-0b8b-4a2b-953a-1886f3a3e8d5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"depth=2 C = IE, O = Baltimore, OU = CyberTrust, CN = Baltimore CyberTrust Root\n",
"verify return:1\n",
"depth=1 C = US, O = \"Cloudflare, Inc.\", CN = Cloudflare Inc ECC CA-3\n",
"verify return:1\n",
"depth=0 C = US, ST = California, L = San Francisco, O = \"Cloudflare, Inc.\", CN = sni.cloudflaressl.com\n",
"verify return:1\n",
"DONE\n",
" DNS:mastodon.cloud, DNS:sni.cloudflaressl.com, DNS:*.mastodon.cloud\n"
]
}
],
"source": [
"!echo Q | openssl s_client -connect 188.114.97.7:443 -servername mastodon.cloud | openssl x509 -noout -text | grep mastodon.cloud"
]
},
{
"cell_type": "markdown",
"id": "a87f65c8-b17f-49ec-9ab2-a8d497cc3e1d",
"metadata": {},
"source": [
"We can then rinse and repeat this process multiple times, until we have divided all these anomalous IPs into those confirmed to be associated to blocking or those that are false positive.\n",
"\n",
"Similarly we can do this for the HTML titles."
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "77d8a03e-1a05-4b2d-94bd-71c9a8c61dbd",
"metadata": {},
"outputs": [],
"source": [
"confirmed_ips = [\n",
" # PTR record is k8s-lb-onlyhttp-cluster-ingress.static.cc.ertelecom.ru\n",
" # Serves blockpage for: http://lawfilter.ertelecom.ru/\n",
" '188.186.157.49',\n",
" # PTR records are block.tdsplus.ru & balance.tdsplus.ru\n",
" # We get connection refused when attempting to access it \n",
" '80.76.104.20',\n",
" # PTR record is block.runnet.ru\n",
" # We get a blockpage when attempting to access it\n",
" '85.142.29.248',\n",
" # AS is mapped to 49505 - SELECTEL\n",
" '95.213.158.61',\n",
" # Known russian blockpages\n",
" '62.33.207.197',\n",
" '62.33.207.196',\n",
" # Blockpage for AS60139\n",
" '185.77.150.2',\n",
" # Blockpage for AS42429\n",
" '77.238.226.53',\n",
" # Blockpage for AS8369\n",
" '78.29.1.40',\n",
" # Blockpage for AS8427\n",
" '188.43.20.67',\n",
" # Blockpage for AS52207\n",
" '195.128.72.3',\n",
" # Blockpage for AS12389\n",
" '31.28.24.3',\n",
" # Likely blockpage for AS197460\n",
" # reverse pointer to host-46-175-31-251.rev.zencom.ru.\n",
" # as of 2022-03-05 connection times out when accessing it\n",
" '46.175.31.251',\n",
" # Likely blockpage for AS3335\n",
" # PTR record host190.49.237.84.nsu.ru\n",
" # as of 2022-03-05 503 error when accessing page\n",
" '84.237.49.190'\n",
"]\n",
"\n",
"false_positive_ips = [\n",
" '188.114.97.7',\n",
" '188.114.96.7'\n",
"]\n",
"\n",
"confirmed_titles = [\n",
" 'ÐоÑÑÑп к ÑеÑÑÑÑÑ Ð¾Ð³ÑаниÑен'\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "907d1dbc-b9f4-42ae-936a-1e7067506baf",
"metadata": {},
"outputs": [],
"source": [
"valid_ip_map = {}"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "5882b3a1-29c3-4519-a517-d232f7e39a1d",
"metadata": {},
"outputs": [],
"source": [
"import certifi\n",
"import ssl\n",
"import socket\n",
"\n",
"def is_tls_valid(ip, hostname):\n",
" if len(df_ru[\n",
" (df_ru['dns_resolved_ips'].str.contains(ip, na=False))\n",
" & (df_ru['domain'] == hostname)\n",
" & (df_ru['input'].str.startswith('https'))\n",
" & (df_ru['http_experiment_failure'] == 'None')\n",
" ]) > 0:\n",
" return True\n",
"\n",
" context = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)\n",
" context.load_verify_locations(certifi.where())\n",
"\n",
" with socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) as sock:\n",
" sock.settimeout(1)\n",
" with context.wrap_socket(sock, server_hostname=hostname) as conn:\n",
" try:\n",
" conn.connect((ip, 443))\n",
" # TODO: do we care to distinguish these values?\n",
" except ssl.SSLCertVerificationError:\n",
" return False\n",
" except ssl.SSLError:\n",
" return False\n",
" except socket.timeout:\n",
" return False\n",
" except socket.error:\n",
" return False\n",
" except:\n",
" return False\n",
" return True\n",
"\n",
"def is_tls_valid_with_cache(ip, hostname):\n",
" key = f\"{ip}{hostname}\"\n",
" if key in valid_ip_map:\n",
" return valid_ip_map[key]\n",
" valid_ip_map[key] = is_tls_valid(ip, hostname)\n",
" return valid_ip_map[key]"
]
},
{
"cell_type": "markdown",
"id": "6ba0d045-31c2-4766-8b19-544860051169",
"metadata": {},
"source": [
"### Putting it all together\n",
"\n",
"We can then proceed to automating the detection on the full dataset. Our goal is that of recomputing the `blocking` feature for each individual measurement based on our improved heuristics.\n",
"\n",
"In addition to the previously discussed DNS and HTTP based blocking, we are going to additionally classify blocking that happens at different layers of the network stack.\n",
"\n",
"Specifically, we are going to be using the following identifiers for the various ways in which blocking might occur:\n",
"\n",
"#### DNS\n",
"* dns.confirmed - one of the returned IPs matches an IP known to be used to implement blocking\n",
"* dns.no_ipv4 - no IPv4 address was returned\n",
"* dns.bogon - a bogon IP address was returned\n",
"* dns.nxdomain - we got an NXDOMAIN response from the probe, but we got a valid response from the control vantage point\n",
"* dns.inconsistent - our DNS consistency heuristics determined the returned IP to be inconsistent\n",
"\n",
"#### HTTP\n",
"\n",
"These are all blocking types related to plaintext HTTP requests:\n",
"\n",
"* http.confirmed - the returned page is a known blockpages\n",
"* http.http_diff - the page doesn't match based on our page consistency heuristics\n",
"* http.connection_reset - we got a connection reset to a plaintext HTTP request\n",
"* http.connection_closed - the connection was closed before all data was transmitted\n",
"* http.connection_timeout - the connection timed out before we could retrieve all the data \n",
"* http.generic_failure - this is an generic error from legacy OONI probes\n",
"\n",
"#### TLS\n",
"\n",
"These are all blocking types related to TLS:\n",
"\n",
"* tls.connection_reset - a reset packet was seen after the client sent the ClientHello packet\n",
"* tls.connection_closed - the connection was closed after the ClientHello\n",
"* tls.connection_timeout - the connection timed out after the ClientHello\n",
" * All of the above can also have the `_after_hello` suffix, indicating that the event happened after the client sent the ClienHello packet\n",
"* tls.mitm - The DNS is consistent, but the TLS certificate validation failed. This suggest a TLS man-in-the-middle\n",
"* tls.generic_failure - generic error from legacy OONI probes\n",
"\n",
"#### TCP/IP\n",
"\n",
"This is when blocking is implemented by targeting the IP address of the host:\n",
"\n",
"* tcp.connection_reset - the TCP connect test failed due to a reset packet\n",
"* tcp.connection_timeout - the TCP connect test failed with a timeout"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "477cde66-7af8-446d-afaa-5e859fc74e1b",
"metadata": {},
"outputs": [],
"source": [
"from ast import literal_eval\n",
"import ipaddress\n",
"\n",
"def normalize_failure(failure_str):\n",
" if \"An existing connection was forcibly closed by the remote host\" in failure_str:\n",
" return \"connection_reset\"\n",
" if \"No address associated with hostname\" in failure_str:\n",
" return \"dns_nxdomain_error\"\n",
" return failure_str\n",
"\n",
"def is_dns_asns_consistent(dns_resolved_ips, control_measurement, row):\n",
" try:\n",
" control_addrs = control_measurement['dns']['addrs']\n",
" if not control_addrs:\n",
" return False\n",
" control_asns = set(list(map(lambda e: e['autonomous_system_number'], \n",
" filter(lambda e: e != None, map(lookup_asn, control_addrs)))))\n",
" exp_asns = set(list(map(lambda e: e['autonomous_system_number'], \n",
" filter(lambda e: e != None, map(lookup_asn, dns_resolved_ips)))))\n",
" if exp_asns.intersection(control_asns):\n",
" return True\n",
" except KeyError:\n",
" # Missing control measurement\n",
" return False\n",
" return False\n",
"\n",
"bogon_ipv4_ranges = [\n",
" ipaddress.ip_network(\"0.0.0.0/8\"), # \"This\" network\n",
" ipaddress.ip_network(\"10.0.0.0/8\"), # Private-use networks\n",
" ipaddress.ip_network(\"100.64.0.0/10\"), # Carrier-grade NAT\n",
" ipaddress.ip_network(\"127.0.0.0/8\"), # Loopback\n",
" ipaddress.ip_network(\"127.0.53.53\"), # Name collision occurrence\n",
" ipaddress.ip_network(\"169.254.0.0/16\"), # Link local\n",
" ipaddress.ip_network(\"172.16.0.0/12\"), # Private-use networks\n",
" ipaddress.ip_network(\"192.0.0.0/24\"), # IETF protocol assignments\n",
" ipaddress.ip_network(\"192.0.2.0/24\"), # TEST-NET-1\n",
" ipaddress.ip_network(\"192.168.0.0/16\"), # Private-use networks\n",
" ipaddress.ip_network(\"198.18.0.0/15\"), # Network interconnect device benchmark testing\n",
" ipaddress.ip_network(\"198.51.100.0/24\"), # TEST-NET-2\n",
" ipaddress.ip_network(\"203.0.113.0/24\"), # TEST-NET-3\n",
" ipaddress.ip_network(\"224.0.0.0/4\"), # Multicast\n",
" ipaddress.ip_network(\"240.0.0.0/4\"), # Reserved for future use\n",
" ipaddress.ip_network(\"255.255.255.255/32\"), # Limited broadcast\n",
"]\n",
"def is_dns_bogon(dns_resolved_ips):\n",
" for ip in dns_resolved_ips:\n",
" ipv4addr = ipaddress.IPv4Address(ip)\n",
" if any([ipv4addr in ip_range for ip_range in bogon_ipv4_ranges]):\n",
" return True\n",
" return False\n",
"\n",
"def is_dns_tls_consistent(dns_resolved_ips, row):\n",
" # If it's a HTTPs site and we didn't get a TLS error, we can assume the IPs are valid\n",
" if row['input'].startswith('https://') and row['http_experiment_failure'] == 'None':\n",
" return False\n",
" \n",
" for ip in dns_resolved_ips:\n",
" domain = urlparse(row['input']).netloc\n",
" if is_tls_valid_with_cache(ip, domain):\n",
" # We consider the first hit to be enough to consider it consistent\n",
" return True\n",
" return False\n",
"\n",
"def is_dns_false_positive(dns_resolved_ips):\n",
" for ip in dns_resolved_ips:\n",
" if ip in false_positive_ips:\n",
" return True\n",
" return False\n",
"\n",
"def recompute_blocking(row):\n",
" try:\n",
" dns_resolved_ips = literal_eval(row['dns_resolved_ips'])\n",
" except:\n",
" dns_resolved_ips = []\n",
"\n",
" blocking = row['blocking']\n",
" for ip in dns_resolved_ips:\n",
" if ip in confirmed_ips:\n",
" return 'dns.confirmed'\n",
" \n",
" # This is a special case for when we got no ipv4 addresses and the network doesn't support ipv6\n",
" if len(dns_resolved_ips) == 0 and row['http_experiment_failure'] == 'network_unreachable':\n",
" return 'dns.no_ipv4'\n",
" \n",
" if is_dns_bogon(dns_resolved_ips):\n",
" return 'dns.bogon'\n",
"\n",
" try:\n",
" control_measurement = literal_eval(row['control_measurement'])\n",
" except:\n",
" return 'invalid'\n",
" if not control_measurement:\n",
" return 'invalid'\n",
" \n",
" if control_measurement['http_request']['failure'] != None:\n",
" return 'invalid'\n",
"\n",
" if (normalize_failure(row['dns_experiment_failure']) == 'dns_nxdomain_error' and \n",
" control_measurement.get('http_request', {}).get('failure', '') != 'dns_lookup_error'):\n",
" return 'dns.nxdomain'\n",
"\n",
" if (\n",
" not (row['input'].startswith('https://') and row['http_experiment_failure'] == 'None') \n",
" and not is_dns_false_positive(dns_resolved_ips) \n",
" and not is_dns_asns_consistent(dns_resolved_ips, control_measurement, row)\n",
" #and not is_dns_tls_consistent(dns_resolved_ips, row)\n",
" ):\n",
" return 'dns.inconsistent'\n",
"\n",
" # If we got down to here, it means that DNS is consistent \n",
" if row['http_title'] in confirmed_titles:\n",
" return 'http.confirmed'\n",
" \n",
" if blocking == 'http-diff' and row['input'].startswith('http://'):\n",
" return 'http.http_diff'\n",
" \n",
" if row['http_experiment_failure'] != 'None':\n",
" tcp_connect_list = literal_eval(row['tcp_connect'])\n",
" for conn in tcp_connect_list:\n",
" if conn['status']['failure'] == 'connection_reset':\n",
" return 'tcp.connection_reset'\n",
" elif conn['status']['failure'] == 'generic_timeout_error':\n",
" return 'tcp.connection_timeout'\n",
" \n",
" # We compute TLS level anomalies this using the network_events\n",
" tls_handshake_started = False\n",
" try:\n",
" network_events = literal_eval(row['network_events'])\n",
" except:\n",
" network_events = []\n",
" if network_events:\n",
" for idx, network_event in enumerate(network_events):\n",
" if network_event['operation'] == 'write':\n",
" write_operations += 1\n",
" if network_event['operation'] == 'read':\n",
" read_operations += 1\n",
"\n",
" if tls_handshake_started and network_event['failure']:\n",
" # We are guaranteed to not be out of bounds due to the tls_handshake_started flag\n",
" prev_operation = network_events[idx-1]\n",
" \n",
" suffix = ''\n",
" if normalize_failure(network_event['failure']) == 'connection_reset':\n",
" return f'tls.connection_reset{suffix}'\n",
" elif normalize_failure(network_event['failure']) == 'eof_error':\n",
" return f'tls.connection_closed{suffix}'\n",
" elif normalize_failure(network_event['failure']) == 'generic_timeout_error':\n",
" return f'tls.connection_timeout{suffix}'\n",
" if write_operations > 1:\n",
" suffix = f'_after_hello'\n",
"\n",
" if network_event['operation'] == 'tls_handshake_start':\n",
" tls_handshake_started = True\n",
" write_operations = 0\n",
" read_operations = 0\n",
" if network_event['operation'] == 'tls_handshake_done':\n",
" tls_handshake_started = False\n",
"\n",
" # If we got down to here, it means the DNS consistency checks have passed\n",
" # For the http related failures, if we are spotting them here, it means the test most likely doesn't support the \n",
" # new network_events keys, and therefore the results are a bit less accurate.\n",
" # This should ideally be indicated via a lower confidence value.\n",
" if normalize_failure(row['http_experiment_failure']) == 'connection_reset':\n",
" if row['input'].startswith('https://'):\n",
" return 'tls.connection_reset'\n",
" else:\n",
" return 'http.connection_reset'\n",
" elif normalize_failure(row['http_experiment_failure']) == 'eof_error':\n",
" if row['input'].startswith('https://'):\n",
" return 'tls.connection_closed'\n",
" else:\n",
" return 'http.connection_closed'\n",
" elif normalize_failure(row['http_experiment_failure']) == 'generic_timeout_error':\n",
" if row['input'].startswith('https://'):\n",
" return 'tls.connection_timeout'\n",
" else:\n",
" return 'http.connection_timeout'\n",
" # It's not just using DNS to point us to an IP that serves a blockpage and it's a TLS MITM\n",
" elif row['input'].startswith('https://') and row['http_experiment_failure'].startswith('ssl_'):\n",
" return 'tls.mitm'\n",
" \n",
" # We map unknown_failures to invalid measurements\n",
" elif row['http_experiment_failure'].startswith('unknown_failure'):\n",
" return 'invalid'\n",
" \n",
" # All unmapped errors go into a generic failure pool\n",
" elif row['http_experiment_failure'] != 'None':\n",
" if row['input'].startswith('https://'):\n",
" return 'tls.generic_failure'\n",
" else:\n",
" return 'http.generic_failure'\n",
" \n",
" return 'ok'"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "8c336cb6-0177-4eeb-a89e-894e898946ee",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 3152336/3152336 [20:14<00:00, 2595.73it/s]\n"
]
}
],
"source": [
"df_ru['blocking_recalc'] = df_ru.progress_apply(recompute_blocking, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "ed0b8ebc-063f-49bf-98ed-e2baadddf55b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['ok', 'invalid', 'tls.generic_failure', 'tls.mitm',\n",
" 'http.http_diff', 'dns.inconsistent', 'tls.connection_timeout',\n",
" 'tls.connection_reset', 'tls.connection_closed',\n",
" 'http.connection_reset', 'dns.confirmed', 'dns.nxdomain',\n",
" 'tcp.connection_timeout', 'http.generic_failure',\n",
" 'http.connection_timeout', 'http.connection_closed', 'dns.bogon',\n",
" 'http.confirmed', 'dns.no_ipv4', 'tcp.connection_reset'],\n",
" dtype=object)"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ru['blocking_recalc'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "285f43ee-1abf-4f96-92c9-b9026fb0f55f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([\"['172.98.192.37']\", \"['13.107.42.14']\", \"['185.3.143.71']\", ...,\n",
" \"['62.115.252.49', '80.239.137.162', '62.115.252.57', '62.115.252.56']\",\n",
" \"['62.115.252.57', '80.239.137.162', '62.115.252.49']\",\n",
" \"['62.115.252.64', '80.239.137.162', '62.115.252.41', '62.115.252.18', '62.115.252.57']\"],\n",
" dtype=object)"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ru[\n",
" df_ru['blocking_recalc'] == 'dns.inconsistent'\n",
"]['dns_resolved_ips'].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "52ac54e9-3193-477d-9e06-4f47b9824f4d",
"metadata": {},
"outputs": [],
"source": [
"mask = (df_ru['blocking_recalc'] == 'dns.inconsistent')\n",
"df_ru.loc[mask, 'blocking_recalc'] = df_ru[mask].progress_apply(recompute_blocking, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "de2b2276-f519-4f2b-8bc6-4e37189d4787",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d133e88-e8f9-450d-bc6f-e720bd2b76d2",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "fc0a42c9-992a-4482-9724-d7f2edf1f2fb",
"metadata": {},
"source": [
"Let's see on how many networks we were able to confirm the blocking of sites"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "ebbd857c-d768-4d9d-906b-d20ba42c4455",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['AS34533', 'AS41733', 'AS8790', 'AS50544', 'AS15774', 'AS41668',\n",
" 'AS8427', 'AS51604', 'AS41843', 'AS8369', 'AS51547', 'AS212614',\n",
" 'AS44507', 'AS56420', 'AS41786', 'AS42429', 'AS51813', 'AS12958',\n",
" 'AS51570', 'AS41330', 'AS52207', 'AS15378', 'AS60139', 'AS2848',\n",
" 'AS25408', 'AS42289', 'AS42437', 'AS206873', 'AS41661', 'AS49404',\n",
" 'AS13335', 'AS202173', 'AS42682', 'AS41754', 'AS58158', 'AS197460',\n",
" 'AS50542', 'AS34703', 'AS48092', 'AS3267', 'AS34590', 'AS43478',\n",
" 'AS12389', 'AS3335', 'AS198715', 'AS29076', 'AS20485', 'AS50498',\n",
" 'AS48190', 'AS35807', 'AS25159', 'AS25513', 'AS42610', 'AS49048',\n",
" 'AS12768', 'AS57843', 'AS56981', 'AS39435'], dtype=object)"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ru[\n",
" df_ru['blocking_recalc'] == 'dns.confirmed'\n",
"]['probe_asn'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "e8f2ccf0-9644-4c6e-b1d1-3368b9237b5a",
"metadata": {},
"outputs": [],
"source": [
"msmt_counts = df_ru[\n",
" df_ru['blocking_recalc'] == 'dns.confirmed'\n",
"][['domain', 'report_id']].groupby('domain').count().reset_index()"
]
},
{
"cell_type": "markdown",
"id": "d5157460-b535-4108-b6b2-ca48a864a869",
"metadata": {},
"source": [
"And let's check out how many sites were confirmed to be blocked based on our fingerprints"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "ef55d134-02de-4569-bbb5-e6132bfa223a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
domain
\n",
"
report_id
\n",
"
\n",
" \n",
" \n",
"
\n",
"
119
\n",
"
shajtanshop.com
\n",
"
1
\n",
"
\n",
"
\n",
"
59
\n",
"
instagram.com
\n",
"
1
\n",
"
\n",
"
\n",
"
34
\n",
"
facebook.com
\n",
"
1
\n",
"
\n",
"
\n",
"
36
\n",
"
fapreactor.com
\n",
"
1
\n",
"
\n",
"
\n",
"
89
\n",
"
nani24.cc
\n",
"
1
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
115
\n",
"
rutracker.org
\n",
"
477
\n",
"
\n",
"
\n",
"
156
\n",
"
www.bbc.com
\n",
"
513
\n",
"
\n",
"
\n",
"
58
\n",
"
imrussia.org
\n",
"
515
\n",
"
\n",
"
\n",
"
135
\n",
"
twitter.com
\n",
"
1873
\n",
"
\n",
"
\n",
"
182
\n",
"
www.facebook.com
\n",
"
1972
\n",
"
\n",
" \n",
"
\n",
"
269 rows à 2 columns
\n",
"
"
],
"text/plain": [
" domain report_id\n",
"119 shajtanshop.com 1\n",
"59 instagram.com 1\n",
"34 facebook.com 1\n",
"36 fapreactor.com 1\n",
"89 nani24.cc 1\n",
".. ... ...\n",
"115 rutracker.org 477\n",
"156 www.bbc.com 513\n",
"58 imrussia.org 515\n",
"135 twitter.com 1873\n",
"182 www.facebook.com 1972\n",
"\n",
"[269 rows x 2 columns]"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"msmt_counts.sort_values('report_id')"
]
},
{
"cell_type": "markdown",
"id": "9a59e4c8-0c37-4ee0-adca-c1c96955b44d",
"metadata": {},
"source": [
"From the perspective of presenting the data and digging deeper into the blocking of specific sites, since the data has so many dimensions, it's often useful to restrict your analysis to a subset of some of the axis.\n",
"\n",
"Common choices for this, is to use a subset of all the domains or a subset of all the networks.\n",
"\n",
"In this example we are going to pick some domains that have very good testing coverage and are highly relevant."
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "0eb1f0f9-e4d7-4625-a636-664c5773722f",
"metadata": {},
"outputs": [],
"source": [
"relevant_domains = [\n",
" 'www.bbc.com',\n",
" 'twitter.com',\n",
" 'www.facebook.com'\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "d7a3c33d-2f79-47b8-944d-0406f2252e43",
"metadata": {},
"outputs": [],
"source": [
"domain_asn_counts = df_ru[\n",
" df_ru['domain'].isin(relevant_domains)\n",
"][['probe_asn', 'domain', 'report_id']].groupby(['probe_asn', 'domain']).count().reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "991ac54c-2eb2-47b0-8ef8-79c1388e9afe",
"metadata": {},
"outputs": [],
"source": [
"# We are looking at 23 days, so having ~4 metrics per day per network seems like a reasonable cutoff\n",
"relevant_asn_domains = domain_asn_counts[\n",
" domain_asn_counts['report_id'] > 100\n",
"][['probe_asn', 'domain']]"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "179f4662-cde2-437a-9e32-08658a33a1f1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['AS12389', 'AS12668', 'AS12714', 'AS12737', 'AS12958', 'AS15493',\n",
" 'AS15640', 'AS15774', 'AS16345', 'AS205638', 'AS20632', 'AS21479',\n",
" 'AS25086', 'AS25159', 'AS25490', 'AS25513', 'AS28840', 'AS29194',\n",
" 'AS31163', 'AS31200', 'AS31213', 'AS31257', 'AS31286', 'AS31376',\n",
" 'AS3216', 'AS34533', 'AS34757', 'AS35533', 'AS35807', 'AS41330',\n",
" 'AS41668', 'AS41733', 'AS42387', 'AS42511', 'AS42610', 'AS42668',\n",
" 'AS43966', 'AS44724', 'AS44927', 'AS47165', 'AS47438', 'AS47655',\n",
" 'AS48642', 'AS50716', 'AS51547', 'AS51604', 'AS51813', 'AS52207',\n",
" 'AS56724', 'AS59734', 'AS8331', 'AS8334', 'AS8359', 'AS8402',\n",
" 'AS8427', 'AS8492', 'AS8580', 'AS8790'], dtype=object)"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"relevant_asn_domains['probe_asn'].unique()"
]
},
{
"cell_type": "markdown",
"id": "0511c839-46d8-4284-ab41-2408246ccc99",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "838a2e30-9065-445c-bc4e-804ea657cb98",
"metadata": {},
"source": [
"Let's start off by looking at the ways through which sites are blocked accross the networks we have selected to have enough measurements. To make the data easier to look at, we are going to fix the domain."
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "80a8607f-cfa2-4f6b-a4a4-87bf569d48b6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"to_plot = df_ru[\n",
" (df_ru['domain'] == 'www.bbc.com')\n",
" & (df_ru['probe_asn'].isin(relevant_asn_domains['probe_asn'].unique()))\n",
" & (df_ru['blocking_recalc'] != 'invalid')\n",
" & (df_ru['measurement_start_time'] > '2022-03-05')\n",
" #& (df_ru['blocking_recalc'] != 'ok')\n",
"][['blocking_recalc', 'probe_asn']]\n",
"to_plot['count'] = 1\n",
"(\n",
" to_plot.pivot_table(\n",
" columns='blocking_recalc',\n",
" index='probe_asn',\n",
" values='count',\n",
" aggfunc='sum'\n",
" ).reset_index()\n",
" .groupby('probe_asn')\n",
" .sum().reset_index()\n",
" .set_index('probe_asn')\n",
" .plot(kind='bar', stacked=True, figsize=(20,10), colormap='Paired', title='Blocking of www.bbc.com by probe_asn')\n",
")"
]
},
{
"cell_type": "markdown",
"id": "ffb3637e-81a3-467c-88c3-302b6589e6c6",
"metadata": {},
"source": [
"As we can see above, the means through which blocking is implemented across different ISPs varies significantly. In some of them, we can also see that the block is not being implemented at all.\n",
"\n",
"We can use the above chart to navigate our exploration of individual measurements on a per-ISP basis."
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "6a4dee21-1961-4fe6-b92b-ee7497e0c3c5",
"metadata": {},
"outputs": [],
"source": [
"def plot_blocking(probe_asn, domain):\n",
" to_plot = df_ru[\n",
" (df_ru['probe_asn'] == probe_asn)\n",
" & (df_ru['domain'] == domain)\n",
" & (df_ru['blocking_recalc'] != 'invalid')\n",
" ][['blocking_recalc', 'measurement_start_time']]\n",
" to_plot['measurement_start_time'] = pd.to_datetime(to_plot['measurement_start_time'])\n",
" to_plot['count'] = 1\n",
" (\n",
" to_plot.pivot_table(\n",
" columns='blocking_recalc',\n",
" index='measurement_start_time',\n",
" values='count',\n",
" aggfunc='sum'\n",
" ).reset_index()\n",
" .groupby(pd.Grouper(key='measurement_start_time', freq='D'))\n",
" .sum().reset_index()\n",
" .set_index('measurement_start_time')\n",
" .plot(kind='bar', stacked=True, title=f\"{probe_asn} {domain}\", colormap='Paired', figsize=(20,8))\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "41fdf5af-e7d9-4c40-97ee-699e6e234dfc",
"metadata": {},
"source": [
"Through the above function, we now have the power to plot a chart that shows us the blocking of a certain domain and ISP over time. In doing so we can determine if the methods through which the blocking is happening are consistent or if there is some variation.\n",
"\n",
"Having a stable signal that doesn't show different ways through which the block is implemented (in cases where the root-cause may be a transient network failure) gives you higher confidence in the data."
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "a8c81b4e-fe79-4444-ae21-32d8993481f5",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_blocking('AS43966', 'www.bbc.com')"
]
},
{
"cell_type": "markdown",
"id": "76408298-99f8-429d-8c4a-37221739fb8e",
"metadata": {},
"source": [
"Here we can see that the block is happening through a connection reset most of the time. The only outliers are cause by what very likely are old versions of the probe (in many cases you may want to exclude older versions of probes from your analysis, if you have enough data).\n",
"\n",
"The only case that probably deserves further investigation, is the OK measurement on the 16th. Let's find it and open it in OONI Explorer."
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "b6b0df1e-9102-466e-b357-15d915a5d847",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"