{ "cells": [ { "cell_type": "markdown", "id": "41485f29-7287-40d2-a87a-c5daeb84f731", "metadata": {}, "source": [ "## Analyzing raw OONI data, a case study\n", "\n", "The goal of this notebook is to explain some of the common workflows that can be adopted when performing analysis of OONI data. This will be done within the context of a specific case study and will focus on the analysis of [Web Connectivity](https://github.com/ooni/spec/blob/master/nettests/ts-017-web-connectivity.md) data.\n", "\n", "We will be focusing on answering the following 2 research questions:\n", "- What domains present signs of blocking in Russia between the 23rd of February and the 17th of March 2022?\n", "- How does the blocking vary from ISP to ISP?\n", "\n", "It can be useful, before you dive into more extensive analysis, to get a sense for what you are likely to find in the data by using the [Measurement Aggregation Toolkit](https://explorer.ooni.org/experimental/mat). For example you can pick a certain country and plot the [anomalies with a per-domain breakdown](https://explorer.ooni.org/experimental/mat?probe_cc=RU&test_name=web_connectivity&category_code=GRP&since=2022-03-09&until=2022-04-09&axis_x=measurement_start_day&axis_y=domain) (it's often helpful to limit the domains to categories that are most relevant, so as to focus on interesting insight).\n", "\n", "In doing so, you will understand if there is something interesting to investigate in the country in question at all and will also help in identifying some examples of interesting sites that you might want to further investigate.\n", "\n", "It's also posisble to use the same API the MAT relies on, for downloading the anomaly,confirmed,failure,ok breakdowns to be used in your own analysis or plotting tooling. Depending on the type of analysis you need to do, this might be sufficient, however keep in mind that the anomaly flag is [suscpetible to false positives](https://ooni.org/support/faq/#why-do-false-positives-occur).\n", "\n", "It's also useful, while you are performing the analysis, to refer to OONI Explorer to inspect the measurements that present anomalies, so as to be able to identify patterns that you can use to further improve your detection heuristics.\n", "\n", "At a high level the workflow we are going to look at is the following:\n", "\n", "![High level overview](https://kroki.io/blockdiag/svg/eNqVj7EKwkAMhnefIpM3CUVxEgVF3FxcHMQh9mINXpNyplQQ392ednAR6RjyfX_yn4LmV89YwGMA4NbaSFD0sFvuwaOhg9EC3IbQ6khAd4uYG6vAEEoWLvmGafxgqxTGUgCKqH0ttig1hlavgkbs_HNLUqwii4Eno3eum6U3evA_D_cNOjTs7TKfZNkxmf8rd8J42grPF1OgcX0=)\n", "\n", "### Downloading the data\n", "\n", "Once you have gotten a feel for the data, it's time to download the raw dataset.\n", "\n", "We offer a tool called oonidata (that's currently in BETA and be sure you have at least v0.2.3), which can be installed by running:\n", "```\n", "pip install oonidata\n", "```\n", "\n", "To download all OONI data for this example notebook, run the following command (you should have at least 38GB on disk):\n", "```\n", "oonidata sync --start-day 2022-02-23 --end-day 2022-03-17 --probe-cc RU --test-name web_connectivity --output-dir ooni-russia-data\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "a96ac23b-e9cd-482c-b598-ba70184eee58", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from datetime import datetime, timedelta\n", "from dateutil.parser import parse as parse_date\n", "from urllib.parse import urlencode, quote, urlparse\n", "\n", "from tqdm import tqdm\n", "tqdm.pandas()" ] }, { "cell_type": "markdown", "id": "aff8b48e-786d-462f-91ed-881f995a9a5f", "metadata": {}, "source": [ "### OONI Explorer utility functions\n", "\n", "Below are a couple of useful utility functions when dealing with measurements. They take a dataframe row and return (or print) the OONI Explorer URL. This is useful to get a link to OONI explorer to more easily inspect the raw measurement to better understand what is going on." ] }, { "cell_type": "code", "execution_count": 2, "id": "db0a313f-7410-409a-9059-a6e19bd157a9", "metadata": {}, "outputs": [], "source": [ "def get_explorer_url(e):\n", " query = ''\n", " if 'input' in e.keys() and e['input']:\n", " query = '?input={}'.format(quote(e['input'], safe=''))\n", " return 'https://explorer.ooni.org/measurement/{}{}'.format(e['report_id'], query)\n", " \n", "def print_explorer_url(e):\n", " print(get_explorer_url(e))" ] }, { "cell_type": "markdown", "id": "0a2c55b2-f304-4288-92ff-965991ffdea6", "metadata": {}, "source": [ "### Extracting metadata from raw measurements\n", "\n", "The OONI raw data is very rich, but for most analysis use-cases you just need a subset of the fields or some value that is derived from them.\n", "\n", "Below are functions that will extract all the metadata we care about from the web_connectivity test." ] }, { "cell_type": "code", "execution_count": 3, "id": "b0ed1fd5-5338-43ec-ae9a-d3dd468a5e2f", "metadata": {}, "outputs": [], "source": [ "import requests\n", "from base64 import b64decode\n", "import hashlib\n", "import json\n", "import re\n", "\n", "def get_raw_measurement(row):\n", " r = requests.get(\"https://api.ooni.io/api/v1/measurement_meta\", params={\n", " 'report_id':row['report_id'],\n", " 'input': row['input'],\n", " 'full': True\n", " })\n", " j = r.json()\n", " return json.loads(j['raw_measurement'])\n", "\n", "def get_resolved_ips(msmt):\n", " queries = msmt['test_keys'].get('queries', [])\n", " if not queries:\n", " return ''\n", " answers = queries[0].get('answers', [])\n", " if not answers:\n", " return []\n", " \n", " ip_list = []\n", " for a in answers:\n", " ip = a.get('ipv4', '')\n", " if ip:\n", " ip_list.append(ip)\n", " return ip_list\n", "\n", "def get_control_failure(msmt):\n", " if 'test_keys' not in msmt:\n", " return 'missing_test_keys'\n", " return msmt['test_keys']['control_failure']\n", "\n", "def get_test_keys_blocking(msmt):\n", " return str(msmt['test_keys']['blocking'])\n", "\n", "def get_http_experiment_failure(msmt):\n", " return str(msmt['test_keys']['http_experiment_failure'])\n", "\n", "def get_resolver_info(msmt):\n", " return {\n", " 'resolver_ip': msmt.get('resolver_ip', ''),\n", " 'resolver_asn': msmt.get('resolver_asn', ''),\n", " 'resolver_network_name': msmt.get('resolver_network_name', '')\n", " }\n", "\n", "def get_network_events(msmt):\n", " return msmt['test_keys'].get('network_events', [])\n", "\n", "def get_tcp_connect(msmt):\n", " return msmt['test_keys'].get('tcp_connect', [])\n", "\n", "def decode_body(body):\n", " if body is None:\n", " return ''\n", " if isinstance(body, dict):\n", " raw_body = b64decode(body['data'])\n", " try:\n", " return raw_body.decode('utf-8')\n", " except:\n", " return raw_body\n", " return body\n", "\n", "def get_last_response_body(msmt):\n", " try:\n", " # The requests/response list sorts them from the newest to the oldest, \n", " # hence the first item in the list is the last response we received.\n", " body = msmt['test_keys']['requests'][0]['response']['body']\n", " return decode_body(body)\n", " except (KeyError, TypeError, IndexError):\n", " return ''\n", "\n", "TITLE_REGEXP = re.compile(\"

(.*?)\", re.IGNORECASE | re.DOTALL)\n", "# Doesn't take into account ordering\n", "META_TITLE_REGEXP = re.compile(\"= ts:\n", " continue\n", " if query.get('until') and parse_date(query['until']) <= ts:\n", " continue\n", " yield p\n", " \n", "def iter_raw_measurements(query):\n", " path_list = list(iter_jsonl_paths(query))\n", " print(f\"processing {len(path_list)}\")\n", " for fp in tqdm(path_list):\n", " for msmt in iter_msmts(fp):\n", " if query.get('probe_asn') and msmt['probe_asn'] != query['probe_asn']:\n", " continue\n", " if query.get('domain'):\n", " domain = urlparse(msmt['input']).netloc\n", " if domain != query['domain']:\n", " continue\n", " yield msmt" ] }, { "cell_type": "code", "execution_count": 7, "id": "6439fc37-2715-4cc1-9b50-a4754aebf955", "metadata": {}, "outputs": [], "source": [ "import csv\n", "\n", "def msmt_to_csv(query, output_file=\"output.csv\"):\n", " with open(output_file, 'w') as output_file:\n", " csv_writer = None\n", " for msmt in iter_raw_measurements(query):\n", " msmt_meta = get_measurement_meta(msmt)\n", " if csv_writer is None:\n", " fieldnames = msmt_meta.keys()\n", " csv_writer = csv.DictWriter(output_file, fieldnames=fieldnames)\n", " csv_writer.writeheader()\n", " csv_writer.writerow(msmt_meta)" ] }, { "cell_type": "code", "execution_count": 8, "id": "898e8ebf-bce6-4a7e-905a-0563460b539d", "metadata": {}, "outputs": [], "source": [ "def get_msmt_df(query):\n", " msmt_list = []\n", " for msmt in iter_raw_measurements(query):\n", " mdf = pd.DataFrame([get_measurement_meta(msmt)])\n", " msmt_list.append(mdf)\n", " return pd.concat(msmt_list, ignore_index=True)" ] }, { "cell_type": "markdown", "id": "5dbaf33a-5686-4c16-8c9c-380fa55f0302", "metadata": {}, "source": [ "Here we do the actual conversion to CSV." ] }, { "cell_type": "code", "execution_count": null, "id": "5a5546b9-d164-4f14-a115-ff14cfe675b4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing 14234\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 94%|████████████████████████████████████████████████████████▌ | 13411/14234 [1:27:58<05:48, 2.36it/s]" ] } ], "source": [ "msmt_to_csv({\n", " 'since': '2022-02-23',\n", " 'until': '2022-03-17',\n", " 'probe_cc': 'RU',\n", " 'test_name': 'web_connectivity'\n", "}, output_file=\"ooni-data-russia.csv\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "3f784f91-8de0-4595-9559-998ff3d40e23", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3423845 ooni-data-russia.csv\n" ] } ], "source": [ "!wc -l ooni-data-russia.csv" ] }, { "cell_type": "markdown", "id": "0cb1e94e-746b-42cd-898f-6f46122e65d1", "metadata": {}, "source": [ "We then load the CSV file in memory as a pandas dataframe for more analysis" ] }, { "cell_type": "code", "execution_count": 11, "id": "3dbca11f-b14a-4ff8-8a78-dcd0c15db207", "metadata": {}, "outputs": [], "source": [ "df_ru = pd.read_csv('ooni-data-russia.csv')" ] }, { "cell_type": "code", "execution_count": 12, "id": "d7c05cfc-c6bc-48d3-b4d0-7ec3f85eecef", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3152336" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_ru)" ] }, { "cell_type": "markdown", "id": "9616a5be-8915-45af-93dd-4ad6e0d01c82", "metadata": {}, "source": [ "When dealing with websites, we generally care to look at data from a domain centric perspective. This allows us to group together URLs that are of the same domain, but that have different paths.\n", "\n", "Since the raw dataset doesn't include the `domain` we add this column here." ] }, { "cell_type": "code", "execution_count": 13, "id": "0adbc67f-adf9-4d7b-aa47-93b91b696b66", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████████████████████████| 3152336/3152336 [00:08<00:00, 365955.32it/s]\n" ] } ], "source": [ "df_ru['domain'] = df_ru['input'].progress_apply(lambda r: urlparse(r).netloc)" ] }, { "cell_type": "code", "execution_count": 14, "id": "c15eadb9-8149-4d80-b011-f9c72a59dbb2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13.035878223367035" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru.memory_usage(deep=True).sum()/1024**3" ] }, { "cell_type": "markdown", "id": "9cf49c12-b6c0-4c0c-af15-9872f28fe712", "metadata": {}, "source": [ "### Hunting for blocking fingerprints\n", "\n", "We can have a very high confidence that the blocking is intentional (and not caused by transient network failures), when it fits in the following classes:\n", "- DNS level interference\n", "- HTTP level intereference\n", "- TLS MITM\n", "\n", "\n", "The first two classes, though, are susceptive to false positives, because sometimes the IP returned in a DNS query can differ based on the geographical location (think CDNs) and sometimes the content of a webpage can also vary from request to request (think the homepage of a news site).\n", "\n", "On the other hand, once we find a blocking fingerprint, we can with great confidence claim that access to that particular site is being restricted. For example we might notice that when a site is blocked on a particular network, the DNS query always returns a given IP address or we might know that the HTTP title for a blockpage is always \"Access to this website is denied\".\n", "\n", "Our goal now to come up with some heuristics that will allow us to, in a way, hunt for these blockpage fingerprints in the big dataset that we have available." ] }, { "cell_type": "markdown", "id": "7ce9e1ba-6888-44a4-bc74-3daaf697f89e", "metadata": {}, "source": [ "### Same title, but different page\n", "\n", "One heuristic which we can apply to spotting blockpages, is that we can say that a web page that looks exactly the same for many different sites. Based on this fairly simple intuition, we can look for blockpage fingerprints by just counting for the number of domains that share the same HTTP title tag." ] }, { "cell_type": "code", "execution_count": 15, "id": "1b97c56c-2da8-4987-aabd-191f6fb5003a", "metadata": {}, "outputs": [], "source": [ "title_domain_count = df_ru[\n", " df_ru['blocking'] == 'http-diff'\n", "].groupby('http_title')['domain'].nunique().sort_values().reset_index()" ] }, { "cell_type": "markdown", "id": "4881770f-2b65-4574-88c4-ac6611ee02cf", "metadata": {}, "source": [ "As we can see in the breakdown below, all these blockpage fingerprints look fairly suspicious and are quite likely to be an indication of blocking. Some of them, however, might be signs of server-side blocking (ex. Geoblocking or DDOS prevention). This is why it's best, to obtain a high degree of accuracy, to investigate these manually and add them to a fingerprint database.\n", "\n", "This is a shared effort amonst censorship research projects, for example you can find a repo of known blocking fingerprints maintained by the CitizenLab here: https://github.com/citizenlab/filtering-annotations " ] }, { "cell_type": "code", "execution_count": 16, "id": "8bf95466-f879-4187-a670-8f38ab86c95f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
http_titledomain
181Доступ к информационному ресурсу ограничен10
182Доступ к ресурсу ограничен!10
183Just a moment...10
184ERROR: The requested URL could not be retrieved11
185403 Forbidden13
186Dr.Web не рекомендует посещать этот сайт14
187Инфолинк17
188Доступ ограничен | Наука-Связь18
189Антивирус ESET NOD3221
190РКН МегаФон25
191Доступ ограничен!26
192Подряд - Сайт заблокирован по решению суда26
193Доступ заблокирован / Access blocked26
194Марк37
195RialCom39
196Инсис45
197МегаФон Kaspersky50
198МегаФон Gameloft51
199Akado51
200МегаФон Пресса52
201Яндекс с МегаФоном53
202МегаФон Знаю, кто звонит53
203Подписка START с МегаФоном54
204МегаФон SMS-фильтр54
205Доступ к запрошенному ресурсу заблокирован55
206IVI с МегаФоном57
207МТС57
208МегаФон МегаФон ТВ59
209Запрещено59
210МегаФон Стоп-реклама63
211Планета65
212MTC69
213Доступ запрещён72
214Домен заблокирован73
215РКН74
216Ресурс заблокирован75
217Орион телеком :: БЛОКИРОВКА75
218Доступ заблокирован76
219Единый реестр доменных имен, указателей страни...79
220Доступ к ресурсу заблокирован80
221Страница заблокирована84
222Доступ закрыт87
223Данный ресурс заблокирован87
224TTK :: Доступ к ресурсу ограничен90
225Доступ к ресурсу ограничен90
226Доступ к запрашиваемому ресурсу ограничен92
227Ресурс заблокирован - Resource is blocked95
228Доступ ограничен109
\n", "
" ], "text/plain": [ " http_title domain\n", "181 Доступ к информационному ресурсу ограничен 10\n", "182 Доступ к ресурсу ограничен! 10\n", "183 Just a moment... 10\n", "184 ERROR: The requested URL could not be retrieved 11\n", "185 403 Forbidden 13\n", "186 Dr.Web не рекомендует посещать этот сайт 14\n", "187 Инфолинк 17\n", "188 Доступ ограничен | Наука-Связь 18\n", "189 Антивирус ESET NOD32 21\n", "190 РКН МегаФон 25\n", "191 Доступ ограничен! 26\n", "192 Подряд - Сайт заблокирован по решению суда 26\n", "193 Доступ заблокирован / Access blocked 26\n", "194 Марк 37\n", "195 RialCom 39\n", "196 Инсис 45\n", "197 МегаФон Kaspersky 50\n", "198 МегаФон Gameloft 51\n", "199 Akado 51\n", "200 МегаФон Пресса 52\n", "201 Яндекс с МегаФоном 53\n", "202 МегаФон Знаю, кто звонит 53\n", "203 Подписка START с МегаФоном 54\n", "204 МегаФон SMS-фильтр 54\n", "205 Доступ к запрошенному ресурсу заблокирован 55\n", "206 IVI с МегаФоном 57\n", "207 МТС 57\n", "208 МегаФон МегаФон ТВ 59\n", "209 Запрещено 59\n", "210 МегаФон Стоп-реклама 63\n", "211 Планета 65\n", "212 MTC 69\n", "213 Доступ запрещён 72\n", "214 Домен заблокирован 73\n", "215 РКН 74\n", "216 Ресурс заблокирован 75\n", "217 Орион телеком :: БЛОКИРОВКА 75\n", "218 Доступ заблокирован 76\n", "219 Единый реестр доменных имен, указателей страни... 79\n", "220 Доступ к ресурсу заблокирован 80\n", "221 Страница заблокирована 84\n", "222 Доступ закрыт 87\n", "223 Данный ресурс заблокирован 87\n", "224 TTK :: Доступ к ресурсу ограничен 90\n", "225 Доступ к ресурсу ограничен 90\n", "226 Доступ к запрашиваемому ресурсу ограничен 92\n", "227 Ресурс заблокирован - Resource is blocked 95\n", "228 Доступ ограничен 109" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title_domain_count[\n", " title_domain_count['domain'] > 8\n", "]" ] }, { "cell_type": "markdown", "id": "bd33a5d0-bed6-4d8f-a6e1-9cf1337785b3", "metadata": {}, "source": [ "Once we have confirmed that a fingerprint is known to implement blocking, we can use it to which domains are being restricted." ] }, { "cell_type": "code", "execution_count": 17, "id": "c9817b77-2e1a-4527-b740-980ff5720554", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['www.linkedin.com', 'www.seedjah.com', 'tushkan.net',\n", " 'bluesystem.ru', 'www.resistance88.com', 'khilafah.net',\n", " 'www.casinoking.com', 'www.eurogrand.com', 'www.usacasino.com',\n", " 'www.anonymizer.ru', 'thepiratebay.org', 'www.lesbi.ru',\n", " 'www.sex.com', 'www.slotland.com', 'www.sportsinteraction.com',\n", " 'www.youporn.com', 'www.spinpalace.com', 'www.sportingbet.com',\n", " 'www.weedy.be', 'libgen.lc', 'new-rutor.org', 'lib.rus.ec',\n", " 'megatfile.cc', 'anonymouse.org', 'ikhwanonline.com',\n", " 'imrussia.org', 'limonka.nbp-info.com', 'mirknig.su',\n", " 'www.hizb-ut-tahrir.org', 'www.blackseango.org',\n", " 'www.grandonline.com', 'nomer-org.website', 'www.aceshigh.com',\n", " 'www.islamdin.com', 'www.partypoker.com', 'betway.com',\n", " 'drugs-forum.com', 'vozrojdenie.crimea.ua', 'www.deti-404.com',\n", " 'www.uniongang.net', 'rutracker.org', 'libgen.rs',\n", " 'hotgaylist.com', 'beeg.com', 'weedfarmer.com',\n", " 'www.888casino.com', 'www.carnivalcasino.com',\n", " 'www.pokerstars.com', 'www.artnet.com', 'bluesystem.info',\n", " 'khodorkovsky.ru', 'censor.net.ua', 'ipvnews.org', 'baskino.me',\n", " 'kinozal.tv', 'namba.kz', 'nnmclub.to', 'www.kasparov.ru',\n", " 'www.ned.org', 'kavkaz.tv', 'www.europacasino.com', 'kinobolt.ru',\n", " 'www.daymohk.org', 'www.kavkazcenter.com', 'www.annacasino.ru',\n", " 'www.ej.ru', 'www.khilafah.com', 'proxy.org', 'www.medinkur.ru',\n", " 'seedoff.zannn.top', 'pornolab.net', 'www.betfair.com',\n", " 'www.casinotropez.com', 'www.goldenrivieracasino.com',\n", " 'www.lostfilm.tv', 'rapidgator.net', 'howtogrowmarijuana.com',\n", " 'www.gotgayporn.com', 'www.agentura.ru', 'www.bbc.com',\n", " 'www.narkop.com', 'www.cannaweed.com', 'zhurnal.lib.ru',\n", " 'haamash.wordpress.com', 'www.marijuana.com', 'guardster.com',\n", " 'www.rollitup.org', 'xs.gay.ru', 'www.minjust.net',\n", " 'video.mivzakon.co.il', 'kazak-chita.ru', 'kasparov.ru'],\n", " dtype=object)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " (df_ru['http_title'] == 'Доступ к ресурсу ограничен')\n", "]['domain'].unique()" ] }, { "cell_type": "code", "execution_count": 18, "id": "8a63bd45-b1cd-41bc-a581-3bc6b8937064", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3779 be62357d2a95c5d8a4df95a75284b9b2\n", "5684 be62357d2a95c5d8a4df95a75284b9b2\n", "11936 be62357d2a95c5d8a4df95a75284b9b2\n", "12001 be62357d2a95c5d8a4df95a75284b9b2\n", "12524 be62357d2a95c5d8a4df95a75284b9b2\n", " ... \n", "3123964 be62357d2a95c5d8a4df95a75284b9b2\n", "3124077 be62357d2a95c5d8a4df95a75284b9b2\n", "3138277 be62357d2a95c5d8a4df95a75284b9b2\n", "3138369 be62357d2a95c5d8a4df95a75284b9b2\n", "3140570 be62357d2a95c5d8a4df95a75284b9b2\n", "Name: http_body_md5, Length: 1831, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " (df_ru['http_title'] == 'Доступ к ресурсу ограничен')\n", "]['http_body_md5']" ] }, { "cell_type": "markdown", "id": "946144f0-43c1-44c8-987e-9495b96f578f", "metadata": {}, "source": [ "### DNS level interference\n", "\n", "We can use a similar heuristics for DNS level interference. The assumption is the same, when we see one IP being mapped to multiple hostnames, it's an indication of it potentially being an IP used to implement blocking.\n", "\n", "In this case, we need to be careful of false positives that might be caused by the use of CDNs, as these will be hosting multiple sites. In the sections below we can see what techniques we can adopt to reduce these false positives further." ] }, { "cell_type": "markdown", "id": "8fe85118-9a33-4acb-b945-53403165cbe5", "metadata": {}, "source": [ "We are going to make use of a IP to ASN database for some of our heuristics. In particular we are going to download the one from db-ip, which has a fairly permissive license and is compatible with the maxmind database format." ] }, { "cell_type": "code", "execution_count": null, "id": "dffd5107-536f-4603-8f82-1203e6ccf2d8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 4015k 100 4015k 0 0 45.5M 0 --:--:-- --:--:-- --:--:-- 45.5M\n" ] } ], "source": [ "!curl -O https://download.db-ip.com/free/dbip-asn-lite-2022-04.mmdb.gz" ] }, { "cell_type": "code", "execution_count": null, "id": "16eb1a3a-2836-4164-b6d0-9c96c175a376", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gzip: dbip-asn-lite-2022-04.mmdb already exists; do you wish to overwrite (y or n)? " ] } ], "source": [ "!gunzip dbip-asn-lite-2022-04.mmdb.gz" ] }, { "cell_type": "code", "execution_count": 19, "id": "38f87a59-a8c7-4c0c-b0b4-2a228a8af3c1", "metadata": {}, "outputs": [], "source": [ "import maxminddb\n", "\n", "asn_db_path = 'dbip-asn-lite-2022-04.mmdb'\n", "def lookup_asn(ip):\n", " with maxminddb.open_database(asn_db_path) as reader:\n", " try:\n", " return reader.get(ip)\n", " # Probably not an IP\n", " except ValueError:\n", " return None" ] }, { "cell_type": "code", "execution_count": 20, "id": "11b35948-b9e6-4412-ad56-4859b8ee8f46", "metadata": {}, "outputs": [], "source": [ "dns_resp_sorted = df_ru[\n", " df_ru['blocking'] == 'dns'\n", "].groupby('dns_resolved_ips')['domain'].nunique().sort_values().reset_index()" ] }, { "cell_type": "code", "execution_count": 21, "id": "b35cb825-4342-476b-aa2d-1721f9ccd7e4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dns_resolved_ipsdomain
1737['184.85.124.165']3
1738['178.248.233.32']3
1739['185.107.56.192']3
1740['185.107.56.195']3
1741['185.107.56.52']3
.........
1862['80.76.104.20']222
1863['100.64.64.66']223
1864['95.213.158.61']225
1865['188.186.157.49']238
1866[]1590
\n", "

130 rows × 2 columns

\n", "
" ], "text/plain": [ " dns_resolved_ips domain\n", "1737 ['184.85.124.165'] 3\n", "1738 ['178.248.233.32'] 3\n", "1739 ['185.107.56.192'] 3\n", "1740 ['185.107.56.195'] 3\n", "1741 ['185.107.56.52'] 3\n", "... ... ...\n", "1862 ['80.76.104.20'] 222\n", "1863 ['100.64.64.66'] 223\n", "1864 ['95.213.158.61'] 225\n", "1865 ['188.186.157.49'] 238\n", "1866 [] 1590\n", "\n", "[130 rows x 2 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dns_resp_sorted[\n", " dns_resp_sorted['domain'] > 2\n", "]" ] }, { "cell_type": "code", "execution_count": 22, "id": "e494cca9-9041-4d07-8e7d-fe1da135ed1c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dns_resolved_ipsdomain
1819['81.200.2.238']12
1820['188.114.97.132']13
1821['83.69.208.124']14
1822['127.0.0.2']16
1823['188.114.97.136']16
1824['188.114.96.136']16
1825['31.28.24.3']16
1826['35.168.95.233']16
1827['195.128.72.1']17
1828['195.128.72.3']17
1829['188.65.128.218']20
1830['212.109.26.243']21
1831['188.114.98.136']21
1832['188.114.99.136']22
1833['193.58.251.1']22
1834['188.114.98.132']24
1835['188.114.96.128']24
1836['81.88.208.208']28
1837['10.1.1.3']30
1838['89.21.139.21']31
1839['188.114.97.128']32
1840['188.114.97.7', '188.114.96.7']33
1841['188.114.96.7', '188.114.97.7']34
1842['37.252.254.39']36
1843['188.114.97.2', '188.114.96.2']37
1844['188.114.98.128']39
1845['78.29.1.40']42
1846['188.43.20.67']44
1847['188.114.99.132']44
1848['188.114.96.2', '188.114.97.2']44
1849['185.77.150.2']46
1850['46.175.31.250']49
1851['188.114.99.128']60
1852['176.103.130.135']62
1853['0.0.0.0']73
1854['78.24.40.190']78
1855['62.140.245.46']81
1856['46.175.31.251']81
1857['127.0.0.1']89
1858['77.238.226.53']163
1859['62.33.207.197', '62.33.207.196']202
1860['85.142.29.248']203
1861['62.33.207.196', '62.33.207.197']208
1862['80.76.104.20']222
1863['100.64.64.66']223
1864['95.213.158.61']225
1865['188.186.157.49']238
1866[]1590
\n", "
" ], "text/plain": [ " dns_resolved_ips domain\n", "1819 ['81.200.2.238'] 12\n", "1820 ['188.114.97.132'] 13\n", "1821 ['83.69.208.124'] 14\n", "1822 ['127.0.0.2'] 16\n", "1823 ['188.114.97.136'] 16\n", "1824 ['188.114.96.136'] 16\n", "1825 ['31.28.24.3'] 16\n", "1826 ['35.168.95.233'] 16\n", "1827 ['195.128.72.1'] 17\n", "1828 ['195.128.72.3'] 17\n", "1829 ['188.65.128.218'] 20\n", "1830 ['212.109.26.243'] 21\n", "1831 ['188.114.98.136'] 21\n", "1832 ['188.114.99.136'] 22\n", "1833 ['193.58.251.1'] 22\n", "1834 ['188.114.98.132'] 24\n", "1835 ['188.114.96.128'] 24\n", "1836 ['81.88.208.208'] 28\n", "1837 ['10.1.1.3'] 30\n", "1838 ['89.21.139.21'] 31\n", "1839 ['188.114.97.128'] 32\n", "1840 ['188.114.97.7', '188.114.96.7'] 33\n", "1841 ['188.114.96.7', '188.114.97.7'] 34\n", "1842 ['37.252.254.39'] 36\n", "1843 ['188.114.97.2', '188.114.96.2'] 37\n", "1844 ['188.114.98.128'] 39\n", "1845 ['78.29.1.40'] 42\n", "1846 ['188.43.20.67'] 44\n", "1847 ['188.114.99.132'] 44\n", "1848 ['188.114.96.2', '188.114.97.2'] 44\n", "1849 ['185.77.150.2'] 46\n", "1850 ['46.175.31.250'] 49\n", "1851 ['188.114.99.128'] 60\n", "1852 ['176.103.130.135'] 62\n", "1853 ['0.0.0.0'] 73\n", "1854 ['78.24.40.190'] 78\n", "1855 ['62.140.245.46'] 81\n", "1856 ['46.175.31.251'] 81\n", "1857 ['127.0.0.1'] 89\n", "1858 ['77.238.226.53'] 163\n", "1859 ['62.33.207.197', '62.33.207.196'] 202\n", "1860 ['85.142.29.248'] 203\n", "1861 ['62.33.207.196', '62.33.207.197'] 208\n", "1862 ['80.76.104.20'] 222\n", "1863 ['100.64.64.66'] 223\n", "1864 ['95.213.158.61'] 225\n", "1865 ['188.186.157.49'] 238\n", "1866 [] 1590" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dns_resp_sorted[\n", " dns_resp_sorted['domain'] > 10\n", "]" ] }, { "cell_type": "code", "execution_count": 23, "id": "8806744f-7408-4994-b4e8-bb941ba7d0ee", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "360 sci-hub.se\n", "549 www.ej.ru\n", "1896 www.shram.kiev.ua\n", "1902 zhurnal.lib.ru\n", "1948 rutracker.org\n", " ... \n", "3149505 bluesystem.info\n", "3149517 www.rollitup.org\n", "3150702 rutracker.org\n", "3151623 www.bbm.com\n", "3151928 nnmclub.to\n", "Name: domain, Length: 15447, dtype: object" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " (df_ru['blocking'] == 'dns')\n", " & (df_ru['dns_resolved_ips'] == \"['188.186.157.49']\")\n", "]['domain']" ] }, { "cell_type": "code", "execution_count": 24, "id": "c5fb9b86-0596-467e-9827-ac0266aca99e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://explorer.ooni.org/measurement/20220302T160209Z_webconnectivity_RU_41733_n1_3bHDEUlWMQ3J7M9e?input=https%3A%2F%2Fsci-hub.se%2F\n" ] } ], "source": [ "print_explorer_url(df_ru.iloc[360])" ] }, { "cell_type": "markdown", "id": "79b97d53-57bc-4862-aeff-3ce8c7f20c30", "metadata": {}, "source": [ "### DNS inconsistency false positive removal\n", "\n", "To understand if what we are looking at is a real blocking IP or not, we can use the following heuristics:\n", "\n", "1. Does the IP in question have a PTR record pointing to something that looks like a blockpage (ex. a hostname that is related to the ISP)\n", "2. What information can we get about the IP by doing a whois lookup\n", "3. Is the ASN of the IP the same as the network where the measurement was collected\n", "4. Do we get a valid TLS certificate for one of the domains in question when doing a TLS handshake and specifying the SNI\n", "\n", "Using these 4 conditions, we are generally able to understand if it's in fact a blocking IP or not" ] }, { "cell_type": "markdown", "id": "4d64bb3d-cc1f-4044-bc3a-501061ec842b", "metadata": {}, "source": [ "### True positive example\n", "\n", "In the following example we can see that the IP `188.186.157.49`:\n", "\n", "1. Has a PTR record pointing to `k8s-lb-onlyhttp-cluster-ingress.static.cc.ertelecom.ru`\n", "2. The whois record shows it's owned by the ISP\n", "3. The AS network name is the same as the measured network\n", "4. We get a certificate with a common name \"*.dom.ru\" (i.e. it's not valid for sci-hub.se)\n", "\n", "This gives is a strong indication that it is in fact a blockpage IP" ] }, { "cell_type": "code", "execution_count": null, "id": "9645f03e-f0d7-4a86-a043-5bc421e1a911", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 25, "id": "99b3d33d-5c27-4eca-bbd9-f7fdf9810547", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "49.157.186.188.in-addr.arpa domain name pointer k8s-lb-onlyhttp-cluster-ingress.static.cc.ertelecom.ru.\n" ] } ], "source": [ "!host 188.186.157.49" ] }, { "cell_type": "code", "execution_count": 26, "id": "fdb48924-f43b-4fec-9121-29aaf454b555", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "% This is the RIPE Database query service.\n", "% The objects are in RPSL format.\n", "%\n", "% The RIPE Database is subject to Terms and Conditions.\n", "% See http://www.ripe.net/db/support/db-terms-conditions.pdf\n", "\n", "% Note: this output has been filtered.\n", "% To receive output for a database update, use the \"-B\" flag.\n", "\n", "% Information related to '188.186.0.0 - 188.187.255.255'\n", "\n", "% Abuse contact for '188.186.0.0 - 188.187.255.255' is '[email protected]'\n", "\n", "inetnum: 188.186.0.0 - 188.187.255.255\n", "netname: RU-RAID-20090619\n", "country: RU\n", "org: ORG-RA21-RIPE\n", "admin-c: RAID1-RIPE\n", "tech-c: RAID1-RIPE\n", "status: ALLOCATED PA\n", "mnt-by: RIPE-NCC-HM-MNT\n", "mnt-by: RAID-MNT\n", "mnt-lower: RAID-MNT\n", "mnt-routes: RAID-MNT\n", "created: 2009-06-19T14:03:12Z\n", "last-modified: 2016-05-30T12:40:21Z\n", "source: RIPE # Filtered\n", "\n", "organisation: ORG-RA21-RIPE\n", "org-name: JSC \"ER-Telecom Holding\"\n", "country: RU\n", "org-type: LIR\n", "address: str. Shosse Kosmonavtov, 111, bldg. 43, office 509\n", "address: 614990\n", "address: Perm\n", "address: RUSSIAN FEDERATION\n", "phone: +7 342 2462233\n", "fax-no: +7 342 2195024\n", "admin-c: ERTH3-RIPE\n", "tech-c: RAID1-RIPE\n", "abuse-c: RAID1-RIPE\n", "mnt-ref: RIPE-NCC-HM-MNT\n", "mnt-ref: RAID-MNT\n", "mnt-ref: ENFORTA-MNT\n", "mnt-ref: AS8345-MNT\n", "mnt-ref: RU-NTK-MNT\n", "mnt-by: RIPE-NCC-HM-MNT\n", "mnt-by: RAID-MNT\n", "created: 2004-04-17T11:56:55Z\n", "last-modified: 2021-05-17T06:43:35Z\n", "source: RIPE # Filtered\n", "\n", "role: ER-Telecom ISP Contact Role\n", "address: JSC \"ER-Telecom\"\n", "address: 111, str. Shosse Kosmonavtov\n", "address: 614000 Perm\n", "address: Russian Federation\n", "phone: +7 342 2462233\n", "fax-no: +7 342 2463344\n", "abuse-mailbox: [email protected]\n", "remarks: 24/7 phone number: +7-342-2362233\n", "admin-c: AAP113-RIPE\n", "tech-c: AAP113-RIPE\n", "tech-c: GRIF59-RIPE\n", "nic-hdl: RAID1-RIPE\n", "mnt-by: RAID-MNT\n", "created: 2005-02-11T12:50:50Z\n", "last-modified: 2022-01-11T06:25:37Z\n", "source: RIPE # Filtered\n", "\n", "% Information related to '188.186.157.0/24AS31483'\n", "\n", "route: 188.186.157.0/24\n", "origin: AS31483\n", "org: ORG-RA21-RIPE\n", "descr: JSC \"ER-Telecom Holding\"\n", "descr: Russia\n", "mnt-by: RAID-MNT\n", "created: 2016-05-12T07:15:31Z\n", "last-modified: 2016-05-12T07:15:31Z\n", "source: RIPE # Filtered\n", "\n", "organisation: ORG-RA21-RIPE\n", "org-name: JSC \"ER-Telecom Holding\"\n", "country: RU\n", "org-type: LIR\n", "address: str. Shosse Kosmonavtov, 111, bldg. 43, office 509\n", "address: 614990\n", "address: Perm\n", "address: RUSSIAN FEDERATION\n", "phone: +7 342 2462233\n", "fax-no: +7 342 2195024\n", "admin-c: ERTH3-RIPE\n", "tech-c: RAID1-RIPE\n", "abuse-c: RAID1-RIPE\n", "mnt-ref: RIPE-NCC-HM-MNT\n", "mnt-ref: RAID-MNT\n", "mnt-ref: ENFORTA-MNT\n", "mnt-ref: AS8345-MNT\n", "mnt-ref: RU-NTK-MNT\n", "mnt-by: RIPE-NCC-HM-MNT\n", "mnt-by: RAID-MNT\n", "created: 2004-04-17T11:56:55Z\n", "last-modified: 2021-05-17T06:43:35Z\n", "source: RIPE # Filtered\n", "\n", "% This query was served by the RIPE Database Query Service version 1.102.3 (HEREFORD)\n", "\n", "\n" ] } ], "source": [ "!whois 188.186.157.49" ] }, { "cell_type": "code", "execution_count": 27, "id": "5251312c-f5af-4092-b3e0-b4a2bae66357", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'autonomous_system_number': 31483,\n", " 'autonomous_system_organization': 'JSC \"ER-Telecom Holding\"'}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lookup_asn(\"188.186.157.49\")" ] }, { "cell_type": "code", "execution_count": 28, "id": "b69ac508-a875-4e73-8657-eb19cc978ec8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'JSC \"ER-Telecom Holding\"'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru.iloc[360]['probe_network_name']" ] }, { "cell_type": "code", "execution_count": 29, "id": "1edcd0af-7573-4071-b4f3-8c9912d37410", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'AS41733'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru.iloc[360]['probe_asn']" ] }, { "cell_type": "code", "execution_count": 30, "id": "c3e8b467-cf43-4e1f-8406-a5cb8bd2f067", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "depth=2 C = US, ST = New Jersey, L = Jersey City, O = The USERTRUST Network, CN = USERTrust RSA Certification Authority\n", "verify return:1\n", "depth=1 C = RU, ST = Moscow, L = Moscow, O = RU-Center (\\D0\\97\\D0\\90\\D0\\9E \\D0\\A0\\D0\\B5\\D0\\B3\\D0\\B8\\D0\\BE\\D0\\BD\\D0\\B0\\D0\\BB\\D1\\8C\\D0\\BD\\D1\\8B\\D0\\B9 \\D0\\A1\\D0\\B5\\D1\\82\\D0\\B5\\D0\\B2\\D0\\BE\\D0\\B9 \\D0\\98\\D0\\BD\\D1\\84\\D0\\BE\\D1\\80\\D0\\BC\\D0\\B0\\D1\\86\\D0\\B8\\D0\\BE\\D0\\BD\\D0\\BD\\D1\\8B\\D0\\B9 \\D0\\A6\\D0\\B5\\D0\\BD\\D1\\82\\D1\\80), CN = RU-CENTER High Assurance Services CA 2\n", "verify return:1\n", "depth=0 C = RU, ST = Permskiy kray, L = Perm, O = JSC ER-Telecom Holding, OU = job, CN = *.dom.ru\n", "verify return:1\n", "DONE\n" ] } ], "source": [ "!echo Q | openssl s_client -connect 188.186.157.49:443 -servername sci-hub.se | openssl x509 -noout -text | grep sci-hub.se" ] }, { "cell_type": "markdown", "id": "7e2e3d4f-a33d-48fb-8566-302d0bc4e66c", "metadata": {}, "source": [ "### False positive example\n", "\n", "In the following example we can see that the IP `188.114.97.7`:\n", "\n", "1. Doesn't have a PTR record\n", "2. The whois record shows it's owned by the Cloudflare\n", "3. The ASN is **not** the same as the measured network\n", "4. We get a valid certificate for `mastodon.cloud` when doing a TLS handshake\n", "\n", "We can conclude that this is most likely a false positive" ] }, { "cell_type": "code", "execution_count": 31, "id": "b50a205b-fc38-45f5-8329-0fb4928b8d8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "944 www.blogdir.ru\n", "1333 www.freewebspace.com\n", "1805 www.babyplan.ru\n", "2676 mastodon.cloud\n", "2924 sputnikipogrom.com\n", " ... \n", "3145972 hitwe.com\n", "3146741 www.resistance88.com\n", "3146962 www.wftucentral.org\n", "3149916 www.nostraightnews.com\n", "3150651 www.metal-archives.com\n", "Name: domain, Length: 4255, dtype: object" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " df_ru['dns_resolved_ips'] == \"['188.114.97.7', '188.114.96.7']\"\n", "]['domain']" ] }, { "cell_type": "code", "execution_count": 32, "id": "88251b53-51e3-46c5-8945-39dc7ce399cf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://explorer.ooni.org/measurement/20220302T224757Z_webconnectivity_RU_31257_n1_ElZKi2MAW05O7NYj?input=https%3A%2F%2Fmastodon.cloud%2F\n" ] } ], "source": [ "print_explorer_url(df_ru.iloc[2676])" ] }, { "cell_type": "code", "execution_count": 33, "id": "c9ab47c9-a824-4e41-83ba-441588823a42", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Host 7.97.114.188.in-addr.arpa. not found: 3(NXDOMAIN)\n" ] } ], "source": [ "!host 188.114.97.7" ] }, { "cell_type": "code", "execution_count": 34, "id": "d7003aa0-ee69-4492-9180-598ac554c36c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "% This is the RIPE Database query service.\n", "% The objects are in RPSL format.\n", "%\n", "% The RIPE Database is subject to Terms and Conditions.\n", "% See http://www.ripe.net/db/support/db-terms-conditions.pdf\n", "\n", "% Note: this output has been filtered.\n", "% To receive output for a database update, use the \"-B\" flag.\n", "\n", "% Information related to '188.114.96.0 - 188.114.99.255'\n", "\n", "% Abuse contact for '188.114.96.0 - 188.114.99.255' is '[email protected]'\n", "\n", "inetnum: 188.114.96.0 - 188.114.99.255\n", "netname: CLOUDFLARENET-EU\n", "descr: CloudFlare, Inc.\n", "descr: 101 Townsend Street, San Francisco, CA 94107, US\n", "descr: +1 (650) 319-8930\n", "descr: https://cloudflare.com/\n", "country: US\n", "admin-c: CAC80-RIPE\n", "tech-c: CTC6-RIPE\n", "status: ASSIGNED PA\n", "mnt-by: MNT-CLOUDFLARE\n", "mnt-lower: MNT-CLOUDFLARE\n", "mnt-routes: MNT-CLOUDFLARE\n", "remarks: https://cloudflare.com/abuse\n", "created: 2015-10-16T16:26:10Z\n", "last-modified: 2015-10-16T16:26:10Z\n", "source: RIPE\n", "\n", "person: Cloudflare Abuse Contact\n", "address: 101 Townsend Street, San Francisco, CA 94107, US\n", "phone: +1 (650) 319-8930\n", "remarks: All Cloudflare abuse reporting can be done via https://www.cloudflare.com/abuse\n", "nic-hdl: CAC80-RIPE\n", "mnt-by: MNT-CLOUDFLARE\n", "created: 2012-06-01T23:27:49Z\n", "last-modified: 2018-06-10T10:14:26Z\n", "source: RIPE # Filtered\n", "\n", "person: Cloudflare Technical Contact\n", "address: 101 Townsend Street, San Francisco, CA 94107, US\n", "phone: +1 (650) 319-8930\n", "nic-hdl: CTC6-RIPE\n", "mnt-by: MNT-CLOUDFLARE\n", "created: 2012-06-01T23:35:57Z\n", "last-modified: 2018-06-10T10:16:13Z\n", "source: RIPE # Filtered\n", "\n", "% Information related to '188.114.97.0/24AS13335'\n", "\n", "route: 188.114.97.0/24\n", "origin: AS13335\n", "mnt-by: MNT-CLOUDFLARE\n", "created: 2020-06-15T18:05:37Z\n", "last-modified: 2020-06-15T18:05:37Z\n", "source: RIPE # Filtered\n", "\n", "% This query was served by the RIPE Database Query Service version 1.102.3 (HEREFORD)\n", "\n", "\n" ] } ], "source": [ "!whois 188.114.97.7" ] }, { "cell_type": "code", "execution_count": 35, "id": "a44b47ba-3204-4867-b63e-a29f0453d764", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'autonomous_system_number': 13335,\n", " 'autonomous_system_organization': 'Cloudflare, Inc.'}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lookup_asn(\"188.114.97.7\")" ] }, { "cell_type": "code", "execution_count": 36, "id": "a65aeca9-1588-4fce-a82b-2a37b7e72a53", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AS31257\n", "Orion Telecom LLC\n" ] } ], "source": [ "print(df_ru.iloc[2676]['probe_asn'])\n", "print(df_ru.iloc[2676]['probe_network_name'])" ] }, { "cell_type": "code", "execution_count": 37, "id": "1d44e940-0b8b-4a2b-953a-1886f3a3e8d5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "depth=2 C = IE, O = Baltimore, OU = CyberTrust, CN = Baltimore CyberTrust Root\n", "verify return:1\n", "depth=1 C = US, O = \"Cloudflare, Inc.\", CN = Cloudflare Inc ECC CA-3\n", "verify return:1\n", "depth=0 C = US, ST = California, L = San Francisco, O = \"Cloudflare, Inc.\", CN = sni.cloudflaressl.com\n", "verify return:1\n", "DONE\n", " DNS:mastodon.cloud, DNS:sni.cloudflaressl.com, DNS:*.mastodon.cloud\n" ] } ], "source": [ "!echo Q | openssl s_client -connect 188.114.97.7:443 -servername mastodon.cloud | openssl x509 -noout -text | grep mastodon.cloud" ] }, { "cell_type": "markdown", "id": "a87f65c8-b17f-49ec-9ab2-a8d497cc3e1d", "metadata": {}, "source": [ "We can then rinse and repeat this process multiple times, until we have divided all these anomalous IPs into those confirmed to be associated to blocking or those that are false positive.\n", "\n", "Similarly we can do this for the HTML titles." ] }, { "cell_type": "code", "execution_count": 38, "id": "77d8a03e-1a05-4b2d-94bd-71c9a8c61dbd", "metadata": {}, "outputs": [], "source": [ "confirmed_ips = [\n", " # PTR record is k8s-lb-onlyhttp-cluster-ingress.static.cc.ertelecom.ru\n", " # Serves blockpage for: http://lawfilter.ertelecom.ru/\n", " '188.186.157.49',\n", " # PTR records are block.tdsplus.ru & balance.tdsplus.ru\n", " # We get connection refused when attempting to access it \n", " '80.76.104.20',\n", " # PTR record is block.runnet.ru\n", " # We get a blockpage when attempting to access it\n", " '85.142.29.248',\n", " # AS is mapped to 49505 - SELECTEL\n", " '95.213.158.61',\n", " # Known russian blockpages\n", " '62.33.207.197',\n", " '62.33.207.196',\n", " # Blockpage for AS60139\n", " '185.77.150.2',\n", " # Blockpage for AS42429\n", " '77.238.226.53',\n", " # Blockpage for AS8369\n", " '78.29.1.40',\n", " # Blockpage for AS8427\n", " '188.43.20.67',\n", " # Blockpage for AS52207\n", " '195.128.72.3',\n", " # Blockpage for AS12389\n", " '31.28.24.3',\n", " # Likely blockpage for AS197460\n", " # reverse pointer to host-46-175-31-251.rev.zencom.ru.\n", " # as of 2022-03-05 connection times out when accessing it\n", " '46.175.31.251',\n", " # Likely blockpage for AS3335\n", " # PTR record host190.49.237.84.nsu.ru\n", " # as of 2022-03-05 503 error when accessing page\n", " '84.237.49.190'\n", "]\n", "\n", "false_positive_ips = [\n", " '188.114.97.7',\n", " '188.114.96.7'\n", "]\n", "\n", "confirmed_titles = [\n", " 'Доступ к ресурсу ограничен'\n", "]" ] }, { "cell_type": "code", "execution_count": 39, "id": "907d1dbc-b9f4-42ae-936a-1e7067506baf", "metadata": {}, "outputs": [], "source": [ "valid_ip_map = {}" ] }, { "cell_type": "code", "execution_count": 40, "id": "5882b3a1-29c3-4519-a517-d232f7e39a1d", "metadata": {}, "outputs": [], "source": [ "import certifi\n", "import ssl\n", "import socket\n", "\n", "def is_tls_valid(ip, hostname):\n", " if len(df_ru[\n", " (df_ru['dns_resolved_ips'].str.contains(ip, na=False))\n", " & (df_ru['domain'] == hostname)\n", " & (df_ru['input'].str.startswith('https'))\n", " & (df_ru['http_experiment_failure'] == 'None')\n", " ]) > 0:\n", " return True\n", "\n", " context = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)\n", " context.load_verify_locations(certifi.where())\n", "\n", " with socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) as sock:\n", " sock.settimeout(1)\n", " with context.wrap_socket(sock, server_hostname=hostname) as conn:\n", " try:\n", " conn.connect((ip, 443))\n", " # TODO: do we care to distinguish these values?\n", " except ssl.SSLCertVerificationError:\n", " return False\n", " except ssl.SSLError:\n", " return False\n", " except socket.timeout:\n", " return False\n", " except socket.error:\n", " return False\n", " except:\n", " return False\n", " return True\n", "\n", "def is_tls_valid_with_cache(ip, hostname):\n", " key = f\"{ip}{hostname}\"\n", " if key in valid_ip_map:\n", " return valid_ip_map[key]\n", " valid_ip_map[key] = is_tls_valid(ip, hostname)\n", " return valid_ip_map[key]" ] }, { "cell_type": "markdown", "id": "6ba0d045-31c2-4766-8b19-544860051169", "metadata": {}, "source": [ "### Putting it all together\n", "\n", "We can then proceed to automating the detection on the full dataset. Our goal is that of recomputing the `blocking` feature for each individual measurement based on our improved heuristics.\n", "\n", "In addition to the previously discussed DNS and HTTP based blocking, we are going to additionally classify blocking that happens at different layers of the network stack.\n", "\n", "Specifically, we are going to be using the following identifiers for the various ways in which blocking might occur:\n", "\n", "#### DNS\n", "* dns.confirmed - one of the returned IPs matches an IP known to be used to implement blocking\n", "* dns.no_ipv4 - no IPv4 address was returned\n", "* dns.bogon - a bogon IP address was returned\n", "* dns.nxdomain - we got an NXDOMAIN response from the probe, but we got a valid response from the control vantage point\n", "* dns.inconsistent - our DNS consistency heuristics determined the returned IP to be inconsistent\n", "\n", "#### HTTP\n", "\n", "These are all blocking types related to plaintext HTTP requests:\n", "\n", "* http.confirmed - the returned page is a known blockpages\n", "* http.http_diff - the page doesn't match based on our page consistency heuristics\n", "* http.connection_reset - we got a connection reset to a plaintext HTTP request\n", "* http.connection_closed - the connection was closed before all data was transmitted\n", "* http.connection_timeout - the connection timed out before we could retrieve all the data \n", "* http.generic_failure - this is an generic error from legacy OONI probes\n", "\n", "#### TLS\n", "\n", "These are all blocking types related to TLS:\n", "\n", "* tls.connection_reset - a reset packet was seen after the client sent the ClientHello packet\n", "* tls.connection_closed - the connection was closed after the ClientHello\n", "* tls.connection_timeout - the connection timed out after the ClientHello\n", " * All of the above can also have the `_after_hello` suffix, indicating that the event happened after the client sent the ClienHello packet\n", "* tls.mitm - The DNS is consistent, but the TLS certificate validation failed. This suggest a TLS man-in-the-middle\n", "* tls.generic_failure - generic error from legacy OONI probes\n", "\n", "#### TCP/IP\n", "\n", "This is when blocking is implemented by targeting the IP address of the host:\n", "\n", "* tcp.connection_reset - the TCP connect test failed due to a reset packet\n", "* tcp.connection_timeout - the TCP connect test failed with a timeout" ] }, { "cell_type": "code", "execution_count": 41, "id": "477cde66-7af8-446d-afaa-5e859fc74e1b", "metadata": {}, "outputs": [], "source": [ "from ast import literal_eval\n", "import ipaddress\n", "\n", "def normalize_failure(failure_str):\n", " if \"An existing connection was forcibly closed by the remote host\" in failure_str:\n", " return \"connection_reset\"\n", " if \"No address associated with hostname\" in failure_str:\n", " return \"dns_nxdomain_error\"\n", " return failure_str\n", "\n", "def is_dns_asns_consistent(dns_resolved_ips, control_measurement, row):\n", " try:\n", " control_addrs = control_measurement['dns']['addrs']\n", " if not control_addrs:\n", " return False\n", " control_asns = set(list(map(lambda e: e['autonomous_system_number'], \n", " filter(lambda e: e != None, map(lookup_asn, control_addrs)))))\n", " exp_asns = set(list(map(lambda e: e['autonomous_system_number'], \n", " filter(lambda e: e != None, map(lookup_asn, dns_resolved_ips)))))\n", " if exp_asns.intersection(control_asns):\n", " return True\n", " except KeyError:\n", " # Missing control measurement\n", " return False\n", " return False\n", "\n", "bogon_ipv4_ranges = [\n", " ipaddress.ip_network(\"0.0.0.0/8\"), # \"This\" network\n", " ipaddress.ip_network(\"10.0.0.0/8\"), # Private-use networks\n", " ipaddress.ip_network(\"100.64.0.0/10\"), # Carrier-grade NAT\n", " ipaddress.ip_network(\"127.0.0.0/8\"), # Loopback\n", " ipaddress.ip_network(\"127.0.53.53\"), # Name collision occurrence\n", " ipaddress.ip_network(\"169.254.0.0/16\"), # Link local\n", " ipaddress.ip_network(\"172.16.0.0/12\"), # Private-use networks\n", " ipaddress.ip_network(\"192.0.0.0/24\"), # IETF protocol assignments\n", " ipaddress.ip_network(\"192.0.2.0/24\"), # TEST-NET-1\n", " ipaddress.ip_network(\"192.168.0.0/16\"), # Private-use networks\n", " ipaddress.ip_network(\"198.18.0.0/15\"), # Network interconnect device benchmark testing\n", " ipaddress.ip_network(\"198.51.100.0/24\"), # TEST-NET-2\n", " ipaddress.ip_network(\"203.0.113.0/24\"), # TEST-NET-3\n", " ipaddress.ip_network(\"224.0.0.0/4\"), # Multicast\n", " ipaddress.ip_network(\"240.0.0.0/4\"), # Reserved for future use\n", " ipaddress.ip_network(\"255.255.255.255/32\"), # Limited broadcast\n", "]\n", "def is_dns_bogon(dns_resolved_ips):\n", " for ip in dns_resolved_ips:\n", " ipv4addr = ipaddress.IPv4Address(ip)\n", " if any([ipv4addr in ip_range for ip_range in bogon_ipv4_ranges]):\n", " return True\n", " return False\n", "\n", "def is_dns_tls_consistent(dns_resolved_ips, row):\n", " # If it's a HTTPs site and we didn't get a TLS error, we can assume the IPs are valid\n", " if row['input'].startswith('https://') and row['http_experiment_failure'] == 'None':\n", " return False\n", " \n", " for ip in dns_resolved_ips:\n", " domain = urlparse(row['input']).netloc\n", " if is_tls_valid_with_cache(ip, domain):\n", " # We consider the first hit to be enough to consider it consistent\n", " return True\n", " return False\n", "\n", "def is_dns_false_positive(dns_resolved_ips):\n", " for ip in dns_resolved_ips:\n", " if ip in false_positive_ips:\n", " return True\n", " return False\n", "\n", "def recompute_blocking(row):\n", " try:\n", " dns_resolved_ips = literal_eval(row['dns_resolved_ips'])\n", " except:\n", " dns_resolved_ips = []\n", "\n", " blocking = row['blocking']\n", " for ip in dns_resolved_ips:\n", " if ip in confirmed_ips:\n", " return 'dns.confirmed'\n", " \n", " # This is a special case for when we got no ipv4 addresses and the network doesn't support ipv6\n", " if len(dns_resolved_ips) == 0 and row['http_experiment_failure'] == 'network_unreachable':\n", " return 'dns.no_ipv4'\n", " \n", " if is_dns_bogon(dns_resolved_ips):\n", " return 'dns.bogon'\n", "\n", " try:\n", " control_measurement = literal_eval(row['control_measurement'])\n", " except:\n", " return 'invalid'\n", " if not control_measurement:\n", " return 'invalid'\n", " \n", " if control_measurement['http_request']['failure'] != None:\n", " return 'invalid'\n", "\n", " if (normalize_failure(row['dns_experiment_failure']) == 'dns_nxdomain_error' and \n", " control_measurement.get('http_request', {}).get('failure', '') != 'dns_lookup_error'):\n", " return 'dns.nxdomain'\n", "\n", " if (\n", " not (row['input'].startswith('https://') and row['http_experiment_failure'] == 'None') \n", " and not is_dns_false_positive(dns_resolved_ips) \n", " and not is_dns_asns_consistent(dns_resolved_ips, control_measurement, row)\n", " #and not is_dns_tls_consistent(dns_resolved_ips, row)\n", " ):\n", " return 'dns.inconsistent'\n", "\n", " # If we got down to here, it means that DNS is consistent \n", " if row['http_title'] in confirmed_titles:\n", " return 'http.confirmed'\n", " \n", " if blocking == 'http-diff' and row['input'].startswith('http://'):\n", " return 'http.http_diff'\n", " \n", " if row['http_experiment_failure'] != 'None':\n", " tcp_connect_list = literal_eval(row['tcp_connect'])\n", " for conn in tcp_connect_list:\n", " if conn['status']['failure'] == 'connection_reset':\n", " return 'tcp.connection_reset'\n", " elif conn['status']['failure'] == 'generic_timeout_error':\n", " return 'tcp.connection_timeout'\n", " \n", " # We compute TLS level anomalies this using the network_events\n", " tls_handshake_started = False\n", " try:\n", " network_events = literal_eval(row['network_events'])\n", " except:\n", " network_events = []\n", " if network_events:\n", " for idx, network_event in enumerate(network_events):\n", " if network_event['operation'] == 'write':\n", " write_operations += 1\n", " if network_event['operation'] == 'read':\n", " read_operations += 1\n", "\n", " if tls_handshake_started and network_event['failure']:\n", " # We are guaranteed to not be out of bounds due to the tls_handshake_started flag\n", " prev_operation = network_events[idx-1]\n", " \n", " suffix = ''\n", " if normalize_failure(network_event['failure']) == 'connection_reset':\n", " return f'tls.connection_reset{suffix}'\n", " elif normalize_failure(network_event['failure']) == 'eof_error':\n", " return f'tls.connection_closed{suffix}'\n", " elif normalize_failure(network_event['failure']) == 'generic_timeout_error':\n", " return f'tls.connection_timeout{suffix}'\n", " if write_operations > 1:\n", " suffix = f'_after_hello'\n", "\n", " if network_event['operation'] == 'tls_handshake_start':\n", " tls_handshake_started = True\n", " write_operations = 0\n", " read_operations = 0\n", " if network_event['operation'] == 'tls_handshake_done':\n", " tls_handshake_started = False\n", "\n", " # If we got down to here, it means the DNS consistency checks have passed\n", " # For the http related failures, if we are spotting them here, it means the test most likely doesn't support the \n", " # new network_events keys, and therefore the results are a bit less accurate.\n", " # This should ideally be indicated via a lower confidence value.\n", " if normalize_failure(row['http_experiment_failure']) == 'connection_reset':\n", " if row['input'].startswith('https://'):\n", " return 'tls.connection_reset'\n", " else:\n", " return 'http.connection_reset'\n", " elif normalize_failure(row['http_experiment_failure']) == 'eof_error':\n", " if row['input'].startswith('https://'):\n", " return 'tls.connection_closed'\n", " else:\n", " return 'http.connection_closed'\n", " elif normalize_failure(row['http_experiment_failure']) == 'generic_timeout_error':\n", " if row['input'].startswith('https://'):\n", " return 'tls.connection_timeout'\n", " else:\n", " return 'http.connection_timeout'\n", " # It's not just using DNS to point us to an IP that serves a blockpage and it's a TLS MITM\n", " elif row['input'].startswith('https://') and row['http_experiment_failure'].startswith('ssl_'):\n", " return 'tls.mitm'\n", " \n", " # We map unknown_failures to invalid measurements\n", " elif row['http_experiment_failure'].startswith('unknown_failure'):\n", " return 'invalid'\n", " \n", " # All unmapped errors go into a generic failure pool\n", " elif row['http_experiment_failure'] != 'None':\n", " if row['input'].startswith('https://'):\n", " return 'tls.generic_failure'\n", " else:\n", " return 'http.generic_failure'\n", " \n", " return 'ok'" ] }, { "cell_type": "code", "execution_count": 42, "id": "8c336cb6-0177-4eeb-a89e-894e898946ee", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|████████████████████████████████████████████████████████| 3152336/3152336 [20:14<00:00, 2595.73it/s]\n" ] } ], "source": [ "df_ru['blocking_recalc'] = df_ru.progress_apply(recompute_blocking, axis=1)" ] }, { "cell_type": "code", "execution_count": 43, "id": "ed0b8ebc-063f-49bf-98ed-e2baadddf55b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['ok', 'invalid', 'tls.generic_failure', 'tls.mitm',\n", " 'http.http_diff', 'dns.inconsistent', 'tls.connection_timeout',\n", " 'tls.connection_reset', 'tls.connection_closed',\n", " 'http.connection_reset', 'dns.confirmed', 'dns.nxdomain',\n", " 'tcp.connection_timeout', 'http.generic_failure',\n", " 'http.connection_timeout', 'http.connection_closed', 'dns.bogon',\n", " 'http.confirmed', 'dns.no_ipv4', 'tcp.connection_reset'],\n", " dtype=object)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru['blocking_recalc'].unique()" ] }, { "cell_type": "code", "execution_count": 46, "id": "285f43ee-1abf-4f96-92c9-b9026fb0f55f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([\"['172.98.192.37']\", \"['13.107.42.14']\", \"['185.3.143.71']\", ...,\n", " \"['62.115.252.49', '80.239.137.162', '62.115.252.57', '62.115.252.56']\",\n", " \"['62.115.252.57', '80.239.137.162', '62.115.252.49']\",\n", " \"['62.115.252.64', '80.239.137.162', '62.115.252.41', '62.115.252.18', '62.115.252.57']\"],\n", " dtype=object)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " df_ru['blocking_recalc'] == 'dns.inconsistent'\n", "]['dns_resolved_ips'].unique()" ] }, { "cell_type": "code", "execution_count": null, "id": "52ac54e9-3193-477d-9e06-4f47b9824f4d", "metadata": {}, "outputs": [], "source": [ "mask = (df_ru['blocking_recalc'] == 'dns.inconsistent')\n", "df_ru.loc[mask, 'blocking_recalc'] = df_ru[mask].progress_apply(recompute_blocking, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "de2b2276-f519-4f2b-8bc6-4e37189d4787", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3d133e88-e8f9-450d-bc6f-e720bd2b76d2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "fc0a42c9-992a-4482-9724-d7f2edf1f2fb", "metadata": {}, "source": [ "Let's see on how many networks we were able to confirm the blocking of sites" ] }, { "cell_type": "code", "execution_count": 48, "id": "ebbd857c-d768-4d9d-906b-d20ba42c4455", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['AS34533', 'AS41733', 'AS8790', 'AS50544', 'AS15774', 'AS41668',\n", " 'AS8427', 'AS51604', 'AS41843', 'AS8369', 'AS51547', 'AS212614',\n", " 'AS44507', 'AS56420', 'AS41786', 'AS42429', 'AS51813', 'AS12958',\n", " 'AS51570', 'AS41330', 'AS52207', 'AS15378', 'AS60139', 'AS2848',\n", " 'AS25408', 'AS42289', 'AS42437', 'AS206873', 'AS41661', 'AS49404',\n", " 'AS13335', 'AS202173', 'AS42682', 'AS41754', 'AS58158', 'AS197460',\n", " 'AS50542', 'AS34703', 'AS48092', 'AS3267', 'AS34590', 'AS43478',\n", " 'AS12389', 'AS3335', 'AS198715', 'AS29076', 'AS20485', 'AS50498',\n", " 'AS48190', 'AS35807', 'AS25159', 'AS25513', 'AS42610', 'AS49048',\n", " 'AS12768', 'AS57843', 'AS56981', 'AS39435'], dtype=object)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " df_ru['blocking_recalc'] == 'dns.confirmed'\n", "]['probe_asn'].unique()" ] }, { "cell_type": "code", "execution_count": 49, "id": "e8f2ccf0-9644-4c6e-b1d1-3368b9237b5a", "metadata": {}, "outputs": [], "source": [ "msmt_counts = df_ru[\n", " df_ru['blocking_recalc'] == 'dns.confirmed'\n", "][['domain', 'report_id']].groupby('domain').count().reset_index()" ] }, { "cell_type": "markdown", "id": "d5157460-b535-4108-b6b2-ca48a864a869", "metadata": {}, "source": [ "And let's check out how many sites were confirmed to be blocked based on our fingerprints" ] }, { "cell_type": "code", "execution_count": 50, "id": "ef55d134-02de-4569-bbb5-e6132bfa223a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
domainreport_id
119shajtanshop.com1
59instagram.com1
34facebook.com1
36fapreactor.com1
89nani24.cc1
.........
115rutracker.org477
156www.bbc.com513
58imrussia.org515
135twitter.com1873
182www.facebook.com1972
\n", "

269 rows × 2 columns

\n", "
" ], "text/plain": [ " domain report_id\n", "119 shajtanshop.com 1\n", "59 instagram.com 1\n", "34 facebook.com 1\n", "36 fapreactor.com 1\n", "89 nani24.cc 1\n", ".. ... ...\n", "115 rutracker.org 477\n", "156 www.bbc.com 513\n", "58 imrussia.org 515\n", "135 twitter.com 1873\n", "182 www.facebook.com 1972\n", "\n", "[269 rows x 2 columns]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "msmt_counts.sort_values('report_id')" ] }, { "cell_type": "markdown", "id": "9a59e4c8-0c37-4ee0-adca-c1c96955b44d", "metadata": {}, "source": [ "From the perspective of presenting the data and digging deeper into the blocking of specific sites, since the data has so many dimensions, it's often useful to restrict your analysis to a subset of some of the axis.\n", "\n", "Common choices for this, is to use a subset of all the domains or a subset of all the networks.\n", "\n", "In this example we are going to pick some domains that have very good testing coverage and are highly relevant." ] }, { "cell_type": "code", "execution_count": 55, "id": "0eb1f0f9-e4d7-4625-a636-664c5773722f", "metadata": {}, "outputs": [], "source": [ "relevant_domains = [\n", " 'www.bbc.com',\n", " 'twitter.com',\n", " 'www.facebook.com'\n", "]" ] }, { "cell_type": "code", "execution_count": 56, "id": "d7a3c33d-2f79-47b8-944d-0406f2252e43", "metadata": {}, "outputs": [], "source": [ "domain_asn_counts = df_ru[\n", " df_ru['domain'].isin(relevant_domains)\n", "][['probe_asn', 'domain', 'report_id']].groupby(['probe_asn', 'domain']).count().reset_index()" ] }, { "cell_type": "code", "execution_count": 57, "id": "991ac54c-2eb2-47b0-8ef8-79c1388e9afe", "metadata": {}, "outputs": [], "source": [ "# We are looking at 23 days, so having ~4 metrics per day per network seems like a reasonable cutoff\n", "relevant_asn_domains = domain_asn_counts[\n", " domain_asn_counts['report_id'] > 100\n", "][['probe_asn', 'domain']]" ] }, { "cell_type": "code", "execution_count": 58, "id": "179f4662-cde2-437a-9e32-08658a33a1f1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['AS12389', 'AS12668', 'AS12714', 'AS12737', 'AS12958', 'AS15493',\n", " 'AS15640', 'AS15774', 'AS16345', 'AS205638', 'AS20632', 'AS21479',\n", " 'AS25086', 'AS25159', 'AS25490', 'AS25513', 'AS28840', 'AS29194',\n", " 'AS31163', 'AS31200', 'AS31213', 'AS31257', 'AS31286', 'AS31376',\n", " 'AS3216', 'AS34533', 'AS34757', 'AS35533', 'AS35807', 'AS41330',\n", " 'AS41668', 'AS41733', 'AS42387', 'AS42511', 'AS42610', 'AS42668',\n", " 'AS43966', 'AS44724', 'AS44927', 'AS47165', 'AS47438', 'AS47655',\n", " 'AS48642', 'AS50716', 'AS51547', 'AS51604', 'AS51813', 'AS52207',\n", " 'AS56724', 'AS59734', 'AS8331', 'AS8334', 'AS8359', 'AS8402',\n", " 'AS8427', 'AS8492', 'AS8580', 'AS8790'], dtype=object)" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "relevant_asn_domains['probe_asn'].unique()" ] }, { "cell_type": "markdown", "id": "0511c839-46d8-4284-ab41-2408246ccc99", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "838a2e30-9065-445c-bc4e-804ea657cb98", "metadata": {}, "source": [ "Let's start off by looking at the ways through which sites are blocked accross the networks we have selected to have enough measurements. To make the data easier to look at, we are going to fix the domain." ] }, { "cell_type": "code", "execution_count": 59, "id": "80a8607f-cfa2-4f6b-a4a4-87bf569d48b6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "<figure size with axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "to_plot = df_ru[\n", " (df_ru['domain'] == 'www.bbc.com')\n", " & (df_ru['probe_asn'].isin(relevant_asn_domains['probe_asn'].unique()))\n", " & (df_ru['blocking_recalc'] != 'invalid')\n", " & (df_ru['measurement_start_time'] > '2022-03-05')\n", " #& (df_ru['blocking_recalc'] != 'ok')\n", "][['blocking_recalc', 'probe_asn']]\n", "to_plot['count'] = 1\n", "(\n", " to_plot.pivot_table(\n", " columns='blocking_recalc',\n", " index='probe_asn',\n", " values='count',\n", " aggfunc='sum'\n", " ).reset_index()\n", " .groupby('probe_asn')\n", " .sum().reset_index()\n", " .set_index('probe_asn')\n", " .plot(kind='bar', stacked=True, figsize=(20,10), colormap='Paired', title='Blocking of www.bbc.com by probe_asn')\n", ")" ] }, { "cell_type": "markdown", "id": "ffb3637e-81a3-467c-88c3-302b6589e6c6", "metadata": {}, "source": [ "As we can see above, the means through which blocking is implemented across different ISPs varies significantly. In some of them, we can also see that the block is not being implemented at all.\n", "\n", "We can use the above chart to navigate our exploration of individual measurements on a per-ISP basis." ] }, { "cell_type": "code", "execution_count": 64, "id": "6a4dee21-1961-4fe6-b92b-ee7497e0c3c5", "metadata": {}, "outputs": [], "source": [ "def plot_blocking(probe_asn, domain):\n", " to_plot = df_ru[\n", " (df_ru['probe_asn'] == probe_asn)\n", " & (df_ru['domain'] == domain)\n", " & (df_ru['blocking_recalc'] != 'invalid')\n", " ][['blocking_recalc', 'measurement_start_time']]\n", " to_plot['measurement_start_time'] = pd.to_datetime(to_plot['measurement_start_time'])\n", " to_plot['count'] = 1\n", " (\n", " to_plot.pivot_table(\n", " columns='blocking_recalc',\n", " index='measurement_start_time',\n", " values='count',\n", " aggfunc='sum'\n", " ).reset_index()\n", " .groupby(pd.Grouper(key='measurement_start_time', freq='D'))\n", " .sum().reset_index()\n", " .set_index('measurement_start_time')\n", " .plot(kind='bar', stacked=True, title=f\"{probe_asn} {domain}\", colormap='Paired', figsize=(20,8))\n", " )" ] }, { "cell_type": "markdown", "id": "41fdf5af-e7d9-4c40-97ee-699e6e234dfc", "metadata": {}, "source": [ "Through the above function, we now have the power to plot a chart that shows us the blocking of a certain domain and ISP over time. In doing so we can determine if the methods through which the blocking is happening are consistent or if there is some variation.\n", "\n", "Having a stable signal that doesn't show different ways through which the block is implemented (in cases where the root-cause may be a transient network failure) gives you higher confidence in the data." ] }, { "cell_type": "code", "execution_count": 65, "id": "a8c81b4e-fe79-4444-ae21-32d8993481f5", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<figure size with axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_blocking('AS43966', 'www.bbc.com')" ] }, { "cell_type": "markdown", "id": "76408298-99f8-429d-8c4a-37221739fb8e", "metadata": {}, "source": [ "Here we can see that the block is happening through a connection reset most of the time. The only outliers are cause by what very likely are old versions of the probe (in many cases you may want to exclude older versions of probes from your analysis, if you have enough data).\n", "\n", "The only case that probably deserves further investigation, is the OK measurement on the 16th. Let's find it and open it in OONI Explorer." ] }, { "cell_type": "code", "execution_count": 62, "id": "b6b0df1e-9102-466e-b357-15d915a5d847", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border='\"1\"' class='\"dataframe\"'>\n", " <thead>\n", " <tr style='\"text-align:' right>\n", " <th></th>\n", " <th>input</th>\n", " <th>measurement_start_time</th>\n", " <th>probe_asn</th>\n", " <th>probe_cc</th>\n", " <th>probe_network_name</th>\n", " <th>report_id</th>\n", " <th>resolver_asn</th>\n", " <th>resolver_ip</th>\n", " <th>resolver_network_name</th>\n", " <th>software_name</th>\n", " <th>...</th>\n", " <th>control_measurement</th>\n", " <th>blocking</th>\n", " <th>http_experiment_failure</th>\n", " <th>dns_experiment_failure</th>\n", " <th>http_title</th>\n", " <th>http_meta_title</th>\n", " <th>http_body_md5</th>\n", " <th>tcp_connect</th>\n", " <th>domain</th>\n", " <th>blocking_recalc</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>517649</th>\n", " <td>https://www.bbc.com/news/world-51235105</td>\n", " <td>2022-03-16 12:41:20</td>\n", " <td>AS43966</td>\n", " <td>RU</td>\n", " <td>IT REGION LTD</td>\n", " <td>20220316T124100Z_webconnectivity_RU_43966_n1_q...</td>\n", " <td>AS43966</td>\n", " <td>79.173.80.17</td>\n", " <td>IT REGION LTD</td>\n", " <td>ooniprobe-android</td>\n", " <td>...</td>\n", " <td>{'tcp_connect': {'151.101.112.81:443': {'statu...</td>\n", " <td>False</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>Covid map: Coronavirus cases, deaths, vaccinat...</td>\n", " <td>Covid map: Coronavirus cases, deaths, vaccinat...</td>\n", " <td>add8e023428a9d1b4816fb3bb2a238c7</td>\n", " <td>[{'ip': '151.101.112.81', 'port': 443, 'status...</td>\n", " <td>www.bbc.com</td>\n", " <td>ok</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>1 rows × 30 columns</p>\n", "</div>" ], "text/plain": [ " input measurement_start_time \\\n", "517649 https://www.bbc.com/news/world-51235105 2022-03-16 12:41:20 \n", "\n", " probe_asn probe_cc probe_network_name \\\n", "517649 AS43966 RU IT REGION LTD \n", "\n", " report_id resolver_asn \\\n", "517649 20220316T124100Z_webconnectivity_RU_43966_n1_q... AS43966 \n", "\n", " resolver_ip resolver_network_name software_name ... \\\n", "517649 79.173.80.17 IT REGION LTD ooniprobe-android ... \n", "\n", " control_measurement blocking \\\n", "517649 {'tcp_connect': {'151.101.112.81:443': {'statu... False \n", "\n", " http_experiment_failure dns_experiment_failure \\\n", "517649 None None \n", "\n", " http_title \\\n", "517649 Covid map: Coronavirus cases, deaths, vaccinat... \n", "\n", " http_meta_title \\\n", "517649 Covid map: Coronavirus cases, deaths, vaccinat... \n", "\n", " http_body_md5 \\\n", "517649 add8e023428a9d1b4816fb3bb2a238c7 \n", "\n", " tcp_connect domain \\\n", "517649 [{'ip': '151.101.112.81', 'port': 443, 'status... www.bbc.com \n", "\n", " blocking_recalc \n", "517649 ok \n", "\n", "[1 rows x 30 columns]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " (df_ru['probe_asn'] == 'AS43966')\n", " & (df_ru['blocking_recalc'] == 'ok')\n", " & (df_ru['measurement_start_time'].str.startswith('2022-03-16'))\n", " & (df_ru['domain'] == 'www.bbc.com')\n", "]" ] }, { "cell_type": "code", "execution_count": 63, "id": "04dc4a18-57db-4b09-a675-72f6c8439054", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://explorer.ooni.org/measurement/20220316T124100Z_webconnectivity_RU_43966_n1_qXKPLjBo4r7rzNdl?input=https%3A%2F%2Fwww.bbc.com%2Fnews%2Fworld-51235105\n" ] } ], "source": [ "print_explorer_url(df_ru.iloc[517649])" ] }, { "cell_type": "code", "execution_count": 66, "id": "e595e135-e016-4011-a161-1864596f1185", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<figure size with axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_blocking('AS25513', 'www.bbc.com')" ] }, { "cell_type": "code", "execution_count": 67, "id": "e7458162-6ae7-4c9c-b8a4-6529e1ff6be2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border='\"1\"' class='\"dataframe\"'>\n", " <thead>\n", " <tr style='\"text-align:' right>\n", " <th></th>\n", " <th>input</th>\n", " <th>measurement_start_time</th>\n", " <th>probe_asn</th>\n", " <th>probe_cc</th>\n", " <th>probe_network_name</th>\n", " <th>report_id</th>\n", " <th>resolver_asn</th>\n", " <th>resolver_ip</th>\n", " <th>resolver_network_name</th>\n", " <th>software_name</th>\n", " <th>...</th>\n", " <th>control_measurement</th>\n", " <th>blocking</th>\n", " <th>http_experiment_failure</th>\n", " <th>dns_experiment_failure</th>\n", " <th>http_title</th>\n", " <th>http_meta_title</th>\n", " <th>http_body_md5</th>\n", " <th>tcp_connect</th>\n", " <th>domain</th>\n", " <th>blocking_recalc</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1455010</th>\n", " <td>http://www.bbc.com/news</td>\n", " <td>2022-03-11 22:54:56</td>\n", " <td>AS25513</td>\n", " <td>RU</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>20220311T225300Z_webconnectivity_RU_25513_n1_C...</td>\n", " <td>AS25513</td>\n", " <td>94.29.125.114</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>ooniprobe-android</td>\n", " <td>...</td>\n", " <td>{'tcp_connect': {'151.101.12.81:80': {'status'...</td>\n", " <td>False</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>MTC</td>\n", " <td>NaN</td>\n", " <td>a7bad4fa931d233f7e7145dcb6412434</td>\n", " <td>[{'ip': '151.101.12.81', 'port': 80, 'status':...</td>\n", " <td>www.bbc.com</td>\n", " <td>ok</td>\n", " </tr>\n", " <tr>\n", " <th>1465178</th>\n", " <td>http://www.bbc.com/news</td>\n", " <td>2022-03-11 02:32:21</td>\n", " <td>AS25513</td>\n", " <td>RU</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>20220311T021951Z_webconnectivity_RU_25513_n1_F...</td>\n", " <td>AS25513</td>\n", " <td>94.29.125.106</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>ooniprobe-desktop-unattended</td>\n", " <td>...</td>\n", " <td>{'tcp_connect': {'151.101.12.81:80': {'status'...</td>\n", " <td>False</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>MTC</td>\n", " <td>NaN</td>\n", " <td>a7bad4fa931d233f7e7145dcb6412434</td>\n", " <td>[{'ip': '151.101.12.81', 'port': 80, 'status':...</td>\n", " <td>www.bbc.com</td>\n", " <td>ok</td>\n", " </tr>\n", " <tr>\n", " <th>1472180</th>\n", " <td>http://www.bbc.com/news</td>\n", " <td>2022-03-11 14:42:04</td>\n", " <td>AS25513</td>\n", " <td>RU</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>20220311T142131Z_webconnectivity_RU_25513_n1_n...</td>\n", " <td>AS59447</td>\n", " <td>45.136.153.146</td>\n", " <td>Istanbuldc Veri Merkezi Ltd Sti</td>\n", " <td>ooniprobe-desktop-unattended</td>\n", " <td>...</td>\n", " <td>{'tcp_connect': {'151.101.12.81:80': {'status'...</td>\n", " <td>False</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>MTC</td>\n", " <td>NaN</td>\n", " <td>a7bad4fa931d233f7e7145dcb6412434</td>\n", " <td>[{'ip': '151.101.12.81', 'port': 80, 'status':...</td>\n", " <td>www.bbc.com</td>\n", " <td>ok</td>\n", " </tr>\n", " <tr>\n", " <th>1504898</th>\n", " <td>https://www.bbc.com/news/world-51235105</td>\n", " <td>2022-03-11 07:22:24</td>\n", " <td>AS25513</td>\n", " <td>RU</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>20220311T071246Z_webconnectivity_RU_25513_n1_A...</td>\n", " <td>AS15169</td>\n", " <td>172.217.37.140</td>\n", " <td>Google LLC</td>\n", " <td>ooniprobe-android</td>\n", " <td>...</td>\n", " <td>{'tcp_connect': {'151.101.84.81:443': {'status...</td>\n", " <td>False</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>Covid map: Coronavirus cases, deaths, vaccinat...</td>\n", " <td>Covid map: Coronavirus cases, deaths, vaccinat...</td>\n", " <td>add8e023428a9d1b4816fb3bb2a238c7</td>\n", " <td>[{'ip': '151.101.84.81', 'port': 443, 'status'...</td>\n", " <td>www.bbc.com</td>\n", " <td>ok</td>\n", " </tr>\n", " <tr>\n", " <th>1528928</th>\n", " <td>http://www.bbc.com/news</td>\n", " <td>2022-03-11 08:52:32</td>\n", " <td>AS25513</td>\n", " <td>RU</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>20220311T082127Z_webconnectivity_RU_25513_n1_J...</td>\n", " <td>AS25513</td>\n", " <td>94.29.125.114</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>ooniprobe-desktop-unattended</td>\n", " <td>...</td>\n", " <td>{'tcp_connect': {'151.101.12.81:80': {'status'...</td>\n", " <td>False</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>MTC</td>\n", " <td>NaN</td>\n", " <td>a7bad4fa931d233f7e7145dcb6412434</td>\n", " <td>[{'ip': '151.101.12.81', 'port': 80, 'status':...</td>\n", " <td>www.bbc.com</td>\n", " <td>ok</td>\n", " </tr>\n", " <tr>\n", " <th>1554661</th>\n", " <td>https://www.bbc.com/news/world-51235105</td>\n", " <td>2022-03-11 07:18:41</td>\n", " <td>AS25513</td>\n", " <td>RU</td>\n", " <td>PJSC Moscow city telephone network</td>\n", " <td>20220311T071719Z_webconnectivity_RU_25513_n1_3...</td>\n", " <td>AS198806</td>\n", " <td>91.239.98.96</td>\n", " <td>LLC SIBUR</td>\n", " <td>ooniprobe-desktop-unattended</td>\n", " <td>...</td>\n", " <td>{'tcp_connect': {'151.101.112.81:443': {'statu...</td>\n", " <td>False</td>\n", " <td>None</td>\n", " <td>None</td>\n", " <td>Covid map: Coronavirus cases, deaths, vaccinat...</td>\n", " <td>Covid map: Coronavirus cases, deaths, vaccinat...</td>\n", " <td>add8e023428a9d1b4816fb3bb2a238c7</td>\n", " <td>[{'ip': '151.101.112.81', 'port': 443, 'status...</td>\n", " <td>www.bbc.com</td>\n", " <td>ok</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>6 rows × 30 columns</p>\n", "</div>" ], "text/plain": [ " input measurement_start_time \\\n", "1455010 http://www.bbc.com/news 2022-03-11 22:54:56 \n", "1465178 http://www.bbc.com/news 2022-03-11 02:32:21 \n", "1472180 http://www.bbc.com/news 2022-03-11 14:42:04 \n", "1504898 https://www.bbc.com/news/world-51235105 2022-03-11 07:22:24 \n", "1528928 http://www.bbc.com/news 2022-03-11 08:52:32 \n", "1554661 https://www.bbc.com/news/world-51235105 2022-03-11 07:18:41 \n", "\n", " probe_asn probe_cc probe_network_name \\\n", "1455010 AS25513 RU PJSC Moscow city telephone network \n", "1465178 AS25513 RU PJSC Moscow city telephone network \n", "1472180 AS25513 RU PJSC Moscow city telephone network \n", "1504898 AS25513 RU PJSC Moscow city telephone network \n", "1528928 AS25513 RU PJSC Moscow city telephone network \n", "1554661 AS25513 RU PJSC Moscow city telephone network \n", "\n", " report_id resolver_asn \\\n", "1455010 20220311T225300Z_webconnectivity_RU_25513_n1_C... AS25513 \n", "1465178 20220311T021951Z_webconnectivity_RU_25513_n1_F... AS25513 \n", "1472180 20220311T142131Z_webconnectivity_RU_25513_n1_n... AS59447 \n", "1504898 20220311T071246Z_webconnectivity_RU_25513_n1_A... AS15169 \n", "1528928 20220311T082127Z_webconnectivity_RU_25513_n1_J... AS25513 \n", "1554661 20220311T071719Z_webconnectivity_RU_25513_n1_3... AS198806 \n", "\n", " resolver_ip resolver_network_name \\\n", "1455010 94.29.125.114 PJSC Moscow city telephone network \n", "1465178 94.29.125.106 PJSC Moscow city telephone network \n", "1472180 45.136.153.146 Istanbuldc Veri Merkezi Ltd Sti \n", "1504898 172.217.37.140 Google LLC \n", "1528928 94.29.125.114 PJSC Moscow city telephone network \n", "1554661 91.239.98.96 LLC SIBUR \n", "\n", " software_name ... \\\n", "1455010 ooniprobe-android ... \n", "1465178 ooniprobe-desktop-unattended ... \n", "1472180 ooniprobe-desktop-unattended ... \n", "1504898 ooniprobe-android ... \n", "1528928 ooniprobe-desktop-unattended ... \n", "1554661 ooniprobe-desktop-unattended ... \n", "\n", " control_measurement blocking \\\n", "1455010 {'tcp_connect': {'151.101.12.81:80': {'status'... False \n", "1465178 {'tcp_connect': {'151.101.12.81:80': {'status'... False \n", "1472180 {'tcp_connect': {'151.101.12.81:80': {'status'... False \n", "1504898 {'tcp_connect': {'151.101.84.81:443': {'status... False \n", "1528928 {'tcp_connect': {'151.101.12.81:80': {'status'... False \n", "1554661 {'tcp_connect': {'151.101.112.81:443': {'statu... False \n", "\n", " http_experiment_failure dns_experiment_failure \\\n", "1455010 None None \n", "1465178 None None \n", "1472180 None None \n", "1504898 None None \n", "1528928 None None \n", "1554661 None None \n", "\n", " http_title \\\n", "1455010 MTC \n", "1465178 MTC \n", "1472180 MTC \n", "1504898 Covid map: Coronavirus cases, deaths, vaccinat... \n", "1528928 MTC \n", "1554661 Covid map: Coronavirus cases, deaths, vaccinat... \n", "\n", " http_meta_title \\\n", "1455010 NaN \n", "1465178 NaN \n", "1472180 NaN \n", "1504898 Covid map: Coronavirus cases, deaths, vaccinat... \n", "1528928 NaN \n", "1554661 Covid map: Coronavirus cases, deaths, vaccinat... \n", "\n", " http_body_md5 \\\n", "1455010 a7bad4fa931d233f7e7145dcb6412434 \n", "1465178 a7bad4fa931d233f7e7145dcb6412434 \n", "1472180 a7bad4fa931d233f7e7145dcb6412434 \n", "1504898 add8e023428a9d1b4816fb3bb2a238c7 \n", "1528928 a7bad4fa931d233f7e7145dcb6412434 \n", "1554661 add8e023428a9d1b4816fb3bb2a238c7 \n", "\n", " tcp_connect domain \\\n", "1455010 [{'ip': '151.101.12.81', 'port': 80, 'status':... www.bbc.com \n", "1465178 [{'ip': '151.101.12.81', 'port': 80, 'status':... www.bbc.com \n", "1472180 [{'ip': '151.101.12.81', 'port': 80, 'status':... www.bbc.com \n", "1504898 [{'ip': '151.101.84.81', 'port': 443, 'status'... www.bbc.com \n", "1528928 [{'ip': '151.101.12.81', 'port': 80, 'status':... www.bbc.com \n", "1554661 [{'ip': '151.101.112.81', 'port': 443, 'status... www.bbc.com \n", "\n", " blocking_recalc \n", "1455010 ok \n", "1465178 ok \n", "1472180 ok \n", "1504898 ok \n", "1528928 ok \n", "1554661 ok \n", "\n", "[6 rows x 30 columns]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " (df_ru['probe_asn'] == 'AS25513')\n", " & (df_ru['blocking_recalc'] == 'ok')\n", " & (df_ru['measurement_start_time'].str.startswith('2022-03-11'))\n", " & (df_ru['domain'] == 'www.bbc.com')\n", "]" ] }, { "cell_type": "code", "execution_count": 71, "id": "afbe8d73-c0d2-41f2-9d86-a4fec0ad4e10", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1455010 ['151.101.12.81']\n", "1465178 ['151.101.12.81']\n", "1472180 ['151.101.12.81']\n", "1504898 ['151.101.84.81']\n", "1528928 ['151.101.12.81']\n", "1554661 ['151.101.112.81']\n", "Name: dns_resolved_ips, dtype: object" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " (df_ru['probe_asn'] == 'AS25513')\n", " & (df_ru['blocking_recalc'] == 'ok')\n", " & (df_ru['measurement_start_time'].str.startswith('2022-03-11'))\n", " & (df_ru['domain'] == 'www.bbc.com')\n", "]['dns_resolved_ips']" ] }, { "cell_type": "markdown", "id": "61910088-73b5-44cb-aef3-49c8a50296d3", "metadata": {}, "source": [ "In two of the OK measurements, it looks like there are different IP addresses returned in the DNS query. Let's inspect the measurement to see if it's a false negative." ] }, { "cell_type": "code", "execution_count": 70, "id": "b7ac9b57-c898-4158-bd39-89ab609b6534", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://explorer.ooni.org/measurement/20220311T071246Z_webconnectivity_RU_25513_n1_A6Ux9DxJ1CkxiLNB?input=https%3A%2F%2Fwww.bbc.com%2Fnews%2Fworld-51235105\n" ] } ], "source": [ "print_explorer_url(df_ru.iloc[1504898])" ] }, { "cell_type": "markdown", "id": "bc2ef1a6-2fbe-458f-9fd1-caa8b657976a", "metadata": {}, "source": [ "Nope, it doesn't look like it. When looking at the blocked metrics, we can see that the IP used is always the \"151.101.12.81\" one. This means it's quite likely that the blocking by closing the connection through a RST packet is also matching on the endpoint." ] }, { "cell_type": "code", "execution_count": 74, "id": "59d390c0-a5d4-41b4-aeff-e6aed709340f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1435961 ['151.101.12.81']\n", "1438061 ['151.101.0.81', '151.101.64.81', '151.101.128...\n", "1471678 ['151.101.12.81']\n", "1499181 ['151.101.12.81']\n", "1501909 ['151.101.12.81']\n", "1531316 ['151.101.12.81']\n", "Name: dns_resolved_ips, dtype: object" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " (df_ru['probe_asn'] == 'AS25513')\n", " & (df_ru['blocking_recalc'] != 'ok')\n", " & (df_ru['measurement_start_time'].str.startswith('2022-03-11'))\n", " & (df_ru['domain'] == 'www.bbc.com')\n", "]['dns_resolved_ips']" ] }, { "cell_type": "markdown", "id": "4f3f393b-78a1-4686-9464-e0ff1caf0f91", "metadata": {}, "source": [ "Let's look at the other potentially false negative measurements" ] }, { "cell_type": "code", "execution_count": 75, "id": "ee90cf2c-5c35-4391-a494-b368368ed4b4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://explorer.ooni.org/measurement/20220311T021951Z_webconnectivity_RU_25513_n1_FXsCAbJpZ0dAibqR?input=http%3A%2F%2Fwww.bbc.com%2Fnews\n" ] } ], "source": [ "print_explorer_url(df_ru.iloc[1465178])" ] }, { "cell_type": "markdown", "id": "9164d6d5-f955-4368-bbd4-086a7b3450cb", "metadata": {}, "source": [ "This looks like an actual false negative, which is caused by our blockpage detection heuristics not being good enough.\n", "\n", "Let's add this fingerprint to our fingerprint DB and re-annotate the measurements." ] }, { "cell_type": "code", "execution_count": 80, "id": "7b9b9880-5987-4a1f-b4fe-873900ccef91", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border='\"1\"' class='\"dataframe\"'>\n", " <thead>\n", " <tr style='\"text-align:' right>\n", " <th></th>\n", " <th>http_title</th>\n", " <th>http_body_md5</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1455010</th>\n", " <td>MTC</td>\n", " <td>a7bad4fa931d233f7e7145dcb6412434</td>\n", " </tr>\n", " <tr>\n", " <th>1465178</th>\n", " <td>MTC</td>\n", " <td>a7bad4fa931d233f7e7145dcb6412434</td>\n", " </tr>\n", " <tr>\n", " <th>1472180</th>\n", " <td>MTC</td>\n", " <td>a7bad4fa931d233f7e7145dcb6412434</td>\n", " </tr>\n", " <tr>\n", " <th>1504898</th>\n", " <td>Covid map: Coronavirus cases, deaths, vaccinat...</td>\n", " <td>add8e023428a9d1b4816fb3bb2a238c7</td>\n", " </tr>\n", " <tr>\n", " <th>1528928</th>\n", " <td>MTC</td>\n", " <td>a7bad4fa931d233f7e7145dcb6412434</td>\n", " </tr>\n", " <tr>\n", " <th>1554661</th>\n", " <td>Covid map: Coronavirus cases, deaths, vaccinat...</td>\n", " <td>add8e023428a9d1b4816fb3bb2a238c7</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " http_title \\\n", "1455010 MTC \n", "1465178 MTC \n", "1472180 MTC \n", "1504898 Covid map: Coronavirus cases, deaths, vaccinat... \n", "1528928 MTC \n", "1554661 Covid map: Coronavirus cases, deaths, vaccinat... \n", "\n", " http_body_md5 \n", "1455010 a7bad4fa931d233f7e7145dcb6412434 \n", "1465178 a7bad4fa931d233f7e7145dcb6412434 \n", "1472180 a7bad4fa931d233f7e7145dcb6412434 \n", "1504898 add8e023428a9d1b4816fb3bb2a238c7 \n", "1528928 a7bad4fa931d233f7e7145dcb6412434 \n", "1554661 add8e023428a9d1b4816fb3bb2a238c7 " ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ru[\n", " (df_ru['probe_asn'] == 'AS25513')\n", " & (df_ru['blocking_recalc'] == 'ok')\n", " & (df_ru['measurement_start_time'].str.startswith('2022-03-11'))\n", " & (df_ru['domain'] == 'www.bbc.com')\n", "][['http_title', 'http_body_md5']]" ] }, { "cell_type": "code", "execution_count": 81, "id": "172ee8d5-b718-44b7-aa66-5ebe5d45ffc8", "metadata": {}, "outputs": [], "source": [ "df_ru.loc[(df_ru['http_body_md5'] == 'a7bad4fa931d233f7e7145dcb6412434'), 'blocking_recalc'] = 'http.confirmed'" ] }, { "cell_type": "code", "execution_count": 82, "id": "c85cb3bc-eead-441b-98c5-27e1427c3dec", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<figure size with axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_blocking('AS25513', 'www.bbc.com')" ] }, { "cell_type": "code", "execution_count": 96, "id": "b3c2abf1-c2ea-4529-8468-64eea1e83b63", "metadata": {}, "outputs": [], "source": [ "data_export = df_ru[\n", " df_ru['domain'].isin(relevant_domains)\n", "][['domain', 'measurement_start_time', 'probe_asn', 'report_id', 'blocking_recalc']]" ] }, { "cell_type": "code", "execution_count": 97, "id": "126aada6-c358-433c-92b9-a63c23c11a0c", "metadata": {}, "outputs": [], "source": [ "data_export['count'] = 1" ] }, { "cell_type": "markdown", "id": "ef704969-9aba-41e9-ad70-4b59c52a51dc", "metadata": {}, "source": [ "At this point we would iterate the process of filtering out any additional false positives and false negatives, until we feel quite confident that we have eliminated most of the outliers (or come up with an explaination as to why we are seeing them).\n", "\n", "Once this process is done, it might be desirable to create a CSV export of this cleaned data in preparation for publication ready charts (ex. through tools like Tableau).\n", "\n", "Since charting tools generally work best with data where the values you need to plot are in the cells and the columns indicate the category of the value, we will reshape the data using the `pivot_table` function. This basically takes the values of `blocking_recalc` and puts them as columns, the value of the cells, in this case, is always going to be one. It's generally quite easy to do further aggregation and grouping inside of the charting tool itself." ] }, { "cell_type": "code", "execution_count": 102, "id": "701e1bf0-27a5-41d8-90c1-6c563f57e59a", "metadata": {}, "outputs": [], "source": [ "data_export.pivot_table(\n", " index=['probe_asn', 'domain', 'measurement_start_time', 'report_id'], \n", " columns=['blocking_recalc'], \n", " values='count'\n", ").reset_index().to_csv('20220226-20220317-russia-relevant-sites-pivot.csv')" ] }, { "cell_type": "code", "execution_count": 103, "id": "eebeb35f-4ec9-4d30-b06a-1d2a49f4142a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "66335 20220226-20220317-russia-relevant-sites-pivot.csv\n" ] } ], "source": [ "!wc -l 20220226-20220317-russia-relevant-sites-pivot.csv" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 5 } </figure></figure></figure></figure>