Maxim: "chatcontrol, eupol" - Mastodon for Tech Folks

COLLECTED BY

Organization: Archive Team

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

History is littered with hundreds of conflicts over the future of a community, group, location or business that were "resolved" when one of the parties stepped ahead and destroyed what was there. With the original point of contention destroyed, the debates would fall to the wayside. Archive Team believes that by duplicated condemned data, the conversation and debate can continue, as well as the richness and insight gained by keeping the materials. Our projects have ranged in size from a single volunteer downloading the data to a small-but-critical site, to over 100 volunteers stepping forward to acquire terabytes of user-created data to save for future generations.

The main site for Archive Team is at archiveteam.org and contains up to the date information on various projects, manifestos, plans and walkthroughs.

This collection contains the output of many Archive Team projects, both ongoing and completed. Thanks to the generous providing of disk space by the Internet Archive, multi-terabyte datasets can be made available, as well as in use by the Wayback Machine, providing a path back to lost websites and work.

Our collection has grown to the point of having sub-collections for the type of data we acquire. If you are seeking to browse the contents of these collections, the Wayback Machine is the best first stop. Otherwise, you are free to dig into the stacks to see what you may find.

The Archive Team Panic Downloads are full pulldowns of currently extant websites, meant to serve as emergency backups for needed sites that are in danger of closing, or which will be missed dearly if suddenly lost due to hard drive crashes or server failures.

Collection: Archive Team: URLs

TIMESTAMPS

The Wayback Machine - https://web.archive.org/web/20220712104747/https://mastodon.technology/@mxm/108481179902603998

Maxim @[email protected]

chatcontrol, eupol

#fediverse, we really need to talk about #chatcontrol, the EU's next mass surveillance system. Long Thread below, with hints to skip over parts of if it's too long for you. Sources are in [brackets] & linked at the end. I tried to simplify a bit to keep the thread accessible for people without deep technical knowledge. I've packed a LOT of content into this thread - if you have trouble undestanding, it's probably my fault. Ask and I'll clarify! Please boost for awareness.

Jun 15, 2022, 11:10 · · Web · · ·

Jun 15, 2022, 11:11

Maxim @[email protected]

chatcontrol, eupol

What is chatcontrol?
Chatcontrol is what critics call the regulation proposed by the european comission [L1]. The regulation aims to reduce the online distribution of Child Sexual Abuse Material (CSAM). It will force messaging services to scan ALL content, including personal messages and photos, to detect CSAM and report detections to a newly established EU centre. The centre which will coordinate with police in EU countries and provide access to detection technologies.

Jun 15, 2022, 11:11

Maxim @[email protected]

chatcontrol, eupol

While I doubt anyone will disagree with the aim, the means (mass surveillance) are not only violating privacy rights but also ineffective, for various reasons. In this thead I'll explain some of these reasons and what YOU can do to help stop the regulation. Let's start with the most important thing: Chatcontrol completely misses the point, as criminals basically rarely use messengers to share material - they're too slow to share large collections of CSAM.

Jun 15, 2022, 11:12

Maxim @[email protected]

chatcontrol, eupol

Instead, they encrypt the files and upload them to a completely normal filesharing service. Since the files are encrypted, the service is unable to scan the contents, even if it wants to detect CSAM. Criminals can then simply share links to the content whereever they want [C1,4-6] - the scanning can't/won't hinder them! This alone should be enough to scrap the regulation! If that's enough for you, scroll far down to the sign (you'll see it) to learn what you can do.

Jun 15, 2022, 11:12

Maxim @[email protected]

chatcontrol, eupol

But maybe we can catch criminals which DO share CSAM via messengers - there are some out there, and the proposal is directed at them. The regulation differentiates between 3 categories to detect [L2a]:
- Known CSAM
- New CSAM
- Grooming
While there has been lots of research, only detection of known CSAM is really feasible, and even that comes with issues. In the next posts, I'll explain some technical details why, skip 13 posts ahead if that's not your thing.

Jun 15, 2022, 11:13

Maxim @[email protected]

chatcontrol, details

Known CSAM images are the easiest to detect - still, it's not an easy task. A computer cannot simply compare an image like a human can and say if they're the same or not. It could compare all pixels, but that would be slow AND you'd need to have CSAM stored wherever the code which is doing the comparison run. Such a comparison would also only work as long as the image is not modified at all. If the image is resized, brightened or blurred this detection won't work.

Jun 15, 2022, 11:13

Maxim @[email protected]

chatcontrol, details

To fix this, so-called perceptual hashing is used. The idea is simple: Split an image into different areas and calculate a value for each area. This list of values for an image is called a perceptual hash. To compare images, calculate the difference between the values of the two hashes. If the difference is below a value (the detection threshold), the images are likely the same. Now the problem is: How to choose the threshold where you say the images are the same?

Jun 15, 2022, 11:13

Maxim @[email protected]

chatcontrol, details

Two approaches:
Set threshold high. More modified CSAM will be detected (high recall). But set the threshold too high, and two unrelated images would be considered the same, in practice meaning a harmless image would be detected as CSAM.

Or: Set threshold low → less false positives (high precision). Too low, and you'll run into the same issue as when comparing pixels - images which have been slightly modified are not considered the same, not detecting known CSAM.

Jun 15, 2022, 11:14

Maxim @[email protected]

chatcontrol, details

You can learn more about precision/recall here: https://mlu-explain.github.io/precision-recall/ For the detection of CSAM, there is no choice but to optimize for high precision: A low precision would mean many false positives. As there far are more normal messages than messges than messages containing CSAM, the amount of false postives would quickly drown out the true positives. Luckily, perceptual hashing is *just* precise enough. How precise exactly?

Jun 15, 2022, 11:14

Maxim @[email protected]

chatcontrol, details

A well-known perceptual hashing algorithm is Microsofts PhotoDNA algorithm [T1a]. It's creator claims the chance of misidentification on a big analysis is about 1 in 50 billion [T1b]. While this is very impressive, such low error probabilities are an absolute must for reliably detecting CSAM. Take (as an example) WhatsApp: Assuming 100 billion messages per day [T2], even at these odds, there'll still be about 2 false positives PER DAY, ONLY for WhatsApp.

Jun 15, 2022, 11:14

Maxim @[email protected]

chatcontrol, details

So yes, image comparison is practical, AS LONG AS YOU INCLUDE HUMAN REVIEW. There are other issues, but to detect known CSAM, the accuracy is not really in question. An important question is: Where does the image get analyzed? On a central server? You'd have to throw out end-to-end encryption (E2EE). Analyzing on the users devices is the only solution to *technically* keep E2EE intact, but why would you trust criminals to use unmodified devices which scan for CSAM?

Jun 15, 2022, 11:15

Maxim @[email protected]

chatcontrol, details

There are also concerns that PhotoDNA may be reversible, so distributing hashes for detection would be equivalent to distributing CSAM [T3]. If reversible, it'd also be possible to intentionally create images which are falsely detected as CSAM, then flood the detection system with it, making detection of real CSAM a lot more difficult. If the above issues are resolved, we MAY be able to reliably detect known CSAM. However, finding new CSAM is far less precise (3 posts):

Jun 15, 2022, 11:15

Maxim @[email protected]

chatcontrol, details

When detecting unknown CSAM, the previous issues apply AND there are two connected addtional issues: AI training and accuracy. To detect new material, an AI is trained by "looking at" lots of CSAM so it can learn what CSAM looks like - note that hashes are not enough, it needs to be actual CSAM. This means that if only the new EU center is allowed access to CSAM, only the EU center will be able to train an AI which detects new CSAM! Will they get it right?

Jun 15, 2022, 11:15

Maxim @[email protected]

chatcontrol, details

As a rule of thumb: The better the AI works, the worse we can currently explain it's behavior [T4]. This means we will not be able to be sure the AI learned "the right way" to detect CSAM. And there's many possibilities to make false decisions: As an example, an AI cannot simply rely on detecting images of naked children, as that could probably lead to false positives when parents send photos of their children at the beach or in the bathtub to the grandparents.

Jun 15, 2022, 11:16

Maxim @[email protected]

chatcontrol, details

Even if the AI can detect *actual porn* with people that *look* underage, this can still lead to false positives - there have been HUMAN mistakes about this [C2]. So how could an AI correctly classify CSAM vs. no CSAM if not even humans are able to? It's a difficult problem, and there WILL be many false positives. Even if the error rate was just 1% (in practice it's higher), this would mean many 100s of millions of messages are falsely reported EVERY DAY.

Jun 15, 2022, 11:16

Maxim @[email protected]

chatcontrol, details

For grooming detection: I haven't found many details, but it seems even less advanced. The european comission says that Microsofts claims to have a technology with an accuracy of 88% [L2b], meaning a 12% error rate (it's unclear if the commission confused it with precision, but that would indicate low recall, which makes sense). In 2020, Microsoft presented their tool. Watch the presentation [T1c]: It's clear that it's not ready for automated large-scale detection.

Jun 15, 2022, 11:17

Maxim @[email protected]

chatcontrol, details

The tool consisted (consists?) of simple text matching rules (for programmers: Regexps).
It
- is intended to help human moderators
- is not intended to run in real-time
- not for law enforcement
- only supports English! [T1c]
Adapting the detection techniques to other languages will, of course, take time. It's looking for specific phrases, so it'll probably have a high precision, but low recall. Still, sexting adults are probably going to trigger false positives.

Jun 15, 2022, 11:17

Maxim @[email protected]

chatcontrol, details

The tool is also not intended for end-to-end encrypted messages [T1c]. Once the rules for matching text are known (they HAVE to be on-device if you want to keep end-to-end encryption), criminals could just switch to different phrases (or even write bots to flood the report system)! Even if detection works well, as it only works on text, it'll be trivial to bypass: Ask the kid to join a discord call, send voice messages, send images containing text etc. etc.

Jun 15, 2022, 11:17

Maxim @[email protected]

chatcontrol, eupol, summary:

To summarize the technical aspects: It's possible to detect known CSAM, but that's about all you can reliably detect. When scanning for unknown CSAM even a low error rate of a few percent will lead to billions of messages being falsely reported EVERY DAY. Grooming detection is at best a protoype and relies on human moderation. Not only can detection be bypassed, it'll also be possible for criminials to deliberately cause false positives to disrupt investigations.

Jun 15, 2022, 11:18

Maxim @[email protected]

chatcontrol, police

The swiss police receives automated reports and states that in about 87% of cases, they are useless [C3] - what will the police do if this number increases to 99%?

Also, again, scanning messages won't help if criminals encrypt their files, upload them to a hosting service and share links, as they are already doing [C1,4-6]. But even assuming the police had a magical technology that has a 100% accuracy for all reports, would they care about reducing distribution of CSAM?

Jun 16, 2022, 14:26

Quincy @[email protected]

chatcontrol, eupol

@mxm Excellent thread, I wish it wasn't needed. I didn't know the details about the "grooming detection" tool.

And I fervently wish facts, and the standards of legality, necessity and proportionality would decide this.

A lot of garbage legislation that was passed in the last few years would have to be thrown out immediately.

Jun 16, 2022, 14:27

Quincy @[email protected]

chatcontrol, eupol

@mxm
Alas, work has already begun to build the surveillance center, as Erich Möchel reports, so I doubt very much any of the decision makers will be swayed, and we have to mount a campaign in every member state and try to talk sense into parliament.

https://fm4.orf.at/stories/3024478/

Jun 16, 2022, 14:45

Maxim @[email protected]

chatcontrol, eupol

@quincy The center could still be useful: It's there to coordinate between the police in different countries & develop/provide the detection software. As long as the detection software doesn't become mandatory, I don't see a reason why it shouldn't operate

Jun 16, 2022, 14:53

Quincy @[email protected]

chatcontrol, eupol

@mxm Hm. Could this be a potential off-ramp for the commission?

Whether or not this center would be a problem depends very much on the details.

If this is just a police coordination center, fine. But judging from the fm4 article, it will be tasked with dragnet surveillance. And that's very much a problem no matter what for.

Jun 21, 2022, 16:38

totoroot @[email protected]

chatcontrol, eupol

@mxm Is it possible to put your toots into a format that is a little easier to read? Like a blog post or just some pastebin :)
I'm interested, but not everyone may want to click on each individual CW button of all the replies

Jun 21, 2022, 17:31

Maxim @[email protected]

@totoroot I plan to post that on a blog eventually (I don't have a blog *yet*), but right now mastodon is all I have.

What you could do, is to go into your mastodon settings and check "Always expand posts marked with content warnings" (& disable again after reading if you prefer)

Otherwise... I know for twitter there are bots which gathers all posts of a thread & upload them, maybe there's something similar for mastodon? Although so far I haven't seen that anywhere.

Jun 29, 2022, 11:07

Maxim @[email protected]

@totoroot I finally finished setting up the blog, you can read the thread here: https://maxim.tips/chatcontrol/

Jun 29, 2022, 19:53

totoroot @[email protected]

@mxm Awesome, thanks!! 👏

Sign in to participate in the conversation