AI Technology · · 8 min read

AI Can Unmask Pseudonymous Users for $4 Per Target, New Research Shows

Large language models achieve 68% accuracy in linking anonymous accounts across platforms, upending decades of assumptions about online privacy protection.

Large language models can identify pseudonymous internet users with up to 68% accuracy at 90% precision for between $1 and $4 per target, according to research published last week by ETH Zurich and Anthropic. The findings collapse the practical obscurity that has long protected anonymous participation online, demonstrating that what once required hours of manual investigation now executes in minutes.

Attack Performance Metrics
Hacker News → LinkedIn67% recall at 90% precision
Reddit temporal split68% recall at 90% precision
Reddit community split45% recall at 99% precision
Cost per profile$1.41–$5.64

The system achieves up to 68% recall at 90% precision across three benchmark datasets, substantially outperforming classical baselines that achieve near 0% for the best non-LLM method. In a test linking 338 Hacker News profiles to LinkedIn accounts, an AI agent correctly identified 226 of 338 targets—a 67% success rate at 90% precision, with 25 erroneous identifications and 86 abstentions. When matching Reddit users’ posts separated by time, the system identified more than one-third of all users at 99% precision.

How the Attack Works

The researchers developed a four-stage pipeline called ESRC: Extract, Search, Reason, and Calibrate. The system first uses an LLM to extract identity-relevant signals from unstructured posts—demographics, writing style, incidental disclosures, interests, and linguistic patterns—then encodes them as semantic embeddings to search a candidate pool for likely matches, before a second, more capable model reasons over top candidates to verify the best match, with a final calibration stage controlling the false positive rate.

Unlike previous deanonymization work that required structured datasets like the Netflix Prize, this approach works directly on raw user content across arbitrary platforms. The models used in the pipeline were Grok 4.1 Fast from xAI, GPT-5.2 from OpenAI, and Gemini 3 Flash and Gemini 3 Pro from Google.

Technical Context

The research tested three datasets: linking Hacker News to LinkedIn profiles using cross-platform references; matching users across Reddit movie discussion communities; and splitting single Reddit users’ histories by time to create two pseudonymous profiles. According to the arXiv preprint, the agent also identified 9 out of 33 scientists from the Anthropic Interviewer dataset, matching prior manual efforts.

Who Is Vulnerable

Active contributors face the highest risk. Active participation increases vulnerability—if you post once or twice under a pseudonym, there’s limited signal, but active community members who have been posting for years create comprehensive datasets of themselves, making those who contribute most to online communities also the most vulnerable to deanonymization.

Each piece of specific information shared—city, job, conferences attended, niche hobbies—narrows down potential identities, and the combination often forms a unique fingerprint. According to The Register, researcher Simon Lermen observed that the combination is often a unique fingerprint.

High-Risk User Profiles
  • Journalists and activists using pseudonyms for source protection
  • Whistleblowers posting about employer misconduct
  • Reddit power users with years of posting history
  • Tech professionals discussing work on Hacker News
  • Anyone reusing usernames or discussing identifying details across platforms

Platforms tested include Hacker News, Reddit, LinkedIn, and partially redacted research interviews. In an experiment involving 338 Hacker News profiles, an AI agent correctly linked roughly two-thirds of them to a real person, with a false positive rate of only about 10%.

Comparison to Human Performance

LLMs do not necessarily exceed human capability—the signals they exploit are the same signals that a skilled investigator would recognize—but they reduce cost. According to Nature Scientific Reports, in a 2023 study using GPT-3.5, the model deanonymized 784 texts correctly, which is 72.6% of the total, notably outperforming humans in original experiments.

Large language models can now re-identify pseudonymous internet users at scale, achieving precision levels that rival skilled human investigators while operating in minutes instead of hours, matching what would take hours for a dedicated human investigator. The speed differential fundamentally changes the threat landscape—what was theoretically possible but practically infeasible is now automated and scalable.

Regulatory Implications Under GDPR

The research arrives as European courts refine pseudonymization standards. In September 2025, the Court of Justice of the European Union issued a landmark decision narrowing the circumstances in which pseudonymized data is considered personal data under GDPR, holding that pseudonymized data is not automatically personal data for all parties but depends on whether the recipient can reasonably reidentify individuals, taking into account technical, organizational and legal factors.

However, GDPR treats pseudonymized data as personal data requiring compliance, while truly anonymized data falls outside the GDPR scope. The new LLM capabilities may force regulators to reassess what constitutes “reasonable” reidentification risk. What might appear to be anonymous data can often be re-identified when combined with auxiliary datasets or through advanced analytics techniques.

GDPR Treatment: Pseudonymized vs. Anonymized Data
Pseudonymized Data Anonymized Data
Still considered personal data Not personal data
Full GDPR compliance required Falls outside GDPR scope
Data subject rights apply No data subject rights
Requires lawful basis for processing No lawful basis needed
Can be reversed with additional information Irreversible by design

Under current interpretation, pseudonymized data is now considered personal data for GDPR purposes only when it is reasonably likely the data recipient can re-identify the data subject based on the recipient’s access to additional data and their contractual and technological capabilities. The $1-$4 cost per profile dramatically expands what qualifies as “reasonably likely.”

Threat Actors and Use Cases

According to The Register, the authors suggest that governments could use this technique to target journalists or activists, that corporations could mine forums to build highly targeted advertising profiles, and that online attackers could develop detailed personal profiles to make social engineering scams more credible.

The blockchain sector faces particular exposure. According to Protos, although crypto is already accustomed to the use of machine learning and clustering algorithms to link wallet addresses to real-world identities, this new research demonstrates how off-blockchain data sets like forum posts and social media activity are now exponentially larger in size and trivially automatable.

Countermeasures and Their Limits

Defensive options exist but face practical constraints. Manual obfuscation techniques—altering writing style, using round-trip translation, or imitating other authors—can reduce attribution accuracy. According to research on adversarial stylometry, the simple trick of round-trip translation only changes a handful of words but makes a significant difference.

However, manual techniques, where individuals intentionally alter their writing style, are particularly effective at evading detection, often reducing the accuracy of stylometric tools to the level of random guesses. The burden of maintaining consistent obfuscation across years of posting is prohibitive for most users.

Practical Countermeasures
  • Minimize identifying micro-details: avoid discussing employer names, specific projects, conferences attended, or unique life events across platforms
  • Use separate personas: maintain strict separation between pseudonymous and real-identity accounts with distinct vocabulary and topics
  • Limit posting volume: active contributors create larger attack surfaces
  • Deploy automated obfuscation: tools like synonym substitution can disrupt stylometric patterns, though effectiveness varies
  • Delete old content: regularly purge historical posts to reduce available training data

Platform-level defenses face detection challenges. Although LLM providers could aim to detect and block attempts to misuse their models for deanonymization, the framework splits an attack into a combination of seemingly benign summarization, search and ranking tasks. Researchers are pessimistic because the pipeline is just a sequence of seemingly harmless steps like summarizing, searching, and sorting that are nearly impossible to tell apart from legitimate use.

According to researcher Simon Lermen’s analysis, the most effective short-term mitigation is restricting data access through enforcing rate limits on API access to user data, detecting automated scraping, and restricting bulk data exports—all of which raise the cost of large-scale attacks, and platforms should assume that pseudonymous users can be linked across accounts and to real identities.

What to Watch

The $2,000 total cost for the entire research experiment signals commoditization. The researchers report their entire experiment cost about $2,000, with the cost per profile estimated to be between $1 and $4. As model capabilities improve and API pricing falls, the economics shift further toward attackers.

Regulatory response will determine whether pseudonymous data retains GDPR protection. The September 2025 ECJ ruling created flexibility for recipients who cannot reasonably reidentify subjects, but $4 automation may force stricter interpretations. Watch for updated guidance from the European Data Protection Board on what constitutes “reasonably likely” reidentification in the LLM era.

Platform policy changes are likely. Expect rate limiting on bulk data access, restrictions on historical post exports, and potential liability shields for implementing anti-scraping measures. Reddit, Hacker News, and similar platforms face pressure to balance community value against exploitation risk.

Model providers face deployment decisions. In a test using data from a Steam profile, GPT-5 Pro refused to search, citing impermissible de-anonymization, and Anthropic’s Claude also rejected the request. Whether these safeguards scale to indirect attacks remains uncertain. The arms race between attribution and obfuscation continues, but the cost asymmetry now favors attackers.