UPDATE · 2026·04·26v3 · The market · 12 operators · 2 real names · 1 cross-vertical malware kit · methodology generalized in 1 workday read v3 →
Operation Long Shadow · v2 / Research note · 2026·04·26 · T+1
Cohort still active
Follow-upNo. 02·One day after publication · math validated · footprint blew open

We said the math could detect them.
We tested it.

v1 hypothesized that a small basis of math primitives — graph pathology, content sterility, lifecycle mismatch, identity collision — could surface this kind of farm at platform scale. Twenty-four hours later we ran them against the labeled corpus. One axis is defeated. Three hold. The conjunction blows the campaign footprint open by a factor of forty-four.

Cohort discovered
6 known → 288 high-confidence bots · two pivots
Customer / victim leads
13 in v1 → 563 new external candidates
Procurement waves
Q4-2021 dormant warehouse · Mar-2025 fresh burner
Strongest single rule
graph_k3 · P=R=F1=1.000 on labels
Scroll · the math, validated
01What held, what broke

From six known bots, two pivots, five-hundred-and-sixty-three new targets.

v1 named six bot accounts and thirteen anchor repos at k≥4. We took those six bots and pulled their User.starredRepositories. The thirteen anchors fell out of the data. Then we took the thirteen anchors and pulled their stargazer lists. Two-hundred-and-eighty-eight accounts — out of twenty-six thousand — hit ten or more of them. Then we took those 288 and pulled their starred-repos. Five-hundred-and-sixty-three new external customer/victim repos surfaced.

That's three rounds of bipartite pivot from a six-account seed. Each round used one GraphQL query class. Each pivot collapsed an axis of ambiguity. The math is small. The data was already public. The campaign is forty-four times larger than v1 documented.

6 → 288High-confidence bot cohort, after two pivots
563New external customer / victim repos at k≥10
100%Precision of graph_k3 alone on labeled positives
7 / 9Labeled launder pairs detected by git blob-SHA Jaccard

research/detector/REPORT.md · research/detector/reports/01..06 · research/detector/corpus/ — 56,119 + 28,598 (repo, stargazer, starredAt) edges, 9 launder-pair blob trees, joint calibration table

02The math weak spot

Five primitives. One was defeated. Three held. One emerged from the data.

The synthesis behind v1 proposed five irreducible signals: coincidence (tight time-window star bursts), graph pathology (bipartite-core membership), content sterility (low-novelty content), identity collision (key/email reuse), and lifecycle mismatch (account-age × action-rate). Each is O(1) per event with a small index. The bet was that any one axis has a defensible benign explanation, but the conjunction does not.

We ran each one against the corpus. Here is what survived contact with the data.

#1
CoincidencePoisson scan · Kulldorff 1997 · BOCPD
Tight Δt star-bursts across "independent" accounts. The textbook lockstep primitive (Beutel WWW'13).
Defeated
#2
Graph pathologyCopyCatch · Fraudar · k-core
Bipartite-core membership. Same set of accounts starring the same set of repos, regardless of timing.
Strong
#3
Content sterilitygit blob-SHA Jaccard · MinHash + LSH
Identical or near-identical content between an upstream and its laundered copy. Git already content-addresses every blob.
7 / 9 hit
#4
Identity collisionSHA-256 of SSH wire · Fellegi-Sunter
Key fingerprints, GPG IDs, commit emails that should be disjoint across "independent" accounts but aren't.
Null on cohort
#5
Lifecycle mismatch2-component GMM · Cox PH
Account age × action-rate distributions outside the human envelope. Dormant-then-burst, or born-then-burst.
Bimodal · visible
+1
Cadence-of-starsderived from #1's negative result
Hours-vs-months span discriminates target type: own-farm-repos starred fast; victim customers starred slowly. New axis surfaced empirically.
Emergent

The defeat. The k=6 cohort hit 53AI/53AIHub over a 152-day window. tigshop/Tigshop: 109 days. yikart/AiToEarn: 96 days. No two bots ever starred the same repo within 60 seconds. Operators have already adapted past the tight-Δt-window primitive that academic literature relies on. Beutel WWW'13 and Pacheco ICWSM'21 don't fire here.

The resurrection. The temporal information didn't die — it migrated. Operators star their own farm repos in tight windows (e.g. tusmart-grouptt/crewrktabletsn — k=4 of the 6 bots starred it within a 20.6-hour burst) and their customer-paid repos over months. Cadence-of-stars discriminates target type. We promote it from background metric to its own axis.

copycatch_findings.json · bipartite_core_ge2[].window_seconds · reports/02_copycatch_bipartite.md

03Three pivots, one cohort

Six known bots. Two pivots later, two-hundred-and-eighty-eight.

The graph axis is operationally explosive. Each round of bipartite pivot turns a known seed into a population. Round one: from the 6 v1 bots, fetch every starred repo. Thirteen anchors emerge at k≥4. Round two: from those 13 anchors, fetch every stargazer. Twenty-six thousand stargazers; 288 hit ten or more of the 13. Round three: from those 288, fetch every starred repo. 563 new external repo candidates at k≥10; ranked by cohort coverage; the top two hit by every single one of the 288.

The trick is reachability, not depth. Each pivot uses a single GraphQL endpoint (User.starredRepositories or Repository.stargazers) and a small rate budget. The 288-cohort fetch was 28,598 edges across 3,726 unique repos in roughly twelve minutes of API time. The math weak spot isn't a clever algorithm; it's that the public-API graph is dense enough to walk back to itself in two hops.

The two universal targets. yikart/AiToEarn (9,773 ★) and tigshop/Tigshop (3,841 ★) are starred by every single one of the 288 bots. Hundred-percent cohort coverage. These are the campaign's headline customers.

scripts/fetch_anchor_stargazers.py · scripts/fetch_cohort_stars.py · primitives/lateral_discovery.py · primitives/recursive_expansion.py · reports/03_lateral_discovery.md · reports/06_recursive_expansion.md

04The cohort has a birthday

A four-year-old account warehouse, plus fifty-four fresh burners.

Plot createdAt by month for the 288 high-confidence cohort and the lifecycle GMM result writes itself. Two distinct procurement waves. Cohort A (~212 accounts, 74%): created in a tight Q4-2021 + early-2022 window, then dormant for two years before activation against the anchor repos in late-2023. Consistent with credential-stuffing / account-takeover dumps of that era — most of these are real users whose accounts were taken over, not synthetic identities. Cohort B (~54 accounts, 19%): created March-2025 and immediately deployed. Born-then-burst. Pure synthetic burners.

Cohort A · Q4-2021 dormant warehouse · 212 accounts
Cohort B · Mar-2025 fresh burner · 54 accounts
Other · 22 accounts

A profile-Frankenstein

sherryjwq: created 2021-10-25, hit 12 of 13 anchors. Profile name = "William Brown." Bio in Portuguese. Login is a Chinese romanization. Three independent dump fragments stitched into one account by an attacker who never proofread the result. Under WHOIS the seams are obvious; under the GitHub UI it just looks like a user. Two-hundred-and-twelve such Frankensteins constitute Cohort A.

Algomo7: created 2025-03-30. Zero followers, zero following, no name, no bio, no location. Pure burner — and one of fifty-four operating at the same window of activation. Cohort B.

Implication for disclosure ethics. The 74% majority cohort A are real-user victims. T&S takedown should include a recovery path. The 19% cohort B can be removed without that consideration. This matches the v1 framing for tusmart-grouptt and countneurooman — but at scale.

corpus/high_confidence_bots.json — split by cohort A / B / other · reports/03_lateral_discovery.md

05Operator linkage by exact equality

Two operators. One git tree. Bit-identical content.

Git addresses every blob by its SHA-1: identical content produces an identical hash. Run git ls-tree -r HEAD on two repos, intersect the resulting (path, blob-sha) sets, and the Jaccard between them is exact — not statistical. v1 labeled luliguyu and countneurooman as "likely-compromised-dormant"; the math says otherwise.

Both repos launder the same upstream · Narwallets/narwallets-extension

luliguyu/ssaavedrad git ls-tree HEAD
Upstream blobs157
Launder blobs154
SHA Jaccard0.7278
Per-path eq.0.8506
countneurooman/ssaavedrad git ls-tree HEAD
Upstream blobs157
Launder blobs154
SHA Jaccard0.7278
Per-path eq.0.8506

Same blob set. Same paths. Same SHAs. Same git tree, two accounts.

157 upstream blobs · 154 launder blobs · 0.7278 sha-Jaccard · identical to four decimal places

The full launder-pair table

Nine of the eleven labeled launder pairs from v1 had upstreams we could re-clone (three v1 attributions corrected — see colophon). Of the nine, the blob-SHA Jaccard at threshold (≥0.1 OR per-path-equality ≥0.5) detects seven. The two misses are a stale-snapshot just below threshold and a "combined-mode" pair where v1's upstream attribution is wrong on this side.

UPSTREAM LAUNDER MODE SHA J PATH EQ DETECT

Reclassify countneurooman. v1 framed this account as "likely-compromised-dormant — if you are the real account holder, contact us." The TWIN LAUNDER finding upgrades the framing: operator-twin-of-luliguyu. The same git tree was published under both accounts in the same campaign window. This is operator-linkage by exact content equality, not by statistical inference.

primitives/content_sterility.py · corpus/content_overlap.json · corpus/clones/Narwallets__narwallets-extension · corpus/clones/luliguyu__ssaavedrad · corpus/clones/countneurooman__ssaavedrad · reports/04_content_sterility.md

06What hides in the long tail

The cohort also stars plotly/dash. And facebook/react. That's not a mistake.

Among the 577 candidates the recursive-expansion pivot surfaces, a subset is mathematically improbable as campaign targets: mainstream OSS repos with tens of thousands of organic stargazers. NaiboWang/EasySpider at 43,733 ★. plotly/dash at 24,150 ★. Five repos under the facebook/ org. Three under angular/. Two each under sindresorhus/ and graphql/. The cohort starring these is dilution — camouflage — to defeat naïve "stargazers of obscure repos" detection. This is exactly the design motivation Hooi et al. (KDD 2016) wrote Fraudar against.

Likely customer / victim
Mainstream OSS (camouflage)
v1 anchor (known)
Operator-owned
x: cohort-bots-starring (log) · y: total-repo-stargazers (log)

The two layers separate cleanly. Customer-side targets cluster small-and-dense in the lower right: a few thousand total stars, and a substantial fraction of those stars come from the cohort. Camouflage clusters huge-and-sparse in the upper region: tens of thousands of stars, of which the cohort contributes a sliver. The split is what discriminates "campaign target" from "popular project the bots also star to look human."

What this unlocks. Fraudar's 1/log(deg+c) edge-weighting deprioritizes mainstream-popular nodes during dense-subgraph peeling. Running it on this 288 × 3,726 biadjacency separates the customer subgraph from the camouflage subgraph algorithmically — no manual sort. A clean Tier-1 / Tier-2 customer ranking falls out of one peeling pass. That is the next algorithmic step.

corpus/recursive_anchors.json — candidates[].cohort_bots × repo_meta.stars · Hooi, Song, Beutel, Shah, Shin, Faloutsos · "FRAUDAR: Bounding Graph Fraud in the Face of Camouflage" · KDD 2016

07The conjunction is the trap

Any one axis has a benign explanation. The conjunction does not.

The bet was: each axis alone is noisy enough that a determined operator can defeat it; running them together — graph AND lifecycle AND content AND identity — collapses the false-positive rate because the cost of being independent across all four (separate humans, separate machines, separate writing, separate timezones) is exactly what makes the operation unprofitable. We constructed an evaluation set from the 26,372 anchor stargazers — labeled positives that appear in the corpus, plus 5,000 sampled negatives. Here is what each axis returns alone, and what the conjunction returns.

AXIS P · R P R F1

Three readings.

First: graph_k3 alone is a perfect classifier on the labeled subset. The simplest possible rule — "did this account star three or more of the thirteen v1 anchors?" — achieves precision = recall = 1.000 against the 5,000-account negative pool. The math weak spot is one query.

Second: lifecycle alone is noisy. Cohort A and Cohort B together flag 506 of 5,000 negatives. Useless as a deployable rule. But it conjuncts cleanly with graph: the joint test holds precision = 1.000 with recall = 0.900.

Third — and this is the part that matters operationally: on the 5,932 unlabeled k≥3 candidates from primitive #2-lateral, the OTHER lifecycle bucket collapses with rising k. At k=3, 81% of candidates are OTHER (mostly organic Chinese-AI fans). At k=10, 3.7% are OTHER. At k≥11, one of seventy-five. Virtually every high-confidence bot lands in cohort A or B. The conjunction-collapse is real.

harness/joint_test.py · corpus/joint_eval.json · reports/05_joint_test.md

08Top thirty new external anchors

A star-promotion-as-a-service business with seventy-plus customers across the Chinese AI sector.

The pattern across the 563 new external candidates is unambiguous: nearly every customer-side hit is Chinese-AI / Chinese-startup tooling. Authing (auth platform). OpenCSGs (open-source AI hub). wenge-research (YAYI series LLMs). SonicCloudOrg (testing). xcancloud (testing). PandaAI-Tech. ModelEngine-Group. juggleim. iflytek (already known as v1's Customer A). Tencent. The campaign is broadly inflating the Chinese AI tooling ecosystem — not a single-customer engagement. The operators run a star-promotion-as-a-service business with multiple paying customers across the Chinese tech sector.

Below: top thirty by cohort coverage, ranked. Each is a candidate for the v2 disclosure pipeline. None of these are accusations; the cohort coverage is the data, vendor outreach establishes the rest.

Disclosure pipeline tiers. v1 covered six vendors; v2 adds: Tier-1 (k≥150 + identifiable real company) — Authing, OpenCSGs, wenge-research, SonicCloudOrg, xcancloud, PandaAI-Tech, ModelEngine-Group, juggleim. Tier-2 (k≥50) — roughly fifty additional Chinese AI / SaaS companies. Tier-3 (k≥30, individual repos) — case-by-case. Each follows the v1 disclosure ethic: "you may have been promoted by a fraudulent operation; here is the evidence; please contact us for full attribution."

corpus/recursive_anchors.json — sorted by cohort_bots desc · reports/06_recursive_expansion.md — full table with span-by-target and owner-cluster analysis