We said the math could detect them.
We tested it.
v1 hypothesized that a small basis of math primitives — graph pathology, content sterility, lifecycle mismatch, identity collision — could surface this kind of farm at platform scale. Twenty-four hours later we ran them against the labeled corpus. One axis is defeated. Three hold. The conjunction blows the campaign footprint open by a factor of forty-four.
From six known bots, two pivots, five-hundred-and-sixty-three new targets.
v1 named six bot accounts and thirteen anchor repos at k≥4. We took those six bots and pulled their User.starredRepositories. The thirteen anchors fell out of the data. Then we took the thirteen anchors and pulled their stargazer lists. Two-hundred-and-eighty-eight accounts — out of twenty-six thousand — hit ten or more of them. Then we took those 288 and pulled their starred-repos. Five-hundred-and-sixty-three new external customer/victim repos surfaced.
That's three rounds of bipartite pivot from a six-account seed. Each round used one GraphQL query class. Each pivot collapsed an axis of ambiguity. The math is small. The data was already public. The campaign is forty-four times larger than v1 documented.
research/detector/REPORT.md · research/detector/reports/01..06 · research/detector/corpus/ — 56,119 + 28,598 (repo, stargazer, starredAt) edges, 9 launder-pair blob trees, joint calibration table
Five primitives. One was defeated. Three held. One emerged from the data.
The synthesis behind v1 proposed five irreducible signals: coincidence (tight time-window star bursts), graph pathology (bipartite-core membership), content sterility (low-novelty content), identity collision (key/email reuse), and lifecycle mismatch (account-age × action-rate). Each is O(1) per event with a small index. The bet was that any one axis has a defensible benign explanation, but the conjunction does not.
We ran each one against the corpus. Here is what survived contact with the data.
The defeat. The k=6 cohort hit 53AI/53AIHub over a 152-day window. tigshop/Tigshop: 109 days. yikart/AiToEarn: 96 days. No two bots ever starred the same repo within 60 seconds. Operators have already adapted past the tight-Δt-window primitive that academic literature relies on. Beutel WWW'13 and Pacheco ICWSM'21 don't fire here.
The resurrection. The temporal information didn't die — it migrated. Operators star their own farm repos in tight windows (e.g. tusmart-grouptt/crewrktabletsn — k=4 of the 6 bots starred it within a 20.6-hour burst) and their customer-paid repos over months. Cadence-of-stars discriminates target type. We promote it from background metric to its own axis.
copycatch_findings.json · bipartite_core_ge2[].window_seconds · reports/02_copycatch_bipartite.md
Six known bots. Two pivots later, two-hundred-and-eighty-eight.
The graph axis is operationally explosive. Each round of bipartite pivot turns a known seed into a population. Round one: from the 6 v1 bots, fetch every starred repo. Thirteen anchors emerge at k≥4. Round two: from those 13 anchors, fetch every stargazer. Twenty-six thousand stargazers; 288 hit ten or more of the 13. Round three: from those 288, fetch every starred repo. 563 new external repo candidates at k≥10; ranked by cohort coverage; the top two hit by every single one of the 288.
The trick is reachability, not depth. Each pivot uses a single GraphQL endpoint (User.starredRepositories or Repository.stargazers) and a small rate budget. The 288-cohort fetch was 28,598 edges across 3,726 unique repos in roughly twelve minutes of API time. The math weak spot isn't a clever algorithm; it's that the public-API graph is dense enough to walk back to itself in two hops.
The two universal targets. yikart/AiToEarn (9,773 ★) and tigshop/Tigshop (3,841 ★) are starred by every single one of the 288 bots. Hundred-percent cohort coverage. These are the campaign's headline customers.
scripts/fetch_anchor_stargazers.py · scripts/fetch_cohort_stars.py · primitives/lateral_discovery.py · primitives/recursive_expansion.py · reports/03_lateral_discovery.md · reports/06_recursive_expansion.md
A four-year-old account warehouse, plus fifty-four fresh burners.
Plot createdAt by month for the 288 high-confidence cohort and the lifecycle GMM result writes itself. Two distinct procurement waves. Cohort A (~212 accounts, 74%): created in a tight Q4-2021 + early-2022 window, then dormant for two years before activation against the anchor repos in late-2023. Consistent with credential-stuffing / account-takeover dumps of that era — most of these are real users whose accounts were taken over, not synthetic identities. Cohort B (~54 accounts, 19%): created March-2025 and immediately deployed. Born-then-burst. Pure synthetic burners.
A profile-Frankenstein
sherryjwq: created 2021-10-25, hit 12 of 13 anchors. Profile name = "William Brown." Bio in Portuguese. Login is a Chinese romanization. Three independent dump fragments stitched into one account by an attacker who never proofread the result. Under WHOIS the seams are obvious; under the GitHub UI it just looks like a user. Two-hundred-and-twelve such Frankensteins constitute Cohort A.
Algomo7: created 2025-03-30. Zero followers, zero following, no name, no bio, no location. Pure burner — and one of fifty-four operating at the same window of activation. Cohort B.
Implication for disclosure ethics. The 74% majority cohort A are real-user victims. T&S takedown should include a recovery path. The 19% cohort B can be removed without that consideration. This matches the v1 framing for tusmart-grouptt and countneurooman — but at scale.
corpus/high_confidence_bots.json — split by cohort A / B / other · reports/03_lateral_discovery.md
Two operators. One git tree. Bit-identical content.
Git addresses every blob by its SHA-1: identical content produces an identical hash. Run git ls-tree -r HEAD on two repos, intersect the resulting (path, blob-sha) sets, and the Jaccard between them is exact — not statistical. v1 labeled luliguyu and countneurooman as "likely-compromised-dormant"; the math says otherwise.
Both repos launder the same upstream · Narwallets/narwallets-extension
157 upstream blobs · 154 launder blobs · 0.7278 sha-Jaccard · identical to four decimal places
The full launder-pair table
Nine of the eleven labeled launder pairs from v1 had upstreams we could re-clone (three v1 attributions corrected — see colophon). Of the nine, the blob-SHA Jaccard at threshold (≥0.1 OR per-path-equality ≥0.5) detects seven. The two misses are a stale-snapshot just below threshold and a "combined-mode" pair where v1's upstream attribution is wrong on this side.
Reclassify countneurooman. v1 framed this account as "likely-compromised-dormant — if you are the real account holder, contact us." The TWIN LAUNDER finding upgrades the framing: operator-twin-of-luliguyu. The same git tree was published under both accounts in the same campaign window. This is operator-linkage by exact content equality, not by statistical inference.
primitives/content_sterility.py · corpus/content_overlap.json · corpus/clones/Narwallets__narwallets-extension · corpus/clones/luliguyu__ssaavedrad · corpus/clones/countneurooman__ssaavedrad · reports/04_content_sterility.md
The cohort also stars plotly/dash. And facebook/react. That's not a mistake.
Among the 577 candidates the recursive-expansion pivot surfaces, a subset is mathematically improbable as campaign targets: mainstream OSS repos with tens of thousands of organic stargazers. NaiboWang/EasySpider at 43,733 ★. plotly/dash at 24,150 ★. Five repos under the facebook/ org. Three under angular/. Two each under sindresorhus/ and graphql/. The cohort starring these is dilution — camouflage — to defeat naïve "stargazers of obscure repos" detection. This is exactly the design motivation Hooi et al. (KDD 2016) wrote Fraudar against.
The two layers separate cleanly. Customer-side targets cluster small-and-dense in the lower right: a few thousand total stars, and a substantial fraction of those stars come from the cohort. Camouflage clusters huge-and-sparse in the upper region: tens of thousands of stars, of which the cohort contributes a sliver. The split is what discriminates "campaign target" from "popular project the bots also star to look human."
What this unlocks. Fraudar's 1/log(deg+c) edge-weighting deprioritizes mainstream-popular nodes during dense-subgraph peeling. Running it on this 288 × 3,726 biadjacency separates the customer subgraph from the camouflage subgraph algorithmically — no manual sort. A clean Tier-1 / Tier-2 customer ranking falls out of one peeling pass. That is the next algorithmic step.
corpus/recursive_anchors.json — candidates[].cohort_bots × repo_meta.stars · Hooi, Song, Beutel, Shah, Shin, Faloutsos · "FRAUDAR: Bounding Graph Fraud in the Face of Camouflage" · KDD 2016
Any one axis has a benign explanation. The conjunction does not.
The bet was: each axis alone is noisy enough that a determined operator can defeat it; running them together — graph AND lifecycle AND content AND identity — collapses the false-positive rate because the cost of being independent across all four (separate humans, separate machines, separate writing, separate timezones) is exactly what makes the operation unprofitable. We constructed an evaluation set from the 26,372 anchor stargazers — labeled positives that appear in the corpus, plus 5,000 sampled negatives. Here is what each axis returns alone, and what the conjunction returns.
Three readings.
First: graph_k3 alone is a perfect classifier on the labeled subset. The simplest possible rule — "did this account star three or more of the thirteen v1 anchors?" — achieves precision = recall = 1.000 against the 5,000-account negative pool. The math weak spot is one query.
Second: lifecycle alone is noisy. Cohort A and Cohort B together flag 506 of 5,000 negatives. Useless as a deployable rule. But it conjuncts cleanly with graph: the joint test holds precision = 1.000 with recall = 0.900.
Third — and this is the part that matters operationally: on the 5,932 unlabeled k≥3 candidates from primitive #2-lateral, the OTHER lifecycle bucket collapses with rising k. At k=3, 81% of candidates are OTHER (mostly organic Chinese-AI fans). At k=10, 3.7% are OTHER. At k≥11, one of seventy-five. Virtually every high-confidence bot lands in cohort A or B. The conjunction-collapse is real.
harness/joint_test.py · corpus/joint_eval.json · reports/05_joint_test.md
A star-promotion-as-a-service business with seventy-plus customers across the Chinese AI sector.
The pattern across the 563 new external candidates is unambiguous: nearly every customer-side hit is Chinese-AI / Chinese-startup tooling. Authing (auth platform). OpenCSGs (open-source AI hub). wenge-research (YAYI series LLMs). SonicCloudOrg (testing). xcancloud (testing). PandaAI-Tech. ModelEngine-Group. juggleim. iflytek (already known as v1's Customer A). Tencent. The campaign is broadly inflating the Chinese AI tooling ecosystem — not a single-customer engagement. The operators run a star-promotion-as-a-service business with multiple paying customers across the Chinese tech sector.
Below: top thirty by cohort coverage, ranked. Each is a candidate for the v2 disclosure pipeline. None of these are accusations; the cohort coverage is the data, vendor outreach establishes the rest.
Disclosure pipeline tiers. v1 covered six vendors; v2 adds: Tier-1 (k≥150 + identifiable real company) — Authing, OpenCSGs, wenge-research, SonicCloudOrg, xcancloud, PandaAI-Tech, ModelEngine-Group, juggleim. Tier-2 (k≥50) — roughly fifty additional Chinese AI / SaaS companies. Tier-3 (k≥30, individual repos) — case-by-case. Each follows the v1 disclosure ethic: "you may have been promoted by a fraudulent operation; here is the evidence; please contact us for full attribution."
corpus/recursive_anchors.json — sorted by cohort_bots desc · reports/06_recursive_expansion.md — full table with span-by-target and owner-cluster analysis