Methodology & data quality

Every data point in The AI Toll company is collected from verified primary sources, classified according to a published taxonomy, and documented with full provenance. This page describes how.

Data collection

The company is powered by automated collection pipelines that harvest data from institutional APIs, government databases, academic repositories, and verified news sources. No data is AI-generated. No data is fabricated. Every entry traces back to a verifiable primary source.

Primary source categories

Source type	Examples	Collection method
Academic databases	OpenAlex, PubMed, arXiv, CrossRef, DBLP	REST API queries with domain-specific filters
Government databases	NVD (NIST), NHTSA, SEC/EDGAR, FDA, BLS	Official public APIs
International organisations	World Bank, WHO, ILO, OECD, V-Dem, SIPRI	Statistical APIs, bulk CSV
News aggregators	GDELT, Google News RSS	Filtered by domain keywords, deduplicated
Technical and practitioner forums	Reddit, Hacker News, Stack Exchange	Public JSON/search APIs — used for early signal detection, not scored directly in the index
Incident databases	AI Incident Database (AIID)	Direct API integration
Ecosystem trackers	HuggingFace, GitHub, PyPI	Platform APIs for metadata
Consumer protection	CFPB, UK Police Data	Bulk data downloads, keyword filtering

Quality assurance

Data quality is enforced at every stage of the pipeline:

99.8% verifiable source URLs — every entry carries a URL that can be independently verified. Google News redirect URLs have been replaced with direct search links to the source outlet.
100% methodology coverage — every entry includes a methodology field describing how the data was collected and from which source.
Source provenance — fields include source_url, source_name, date, language, country, and domain.
Quarantine system — entries that fail quality checks (missing source URL, invalid date) are moved to quarantine rather than deleted, preserving auditability.
Deduplication — all scrapers implement URL-based deduplication to prevent duplicate entries.

Verification and exclusion criteria

The company applies strict inclusion criteria. Entries are accepted only if they meet all of the following conditions:

Identifiable provenance — a verifiable source URL from an institutional, academic, governmental, or established media outlet. Entries without traceable origin are quarantined.
Primary or corroborated source — court filings, regulatory decisions, peer-reviewed papers, and government databases are accepted directly. Media reports are accepted when they reference identifiable events, organisations, or individuals.
AI relevance — title, summary, or tags must contain at least one AI-related term. Entries from broad categories (environment, housing, sport) that mention no AI connection are excluded automatically.

The following are excluded by design:

Unverified social media posts and anonymous forum threads (Reddit, community discussions are used for signal detection only — never scored in the index)
Opinion pieces, editorials, and think-pieces without reference to a documented event
AI-generated or synthetic content
Entries without identifiable provenance or source URL
Duplicate coverage of the same event from the same source
Programming repositories misclassified as company data

Excluded entries are moved to a quarantine archive, not deleted. Every exclusion decision is logged and auditable.

Incident methodology

A core distinction in the company is between articles and incidents. An article is a single document from a single source. An incident is a real-world event — it may be covered by one article or by dozens.

When a teenager's suicide is linked to an AI chatbot, that single event generates coverage across multiple outlets and languages. Each article is collected and classified individually, but for scoring purposes, they are clustered into a single incident record using title similarity, date proximity, country, and domain matching.

The Toll AI Index uses incident density, not article density. A country is not penalised for having more media coverage of the same event. This prevents media attention from inflating risk scores and ensures that countries with a free, active press are not artificially rated as higher-risk than countries where incidents go unreported.

Each incident record carries:

A unique incident_id shared by all articles covering the same event
A cluster_size indicating how many articles are linked to it
The highest severity level among its constituent articles

Severity framework

Every incident and regulatory entry is assigned a severity level from 1 to 5 based on documented consequences, not potential harm or media coverage. The scale is anchored on observable, verifiable criteria:

Level	Label	Observable criterion
1	Informational	No incident. Analysis, research, policy discussion, or industry announcement with no documented harm.
2	Low	Incident documented but contained or corrected. No formal legal or regulatory consequence. No physical harm.
3	Medium	Official investigation or inquiry opened, formal complaint filed, or documented harm to specific individuals with measurable impact.
4	High	Formal legal action filed, regulatory enforcement with fine or ban, or systematic discrimination documented and affecting an identifiable group.
5	Critical	Documented death, serious physical injury, or mass fundamental rights violation directly caused by or linked to an AI system.

The threshold between levels is defined by formal action: no authority action caps an entry at Level 2; an investigation opened raises it to Level 3; a lawsuit filed or fine issued to Level 4; documented death or serious injury to Level 5. This escalation ladder is binary, not interpretive.

Entries where severity cannot be assessed from available information are left unscored rather than estimated. The full severity framework, including decision rules and regulation-specific scoring, is available to institutional reviewers on request.

Toll AI Index

The Toll AI Index is the company's flagship product — a composite score that measures where AI harm occurs. Each incident is attributed to the country or countries where the impact is documented. Technology origin is tracked separately and does not affect country scores.

The index is built from five dimensions, each of which requires the company's full dataset to compute:

1. Verified Harm Score (30%)

Counts verified L4 (severe) and L5 (critical/death) incidents per country. L5 incidents score 10 points, L4 score 5 points. Only entries with high confidence verification or LLM-based severity assessment are counted. This dimension measures actual documented harm, not media noise.

2. Legal Action Density (25%)

Counts entries involving court cases, legislation, enforcement actions, or documents containing legal keywords (lawsuit, fine, penalty, enforcement, ruling, sanctions). Only entries with severity ≥ 3 qualify. Legal actions are independently verifiable and indicate real consequences.

3. Regulatory Gap (20%)

Measures whether a country's regulatory framework matches its AI risk exposure. Formula: verified incidents divided by regulatory coverage score. A well-regulated country (e.g. EU member states) with incidents sees its score reduced by its regulatory response. Higher score means more problems relative to regulatory capacity.

4. Domain Breadth (15%)

Number of distinct impact domains with at least one moderately verified incident of severity 3 or above. A country with AI problems across many sectors (health, transport, privacy, finance) is more broadly exposed than one with a single-sector issue.

5. Trend Severity (10%)

Compares the count of verified severe incidents in the last 12 months versus the previous 12 months. Captures acceleration of AI harm, not just static totals. A rising trend scores higher.

Country attribution

The index uses only countries_impacted — the countries where harm physically occurs or where people are directly affected. countries_responsible (country where the technology originates) is stored but not used in the index calculation; it powers a separate "AI Origin Risk" view.

If an incident impacts multiple countries, each impacted country receives the full weight of the incident independently.

Each dimension is normalised to a 0–100 scale. The composite score is a weighted average of the five dimensions. Weighting details and full methodology are available to institutional partners and academic reviewers on request.

Score interpretation:

Below 35 — Low exposure
35–60 — Moderate exposure
Above 60 — High exposure

27-domain taxonomy

Every entry is classified into one of 27 impact domains. The taxonomy covers both direct harms (incidents, accidents) and systemic impacts (workforce displacement, democratic erosion, environmental cost).

#	Domain	Scope
01	Health	Mental health harms, AI diagnostics errors, surgical robots, algorithmic prescriptions, triage failures
02	Children and youth	Exposure of minors to harmful content, AI-generated CSAM, age-inappropriate chatbots
03	Education	Academic fraud, AI-generated plagiarism, assessment integrity, teacher displacement
04	Employment	Job displacement, algorithmic hiring bias, workplace surveillance, gig economy automation
05	Creativity	Copyright infringement, generative AI vs human artists, music and image theft
06	Democracy	Deepfakes in elections, AI-generated propaganda, voter manipulation, political bots
07	Privacy	Facial recognition, mass surveillance, data harvesting, biometric tracking
08	Justice and bias	Algorithmic discrimination in sentencing, policing, credit scoring, insurance
09	Fraud	AI-powered scams, voice cloning, deepfake identity theft, phishing at scale
10	Cybersecurity	AI-powered cyberattacks, automated vulnerability exploitation, deepfake phishing, adversarial ML
11	Environment	Energy consumption of data centres, water usage, carbon footprint of training runs
12	Military	Autonomous weapons, lethal autonomous systems, AI in targeting and surveillance
13	Sovereignty	National dependence on foreign AI, data colonialism, strategic AI autonomy
14	Finance	Algorithmic trading failures, AI-driven market manipulation, robo-advisory risks
15	Enterprise	AI system failures in business operations, hallucination in enterprise tools, vendor lock-in
16	Science	AI-fabricated research, paper mills, peer review manipulation, reproducibility crisis
17	Info pollution	AI-generated misinformation, synthetic media flooding, dead internet theory
18	Vulnerable people	Exploitation of elderly, disabled, and marginalised groups by AI systems
19	Language	Linguistic homogenisation, low-resource language erasure, translation bias
20	Sexuality	Non-consensual deepfake pornography, AI companions, exploitation of intimacy
21	Marketing	Hyper-targeted manipulation, dark patterns, synthetic influencers, deceptive ads
22	Food and agriculture	AI in precision farming failures, food supply chain disruption, land-use algorithms
23	Housing	Algorithmic rent pricing, AI-driven gentrification, discriminatory mortgage models
24	Transport	Autonomous vehicle accidents, AI traffic management failures, aviation automation
25	Sport	AI in doping detection, algorithmic refereeing, performance prediction ethics
26	Religion	AI-generated sermons, chatbot spiritual advisors, theological disruption
27	Human identity	Digital resurrection, grief bots, consciousness debate, human-AI boundaries, post-mortem data rights

Incident clustering

A single real-world event may generate dozens of articles across multiple outlets and languages. Without deduplication, one incident counted 20 times would inflate a country's score. The index applies incident clustering before scoring:

High-severity entries (L4+L5): all articles of the same severity level within a ±30 day window are merged into one incident, regardless of language, outlet, or domain. This ensures that a school bombing covered by 20 outlets in 15 languages counts as one incident, not twenty.
Lower-severity entries (L1–L3): articles are clustered by shared key terms (3+ common terms within ±14 days). Less aggressive because lower-severity entries cover a wider range of distinct events.

Clustering reduces the total entry count by approximately 85%, primarily affecting countries with extensive multilingual media coverage of the same events.

Known limitations

Incident documentation density varies by country and is influenced by media landscape, language coverage, and public reporting norms. The Toll AI Index reflects documented exposure based on available sources.

English-language sources remain overrepresented. The company now covers 17 languages, but coverage of incidents in Chinese, Russian, Japanese, and Korean remains uneven compared to Western European languages.

US dominance caveat: the United States consistently scores significantly higher than all other countries. This gap partly reflects a genuine concentration of AI deployment and litigation in the US, but is amplified by the dominance of English-language American sources in the scraping pipeline. The absolute gap should not be interpreted as proportional real-world risk difference — it reflects documentation density.

Regulatory data for jurisdictions in Sub-Saharan Africa and Central Asia is sparse, reflecting limited public availability rather than absence of activity. Some data sources carry an inherent reporting lag of days to weeks, particularly court filings and regulatory decisions. Community signals (forums, practitioner discussions) are used for early detection only and are not scored directly in the Toll AI Index.

Academic exclusion: entries classified as academic papers or preprints are excluded from all index scoring. They are retained in the company for research purposes but do not contribute to any dimension of the Toll AI Index.

Update frequency

The raw database is updated daily through automated pipelines. The Toll AI Index is recalculated quarterly, with the next publication scheduled for Q2 2026. Incident alerts and regulatory updates are processed within 24 hours of source publication.

About the company

The AI Toll is an independent company created to document, structure, and make accessible the full scope of artificial intelligence's impact on society. The project emerged from the recognition that while AI's benefits receive extensive coverage, its costs, risks, and harms are fragmented across thousands of sources in dozens of languages.

The company is not funded by any technology company. The data is collected, structured, and maintained independently. Academic partnerships are available on request. The methodology is documented and available for review.

For enquiries: contact@theaitoll.com