Methodology & data quality
Every data point in The AI Toll company is collected from verified primary sources, classified according to a published taxonomy, and documented with full provenance. This page describes how.
Data collection
The company is powered by automated collection pipelines that harvest data from institutional APIs, government databases, academic repositories, and verified news sources. No data is AI-generated. No data is fabricated. Every entry traces back to a verifiable primary source.
Primary source categories
| Source type | Examples | Collection method |
|---|---|---|
| Academic databases | OpenAlex, PubMed, arXiv, CrossRef, DBLP | REST API queries with domain-specific filters |
| Government databases | NVD (NIST), NHTSA, SEC/EDGAR, FDA, BLS | Official public APIs |
| International organisations | World Bank, WHO, ILO, OECD, V-Dem, SIPRI | Statistical APIs, bulk CSV |
| News aggregators | GDELT, Google News RSS | Filtered by domain keywords, deduplicated |
| Technical and practitioner forums | Reddit, Hacker News, Stack Exchange | Public JSON/search APIs — used for early signal detection, not scored directly in the index |
| Incident databases | AI Incident Database (AIID) | Direct API integration |
| Ecosystem trackers | HuggingFace, GitHub, PyPI | Platform APIs for metadata |
| Consumer protection | CFPB, UK Police Data | Bulk data downloads, keyword filtering |
Quality assurance
Data quality is enforced at every stage of the pipeline:
- 99.8% verifiable source URLs — every entry carries a URL that can be independently verified. Google News redirect URLs have been replaced with direct search links to the source outlet.
- 100% methodology coverage — every entry includes a
methodologyfield describing how the data was collected and from which source. - Source provenance — fields include
source_url,source_name,date,language,country, anddomain. - Quarantine system — entries that fail quality checks (missing source URL, invalid date) are moved to quarantine rather than deleted, preserving auditability.
- Deduplication — all scrapers implement URL-based deduplication to prevent duplicate entries.
Verification and exclusion criteria
The company applies strict inclusion criteria. Entries are accepted only if they meet all of the following conditions:
- Identifiable provenance — a verifiable source URL from an institutional, academic, governmental, or established media outlet. Entries without traceable origin are quarantined.
- Primary or corroborated source — court filings, regulatory decisions, peer-reviewed papers, and government databases are accepted directly. Media reports are accepted when they reference identifiable events, organisations, or individuals.
- AI relevance — title, summary, or tags must contain at least one AI-related term. Entries from broad categories (environment, housing, sport) that mention no AI connection are excluded automatically.
The following are excluded by design:
- Unverified social media posts and anonymous forum threads (Reddit, community discussions are used for signal detection only — never scored in the index)
- Opinion pieces, editorials, and think-pieces without reference to a documented event
- AI-generated or synthetic content
- Entries without identifiable provenance or source URL
- Duplicate coverage of the same event from the same source
- Programming repositories misclassified as company data
Excluded entries are moved to a quarantine archive, not deleted. Every exclusion decision is logged and auditable.
Incident methodology
A core distinction in the company is between articles and incidents. An article is a single document from a single source. An incident is a real-world event — it may be covered by one article or by dozens.
When a teenager's suicide is linked to an AI chatbot, that single event generates coverage across multiple outlets and languages. Each article is collected and classified individually, but for scoring purposes, they are clustered into a single incident record using title similarity, date proximity, country, and domain matching.
The Toll AI Index uses incident density, not article density. A country is not penalised for having more media coverage of the same event. This prevents media attention from inflating risk scores and ensures that countries with a free, active press are not artificially rated as higher-risk than countries where incidents go unreported.
Each incident record carries:
- A unique
incident_idshared by all articles covering the same event - A
cluster_sizeindicating how many articles are linked to it - The highest severity level among its constituent articles
Severity framework
Every incident and regulatory entry is assigned a severity level from 1 to 5 based on documented consequences, not potential harm or media coverage. The scale is anchored on observable, verifiable criteria:
| Level | Label | Observable criterion |
|---|---|---|
| 1 | Informational | No incident. Analysis, research, policy discussion, or industry announcement with no documented harm. |
| 2 | Low | Incident documented but contained or corrected. No formal legal or regulatory consequence. No physical harm. |
| 3 | Medium | Official investigation or inquiry opened, formal complaint filed, or documented harm to specific individuals with measurable impact. |
| 4 | High | Formal legal action filed, regulatory enforcement with fine or ban, or systematic discrimination documented and affecting an identifiable group. |
| 5 | Critical | Documented death, serious physical injury, or mass fundamental rights violation directly caused by or linked to an AI system. |
The threshold between levels is defined by formal action: no authority action caps an entry at Level 2; an investigation opened raises it to Level 3; a lawsuit filed or fine issued to Level 4; documented death or serious injury to Level 5. This escalation ladder is binary, not interpretive.
Entries where severity cannot be assessed from available information are left unscored rather than estimated. The full severity framework, including decision rules and regulation-specific scoring, is available to institutional reviewers on request.
Toll AI Index
The Toll AI Index is the company's flagship product — a composite score that measures where AI harm occurs. Each incident is attributed to the country or countries where the impact is documented. Technology origin is tracked separately and does not affect country scores.
The index is built from five dimensions, each of which requires the company's full dataset to compute:
1. Verified Harm Score (30%)
Counts verified L4 (severe) and L5 (critical/death) incidents per country. L5 incidents score 10 points, L4 score 5 points. Only entries with high confidence verification or LLM-based severity assessment are counted. This dimension measures actual documented harm, not media noise.
2. Legal Action Density (25%)
Counts entries involving court cases, legislation, enforcement actions, or documents containing legal keywords (lawsuit, fine, penalty, enforcement, ruling, sanctions). Only entries with severity ≥ 3 qualify. Legal actions are independently verifiable and indicate real consequences.
3. Regulatory Gap (20%)
Measures whether a country's regulatory framework matches its AI risk exposure. Formula: verified incidents divided by regulatory coverage score. A well-regulated country (e.g. EU member states) with incidents sees its score reduced by its regulatory response. Higher score means more problems relative to regulatory capacity.
4. Domain Breadth (15%)
Number of distinct impact domains with at least one moderately verified incident of severity 3 or above. A country with AI problems across many sectors (health, transport, privacy, finance) is more broadly exposed than one with a single-sector issue.
5. Trend Severity (10%)
Compares the count of verified severe incidents in the last 12 months versus the previous 12 months. Captures acceleration of AI harm, not just static totals. A rising trend scores higher.
Country attribution
The index uses only countries_impacted — the countries where harm physically occurs or where people are directly affected. countries_responsible (country where the technology originates) is stored but not used in the index calculation; it powers a separate "AI Origin Risk" view.
If an incident impacts multiple countries, each impacted country receives the full weight of the incident independently.
Each dimension is normalised to a 0–100 scale. The composite score is a weighted average of the five dimensions. Weighting details and full methodology are available to institutional partners and academic reviewers on request.
Score interpretation:
- Below 35 — Low exposure
- 35–60 — Moderate exposure
- Above 60 — High exposure
27-domain taxonomy
Every entry is classified into one of 27 impact domains. The taxonomy covers both direct harms (incidents, accidents) and systemic impacts (workforce displacement, democratic erosion, environmental cost).
| # | Domain | Scope |
|---|---|---|
| 01 | Health | Mental health harms, AI diagnostics errors, surgical robots, algorithmic prescriptions, triage failures |
| 02 | Children and youth | Exposure of minors to harmful content, AI-generated CSAM, age-inappropriate chatbots |
| 03 | Education | Academic fraud, AI-generated plagiarism, assessment integrity, teacher displacement |
| 04 | Employment | Job displacement, algorithmic hiring bias, workplace surveillance, gig economy automation |
| 05 | Creativity | Copyright infringement, generative AI vs human artists, music and image theft |
| 06 | Democracy | Deepfakes in elections, AI-generated propaganda, voter manipulation, political bots |
| 07 | Privacy | Facial recognition, mass surveillance, data harvesting, biometric tracking |
| 08 | Justice and bias | Algorithmic discrimination in sentencing, policing, credit scoring, insurance |
| 09 | Fraud | AI-powered scams, voice cloning, deepfake identity theft, phishing at scale |
| 10 | Cybersecurity | AI-powered cyberattacks, automated vulnerability exploitation, deepfake phishing, adversarial ML |
| 11 | Environment | Energy consumption of data centres, water usage, carbon footprint of training runs |
| 12 | Military | Autonomous weapons, lethal autonomous systems, AI in targeting and surveillance |
| 13 | Sovereignty | National dependence on foreign AI, data colonialism, strategic AI autonomy |
| 14 | Finance | Algorithmic trading failures, AI-driven market manipulation, robo-advisory risks |
| 15 | Enterprise | AI system failures in business operations, hallucination in enterprise tools, vendor lock-in |
| 16 | Science | AI-fabricated research, paper mills, peer review manipulation, reproducibility crisis |
| 17 | Info pollution | AI-generated misinformation, synthetic media flooding, dead internet theory |
| 18 | Vulnerable people | Exploitation of elderly, disabled, and marginalised groups by AI systems |
| 19 | Language | Linguistic homogenisation, low-resource language erasure, translation bias |
| 20 | Sexuality | Non-consensual deepfake pornography, AI companions, exploitation of intimacy |
| 21 | Marketing | Hyper-targeted manipulation, dark patterns, synthetic influencers, deceptive ads |
| 22 | Food and agriculture | AI in precision farming failures, food supply chain disruption, land-use algorithms |
| 23 | Housing | Algorithmic rent pricing, AI-driven gentrification, discriminatory mortgage models |
| 24 | Transport | Autonomous vehicle accidents, AI traffic management failures, aviation automation |
| 25 | Sport | AI in doping detection, algorithmic refereeing, performance prediction ethics |
| 26 | Religion | AI-generated sermons, chatbot spiritual advisors, theological disruption |
| 27 | Human identity | Digital resurrection, grief bots, consciousness debate, human-AI boundaries, post-mortem data rights |
Incident clustering
A single real-world event may generate dozens of articles across multiple outlets and languages. Without deduplication, one incident counted 20 times would inflate a country's score. The index applies incident clustering before scoring:
- High-severity entries (L4+L5): all articles of the same severity level within a ±30 day window are merged into one incident, regardless of language, outlet, or domain. This ensures that a school bombing covered by 20 outlets in 15 languages counts as one incident, not twenty.
- Lower-severity entries (L1–L3): articles are clustered by shared key terms (3+ common terms within ±14 days). Less aggressive because lower-severity entries cover a wider range of distinct events.
Clustering reduces the total entry count by approximately 85%, primarily affecting countries with extensive multilingual media coverage of the same events.
Known limitations
Incident documentation density varies by country and is influenced by media landscape, language coverage, and public reporting norms. The Toll AI Index reflects documented exposure based on available sources.
English-language sources remain overrepresented. The company now covers 17 languages, but coverage of incidents in Chinese, Russian, Japanese, and Korean remains uneven compared to Western European languages.
US dominance caveat: the United States consistently scores significantly higher than all other countries. This gap partly reflects a genuine concentration of AI deployment and litigation in the US, but is amplified by the dominance of English-language American sources in the scraping pipeline. The absolute gap should not be interpreted as proportional real-world risk difference — it reflects documentation density.
Regulatory data for jurisdictions in Sub-Saharan Africa and Central Asia is sparse, reflecting limited public availability rather than absence of activity. Some data sources carry an inherent reporting lag of days to weeks, particularly court filings and regulatory decisions. Community signals (forums, practitioner discussions) are used for early detection only and are not scored directly in the Toll AI Index.
Academic exclusion: entries classified as academic papers or preprints are excluded from all index scoring. They are retained in the company for research purposes but do not contribute to any dimension of the Toll AI Index.
Update frequency
The raw database is updated daily through automated pipelines. The Toll AI Index is recalculated quarterly, with the next publication scheduled for Q2 2026. Incident alerts and regulatory updates are processed within 24 hours of source publication.
About the company
The AI Toll is an independent company created to document, structure, and make accessible the full scope of artificial intelligence's impact on society. The project emerged from the recognition that while AI's benefits receive extensive coverage, its costs, risks, and harms are fragmented across thousands of sources in dozens of languages.
The company is not funded by any technology company. The data is collected, structured, and maintained independently. Academic partnerships are available on request. The methodology is documented and available for review.
For enquiries: contact@theaitoll.com