How to Query Wikipedia Databases for Research Using Quarry and Replicas

Wikipedia is more than just a website you browse. Under the hood, it’s a massive, constantly updated database of human knowledge - and if you know how to ask the right questions, you can pull out data that no search engine can give you. Researchers, journalists, and data analysts use this hidden layer of Wikipedia every day. But you don’t need to be a coder to tap into it. Tools like Quarry and the public Wikipedia replicas make it possible to run real SQL queries on live Wikipedia data - no permission needed.

What are Wikipedia replicas?

Wikipedia replicas are exact, read-only copies of Wikipedia’s databases, updated every few hours. They’re hosted by the Wikimedia Foundation and made available to the public for research, analysis, and education. These aren’t just snapshots - they include every edit, every revision, every user contribution since 2001. That means you can track how a single article changed over time, find which editors are most active, or count how often certain terms appear across thousands of pages.

These databases are huge. The English Wikipedia replica alone contains over 60 million rows in its page table and more than 1.5 billion rows in its revision table. If you tried to download this data, it would take weeks and terabytes of storage. But you don’t have to. The replicas are live online, and you can query them directly.

What is Quarry?

Quarry is a web-based tool built specifically to let anyone run SQL queries on Wikipedia replicas. It’s free, no sign-up required, and runs entirely in your browser. Think of it like Google Sheets for Wikipedia data - you type a query, click run, and get results in seconds. No installation, no server, no API keys.

Quarry was created by Wikimedia volunteers because researchers kept asking: “Can I find out how many articles were created in 2023?” or “Which editors deleted the most content?” The answer was always yes - but only if you knew how to access the database. Quarry made that easy.

It supports standard SQL syntax, including JOINs, GROUP BY, and subqueries. You can even save your queries and share them with others. Many published studies on Wikipedia’s editing patterns now include links to the exact Quarry query used to generate their data.

Why use this instead of the Wikipedia API?

You might be thinking: “Why not just use the Wikipedia API?” The API is great for pulling single articles or recent edits. But it’s designed for small, targeted requests - not bulk analysis. If you want to count how many articles mention “climate change” in every language version of Wikipedia, the API would take months to complete. And you’d hit rate limits long before you finished.

Quarry and the replicas are built for exactly this kind of heavy lifting. You can scan 100,000 articles in under a minute. You can join the page table with the revision table to find the first edit of every article. You can compare edit frequency across countries by filtering on user IP locations. The API can’t do that. Only the full database can.

A researcher analyzing a chart of Wikipedia edit trends on vaccine articles, with SQL code visible in overlay.

Real research examples

Here’s what real researchers have done with Quarry:

  • A 2024 study from the University of Michigan tracked how misinformation spread in Wikipedia articles about vaccines by analyzing edit histories across 27 languages. They used a query to find all revisions where the word “vaccine” was removed and replaced with “harmful.”
  • Journalists at ProPublica used Quarry to identify the top 100 most edited Wikipedia pages in 2023 - then cross-referenced those with political campaign donations. They found a direct link between corporate lobbying and changes to corporate Wikipedia pages.
  • A team at MIT analyzed the gender gap in Wikipedia editors by querying the user table for self-reported gender and correlating it with edit volume. Their query showed that articles about women were 37% more likely to be edited by male editors than female ones.

These aren’t theoretical experiments. They’re published findings backed by live data pulled directly from Wikipedia’s servers.

How to run your first query

Let’s say you want to know: “Which Wikipedia articles have been edited more than 1,000 times?” Here’s how to find out:

  1. Go to https://quarry.wmcloud.org.
  2. Click “New Query.”
  3. Paste this SQL code:
SELECT page_title, COUNT(rev_id) AS edit_count
FROM revision
JOIN page ON rev_page = page_id
WHERE page_namespace = 0
GROUP BY page_title
HAVING edit_count > 1000
ORDER BY edit_count DESC
LIMIT 10;
  1. Click “Run.”
  2. Wait a few seconds. You’ll see a list of the top 10 most edited articles - things like “United States,” “Barack Obama,” or “COVID-19 pandemic.”

That’s it. No login. No download. Just data.

You can tweak the query. Change 1000 to 500 to see more results. Change LIMIT 10 to LIMIT 50 to see more entries. Replace page_namespace = 0 with page_namespace = 1 to look at talk pages instead of articles. The possibilities are endless.

An abstract network of global editor nodes connected by data flows toward a central Wikipedia replica server.

Common pitfalls and how to avoid them

Even simple queries can go wrong. Here are the top three mistakes people make:

  • Forgetting namespace filters. Wikipedia has different types of pages: articles (namespace 0), talk pages (namespace 1), user pages (namespace 2), templates (namespace 10). If you don’t specify which one you want, you’ll get junk data. Always add WHERE page_namespace = 0 for articles.
  • Querying too much data. Quarry has a 60-second timeout. If your query tries to scan 10 billion rows, it’ll fail. Start small. Use LIMIT, WHERE, and GROUP BY to narrow results before running the full query.
  • Not checking for duplicates. The same article might have multiple revisions with the same title. Use DISTINCT or GROUP BY to avoid counting the same thing twice.

Pro tip: Always test your query on a small dataset first. Try LIMIT 5 before running it on the full database.

What you can’t do with Quarry

Quarry is powerful - but it’s not magic. You can’t:

  • Write to the database. All queries are read-only.
  • Access private data. User emails, IP addresses, and personal info are scrubbed.
  • Run queries faster than once every 10 seconds. There’s a rate limit to prevent server overload.
  • Query non-English wikis directly. You have to switch to a different Quarry instance for each language (e.g., de.quarry.wmcloud.org for German).

But for 95% of research needs - especially academic, journalistic, or data-driven projects - Quarry gives you everything you need.

Where to go next

Want to go deeper? Start with these resources:

Many universities now teach Wikipedia data analysis in digital humanities and journalism courses. If you’re writing a paper, doing investigative work, or just curious about how knowledge is built - this is one of the most powerful tools you’ve never heard of.

Can I use Quarry to edit Wikipedia articles?

No. Quarry is a read-only tool. It lets you query data from Wikipedia’s databases, but it does not allow editing, creating, or deleting content. To edit Wikipedia, you must use the website directly.

Is Quarry free to use?

Yes. Quarry is completely free, open-source, and requires no registration. You can run as many queries as you want without limits - as long as you don’t overload the server. The Wikimedia Foundation hosts it using donated resources.

Do I need to know SQL to use Quarry?

Basic SQL helps, but it’s not required. Quarry has templates and examples you can copy and modify. Many users start by tweaking existing queries. You don’t need to understand JOINs or subqueries to get useful results - just change numbers and table names.

Can I query non-English Wikipedia databases?

Yes, but you need to use a separate instance. For example, use de.quarry.wmcloud.org for German Wikipedia, fr.quarry.wmcloud.org for French, and so on. Each language has its own replica, and queries won’t work across them unless you combine results manually.

How often are the replicas updated?

The replicas are updated every 2 to 6 hours, depending on the wiki. Most data is within 4 hours of the live Wikipedia site. For real-time edits, you’ll still need to check the website directly. But for research that looks at trends over days or weeks, the replicas are accurate enough.