Best Practices
This page gathers best practices on how to use the Wikimedia Enterprise APIs. Following these best practices will ensure you:
- Make the least amount of API calls possible to get the data you need, which is helpful if you’re on a free plan with a monthly request quota.
- Reduce the latency of Enterprise API responses
Create an account and go through our Getting Started documentation first. Once you’ve Authenticated, store your access_token
and refresh_token
in a safe place, like an .env
file or other local dotfile.
Access tokens are valid for 24 hours. Before your access token expires, use the refresh_token
to invoke Refresh Token and get a new access_token for the next 24 hours. Your refresh_token
expires after 90 days. Don’t use the Login method every 24 hours to get a new access_token
, ideally you would use the Login method only once every 90 days.
Use fields and filters to get only the wiki projects, languages, and data you need in your API response. This keeps responses smaller, makes requests faster, and can save bandwidth and lower costs for high-volume use.
Use the following utility endpoints to find the correct project and language codes for your use case:
- Find the projects you want data for by requesting GET Available Project Codes, which will get you e.g. the identifier
wiki
for Wikipedia projects. - Request GET Available Languages to get the correct codes for the language(s) you need, e.g.
en
for English orde
for German. Please note that the language codes for wiki projects don’t all follow a specific ISO standard or other language code convention. Language codes can have two or three characters (e.g. chy for Cheyenne). - To find all project codes for a specific language, request POST Available Projects and use a filter to only receive project codes for a specific language.
Example request body for all project codes with the language code de
for German:
{
"filters": [
{
"field": "in_language.identifier", "value": "de"
}
]
}
Example request body for all project codes in any language for the wikibooks project:
{
"filters": [
{
"field": "code", "value": "wikibooks"
}
]
}
- Lastly, request GET Available Namespaces to get a list and description of the Namespaces you can request data for.
Once you have all the information you need from the Utility endpoint, you can craft identifiers to use in the Snapshot and Realtime APIs as follows:
<language><project_code>_namespace_<namespace_number>
E.g. The Snapshot identifier for German Wikibooks Namespace 0 is dewikibooks_namespace_0
.
Snapshot API
For most use cases, using the following endpoints in this order will return the data you’re looking for:
- POST Request Available Snapshots to find the project codes you’re looking for
- Pass the project codes you need to the POST Snapshot Bundle Info endpoint to get more insights into the size of the project file, available chunks, and more.
- Then either GET Download Snapshot Bundle for the full Snapshot, or GET Snapshot Chunk to download the Snapshot chunk by chunk
Available Snapshots request example:
curl --location 'https://api.enterprise.wikimedia.com/v2/snapshots' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ACCESS_TOKEN' \
--data '{
"filters": [
{
"field": "namespace.identifier", "value": 0
},
{
"field": "is_part_of.identifier", "value": "enwiki"
}
],
"fields": ["is_part_of.identifier","date_modified","size.value"]
}'
[
{
"is_part_of__identifier": "enwiki",
"date_modified": "2025-08-25T01:02:01.315827526Z",
"size__value": 141980.842e0
}
]
Extract the identifier(s) of the Snapshot bundle(s) you want, then request Snapshot Bundle Info for each of those Snapshots.
Example request for Snapshot Bundle Info:
curl --location --request POST 'https://api.enterprise.wikimedia.com/v2/snapshots/dewiki_namespace_0' \
--header 'Authorization: Bearer ACCESS_TOKEN'
{
"identifier": "dewiki_namespace_0",
"version": "77859570ab3e28f59e34b2a5e372d1ff",
"date_modified": "2025-08-25T00:39:15.143882329Z",
"is_part_of": {
"identifier": "dewiki"
},
"in_language": {
"identifier": "de"
},
"namespace": {
"identifier": 0
},
"size": {
"value": 34626.35e0,
"unit_text": "MB"
},
"chunks": [
"dewiki_namespace_0_chunk_0",
"dewiki_namespace_0_chunk_1",
"dewiki_namespace_0_chunk_2",
"dewiki_namespace_0_chunk_3",
[...]
]
}
Decide whether to download the bundle(s) you want in full or to download them in chunks, using the chunk identifiers given in the Snapshot Bundle Info response. Chunks are ideal for downloading just a small sample of a wiki project for testing purposes. Downloading a large project in chunks is also useful for avoiding the need to keep a connection open for a long time while downloading a single file. Some projects (like English Wikipedia) are larger than a terabyte, so using chunks is advisable. Be aware that requesting one Snapshot chunk counts towards your allotted quota of Snapshot API calls. If you are on a free plan, download full project bundles to minimize the number of calls you make.
Example GET Request to the Download Project Snapshot endpoint to download the full snapshot of German Wikipedia:
curl --location 'https://api.enterprise.wikimedia.com/v2/hourlys/2022-08-14/dewiki_namespace_0/download' \
--header 'Authorization: Bearer ACCESS_TOKEN'
Example GET Request using cURL to the Download Snapshot Chunk endpoint to download the first chunk of German Wikipedia:
curl --location 'https://api.enterprise.wikimedia.com/v2/snapshots/dewiki_namespace_0/chunks/dewiki_namespace_0_chunk_0/download' \
--header 'Authorization: Bearer ACCESS_TOKEN'
Use the Snapshot Bundle Info endpoint every time before downloading a bundle or chunk of a bundle to make sure the bundle is available, and check its date_modified field to see when it was last updated.
Snapshot bundles get updated daily around 4:00 AM UTC, it is recommended to consume new Snapshot bundles somewhere between 8:00 AM and 12:00 (noon) UTC to make sure the new bundles are available and complete. If you are on a free plan, you only have access to renewed Snapshot bundles on the 2nd and 21st day of every month.
On-demand API
If you only want a specific set of articles and you already have the names of those pages, (e.g. https://en.wikipedia.org/wiki/Josephine_Baker), use the On-demand API Article Lookup method (e.g. GET https://api.enterprise.wikimedia.com/v2/articles/Josephine_Baker). Please be aware that article names are case sensitive. Spaces in page names should be passed to the API as underscores, e.g. Josephine_Baker
, or as URLencoded spaces, e.g. Josephine%20Baker
.
curl --location 'https://api.enterprise.wikimedia.com/v2/articles/Josephine_Baker' \
--header 'Authorization: Bearer ACCESS_TOKEN'
Realtime API
Before downloading a Batch, check if that Batch is available with the Available Hourly Batches method.
Example GET request using cURL for the available hourly batches on the 26th of August 2025 at 00:00 UTC:
curl --location --request POST 'https://api.enterprise.wikimedia.com/v2/batches/2025-08-26/00' \
--header 'Authorization: Bearer ACCESS_TOKEN'
Once you’ve connected to the Article Updates Stream, you will continue getting data streamed to you until you close the connection yourself. The stream is long-lived, it will not close automatically after a given amount of time. Even if you keep this connection open for longer than 24 hours, meaning your access_token has technically expired, the connection won’t close as long as the access_token
you’ve used when you made the initial connection was valid at the moment you opened the connection. Find an example of a POST Request to the Article Updates Stream below.
Parallel consumption
Using the parts
request parameter, you can open more than one parallel connection to the Realtime API – each of those connections targeting a subset of data partitions, also called a part. The maximum allowed number of parallel connections to realtime API is 10, i.e. the allowed range for parts is 0 through 9.
If you only want to open one connection that listens to all parts at the same time, you do not need to use parts. When you don’t pass the parts
parameter, the default behavior of the API is to listen to all parts at the same time.
Example POST Request using cURL to connect to the Realtime Article Updates Stream, using filter to receive only updates from English Wikipedia, listening to all parts:
curl --location 'https://realtime.enterprise.wikimedia.com/v2/articles' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ACCESS_TOKEN' \
--data '{
"filters": [
{
"field": "is_part_of.identifier", "value": "enwiki"
}
],
"parts": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
}'
Reconnecting to Realtime Streaming API
When using the Article Updates (streaming) endpoint, follow these instructions on how to handle multiple connections and restart support.
Realtime updates are stored for a rolling 48 hours, meaning you can use a timestamp within the last 48 hours of reconnecting to reopen a connection and get up to speed with any updates you might have missed. If you know the timestamp of when your connection to the Realtime stream was closed, you can easily reconnect using one of these two ways:
Use since
This is the simplest and recommended method to reconnect to the Realtime stream. Save the latest timestamp you received for each event.partition
while you were connected to the streaming endpoint. We recommend using event.date_published
for this. Pass the last timestamp you received in the since parameter for all partitions to ensure you don’t miss any events. If you use since
to reconnect, you may receive events from some partitions that you already obtained in your previous connection. Therefore we recommend recording the IDs of incoming events to discard duplicate events, ensuring idempotence.
Example POST request using cURL asking for all Realtime streaming updates for English Wikipedia since the timestamp 2025-04-15T15:33:50Z
.
curl --location 'https://realtime.enterprise.wikimedia.com/v2/articles' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ACCESS_TOKEN' \
--data '{
"filters": [
{
"field": "is_part_of.identifier", "value": "enwiki"
}
],
"since":"2025-04-15T15:33:50Z"
}'
Please be aware that if you use the since
parameter you cannot specify parts
. The since
parameter will automatically apply to all parts at the same time. If you try to pass since
while also using the parts
parameter, you will receive the following error message:
"status": 422,
"message": "for parallel consumption, specify either offsets or time-offsets (since_per_partition) parameter"
Use since_per_partition
Where the since
parameter only takes one timestamp as input, since_per_partition
allows you to define different timestamps for different partitions, so you can start consuming data from different timestamps in different partitions when reconnecting. This is useful if you are using parts
to consume subsets of partitions, whether you’re doing this with one or multiple parallel connections.
The most straightforward, and recommended, way to reconnect using since_per_partition
is to open just one connection. For every partition, find the last event.date_published
value you received before you disconnected. When reconnecting, send the request parameter parts with the value [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
and the since_per_partition
object with key-value pairs mapping every part to a corresponding event.date_published
value, or any other RFC3339 format timestamp.
The advantage of using since_per_partition
over using since is that the chances of receiving duplicate events are much lower, as you are supplying precise timestamps per partition. This makes it much less likely that you will receive events that you had already received before reconnection.
Example request with since_per_partition
:
curl --location 'https://realtime.enterprise.wikimedia.com/v2/articles' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ACCESS_TOKEN' \
--data '{
"filters": [
{
"field": "is_part_of.identifier", "value": "enwiki"
}
],
"parts": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
"since_per_partition": {
"0": "2023-06-05T12:00:00Z",
"1": "2023-06-05T12:00:00Z",
"2": "2023-06-05T12:00:00Z",
[...],
}
}'