Bluesky Datasets
We collect a number of datasets for our research, which we outline below. Please see our privacy policy for contact information if you have any questions.
We make some anonymized datasets available below. Furthermore, we may be able to share parts of the un-blinded datasets for research purposes upon request.
If you use any of our datasets, please cite our IMC Paper as such:
@inproceedings{balduf2024bluesky,
author = {Balduf, Leonhard and Sokoto, Saidu and Ascigil, Onur and Tyson, Gareth and Scheuermann, Bj\"{o}rn and Korczy\'{n}ski, Maciej and Castro, Ignacio and Kr\'{o}l, Micha{\l}},
title = {Looking AT the Blue Skies of Bluesky},
year = {2024},
url = {https://doi.org/10.1145/3646547.3688407},
doi = {10.1145/3646547.3688407},
booktitle = {Proceedings of the 2024 ACM on Internet Measurement Conference},
}
If you want to request any of the non-public data, please:
- Read this entire website to make sure to
- Check if the anonymized datasets available for download are sufficient for your research.
- Try to set up the data collection yourself. This is pretty straightforward for some parts. We have a very detailed README that describes our setup.
- Read our privacy policy, which you will have to abide by.
- Specify in as much detail as possible the datasets, record types, and time frames you need. The more specific this is, the more likely it is we can actually help you. Please understand it takes time to extract and prepare these datasets.
- Include a signed proposal, outlining what you would use the data for and that you will not reshare it or make it public.
- Please cite our IMC paper introducing the dataset, see above.
Firehose Logs
We are subscribed to the official Bluesky Firehose, i.e., the big relay that unifies and re-broadcasts repository commits of all federated PDSes. We extract and parse the commits, which then take a form like so:
did, timestamp, commit
where each commit contains a list of operations:
operation_type, collection, rkey, record
These repo commits are produced by the PDS of a user based on their actions in Bluesky. They are part of the normal operation of the network and contain users’ public content.
We have been operating a Firehose logger since April 2024. Starting from June 2024, we also log block data, which allows us to subsequently extract repo commits and their data. The data collection code is publicly available on GitHub.
We collect all types of Firehose events, which include identity updates, service notifications, and repository commits. The latter contains update operations to a user’s repository as a diff from a previous revision.
Due to the sensitive nature of this data, we cannot make it available publicly.
Labeler Logs
We are subscribed to every known Labeler endpoint and log all labels produced by them. The code for this is also publicly available on GitHub. This data is necessarily public for the Bluesky moderation to work. We log and archive the operations taken by these Labelers for research purposes.
We can probably share this data for research purposes — please get in touch.
PLC Directory Mirror
We replicate the official PLC Directory using the export functionality, which exports an ordered list of operations. We replay these operations to arrive at the latest state for each DID. This data is part of the decentralized identity infrastructure. It has to be public in order for content on Bluesky to be verifiable, as the DID documents contains public key(s) of a user.
We operate a public mirror of plc.directory as a service to the
community. This allows us to export a snapshot of all registered DIDs
and their DID documents. The (centralized) plc.directory
instance imposes necessary rate limits. We currently do not, in order to
make this vital data more easily available to researchers and the
community. Please see our page on services
for more information.
You can use the mirror free of charge. We can probably also share a snapshot for research purposes — please get in touch.
Feed Generator Output
We compile a complete list of all Feed Generators in the network by
analyzing repository data and real-time updates from Firehose logs. Each
Feed Generator is identified by its DID and associated endpoint. We
retrieve metadata for each Feed Generator using the getFeedGenerator
API of the Bluesky AT Protocol.
For each Feed Generator, we retrieve new posts daily. This involves
querying feeds using the FeedGetFeed
API and saving any new, unrecorded posts to our database. The code
for doing this is publicly
available on Github.
- The metadata for Feed Generators, the list of Feed Generators, and the posts (although massive) are available upon request for research purposes — please get in touch.
Full-Network Mirror
We operate a mirror of the entire network using uabluerail/ipfs-indexer. For that, we are subscribed to all PDSes and apply their event streams onto the database. This essentially has the format
DID, rkey, <record JSON>
Due to the way the AT Protocol works, this data is public. From the database we export snapshots for research purposes. We correctly implement removals, i.e., we do not process deleted data for our research. This allows us to export snapshots of, e.g. the social graph, all posts, etc.
Similar to the Firehose data, we cannot make this data available publicly. We derive anonymized datasets, which we make available below. We may be able to share parts of the un-blinded dataset — please get in touch.
Anonymized Datasets
From the data collection outlined above, we derive a few anonymized datasets, which we make available in the following.
Methodology
We export a snapshot from the full-network mirror and anonymize the results. The resulting data is split into individual files by record type. You can find a list of record types on browser.blue, or in the official sources.
The anonymization works roughly like this:
- Collect all distinct DIDs in the dataset.
- Derive sequential random numeric IDs for the DIDs, which we refer to as DID IDs. This results in an anonymization key.
- Replace all DIDs in the dataset with DID IDs.
- Strip columns with identifying information:
- We always remove
description_facets
, because they contain mentions, which in turn contain DIDs.
- For posts, we remove the text and extract useful features from
embeds, which we then discard.
- For profiles, we remove the username and pinned post.
- We always remove CIDs.
- We always remove
- Reorder the datasets by their random numeric DID IDs, compress, and save.
We list the derived datasets and their download links below. Further down, we also list the exact SQL queries we use to derive these datasets from our snapshots and anonymize them. You can use the queries to see the structure of the datasets (or just download them, that also works).
If you notice potential improvements to the anonymization procedures, please contact us so we can implement them.
Blocks
Anonymization:
- We replace the DID and subject fields
Download:
- 2025-04-14: parquet
Feed Generators
Anonymization:
- We replace the DID and remove the description facets
Download:
- 2025-04-14: parquet
Follows
Anonymization:
- We replace the DID and subject.
Download:
- 2025-04-14: parquet
Likes
Anonymization:
- We replace the DID and the DID of the liked record.
- We remove the CID of the subject.
Download:
- 2025-04-14: parquet
List Blocks
Anonymization:
- We replace the DID and the DID of the subject.
Download:
- 2025-04-14: parquet
List Items
Anonymization:
- We replace the DID of the creator, the DID in the list URI, and the subject.
Download:
- 2025-04-14: parquet
Lists
Anonymization:
- We replace the DID and remove the description facets.
Download:
- 2025-04-14: parquet
Posts
Anonymization:
- We remove the text, extract features from the embedded record (if any),
- replace DIDs in replies and flattens them (removing CIDs), and
- remove all other DIDs that are in there.
Download:
- 2025-04-14: parquet
Profiles
Anonymization:
- We replace the DID, remove the pinned post, and remove the display name of the user.
Download:
- 2025-04-14: parquet
Reposts
Anonymization:
- We replace the DID and the subject DID.
- We flatten the subject URI to remove the CID.
Download:
- 2025-04-14: parquet
Starter Packs
Anonymization:
- We replace the DID of the creator.
- We remove description facets.
- We replace the DIDs of the embedded feed generators and referenced list.
Download:
- 2025-04-14: parquet
SQL Queries
These are the exact queries we use to derive the above datasets from a snapshot of the network. They anonymize the data via replacements and drop columns that are potentially identifying.
Blocks:
-- Blocks: replace did and subject
COPY (
SELECT k1.did_id, b.rkey, b.created_at, k2.did_id AS subject_id
FROM 'blocks.parquet' b
INNER JOIN 'anon/key.parquet' k1
ON (b.did == k1.did)
INNER JOIN 'anon/key.parquet' k2
ON (b.subject == k2.did)
ORDER BY k1.did_id, rkey)
TO 'anon/blocks.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
Feed Generators:
-- Feed generators: replace did, remove description_facets
COPY (
SELECT
k.did_id,* EXCLUDE (did,description_facets)
fg.FROM 'feed_generators.parquet' fg
INNER JOIN 'anon/key.parquet' k
USING (did)
ORDER BY did_id, rkey)
TO 'anon/feed_generators.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
Follows:
-- Follows: replace did and subject
COPY (
SELECT
k1.did_id,
f.rkey,
f.created_at,AS subject_id
k2.did_id FROM 'follows.parquet' f
INNER JOIN 'anon/key.parquet' k1
ON (f.did == k1.did)
INNER JOIN 'anon/key.parquet' k2
ON (f.subject == k2.did)
ORDER BY k1.did_id, rkey)
TO 'anon/follows.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
Likes:
-- Likes: replace did and subject.uri.did, remove subject.cid, flatten subject.uri to just subject
COPY (
SELECT
k1.did_id,* EXCLUDE (did, subject),
l.'did_id': k2.did_id, 'collection': l.subject.uri.collection, 'rkey': l.subject.uri.rkey} AS subject
{FROM 'likes.parquet' l
INNER JOIN 'anon/key.parquet' k1
USING (did)
INNER JOIN 'anon/key.parquet' k2
ON (l.subject.uri.did == k2.did)
ORDER BY (k1.did_id, rkey))
TO 'anon/likes.parquet';
List Blocks:
-- List Blocks: replace did and subject.did
COPY (
SELECT
k1.did_id,* EXCLUDE (did, subject),
lb.'did_id':k2.did_id, 'collection': lb.subject.collection, 'rkey': lb.subject.rkey} AS subject
{FROM 'list_blocks.parquet' lb
INNER JOIN 'anon/key.parquet' k1
ON (lb.did == k1.did)
INNER JOIN 'anon/key.parquet' k2
ON (lb.subject.did == k2.did)
ORDER BY k1.did_id, rkey)
TO 'anon/list_blocks.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
List Items:
-- List Items: replace did, list.did, and subject
COPY (
SELECT
k1.did_id,* EXCLUDE (did, list, subject),
li.'did_id': k2.did_id, 'collection': li.list.collection, 'rkey': li.list.rkey} AS "list",
{AS subject_id
k3.did_id FROM 'list_items.parquet' li
INNER JOIN 'anon/key.parquet' k1
ON (li.did == k1.did)
INNER JOIN 'anon/key.parquet' k2
ON (li.list.did == k2.did)
INNER JOIN 'anon/key.parquet' k3
ON (li.subject == k3.did)
ORDER BY (k1.did_id, rkey))
TO 'anon/list_items.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
Lists:
-- Lists: replace did, remove description_facets
COPY (
SELECT k.did_id, l.* EXCLUDE (did, description_facets)
FROM 'lists.parquet' l
INNER JOIN 'anon/key.parquet' k
ON (l.did == k.did)
ORDER BY (did_id, rkey))
TO 'anon/lists.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
Posts:
-- Posts: replace did, remove facets, remove text, extract parts of embed, replace reply.root.uri.did and reply.parent.uri.did:
COPY (
WITH tmp1 AS (SELECT did,
rkey,'$."$type"') AS embed_type,
json_extract_string(embed, '$.images') AS embed_images,
json_extract(embed, '$.media') AS embed_media,
json_extract(embed, '$.video') AS embed_video,
json_extract(embed, COALESCE(json_extract_string(embed, '$.record.record.uri'),
'$.record.uri')) AS embed_record_uri,
json_extract_string(embed, '$.external.uri') AS embed_external_uri
json_extract_string(embed, FROM 'posts.parquet'
WHERE embed IS NOT NULL),
AS (SELECT did,
tmp2
rkey,
embed_type,IF(embed_record_uri IS NULL, NULL, {'did': split_part(embed_record_uri, '/', 3), 'collection': split_part(embed_record_uri, '/', 4), 'rkey': split_part(embed_record_uri, '/', 5)}) AS embed_record,
embed_external_uri,
embed_images,
embed_media,
embed_videoFROM tmp1),
AS (SELECT k1.did_id,
tmp3
rkey,
embed_type,IF(embed_record IS NULL, NULL, {'did_id': k2.did_id, 'collection': embed_record.collection, 'rkey': embed_record.rkey}) AS embed_record,
embed_external_uri,
embed_images,
embed_media,
embed_video,IS NOT NULL AND k2.did_id IS NULL AS invalid_did
embed_record.did FROM tmp2
INNER JOIN 'anon/key.parquet' k1 USING (did) LEFT JOIN 'anon/key.parquet' k2
ON (embed_record.did == k2.did))
SELECT *
FROM tmp3) TO 'anon/tmp_post_embeds.parquet' (FORMAT parquet, COMPRESSION zstd);
COPY (WITH tmp4 AS (SELECT k0.did_id,
p.rkey,IF(p.reply IS NULL, NULL, {'root': IF(p.reply.root IS NULL, NULL, {'did_id': k1.did_id, 'collection': p.reply.root.uri.collection, 'rkey': p.reply.root.uri.rkey}), 'parent': IF(p.reply.parent IS NULL, NULL, {'did_id': k2.did_id, 'collection': p.reply.parent.uri.collection, 'rkey': p.reply.parent.uri.rkey})}) AS reply,
IS NOT NULL AND k1.did_id IS NULL) OR
(p.reply.root.uri.did parent.uri.did IS NOT NULL AND k2.did_id IS NULL) AS invalid_did
(p.reply.FROM 'posts.parquet' p INNER JOIN 'anon/key.parquet' k0 USING (did) LEFT JOIN 'anon/key.parquet' k1
ON (p.reply.root.uri.did == k1.did) LEFT JOIN 'anon/key.parquet' k2 ON (p.reply.parent.uri.did == k2.did)
WHERE p.reply IS NOT NULL)
SELECT *
FROM tmp4) TO 'anon/tmp_posts_replies.parquet' (FORMAT parquet, COMPRESSION zstd);
COPY (SELECT k1.did_id,
* EXCLUDE (did, "text", embed, facets, reply), tmp3.embed_type,
p.
tmp3.embed_record,
tmp3.embed_external_uri,
tmp3.embed_images,
tmp3.embed_media,
tmp3.embed_video,
tmp4.replyFROM 'posts.parquet' p INNER JOIN 'anon/key.parquet' k1 USING (did) LEFT JOIN 'anon/tmp_post_embeds.parquet' tmp3 USING (did_id,rkey) LEFT JOIN 'anon/tmp_posts_replies.parquet' tmp4 USING (did_id, rkey) WHERE NOT (tmp3.invalid_did IS NOT NULL AND tmp3.invalid_did) AND NOT (tmp4.invalid_did IS NOT NULL AND tmp4.invalid_did)) TO 'anon/tmp_posts_unsorted.parquet' (FORMAT parquet, COMPRESSION zstd);
COPY (SELECT *
FROM 'anon/tmp_posts_unsorted.parquet') TO 'anon/posts.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
Profiles:
-- Profiles: replace did, remove pinned_post, remove display_name,
COPY (
SELECT k.did_id, p.* EXCLUDE (did, pinned_post, display_name)
FROM 'profiles.parquet' p
INNER JOIN 'anon/key.parquet' k
ON (p.did == k.did)
ORDER BY (did_id, rkey))
TO 'anon/profiles.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
Reposts:
-- Reposts: replace did and subject.uri.did, remove subject.cid, flatten subject to just URI
COPY (
SELECT
k1.did_id,* EXCLUDE (did,subject),
rp.'did_id': k2.did_id, 'collection': rp.subject.uri.collection, 'rkey': rp.subject.uri.rkey} AS subject
{FROM 'reposts.parquet' rp
INNER JOIN 'anon/key.parquet' k1
ON (rp.did == k1.did)
INNER JOIN 'anon/key.parquet' k2
ON (rp.subject.uri.did == k2.did)
ORDER BY k1.did_id, rp.rkey)
TO 'anon/reposts.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);
Starter Packs:
-- Starter Packs: replace did, remove description_facets, replace list.did, replace feeds.did
COPY (WITH tmp1
AS (SELECT k1.did_id, sp.* EXCLUDE (did, description_facets, list), {'did_id': k2.did_id, 'collection': sp.list.collection, 'rkey': sp.list.rkey} AS "list"
FROM 'starter_packs.parquet' sp INNER JOIN 'anon/key.parquet' k1
ON (sp.did == k1.did) INNER JOIN 'anon/key.parquet' k2 ON (sp.list.did == k2.did)),
AS (SELECT did_id, rkey, unnest(feeds) AS feed FROM tmp1),
tmp2 AS (SELECT tmp2.did_id,
tmp3 'did_id': k.did_id, 'collection': tmp2.feed.collection, 'rkey': tmp2.feed.rkey} AS feed
tmp2.rkey, {FROM tmp2 INNER JOIN 'anon/key.parquet' k
ON (tmp2.feed.did == k.did)),
AS (SELECT did_id, rkey, list(feed) AS feeds FROM tmp3 GROUP BY did_id, rkey)
tmp4 SELECT tmp1.* EXCLUDE (feeds), tmp4.feeds
FROM tmp1
LEFT JOIN tmp4 USING (did_id, rkey)
ORDER BY (did_id, rkey)) TO 'anon/starter_packs.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);