Bluesky Datasets

We collect a number of datasets for our research, which we outline below. Please see our privacy policy for contact information if you have any questions.

We make some anonymized datasets available below. Furthermore, we may be able to share parts of the un-blinded datasets for research purposes upon request.

If you use any of our datasets, please cite our IMC Paper as such:

@inproceedings{balduf2024bluesky,
    author = {Balduf, Leonhard and Sokoto, Saidu and Ascigil, Onur and Tyson, Gareth and Scheuermann, Bj\"{o}rn and Korczy\'{n}ski, Maciej and Castro, Ignacio and Kr\'{o}l,  Micha{\l}},
    title = {Looking AT the Blue Skies of Bluesky},
    year = {2024},
    url = {https://doi.org/10.1145/3646547.3688407},
    doi = {10.1145/3646547.3688407},
    booktitle = {Proceedings of the 2024 ACM on Internet Measurement Conference},
}

If you want to request any of the non-public data, please:

  1. Read this entire website to make sure to
    1. Check if the anonymized datasets available for download are sufficient for your research.
    2. Try to set up the data collection yourself. This is pretty straightforward for some parts. We have a very detailed README that describes our setup.
  2. Read our privacy policy, which you will have to abide by.
  3. Specify in as much detail as possible the datasets, record types, and time frames you need. The more specific this is, the more likely it is we can actually help you. Please understand it takes time to extract and prepare these datasets.
  4. Include a signed proposal, outlining what you would use the data for and that you will not reshare it or make it public.
  5. Please cite our IMC paper introducing the dataset, see above.

Firehose Logs

We are subscribed to the official Bluesky Firehose, i.e., the big relay that unifies and re-broadcasts repository commits of all federated PDSes. We extract and parse the commits, which then take a form like so:

did, timestamp, commit

where each commit contains a list of operations:

operation_type, collection, rkey, record

These repo commits are produced by the PDS of a user based on their actions in Bluesky. They are part of the normal operation of the network and contain users’ public content.

We have been operating a Firehose logger since April 2024. Starting from June 2024, we also log block data, which allows us to subsequently extract repo commits and their data. The data collection code is publicly available on GitHub.

We collect all types of Firehose events, which include identity updates, service notifications, and repository commits. The latter contains update operations to a user’s repository as a diff from a previous revision.

Due to the sensitive nature of this data, we cannot make it available publicly.

Labeler Logs

We are subscribed to every known Labeler endpoint and log all labels produced by them. The code for this is also publicly available on GitHub. This data is necessarily public for the Bluesky moderation to work. We log and archive the operations taken by these Labelers for research purposes.

We can probably share this data for research purposes — please get in touch.

PLC Directory Mirror

We replicate the official PLC Directory using the export functionality, which exports an ordered list of operations. We replay these operations to arrive at the latest state for each DID. This data is part of the decentralized identity infrastructure. It has to be public in order for content on Bluesky to be verifiable, as the DID documents contains public key(s) of a user.

We operate a public mirror of plc.directory as a service to the community. This allows us to export a snapshot of all registered DIDs and their DID documents. The (centralized) plc.directory instance imposes necessary rate limits. We currently do not, in order to make this vital data more easily available to researchers and the community. Please see our page on services for more information.

You can use the mirror free of charge. We can probably also share a snapshot for research purposes — please get in touch.

Feed Generator Output

We compile a complete list of all Feed Generators in the network by analyzing repository data and real-time updates from Firehose logs. Each Feed Generator is identified by its DID and associated endpoint. We retrieve metadata for each Feed Generator using the getFeedGenerator API of the Bluesky AT Protocol.

For each Feed Generator, we retrieve new posts daily. This involves querying feeds using the FeedGetFeed API and saving any new, unrecorded posts to our database. The code for doing this is publicly available on Github.

  • The metadata for Feed Generators, the list of Feed Generators, and the posts (although massive) are available upon request for research purposes — please get in touch.

Full-Network Mirror

We operate a mirror of the entire network using uabluerail/ipfs-indexer. For that, we are subscribed to all PDSes and apply their event streams onto the database. This essentially has the format

DID, rkey, <record JSON>

Due to the way the AT Protocol works, this data is public. From the database we export snapshots for research purposes. We correctly implement removals, i.e., we do not process deleted data for our research. This allows us to export snapshots of, e.g. the social graph, all posts, etc.

Similar to the Firehose data, we cannot make this data available publicly. We derive anonymized datasets, which we make available below. We may be able to share parts of the un-blinded dataset — please get in touch.

Anonymized Datasets

From the data collection outlined above, we derive a few anonymized datasets, which we make available in the following.

Methodology

We export a snapshot from the full-network mirror and anonymize the results. The resulting data is split into individual files by record type. You can find a list of record types on browser.blue, or in the official sources.

The anonymization works roughly like this:

  1. Collect all distinct DIDs in the dataset.
  2. Derive sequential random numeric IDs for the DIDs, which we refer to as DID IDs. This results in an anonymization key.
  3. Replace all DIDs in the dataset with DID IDs.
  4. Strip columns with identifying information:
    • We always remove description_facets, because they contain mentions, which in turn contain DIDs.
    • For posts, we remove the text and extract useful features from embeds, which we then discard.
    • For profiles, we remove the username and pinned post.
    • We always remove CIDs.
  5. Reorder the datasets by their random numeric DID IDs, compress, and save.

We list the derived datasets and their download links below. Further down, we also list the exact SQL queries we use to derive these datasets from our snapshots and anonymize them. You can use the queries to see the structure of the datasets (or just download them, that also works).

If you notice potential improvements to the anonymization procedures, please contact us so we can implement them.

Blocks

Anonymization:

  • We replace the DID and subject fields

Download:

Feed Generators

Anonymization:

  • We replace the DID and remove the description facets

Download:

Follows

Anonymization:

  • We replace the DID and subject.

Download:

Likes

Anonymization:

  • We replace the DID and the DID of the liked record.
  • We remove the CID of the subject.

Download:

List Blocks

Anonymization:

  • We replace the DID and the DID of the subject.

Download:

List Items

Anonymization:

  • We replace the DID of the creator, the DID in the list URI, and the subject.

Download:

Lists

Anonymization:

  • We replace the DID and remove the description facets.

Download:

Posts

Anonymization:

  • We remove the text, extract features from the embedded record (if any),
  • replace DIDs in replies and flattens them (removing CIDs), and
  • remove all other DIDs that are in there.

Download:

Profiles

Anonymization:

  • We replace the DID, remove the pinned post, and remove the display name of the user.

Download:

Reposts

Anonymization:

  • We replace the DID and the subject DID.
  • We flatten the subject URI to remove the CID.

Download:

Starter Packs

Anonymization:

  • We replace the DID of the creator.
  • We remove description facets.
  • We replace the DIDs of the embedded feed generators and referenced list.

Download:

SQL Queries

These are the exact queries we use to derive the above datasets from a snapshot of the network. They anonymize the data via replacements and drop columns that are potentially identifying.

Blocks:

-- Blocks: replace did and subject
COPY (
SELECT k1.did_id, b.rkey, b.created_at, k2.did_id AS subject_id
FROM 'blocks.parquet' b
  INNER JOIN 'anon/key.parquet' k1
  ON (b.did == k1.did)
  INNER JOIN 'anon/key.parquet' k2
  ON (b.subject == k2.did)
ORDER BY k1.did_id, rkey)
TO 'anon/blocks.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

Feed Generators:

-- Feed generators: replace did, remove description_facets
COPY (
SELECT 
    k.did_id,
    fg.* EXCLUDE (did,description_facets)
FROM 'feed_generators.parquet' fg
  INNER JOIN 'anon/key.parquet' k
  USING (did)
ORDER BY did_id, rkey)
TO 'anon/feed_generators.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

Follows:

-- Follows: replace did and subject
COPY (
SELECT 
    k1.did_id,
    f.rkey,
    f.created_at,
    k2.did_id AS subject_id
FROM 'follows.parquet' f
  INNER JOIN 'anon/key.parquet' k1
  ON (f.did == k1.did)
  INNER JOIN 'anon/key.parquet' k2
  ON (f.subject == k2.did)
ORDER BY k1.did_id, rkey)
TO 'anon/follows.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

Likes:

-- Likes: replace did and subject.uri.did, remove subject.cid, flatten subject.uri to just subject
COPY (
SELECT
    k1.did_id,
    l.* EXCLUDE (did, subject),
    {'did_id': k2.did_id, 'collection': l.subject.uri.collection, 'rkey': l.subject.uri.rkey} AS subject
FROM 'likes.parquet' l
  INNER JOIN 'anon/key.parquet' k1
  USING (did)
  INNER JOIN 'anon/key.parquet' k2
  ON (l.subject.uri.did == k2.did)
ORDER BY (k1.did_id, rkey))
TO 'anon/likes.parquet';

List Blocks:

-- List Blocks: replace did and subject.did
COPY (
SELECT 
    k1.did_id,
    lb.* EXCLUDE (did, subject),
    {'did_id':k2.did_id, 'collection': lb.subject.collection, 'rkey': lb.subject.rkey} AS subject
FROM 'list_blocks.parquet' lb
  INNER JOIN 'anon/key.parquet' k1
  ON (lb.did == k1.did)
  INNER JOIN 'anon/key.parquet' k2
  ON (lb.subject.did == k2.did)
ORDER BY k1.did_id, rkey)
TO 'anon/list_blocks.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

List Items:

-- List Items: replace did, list.did, and subject
COPY (
SELECT
    k1.did_id,
    li.* EXCLUDE (did, list, subject),
    {'did_id': k2.did_id, 'collection': li.list.collection, 'rkey': li.list.rkey} AS "list",
    k3.did_id AS subject_id
FROM 'list_items.parquet' li
  INNER JOIN 'anon/key.parquet' k1
  ON (li.did == k1.did)
  INNER JOIN 'anon/key.parquet' k2
  ON (li.list.did == k2.did)
  INNER JOIN 'anon/key.parquet' k3
  ON (li.subject == k3.did)
ORDER BY (k1.did_id, rkey))
TO 'anon/list_items.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

Lists:

-- Lists: replace did, remove description_facets
COPY (
SELECT k.did_id, l.* EXCLUDE (did, description_facets)
FROM 'lists.parquet' l
  INNER JOIN 'anon/key.parquet' k
  ON (l.did == k.did)
ORDER BY (did_id, rkey))
TO 'anon/lists.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

Posts:

-- Posts: replace did, remove facets, remove text, extract parts of embed, replace reply.root.uri.did and reply.parent.uri.did:
COPY (
    WITH tmp1 AS (SELECT did,
                         rkey,
                         json_extract_string(embed, '$."$type"')              AS embed_type,
                         json_extract(embed, '$.images')                      AS embed_images,
                         json_extract(embed, '$.media')                       AS embed_media,
                         json_extract(embed, '$.video')                       AS embed_video,
                         COALESCE(json_extract_string(embed, '$.record.record.uri'),
                                  json_extract_string(embed, '$.record.uri')) AS embed_record_uri,
                         json_extract_string(embed, '$.external.uri')         AS embed_external_uri
                  FROM 'posts.parquet'
                  WHERE embed IS NOT NULL),
         tmp2 AS (SELECT did,
                         rkey,
                         embed_type,
                         IF(embed_record_uri IS NULL, NULL, {'did': split_part(embed_record_uri, '/', 3), 'collection': split_part(embed_record_uri, '/', 4), 'rkey': split_part(embed_record_uri, '/', 5)}) AS embed_record,
                         embed_external_uri,
                         embed_images,
                         embed_media,
                         embed_video
                  FROM tmp1),
         tmp3 AS (SELECT k1.did_id,
                         rkey,
                         embed_type,
                         IF(embed_record IS NULL, NULL, {'did_id': k2.did_id, 'collection': embed_record.collection, 'rkey': embed_record.rkey}) AS embed_record,
                         embed_external_uri,
                         embed_images,
                         embed_media,
                         embed_video,
                         embed_record.did IS NOT NULL AND k2.did_id IS NULL                                                                      AS invalid_did
                  FROM tmp2
                           INNER JOIN 'anon/key.parquet' k1 USING (did) LEFT JOIN 'anon/key.parquet' k2
                  ON (embed_record.did == k2.did))
    SELECT *
    FROM tmp3) TO 'anon/tmp_post_embeds.parquet' (FORMAT parquet, COMPRESSION zstd);
COPY (WITH tmp4 AS (SELECT k0.did_id,
                           p.rkey,
                           IF(p.reply IS NULL, NULL, {'root': IF(p.reply.root IS NULL, NULL, {'did_id': k1.did_id, 'collection': p.reply.root.uri.collection, 'rkey': p.reply.root.uri.rkey}), 'parent': IF(p.reply.parent IS NULL, NULL, {'did_id': k2.did_id, 'collection': p.reply.parent.uri.collection, 'rkey': p.reply.parent.uri.rkey})}) AS reply,
                           (p.reply.root.uri.did IS NOT NULL AND k1.did_id IS NULL) OR
                           (p.reply.parent.uri.did IS NOT NULL AND k2.did_id IS NULL)                                                                                                                                                                                                                                                            AS invalid_did
                    FROM 'posts.parquet' p INNER JOIN 'anon/key.parquet' k0 USING (did) LEFT JOIN 'anon/key.parquet' k1
                    ON (p.reply.root.uri.did == k1.did) LEFT JOIN 'anon/key.parquet' k2 ON (p.reply.parent.uri.did == k2.did)
                    WHERE p.reply IS NOT NULL)
      SELECT *
      FROM tmp4) TO 'anon/tmp_posts_replies.parquet' (FORMAT parquet, COMPRESSION zstd);
COPY (SELECT k1.did_id,
             p.* EXCLUDE (did, "text", embed, facets, reply), tmp3.embed_type,
             tmp3.embed_record,
             tmp3.embed_external_uri,
             tmp3.embed_images,
             tmp3.embed_media,
             tmp3.embed_video,
             tmp4.reply
      FROM 'posts.parquet' p INNER JOIN 'anon/key.parquet' k1 USING (did) LEFT JOIN 'anon/tmp_post_embeds.parquet' tmp3 USING (did_id,rkey) LEFT JOIN 'anon/tmp_posts_replies.parquet' tmp4 USING (did_id, rkey) WHERE NOT (tmp3.invalid_did IS NOT NULL AND tmp3.invalid_did) AND NOT (tmp4.invalid_did IS NOT NULL AND tmp4.invalid_did)) TO 'anon/tmp_posts_unsorted.parquet' (FORMAT parquet, COMPRESSION zstd);
COPY (SELECT *
      FROM 'anon/tmp_posts_unsorted.parquet') TO 'anon/posts.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

Profiles:

-- Profiles: replace did, remove pinned_post, remove display_name,
COPY (
SELECT k.did_id, p.* EXCLUDE (did, pinned_post, display_name)
FROM 'profiles.parquet' p
  INNER JOIN 'anon/key.parquet' k
  ON (p.did == k.did)
ORDER BY (did_id, rkey))
TO 'anon/profiles.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

Reposts:

-- Reposts: replace did and subject.uri.did, remove subject.cid, flatten subject to just URI
COPY (
SELECT 
    k1.did_id,
    rp.* EXCLUDE (did,subject),
    {'did_id': k2.did_id, 'collection': rp.subject.uri.collection, 'rkey': rp.subject.uri.rkey} AS subject
FROM 'reposts.parquet' rp
  INNER JOIN 'anon/key.parquet' k1
  ON (rp.did == k1.did)
  INNER JOIN 'anon/key.parquet' k2
  ON (rp.subject.uri.did == k2.did)
ORDER BY k1.did_id, rp.rkey)
TO 'anon/reposts.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);

Starter Packs:

-- Starter Packs: replace did, remove description_facets, replace list.did, replace feeds.did
COPY (WITH tmp1
               AS (SELECT k1.did_id, sp.* EXCLUDE (did, description_facets, list), {'did_id': k2.did_id, 'collection': sp.list.collection, 'rkey': sp.list.rkey} AS "list"
                   FROM 'starter_packs.parquet' sp INNER JOIN 'anon/key.parquet' k1
                   ON (sp.did == k1.did) INNER JOIN 'anon/key.parquet' k2 ON (sp.list.did == k2.did)),
           tmp2 AS (SELECT did_id, rkey, unnest(feeds) AS feed FROM tmp1),
           tmp3 AS (SELECT tmp2.did_id,
                           tmp2.rkey, {'did_id': k.did_id, 'collection': tmp2.feed.collection, 'rkey': tmp2.feed.rkey} AS feed
                    FROM tmp2 INNER JOIN 'anon/key.parquet' k
                    ON (tmp2.feed.did == k.did)),
           tmp4 AS (SELECT did_id, rkey, list(feed) AS feeds FROM tmp3 GROUP BY did_id, rkey)
      SELECT tmp1.* EXCLUDE (feeds), tmp4.feeds
      FROM tmp1
               LEFT JOIN tmp4 USING (did_id, rkey)
      ORDER BY (did_id, rkey)) TO 'anon/starter_packs.parquet' (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 22);