Bluesky Research

We are a team of researchers performing measurements on Bluesky.

What is this?

The pitfalls of centralized social networks, such as Facebook and Twitter/X, have led to concerns about control, transparency, and accountability. Decentralized social networks have emerged as a result with the goal of empowering users. In contrast to alternative approaches (e.g. Mastodon), Bluesky decomposes and opens the key functions of the platform into subcomponents that can be provided by third party stakeholders.

We investigate this novel architecture of Bluesky, measure the network, describe the components, and look into the effects this all has on the users.

Who are you?

If you want, contact us on Bluesky:

We are associated with the Communication Networks Lab at TU Darmstadt, Germany; the School of Science and Technology at City, University of London, UK; the School of Computing and Communications at Lancaster University, UK; the School of Electronic Engineering and computer Science at Queen Mary, University of London, UK; the Hong Kong University of Science and Technology (GZ), China; the University of Grenoble, Alpes, Ensimag, France; and the former Trust in Distributed Systems research group at the Weizenbaum Institute, Germany.

Papers

IMC ’24: Looking AT the Blue Skies of Bluesky

In our IMC ’24 paper we look at the overall structure, datasets, and growth of Bluesky. We conduct the first large-scale analysis of this novel microblogging platform. We collect a comprehensive dataset covering all the key elements of Bluesky, up to April 2024, covering about 5.5M users, 225M posts, 40k Feed Generators, and 62 Labelers. We study the uptake of the functionalities that Bluesky opens to third parties. Our findings show substantial uptake of content curation related functionalities.

Please cite as such:

@inproceedings{balduf2024bluesky,
    author = {Balduf, Leonhard and Sokoto, Saidu and Ascigil, Onur and Tyson, Gareth and Scheuermann, Bj\"{o}rn and Korczy\'{n}ski, Maciej and Castro, Ignacio and Kr\'{o}l,  Micha{\l}},
    title = {Looking AT the Blue Skies of Bluesky},
    year = {2024},
    url = {https://doi.org/10.1145/3646547.3688407},
    doi = {10.1145/3646547.3688407},
    booktitle = {Proceedings of the 2024 ACM on Internet Measurement Conference},
}
Replication

In order to replicate this work, you need to obtain Firehose updates, Labelers, Feed Generators, and DID documents of every user. We make most of the tooling for this public, please see below.

(under review): Bootstrapping Social Networks: Lessons from Bluesky Starter Packs

In our 2025 paper we look at Starter Packs and their impact. We curate a complete dataset up to the end of 2024, with 25M users and 335k Starter Packs with 1.7M members. We identify follows resulting from starter packs and confirm that starter packs help users bootstrap their social network.

Please cite as such:

@misc{balduf2025bootstrappingsocialnetworks,
    title={Bootstrapping Social Networks: Lessons from Bluesky Starter Packs},
    author={Leonhard Balduf and Saidu Sokoto and Onur Ascigil and Gareth Tyson and Ignacio Castro and Andrea Baronchelli and George Pavlou and Björn Scheuermann and Michał Król},
    year={2025},
    eprint={2501.11605},
    archivePrefix={arXiv},
    primaryClass={cs.SI},
    url={https://arxiv.org/abs/2501.11605},
}
Replication

In order to replicate this work, you need (in addition to everything from the previous work) snapshots of the entire network. The tools for this are open-source, see below.

In order to match extracted multi-follow operations from the Firehose to starter packs, you’d need

  • The state of every starter pack at every point in time, which can be realized through Firehose updates.
  • Multi-follow operations extracted from the Firehose.
  • A tool to intersect them with high performance, which we open source here.

Services

Mirror of plc.directory

We run a mirror of plc.directory. This is implemented using bsky-watch/plc-mirror. The mirror follows the export log of the PLC directory and replays the operations onto a local database. The exposed API implements DID lookups only, no audit log, export, etc. If the database lags behind the latest operations by too much, the server refuses to handle requests.

There are currently no rate limits and no uptime guarantees. Please be considerate, though: We just run this for fun and might stop if it ever gets too expensive.

You can query the mirror at plc-mirror.bsky.leobalduf.com for a DID at

curl -s "https://plc-mirror.bsky.leobalduf.com/<did>"

For example:

curl -s "https://plc-mirror.bsky.leobalduf.com/did:plc:nggqjgdkqhytcag6x7fhiyuv" | jq .

which should return something like:

{
  "@context": [
    "https://www.w3.org/ns/did/v1",
    "https://w3id.org/security/multikey/v1",
    "https://w3id.org/security/suites/secp256k1-2019/v1"
  ],
  "alsoKnownAs": [
    "at://leobalduf.bsky.social"
  ],
  "id": "did:plc:nggqjgdkqhytcag6x7fhiyuv",
  "service": [
    {
      "id": "#atproto_pds",
      "serviceEndpoint": "https://verpa.us-west.host.bsky.network",
      "type": "AtprotoPersonalDataServer"
    }
  ],
  "verificationMethod": [
    {
      "controller": "did:plc:nggqjgdkqhytcag6x7fhiyuv",
      "id": "did:plc:nggqjgdkqhytcag6x7fhiyuv#atproto",
      "publicKeyMultibase": "zQ3shoMrXu5Yx1xjeHdpSD94TxYFZPKK64Fs52BE2jqKWxvnh",
      "type": "Multikey"
    }
  ]
}

Datasets

We collect a number of datasets for our research, which we outline below. We are able to share part of these datasets for research purposes upon request. Please see our privacy policy for contact information and details about the data collected.

Firehose Logs

We have been operating a Firehose logger since April 2024. Starting from June 2024, we also log block data, which allows us to subsequently extract repo commits and their data. The data collection code is publicly available on GitHub.

We collect all types of Firehose events, which include identity updates, service notifications, and repository commits. The latter contains update operations to a user’s repository as a diff from a previous revision.

Due to the sensitive nature of this data, we cannot make it available publicly. However, we may be able to share parts of it it for research purposes – please get in touch.

Labeler Logs

We are subscribed to every known Labeler endpoint and log all labels produced by them. The code for this is also publicly available on GitHub.

We can probably share this data for research purposes – please get in touch.

PLC Directory Mirror

As outlined above, we operate a mirror of plc.directory. This allows us to export a snapshot of all registered DIDs and their DID documents.

You can use the mirror free of charge. We can probably also share a snapshot for research purposes – please get in touch.

Full-Network Mirror

We operate a mirror of the entire network using uabluerail/ipfs-indexer. This allows us to export snapshots of, e.g. the social graph, all posts, etc.

Similar to the Firehose data, we cannot make this data available publicly. However, we may be able to share parts of it it for research purposes – please get in touch.

Feed Generator Output

We compile a complete list of all Feed Generators in the network by analyzing repository data and real-time updates from Firehose logs. Each Feed Generator is identified by its DID and associated endpoint. We retrieve metadata for each Feed Generator using the getFeedGenerator API of the Bluesky AT Protocol.

For each Feed Generator, we retrieve new posts daily. This involves querying feeds using the FeedGetFeed API and saving any new, unrecorded posts to our database. The code for doing this is publicly available on Github.

  • The metadata for Feed Generators, the list of Feed Generators, and the posts (although massive) are available upon request for research purposes – please get in touch.

Anonymized Datasets

From the above, we derive a few anonymized datasets, which we make available in the following.

Social Graph Snapshot(s)

Methodology: We take the entire follower graph, remove self-loops and duplicate edges, and assign each DID a numerical ID.

Format: src_id, dst_id

Block Graph Snapshot(s)

Format: src_id, dst_id

  • TODO

Privacy Policy

Since our work includes collecting data of potentially real humans, we wrote ourselves a privacy policy.