Bluesky Research
We are a team of researchers performing measurements on Bluesky.
What is this?
The pitfalls of centralized social networks, such as Facebook and Twitter/X, have led to concerns about control, transparency, and accountability. Decentralized social networks have emerged as a result with the goal of empowering users. In contrast to alternative approaches (e.g. Mastodon), Bluesky decomposes and opens the key functions of the platform into subcomponents that can be provided by third party stakeholders.
We investigate this novel architecture of Bluesky, measure the network, describe the components, and look into the effects this all has on the users.
Who are you?
If you want, contact us on Bluesky:
- Leonhard Balduf: leobalduf.bsky.social
- Saidu Sokoto: bibo7086.bsky.social
- Dr Michał Król: harnen.bsky.social
- Onur Ascigil: asonur.bsky.social
- Gareth Tyson: garethtyson.bsky.social
- Ignacio Castro: ignactro.bsky.social
- Andrea Baronchelli: baronca.bsky.social
We are associated with the Communication Networks Lab at TU Darmstadt, Germany; the School of Science and Technology at City, University of London, UK; the School of Computing and Communications at Lancaster University, UK; the School of Electronic Engineering and computer Science at Queen Mary, University of London, UK; the Hong Kong University of Science and Technology (GZ), China; the University of Grenoble, Alpes, Ensimag, France; and the former Trust in Distributed Systems research group at the Weizenbaum Institute, Germany.
Papers
IMC ’24: Looking AT the Blue Skies of Bluesky
In our IMC ’24 paper we look at the overall structure, datasets, and growth of Bluesky. We conduct the first large-scale analysis of this novel microblogging platform. We collect a comprehensive dataset covering all the key elements of Bluesky, up to April 2024, covering about 5.5M users, 225M posts, 40k Feed Generators, and 62 Labelers. We study the uptake of the functionalities that Bluesky opens to third parties. Our findings show substantial uptake of content curation related functionalities.
Please cite as such:
@inproceedings{balduf2024bluesky,
author = {Balduf, Leonhard and Sokoto, Saidu and Ascigil, Onur and Tyson, Gareth and Scheuermann, Bj\"{o}rn and Korczy\'{n}ski, Maciej and Castro, Ignacio and Kr\'{o}l, Micha{\l}},
title = {Looking AT the Blue Skies of Bluesky},
year = {2024},
url = {https://doi.org/10.1145/3646547.3688407},
doi = {10.1145/3646547.3688407},
booktitle = {Proceedings of the 2024 ACM on Internet Measurement Conference},
}
Replication
In order to replicate this work, you need to obtain Firehose updates, Labelers, Feed Generators, and DID documents of every user. We make most of the tooling for this public, please see below.
(under review): Bootstrapping Social Networks: Lessons from Bluesky Starter Packs
In our 2025 paper we look at Starter Packs and their impact. We curate a complete dataset up to the end of 2024, with 25M users and 335k Starter Packs with 1.7M members. We identify follows resulting from starter packs and confirm that starter packs help users bootstrap their social network.
Please cite as such:
@misc{balduf2025bootstrappingsocialnetworks,
title={Bootstrapping Social Networks: Lessons from Bluesky Starter Packs},
author={Leonhard Balduf and Saidu Sokoto and Onur Ascigil and Gareth Tyson and Ignacio Castro and Andrea Baronchelli and George Pavlou and Björn Scheuermann and Michał Król},
year={2025},
eprint={2501.11605},
archivePrefix={arXiv},
primaryClass={cs.SI},
url={https://arxiv.org/abs/2501.11605},
}
Replication
In order to replicate this work, you need (in addition to everything from the previous work) snapshots of the entire network. The tools for this are open-source, see below.
In order to match extracted multi-follow operations from the Firehose to starter packs, you’d need
- The state of every starter pack at every point in time, which can be realized through Firehose updates.
- Multi-follow operations extracted from the Firehose.
- A tool to intersect them with high performance, which we open source here.
Services
Mirror of
plc.directory
We run a mirror of plc.directory. This is implemented using bsky-watch/plc-mirror. The mirror follows the export log of the PLC directory and replays the operations onto a local database. The exposed API implements DID lookups only, no audit log, export, etc. If the database lags behind the latest operations by too much, the server refuses to handle requests.
There are currently no rate limits and no uptime guarantees. Please be considerate, though: We just run this for fun and might stop if it ever gets too expensive.
You can query the mirror at
plc-mirror.bsky.leobalduf.com
for a DID at
curl -s "https://plc-mirror.bsky.leobalduf.com/<did>"
For example:
curl -s "https://plc-mirror.bsky.leobalduf.com/did:plc:nggqjgdkqhytcag6x7fhiyuv" | jq .
which should return something like:
{
"@context": [
"https://www.w3.org/ns/did/v1",
"https://w3id.org/security/multikey/v1",
"https://w3id.org/security/suites/secp256k1-2019/v1"
],
"alsoKnownAs": [
"at://leobalduf.bsky.social"
],
"id": "did:plc:nggqjgdkqhytcag6x7fhiyuv",
"service": [
{
"id": "#atproto_pds",
"serviceEndpoint": "https://verpa.us-west.host.bsky.network",
"type": "AtprotoPersonalDataServer"
}
],
"verificationMethod": [
{
"controller": "did:plc:nggqjgdkqhytcag6x7fhiyuv",
"id": "did:plc:nggqjgdkqhytcag6x7fhiyuv#atproto",
"publicKeyMultibase": "zQ3shoMrXu5Yx1xjeHdpSD94TxYFZPKK64Fs52BE2jqKWxvnh",
"type": "Multikey"
}
]
}
Datasets
We collect a number of datasets for our research, which we outline below. We are able to share part of these datasets for research purposes upon request. Please see our privacy policy for contact information and details about the data collected.
Firehose Logs
We have been operating a Firehose logger since April 2024. Starting from June 2024, we also log block data, which allows us to subsequently extract repo commits and their data. The data collection code is publicly available on GitHub.
We collect all types of Firehose events, which include identity updates, service notifications, and repository commits. The latter contains update operations to a user’s repository as a diff from a previous revision.
Due to the sensitive nature of this data, we cannot make it available publicly. However, we may be able to share parts of it it for research purposes – please get in touch.
Labeler Logs
We are subscribed to every known Labeler endpoint and log all labels produced by them. The code for this is also publicly available on GitHub.
We can probably share this data for research purposes – please get in touch.
PLC Directory Mirror
As outlined above, we operate a mirror of plc.directory. This allows us to export a snapshot of all registered DIDs and their DID documents.
You can use the mirror free of charge. We can probably also share a snapshot for research purposes – please get in touch.
Full-Network Mirror
We operate a mirror of the entire network using uabluerail/ipfs-indexer. This allows us to export snapshots of, e.g. the social graph, all posts, etc.
Similar to the Firehose data, we cannot make this data available publicly. However, we may be able to share parts of it it for research purposes – please get in touch.
Feed Generator Output
We compile a complete list of all Feed Generators in the network by
analyzing repository data and real-time updates from Firehose logs. Each
Feed Generator is identified by its DID and associated endpoint. We
retrieve metadata for each Feed Generator using the getFeedGenerator
API of the Bluesky AT Protocol.
For each Feed Generator, we retrieve new posts daily. This involves
querying feeds using the FeedGetFeed
API and saving any new, unrecorded posts to our database. The code
for doing this is publicly
available on Github.
- The metadata for Feed Generators, the list of Feed Generators, and the posts (although massive) are available upon request for research purposes – please get in touch.
Anonymized Datasets
From the above, we derive a few anonymized datasets, which we make available in the following.
Social Graph Snapshot(s)
Methodology: We take the entire follower graph, remove self-loops and duplicate edges, and assign each DID a numerical ID.
Format: src_id, dst_id
2025-01-01
: parquet
Block Graph Snapshot(s)
Format: src_id, dst_id
- TODO
Privacy Policy
Since our work includes collecting data of potentially real humans, we wrote ourselves a privacy policy.