How it all Started
Pushshift originally started in the summer/autumn of 2014 while I was chatting with a friend over Google Hangouts about Reddit data. At the time, I was trying to find the most efficient method to ingest Reddit data while working within the limitations and rate limits of the Reddit API. I was collecting data from Reddit’s “new” page and scanning for new submissions. Once a new submission was detected, the script would poll that submission for new comments and ingest comments as they came in. The script worked well, but there was one huge problem — there was no possible way to get ALL public Reddit content using this method. The Reddit API rate limits would prevent making enough calls to get all new submissions and comments as they came in.
While chatting with my friend, he briefly mentioned a Reddit API endpoint called “/api/info” that allowed querying the API by an object’s specific id. While he was discussing his project, something “clicked” in my head and I suddenly had a huge epiphany. Since Reddit ids were simply base 36 representations of an integer and since they were sequential, it dawned on me that it was possible to ingest all public data made to Reddit using this end-point.
At that moment, I quickly started writing code to query this end-point and place the data in MySQL. The logic was sound and it was working brilliantly — I had discovered the best method for ingesting Reddit data. I ran some numbers in my head and calculated how many API calls it would take to ingest the roughly ~1.7 billion comments already made to Reddit. Each call to /api/info allows for up to 100 ids to be queried. Reddit’s rate limit at the time was 30 calls per minute. 1.7 billion / 100 would take 17 million calls to Reddit’s API. Since I could make 43,200 calls per day, that worked out to approximately 394 days to get the data already on Reddit. Of course, during those 394 days, a lot more comments would also be made to Reddit. Also, there were submissions to ingest as well which would add approximately another 60 days or so of API calls.
Around this time, Reddit was rolling out oAuth and encouraging developers to use oAuth. As a way to entice people to use this, they doubled the number of calls that could be made to the API. At this point, 394 days became 196 days — or roughly six and a half months. Once I wrote the script, it went to work and started ingesting data from Reddit. At the time, I was using MySQL (Currently, Pushshift.io uses PostgreSQL for its database) and it was quickly filling up with data. I had to purchase some additional drives to handle the amount of data, but the script was working well and I was on track to complete the data ingest in approximately 10 months or so (this would include also getting submissions and all of the new data that was being posted to Reddit while I was running the initial ingest.
Over Half a Year Later …
After months and months of constantly ingesting data, I had finally caught up with the new data being posted to Reddit. Day and night, the script continued to run while diligently collecting comments and submissions and storing them within the database. It was early July of 2015 when the initial data ingest project had completed. Here I was, probably one of the only people in world at the time that had almost every publicly available Reddit comment and post. At the time, it was over 250 gigabytes of data (compressed). I wanted to share this with the academic community and other data scientists. What better way to share all this Reddit data than by making a post to Reddit letting the world know that I had this huge corpus available.
I wasn’t sure what would happen after I made the post, but there was only one way to find out — I made this post to Reddit on July 3, 2015.
BOOM! goes the Dynamite
I remember distinctly what happened after I made the post. I was expecting perhaps a couple of comments from data enthusiasts and perhaps a couple PMs asking for clarification on the data format. After making the post, I went about my day for a few hours and basically forgot I had even made the submission.
About an hour or so later, my phone kept going off repeatedly buzzing over and over again. I had a Reddit mobile client installed that would buzz when new PMs or username mentions came in. I logged back into my desktop computer and could not believe it — I had countless PMs from people, journalists asking to interview me about the data, people telling me that it was a stupid joke and there was no way I had “all of reddit” on my hard-drive. Soon, there was discussion on Ycombinator about the legality of what I had done. People were both impressed, amused and curious all at once. I started to worry about the repercussions of what I had done — there were crossposts being made to dozens of other subreddits referencing my original submission. The genie was out of that bottle at this point.
I honestly had no idea that there would be such a powerful reaction from my original post. Over the next couple of days, there were articles written about what I had done. Some mentioned that this would be a huge treasure trove of data for scientists and researchers to analyze. Others wrote about the more Orwellian aspects of having this data openly available to everyone. No matter what position the author took, most of the articles seemed to agree that this was a huge deal.
Here is a sampling of the articles that came out in the following days from my original submission:
I had to spend a few hours that evening writing and answering questions about the dataset. I received a lot of support from many people that made reference to the huge research potentials for this dataset. Eventually Pushshift.io evolved from the initial project and from that came a full featured API resource, an SSE stream, search utilities and more. Since that time, there have been dozens of research papers written that have used the dataset. New websites like ceddit and removeddit suddenly were made possible using the Pushshift API.
From the initial project, I realized that there were a lot of applications for big data sources. I became more interested in the potential applications myself and since then I have been determined to grow Pushshift.io so that more data sources are made available to the academic community and for people who just love to analyze and visualize data.
In the end, I’m glad I took the time to do the initial data ingest. It helped to promote a lot of interesting discussions and along the way, I’ve met a lot of wonderful people who have been extremely helpful while providing a lot of guidance for making Pushshift more useful.by