What is pushshift.io

This site is maintained by Jason Baumgartner and contains various articles relating to big data, social media ingest and analysis and general technology trends.

4 thoughts on “What is pushshift.io”

  1. Hi Jason,

    Your website is very popular with Insight data science fellows here in Boston – in no small part due to former fellow John Walk singing the praises of the tremendous effort you have invested in making these data sets publicly available.

    I am working on a project due Friday involving topic modeling of the r/dementia and r/Alzheimers reddit posts to better understand the needs of patients and caregivers. I find that my downloads from files.pushshift.io are rate limited to ~150KB/s, which seems very reasonable given the enormous amount of traffic you have to handle.

    I made a donation of $10 already, but I would be happy to pay for a higher download rate. Is that an arrangement that is at all within the realm of possibility?

    Thank you for your consideration,
    Andrew
    (607) 351-6341

    P.S. I did spend quite a bit of time trying to download the past year of posts to r/dementia and r/Alzheimers using BigQuery but ran into trouble. For example, running

    SELECT * FROM [pushshift:[email protected]]
    WHERE subreddit = ‘Alzheimers’;

    returned 876 submissions/comments. I expected the total number to be much higher based on the volume of posts to r/Alzheimers.

  2. Hi Jason Baumgartner,

    I am a PHD student at Dalhousie University, Canada. I have published a paper to COLING 2016 (coling2016.anlp.jp) that presented a newly-created n-gram temporal corpus and its application. I provide all the corpus data here through HTTP download – https://web.cs.dal.ca/~anh/?page_id=1699. I also want to provide an API so that all users can query and visualize those n-grams in a reasonable time (like Google Books ngram viewer). With the size of the corpus (more than 3TB), it will need a lot of computing resources. I used Bigquery before and believed it is a good candidate for this. I see that you already put all Reddit comments in Google Bigquery. I am looking for a Google Bigquery Grant so that I could put my n-gram corpus data to Google Bigquery. Please let me know if you could direct me to the right direction.

    Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

Learn about Big Data and Social Media Ingest and Analysis