This site is maintained by Jason Baumgartner and contains various articles relating to big data, social media ingest and analysis and general technology trends.

One thought on “What is pushshift.io

  1. Hi Jason,

    Your website is very popular with Insight data science fellows here in Boston – in no small part due to former fellow John Walk singing the praises of the tremendous effort you have invested in making these data sets publicly available.

    I am working on a project due Friday involving topic modeling of the r/dementia and r/Alzheimers reddit posts to better understand the needs of patients and caregivers. I find that my downloads from files.pushshift.io are rate limited to ~150KB/s, which seems very reasonable given the enormous amount of traffic you have to handle.

    I made a donation of $10 already, but I would be happy to pay for a higher download rate. Is that an arrangement that is at all within the realm of possibility?

    Thank you for your consideration,
    Andrew
    (607) 351-6341

    P.S. I did spend quite a bit of time trying to download the past year of posts to r/dementia and r/Alzheimers using BigQuery but ran into trouble. For example, running

    SELECT * FROM [pushshift:[email protected]]
    WHERE subreddit = ‘Alzheimers’;

    returned 876 submissions/comments. I expected the total number to be much higher based on the volume of posts to r/Alzheimers.

  2. Hi Jason Baumgartner,

    I am a PHD student at Dalhousie University, Canada. I have published a paper to COLING 2016 (coling2016.anlp.jp) that presented a newly-created n-gram temporal corpus and its application. I provide all the corpus data here through HTTP download – https://web.cs.dal.ca/~anh/?page_id=1699. I also want to provide an API so that all users can query and visualize those n-grams in a reasonable time (like Google Books ngram viewer). With the size of the corpus (more than 3TB), it will need a lot of computing resources. I used Bigquery before and believed it is a good candidate for this. I see that you already put all Reddit comments in Google Bigquery. I am looking for a Google Bigquery Grant so that I could put my n-gram corpus data to Google Bigquery. Please let me know if you could direct me to the right direction.

    Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *