1

I'm currently working on scraping users' past comments on Reddit with Praw/Python, and I would love to go beyond the upper bound (1k) when I make my query.

I've read something on Cloudsearch Syntax where you can make timestamp and query multiple times, but I couldn't fully digest what's going on there. Can someone shed some light on? Thanks!

What I'm currently going for:

dh = reddit.redditor(USERNAME)
count = 0
for c in dh.comments.new(limit = None):
      print c.subreddit

This would always give me count = 1000...

1 Answers1

-3

Reddit's listing pages, the same that you can see when you browse to a subreddit, or a user's page, are all capped at 1000 items. When a new item is added, or updated, e.g., voted on, that applies to a given listing, it is inserted into the correct location in that listing, removing any items that exceed the 1000 item limit.

Reddit's search is different. While each individual search itself has a similar 1000-item limit, timestamps can be used to narrow the search results. By sorting results newest first, and keeping track of the oldest result's timestamp one can successfully loop through consecutive searches.

PRAW's submissions does exactly this: http://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html#praw.models.Subreddit.submissions

Note: Search only works on submissions

bboe
  • 4,092
  • 3
  • 29
  • 39