Add download_carousel #125

stefco · 2021-06-18T19:33:19Z

Description

Commits fb0c968 and b51f725:

Adds a download_carousel method for Posts which allows you to download all media on carousel posts, i.e. posts with multiple images/videos, as raised in #105. Since this is a batch operation, you specify an output directory and a function for calculating the filename for each output image instead of specifying a single output filename. See the method documentation for details.

Also added a couple of supporting methods, though only one of them, parse_carousel_urls, is public; this method simply returns the video and image URLs for each image in the carousel, or None if the post is not a carousel. Again, see docstring for details.

Also added the beginnings of a demo jupyter notebook.

Fixes #105

Commit e88032e:

Post.get_recent_comments would raise a KeyError when using Selenium or a requests.Session object to scrape a Post due to slight differences in the structure of the resulting json_dict. I added an except block to handle this and try the alternative json_dict schema.

Fixes #124

Commit f443435:

Add Profile.iter_posts to get a lazy iterator over posts, and reimplement Profile.get_posts (with the same API) using iter_posts.

Fixes #127

Checklist

I followed the guidelines in our Contributing document
I added an explanation of my changes
I have written new tests for my changes, as applicable
I successfully ran tests with my changes locally

Additional notes (optional)

Have not written automated tests yet, will do so soon.

kyrlon · 2021-12-30T22:39:44Z

I have attempted to run the added function download_carousel with the following Google Instagram post, but come across a TypeError.

from instascrape import *
from pathlib import Path

def insta_scrape(ig_links):
    for link in ig_links:
        post = link.split("?")[0] if "copy_link" in link else link
        post_folder = Path(post.split("/")[-2])
        post_folder.mkdir(parents=True, exist_ok=True)

        google_post = Post(post)
        google_post.download_carousel(str(post_folder), allow_non_carousel=True)


if __name__ == "__main__":
    link_list = list()   
    ig_link = "https://www.instagram.com/p/CXuAeZ1ltCa/?utm_source=ig_web_copy_link"
    link_list.append(ig_link)
    insta_scrape(link_list)

Running the following code above, I get the following Traceback:

$ py insta_scrape.py 
Traceback (most recent call last):
  File "insta_scrape.py", line 18, in <module>
    insta_scrape(link_list)
  File "insta_scrape.py", line 11, in insta_scrape
    google_post.download_carousel(str(post_folder), allow_non_carousel=True)
  File "...\insta_download\py_venv\lib\site-packages\instascrape\scrapers\post.py", line 213, in download_carousel  
    urls = self.parse_carousel_urls()
  File "..\insta_download\py_venv\lib\site-packages\instascrape\scrapers\post.py", line 157, in parse_carousel_urls
    is_videos = self._filter_get(self.flat_json_dict, self._IS_VIDEO_KEYS)
  File "..\insta_download\py_venv\lib\site-packages\instascrape\scrapers\post.py", line 133, in _filter_get        
    return [(k, dic[k]) for k in keys if k in dic]
  File "..\insta_download\py_venv\lib\site-packages\instascrape\scrapers\post.py", line 133, in <listcomp>
    return [(k, dic[k]) for k in keys if k in dic]
TypeError: argument of type 'NoneType' is not iterable
(py_venv)

Stepping through with the debugger, I noticed that the variable self.flat_json_dict has the value of None. I am not sure if anyone else has come across such an error.

kyrlon · 2021-12-30T22:59:39Z

I had forgotten to perform the scrape method. With the new line, the function now operates as intended.

from instascrape import *
from pathlib import Path

def insta_scrape(ig_links):
    for link in ig_links:
        post = link.split("?")[0] if "copy_link" in link else link
        post_folder = Path(post.split("/")[-2])
        post_folder.mkdir(parents=True, exist_ok=True)

        google_post = Post(post)
        google_post.scrape()
        google_post.download_carousel(str(post_folder), allow_non_carousel=True)


if __name__ == "__main__":
    link_list = list()   
    ig_link = "https://www.instagram.com/p/CXuAeZ1ltCa/?utm_source=ig_web_copy_link"
    link_list.append(ig_link)
    insta_scrape(link_list)

stefco added 6 commits June 18, 2021 15:18

Add download_carousel

fb0c968

download WIP

6e80d92

comment fix

e88032e

existing media downloads skippable in download_carousel

b51f725

add iter_posts for lazy post finding

f443435

fix failed merge cruft in .gitignore

0a4cc92

stefco mentioned this pull request Jun 19, 2021

fix get_recent_comments on selenium/requests.Session (Fixes #124) #126

Closed

4 tasks

stefco mentioned this pull request Aug 7, 2021

get_recent_comments() results in KeyError: 'entry_data' #124

Open

kyrlon mentioned this pull request Dec 30, 2021

Scraping multiple photos in a single post #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add download_carousel #125

Add download_carousel #125

Uh oh!

stefco commented Jun 18, 2021 •

edited

Loading

Uh oh!

kyrlon commented Dec 30, 2021

Uh oh!

kyrlon commented Dec 30, 2021

Uh oh!

Uh oh!

Add download_carousel #125

Are you sure you want to change the base?

Add download_carousel #125

Uh oh!

Conversation

stefco commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Additional notes (optional)

Uh oh!

kyrlon commented Dec 30, 2021

Uh oh!

kyrlon commented Dec 30, 2021

Uh oh!

Uh oh!

stefco commented Jun 18, 2021 •

edited

Loading