Merge pull request #640 from aliparlakci/development

Serene-Arc · Sep 27, 2022 · e7629d7 · e7629d7
2 parents e4fcacf + 0ce2585
commit e7629d7
Show file tree

Hide file tree

Showing 46 changed files with 672 additions and 141 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,2 @@
+# Declare files that will always have CRLF line endings on checkout.
+*.ps1    text  eol=crlf
diff --git a/.github/workflows/protect_master.yml b/.github/workflows/protect_master.yml
@@ -0,0 +1,13 @@
+name: Protect master branch
+
+on:
+  pull_request:
+    branches:
+      - master
+jobs:
+    merge_check:
+        runs-on: ubuntu-latest
+        steps:
+        -  name: Check if the pull request is mergeable to master
+           run: |
+               if [[ "$GITHUB_HEAD_REF" == 'development' && "$GITHUB_REPOSITORY" == 'aliparlakci/bulk-downloader-for-reddit'  ]]; then exit 0; else exit 1; fi;
diff --git a/README.md b/README.md
@@ -53,6 +53,12 @@ However, these commands are not enough. You should chain parameters in [Options]
 python3 -m bdfr download ./path/to/output --subreddit Python -L 10
 ```
 ```bash
+python3 -m bdfr download ./path/to/output --user reddituser --submitted -L 100
+```
+```bash
+python3 -m bdfr download ./path/to/output --user reddituser --submitted --all-comments --comment-context
+```
+```bash
 python3 -m bdfr download ./path/to/output --user me --saved --authenticate -L 25 --file-scheme '{POSTID}'
 ```
 ```bash
@@ -62,6 +68,31 @@ python3 -m bdfr download ./path/to/output --subreddit 'Python, all, mindustry' -
 python3 -m bdfr archive ./path/to/output --subreddit all --format yaml -L 500 --folder-scheme ''
 ```
 
+Alternatively, you can pass options through a YAML file.
+
+```bash
+python3 -m bdfr download ./path/to/output --opts my_opts.yaml
+```
+
+For example, running it with the following file
+
+```yaml
+skip: [mp4, avi]
+file_scheme: "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}"
+limit: 10
+sort: top
+subreddit:
+  - EarthPorn
+  - CityPorn
+```
+
+would be equilavent to (take note that in YAML there is `file_scheme` instead of `file-scheme`):
+```bash
+python3 -m bdfr download ./path/to/output --skip mp4 --skip avi --file-scheme "{UPVOTES}_{REDDITOR}_{POSTID}_{DATE}" -L 10 -S top --subreddit EarthPorn --subreddit CityPorn
+```
+
+In case when the same option is specified both in the YAML file and in as a command line argument, the command line argument takes prs
+
 ## Options
 
 The following options are common between both the `archive` and `download` commands of the BDFR.
@@ -74,6 +105,10 @@ The following options are common between both the `archive` and `download` comma
 - `--config`
   - If the path to a configuration file is supplied with this option, the BDFR will use the specified config
   - See [Configuration Files](#configuration) for more details
+- `--opts`
+  - Load options from a YAML file.
+  - Has higher prority than the global config file but lower than command-line arguments.
+  - See [opts_example.yaml](./opts_example.yaml) for an example file.
 - `--disable-module`
   - Can be specified multiple times
   - Disables certain modules from being used
@@ -92,8 +127,8 @@ The following options are common between both the `archive` and `download` comma
   - This option will make the BDFR use the supplied user's saved posts list as a download source
   - This requires an authenticated Reddit instance, using the `--authenticate` flag, as well as `--user` set to `me`
 - `--search`
-  - This will apply the specified search term to specific lists when scraping submissions
-  - A search term can only be applied to subreddits and multireddits, supplied with the `- s` and `-m` flags respectively
+  - This will apply the input search term to specific lists when scraping submissions
+  - A search term can only be applied when using the `--subreddit` and `--multireddit` flags
 - `--submitted`
   - This will use a user's submissions as a source
   - A user must be specified with `--user`
@@ -192,6 +227,15 @@ The following options apply only to the `download` command. This command downloa
   - This skips all submissions from the specified subreddit
   - Can be specified multiple times
   - Also accepts CSV subreddit names
+- `--min-score`
+  - This skips all submissions which have fewer than specified upvotes
+- `--max-score`
+  - This skips all submissions which have more than specified upvotes
+- `--min-score-ratio`
+  - This skips all submissions which have lower than specified upvote ratio
+- `--max-score-ratio`
+  - This skips all submissions which have higher than specified upvote ratio
+
 
 ### Archiver Options
 
@@ -215,7 +259,10 @@ The `clone` command can take all the options listed above for both the `archive`
 
 ## Common Command Tricks
 
-A common use case is for subreddits/users to be loaded from a file. The BDFR doesn't support this directly but it is simple enough to do through the command-line. Consider a list of usernames to download; they can be passed through to the BDFR with the following command, assuming that the usernames are in a text file:
+A common use case is for subreddits/users to be loaded from a file. The BDFR supports this via YAML file options (`--opts my_opts.yaml`).
+
+Alternatively, you can use the command-line [xargs](https://en.wikipedia.org/wiki/Xargs) function.
+For a list of users `users.txt` (one user per line), type:
 
 ```bash
 cat users.txt | xargs -L 1 echo --user | xargs -L 50 python3 -m bdfr download <ARGS>

diff --git a/bdfr/__main__.py b/bdfr/__main__.py
@@ -16,13 +16,19 @@
     click.argument('directory', type=str),
     click.option('--authenticate', is_flag=True, default=None),
     click.option('--config', type=str, default=None),
+    click.option('--opts', type=str, default=None),
     click.option('--disable-module', multiple=True, default=None, type=str),
+    click.option('--exclude-id', default=None, multiple=True),
+    click.option('--exclude-id-file', default=None, multiple=True),
+    click.option('--file-scheme', default=None, type=str),
+    click.option('--folder-scheme', default=None, type=str),
     click.option('--ignore-user', type=str, multiple=True, default=None),
     click.option('--include-id-file', multiple=True, default=None),
     click.option('--log', type=str, default=None),
     click.option('--saved', is_flag=True, default=None),
     click.option('--search', default=None, type=str),
     click.option('--submitted', is_flag=True, default=None),
+    click.option('--subscribed', is_flag=True, default=None),
     click.option('--time-format', type=str, default=None),
     click.option('--upvoted', is_flag=True, default=None),
     click.option('-L', '--limit', default=None, type=int),
@@ -37,17 +43,17 @@
 ]
 
 _downloader_options = [
-    click.option('--file-scheme', default=None, type=str),
-    click.option('--folder-scheme', default=None, type=str),
     click.option('--make-hard-links', is_flag=True, default=None),
     click.option('--max-wait-time', type=int, default=None),
     click.option('--no-dupes', is_flag=True, default=None),
     click.option('--search-existing', is_flag=True, default=None),
-    click.option('--exclude-id', default=None, multiple=True),
-    click.option('--exclude-id-file', default=None, multiple=True),
     click.option('--skip', default=None, multiple=True),
     click.option('--skip-domain', default=None, multiple=True),
     click.option('--skip-subreddit', default=None, multiple=True),
+    click.option('--min-score', type=int, default=None),
+    click.option('--max-score', type=int, default=None),
+    click.option('--min-score-ratio', type=float, default=None),
+    click.option('--max-score-ratio', type=float, default=None),
 ]
 
 _archiver_options = [

diff --git a/bdfr/archiver.py b/bdfr/archiver.py
@@ -34,6 +34,9 @@ def download(self):
                         f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
                         f' due to {submission.author.name if submission.author else "DELETED"} being an ignored user')
                     continue
+                if submission.id in self.excluded_submission_ids:
+                    logger.debug(f'Object {submission.id} in exclusion list, skipping')
+                    continue
                 logger.debug(f'Attempting to archive submission {submission.id}')
                 self.write_entry(submission)
 

diff --git a/bdfr/configuration.py b/bdfr/configuration.py
@@ -2,16 +2,21 @@
 # coding=utf-8
 
 from argparse import Namespace
+from pathlib import Path
 from typing import Optional
+import logging
 
 import click
+import yaml
 
+logger = logging.getLogger(__name__)
 
 class Configuration(Namespace):
     def __init__(self):
         super(Configuration, self).__init__()
         self.authenticate = False
         self.config = None
+        self.opts: Optional[str] = None
         self.directory: str = '.'
         self.disable_module: list[str] = []
         self.exclude_id = []
@@ -33,8 +38,13 @@ def __init__(self):
         self.skip: list[str] = []
         self.skip_domain: list[str] = []
         self.skip_subreddit: list[str] = []
+        self.min_score = None
+        self.max_score = None
+        self.min_score_ratio = None
+        self.max_score_ratio = None
         self.sort: str = 'hot'
         self.submitted: bool = False
+        self.subscribed: bool = False
         self.subreddit: list[str] = []
         self.time: str = 'all'
         self.time_format = None
@@ -48,6 +58,31 @@ def __init__(self):
         self.comment_context: bool = False
 
     def process_click_arguments(self, context: click.Context):
+        if context.params.get('opts') is not None:
+            self.parse_yaml_options(context.params['opts'])
         for arg_key in context.params.keys():
-            if arg_key in vars(self) and context.params[arg_key] is not None:
-                vars(self)[arg_key] = context.params[arg_key]
+            if not hasattr(self, arg_key):
+                logger.warning(f'Ignoring an unknown CLI argument: {arg_key}')
+                continue
+            val = context.params[arg_key]
+            if val is None or val == ():
+                # don't overwrite with an empty value
+                continue
+            setattr(self, arg_key, val)
+
+    def parse_yaml_options(self, file_path: str):
+        yaml_file_loc = Path(file_path)
+        if not yaml_file_loc.exists():
+            logger.error(f'No YAML file found at {yaml_file_loc}')
+            return
+        with open(yaml_file_loc) as file:
+            try:
+                opts = yaml.load(file, Loader=yaml.FullLoader)
+            except yaml.YAMLError as e:
+                logger.error(f'Could not parse YAML options file: {e}')
+                return
+        for arg_key, val in opts.items():
+            if not hasattr(self, arg_key):
+                logger.warning(f'Ignoring an unknown YAML argument: {arg_key}')
+                continue
+            setattr(self, arg_key, val)
diff --git a/bdfr/connector.py b/bdfr/connector.py
@@ -243,9 +243,19 @@ def split_args_input(entries: list[str]) -> set[str]:
         return set(all_entries)
 
     def get_subreddits(self) -> list[praw.models.ListingGenerator]:
-        if self.args.subreddit:
-            out = []
-            for reddit in self.split_args_input(self.args.subreddit):
+        out = []
+        subscribed_subreddits = set()
+        if self.args.subscribed:
+            if self.args.authenticate:
+                try:
+                    subscribed_subreddits = list(self.reddit_instance.user.subreddits(limit=None))
+                    subscribed_subreddits = set([s.display_name for s in subscribed_subreddits])
+                except prawcore.InsufficientScope:
+                    logger.error('BDFR has insufficient scope to access subreddit lists')
+            else:
+                logger.error('Cannot find subscribed subreddits without an authenticated instance')
+        if self.args.subreddit or subscribed_subreddits:
+            for reddit in self.split_args_input(self.args.subreddit) | subscribed_subreddits:
                 if reddit == 'friends' and self.authenticated is False:
                     logger.error('Cannot read friends subreddit without an authenticated instance')
                     continue
@@ -270,9 +280,7 @@ def get_subreddits(self) -> list[praw.models.ListingGenerator]:
                         logger.debug(f'Added submissions from subreddit {reddit}')
                 except (errors.BulkDownloaderException, praw.exceptions.PRAWException) as e:
                     logger.error(f'Failed to get submissions for subreddit {reddit}: {e}')
-            return out
-        else:
-            return []
+        return out
 
     def resolve_user_name(self, in_name: str) -> str:
         if in_name == 'me':
@@ -406,7 +414,9 @@ def check_subreddit_status(subreddit: praw.models.Subreddit):
         try:
             assert subreddit.id
         except prawcore.NotFound:
-            raise errors.BulkDownloaderException(f'Source {subreddit.display_name} does not exist or cannot be found')
+            raise errors.BulkDownloaderException(f"Source {subreddit.display_name} cannot be found")
+        except prawcore.Redirect:
+            raise errors.BulkDownloaderException(f"Source {subreddit.display_name} does not exist")
         except prawcore.Forbidden:
             raise errors.BulkDownloaderException(f'Source {subreddit.display_name} is private and cannot be scraped')
 

diff --git a/bdfr/default_config.cfg b/bdfr/default_config.cfg
@@ -1,7 +1,7 @@
 [DEFAULT]
 client_id = U-6gk4ZCh3IeNQ
 client_secret = 7CZHY6AmKweZME5s50SfDGylaPg
-scopes = identity, history, read, save
+scopes = identity, history, read, save, mysubreddits
 backup_log_count = 3
 max_wait_time = 120
 time_format = ISO
diff --git a/bdfr/downloader.py b/bdfr/downloader.py
@@ -57,6 +57,19 @@ def _download_submission(self, submission: praw.models.Submission):
                 f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
                 f' due to {submission.author.name if submission.author else "DELETED"} being an ignored user')
             return
+        elif self.args.min_score and submission.score < self.args.min_score:
+            logger.debug(
+                f"Submission {submission.id} filtered due to score {submission.score} < [{self.args.min_score}]")
+            return
+        elif self.args.max_score and self.args.max_score < submission.score:
+            logger.debug(
+                f"Submission {submission.id} filtered due to score {submission.score} > [{self.args.max_score}]")
+            return
+        elif (self.args.min_score_ratio and submission.upvote_ratio < self.args.min_score_ratio) or (
+            self.args.max_score_ratio and self.args.max_score_ratio < submission.upvote_ratio
+        ):
+            logger.debug(f"Submission {submission.id} filtered due to score ratio ({submission.upvote_ratio})")
+            return
         elif not isinstance(submission, praw.models.Submission):
             logger.warning(f'{submission.id} is not a submission')
             return

diff --git a/bdfr/file_name_formatter.py b/bdfr/file_name_formatter.py
@@ -111,6 +111,9 @@ def format_path(
         if not resource.extension:
             raise BulkDownloaderException(f'Resource from {resource.url} has no extension')
         file_name = str(self._format_name(resource.source_submission, self.file_format_string))
+
+        file_name = re.sub(r'\n', ' ', file_name)
+
         if not re.match(r'.*\.$', file_name) and not re.match(r'^\..*', resource.extension):
             ending = index + '.' + resource.extension
         else:

diff --git a/bdfr/site_downloaders/download_factory.py b/bdfr/site_downloaders/download_factory.py
@@ -17,15 +17,18 @@
 from bdfr.site_downloaders.redgifs import Redgifs
 from bdfr.site_downloaders.self_post import SelfPost
 from bdfr.site_downloaders.vidble import Vidble
+from bdfr.site_downloaders.vreddit import VReddit
 from bdfr.site_downloaders.youtube import Youtube
 
 
 class DownloadFactory:
     @staticmethod
     def pull_lever(url: str) -> Type[BaseDownloader]:
         sanitised_url = DownloadFactory.sanitise_url(url)
-        if re.match(r'(i\.)?imgur.*\.gif.+$', sanitised_url):
+        if re.match(r'(i\.|m\.)?imgur', sanitised_url):
             return Imgur
+        elif re.match(r'(i\.)?(redgifs|gifdeliverynetwork)', sanitised_url):
+            return Redgifs
         elif re.match(r'.*/.*\.\w{3,4}(\?[\w;&=]*)?$', sanitised_url) and \
                 not DownloadFactory.is_web_resource(sanitised_url):
             return Direct
@@ -37,16 +40,14 @@ def pull_lever(url: str) -> Type[BaseDownloader]:
             return Gallery
         elif re.match(r'gfycat\.', sanitised_url):
             return Gfycat
-        elif re.match(r'(m\.)?imgur.*', sanitised_url):
-            return Imgur
-        elif re.match(r'(redgifs|gifdeliverynetwork)', sanitised_url):
-            return Redgifs
         elif re.match(r'reddit\.com/r/', sanitised_url):
             return SelfPost
         elif re.match(r'(m\.)?youtu\.?be', sanitised_url):
             return Youtube
         elif re.match(r'i\.redd\.it.*', sanitised_url):
             return Direct
+        elif re.match(r'v\.redd\.it.*', sanitised_url):
+            return VReddit
         elif re.match(r'pornhub\.com.*', sanitised_url):
             return PornHub
         elif re.match(r'vidble\.com', sanitised_url):

diff --git a/bdfr/site_downloaders/gfycat.py b/bdfr/site_downloaders/gfycat.py
@@ -21,7 +21,7 @@ def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> l
         return super().find_resources(authenticator)
 
     @staticmethod
-    def _get_link(url: str) -> str:
+    def _get_link(url: str) -> set[str]:
         gfycat_id = re.match(r'.*/(.*?)/?$', url).group(1)
         url = 'https://gfycat.com/' + gfycat_id
 
@@ -39,4 +39,4 @@ def _get_link(url: str) -> str:
             raise SiteDownloaderError(f'Failed to download Gfycat link {url}: {e}')
         except json.JSONDecodeError as e:
             raise SiteDownloaderError(f'Did not receive valid JSON data: {e}')
-        return out
+        return {out,}
diff --git a/bdfr/site_downloaders/imgur.py b/bdfr/site_downloaders/imgur.py
@@ -41,10 +41,12 @@ def _compute_image_url(self, image: dict) -> Resource:
 
     @staticmethod
     def _get_data(link: str) -> dict:
-        link = link.rstrip('?')
-        if re.match(r'(?i).*\.gif.+$', link):
-            link = link.replace('i.imgur', 'imgur')
-            link = re.sub('(?i)\\.gif.+$', '', link)
+        try:
+            imgur_id = re.match(r'.*/(.*?)(\..{0,})?$', link).group(1)
+            gallery = 'a/' if re.search(r'.*/(.*?)(gallery/|a/)', link) else ''
+            link = f'https://imgur.com/{gallery}{imgur_id}'
+        except AttributeError:
+            raise SiteDownloaderError(f'Could not extract Imgur ID from {link}')
 
         res = Imgur.retrieve_url(link, cookies={'over18': '1', 'postpagebeta': '0'})
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Declare files that will always have CRLF line endings on checkout.
		*.ps1 text eol=crlf