Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving DIA workflow with new parameters, and logs originally from DIANN (original issue #481 comments) #495

Merged
merged 18 commits into from
Mar 26, 2025

Conversation

an-altosian
Copy link

@an-altosian an-altosian commented Mar 13, 2025

User description

481

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/quantms branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

PR Type

Enhancement, Bug fix


Description

  • Added new parameters and logic for DIA-NN performance optimization.

  • Improved reproducibility by sorting input files for consistent results.

  • Refactored handling of optional outputs and logging across multiple modules.

  • Removed redundant --quick-mass-acc flag in several processes.


Changes walkthrough 📝

Relevant files
Enhancement
7 files
modules.config
Added new parameters and logic for DIA-NN processes           
+13/-0   
main.nf
Removed redundant `--quick-mass-acc` and improved logging
+5/-4     
main.nf
Adjusted mass accuracy logic and logging                                 
+6/-7     
main.nf
Refactored optional outputs and removed redundant flags   
+9/-18   
main.nf
Updated logging mechanism for final analysis                         
+4/-2     
main.nf
Added new arguments and updated logging                                   
+3/-1     
dia.nf
Ensured reproducibility by sorting input files                     
+2/-2     
Configuration changes
1 files
nextflow.config
Introduced new parameters for performance mode and quick mass accuracy
+2/-3     

Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Sorry, something went wrong.

    Copy link

    coderabbitai bot commented Mar 13, 2025

    Important

    Review skipped

    Auto reviews are disabled on base/target branches other than the default branch.

    Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

    You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


    Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

    ❤️ Share
    🪧 Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>, please review it.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (Invoked using PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai generate docstrings to generate docstrings for this PR.
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai plan to trigger planning for file edits and PR creation.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Other keywords and placeholders

    • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
    • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
    • Add @coderabbitai anywhere in the PR title to generate the title automatically.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    Copy link

    github-actions bot commented Mar 13, 2025

    nf-core pipelines lint overall result: Passed ✅ ⚠️

    Posted for pipeline commit d42eb40

    +| ✅ 105 tests passed       |+
    #| ❔  13 tests were ignored |#
    !| ❗  26 tests had warnings |!

    ❗ Test warnings:

    • pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
    • pipeline_todos - TODO string in nextflow.config: Update the field with the details of the contributors to your pipeline. New with Nextflow version 24.10.0
    • pipeline_todos - TODO string in README.md: update the following command to include all required parameters for a minimal example
    • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
    • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
    • pipeline_todos - TODO string in ro-crate-metadata.json: "description": "

      \n \n <source media="(prefers-color-scheme: dark)" srcset="docs/images/bigbio-quantms_logo_dark.png">\n <img alt="bigbio/quantms" src="docs/images/bigbio-quantms_logo_light.png">\n \n

      \n\nGitHub Actions CI Status\nGitHub Actions Linting StatusAWS CICite with Zenodo\nnf-test\n\nNextflow\nrun with conda\nrun with docker\nrun with singularity\nLaunch on Seqera Platform\n\nGet help on SlackFollow on TwitterFollow on MastodonWatch on YouTube\n\n## Introduction\n\nbigbio/quantms is a bioinformatics pipeline that ...\n\n TODO nf-core:\n Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the\n major pipeline sections and the types of output it produces. You're giving an overview to someone new\n to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction\n\n\n Include a figure that guides the user through the major workflow steps. Many nf-core\n workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. \n Fill in short bullet-pointed list of the default steps in the pipeline 1. Read QC (FastQC)2. Present QC for raw reads (MultiQC)\n\n## Usage\n\n> [!NOTE]\n> If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.\n\n Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.\n Explain what rows and columns represent. For instance (please edit as appropriate):\n\nFirst, prepare a samplesheet with your input data that looks as follows:\n\nsamplesheet.csv:\n\ncsv\nsample,fastq_1,fastq_2\nCONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz\n\n\nEach row represents a fastq file (single-end) or a pair of fastq files (paired end).\n\n\n\nNow, you can run the pipeline using:\n\n update the following command to include all required parameters for a minimal example \n\nbash\nnextflow run bigbio/quantms \\\n -profile <docker/singularity/.../institute> \\\n --input samplesheet.csv \\\n --outdir <OUTDIR>\n\n\n> [!WARNING]\n> Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.\n\nFor more details and further functionality, please refer to the usage documentation and the parameter documentation.\n\n## Pipeline output\n\nTo see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page.\nFor more details about the output files and reports, please refer to the\noutput documentation.\n\n## Credits\n\nbigbio/quantms was originally written by Yasset Perez-Riverol.\n\nWe thank the following people for their extensive assistance in the development of this pipeline:\n\n If applicable, make list of people who have also contributed \n\n## Contributions and Support\n\nIf you would like to contribute to this pipeline, please see the contributing guidelines.\n\nFor further information or help, don't hesitate to get in touch on the Slack #quantms channel (you can join with this invite).\n\n## Citations\n\n Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. \n If you use bigbio/quantms for your analysis, please cite it using the following doi: 10.5281/zenodo.XXXXXX \n\n Add bibliography of tools and data used in your pipeline \n\nAn extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.\n\nYou can cite the nf-core publication as follows:\n\n> The nf-core framework for community-curated bioinformatics pipelines.\n>\n> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.\n>\n> Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.\n",
    • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
    • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
    • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
    • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
    • pipeline_todos - TODO string in base.config: Check the defaults for all processes
    • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
    • local_component_structure - preprocess_expdesign.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
    • local_component_structure - samplesheet_check.nf in modules/local should be moved to a TOOL/SUBTOOL/main.nf structure
    • local_component_structure - featuremapper.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - create_input_channel.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - proteininference.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - databasesearchengines.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - proteinquant.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - dda_id.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - phosphoscoring.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - file_preparation.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - psmrescoring.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - id.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - input_check.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure
    • local_component_structure - psmfdrcontrol.nf in subworkflows/local should be moved to a SUBWORKFLOW_NAME/main.nf structure

    ❔ Tests ignored:

    ✅ Tests passed:

    Run details

    • nf-core/tools version 3.2.0
    • Run at 2025-03-26 08:54:28

    @ypriverol
    Copy link
    Member

    ypriverol commented Mar 13, 2025

    @an-altosian I will take a look to it tomorrow. For some reason the EBI PRIDE files are slow to download, main reason why most of the test are failing. But no panic.

    This error: https://github.com/bigbio/quantms/actions/runs/13842972888/job/38742737869?pr=495 is because all the parameters in the nextflow.config needs to be in the schema json file.

    @an-altosian
    Copy link
    Author

    an-altosian commented Mar 13, 2025

    Cool! I will work on the parameter schema json files (which should be the cause of the failed linting), if these changes look good. Please feel free to reject any changes according to Vadim's comments.

    Best,
    Dongze

    @an-altosian
    Copy link
    Author

    met-excision - Vadim's comment in #164

    @an-altosian an-altosian marked this pull request as ready for review March 19, 2025 17:47
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Missing Parameters

    The PR removes parameters min_corr, corr_diff, and time_corr_only but these values are still referenced in diann_preliminary_analysis/main.nf. The code now uses params.min_corr and params.corr_diff but these parameters are no longer defined in nextflow.config.

    performance_mode        = true // add '--min-corr 2 --corr-diff 1 --time-corr-only'
    quick_mass_acc          = true
    mass_acc_automatic      = true
    Potential Bug

    The PR changes .collect() to .toSortedList() for ms_file_names, but this might change the behavior of the pipeline. Verify that .toSortedList() produces the same type of output as .collect() for downstream processes.

    .toSortedList()

    Copy link

    qodo-merge-pro bot commented Mar 19, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Add error handling
    Suggestion Impact:The commit addressed the error handling concern but used a different approach. Instead of checking if the file exists before copying, it redirected the output of the DIA-NN command to the log file using tee, which ensures the log file is always created regardless of whether report.log.txt is generated.

    code diff:

    -    cp report.log.txt assemble_empirical_library.log
    +            $args \\
    +            2>&1 | tee assemble_empirical_library.log

    The command assumes that report.log.txt will always be generated, but if the
    DIA-NN process fails before creating this file, the cp command will fail and
    potentially mask the original error. Consider adding error handling or checking
    if the file exists before copying.

    modules/local/assemble_empirical_library/main.nf [58]

    -+    cp report.log.txt assemble_empirical_library.log
    ++    [ -f report.log.txt ] && cp report.log.txt assemble_empirical_library.log || echo "Warning: report.log.txt not found" > assemble_empirical_library.log

    [To ensure code accuracy, apply this suggestion manually]

    Suggestion importance[1-10]: 7

    __

    Why: The suggestion adds important error handling to prevent the pipeline from failing silently if the report.log.txt file isn't generated. This improves robustness and makes debugging easier by ensuring the log file exists or creating a warning message.

    Medium
    Improve compatibility
    Suggestion Impact:The commit addressed the compatibility issue but in a different way. Instead of replacing the spaceship operator with compareTo(), it removed the explicit sorting with the spaceship operator and used the built-in sort parameter of the collect() method earlier in the code.

    code diff:

    -                .collect()
    +                .collect(sort: true)
                 DIANN_PRELIMINARY_ANALYSIS(preanalysis_subset.combine(speclib))
             } else {
                 empirical_lib_files = ch_file_preparation_results
                     .map { result -> result[1] }
    -                .collect()
    +                .collect(sort: true)
                 DIANN_PRELIMINARY_ANALYSIS(ch_file_preparation_results.combine(speclib))
             }
             ch_software_versions = ch_software_versions
    @@ -96,7 +96,7 @@
             //
             // Order matters in DIANN, This should be sorted for reproducible results.
             ASSEMBLE_EMPIRICAL_LIBRARY(
    -            empirical_lib_files.sort{ a, b -> a.getName() <=> b.getName() },
    +            empirical_lib_files,

    The sorting operation uses the Groovy spaceship operator (<=>), which might not be
    compatible with all Nextflow versions. For better compatibility, consider using
    the more explicit compareTo method.

    workflows/dia.nf [99]

    -+            empirical_lib_files.sort{ a, b -> a.getName() <=> b.getName() },
    ++            empirical_lib_files.sort{ a, b -> a.getName().compareTo(b.getName()) },

    [To ensure code accuracy, apply this suggestion manually]

    Suggestion importance[1-10]: 5

    __

    Why: Using the more explicit compareTo() method instead of the spaceship operator (<=>) improves compatibility across different Nextflow versions. This is a moderate improvement for maintainability and portability.

    Low
    • Update

    Sorry, something went wrong.

    @an-altosian
    Copy link
    Author

    Hi @daichengxin Thank you for the commit. Just wondering, I thought from our discussions, we would like to use the DIANN reported log file, report.log.txt, instead of capturing stdout because diann writes the executed command at the beginning of the file?

    @daichengxin
    Copy link
    Collaborator

    Hi @an-altosian, Thanks for pointing this out. I recovered them.

    @ypriverol ypriverol changed the title Addressing #481 comments Improving DIA workflow with new parameters, and logs originally from DIANN (original issue #481 comments) Mar 24, 2025
    @ypriverol
    Copy link
    Member

    @an-altosian we may need to go back to @daichengxin sorting strategy or use another one, all tests are failing.

    @an-altosian
    Copy link
    Author

    Will work on it.

    @an-altosian
    Copy link
    Author

    I have tested that the Path class could not be sorted by file names. This also indicates that this line for sorting the files also doesn't work as expected. I updated it with a closure to sort by filenames.

    .toSortedList()

    @ypriverol ypriverol merged commit 747efe5 into bigbio:dev Mar 26, 2025
    111 of 162 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    4 participants