Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in parsing bwa-mem2 when specific implementation is specified #77

Open
killidude opened this issue Feb 17, 2025 · 0 comments
Open

Comments

@killidude
Copy link

Folks, @shawnpg @bnelsj ,

I mapped my Hi-C library using bwa-mem2, and then ran your hic_qc.py script. I got this error:

tomas@edna:/media/tomas/Data2/hi_c/hi_c1/mapped$ hic_qc.py -b A524_mapped.bam -o A524_hic_qc
[hic_qc - 2025-02-16 09:38:59,572] parsing the first 1000000 read pairs in bam file A524_mapped.bam to QC Hi-C library quality
Traceback (most recent call last):
File "/home/tomas/bin/hic_qc/hic_qc.py", line 1147, in
QC.parse_bam(args.bam_file, max_read_pairs=args.num_reads)
File "/home/tomas/bin/hic_qc/hic_qc.py", line 234, in parse_bam
self.extract_header_info(bam_fh.header)
File "/home/tomas/bin/hic_qc/hic_qc.py", line 306, in extract_header_info
self.bwa_command = re.search(r'(bwa-mem2 )[^//]*', self.bwa_command_line).group()
AttributeError: 'NoneType' object has no attribute 'group'

It turns out that the hic_qc.py script already expects that the bam file could have been generated with bwa-mem2 (see Commit c0a178c), however, it looks only for "bwa-mem2" and does not recognize a specific implementation (avx2, sse41, etc) which is actually executed. The lines in the hic-qc.py script that check if the bam file was generated by bwa or bwa-mem2 are below.

    # TODO: add more robust logic to different BAM headers, and/or comment the assumptions made in this code
    if 'PG' in header and 'bwa' in header['PG'][0]['CL']:
        self.bwa_command_line = header['PG'][0]['CL']
        if 'bwa-mem2' in self.bwa_command_line:
            self.bwa_command = re.search(r'(bwa-mem2 )[^//]*', self.bwa_command_line).group()
        else:
            self.bwa_command = re.search(r'(bwa )[^//]*', self.bwa_command_line).group()

When one executes bwa-mem2 one can either run "bwa-mem2" which then finds the appropriate implementation of bwa-mem2 for the computer on which bwa-mem2 is being executed (e.g. bwa-mem2.avx2), or one can directly execute the desired version (e.g. bwa-mem2.avx2). What is registered in the bam header is then "bwa-mem2 -command options" or "bwa-mem2.avx2 -command options", respectively, and the hic_qc.py script only recognizes "bwa-mem2 -command options", thus the error in hic_qc.py.

The solution to make hic_qc.py script to recognize "bwa-mem2 -command options" or "bwa-mem2.avx2 -command options" in the bam header is to change line 308 of hic_qc.py
from: self.bwa_command = re.search(r'(bwa-mem2 )[^//]', self.bwa_command_line).group()
to: self.bwa_command = re.search(r'(bwa-mem2)[^//]
', self.bwa_command_line).group()
(remove the space behind 'bwa-mem2')

This will make hic_qc.py robust, and recognize all implementations of bwa-mem2.

By the way, I mapped my library with both bwa and bwa-mem2, and the QC results are identical. The only difference is that mapping with bwa-mem2 is about 2.5x faster than with bwa.

All the best,

Tomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant