Loading the Parquet file
The main dataset (dsRNA_human_v1.parquet, ~325 MB, zstd-compressed) carries all
5,134,754 dsRNAs and every analytical column except the RNA sequence and predicted structure
(those live in the extended companion, joined on dsRNA_id). Parquet is columnar and
zstd-compressed, so column-projected queries and filters run in seconds even at full size.
pandas
import pandas as pd
df = pd.read_parquet('dsRNA_human_v1.parquet')
# High-confidence by all 3 models
hc = df[df['n_models_high_conf'] == 3]
# Conserved across vertebrates (both arms phastCons100 > 0.5)
conserved = hc[(hc['i_phast100'] > 0.5) & (hc['j_phast100'] > 0.5)]
# Stranded editing on both arms
edited_both = hc[(hc['stranded_editing_i_sites'] > 0) &
(hc['stranded_editing_j_sites'] > 0)]
print(f"{len(hc):,} HC, {len(conserved):,} conserved, {len(edited_both):,} edited on both arms")
polars (lazy — fastest for big slices)
Lazy scans push filters and column selection down to the Parquet reader, so they only touch the columns and row groups they need. Significantly faster than pandas on cold reads of the full file.
import polars as pl
# Lazy scan: nothing is loaded until .collect()
hits = (pl.scan_parquet('dsRNA_human_v1.parquet')
.filter(pl.col('n_models_high_conf') == 3)
.filter((pl.col('i_gene_name') == 'MYC') |
(pl.col('j_gene_name') == 'MYC'))
.select(['dsRNA_id', 'chr', 'i_start', 'j_end',
'energy_kcal_mol', 'gtex_model_score'])
.collect())
print(hits)
DuckDB (SQL directly on the file)
DuckDB lets you query the parquet without converting to a database, and you can join two parquets in a single SQL statement — handy for the main + extended pair.
import duckdb
con = duckdb.connect()
# Top 20 most stable HC-by-all-3 dsRNAs in a region
hits = con.execute("""
SELECT dsRNA_id, chr, i_start, j_end,
energy_kcal_mol, longest_helix,
gtex_model_score, stability_model_score, structure_probing_score
FROM 'dsRNA_human_v1.parquet'
WHERE n_models_high_conf = 3
AND chr = 'chr8'
AND i_start BETWEEN 128000000 AND 129000000
ORDER BY energy_kcal_mol ASC
LIMIT 20
""").df()
Joining the sequence / structure companion
The extended file (dsRNA_human_v1_extended.parquet, ~522 MB) is keyed by
dsRNA_id and carries the RNA sequence and dot-bracket predicted structure. Pull it in
only when you need it — keeping the join lazy preserves Parquet's filter-pushdown advantages.
# pandas
main = pd.read_parquet('dsRNA_human_v1.parquet')
ext = pd.read_parquet('dsRNA_human_v1_extended.parquet')
with_seq = (main[main['n_models_high_conf'] == 3]
.merge(ext, on='dsRNA_id', how='left'))
# DuckDB (SQL join in one query)
df = con.execute("""
SELECT m.dsRNA_id, m.chr, m.energy_kcal_mol, m.gtex_model_score,
e.sequence, e.predicted_structure
FROM 'dsRNA_human_v1.parquet' m
JOIN 'dsRNA_human_v1_extended.parquet' e USING (dsRNA_id)
WHERE m.n_models_high_conf = 3
AND m.i_phast100 > 0.5 AND m.j_phast100 > 0.5
""").df()
Column reference
Full descriptions, types, units, and source-column traces are in the data_dictionary.tsv. You can also introspect from any of the loaders above:
# polars
pl.scan_parquet('dsRNA_human_v1.parquet').collect_schema()
# DuckDB
con.execute("DESCRIBE SELECT * FROM 'dsRNA_human_v1.parquet'").df()
# pandas
df.dtypes
GFF3 for genome browsers
Two bgzipped + tabix-indexed GFF3 files load directly into IGV, UCSC, JBrowse, or any
GFF3-aware viewer: dsRNA_all.gff3.gz (all 5.1M) and
dsRNA_high_confidence.gff3.gz (the any-of-3 subset, ~2.5M).
Each dsRNA is encoded as one mRNA feature with two exon children — one per
arm. The loop region between the arms is implicit (it's the gap between the two exons, effectively an
intron). This is fully GFF3-compliant and uses parent-child relationships that every browser already
knows how to render.
chr1 RNAduplex mRNA 10195 19583 . - . ID=dsRNA_chr1_1;energy=-270.1;match_perc=71.03;...
chr1 dsRNAscan exon 10195 10512 . - . ID=exon-chr1_1-1;Parent=dsRNA_chr1_1
chr1 dsRNAscan exon 19260 19583 . - . ID=exon-chr1_1-2;Parent=dsRNA_chr1_1
- The
mRNAline carries the analytical attributes:energy,match_perc,i_phast100,GTEx,Stability,Probing,sequence,predicted_structure,forna_link, etc. - The two
exonlines carry onlyIDandParent. - The
mRNAstart/endcover the full dsRNA span; the two exons cover the i-arm and j-arm coordinates separately. - All coordinates are 1-based, inclusive (GFF3 convention).
Quick IGV recipe
File -> Load from URL...
URL: https://dsrna.chpc.utah.edu/tools/downloads/dsRNA_high_confidence.gff3.gz
Index URL: (leave blank; the .tbi auto-resolves)
FORNA URL reconstruction
The GFF3 carries a forna_link attribute on each mRNA line for one-click structure
viewing. The Parquet/SQLite files omit it because it's deterministic from sequence and
predicted_structure:
import urllib.parse
def forna_url(seq: str, struct: str) -> str:
base = 'http://rna.tbi.univie.ac.at/forna/forna.html?id=url/name'
return f'{base}&sequence={urllib.parse.quote(seq)}&structure={urllib.parse.quote(struct)}'
About the BEDPE files
The slim BEDPE files at downloads
encode each dsRNA as one row with two genomic intervals — columns 1-3 are the i-arm,
columns 4-6 are the j-arm. This shape is the native fit for two-armed biological objects and pairs with
bedtools pairtobed for overlap questions ("either arm overlaps a gene", "both arms in the
same gene", "one arm exonic, other intronic"). Columns 11-38 carry every analytical attribute (arm
lengths, helix length, energy, percent paired, gene names, repeats, ML scores, editing counts).
Six files are provided so you can start with the slice you actually want:
| File | Subset | n dsRNAs |
|---|---|---|
dsRNA_all_slim.bedpe.gz | everything | 5,134,754 |
dsRNA_high_conf_any_slim.bedpe.gz | high-confidence by ≥ 1 model | 2,458,476 |
dsRNA_high_conf_all3_slim.bedpe.gz | high-confidence by all 3 models | 1,509,138 |
dsRNA_gtex_high_conf_slim.bedpe.gz | GTEx model high-conf | 1,713,035 |
dsRNA_stability_high_conf_slim.bedpe.gz | Stability model high-conf | 2,125,678 |
dsRNA_probing_high_conf_slim.bedpe.gz | Probing model high-conf | 2,226,288 |
Each .bedpe.gz has a sibling .tbi tabix index at the same URL with
.tbi appended — download it separately if you want random-access region queries.
Warning: most bedtools subcommands silently ignore the second arm.
bedtools intersect, sort, merge, coverage,
closest, etc. all read cols 1-3 as the interval and treat cols 4-6 as opaque extra
columns. The output looks plausible but only reflects the i-arm. The BEDPE-aware subcommands are
pairtobed (BEDPE vs BED/GFF/VCF), pairtopair (BEDPE vs BEDPE),
bedpetobam, and bamtobed -bedpe. For anything else, split the file into
per-arm BEDs first — see the last section on this page.
Column layout
Headerless TSV, 38 columns. The first 10 are the standard BEDPE spec; the rest are dsRNA-specific
extras you'll use for filtering with awk.
| # | Column | Notes |
|---|---|---|
| 1-6 | chr1 start1 end1 chr2 start2 end2 | i-arm + j-arm coords, 0-based half-open |
| 7 | name | dsRNA_id (e.g. dsRNA_chr1_42) |
| 8 | score | round(avg(i_length, j_length)) — drives arc thickness in IGV |
| 9-10 | strand1 strand2 | both strands (the dsRNA's strand) |
| 11-13 | i_length j_length loop_length | nt |
| 14 | energy_kcal_mol | RNAduplex free energy, 2dp |
| 15 | percent_paired | integer % |
| 16 | longest_helix | bp |
| 17-18 | i_gene_name j_gene_name | stale — will be redone vs GENCODE v49 in v2 |
| 19 | repetitive | Repetitive / Non-Repetitive |
| 20-23 | i_phast100 j_phast100 i_phast17 j_phast17 | conservation, 2dp |
| 24 | gtex_model_score | 0-1, 3dp. HC ≥ 0.2513 |
| 25 | stability_model_score | 0-1, 3dp. HC ≥ 0.2471 |
| 26 | structure_probing_score | 0-1 (min-max normalized), 3dp. HC ≥ 0.46 |
| 27-29 | gtex_high_conf stability_high_conf structure_probing_high_conf | 0/1 booleans |
| 30 | n_models_high_conf | 0-3 (sum of the three booleans) |
| 31-32 | stranded_editing_i_sites stranded_editing_j_sites | REDIportal, strand-aware |
| 33-34 | unstranded_editing_i_sites unstranded_editing_j_sites | strand-agnostic |
| 35-36 | stranded_editing_sites unstranded_editing_sites | i + j totals |
| 37-38 | i_repetitive_element j_repetitive_element | RepeatMasker family/name(s) |
Quick filtering with awk
The simplest pipeline pattern is zcat | awk | bgzip. Decompress on the fly, filter
on any column, recompress (and re-tabix if you want random access).
zcat dsRNA_all_slim.bedpe.gz \
| awk -F'\t' '$30 == 3' \
| bgzip > dsRNA_unanimous.bedpe.gz
tabix -p bed dsRNA_unanimous.bedpe.gz # ok despite warning -- pairtobed will still work
Same logic as dsRNA_high_conf_all3_slim.bedpe.gz — useful as a template for tighter cutoffs.
zcat dsRNA_all_slim.bedpe.gz \
| awk -F'\t' '$27 == 1 && $15 >= 85 && $16 >= 40' \
| bgzip > dsRNA_gtex_tight.bedpe.gz
Col 27 = gtex_high_conf, 15 = percent_paired, 16 = longest_helix.
zcat dsRNA_high_conf_any_slim.bedpe.gz \
| awk -F'\t' '$24 >= 0.5 && $31 > 0 && $32 > 0' \
| bgzip > dsRNA_gtex_edited.bedpe.gz
zcat dsRNA_gtex_edited.bedpe.gz | wc -l
zcat dsRNA_all_slim.bedpe.gz \
| awk -F'\t' '$20 > 0.5 && $21 > 0.5' \
| bgzip > dsRNA_conserved.bedpe.gz
zcat dsRNA_high_conf_any_slim.bedpe.gz \
| awk -F'\t' '$19 == "Non-Repetitive"' \
| bgzip > dsRNA_nonrep.bedpe.gz
Only ~6% of high-conf dsRNAs are non-repetitive; this is a useful pre-filter for cross-organism comparisons.
zcat dsRNA_all_slim.bedpe.gz \
| awk -F'\t' '$14 <= -100 && $16 > 100' \
| bgzip > dsRNA_strong.bedpe.gz
pairtobed: overlap recipes
bedtools pairtobed understands the two-arm relationship natively. Its
-type flag controls how the two arms must intersect feature B:
| -type | Returns dsRNAs where... |
|---|---|
either | at least one arm overlaps any feature in B (default) |
both | both arms overlap the same feature in B |
xor | exactly one arm overlaps B (one in, one out) |
neither | neither arm overlaps B |
ispan | the loop region (between the arms) overlaps B |
notispan | the loop region does not overlap B |
ospan | any part of the i-arm-start to j-arm-end span overlaps B |
Other useful flags: -s (same strand only), -S (opposite strand only),
-f 0.5 (require at least 50% of A's interval to overlap B).
Worked example 1: dsRNAs near m6A sites
You need a hg38 BED of m6A site coordinates. Two well-curated databases publish them:
- m6A-Atlas 2.0 — integrated high-confidence m6A sites
across tissues. Download from the
Downloadtab; select GRCh38 / BED format. - REPIC — RNA Epitranscriptome Collection.
Has per-study downloads + a merged catalog. GRCh38 BEDs are under
Download → Browse.
The recipes below assume the file is named m6A_sites.bed (any BED4+ format works —
only cols 1-3 are used for overlap; the name column lets you trace hits back if you want).
bedtools pairtobed \
-a dsRNA_high_conf_any_slim.bedpe.gz \
-b m6A_sites.bed \
-type either \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > dsRNAs_with_m6A.bedpe.gz
zcat dsRNAs_with_m6A.bedpe.gz | wc -l
Output is a valid BEDPE with all 38 analytical columns intact — load it in IGV, hand it to another pairtobed call, or open in pandas. cut -f1-38 drops the appended m6A feature columns; awk '!seen[$7]++' dedupes by dsRNA_id (col 7) so dsRNAs that overlap multiple m6A sites only appear once.
bedtools pairtobed -a dsRNA_high_conf_any_slim.bedpe.gz \
-b m6A_sites.bed -type both \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > dsRNAs_m6A_paired.bedpe.gz
-type both requires the SAME m6A site to be present in both arms — rare. If you want
"any m6A on each arm", use two intersects (one per arm BED) and intersect the resulting dsRNA-id sets (see the per-arm-split section).
bedtools pairtobed -a dsRNA_high_conf_any_slim.bedpe.gz \
-b m6A_sites.bed -type ispan \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > dsRNAs_m6A_loop.bedpe.gz
zcat dsRNAs_m6A_loop.bedpe.gz | wc -l
-type ispan tests overlap with the implicit "loop" between the two arms —
useful for catching dsRNAs that flank a methylation cluster.
Worked example 2: splice sites & exons
Use GENCODE v49 (or your favourite annotation). Extract exons into a BED with transcript_id in the name column — this lets you ask "same transcript" questions later.
# One-time prep: download GENCODE v49 + build an exons BED carrying transcript_id.
# (On macOS use `gzcat` in place of `zcat`.)
curl -sLO https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_49/gencode.v49.annotation.gff3.gz
zcat gencode.v49.annotation.gff3.gz \
| awk -F'\t' '$3=="exon"' \
| awk -F'\t' '{ split($9, a, ";");
for (i in a) if (a[i] ~ /^transcript_id=/) tid=substr(a[i], 15);
printf "%s\t%d\t%d\t%s\n", $1, $4-1, $5, tid; }' \
| sort -k1,1 -k2,2n > gencode_v49_exons.bed
bedtools pairtobed -a dsRNA_high_conf_any_slim.bedpe.gz \
-b gencode_v49_exons.bed -type both \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > dsRNAs_exonic_pairs.bedpe.gz
For the strict "BOTH arms 100% inside same transcript" definition you need a per-arm intersect
pipeline (pairtobed -type both only checks at-least-one-bp overlap). See the
per-arm-split section below.
bedtools pairtobed -a dsRNA_high_conf_any_slim.bedpe.gz \
-b gencode_v49_exons.bed -type xor \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > dsRNAs_exon_intron.bedpe.gz
bedtools pairtobed -a dsRNA_high_conf_any_slim.bedpe.gz \
-b gencode_v49_exons.bed -type ispan \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > dsRNAs_flanking_exon.bedpe.gz
Worked example 3: ENCODE eCLIP (ILF3)
ENCODE eCLIP peaks for any RBP are downloadable as BED narrowPeak files from encodeproject.org/eclip/. ADAR / ADARB1 are not in the ENCODE eCLIP catalog, but several well-characterised dsRNA binders are — ILF3 (NF90/NF110), DHX9, STAU2, IGF2BP1/2/3. ILF3 makes a good starting point because it has the most direct dsRNA-binding biology of the set.
Two ILF3 eCLIP experiments are released on ENCODE (rep1+rep2 IDR-merged narrowPeak files, GRCh38, blacklist-removed):
# ILF3 in K562 (~3K peaks)
curl -sLO https://www.encodeproject.org/files/ENCFF385MJU/@@download/ENCFF385MJU.bed.gz
gunzip -c ENCFF385MJU.bed.gz > ILF3_K562_eCLIP.bed
# ILF3 in HepG2 (different binding landscape, ~10K peaks)
curl -sLO https://www.encodeproject.org/files/ENCFF071QDP/@@download/ENCFF071QDP.bed.gz
gunzip -c ENCFF071QDP.bed.gz > ILF3_HepG2_eCLIP.bed
Experiment pages for reference:
ENCSR438KWZ (K562),
ENCSR786TSC (HepG2).
The files are 10-column BED narrowPeak format; pairtobed reads cols 1-3 so the extra
columns are harmless.
bedtools pairtobed -a dsRNA_high_conf_any_slim.bedpe.gz \
-b ILF3_K562_eCLIP.bed -type either \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > ILF3_bound_dsRNAs.bedpe.gz
bedtools pairtobed -a dsRNA_high_conf_any_slim.bedpe.gz \
-b ILF3_K562_eCLIP.bed -type both \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > ILF3_double_bound.bedpe.gz
-type both requires the SAME peak to cover both arms — very strict. For "any peak on each arm" use two intersects (one per arm BED) and intersect the resulting dsRNA-id sets.
bedtools pairtobed -a dsRNA_high_conf_all3_slim.bedpe.gz \
-b ILF3_K562_eCLIP.bed -type neither \
| cut -f1-38 \
| awk -F'\t' '!seen[$7]++' \
| bgzip > high_conf_no_ILF3.bedpe.gz
-type neither reports each dsRNA at most once, so the dedup is belt-and-suspenders.
Worked example 4: intermolecular dsRNAs (sense-antisense pairs)
The main BEDPE files cover intramolecular hairpin dsRNAs. A separate, smaller file covers intermolecular dsRNAs — sense / antisense gene pairs whose transcripts overlap and can pair in trans:
curl -sLO https://dsrna.chpc.utah.edu/intermolecular_dsRNA.gff3
769 records, GFF3 format, single feature type antisense_dsRNA. Key attributes:
gene_plus, gene_minus, biotype_plus, biotype_minus,
total_overlap_bp, total_editing_sites, n_eers (editing-enriched
regions), dsrna_tx_plus / dsrna_tx_minus (Ensembl transcript IDs per strand).
File -> Load from URL...
URL: https://dsrna.chpc.utah.edu/intermolecular_dsRNA.gff3
Index URL: (leave blank)
awk -F'\t' '$3=="antisense_dsRNA"' intermolecular_dsRNA.gff3 \
| grep -E 'gene_plus=TP53(;|$)|gene_minus=TP53(;|$)'
# Returns the TP53 / WRAP53 antisense pair on chr17.
The (;|$) anchor avoids false hits like TP53AIP1 when you really wanted TP53. ADAR / ADARB1 don't have antisense partners in this dataset.
awk -F'\t' '$3=="antisense_dsRNA" {
match($9, /total_editing_sites=([0-9]+)/, m);
if (m[1]+0 >= 20) print
}' intermolecular_dsRNA.gff3 \
| wc -l
Requires gawk (for the match() third-arg array). On macOS install via brew install gawk and call as gawk.
awk -F'\t' '$3=="antisense_dsRNA" {
split($9, a, ";");
gp=""; gm="";
for (i in a) {
if (a[i] ~ /^gene_plus=/) gp = substr(a[i], 11);
if (a[i] ~ /^gene_minus=/) gm = substr(a[i], 12);
}
print gp"\t"gm
}' intermolecular_dsRNA.gff3 > intermolecular_pairs.tsv
wc -l intermolecular_pairs.tsv
awk -F'\t' '$3=="antisense_dsRNA" \
&& $9 ~ /biotype_plus=protein_coding/ \
&& $9 ~ /biotype_minus=protein_coding/' intermolecular_dsRNA.gff3 \
> intermolecular_pc_only.gff3
wc -l intermolecular_pc_only.gff3
bedpetobam: load BEDPE in samtools / IGV.org workflows
bedtools bedpetobam converts BEDPE into a paired-end-style BAM so you can sort, index,
and load alongside read alignments in IGV, samtools, or any BAM-aware tool.
# Get a chromosome-sizes file (hg38). Pick whichever you have available:
# (a) UCSC tools: fetchChromSizes hg38 > hg38.chromsizes
# (b) Indexed FASTA: cut -f1,2 hg38.fa.fai > hg38.chromsizes
# (c) Direct download from UCSC:
curl -sL https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes \
-o hg38.chromsizes
# Convert + sort + index.
bedtools bedpetobam -i dsRNA_high_conf_all3_slim.bedpe.gz \
-g hg38.chromsizes \
| samtools sort -o dsRNA_high_conf_all3.bam
samtools index dsRNA_high_conf_all3.bam
Now dsRNA_high_conf_all3.bam + .bai loads in IGV web/desktop as a normal
BAM track, with each dsRNA appearing as a "read pair" connecting its two arms. You can also pipe it
through samtools for region-based queries:
samtools view dsRNA_high_conf_all3.bam chr8:128,748,000-128,755,000
Caveat: bedpetobam in bedtools < 2.30 truncates output. Upgrade to 2.30+ if you
hit that.
Per-arm splitting (when you need plain bedtools intersect)
If your downstream tool needs single-interval BED (e.g. bedtools coverage,
genomecov, custom awk pipelines that expect cols 1-3), split the BEDPE into two BEDs
— one per arm — before doing anything else:
zcat dsRNA_high_conf_any_slim.bedpe.gz \
| awk -F'\t' 'BEGIN{OFS="\t"} {print $1, $2, $3, $7, $8, $9}' \
| sort -k1,1 -k2,2n > i_arms.bed
zcat dsRNA_high_conf_any_slim.bedpe.gz \
| awk -F'\t' 'BEGIN{OFS="\t"} {print $4, $5, $6, $7, $8, $10}' \
| sort -k1,1 -k2,2n > j_arms.bed
Both files use the dsRNA_id as the BED name field (col 4), so you can join the two arms back together by ID after intersecting. Example — per-arm bp overlap with 3'UTRs, then rejoin:
# One-time prep: extract 3'UTRs from GENCODE v49.
zcat gencode.v49.annotation.gff3.gz \
| awk -F'\t' '$3=="three_prime_UTR" {printf "%s\t%d\t%d\t.\n", $1, $4-1, $5}' \
| sort -k1,1 -k2,2n > gencode_v49_3UTR.bed
bedtools intersect -a i_arms.bed -b gencode_v49_3UTR.bed -wo -sorted \
| awk -F'\t' '{sum[$4] += $NF} END {for (id in sum) print id"\t"sum[id]}' \
> i_3UTR_bp.tsv
bedtools intersect -a j_arms.bed -b gencode_v49_3UTR.bed -wo -sorted \
| awk -F'\t' '{sum[$4] += $NF} END {for (id in sum) print id"\t"sum[id]}' \
> j_3UTR_bp.tsv
# Join: dsRNAs with 3'UTR overlap on BOTH arms (id, bp_i, bp_j)
join -t$'\t' -j1 \
<(sort i_3UTR_bp.tsv) \
<(sort j_3UTR_bp.tsv) > both_arms_3UTR.tsv
# Subset the source BEDPE back down to just these dsRNAs - now you have a
# valid BEDPE for IGV / pairtobed / pandas, not just an ID stats table.
zcat dsRNA_high_conf_any_slim.bedpe.gz \
| awk -F'\t' 'NR==FNR {ids[$1]=1; next} ids[$7]' both_arms_3UTR.tsv - \
| bgzip > both_arms_3UTR.bedpe.gz
More
- bedtools reference: pairtobed, bedpetobam.
- Want to run the same questions interactively (without bedtools)? The Filter dsRNAs Shiny app exposes the same confidence columns through sliders and exports filtered subsets directly.
- Bigger picture: the User Guide explains how the ML scores were trained and what the cutoffs mean biologically.