# Human dsRNA Predictions v1

Predictions of double-stranded RNA structures across the human genome (GRCh38 / hg38), produced by dsRNAscan + RNAduplex, annotated with conservation, gene context, RNA-editing site counts, and three independent ML model scores.

**Rows:** 5,134,754 dsRNAs  **Source pipeline date:** 2025-06-19  **Generated:** 2026-05-23

## Files

| File | Format | Contents |
| --- | --- | --- |
| `dsRNA_human_v1.parquet` | Parquet (zstd) | Main analyst-facing dataset: all friendly-named columns except sequence/structure. |
| `dsRNA_human_v1_extended.parquet` | Parquet (zstd) | Companion keyed by `dsRNA_id`: `sequence` and `predicted_structure` (dot-bracket). Join on `dsRNA_id`. |
| `dsrna_database.no_structure.sqlite` | SQLite | Same columns as the main parquet, indexed for the Shiny browser. |
| `dsrna_database.structures.sqlite` | SQLite | Same as above + `sequence` + `predicted_structure`. |
| `data_dictionary.tsv` | TSV | Machine-parseable per-column descriptions, types, units, source-column traces. |

## Loading the data

```python
import pandas as pd
df = pd.read_parquet('dsRNA_human_v1.parquet')
# Filter to dsRNAs called high-confidence by all 3 models:
hc = df[df['n_models_high_conf'] == 3]
# Pull sequences/structures only when needed:
ext = pd.read_parquet('dsRNA_human_v1_extended.parquet')
with_seq = hc.merge(ext, on='dsRNA_id', how='left')
```

```python
# Polars equivalent:
import polars as pl
df = pl.read_parquet('dsRNA_human_v1.parquet')
```

## High-confidence definition

Three independent ML models score each dsRNA. A dsRNA is considered high-confidence by a given model when its raw score exceeds the documented threshold:

| Model | Column | Threshold | High-conf count |
| --- | --- | --- | --- |
| GTEx editing | `gtex_model_score` | >= 0.2513 | 1,713,035 |
| Stability (no-GTEx) | `stability_model_score` | >= 0.2471 | 2,125,678 |
| Structure-probing 3'UTR | `structure_probing_score` | >= 0.0315 | 2,226,288 |

- **Any of 3:** 2,458,476 dsRNAs
- **All 3:** 1,509,138 dsRNAs

The `n_models_high_conf` column (integer 0-3) is the sum of the three boolean `*_high_conf` columns. Filter on it for tiered confidence levels.

Note: `gtex_confidence_label` and `stability_confidence_label` are pre-computed string labels from the upstream pipeline and almost always agree with the booleans. Use the booleans for filtering and the labels for cross-checks.

## FORNA URL reconstruction

The 1.5 GB GFF3 stored a redundant `forna_link` URL per dsRNA. It is omitted from these files because it is deterministic from `sequence` and `predicted_structure`:

```python
import urllib.parse
def forna_url(seq: str, struct: str) -> str:
    base = 'http://rna.tbi.univie.ac.at/forna/forna.html?id=url/name'
    return f'{base}&sequence={urllib.parse.quote(seq)}&structure={urllib.parse.quote(struct)}'
```

## Interactive browser

For exploratory filtering (gene name, coordinates, length, pairing), use the dsRNAscan Shiny browser: https://dsrna.chpc.utah.edu/

## Column dictionary

| Column | Type | Unit | Description |
| --- | --- | --- | --- |
| `dsRNA_id` | string |  | Per-chromosome dsRNA identifier of the form 'dsRNA_chrN_M'. M is the 1-based dsRNA index within the chromosome, sorted by (i_start, j_start). |
| `chr` | string |  | Chromosome (e.g., 'chr1', 'chrX'). |
| `strand` | string |  | Strand of the dsRNA: '+' or '-'. |
| `start` | int32 | bp | Overall dsRNA span start = min(i_start, j_start). 1-based, inclusive (GFF3 convention). |
| `end` | int32 | bp | Overall dsRNA span end = max(i_end, j_end). 1-based, inclusive (GFF3 convention). |
| `i_start` | int32 | bp | Start coordinate of the i-arm (5' arm relative to dsRNA). 1-based, inclusive. |
| `i_end` | int32 | bp | End coordinate of the i-arm. 1-based, inclusive. |
| `j_start` | int32 | bp | Start coordinate of the j-arm (3' arm relative to dsRNA). 1-based, inclusive. |
| `j_end` | int32 | bp | End coordinate of the j-arm. 1-based, inclusive. |
| `i_length` | int32 | nt | Length of the i-arm in nucleotides. |
| `j_length` | int32 | nt | Length of the j-arm in nucleotides. |
| `loop_length` | int32 | nt | Distance between the two arms (j_start - i_end - 1). |
| `energy_kcal_mol` | float32 | kcal/mol | Free energy of formation of the predicted duplex (RNAduplex). |
| `percent_paired` | float32 | % | Percentage of nucleotides in the duplex that are base-paired. |
| `longest_helix` | int32 | bp | Length of the longest contiguous base-paired helix in the dsRNA. |
| `length_category` | string |  | Length-based bin: '30-40 nt', '40-300 nt', or '> 300 nt'. |
| `i_gene_name` | string |  | Gene symbol(s) overlapping the i-arm (comma-separated). 'NA' if intergenic. |
| `j_gene_name` | string |  | Gene symbol(s) overlapping the j-arm (comma-separated). 'NA' if intergenic. |
| `i_gene_id` | string |  | Ensembl gene/transcript ID(s) overlapping the i-arm. 'NA' if intergenic. |
| `j_gene_id` | string |  | Ensembl gene/transcript ID(s) overlapping the j-arm. 'NA' if intergenic. |
| `genic_intergenic` | string |  | Whether the dsRNA overlaps any gene: 'Genic' or 'Intergenic'. |
| `repetitive` | string |  | Whether either arm overlaps a repetitive element: 'Repetitive' or 'Non-Repetitive'. |
| `alu` | string |  | Whether either arm overlaps an Alu element: 'Alu' or 'Non-Alu'. |
| `i_repetitive_element` | string |  | Repeat family/name(s) overlapping the i-arm (comma-separated). 'FALSE' if none. |
| `j_repetitive_element` | string |  | Repeat family/name(s) overlapping the j-arm (comma-separated). 'FALSE' if none. |
| `Editing` | string |  | Editing status of the dsRNA: 'Edited' (at least one A-to-I site detected on either arm) or 'Unedited'. |
| `stranded_editing_i_sites` | int32 | sites | Number of stranded A-to-I editing sites on the i-arm (REDIportal, strand-aware). |
| `stranded_editing_j_sites` | int32 | sites | Number of stranded A-to-I editing sites on the j-arm. |
| `unstranded_editing_i_sites` | int32 | sites | Number of unstranded A-to-I editing sites on the i-arm (REDIportal, strand-agnostic). |
| `unstranded_editing_j_sites` | int32 | sites | Number of unstranded A-to-I editing sites on the j-arm. |
| `stranded_editing_sites` | int32 | sites | Total stranded editing sites = stranded_editing_i_sites + stranded_editing_j_sites. |
| `unstranded_editing_sites` | int32 | sites | Total unstranded editing sites = unstranded_editing_i_sites + unstranded_editing_j_sites. |
| `i_phast100` | float32 |  | PhastCons 100-vertebrate conservation score for the i-arm (range 0-1). |
| `j_phast100` | float32 |  | PhastCons 100-vertebrate conservation score for the j-arm. |
| `i_phast17` | float32 |  | PhastCons 17-primate conservation score for the i-arm. |
| `j_phast17` | float32 |  | PhastCons 17-primate conservation score for the j-arm. |
| `i_phyp100` | float32 |  | PhyloP 100-vertebrate conservation score for the i-arm. |
| `j_phyp100` | float32 |  | PhyloP 100-vertebrate conservation score for the j-arm. |
| `i_phyp17` | float32 |  | PhyloP 17-primate conservation score for the i-arm. |
| `j_phyp17` | float32 |  | PhyloP 17-primate conservation score for the j-arm. |
| `gtex_model_score` | float32 |  | GTEx editing-prediction model score (YDF, trained on Alu+nonAlu, 100% Alu fraction). High-confidence threshold: 0.2513. |
| `stability_model_score` | float32 |  | Structure-only stability-prediction model score (YDF, no GTEx expression features). High-confidence threshold: 0.2471. |
| `structure_probing_score` | float32 |  | Power-weighted advantage score from the structure-probing 3'UTR model. High-confidence threshold: 0.0315 (raw scale), equivalent to 0.4574 after min-max normalization. |
| `gtex_high_conf` | bool |  | True if gtex_model_score >= 0.2513. |
| `stability_high_conf` | bool |  | True if stability_model_score >= 0.2471. |
| `structure_probing_high_conf` | bool |  | True if structure_probing_score >= 0.0315 (raw scale). |
| `n_models_high_conf` | int8 |  | Number of the three ML models calling this dsRNA high-confidence (0-3). |
| `gtex_confidence_label` | string |  | Pre-computed string label from the upstream pipeline: 'High Confidence' or 'Low Confidence'. Mostly agrees with gtex_high_conf; use that boolean for filtering and this label for cross-check. |
| `stability_confidence_label` | string |  | Pre-computed string label from the upstream pipeline for the stability (no-GTEx) model. Mostly agrees with stability_high_conf. |
| `sequence` (extended only) | string |  | RNA sequence of the dsRNA. i-arm and j-arm joined by '&'. U used in place of T. |
| `predicted_structure` (extended only) | string |  | Dot-bracket secondary structure annotation matching the sequence string. i-arm and j-arm joined by '&'. |

## Version & changelog

- **v1** (2026): initial public release. Replaces the 1.5 GB GFF3 and the old 11 GB Shiny SQLite with friendly column names and a split structures companion.
