Information description
Genomic sequencing knowledge for this venture was taken primarily from 94 TNBC samples included within the MyBrCa cohort tumour sequencing venture. Briefly, this included whole-exome sequencing (WES) and RNA-sequencing (RNA-seq) knowledge collected from biobanked breast tumours of feminine sufferers from two hospitals – Subang Jaya Medical Centre in Subang Jaya, Malaysia, and Universiti Malaya Medical Centre in Kuala Lumpur, Malaysia, and analysed along with out there scientific knowledge. The cohort knowledge and sequencing strategies are described in full in Pan et al.24 and related papers24,25,40. We additionally included a further 35 TNBC samples that weren’t a part of the unique cohort description, for a complete pattern measurement of 129 MyBrCa TNBC samples. These samples have been obtained and processed in largely the identical approach because the earlier MyBrCa samples, with the one distinction being using the Illumina NovoSeq 6000 because the sequencing platform as an alternative of the Illumina HiSeq 4000. The sequencing protection and high quality statistics of WES and RNA-seq knowledge for every new pattern are summarized in Supp. Tables 1A and 1B, respectively. Further validation knowledge from TCGA and METABRIC TNBC samples have been downloaded from the NIH Genomics Information Portal and the European Genome-phenome Archive, respectively.
Affected person recruitment and pattern assortment was reviewed and authorized by the Unbiased Ethics Committee, Ramsay Sime Darby Well being Care (reference no: 201109.4 and 201208.1), in addition to the Medical Ethics Committee of the College Malaya Medical Centre (reference no: 842.9). Written knowledgeable consent to participation in analysis was given by every particular person affected person.
Transcriptomic knowledge processing
Uncooked RNA-Seq reads have been mapped to the hs37d5 reference human genome, and gene-level learn counts have been quantified utilizing featureCounts (v. 1.2.31) with the Homo sapiens GRCh37.87 human transcriptome genome annotation mannequin.
Mutational analyses
To name SNVs, we used positions referred to as by Mutect2 with following filters: minimal 10 reads in tumour and 5 reads in regular samples, OxoG metric lower than 0.8, variant allele frequency (VAF) 0.075 or extra, p-value for Fisher’s actual take a look at on the strandedness of the reads 0.05 or extra, and SAF greater than 0.75. For positions which are current in 5 samples or extra, we eliminated two positions that weren’t in COSMIC and in single tandem repeats. We additionally eliminated variants which have VAF at the very least 0.01 in gnomAD, and regarded solely variants which are supported by at the very least 4 alternate reads, with at the very least 2 reads per strand. For indels, we additionally required the positions to be referred to as by Strelka2. Variants have been annotated utilizing Oncotator model 1.9.9.0.
Dedication of HRD standing
Genomic options from WES and sWGS knowledge have been utilized in a clustering step to group the TNBC samples into 2 teams: HRD excessive and HRD low. The genomic options used embody telomeric allelic imbalance (TAI), lack of heterozygosity (LOH), large-scale transitions (LST), copy quantity amplification, copy quantity achieve, copy quantity loss, copy quantity deletion, indel counts, and COSMIC mutational signature SBS3 scores. TAI, LOH and LST scores have been decided utilizing the scarHRD R package deal (v. 0.1.1)41 on allele-specific copy quantity profiles derived by Sequenza (v. 2.2) from paired tumour-matched regular WES bam information. The prevalence of the HRD-associated single base-pair substitution (SBS) mutational signature 3 from COSMIC (SBS3) was decided utilizing deconstructSigs (v.1.8.0), restricted to samples with at the very least 15 SNVs. Scores for copy quantity amplification, achieve, loss, and deletion have been obtained utilizing the QDNASeq R package deal (v. 1.22) on shallow-whole genome sequencing bam information. Scores for every function have been normalized utilizing z-scores earlier than clustering, aside from indel counts which have been log-transformed, then all of the scores have been rescaled. Okay-means clustering and hierarchical clustering have been carried out utilizing the Python packages “scikit-learn” (v. 1.2.1) and “scipy” (v. 1.12.0) respectively. Solely samples that reached consensus between the 2 clustering algorithms have been chosen for additional evaluation, and the consensus clustering outcomes have been assigned because the HRD standing of every pattern.
Differential expression analyses
Gene-level depend matrices have been normalised utilizing the “Trimmed Imply of M-values” technique applied within the edgeR (v. 3.20.9) R package deal. The normalized depend matrices have been then reworked into log2 counts-per-million (CPM) values utilizing the “cpm” perform from the edgeR package deal in R. The depend matrix was first filtered to take away very lowly- and non-expressed genes. Differentially expressed genes have been decided by empirical Bayes moderation of the usual errors in the direction of a standard worth from a linear mannequin match of the reworked depend matrices as applied within the limma package deal, with the brink for differential expression set as false discovery charge (FDR) < 0.001 and absolute log fold change > 0.2.
Pathway evaluation
Over-representation evaluation utilizing KEGG and Reactome pathway-based units in addition to gene-ontology (GO) based mostly units was carried out utilizing ConsensusPathDB (http://cpdb.molgen.mpg.de, accessed 21 April 2022) utilizing the human database and ENSEMBL identifiers. For GO-based units, the search was restricted to gene ontology degree 2 and degree 3 classes solely.
Pathway evaluation was carried out utilizing gene set enrichment evaluation (GSEA), as applied within the Broad Institute GSEA Java executable (v 4.2.3), utilizing the MSigDB Hallmark gene units, in addition to the KEGG gene units, as applied within the GSEA program utilizing default choices.
Dedication of germline BRCA mutation standing
Carriers of deleterious pathogenic germline variants in BRCA1 and BRCA2 within the MyBrCa cohort have been recognized from focused sequencing carried out as a part of the BRIDGES examine42. LOH and biallelic standing of the germline variants have been taken from Ng et al. 25. Every provider was independently confirmed with Sanger sequencing.
Classifier structure
The machine studying framework was applied in Python (v. 3.9.6) utilizing the libraries “scikit-learn”, “scipy”, “numpy” (v. 1.26.4), “pandas” (v. 1.5.3). The enter dataset for the classifier consisted of RNA-seq gene expression knowledge quantified as TMM and log2 normalized counts per million (CPM), together with the HRD classification of every pattern.
Our classifier structure consisted of a double loop system (Supp. Fig. 1). Within the outer loop, the enter knowledge was cut up into coaching and testing units following a 70/30 ratio utilizing a one-fold stratified shuffle cut up repeated 5 occasions with totally different seeds, leading to 5 units of coaching and testing knowledge that have been handed into the internal loop. The internal loop mixed two classifier pipelines for Assist Vector Machine and Random Forest algorithms, respectively, with the chance {that a} pattern is HRD Excessive being the typical rating of each pipelines. The internal loop pipeline structure was tailored from Sammut et al. (2021)33 and has a function choice step constructed into the classifier pipeline previous to the classification mannequin, consisting of z-score scaling, k-best choice and collinearity elimination. Inside the internal loop, the hyperparameters have been optimized utilizing a five-fold randomized cross-validation (CV) search that maximizes the world beneath the receiver working attribute (AUROC). This randomized CV search examined 1000 random combos sampled from the desired hyperparameter distributions. The optimization was repeated 5 occasions as a part of the cross-validation step, and the ultimate scores for the internal loop have been the typical scores of the five-fold CV. After coaching, the fashions have been validated towards their testing datasets to find out the AUROC for every set of information within the outer loop. Lastly, the AUROC scores from every repetition have been averaged to get the ultimate reported AUROC for your entire ensemble classifier. The ultimate ensemble classifier is actually composed of 5 units of 5 SVM and RF fashions (25 fashions in complete for every algorithm), and the scores generated by the ensemble classifier are the typical scores throughout all 5 units. The optimized hyperparameters and chosen options for every mannequin are reported within the supplementary materials. This last ensemble classifier was used for additional validation, referred to beneath because the “MyBrCa mannequin”.
Validation on different cohorts
The classifier was validated utilizing gene expression knowledge from TNBC samples from different cohorts, together with TCGA, the Molecular Taxonomy of Breast Most cancers Worldwide Consortium (METABRIC) cohort27,28, and the Nik-Zainal (2016)29 (NZ-560) cohort from the Worldwide Most cancers Genome Consortium (ICGC). As a result of the person cohort datasets didn’t all the time include all of the genes used within the mannequin coaching, the fashions utilized in every validation have been retrained on the MyBrCa knowledge utilizing the out there genes for that cohort. The TCGA cohort RNA-seq knowledge was downloaded from the GDC Information Portal and included all 217 genes used within the MyBrCa mannequin. The METABRIC cohort, in contrast to our different cohorts, consists of microarray knowledge slightly than RNA-seq knowledge, and consists of solely 146 of the genes used within the MyBrCa mannequin. Gene expression knowledge for the METABRIC cohort was downloaded from the European Genome-phenome Archive. For the NZ-560 cohort, we used the log2 FPKM gene expression values from RNA-seq knowledge that was reported within the unique publication, however knowledge was out there for under 164 of the genes utilized in MyBrCa mannequin. Gene expression values from every cohort have been normalized utilizing z-score scaling and quantile normalization individually for every cohort earlier than classification. F1 rating, precision, and recall values have been calculated utilizing the HRD200 chance threshold that maximized F1 rating.
RNA extraction
RNA from tumour samples was extracted utilizing the QIAGEN miRNeasy Mini Package with a QIAcube, based on normal protocol. Complete RNA was quantitated utilizing a Nanodrop 2000 Spectrophotometer and RNA integrity was measured utilizing an Agilent 2100 Bioanalyzer.
NanoString validation
For the NanoString validation, we used knowledge from a customized CodeSet developed for the NanoString nCounter platform. This tradition CodeSet included 35 genes from our gene set and three housekeeping genes used for knowledge normalization. We obtained NanoString nCounter learn counts for these genes from 61 contemporary frozen samples and 23 FFPE samples from the MyBrCa TNBC cohort. Expression for this gene set was measured on an nCounter MAX Evaluation System, and the uncooked knowledge was processed and normalized utilizing the NanoString’s proprietary nSolver (v. 4.0) software program earlier than being exported as a normalized gene expression matrix textual content file for processing by the machine studying classifier, which was retrained utilizing solely the 35 genes included within the NanoString knowledge. The NanoString gene expression values have been normalized utilizing z-score scaling and quantile normalization earlier than classification. The contemporary frozen and FFPE samples have been normalized individually.
Statistical analyses
All field and whiskers plots within the figures are constructed with containers indicating twenty fifth percentile, median and seventy fifth percentile, and whiskers exhibiting the utmost and minimal values inside 1.5 occasions the inter-quartile vary from the sting of the field, with outliers proven.

