Members
The Institutional Evaluate Board (IRB No. SMC 2022-05-027 for affected person samples; GCL-2017-1008-03, GCL-2020-1002-06, GCL-2021-1049-07 for wholesome samples) accredited the gathering of plasma and tissues from sufferers with lung most cancers and wholesome people. Knowledgeable consent was obtained from all contributors. All analysis was carried out in accordance with related tips and rules. Sufferers with lung most cancers had been identified histologically and remedy naive. The medical knowledge from the methylated DNA immunoprecipitation sequencing (MeDIP-seq), Entire-genome Enzymatic Methyl-seq (WGEM-seq), Twist Human Methylome Panel, and focused EM-seq panel are summarized in Supplementary Tables 1, 2, 3 and Desk 1. Particulars of the dataset composition are offered in Supplementary Desk 4.
DNA extraction
The peripheral blood collected in Streck tubes (Streck, USA) underwent a two-step centrifugation protocol for the separation of plasma and Buffy coat. After centrifugation at 3000 rpm for 10 min at 25 °C, a second centrifugation was carried out at 16,000×g for 10 min at 25 °C.
The plasma was separated and cfDNA was extracted utilizing a selected package. For MeDIP-seq, cfDNA was robotically extracted utilizing the chemagic DNA Blood200 package (PerkinElmer, USA) by the chemagic MSM I instrument (PerkinElmer). Twist Human Methylome Panel and focused EM-seq panel utilized the Magazine-Bind cfDNA package (Omega Bio-Tek, USA) for handbook extraction. Within the cfDNA extraction, 2 ml of plasma was utilized, with an elution quantity of fifty µl.
Genomic DNA (gDNA) extraction was carried out on each buffy coat and recent frozen tissue samples. For recent frozen tissue samples, 10–30 ng of tissue was homogenized utilizing the Homogenizer FastPrep-24 system (MP Biomedicals, USA). Subsequently, for WGEM-seq, gDNA was extracted utilizing the QIAmp DNA Mini package (Qiagen, Germany) from separated buffy coat and recent frozen tissue samples.
Serial pattern preparation to find out the restrict of detection (LOD)
We carried out restrict of detection experiments utilizing plasma samples from a most cancers affected person and a wholesome particular person, diluted to particular tumor fraction ratios. For a lung most cancers pattern, a tumor fraction of 15% was predicted utilizing ichorCNA (v0.2.0)20. The tumor fraction vary was set throughout 5 ranges: undiluted (tumor fraction 15%), 1%, 0.5%, 0.1%, and 0% (representing a wholesome particular person).
MeDIP-seq
The extracted cfDNA (10 ng) was ready into libraries utilizing the TruSeq Nano DNA HT Library Prep Package (Illumina, USA). Following the adapter ligation step, a 5mC immunoprecipitation was carried out utilizing the iPure Package V2 (Diagenode, USA) at 10 rpm and 4 °C for 17 h, adopted by PCR amplification for 13 cycles. The focus and measurement distribution of the ensuing libraries had been measured utilizing the Qubit dsDNA HS Assay Package (Invitrogen, USA) and TapeStation 4200 (Agilent Applied sciences, USA). The ready libraries had been sequenced on the NovaSeq 6000 sequencer (Illumina) in 150-bp paired-end mode, producing roughly 100 million reads per pattern.
Adapter trimming and high quality trimming of fastq recordsdata had been carried out utilizing Trim Galore (model 0.6.6)21. Nucleotide fragments had been aligned to the human reference genome (hg19) utilizing the BWA Alignment Device (v0.7.17-r1188). Duplicate PCR fragments had been eliminated, and fragments with a mapping high quality beneath 10 had been excluded utilizing SAMtools (v1.11). Chromosomes 1–22 had been retained, whereas the others had been discarded. We divided your entire genome into 300 bp bins and calculated the learn counts for every bin, excluding the areas within the blacklist22. And bins with a complete learn depend of 10 or much less throughout all samples had been excluded. Normalization of the 300-bp bins was carried out utilizing the trimmed imply of M-values (TMM)23 with the edgeR (v3.28.1) R package deal24.
WGEM-seq
The gDNA (200 ng) was fragmented to sizes starting from 240 to 290 bp utilizing Covaris instrument (Covaris, USA). The library was ready utilizing the NEBNext enzymatic methyl-seq package (New England Biolabs, USA) with 200 ng of DNA. The library preparation concerned a methylation conversion step, whereby ten-eleven translocation dioxygenase 2 (TET2) and APOBEC enzymes had been employed to interchange non-methylated cytosines with Uracil. The ultimate DNA library’s measurement and focus had been decided utilizing the Qubit dsDNA HS Assay Package (Invitrogen) and TapeStation 4200 (Agilent Applied sciences). Within the final step, the ready DNA libraries had been sequenced on the NovaSeq 6000 sequencer (Illumina) in 150-bp paired-end mode, producing roughly 600 million reads per pattern.
We carried out adapter and high quality trimming of FASTQ recordsdata utilizing Trim Galore. The nucleotide fragments had been then aligned to the hg19 reference genome utilizing Bismark instrument (v0.23.0)25, and duplicate PCR fragments had been eliminated utilizing the deduplicate_bismark. We used SAMtools view to exclude nucleotide fragments with a mapping high quality of lower than 10 and restricted them to chromosomes 1 to 22. Methylation calling was carried out utilizing the Bismark_methylation_extractor. Beta values had been calculated utilizing the methylKit R package deal (v1.12.0)26 to quantify the methylation ranges. The beta values had been obtained from CpG websites with a minimal depth of 5 or extra.
Twist Human Methylome Panel and focused EM-seq panel
The Twist Human Methylome Panel (Twist Bioscience, USA) targets biologically related methylation markers throughout 123 Mb of genomic content material, encompassing 3.98M CpG websites. Our custom-designed Focused EM-seq panel includes 366 lung cancer-specific methylation markers that differentiate regular samples from most cancers samples. Manufactured by Twist Bioscience, this panel spans 0.1 Mb and contains 5K CpG websites.
Ready DNA libraries utilizing the NEBNext enzymatic methyl-seq package (New England Biolabs), using 2–100 ng of extracted cfDNA. Methylation conversion concerned changing unmethylated cytosines with uracil by TET2 and APOBEC enzymes. Eight pattern teams had been created by combining 200 ng from every library for hybridization. Subsequently, the method centered on capturing the precise goal from the hybridized pattern. The focus and measurement distribution of the ensuing libraries and captured DNA had been measured utilizing the Qubit dsDNA HS Assay Package (Invitrogen) and TapeStation 4200 (Agilent Applied sciences). Sequencing was carried out on the NovaSeq 6000 and MiSeq Dx sequencers (Illumina) in 150-bp paired-end mode, Twist Human Methylome Panel and focused EM-seq panel achieved common sequencing depths of 220× and 700× per pattern. Knowledge preprocessing was carried out in the identical approach as described for WGEM-seq. Beta values had been obtained for CpG websites with a minimal protection of 10 and 20 for the Twist Human Methylome panel and Focused EM-seq panel, respectively.
Methylation markers on the Infinium HumanMethylation450 (450K) BeadChip array and MeDIP-seq
The Infinium HumanMethylation450 (450K) BeadChip array knowledge with the title starting as GDC TCGA was obtained from the College of California Santa Cruz (UCSC) Xena database (https://xenabrowser.internet/datapages/). The info consists of 458 major strong tumor samples and 32 adjoining regular tissue samples for lung adenocarcinoma (ADC), in addition to 370 major strong tumor samples and 42 adjoining regular tissue samples for lung squamous cell carcinoma (SCC). Moreover, the 450K array knowledge of 656 regular blood samples had been obtained from Gene Expression Omnibus (GEO) databases (https://www.ncbi.nlm.nih.gov/geo/; GSE40279)27. We obtained beta values for every CpG website from the 450K array knowledge and excluded CpG websites with lacking values. The dataset was divided right into a discovery set and a validation set, and markers had been chosen utilizing the invention set and verified utilizing the validation set (Supplementary Desk 4a,b). Differentially methylated areas (DMRs) had been chosen, areas that exhibited variations between lung most cancers tissues and adjoining regular tissues, in addition to variations between lung most cancers tissues and regular blood samples. This choice was made utilizing the Limma (v3.46.0) R package deal28, areas the place the false discovery price (FDR, Benjamini–Hochberg methodology) was < 0.01 and absolutely the delta beta was > 0.25.
MeDIP-seq knowledge was generated in-house from 25 sufferers with lung most cancers and 190 wholesome people. MeDIP-seq additionally divided the dataset right into a discovery set and a validation set, the identical because the 450K array knowledge (Supplementary Desk 4c). DMRs had been recognized between lung most cancers and wholesome samples. Utilizing the edgeR R package deal, we chosen a area with an FDR (Benjamini–Hochberg methodology) worth of lower than 0.05 and extracted CpGs throughout the area.
Methylation markers on the 450K array and WGEM-seq
The 450K array was processed following the identical protocol outlined within the part “Methylation markers on the Infinium HumanMethylation450 (450K) BeadChip array and MeDIP-seq”. In WGEM-seq knowledge, We recognized DMRs by evaluating methylation patterns between seven lung most cancers tissues and 7 adjoining regular tissues, and between seven lung most cancers tissues and ten regular white blood cells (WBC) (Supplementary Desk 4d). The filtering standards included an absolute distinction > 25 and a q-value < 0.01, calculated utilizing the methylKit R package deal. P-values had been calculated utilizing logistic regression and adjusted to q-values utilizing the SLIM methodology29.
Vital methylation markers within the cfDNA
We utilized Twist Human Methylome Panel knowledge from 5 sufferers with lung most cancers and 7 wholesome people (Supplementary Desk 4e). Three steps had been carried out to carry out additional filtering on the chosen CpGs.
First, the beta worth was used to calculate the realm below the receiver working attribute curve (AUC) with Scikit-learn Python library (v1.0.2)30 to tell apart lung most cancers samples from wholesome samples. Areas with AUC values above 0.65 had been thought-about important. Second, we included areas the place absolutely the distinction between lung most cancers and wholesome samples was higher than 3 and absolutely the q-value was lower than 0.05 utilizing the methylKit R package deal. Lastly, we chosen areas the place the usual deviation of wholesome samples was lower than 0.05 utilizing the R software program (v3.6.3).
MFS in cfDNA
To generate MFS inputs, we utilized focused EM-seq panel knowledge. The focused EM-seq panel includes cfDNA knowledge from 142 lung most cancers sufferers and 56 wholesome people. We carried out 100-bp binning based mostly on genomic coordinates and chosen bins with three or extra methylation markers. Subsequent, methylation ranges had been measured based mostly on fragment sizes starting from 120 to 220 in 10-bp intervals.
Methylation stage = variety of methylated cytosines/whole cytosines, the place whole cytosines ≥ 20.
The overall cytosines is the sum of the variety of methylated cytosines and the variety of unmethylated cytosines. The created MFS desk is a 2D desk, the place the x-axis is the genomic place, the y-axis is the fragment measurement, and the worth is the methylation stage (Supplementary Fig. 1). For areas with whole cytosines < 20, lacking values had been imputed utilizing the median methylation stage for that fragment measurement.
Deep studying mannequin technology
Mannequin improvement employed MFS generated from the focused EM-seq panel of 142 lung most cancers sufferers and 56 wholesome people as enter. We developed a convolutional neural community (CNN) mannequin to discriminate between wholesome people and sufferers with lung most cancers utilizing a 2D vector MFS desk as enter knowledge. The dataset was preprocessed by making use of standardization scaling utilizing wholesome samples from the coaching and validation units. We divided your entire dataset into coaching, validation, and check units (Desk 1). The coaching set was used for mannequin coaching, the validation set for hyper-parameter tuning, and the check set for evaluating the ultimate mannequin efficiency. Hyperparameter tuning is the method of optimizing the values of varied parameters (variety of convolution layers, variety of dense layers, variety of convolution filters, and many others.) that make up a CNN mannequin. Bayesian optimization approach is used within the hyperparameter tuning course of. When the validation loss begins to extend in comparison with the coaching loss, the mannequin is taken into account to be overfitting and the mannequin coaching is stopped. The efficiency of a number of fashions obtained by hyperparameter tuning is in contrast utilizing the validation set. The mannequin with the perfect efficiency on the validation set is chosen because the optimum mannequin, and the ultimate efficiency is evaluated utilizing the check set. Given a 2D vector MFS desk of a selected pattern, the educated CNN mannequin calculated the chance that it was a wholesome particular person or a affected person with lung most cancers. The sigmoid perform within the ultimate layer was used for calculation. Sufferers with lung most cancers and wholesome people had been labeled based mostly on a predicted chance of 0.5.
Mannequin building using fragment measurement and methylation stage function
To check the efficiency with the MFS function, we used cfDNA knowledge from the focused EM-seq panel of 142 lung most cancers sufferers and 56 wholesome people. For fragment measurement options, we used the DELFI5 methodology to calculate the ratio of brief fragments. The ratio of brief fragments was calculated by dividing the variety of brief fragments (100–150 bp) by the variety of lengthy fragments (151–220 bp) for every 100 bp bin. As for the methylation stage function, we quantified methylation ranges in 100 bp bins following the identical protocol described within the “MFS in cfDNA” part, however the knowledge was built-in with out being divided by fragment measurement. Standardization scaling was utilized to each computed options, using wholesome samples from the coaching and validation units. Subsequently, a CNN mannequin was educated utilizing these options.
Statistical analyses
Methylation markers had been chosen utilizing the Limma R package deal for the 450K array, edgeR R package deal for MeDIP-seq, and methylKit R package deal for WGEM-seq. Marker filtering used Scikit-learn Python library, methylKit R package deal, and R software program. To judge the mannequin efficiency, we utilized metrics together with the realm below the receiver working attribute curve (AUROC), accuracy, and sensitivity values mounted at 80%, 95%, and 98% specificity. All analysis metrics had been carried out utilizing a {custom} Python script (v3.8.1), with 95% confidence intervals (CI) obtained from 2000 bootstrap iterations.

