AGBT 16 Opening Session
Wednesday, February 10, 2016
James Hadfield-AGBT Guest Blogger (The CRUK Cambridge Institute Genomics Core)
Sean Eddy kicks off the first plenary session “Genomics I” (chaired by Mike Zody from NYGC) his talk title was “Genome evolution: the future of deciphering the past” but has been changed to “A statistical test of RNA base pair variation applied to lncRNA structure”. Sean’s lab in the HHMI at Harvard University focuses on developing computational methods for genome sequence analysis, particularly interested in methods to identify how genomes evolve. Sean was the 2007 Benjamin Franklin Awardee for his open access distribution of HMMER, and co-creation of the Pfam database.
Sean’s talk focused on a simple hypothesis “RNA sequences with highly conserved secondary structure are more likely to be functional”. The need to maintain a particular secondary structure (to achieve a specific function) imposes constraints on the sequence of an RNA (that performs the specific function). Sean’s talk focused on how this is usually in the form of pairwise correlations between the bases in the stems of stem-loop structures (A:U & G:C). Understanding this constraint well enough and we can predict, or at least get a tip-off, about function. His group have recently been working on improving models that determine secondary structure by introducing statistical tests to support, or not, the predictions. He used the misappropriation of R2R in a couple of high-profile lncRNA examples where groups had used the software to generate beautiful RNA-structural figures but did not heed the warnings of the authors not to use it to predict the significance of these structures. The rest of his talk focused on the statistical tests they’ve developed and implemented, and how they used these to show that recent and high-profile data from the 2015 Cell paper on Hotair, a 2,148-nt-long lncRNA, may be incorrect.
lncRNAs are a controversial class (there probably is lots of interesting molecular biology, but also trancscriptional noise). They do not appear to be coding RNAs but may make small peptides. Sean’s talk focused on data from the 2015 Cell paper on Hotair a 2,148-nt-long lncRNA. His group tested multiple methods for prediction of consensus structures (150 lncRNAs across multiple algorithms), and have developed statistics that improve on the reliability of these structural prediction.
The new, and improved, R2R figures show statistically supported RNA structure. They saw no evidence for evolutionarily conserved structure in HOTAIR (or XIST or SRA). Their current data DO NOT support conserved structures. Other experimental evidence supports the existence of structure in these RNAs, but “any RNA folds”. The take-home message is that evolutionary conservation supports the functional significance of the structure. Sean was very careful to state that this should be a cautionary tale and that stats are needed to improve lncRNA analysis.
In the questions Sean was asked if he’d looked at mRNAs secondary structure. He replied that their analysis shows that cis regulatory structures do appear to be conserved, but coding region structures do not.
David Haussler: “Global sharing of better and more genomes”. David’s lab at UCSC developed the browser so many of us are familiar with (insert poll: UCSC or Ensembl). He is also one of the co-PIs of a modern day Noah’s Ark: the Genome 10K Project that aims to collect the genomes of 10,000 vertebrate species (co-PIs include, Ed Green, Beth Shapiro (who’ll be talking about Passenger pigeon paleogenomes tomorrow), and Terrie Williams).
David’s talk described the work of the Global Alliance for Genomics and Health (GA4GH), which is creating a technology platform and standards to allow sharing of genome data, with clear guidance on privacy, and engagement of key stakeholders. It now has 380 members in 30 countries, including 120 companies. GA4GH was conceived because no one institute has the wherewithal to cope with all the genome data being generated across the world so it goes into siloes. But each silo will contain some of the rare variants, or rare combinations of combinations, we really want to identify and understand.
The main part of David’s talk was on genome graphs. Fortunately he started with an intro: a Genome graphs = set of sequences (nodes) with joins (edges) connecting the sides of the sequences. Joins can connect either side of a sequence (bidirected edges). Paths encode DNA strings with any side of entry determining the string (I really need the slides here to show this but “no photos at AGBT”). Why should we care about genome graphs? Because the current methods of representing reference genomes means we need to carefull move from one reference to another e.g. GRCh38, 1000 genomes, TCGA, UK10K. Also various databases contain the same information in different form e.g. dbSNP and dbVAR, or ClinVar, et al. Ideally we need to get all this information in one place,a nd in a form that means we don’t need to start over when it’s time for the next genome build. To illustrate this David showed a nice graphic of a Rosetta stone for the genome and described how you could zoom into the graph to see finer structural information.
He used the BRCA challenge as an example of where GA4GH could have a positive impact. Led by Sir John Burn and Steve Chanock, the BRCA challenge aims to pool BRCA variant data from around the world. There are over 12,000 coding variants in multiple, and sometimes private, databases e.g. Myriad. In the BRCA Challenge almost everyone agrees to share this information, which potentially allows a BRCA patient with a variant of unknown significance can benefit from data from other patients with that variant. Ultimately creating a set of curated BRCA variants. The BRCA exchance website allows this data to be easily accessed, ultimately via a mobile App.
David’s summary was that data sharing is easy to do, if we agree to share it!
Pardis Sabeti: “Genomic surveillance of microbial threats”. Pardis is a computational biologist, medical geneticist and an evolutionary geneticist. Her group at Harvard works on computational biology, infectious disease and functional analysis, and is aiming to improve our understanding of the mechanisms underlying evolutionary adaptation in humans and their pathogens. In 2014 her group used NGS to find a single animal to human transmission event responsible for the recent Ebola outbreak, they were able to show that outbreak strain was closely related to a Central African strain seen in 2004 indicating the movement of the virus over a decade from Central to West Africa. Watch her great TED talk, and read about “the Rollerblading Rock Star Scientist of Harvard” at Smithsoinian.com.
It appears we were very lucky to have Pardis at AGBT as she’d been involved in a very serious accident (smashed her pelvis and both knees 6 months ago) that could have kept her away from AGBT. But she made a special exception to come to this conference (coming home), as she’s wanted to come to AGBT for a long time, and was really glad to be invited! Her group has been working during her medical leave of absence.
She started her talk by discussing adaptive variation in humans. It is possible to leverage evolution to find biology (links nicely back to Sean’s talk) and her group has discovered variation that plays a role in immune response to flagella, in thermoregulation, in resistance to cholera. 90% of the changes do not lie in coding regions mostly in regulatory regions.
The main part of her talk focused on the groups work on Lassa, Ebola and emerging pathogens. Earlier work had shown there was a large biological signal in a West African variant in LARGE that is critical for entry of Lassa virus, first discovered in Nigeria relatively recently. She posed the conundrum “if Lassa is correctly described as an emerging disease, how come there is such a large evolutionary signal”? To understand this better they needed to work on a biosafety 4 level organism in a developing country, with samples that can be low input, degraded, and also contaminated with other genomes – very few viral reads. To do this they needed to develop approaches that work in the field, develop experimental methods to remove HuRNA, and develop new computational tools. Ultimately they chose to build labs, not just physically, but also build the local infrastructures including staff to run them.
During their initial work they saw patients being diagnose with obvious haemorrhagic fever, but most positive cases walked in with conjunctivitis carrying lassa virus a disease with 50% mortality. Many cases must be going undetected. Sero-prevalence data suggested extreme exposure of up to 20%-40% in different countries/populations. Sequencing showed a very diverse virus, probably 1000 years old – this is not an emerging pathogen! She began to believe that Lassa/Ebola are prevalent but unreported Gire et al science 2012. Emerging disease or diagnoses?
During the outbreak of Eboloa in 2014 the lab was quickly able to get ready for, and start testing for ebloa with a PCR test. They identified the first case in Sierra Leone, but even though everything was done correctly this was only the first case they caught, there were 100s of other cases in the surrounding area. Their work published in Science and Cell was underpinned by rapid genome sequencing, and the sharing of that data. One of the reasons they were able to work so quickly/well was long-term involvement with communities, hospitals, labs (built and trained by them). Genomics surveillance technology still needs developments: genome sequencing (best done in a national facility, but nanopores might work out), molecular diagnostics (in diagnostics labs), and also rapid diagnostics (ideally as point of care). Her group are developing a pan-viral diagnostic capture-seq panel (20 viruses) with minimal probe set at just $4 per sample. And clinical analysis tools available on GitHub, DNAnexus, Fathom, etc.
Pardis finished off by talking about the amazing potential to detect infectious disease and understand where it came from, and what it really is. And that genome data have been recognised as incredibly important by the non-AGBT community, as has data sharing (links nicely back to David’s talk), particularly in an outbreak situation.
There was an obvious personal impact in this work from the death of colleagues in West Africa. I have never seen (or been moved by) an acknowledgement slide include an in memoriam section, but five people died in the course of this work.
PS: there were 26 people in her lab and it has continued to work in her extended medical leave. Many labs of this size might destroy themselves with infighting (“while the cats away”).
Matthew Sullivan: “Unveiling viral ocean: towards a global map of ocean viruses”. His lab at Ohio State University uses a genomic and metagenomic toolkit to understand the co-evolution of microbe and virus in environmental populations. They are developing single-cell methods for certain species. Matthew is on four out of five papers in a special “Tara Oceans” issue of Science. For the Tara Oceans project researchers sailed the world sampling microscopic plankton at 210 sites and depths of up to 2000m in all the major oceanic regions. Nice work if you can get it, and no wonder he’s got such a big smile on his OSU webpage – although the backdrop looks more like a river in Norfolk than the mid-Pacific!
Matt’s talk started with a video of what is presumably video shot from the Tara Oceans in rough seas – virologists have tough sampling conditions! Microbes dominate biochemical and energy transformations that fuel our planet – 50% of the oxygen is from marine microbes. We forget too easily that microscopic biology is the stuff of life. Before Tara oceans we knew practically nothing about viruses in the sea. It turns out there are about 10 viruses per microbial cell. These turnover every few days and have dynamic population structure. The viruses lyse about 1/3rd of cells per day. There is such a lack of data on viruses, and Matts group are pushing to get past this.
To do this they had to developed a pipeline from sample-to-sequence, including concentration and purification of samples, amplification for dsDNA viruses down to 100 femtogram (Matt also gave nod to Swift Biosciences who have two posters for ssDNA viruses usig the Swift 1S+ kit). This getting 3rd gen viral genomics generates 100Mb/sample, has some ability to assemble genomes, and they are finding that 90% of genes discovered were unknown.
The Tara Oceans project was a huge consortium effort with days spent collecting, concentrating, prepping samples…on a boat. Not only did the Tara Oceans group group uncover huge amounts of information on viral genomics, they also discovered a new process that sequesters carbon in the deep sea. Lysis of cell leads to aggregation of cell debris which sinks to the ocean floor taking carbon with it. Matt has a pper out today in Nature: Plankton networks driving carbon export in the oligotrophic ocean. In the discussion they state how they’ve been able to link genes to 89% of this particular ecosystems variabliity – quite a lot from such small genomes!
That’s the end of the AGBT 16 opening session…unless you stayed up partying all night while I was writing this!