United States of America
Food and Drug Administration
2014 Next Generation Standards Conference
Bethesda, Maryland
Wednesday, September 24, 2014 2
1 Ag e n d a
2 Introductory Remarks:
3 RICHARD G. H. COTTON, PhD Founding Patron and Scientific Director 4 Human Variome Project
5 CAROLYN A. WILSON, PhD Associate Director for Science, CBER 6 Food and Drug Administration
7 VAHAN SIMONYAN, PhD Lead Scientist, HIVE, CBER 8 Food and Drug Administration
9 Next Generation Sequencing Standards:
10 VAHAN SIMONYAN, PhD, Session Chair
11 WEIDA TONG, PhD Director, Division of Bioinformatics 12 And Biostatistics Food and Drug Administration 13 AMNON SHABO, PhD 14 Chair, Translational Health Informatics European Federation of Medical 15 Informatics
16 EUGENE YASCHENKO Chief, Molecular Software Section, 17 NCBI, National Institutes of Health
18 Big Data Administration and Computational Infrastructure: 19 EUGENE YASCHENKO, Session Chair 20 VAHAN SIMONYAN, PhD 21 Lead Scientist, HIVE, CBER Food and Drug Administration 22
3
1 PARTICIPANTS (cont'd):
2 WARREN KIBBE, PhD Director, CBIIT National Cancer Institute 3 TOBY BLOOM, PhD 4 Deputy Scientific Director, Informatics New York Genome Center 5 Database Development: 6 RAJA MAZUMDER, PhD, Session Chair 7 KIM PRUITT, PhD 8 RefSeq Project Lead, National Library of Medicine, NCBI, National Institutes of Health 9 MIKE CHERRY, PhD 10 Department of Genetics Stanford University
11 RODNEY BRISTER, PhD Staff Scientist Virus Genome Group, NCBI 12 National Institutes of Health
13
14
15 * * * * *
16
17
18
19
20
21
4
1 P r o c e e d i n g s
2 DR. COTTON: I'd like to thank the
3 organizers for letting me join this project and
4 this conference, particularly Vahan and Alin.
5 Unfortunately, due to the time zone, I'm not going
6 to be able to stay for the full conference, but
7 this is one of the reasons why more recently we've
8 changed our administration so we now have two
9 scientific directors in the U.S.: Mike Watson,
10 who is the CEO of the American College of Medical
11 Genetics; and Garry Cutting, who is the well-known
12 curator of the Cystic Fibrosis Database; and Sir
13 John Burn joining in the EU, in Europe, and in
14 Australia one of the leaders of the InSiGHT
15 database. They spoke to the FDA in Silver Spring
16 in 2010 on this topic, and some in the audience
17 may have been at that meeting.
18 The Human Variome Project is a coherent
19 group, which has been really working together for
20 20 years to try and obtain the best data for
21 clinical decision making. That's pretty critical
22 when you're talking about life or death 5
1 information. And Sir John Burn has said that not
2 sharing data kills. And so in other words, if
3 some data is not available on the other side of
4 the world, someone could die or be aborted, et
5 cetera. So that's quite dramatic and that's one
6 of the reasons why the clinicians clearly need the
7 proper data.
8 The Human Variome Project first started
9 with looking at research data and sharing that.
10 But now, of course, everything's been modified
11 because we are now talking about next generation
12 sequencing. But, nevertheless, the curation of
13 single mutations and single genes is much the
14 same. It's just the actual way of getting the
15 data is different and the way it's shared. And so
16 now, of course, we're focusing on clinical grade
17 results to share, and I'd like to thank John-Paul
18 Plazzer who's here who's the InSiGHT curator who,
19 in fact, is helping me with this.
20 So in other words, the Human Variome
21 Project summary is an international nongovernment
22 organization with over 1,000 individual members 6
1 from 80 countries. It is working to integrate the
2 free and open collection, curation,
3 interpretation, and sharing of all clinically
4 validated genetic and genomic information in
5 inherited disease. It's focused on supporting the
6 diagnostic works and diagnostic labs and building
7 the world's genetics and genomics service delivery
8 capacity. And, in fact, we've done quite a lot of
9 work in the developing world already, and with
10 increased funding in the future and with the help
11 of WHO and UNESCO, we have to accelerate that.
12 And so, in other words, we're harmonizing national
13 and regional efforts around regulatory frameworks
14 of governance.
15 We are an NGO of UNESCO, and we're
16 working with WHO on a global genomic policy, and
17 we're likely to become an NGO of WHO in the
18 future. And we're particularly interested in
19 those two bodies because we hope that that will
20 help particularly in the developing world.
21 So clinical genetic testing -- if one
22 knows about that -- pre-analytical, analytical, 7
1 and interpretation, of course, now. It may be the
2 forefront of dollar genome that it can be the
3 $10,000 interpretation and that's really tough for
4 clinicians and especially when not all the data's
5 available from around the world to make the best
6 decision for their patients.
7 I think the need for expert curation is
8 highlighted in this slide. Unfortunately, it's
9 going to be a while till it's overtaken by
10 informatics, but one day hopefully it will. And I
11 think when I first saw this data, I was quite
12 shocked. There was a summary of published results
13 to decide if a particular mutation that emulates
14 one gene was, in fact, pathogenic or not. And you
15 can see that five labs used eight tests, and, in
16 fact, they all differed in their actual results.
17 And in the end, the InSiGHT group, the model for
18 the Human Variome Project and how to deal with
19 information in a particular gene, had a
20 teleconference and they did this with John-Paul
21 Plazzer pulling this together and they had a
22 telephone hookup to discuss their homework on this 8
1 particular gene and they interpreted that as
2 pathogenic. And they did, in fact, incorporate
3 this because they put this on their Website. So
4 if they've been proven to be wrong, they're
5 protected by incorporation. So this is quite
6 serious and if you're talking about individual
7 health care, this is quite a challenge, in fact.
8 So another thing that's being worked on
9 is the actual terminology. There's a number of
10 different ways of saying large bone, and Ada
11 Homash from OMIM and others have been working on
12 trying to get the ontology down to about 1,000
13 words, which can be taken from a checklist so that
14 the actual data can be more uniform.
15 So how does the HVP work? It
16 facilitates the collection, curation,
17 interpretation, and sharing of clinically useful
18 genetic variation from all countries to
19 internationally existing gene and disease
20 databases. Now, it doesn't do all this. It
21 doesn't do all this, it's just a community trying
22 to tell the community or work at how the community
9
1 does something in much the same way the FDA is
2 doing in this project or this meeting.
3 So, obviously, we involve all relevant
4 experts and bodies, build on existing work to
5 avoid duplication, and we standardize approaches
6 and methods. And it's very important to indicate
7 that this really is a very inclusive project and
8 we do try to talk with anybody who comes along who
9 are, in fact, doing something in the area and
10 assist them and have them assist the world. So
11 all this is being integrated into routine clinical
12 care so that it is sustainable.
13 So this is the architecture and all of
14 this can be seen on the Website. In the end we've
15 agreed that all data has to go into NCBI, ABI, for
16 safekeeping and for comparison with other data.
17 At the top you can see Channel 1. These are
18 geno-disease-specific databases, which are curated
19 by experts daily. It's tough trying to get money
20 to do this properly. In the case of our model
21 genes in inherited colon cancer, we've been
22 fortunate to retain charitable funding to support
10
1 John-Paul Plazzer. And, in fact, clearly the
2 Cystic Fibrosis group has been very lucky to get a
3 large amount of money put together for their
4 database.
5 On the lower channel we have country
6 nodes where the countries need to collect data for
7 their own purposes, and this ultimately will go in
8 to the gene- or disease- specific databases for
9 curation and will go to the multiple central
10 databases. Some of you might have heard of
11 ClinGen. ClinGen is also following this model
12 whereby the geno disease-specific databases are
13 curated and then the data goes on to ClinVar,
14 which you might have heard of.
15 So this shows you the countries. The
16 orange blobs are the countries who have agreed to
17 actually start working towards collecting data as
18 a node and then sharing it worldwide. And, in
19 fact, the others -- the blue -- are the countries
20 where we are still talking with the countries. At
21 the moment there are about 20 countries that have
22 agreed. 11
1 So what do the nodes collect? And the
2 reason that they are collecting the data is, in
3 fact, in a paper of several years ago in "Genetics
4 in Medicine" if anyone wants to see that. So each
5 country node collects data from the laboratory and
6 all genes are tested within the country together
7 with the scientific data. And some nodes also
8 collect research data. Australia is developing an
9 LOVD 3 database for the collection of next
10 generation sequencing data, and this will be
11 integrated into other nodes.
12 I think I will dwell a little bit on the
13 LOVD 3 software, which I will come back to I
14 think. This software was developed by the
15 community specifically in Leiden to curate single
16 genes and mutations in them. Now, the software
17 has developed so well you can put a whole genome
18 into that database and it zip writes it into the
19 22,000 gene-specific databases. So it's important
20 to say that this is open source software and,
21 therefore, it can be used in a way that various
22 users might wish to do so. And it could even be
12
1 used in viruses, et cetera, and food products,
2 whatever is being talked about in this conference.
3 And then, of course, the relevant data is then
4 shared as I said in the diagram.
5 So we've got a couple of lead nodes.
6 What's happened in Malaysia is quite dramatic
7 because we have a very active leader. He was able
8 to get to government and be quite well-funded and
9 data collection is beginning. And even further,
10 they are actually leading the Southeast Asian
11 region. Obviously, these things are very slow,
12 and I think probably everyone in the room knows
13 that it's very, very slow, these things, to get
14 regulatory things through, collect data, get
15 everyone to actually agree to put the data in. In
16 Australia we got two successive informatics grants
17 funded and we are collecting data, but we did not
18 yet get funding for a proper curator for the whole
19 country. But we do have most of the big labs in
20 Australia agreeing to submit data to the
21 Australian database.
22 Now, I've mentioned InSiGHT before.
13
1 This is a lead database for HVP as a model and, in
2 fact, it deals with inherited colon cancer. Their
3 organization was formed in 2003 with the merger of
4 two groups, and it's to improve the care of
5 individuals and families with inherited colon
6 cancer. Their meetings are multi-disciplinary,
7 ranging from basic right through to clinicians,
8 genetics, et cetera. And their first step once
9 they decided to make a difference in their
10 databases was to actually reconcile three
11 independent databases, one of which was from the
12 literature, the other was from labs, and the other
13 one was from in vitro testing. And it's host on
14 the LOVD database in Leiden. If any of you are
15 not familiar with the LOVD databases in Leiden, as
16 I said there are 22,000 locus- or gene-specific
17 databases. Some of them do not have any data in
18 them. Some of them do not have curators. But, of
19 course, as soon as a whole genome sequence is put
20 in there, those will be populated. And, of
21 course, as I said earlier, there's expert review
22 of variance of unknown significance in the InSiGHT
14
1 database that have, in effect, been an assist.
2 And, of course, CFTR, whilst not initially a part
3 of the Human Variome Project, in effect is another
4 lead database and it doesn't use the software that
5 was used by the LOVD system.
6 What about PharmGKB for the
7 pharmacogenomics aspects of the actual FDA's
8 activities? It, in effect, has a focus on
9 pharmacogenomics and was actually formed in
10 parallel with the Human Variome Project
11 activities. And you might well know about that,
12 and obviously Russell is a key person in that
13 area.
14 Now, regarding standards and
15 recommendations, which are a key part for this
16 conference, consortium members have developed
17 these starting in the 1990s. Because initially
18 when this group started in the early-to-mid 1990s,
19 people were collecting data on their genes of
20 interest for their clinics on accounting software,
21 et cetera. And so this is where we started
22 developing the LOVD software, and, of course, from
15
1 then on members actually developed various
2 recommendations. And, in fact, people can see in
3 the Human Mutation virtual issue where they've
4 actually pulled together all of those
5 recommendations and standards, and they also list
6 on the Human Variome Project Website. And Mauno
7 Vihinen, one of their most active consortium
8 members, is writing a paper on that at the moment.
9 Probably the standard that is most
10 widely used is the variant nomenclature, and
11 that's, in fact, led by Johan den Dunnen who also
12 leads the LOVD software work. Also, software
13 content is widely used as in the LOVD software, so
14 people over that period of time have agreed what
15 sort of data should be in there for inherited
16 diseases at least. And now, of course, what's
17 happening is that a wider community is further
18 developing those into actual standards.
19 So what are the standards under
20 development? We have disclaimer statements that
21 we should put on gene- or disease-specific 22 databases. And just as an aside, I think it's 16
1 quite likely that maybe there will have to be
2 gene- or disease-specific databases for the actual
3 genes of interest, which this conference is
4 talking about. There's also a sonic
5 pathogenicity. Marc Greenblatt is a prominent
6 U.S. Person. Minimal content for the databases.
7 Sequence description committee. I've said that.
8 Variant database quality assessment. Disease and
9 phenotype descriptions in gene-specific databases
10 and minimum content and an ethics checklist.
11 So the need for accreditation -- this is
12 very important. If any database is critical to
13 support clinical decision making, it needs to be
14 accredited and, obviously, there are ways of doing
15 this. And it can't be possible without a national
16 framework like the College of Pathologists or the
17 American Society of Medical Genetics. And also
18 there's a requirement for national accreditation
19 really if we are to believe data coming in from
20 around the world.
21 So in Australia, for example, the Royal
22 College of Pathologists and the Genetic Society
17
1 have both draft documents that can be read in
2 applied translational genetics, a proposal that
3 will go to the National Pathology Accreditation
4 Advisory Council, then a requirement for pathology
5 to follow. This will also be submitted to the
6 Human Variome Project for standard development and
7 HVP member databases are expected to conform and
8 one day maybe one form of international
9 accreditation.
10 So what about the FDA? At least in
11 pharmacogenomics and also in viruses as well, I
12 suppose, first of all what about data collection?
13 We've found it extremely difficult to get
14 collection from labs and maybe grants should make
15 it conditional for data submission. Labs, as part
16 of their accreditation or quality control, should
17 submit data and maybe in the U.S., for example, it
18 might be state-by-state collection. For quality
19 you really need gene-specific experts for curation
20 as I've said, and for bioinformatics, the LOVD
21 open source database might be of interest.
22 So just coming to the acknowledgements,
18
1 we've got a large international society of
2 advisory committee that meets every month. Some
3 very prominent people involved with that and we're
4 well represented from around the world. These are
5 the people who work in the coordinating office.
6 The final thing I'd like to say is one
7 of the most difficult things, which I didn't put
8 on the slide, is to connect the actual genetic
9 data to the phenotype. And, in fact, there is a
10 system called Biograde in Australia whereby you
11 can actually access the data in a way about
12 particular diseases and about particular patients.
13 And this is one way of following -- you could keep
14 following a patient, for example, in a particular
15 hospital. And this was initially set up for
16 research and now, obviously, we're looking at that
17 for following an inherited disease.
18 So thank you for listening, and I hope
19 you could hear.
20 DR. WILSON: Dr. Cotton, thank you so
21 much. This is Carolyn Wilson. That was a really
22 wonderful overview of the project that you're
19
1 working on. It sounds very exciting and
2 challenging. I just wanted to see if there are
3 any questions. There are microphones now, so if
4 you could come to a microphone if you have
5 questions. Questions for Dr. Cotton? Well,
6 thank you, again, Dr. Cotton, for that talk.
7 So our last speaker in this introductory
8 session is Dr. Vahan Simonyan who has been leading
9 a lot of the efforts within the Center for
10 Biologics in the FDA. He'll be talking about
11 development and implementation of novel next
12 generation sequencing standards, really setting
13 the tone for the conversation that we're hoping to
14 have over the next two days.
15 DR. SIMONYAN: Thank you, thank you for
16 coming today. And thank you, Carolyn Wilson,
17 because she introduced the strategic perspective
18 from the FDA's viewpoint. Why do we need this
19 workshop? What are the challenges coming our way?
20 And thanks to Dr. Cotton, who actually gave us the
21 global perspective to this, and how the data moves
22 in a global scale? What kinds of standardization
20
1 efforts are being done? And also he has
2 highlighted the curation.
3 What I'm going to discuss are the more
4 technical aspects of it -- what we are expecting
5 being part of the Food and Drug Administration,
6 and how are we seeking solutions?
7 So this is just an introductory slide of
8 mine. This is our mission. We want to develop a
9 versatile platform for harmonization of next
10 generation sequencing technology. We want to come
11 up with standardization of data formats and
12 promote the way different platforms interact with
13 each other and also make sure that when you do run
14 bioinformatics and analytics, there is a way to
15 verify and validate the bioinformatics and
16 analytics from software and science perspectives.
17 This is my disclaimer. The perspectives
18 given here are my own perspectives and they do not
19 obligate or bind FDA.
20 So what are we going to discuss today?
21 The number one, I'll briefly mention NGS
22 lifecycle, which most are probably are very
21
1 familiar with. And then I'll talk about the
2 challenges we are facing every day. We'll then
3 move to the goals, the main goal statements, what
4 we are trying to achieve by getting all of these
5 bright people around the same conference room and
6 trying to see what is the vision. And then I'll
7 briefly mention format standardization attempts,
8 which we altogether can do, and bioinformatics
9 harmonization attempts. And after that, I'll very
10 briefly talk about what are the future plans
11 coming from our perspective? How can we continue
12 our communications?
13 So usually and just very generally --
14 and you have to understand I'm not biologist. I'm
15 a quantum physicist. So in this light if you see
16 a bunch of things are missing, it's because I
17 don't know much of the biology. By my
18 understanding we always start from some biological
19 specimen. And then that ends up somehow after
20 some chemical and physical treatments in the
21 sequencing machine. The results then are
22 transferred through these wires -- one gigabit,
22
1 ten gigabit -- we all hear these names in the
2 archiving system. And then computational programs
3 have to get the data back, do the computations,
4 and the circle here represents that it's not just
5 one computation. There are many computations
6 here. And then later after the analysis moves
7 into somebody's table who looks at the data and
8 makes a decision and gets approved or not
9 approved.
10 So there are two models of operations.
11 One of them I call trust-based model and when I
12 talk trust, I don't talk about the trust to
13 somebody or some organization. I mean trust to
14 the pipeline, trust to the workflows, trust to the
15 bioinformatics, trust to the data. And, of
16 course, what we are trying to achieve is we want
17 to move a provenance- based mode of operation
18 where we can validate every step of the way
19 wherever it is possible. So we are trying to move
20 from the top diagram to the bottom diagram here.
21 What are the steps that are of concern
22 from our perspective and the reason of this
23
1 workshop and the future work that has to be done
2 with this? So we want to make sure that when
3 archival pipelines are working, we know what is
4 happening in there. We want to make that the
5 computations are happening, all of the algorithms,
6 wonderful algorithms, hundreds of tools out there,
7 but not many people understand how they work.
8 They use them for arguments, launch them, get the
9 results. We want to understand and look at it.
10 Also we want to talk about the
11 platforms. There are different platforms. A lot
12 of times computational results do depend on what
13 platforms you are running in. And from a data
14 perspective, we want to talk about standardization
15 of data and metadata. By data, we mean something
16 that comes out of these devices. Metadata means
17 descriptive information that is assigned to the
18 data. And also we want to talk about archival
19 standards. What do we keep? How much do we keep?
20 Also interoperability standards. Those are the
21 standards that different computational protocols
22 can talk to each other. And, of course, how do we
24
1 generate the results?
2 So let's move to the next stage. It's
3 the NGS challenges. Number one is challenges.
4 Before we even came to this conversation, there
5 were many different file formats, which are
6 wonderful formats that I use for many different
7 purposes. But the challenge now is we are dealing
8 with a huge amount of data. So some of the data
9 file formats, which are wonderful in serving that
10 role, can still do that, but there's too much
11 engineering information sometimes. One of the
12 questions -- and believe me, I don't know the
13 solutions. I'm just asking questions -- is how
14 important it is to maintain all of the idea lines
15 --
16 DR. WILSON: Sorry to interrupt. The
17 Adobe connection is a little fuzzy, so if you
18 could put that on you, I think it would be better
19 because I think it's picking up whatever static is
20 --
21 DR. SIMONYAN: Sorry, sorry. Give me a
22 minute. We have plenty of time. In fact, we
25
1 assigned a very big 1.5 hour break. So we were
2 late half an hour, but we are going to catch up.
3 Don't worry about it, lunch is just next door.
4 So let's move to the FASTQ formats, for
5 example. And we have idea lines, we have base
6 qualities, we have the sequences themselves.
7 Biologically, the most important part is the
8 sequence information. Qualities are important for
9 base understanding and trusting the particular
10 base. And idea lines mostly contain in FASTQ
11 files, genomic information on your plate. What
12 are the coordinates you use to read that
13 particular sequence? These are wonderful if
14 you're running quality controls. These are
15 wonderful if you're trying to analyze how well
16 your experiment went. And I can tell wonderful
17 stories of how we can predict earthquakes,
18 actually post factum, by looking at the idea lines
19 and the qualities. But that's a different
20 conversation.
21 So the question is this. If we are
22 compressing these file formats -- and, of course,
26
1 every sensible technology person tries to compress
2 these file formats, not to maintain them --
3 there's a different way you can compress and there
4 are different coefficients of compression. So
5 idea lines compress somewhat and here you can see
6 I've tried to show this. Sequences can be
7 compressed heavily, and then CBI has come up with
8 this reference-based compression machinery. So
9 instead of keeping the whole sequence, we keep
10 only the position and the genome from which it is
11 coming, and your huge sequence compresses to a few
12 bites only. So this is wonderful. Sequences are
13 the most useful part and they are the most
14 compressible part.
15 And then we have to deal with qualities.
16 Qualities, again, are important for quality
17 control. That's why they are called qualities.
18 And then we use them sometimes in the
19 bioinformatics pipeline, but most of this
20 bioinformatics pipeline is, again, just a flip of
21 a bit if it's important to consider these ways or
22 not important to consider these ways. So there is
27
1 a question. How much of that quality are we going
2 to keep because this thing is not well
3 compressible? And a lot of time we have done some
4 analysis on some vital bacterial NGS data, 80
5 percent of the data ends up sometimes being the
6 qualities and sometimes it's 15 an average of
7 human cases. And then people are asking me, why
8 do you worry about this? It's only 15 percent of
9 the data after all? And for a human case, it's
10 qualities; 15 percent of the data when we're
11 talking about terabytes and petabytes, it's a huge
12 amount of investment. We have to keep it, and we
13 have to keep it for many, many years. So that is
14 important.
15 Another big question is about alignment
16 formats. Well, I have a computer. I have a
17 program. It's takes my rates. It takes my
18 reference sequences, runs the program, produces
19 alignment. And then I'm going to do something
20 without alignment. Believe me, if you send a sum
21 file to FDA to reviewers saying this is my result.
22 There's a huge amount of work to be done before a
28
1 biologically meaningful conclusion can be made.
2 So that by itself is not a real result from a
3 reviewer's perspective. So the question is this.
4 Most of the time the sum files that we are using
5 are going to be used on the same platform on the
6 same set of machines where we have the sequences,
7 we have the qualities. But if you look at the sum
8 file, it is completely duplicated. Well, people
9 say maybe it's not for inter- computer file
10 format. It's not ideal because it's duplicating a
11 lot of stuff. Maybe it's for export file format.
12 But if you look at it as a reference to the
13 genome, it is wonderful if you are working with
14 human chromosome version 37, which is available
15 from NCBI, but a lot of time we deal with
16 sequences where references are either internal or
17 not published. So by itself it actually -- well,
18 I'm being asked to slow down a little bit. So by
19 itself, that format is not a complete format
20 either. So it's repetitious because it is
21 repeating information and it is not complete, but
22 it's a wonderful format. I don't want to sound
29
1 like we're criticizing the format. We are just
2 asking a question. We are moving to the next
3 generation era. Do we need to sometimes be
4 critical and skeptical to our previous well-served
5 file formats?
6 Let's move to metadata. Well, NCBI has
7 done a wonderful job of accumulating and
8 communicating with a huge number of people from
9 outside all over the world in standardizing some
10 of the fundamental metadata types by a project, by
11 a sample, a next gen sequencing grant, and
12 experimental protocols. But if you look at this
13 definition of these file formats, sometimes their
14 tens of kilobytes are very well considered and
15 treated text. But if you look at these samples of
16 what information is being provided, a lot of time
17 you'll see it's very sparse. So it is
18 complicated. So there is a need. And when you
19 give something that is too complicated, there are
20 people who will try to circumvent it. Perhaps we
21 have to think from FDA's perspective how we can
22 make their life easier. So there are challenges
30
1 also on a metadata part.
2 Then archival is an issue. I briefly
3 mentioned it when I was talking about the
4 pipeline. So the questions that are coming from
5 FDA's perspective are how long do we keep the
6 data? That's a question, which was -- well, we
7 knew there were regulations controlling the length
8 of the data storage in FDA. But when it comes to
9 next gen, you understand, we are talking about
10 petabytes. And these petabytes have to be
11 transferred, again copied. If you gave a hard
12 drive with this petabyte and you put it in your
13 cool, dark place, you come back in 30 years, it's
14 not going to be readable, not in 5 years, in 10
15 years. And another thing, recently I had a hard
16 drive with an IDE interface to it. I was trying
17 to read my old pictures. There was no way. I
18 couldn't find the computer that is reading that,
19 which means that these petabytes of the data
20 cannot be just put somewhere and be the one-time
21 expenditures. You have to maintain it. You have
22 to hire systems programmers, administrators. So a
31
1 question arises, how long do you keep the data,
2 understanding what are your gradual increasing
3 costs?
4 Another one is what do we transfer into
5 the system? There's a huge amount of -- do you
6 know that some of the sequencing machines generate
7 TIFs, which are 17 times larger than the
8 sequences? But at some point somebody made a
9 smart decision. Well, maybe we should just not
10 transfer it. That's not a final format. But then
11 the question is this. Now we have this big, huge
12 FASTQ file. Is it a final format or still we can
13 do something about it? Because the FASTQ files
14 sometimes are not the real data despite that we
15 are calling them real data.
16 So what we can and we cannot lose during
17 the compression. That is another question. Yes,
18 if something is not to be used for the
19 computations later, do we lose it or do we
20 maintain it? And I talked about the cost.
21 Now, let's move to the challenges in
22 bioinformatics. This is just one particular kind
32
1 of a pipeline generalized not really anything
2 particular. There is no computational knowledge
3 coming out of there. It's just a lot of time we
4 do many components and I highlighted some of the
5 data file formats that are in there. Again, the
6 question is that is there too much trust in
7 bioinformatics? I have seen many people who are
8 wonderful scientists who are using all of the
9 alignment tools and coming up with beautiful
10 solutions. They take the bioinformatics pipeline
11 -- Bowtie has about like 50 parameters; BLAST is
12 about 50 parameters; our own tool is about 50
13 parameters. So most of the time people read the
14 instruction, specify a couple that are most
15 customizable and easily understandable, and run
16 the pipeline and get the results. A lot of time
17 people say hey, I have this wonderful pipeline and
18 it works wonderfully for my vital datasets. I'm
19 going to use that for human. They launch it. It
20 either doesn't finish or finishes, you get the
21 results. How much do you know? You have to rely
22 on it or not? So sometimes what do you plug in
33
1 your tool is important. I can get the hammer and
2 put in the big nail, but if I use this hammer to
3 put a pin on a wall, I'm going to make a hole in
4 it. I have seen these cases when really smart
5 people are doing this not because they didn't
6 understand, because it's too complicated. There
7 are different professions of doing it. So we have
8 to come up with a way of actually validating
9 bioinformatics pipeline. So it's important to
10 think what am I applying my tool to? It's
11 important to understand what the set of parameters
12 are I can use my tool to apply and what I can
13 expect from my computations. So what can I apply
14 my tool to is about usability domain. What kinds
15 of parameters are satisfyingly generating
16 scientifically correct results? We call this
17 parameter space. Just like in a real lab
18 experiment, I can make an experiment in one
19 temperature condition and then get more results.
20 I am a chemist by my first education. So in a
21 different temperature conditions or pressure, I'll
22 get different results. Again, in bioinformatics
34
1 pipeline, no difference. I put in some parameters
2 I can more results, some others I get different
3 results. So the same tool, same pipeline, may be
4 actually good for many different purposes. We
5 have different validity domains with different
6 parameter space.
7 So what are our goals? Well, first of
8 all before we define the goals we want to say what
9 we do not want to do. We do not want to put any
10 limitations on technology and that is critically
11 important. Another thing is we don't want to make
12 any preferences of some platform, and we don't
13 want to put any limitations on the particularities
14 or the implementation in a particular local
15 institution, organization, or scientific or
16 research institute. We want to make it as open as
17 possible. Whatever we together come up with at
18 the end, we must maintain the freedom. These are
19 the freedoms, just like in the Constitution.
20 Now what we do want to do is to come up
21 with a good data typing engine, which I'll
22 describe briefly. Why do I think when we are
35
1 talking about standardization we should talk not
2 just about what is the file format, what are the
3 names of these little fields. That is silly to
4 get a bunch of smart people and talk about names
5 of the fields. What we need to create actually is
6 how do you define the standards and that's what I
7 call data typing engine.
8 Another thing is how do we validate? We
9 want to find out in the end how do we validate
10 bioinformatics protocols and what do we archive
11 and how do we submit the data in this particular
12 case to FDA? And it is important. How do we
13 altogether generate something which is going to be
14 valid tomorrow or in one year or in ten years?
15 There is no way. Actually our vision is very
16 brief. Technology has changed in the last five
17 years so much that if I thought I'm doing
18 something right five years ago, I'm pretty sure
19 half of it was wrong.
20 But let's look at our United States
21 system. We have a Constitution and we have a set
22 of laws. We ask you to define the Constitution.
36
1 The Constitution is a fundamental framework to
2 create laws. And the laws are the ones that are
3 local, time limited, and can be accepted, which is
4 not true for the Constitution. The same way if we
5 generate a bunch of file format standards today,
6 we are bound to fail in the future. But if we
7 create engines of standardization and if we do it
8 well, just like our Constitution, we have the
9 potential of creating a beautiful result in the
10 end, just like our country.
11 So we're in luck. How do we create
12 engines and how do we send our data? We're in
13 luck because we are not the first one in this
14 field. Software development paradigm, software
15 development organization, I saw and see and all of
16 these guys have done the legwork for us. And
17 NCBI, I work closely with NCBI. NCBI, NIST, and
18 other organizations have applied those techniques
19 to actual implementation regarding the
20 bioinformatics, the biomedical data. So, for
21 example, this is just a proposed set of metadata
22 objects of biomedical importance that we use
37
1 regularly in a NGS world. We can start from this.
2 And there is a hierarchy -- one of the things we
3 want to introduce, just like in any programming
4 language, is inheritance. Let's say I have a
5 type. I call it a bioproject type. This is my
6 project. It's a genetic thing. All projects have
7 this certain phase. Who is submitting? Where is
8 submitting? Where has the project come from? And
9 then we can have projects that are specific target
10 that have extra additional information. The same
11 way, sample, is something that has certain fields
12 describing. Where is sample accumulated? How
13 much of it is accumulated? How was it treated?
14 But then let's say a human sample has additional
15 characteristics. Our animal sample has
16 taxonomical identification, which generally you
17 don't need to specify for a sample because samples
18 in general can be metagenomic specifying
19 taxonomical identification and might or might not
20 be useful. So if you create hierarchy, if you
21 borrow that idea of inheritance from the
22 computation IT world and create inherited set of
38
1 objects, each object becomes very simple and it
2 contains only the things that are necessary.
3 Otherwise, if you generate these flat and complex
4 data types, you have let's say bioproject data
5 types without this inheritance, without all of the
6 fields that are under bioproject. So there is a
7 way to deal with this and one of the initial
8 propositions for us is to use this technology.
9 Another one is the introduction of
10 inclusion. A lot of times data can be autonomic
11 or complex. Autonomic is like a string. It's my
12 name or something. If you split it, it's wrong
13 and it's not complete, but they are complex data
14 types and in programming languages and paradigms
15 they are using this concept of inclusion. So we
16 define the very nice and reusable data type and we
17 use it constantly. And by using the inclusion and
18 inheritance, we can generate data types of any
19 complexity while keeping them minimalistic.
20 The next thing is the programming
21 languages again came up with this very nice
22 concept of data typing, correct field typing,
39
1 variables of types behind them. And they are not
2 just a bunch of values or long strings. There
3 should be strict interpretation of every field.
4 Fields can be missing and then the question is
5 what is the default value for that? Is there a
6 default value for that? If you miss that field,
7 does it invalidate the whole data or invalidates
8 itself? And is it optional or mandatory? Is it a
9 single value data or multi-value data? My name in
10 a data/metadata format is a single value. I have
11 just one name, one first name and one last name.
12 I don't even have the second one. But if you are
13 talking let's say how many times I visited the
14 doctor, that's a multi-value. You see, just by
15 talking on this, we can standardize, use these
16 paradigms, and it already becomes easy.
17 Now about flexibility and extensibility:
18 So if tomorrow I design my data hierarchy here,
19 you see the bigger ones are including the previous
20 one, by inheriting from them if tomorrow I need to
21 introduce a new one because my vision a year ago
22 was not complete, that format, that engine which
40
1 we create, should be able to adopt modifications
2 without changing the data type.
3 So when we generate this data typing
4 engine, we must also not forget we are not the
5 first one here and we are dealing with existing
6 data and the formats. So we must not forget that
7 we are dealing with an existing industry with
8 existing data and existing data formats. And
9 there are ways to design our data typing engines
10 in a way that will be adoptable actually sometimes
11 without touching the database content we will be
12 able to adopt to the new standards. So we are not
13 generating the novel or initial data typing engine
14 and forcing everyone to come and play with us. In
15 fact, we are trying to create a new data typing
16 engine that adopts to existing data and it is very
17 important because there's a huge amount of
18 information out there and there is no way we can
19 convert anyone.
20 So then we must also remember the big
21 data. Big data means our formats sometimes are
22 too heavy. One of these days I got actually a few
41
1 millions of biomedical data. These are test data
2 for testing our platforms in XML format. It
3 turned out that a significant amount of data is
4 engineering bytes and bits and the actual content
5 was only a fraction of the data. XML is a very
6 nice format to adopt, to work with, and then to --
7 there are many tools working with that. But when
8 you move to the petabytes of data, it seems it's
9 too expensive. There are too many formats. And,
10 in fact, there are other much simpler formats that
11 can be used for these purposes except we have to
12 make them adapt to be used with big data having
13 implications of missing -- interpretations on
14 missing values or things. If they are ambiguously
15 interpreted, there is no issue with them. And
16 also we have to remember, it's not just the big
17 data. It's not just big files. It's a big number
18 of big files. For example, the Center for Food
19 Safety and Applied Nutrition is involved together
20 with NCBI and ORA or many other organizations in
21 100,000 genome projects. Now multiply 100,000
22 times each sample is going to be sequenced many
42
1 times and their associated many other files, you
2 end up having millions of files, each one being a
3 big data. So when we design our data file
4 formats, we have to think not just about big data,
5 but about a big number of big data.
6 Let's move to the bioinformatics
7 harmonization now. And there's a lot of
8 confusion. When we say we want to validate
9 something, it's very important to determine what
10 are we trying to validate? Are we trying to
11 validate experimental method? Are we trying to
12 validate experimental protocol, experimental
13 instance? The method we were trying to just
14 define the terminology. And I looked at some of
15 the scientific ontology publications and there's a
16 definition behind it that the method is actually
17 underlying scientific knowledge, which is put in
18 the basis of your instructions during the
19 experiment. As being scientific experimental
20 method, it actually has to abate criteria of
21 science. And as French philosopher and scientific
22 mathematician has defined, any experimental method
43
1 should be objective, should be reproducible,
2 deductible, and predictable. I don't want to go
3 in details. These are like fundamentals in a
4 first class, first year, first university where
5 the science is and how different it is. So when
6 somebody uses experimental method, it should
7 comply to these particular rules.
8 Now let's go to the experimental
9 protocol, the subject of our discussion now. I
10 mentioned briefly what is a usability domain. So
11 can I apply it to metagenomics? How about
12 bacteria? Eucariota? Human? Viruses? Anything.
13 So my protocol is a set of instructions. When I
14 tried to validate something, the same protocol
15 like I mentioned before can produce valid or
16 invalid results. So I have to know what the
17 usability domain is when I'm testing a program.
18 Then parametric domain, what are the
19 sizes of the seeds? Let's say I'm talking about
20 alignment or what is the identity much as I should
21 know. What sequences are considered aligned? If
22 I am calling variants, what variants I am going to
44
1 use for genotyping purposes? So all of these are
2 parametric space, not knowledge domain. A lot of
3 programs like I said produce a bunch of different
4 information, mutations, identification, genotype,
5 expression data. I have seen cases when people
6 use a very nice pipeline, TOPAZ pipeline, to
7 generate expression data, but with the conditions
8 and parameters they generate data where I wouldn't
9 rely on variations. Again, the same program, same
10 parameter space, same usability domain, multiple
11 files, one of them is good and usable. The other
12 one is not good because the parameters are not
13 good for variant calling in this particular case.
14 And we also have to talk -- when I am
15 looking at a program to see how good it is, there
16 are deterministic and heuristic programs. With
17 deterministic programs usually you always give
18 them the same input, you get the same output.
19 Heuristic programs are different. They are based
20 on random numbers. They are based on some kind of
21 analysis. They have a chance of generating false
22 negative and false positive. It's okay. We live
45
1 in the real world. If you try to find every
2 possible alignment on a human genome on a 100
3 million refile, considering all alternatives
4 without heuristic program, that will take you 56
5 years. We don't want to do that. So our programs
6 are much, much, much faster, but they have a
7 chance of generating false negatives and false
8 positives. So when we talk about validity, we
9 don't require 100 percent conformance or 100
10 percent everything. There are always errors. The
11 whole technology works on the assumption of a
12 science and there are always inaccuracies. So we
13 have to understand what the range of errors are.
14 Okay, usability, parametric space,
15 knowledge domain I discussed, and range of errors.
16 Now, another really big thing, which we
17 have to know, is what data do I need to run my
18 program? If my databases are good, do I use the
19 right BLAST matrix if I'm running BLAST? Do I use
20 the right viral representative database if I'm
21 trying to detect viruses? So there are data
22 prerequisites that will validate or invalidate
46
1 particular users' patterns of the program. And
2 there's one big thing, which is always missing in
3 my life. I always feel the need and that's a
4 biocompute object. If you have -- all of us have
5 worked with NCBI, GenBank, or other formats and
6 there are very nice definitions of that data. But
7 when it comes to computations and I run this
8 complex pipeline, where do I register? Can I
9 register it and share with someone? Can I put it
10 into some public database and say please go use
11 it, this is my parametric space usability domain,
12 all of the things we discussed. There is a need
13 to create a metadata for biocompute objects. I
14 know efforts are being done in different places,
15 but I think it is also a very big question in our
16 discussion. We have to define these things.
17 Now, this is one of my last slides. So
18 where do we go from here? So today we are at
19 September 24. So this is the conference where we
20 invited all of you to start a dialogue. We are
21 not making any regulations, any guidance,
22 anything. We are inviting you to work with us and
47
1 try to see because we are facing this challenge.
2 The challenge is coming our way. We want your
3 input, all the input -- academia, industry,
4 somebody mentioned diagnostics companies, other
5 government agencies, international organizations,
6 consortiums. There will be two series of
7 concurrent workshops happening in telecom or
8 face-to-face meetings. Please welcome. Tell us.
9 There is a registration, and we will pass this
10 paper. You can register yourself for this so you
11 are on an email list. We are going to organize
12 after this conference. We are going to edit our
13 documents to make sure that whatever input you
14 give us today or tomorrow, it's also reflected.
15 Documents will be available for download. And
16 also based on your input and demand for
17 registration, we will start scheduling the regular
18 monthly meetings. Please come and talk to us.
19 And where we end up at the end -- I
20 specifically made this a little fuzzy on the right
21 side. So we don't really know where we'll end up.
22 We just know we have a big challenge and we want
48
1 to solve the challenge. And it might end up
2 actually eventually generating the guidance
3 documents where your input will be also
4 considered, but also standards documents. There
5 are different organizations -- ISO standards, ANSI
6 standards, ASTM or STM international -- there are
7 other organizations that also generate standards
8 and recommend for usage. And we at the FDA we use
9 other standards also developed if there are such
10 standards available. So if we do a good job
11 together and we generate this big -- we succeed in
12 this big idea, parts of it can be used. And we
13 promise we'll work with the standardization
14 organizations to make them publically available,
15 not just for our purpose, but for anything else.
16 And these are my acknowledgment slides.
17 There are many people I should acknowledge, but
18 Alin has done a great job. Without her, it
19 wouldn't be possible to organize this. Hayley,
20 Alissa, and John were actually editing with me all
21 of the documents and making sure that all of the
22 communications are not missed. Carolyn Wilson,
49
1 without her vision it wouldn't be possible to even
2 be here because the whole support and
3 encouragement is very important for us. And we
4 have very good friends with whom we talk every
5 day. We communicated. We come up with ideas.
6 They are very skeptical, very critical, and that's
7 very helpful. And we are hoping that you will be
8 in that second list after this conference by
9 working with us. And the most important people
10 after all of these are the researchers. Without
11 them it's not possible.
12 Okay, so now we are ready for questions.
13 QUESTIONER: Paul Walsham, In Silico
14 Life Signs. One of the challenges that you
15 pointed out was reproducibility, and I think it's
16 more of a comment that I think is very well placed
17 in the FDA, the idea of traceability, which has
18 been widely used, adopted. It really matches this
19 and basically to be able to track parametric space
20 as you outlined as to what was done in pipelines
21 and who did them. So I think there's a good
22 foundation there for addressing that issue.
50
1 DR. SIMONYAN: Any other questions? Any
2 other questions from online maybe? You can type
3 it and we can read it for you.
4 QUESTIONER: You made a couple of points
5 that kind of struck me. One was the alignment
6 pipeline is heuristic based so there are going to
7 be errors. And given that, we know that those
8 pipelines are going to change with time and get
9 better and better. And also it seems to me
10 there's a real need to be able to compare results
11 from two different labs using two different
12 pipelines and say whether -- and use a
13 nomenclature or a format that says these two
14 people have the same novel polymorphism. So it
15 seems to me that we have to send the primary data
16 and be able to reinterpret it on the fly using
17 common methodologies. And there needs to be a
18 standard way of doing that and part of the goal
19 for this workshop is to recommend standards for
20 doing that.
21 DR. SIMONYAN: Yes, the question was
22 about, let's say for example, the alignment
51
1 pipeline is heuristic so it can get different
2 results and there are many pipelines that form
3 with this data. I'm sorry if I don't exactly
4 represent the question. How do we validate this?
5 So our idea is this. So we can generate
6 test cases for validation purposes, and by
7 validation I mean bioinformatics and software
8 validation. There are two ways you can do that in
9 our knowledge, but, of course, input again is
10 welcome. Whatever I say, input is welcome. One
11 way we can use simulated datasets. We have done a
12 huge amount of work to generate the simulated
13 datasets and engines to generate simulated
14 datasets where we can validate actually the
15 mathematical accuracy of algorithmic or the
16 pipeline. Let's say if somebody claims this is a
17 nice pipeline to generate mutation calls on
18 bacterial genomes, we can mimic bacterial genomes.
19 We have many tools for that and we can generate
20 more where we will impregnate the samples and
21 generate risks like they are coming from one of
22 the devices in the market. And we will then test
52
1 it through the protocol of the bioinformatics and
2 come out with the results and compare the reality
3 and the outcome.
4 The other approach -- competition
5 generated data are good because you know what you
6 are expecting. But then, for example, our
7 collaborators in Nice -- Justin, I think he's
8 there -- they are working on actually biological
9 data and validating what actually is a good
10 biological sample. Biological data have a much
11 bigger variability than any simulated data can and
12 that is the way you test your robustness of your
13 bioinformatics pipeline. Let's say we get
14 genome-in-a-bottle data where they are coming up
15 with a validated test of mutations and they have a
16 very beautiful dataset with familial relations.
17 I'm pretty sure Justin is going to talk about some
18 of the aspects of this project. And we can take
19 that data and then try again, create a test case
20 with parametric space and all of these things
21 different, and then the pipelines will generate
22 the results. Then we will compare what Justin and
53
1 his team are coming up with a validated variation
2 of genotype and then we can test it. So simulated
3 and biologically validated data.
4 QUESTIONER: Lester Shulman from
5 Ministry of Health Israel. If you have two
6 different variants or two different variations of
7 one in your pipeline, how do you decide which
8 would be -- what parameters would you use to
9 decide which one might be more valid?
10 DR. SIMONYAN: One position? You are
11 saying one position?
12 QUESTIONER: Let's say that you have two
13 datasets. You've gone through two different
14 pipelines and now you get results that are not
15 harmonized.
16 DR. SIMONYAN: So if you put the test
17 data -- I put one variant and I know what I am
18 putting into the data. Let's say position 1
19 million in a human genome -- I don't know what
20 that position actually does mean -- so I can
21 impregnate my data, generate NGS data, run the
22 program. And then let's say your pipeline
54
1 produces A2T code when I know I put the G there.
2 That's a way to validate and genotyping
3 information is available for human data through I
4 mentioned our collaborators. And the viruses and
5 bacteria we do this every time we run a program.
6 So what we do we have our own pipelines. You'll
7 hear about them later. We have test sets with
8 mimicked data. And every time we change a single
9 line of code, we take the whole pipeline and run
10 it again and compare the results with the ones
11 that we know we must get. That's about
12 mathematical accuracy.
13 QUESTIONER: Toby Bloom from the New
14 York Genome Center. I'm worried that at this
15 point in time comparing existing methods just
16 isn't good enough in the sense that for alignment
17 maybe, but once you get to variant calling and
18 especially structural variant calling, you find
19 that -- and anybody who's worked on big projects
20 like TCJ or 1,000 genomes or anything knows this
21 -- you run three different methods, you get three
22 different answers and you don't find that one
55
1 method is better than another. What you
2 eventually find is that this sematic mutation
3 caller is better at minor allele frequencies that
4 are over 10 percent, but this one is better at
5 allele frequencies that are lower than 10 percent,
6 and this one really doesn't do very well if
7 there's too much stromal contamination.
8 We at this point are running three of
9 everything and comparing the results on every
10 sample. So, yes, we do lots of validation. We
11 actually do validation every few months against
12 all the gold standard samples, and we change which
13 three best ones there are and use them. But it's
14 a really complicated problem and we aren't ready
15 for standards. I mean we don't know what to make
16 the standards.
17 DR. SIMONYAN: I think one of the
18 reasons why it hit me is because the technology is
19 moving, our understanding of the data is moving,
20 and we tried to address some of that through
21 coming up with a concept of the knowledge domain.
22 When you say some pipelines are good for one
56
1 particular goal, the other pipelines are good for
2 the other goals, so, yes, it's very complicated.
3 I have no choice but to agree with you, but that
4 doesn't mean we don't need to make the steps. We
5 still need to get together and try to work with
6 that. There is no solution. If there was a good
7 solution, we wouldn't be here. There is no
8 solution. That's why your input and the input
9 from everybody else needs to be there.
10 QUESTIONER: So the previous question is
11 actually perfect and following the green challenge
12 project, for example, just trying to kind of get a
13 consensus. But over the years I think we will
14 slowly start understanding under certain
15 conditions, we can say this. And you mentioned
16 three conditions actually in that one sentence.
17 So if these are those conditions, say which
18 conditions can we say now, under which conditions
19 we can say in a year. Some conditions maybe it
20 will take us a decade to get to and say okay, we
21 can for sure say this is going to work. So the
22 biocompute object I think is going to help us kind
57
1 of tease out the different parts.
2 DR. SIMONYAN: Yep, yep. And also don't
3 forget we are in a biological universe. Evolution
4 is kind of inherent to us, so all of our visions
5 should also evolve.
6 QUESTIONER: As we try to develop these
7 standards, I'm wondering whether we have thought
8 of the limits of our technology. Meaning when it
9 comes to our storage transfer ability, volume,
10 have we involved engineers, informatics
11 scientists, who can advise us as to what is
12 realistic and what is not?
13 DR. SIMONYAN: That's a very important
14 question because one thing we are doing -- when I
15 moved to the science, I thought I was only going
16 to do fun stuff like coming up with beautiful
17 algorithms and running them. You hit a big data
18 and there's an infrastructure challenge behind it,
19 a huge amount of infrastructure challenge. And I
20 can see people in this auditorium who are coming
21 from the hardware perspective, which are the
22 providers of the computational platforms,
58
1 providers of the net, providers of storage. And
2 we work with some of them actually advising --
3 they were advising us what are the platforms which
4 are existent today and where are they moving?
5 Yes, this is a really big challenge and that's why
6 we specifically must invite hardware production
7 companies, hardware manufacturers, and they'll
8 work with us also. I do intend to have them. I
9 mean one of the simple things and I always kind of
10 see this misconception, is hey, I have this nice
11 computer cluster, a thousand cords, and a bunch of
12 storage. But now I'm going to buy two times more
13 of everything and it's going to work two times
14 faster. It doesn't work like that because some
15 things, computations, grow linearly, and data
16 storage grows linearly. And networking between
17 them grows like a square. So now you have to buy
18 four times more networking. So unless we have the
19 expertise from these gentlemen or ladies who are
20 working in these companies of hardware
21 manufacturing and they know where technology is
22 going, we cannot do it without them. And in a way
59
1 it's important and it's very nice that they are
2 coming to us and asking what do you need because a
3 lot of times some of the algorithmics we
4 understand how it works because every day we
5 struggle with algorithms and things. It's
6 important for them because they have to write
7 their tomorrow software to work with our
8 algorithms.
9 DR. WILSON: At this time I think we're
10 going to close the questions. Dr. Simonyan will
11 be around during the lunchtime and other times, so
12 feel free to stop him and ask him more questions.
13 Our next session is chaired by --
14 DR. SIMONYAN: We are going to pass the
15 registration forms.
16 DR. WILSON: So Dr. Simonyan would like
17 to pass around these registration forms.
18 Therefore, if you'd like to receive information
19 about future conferences, be involved in these NGS
20 standardization efforts or any bioinformatics
21 aspect of it, there's two categories so please
22 check the appropriate one. Write your email. If
60
1 you do not feel comfortable passing your name
2 around that's fine. Just email me. I'm sure most
3 of you know my email or Dr. Khaled Bouri whose
4 email is circulating through the FDA channels.
5 He's the contact person listed in all the
6 different advertising utilities. So feel free to
7 email either one of us and we'll add your name to
8 the list.
9 The next session is on NGS standards Dr.
10 Simonyan is chairing.
11 DR. SIMONYAN: Dr. Weida Tong is going
12 to give us next presentation. He works for FDA at
13 the National Center for Toxicological Research and
14 he's the Director of Bioinformatics and
15 Biostatistics Division. He and his group and
16 together with a big collaboration from around the
17 world have done wonderful bioinformatics projects
18 and big data projects. And some of the
19 mentionable ones that are very relevant to us
20 right now are MAQC and SEQC projects, which were
21 spearheaded by him and his team. And he's going
22 to present his perspective to this effort. Thank
61
1 you.
2 DR. TONG: So I'm glad I get a chance to
3 use the microphone. I'm from Little Rock,
4 Arkansas. And without a microphone I will give
5 away my southern accent. (Laughter)
6 So what I'm going to do today is I'm
7 going to talk about some of the experience I have
8 learned in the past by dealing with the micro data
9 and the recent experience by dealing with the next
10 generation sequencing. And then in the last part
11 of my presentation I'm going to provide a few
12 points I think are important to consider when we
13 try to move this field forward.
14 So before I start I do want to take this
15 opportunity to thank Vahan and Dr. Carolyn Wilson
16 in allowing me to talk today. Even though I put a
17 title as the FDA perspective, I have to say this
18 is my personal perspective. In the past ten years
19 I had a tremendous opportunity to work under
20 brilliant scientists in FDA and to deal with
21 genomic technology. But Carolyn in her
22 presentation outlined a variety of the projects in
62
1 the FDA and I did not involve all of them, so I
2 cannot say -- what I'm going to say here is not
3 presented as the FDA's perspective. It's just my
4 personal view.
5 So before I start really getting to the
6 main topics of my presentation, I would like to
7 make two points. I really want to get these two
8 points out of my way. When we talk about next
9 generation sequencing, it's a tool and we talk
10 about the challenge and issues. It really depends
11 on how these particular tools are going to be
12 used, and I think we need to keep this in mind
13 even though it's very obvious. For example, if we
14 are going to deal with the human genome, sometimes
15 we call it DNA sequencing, we are dealing with 3.2
16 billion base pair, that large a size. But the
17 microarray that's only like 25 of them. If we're
18 talking about the alignment that's nothing to do
19 about the microarray. It's very easy, any
20 alignment can do. But if you're going to do
21 starting to look at the various alignment to
22 algorithm to dealing with the human genome, that's
63
1 going to be entirely different story.
2 So before I come I did a very quick
3 search at the department. How does the landscape
4 look like in terms of the next generation
5 sequencing tools that have been used in the
6 various areas? And very clearly almost half of
7 them were used to study the genetic variance and
8 how these variances are related to human disease
9 and how they respond to the drug treatment. And
10 it's also for this very, very important field and
11 it is a field certainly given rise to personalized
12 medicine. And if you go to any one of the
13 meetings related to next generation sequencing,
14 you will see the cost of the sequencing was
15 decreased over the years and deviates -- I spare
16 you for this graph. But this has a much larger
17 and significant impact on the DNA sequencing.
18 But if we're starting to talk about RNA
19 sequencing even though this cost factor is still
20 there. But if you apply this methodology to do
21 the toxicogenomics, for example, the costs related
22 to animals become the bottleneck and it's not the
64
1 sequencing themselves.
2 So I just wanted to get this out of the
3 way and keep this in mind and when we're dealing
4 with these issues we really need to understand
5 what is the purpose in application. So this point
6 really is related to my second point and what I'm
7 going to focus on for today's presentation.
8 Now, my group is doing a variety of
9 projects using the next generation sequencing.
10 And, of course, we do a little bit of DNA
11 sequencing. We do a little bit about microarray.
12 We also have one project to use the next
13 generation sequencing for the food safety, and
14 particularly for the food borne pathogen
15 identification and detection. But most of our
16 work is focused on RNA-seq.
17 So most of my presentation is going to
18 focus on some of the observations and the results
19 we obtained from a recent project we just
20 completed called RNA sequencing quality control or
21 the SEQC. This is the way I say it, but the word
22 on the street pronounce it as "SEXY." So at least
65
1 FDA had a little bit of humor in this area.
2 So we have three papers and what came
3 out from this project was published in Nature
4 Biotechnology in the current issue. And along
5 with these three papers, they also have another
6 two papers on using our SEQC samples and the data,
7 and also have another two papers providing a
8 commentary or discussions about the results and
9 the implications of these results. So if you have
10 not seen that particular issue, it is the current
11 issue, the September issue. I really encourage
12 you to take a look at it.
13 So most of my talk is going to focus on
14 the results from the SEQC project and this is
15 really related to my second point. So when we're
16 talking about RNA-seq, the issues are just
17 slightly different compared to DNA sequencing even
18 though they still share a lot of the commonality
19 in terms of how to manage the data, how to
20 communicate the data, so on and so forth. But in
21 the RNA sequencing area, particularly in the gene
22 expression studies, we have two objectives. One
66
1 is how we can accurately determine which genes are
2 up and down, differently expressed, and sometimes
3 we want to understand the isoforms and the gene
4 fusions and so on and so forth. So those sorts of
5 information can enhance our way to understand the
6 underlying mechanism of the disease and the
7 health.
8 Another major goal is how we can use
9 these tools to develop gene expression-based
10 predictive models, to predict a variety of the
11 endpoints particularly in the clinical use as well
12 as for the safety assessment. And come to think
13 about it, this space has been occupied by
14 microarray for a long, long time, and the
15 microarray has been around over 15 years. Many
16 companies invest tremendous efforts to develop the
17 microarray-based biomarkers and predictive models.
18 Now another question is how are we going to deal
19 with that? And how are the microarray-based
20 investment are going to play out in the RNA-seq
21 area? So those are the sort of issues we
22 certainly need to talk about.
67
1 It turns out emotion is very different
2 when we talk about a microarray and RNA-seq. So
3 this is the email. I sent it out to the
4 consortium two weeks before the paper was
5 submitted to the Nature Biotechnology, and it is
6 very clear there are two camps. One is microarray
7 and another is next generation sequencing. People
8 invested there are pretty much like half of their
9 professional life in one particular platform.
10 Suddenly you say hey, you are obsolete and go
11 away. So there are a lot of emotions going on.
12 So even when we develop the data standards, those
13 emotions are still there so we need to pay
14 attention on that as well.
15 So this emotion does not just exist in
16 our consortium. When we submit a paper to Nature
17 Biotechnology, we have five reviewers and
18 editorial and the revisions total eight months to
19 get this paper published. You can see a lot of
20 the comments are again about microarray and
21 RNA-seq. So I will talk a little bit about this
22 aspect in my presentation as well.
68
1 Where are we in terms of the microarray
2 compared to the RNA-seq? I did a very quick
3 search into the GEO because the GEO is the public
4 data repository and you can find all kinds of
5 different information. And if I take a snapshot
6 and just for 2014 I did a couple of months ago,
7 about six times more microarray data has been
8 deposited in 2014 compared to RNA-seq data. So
9 GEO is starting to see RNA-seq data back to 2006
10 and GEO is starting to have the microarray data
11 all the way back in the 2000s. If you look at the
12 first eight years how the data accumulated in the
13 GEO database, again there's about like a 6:1 ratio
14 in favor of the microarray.
15 So we took those data and we did a
16 projection in how long it would take the RNA-seq
17 to overtake the microarray. And it came out
18 pretty interesting, though. We went into 2012 and
19 the RNA-seq data in the GEO reached to the one
20 million mark. That's the current number of
21 microarray data in the GEO database, and probably
22 is going to reach in 2028 and eventually RNA-seq
69
1 could entirely replace the microarray. So this is
2 really just to give you a sense. We will have a
3 fairly long coexistence between these two
4 platforms.
5 So some of the lessons learned from the
6 microarray still have some value and that's what
7 I'm going to talk about and actually is my main
8 topic today.
9 So about ten years ago on the microarray
10 it was the hot potato in the research community.
11 Everyone wanted to get a piece of it and even our
12 institution established the core facility into the
13 microarray and the gene expression studies. At
14 the time the industry, of course, was always in
15 the forefront to apply emerging technology. They
16 used this technology in both the preclinical and
17 the clinical settings. So they come to the FDA
18 and say whether these kinds of data can be shared
19 with FDA to support some of the submissions. And
20 certainly this is going to generate a lot of
21 anxiety in the FDA community. So just like what
22 we have right now, talking about next generation
70
1 sequencing, of course, in much, much larger scale
2 when we talk about next generation sequencing.
3 So on into 2003 FDA released a draft
4 document to the industry on the pharmacogenomics
5 data submission. So in this guidance we
6 articulate a specific mechanism called voluntary
7 genomics data submission program. Essentially we
8 encourage industry to send their data, genomics
9 data, to the FDA on a voluntary basis. So through
10 this process we will have an effective dialogue
11 between the industry and FDA so we can work
12 together to establish the guidance on how to deal
13 with genomics data.
14 So at the very, very beginning we set up
15 three objectives to try to accomplish from this
16 program. First, of course, we need to have a
17 place to store this data. This is really about
18 how I mentioned about how we are going to capture
19 this data submitted by the sponsors because we
20 believe this type of data are important to support
21 our future regulatory policy.
22 And second, we want to produce the
71
1 sponsors' results because I really don't know, at
2 times we don't know how they do their analysis,
3 and whether these results can be reproduced.
4 And the third objective is can we do our
5 own analysis, provide our agency's view on how
6 these type of data should be analyzed and how
7 these type of data should be appropriately
8 interpreted. So these are our three objectives.
9 Now clearly if you wanted to do this
10 sort of thing, you'd need to have a robust
11 bioinformatics infrastructure in place to do all
12 of this work. So that's what we did in 2014
13 actually in our lab, and actually very early on we
14 developed a tool called ArrayTrack and its
15 intended use is to handle the microarray data. So
16 we purposed this tool and started to use this tool
17 to support voluntary genomics data submission
18 program and we focused on the microarray data
19 analysis. This tool is still being widely used by
20 the various parts of the FDA as well as in the
21 research community. And, of course, right now we
22 are trying to refine this tool by incorporating
72
1 some of the flavors to dealing with the RNA-seq
2 data as well.
3 So during the process of using our tool
4 to reproduce the sponsors' results, the first
5 thing we found out was we will never be able to
6 reproduce sponsors' results. Whatever the
7 document the document they send to us says, we did
8 this, we're using the T-test, well that's not
9 enough. There are six different types of the
10 T-test out there. So which one? And it's just
11 not enough and at the end of the day we had to say
12 send us a script so we can reproduce the results.
13 So this is highly challenging, this area in how to
14 reproduce the results. And when we started to do
15 our own analysis and the overlap between our
16 findings and the sponsors' results, it's not even
17 close. And then they lead entirely different
18 biological interpretations. How are we going to
19 deal with these issues?
20 So that's why we're starting to initiate
21 a very large consortium effort. We summarized the
22 two biggest challenges, and the first, of course,
73
1 is quality control. And every time we received
2 the data from the sponsors, they did a very
3 reasonable quality control. The question is how
4 good is good enough? Can we have some
5 quantitative matrix to define the quality control
6 that should be done? That really is the first
7 question.
8 And the second question is analysis.
9 Can we standardize analysis? Actually not.
10 That's actually -- if you standardize the analysis
11 protocol, you essentially hinder innovation and
12 this is not something we want to do. So we should
13 let research community explore the variety of the
14 novel approach, but we do emphasize and we need to
15 have a baseline practice. So when you submit data
16 to us -- and I'm still talking about microarray,
17 nothing to do about next generation sequencing --
18 when you send the data to us, you need to include
19 one particular methodology so we make sure the
20 data indeed can be reproduced.
21 And lastly is the cross-platform. For
22 the microarray, back in 2005 this was a huge issue
74
1 and at least around ten different platforms out
2 there. So using the same samples, you give it to
3 ten different platforms and the question naturally
4 to be asked is can you get to the same results or
5 not? Actually for next generation sequencing,
6 it's a little bit simpler because the whole space
7 is minimized. But still these are the questions
8 we try to address.
9 So we established a consortium effort
10 called a MicroArray Quality Control, called the
11 MAQC project, and we are very fortunate in it.
12 When we started this project, we got tremendous
13 support from all the FDA centers and also by the
14 broad research community. Certainly we see the
15 needs of such efforts. Our objective is first
16 looking to the technical performance of the
17 microarray technology to see whether the
18 technology itself is reliable or not. So we need
19 to get this out of our way. And next, we want to
20 understand how this technology is really reliable
21 and it can be used for clinical use as well as for
22 safety evaluations. So these are our two
75
1 objectives.
2 So in order to do that, we do realize we
3 needed to approach the research community and we
4 need to reach a consensus with the stakeholders.
5 So what we do is the first we decided to have all
6 the data -- the conclusions, the results -- made
7 available to the public. So that's why we
8 approached Nature Biotechnology to see whether
9 they were willing to entertain the results from
10 this consortium. They said yes and that's why
11 Nature Biotechnology always publishes our results.
12 And we also want to make sure the data is
13 available for the people to reproduce our results.
14 But we also take it a step further. We
15 said the samples we use also need to make publicly
16 available. So if anyone is just crazy enough to
17 go back and take $1 million to reproduce our
18 results, they should be able to do it. So we
19 approached a company and talked to them, can you
20 generally keep the reference samples for the
21 entire community for ten years? They said yes.
22 So these are the sample still available and
76
1 actually these are the samples we just used for
2 the next generation sequencing as well.
3 So we actually did a sequential and
4 processed it in three phases of this project, and
5 the first phase ended in 2005. Of course, dealing
6 with the microarray technology, we look at whether
7 the microarray technology is reliable. We look at
8 the cross-platform issues. And we compare the
9 microarray results with the quantitative gene
10 expression studies. And we also investigate how
11 the choice of the bioinformatics method is going
12 to impact the downstream biology, and so on and so
13 forth. And we have six papers published in Nature
14 Biotechnology, and, of course, this is a very
15 happy thing to see your paper published in a good
16 journal. But I think the most exciting thing and
17 a thrill for us is some of the key results were
18 incorporated into the companion guidance in FDA to
19 the industry on how to send the data to the FDA.
20 So with that we found that the
21 microarray technology is reliable, particularly
22 with what kind of bioinformatics approach should
77
1 be put in place to make this platform reproducible
2 and across different laboratories. And we're
3 starting to look into the issues about how this
4 tool really can be used for clinical use and
5 safety evaluations. And it turns out this is a
6 huge, huge undertaking, a long quest. It took us
7 about four years, and we have around 200 people
8 from 86 organizations. And the project had
9 tremendous input from Dr. Greg Campbell from CDRH
10 and he's office director, and he really provided a
11 lot of the recommendations and suggestions, which
12 direction we're supposed to go for this project.
13 And, again, at the end of the day, we had a bunch
14 of the papers published and some in Nature
15 Biotechnology, some in Pharcogenomics Journal.
16 So why are we working on this project?
17 It was a painful thing. Why are starting to work
18 on this project and suddenly RNA-seq is starting
19 to gain the momentum? Actually, we lost quite a
20 few consortium members because people said hey,
21 that was much harder stuff to do, so they're gone.
22 This is the paper that was published in
78
1 2008. I just looked at it before I came and the
2 citation is over 3,000 times and clearly it's the
3 point of the reference in terms of the future or
4 the perspective of RNA-seq. And what the paper
5 says is that RNA-seq will replace the microarray.
6 Period. It does not say how long it will take,
7 but six years later, as I mentioned in the very
8 beginning, we still see quite a bit of coexistence
9 between the microarray and RNA-seq and tremendous
10 euphoria associated with RNA-seq back in 2008.
11 And then the nature of our technology would step
12 in and say wait a minute, is that where we were
13 standing the exact same spot we stand before for
14 the microarray? We should have some MAQC-type of
15 efforts. We are gaining understanding in terms of
16 the quality, in terms of the reproducibility.
17 So that's why we started the third phase
18 of the project, called SEQC, and we have a little
19 bit over 180 private participants and from the
20 seventies reorganization and takes about six
21 years. We started in 2008 and we stopped in the
22 middle for two years because our Science Advisory
79
1 Board coming in said this technology is moving too
2 fast. If you take a snapshot now, you're going to
3 be changing it tomorrow. So why don't you just
4 wait for two years. So we stopped it for two
5 years and then we realized this technology will
6 never stop -- you just cannot wait. You've got to
7 take a snapshot, so that's what we did. We
8 generated a little bit over 10 kilobyte data and a
9 huge portion of that data is all available in the
10 GEO. We produced about ten manuscripts; three are
11 in Nature Biotechnology, just two in Nature
12 Communications, three in Scientific Data, and two
13 in Genome Biology. Clearly I cannot cover all the
14 findings from this project, but I do want to point
15 out that the three papers in Scientific Data are
16 probably the most interesting papers because those
17 are the three papers that provide very detailed
18 descriptions of the data we used in this project.
19 We hope those data can play a role in the research
20 community.
21 So what I'm going to do in the next few
22 slides is just give you -- explain a little bit
80
1 more about those datasets. And the first dataset
2 we generated is one to understand the
3 cross-laboratory and the cross-platform to
4 producibility. So that's what we did. We have
5 six samples on the top called the A, B, C, D, E,
6 F. And as I said before, we have two samples, A
7 and B, are the reference samples you can purchase.
8 They are commercially available. Those are
9 generated from MAQC-1 ten years ago. The exact
10 same sample, same batch, generated back then. And
11 the C and D is a mixture of the A and B. So C is
12 a 3:1 ratio between A and B, and the D is the
13 reverse. And the E and F are the ERCC, the
14 reference samples. So we have these six samples.
15 We distribute these six samples onto the various
16 Illumina sites to seven labs. So they all have
17 exactly the same samples. So they do the
18 sequencing and we distribute that for the four
19 solid laboratories and they also have the same
20 sample. We also distributed to Roche 454.
21 So once we get all the data together,
22 now we are really in the position to look into the
81
1 cross-laboratory and the cross-platform for
2 reproducibility. And I think one particular
3 benefit of this particular design is because C and
4 D is the mixture of the A and B. We know exactly
5 what the ratio looks like so we can use these
6 built-in tools to assess the accuracy of the
7 RNA-seq technology.
8 So on the second dataset, also very
9 exciting, we took pediatric samples from patients,
10 500 of them, and we did an RNA-seq on all of these
11 samples. We also did a microarray, so we have
12 both data. We have a microarray. We have
13 RNA-seq. And then we started to ask a question.
14 Which platforms will give you an edge in terms of
15 the clinical utility? So we did that part.
16 The third dataset, also pretty
17 interesting, is called the rat body map where we
18 essentially took the ten organs from the rats. We
19 not only just took it from the one rat. We had
20 the male and the female in the four different
21 developmental stages. So in total we have around
22 320 samples and we did the RNA-seq on it. And for
82
1 the liver and the kidney, we also did a
2 microarray. So that really gives you some ideas
3 in how the gene expression in high resolution by
4 next generation sequencing is performed. So these
5 are actually the papers published in Nature
6 Communication.
7 So the last dataset, also pretty
8 interesting, is called toxicogenomics dataset. We
9 have two datasets actually, the training set and a
10 test set. In the middle this part is the
11 chemicals. We have 15 chemicals you can see. For
12 each chemical we have three treated rats versus
13 control. So we will be able to see how the
14 differently expressed genes look like across all
15 these 15 treatments. And for every three
16 chemicals, we know exactly what particular
17 mechanism was involved to cause the toxicity. We
18 also will be able to develop the classified based
19 on the training set to predict the test set, and
20 the test set has a very similar design. So this
21 is the paper in the Nature Biotechnology if you
22 want to take a look at as well.
83
1 Again, as I mentioned, and I really do
2 not have time to go over all these results. It's
3 just impossible. But we do have a poster outside,
4 and I will stand by the poster and please come to
5 me if you have any questions. I would love to
6 talk you about it and the various results we
7 obtained from this project.
8 But I'm going to highlight a few
9 findings that personally I feel interesting.
10 First of all, we do find the technology is
11 absolutely reproducible, regardless of who does it
12 and which platform you're going to use. This is
13 actually some good news. Of course, the bad news,
14 the other side, is what bioinformatics approach
15 you choose actually is important. And how do you
16 remove some of the lower expressed genes. And the
17 largest variation we see is how to deal with the
18 lower expressed genes. So this is the first point
19 I trying to make.
20 The second -- actually this is very
21 interesting. When we talk about microarray, a lot
22 of people say it's boring. It's just up and down,
84
1 which is pretty boring. And the reason we are so
2 excited about RNA-seq technology is because they
3 provided a novel discovery. And many believe this
4 is probably the only point we need to focus on,
5 that set, and other things are really not
6 important.
7 So we spent tremendous efforts trying to
8 understand how reproducible for the novel
9 discovery. So we looked at the novel junctions
10 that are identified by RNA-seq across the
11 different laboratories and across different
12 platforms. And we found that 80 percent of them
13 can be verified really real-time PCR. This is
14 great news, but I'm wondering about biological
15 significance we have no idea whatsoever. So
16 that's why the new concept was coined by one of
17 the commentaries in Nature Biotechnology that says
18 we need to introduce so-called transcript of the
19 unknown significance, equivalent to the variant of
20 unknown significance in the DNA sequencing area.
21 So another thing is about differential
22 expression genes. As I mentioned, RNA-seq is a
85
1 gene expression platform and one objective is to
2 understand or to accurately profile which genes
3 are differentially expressed. Actually on -- this
4 is a pretty busy slide, so I'm not going to go
5 through all of that. What I'm going to point out
6 are two important points. First of all, when we
7 look at the microarray, look at RNA-seq. The
8 agreement is dependent on which samples you're
9 working with. If you're working with the normal
10 sample versus cancer, those two types of samples
11 are so different. Both are great. But if you
12 look at the same liver samples and one was treated
13 and another was not treated, only a few genes are
14 differentially expressed, both platforms are
15 disaster and the overlap is very small. So just
16 keep this in mind what particular samples you're
17 working with, it is important.
18 Second, we do find that RNA-seq has a
19 tremendous advantage in the lower end in terms of
20 the transcript abundance and the lower expressed
21 genes.
22 So lastly on -- this is another result
86
1 that I think is interesting. One time I went to a
2 meeting and there were a lot of industry people
3 there talking about the microarray and all of
4 that. One company made a comment and said my
5 company has been invested in the past ten years to
6 generate the microarray, 20,000 of them every
7 year. So now you are telling me to give up all
8 these samples, give up all these microarrays and
9 redo everything with RNA-seq. This is not real.
10 And then, of course, suddenly a lot of hands were
11 raised and they said yes, my company is the same.
12 Probably this is one of the reasons we see the
13 slow development using RNA-seq because the
14 companies were invested so much. So the question
15 in our consortium efforts to ask is how is the
16 legacy microarray data going to play out in the
17 RNA-seq environment and whether the biomarkers
18 developed using the microarray platforms can be
19 directly used for the RNA-seq data? And when we
20 move forward certainly RNA-seq is going to replace
21 the microarray, there's no question about it --
22 but when we move forward we have more and more
87
1 data generated using the RNA-seq data, can
2 RNA-seq-based biomarkers apply back to the
3 microarray data to leverage the past investment?
4 And these are the questions that are important.
5 This is basically the result on the right side and
6 it's pretty complicated and summarizes this part
7 of the investigation.
8 So I think I took too much time. So
9 finally I'll just make a few points from my
10 personal view. By working with a lot of the FDA's
11 reviewers and scientists in the area of genomics,
12 particularly in the microarray, and also working
13 with the MAQC consortium members, I do find a
14 couple of things that need to be mentioned, at
15 least that need to be said.
16 NGS is a tool. It's a challenging
17 issue, largely rests on how you use it. And I
18 think that the fit for purpose is very important
19 issue that needs to be focused. Second, I think
20 we should not focus on one standard, one shoe fits
21 all. That's why people have different foot sizes
22 and we have so many different shoe stores. And
88
1 one shoe fits all sometimes just does not work out
2 very well. So I think this should be kept in
3 mind.
4 Second, of course, we do need to
5 recognize the evolving nature of this technology.
6 If we say in the past ten years and it's a really,
7 really scary technology and how this technology
8 evolved, I guess one could be more scared in the
9 next ten years because the technology is only
10 going faster. It's not slower. So we really need
11 to keep this in mind, and we really should not
12 allow our emotions too high when we talk about
13 this issue.
14 And bioinformatics, certainly it's a
15 significant part in this field and how to
16 accurately apply the bioinformatics approach.
17 Actually it's very important. One of the lessons
18 we learned is we are now in the very, very
19 dangerous zone now. We come with the tens of
20 thousands of data points in the Excel spreadsheet.
21 We have no idea and no time to look at it. And
22 the Excel spreadsheet changed the data to the data
89
1 and time. You don't even know. And then you're
2 starting to apply the algorithm and you're very,
3 very excited to get some results and move forward.
4 We see that again and again in our MAQC
5 project. I'll just give you one quick example.
6 What we did is we generated two data and one is
7 the male and the female. We did not tell the
8 consortium this is male and female. We said can
9 you predict this endpoint? And another one is a
10 random number, and we said both are a critical
11 endpoint. Guess what? Some people came out with
12 fantastic predictions for the random data and
13 sometimes it was miserable for the male and the
14 female. So this issue is new and this issue is
15 very relevant to when we're dealing with the data.
16 It's not just genomics data. I think we need a
17 high screening approach and so on how we're going
18 to build in a positive control to minimize the
19 error in this process. I think it's also
20 important. And certainly we cannot dwell.
21 This is why we're all here, and Dr.
22 Wilson has done a fantastic job putting this
90
1 together and starting the dialogue. And hopefully
2 we walk away with some solutions. Thank you very
3 much. I do need to thank all the consortium
4 members who supported this project. Thank you.
5 QUESTIONER: I have a question. Can you
6 put in perspective quantitative versus qualitative
7 predictions of RNA-seq and microarray data, how
8 quantitatively reliable it is and how
9 qualitatively reliable it is?
10 DR. TONG: Actually, if we're talking
11 about a biomarker or talking about predictive
12 models, they are exactly the same and very much in
13 microarray. If you think about it, when we
14 develop predictive models, we select only a few
15 genes and differentiate health and disease. Well,
16 microarray is not bad. They did 40,000 or 50,000
17 genes, you pick two or three. You certainly can
18 find it. RNA-seq is just a little bit more. You
19 also can find it. So from the statistical point
20 of view, we cannot see any edge in terms of the
21 classifier. But if you are really engaging in the
22 mechanistic understanding and particularly trying
91
1 to identify these novel genes, how they contribute
2 to the disease and the health, RNA-seq holds a lot
3 of the promise. And we increased the depth to 10
4 billion, actually. That's the luxury we did to 10
5 billion. Even at the 10 billion level we
6 continually found new genes. And those new genes
7 can be verified using the real-time PCR.
8 QUESTIONER: Hi. Great talk, thank you
9 very much. I have a question about the number of
10 replicas, biological replicas, that you guys
11 recommend. We have observed with RNA-seq that the
12 reproducibility between biological replicas is not
13 very high. And we have also seen batch effects.
14 So my question is specifically targeted to how
15 many biological replicas are you guys recommending
16 for RNA-seq, based on your experience?
17 DR. TONG: If I say more is better, then
18 certainly people -- this is a very, very good
19 question and back in the microarray era this same
20 question was raised. We found it still depends on
21 the samples. For example, in the first dataset I
22 mentioned, A and B was entirely different samples
92
1 and we can use very few -- the replicas can
2 generate reliable data. But if you look at the
3 two biological systems, it was so close. If you
4 want to determine which genes are differential of
5 these two situations, you definitely need more
6 samples. So statistical power depends on the
7 endpoints you study. This probably did not really
8 answer your question.
9 QUESTIONER: Well, the thing is, at
10 least in our experience, we have observed a lot
11 more variability between samples and replicas for
12 RNA-seq versus microarrays. So, for example,
13 right now we're doing four biological replicas and
14 we tried to randomize the machines where we do the
15 different runs, and that seems to help. But I was
16 wondering if you had the same experience.
17 DR. TONG: Let me just address your
18 question from another angle. So in the toxicology
19 field, that's where I come from, and in the
20 toxicology field we normally use three animals,
21 sometimes four animals. That's it and we cannot
22 use two more because it's too expensive to the
93
1 animals. So most of the time we are trying to fix
2 our questions and say if we're using three
3 animals, how many variations are we going to
4 introduce in this analysis? If you'd rather say
5 how can we narrow down that error to this one, how
6 many animals we need? Actually we ask a question
7 where we are not wrong.
8 QUESTIONER: Okay, thanks. Laura Van T
9 Veer, University of California, San Francisco. I
10 see on your last slide your very impressive number
11 of people and organizations with whom you have
12 collaborated. I was wondering whether in that
13 whole activity in sort of reaching out to other
14 organizations how much have you been working or
15 are interested to work with the College of
16 American Pathologists and institutes like the NIST
17 because I think it's a tremendous work you've
18 undertaken, which will yield very useful
19 information for everybody.
20 DR. TONG: Well, thank you very much for
21 the question. For the MAQC project and what we
22 did, we did a normal mechanism we would do in FDA.
94
1 We put it in the Federal Register, said the
2 project is coming and there is a deadline. Please
3 send your interest to us. So this is how we
4 constituted the consortium members. And we
5 certainly made some personal calls and we bring
6 some key persons we know by reading the
7 literature. They contributed to this project. So
8 right now we are in the phase of MAQC-4, and we're
9 starting to try to decide which particular area
10 we're supposed to go. And, hopefully, we can give
11 a much broader announcement and other people have
12 the opportunity to join our efforts.
13 ALIN: There was a question online.
14 Hannah asks "What types of biological reference
15 sources are best for liability assessment? How
16 relevant are animal models?"
17 DR. TONG: Biological samples -- A and
18 B. Actually, probably for any samples you look
19 at, the most data was generated at A and B because
20 A and B become the standards. Everybody becomes
21 the efficiency assessment. Using the A and B, you
22 go through the entire process and compare the data
95
1 we generated from our project as well as by the
2 research community. You immediately realize where
3 you are. So as many laboratory tests using A and
4 B samples.
5 ALIN: One more question and then we'll
6 end it there.
7 QUESTIONER: Hi. This is Warren Kibbe
8 from NCI. A great talk. A lot of what you and
9 Vahan were both talking about was really around
10 reproducibility and lots of things. How do we do
11 this in both a QC way and a reproducible way? And
12 it seems like the part that you didn't talk about
13 is how do we not just package up all of our
14 scripts, but how do we really do this in a more
15 permanent way so other folks can actually use
16 exactly the same versions of software, the same
17 versions of the analysis? How do you see that
18 working into the framework that you're talking
19 about?
20 DR. TONG: Well, I hope Vahan can
21 eventually -- you know, we're working together and
22 we can come up with a solution for that. And no,
96
1 we have not really addressed this issue. And we
2 actually do it the opposite way. We have tried to
3 give much more freedom to the scientists to use
4 whatever they want because we are coming in with
5 no hypothesis and we don't know what is the best
6 way. We don't want to lock down.
7 But as I said in the very beginning, we
8 are emphasizing some baseline practice. We tried
9 to identify the method and the results are always
10 reproducible. So this methodology is -- if
11 someone sends data to the FDA and says okay, you
12 can do whatever you want to do, but make sure you
13 include that particular pipeline. So I think this
14 is always in our mind when we're working with the
15 project.
16 DR. SIMONYAN: To answer that question
17 also, where the concept of biocompute objects is
18 exactly designed to address that issue, so it's
19 not just within the consortium. Like SEQC can use
20 the same kind of reproducible protocols, but the
21 whole world will be able to use it; hopefully.
22 And I know NCI/NCBI IT has the harmonization
97
1 efforts also, and we will definitely need your
2 input. We already work with your team members and
3 we are pleased to have the discussion continue.
4 So the next speaker is Dr. Amnon Shabo.
5 He's the Chair of Translational Health Informatics
6 from the European Federation of Medical
7 Informatics, and he's going to give us the
8 perspective from an HL7 protocol also. As you
9 well know, this is one of the most frequently used
10 protocols for exchange of biomedical information.
11 So hopefully we can see that perspective of how do
12 the results of our computations eventually get
13 communicated to different health organizations.
14 DR. SHABO: Thank you and good morning.
15 Thanks for the invitation. So this stuff will be
16 closer to the phenotypic side of the world rather
17 than the genotype and NGS and all this stuff that
18 you all are doing. So I call this talk
19 "translational and interoperable health
20 infostructure, information infrastructure -- the
21 servant of three masters."
22 So what I've been trying to do in the
98
1 past 14 years where I have been working with IBM
2 Research is to develop information infrastructure,
3 or in short infostructure, that could really serve
4 both research and health care. As we know
5 different stakeholders in the health arena are
6 developing their own information infrastructure
7 and they are quite distinct. They are quite
8 different. If you look, for example, at EHR,
9 electronic health record system of a health care
10 provider, there might be even some genetic data,
11 but most likely clinical data. And on the other
12 side of the continuum if you look at the
13 information infrastructures of laboratories
14 running NGS, I guess these are like two very
15 different systems. But eventually, as the
16 previous speaker said, NGS and all other platforms
17 are just tools and eventually we would like to
18 bring the data and the interpretation of the data
19 into the clinical space. And I think
20 translational is really the game, the playground,
21 here. And I think NGS is probably a good tool for
22 translational efforts because it really gives you
99
1 a lot of data and translational is mostly
2 data-driven, bypassing the traditional classic,
3 scientific methodologies most often.
4 Some of the ideas are published in a
5 pharmacogenomics paper that I published - "The
6 Patient- centric Translational Health Record." So
7 that's another focus that I'm bringing to the
8 table. So eventually the health record -- those
9 that are mainly maintained by health care
10 providers -- but we also see emerging today the
11 personal health record. The other idea is
12 environment where you get the full picture of the
13 health conditions of the patient, and that's the
14 ideal environment for actually bringing in genetic
15 and genomic data so they could be taken as another
16 input in the interpretation process and the
17 clinical decisions support process that is being
18 done by humans and machines.
19 So this is just a very -- briefly to
20 give you my motivation and patience for many years
21 already, even before I joined IBM, and it's really
22 about a methodology that is the contrast or
100
1 complementary to the main methodology used today
2 in clinical decision support and expert systems in
3 general, which is the rule-based reasoning.
4 There is another alternative
5 methodology, which is called case-based reasoning.
6 It's not machine learning. It's different. And
7 you actually are running ontological comparisons
8 between the cases that you are now treating and
9 some kind of a case base, looking for the most
10 similar case to what you are treating now, and
11 using the data that is in the most similar case or
12 the similar cases in order to refine and get
13 insights on how better that case could be treated.
14 And by the way, the most success story of case-
15 based reasoning is actually in help desk that just
16 came out to the market, totally unrelated to
17 health. The manufacturer doesn't really know much
18 about the defects and the problems that are
19 happening. The customers are very angry and those
20 who are answering the phones are not really so
21 knowledgeable. And so case-based reasoning has
22 been tried there very successfully.
101
1 Now this is very human, actually, if you
2 think about humans and how they are going about
3 reasoning. Most often they kind of try to echo a
4 very similar case that they have seen or their
5 colleagues have seen or they read in the
6 literature and obviously also are based on rules.
7 They take clinical guidance, for example, in the
8 clinical environment, but it is kind of what we
9 call also intuition in a sense that it kind of
10 brings up similar cases. So I'm saying this is
11 very human and I think that we have to acknowledge
12 the fact that in the health arena, what we don't
13 know is much, much, much more than what we know.
14 That's why I think case- based reasoning could be
15 a very appealing complementary methodology to
16 refine the decisions that you are making based on
17 rules.
18 Now, that leads actually to the
19 question, what is a case? And that's a big
20 question in the case-based reasoning field. In
21 many fields' domains, it will be obvious. But in
22 the health space, that's a big question. Is it
102
1 just an episode? Is it just this hospitalization
2 or this visit that I made to the physician? No, I
3 came to the conclusion that this is the entire
4 lifetime electronic health record. And then the
5 next question that should be asked, if we agree on
6 that, is how could we sustain -- it's
7 sustainability here. How can we sustain,
8 preserve, lifetime electronic health records? And
9 those lifetime electronic health records should
10 contain every data collected about the patient,
11 including data that the patient generated by
12 himself, by herself. And that including also
13 could be NGS data and any other type of data,
14 sensor data, imaging data, what have you. So I'll
15 get to this point of health record sustainability
16 at the end of my talk.
17 So, again, translational is really the
18 relevant playground here I think. I guess you are
19 familiar with the field and the barriers from
20 bench to bedside, from bedside to community, and
21 community to policy. Many successful
22 interventions at the bedside do not scale out to
103
1 the community and definitely not to the policy.
2 And that's because of many factors and some of
3 them are not related only to the biology,
4 socioeconomic, bioethical and so forth, so all of
5 them have their own kind of infomatics so to
6 speak. So I'm an infomatitist. I'm looking at it
7 from the information point of view. In order for
8 translational to succeed, we need to somehow come
9 up with some harmonization between the languages
10 that all those disciplines are talking. And
11 obviously within biology and health, there is
12 already plenty of languages and different formats
13 that we are coping with. But don't forget that
14 there is a language of the economist and the
15 ethicist and all those experts that are actually
16 feeding more and more factors and constraints and
17 considerations. But eventually we'll fail
18 actually the successful intervention that we have
19 been able to get into the bedside, from the bench
20 to the bedside. So this is a new kind of world.
21 Today is the biomedical informaticist kind of
22 bridging between the medical informaticist and the
104
1 bioinformaticist.
2 So in the U.S. you have the PCAST
3 report, if you are familiar with that. This is
4 the President's Council of Advisors on Science and
5 Technology. And in 2010 they came up a
6 recommendation to create and disseminate a
7 universal exchange language. I think that's very
8 key to what we are talking here today. We are
9 talking here today about NGS standards, but we all
10 have to meet in the fact that NGS is just one
11 component and one input out of many, many types of
12 inputs that are going into this kind of health
13 language so to speak.
14 In the next few slides I will show you a
15 few key principles that I think should be kept no
16 matter what format you are actually using, no
17 matter what standard you are developing. This is
18 kind of informatics, the principle, that I've been
19 promoting and developing along the years, along
20 the past 15 years. And first and foremost and the
21 most important to me that I present as a key
22 principle is that we cannot really convey and
105
1 represent clinical genomics data,
2 genotype/phenotype association in this case, or in
3 general health data in a flat manner. So I'm
4 saying that flatter presentations are flat tires.
5 And why? Because it's really all about the
6 associations between the data. So if I had
7 inflammation and that was an indication to go
8 through some operation, inflammation in the gall
9 bladder, for example, and I went through an
10 operation to remove the gall bladder. So there is
11 an indication here that is indicating the
12 operation. So there's got to be this association
13 in order to understand later on for humans and for
14 machines, especially for machines, why this
15 operation took place. And then maybe there were
16 complications, so again, another association to
17 the complications. Then maybe I was prescribed
18 medication; another association to the medication,
19 and so forth and so forth. So it's all about
20 context. It's all about association of different
21 data items that today are quite disassociated.
22 They are quite discrete. And that makes them not
106
1 an easy interpretation of it later on, downstream
2 as we like to call it also in the genomic area.
3 But downstream is all the way downstream to the
4 patient himself, to the body of the patient. So
5 that's really the most important principle, have
6 it really represented in kind of a statement
7 model. And that's basically the base, what we
8 call clinical statement model. The other codes on
9 the left side dawn from medical terminologies.
10 They are inserted into the basic building blocks
11 like the patients' procedures and medications and
12 so forth. But then -- and that is being done
13 already today and quite nicely -- but then they
14 should be put within some syntax language into
15 statements. And those statements are already the
16 basis of some common language that we can then use
17 in different domains, clinical domains I mean.
18 And that's a simple example that is
19 taken from a project that I led the IBM
20 contribution to it. It was a GWAS on
21 hypertension, essential hypertension, like 25
22 cohorts funded by the European Commission, 25
107
1 cohorts from all over Europe. It was done by one
2 million SNP genotyping. This was back in 2008.
3 But clinical data, very rich clinical data, about
4 500 fields just around hypertension, everything
5 that you can think of that is related to
6 hypertension and eventually may make this GWAS the
7 most successful in finding on the one side the
8 variants, but on the other side what does it mean?
9 What are the biological processes underlying it,
10 and what does it mean clinically wise?
11 So we get data and you all know it. I
12 mean most of you are also involved in research,
13 and we get data from research typically in
14 spreadsheets in relational databases. So, on the
15 bottom you see blood pressure, heart rate, being
16 measured along some kind of a timeline. In a
17 different spreadsheet you see an indication of
18 taking an anti- hypertension drug, again along
19 some timeline. Apparently, if we talk with the
20 researchers, those timelines are the same for the
21 same patient and this context is essential. I
22 mean without knowing that the patient is on anti-
108
1 hypertension drug, comparing to another patient
2 that is not, the measurement of the blood pressure
3 and the heart rate are a bit -- well, I don't want
4 to say useless, but they are out of context and
5 analysis of it might lead to wrong conclusions.
6 So what I'm saying is just represent it
7 explicitly. That's what I'm advocating for. And
8 it's not rocket science. It's actually very
9 simple. You have, as the previous -- oh, I don't
10 know who mentioned it -- we know it's working
11 nicely in programming in the computer science
12 world. It's here the same, you know, design of
13 the program and design and modeling. You have the
14 objects. You have the observations. You just tie
15 them together based on some common language. Of
16 course, we need to agree on common language. It's
17 not that simple. We don't tend to agree so much
18 as humans. But anyway, you can relate to it
19 somatic and you are now much better off in terms
20 of conveying this information to another party, to
21 a clinical decision support application, to a
22 colleague of yours.
109
1 Now, going to the genomic side: So
2 obviously we would like to have clinical genomic
3 statements. So that's just a study co-written by
4 OMIM from like 10 years ago I think. And they've
5 been doing those studies on the EGFR, somatic
6 mutations related to the non-small-cell lung
7 cancer, and that's a description of a single
8 patient. This is a study published in the
9 literature. A single patient was under complete
10 remission for the first two years because he had a
11 specific somatic mutation in the EGFR that caused
12 a responsiveness to the drug. After two years
13 that guy had a second somatic mutation in the EGFR
14 gene, and this second somatic mutation was
15 resistant to the same drug and at that time, 10
16 years ago, this knowledge was not known yet. This
17 guy went into relapse.
18 So if you just convey the very ends by
19 themselves, the clinical descriptions on another
20 system in some clinic or medical record systems,
21 they are not tied together. It's really hard to
22 bring about decision support and reasoning, end
110
1 genes, and all those nice things that computer
2 scientists know to do and develop. So I'm saying
3 there are variations. There is the phenotypic
4 side. We need to tie them together. We need to
5 distinguish between observed phenotypes so I know
6 that this patient is responsive to this drug.
7 This has been observed already. I can make the
8 distinction that this phenotype is observed as
9 opposed to interpretation. I guess most of you
10 here are more interested in the interpretation, of
11 course, because you are just basically testing and
12 having the data kind of available to downstream
13 analysis, and you are interested in the
14 interpretive or the potential phenotype of that
15 genomic observation.
16 So that's kind of the expansion of the
17 model of the clinical genomic statement model.
18 They have been already used in HL7 standard. Now,
19 how many of you -- show of hands -- heard about
20 HL7? Oh, wow, that's nice. I was expecting a
21 lower number. So HL7 is the most dominant
22 organization in the world of developing health
111
1 care standards. So 10 years ago, together with a
2 few other members of HL7, I founded the Clinical
3 Genomics Working Group. And we have been
4 developing several standards in Version 2, in
5 Version 3. And recently I also developed a
6 genetic testing report, which is already published
7 in a document structure. Now this document
8 structure is based on structures that have been
9 already adopted and recommended by the Meaningful
10 Use Criteria in the U.S. How many of you have
11 heard about the Meaningful Use Criteria? Wow, not
12 so many. So Meaningful Use Criteria, this is
13 meaningful use of health care IT in the U.S. This
14 is a mechanism of incentivizing health care
15 providers to use appropriate health care IT
16 technologies and there are several criteria. Some
17 of them are in the area of standards. The
18 document standard has been adopted, documents for
19 summary, summary in general, all those referral
20 letters and so forth. So now in addition to this
21 collection that is being pushed by the government,
22 by the way, by the HHS -- which is also called the
112
1 Consolidated Clinical Document Architecture -- in
2 addition to that we would like to bring in a
3 genetic testing report document on top of the same
4 technology platform or framework is better to say
5 that could be consumed by health care providers.
6 So now when this is consumed by health care
7 providers, we're actually opening the door. Now
8 I'm kind of wearing the hat of genomics. We are
9 opening the door for us to push and bring data
10 into the EHR environment, into the information
11 system environment of the health care providers.
12 The next principle, informatics
13 principle, is that we must keep in mind that
14 narrative and stories, stories that are told by
15 the physicians, what they have been kind of
16 thinking of our health situation and stories that
17 we are telling, that's in the context of personal
18 health records and so forth, these are not going
19 to go away. These are the narratives and they
20 contain much richer information quite often and
21 it's got to be in the equation. And what we
22 actually need here is kind of a balance between
113
1 narrative and structured data. Of course, without
2 structured data it will be hard for analysis and
3 decision support verification to act upon the
4 data, but the narrative is there. We can analyze
5 it through natural language processing, but this
6 is not so patient safe. It could be nice for
7 research, nice for the IBM Watson Program, if you
8 heard about that, but I think not that safe for
9 patient care what we call.
10 So anyway there's got to be some
11 coexistence between narrative and structured. We
12 should formulate interlinks between both of them,
13 but this is kind of inherent redundancy of the
14 data within the formats that we are developing.
15 And I guess also on the genomics side there is a
16 lot of narrative as well, descriptions of what has
17 been done and all kinds of things.
18 So the Clinical Document Architecture,
19 the standard that I've been alluding to, is really
20 kind of the major platform where we can convey
21 both structured and narrative data. That's the
22 basic structure of it. I'll go through it
114
1 quickly. That's the structure of the genetic
2 testing report. In some years, I think it was
3 like a couple of years ago, we got participation
4 in the HL7 Clinical Genomic Group from Illumina
5 and Life Technologies. And we have been actively
6 trying to use this genetic testing report not only
7 for the kind of testing of non-mutations and the
8 more classic genetic testing, but also for NGS.
9 We got some samples of NGS reports from I think it
10 was Illumina. We started working on bringing at
11 least the summary level of it into this document
12 format that could be consumed, again, by health
13 care providers.
14 That's the layout of it. I'll go
15 quickly through that. That's a rendering of it.
16 As you remember, it has the rendering mechanism,
17 but it can be human-to-human communication. And
18 underlying it with inherent redundancy the data is
19 being also structured; not all of it, of course.
20 There is a summary section there, saying
21 that okay we have been doing those testing. In
22 this specific sample we are talking about JGB-2
115
1 full gene tests and some deletion tests and some
2 hearing loss mutation tests in the mitochondrial.
3 So each of these tests has its own interpretation,
4 whether it's pathogenic or benign or whatever.
5 And then there's got to be some kind of an overall
6 interpretation and that's the gist of this report
7 actually. This report could be consumed maybe
8 even by a primary physician or by someone that is
9 not that expert in all the genetic testing. And
10 they want to see just the overall interpretation.
11 So there's got to be, again, some kind of a
12 rule-based engine that will take all of the
13 interpretations together and digest them and then
14 see how those can be conveyed in one overall
15 interpretation. This is being done also by
16 GeneInsight and now spinning out as a company.
17 Another key principle is that we've got
18 to develop those standards, using tooling, using
19 modeling tools, because quite often and even
20 within HL7, which is really as I said the major
21 standardization body, standards are being
22 developed just by writing text actually. As an
116
1 implementer what you get is a PDF document. And
2 now go figure how you can really take all those
3 fields and all those constraints and instructions
4 and guidance. You have to eyeball them and
5 somehow retype them into your specific solution.
6 So most often those things will be error prone and
7 you'll get the implementations actually different
8 from each other.
9 So if we are starting with a tool and
10 the tool actually already having the base language
11 within it -- in this case here I'm talking about
12 the Clinical Document Architecture, it could be
13 any other kind of base language that we all agree
14 on -- then all I need to do is actually take the
15 tool and constrain it and refine it to the needs
16 of what I'm actually developing, whether it could
17 be a report, NGS, a genetic testing report that
18 will go into the clinical environment, then I'm
19 kind of confident that I'm doing design and
20 development of a standard that is actually fully
21 aligned with the base language because the base
22 language is actually within the tool. And if the
117
1 tool does not allow you to diverge. And
2 divergence is really the most kind of dangerous
3 thing in standard development and there is
4 obviously the joke that the nice thing about
5 standards is that there are many to choose from.
6 And we standard developers are really good at
7 that, continuing to develop more and more formats,
8 thinking that each format could be really good for
9 another use case. We have the right excuses
10 always, but that's the situation.
11 Now coming I think even closer to the
12 topic of this workshop is raw data. So that's
13 another key principle in the clinical environment,
14 in clinical standards. Eventually we need to be
15 able to encapsulate all data into the clinical
16 structure. We cannot encapsulate everything,
17 depending on its amount. What I'm advocating
18 actually is for the encapsulation of raw data, but
19 only key chunks of the raw data that we think are
20 relevant for the treatment of the patient, keeping
21 references back to the full-blown raw data; and
22 most importantly, not trying to remodel the raw
118
1 data. Now, the raw data could differ. You could
2 discuss today different formats for NGS, fine,
3 maybe there will be eventually just one if you're
4 all able to agree, that's fine. So we use that.
5 If not, if there are two alternatives, that's also
6 fine. So we encapsulate whatever format we are
7 getting into the clinical environment and we just
8 specify in the field where the encapsulation is
9 being done, what is the exact schema, the version
10 of the schema of that NGS standard. So let's
11 assume for a second that there is one like that.
12 So I'm not trying to remodel it. I'm not trying
13 to block and say there's got to be only one
14 standard that is encapsulated. There could be a
15 multiple as long as they acknowledge the samples.
16 We have to admit that it could be that we wanted
17 just a single standard.
18 So that's the basic idea, having genomic
19 data sources, laboratories, for example, sending
20 data to clinical environments not just
21 interpretations, including key chunks of raw data.
22 If it is not really the whole genome, then you can
119
1 actually send the entire raw data into it. And
2 then in EHR, in the health care provider system,
3 that's the ideal environment to get out the most
4 up-to-date knowledge, to get the full-blown
5 picture of the health condition of the patient,
6 and altogether fuse it and come up with what is
7 recommended actually for the clinicians.
8 And the most important is this error
9 matter, reanalysis. So reanalysis is now being
10 enabled if you go with this concept because the
11 raw data is there and you can reanalyze it by
12 different algorithms or later on when there are
13 new discoveries and you want to actually parse
14 again the same raw data. So that's very important
15 and not really practiced today.
16 This is just an example of the
17 implementation that we have been doing, so XML
18 tagging. The outer XML tags are actually coming
19 from the HL7 world and the inner are just -- this
20 has been done actually 10 years ago -- but it
21 could be as well valid today. This is BSML. I
22 don't know how many of you heard about BSML,
120
1 bioinformatics sequencing markup language? That
2 was done by a grant from the NLM I think, but now
3 is unfortunately dead. But it is just a format
4 for describing sequences. I guess again today
5 there are other languages. They could be
6 encapsulated in the same manner. Instead of BSML,
7 it would be like, I don't know, NGS markup
8 language or what have you. And it shouldn't even
9 be XML. It could be binary data or whatever. XML
10 is nice because the fusion is very clear to the
11 implementers, not to the consumers. The consumers
12 are not really interested in the underlying
13 presentation, but most often the implementers are
14 trying to understand what data they are dealing
15 with.
16 I'll skip that. That's our recent
17 effort to develop some kind of standards
18 independent clinical genomic statement domain
19 information order, which has been just validated
20 with all those principles already inside.
21 Now I'm switching gears into something
22 that is kind of border, again going back to the
121
1 translational playground. And in the
2 translational playground eventually there is much
3 more to it. In this chart what I'm showing, it's
4 kind of a roadmap that I'm pursuing within this
5 European Federation of Medical Informatics, the
6 translation that has Informatics Working Group,
7 and the idea is that at the top of it you see
8 knowledge. Knowledge standards, knowledge
9 starting from the top left, which is about
10 scientific literature, and in the middle, of
11 course, all the terminologies, and then on the
12 left side -- on the right side, I'm sorry --
13 decision supportive formats. And at the bottom of
14 it, the blue boxes are actually formats used
15 actually at the point of care.
16 Now, the middle layer is really the
17 challenge here. The middle layer, if you start
18 from the left, those are things that are closer to
19 the research. If you go to the right, those are
20 things that I call them the bridge standard. They
21 are bridging between research and point of care.
22 And so those are the three bridge standards that
122
1 already exist today. This GTR that I already
2 mentioned, genetic testing report; DIR, which
3 stands for diagnostic imaging report, that in the
4 same manner of encapsulation of genomics, you can
5 encapsulate imaging as well. So it points to key
6 images, not the entire CT set. It points only to
7 key images and that's summarizing the diagnostic
8 report. And then the PHMR is the personal health
9 monitoring report, which is really in the space of
10 sensors and well-bed devices and so forth. So
11 this could lead nicely into the clinical
12 environment.
13 Let's move on. One of the things that
14 was in the middle area was also in this paper
15 where they came up with this kind of packaging
16 format I would call it. And they called it the
17 iPOP, the integrative Personal Omics Profile.
18 There has been a paper published in Cell, reviewed
19 in Nature, and they called it "The Rise of the
20 Narciss-ome." He was a healthy guy, one of the
21 authors, went through all omics testing that you
22 can think of -- all genome sequencing and so
123
1 forth, all the clinical events recording and
2 everything -- for about two and a half years. And
3 they've been trying to see whether there is any
4 value in doing all those omics testing along some
5 time and not just one-off. And, indeed, what is
6 reported in this paper is that at the beginning of
7 this study, he was found predisposed to diabetes.
8 It may sound to you as a joke, but at the end of
9 the study he was actually diagnosed with diabetes.
10 So in the interview he says that actually he
11 thinks that based on the predisposition that he
12 was found with, he's changed his lifestyle and
13 then the actual diagnosis of diabetes was less
14 severe than it could have been if he had not known
15 about it.
16 So that's the project that I've been
17 telling you about briefly. It's called hypergenes
18 and it was about essential hypertension, a
19 genostudy, and that's the translational health
20 infomatics. Architecture that we came up with and
21 we implemented it and it was used by all those
22 clinical centers that contributed the 25 cohorts.
124
1 They all had specimens and those specimens went to
2 one million SNP genotyping.
3 So what you see, the principles here are
4 that at the heart of it is the warehousing, but
5 not in the regular sense that usually people think
6 of warehousing. This warehousing was
7 standards-based, so based on those clinical
8 standards I've been describing so far, including
9 the genomics component in it with cross links to
10 the mass data so the red repository is really the
11 clinical data. That's kind of the entry point.
12 And then those green repositories are the mass and
13 raw data coming from omics sensors and imaging.
14 And those are standardized, and the principle is
15 that we have to keep in mind that this is the only
16 place where we can keep the original data in the
17 richest form as possible. That's where it could
18 be maintained. That could be a good place also to
19 do exploratory, but once you want to get into a
20 specific analysis, then you actually go -- if you
21 look at the top of the chart -- you go and produce
22 some kind of what is called also data mods, but I
125
1 call them information mods. Why information?
2 Because out of the data in the warehouse, which is
3 again the original data -- standardized, but
4 original data in its richest form -- I am
5 producing some kind of information with a
6 different perspective and different view and
7 that's the information mod that is being actually
8 analyzed. So one of the most known one is
9 tranSMART, but the confusion here around the
10 terminology is that they call tranSMART also a
11 warehouse. How many of you heard about tranSMART?
12 So that's kind of the extension of i2b2, which is
13 like the warehousing machinery for getting EHR
14 data into research.
15 Let's move on. I guess I have just a
16 few minutes left. That's the principle I've been
17 talking about. That's how we've been doing it for
18 this hypertension study. These are the specific
19 standards we picked and constrained and interlaid
20 together to come up with the standards-based
21 information model governing the warehousing.
22 Ontology must be developed for specific solutions.
126
1 You cannot avoid it. You cannot find always the
2 right terms in those common terminologies.
3 I'll skip that. That's another issue,
4 but I don't have time for that. Expressiveness
5 versus interoperability: The more expressive you
6 are, the less interoperable you are. That's a big
7 kind of tension between health care and research.
8 So in research we would like to be very expressive
9 to keep all the details of the data. In health
10 care they are now interested more and more in
11 interoperability, exchanging of information
12 between providers for care coordination, that's
13 the latest hype. How can we coordinate the care
14 we are giving to the patients?
15 So I'm coming now to the last part, and
16 I'll do it very quick, and I already alluded to it
17 in the first slide of my talk. So eventually, as
18 I said, the case to me -- if I'm coming back to
19 the case-based reasoning -- the case is really the
20 lifetime electronic health record, including any
21 genomic data, sensor data, imaging data. And in
22 those data actually, in those datasets, we all
127
1 know that most of it is of unknown significance.
2 But in case-based reasoning, we actually
3 ontologically could compare those unknown
4 significance data and still come up with some
5 insight that would help us to refine our clinical
6 decision support.
7 So just to be on the same page, what is
8 the difference between EHR and medical records?
9 Medical records are being expanded along three
10 dimensions -- so, of course, institutional, time,
11 longitudinal, and content -- for medicine to
12 health. What is the base architecture of an EHR?
13 Again, the main benefit here is not just the
14 temporary data -- medical records, charts,
15 documents, so forth -- it has to be in the summary
16 of it. If we don't clear the summary, it's almost
17 useless for the clinician at the point of care
18 having just 5 or 10 minutes for giving us the best
19 care. So the summary also includes personal
20 genetic evaluation and also genetic based -- it's
21 all there in the spirit of having summaries as
22 normally done at least, but also topical summaries
128
1 around diseases, around events and problems.
2 So what are the sustainability
3 constellations? Now, if you just go back up
4 really to the highest level possible, which I even
5 don't know how to name, but are the constellations
6 of sustaining health records? So at the top left
7 it's government centric, and we get some of it in
8 places like the UK perhaps or even Australia. And
9 the risk is obvious I think -- Big Brother. Now
10 at the top right, provider centric, so providers
11 are actually the record keepers and the risk here
12 -- and we all know it from our experience as
13 patients -- you always get partial data. You
14 never get the entire data because there are always
15 silos of information within the health care
16 provider system.
17 Now, on the bottom left, regional
18 centric, okay, we are all part of communities. We
19 don't really move between communities. Let's do
20 it on a community basis. The risk here is
21 obvious. It's limited because it's just regional
22 and we know that we do move, maybe not many of us,
129
1 but we move. And then consumer centric. Consumer
2 centric, eventually all those PHRs, personal
3 health records, would be in my mind unreliable in
4 the eyes of the clinicians because they really
5 cannot trust it. They might say to themselves who
6 knows who messed up with this data.
7 So I'm actually proposing an independent
8 kind of non-centric constellation, which I call
9 independent health record banks. And the argument
10 here is that none of the existing players in the
11 health care arena can or should sustain lifetime
12 electronic health records. And these are the main
13 principles. The first one is the most radical
14 one, and I'm actually proposing to change the
15 legislation so that health care providers are no
16 longer the record keepers. They could keep the
17 records that they created on their expense, but
18 they are not anymore the record keepers of the
19 legal or medical copy of the record. Rather it's
20 going to be held and sustained by independent
21 health record banks that will be independent and
22 regulated by the legislation.
130
1 We escaped the issue of unique ID, which
2 is very problematic, especially in the U.S. And
3 also we escaped the problem of ownership. There
4 is no owner here. Even the patient is not the
5 owner. There is a custodian model, a very simple
6 model.
7 And that's the major conceptual
8 transition. We are moving out the medical records
9 today, archives, that are today within the health
10 care provider, into independent repositories. We
11 are emphasizing the standards-based communication,
12 which would help standardization in general, of
13 course, but it could be enabled only by new
14 legislation.
15 That's the base production cycle. I'm
16 introducing myself to the health care provider.
17 The health care provider will get the entire
18 full-blown EHR into his operational system, take
19 care of me, and then feedback or send back the
20 records back to the bank and they will be then
21 incorporated into the summary, which would be done
22 only on an ongoing basis and not as a result of
131
1 some query or data federation that never works
2 actually. And that's also for farma. A farma
3 goes through a clinical trial, and then they
4 should send the genomic data back to the
5 independent health record bank. They won't like
6 it so much, but they would like to get the record,
7 of course.
8 These are on the macroeconomic level,
9 the main transformation. I'm arguing that there
10 is no additional cost here because the archiving
11 costs today are embedded within the health care
12 costs. They will be just moved to the health
13 record banking. These are the benefits that I
14 already alluded to. A bill introduced in the U.S.
15 Congress seven years ago exactly, an influence for
16 my publications, unfortunately it didn't pass, but
17 at least it was a nice step towards this idea of a
18 health record bank. And for more information, you
19 can read here in the special issue we just
20 published lately on health record banking.
21 So thank you for your attention.
22 ALIN: Due to time constraints, we're
132
1 only going to take two or three questions. Sorry.
2 QUESTIONER: How do you imagine or how
3 do you work into their protection for individual
4 privacy who has access to the data, and how can
5 you strip the data for let's say epidemiological
6 research that is not involving personal
7 information?
8 DR. SHABO: Yes, so privacy is a tough
9 issue as we all know. In a sense there is a
10 tension, privacy versus availability. As a
11 healthy person sometimes I'm kind of cautious
12 about my privacy and I'm not really thinking about
13 the availability of data once I'm sick. But I'm
14 only arguing. I don't have a magic solution to
15 it. I'm only arguing that in the current
16 situation where data of individuals is scattered
17 all over and fragmented totally. And it is the
18 only mechanism to actually protect privacy is
19 really a bit ridiculous. We need to protect
20 privacy in the means that we do have today. I
21 know that there could be breaches, but the data is
22 fragmented and scattered and privacy could be
133
1 breached in each provider or health information
2 itself, a service provider.
3 So I think that the bank model is a
4 chance to actually better protect the privacy.
5 And think about also the unique ID. You don't
6 need any unique ID here. And in the current
7 interoperability effort in the U.S., for example,
8 you've got to have some kind of what is called
9 patient matching, matching based on the ID, and
10 that's a huge issue.
11 QUESTIONER: When we are talking about
12 genomics data, which will eventually start flowing
13 through EHR systems, the size of the data is a big
14 issue. Also the amount of things you can report
15 is huge. I mean 400,000 transcripts and any
16 variation on it. Imagine the amount of
17 information that will flow through HL7 protocols;
18 that might be very challenging and try to see the
19 vision, what would that be?
20 DR. SHABO: I want really to emphasize
21 that this model of health record banking is
22 totally distributed, so this is not about some
134
1 kind of centralized repository for all health
2 records. So because of the distribution -- it's
3 distributed first on the business level. So at
4 the business level there are competing independent
5 health record banks. Now, within each bank there
6 could be architecturally in terms of IT
7 architecture, again, some distribution depending
8 on the volume and so forth. And so I think there
9 shouldn't be any problem to get any volume of
10 data, including whole genome sequencing and so
11 forth. While in the current situation, if you try
12 to push whole genome sequencing to health care
13 providers, they cannot even get simple genetic
14 testing like no mutations, not to mention NGS.
15 And those health record banks should be
16 specialized in health information, health
17 standards, and so forth. This will be their
18 specialty, unlike providers where the specialty is
19 providing care or insurers, they should provide
20 insurance, and so forth. So it should be focused
21 on its specialty. I'm only saying that health
22 record banking will be kind of an engine to get
135
1 all those things sold. I'm not saying that they
2 are sold today, but they could be sold better when
3 you have a kind of expertise within those
4 organizations.
5 DR. SIMONYAN: Just one more comment
6 about privacy is that sometimes people mix up
7 privacy and security, and one of the Office of
8 Information management officers told me the big
9 difference. The privacy is who has the right to
10 access the data. Security is who actually does
11 it. So sometimes when -- it's important to
12 distinguish that security sometimes is actually a
13 much bigger question than the privacy because
14 ownership of the data is sometimes clear
15 legislatively, but the security is not.
16 So we have the next speaker, Eugene
17 Yaschenko. He is from National Center for
18 Biotechnology Information. He is the head of the
19 molecular software efforts at NCBI and we know
20 that is one of the biggest, if not the biggest,
21 one of the biggest NGS data repositories and tool
22 providers. So we are happy to hear what he has to
136
1 say about NGS standardization.
2 MR. YASCHENKO: It's still morning, so
3 good morning. I'll try to be quick and let you
4 have lunch. My name is Eugene Yaschenko and I'm
5 chief of the molecular software section at NCBI.
6 And what we are doing is we are building large
7 genomic, genetic, and biomedical databases for the
8 world.
9 So the outline: I will introduce for
10 people who still don't know what NCBI is and
11 produced in NCBI and to build a section about
12 describing the subsection of NCBI which we call
13 the primary archive. Then I'll talk about
14 sequencing archive.
15 So about NCBI: NCBI was created by
16 Congress in 1988 to develop (audio skip). We are
17 funded by act of Congress, and we are in a unique
18 position. We have the funds, obligations to
19 develop databases, obligations to accept the data
20 and serve those databases to public, but we have
21 no leverage on what kind of data we get in what
22 formats and what standard. So the only leverage
137
1 we have is many publishers who require people to
2 submit quantified databases. But we can try not
3 to accept data in the wrong format, but have it
4 resolved and be compliant to the Congress when
5 they are in session. We have to store the data,
6 but we don't control what formats are being sent
7 to us.
8 NCBI in general is mostly known for
9 biomedical efforts. You all know the Department
10 of Health, department central and public health
11 sites, which we support. We also have a lot of
12 databases that support bioinformatics resources at
13 NCBI, its molecular type databases part of
14 GenBank, its brought in database, also part of the
15 GenBank. There is very extensive and nicely
16 created database on genes; the whole genomes,
17 which were assembled and submitted to NCBI and
18 very critical for bioinformatics research is BLAST
19 on NCBI hardware.
20 So part of our databases we call primary
21 data archives. And primary data archives when we
22 build them we do it with some principles that
138
1 first of all they submit them to you. We try to
2 reflect as faithfully as we can what is submitted
3 or provided to us. The second, the term of our
4 archival of this data is assumed to be
5 indefinitely. Obviously, nothing is permanent in
6 the world. But when we design databases, we
7 assume that this database will be used
8 indefinitely.
9 Data archive is different from file
10 archive. This is not our principle, so we are not
11 just a file archive. We're not archiving just
12 files. We tried to figure out what data those
13 files contain, possibly build for them specialized
14 databases and archive the data, not files. Also
15 as a result of that, it's not necessarily lost to
16 us. Some controlled loss of information may be
17 done for the sense of space or even simple sense
18 when you have a data type load, they don't want to
19 reserve all bits of information. We just control
20 the flow.
21 Also for internal storage, we tried to
22 make it as uniform as possible. We utilized the
139
1 transformation of internal storage for validation
2 for additional indexing, sorting, and so on and so
3 forth. And I already mentioned, the extra pain we
4 get is from a multitude of formats generated
5 outside and we always welcome any movement toward
6 the common standard.
7 So primary sequencing data archives at
8 NCBI: GenBank was created in the eighties. Then
9 with the event of first automated sequencing
10 technology, we created the trace archive to store
11 single traces. And when the next generation
12 sequencing came, we developed SRA to handle
13 several orders of magnitudes in more data. We
14 call those two archives, trace archive and SRA,
15 raw data archive because we put a minimal amount
16 of curation on this data and this data represent
17 as close to what was produced by the submitter as
18 possible. So it doesn't mean -- this is not a
19 highly curated set. This is the data as the
20 submitter sees it.
21 So on this picture you see the growth of
22 three databases -- GenBank, which I think on the
140
1 screen should be blue; the trace archive, which is
2 red; and yellow, which on this resolution is
3 almost a vertical line, is SRA. And when we
4 change zoom and shift a bit of set, you see all
5 those lines scaled to the trace archive. And when
6 you change to SRA, you see this long yellow line
7 and everything else stays at zero, which is the
8 number of bases in SRA. As you see, by now we
9 approach three databases of storage and one
10 database is a lot. If you take a DNA molecule in
11 its unstretched form and just line it up, then one
12 database will be a single DNA molecule from
13 Washington, D.C., to New York. This is like
14 micro-scale times infinity becomes something we
15 can measure in real distances.
16 So this is our daily input into the
17 amount of bases we inject every day into SRA, and
18 as you see -- so these lines are inputs and the
19 bit smoother line is average. So as you see
20 starting a couple of years ago, we effectively
21 insert into SRA every day the full size of trace
22 archive and much more. So this is how we talk
141
1 about how scales are changing with next generation
2 technologies.
3 So because of this, this is how we model
4 data for our internal standards, the data that we
5 store. So as we realized the volume is big, we
6 started to try to compress data and reduce it to
7 as few bytes as we can. So as you see in the
8 early stages when we went through NGS explosion,
9 we spent up to 8 bytes because we were recording
10 not only the quality, but Illumina was producing
11 four-channel quality. Some of the technologies
12 were producing signals, which resulted in
13 particular base codes.
14 In the second phase when we realized
15 it's not sustainable, both us and vendors start to
16 converse on the need to store only bases and
17 qualities. And this is what I call the FASTQ
18 conversion. As we go further and try to apply
19 better and better technology, we realized that one
20 of the steps in preserving the data is align it to
21 the reference. And alignment to the reference, as
22 mentioned before, provides very good possibility
142
1 of doing compression by reference when you don't
2 need to record the sequence. A lot of the
3 sequences mention the reference pretty well. So
4 this allowed us to converge to more or less a
5 steady pace where we spend approximately
6 two-thirds of a byte per base and this was
7 previously mentioned. Almost 80 percent of this
8 data is a quality score, which we try to attack
9 next as a feasibility of do we really need to
10 store all of that.
11 So the need for standards to be
12 discussed: This is NGS data. This is if you look
13 at the SRA now, you find 8 sequence and
14 technologies. You find 34 sequences in instrument
15 models. You find 20 types of experiments, meaning
16 doubles, genome sequence, RNA-seq, all of them.
17 And you find cells and taxonomy IDs. So,
18 obviously, to develop separate standards for each
19 of those combinations is impossible. We need a
20 common standard for all of them. And standard is
21 relatively simple. We need to record metadata and
22 we need to record equality and possibly their
143
1 alignment to the reference.
2 So there is also need for a standard for
3 long-time archival perspective. This is the goal
4 of NCBI to build archive for indefinite use, and
5 this raises the first question. Long-term
6 usability means software support. So imagine that
7 format is generated today and it's been understood
8 by a particular set of software. What happens ten
9 years from now? Will this software work on the
10 future computers still? So what's convenient now
11 may not necessarily be relevant in the long term.
12 So then uniformity of the data produced
13 by different technology, even within the scope of
14 the same manufacturer, they change their formats
15 that they produce as a result of applying it
16 several times. It creates -- and their latest
17 software doesn't understand, I think doesn't
18 understand, the format they created six years ago
19 when they started to do it.
20 So also uniformity of data produces
21 different types and also long-term usability of
22 various elements. As I said before, initially we
144
1 tried to model more elements than just data
2 equality, but later the practice shows that it
3 doesn't need to be modeled that much. So the
4 benefits of transforming to archival format are in
5 addition to being archived, and this is regional
6 data as well as data being curated while its being
7 transformed. Transform is possible -- at the time
8 of transform the software that understands the
9 format being submitted to archive is up to time
10 because it's current, it's still current. And the
11 transformation to internal standard gives us
12 ability to apply the data compression. And there
13 is problem that we need to keep up with all the
14 formats which are being submitted to us. There is
15 potential loss of data due to box and archive
16 software if they are created; potential loss of
17 data due to archive decisions for data to keep or
18 discard. And data decompression slows down
19 short-term computational needs and data reduction
20 may discard data fields which are used to create
21 additional information. So we had several times
22 finding in fields which are used only to carry
145
1 information from one bioinformatics pipeline to
2 the next bioinformatics pipeline. Otherwise they
3 have no other meaning, just to carry it from let's
4 say bowtie to -- from top hat to cuff link -- and
5 no other meaning can be applied to those fields.
6 So the decision circle we'll go through.
7 We go in circles. We first decide what data
8 series to store. Then we look at the data series
9 for redundancy removal. Then we decide whether we
10 do loss compression. And then we go through
11 practical application and when something happens
12 when the new technologies shows up or the pattern
13 of using the data becomes changed, we go into the
14 circle again.
15 We initially preserve as much available
16 data series as reasonable and possible. Then we
17 do compression routines. Existent compression
18 routines speed up development acceptance of new
19 data. Then we improve compression methods as data
20 becomes better understood. And then we discard
21 data series which did not prove to be useful, and
22 we recover disc space.
146
1 Also we need to manage the risk of data
2 loss, the risk of losing archived data. We do
3 normal stuff -- redundant disc, redundant location
4 backups. The risk of introducing zero software,
5 obviously we handle the signet test and regression
6 test and quality assurance and recover regional in
7 the processing. And the risk of making bad
8 decisions, for example, we decided to discard some
9 data series. Since it is tape storage, which has
10 limited lifetime, our assumption is that a mistake
11 will be discovered within the lifetime of the
12 tape.
13 So as I mentioned, SRA is big data, and
14 what are the conditions for the data? This is my
15 understanding of how big data gets born. First of
16 all we have advanced technology to generate it
17 because you have to be able to generate data to
18 call it big. It's not easy to create so many
19 bytes. You also need sufficient amount of
20 resources to store this data because you cannot
21 call big data the stuff you can never store. At
22 initial stages of a limited design, there was a
147
1 requirement trying to store all the images
2 produced by Illumina. It's not really -- this
3 technology doesn't exist, so it was never even
4 attempted to be done.
5 Another thing for big data is we don't
6 have enough methods to digest it. That's why we
7 keep it big because we cannot extract useful
8 information at this time, and we keep it for the
9 future trying to figure out maybe in the future we
10 will be able to.
11 So this is, for example, properties of
12 NGS data which we analyzed that we do have random
13 noise in data. We also have systematic noise by
14 well-known biases of let's say PCR or some
15 platforms. We have expected signal. First of all
16 we have known carrier, which I call known because
17 a lot of data which when you align individual
18 genome to the human genome, a lot of data will
19 just tell you that yes, you are human. It will
20 not tell anything individual about you. It will
21 just tell you yes, you are human because it
22 matches perfectly with the genome.
148
1 Then you have expected signals. So, for
2 example, your eyes are blue. Definitely you will
3 have some kind of variations that make your eyes
4 blue. And then you have novel signals, signals
5 where you are looking for, but it's hidden within
6 this big data. So it's like looking at an
7 astronomical picture. So you have all of the
8 black background. You have all of the white
9 speckles and noise. Then you have well-known and
10 well-positioned stars and then you find the super
11 nova. So this is exactly what you are looking at
12 in big data. And our analysis of data in this
13 area show that approximately 64 percent of the
14 total reads we find identical to the reference for
15 human; 25 percent of the reads are suspect,
16 suspect in the sense that for some reason, the
17 black box of sequencing decided that the two or
18 three bases have low quotes or even half of the
19 reads have low quotes. So there is something why
20 hardware misfired. So I would say why not call
21 them suspect and don't try to catch signaling
22 errors.
149
1 Then we also now have very well-known
2 variants, which recently have been recognized and
3 catalogued by the 1,000 genome project, variants
4 of human that are common to a large percentage of
5 people. And all this is a small fraction, and
6 this fraction is 1 percent, on a scale of 1
7 percent, which is actually useful data, data which
8 you can feed into clinical decisions or academic
9 research. And the problems are finite. The
10 additional alignment phase to SRA data allow us to
11 go through silos of data for each individual
12 sample. Two matrix of data where you can align
13 those silos with their position on the genome.
14 And this is where the existence of standards is
15 important because once you go from set of
16 individual data files to matrix, you can start
17 doing a lot of -- a lot more flexibility of how to
18 look at the data. For example, you can slice the
19 data. You can look at the region on the genome
20 and see how different individuals of data 1, data
21 2, data 3, data 4 look in this region. And just,
22 for example, the four individuals for high-depth
150
1 sequencing of 1,000 genomes and they are in HLA
2 region. And those four graphs on the bottom you
3 see represent the coverage, and the red lines
4 represent mismatches. How many -- what percentage
5 of mismatches to people, and you see from those
6 pictures that you can find people with homozygous
7 mutation, heterozygous mutation. You can find
8 people who have mutations and no mutations, and
9 you are looking at one single gene. You know I'm
10 going to look for separate genomes and try to cut
11 a piece out of those genomes. You just take a
12 slice in different dimensions and the data is
13 there.
14 So another application that we tried to
15 do for our human part of this array under dbGaP
16 protection is to build an index on features found
17 on the people. For example, when you have very
18 strong mismatch from the reference, you can index
19 it properly. We tried to build an index, which
20 reports a low existence. So the idea is that you
21 have human sample and you narrow down the
22 problematic area to particular mutation, rare
151
1 mutation that you've never seen before. And all
2 you want to do is you want to go to big system
3 containing a lot of people who haven't seen this
4 mutation before. And in this case you don't even
5 use NGS directly because you have phenotype for
6 the person who has mutation, and then you want to
7 find all other people who have this mutation and
8 try to analyze linkage to their phenotype. So you
9 use NGS only as an indexing media, not as direct
10 research.
11 So to end, we're playing with the thing,
12 which is called beacon. So, for example, this is
13 one of the articles about APOC3 gene, which is
14 responsible for your lifecycle of triglycerides in
15 the cells. And this particular mutation makes you
16 impermeable to eating fat food. I'm not a doctor
17 so it's not medical advice, but if you have this
18 mutation, according to the theory, you can eat as
19 much fat as you want and it will not affect your
20 health. But this is a recent study and I want to
21 look at what people have this mutation. So this
22 is the beacon at NCBI, dbGaP, and you just enter
152
1 the coordinates you are looking for and you find
2 there is one sample in each of the two studies and
3 you happen to have access to it. This is general
4 research you set and the person, Steven Cherry,
5 who was doing this study has access to the set.
6 So he happened to find one person in each of those
7 studies and they are also in the following
8 distribution in the dbGaP study. But he has no
9 access to it, but he has the ability to request
10 access. By looking at those few fields, he finds
11 yes, that for those two people according to the
12 SRA records existing under dbGaP protection there
13 is coverage and they have the heterozygous
14 mutation of one of the DNA codes. So this example
15 took -- I gave two examples how existence of
16 standards allow you to create from collection of
17 the data to big data matrix and index it because
18 the data that fills in the fields of this matrix
19 is uniform and conforms to the same standards. So
20 this is the end of my talk.
21 QUESTIONER: (off mic)
22 MR. YASCHENKO: How we handle storage of
153
1 --
2 QUESTIONER: (off mic)
3 MR. YASCHENKO: We maintain the software
4 on NCBI's Website, but we also -- I think we're in
5 the process of moving this into GitHub because it
6 becomes --
7 QUESTIONER: (off mic)
8 MR. YASCHENKO: Yeah, we're going to
9 move to GitHub in the next couple of weeks
10 actually.
11 QUESTIONER: (off mic)
12 MR. YASCHENKO: We are also using -- you
13 mean virtual machines?
14 QUESTIONER: (off mic)
15 MR. YASCHENKO: We are also working, not
16 program, modeling how to create the virtual
17 machine environment for people. We tried to
18 create examples how to use SRA from the cloud and
19 the cloud can be anything. We're modeling with
20 Amazon, but it can be --
21 QUESTIONER: (off mic)
22 MR. YASCHENKO: Exactly. Right. We're
154
1 at the point where we're just providing the
2 recipe, how to install software in the virtual
3 machine. But we also can generate the Amazon
4 images for people to use direct result
5 installation.
6 ALIN: Are there any other questions?
7 QUESTIONER: What kind of redundancy do
8 you have of the basic data? If you're going to
9 condense your files and base them on a sequence
10 alignment of a standard sequence, how can you
11 preserve that standard sequence because if you
12 lose that, everything else goes?
13 MR. YASCHENKO: The standard sequence,
14 first of all, we prefer the standard sequence to
15 be one of the GenBank sequences. And the GenBank
16 sequencing is highly reliable because it's not
17 only stored at NCBI, it's also kept in multiple
18 places. It's stored at INSEC, which is an
19 international collaboration.
20 We do allow for some of the cases what
21 we call local references, where reference is
22 stored with the object. Obviously, the
155
1 compression is less there because you store the
2 reference. But then it's part of the data. It's
3 not remote, it's local. If you lose it, you lose
4 it with the data.
5 ALIN: There's actually a question
6 online. Kay asks "Does anyone see any linkage
7 between" (audio skip).
8 MR. YASCHENKO: I had trouble
9 understanding.
10 ALIN: I'm not sure if I follow. So
11 does anyone have a question here?
12 QUESTIONER: When we are talking about
13 human data, reference-based compression would be a
14 wonderful compression algorithm. But let's say if
15 it's viral data, and we deal with a lot of viral
16 data or bacterial data for high coverage, we get
17 sequences that are absolutely identical even with
18 the error rates of current models. So there are
19 different approaches. Even before you do
20 reference data compression, they can do just
21 purely redundancy compression. It sometimes gives
22 us a factor of eight to ten right there, so you
156
1 don't even need to keep the reference for that.
2 MR. YASCHENKO: Reference-based
3 compression is redundancy compression because in
4 the fact -- even if your reference is genomic, you
5 store your sequence once when you have coverage.
6 Even if you first assemble -- you don't need to
7 have a perfect reference, you just assemble your
8 reference and you compress against it. This is
9 already reference-based compression because you
10 store the sequence but once and all the rest that
11 are generated are highly redundant.
12 QUESTIONER: But sometimes assembling
13 the genome itself is a challenge. So maybe --
14 MR. YASCHENKO: I think we can have
15 another today meeting on assembling the genome.
16 QUESTIONER: Some of the data from RNA
17 viruses, you're dealing with a quasi-species
18 rather than a single sequence. So how do you deal
19 with that when you're talking about compression of
20 data?
21 MR. YASCHENKO: If we are not doing any
22 compression on our own, there's a part in NCBI
157
1 that deals with pathogens and you will hear where
2 we do analyze the data. But from an SRA
3 perspective, we take the data as given by
4 submitter. So far we are not introducing anything
5 on our own. So if submitter and many pipelines
6 involve alignment to the common reference, we take
7 it. If submitter involves producing genome
8 reference and align it and give it to us, we also
9 take it, but we are not imposing our analysis.
10 ALIN: Before we close the session, we
11 are actually going to take a break now and go to
12 lunch. But before we do that I have a couple of
13 general announcements. One, there are posters
14 outside. Feel free to look at those at your
15 leisure. They're there for you guys to look at
16 and the presenters are also somewhere in the room.
17 So at some point if you don't mind standing near
18 your poster that would be great; if people have
19 questions that you could address them.
20 Also it is lunch. You are welcome to
21 leave the campus, but note that if you do leave
22 the campus you have to go through all the security
158
1 procedures that you went through to get in and
2 those take some time. So there's a cafeteria
3 across from us. We recommend that you go there,
4 so just to ease the time-limiting factor.
5 Also there is a tweet going out that has
6 hashtag FDANGS. We are required to say because
7 we're a part of the government we don't associate
8 with that and, honestly, if you could probably not
9 have FDA in that Twitter tag, it would be probably
10 better for us, to not get us in trouble, just a
11 public announcement that we don't endorse that
12 Twitter hashtag.
13 Thank you. We'll reconvene at 1:30 p.m.
14 (Recess)
15 MR. YASCHENKO: Welcome back from lunch.
16 I hope you feel better now. So, we're second
17 section of data, Big Data Administration and
18 Computational Infrastructure. There are already
19 several mentions of Big Data before, but now we
20 have a whole section dedicated to it. So, the
21 first speaker to talk is Dr. Vahan Simonyan, high
22 performance integrated virtual environment blood
159
1 (Inaudible).
2 DR. SIMONYAN: Good afternoon. So, in
3 the first speech I was representing the
4 perspective of standardization, and this speech is
5 actually the work which I do on a daily basis.
6 So, this is representing high performance
7 integrated virtual environments, a platform which
8 is developed specifically for NGS data inside the
9 FDA and outside in public (Inaudible). So again,
10 this is my disclaimer. I'm supposed to say this,
11 and everything I say, it's my own perspective.
12 It's not a regulation or anything. That does not
13 bind the FDA.
14 Okay, so, in an NGS world, again -- I
15 mentioned this before, when maybe we start working
16 in science, we all think we're going to do fun
17 stuff and do big, different beautiful algorithms,
18 new science, new research. As soon as you move to
19 the NGS server, you find out that you have to
20 infrastructure, and huge infrastructure. And data
21 is big. Data standards are -- some are there.
22 Some are not there.
160
1 It's difficult to interpret these in big
2 ways sometimes. And the data complexity is big.
3 And also, the computations that are to be done are
4 really very large. Every time somebody comes to
5 me and they want to analyze everything against
6 everything, well, it's possible, but it takes
7 years and years and years to finish. So,
8 computations should be done -- it's not that they
9 are big, they are also complex, to avoid all of
10 these huge problems.
11 And computational standards -- we
12 already raised the issue, and that's why we are
13 here today. Computational standards are also a
14 very big question. So, what do we need? And
15 being part of the FDA or doing the same thing for
16 the public, we need a good storage capacity. We
17 need a good security capacity.
18 Also, being a regulating agency and
19 accepting review data, which is important to make
20 sure that the data is very secure and they'll just
21 prepare one slide to it, and then availability of
22 the data is important. We have all heard about
161
1 these beautiful storage systems which are very
2 fast internally. They are facing out with one
3 (Inaudible) for cable, and then you have a
4 computer cloud which is facing in with one cable.
5 And so, they are not available. Data is there.
6 It's just not available for you to compute. It's
7 very slow.
8 And interfaces are an issue, because I
9 mean, the results of our computations sometimes
10 are a hundred million or billion raw tables -- how
11 useful it is. I mean, unless you have a nice
12 interface which allows you to look at them, it's
13 not a result. And support, of course. Big Data
14 administration -- as soon as you go to
15 infrastructure, you have to think about the
16 administration and support, because you have to
17 support the hardware, you have to support the
18 stuff.
19 And then, people move and then you have
20 to think about it; how it has to be upgraded. You
21 have to think about it. So, originally, we
22 started this to become scientists, and then, we
162
1 ended up doing all of these other things which are
2 infrastructure related. So, what HIVE does, it
3 tries to address some of these issues -- the few
4 things it does -- it does robust data loading. It
5 has a robust data loading information.
6 Pretty much, we can go anywhere and get
7 anything for you. You just give us a URL or an
8 ID. If it's NCBI or Uniprod or anything, you come
9 to the system. You command us to go and get the
10 data for you. You go home. You drink your
11 coffee. You come back. You have the data.
12 Distributed storage. When we take the
13 data, we spread it across the cluster and store it
14 in pieces. There are two good reasons for that.
15 Number one is that computations are faster when
16 they're also distributed. And number two is that
17 if we compromise a particular note, we don't
18 compromise the whole data. Everything is
19 distributed. And security -- HIV Provides
20 security.
21 And we had to come up with a new model
22 of security we called Hierarchical Security. And
163
1 then computations -- distributed computations.
2 When you are a developer in HIVE, you don't have
3 to think, where is your competition going to run.
4 You don't have to think, is my data in that
5 particular compute note. You don't have to think
6 about how to parallelize it. HIVE does a lot of
7 it for you.
8 And interfaces. HIVE provides web HTML5
9 based -- Java script based web interfaces, so if
10 you are working in some infrastructures of IT
11 infrastructure, when there are limitations on what
12 software we can install, you don't need to think
13 about it. Browsers are everywhere, and most
14 modern browsers do support HTML5.
15 And of course, the most important part,
16 HIVE is an expertise. It doesn't matter what the
17 tools are. Sometimes we don't how to use the
18 tools, and sometimes, the tool does 90 percent of
19 what you want, but there is still that 10 percent.
20 So, unless you have an expertise and an
21 infrastructure, you are doomed, because now your
22 biology is trying to deal with that 10 percent.
164
1 Okay. So from a hardware perspective,
2 topological perspective, HIVE is an encapsulated,
3 behind the firewall, cloud-like infrastructure.
4 There is a HIVE. It stands for High Performance
5 Integrated Virtual Environment. The letter I is
6 the integrated, because everything is together. I
7 draw on that data that you need and the storage
8 you need in a separate fashion, but sometimes they
9 are the same notes. Sometimes they are different
10 notes. And there's an extremely well optimized
11 internal network connectivity in between them.
12 Everything is controlled with cloud
13 servers, and there is a single point of access
14 which is the web servers. No (Inaudible), no
15 additional users. We are trying to minimize
16 potential violations of the system's integrity.
17 And then, we can link to your devices, if you have
18 an illumina device or rosh device or any other
19 device, we're just doing the sequencing. We can
20 integrate, and then if those mounts are available
21 -- technology people know where the mounts are --
22 it's where your (Inaudible) end up being -- we can
165
1 get them for you even without you actually
2 thinking about it.
3 We can go and upload your data from your
4 local storage, or you can tell us to get the data
5 from somewhere else. Everything is not from the
6 web browser. And this is how HIVE looked before
7 and after. One of these before and after pictures
8 (Laughter). To be honest with you, the second
9 perspective -- after perspective is taken -- the
10 guy is still there behind the cave walls.
11 (Laughter) He's just not visible (Laughter). The
12 other side. Yeah.
13 Okay. So, a few first slides and the
14 storage data flow -- I come here. I define my
15 metadata. I define my sequences. I click a
16 button. Data gets uploaded into HIVE, and that's
17 the only time when HIVE depends on your computers
18 being live. When you upload the data from your
19 local computer -- because if your computer goes to
20 sleep you go to another page, you break the
21 connection. There is nothing we can do about it.
22 That's the only point in HIVE -- when
166
1 you issue a command, you have to wait for it to
2 finish. So what do we do? We initiate the
3 processing pipeline. We recognize almost all of
4 the formats which are available in industry,
5 unless something came up from yesterday. And we
6 do validate it, because we know -- in the
7 industry, we know how validation protocols are
8 there.
9 There are about 40 different validations
10 possible. Ten or 20 of them are being done
11 automatically. We pass the data. We compress the
12 data. We encrypt some of the data, because
13 sometimes it's not feasible, and we archive it for
14 distribution. And the next thing, you come. You
15 don't have the data in your machine. You just
16 point it to NCBI or Uniprod or anywhere, and then
17 we go and start getting it.
18 It seems like NCBI provides a lot of
19 nice FTP downloads and things, but information is
20 not just a packaged FTP lying there. So, you have
21 to do this electronic handshaking. It's called
22 EUTO. So, you have to submit an inquiry, get the
167
1 results, go on -- start downloading by chunks. We
2 take all of this complexity into HIVE, and we do
3 it concurrently. Hopefully, not too much
4 concurrently, because NCBI will blacklist us if we
5 go too much.
6 So, and then, once we get the data, what
7 do we do? We monitor -- sorry. Okay, we're
8 splitting the -- okay, we monitor while the
9 downloads are finished. We pass, compress and do
10 the other stuff automatically. And so, how do you
11 do computations? This is a very general view.
12 I come. I select my data. It says the
13 majority inside of the system -- already, HIVE
14 does not work in the data, which is outside. We
15 can get them, but once -- we can compute only when
16 the data is inside. You remember, that's
17 integrated (Inaudible). So, you click on a
18 button. We'll -- based on what kind of
19 computation it is and appreciation, how much time
20 it may take, we may decided -- our software
21 decides how many chunks have to be made out of it.
22 How many pieces, how many computer initiatives
168
1 should be involved in it?
2 Then, we parallelize. It's called
3 parallelization. Then we launch -- wake up
4 computers which go, start -- so it's very
5 fascinating to start working with the data units,
6 get the information while your browser is updating
7 and showing progress -- 1 percent, 2 percent, 3
8 percent, et cetera. Once it is done, we -- the
9 paralleled (Inaudible) is coagulated. The
10 visualization is prepared and sent to you.
11 And this is how visualizations usually
12 look like. Well, they're not crooked in real
13 life. I just tried to create a perspective.
14 Yeah, so there are all kinds of visualizations.
15 This is all external based. Some visualizations
16 are built internally by HIVE. Some visualizations
17 we run -- adopt tools -- many tools. We have --
18 you'll see some of the tools I mentioned.
19 If the tool produces nice visualization
20 or a table, or it produces the data just in the
21 text format that we can visualize. We are trying
22 to bring it to you in this concept. And
169
1 sometimes, it's actually difficult. Do you know
2 how many times people say, hey, I've done my
3 computation. I waited my three hours, and I'm
4 clicking on this human chromosome number one. I
5 want to see all of the mutations.
6 In the real estate of the screen, which
7 is 300 pixels -- and the human first genome, first
8 chromosome has 250 million positions. And we know
9 about 20 different things about every single
10 position. People say, I clicked it. It's
11 computed. It should take a second to show.
12 But in reality, to go through 200
13 gigabytes -- 250 million positions and produce
14 just the output for you, which is visually
15 appealing and you can understand what's happening,
16 even that has to be launched like 50 processes on
17 a bug (sic), each one doing the chunk of the work,
18 and then coagulating back and bringing it to you.
19 So, when we are dealing with Big Data, even the
20 simplest things which seem simple, they're really
21 a big deal.
22 Okay. So what is the ultimate goal and
170
1 where are we? We produce this ultimate --
2 extremely modular. And we spent a significant
3 amount of time developing these small modules, the
4 alignment module, very (Inaudible) modules -- all
5 of the like trivial -- phylogenetic modules,
6 passing modules, security modules.
7 And we make these black boxes. I don't
8 think they're black here. I'm color blind, but I
9 still can see some of them. It produces black
10 boxes which take particular inputs and outputs,
11 and on the right side, you can see just some
12 examples of what they actually are. And the
13 letter V in HIVE stands for virtual. But unlike
14 most virtual infrastructures, we don't virtualize
15 the machines. We virtualize the services. If
16 it's an alignment, it's an alignment. It takes
17 some input. It takes some outputs. It produces
18 some outputs.
19 Why do you ever care where is it done?
20 Is it a Mac computer? Is it a Windows computer?
21 How much memory does it have? We, as the
22 configurators of HIVE, configured the system in
171
1 such a way that if you launch a service, the
2 service is executed for you regardless of where it
3 is done. It's not your worry to do it. So
4 eventually, well, we are working also on a
5 pipeline design component where we'll take all of
6 these modules and link them together. And the
7 hope is that we can produce actual working
8 pipelines which are preconfigured and validated,
9 and we'll provide it to our users.
10 Okay. So, what actual kind of
11 computations, just to mention a few, are there?
12 We have about 40 tools in production, and we have
13 more in development, and we have some in beta and
14 alpha testing. So of course, the retrieval,
15 storage, security parts of it, visualization tools
16 -- we have a number of alignments which are
17 available to us and adaptable to our system. We
18 have assemblers. We have variation code
19 (Inaudible) arsenal, and you can notice that there
20 are tools which are ours; there are tools which
21 are not ours. They're adapted.
22 It takes us, depending on how complex
172
1 your tool is, it takes some time to adapt it.
2 Generally, adaptation -- if you have a command
3 line, it just takes like five inputs and produces
4 like two outputs. That's easy to do. But if you
5 want us to make it optimal, we can spend extra
6 time and develop an explicit parallelization
7 routine. But even implicit parallelization, when
8 we just adapt your tool and just launch it one of
9 the computers, you already benefit from HIVE.
10 Why? Because now you can launch 10
11 copies of it, they are going to go to 10 different
12 computers. You are going to do 10 jobs in the
13 time of one. That's implicitly benefiting from
14 it, although we didn't spend on parallelizing it
15 ourselves. But because alignment and variant
16 colors are significant time consumers, we spend
17 the time and we develop not only our own aligners
18 and variant colors, but we also spend time to
19 explicitly parallelize existing ones.
20 So, let's say if you run a computation
21 and multi samples pipelines, which will take some
22 big, huge coverage data; it will take you two
173
1 days. In HIVE, it will take you a few hours, like
2 two or three hours. And if the system is busy, we
3 tend to get busy sometimes on a Thursday,
4 Wednesday, it will take six hours. But it's still
5 much better than this two or three day computation
6 time. And the reason it is important to generate
7 these very optimized routines is because we are in
8 science.
9 I, being a scientist, I never was able
10 to ask the right question from a first time. I
11 don't know what question to ask. I ask a
12 question, I get an answer, and I recognize, hey,
13 that was the wrong question. I have to compute
14 again. I have to compute again. Now, imagine if
15 every time I ask a question, I have to wait days,
16 how efficient I can be. But if I ask a question,
17 I get an answer immediately or within a reasonable
18 time limit, then I have a better potential of
19 doing a nice science. That's why we do spend the
20 time optimizing all of these things.
21 So, let's just move forward. There's a
22 big arsenal of tools in there. The next thing is
174
1 working with scientists -- is the difficulty of
2 their inquiries. Like, when we designed HIVE, we
3 took all of the CBI approaches, designed all of
4 their data models. We said, hey, we have these
5 beautiful databases you can use. You can populate
6 your information in it. The next day, somebody
7 comes, I have a data model which doesn't fit
8 there. Please design it for me.
9 Hey, I am -- we have DB administrators
10 -- very smart guys. They went. They started
11 designing a new data type. In two days, we have
12 five more requests. We recognize it's
13 unattainable, because now you have so many
14 different data types, it's not possible to
15 maintain. So we stopped. We spent one month or
16 two. We designed a new data model which looks
17 more predicate databases or triplet databases, but
18 it's heavily adapted.
19 We borrowed all of the nice ideas and
20 joined them in a nice, hybrid way, and we created
21 our own HIVE honeycomb database model. So we can
22 define database -- it will take 15 minutes to
175
1 design a new database, once you, the researcher
2 knows what you are -- if you know what you want,
3 it takes us 15 minutes to make a new database for
4 you, because there is one database in there which
5 maintains all of the joint databases. It turns
6 out to be, we are saving a lot of economics on it.
7 We are saving money to having no more system
8 administrators, a much simpler life.
9 Okay. Now about the security aspects of
10 it. So, HIVE implements this hierarchical
11 security model. In any big system where it has to
12 be extremely secure, you have a lot of objects
13 which can share and a lot of objects -- sorry, a
14 lot of users which can accept the shares. The
15 problem is that if you have many rules of defining
16 the permission universe, the system slows down.
17 At every point of access to any object, you have
18 to check the permission.
19 We recognize this is a challenge, so we
20 actually did some studies, did some reading and
21 came up with a hierarchical security model where
22 objects can be shared with their hierarchies. I
176
1 can say I'm giving this object to this particular
2 entity and down in hierarchy, or to this
3 particular entity and up in hierarchy. So, by
4 object, I mean files. I mean processes. I mean
5 computation. I mean algorithms.
6 Let's say you have an algorithm which is
7 only yours in HIVE, and you want to share it with
8 me, but with nobody else. You share an algorithm.
9 So, the next time somebody wants to use it, he
10 says, hey, show me your aligners. If he doesn't
11 have access to your aligner, he cannot get it. If
12 he does have an access to your aligner, he can get
13 it. So, even algorithms are objects. That means
14 they are shareable. Files are shareable. Results
15 are shareable.
16 The fact that computation is running is
17 also shareable. I can give a computation to
18 somebody saying, hey, I launched it. Now you take
19 care of that. And then, we share not just within
20 one hierarchy, which might be an organizational
21 hierarchy of your institution, or you can have
22 project hierarchies. Yes? If each hierarchy is a
177
1 tree, HIVE works with forests, not with trees.
2 So, we can have multiple hierarchies,
3 organizational projects or any other kinds of
4 entities.
5 Okay. So now, let's talk about the
6 misconception that a lot of times, people are
7 asking why do you even need this. When we refer
8 to HIVE as a cloud-like infrastructure, why do you
9 need it? There are all these kinds of clouds
10 already. It's important to note that before --
11 clouds are just renting your computers, except you
12 don't pay for the computer, which is going to
13 consume your power. It's consuming somebody
14 else's power. You are still paying for it, but
15 you get the box. You get the box with the command
16 line on it, and you are responsible for putting
17 the software in it, optimizing the software, doing
18 the parallelization and doing all kinds of stuff.
19 But to do all of that, you need to know
20 programming languages. And these are programming
21 languages, (Inaudible) shell, and there are many
22 others. I know like of them. It's a nightmare.
178
1 I barely speak English or Russian or Armenian, but
2 I know 13 programming languages. So, for a
3 biologist to actually these new programming
4 paradigms will take a significant amount of time.
5 We thought it's actually a waste of the
6 tens of years these people became scientists to
7 let them actually move under the tables with a
8 cable in their mouth. That's not a good way to
9 use somebody's time. So, then when -- these are
10 actual comments. When I was describing to one of
11 the biologists, like we have -- at that point, we
12 had 500 CPU cores. He said, oh, that's
13 magnificent. What is the core (Laughter)? You
14 know? And in programming language, that's okay.
15 So again, but pisan, I said, I thought
16 you were theoretical scientist. I didn't now you
17 worked with snakes. So, there's a big difference
18 between programmers and IT developers and actual
19 (Inaudible) biology scientists. And we have to
20 maintain that separation clearly, and HIVE tries
21 to do that. And of course, there are people like
22 me on the bottom, when we see a computer monitor,
179
1 we are really, really excited.
2 Now, about deployments of HIVE. HIVE
3 has multiple deployments. It's a packaged
4 product. We have maxi-HIVE, which is specifically
5 designed for regular through usage (sic), and the
6 big data to accept the submissions. We got it
7 only a week ago, for which we are very glad. And
8 we have a mini-HIVE platform, which is designed
9 for cutting edge research algorithms. This is
10 where we adapt the tools and we let our scientists
11 play with the weird data, well data and do all
12 kinds of studies.
13 And we have a public HIVE. And our
14 collaborator, Dr. Mazmuder from GW is maintaining
15 that side, and they are actually actively opening
16 and promoting the technology which we are
17 developing. They are inviting people to run
18 pilots. It's an open public resource. Everybody
19 can do it, except that we are asking to personally
20 communicate, because you know, a lot of times when
21 we initially opened it for everybody, immediately,
22 people started launching so many computations, the
180
1 poor computers actually went down, because there
2 were too many things to do. But it's an open
3 source. We tried to transfer the technology to
4 everyone who wants to do this technology.
5 And we have public elastic HIVE. That's
6 when, let's say, somebody now wants to do many
7 computations in a cloud environment. That's why
8 colonial won at Ashburn. Dr. Mazumder is
9 maintaining that plastic elastic HIVE. Not
10 plastic, sorry. Public elastic HIVE. I have to
11 know my terms.
12 Okay, we also adapted HIVE to Amazon and
13 Rock Space, and we didn't make big studies on
14 them. We did a feasibility study on them, and
15 they are running, and we see some performance
16 depreciation and we see significant costs
17 associated when we move the data. You know,
18 there's a huge amount of computer power in these
19 clouds, and you pay for them very little. To
20 store the data, you pay for them very little.
21 Unfortunately, the reality is that when
22 you work in a big data universe, data moving costs
181
1 you. And you move these huge amounts of terabytes
2 of the data -- just today, we are loading like six
3 terabytes, and it's a routine day. And I imagine
4 you are moving this every day, and the costs add
5 up. So, I, myself, am a supporter of a hybrid
6 platform where we'll have a local private
7 cloud-like environment like HIVE, and then, we can
8 support efforts for other search patterns. We can
9 extend to Amazon or Hard Rock Space or any other
10 provider.
11 We are also working with IBM software
12 now. We are trying to use their bare bones
13 systems, bare bone clouds, because the performance
14 is better to our opinion. So you see, we are
15 trying to actually adapt to an environment, not to
16 stick to just the cloud or just the private,
17 because economics is, after all, deciding what we
18 can and what we cannot do.
19 And then, let me bring you slides. This
20 is important. We think that we are solving
21 something which is completely new. We are solving
22 the big challenge which was not here before, and I
182
1 want to say that that is not true. There are so
2 many big data machines right in this room. Every
3 human is a big data machine. Do you know that
4 every second in your body, a gazillion amount of
5 zeta bytes is actually generated information. And
6 then, that information is now being transferred to
7 the central processing unit, minus here. It
8 actually is being treated right where it exists.
9 That is why only a minute amount of
10 information is being transferred to the central
11 processing unit. And even then, our brains
12 sometimes consume 40 percent of the energy. They
13 overheat and they make us distribute things.
14 (Laughter) So, the reason is, it's
15 a distributed computing entity. If
16 you think about it, computation and
17 data information processing is not
18 done in a big, big data center or
19 on a big data cloud. It's done
20 everywhere where it's generated.
21 Of course, the most vital information
22 which is critical for the survival of the system
183
1 as a whole is being transferred to our brains.
2 Similarly, let's model this. Evolution came out
3 with this concept, and it took, actually, three
4 and a half billion years. Why do we need to
5 reinvent? Let's come up with this concept. So,
6 we've tried to produce HIVE in a box for them.
7 We take the machine, one, powerful -- it
8 should be powerful, but it's not loud a cloud. Of
9 course, it's not like a whole infrastructure, of
10 course. We can put a certain number of cores, a
11 certain amount of storage into it, large memory
12 machine. We stick it next to the Lumina or
13 (Inaudible) life sciences or any of these data
14 providers, and then as soon as the data is being
15 produced, HIVE in a box will be able to run
16 particular pipelines which are pretty fine for
17 that particular setting. It's alike an appliance.
18 So, that's what we want to develop and
19 make this available for everyone. And the reason
20 we went this way is because we have an array of
21 sort of regulatory affairs, and the FDA has
22 hundreds of office across America, and I think
184
1 across -- some of them are in international
2 countries -- other countries, sorry.
3 International countries, yeah. In other
4 countries. So, the reason we looked at this is to
5 make sure that they can do their analytics without
6 actually moving the data all the way to us,
7 because network, network, network. It's not
8 location. It is a location, location, location.
9 Yeah. So that's another platform which we are
10 developing.
11 So, what is the future? Moving forward,
12 we are trying to put this -- actually, we have put
13 some of the HIVE developments in a curricula at
14 George Washington University. We are trying to
15 take students, start developing on HIVE. It takes
16 some efforts on developing the APIs and things,
17 but the hope is that, just like your iPhone, just
18 like your smart phone, unless you have young
19 people working and developing new stuff, exciting
20 stuff, you are destined to fail.
21 That's why we want to want to actually
22 have this platform available for everyone, so they
185
1 can develop for it. And we are also at the core
2 of developing of the NGS because of the technology
3 expertise which we have. We also, because of
4 HIVE's success in a NGS setting, now, HIVE Is
5 being studied, investigated for its possibility to
6 treat in post market and clinical data. That's
7 the honeycomb databasing model.
8 And we are collaborating with different
9 (Inaudible) centers. With some of them, we are
10 running active collaboration. For some of them,
11 we are actually developing new tools. We are at
12 the stage of actually adopting their pipelines,
13 and public HIVE collaborates outside with all
14 different organizations.
15 So, this is just a brief history of
16 HIVE. We started on this concept deployment -- we
17 had four Macintosh computers, three scientists,
18 two students and one goal, and we hoped to have
19 zero challenges. Unfortunately, that didn't
20 virtualize. It's a four, three, two, one, zero.
21 It didn't happen. Okay, but that was a very small
22 -- like developers and scientists together,
186
1 working and trying to do something, and then we
2 actually got funding in 2012; medical count to
3 measure initiatives.
4 And then, we went to research production
5 in May of 2013, and we -- actually, this is a
6 milestone for us, also, because we designed a very
7 nice shortage aligner, but just like any aligner,
8 there tens of them. Everybody likes their own
9 flavor. This is adapted in HIVE, and it performs
10 very well. And oh, just this week, we got our
11 FISMA categorization and we (Inaudible) to operate
12 in a regulatory environment as a regulatory
13 production system for NGS data. And currently, we
14 have 80 big or small projects running in HIVE, and
15 all kinds of data, starting from huge terabytes
16 and just ending in like one file, and you're
17 trying to find out where does it align.
18 And we just -- last year, I think we had
19 -- I think I should have said 1.5 years -- we have
20 15 publications and some were pending or in
21 submission process, which I believe that is the
22 biggest achievement, because in science, your
187
1 worth is measured by how much you can actually
2 publish and what you can actually change in the
3 world and have impact in the world. And that is
4 the world.
5 So, this is the HIVE development team.
6 I hope you understand that a big project like this
7 needs a lot of people, and I tried to mention all
8 of them, but in case I forget, I promise to bring
9 chocolates. And we had project leaders and HIVE
10 friends in significant countries, but those are
11 mentioned here. Those are people whose advice or
12 ideas are incorporated, or people who are helping
13 with purchasing hardware or moving stuff or
14 connecting stuff. And the high performance
15 computer center has done a wonderful job.
16 We have a 3,000 core computer high
17 performance computer center, and they have done a
18 wonderful job collaborating with us. And again,
19 as usual, our researchers, without whom we
20 couldn't do it. And please ask me questions.
21 (Applause)
22 MS. VOSKANIAN-KORDI: Actually, there
188
1 was a question online before they get the mikes
2 out. Or, I'll give them some time to get the
3 mikes out. So, there is a question saying, any
4 plans to work with FDA to develop a validation
5 methodology for this pipeline?
6 DR. SIMONYAN: Yeah. Like I said, if I
7 understand correctly, it's about -- if it's about
8 HIVE, yes, we are the core of the development of
9 these standards and validation protocols. I
10 should be very clear -- HIVE is not the only
11 platform at NGS -- at FDA. There are many
12 beautifully built bioinformatics, pipelines and
13 platforms, but we are the one who got the
14 regulatory approval for regulatory analysis.
15 But we are inclusive. We are
16 collaborating, and yes, we are involved in this
17 development of the validation pipelines. And
18 whatever it was I was saying, we wrote a big
19 document on NGS bioinformatics validation and its
20 standardization. All of it is already implemented
21 in HIVE, because we didn't want to just sit down
22 and write stuff. We wanted to actually see if it
189
1 works. So yes, we are involved in it.
2 SPEAKER: Is the intent for HIVE to
3 become an open platform for others to develop
4 applications --
5 DR. SIMONYAN: Yes.
6 SPEAKER: -- to use that framework?
7 DR. SIMONYAN: Yes, yes.
8 SPEAKER: And when will it be available
9 as an open platform?
10 DR. SIMONYAN: Yeah. We are pushing
11 hard for it. I mean, the reality is that some of
12 the code base, we intend to issue in January --
13 API. API will come very soon. API. As for a
14 source code for the algorithms, there are multiple
15 layers of source codes. Do you understand?
16 Source codes of algorithms, we also hope to try to
17 actually release at the same time, although I mean
18 source codes -- as soon as you release a source
19 code, you have to provide the compilation and you
20 are committing to some actually, developer
21 resources. And that's the difficult thing to come
22 up.
190
1 So yes, are trying to actually issue the
2 source codes of the algorithmic layer also, given
3 the connected issues. But there is also code
4 base, which the IPO of this is being resolved now.
5 We are trying to make sure that the FDA has
6 ownership to it. But it still is -- things that
7 still have to come.
8 SPEAKER: Thanks, Vahan. Yeah, great
9 talk. Just like any performance system, there's a
10 lot of design decision to make. With your
11 honeycomb system, you know, you are able to deal
12 with sensible data, let's say. But what kind of
13 trade-offs are there in that? Like is it easy to
14 query? Is it easy to report on? Are there other
15 trade-offs that were made?
16 DR. SIMONYAN: Yeah. So, honeycomb data
17 is actually -- underneath the -- honeycomb data is
18 a low level engine. We are going to provide
19 whatever standards we decide. Like we are
20 adopting NCBI standards, in this case. Yes? The
21 honeycomb -- will you inquire (sic) honeycomb to
22 give you the data for this particular metadata
191
1 object? Those standards will be generated by
2 honeycomb.
3 In this particular moment, if you are
4 asking, give me the honeycomb -- give me the data
5 for, let's say, bio project or bio sample, it will
6 produce the same one as NCBI does produce when
7 years running (Inaudible) to retrieve the same
8 information. Underneath, we keep it slightly
9 differently, because I mean, we expect to get --
10 within half a year, we expect to get 300 million
11 metadata records.
12 The reality is that we are trying to
13 step away from XML internally, because it's too
14 heavy. So underneath, we keep it a different way.
15 We optimize it differently. But export import
16 should be conformant to whatever this community
17 decides. They'll make sure of that.
18 SPEAKER: Yeah, there's hybrid solutions
19 --
20 DR. SIMONYAN: Yeah.
21 SPEAKER: -- like x amount of
22 (Inaudible) relational with (Inaudible) space.
192
1 DR. SIMONYAN: He likes 14 14 XML
2 (Inaudible).
3 SPEAKER: Thanks.
4 MS. VOSKANIAN-KORDI: There is another
5 online question. What element of the pipeline
6 will be validated, and when?
7 DR. SIMONYAN: Which particular pipeline
8 and when --
9 MS. VOSKANIAN-KORDI: The HIVE pipeline.
10 DR. SIMONYAN: Well, HIVE is not a
11 pipeline. It has multiple pipelines, and it's a
12 platform. So in HIVE -- actually, a very
13 interesting question came out from this gentleman.
14 He said, can we download just one pipeline out of
15 HIVE? The unfortunate reality is that we are
16 working within this platform, and all of our tools
17 are platform dependent, because after seeing that
18 our data is -- if there (Inaudible) data, we
19 cannot get it out. And the data, the way data is
20 maintained inter inside, there are certain
21 limitations to it. So, this platform has been --
22 two of the pipelines are adapted to the pipeline.
193
1 Just like -- I mean, you wouldn't be surprised if
2 you tried to run your Microsoft Windows Word in a
3 Linux platform and you don't launch it. Yes?
4 Because it is a different platform.
5 So in a way, you can think of HIVE as an
6 operating system destined to work on multiple
7 computers, except that it also is extensively
8 attuned for big data, and a lot of tools exist
9 already in NGS. So, HIVE has many pipelines, and
10 they will be scrutinized exactly as much as any
11 other FDA tools or any other tool from industry,
12 whoever tries to validate it.
13 But again, this is -- it's my
14 perspective. It's not FDA saying, so -- there are
15 many beautifully built tools. This is just one
16 platform. You have to understand that. We are
17 proposing this, and we are exactly on the same
18 footing as everybody sitting in this room is.
19 There are no prejudices here whatsoever.
20 SPEAKER: Good question. Here I am. So
21 good talk, actually.
22 DR. SIMONYAN: Thank you.
194
1 SPEAKER: My question is, so is the HIVE
2 similar to some of these open source frameworks,
3 like MapReduce or Hadoop?
4 DR. SIMONYAN: Yeah. We are a little
5 bit lazier than Mat Reduce. We map, but we do not
6 reduce, because a lot of times -- yes, it's very
7 similar. We are not using it underneath, but the
8 platforms and ideas are very similar, except in
9 MapReduce, there is a stage you may or may not
10 reduce. Because our data is so big, reduction
11 doesn't really matter. And the next thing you are
12 going to ask about, the data, it also knows how to
13 work on this mapped data. So, reduction is done
14 only usually for visualization or conclusion
15 making or download purposes. There is a big
16 similarity in between them, but engine is
17 different.
18 SPEAKER: So how fast is your engine?
19 DR. SIMONYAN: For map reducing? So,
20 examples I can bring -- I mean, I can't say about
21 the fastness of the engine, because it depends on
22 what pipeline are we talking about. Yes?
195
1 SPEAKER: Yes, yes.
2 DR. SIMONYAN: Like Hadoop, similar
3 platforms are Java based, and then we are -- the
4 core of HIVE and all of these companies are CC++
5 based, with a very low level to the machine. So,
6 we have compared it initially when we were
7 initiating the process. We tried to choose
8 different platforms, and came out that CC++
9 performance is significantly better, so we stuck
10 with this paradigm. But they're very similar
11 ideas. Over there?
12 SPEAKER: I really enjoyed your
13 presentation.
14 DR. SIMONYAN: Thank you.
15 SPEAKER: So early on, when we develop
16 our (Inaudible) track, we'll run into the issues.
17 And actually, we had some discussions about
18 whether these are the tools we need to develop for
19 the technology savvy person or just for the
20 reviewers. I think this is the seminal questions
21 that are going to be for the HIVE. And of course,
22 at the end of the day, we have to conduct a lot of
196
1 training courses to get users really on board to
2 use these tools. So, do you have these kind of
3 (Inaudible)?
4 DR. SIMONYAN: Both for reviewers or for
5 technology savvy people. Yes? The question.
6 SPEAKER: Yes.
7 DR. SIMONYAN: So, I mean, based on the
8 interfaces -- let's be completely honest with it.
9 It's a very good question. The computations are
10 sometimes really complex, and even the
11 interpretation is very complex. So, the current
12 set up is such that there is a learning curve.
13 You have to understand, how do you get the data,
14 how do you launch. What is the interpretation of
15 many of the arguments, which is available?
16 But you're absolutely right that we have
17 to move towards like more technicians, few
18 buttons, here is the data, get me the output kind
19 of situation. And we have done that for some of
20 our tools already. We wanted to create this
21 advanced engine which is possible to customize for
22 advanced users. At the same time now, we are
197
1 moving to these overview engines, which is the web
2 page, pretty much. You come. You select two
3 things. Click. You get your third thing.
4 And I think the NCBI model is wonderful,
5 because they've done a lot of work in making it
6 available and understandable from not just a
7 technology expert's perspective, but also from a
8 scientist's or reviewer's perspective. We are
9 getting there, but I wouldn't say that right now
10 it's so easy, because there's a really huge amount
11 of information there. We are trying. You know?
12 It's a work in progress. You know?
13 MS. WILSON: I was just going to add
14 that also, we are in the process of developing
15 training for review staff to just get them
16 introduced that technology, and then, have more in
17 depth training for people who are going to
18 actually have to analyze the data as part of the
19 regulatory submissions.
20 DR. SIMONYAN: Yep. Yep.
21 MS. VOSKANIAN-KORDI: Well, we're going
22 to end the questions there and allow the next
198
1 speaker to take over the podium. I'm going to
2 pull up those slides, if you want to introduce
3 him.
4 MR. YASCHENKO: And the next speaker is
5 Dr. Warren Kibbe. He represents the National
6 Cancer Institute, who as you can imagine, deals
7 with a huge volume of cancer data and the
8 solutions for storing this data. And also, trial
9 computations, and they build and present this.
10 (Break in recording)
11 DR. KIBBE: Great. Well, I want to
12 thank the organizers for this, and this microphone
13 is a little bit low for me, so I guess I'll be
14 leaned over here so you can hear me. So, unlike
15 the previous speaker -- and I really enjoyed
16 hearing what HIVE can do, I'm going to focus more
17 on the problems, frankly, that the NCI is
18 grappling with, with generating a lot of this
19 data, and how do we get it out in folks' hands?
20 How do we really provide the community with access
21 to these data sets and the computational
22 horsepower behind it?
199
1 So, I'm going to take a little detour
2 before I get into some of that. I want to talk
3 just about what the national challenges in cancer
4 data really look like. I'll briefly talk about
5 disruptive technologies, because I think that's
6 what's leading us into these problems of big data,
7 and I think that's a good thing. I'll talk a
8 little bit about two specific initiatives in the
9 NCI, the genomics data commons and the cloud
10 pilots, and then close with just what I think is a
11 really important issue, and that's how do we start
12 to build a national learning system where we
13 really learn from every single cancer patient
14 that's getting clinical genomics done. And I
15 think that's something that everybody who's in the
16 cancer space agrees to, and I think it has a lot
17 of impact, both with the FDA and frankly, all
18 kinds of diseases, not just cancer.
19 So, two of the big problems from an
20 informatics perspective that I see that we're all
21 grappling with is how do we really lower barriers
22 to data access. So, from a security and a privacy
200
1 standpoint, we have an awful lot of barriers that
2 stop us from gaining access, particularly to
3 genomic data and associated clinical data.
4 And then, what we really want to do is
5 get to the point where we have access to those
6 data and we can learn from them, and we can build
7 predictive models that let us really help cancer
8 patients. So again, my perspective is very much
9 on the cancer side. So, how do we do this for
10 cancer patients? But again, I think it has
11 relevance for every single disease area.
12 And from a principle standpoint, I think
13 we really need open science. We need that
14 semantic interoperability, and Vahan really spent
15 the whole first session talking about all of the
16 pieces of that and how important they are. And
17 then, the last piece is really, we need to have
18 sustainable models for this infrastructure. And I
19 think that's something, particularly when we're
20 thinking about big data, that's becoming apparent.
21 We can't replicate this data everywhere in the
22 world, so how do we do this in a sustainable way?
201
1 So, I'm going to turn first to
2 disruptive technologies. So, how do we get in
3 this place? And I think we all know that high
4 throughput biology is both the source of many of
5 -- really important biology that we're doing now,
6 but also, the source of all of this big data that
7 we're grappling with, or one of the sources.
8 And something that I guess I like to
9 think about is, as we start generating Next
10 Generation sequencing and as we start doing this
11 in a really detailed way, we're really forcing us
12 to think as a community about computational
13 biology and systems biology. So, we really
14 reached the end of thinking about this from a
15 purely reductionist standpoint. And I think the
16 answer is yes. Or at least I hope the answer is
17 yes.
18 So the other thing, and this came out of
19 a workshop from the IOM about a year ago that I
20 was involved in, is realizing how ubiquitous
21 computing and access to data has really become
22 throughout the whole world. So, it was shocking
202
1 for me, that as of December of 2013, so it's now
2 almost a year ago, to realize that there were 6.6
3 billion active mobile contracts in the world, and
4 the world population being at the time, 7.1
5 billion. So that's more than 90 percent of the
6 world has access to at least a cell phone
7 contract, and 1.9 billion smart phone contracts.
8 So, that means that access to data has
9 now really become pervasive in a way that wasn't
10 true five years ago. So, how do we really
11 capitalize on that, knowing that almost everyone
12 now is a data provider and that data emersion is a
13 real thing? Are you going to --
14 (Discussion off the record)
15 DR. KIBBE: So, the reason I think
16 that's really important is there are lots of folks
17 who talk about social media now. I think that
18 there seems to be a pretty big age gap in thinking
19 about social media, but it's clear that everyone
20 underneath the age of 30 lives by social media.
21 And how do we reach them? How do we really change
22 their behavior? Again, from a cancer perspective,
203
1 that's really important, because I think that
2 there are three modifiable risks that everyone has
3 for cancer. It's infectious disease, smoking,
4 poor nutrition and lack of exercise. Those three
5 things contribute to about 50 percent of our
6 cancer burden across the world. So, if we can
7 just start to address the things we know, we'll
8 really relieve a huge burden on the world with
9 respect to cancer. So, I think that's really
10 important, and I see again, one of those
11 disruptive technologies play an enormous role in
12 that.
13 So, I'll get into big data now, since
14 that's what I'm supposed to talk about. So, we
15 have three very large cancer projects that have
16 generated comparatively speaking, huge amounts of
17 data. TCGA, TARGET and ICGC, which is not the
18 ICGC, which is not an NCI initiative, but it's
19 closely aligned. And then coming out of those, we
20 have the Cancer Genomics Data Commons, which I'll
21 talk about, and the NCI cloud pilots. And along
22 the way, we also have a number of clinical trials
204
1 that are starting to use these data to actually
2 assign patients to specific arms.
3 So, I won't belabor this, because you've
4 been hearing about this all morning and in the
5 previous talk, but we're now inundated by data.
6 What's really good is, of course, the
7 computational capacity that we have the ability to
8 store -- it's all rapidly increasing. It's all
9 exponentially increasing.
10 I'm going to switch gears for just a
11 second, and we're going to go back to how we got
12 in this place, because I think it's useful to
13 think about very briefly. So, this takes me back
14 almost to when I was first becoming a scientist,
15 not quite that far back, and it really starts to
16 map out the Human Genome Project, when it started
17 and where we were. So, when the Human Genome
18 Project started, we weren't really doing mass
19 sequencing. We were doing mapping.
20 And again, from looking at that from an
21 NCBI standpoint, the amount of data that was
22 around then seems laughingly trivial today. But
205
1 it wasn't all that laughable at the time. And
2 here's a little timeline of various sequencing
3 groups. So, you can see the Saccharomyces
4 cerevisiae sequencing completed in '96. We've got
5 SGD represented here in the room.
6 Bacterial genome sequencing continues
7 onward, because there are so many different
8 species there that are being sequenced. And you
9 can see we moved from just sequencing incredibly
10 small things to getting closer and closer to doing
11 the full human genome.
12 And likewise, you see this enormous ramp
13 up in the amount of data, and that was certainly
14 reflected in Eugene's talk looking at the current
15 version of what's in NCBI. But this is now
16 looking backward more than 14 years ago. And I
17 think it's, again, as you start seeing -- and the
18 reason I show this slide is that there's a --
19 technology as it starts to reach maturity,
20 generates all of the data at the end of its
21 current life cycle. And so, that's something we
22 see with TCGA.
206
1 And of course, in February of 2001,
2 these two very seminal works coming out of the
3 Human Genome Project were published. And of
4 course from an outcome standpoint, we know that
5 the Human Genome Project cost a bit more than $5
6 billion, but it's generated more than $800 billion
7 in the U.S. economy alone. So from an economics
8 standpoint, it's been enormously successful. From
9 a scientific standpoint, equally so.
10 So, these are some papers that actually
11 have come out more recently from TCGA. So again,
12 The Cancer Genome Atlas. I'm not going to dwell
13 on these, but we're really starting to understand
14 much more consistently, much more precisely, the
15 genetic underpinnings of cancer. And with that,
16 we hope we'll start to be able to understand much
17 more precisely how we can intervene in cancer.
18 And of course, along with that, TCGA is
19 a long running project at this point. It's going
20 into its tenth year shortly, and it's been a great
21 collaboration. It started out looking at just a
22 few tissue types in cancer, and now, it has been
207
1 expanded to more than 20 tissues. And I think
2 again, echoing what a number of folks said this
3 morning, a lot of it is -- it's a test bed, and
4 there's been incredibly important QC components
5 that have come out, out of TCGA.
6 So, I'm going to skip to a few slides
7 coming directly from the TCGA Consortium. And one
8 of the really interesting parts of what's come out
9 of the TCGA is just being able to look across all
10 of these tissues and realize that there are
11 different mutational patterns in different types
12 of cancers. So, what in retrospect is fairly
13 obvious, is that pediatric cancers have a
14 relatively low mutation rate, and things like
15 melanoma have a very high mutation rate. And
16 that's all diagrammed out here very nicely. But
17 there's been some incredibly transformative
18 understanding of human cancer coming from TCGA.
19 And the papers are numerous, and every one of them
20 has been very insightful in helping us understand
21 important parts of cancer.
22 So, I want to delve into this one just a
208
1 little bit. This is likelihood at endometrial
2 cancers. And what was interesting is looking at
3 the histology of these cancers versus the way that
4 they were characterized from genomic standpoint,
5 pointed out that there were a number of cancers
6 that were being misdiagnosed. And so, those are
7 the ones on the right in those panels, where when
8 you look genomically at them, they were clearly
9 misdiagnosed, and there were a number of
10 endometrioid cancers that were put in there.
11 And what let everyone do is start to
12 understand it. In fact, the outcomes for giving
13 treatment looked very different, and the -- you
14 can see the survival curves look quite different
15 for each of these classes, even though previously,
16 they were all treated as one disease. So, now
17 we're starting to understand that the pathways
18 that underlie these diseases are specific, they're
19 important, and they highlight the different kinds
20 of therapy that need to be done.
21 So, with all of that said, one of the
22 problems for TCGA is it's been a 10 year project,
209
1 and it has a heterogeneity of technologies behind
2 it. There's imaging. There's multiple kinds of
3 sequencing. So, there's a push now when we want
4 to create the Cancer Genomics Data Commons and
5 create some harmonization between the way all
6 these data are being handled and the way they can
7 be analyzed.
8 And we also want now to create an
9 infrastructure that allows directly individuals to
10 be able to contribute their own data to this
11 repository, and that would then create this
12 national or hopefully even international cancer
13 knowledge based.
14 So, as I mentioned, there's a lot of
15 different data types in TCGA. This highlights a
16 few of them, and that they're actually held in
17 different places. They're not just BAM files.
18 It's all kinds of things, and imaging, again, is
19 incredibly important to it. And we don't really
20 have a consistent way to gain access to it all.
21 So again, one of the drivers behind the Genomic
22 Data Commons is to put some cohesion around all of
210
1 this.
2 However, the way that the Genomics Data
3 Commons is still being thought of and built is
4 it's a classic data centric repository. So, how
5 do we actually allow people to access it and
6 download from it? And this is becoming a critical
7 point, because we now have -- in TCGA from a
8 sequencing standpoint alone, about two and a half
9 pedabytes of data. And you can see the rapid
10 increase in the amount of data in TCGA. So again,
11 as I mentioned earlier, that means that individual
12 groups really can't download the whole data set
13 and compute on it locally. It just becomes a non-
14 starter.
15 On top of that, just downloading it, if
16 you assume everybody had a 10 gigabit connection
17 -- so, sorry for the -- I think we flipped back
18 and forth between geek speak and normal language
19 or normal scientific language, at least, which
20 already isn't normal language, but it takes about
21 23 days just to download the current TCGA dataset.
22 So, it's clear that just moving that data around
211
1 is no longer an option.
2 And then, of course, when you want to
3 actually compute on it, the amount of resources
4 and the amount of tooling necessary requires
5 really, something like HIVE. And again, we
6 shouldn't be asking everybody to set up their own
7 HIVE instance, although that might make a few
8 people very happy. So, that highlights, then, the
9 relationship between what we want to do between
10 the Genomic Data Commons and the NCI cloud pilots.
11 So, the idea is that cloud pilots now
12 will create an infrastructure that's tightly
13 coupled to Genomics Data Commons, and allow people
14 into that infrastructure, and it's a cloud
15 infrastructure much like what was being described
16 from the HIVE, except we don't actually have any
17 working software yet, and be able to then do the
18 analysis of TCGA data. So, that's the essence of
19 the cancer genomics cloud pilots.
20 So, we'll be announcing very shortly,
21 though there will be a number of folks funded to
22 do this, and the idea is to really explore the
212
1 models for cancer genomics and what those APIs
2 might look like that would be consumed by the
3 community, and of course, explore cloud models for
4 how to make data plus analysis really happen. And
5 I think that was again, the HIVE model is very
6 elegant.
7 But how do we do this in a community
8 focused way? So, how do we allow anyone to have
9 access to it? It's clear that NCI or the federal
10 government can't pay for it all, so how do we
11 implement this in a way that's scalable and
12 becomes cost effective for everyone? And I think
13 I've said almost all of this.
14 So, another part of this is how do we do
15 this in a reproducible way? And that was, again,
16 I think this morning, Vahan was talking about how
17 we can make standards a part of Next Gen
18 sequencing, and reproducibility is a big part of
19 those -- a big reason to have those standards.
20 So, I flew through my slides, and that
21 will hopefully keep us on target here. But I want
22 to leave you with what the future might look like.
213
1 I really think that the last of computing clouds
2 are here to stay, and the NCI wants to understand
3 how we can make use of that kind of
4 infrastructure. I also think there's some real
5 benefit to social networks. How can we actually
6 change the behavior of folks throughout the whole
7 world in a way that reduces the risk for cancer?
8 That's a whole other different kind of big data
9 talk.
10 And of course, there is a precision
11 medicine piece to this. So, it's not just
12 sequencing. It's imaging. It's histology.
13 There's all kinds of data that are being made
14 available. And how do we really combine all of
15 those in a way we can learn from them and do true
16 prediction? And again, the take home is how do we
17 take all of this and build something where we
18 really can build a learning healthcare network
19 where we learn from every single cancer patient?
20 So with that, I'll take questions. (Applause)
21 SPEAKER: That's a great talk, Warren.
22 So, one of the things about dbGaP -- so, do you
214
1 see streamlining of the dbGaP process, which do
2 you think is actually hampering some of the access
3 to the data? And do you see some other mechanism
4 or even some different set of rules, a new set of
5 tools in the new era? Who accesses --
6 (Simultaneous discussion)
7 DR. KIBBE: That's a great question. I
8 don't think dbGaP itself is necessarily the
9 problem. I think what it is, is there's a lack of
10 consistency in the way that consents are done.
11 There's a lack of consistency across projects. So
12 right now, I think everyone who has gained access
13 to dbGaP, the current status is it's project by
14 project. So, you submit and you gain access to a
15 given project. That's clearly not scalable.
16 That's not where we want to go.
17 The good part for TCGA is this one
18 project. So, you gain access to everything inside
19 TCGA, but you've got to do the same thing for
20 TARGET. You know? Again, that's another set of
21 permissions. So, where I see this going in the
22 future is, hopefully, we'll have more harmonized
215
1 consent forms where we can actually lump things
2 into a group consent. I think, in fact, that was
3 a point this morning in someone's slides, talking
4 about these access groups in dbGaP.
5 So, I see that as one potential way
6 around it. The other side is getting participants
7 directly involved and getting them to actually
8 give their data freely in a very different model
9 -- so, one where the government isn't directly
10 involved. And that's a very different model, and
11 I think it would be very interesting to see that
12 pursued.
13 SPEAKER: This is just a comment. Since
14 I work with dbGaP, the concept of using the
15 application for multiple datasets simultaneously
16 is being considered by NIH. This is not NCBI.
17 It's NIH's decision. And one of them already
18 exists -- the general research usage set. You can
19 apply once for the general research and get
20 approved for multiple things at the same time.
21 So, it's being changed.
22 DR. KIBBE: And there's the new genomics
216
1 data sharing policy. Again, it starts driving us
2 toward that consistency of access.
3 SPEAKER: Another question is that you
4 mentioned TCGA and TARGET and other cancer related
5 big projects. And recently, in (Inaudible)
6 Lincoln Stein was talking about ICGC in detail and
7 all of the challenges. What other similar big
8 projects are running? And what is the mode of
9 interaction between those? Sharing the data?
10 Sharing the resources and --
11 DR. KIBBE: Well, so I think back to
12 Raja's point about the difficulty dbGaP. So, when
13 you start going into the international
14 consortiums, so ICGC is the international
15 consortium, it turns out some of the countries
16 that have submitted data to it won't allow their
17 data to be held on U.S. soil.
18 The U.S., in some of the agreements,
19 won't allow it to go outside of the United States.
20 So, that makes it very hard to combine some of
21 this. So, there has been some really interesting,
22 and frankly, I think very novel thinking about how
217
1 we can start to combine ICGC and TCGA and TARGET
2 in a virtual environment that respects these
3 geographical locations, but still allows people to
4 do computation across them. But I think that's --
5 frankly, part of the problem there is that we have
6 the ability to say, no, we don't want to share
7 with them. That probably shouldn't happen in the
8 first place. But that's beyond a discussion just
9 for this room.
10 MS. VOSKANIAN-KORDI: There's actually a
11 couple of questions online.
12 SPEAKER: What is the NCI cancer genomic
13 data FP? Expected (Inaudible)?
14 DR. KIBBE: The Genomics Data Commons
15 was awarded back in July or August. I'm not sure.
16 And I believe there will be, if there isn't
17 already, a public announcement of it.
18 SPEAKER: Okay. And are the NCI
19 informatics initiatives planned to (Inaudible)?
20 DR. KIBBE: Oh, ASCO LinQ. Sorry.
21 SPEAKER: Right.
22 DR. KIBBE: Yes. So, we're certainly
218
1 talking with ASCO, and the -- it's actually ASCO
2 CancerLinQ. That was partly why I was confused.
3 SPEAKER: Okay, sorry.
4 DR. KIBBE: No, no problem. So, those
5 of you who don't know about CancerLinQ, it's a way
6 that different healthcare providers involved in
7 oncology can start to share data about outcomes
8 and about therapies for their cancer patients.
9 Looking at TCGA -- so, TCGA, one of its downsides
10 is we don't have long-term outcomes for many of
11 the TCGA patients.
12 So, we have this very detailed snapshot
13 of their cancer, their histology, their point in
14 time where we took the data, but we don't always
15 have their long-term outcomes, and not even
16 necessarily what therapies they were given. So, I
17 think it's very natural to think about coupling an
18 archive like TCGA with something like CancerLinQ.
19 The problem is, there will be very few patients
20 that actually overlap between what's in CancerLinQ
21 and what's in TCGA.
22 So, long-term, yes, I think that's
219
1 exactly what we want to be able to do.
2 Short-term, it probably won't be of much value
3 initially. Laura?
4 MS. VENTRIA: Laura Ventria, UCSF.
5 Maybe to give a little positive comment to your
6 comments --
7 DR. KIBBE: Oh, absolutely, please.
8 MS. VENTRIA: So, I think the activities
9 ongoing in cancer but also other diseases to share
10 data and to be able to actually get to the
11 learning systems is very much taken up by the
12 Global Alliance for Genomics and Health. And I
13 think also tomorrow, that will be one of the talks
14 of this global alliance, where also, the
15 structured data components, as well as a way how
16 to access those data by API will be presented.
17 DR. KIBBE: Absolutely. So, for those
18 of you who don't know about the Global Alliance
19 for Genomics and Health, it is an international
20 consortium. It's made up of now -- I think it's
21 over 205 different organizations across the world.
22 And go and read their web site. They've laid out
220
1 very beautifully, I think it's eight principles
2 for data sharing and how we go about thinking
3 about data sharing across all kinds of diseases,
4 and explicitly, how we start thinking about
5 crossing interesting boundaries.
6 So, I think there's -- then there's some
7 great work going on. Actually, the beacon that
8 Eugene mentioned is a response to one of the GA4,
9 GC initiatives to -- everyone should create a
10 beacon. And I think right now NCBI is the only
11 one that's created a beacon, but we have to start
12 somewhere.
13 So, I think there is a lot of hope here,
14 and I appreciate, Laura, your calling me out for
15 not being quite hopeful enough. There's a lot of
16 really good things going on for data sharing.
17 MR. YASCHENKO: The next speaker is Dr.
18 Toby Bloom. She's from New York Genomes. So,
19 welcome. She will present the issue more related
20 to, I believe real life hospital clinical
21 decisions and sequencing.
22 (Discussion off the record)
221
1 DR. BLOOM: Can everybody hear me?
2 SPEAKER: Yes.
3 DR. BLOOM: Okay, good. So, I'm going
4 to talk about big data in clinical genomics and
5 clinical research studies. Okay? Why don't I
6 understand what this is doing?
7 (Discussion off the record)
8 (Break in recording)
9 DR. BLOOM: That did it. I just want to
10 tell you a little bit about the New York Genome
11 Center; just a couple of slides, because the New
12 York Genome Center is fairly new.
13 (Discussion off the record)
14 DR. BLOOM: The New York Genome Center
15 is fairly new. It was formed a couple of years
16 ago by a collaboration of 12 large health systems
17 in New York on the theory that it was better to
18 have one -- and cheaper to have one central genome
19 center than try to build 12.
20 We have more members now than we did
21 then, but you can see that almost all of the New
22 York hospitals are in here. Weil, Cornell,
222
1 Columbia, Presbyterian. Cold Spring Harbor is a
2 member. Sloan Kettering is a member. Most
3 recently, IBM became our first corporate member,
4 but we're open --
5 (Discussion off the record)
6 DR. BLOOM: And we have little strange
7 ones, like the American Museum of Natural History,
8 which isn't exactly doing clinical genomics, but
9 they do have a lot of things they want to run
10 genomic analysis on. Our current capacity, I
11 think, qualifies us as big data. We are one of
12 the first four of the organizations that got
13 Xtens. In fact, the last two of our Xtens are
14 only coming in this week. Today?
15 They were supposed to arrive today, and
16 then we will have ten. We have eight of them
17 running right now. We have 16 2500s. We have
18 capacity for somewhere north of 10 terrabases a
19 day. We do have a CLEO lab, although New York
20 CLEO lab standards mean that not everything we
21 want to run in the CLEO lab can we run quite yet.
22 Do you want to know about computer
223
1 infrastructure? We've got nine pedabytes of
2 storage. Not all of it is used, but we expect it
3 to all be used by December. We have only 3,000
4 cores, but we expect to double that. We have all
5 the standard pipelines using mostly standard
6 methods. As I said, for cancer, we run three
7 somatic variant callers, three structural variant
8 callers, two copy number callers, one or two
9 purity playity callers and god knows what else,
10 and then compare the results semi automatically,
11 but a lot manually right now. And I will talk
12 more about databases later.
13 So, big data. It's a term that is
14 overused, and I don't think anybody really has a
15 definition for it. The standard definition a lot
16 of people use, which is attributed to Gartner
17 Group, but I'm not sure if it's really true, is
18 that it's a combination of how much data you have
19 as a volume, how fast it's coming at you, the
20 velocity, and how complex it is, the variety.
21 So, basically, big data is anything
22 that's too big for you to handle easily, and that
224
1 usually involves some combination of volume, speed
2 and complexity of the data. What I'm going to
3 talk about today -- everybody talks about the
4 volume. Lots of people talk about the velocity.
5 I really want to talk about the variety. I really
6 want to talk about how complex data is going to
7 cause us problems, and it's really showing up in
8 clinical genomics first. And that doesn't mean
9 it's going to stay there, but I think we've really
10 hit it.
11 So, you know, everybody understands that
12 we're talking about dealing with real patients,
13 not just samples we get that are anonymous, and
14 that we need the results faster, and that it's
15 going to be used for treatment or maybe not. And
16 everybody is worried about the accuracy of
17 interpretation when it's going to be used for
18 treatment. But I'm really going to focus, for
19 now, on how complex clinical research projects can
20 be. And I'm going to talk about clinical research
21 for now, rather than CLEA clinical one at a time
22 samples.
225
1 Let me start with one example. The New
2 York Genome Center is currently running an auto
3 immune disease study. We're currently doing only
4 rheumatoid arthritis, but we're expecting to
5 extend it to lupus and multiple sclerosis and
6 Crohn's Disease. And because we're a
7 collaborative center and we're formed with a
8 collaborative center, we've got two other
9 hospitals working on this now, but we expect other
10 hospitals to come in for the other diseases.
11 What are we doing in this project? We
12 are taking weekly blood samples from patients. We
13 basically taught the patients to prick their
14 fingers the way diabetics do. They put four drops
15 of blood into a pre bar-coded tube; take a picture
16 of the bar-code with their cell phone and email it
17 in, so that we know when they took the sample, and
18 stick it in their freezer until they go to the
19 doctor.
20 We're doing this because the genome
21 isn't going to change. The RNA changes all the
22 time. And what we're hoping is that if we follow
226
1 people over a long period of time, we will be able
2 to figure out what changes in gene -- what genes
3 are involved when a flare happens. What happens
4 in the prodrome period before the flare? Can we
5 find early predictors?
6 We're interested in a lot of things, but
7 we're interested in finding the mechanism of
8 action. We're also interested in finding ways to
9 help patients manage their chronic diseases
10 better, and we're hoping that if we can get early
11 indicators of when a flare is going to happen,
12 that not only can we get the patients to doctors
13 earlier, but we may be able to connect it to
14 environmental triggers that you can't find now if
15 they're really long before the actual flare
16 happens.
17 So, we are collecting not only weekly
18 RNA data, we're collecting microbiome data. Right
19 now, just at the beginning of the trial and at the
20 first flare, fecal microbiome is known to change,
21 not only in Crohn's, but also in RA. And I've
22 heard recently in multiple sclerosis, as well.
227
1 It's not surprising this is an auto immune
2 disease. It's probably an effect rather than a
3 cause, but it might give us signs of things.
4 And we're going both, because we want --
5 oral is a lot easier to collect than fecal, and if
6 we can see changes in oral, we'll be able to do
7 weekly oral microbiome collections, also, and just
8 drop the fecal. We may add methylone. We aren't
9 right now. Patient are filling out weekly surveys
10 of how they feel. They're going to the doctor
11 monthly to get clinical assessments of
12 inflammation that we can tie, then, to the RNA
13 samples.
14 We need their medication data, because
15 if they're on -- they're all on medications all
16 the time, and during flares, they're on more. But
17 we need to know when they started and then what
18 they're taking, because it changes the RNA data.
19 And the most interesting of these, maybe, is that
20 we actually have put an app on their smart phone,
21 which right now, is mostly collecting mobility
22 data.
228
1 And again, it's there mostly so that if
2 we can correlate changes in how fast people are
3 moving or how far they're going daily, we may be
4 able to connect that to changes in RNA expression
5 levels. If we can use that as a very simple
6 predictor that they're going into flare, we really
7 can easily change that app to say go to your
8 doctor now. That's the hope. We aren't there.
9 We'd really like to know about food
10 intake. There's no reliable way to do it.
11 There's actually somebody -- Cornell Tech, which
12 is the alliance of Cornell and the Technion. That
13 started in New York about a year ago. They're the
14 ones who are doing the smart phone apps. They're
15 doing mobile health. There's actually somebody
16 there who's trying to analyze pictures of plates
17 of food to figure out what's on them, because
18 that's about the only way we think we can get
19 accurate information about food intake. When it's
20 ready, we'll add that, but it's not there yet.
21 I talked about the study goals already,
22 but yeah, it's a combination. And by the way,
229
1 these are contradictory, and they compete with
2 each other. We want to understand the mechanism
3 of action of auto immune disease, and we want to
4 know whether it's the same in all of them or not.
5 And we also want to help patients better manage
6 their disease and maybe find the environmental
7 triggers. The better the patients can manage the
8 disease, the less information we have to figure
9 out the mechanism of action. But we're doing both
10 together, nonetheless. Okay?
11 But look at how many kinds of data I
12 have. Right? And how I have to analyze them.
13 Okay? So I've got time series data for about six
14 different things from different time periods.
15 Okay? And I can't even smooth the peaks, because
16 what I'm looking for is the peaks. Right? So,
17 I've got to get the daily and the weekly and the
18 monthly stuff all aligned and analyze it. And
19 then, after I do that, I have to figure out what
20 correlates with what. Okay?
21 I don't think this is going to be an
22 unusual kind of clinical study in the future.
230
1 Everybody talks about wanting clinical data and
2 better and deeper clinical data to go along with
3 the genomic data. This is sort of, I think an
4 early example of that. We've seen smaller
5 projects like this, but this is a bigger one, and
6 it's not easy to do. So, those of you who know
7 me, know I always talk about the problems and I
8 rarely talk about the solutions.
9 We are working on it. Right now, we're
10 doing what we have to do, to deal with the
11 patients as they come in. We're trying to figure
12 out enough as we do it to build a system that will
13 deal with this. But for starters from an
14 infrastructure perspective, to ask the questions
15 we want, we can't have all this stuff in a
16 gazillion little files. Right? Some files are
17 big, but RNA files -- you know, if I have a cancer
18 genome -- whole genome, it could be 500 gig to a
19 terabyte.
20 RNA files are small and I'm going to
21 have them weekly, and I might have them daily.
22 Right? They're not that small when you get that
231
1 many of them per patient, and you have a lot of
2 patients. It's still going to be big data. But
3 I've got clinical data. I've got patient surveys
4 with 10 questions on them. I have to get all of
5 that -- and I've got to take the EHR data. I've
6 got to get all of that into a database, or at
7 least indexed in a database that lets me ask
8 questions.
9 Okay? And a standard relational
10 database isn't going to do it. And by the way, as
11 we get bigger and want more data, yes, I am
12 really, really hoping that the global alliance
13 will come up with the mechanism by which we can
14 connect everybody else, and yes, I'm interested in
15 doing (Inaudible). But we haven't even gotten
16 there yet, because we're new and I've got way too
17 much to do.
18 But we're in the middle of designing
19 that database. It's not easy, and finding the
20 infrastructure isn't easy, and getting it to be
21 scalable in all of those dimensions is not easy.
22 Some things in HIVE may help. HIVE as it is,
232
1 probably won't. But I'll steal what I can. From
2 a compute perspective, you know, we've got
3 alignment methods out there. Take your pick.
4 We've got variant call methods out
5 there. Take your pick. There are people working
6 on longitudinal data. There's nobody who's really
7 done this yet in any standard way that we can
8 steal. We're building the methods ourselves. And
9 we're going to build them up as we get more and
10 more patients in and figure out what we're doing.
11 We're doing it manually right now.
12 And by the way, even with just the first
13 patients, we can see changes -- we can see real
14 differences in RNA expression as patients go into
15 flare. We don't know what it means yet, and we
16 don't have enough data to know what it means yet.
17 But we're hopeful that that means we're really
18 going to find something. We don't know the
19 compute capacity we're going to need. We have no
20 idea what this compute is going to take, but doing
21 longitudinal data itself is time intensive.
22 So, I'm trying to point out here that as
233
1 we get closer and closer to clinical studies where
2 we're not in sort of the research area, where with
3 TCGA, you have three centers that all analyze the
4 same data through their pipelines and find three
5 different answers, and then take three months to
6 figure out which answers were right. We're not
7 there. We're in clinical right now. Right? And
8 we're trying to figure out -- this isn't real time
9 yet, but we're trying to figure out fast what's
10 going on here.
11 Among other things, we want to be able
12 to change the protocols as we find out what works
13 and what doesn't work. Data interpretation and
14 clinical accuracy. I know everybody at the FDA is
15 really worried about diagnostics and tying
16 diagnostics to drugs. I'm going to go over this
17 really quickly, because it's not the part I'm
18 worried about. Okay? Yes, I believe if you
19 understand what the mutation is and you have a
20 drug that you believe works, you can come up with
21 a diagnostic for it.
22 The harder question is, if you come up
234
1 with a new mutation nobody has seen before, but
2 when you analyze it biologically, it affects the
3 same pathway in the same way as some drug out
4 there for a different mutation does, what do you
5 do? Especially if it's a rare disease and you're
6 not going to get a drug company to test it.
7 Right? Are doctors going to use it off label?
8 Are you going to do more testing? What do you do?
9 And we have one example where -- of
10 exactly this, where I know that at least one
11 doctor used a drug off label for a dying kid, and
12 then I never heard back again, so I assume the kid
13 died anyway. But I don't know. But we know that
14 sometimes it works in a different -- especially if
15 it's a different disease. We know that sometimes
16 the same mutation and the same drug, the drug will
17 work in one disease. It won't work in another
18 disease, when it's the exact same mutation.
19 And so these are problems that I'm not
20 trying to solve, but they seem to tie in to what
21 people are interested in here. I'm worried about
22 the next level of cases, which is my next case of
235
1 big data. Okay? So, I started out worrying about
2 the multi modal longitudinal data. This is a
3 different issue. When we're talking about
4 variants that have a risk of disease, that carry a
5 risk of disease with them, we don't have a good
6 handle in almost any case of what the penetrance
7 is of those variants; what modifiers, what other
8 genes, what other variants may or may not keep the
9 risk variant in check. Okay?
10 And so, we don't know when some
11 preventive treatment might be needed and when it
12 isn't. Okay? David Altshuler just -- sorry.
13 Two different mikes here get me confused. David
14 Altshuler not took long ago published a study on
15 diabetes, a rare variant that protects you against
16 diabetes, no matter what your other risk factors
17 are. It took 250,000 patients to get statistical
18 power to publish that paper. Okay?
19 So, this is a different case of big
20 data, and it's a really important one. And as get
21 more and more into this, I think it's going to be
22 the next place that genomics is going to be
236
1 spending a lot of time. And yes, we've already
2 found some centenarians with two copies of ApoE4
3 who have no signs of Alzheimer's, and at over a
4 hundred, they really are controls. (Laughter)
5 They're not getting Alzheimer's. Right? But we
6 don't know why.
7 Everybody is really aware of the changes
8 in thinking about BRCA1 now and how much of a risk
9 BRCA1 is to various -- to different women. We
10 have to figure this out, especially as we get to
11 whether you use drugs early to prevent something
12 like Alzheimer's. We don't understand it. But
13 we're going to need tens of thousands of genomes.
14 We don't know how to get them. We're not going to
15 have them all. Hopefully, the global alliance is
16 going to help. There are regulations that make
17 this difficult.
18 Here's a situation that at the moment,
19 is driving me crazy. The New York Genome Center
20 is a collaboration of a lot of hospitals. We
21 often ask researchers if we can keep their data
22 the New York Genome Center, so that it can be
237
1 easily available to combine with other data if
2 somebody needs more data for their studies, or at
3 least, to let us index it so they can go ask for
4 it and get permission.
5 And those people often say, you can keep
6 it, but our data access committee has to approve
7 its use, or our IRB has to approve its use. I
8 recently got permission to take some very large
9 datasets on one particular disease, some of which
10 were not sequenced with us. And they were all
11 consented for use in medical research around a
12 single group of diseases, but not all medical
13 research.
14 And I said to the caretakers of this
15 data, so I'm really interested in this penetrance
16 and modifier problem. So if somebody comes to me
17 working on that, the first question they're going
18 to ask is, can you tell me how many people with
19 the disease of interest I'm working on have this
20 variant. And how many other people in the general
21 population have this variant? So, all I'm going
22 to give them back is a count of some numbers of
238
1 people with this variant out of the tens of
2 thousands of genomes I have.
3 Can I include your samples in just
4 counting this variant? If it's not a rare
5 variant, then the number is large. There is no
6 chance of identification here. Okay? The answer
7 is no, because patients consented it only for use
8 for this one disease. And it doesn't matter that
9 this is only a count that is not identifiable.
10 They didn't allow for somebody to look and see if
11 they had this variant.
12 So first of all, that causes major
13 problems for making progress in any study that
14 needs large numbers of controls or just large
15 numbers of patients, period, and especially when
16 you're looking at penetrance and modifiers, and
17 you're looking for modifiers. This is horrendous.
18 I wish somebody would tell me if there was a
19 regulatory way around that, but I don't think
20 there is. Okay?
21 But now, we come back to the database.
22 I need to connect all these kinds of data and
239
1 these huge numbers of genomes. And I mean, I have
2 a variant database. The variant database doesn't
3 do me any good. It doesn't matter if I actually
4 keep the data in a bunch of different databases,
5 as long as I can query across them easily. The
6 structure itself doesn't matter, but look at what
7 it does to my access control and security.
8 I now need, on every query, to look at
9 every cell in the database, not even every row in
10 the database, and decide if I can use it in this
11 query. You're saying no way. But if I look at
12 what my security is, in terms of informed
13 consents, in terms of data access permissions,
14 okay, informed consents have check boxes. Okay?
15 This disease only, all cancers, all medical
16 research, non commercial research only, commercial
17 -- okay. Okay? There can be multiple boxes
18 there. Okay?
19 That changes what samples, even within
20 one project, can and can't be used for what.
21 Okay? And it's not even completely hierarchical,
22 because the commercial, non commercial stuff is
240
1 orthogonal to the other stuff. And then, I have a
2 different problem. I have owners of samples, the
3 biosample repositories, that allow their samples
4 to be used in multiple projects that are unrelated
5 to each other, but they only want the PIs in their
6 project, in that particular project to see the
7 results of that particular project.
8 So, two PIs can have access to the same
9 sample. It can be aliquots of the very same
10 sample taken from the same patient at the same
11 time. And there are two different files or six
12 different files for that patient. And this PI can
13 see two and this PI can see four, and that PI can
14 only see one. And maybe they can ask for
15 permission for more, and maybe they can't. But
16 the more projects we try to use the same samples
17 in, the harder and harder the security and privacy
18 constraints get, and the more and more impossible
19 it becomes to put them into a database.
20 There are databases out there that do
21 sell level access control. Accumular does. It's
22 a name value para database. I never get it to run
241
1 fast enough. Okay? I'm just putting this out
2 there. I can't even remember -- how far behind am
3 I? I'm five minutes behind.
4 There's lots of questions we want to
5 ask. Is there anything else on this I haven't
6 said yet? (Laughter) I don't think so. I'll just
7 keep going. I already said we don't know how to
8 analyze this data. We're working on it. I hope
9 other people are working on it. But forget the
10 multi modal longitudinal stuff. If I need 250,000
11 genomes and I have to analyze it once, that's a
12 problem in itself. Okay?
13 This is just a summary of everything I
14 just said, and since I'm behind, I'm not even
15 going to say it again. But it's important. All
16 right? Especially the penetrance and modifier
17 stuff. But for anything that needs to aggregate
18 data across locations, across diseases, across
19 projects to gain data access, like dbGaP, allows
20 you to other projects, it causes problems.
21 There's actually only one little thing
22 on this page that I wanted to say, which is -- and
242
1 I'll tell you (Laughter) -- my lawyer actually
2 didn't want me to say this, but I'm going to say
3 it anyway. Okay. We all know that eventually,
4 genomic data is going to fall totally under HIPAA,
5 and right now, it's in this gray area. But
6 everybody agrees that it's personally identifiable
7 information.
8 And as far as I'm concerned, when
9 things are personally identifiable information,
10 they should be kept encrypted. And I am perfectly
11 happy to keep it completely encrypted, once the
12 pipelines are over. But it can take weeks to get
13 through the pipelines, and I'm going to read and
14 write that data dozens of times in that time.
15 And I have tried hardware enhanced
16 encryption. It takes three hours to encrypt a big
17 dan (sic) file. I can't do it. So, I'm actually
18 looking for other algorithms that I think will
19 help me maintain the data in storage, not
20 encrypted, but not identifiable. And I'm working
21 on some algorithms, and I'm going to go the
22 statistician route and try to get a statistician
243
1 to tell me it's not identifiable. And I'm hoping.
2 But it is a problem for all of us. We need to
3 understand what it means if we're not going to be
4 at risk of a breach that exposes a whole lot of
5 data that could wind up identifying people.
6 I've probably said all of this already,
7 but yeah, we're going to have larger numbers of
8 genomes together. We're going to have more kinds
9 of data. They're all going to be longitudinal.
10 We have to figure out how to store the databases
11 so we can query them, not just run pipelines over
12 them from front to back on every file and combine
13 lots of files together in each algorithm.
14 And that's it. I have a lot of people
15 to thank for this and a lot of people who are
16 working these projects. And thank you.
17 (Applause). Not too far over.
18 MS. VOSKANIAN-KORDI: Keep that, because
19 you're answering questions.
20 DR. BLOOM: Oh, I'm still answering
21 questions. Okay.
22 MS. VOSKANIAN-KORDI: Please.
244
1 DR. BLOOM: I'll put it back on.
2 MS. VOSKANIAN-KORDI: Please answer
3 questions.
4 DR. BLOOM: Yes.
5 SPEAKER: You were talking about some of
6 the -- in the first part of your lecture when you
7 were talking about the immune -- sorry --
8 DR. BLOOM: The auto immune disease
9 project.
10 SPEAKER: Auto immune diseases.
11 DR. BLOOM: The rheumatoid arthritis.
12 SPEAKER: Yeah. A lot of those -- the
13 primary signal comes much before you see the
14 disease state.
15 DR. BLOOM: Yes.
16 SPEAKER: And somebody actually can be
17 related to date of birth, which means some kind of
18 environmental factor. How do you -- how far can
19 you go backwards when you set up the databases?
20 DR. BLOOM: So, there's two answers to
21 that. In terms of -- I mean, these patients flare
22 only once or twice a year. So, for things that
245
1 are you know, months ahead, we will have even the
2 RNA data. And we can get access to their clinical
3 records, but I don't think -- I think if there
4 were things in the client records, we would have
5 found them already. So, in terms of being able to
6 relate the genomic data to the clinical data, I
7 think it's only going to be from the start of the
8 study.
9 The other side of that is that I'm doing
10 something a little bit backwards, which some
11 people think I'm crazy for. The New York Genome
12 Center is the host for the PCORI clinical data
13 research network for all of New York. I am going
14 to have full longitudinal clinical --
15 de-identified clinical records for just about
16 every patient in New York at the New York Genome
17 Center, which means moving forward, I am hoping,
18 and I do not have full permission for this yet,
19 but I am hoping that all the researchers at those
20 hospitals who send us genomic data will be able to
21 get access to the anonymized ID to link to full
22 clinical records.
246
1 So in that sense, we can do it. But I
2 can't -- you know, I don't have their genome from
3 their date of birth, and when we start sequencing
4 all babies, maybe I will. But I don't know how
5 else to answer that.
6 MR. YASCHENKO: I have one comment and
7 one question. The comment would be that about
8 encryption. And I think that we hit the same
9 issue when we were in post (Inaudible), how do we
10 encrypt your data. And to my opinion, I think
11 there is a need to develop hardware platforms
12 which are maintaining the data encrypted.
13 And I see here, and I made sure that
14 some hardware manufacturers are also here, because
15 this question of encryption and encoding is --
16 DR. BLOOM: It's a -- yes. I'm sorry to
17 interrupt. Go ahead.
18 MR. YASCHENKO: That is what the --
19 DR. BLOOM: I was going to say, you can
20 buy encrypted disks.
21 MR. YASCHENKO: Yep.
22 DR. BLOOM: Okay? They're more
247
1 expensive. We could do it. There's a problem
2 with them.
3 MR. YASCHENKO: Uh-huh.
4 DR. BLOOM: If somebody breaches your
5 system and gets into your server, those encrypted
6 disks are designed so that when your server starts
7 up, the decryption key is loaded, and anybody who
8 is using the disk -- it sees the data unencrypted
9 automatically. So, if your server is breached,
10 encrypted disks don't help.
11 MR. YASCHENKO: But perhaps the file
12 system developers, and those would be the same
13 people -- yes? Because I see people from AMC,
14 from IBM, from others perhaps I didn't recognize
15 -- if they develop file systems, it's a design of
16 the file system -- when does the encryption -- it
17 becomes open for the usage, for the programs?
18 Perhaps, if they collaborated with you and with us
19 --
20 DR. BLOOM: That would be --
21 MR. YASCHENKO: -- that would be a
22 wonderful addition to it.
248
1 DR. BLOOM: That's an excellent -- yes.
2 I think it's going to take collaboration with
3 hardware manufacturers to do something.
4 MR. YASCHENKO: Yep, yep. Okay.
5 DR. BLOOM: Or, finding a different
6 software solution that's not encrypted, which is
7 what I'm currently trying to do.
8 MR. YASCHENKO: And the question be
9 towards -- not towards you, but towards maybe some
10 folks from FDA. But we're living in this country
11 for a century, when younger people don't have any
12 concerns about identities. They put their all
13 pictures (sic), videos, everything out. So, if
14 let's say, somebody wants to publish a genome for
15 usage of any purpose, are there any regulations
16 which would be controlling that?
17 DR. BLOOM: Yes.
18 MR. YASCHENKO: There are. You know the
19 answer. That's good.
20 DR. BLOOM: Well, it depends in part, on
21 what state you're in.
22 MR. YASCHENKO: Uh-huh.
249
1 DR. BLOOM: Okay, it's country. It's
2 not just state. New York has particularly strict
3 regulations about it. We are trying to crowd
4 source. Warren?
5 DR. KIBBE: Toby, that was great. You
6 brought some really wonderful points. I just
7 wanted to bring up a conversation I heard that
8 came out of the global alliance, and that was,
9 there was a discussion of some treaties that were
10 signed back after World War II that explicitly
11 called out the right of patients to participate in
12 research and to benefit from research.
13 And I think they were looking at some of
14 those treaties, which all countries in the world
15 have signed, as a way to say there is a way to
16 force data sharing, looking at it from a right of
17 --
18 DR. BLOOM: Awesome.
19 DR. KIBBE: -- the individual to
20 participate. So I think that again, that becomes
21 -- that makes these beacon services possible.
22 DR. BLOOM: That's really -- that's a
250
1 really interesting thing. I love it.
2 MS. VOSKANIAN-KORDI: Anybody else?
3 (No response heard)
4 DR. BLOOM: Okay. Thank you so much
5 (Applause). Now I can take this off.
6 (Recess)
7 MS. VOSKANIAN-KORDI: All right. We're
8 going to go ahead and get started, but before we
9 start, the next session is Database Development.
10 There was a wallet that was left on a table
11 outside by Daniel Guittierez. So, if that's you
12 and you don't have your wallet, please come see us
13 outside. All right. I'm going to turn the podium
14 over to Dr. Mazumder with the Database Development
15 session.
16 DR. MAZUMDER: So, thank you very much
17 for inviting me -- oh, just one second. Thank you
18 for inviting me to this session, and it's a really
19 great workshop, and lots of different comments and
20 opinions. And this session is a little bit
21 different. I mean, we'll not really directly talk
22 about NGS initially, at least, as much.
251
1 And I just want to set the tone for this
2 session. So, before NGS was there, there are
3 many, many resources, model organism, databases,
4 RefSeq, Uniprod, SGD, gene ontology. Many of
5 these reference datasets have been -- you know,
6 they have been built over years, over decades.
7 And a sequence by itself is completely
8 meaningless. It doesn't have any meaning. You
9 have to add annotation to it.
10 And most of the time, annotation
11 initially is added manually through biocuration.
12 And then, once you have some gold standard
13 annotation like RefSeq, Risprod or the mortal
14 organism database type of annotation, you can
15 create an automated process to take this
16 annotation and add this to other data sources.
17 And then, when you have NGS data, you
18 have mapping algorithms or other algorithms which
19 will take this NGS data, map it to some reference
20 against this reference -- it has to be mentioned
21 and annotated and everything else by some database
22 curators or database providers, and then this can
252
1 be then used for biological knowledge generation.
2 So, if you look at that -- you know,
3 this is just the National Human Genome Research
4 Institute. They have many model organism
5 databases that are supported. This is their URL.
6 You know, I have some of the names here. Not all
7 of the names are here. Then, there are funding
8 mechanisms which support HIV databases, influenza
9 virus resource, virus sequence databases and so
10 on. So, these databases are quite important.
11 So, in terms of -- I lead the public
12 HIVE, and many times, the question comes, you
13 know, okay, so you have a mapping algorithm. It's
14 really great, so I want to use it, and so on. So,
15 HIVE is not just developing new tools. So, we
16 develop pipelines, work flows, and when we see a
17 need for development of new tools or work flows,
18 we do it. And the need could be many reasons why
19 you want to do it. And Dr. Simonyan mentioned a
20 few of them.
21 We work with XO varnisi DNAC data,
22 textural and image data, ontology standards,
253
1 (Inaudible) and DO cancer slims. So, disease
2 ontology. So, there was a mention of ontology
3 earlier on. Having an ontology -- you know, Dr.
4 Warren Kibbe of CBIIT -- you know, he wrote this
5 paper on disease ontology. One of the things that
6 we were working on this publication where we are
7 doing (Inaudible) cancer analysis from ICGC and
8 TCGA -- and I can tell you that even within ICGC,
9 the same cancer type is mentioned -- has a
10 different name.
11 Now, if you look at it for a few seconds
12 -- if a human looks at it, they will know it's
13 exactly the same cancer type from two different
14 countries. Where from our computational
15 viewpoint, if you try to figure out what is what,
16 it becomes a nightmare. So, we started a small
17 project with a small group of people from Lynn,
18 from University of Maryland, Cosmic and the Early
19 Detection Research Network Group over -- funded by
20 NCI, like 250 scientists.
21 So, we are trying to even take care of
22 this ontology, so that when I compare ICGC,
254
1 IntOGen, Cosmic, I can at least map it to a
2 particular cancer type, which is a hierarchy, and
3 then I can propagate some information across it.
4 We also try to collect data to map data. There
5 are projects from Uniprod, for example, ID
6 mapping, which allows us to do things like that.
7 But the key thing is that references and
8 standards, they have to be done in collaboration.
9 It cannot be done alone.
10 So, this is a perfect place, I think,
11 where we can start also talking about some of the
12 references and standards that many of you are
13 involved in, and bring it to the forefront, so
14 that we can use it within the HIVE group. So, I
15 tried to put the session emphasis, but you know,
16 I'll just go quickly through this, and there are
17 other things that our speakers will talk about;
18 focus on the need for development of curated
19 databases, focus on validation and integration
20 protocols, steps needed to be produce viable and
21 reliable resources that can facilitate
22 collaboration and research.
255
1 So, this session, I am pleased -- it's
2 my pleasure and honor to have three great
3 speakers, Dr. Kim Pruitt, Dr. Mike Cherry and Dr.
4 Rodney Brister. So, Dr. Pruitt is our first
5 speaker. She is a senior staff scientist,
6 Eukaryotic Group at NCBI.
7 She's the RefSeq unit chief, and she has
8 a PhD from Cornell in genetics and development.
9 She leads a great project called the CCDS, and
10 she's the NCIB lead for CCDS projects which tries
11 to standardize the protein coding regions of human
12 and mouse, which is an extremely useful project
13 when you're trying to map and understand your
14 results with a particular reference dataset.
15 She's also the founding member for the
16 International Society of Biocuration. How many of
17 you have heard or that name, International Society
18 of Biocuration? I talked a lot about biocuration,
19 but I think you should take a look at it. So, the
20 next meeting -- our next meeting is in China,
21 Beijing, April 23rd, I think so. And there is
22 also -- the database is the official journal of
256
1 ISB, and you can submit your paper. It deals with
2 database development and biocuration. Please
3 remember that. And if any of you can make it to
4 Beijing, hopefully, we will see you there.
5 Without further ado, Kim?
6 (Break in recording - long pause)
7 DR. PRUITT: I can't manage the mike
8 (Laughter). Thank you for that really nice
9 introduction. So, switching gears, I'm going to
10 talk about RefSeq, which is an NCBI product.
11 Let's see. Next slide.
12 This is a project at NCBI to provide
13 reference sequence standards at the level of the
14 central dogma. So, we're providing reference
15 sequence standards for genomes, for transcripts,
16 for proteins at a huge level ranging from archea
17 to eukaryotes to viruses.
18 There are numerous advantages in the
19 RefSeq set. I'm going to tell you a little bit in
20 a couple of slides -- very 30,000 foot view of how
21 we build the RefSeq set. But there are numerous
22 advantages in using the RefSeq data as compared to
257
1 the primary data that's submitted to GenBank.
2 (Discussion off the record)
3 DR. PRUITT: So, the advantages include
4 consistency, because we control this data product.
5 We're offering a greater consistency in the
6 formatting of these sequence records. There's
7 greater transparency in the source of the data
8 that goes into building the RefSeq dataset. And
9 there's more annotation.
10 We have several annotation pipelines.
11 We have annotation pipelines for prokaryotes and
12 eukaryotes. We also generate annotation for
13 viruses in collaboration with experts in these
14 organisms, but also, through our robust annotation
15 pipelines. And so for many genomes that are
16 submitted to GenBank in an unannotated form,
17 you'll be able to find annotated in the RefSeq
18 dataset.
19 Our data sources are primarily GenBank,
20 and so we rely on a continued submission of
21 primary data to the archival databases. We do
22 engage in collaborations with model organism
258
1 databases or individual researchers who are an
2 expert in our particular protein family. As I
3 said, we have annotation pipelines.
4 And just a quick note on my terminology
5 in this talk, I will call RefSeqs that are a
6 direct product over annotation pipeline -- I call
7 those model RefSeqs, and RefSeqs that are in the
8 pool that is subject to curation, I call that the
9 known RefSeq dataset. So, remember model and
10 known.
11 In terms of products and access, this is
12 a series of sequence records. They're available
13 in our web interfaces, in the nuclear titer and
14 protein databases. They're available in Blast
15 databases. They're available for FTP, and they're
16 available through our programming utilities --
17 utilities (sic).
18 Okay, so I'd like to just sort of take a
19 step back. My talk is going to really focus on
20 our support for the human genome. I could talk
21 for two hours, probably, or more, on the breadth
22 of RefSeq. But for the human genome, NCBI has
259
1 several curated -- we're providing curation
2 support for several very important databases and
3 resources. So, we're involved with the community
4 that is maintaining the assembled human genome.
5 This is the Genome Reference Consortium.
6 And the curators in my group actually
7 contribute tickets to the GRC that are
8 highlighting areas of the genome sequence that we
9 would like them to investigate. These are areas
10 where we're questioning if they're representing a
11 mutation versus the non-mutated version of a gene,
12 or if they have faithfully represented the
13 completeness of a gene. Some of these are
14 questions of redundancy in the genome assembly.
15 So, we're an active contributor to
16 tickets to the GRC, and then as those regions of
17 the genome get fixed, we are often involved in
18 reviewing the corrections to the genomic sequence
19 to see if then the genome annotation will be
20 indeed, improved in that region of the genome.
21 NCBI is also involved in the Locus Reference
22 Genomic project, which is -- and the prelude to
260
1 that is our RefSeq gene product set.
2 These are genomic records that we're
3 putting out so that clinical testing communities
4 have a stable reference for reporting their
5 mutations in, so they have a reference that has
6 been curated and vetted. This is in coordination
7 with Locus' specific databases and the clinical
8 community, and these are sort of gene region
9 snapshots of the genomic sequence, and they're
10 annotated with the transcripts that are most used
11 for reporting relevance in the clinical lab
12 setting.
13 NCBI also has the ClinVar database,
14 which was alluded to earlier. And this is an
15 archival database of clinical variants that have
16 an asserted relevance; that they are asserted to
17 have a pathogenic relevance. And so, we are
18 archiving this information. Again, we're working
19 with clinical community to gather this
20 information. We have curators who support the
21 submission process so that data standards are met
22 and the information can be clearly distributed
261
1 back out to the community.
2 And as Raja mentioned, our involved with
3 the CCDS collaboration -- and this is an
4 international collaboration with our partners at
5 the Welcome Trust Sanger Institute, the Ensemble
6 Resource, the Huge Gene Nomenclature Group, the
7 Mouse Genome Informatics Group and USCS to
8 stabilize the protein coding annotation on the
9 human and mouse genome for those proteins that are
10 the most supported.
11 So, there continue to be differences in
12 ensemble and USCS and RefSeq for more predicted or
13 model kind of coded regional representations. But
14 the most supported layer is subject to this
15 collaboration, where we are working closely
16 together and trying to effect any updates in a
17 synchronized manner.
18 And then of course, there's RefSeq
19 curation, where we are curating genes, transcripts
20 and proteins. So, here's a quick example of a
21 region where the original genome sequence in
22 version GRCh37. This is a region of human
262
1 chromosome 8, the SCXa and b genes are highlighted
2 in red boxes on the top level, and those genes
3 were found to be really redundant. RefSeq
4 questioned whether this was a valid gene
5 duplication or a redundancy in the genomic
6 sequence that was represented on chromosome a.
7 So, the GRC undertook some experimental
8 validation and retiled through this region and
9 actually deduced that -- determined that there was
10 a single gene in this location and not two. And
11 so in the GRCh38 genome, which is the current
12 public version, you'll find a single gene in this
13 location.
14 And ClinVar -- I'm really not going to
15 talk about ClinVar, but I do want to highlight it
16 as an important resource that NCBI is engaged
17 with, where we're working closely with a range --
18 a large variety of groups that are involved with
19 monitoring and reporting clinically relevant
20 variations and aggregating that, so that that
21 information can be archived and then distributed
22 back out.
263
1 And the little panel on the right is
2 just the top of a ClinVar page where things are
3 given stars; where you can see how many groups
4 have reported this, if it's been curated by an
5 expert panel, and so on. So, back to RefSeq.
6 Basically, RefSeq can be thought of as a genome
7 annotation database, and the foundation of our
8 genomic sequence is GenBank. And so, we are not
9 reassembling genomes. We are not changing the
10 genomic sequence in RefSeq versus what is in
11 GenBank.
12 We are simply making a copy of the
13 submitted assembly and then subjected that to
14 annotation and curation. So, we have several
15 annotation pipelines. We have a very robust
16 eukaryotic annotation pipeline. The product of
17 that is, of course, the annotated reference
18 genome, transcript proteins, FTP Blast databases
19 and annotation reports.
20 In our curation mode, we're looking at
21 data at the level of genes, transcripts and
22 proteins. So, we are representing novel splice
264
1 variants. We're doing extensive sequence
2 alignment analysis. We're diving into the
3 literature in order to represent things that are
4 the most common protein isoform in transcript
5 variant. That is represented in the literature.
6 That is thought to be the functional unit, and we
7 want to make sure that we are representing that in
8 the RefSeq database. And we're also adding
9 publications. We're adding names. We're adding
10 content to NCBI's gene database, and so we're
11 adding functional information in the process of
12 our curating our sequence records.
13 Some of the functional information that
14 we're adding -- here is an example. This impacts
15 both the structural annotation of the genome and
16 the functional annotation. So in this case, this
17 is the mouse Bag 1 gene, and actually, the same
18 situation occurs for the human ortho log. This is
19 -- in inverse orientation, this is a gene that's
20 on the opposite strand. So the five prime end of
21 the gene is to your right.
22 So, the canonical protein was annotated
265
1 at the AUG site that I have highlighted in red on
2 the bottom part of your panel, which is a zoomed
3 in view of the top panel where I have boxed the
4 first x on. So, we have two RefSeqs represented,
5 two RefSeq transcripts and proteins represents for
6 the mouse and human bag 1 gene. And one of them
7 starts at a canonical AUG start codon, and the
8 other one starts at a CUG start codon, which has
9 been experimentally described in both mouse and
10 human.
11 And so, this is an example of a type of
12 annotation that automatic annotation pipelines
13 would fail to provide, because they're looking for
14 the canonical AUG start codons. They're not
15 trying to annotate proteins from every CUG start
16 codon that might occur in the genome. So, that
17 would introduce a lot of false noise. And so,
18 here is a value added layer that curation has
19 introduced.
20 This is a highly simplified view of our
21 annotation pipeline on the right part of my slide.
22 It is significantly more complex than this. I
266
1 want to make the point that we are aligning a huge
2 amount of evidence data in generating our
3 annotation. This is really an evidence based
4 product, our genome annotation.
5 So, we are using the known RefSeq
6 component as an input. So, what we curate, like
7 the example I've just shown you, then because a
8 re-agent to developing the genome annotation
9 product. We also are aligning CDNAs, ESTs, TSAs
10 and RNA seq data. And this is something that we
11 introduced in 2013. And the addition of RNA seq
12 data has allowed us to greatly expand on the
13 number of exons that we are -- exons and introns
14 that we're representing in our final genome
15 annotation product set.
16 These are in the model RefSeq component,
17 because these are predicted outputs of our
18 interpretation of all of these alignments. And
19 just to give you sort of a snapshot of some of the
20 numbers, on the left part of the slide, we have
21 some 48,000 known RefSeq transcripts that we are
22 curating. The vast majority of those have
267
1 undergone some level of curation.
2 And we have -- towards the bottom, we
3 have some 52,000 model transcripts that have been
4 added by the integration of RNA -- primarily added
5 by the integration of RNA seq data. For human,
6 we're using the human body map 2 dataset as our
7 alignment pool.
8 So, this is a significant number of
9 annotated transcripts and proteins available for
10 comparison in Next Gen alignment analysis. So,
11 some of the quick highlights of our annotation
12 pipeline speed -- we have a very fast pipeline.
13 It can turn around a whole eukaryotic genome in 2
14 to 10 days, and that is from when we first do our
15 data snapshot of RNA seq data, transcripts,
16 proteins, everything. We do a data snapshot at
17 the beginning. It's a turnkey pipeline. We can
18 have that product loaded, average 2 to 10 days
19 after we have started that, it is available
20 publicly in NCBI's nucleotide protein database,
21 and shortly thereafter on our FTP site.
22 So, it's a very robust pipeline. It has
268
1 very good speed. It is a quality pipeline. We
2 have put a huge amount of engineering work behind
3 it. We do regression testing. You know, any kind
4 of software change, we check. Does it have any
5 deleterious effects? Do we get the product that
6 we're expecting to get? So, we do all of the
7 things that we should be doing for annotating the
8 human genome and other eukaryotic genomes.
9 We can annotate multiple assemblies
10 simultaneously. For human, we annotate the CRCh38
11 assembly. There's a CHRM1 assembly and the HuRef
12 assembly. We can annotate multiple organisms
13 simultaneously. So, this is a very powerful
14 pipeline that we have put together. And of
15 course, one of the main advantages, from my sort
16 of slightly biased point of view is that it
17 integrates the curated RefSeq content. And so we
18 have a means, then, to directly affect in a
19 positive way the output of the annotation pipeline
20 over time, because we do re-annotate organisms
21 over time. Now, as we're re-annotating those
22 organisms, we can be affecting positive change,
269
1 correcting errors that have been identified by us
2 or reported by the community in the genome
3 annotation.
4 So, so far, we've annotated 153
5 organisms using our eukaryotic genome annotation
6 pipeline. Eighty-eight of those organisms we've
7 integrated RNA seq data with. That's quite a lot.
8 A hundred and sixteen of those organisms have some
9 curated RefSeq records available for them. For
10 some of these organisms, it might be really small.
11 It might be 10 RefSeqs, but these are 10 RefSeqs
12 that we have corrected in the genome annotation
13 product for the organism.
14 Human and mouse we update at least
15 yearly. We're trying to achieve a yearly update
16 goal for many of our eukaryotes, and certainly a
17 new assembly triggers a re- annotation. And if we
18 become aware of a new dataset that would probably
19 have a positive outcome on our annotation, then
20 we'll rerun. Or, if we make a significant change
21 to our methods that would have a positive outcome
22 on our annotation, then we'll rerun.
270
1 One of the things that we've really put
2 a lot of effort into is transparency and support
3 evidence, both at the level of the gene and at the
4 level of individual transcripts. So, on the left
5 side, you're looking at a little display from an
6 entrée gene record for the crystalline A gene.
7 This is the human crystalline A gene in the top,
8 and on the top of that left panel, you can see the
9 gene structure that we annotated on the genome.
10 The top transcript protein pair, the
11 blue red lines there is a curated RefSeq. And
12 then, the second set of transcript proteins is a
13 model RefSeq that was predicted based on RNA seq
14 data. As you look further down, you can see a
15 line that shows you that one of these proteins is
16 in the CCDS project, and so it is tracked as a
17 stable protein coding annotation on the genome.
18 Now below that, you can see the clinical
19 variants that are tracked in the ClinVar database.
20 And then below that, you see three tracks that are
21 an aggregate of our RNA seq data. So, the top
22 track is showing you an exon coverage graph. The
271
1 track below that is sort of a mirror image of
2 that. This is showing you the coverage for the
3 intron spanning reeds, and so it's kind of a
4 mirror image of where the exon coverage is.
5 And then below that, the track with the
6 black lines, that is our intron futures that we
7 culled from the RNA seq data. So, the predicted
8 model, the XM that's annotated up there is
9 supported. We see an intron that corresponds to
10 the novel intron that's introduced in that model
11 that came from RNA seq data.
12 So, here's a window. Here's a quick
13 snapshot. Here's the evidence behind this
14 annotation. Yes, I can see it's supported by
15 transcript data. Of course, the problem with RNA
16 seq data is you don't know if you've got ongoing
17 range exon coverage from that data -- what we know
18 is that the intron exon pairs are supported. In
19 the right side, I'm showing you on a record by
20 record basis that we are also providing
21 information per record about the evidence
22 supporting that record.
272
1 So on the known RefSeqs that we clearly
2 indicate the data sources behind the construction
3 of this record. So we will tell you that we made
4 this RefSeq record from accession number A, B and
5 C, and maybe we've used three accession numbers in
6 order to avoid what we think are rare or erroneous
7 mismatches in any of those accessions, in order to
8 provide a record that is the best match to the
9 genome that we don't think is representing
10 anything that is deleterious or a mutation. We
11 also provide what's highlighted here --
12 (Break in recording)
13 DR. PRUITT: Aha. So, up here at the
14 top, there's section on the record that is
15 reporting two levels of evidence; one that we have
16 long range combination -- we have long range
17 support for the transcript exon combination. And
18 we're arbitrarily giving you two accessions that
19 are supporting the long range exon combination.
20 So, you know, here's three exon RefSeq model, and
21 so exons one, two and three in combination are
22 found in these two transcript records.
273
1 We're also telling you where we have RNA
2 seq support for the exon pairs. And so, we have
3 RNA support for exon one to exon two, exon two to
4 exon three. I mean, we're giving you, again, an
5 arbitrary number of samples that support that.
6 We're just saying two because it can be very long
7 list, and it would make the record view kind of
8 unwieldy. So, this is what you'll see on a known
9 RefSeq record.
10 On a model RefSeq record, the
11 information is provided a little bit differently.
12 We're telling you that this is a product of the
13 NCBI annotation pipeline. It's sockular version
14 5.2. There's a link here to an annotation report
15 page where you can get detailed information --
16 I'll show you a quick glimpse of that on my next
17 slide, about the inputs and outputs of the
18 annotation pipeline.
19 And down below in the record, there are
20 comments that tell you supporting evidence for
21 this model includes similarity to MRNAs, 199 ESTs
22 and one hundred percent coverage of the annotated
274
1 genomic future by RNA seq alignments, including
2 five samples that support all of the etrons. So,
3 there's a wealth of information here about
4 evidence.
5 This extends to evidence at the level of
6 the whole annotation run, so we're providing you
7 know, annotation report pages for all of our
8 eukaryotic genomes. You can find the links to
9 these reports on the sequence records, and it
10 brings you to a report page that tells you what
11 software version was used, exactly what version of
12 the assembly was annotated; what are the results
13 of the annotation; how many features of what type
14 are produced from this annotation run, and huge
15 amounts of information about the alignments.
16 How many of the RefSeqs did align? How
17 many of the transcripts that we started -- how
18 many transcripts did we start with? How many of
19 those did align? At what level of threshold
20 quality? And for RNA seq data, we'll tell you
21 exactly what RNA seq data samples that we used.
22 And again, huge details about the alignment
275
1 statistics. And this type of information is
2 provided per assembly that was annotated. So for
3 human, you can interrogate this kind of
4 information for GRCh38, or if you're interested in
5 comparing to HuRef, you can switch to them.
6 So, a quick warning. A lot of people
7 think that they can get RefSeq by going to USCS.
8 And what you get from USCS is their alignment of
9 the known RefSeq dataset. It is not the same
10 thing as NCBI's genome annotation product of
11 RefSeq. They do not align model records, and so
12 often, you will get an under representation of
13 what NCBI's view of the number of possible
14 transcripts and exons are, and they -- actually,
15 because they don't align any of the model RefSeqs,
16 sometimes we may have a gene cull that's not
17 reflected in the USCS display.
18 So, user beware. If you're downloading
19 RefSeq genes from USCS, you are downloading their
20 alignment of known RefSeq records. And so you
21 might have ambiguous placement of paralogs. You
22 might have slightly different exon culls for some
276
1 exons because of different alignment methods that
2 were used. You'll be missing splice variants.
3 You may be missing genes. So, and I'm really
4 almost done.
5 So, yes, you can get RefSeq data at
6 NCBI. We have genomes FTP site that we just
7 recently regenerated that is now a comprehensive
8 report of all RefSeq genomes that are available in
9 NCBI's assembly database. And so, the scope of
10 our new genomes FTP site is archaea, bacteria and
11 eukaryotes. You can get all of the GenBank
12 genomes there, and you can get all of the RefSeq
13 genomes there by simply toggling your path here to
14 RefSeq or GenBank.
15 We have metadata files that help you
16 traverse the FTP directory to find exactly what
17 you're looking for. I encourage you to look at
18 the read me files that are in this directory, the
19 recent NCBI news announcement, and we also have
20 put out an FAQ about this directory. One of the
21 things that we have done with this new genomes FTP
22 site is to modify our fast A titles.
277
1 We did this so that the RefSeq that you
2 download from NCBI can be used more readily in
3 some of the big RNA seq aligner programs. We
4 provide GFF format on our FTP site, and we provide
5 a fast A, and of course, we provide the NCBI
6 standard sort of text view document style, as
7 well.
8 So, NCBI traditionally has used a
9 complex fast A format that had an awful lot of
10 information in there that people then had to parse
11 to get which bit that they wanted to track on. In
12 our new style, we're providing simply the
13 accession dot version, and then of course, the
14 record description. This simple accession dot
15 version is the seq ID that's used in the GFF file,
16 in the fast A files. And so it makes it much more
17 interoperable with some of the RNA seq alignment
18 data.
19 So, we do listen. We do really want to
20 make this a product set that people can use in
21 their pipelines. And if people have suggestions
22 and requests, we do want to hear about those. So,
278
1 RefSeq -- why is this a useful tool for comparison
2 to NGS data? I mean, as Raja really said, because
3 it's annotated. It gives you the context of your
4 alignment. It's an annotation dataset that NCBI
5 is committed to supporting.
6 We integrate the curated information so
7 we have a means to provide corrected information,
8 as we become aware that these corrections are
9 needed. So, we are really very interested in
10 community feedback to help us maintain. This is a
11 community resource and we want community
12 involvement to help us maintain this. We only
13 have so many staff at NCBI. And we're handling a
14 large number of genomes, and so you know, we're
15 really welcoming community input.
16 It's evidence based. There's
17 transparency in our evidence, in our pipeline, in
18 the re-agents that we're using. There's
19 transparency in the support evidence in terms of
20 the specific annotated transcripts. So, we're
21 really trying to build a layer of transparency to
22 this product so people can find that they can use
279
1 it with confidence. Of course, the integration of
2 curated information and the connection between
3 RefSeq and NCBI's gene database gives you a
4 powerful layer of connecting the sequence to the
5 knowledge, because in NCBI's gene database, we're
6 integrating a huge amount of information.
7 We've got GO, you know, gene ontology
8 data integrated. We're integrating publication
9 information. So, there's a lot of information in
10 gene that relates to the functional aspect of this
11 gene and the sequence is then directly compared --
12 is directly connected to that. So you know,
13 basically, NCBI is committed to supporting the
14 human genome at several levels; at the level of
15 maintaining the sequence -- the human genome
16 sequence, the assembly, at the level of archiving
17 the pathogenic and clinically relevant variation
18 data and at the level of the annotation.
19 One of the things that we're looking
20 forward to doing in the next couple of years is to
21 start adding functional annotation about
22 regulatory regions. So, promoters, silencers,
280
1 enhancers -- to start adding annotation of those
2 regulatory regions that have been studied and
3 experimentally confirmed. Of course, there's a
4 large number -- there's a huge amount of data
5 that's coming out of the high throughput of the
6 genomics projects, but there's a wealth of
7 literature, and we would like to connect that to
8 the genome sequence, also. So, these are future
9 plans.
10 And of course, everything that produce
11 at NCBI for the RefSeq project is really available
12 through our web site, FTPs and programming APIs.
13 And because it's at NCBI, we connect RefSeq to a
14 wealth of other resources, and so it really
15 facilitates navigation to more information, once
16 you're in NCBI's domain.
17 And that's it. And I would like to
18 thank, you know, the people that do all the work.
19 You know, the GRC, ClinVar, the eukaryotic genome
20 annotation pipeline. These people are also
21 producing the FTP site and the assembly resource,
22 and of course, my group who eukaryotic RefSeq
281
1 transcription protein set. Thank you. (Applause)
2 MS. VOSKANIAN-KORDI: If there are any
3 questions at this point, please --
4 SPEAKER: Is it possible to have two
5 RefSeq genomic records for the same organism in
6 the same leaf level taxonomy node?
7 DR. PRUITT: So, are you talking about
8 prokaryotic (Inaudible)?
9 SPEAKER: Yes. That's what I'm asking.
10 Prokaryotic.
11 DR. PRUITT: Yeah. So RefSeq undertook
12 providing genome annotation for all submitted
13 prokaryotic genomes, and they may be individual
14 isolates of the same strain. So, there is now a
15 lot of redundancy in the genome representation in
16 RefSeq for prokaryotes. We are also, though,
17 selecting representative genomes that would be --
18 So, if you don't want to deal with the
19 however many thousands of salmonella enterica
20 genomes that we have in RefSeq, we're selecting
21 those that we, in our opinion, are the reference
22 standards or good representatives of different
282
1 strains. And we have metadata files that clearly
2 identify the reference and representative set.
3 And also, in our new genomes FTP site, we have a
4 directory level way to navigate to just that
5 subset of the data.
6 (Discussion off the record)
7 SPEAKER: Hi. I very much appreciate
8 the new FTP site. It's much more convenient to
9 navigate than the old one.
10 DR. PRUITT: Oh, that's great to hear.
11 SPEAKER: But there's one thing missing.
12 The viruses are still apart. Is there a plan to
13 integrate viruses into that same system?
14 DR. PRUITT: So, the viruses -- very
15 good observation. So, the viruses have not
16 traditionally been included on the assembly
17 resource, and we are -- there are some
18 conversations on going to try to come up with a
19 good model for inclusion of the viral genomes in
20 that resource.
21 There is still a viral genomes FTP site
22 where data can be downloaded. The viral genome
283
1 project, which Rodney is going to tell you more
2 about, is providing both reference sequence data
3 and also, some value added data. So, that's on
4 top of that. And so, in terms of the new genomes
5 FTP site, moving forward, our goal is to try to
6 include the reference sequence data in the new
7 structure. But this add on data will still need
8 to be represented.
9 There will probably still be a duality
10 in terms of viral genome representation, because
11 we want to provide this additional dataset, which
12 really doesn't fit currently in the model of the
13 assembly database.
14 SPEAKER: Well, it's sort of like what
15 you have now for the non-viral, where you have the
16 representative.
17 DR. PRUITT: Mm-hmm.
18 SPEAKER: And then, it's almost like the
19 genome neighbors, that you're borrowing --
20 DR. PRUITT: Right.
21 SPEAKER: -- from the viral (Inaudible).
22 DR. PRUITT: Right, right. And so, the
284
1 genome neighbors don't really fit in the --
2 SPEAKER: Well, it's --
3 DR. PRUITT: Right now, they don't.
4 SPEAKER: -- you mentioned like
5 salmonella. You have one salmonella
6 representative, and then the other ones are sort
7 of like genome neighbors.
8 DR. PRUITT: They're sort of like genome
9 neighbors, but we've actually instantiated as
10 RefSeq genomes.
11 SPEAKER: Oh, okay. Right.
12 DR. PRUITT: And in the viral world, the
13 genome neighbors are not all instantiated as a
14 RefSeq genome. They're still GenBank.
15 MS. VOSKANIAN-KORDI: We're actually
16 going to go ahead and close Dr. Pruitt's section
17 right now. If we have time at the end, we'll open
18 it up to further discussion, as well.
19 DR. MAZUMDER: Our next speaker is Dr.
20 Mike Cherry. He started the model organism
21 database SGD, Saccharomyces Genome database. He
22 also started gene ontology. And then, he took
285
1 over the ENCODE DCC. He grew up in -- there's
2 some little interesting things that I just
3 learned. He grew up in Indiana. Parents,
4 molecular biologists, biochemists and farmers.
5 And friends with Warren Gish, Blast.
6 You know? The author of Blast. And Mike Corell,
7 where he first started programming in C, or at
8 least did a lot of that at that time. So, Mike
9 Corell is the Unix 4.2 BSD, and he repurposed the
10 C code, and he's going to talk more about what he
11 is working on now, and also, talk about lessons
12 learned from the more (Inaudible) databases.
13 DR. CHERRY: Okay. So, I have Warren,
14 the same problem getting close to the mike. But
15 thank you guys for that nice introduction. So,
16 I'm not going to tell you about any programming
17 that I've done in the last 10 years. My guys
18 won't let me do it anymore. They do keep one
19 script that I wrote on the server, just so that I
20 can say I'm involved in the project. So, it's
21 very nice of them (Laughter). So, I'm Mike
22 Cherry. I'm at the Department of Genetics at
286
1 Stanford, and as Raja said, I've started several
2 genome databases, but this Saccharomyces
3 cerevisiae database is the longest running. I'll
4 talk to you a little bit about that and what we do
5 with curation. I won't go into too much detail
6 about our web pages and such, because I'll get
7 probably too carried away, but I will go further
8 to a little bit about our work with the gene
9 ontology, as well as get into a little bit about
10 the data coordination center we're trading for
11 ENCDODE. And that's where the metadata comes in.
12 (Discussion off the record)
13 DR. CHERRY: Great. So, in model
14 organism databases, I sort of think of these guys,
15 the Saccharomyces cerevisiae, the brewing baker's
16 budding yeast, disopholomatic ester, the fruit
17 fly, although it doesn't eat fruit. It likes
18 yeast. And senerbatis elegance who doesn't like
19 yeast but eats bacteria.
20 So, these are really great models
21 because of their genetics and genomics, and the
22 long history that their community has brought.
287
1 They're really all started because of the genetics
2 Saccharomyces a little bit because of the
3 biochemistry. Nobody cares if you grind up pounds
4 and pounds of yeast. It smells good, too.
5 But they're currently used so much
6 because of the genetics and the genomics. They
7 have a rich future -- a rich future. They do have
8 a rich future, but they have a rich community
9 that's involved with them and that maintains the
10 sort of active work. People are studying yeast
11 not just to study RNA plemaries per se, but
12 they're looking at RNA plemaries within the cell
13 itself. So, are they really trying to take apart
14 the system, even though they may be working on one
15 protein?
16 There's also these very powerful
17 resources, molecular resources that are available
18 in these communities. Yeast, for example, there
19 is a knock out of every single gene and the genome
20 that's available for non essential genes. And the
21 essential genes have had a promoter added
22 upstream, so you could actually dial down the
288
1 protein, as well. And these knock out collections
2 are easy to change and manipulate, because yeast
3 really loves to do a homologous recombination.
4 So, as a result of that, you can
5 actually take the knock out of a gene and yeast
6 and put in the human gene, human CDNA, and there's
7 hundreds of cases where the human gene complements
8 the knock out in yeast. And so then, you can
9 study the human gene in yeast where you can grind
10 up tons and have a lot of fun. You know, and this
11 really shows you the power of the evolution to be
12 able to go from a simple organism and study and
13 gene, and a more complex organism.
14 So, I'll talk a little bit about the
15 Saccharomyces genome database, because they're a
16 nice URL. So, what do we do? I have this
17 problem. I keep wanting to walk away. We curate
18 information. Okay? So, I've got about six
19 curators, really experienced PhD curators who love
20 to read papers. Okay? Many of you in the room
21 are probably like that, as well, and I've got a
22 job for you. These six people read the paper.
289
1 They start with the results, maybe the methods.
2 We try not to read the introduction and
3 the discussion first, because we don't want to be
4 confused by what the author says they did. We
5 want to know what they actually did. And so
6 they're reading the paper, abstracting out of that
7 the details; oftentimes hidden in figure legends,
8 tables and such. And then, they integrate it into
9 our database using control vocabulary syntologies.
10 Of course in some cases, we have to use the free
11 text, but as much as we can, we really try to put
12 it in this very structured way.
13 Generally, you know, as you imagine when
14 you're reading papers, the data is very
15 unstructured -- even if it's high throughput, it's
16 very unstructured. You know, people don't use
17 standards in the community. They just sort of
18 make it up themselves. And so, a lot of times, we
19 spend a considerable amount of effort wrangling
20 the data. So that is to take the data that they
21 have given -- I mean, you know, it's a nightmare
22 sometimes.
290
1 They provide you a table as a PDF file
2 or as an image. You know, here's the TIFF image
3 of my table. So, we have to do a lot of work to
4 convert that. Grabbing the metadata is very
5 difficult. Typically, it's very sparse in the
6 paper, and we try to work with the authors to get
7 more of that. The hard part though, sometimes, is
8 that we'll do all this work to correct -- get
9 their data wrangled nicely, and then they say, oh,
10 you can submit that to GO. You know? I haven't
11 done that yet, because I didn't take the time.
12 And you know, and so we're sort of left with the
13 case of who's really going to do it.
14 More and more, we're making sure that we
15 don't manipulate their data until they submit it
16 to GO, just as a way of having a little stick.
17 We're experimenting with something we're sort of
18 calling the wall of fame, or you know, give people
19 badges. So, somebody that actually submits their
20 data, does a really great job, you know, we want
21 to sort of put their name in lights somewhere.
22 You know, it may be a graduate student, but that
291
1 graduate student is going to say, hey, look at me.
2 You know, I'm on the home page of (Inaudible).
3 And hopefully, that will encourage others to do
4 that.
5 You know, and really, though,
6 unfortunately, it means -- the problem is, the PIs
7 have to start enforcing this, and they're really
8 the problem in many cases. Okay. So, you sort of
9 get the gist of this already, but the role of a
10 genomic resource, genome database is really to
11 take the data that's been published, okay, as a
12 result of experiments and computation. We grab
13 all that data out, put it in our database in a
14 nice, integrated way, and then we can have the
15 nice Muppet scientist read that information, do
16 discovery on that, propose new experiments,
17 integrate their own data into that, and hopefully
18 published that, and then the cycle continues.
19 So, this is sort of our whole purpose.
20 My lab really doesn't do research per se. Our job
21 is to take data and make it available to people;
22 to really promote research, our publications are
292
1 about building databases and such. Okay. The
2 types of information that we have interconnected
3 within our database include the molecular
4 function, sort of what the protein does, the
5 biological process, sort of why it does it -- it's
6 part of secretion. The sailor component, where it
7 happens with the cell; the major complexes that
8 are there, sequence homology, mutant phenotypes we
9 spend a lot of effort on.
10 Yeast has lots of mutant phenotypes
11 reported from many strains. I forgot to mention
12 earlier that the yeast genome is in RefSeq, and
13 it's been there for quite a long time. We've
14 actually sort of hand curated, I would guess,
15 almost every nucleotide over the 20 years, because
16 people have banged on the genome so hard that
17 they'll actually tell us when there is an error,
18 because it's been re-sequenced so many times in
19 the community. So, that's been very useful.
20 We also have a lot of other strain
21 genome so we can do comparative analysis to fix
22 the main reference.
293
1 Genetic and protein interactions -- lots
2 and lots of information there, a genetic
3 interaction being where you have a single mutation
4 that doesn't have a phenotype on its own. You
5 combine these two mutations together in the same
6 cell, and the cell is dead or it grows very
7 poorly. So, it's a genetic interaction. Protein
8 interactions of course, are too touching. All
9 kinds of expression data, exploring more about the
10 protein domains and the sequence level of our 3D
11 structure. And we're getting much more into
12 pathways, protein complexes and the regulation of
13 all of this stuff together.
14 One thing I wanted to touch on briefly,
15 just because it was mentioned before -- one
16 problem we have, and Kim and them have done a
17 great job of putting together the transcripts with
18 -- even though RNA seq and the expression ratings
19 before that were created and used, and so there's
20 a lot of data there, and it's really easy for
21 people to go forward, they'll oftentimes, after
22 they create RNA seq -- of course, the next thing
294
1 they want to do is do it in humans.
2 And they sometimes forget to go back and
3 fill out everything it used. So, we don't have
4 the transcripts as tightly defined as we would
5 like, but we have a wealth of information. And
6 this is one case where we have to start piecing
7 things together. So, we have all of the
8 transcript levels. We have things like ribosome
9 profiling, where you look at the RNAs that the
10 ribosomes are bound to and try to pull out not
11 just the RNAs that exist at some point within the
12 cell's life cycle, but that the ribosome is
13 actually bound to. And this is the (Inaudible)
14 thing that then we have to do to make things a
15 little bit better.
16 Okay. So, an example of one of these
17 networks -- and this is really actually a very
18 simple one, genetic interactions. So, here is a
19 sort of famous network diagram within yeast where
20 they've actually taken the genetic interactions,
21 so the bright spots are proteins. And what you
22 probably can't see is there are lines connecting
295
1 all of these proteins together, and those are
2 describing the interactions themselves.
3 But they've annotated the proteins with
4 the biological processes, so that the pathway is
5 the larger sort of reason for that function within
6 the cell. And they've added some sort of
7 attractor to the proteins of a similar function
8 and clustered them on the cell like this. And
9 that's basically showing you that in the genetic
10 interactions, proteins with a similar process are
11 found close together.
12 But you can imagine not only taking the
13 synthetic lethal information -- you actually
14 combine any number of networks. And so this gets
15 into the statistics that I can't actually
16 understand, but it's taking many different types
17 of information, looking at the genotypes, the
18 metabolomics, expression arrays, going after
19 protein coding interactions -- you know, any
20 number of networks together and combining them to
21 create basing networks.
22 But of course, what you don't see here
296
1 is that high quality annotations are underneath
2 these. So for every gene, the connectors here is
3 a fact that the -- for each gene, we have high
4 quality connections into these networks themselves
5 away the annotations about the functions in very
6 specific ways -- the actual title to this and
7 whatnot that's going on.
8 So, this is, I think one of the newer
9 ways that the yeast research is going, really
10 looking at the systems biology and systems
11 genomics, I think is what somebody called it
12 there. Gene ontology consortium was something we
13 created in 1998. The PIs currently are Judy Blake
14 at Jackson Lab in Maine, Paul Strindberg at
15 CalTech, Paul Thomas at UCS, Suzanna Lewis at
16 Berkeley and myself at Stanford, so it's sort of a
17 California thing, these days.
18 But it's really an international
19 consortium. You may not realize that the gene
20 ontology consortium, the folks that make gene
21 ontology, that do a lot of the annotations and
22 distribute the annotations that we create is only
297
1 about 40 people. Okay? And that's -- actually,
2 not all of them are even funded by the
3 consortium's grant.
4 Just real briefly, the gene ontology, if
5 you haven't seen it before, it's really a
6 hierarchical description of function and process
7 within the cell. And we do it in this way so that
8 you can actually annotate the different levels of
9 the ontology. So, I don't know if I should mess
10 with the pointer, but the ball indicates -- the
11 size of the ball indicates the number of proteins
12 that have been annotated to that term or to one of
13 its children below.
14 So, at the very bottom, you see the
15 balls are small. Then we have like cell cycle
16 checkpoint right there. Regulation of cell cycle
17 progress is right here. So, a lot of proteins are
18 annotated low, but some are annotated here,
19 because the evidence only allows it to be
20 annotated there. We don't see evidence within the
21 paper that allows it to be annotated lower.
22 Sometimes, it's like a mutant phenotype. The cell
298
1 doesn't live. We think it's the cell cycle
2 checkpoint. That would be annotated high.
3 So, the gene ontology allows the
4 proteins to be annotated. And we do all of this
5 annotation in an evidence based way using
6 ontologies. It's called the Evidence Code
7 ontology, ECO. But what the curator does and why
8 it takes some training is the curator doesn't
9 annotate to just this phrase, nucleotide
10 metabolism. What they're actually doing is
11 they're annotating to a description of that term.
12 And that's particular important because
13 apictosis -- different people in the room may have
14 three or four different views of how you would
15 actually define that. But for the ontology, we've
16 defined it in a very specific way, and that's what
17 people are annotating to.
18 Just a little big of statistics about
19 all of this work that's happened in 15 years, 18
20 years. So, the ontologies themselves are quite
21 large. There's more things about process, because
22 there's aspects to the cell that we can sort of
299
1 see and we know what's happening. We don't always
2 understand the functions that the proteins are
3 doing. We know the protein's there and it's
4 important, but we don't know exactly what's doing.
5 And then, cellular components have to do with the
6 regions in the cell, the complex in the cell. So
7 lots there.
8 I mean, it's really astounding that the
9 total number of annotations here is 370 million,
10 but you have to realize that that's not people
11 work, the majority of that. That's actually
12 computers crunching on these things. The Uniprod
13 Group at UBI in Hingston Cambishure, his -- they
14 work a pathway -- a pipeline on all of the
15 proteins within Uniprod. And this is using
16 protein domains, sequence similarity, very sort of
17 sophisticated rules that they've built over 10
18 years to actually annotate these proteins from all
19 organisms that they can find.
20 This manually curated set, which is
21 still quite big, but it's a number of years that
22 it's taken to be created, for 65 different genomes
300
1 -- and so for example, for yeast, we have 80,000
2 annotations that are available that are hand
3 curated by reading papers, describing the
4 evidence. The ontology access also allows us to
5 connect the term, the annotation with small
6 molecules that are involved in the measurement of
7 that function.
8 This graph is just to show you how great
9 yeast is. So here, the red bars are showing you
10 the percentage of the gene products within the
11 cell that have at least two of the three domains
12 of the ontology annotated with an experimental
13 term. Okay? And so it's about 50 percent of
14 yeast. The others -- so human is about 10
15 percent, fly is about 10 percent. The line is
16 actually showing you the number of gene products
17 that have been annotated in this way. I know it's
18 difficult to read and all that. But this is not
19 annotating the number of genes, it's annotating
20 the isoforms in that gene. So it would be
21 annotating all of the records within Uniprod.
22 Okay.
301
1 And then hopefully, I'll have time to go
2 through a little bit of what we're doing with
3 ENCODE. So the ENCODE project has been around for
4 about nine years. My group joined ENCODE two
5 years ago as the data coordination center. So,
6 our job is to take data from all of the groups.
7 They don't get credit for submitting data until
8 they give it to us, and they have to give it to us
9 in a way that we say it applies -- that it's met
10 the standards both in the file formats and quality
11 as well as the metadata, as well.
12 Okay. So, just real briefly, on the
13 ENCODE, there's a variety of methods that are used
14 that are actually mapping features of the genome
15 across the whole genome. So, there's a number of
16 -- they're all seq methods these days. So, RNA
17 seq is measuring where transcripts occur. There
18 is a lot of chip sync methods going on to help you
19 identify where proteins bind, where modifications
20 happen to the chromosomes or to the histones,
21 methylation, settlation and such, accessibility of
22 the DNA accessibly through the chromotin where the
302
1 DNA can be cut.
2 And so, all of this information is put
3 together, and the whole purpose is to create a
4 really gold -- a highly structured, high quality
5 gold standard set of information from these
6 datasets with the hope that they'll be around for
7 -- and useful for about four or five years, if not
8 longer. And so this really requires that the
9 standards are quite high, both on the quality of
10 the experiment that's done and on the metadata
11 that's available so you can actually find the
12 information that's available later.
13 A real simple view of how we do this; we
14 require experimental metadata. We have to have
15 the primary data submitted in proper forms. You
16 have to have the right number of replicates for
17 the particular experiment, and of course, the
18 biosample is very critical. We run that through
19 analysis pipelines, and this is -- so the analysis
20 pipelines, as have been talked about many times
21 today, not all of the software is really great.
22 How do you use one package or another package?
303
1 What's needed here for ENCODE, though,
2 is that we want to have a unified pipeline for all
3 of the data of a particular data type. And that's
4 really particularly tough, because it's biological
5 labs that are doing the assays themselves, and of
6 course, they have their own post docs that say the
7 right way to analyze the data.
8 So, it's roughly taken two years to get
9 the various data providers to reach some form of
10 consensus that they agree is appropriate. Okay?
11 They don't want to be good enough. They want to
12 be right. But of course, they would never
13 actually finish, because they know they're not
14 right yet. And they'll keep saying, well in
15 another year, I'll tell you. In a year, I'll tell
16 you. And I've got to finish this paper.
17 But we have actually gotten them to
18 write down protocols, and we're implementing those
19 protocols in the cloud. So, it's sort of like
20 they're the R&D part and we're the production
21 part. We have to take their code and make it work
22 in a way that we can actually run it thousands of
304
1 times, hopefully automatically without a lot of
2 hand holding. And we're doing that in the cloud
3 using a DNA nexus environment. So, we have access
4 right in there.
5 We're writing -- we're building the
6 pipelines to DNA access that allows us to share
7 the pipelines very easily. You can get command
8 nine access in there. But also, if you have a
9 small number of experiments you want to analyze,
10 you can use their web site.
11 So, this has been a particularly
12 difficult one, it's getting these pipelines
13 together. But once we have all that -- once we
14 know we have the right number of replicates and
15 such, we fire it all through. Of course, you need
16 reference annotations to get there. And the
17 analysis pipeline results are made available,
18 because those files go back into the data
19 warehouse, which is at Amazon, and we have to have
20 the metadata for every step along the way here.
21 So, the principles of our metadata for
22 ENCODE is, of course, reproducibility. We want
305
1 you to be able to track the analysis that has
2 happened. We want to be able to communicate the
3 key assumptions and purposes of this particular
4 step within the pipelines, and of course, easily
5 accessible -- provide easy accessibility to the
6 quality and metrics that are there.
7 Transparency. We want you to be able to
8 redo the pipeline in the future. We want you to
9 be able to run the pipeline on your data and to
10 associate it cleanly with the standard data that
11 have been created. Of course, where the files
12 have been, which software they have been run on,
13 which files were used to create another analysis
14 file -- all of this is really critical to allow
15 this to really be a resource. A lot of the lab
16 groups are not used to this sort of level of
17 reproducibility requirements and sharing, and it
18 is a little bit of a cultural education that we're
19 confronted with.
20 Just as an example of the metadata
21 recapture for the software steps -- so we have
22 metadata about the file. We have metadata about
306
1 the software. And we freely, openly share all of
2 the software via Get Hub, and then we have the
3 analysis steps as well that integrate some of the
4 software as part of the greater analysis steps.
5 So, we want to track all of this
6 information so that you can actually query on it.
7 You can find out more information about how a
8 particular pipeline has changed over time. So, in
9 the last two slides -- so I haven't shown you any
10 web sites, and that's what half my group does is
11 make web sites. So, I had to show you one.
12 This is our latest web site. It's the
13 ENCODE portal, and it's been a lot of fun to
14 create. We've used a lot of different
15 technologies than our SGD web site, which is
16 running on eight years old. But in this case, we
17 used the ontologies -- because the ontologies are
18 used in the metadata, we can create this faceted
19 searching. So, in this case, I've clicked on DNA
20 seq, and it and it's told us that there's a
21 certain number of experiments that fit into
22 various categories.
307
1 I know you can't read them there, but
2 the gray bar tells you the relative number of
3 experiments that are being observed or found.
4 Just blowing that up, you can see -- so of those
5 DNA seq experiments that were available, there
6 were a certain number here that are from primary
7 cells, from tissues, and then we actually --
8 because we use ontologies for -- Uberon ontology
9 for the anatomy and we use a cell line ontology in
10 connecting that in with Uberon, so we know where
11 the tissues -- sorry, where the cells are from.
12 You can actually drill down and say I only wanted
13 to see experiments for DNAs on the eye.
14 Okay. So, I don't know if I was too
15 fast or too slow. I want to acknowledge the
16 funding that we've had for this. It's all funded
17 by the Human Genome and it's been a really good,
18 fruitful collaboration with them and with the
19 other model organism databases over the years.
20 Thank you.
21 (Applause)
22 MS. VOSKANIAN-KORDI: Are there any
308
1 questions from the audience?
2 SPEAKER: -- project, how much of the
3 data is shared with the other (Inaudible)?
4 DR. CHERRY: So, we're now at -- ENCODE
5 has used this for about nine years. We're
6 currently at ENCODE three, the third phase of it,
7 and we're two years into that third phase. All of
8 the data from ENCODE II was made available via the
9 EBI, and it's still there from sort of an FTP
10 site. ENCODE II in the U.S. is available from
11 Santa Cruz, and then it's also available now from
12 our new site at Amazon, and you can get at it via
13 the metadata searches and things. The NCBI -- I
14 don't know that we've put the data on NCBI, other
15 than in geo and SRA.
16 MS. VOSKANIAN-KORDI: Other questions?
17 Yep. There's one over here.
18 SPEAKER: Hey, Mike. That's a great
19 talk. You know, I think it would be interesting
20 to compare and contrast how hard it's been to get
21 the community around the mods to agree to
22 something like gene ontology, where we've had a
309
1 wonderful collaboration and ability to agree on a
2 set of standards, and how hard that's been in many
3 other spaces.
4 So from your view, what really made gene
5 ontology click and the agreements happen, and how
6 could we replicate that more, for instance, into
7 the genomic space into -- well, the clinical
8 genomic space?
9 DR. CHERRY: (Laughter) That's a good
10 question. So over the years, I've been
11 interviewed many times by ontologists that try to
12 say, how did you get this to work? And our goal
13 was not to at least start building an ontology
14 because it was ontologically correct. We needed
15 something that worked. And so, we built it sort
16 of in a minimalistic way that helped us get data
17 out.
18 And so, I think that was a critical
19 component, is that people could see there's data
20 there. It's annotated in this way. Oh, maybe I
21 can understand how it's used, but there's still
22 been -- you know, over the years, a lot of little
310
1 spats that have gone on about how to do it and how
2 not to do it. But it's really been great over the
3 past, I don't know, eight years or so when
4 everybody -- basically, any genome paper that
5 comes out or any RNA seq paper comes out, they
6 always list you know, some sort of enrichment
7 analysis using gene ontology.
8 And so, I think it's one of these things
9 that it's just ubiquitous, and you know, that was
10 the whole intention. Right? Nobody really knows
11 where gene ontology comes from. It's just there.
12 And it's created by these curators who are just
13 doing it to -- not to credit for doing it, but to
14 actually help research go forward. And I should
15 say that Warren was involved in the dictyostelium
16 database, so he's a mod guy, too. (Applause)
17 DR. MAZUMDER: Hello. Our next --
18 thanks a lot, Mike. Actually, you know, I think
19 90 percent of the curators or curation that you
20 see today, gene name, protein name, all of this
21 has been done by people who can fit in this room,
22 or even smaller than this room. I could make an
311
1 accession like that. So it's quite amazing,
2 actually.
3 Anyway, our next speaker is Dr. Rodney
4 Brister. He's a group leader of Viral Genome
5 Group at NCBI. He's also the chair of the Virus
6 Genome Annotation Working Group, and he also is
7 involved with the international group, Virus
8 Genome Data Standards. And I know that there are
9 several people in this auditorium who are working
10 on viruses and are interested in this talk. Thank
11 you, Rodney.
12 DR. BRISTER: Sorry for being the last
13 talk (Laughter). We're obviously running a little
14 late. So, Kim went over RefSeq in sort of a
15 general way, and I'm going to focus more on the
16 viral aspects of RefSeq and the work that my group
17 does, the Viral Genome Group at NCBI.
18 So, the world is literally coded in
19 viruses. There are an estimated at 10 to the 30
20 of viruses on the planet right here today, and
21 understandably, we're interested in them because
22 they kill a lot of people. Annually, several
312
1 million deaths are attributed to viruses, and this
2 has brought a lot of attention in the sequencing
3 and genomics world onto viruses, and a lot of
4 money has been put forth to sequence the number of
5 human and plant pathogens.
6 And so, people engaged in sequencing
7 have created somewhat of a sequence explosion,
8 which has resulted in about two million viral
9 sequences being deposited in INS D.C. databases or
10 commonly, GenBank. This sounds like a really
11 great thing. The problem is, sequences are just
12 A', G's, C's and T's until someone comes along and
13 transforms those into sequence data.
14 And in the context of NGS sequencing,
15 transformation of the data means assembling the
16 raw sequences into consensus sequences and
17 annotating those consensus sequences into
18 biologically relevant data, like genes and
19 proteins. Of course, that's where the humans come
20 into play as well as the computers. And the
21 Reference Genome Group -- the Viral Reference
22 Genome Group is involved in trying to make this
313
1 whole process easier.
2 As part of the RefSeq project, we create
3 a non redundant, well annotated set of reference
4 genomes, which we distribute publicly for the
5 world to use. And before going further, I just
6 want to say we're a very small group and rely to a
7 great extent on collaborations with various
8 communities, public databases and individuals
9 scientists, some of whom are here today.
10 Our goal is really to provide the
11 reference infrastructure for the identification,
12 assembly and annotation of viral sequence data.
13 So, our data model is pretty simple, or at least
14 it used to be. And that is, we would create one
15 reference genome for each viral species. Taxonomy
16 is centric to this model. We rely on the taxonomy
17 from the International Committee for the Taxonomy
18 of Viruses. They set standards for family, genus
19 and species level taxonomy, which we bring into
20 NCBI and sort of use as our template for what we
21 create in NCBI.
22 Over the past decade or so, ICTV has
314
1 approved in a steady fashion, quite a number of
2 viruses -- viral species, excuse me. And we
3 validate the taxonomy for each viral reference
4 sequence that we create based on the criteria to
5 the various study sections within ICTV set forth.
6 Now, one of the problems of our data model is --
7 see here in orange, the rate at which novel
8 viruses are discovered is nearly exponential and
9 doesn't really match well with the linear sort of
10 taxonomy efforts at the ICTV.
11 So, we have to react to these novel
12 viruses, because we want to represent them in our
13 sequence space. So we create RefSeqs from these
14 novel viruses, which means we spend a lot of time
15 doing taxonomy on these novel viruses. And we've
16 put them in special bins that are called
17 unclassified bins that you can see in the NCBI
18 taxonomy pages, and we try to classify things to
19 the family or genus level. Sometimes, we're
20 actually able to classify them a little bit below
21 that.
22 Only a few viral -- another problem for
315
1 our data model is that unlike other model
2 organisms, which you've heard about in earlier
3 talks, only a few genomes have been experimentally
4 defined. And so, we have a data model that has to
5 incorporate very well annotated genomes; genomes
6 that have well defined experimental data with
7 genomes for the records that have been annotated
8 mostly by the computational transfer of annotation
9 from a well annotated genome to a less well -- or
10 a well defined genome to a less defined genome.
11 And finally, genomes that have been
12 annotated simply by ab-initio techniques for which
13 very little or no experimental data is known. And
14 so, we have developed multiple classes for RefSeqs
15 to help you discriminate between these types of
16 genomes. So, we have a reviewed class that
17 generally represents a very well annotated genome
18 that comes from mostly experimental data, and we
19 have our provisional class that generally
20 represents something that has mostly been
21 annotated through ab-initio techniques.
22 Now, once we create a RefSeq genome,
316
1 every other genome is then included as a neighbor
2 to that RefSeq genome. Every other genome for
3 that species, excuse me, is included as a genome
4 neighbor to that RefSeq genome. And over the past
5 decade or so, there's been a huge increase in
6 focused sequencing efforts around human, and to a
7 lesser extent, agricultural pathogens. So, you
8 can see in some cases, and this table does not
9 include influenza -- in some cases, you literally
10 have thousands of genomes for a particular
11 species.
12 Now, taxonomy and genome links were
13 validated for all of the genome neighbors. So, we
14 add value to things that are not RefSeqs, and we
15 try again, to assign them with the right taxonomic
16 identifier and try to validate whether or not this
17 is a full link genome as expected by other genomes
18 in that particular taxonomic grouping or not.
19 Now, over the past decade, the
20 sequencing for viruses has just skyrocketed, and
21 the number of validated RefSeq genome records is
22 nearly 80,000 at this point. In blue there, you
317
1 can see the number of RefSeqs, which is basically
2 linear. So, a lot of the new sequencing we're
3 seeing, despite seeing a lot of novel viruses
4 being sequenced, or re-sequencing of new isolates
5 of already defined species.
6 Now, this has increased the extant
7 sequence space for viruses a great deal, and this
8 extant sequence space is growing with both novel
9 viruses, but to a greater extent, with variants of
10 already discovered viruses. And some of this
11 space is captured fairly well by our current
12 RefSeqs, but other parts of the spaces, not at
13 all. And this represents new genotypes, new
14 strains and other variants.
15 So in an attempt to better capture the
16 sequence space so that when you come into our
17 RefSeq database you can find a hit to your
18 sequence, we're expanding our RefSeqs to break the
19 model of one RefSeq per species to include
20 multiple RefSeqs per certain species, and we're
21 relying more on the sequence analysis to define
22 the sequence space, define holes in it, in terms
318
1 of our RefSeq representation. We're also relying
2 on the communities at large to tell us about their
3 important genotypes, their important strains, so
4 that we can capture all of this diversity within
5 this RefSeq model.
6 And again, the idea is that you can come
7 into the RefSeq database and identify a sequence
8 right away. So, it may not be enough to know that
9 you have an enterovirus, but you need to know that
10 there's a particular type of enterovirus, and that
11 allows you to know something about that virus
12 clinically right away.
13 Now, most people think of viruses as
14 things that pop into cells and blow them up with a
15 bunch of (Inaudible). But for a lot of viruses,
16 they get into the cell and they hang out and they
17 take up residency of the cell, sometimes as an
18 episone, sometimes as an integrate. And this
19 sequence space has really been missed by the
20 RefSeq model of years past.
21 And so, we're trying to reclaim this
22 sequence space by using extant viruses to identify
319
1 viruses that are integrated as part of host
2 genomes; those genomes having been submitted to
3 the INS D.C. database, and CrateRef seeks to have
4 some context them, i.e., this RefSeq was made from
5 a crow virus, to allow people to find these guys
6 in terms of when they do a search against RefSeq,
7 but also, to give people, users some context
8 awareness about these sequences.
9 Another project we're doing is trying to
10 give every RefSeq species a host type. And that's
11 another manually curated operation where we're
12 going through both old and newly submitted viruses
13 and identifying the host to which -- a group of
14 hosts to which they infect and assigning that to
15 the actual taxonomy associated with that RefSeq.
16 So, that actually helps the whole database,
17 because then all other viruses of that taxonomy
18 node also get this property. So, you can come
19 into our infrastructure and identify viruses by
20 this host type.
21 The distribution of host type is
22 actually kind of interesting. Humans only make up
320
1 a small number of the hosts in our databases.
2 Actually, our Blast analytics tells us that
3 recently, most of the hits and blasts to viruses
4 come against bacteria (Inaudible). And we're
5 seeing an uptake in other agriculturally important
6 viruses that infect plants and other vertebrates,
7 as well.
8 Now, our goal is to let you know what
9 you have. So, we want to create a reference
10 infrastructure that allows you to identify what
11 you have. And our products consist of our RefSeq
12 sequences, the taxonomy data associated with those
13 RefSeq sequences, the host type metadata
14 associated with those RefSeq sequences, and then
15 the genome neighbors associated with those RefSeq
16 sequences.
17 And so the idea is, you can come and get
18 a hit to our RefSeq, know what taxon is belongs
19 to, know what kind of host it infects. And then,
20 if you want to learn more about the variation
21 associated with that species, then be able to
22 delve down into all of the neighbor sequences and
321
1 see the variants on the genome level associated
2 with that particular RefSeq. And of course, our
3 model is very much dependent on feedback from the
4 community. So, we take what you guys bring us and
5 we integrate it back into the database and
6 hopefully, together, we all improve the data.
7 So, the next question is, where is all
8 of this going? And one example of this is
9 something that we call virus variation, and it
10 tries to deal with this problem that we have
11 millions of GenBank records that really don't have
12 trustworthy annotation. They may not be
13 taxonomically filed away correctly. They may not
14 have genes approachings (sic) annotated at all.
15 They may be annotated incorrectly. So, the Virus
16 Variation Project attempts to bring into this
17 space specialized databases, user interfaces,
18 reference driven annotation tools, metadata
19 parsing, mapping to standardized vocabularies to
20 make all this data much more accessible, so to
21 give it some place in the world as not quite a
22 reference, but something that's been analyzed
322
1 computationally.
2 And the goal is to take things from
3 GenBank, Biosample, SRA, and pull them to our
4 pipeline and create a standardized sequence
5 annotation for all of these records in a context
6 that allows you to analyze variants and metadata
7 associated with those sequences, and bringing
8 about some more user friendly interfaces while
9 doing it, because I realize that some of the
10 entrée tools can be difficulty.
11 And so there are a lot of people that
12 contribute to this. Again, we have a very small
13 group, so we rely on the kindness of strangers in
14 some cases, but also, the kindness of the people
15 within NCBI to help us out on various projects.
16 And here they all are. Thank you.
17 (Applause) On time, too
18 (Laughter).
19 MS. VOSKANIAN-KORDI: Sorry. Before we
20 ask questions, if anyone has their posters
21 outside, they're trying to remove the bulletin
22 boards, so please go ahead and remove your posters
323
1 before five, which is in five minutes. So, thank
2 you.
3 SPEAKER: A lot of the RNA viruses that
4 were originally sequenced no longer exist anymore
5 or can't be found in nature. So, how do you take
6 into account -- would you go through the neighbors
7 as -- annotation of the neighbors?
8 DR. BRISTER: Well, that's a good
9 question. And Aretha Kahn and some other people
10 who are part of the Advanced Virus Detection Group
11 -- endogenous viruses, extinct viruses things,
12 "missing from the databases" -- that's really the
13 kind of stuff that we have to rely on the
14 community at large to tell us about.
15 We don't know it's a problem until
16 someone tell us about it. And then, once we
17 recognize there's an issue there, then we go and
18 we try to solve it. I know ICTV struggled with
19 this issue that they had many viruses that were
20 characterized long ago based on phenotype. And
21 they could filter material. They knew it was a
22 virus. They may have even had an EM of it -- no
324
1 sequence.
2 And I think at this point, they're sort
3 of giving up on those guys. We're a sequenced
4 database, so we need representation within a
5 sequence. Sometimes those exist someplace.
6 Sometimes they don't. If we have a sequence, we
7 can start from there, and we can start making some
8 representation of that.
9 DR. MAZUMDER: Can I add to that list
10 which you just named? The recombinants and some
11 subtype mosaics -- how do they relate to your
12 neighbor concepts?
13 DR. BRISTER: Okay, so this has come up
14 with HIV, and it's kind of an interesting topic.
15 We'll add to this constructs and non natural
16 constructs. Right? This came up with some of the
17 constructs that were made to study HIV in the
18 laboratory where simian and human viruses were
19 sort of fused together.
20 From our standpoint, there's really two
21 levels to this. One is this representation of the
22 sequence base, and the second is the context. So,
325
1 in a perfect world, which we're not there yet, we
2 would like to have a context of where a database
3 -- we would like to be able to store things as
4 laboratory you know, strains, or store things as
5 these are not natural variants, so when you get
6 this hit, be aware of this.
7 Right now, we're doing it with host type
8 in terms of environmental samples. We want to
9 represent the sequence space associated with
10 environmental samples. There's an easy way to
11 mark these as environmental samples. We're doing
12 it that way through the host type property. We've
13 discussed doing the same thing for laboratory
14 strains.
15 In virus variation, we've solved the
16 problem by doing that. It's not clear how we're
17 going to do that within the referenced context.
18 Some of the strains are neighbors now. The real
19 question is, should we make them into RefSeqs.
20 And we haven't yet, but we are having an ongoing
21 discussion about that.
22 DR. MAZUMDER: Very important, because I
326
1 know you were (Inaudible) samples. We had folks
2 from HIV come in (Inaudible) --
3 (Simultaneous discussion)
4 SPEAKER: I just wanted to continue just
5 with that. There are a lot of natural
6 recombinations, particularly like in the
7 enterovirus families.
8 DR. MAZUMDER: Right.
9 SPEAKER: And so those occur all the
10 time. As a matter of fact, in some cases, it's
11 only the cap-set portion that defines the virus as
12 being a particular genotype.
13 DR. BRISTER: Yeah. And so, the same
14 thing happens in the bacteriophage. So, we had a
15 project that's focused on bacteriophages, you have
16 sets that sort of move around between viruses.
17 And so, we're a protein database and a nucleotide
18 database. So, in order to capture the protein
19 space associated with a group of bacteriophages,
20 we've actually had to make several different
21 RefSeqs that together, represent the protein space
22 of that group.
327
1 And I suspect we're going to go back and
2 do this for many other viruses. I mean, as Kim
3 kind of alluded to, we're breaking a lot of rules.
4 And so (Laughter), you know, the thing is when
5 you're a small community and you require a lot of
6 interactions with other scientists, you start
7 hearing them. You go, this is a stupid rule. We
8 need to get rid of this. And so, we're taking
9 test cases to do that, and then we're going back
10 to some of the other stuff and going, okay, we
11 should do that here, too.
12 And it's really about bringing people
13 into the discussion, contacting me to search my
14 name on Google. I'm in this -- I usually use J.
15 Rodney. If you do J. Rodney Brister, you'll find
16 me. Give me an email. We'll solve the problem.
17 I can't guarantee you we'll solve it tomorrow.
18 We'll start the discussion.
19 The way we like to approach these things
20 is to create a group, maybe five or six people who
21 can have a discussion about this as a community,
22 bring the ICTV in it. When necessary, bring
328
1 sequencing centers into it, when necessary, and
2 really sort of work this out together. And we've
3 been successful in some cases, and we're looking
4 to scale that up.
5 SPEAKER: Yes. I'm wondering how the
6 database accounts for important viral genotypes
7 that are resistant to anti-virals. Would they be
8 considered a neighbor to a RefSeq, or are they
9 accounted for in the metadata? How do you address
10 those types of things?
11 DR. BRISTER: Okay. So this is another
12 great thing that we've kind of solved in the virus
13 variation construct where we can work things with
14 clinically relevant metadata. That really doesn't
15 fit into the RefSeq model. The RefSeq model is
16 not designed for viruses in many ways, shapes and
17 forms.
18 How we deal with that down the road, I
19 don't know. One way is to bring in new index
20 terms and entrée -- haven't really had that
21 discussion. In my mind, I'd rather get away from
22 the constraints that a model somebody built for
329
1 something else places upon us and move into a
2 space where we take advantage of a model that we
3 built to solve this particular problem.
4 So, I think with clinically relevant
5 mutations, we like to get those viruses themselves
6 into virus variation, and we've then -- we've
7 expanded virus variation just over the last couple
8 of months, actually adding MERS, an Ebola virus to
9 it. We have plans for Norovirus, Rotavirus.
10 Again, we're open to suggestions.
11 So you know, bring these ideas to me.
12 We have a small group, but if you know, we get a
13 bunch of people asking us for these resources, you
14 guys start calling the Congressmen and start
15 calling my director and saying, we need more viral
16 resources, then maybe we can get it done.
17 DR. MAZUMDER: If there are no more
18 questions, thanks a lot. It was a great, great
19 talk. (Applause) Thank you very much, Rodney.
20 MS. VOSKANIAN-KORDI: At this point,
21 we're going to go ahead and close the first day of
22 our conference. We'll start again tomorrow at
330
1 8:30. Thank you all for joining, and again, if
2 you have a poster up, please go ahead and remove
3 it. Thank you so much. Bye.
4
5 (Whereupon, the PROCEEDINGS were
6 adjourned.)
7 * * * * *
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
331
1 CERTIFICATE OF NOTARY PUBLIC
2 STATE OF MARYLAND
3 I, Mark Mahoney, notary public in and for
4 the State of Maryland, do hereby certify that the
5 forgoing PROCEEDING was duly recorded and
6 thereafter reduced to print under my direction;
7 that the witnesses were sworn to tell the truth
8 under penalty of perjury; that said transcript is a
9 true record of the testimony given by witnesses;
10 that I am neither counsel for, related to, nor
11 employed by any of the parties to the action in
12 which this proceeding was called; and, furthermore,
13 that I am not a relative or employee of any
14 attorney or counsel employed by the parties hereto,
15 nor financially or otherwise interested in the
16 outcome of this action.
17
18 (Signature and Seal on File)
19 ------
20 Notary Public, in and for the State of Maryland
21 My Commission Expires: November 1, 2014
22