<<

United States of America

Food and Drug Administration

2014 Next Generation Standards Conference

Bethesda, Maryland

Wednesday, September 24, 2014 2

1 Ag e n d a

2 Introductory Remarks:

3 RICHARD G. H. COTTON, PhD Founding Patron and Scientific Director 4 Human Variome Project

5 CAROLYN A. WILSON, PhD Associate Director for Science, CBER 6 Food and Drug Administration

7 VAHAN SIMONYAN, PhD Lead Scientist, HIVE, CBER 8 Food and Drug Administration

9 Next Generation Sequencing Standards:

10 VAHAN SIMONYAN, PhD, Session Chair

11 WEIDA TONG, PhD Director, Division of 12 And Biostatistics Food and Drug Administration 13 AMNON SHABO, PhD 14 Chair, Translational Health Informatics European Federation of Medical 15 Informatics

16 EUGENE YASCHENKO Chief, Molecular Software Section, 17 NCBI, National Institutes of Health

18 Big Data Administration and Computational Infrastructure: 19 EUGENE YASCHENKO, Session Chair 20 VAHAN SIMONYAN, PhD 21 Lead Scientist, HIVE, CBER Food and Drug Administration 22

3

1 PARTICIPANTS (cont'd):

2 WARREN KIBBE, PhD Director, CBIIT National Cancer Institute 3 TOBY BLOOM, PhD 4 Deputy Scientific Director, Informatics New York Genome Center 5 Database Development: 6 RAJA MAZUMDER, PhD, Session Chair 7 KIM PRUITT, PhD 8 RefSeq Project Lead, National Library of Medicine, NCBI, National Institutes of Health 9 MIKE CHERRY, PhD 10 Department of Genetics Stanford University

11 RODNEY BRISTER, PhD Staff Scientist Virus Genome Group, NCBI 12 National Institutes of Health

13

14

15 * * * * *

16

17

18

19

20

21

4

1 P r o c e e d i n g s

2 DR. COTTON: I'd like to thank the

3 organizers for letting me join this project and

4 this conference, particularly Vahan and Alin.

5 Unfortunately, due to the time zone, I'm not going

6 to be able to stay for the full conference, but

7 this is one of the reasons why more recently we've

8 changed our administration so we now have two

9 scientific directors in the U.S.: Mike Watson,

10 who is the CEO of the American College of Medical

11 Genetics; and Garry Cutting, who is the well-known

12 curator of the Cystic Fibrosis Database; and Sir

13 John Burn joining in the EU, in Europe, and in

14 Australia one of the leaders of the InSiGHT

15 database. They spoke to the FDA in Silver Spring

16 in 2010 on this topic, and some in the audience

17 may have been at that meeting.

18 The Human Variome Project is a coherent

19 group, which has been really working together for

20 20 years to try and obtain the best data for

21 clinical decision making. That's pretty critical

22 when you're talking about or death 5

1 information. And Sir John Burn has said that not

2 sharing data kills. And so in other words, if

3 some data is not available on the other side of

4 the world, someone could die or be aborted, et

5 cetera. So that's quite dramatic and that's one

6 of the reasons why the clinicians clearly need the

7 proper data.

8 The Human Variome Project first started

9 with looking at research data and sharing that.

10 But now, of course, everything's been modified

11 because we are now talking about next generation

12 sequencing. But, nevertheless, the curation of

13 single mutations and single genes is much the

14 same. It's just the actual way of getting the

15 data is different and the way it's shared. And so

16 now, of course, we're focusing on clinical grade

17 results to share, and I'd like to thank John-Paul

18 Plazzer who's here who's the InSiGHT curator who,

19 in fact, is helping me with this.

20 So in other words, the Human Variome

21 Project summary is an international nongovernment

22 organization with over 1,000 individual members 6

1 from 80 countries. It is working to integrate the

2 free and open collection, curation,

3 interpretation, and sharing of all clinically

4 validated genetic and genomic information in

5 inherited disease. It's focused on supporting the

6 diagnostic works and diagnostic labs and building

7 the world's genetics and genomics service delivery

8 capacity. And, in fact, we've done quite a lot of

9 work in the developing world already, and with

10 increased funding in the future and with the help

11 of WHO and UNESCO, we have to accelerate that.

12 And so, in other words, we're harmonizing national

13 and regional efforts around regulatory frameworks

14 of governance.

15 We are an NGO of UNESCO, and we're

16 working with WHO on a global genomic policy, and

17 we're likely to become an NGO of WHO in the

18 future. And we're particularly interested in

19 those two bodies because we hope that that will

20 help particularly in the developing world.

21 So clinical genetic testing -- if one

22 knows about that -- pre-analytical, analytical, 7

1 and interpretation, of course, now. It may be the

2 forefront of dollar genome that it can be the

3 $10,000 interpretation and that's really tough for

4 clinicians and especially when not all the data's

5 available from around the world to make the best

6 decision for their patients.

7 I think the need for expert curation is

8 highlighted in this slide. Unfortunately, it's

9 going to be a while till it's overtaken by

10 informatics, but one day hopefully it will. And I

11 think when I first saw this data, I was quite

12 shocked. There was a summary of published results

13 to decide if a particular mutation that emulates

14 one gene was, in fact, pathogenic or not. And you

15 can see that five labs used eight tests, and, in

16 fact, they all differed in their actual results.

17 And in the end, the InSiGHT group, the model for

18 the Human Variome Project and how to deal with

19 information in a particular gene, had a

20 teleconference and they did this with John-Paul

21 Plazzer pulling this together and they had a

22 telephone hookup to discuss their homework on this 8

1 particular gene and they interpreted that as

2 pathogenic. And they did, in fact, incorporate

3 this because they put this on their Website. So

4 if they've been proven to be wrong, they're

5 protected by incorporation. So this is quite

6 serious and if you're talking about individual

7 health care, this is quite a challenge, in fact.

8 So another thing that's being worked on

9 is the actual terminology. There's a number of

10 different ways of saying large bone, and Ada

11 Homash from OMIM and others have been working on

12 trying to get the ontology down to about 1,000

13 words, which can be taken from a checklist so that

14 the actual data can be more uniform.

15 So how does the HVP work? It

16 facilitates the collection, curation,

17 interpretation, and sharing of clinically useful

18 genetic variation from all countries to

19 internationally existing gene and disease

20 databases. Now, it doesn't do all this. It

21 doesn't do all this, it's just a community trying

22 to tell the community or work at how the community

9

1 does something in much the same way the FDA is

2 doing in this project or this meeting.

3 So, obviously, we involve all relevant

4 experts and bodies, build on existing work to

5 avoid duplication, and we standardize approaches

6 and methods. And it's very important to indicate

7 that this really is a very inclusive project and

8 we do try to talk with anybody who comes along who

9 are, in fact, doing something in the area and

10 assist them and have them assist the world. So

11 all this is being integrated into routine clinical

12 care so that it is sustainable.

13 So this is the architecture and all of

14 this can be seen on the Website. In the end we've

15 agreed that all data has to go into NCBI, ABI, for

16 safekeeping and for comparison with other data.

17 At the top you can see Channel 1. These are

18 geno-disease-specific databases, which are curated

19 by experts daily. It's tough trying to get money

20 to do this properly. In the case of our model

21 genes in inherited colon cancer, we've been

22 fortunate to retain charitable funding to support

10

1 John-Paul Plazzer. And, in fact, clearly the

2 Cystic Fibrosis group has been very lucky to get a

3 large amount of money put together for their

4 database.

5 On the lower channel we have country

6 nodes where the countries need to collect data for

7 their own purposes, and this ultimately will go in

8 to the gene- or disease- specific databases for

9 curation and will go to the multiple central

10 databases. Some of you might have heard of

11 ClinGen. ClinGen is also following this model

12 whereby the geno disease-specific databases are

13 curated and then the data goes on to ClinVar,

14 which you might have heard of.

15 So this shows you the countries. The

16 orange blobs are the countries who have agreed to

17 actually start working towards collecting data as

18 a node and then sharing it worldwide. And, in

19 fact, the others -- the blue -- are the countries

20 where we are still talking with the countries. At

21 the moment there are about 20 countries that have

22 agreed. 11

1 So what do the nodes collect? And the

2 reason that they are collecting the data is, in

3 fact, in a paper of several years ago in "Genetics

4 in Medicine" if anyone wants to see that. So each

5 country node collects data from the laboratory and

6 all genes are tested within the country together

7 with the scientific data. And some nodes also

8 collect research data. Australia is developing an

9 LOVD 3 database for the collection of next

10 generation sequencing data, and this will be

11 integrated into other nodes.

12 I think I will dwell a little bit on the

13 LOVD 3 software, which I will come back to I

14 think. This software was developed by the

15 community specifically in Leiden to curate single

16 genes and mutations in them. Now, the software

17 has developed so well you can put a whole genome

18 into that database and it zip writes it into the

19 22,000 gene-specific databases. So it's important

20 to say that this is open source software and,

21 therefore, it can be used in a way that various

22 users might wish to do so. And it could even be

12

1 used in viruses, et cetera, and food products,

2 whatever is being talked about in this conference.

3 And then, of course, the relevant data is then

4 shared as I said in the diagram.

5 So we've got a couple of lead nodes.

6 What's happened in Malaysia is quite dramatic

7 because we have a very active leader. He was able

8 to get to government and be quite well-funded and

9 data collection is beginning. And even further,

10 they are actually leading the Southeast Asian

11 region. Obviously, these things are very slow,

12 and I think probably everyone in the room knows

13 that it's very, very slow, these things, to get

14 regulatory things through, collect data, get

15 everyone to actually agree to put the data in. In

16 Australia we got two successive informatics grants

17 funded and we are collecting data, but we did not

18 yet get funding for a proper curator for the whole

19 country. But we do have most of the big labs in

20 Australia agreeing to submit data to the

21 Australian database.

22 Now, I've mentioned InSiGHT before.

13

1 This is a lead database for HVP as a model and, in

2 fact, it deals with inherited colon cancer. Their

3 organization was formed in 2003 with the merger of

4 two groups, and it's to improve the care of

5 individuals and families with inherited colon

6 cancer. Their meetings are multi-disciplinary,

7 ranging from basic right through to clinicians,

8 genetics, et cetera. And their first step once

9 they decided to make a difference in their

10 databases was to actually reconcile three

11 independent databases, one of which was from the

12 literature, the other was from labs, and the other

13 one was from in vitro testing. And it's host on

14 the LOVD database in Leiden. If any of you are

15 not familiar with the LOVD databases in Leiden, as

16 I said there are 22,000 locus- or gene-specific

17 databases. Some of them do not have any data in

18 them. Some of them do not have curators. But, of

19 course, as soon as a whole genome sequence is put

20 in there, those will be populated. And, of

21 course, as I said earlier, there's expert review

22 of variance of unknown significance in the InSiGHT

14

1 database that have, in effect, been an assist.

2 And, of course, CFTR, whilst not initially a part

3 of the Human Variome Project, in effect is another

4 lead database and it doesn't use the software that

5 was used by the LOVD system.

6 What about PharmGKB for the

7 pharmacogenomics aspects of the actual FDA's

8 activities? It, in effect, has a focus on

9 pharmacogenomics and was actually formed in

10 parallel with the Human Variome Project

11 activities. And you might well know about that,

12 and obviously Russell is a key person in that

13 area.

14 Now, regarding standards and

15 recommendations, which are a key part for this

16 conference, consortium members have developed

17 these starting in the 1990s. Because initially

18 when this group started in the early-to-mid 1990s,

19 people were collecting data on their genes of

20 interest for their clinics on accounting software,

21 et cetera. And so this is where we started

22 developing the LOVD software, and, of course, from

15

1 then on members actually developed various

2 recommendations. And, in fact, people can see in

3 the Human Mutation virtual issue where they've

4 actually pulled together all of those

5 recommendations and standards, and they also list

6 on the Human Variome Project Website. And Mauno

7 Vihinen, one of their most active consortium

8 members, is writing a paper on that at the moment.

9 Probably the standard that is most

10 widely used is the variant nomenclature, and

11 that's, in fact, led by Johan den Dunnen who also

12 leads the LOVD software work. Also, software

13 content is widely used as in the LOVD software, so

14 people over that period of time have agreed what

15 sort of data should be in there for inherited

16 diseases at least. And now, of course, what's

17 happening is that a wider community is further

18 developing those into actual standards.

19 So what are the standards under

20 development? We have disclaimer statements that

21 we should put on gene- or disease-specific 22 databases. And just as an aside, I think it's 16

1 quite likely that maybe there will have to be

2 gene- or disease-specific databases for the actual

3 genes of interest, which this conference is

4 talking about. There's also a sonic

5 pathogenicity. Marc Greenblatt is a prominent

6 U.S. Person. Minimal content for the databases.

7 Sequence description committee. I've said that.

8 Variant database quality assessment. Disease and

9 phenotype descriptions in gene-specific databases

10 and minimum content and an ethics checklist.

11 So the need for accreditation -- this is

12 very important. If any database is critical to

13 support clinical decision making, it needs to be

14 accredited and, obviously, there are ways of doing

15 this. And it can't be possible without a national

16 framework like the College of Pathologists or the

17 American Society of Medical Genetics. And also

18 there's a requirement for national accreditation

19 really if we are to believe data coming in from

20 around the world.

21 So in Australia, for example, the Royal

22 College of Pathologists and the Genetic Society

17

1 have both draft documents that can be read in

2 applied translational genetics, a proposal that

3 will go to the National Pathology Accreditation

4 Advisory Council, then a requirement for pathology

5 to follow. This will also be submitted to the

6 Human Variome Project for standard development and

7 HVP member databases are expected to conform and

8 one day maybe one form of international

9 accreditation.

10 So what about the FDA? At least in

11 pharmacogenomics and also in viruses as well, I

12 suppose, first of all what about data collection?

13 We've found it extremely difficult to get

14 collection from labs and maybe grants should make

15 it conditional for data submission. Labs, as part

16 of their accreditation or quality control, should

17 submit data and maybe in the U.S., for example, it

18 might be state-by-state collection. For quality

19 you really need gene-specific experts for curation

20 as I've said, and for bioinformatics, the LOVD

21 open source database might be of interest.

22 So just coming to the acknowledgements,

18

1 we've got a large international society of

2 advisory committee that meets every month. Some

3 very prominent people involved with that and we're

4 well represented from around the world. These are

5 the people who work in the coordinating office.

6 The final thing I'd like to say is one

7 of the most difficult things, which I didn't put

8 on the slide, is to connect the actual genetic

9 data to the phenotype. And, in fact, there is a

10 system called Biograde in Australia whereby you

11 can actually access the data in a way about

12 particular diseases and about particular patients.

13 And this is one way of following -- you could keep

14 following a patient, for example, in a particular

15 hospital. And this was initially set up for

16 research and now, obviously, we're looking at that

17 for following an inherited disease.

18 So thank you for listening, and I hope

19 you could hear.

20 DR. WILSON: Dr. Cotton, thank you so

21 much. This is Carolyn Wilson. That was a really

22 wonderful overview of the project that you're

19

1 working on. It sounds very exciting and

2 challenging. I just wanted to see if there are

3 any questions. There are microphones now, so if

4 you could come to a microphone if you have

5 questions. Questions for Dr. Cotton? Well,

6 thank you, again, Dr. Cotton, for that talk.

7 So our last speaker in this introductory

8 session is Dr. Vahan Simonyan who has been leading

9 a lot of the efforts within the Center for

10 Biologics in the FDA. He'll be talking about

11 development and implementation of novel next

12 generation sequencing standards, really setting

13 the tone for the conversation that we're hoping to

14 have over the next two days.

15 DR. SIMONYAN: Thank you, thank you for

16 coming today. And thank you, Carolyn Wilson,

17 because she introduced the strategic perspective

18 from the FDA's viewpoint. Why do we need this

19 workshop? What are the challenges coming our way?

20 And thanks to Dr. Cotton, who actually gave us the

21 global perspective to this, and how the data moves

22 in a global scale? What kinds of standardization

20

1 efforts are being done? And also he has

2 highlighted the curation.

3 What I'm going to discuss are the more

4 technical aspects of it -- what we are expecting

5 being part of the Food and Drug Administration,

6 and how are we seeking solutions?

7 So this is just an introductory slide of

8 mine. This is our mission. We want to develop a

9 versatile platform for harmonization of next

10 generation sequencing technology. We want to come

11 up with standardization of data formats and

12 promote the way different platforms interact with

13 each other and also make sure that when you do run

14 bioinformatics and analytics, there is a way to

15 verify and validate the bioinformatics and

16 analytics from software and science perspectives.

17 This is my disclaimer. The perspectives

18 given here are my own perspectives and they do not

19 obligate or bind FDA.

20 So what are we going to discuss today?

21 The number one, I'll briefly mention NGS

22 lifecycle, which most are probably are very

21

1 familiar with. And then I'll talk about the

2 challenges we are facing every day. We'll then

3 move to the goals, the main goal statements, what

4 we are trying to achieve by getting all of these

5 bright people around the same conference room and

6 trying to see what is the vision. And then I'll

7 briefly mention format standardization attempts,

8 which we altogether can do, and bioinformatics

9 harmonization attempts. And after that, I'll very

10 briefly talk about what are the future plans

11 coming from our perspective? How can we continue

12 our communications?

13 So usually and just very generally --

14 and you have to understand I'm not biologist. I'm

15 a quantum physicist. So in this light if you see

16 a bunch of things are missing, it's because I

17 don't know much of the . By my

18 understanding we always start from some biological

19 specimen. And then that ends up somehow after

20 some chemical and physical treatments in the

21 sequencing machine. The results then are

22 transferred through these wires -- one gigabit,

22

1 ten gigabit -- we all hear these names in the

2 archiving system. And then computational programs

3 have to get the data back, do the computations,

4 and the circle here represents that it's not just

5 one computation. There are many computations

6 here. And then later after the analysis moves

7 into somebody's table who looks at the data and

8 makes a decision and gets approved or not

9 approved.

10 So there are two models of operations.

11 One of them I call trust-based model and when I

12 talk trust, I don't talk about the trust to

13 somebody or some organization. I mean trust to

14 the pipeline, trust to the workflows, trust to the

15 bioinformatics, trust to the data. And, of

16 course, what we are trying to achieve is we want

17 to move a provenance- based mode of operation

18 where we can validate every step of the way

19 wherever it is possible. So we are trying to move

20 from the top diagram to the bottom diagram here.

21 What are the steps that are of concern

22 from our perspective and the reason of this

23

1 workshop and the future work that has to be done

2 with this? So we want to make sure that when

3 archival pipelines are working, we know what is

4 happening in there. We want to make that the

5 computations are happening, all of the algorithms,

6 wonderful algorithms, hundreds of tools out there,

7 but not many people understand how they work.

8 They use them for arguments, launch them, get the

9 results. We want to understand and look at it.

10 Also we want to talk about the

11 platforms. There are different platforms. A lot

12 of times computational results do depend on what

13 platforms you are running in. And from a data

14 perspective, we want to talk about standardization

15 of data and metadata. By data, we mean something

16 that comes out of these devices. Metadata means

17 descriptive information that is assigned to the

18 data. And also we want to talk about archival

19 standards. What do we keep? How much do we keep?

20 Also interoperability standards. Those are the

21 standards that different computational protocols

22 can talk to each other. And, of course, how do we

24

1 generate the results?

2 So let's move to the next stage. It's

3 the NGS challenges. Number one is challenges.

4 Before we even came to this conversation, there

5 were many different file formats, which are

6 wonderful formats that I use for many different

7 purposes. But the challenge now is we are dealing

8 with a huge amount of data. So some of the data

9 file formats, which are wonderful in serving that

10 role, can still do that, but there's too much

11 engineering information sometimes. One of the

12 questions -- and believe me, I don't know the

13 solutions. I'm just asking questions -- is how

14 important it is to maintain all of the idea lines

15 --

16 DR. WILSON: Sorry to interrupt. The

17 Adobe connection is a little fuzzy, so if you

18 could put that on you, I think it would be better

19 because I think it's picking up whatever static is

20 --

21 DR. SIMONYAN: Sorry, sorry. Give me a

22 minute. We have plenty of time. In fact, we

25

1 assigned a very big 1.5 hour break. So we were

2 late half an hour, but we are going to catch up.

3 Don't worry about it, lunch is just next door.

4 So let's move to the FASTQ formats, for

5 example. And we have idea lines, we have base

6 qualities, we have the sequences themselves.

7 Biologically, the most important part is the

8 sequence information. Qualities are important for

9 base understanding and trusting the particular

10 base. And idea lines mostly contain in FASTQ

11 files, genomic information on your plate. What

12 are the coordinates you use to read that

13 particular sequence? These are wonderful if

14 you're running quality controls. These are

15 wonderful if you're trying to analyze how well

16 your experiment went. And I can tell wonderful

17 stories of how we can predict earthquakes,

18 actually post factum, by looking at the idea lines

19 and the qualities. But that's a different

20 conversation.

21 So the question is this. If we are

22 compressing these file formats -- and, of course,

26

1 every sensible technology person tries to compress

2 these file formats, not to maintain them --

3 there's a different way you can compress and there

4 are different coefficients of compression. So

5 idea lines compress somewhat and here you can see

6 I've tried to show this. Sequences can be

7 compressed heavily, and then CBI has come up with

8 this reference-based compression machinery. So

9 instead of keeping the whole sequence, we keep

10 only the position and the genome from which it is

11 coming, and your huge sequence compresses to a few

12 bites only. So this is wonderful. Sequences are

13 the most useful part and they are the most

14 compressible part.

15 And then we have to deal with qualities.

16 Qualities, again, are important for quality

17 control. That's why they are called qualities.

18 And then we use them sometimes in the

19 bioinformatics pipeline, but most of this

20 bioinformatics pipeline is, again, just a flip of

21 a bit if it's important to consider these ways or

22 not important to consider these ways. So there is

27

1 a question. How much of that quality are we going

2 to keep because this thing is not well

3 compressible? And a lot of time we have done some

4 analysis on some vital bacterial NGS data, 80

5 percent of the data ends up sometimes being the

6 qualities and sometimes it's 15 an average of

7 human cases. And then people are asking me, why

8 do you worry about this? It's only 15 percent of

9 the data after all? And for a human case, it's

10 qualities; 15 percent of the data when we're

11 talking about terabytes and petabytes, it's a huge

12 amount of investment. We have to keep it, and we

13 have to keep it for many, many years. So that is

14 important.

15 Another big question is about alignment

16 formats. Well, I have a computer. I have a

17 program. It's takes my rates. It takes my

18 reference sequences, runs the program, produces

19 alignment. And then I'm going to do something

20 without alignment. Believe me, if you send a sum

21 file to FDA to reviewers saying this is my result.

22 There's a huge amount of work to be done before a

28

1 biologically meaningful conclusion can be made.

2 So that by itself is not a real result from a

3 reviewer's perspective. So the question is this.

4 Most of the time the sum files that we are using

5 are going to be used on the same platform on the

6 same set of machines where we have the sequences,

7 we have the qualities. But if you look at the sum

8 file, it is completely duplicated. Well, people

9 say maybe it's not for inter- computer file

10 format. It's not ideal because it's duplicating a

11 lot of stuff. Maybe it's for export file format.

12 But if you look at it as a reference to the

13 genome, it is wonderful if you are working with

14 human chromosome version 37, which is available

15 from NCBI, but a lot of time we deal with

16 sequences where references are either internal or

17 not published. So by itself it actually -- well,

18 I'm being asked to slow down a little bit. So by

19 itself, that format is not a complete format

20 either. So it's repetitious because it is

21 repeating information and it is not complete, but

22 it's a wonderful format. I don't want to sound

29

1 like we're criticizing the format. We are just

2 asking a question. We are moving to the next

3 generation era. Do we need to sometimes be

4 critical and skeptical to our previous well-served

5 file formats?

6 Let's move to metadata. Well, NCBI has

7 done a wonderful job of accumulating and

8 communicating with a huge number of people from

9 outside all over the world in standardizing some

10 of the fundamental metadata types by a project, by

11 a sample, a next gen sequencing grant, and

12 experimental protocols. But if you look at this

13 definition of these file formats, sometimes their

14 tens of kilobytes are very well considered and

15 treated text. But if you look at these samples of

16 what information is being provided, a lot of time

17 you'll see it's very sparse. So it is

18 complicated. So there is a need. And when you

19 give something that is too complicated, there are

20 people who will try to circumvent it. Perhaps we

21 have to think from FDA's perspective how we can

22 make their life easier. So there are challenges

30

1 also on a metadata part.

2 Then archival is an issue. I briefly

3 mentioned it when I was talking about the

4 pipeline. So the questions that are coming from

5 FDA's perspective are how long do we keep the

6 data? That's a question, which was -- well, we

7 knew there were regulations controlling the length

8 of the data storage in FDA. But when it comes to

9 next gen, you understand, we are talking about

10 petabytes. And these petabytes have to be

11 transferred, again copied. If you gave a hard

12 drive with this petabyte and you put it in your

13 cool, dark place, you come back in 30 years, it's

14 not going to be readable, not in 5 years, in 10

15 years. And another thing, recently I had a hard

16 drive with an IDE interface to it. I was trying

17 to read my old pictures. There was no way. I

18 couldn't find the computer that is reading that,

19 which means that these petabytes of the data

20 cannot be just put somewhere and be the one-time

21 expenditures. You have to maintain it. You have

22 to hire systems programmers, administrators. So a

31

1 question arises, how long do you keep the data,

2 understanding what are your gradual increasing

3 costs?

4 Another one is what do we transfer into

5 the system? There's a huge amount of -- do you

6 know that some of the sequencing machines generate

7 TIFs, which are 17 times larger than the

8 sequences? But at some point somebody made a

9 smart decision. Well, maybe we should just not

10 transfer it. That's not a final format. But then

11 the question is this. Now we have this big, huge

12 FASTQ file. Is it a final format or still we can

13 do something about it? Because the FASTQ files

14 sometimes are not the real data despite that we

15 are calling them real data.

16 So what we can and we cannot lose during

17 the compression. That is another question. Yes,

18 if something is not to be used for the

19 computations later, do we lose it or do we

20 maintain it? And I talked about the cost.

21 Now, let's move to the challenges in

22 bioinformatics. This is just one particular kind

32

1 of a pipeline generalized not really anything

2 particular. There is no computational knowledge

3 coming out of there. It's just a lot of time we

4 do many components and I highlighted some of the

5 data file formats that are in there. Again, the

6 question is that is there too much trust in

7 bioinformatics? I have seen many people who are

8 wonderful scientists who are using all of the

9 alignment tools and coming up with beautiful

10 solutions. They take the bioinformatics pipeline

11 -- Bowtie has about like 50 parameters; BLAST is

12 about 50 parameters; our own tool is about 50

13 parameters. So most of the time people read the

14 instruction, specify a couple that are most

15 customizable and easily understandable, and run

16 the pipeline and get the results. A lot of time

17 people say hey, I have this wonderful pipeline and

18 it works wonderfully for my vital datasets. I'm

19 going to use that for human. They launch it. It

20 either doesn't finish or finishes, you get the

21 results. How much do you know? You have to rely

22 on it or not? So sometimes what do you plug in

33

1 your tool is important. I can get the hammer and

2 put in the big nail, but if I use this hammer to

3 put a pin on a wall, I'm going to make a hole in

4 it. I have seen these cases when really smart

5 people are doing this not because they didn't

6 understand, because it's too complicated. There

7 are different professions of doing it. So we have

8 to come up with a way of actually validating

9 bioinformatics pipeline. So it's important to

10 think what am I applying my tool to? It's

11 important to understand what the set of parameters

12 are I can use my tool to apply and what I can

13 expect from my computations. So what can I apply

14 my tool to is about usability domain. What kinds

15 of parameters are satisfyingly generating

16 scientifically correct results? We call this

17 parameter space. Just like in a real lab

18 experiment, I can make an experiment in one

19 temperature condition and then get more results.

20 I am a chemist by my first education. So in a

21 different temperature conditions or pressure, I'll

22 get different results. Again, in bioinformatics

34

1 pipeline, no difference. I put in some parameters

2 I can more results, some others I get different

3 results. So the same tool, same pipeline, may be

4 actually good for many different purposes. We

5 have different validity domains with different

6 parameter space.

7 So what are our goals? Well, first of

8 all before we define the goals we want to say what

9 we do not want to do. We do not want to put any

10 limitations on technology and that is critically

11 important. Another thing is we don't want to make

12 any preferences of some platform, and we don't

13 want to put any limitations on the particularities

14 or the implementation in a particular local

15 institution, organization, or scientific or

16 research institute. We want to make it as open as

17 possible. Whatever we together come up with at

18 the end, we must maintain the freedom. These are

19 the freedoms, just like in the Constitution.

20 Now what we do want to do is to come up

21 with a good data typing engine, which I'll

22 describe briefly. Why do I think when we are

35

1 talking about standardization we should talk not

2 just about what is the file format, what are the

3 names of these little fields. That is silly to

4 get a bunch of smart people and talk about names

5 of the fields. What we need to create actually is

6 how do you define the standards and that's what I

7 call data typing engine.

8 Another thing is how do we validate? We

9 want to find out in the end how do we validate

10 bioinformatics protocols and what do we archive

11 and how do we submit the data in this particular

12 case to FDA? And it is important. How do we

13 altogether generate something which is going to be

14 valid tomorrow or in one year or in ten years?

15 There is no way. Actually our vision is very

16 brief. Technology has changed in the last five

17 years so much that if I thought I'm doing

18 something right five years ago, I'm pretty sure

19 half of it was wrong.

20 But let's look at our United States

21 system. We have a Constitution and we have a set

22 of laws. We ask you to define the Constitution.

36

1 The Constitution is a fundamental framework to

2 create laws. And the laws are the ones that are

3 local, time limited, and can be accepted, which is

4 not true for the Constitution. The same way if we

5 generate a bunch of file format standards today,

6 we are bound to fail in the future. But if we

7 create engines of standardization and if we do it

8 well, just like our Constitution, we have the

9 potential of creating a beautiful result in the

10 end, just like our country.

11 So we're in luck. How do we create

12 engines and how do we send our data? We're in

13 luck because we are not the first one in this

14 field. Software development paradigm, software

15 development organization, I saw and see and all of

16 these guys have done the legwork for us. And

17 NCBI, I work closely with NCBI. NCBI, NIST, and

18 other organizations have applied those techniques

19 to actual implementation regarding the

20 bioinformatics, the biomedical data. So, for

21 example, this is just a proposed set of metadata

22 objects of biomedical importance that we use

37

1 regularly in a NGS world. We can start from this.

2 And there is a hierarchy -- one of the things we

3 want to introduce, just like in any programming

4 language, is inheritance. Let's say I have a

5 type. I call it a bioproject type. This is my

6 project. It's a genetic thing. All projects have

7 this certain phase. Who is submitting? Where is

8 submitting? Where has the project come from? And

9 then we can have projects that are specific target

10 that have extra additional information. The same

11 way, sample, is something that has certain fields

12 describing. Where is sample accumulated? How

13 much of it is accumulated? How was it treated?

14 But then let's say a human sample has additional

15 characteristics. Our animal sample has

16 taxonomical identification, which generally you

17 don't need to specify for a sample because samples

18 in general can be metagenomic specifying

19 taxonomical identification and might or might not

20 be useful. So if you create hierarchy, if you

21 borrow that idea of inheritance from the

22 computation IT world and create inherited set of

38

1 objects, each object becomes very simple and it

2 contains only the things that are necessary.

3 Otherwise, if you generate these flat and complex

4 data types, you have let's say bioproject data

5 types without this inheritance, without all of the

6 fields that are under bioproject. So there is a

7 way to deal with this and one of the initial

8 propositions for us is to use this technology.

9 Another one is the introduction of

10 inclusion. A lot of times data can be autonomic

11 or complex. Autonomic is like a string. It's my

12 name or something. If you split it, it's wrong

13 and it's not complete, but they are complex data

14 types and in programming languages and paradigms

15 they are using this concept of inclusion. So we

16 define the very nice and reusable data type and we

17 use it constantly. And by using the inclusion and

18 inheritance, we can generate data types of any

19 complexity while keeping them minimalistic.

20 The next thing is the programming

21 languages again came up with this very nice

22 concept of data typing, correct field typing,

39

1 variables of types behind them. And they are not

2 just a bunch of values or long strings. There

3 should be strict interpretation of every field.

4 Fields can be missing and then the question is

5 what is the default value for that? Is there a

6 default value for that? If you miss that field,

7 does it invalidate the whole data or invalidates

8 itself? And is it optional or mandatory? Is it a

9 single value data or multi-value data? My name in

10 a data/metadata format is a single value. I have

11 just one name, one first name and one last name.

12 I don't even have the second one. But if you are

13 talking let's say how many times I visited the

14 doctor, that's a multi-value. You see, just by

15 talking on this, we can standardize, use these

16 paradigms, and it already becomes easy.

17 Now about flexibility and extensibility:

18 So if tomorrow I design my data hierarchy here,

19 you see the bigger ones are including the previous

20 one, by inheriting from them if tomorrow I need to

21 introduce a new one because my vision a year ago

22 was not complete, that format, that engine which

40

1 we create, should be able to adopt modifications

2 without changing the data type.

3 So when we generate this data typing

4 engine, we must also not forget we are not the

5 first one here and we are dealing with existing

6 data and the formats. So we must not forget that

7 we are dealing with an existing industry with

8 existing data and existing data formats. And

9 there are ways to design our data typing engines

10 in a way that will be adoptable actually sometimes

11 without touching the database content we will be

12 able to adopt to the new standards. So we are not

13 generating the novel or initial data typing engine

14 and forcing everyone to come and play with us. In

15 fact, we are trying to create a new data typing

16 engine that adopts to existing data and it is very

17 important because there's a huge amount of

18 information out there and there is no way we can

19 convert anyone.

20 So then we must also remember the big

21 data. Big data means our formats sometimes are

22 too heavy. One of these days I got actually a few

41

1 millions of biomedical data. These are test data

2 for testing our platforms in XML format. It

3 turned out that a significant amount of data is

4 engineering bytes and bits and the actual content

5 was only a fraction of the data. XML is a very

6 nice format to adopt, to work with, and then to --

7 there are many tools working with that. But when

8 you move to the petabytes of data, it seems it's

9 too expensive. There are too many formats. And,

10 in fact, there are other much simpler formats that

11 can be used for these purposes except we have to

12 make them adapt to be used with big data having

13 implications of missing -- interpretations on

14 missing values or things. If they are ambiguously

15 interpreted, there is no issue with them. And

16 also we have to remember, it's not just the big

17 data. It's not just big files. It's a big number

18 of big files. For example, the Center for Food

19 Safety and Applied Nutrition is involved together

20 with NCBI and ORA or many other organizations in

21 100,000 genome projects. Now multiply 100,000

22 times each sample is going to be sequenced many

42

1 times and their associated many other files, you

2 end up having millions of files, each one being a

3 big data. So when we design our data file

4 formats, we have to think not just about big data,

5 but about a big number of big data.

6 Let's move to the bioinformatics

7 harmonization now. And there's a lot of

8 confusion. When we say we want to validate

9 something, it's very important to determine what

10 are we trying to validate? Are we trying to

11 validate experimental method? Are we trying to

12 validate experimental protocol, experimental

13 instance? The method we were trying to just

14 define the terminology. And I looked at some of

15 the scientific ontology publications and there's a

16 definition behind it that the method is actually

17 underlying scientific knowledge, which is put in

18 the basis of your instructions during the

19 experiment. As being scientific experimental

20 method, it actually has to abate criteria of

21 science. And as French philosopher and scientific

22 mathematician has defined, any experimental method

43

1 should be objective, should be reproducible,

2 deductible, and predictable. I don't want to go

3 in details. These are like fundamentals in a

4 first class, first year, first university where

5 the science is and how different it is. So when

6 somebody uses experimental method, it should

7 comply to these particular rules.

8 Now let's go to the experimental

9 protocol, the subject of our discussion now. I

10 mentioned briefly what is a usability domain. So

11 can I apply it to metagenomics? How about

12 bacteria? Eucariota? Human? Viruses? Anything.

13 So my protocol is a set of instructions. When I

14 tried to validate something, the same protocol

15 like I mentioned before can produce valid or

16 invalid results. So I have to know what the

17 usability domain is when I'm testing a program.

18 Then parametric domain, what are the

19 sizes of the seeds? Let's say I'm talking about

20 alignment or what is the identity much as I should

21 know. What sequences are considered aligned? If

22 I am calling variants, what variants I am going to

44

1 use for genotyping purposes? So all of these are

2 parametric space, not knowledge domain. A lot of

3 programs like I said produce a bunch of different

4 information, mutations, identification, genotype,

5 expression data. I have seen cases when people

6 use a very nice pipeline, TOPAZ pipeline, to

7 generate expression data, but with the conditions

8 and parameters they generate data where I wouldn't

9 rely on variations. Again, the same program, same

10 parameter space, same usability domain, multiple

11 files, one of them is good and usable. The other

12 one is not good because the parameters are not

13 good for variant calling in this particular case.

14 And we also have to talk -- when I am

15 looking at a program to see how good it is, there

16 are deterministic and heuristic programs. With

17 deterministic programs usually you always give

18 them the same input, you get the same output.

19 Heuristic programs are different. They are based

20 on random numbers. They are based on some kind of

21 analysis. They have a chance of generating false

22 negative and false positive. It's okay. We live

45

1 in the real world. If you try to find every

2 possible alignment on a human genome on a 100

3 million refile, considering all alternatives

4 without heuristic program, that will take you 56

5 years. We don't want to do that. So our programs

6 are much, much, much faster, but they have a

7 chance of generating false negatives and false

8 positives. So when we talk about validity, we

9 don't require 100 percent conformance or 100

10 percent everything. There are always errors. The

11 whole technology works on the assumption of a

12 science and there are always inaccuracies. So we

13 have to understand what the range of errors are.

14 Okay, usability, parametric space,

15 knowledge domain I discussed, and range of errors.

16 Now, another really big thing, which we

17 have to know, is what data do I need to run my

18 program? If my databases are good, do I use the

19 right BLAST matrix if I'm running BLAST? Do I use

20 the right viral representative database if I'm

21 trying to detect viruses? So there are data

22 prerequisites that will validate or invalidate

46

1 particular users' patterns of the program. And

2 there's one big thing, which is always missing in

3 my life. I always feel the need and that's a

4 biocompute object. If you have -- all of us have

5 worked with NCBI, GenBank, or other formats and

6 there are very nice definitions of that data. But

7 when it comes to computations and I run this

8 complex pipeline, where do I register? Can I

9 register it and share with someone? Can I put it

10 into some public database and say please go use

11 it, this is my parametric space usability domain,

12 all of the things we discussed. There is a need

13 to create a metadata for biocompute objects. I

14 know efforts are being done in different places,

15 but I think it is also a very big question in our

16 discussion. We have to define these things.

17 Now, this is one of my last slides. So

18 where do we go from here? So today we are at

19 September 24. So this is the conference where we

20 invited all of you to start a dialogue. We are

21 not making any regulations, any guidance,

22 anything. We are inviting you to work with us and

47

1 try to see because we are facing this challenge.

2 The challenge is coming our way. We want your

3 input, all the input -- academia, industry,

4 somebody mentioned diagnostics companies, other

5 government agencies, international organizations,

6 consortiums. There will be two series of

7 concurrent workshops happening in telecom or

8 face-to-face meetings. Please welcome. Tell us.

9 There is a registration, and we will pass this

10 paper. You can register yourself for this so you

11 are on an email list. We are going to organize

12 after this conference. We are going to edit our

13 documents to make sure that whatever input you

14 give us today or tomorrow, it's also reflected.

15 Documents will be available for download. And

16 also based on your input and demand for

17 registration, we will start scheduling the regular

18 monthly meetings. Please come and talk to us.

19 And where we end up at the end -- I

20 specifically made this a little fuzzy on the right

21 side. So we don't really know where we'll end up.

22 We just know we have a big challenge and we want

48

1 to solve the challenge. And it might end up

2 actually eventually generating the guidance

3 documents where your input will be also

4 considered, but also standards documents. There

5 are different organizations -- ISO standards, ANSI

6 standards, ASTM or STM international -- there are

7 other organizations that also generate standards

8 and recommend for usage. And we at the FDA we use

9 other standards also developed if there are such

10 standards available. So if we do a good job

11 together and we generate this big -- we succeed in

12 this big idea, parts of it can be used. And we

13 promise we'll work with the standardization

14 organizations to make them publically available,

15 not just for our purpose, but for anything else.

16 And these are my acknowledgment slides.

17 There are many people I should acknowledge, but

18 Alin has done a great job. Without her, it

19 wouldn't be possible to organize this. Hayley,

20 Alissa, and John were actually editing with me all

21 of the documents and making sure that all of the

22 communications are not missed. Carolyn Wilson,

49

1 without her vision it wouldn't be possible to even

2 be here because the whole support and

3 encouragement is very important for us. And we

4 have very good friends with whom we talk every

5 day. We communicated. We come up with ideas.

6 They are very skeptical, very critical, and that's

7 very helpful. And we are hoping that you will be

8 in that second list after this conference by

9 working with us. And the most important people

10 after all of these are the researchers. Without

11 them it's not possible.

12 Okay, so now we are ready for questions.

13 QUESTIONER: Paul Walsham, In Silico

14 Life Signs. One of the challenges that you

15 pointed out was reproducibility, and I think it's

16 more of a comment that I think is very well placed

17 in the FDA, the idea of traceability, which has

18 been widely used, adopted. It really matches this

19 and basically to be able to track parametric space

20 as you outlined as to what was done in pipelines

21 and who did them. So I think there's a good

22 foundation there for addressing that issue.

50

1 DR. SIMONYAN: Any other questions? Any

2 other questions from online maybe? You can type

3 it and we can read it for you.

4 QUESTIONER: You made a couple of points

5 that kind of struck me. One was the alignment

6 pipeline is heuristic based so there are going to

7 be errors. And given that, we know that those

8 pipelines are going to change with time and get

9 better and better. And also it seems to me

10 there's a real need to be able to compare results

11 from two different labs using two different

12 pipelines and say whether -- and use a

13 nomenclature or a format that says these two

14 people have the same novel polymorphism. So it

15 seems to me that we have to send the primary data

16 and be able to reinterpret it on the fly using

17 common methodologies. And there needs to be a

18 standard way of doing that and part of the goal

19 for this workshop is to recommend standards for

20 doing that.

21 DR. SIMONYAN: Yes, the question was

22 about, let's say for example, the alignment

51

1 pipeline is heuristic so it can get different

2 results and there are many pipelines that form

3 with this data. I'm sorry if I don't exactly

4 represent the question. How do we validate this?

5 So our idea is this. So we can generate

6 test cases for validation purposes, and by

7 validation I mean bioinformatics and software

8 validation. There are two ways you can do that in

9 our knowledge, but, of course, input again is

10 welcome. Whatever I say, input is welcome. One

11 way we can use simulated datasets. We have done a

12 huge amount of work to generate the simulated

13 datasets and engines to generate simulated

14 datasets where we can validate actually the

15 mathematical accuracy of algorithmic or the

16 pipeline. Let's say if somebody claims this is a

17 nice pipeline to generate mutation calls on

18 bacterial genomes, we can mimic bacterial genomes.

19 We have many tools for that and we can generate

20 more where we will impregnate the samples and

21 generate risks like they are coming from one of

22 the devices in the market. And we will then test

52

1 it through the protocol of the bioinformatics and

2 come out with the results and compare the reality

3 and the outcome.

4 The other approach -- competition

5 generated data are good because you know what you

6 are expecting. But then, for example, our

7 collaborators in Nice -- Justin, I think he's

8 there -- they are working on actually biological

9 data and validating what actually is a good

10 biological sample. Biological data have a much

11 bigger variability than any simulated data can and

12 that is the way you test your robustness of your

13 bioinformatics pipeline. Let's say we get

14 genome-in-a-bottle data where they are coming up

15 with a validated test of mutations and they have a

16 very beautiful dataset with familial relations.

17 I'm pretty sure Justin is going to talk about some

18 of the aspects of this project. And we can take

19 that data and then try again, create a test case

20 with parametric space and all of these things

21 different, and then the pipelines will generate

22 the results. Then we will compare what Justin and

53

1 his team are coming up with a validated variation

2 of genotype and then we can test it. So simulated

3 and biologically validated data.

4 QUESTIONER: Lester Shulman from

5 Ministry of Health Israel. If you have two

6 different variants or two different variations of

7 one in your pipeline, how do you decide which

8 would be -- what parameters would you use to

9 decide which one might be more valid?

10 DR. SIMONYAN: One position? You are

11 saying one position?

12 QUESTIONER: Let's say that you have two

13 datasets. You've gone through two different

14 pipelines and now you get results that are not

15 harmonized.

16 DR. SIMONYAN: So if you put the test

17 data -- I put one variant and I know what I am

18 putting into the data. Let's say position 1

19 million in a human genome -- I don't know what

20 that position actually does mean -- so I can

21 impregnate my data, generate NGS data, run the

22 program. And then let's say your pipeline

54

1 produces A2T code when I know I put the G there.

2 That's a way to validate and genotyping

3 information is available for human data through I

4 mentioned our collaborators. And the viruses and

5 bacteria we do this every time we run a program.

6 So what we do we have our own pipelines. You'll

7 hear about them later. We have test sets with

8 mimicked data. And every time we change a single

9 line of code, we take the whole pipeline and run

10 it again and compare the results with the ones

11 that we know we must get. That's about

12 mathematical accuracy.

13 QUESTIONER: Toby Bloom from the New

14 York Genome Center. I'm worried that at this

15 point in time comparing existing methods just

16 isn't good enough in the sense that for alignment

17 maybe, but once you get to variant calling and

18 especially structural variant calling, you find

19 that -- and anybody who's worked on big projects

20 like TCJ or 1,000 genomes or anything knows this

21 -- you run three different methods, you get three

22 different answers and you don't find that one

55

1 method is better than another. What you

2 eventually find is that this sematic mutation

3 caller is better at minor allele frequencies that

4 are over 10 percent, but this one is better at

5 allele frequencies that are lower than 10 percent,

6 and this one really doesn't do very well if

7 there's too much stromal contamination.

8 We at this point are running three of

9 everything and comparing the results on every

10 sample. So, yes, we do lots of validation. We

11 actually do validation every few months against

12 all the gold standard samples, and we change which

13 three best ones there are and use them. But it's

14 a really complicated problem and we aren't ready

15 for standards. I mean we don't know what to make

16 the standards.

17 DR. SIMONYAN: I think one of the

18 reasons why it hit me is because the technology is

19 moving, our understanding of the data is moving,

20 and we tried to address some of that through

21 coming up with a concept of the knowledge domain.

22 When you say some pipelines are good for one

56

1 particular goal, the other pipelines are good for

2 the other goals, so, yes, it's very complicated.

3 I have no choice but to agree with you, but that

4 doesn't mean we don't need to make the steps. We

5 still need to get together and try to work with

6 that. There is no solution. If there was a good

7 solution, we wouldn't be here. There is no

8 solution. That's why your input and the input

9 from everybody else needs to be there.

10 QUESTIONER: So the previous question is

11 actually perfect and following the green challenge

12 project, for example, just trying to kind of get a

13 consensus. But over the years I think we will

14 slowly start understanding under certain

15 conditions, we can say this. And you mentioned

16 three conditions actually in that one sentence.

17 So if these are those conditions, say which

18 conditions can we say now, under which conditions

19 we can say in a year. Some conditions maybe it

20 will take us a decade to get to and say okay, we

21 can for sure say this is going to work. So the

22 biocompute object I think is going to help us kind

57

1 of tease out the different parts.

2 DR. SIMONYAN: Yep, yep. And also don't

3 forget we are in a biological . Evolution

4 is kind of inherent to us, so all of our visions

5 should also evolve.

6 QUESTIONER: As we try to develop these

7 standards, I'm wondering whether we have thought

8 of the limits of our technology. Meaning when it

9 comes to our storage transfer ability, volume,

10 have we involved engineers, informatics

11 scientists, who can advise us as to what is

12 realistic and what is not?

13 DR. SIMONYAN: That's a very important

14 question because one thing we are doing -- when I

15 moved to the science, I thought I was only going

16 to do fun stuff like coming up with beautiful

17 algorithms and running them. You hit a big data

18 and there's an infrastructure challenge behind it,

19 a huge amount of infrastructure challenge. And I

20 can see people in this auditorium who are coming

21 from the hardware perspective, which are the

22 providers of the computational platforms,

58

1 providers of the net, providers of storage. And

2 we work with some of them actually advising --

3 they were advising us what are the platforms which

4 are existent today and where are they moving?

5 Yes, this is a really big challenge and that's why

6 we specifically must invite hardware production

7 companies, hardware manufacturers, and they'll

8 work with us also. I do intend to have them. I

9 mean one of the simple things and I always kind of

10 see this misconception, is hey, I have this nice

11 computer cluster, a thousand cords, and a bunch of

12 storage. But now I'm going to buy two times more

13 of everything and it's going to work two times

14 faster. It doesn't work like that because some

15 things, computations, grow linearly, and data

16 storage grows linearly. And networking between

17 them grows like a square. So now you have to buy

18 four times more networking. So unless we have the

19 expertise from these gentlemen or ladies who are

20 working in these companies of hardware

21 manufacturing and they know where technology is

22 going, we cannot do it without them. And in a way

59

1 it's important and it's very nice that they are

2 coming to us and asking what do you need because a

3 lot of times some of the algorithmics we

4 understand how it works because every day we

5 struggle with algorithms and things. It's

6 important for them because they have to write

7 their tomorrow software to work with our

8 algorithms.

9 DR. WILSON: At this time I think we're

10 going to close the questions. Dr. Simonyan will

11 be around during the lunchtime and other times, so

12 feel free to stop him and ask him more questions.

13 Our next session is chaired by --

14 DR. SIMONYAN: We are going to pass the

15 registration forms.

16 DR. WILSON: So Dr. Simonyan would like

17 to pass around these registration forms.

18 Therefore, if you'd like to receive information

19 about future conferences, be involved in these NGS

20 standardization efforts or any bioinformatics

21 aspect of it, there's two categories so please

22 check the appropriate one. Write your email. If

60

1 you do not feel comfortable passing your name

2 around that's fine. Just email me. I'm sure most

3 of you know my email or Dr. Khaled Bouri whose

4 email is circulating through the FDA channels.

5 He's the contact person listed in all the

6 different advertising utilities. So feel free to

7 email either one of us and we'll add your name to

8 the list.

9 The next session is on NGS standards Dr.

10 Simonyan is chairing.

11 DR. SIMONYAN: Dr. Weida Tong is going

12 to give us next presentation. He works for FDA at

13 the National Center for Toxicological Research and

14 he's the Director of Bioinformatics and

15 Biostatistics Division. He and his group and

16 together with a big collaboration from around the

17 world have done wonderful bioinformatics projects

18 and big data projects. And some of the

19 mentionable ones that are very relevant to us

20 right now are MAQC and SEQC projects, which were

21 spearheaded by him and his team. And he's going

22 to present his perspective to this effort. Thank

61

1 you.

2 DR. TONG: So I'm glad I get a chance to

3 use the microphone. I'm from Little Rock,

4 Arkansas. And without a microphone I will give

5 away my southern accent. (Laughter)

6 So what I'm going to do today is I'm

7 going to talk about some of the experience I have

8 learned in the past by dealing with the micro data

9 and the recent experience by dealing with the next

10 generation sequencing. And then in the last part

11 of my presentation I'm going to provide a few

12 points I think are important to consider when we

13 try to move this field forward.

14 So before I start I do want to take this

15 opportunity to thank Vahan and Dr. Carolyn Wilson

16 in allowing me to talk today. Even though I put a

17 title as the FDA perspective, I have to say this

18 is my personal perspective. In the past ten years

19 I had a tremendous opportunity to work under

20 brilliant scientists in FDA and to deal with

21 genomic technology. But Carolyn in her

22 presentation outlined a variety of the projects in

62

1 the FDA and I did not involve all of them, so I

2 cannot say -- what I'm going to say here is not

3 presented as the FDA's perspective. It's just my

4 personal view.

5 So before I start really getting to the

6 main topics of my presentation, I would like to

7 make two points. I really want to get these two

8 points out of my way. When we talk about next

9 generation sequencing, it's a tool and we talk

10 about the challenge and issues. It really depends

11 on how these particular tools are going to be

12 used, and I think we need to keep this in mind

13 even though it's very obvious. For example, if we

14 are going to deal with the human genome, sometimes

15 we call it DNA sequencing, we are dealing with 3.2

16 billion base pair, that large a size. But the

17 microarray that's only like 25 of them. If we're

18 talking about the alignment that's nothing to do

19 about the microarray. It's very easy, any

20 alignment can do. But if you're going to do

21 starting to look at the various alignment to

22 algorithm to dealing with the human genome, that's

63

1 going to be entirely different story.

2 So before I come I did a very quick

3 search at the department. How does the landscape

4 look like in terms of the next generation

5 sequencing tools that have been used in the

6 various areas? And very clearly almost half of

7 them were used to study the genetic variance and

8 how these variances are related to human disease

9 and how they respond to the drug treatment. And

10 it's also for this very, very important field and

11 it is a field certainly given rise to personalized

12 medicine. And if you go to any one of the

13 meetings related to next generation sequencing,

14 you will see the cost of the sequencing was

15 decreased over the years and deviates -- I spare

16 you for this graph. But this has a much larger

17 and significant impact on the DNA sequencing.

18 But if we're starting to talk about RNA

19 sequencing even though this cost factor is still

20 there. But if you apply this methodology to do

21 the toxicogenomics, for example, the costs related

22 to animals become the bottleneck and it's not the

64

1 sequencing themselves.

2 So I just wanted to get this out of the

3 way and keep this in mind and when we're dealing

4 with these issues we really need to understand

5 what is the purpose in application. So this point

6 really is related to my second point and what I'm

7 going to focus on for today's presentation.

8 Now, my group is doing a variety of

9 projects using the next generation sequencing.

10 And, of course, we do a little bit of DNA

11 sequencing. We do a little bit about microarray.

12 We also have one project to use the next

13 generation sequencing for the food safety, and

14 particularly for the food borne pathogen

15 identification and detection. But most of our

16 work is focused on RNA-seq.

17 So most of my presentation is going to

18 focus on some of the observations and the results

19 we obtained from a recent project we just

20 completed called RNA sequencing quality control or

21 the SEQC. This is the way I say it, but the word

22 on the street pronounce it as "SEXY." So at least

65

1 FDA had a little bit of humor in this area.

2 So we have three papers and what came

3 out from this project was published in Nature

4 Biotechnology in the current issue. And along

5 with these three papers, they also have another

6 two papers on using our SEQC samples and the data,

7 and also have another two papers providing a

8 commentary or discussions about the results and

9 the implications of these results. So if you have

10 not seen that particular issue, it is the current

11 issue, the September issue. I really encourage

12 you to take a look at it.

13 So most of my talk is going to focus on

14 the results from the SEQC project and this is

15 really related to my second point. So when we're

16 talking about RNA-seq, the issues are just

17 slightly different compared to DNA sequencing even

18 though they still share a lot of the commonality

19 in terms of how to manage the data, how to

20 communicate the data, so on and so forth. But in

21 the RNA sequencing area, particularly in the gene

22 expression studies, we have two objectives. One

66

1 is how we can accurately determine which genes are

2 up and down, differently expressed, and sometimes

3 we want to understand the isoforms and the gene

4 fusions and so on and so forth. So those sorts of

5 information can enhance our way to understand the

6 underlying mechanism of the disease and the

7 health.

8 Another major goal is how we can use

9 these tools to develop gene expression-based

10 predictive models, to predict a variety of the

11 endpoints particularly in the clinical use as well

12 as for the safety assessment. And come to think

13 about it, this space has been occupied by

14 microarray for a long, long time, and the

15 microarray has been around over 15 years. Many

16 companies invest tremendous efforts to develop the

17 microarray-based biomarkers and predictive models.

18 Now another question is how are we going to deal

19 with that? And how are the microarray-based

20 investment are going to play out in the RNA-seq

21 area? So those are the sort of issues we

22 certainly need to talk about.

67

1 It turns out emotion is very different

2 when we talk about a microarray and RNA-seq. So

3 this is the email. I sent it out to the

4 consortium two weeks before the paper was

5 submitted to the Nature Biotechnology, and it is

6 very clear there are two camps. One is microarray

7 and another is next generation sequencing. People

8 invested there are pretty much like half of their

9 professional life in one particular platform.

10 Suddenly you say hey, you are obsolete and go

11 away. So there are a lot of emotions going on.

12 So even when we develop the data standards, those

13 emotions are still there so we need to pay

14 attention on that as well.

15 So this emotion does not just exist in

16 our consortium. When we submit a paper to Nature

17 Biotechnology, we have five reviewers and

18 editorial and the revisions total eight months to

19 get this paper published. You can see a lot of

20 the comments are again about microarray and

21 RNA-seq. So I will talk a little bit about this

22 aspect in my presentation as well.

68

1 Where are we in terms of the microarray

2 compared to the RNA-seq? I did a very quick

3 search into the GEO because the GEO is the public

4 data repository and you can find all kinds of

5 different information. And if I take a snapshot

6 and just for 2014 I did a couple of months ago,

7 about six times more microarray data has been

8 deposited in 2014 compared to RNA-seq data. So

9 GEO is starting to see RNA-seq data back to 2006

10 and GEO is starting to have the microarray data

11 all the way back in the 2000s. If you look at the

12 first eight years how the data accumulated in the

13 GEO database, again there's about like a 6:1 ratio

14 in favor of the microarray.

15 So we took those data and we did a

16 projection in how long it would take the RNA-seq

17 to overtake the microarray. And it came out

18 pretty interesting, though. We went into 2012 and

19 the RNA-seq data in the GEO reached to the one

20 million mark. That's the current number of

21 microarray data in the GEO database, and probably

22 is going to reach in 2028 and eventually RNA-seq

69

1 could entirely replace the microarray. So this is

2 really just to give you a sense. We will have a

3 fairly long coexistence between these two

4 platforms.

5 So some of the lessons learned from the

6 microarray still have some value and that's what

7 I'm going to talk about and actually is my main

8 topic today.

9 So about ten years ago on the microarray

10 it was the hot potato in the research community.

11 Everyone wanted to get a piece of it and even our

12 institution established the core facility into the

13 microarray and the gene expression studies. At

14 the time the industry, of course, was always in

15 the forefront to apply emerging technology. They

16 used this technology in both the preclinical and

17 the clinical settings. So they come to the FDA

18 and say whether these kinds of data can be shared

19 with FDA to support some of the submissions. And

20 certainly this is going to generate a lot of

21 anxiety in the FDA community. So just like what

22 we have right now, talking about next generation

70

1 sequencing, of course, in much, much larger scale

2 when we talk about next generation sequencing.

3 So on into 2003 FDA released a draft

4 document to the industry on the pharmacogenomics

5 data submission. So in this guidance we

6 articulate a specific mechanism called voluntary

7 genomics data submission program. Essentially we

8 encourage industry to send their data, genomics

9 data, to the FDA on a voluntary basis. So through

10 this process we will have an effective dialogue

11 between the industry and FDA so we can work

12 together to establish the guidance on how to deal

13 with genomics data.

14 So at the very, very beginning we set up

15 three objectives to try to accomplish from this

16 program. First, of course, we need to have a

17 place to store this data. This is really about

18 how I mentioned about how we are going to capture

19 this data submitted by the sponsors because we

20 believe this type of data are important to support

21 our future regulatory policy.

22 And second, we want to produce the

71

1 sponsors' results because I really don't know, at

2 times we don't know how they do their analysis,

3 and whether these results can be reproduced.

4 And the third objective is can we do our

5 own analysis, provide our agency's view on how

6 these type of data should be analyzed and how

7 these type of data should be appropriately

8 interpreted. So these are our three objectives.

9 Now clearly if you wanted to do this

10 sort of thing, you'd need to have a robust

11 bioinformatics infrastructure in place to do all

12 of this work. So that's what we did in 2014

13 actually in our lab, and actually very early on we

14 developed a tool called ArrayTrack and its

15 intended use is to handle the microarray data. So

16 we purposed this tool and started to use this tool

17 to support voluntary genomics data submission

18 program and we focused on the microarray data

19 analysis. This tool is still being widely used by

20 the various parts of the FDA as well as in the

21 research community. And, of course, right now we

22 are trying to refine this tool by incorporating

72

1 some of the flavors to dealing with the RNA-seq

2 data as well.

3 So during the process of using our tool

4 to reproduce the sponsors' results, the first

5 thing we found out was we will never be able to

6 reproduce sponsors' results. Whatever the

7 document the document they send to us says, we did

8 this, we're using the T-test, well that's not

9 enough. There are six different types of the

10 T-test out there. So which one? And it's just

11 not enough and at the end of the day we had to say

12 send us a script so we can reproduce the results.

13 So this is highly challenging, this area in how to

14 reproduce the results. And when we started to do

15 our own analysis and the overlap between our

16 findings and the sponsors' results, it's not even

17 close. And then they lead entirely different

18 biological interpretations. How are we going to

19 deal with these issues?

20 So that's why we're starting to initiate

21 a very large consortium effort. We summarized the

22 two biggest challenges, and the first, of course,

73

1 is quality control. And every time we received

2 the data from the sponsors, they did a very

3 reasonable quality control. The question is how

4 good is good enough? Can we have some

5 quantitative matrix to define the quality control

6 that should be done? That really is the first

7 question.

8 And the second question is analysis.

9 Can we standardize analysis? Actually not.

10 That's actually -- if you standardize the analysis

11 protocol, you essentially hinder innovation and

12 this is not something we want to do. So we should

13 let research community explore the variety of the

14 novel approach, but we do emphasize and we need to

15 have a baseline practice. So when you submit data

16 to us -- and I'm still talking about microarray,

17 nothing to do about next generation sequencing --

18 when you send the data to us, you need to include

19 one particular methodology so we make sure the

20 data indeed can be reproduced.

21 And lastly is the cross-platform. For

22 the microarray, back in 2005 this was a huge issue

74

1 and at least around ten different platforms out

2 there. So using the same samples, you give it to

3 ten different platforms and the question naturally

4 to be asked is can you get to the same results or

5 not? Actually for next generation sequencing,

6 it's a little bit simpler because the whole space

7 is minimized. But still these are the questions

8 we try to address.

9 So we established a consortium effort

10 called a MicroArray Quality Control, called the

11 MAQC project, and we are very fortunate in it.

12 When we started this project, we got tremendous

13 support from all the FDA centers and also by the

14 broad research community. Certainly we see the

15 needs of such efforts. Our objective is first

16 looking to the technical performance of the

17 microarray technology to see whether the

18 technology itself is reliable or not. So we need

19 to get this out of our way. And next, we want to

20 understand how this technology is really reliable

21 and it can be used for clinical use as well as for

22 safety evaluations. So these are our two

75

1 objectives.

2 So in order to do that, we do realize we

3 needed to approach the research community and we

4 need to reach a consensus with the stakeholders.

5 So what we do is the first we decided to have all

6 the data -- the conclusions, the results -- made

7 available to the public. So that's why we

8 approached Nature Biotechnology to see whether

9 they were willing to entertain the results from

10 this consortium. They said yes and that's why

11 Nature Biotechnology always publishes our results.

12 And we also want to make sure the data is

13 available for the people to reproduce our results.

14 But we also take it a step further. We

15 said the samples we use also need to make publicly

16 available. So if anyone is just crazy enough to

17 go back and take $1 million to reproduce our

18 results, they should be able to do it. So we

19 approached a company and talked to them, can you

20 generally keep the reference samples for the

21 entire community for ten years? They said yes.

22 So these are the sample still available and

76

1 actually these are the samples we just used for

2 the next generation sequencing as well.

3 So we actually did a sequential and

4 processed it in three phases of this project, and

5 the first phase ended in 2005. Of course, dealing

6 with the microarray technology, we look at whether

7 the microarray technology is reliable. We look at

8 the cross-platform issues. And we compare the

9 microarray results with the quantitative gene

10 expression studies. And we also investigate how

11 the choice of the bioinformatics method is going

12 to impact the downstream biology, and so on and so

13 forth. And we have six papers published in Nature

14 Biotechnology, and, of course, this is a very

15 happy thing to see your paper published in a good

16 journal. But I think the most exciting thing and

17 a thrill for us is some of the key results were

18 incorporated into the companion guidance in FDA to

19 the industry on how to send the data to the FDA.

20 So with that we found that the

21 microarray technology is reliable, particularly

22 with what kind of bioinformatics approach should

77

1 be put in place to make this platform reproducible

2 and across different laboratories. And we're

3 starting to look into the issues about how this

4 tool really can be used for clinical use and

5 safety evaluations. And it turns out this is a

6 huge, huge undertaking, a long quest. It took us

7 about four years, and we have around 200 people

8 from 86 organizations. And the project had

9 tremendous input from Dr. Greg Campbell from CDRH

10 and he's office director, and he really provided a

11 lot of the recommendations and suggestions, which

12 direction we're supposed to go for this project.

13 And, again, at the end of the day, we had a bunch

14 of the papers published and some in Nature

15 Biotechnology, some in Pharcogenomics Journal.

16 So why are we working on this project?

17 It was a painful thing. Why are starting to work

18 on this project and suddenly RNA-seq is starting

19 to gain the momentum? Actually, we lost quite a

20 few consortium members because people said hey,

21 that was much harder stuff to do, so they're gone.

22 This is the paper that was published in

78

1 2008. I just looked at it before I came and the

2 citation is over 3,000 times and clearly it's the

3 point of the reference in terms of the future or

4 the perspective of RNA-seq. And what the paper

5 says is that RNA-seq will replace the microarray.

6 Period. It does not say how long it will take,

7 but six years later, as I mentioned in the very

8 beginning, we still see quite a bit of coexistence

9 between the microarray and RNA-seq and tremendous

10 euphoria associated with RNA-seq back in 2008.

11 And then the nature of our technology would step

12 in and say wait a minute, is that where we were

13 standing the exact same spot we stand before for

14 the microarray? We should have some MAQC-type of

15 efforts. We are gaining understanding in terms of

16 the quality, in terms of the reproducibility.

17 So that's why we started the third phase

18 of the project, called SEQC, and we have a little

19 bit over 180 private participants and from the

20 seventies reorganization and takes about six

21 years. We started in 2008 and we stopped in the

22 middle for two years because our Science Advisory

79

1 Board coming in said this technology is moving too

2 fast. If you take a snapshot now, you're going to

3 be changing it tomorrow. So why don't you just

4 wait for two years. So we stopped it for two

5 years and then we realized this technology will

6 never stop -- you just cannot wait. You've got to

7 take a snapshot, so that's what we did. We

8 generated a little bit over 10 kilobyte data and a

9 huge portion of that data is all available in the

10 GEO. We produced about ten manuscripts; three are

11 in Nature Biotechnology, just two in Nature

12 Communications, three in Scientific Data, and two

13 in Genome Biology. Clearly I cannot cover all the

14 findings from this project, but I do want to point

15 out that the three papers in Scientific Data are

16 probably the most interesting papers because those

17 are the three papers that provide very detailed

18 descriptions of the data we used in this project.

19 We hope those data can play a role in the research

20 community.

21 So what I'm going to do in the next few

22 slides is just give you -- explain a little bit

80

1 more about those datasets. And the first dataset

2 we generated is one to understand the

3 cross-laboratory and the cross-platform to

4 producibility. So that's what we did. We have

5 six samples on the top called the A, B, C, D, E,

6 F. And as I said before, we have two samples, A

7 and B, are the reference samples you can purchase.

8 They are commercially available. Those are

9 generated from MAQC-1 ten years ago. The exact

10 same sample, same batch, generated back then. And

11 the C and D is a mixture of the A and B. So C is

12 a 3:1 ratio between A and B, and the D is the

13 reverse. And the E and F are the ERCC, the

14 reference samples. So we have these six samples.

15 We distribute these six samples onto the various

16 Illumina sites to seven labs. So they all have

17 exactly the same samples. So they do the

18 sequencing and we distribute that for the four

19 solid laboratories and they also have the same

20 sample. We also distributed to Roche 454.

21 So once we get all the data together,

22 now we are really in the position to look into the

81

1 cross-laboratory and the cross-platform for

2 reproducibility. And I think one particular

3 benefit of this particular design is because C and

4 D is the mixture of the A and B. We know exactly

5 what the ratio looks like so we can use these

6 built-in tools to assess the accuracy of the

7 RNA-seq technology.

8 So on the second dataset, also very

9 exciting, we took pediatric samples from patients,

10 500 of them, and we did an RNA-seq on all of these

11 samples. We also did a microarray, so we have

12 both data. We have a microarray. We have

13 RNA-seq. And then we started to ask a question.

14 Which platforms will give you an edge in terms of

15 the clinical utility? So we did that part.

16 The third dataset, also pretty

17 interesting, is called the rat body map where we

18 essentially took the ten organs from the rats. We

19 not only just took it from the one rat. We had

20 the male and the female in the four different

21 developmental stages. So in total we have around

22 320 samples and we did the RNA-seq on it. And for

82

1 the liver and the kidney, we also did a

2 microarray. So that really gives you some ideas

3 in how the gene expression in high resolution by

4 next generation sequencing is performed. So these

5 are actually the papers published in Nature

6 Communication.

7 So the last dataset, also pretty

8 interesting, is called toxicogenomics dataset. We

9 have two datasets actually, the training set and a

10 test set. In the middle this part is the

11 chemicals. We have 15 chemicals you can see. For

12 each chemical we have three treated rats versus

13 control. So we will be able to see how the

14 differently expressed genes look like across all

15 these 15 treatments. And for every three

16 chemicals, we know exactly what particular

17 mechanism was involved to cause the toxicity. We

18 also will be able to develop the classified based

19 on the training set to predict the test set, and

20 the test set has a very similar design. So this

21 is the paper in the Nature Biotechnology if you

22 want to take a look at as well.

83

1 Again, as I mentioned, and I really do

2 not have time to go over all these results. It's

3 just impossible. But we do have a poster outside,

4 and I will stand by the poster and please come to

5 me if you have any questions. I would love to

6 talk you about it and the various results we

7 obtained from this project.

8 But I'm going to highlight a few

9 findings that personally I feel interesting.

10 First of all, we do find the technology is

11 absolutely reproducible, regardless of who does it

12 and which platform you're going to use. This is

13 actually some good news. Of course, the bad news,

14 the other side, is what bioinformatics approach

15 you choose actually is important. And how do you

16 remove some of the lower expressed genes. And the

17 largest variation we see is how to deal with the

18 lower expressed genes. So this is the first point

19 I trying to make.

20 The second -- actually this is very

21 interesting. When we talk about microarray, a lot

22 of people say it's boring. It's just up and down,

84

1 which is pretty boring. And the reason we are so

2 excited about RNA-seq technology is because they

3 provided a novel discovery. And many believe this

4 is probably the only point we need to focus on,

5 that set, and other things are really not

6 important.

7 So we spent tremendous efforts trying to

8 understand how reproducible for the novel

9 discovery. So we looked at the novel junctions

10 that are identified by RNA-seq across the

11 different laboratories and across different

12 platforms. And we found that 80 percent of them

13 can be verified really real-time PCR. This is

14 great news, but I'm wondering about biological

15 significance we have no idea whatsoever. So

16 that's why the new concept was coined by one of

17 the commentaries in Nature Biotechnology that says

18 we need to introduce so-called transcript of the

19 unknown significance, equivalent to the variant of

20 unknown significance in the DNA sequencing area.

21 So another thing is about differential

22 expression genes. As I mentioned, RNA-seq is a

85

1 gene expression platform and one objective is to

2 understand or to accurately profile which genes

3 are differentially expressed. Actually on -- this

4 is a pretty busy slide, so I'm not going to go

5 through all of that. What I'm going to point out

6 are two important points. First of all, when we

7 look at the microarray, look at RNA-seq. The

8 agreement is dependent on which samples you're

9 working with. If you're working with the normal

10 sample versus cancer, those two types of samples

11 are so different. Both are great. But if you

12 look at the same liver samples and one was treated

13 and another was not treated, only a few genes are

14 differentially expressed, both platforms are

15 disaster and the overlap is very small. So just

16 keep this in mind what particular samples you're

17 working with, it is important.

18 Second, we do find that RNA-seq has a

19 tremendous advantage in the lower end in terms of

20 the transcript abundance and the lower expressed

21 genes.

22 So lastly on -- this is another result

86

1 that I think is interesting. One time I went to a

2 meeting and there were a lot of industry people

3 there talking about the microarray and all of

4 that. One company made a comment and said my

5 company has been invested in the past ten years to

6 generate the microarray, 20,000 of them every

7 year. So now you are telling me to give up all

8 these samples, give up all these microarrays and

9 redo everything with RNA-seq. This is not real.

10 And then, of course, suddenly a lot of hands were

11 raised and they said yes, my company is the same.

12 Probably this is one of the reasons we see the

13 slow development using RNA-seq because the

14 companies were invested so much. So the question

15 in our consortium efforts to ask is how is the

16 legacy microarray data going to play out in the

17 RNA-seq environment and whether the biomarkers

18 developed using the microarray platforms can be

19 directly used for the RNA-seq data? And when we

20 move forward certainly RNA-seq is going to replace

21 the microarray, there's no question about it --

22 but when we move forward we have more and more

87

1 data generated using the RNA-seq data, can

2 RNA-seq-based biomarkers apply back to the

3 microarray data to leverage the past investment?

4 And these are the questions that are important.

5 This is basically the result on the right side and

6 it's pretty complicated and summarizes this part

7 of the investigation.

8 So I think I took too much time. So

9 finally I'll just make a few points from my

10 personal view. By working with a lot of the FDA's

11 reviewers and scientists in the area of genomics,

12 particularly in the microarray, and also working

13 with the MAQC consortium members, I do find a

14 couple of things that need to be mentioned, at

15 least that need to be said.

16 NGS is a tool. It's a challenging

17 issue, largely rests on how you use it. And I

18 think that the fit for purpose is very important

19 issue that needs to be focused. Second, I think

20 we should not focus on one standard, one shoe fits

21 all. That's why people have different foot sizes

22 and we have so many different shoe stores. And

88

1 one shoe fits all sometimes just does not work out

2 very well. So I think this should be kept in

3 mind.

4 Second, of course, we do need to

5 recognize the evolving nature of this technology.

6 If we say in the past ten years and it's a really,

7 really scary technology and how this technology

8 evolved, I guess one could be more scared in the

9 next ten years because the technology is only

10 going faster. It's not slower. So we really need

11 to keep this in mind, and we really should not

12 allow our emotions too high when we talk about

13 this issue.

14 And bioinformatics, certainly it's a

15 significant part in this field and how to

16 accurately apply the bioinformatics approach.

17 Actually it's very important. One of the lessons

18 we learned is we are now in the very, very

19 dangerous zone now. We come with the tens of

20 thousands of data points in the Excel spreadsheet.

21 We have no idea and no time to look at it. And

22 the Excel spreadsheet changed the data to the data

89

1 and time. You don't even know. And then you're

2 starting to apply the algorithm and you're very,

3 very excited to get some results and move forward.

4 We see that again and again in our MAQC

5 project. I'll just give you one quick example.

6 What we did is we generated two data and one is

7 the male and the female. We did not tell the

8 consortium this is male and female. We said can

9 you predict this endpoint? And another one is a

10 random number, and we said both are a critical

11 endpoint. Guess what? Some people came out with

12 fantastic predictions for the random data and

13 sometimes it was miserable for the male and the

14 female. So this issue is new and this issue is

15 very relevant to when we're dealing with the data.

16 It's not just genomics data. I think we need a

17 high screening approach and so on how we're going

18 to build in a positive control to minimize the

19 error in this process. I think it's also

20 important. And certainly we cannot dwell.

21 This is why we're all here, and Dr.

22 Wilson has done a fantastic job putting this

90

1 together and starting the dialogue. And hopefully

2 we walk away with some solutions. Thank you very

3 much. I do need to thank all the consortium

4 members who supported this project. Thank you.

5 QUESTIONER: I have a question. Can you

6 put in perspective quantitative versus qualitative

7 predictions of RNA-seq and microarray data, how

8 quantitatively reliable it is and how

9 qualitatively reliable it is?

10 DR. TONG: Actually, if we're talking

11 about a biomarker or talking about predictive

12 models, they are exactly the same and very much in

13 microarray. If you think about it, when we

14 develop predictive models, we select only a few

15 genes and differentiate health and disease. Well,

16 microarray is not bad. They did 40,000 or 50,000

17 genes, you pick two or three. You certainly can

18 find it. RNA-seq is just a little bit more. You

19 also can find it. So from the statistical point

20 of view, we cannot see any edge in terms of the

21 classifier. But if you are really engaging in the

22 mechanistic understanding and particularly trying

91

1 to identify these novel genes, how they contribute

2 to the disease and the health, RNA-seq holds a lot

3 of the promise. And we increased the depth to 10

4 billion, actually. That's the luxury we did to 10

5 billion. Even at the 10 billion level we

6 continually found new genes. And those new genes

7 can be verified using the real-time PCR.

8 QUESTIONER: Hi. Great talk, thank you

9 very much. I have a question about the number of

10 replicas, biological replicas, that you guys

11 recommend. We have observed with RNA-seq that the

12 reproducibility between biological replicas is not

13 very high. And we have also seen batch effects.

14 So my question is specifically targeted to how

15 many biological replicas are you guys recommending

16 for RNA-seq, based on your experience?

17 DR. TONG: If I say more is better, then

18 certainly people -- this is a very, very good

19 question and back in the microarray era this same

20 question was raised. We found it still depends on

21 the samples. For example, in the first dataset I

22 mentioned, A and B was entirely different samples

92

1 and we can use very few -- the replicas can

2 generate reliable data. But if you look at the

3 two biological systems, it was so close. If you

4 want to determine which genes are differential of

5 these two situations, you definitely need more

6 samples. So statistical power depends on the

7 endpoints you study. This probably did not really

8 answer your question.

9 QUESTIONER: Well, the thing is, at

10 least in our experience, we have observed a lot

11 more variability between samples and replicas for

12 RNA-seq versus microarrays. So, for example,

13 right now we're doing four biological replicas and

14 we tried to randomize the machines where we do the

15 different runs, and that seems to help. But I was

16 wondering if you had the same experience.

17 DR. TONG: Let me just address your

18 question from another angle. So in the toxicology

19 field, that's where I come from, and in the

20 toxicology field we normally use three animals,

21 sometimes four animals. That's it and we cannot

22 use two more because it's too expensive to the

93

1 animals. So most of the time we are trying to fix

2 our questions and say if we're using three

3 animals, how many variations are we going to

4 introduce in this analysis? If you'd rather say

5 how can we narrow down that error to this one, how

6 many animals we need? Actually we ask a question

7 where we are not wrong.

8 QUESTIONER: Okay, thanks. Laura Van T

9 Veer, University of California, San Francisco. I

10 see on your last slide your very impressive number

11 of people and organizations with whom you have

12 collaborated. I was wondering whether in that

13 whole activity in sort of reaching out to other

14 organizations how much have you been working or

15 are interested to work with the College of

16 American Pathologists and institutes like the NIST

17 because I think it's a tremendous work you've

18 undertaken, which will yield very useful

19 information for everybody.

20 DR. TONG: Well, thank you very much for

21 the question. For the MAQC project and what we

22 did, we did a normal mechanism we would do in FDA.

94

1 We put it in the Federal Register, said the

2 project is coming and there is a deadline. Please

3 send your interest to us. So this is how we

4 constituted the consortium members. And we

5 certainly made some personal calls and we bring

6 some key persons we know by reading the

7 literature. They contributed to this project. So

8 right now we are in the phase of MAQC-4, and we're

9 starting to try to decide which particular area

10 we're supposed to go. And, hopefully, we can give

11 a much broader announcement and other people have

12 the opportunity to join our efforts.

13 ALIN: There was a question online.

14 Hannah asks "What types of biological reference

15 sources are best for liability assessment? How

16 relevant are animal models?"

17 DR. TONG: Biological samples -- A and

18 B. Actually, probably for any samples you look

19 at, the most data was generated at A and B because

20 A and B become the standards. Everybody becomes

21 the efficiency assessment. Using the A and B, you

22 go through the entire process and compare the data

95

1 we generated from our project as well as by the

2 research community. You immediately realize where

3 you are. So as many laboratory tests using A and

4 B samples.

5 ALIN: One more question and then we'll

6 end it there.

7 QUESTIONER: Hi. This is Warren Kibbe

8 from NCI. A great talk. A lot of what you and

9 Vahan were both talking about was really around

10 reproducibility and lots of things. How do we do

11 this in both a QC way and a reproducible way? And

12 it seems like the part that you didn't talk about

13 is how do we not just package up all of our

14 scripts, but how do we really do this in a more

15 permanent way so other folks can actually use

16 exactly the same versions of software, the same

17 versions of the analysis? How do you see that

18 working into the framework that you're talking

19 about?

20 DR. TONG: Well, I hope Vahan can

21 eventually -- you know, we're working together and

22 we can come up with a solution for that. And no,

96

1 we have not really addressed this issue. And we

2 actually do it the opposite way. We have tried to

3 give much more freedom to the scientists to use

4 whatever they want because we are coming in with

5 no hypothesis and we don't know what is the best

6 way. We don't want to lock down.

7 But as I said in the very beginning, we

8 are emphasizing some baseline practice. We tried

9 to identify the method and the results are always

10 reproducible. So this methodology is -- if

11 someone sends data to the FDA and says okay, you

12 can do whatever you want to do, but make sure you

13 include that particular pipeline. So I think this

14 is always in our mind when we're working with the

15 project.

16 DR. SIMONYAN: To answer that question

17 also, where the concept of biocompute objects is

18 exactly designed to address that issue, so it's

19 not just within the consortium. Like SEQC can use

20 the same kind of reproducible protocols, but the

21 whole world will be able to use it; hopefully.

22 And I know NCI/NCBI IT has the harmonization

97

1 efforts also, and we will definitely need your

2 input. We already work with your team members and

3 we are pleased to have the discussion continue.

4 So the next speaker is Dr. Amnon Shabo.

5 He's the Chair of Translational Health Informatics

6 from the European Federation of Medical

7 Informatics, and he's going to give us the

8 perspective from an HL7 protocol also. As you

9 well know, this is one of the most frequently used

10 protocols for exchange of biomedical information.

11 So hopefully we can see that perspective of how do

12 the results of our computations eventually get

13 communicated to different health organizations.

14 DR. SHABO: Thank you and good morning.

15 Thanks for the invitation. So this stuff will be

16 closer to the phenotypic side of the world rather

17 than the genotype and NGS and all this stuff that

18 you all are doing. So I call this talk

19 "translational and interoperable health

20 infostructure, information infrastructure -- the

21 servant of three masters."

22 So what I've been trying to do in the

98

1 past 14 years where I have been working with IBM

2 Research is to develop information infrastructure,

3 or in short infostructure, that could really serve

4 both research and health care. As we know

5 different stakeholders in the health arena are

6 developing their own information infrastructure

7 and they are quite distinct. They are quite

8 different. If you look, for example, at EHR,

9 electronic health record system of a health care

10 provider, there might be even some genetic data,

11 but most likely clinical data. And on the other

12 side of the continuum if you look at the

13 information infrastructures of laboratories

14 running NGS, I guess these are like two very

15 different systems. But eventually, as the

16 previous speaker said, NGS and all other platforms

17 are just tools and eventually we would like to

18 bring the data and the interpretation of the data

19 into the clinical space. And I think

20 translational is really the game, the playground,

21 here. And I think NGS is probably a good tool for

22 translational efforts because it really gives you

99

1 a lot of data and translational is mostly

2 data-driven, bypassing the traditional classic,

3 scientific methodologies most often.

4 Some of the ideas are published in a

5 pharmacogenomics paper that I published - "The

6 Patient- centric Translational Health Record." So

7 that's another focus that I'm bringing to the

8 table. So eventually the health record -- those

9 that are mainly maintained by health care

10 providers -- but we also see emerging today the

11 personal health record. The other idea is

12 environment where you get the full picture of the

13 health conditions of the patient, and that's the

14 ideal environment for actually bringing in genetic

15 and genomic data so they could be taken as another

16 input in the interpretation process and the

17 clinical decisions support process that is being

18 done by humans and machines.

19 So this is just a very -- briefly to

20 give you my motivation and patience for many years

21 already, even before I joined IBM, and it's really

22 about a methodology that is the contrast or

100

1 complementary to the main methodology used today

2 in clinical decision support and expert systems in

3 general, which is the rule-based reasoning.

4 There is another alternative

5 methodology, which is called case-based reasoning.

6 It's not machine learning. It's different. And

7 you actually are running ontological comparisons

8 between the cases that you are now treating and

9 some kind of a case base, looking for the most

10 similar case to what you are treating now, and

11 using the data that is in the most similar case or

12 the similar cases in order to refine and get

13 insights on how better that case could be treated.

14 And by the way, the most success story of case-

15 based reasoning is actually in help desk that just

16 came out to the market, totally unrelated to

17 health. The manufacturer doesn't really know much

18 about the defects and the problems that are

19 happening. The customers are very angry and those

20 who are answering the phones are not really so

21 knowledgeable. And so case-based reasoning has

22 been tried there very successfully.

101

1 Now this is very human, actually, if you

2 think about humans and how they are going about

3 reasoning. Most often they kind of try to echo a

4 very similar case that they have seen or their

5 colleagues have seen or they read in the

6 literature and obviously also are based on rules.

7 They take clinical guidance, for example, in the

8 clinical environment, but it is kind of what we

9 call also intuition in a sense that it kind of

10 brings up similar cases. So I'm saying this is

11 very human and I think that we have to acknowledge

12 the fact that in the health arena, what we don't

13 know is much, much, much more than what we know.

14 That's why I think case- based reasoning could be

15 a very appealing complementary methodology to

16 refine the decisions that you are making based on

17 rules.

18 Now, that leads actually to the

19 question, what is a case? And that's a big

20 question in the case-based reasoning field. In

21 many fields' domains, it will be obvious. But in

22 the health space, that's a big question. Is it

102

1 just an episode? Is it just this hospitalization

2 or this visit that I made to the physician? No, I

3 came to the conclusion that this is the entire

4 lifetime electronic health record. And then the

5 next question that should be asked, if we agree on

6 that, is how could we sustain -- it's

7 sustainability here. How can we sustain,

8 preserve, lifetime electronic health records? And

9 those lifetime electronic health records should

10 contain every data collected about the patient,

11 including data that the patient generated by

12 himself, by herself. And that including also

13 could be NGS data and any other type of data,

14 sensor data, imaging data, what have you. So I'll

15 get to this point of health record sustainability

16 at the end of my talk.

17 So, again, translational is really the

18 relevant playground here I think. I guess you are

19 familiar with the field and the barriers from

20 bench to bedside, from bedside to community, and

21 community to policy. Many successful

22 interventions at the bedside do not scale out to

103

1 the community and definitely not to the policy.

2 And that's because of many factors and some of

3 them are not related only to the biology,

4 socioeconomic, bioethical and so forth, so all of

5 them have their own kind of infomatics so to

6 speak. So I'm an infomatitist. I'm looking at it

7 from the information point of view. In order for

8 translational to succeed, we need to somehow come

9 up with some harmonization between the languages

10 that all those disciplines are talking. And

11 obviously within biology and health, there is

12 already plenty of languages and different formats

13 that we are coping with. But don't forget that

14 there is a language of the economist and the

15 ethicist and all those experts that are actually

16 feeding more and more factors and constraints and

17 considerations. But eventually we'll fail

18 actually the successful intervention that we have

19 been able to get into the bedside, from the bench

20 to the bedside. So this is a new kind of world.

21 Today is the biomedical informaticist kind of

22 bridging between the medical informaticist and the

104

1 bioinformaticist.

2 So in the U.S. you have the PCAST

3 report, if you are familiar with that. This is

4 the President's Council of Advisors on Science and

5 Technology. And in 2010 they came up a

6 recommendation to create and disseminate a

7 universal exchange language. I think that's very

8 key to what we are talking here today. We are

9 talking here today about NGS standards, but we all

10 have to meet in the fact that NGS is just one

11 component and one input out of many, many types of

12 inputs that are going into this kind of health

13 language so to speak.

14 In the next few slides I will show you a

15 few key principles that I think should be kept no

16 matter what format you are actually using, no

17 matter what standard you are developing. This is

18 kind of informatics, the principle, that I've been

19 promoting and developing along the years, along

20 the past 15 years. And first and foremost and the

21 most important to me that I present as a key

22 principle is that we cannot really convey and

105

1 represent clinical genomics data,

2 genotype/phenotype association in this case, or in

3 general health data in a flat manner. So I'm

4 saying that flatter presentations are flat tires.

5 And why? Because it's really all about the

6 associations between the data. So if I had

7 inflammation and that was an indication to go

8 through some operation, inflammation in the gall

9 bladder, for example, and I went through an

10 operation to remove the gall bladder. So there is

11 an indication here that is indicating the

12 operation. So there's got to be this association

13 in order to understand later on for humans and for

14 machines, especially for machines, why this

15 operation took place. And then maybe there were

16 complications, so again, another association to

17 the complications. Then maybe I was prescribed

18 medication; another association to the medication,

19 and so forth and so forth. So it's all about

20 context. It's all about association of different

21 data items that today are quite disassociated.

22 They are quite discrete. And that makes them not

106

1 an easy interpretation of it later on, downstream

2 as we like to call it also in the genomic area.

3 But downstream is all the way downstream to the

4 patient himself, to the body of the patient. So

5 that's really the most important principle, have

6 it really represented in kind of a statement

7 model. And that's basically the base, what we

8 call clinical statement model. The other codes on

9 the left side dawn from medical terminologies.

10 They are inserted into the basic building blocks

11 like the patients' procedures and medications and

12 so forth. But then -- and that is being done

13 already today and quite nicely -- but then they

14 should be put within some syntax language into

15 statements. And those statements are already the

16 basis of some common language that we can then use

17 in different domains, clinical domains I mean.

18 And that's a simple example that is

19 taken from a project that I led the IBM

20 contribution to it. It was a GWAS on

21 hypertension, essential hypertension, like 25

22 cohorts funded by the European Commission, 25

107

1 cohorts from all over Europe. It was done by one

2 million SNP genotyping. This was back in 2008.

3 But clinical data, very rich clinical data, about

4 500 fields just around hypertension, everything

5 that you can think of that is related to

6 hypertension and eventually may make this GWAS the

7 most successful in finding on the one side the

8 variants, but on the other side what does it mean?

9 What are the biological processes underlying it,

10 and what does it mean clinically wise?

11 So we get data and you all know it. I

12 mean most of you are also involved in research,

13 and we get data from research typically in

14 spreadsheets in relational databases. So, on the

15 bottom you see blood pressure, heart rate, being

16 measured along some kind of a timeline. In a

17 different spreadsheet you see an indication of

18 taking an anti- hypertension drug, again along

19 some timeline. Apparently, if we talk with the

20 researchers, those timelines are the same for the

21 same patient and this context is essential. I

22 mean without knowing that the patient is on anti-

108

1 hypertension drug, comparing to another patient

2 that is not, the measurement of the blood pressure

3 and the heart rate are a bit -- well, I don't want

4 to say useless, but they are out of context and

5 analysis of it might lead to wrong conclusions.

6 So what I'm saying is just represent it

7 explicitly. That's what I'm advocating for. And

8 it's not rocket science. It's actually very

9 simple. You have, as the previous -- oh, I don't

10 know who mentioned it -- we know it's working

11 nicely in programming in the computer science

12 world. It's here the same, you know, design of

13 the program and design and modeling. You have the

14 objects. You have the observations. You just tie

15 them together based on some common language. Of

16 course, we need to agree on common language. It's

17 not that simple. We don't tend to agree so much

18 as humans. But anyway, you can relate to it

19 somatic and you are now much better off in terms

20 of conveying this information to another party, to

21 a clinical decision support application, to a

22 colleague of yours.

109

1 Now, going to the genomic side: So

2 obviously we would like to have clinical genomic

3 statements. So that's just a study co-written by

4 OMIM from like 10 years ago I think. And they've

5 been doing those studies on the EGFR, somatic

6 mutations related to the non-small-cell lung

7 cancer, and that's a description of a single

8 patient. This is a study published in the

9 literature. A single patient was under complete

10 remission for the first two years because he had a

11 specific somatic mutation in the EGFR that caused

12 a responsiveness to the drug. After two years

13 that guy had a second somatic mutation in the EGFR

14 gene, and this second somatic mutation was

15 resistant to the same drug and at that time, 10

16 years ago, this knowledge was not known yet. This

17 guy went into relapse.

18 So if you just convey the very ends by

19 themselves, the clinical descriptions on another

20 system in some clinic or medical record systems,

21 they are not tied together. It's really hard to

22 bring about decision support and reasoning, end

110

1 genes, and all those nice things that computer

2 scientists know to do and develop. So I'm saying

3 there are variations. There is the phenotypic

4 side. We need to tie them together. We need to

5 distinguish between observed phenotypes so I know

6 that this patient is responsive to this drug.

7 This has been observed already. I can make the

8 distinction that this phenotype is observed as

9 opposed to interpretation. I guess most of you

10 here are more interested in the interpretation, of

11 course, because you are just basically testing and

12 having the data kind of available to downstream

13 analysis, and you are interested in the

14 interpretive or the potential phenotype of that

15 genomic observation.

16 So that's kind of the expansion of the

17 model of the clinical genomic statement model.

18 They have been already used in HL7 standard. Now,

19 how many of you -- show of hands -- heard about

20 HL7? Oh, wow, that's nice. I was expecting a

21 lower number. So HL7 is the most dominant

22 organization in the world of developing health

111

1 care standards. So 10 years ago, together with a

2 few other members of HL7, I founded the Clinical

3 Genomics Working Group. And we have been

4 developing several standards in Version 2, in

5 Version 3. And recently I also developed a

6 genetic testing report, which is already published

7 in a document structure. Now this document

8 structure is based on structures that have been

9 already adopted and recommended by the Meaningful

10 Use Criteria in the U.S. How many of you have

11 heard about the Meaningful Use Criteria? Wow, not

12 so many. So Meaningful Use Criteria, this is

13 meaningful use of health care IT in the U.S. This

14 is a mechanism of incentivizing health care

15 providers to use appropriate health care IT

16 technologies and there are several criteria. Some

17 of them are in the area of standards. The

18 document standard has been adopted, documents for

19 summary, summary in general, all those referral

20 letters and so forth. So now in addition to this

21 collection that is being pushed by the government,

22 by the way, by the HHS -- which is also called the

112

1 Consolidated Clinical Document Architecture -- in

2 addition to that we would like to bring in a

3 genetic testing report document on top of the same

4 technology platform or framework is better to say

5 that could be consumed by health care providers.

6 So now when this is consumed by health care

7 providers, we're actually opening the door. Now

8 I'm kind of wearing the hat of genomics. We are

9 opening the door for us to push and bring data

10 into the EHR environment, into the information

11 system environment of the health care providers.

12 The next principle, informatics

13 principle, is that we must keep in mind that

14 narrative and stories, stories that are told by

15 the physicians, what they have been kind of

16 thinking of our health situation and stories that

17 we are telling, that's in the context of personal

18 health records and so forth, these are not going

19 to go away. These are the narratives and they

20 contain much richer information quite often and

21 it's got to be in the equation. And what we

22 actually need here is kind of a balance between

113

1 narrative and structured data. Of course, without

2 structured data it will be hard for analysis and

3 decision support verification to act upon the

4 data, but the narrative is there. We can analyze

5 it through natural language processing, but this

6 is not so patient safe. It could be nice for

7 research, nice for the IBM Watson Program, if you

8 heard about that, but I think not that safe for

9 patient care what we call.

10 So anyway there's got to be some

11 coexistence between narrative and structured. We

12 should formulate interlinks between both of them,

13 but this is kind of inherent redundancy of the

14 data within the formats that we are developing.

15 And I guess also on the genomics side there is a

16 lot of narrative as well, descriptions of what has

17 been done and all kinds of things.

18 So the Clinical Document Architecture,

19 the standard that I've been alluding to, is really

20 kind of the major platform where we can convey

21 both structured and narrative data. That's the

22 basic structure of it. I'll go through it

114

1 quickly. That's the structure of the genetic

2 testing report. In some years, I think it was

3 like a couple of years ago, we got participation

4 in the HL7 Clinical Genomic Group from Illumina

5 and Life Technologies. And we have been actively

6 trying to use this genetic testing report not only

7 for the kind of testing of non-mutations and the

8 more classic genetic testing, but also for NGS.

9 We got some samples of NGS reports from I think it

10 was Illumina. We started working on bringing at

11 least the summary level of it into this document

12 format that could be consumed, again, by health

13 care providers.

14 That's the layout of it. I'll go

15 quickly through that. That's a rendering of it.

16 As you remember, it has the rendering mechanism,

17 but it can be human-to-human communication. And

18 underlying it with inherent redundancy the data is

19 being also structured; not all of it, of course.

20 There is a summary section there, saying

21 that okay we have been doing those testing. In

22 this specific sample we are talking about JGB-2

115

1 full gene tests and some deletion tests and some

2 hearing loss mutation tests in the mitochondrial.

3 So each of these tests has its own interpretation,

4 whether it's pathogenic or benign or whatever.

5 And then there's got to be some kind of an overall

6 interpretation and that's the gist of this report

7 actually. This report could be consumed maybe

8 even by a primary physician or by someone that is

9 not that expert in all the genetic testing. And

10 they want to see just the overall interpretation.

11 So there's got to be, again, some kind of a

12 rule-based engine that will take all of the

13 interpretations together and digest them and then

14 see how those can be conveyed in one overall

15 interpretation. This is being done also by

16 GeneInsight and now spinning out as a company.

17 Another key principle is that we've got

18 to develop those standards, using tooling, using

19 modeling tools, because quite often and even

20 within HL7, which is really as I said the major

21 standardization body, standards are being

22 developed just by writing text actually. As an

116

1 implementer what you get is a PDF document. And

2 now go figure how you can really take all those

3 fields and all those constraints and instructions

4 and guidance. You have to eyeball them and

5 somehow retype them into your specific solution.

6 So most often those things will be error prone and

7 you'll get the implementations actually different

8 from each other.

9 So if we are starting with a tool and

10 the tool actually already having the base language

11 within it -- in this case here I'm talking about

12 the Clinical Document Architecture, it could be

13 any other kind of base language that we all agree

14 on -- then all I need to do is actually take the

15 tool and constrain it and refine it to the needs

16 of what I'm actually developing, whether it could

17 be a report, NGS, a genetic testing report that

18 will go into the clinical environment, then I'm

19 kind of confident that I'm doing design and

20 development of a standard that is actually fully

21 aligned with the base language because the base

22 language is actually within the tool. And if the

117

1 tool does not allow you to diverge. And

2 divergence is really the most kind of dangerous

3 thing in standard development and there is

4 obviously the joke that the nice thing about

5 standards is that there are many to choose from.

6 And we standard developers are really good at

7 that, continuing to develop more and more formats,

8 thinking that each format could be really good for

9 another use case. We have the right excuses

10 always, but that's the situation.

11 Now coming I think even closer to the

12 topic of this workshop is raw data. So that's

13 another key principle in the clinical environment,

14 in clinical standards. Eventually we need to be

15 able to encapsulate all data into the clinical

16 structure. We cannot encapsulate everything,

17 depending on its amount. What I'm advocating

18 actually is for the encapsulation of raw data, but

19 only key chunks of the raw data that we think are

20 relevant for the treatment of the patient, keeping

21 references back to the full-blown raw data; and

22 most importantly, not trying to remodel the raw

118

1 data. Now, the raw data could differ. You could

2 discuss today different formats for NGS, fine,

3 maybe there will be eventually just one if you're

4 all able to agree, that's fine. So we use that.

5 If not, if there are two alternatives, that's also

6 fine. So we encapsulate whatever format we are

7 getting into the clinical environment and we just

8 specify in the field where the encapsulation is

9 being done, what is the exact schema, the version

10 of the schema of that NGS standard. So let's

11 assume for a second that there is one like that.

12 So I'm not trying to remodel it. I'm not trying

13 to block and say there's got to be only one

14 standard that is encapsulated. There could be a

15 multiple as long as they acknowledge the samples.

16 We have to admit that it could be that we wanted

17 just a single standard.

18 So that's the basic idea, having genomic

19 data sources, laboratories, for example, sending

20 data to clinical environments not just

21 interpretations, including key chunks of raw data.

22 If it is not really the whole genome, then you can

119

1 actually send the entire raw data into it. And

2 then in EHR, in the health care provider system,

3 that's the ideal environment to get out the most

4 up-to-date knowledge, to get the full-blown

5 picture of the health condition of the patient,

6 and altogether fuse it and come up with what is

7 recommended actually for the clinicians.

8 And the most important is this error

9 matter, reanalysis. So reanalysis is now being

10 enabled if you go with this concept because the

11 raw data is there and you can reanalyze it by

12 different algorithms or later on when there are

13 new discoveries and you want to actually parse

14 again the same raw data. So that's very important

15 and not really practiced today.

16 This is just an example of the

17 implementation that we have been doing, so XML

18 tagging. The outer XML tags are actually coming

19 from the HL7 world and the inner are just -- this

20 has been done actually 10 years ago -- but it

21 could be as well valid today. This is BSML. I

22 don't know how many of you heard about BSML,

120

1 bioinformatics sequencing markup language? That

2 was done by a grant from the NLM I think, but now

3 is unfortunately dead. But it is just a format

4 for describing sequences. I guess again today

5 there are other languages. They could be

6 encapsulated in the same manner. Instead of BSML,

7 it would be like, I don't know, NGS markup

8 language or what have you. And it shouldn't even

9 be XML. It could be binary data or whatever. XML

10 is nice because the fusion is very clear to the

11 implementers, not to the consumers. The consumers

12 are not really interested in the underlying

13 presentation, but most often the implementers are

14 trying to understand what data they are dealing

15 with.

16 I'll skip that. That's our recent

17 effort to develop some kind of standards

18 independent clinical genomic statement domain

19 information order, which has been just validated

20 with all those principles already inside.

21 Now I'm switching gears into something

22 that is kind of border, again going back to the

121

1 translational playground. And in the

2 translational playground eventually there is much

3 more to it. In this chart what I'm showing, it's

4 kind of a roadmap that I'm pursuing within this

5 European Federation of Medical Informatics, the

6 translation that has Informatics Working Group,

7 and the idea is that at the top of it you see

8 knowledge. Knowledge standards, knowledge

9 starting from the top left, which is about

10 scientific literature, and in the middle, of

11 course, all the terminologies, and then on the

12 left side -- on the right side, I'm sorry --

13 decision supportive formats. And at the bottom of

14 it, the blue boxes are actually formats used

15 actually at the point of care.

16 Now, the middle layer is really the

17 challenge here. The middle layer, if you start

18 from the left, those are things that are closer to

19 the research. If you go to the right, those are

20 things that I call them the bridge standard. They

21 are bridging between research and point of care.

22 And so those are the three bridge standards that

122

1 already exist today. This GTR that I already

2 mentioned, genetic testing report; DIR, which

3 stands for diagnostic imaging report, that in the

4 same manner of encapsulation of genomics, you can

5 encapsulate imaging as well. So it points to key

6 images, not the entire CT set. It points only to

7 key images and that's summarizing the diagnostic

8 report. And then the PHMR is the personal health

9 monitoring report, which is really in the space of

10 sensors and well-bed devices and so forth. So

11 this could lead nicely into the clinical

12 environment.

13 Let's move on. One of the things that

14 was in the middle area was also in this paper

15 where they came up with this kind of packaging

16 format I would call it. And they called it the

17 iPOP, the integrative Personal Omics Profile.

18 There has been a paper published in Cell, reviewed

19 in Nature, and they called it "The Rise of the

20 Narciss-ome." He was a healthy guy, one of the

21 authors, went through all omics testing that you

22 can think of -- all genome sequencing and so

123

1 forth, all the clinical events recording and

2 everything -- for about two and a half years. And

3 they've been trying to see whether there is any

4 value in doing all those omics testing along some

5 time and not just one-off. And, indeed, what is

6 reported in this paper is that at the beginning of

7 this study, he was found predisposed to diabetes.

8 It may sound to you as a joke, but at the end of

9 the study he was actually diagnosed with diabetes.

10 So in the interview he says that actually he

11 thinks that based on the predisposition that he

12 was found with, he's changed his lifestyle and

13 then the actual diagnosis of diabetes was less

14 severe than it could have been if he had not known

15 about it.

16 So that's the project that I've been

17 telling you about briefly. It's called hypergenes

18 and it was about essential hypertension, a

19 genostudy, and that's the translational health

20 infomatics. Architecture that we came up with and

21 we implemented it and it was used by all those

22 clinical centers that contributed the 25 cohorts.

124

1 They all had specimens and those specimens went to

2 one million SNP genotyping.

3 So what you see, the principles here are

4 that at the heart of it is the warehousing, but

5 not in the regular sense that usually people think

6 of warehousing. This warehousing was

7 standards-based, so based on those clinical

8 standards I've been describing so far, including

9 the genomics component in it with cross links to

10 the mass data so the red repository is really the

11 clinical data. That's kind of the entry point.

12 And then those green repositories are the mass and

13 raw data coming from omics sensors and imaging.

14 And those are standardized, and the principle is

15 that we have to keep in mind that this is the only

16 place where we can keep the original data in the

17 richest form as possible. That's where it could

18 be maintained. That could be a good place also to

19 do exploratory, but once you want to get into a

20 specific analysis, then you actually go -- if you

21 look at the top of the chart -- you go and produce

22 some kind of what is called also data mods, but I

125

1 call them information mods. Why information?

2 Because out of the data in the warehouse, which is

3 again the original data -- standardized, but

4 original data in its richest form -- I am

5 producing some kind of information with a

6 different perspective and different view and

7 that's the information mod that is being actually

8 analyzed. So one of the most known one is

9 tranSMART, but the confusion here around the

10 terminology is that they call tranSMART also a

11 warehouse. How many of you heard about tranSMART?

12 So that's kind of the extension of i2b2, which is

13 like the warehousing machinery for getting EHR

14 data into research.

15 Let's move on. I guess I have just a

16 few minutes left. That's the principle I've been

17 talking about. That's how we've been doing it for

18 this hypertension study. These are the specific

19 standards we picked and constrained and interlaid

20 together to come up with the standards-based

21 information model governing the warehousing.

22 Ontology must be developed for specific solutions.

126

1 You cannot avoid it. You cannot find always the

2 right terms in those common terminologies.

3 I'll skip that. That's another issue,

4 but I don't have time for that. Expressiveness

5 versus interoperability: The more expressive you

6 are, the less interoperable you are. That's a big

7 kind of tension between health care and research.

8 So in research we would like to be very expressive

9 to keep all the details of the data. In health

10 care they are now interested more and more in

11 interoperability, exchanging of information

12 between providers for care coordination, that's

13 the latest hype. How can we coordinate the care

14 we are giving to the patients?

15 So I'm coming now to the last part, and

16 I'll do it very quick, and I already alluded to it

17 in the first slide of my talk. So eventually, as

18 I said, the case to me -- if I'm coming back to

19 the case-based reasoning -- the case is really the

20 lifetime electronic health record, including any

21 genomic data, sensor data, imaging data. And in

22 those data actually, in those datasets, we all

127

1 know that most of it is of unknown significance.

2 But in case-based reasoning, we actually

3 ontologically could compare those unknown

4 significance data and still come up with some

5 insight that would help us to refine our clinical

6 decision support.

7 So just to be on the same page, what is

8 the difference between EHR and medical records?

9 Medical records are being expanded along three

10 dimensions -- so, of course, institutional, time,

11 longitudinal, and content -- for medicine to

12 health. What is the base architecture of an EHR?

13 Again, the main benefit here is not just the

14 temporary data -- medical records, charts,

15 documents, so forth -- it has to be in the summary

16 of it. If we don't clear the summary, it's almost

17 useless for the clinician at the point of care

18 having just 5 or 10 minutes for giving us the best

19 care. So the summary also includes personal

20 genetic evaluation and also genetic based -- it's

21 all there in the spirit of having summaries as

22 normally done at least, but also topical summaries

128

1 around diseases, around events and problems.

2 So what are the sustainability

3 constellations? Now, if you just go back up

4 really to the highest level possible, which I even

5 don't know how to name, but are the constellations

6 of sustaining health records? So at the top left

7 it's government centric, and we get some of it in

8 places like the UK perhaps or even Australia. And

9 the risk is obvious I think -- Big Brother. Now

10 at the top right, provider centric, so providers

11 are actually the record keepers and the risk here

12 -- and we all know it from our experience as

13 patients -- you always get partial data. You

14 never get the entire data because there are always

15 silos of information within the health care

16 provider system.

17 Now, on the bottom left, regional

18 centric, okay, we are all part of communities. We

19 don't really move between communities. Let's do

20 it on a community basis. The risk here is

21 obvious. It's limited because it's just regional

22 and we know that we do move, maybe not many of us,

129

1 but we move. And then consumer centric. Consumer

2 centric, eventually all those PHRs, personal

3 health records, would be in my mind unreliable in

4 the eyes of the clinicians because they really

5 cannot trust it. They might say to themselves who

6 knows who messed up with this data.

7 So I'm actually proposing an independent

8 kind of non-centric constellation, which I call

9 independent health record banks. And the argument

10 here is that none of the existing players in the

11 health care arena can or should sustain lifetime

12 electronic health records. And these are the main

13 principles. The first one is the most radical

14 one, and I'm actually proposing to change the

15 legislation so that health care providers are no

16 longer the record keepers. They could keep the

17 records that they created on their expense, but

18 they are not anymore the record keepers of the

19 legal or medical copy of the record. Rather it's

20 going to be held and sustained by independent

21 health record banks that will be independent and

22 regulated by the legislation.

130

1 We escaped the issue of unique ID, which

2 is very problematic, especially in the U.S. And

3 also we escaped the problem of ownership. There

4 is no owner here. Even the patient is not the

5 owner. There is a custodian model, a very simple

6 model.

7 And that's the major conceptual

8 transition. We are moving out the medical records

9 today, archives, that are today within the health

10 care provider, into independent repositories. We

11 are emphasizing the standards-based communication,

12 which would help standardization in general, of

13 course, but it could be enabled only by new

14 legislation.

15 That's the base production cycle. I'm

16 introducing myself to the health care provider.

17 The health care provider will get the entire

18 full-blown EHR into his operational system, take

19 care of me, and then feedback or send back the

20 records back to the bank and they will be then

21 incorporated into the summary, which would be done

22 only on an ongoing basis and not as a result of

131

1 some query or data federation that never works

2 actually. And that's also for farma. A farma

3 goes through a clinical trial, and then they

4 should send the genomic data back to the

5 independent health record bank. They won't like

6 it so much, but they would like to get the record,

7 of course.

8 These are on the macroeconomic level,

9 the main transformation. I'm arguing that there

10 is no additional cost here because the archiving

11 costs today are embedded within the health care

12 costs. They will be just moved to the health

13 record banking. These are the benefits that I

14 already alluded to. A bill introduced in the U.S.

15 Congress seven years ago exactly, an influence for

16 my publications, unfortunately it didn't pass, but

17 at least it was a nice step towards this idea of a

18 health record bank. And for more information, you

19 can read here in the special issue we just

20 published lately on health record banking.

21 So thank you for your attention.

22 ALIN: Due to time constraints, we're

132

1 only going to take two or three questions. Sorry.

2 QUESTIONER: How do you imagine or how

3 do you work into their protection for individual

4 privacy who has access to the data, and how can

5 you strip the data for let's say epidemiological

6 research that is not involving personal

7 information?

8 DR. SHABO: Yes, so privacy is a tough

9 issue as we all know. In a sense there is a

10 tension, privacy versus availability. As a

11 healthy person sometimes I'm kind of cautious

12 about my privacy and I'm not really thinking about

13 the availability of data once I'm sick. But I'm

14 only arguing. I don't have a magic solution to

15 it. I'm only arguing that in the current

16 situation where data of individuals is scattered

17 all over and fragmented totally. And it is the

18 only mechanism to actually protect privacy is

19 really a bit ridiculous. We need to protect

20 privacy in the means that we do have today. I

21 know that there could be breaches, but the data is

22 fragmented and scattered and privacy could be

133

1 breached in each provider or health information

2 itself, a service provider.

3 So I think that the bank model is a

4 chance to actually better protect the privacy.

5 And think about also the unique ID. You don't

6 need any unique ID here. And in the current

7 interoperability effort in the U.S., for example,

8 you've got to have some kind of what is called

9 patient matching, matching based on the ID, and

10 that's a huge issue.

11 QUESTIONER: When we are talking about

12 genomics data, which will eventually start flowing

13 through EHR systems, the size of the data is a big

14 issue. Also the amount of things you can report

15 is huge. I mean 400,000 transcripts and any

16 variation on it. Imagine the amount of

17 information that will flow through HL7 protocols;

18 that might be very challenging and try to see the

19 vision, what would that be?

20 DR. SHABO: I want really to emphasize

21 that this model of health record banking is

22 totally distributed, so this is not about some

134

1 kind of centralized repository for all health

2 records. So because of the distribution -- it's

3 distributed first on the business level. So at

4 the business level there are competing independent

5 health record banks. Now, within each bank there

6 could be architecturally in terms of IT

7 architecture, again, some distribution depending

8 on the volume and so forth. And so I think there

9 shouldn't be any problem to get any volume of

10 data, including whole genome sequencing and so

11 forth. While in the current situation, if you try

12 to push whole genome sequencing to health care

13 providers, they cannot even get simple genetic

14 testing like no mutations, not to mention NGS.

15 And those health record banks should be

16 specialized in health information, health

17 standards, and so forth. This will be their

18 specialty, unlike providers where the specialty is

19 providing care or insurers, they should provide

20 insurance, and so forth. So it should be focused

21 on its specialty. I'm only saying that health

22 record banking will be kind of an engine to get

135

1 all those things sold. I'm not saying that they

2 are sold today, but they could be sold better when

3 you have a kind of expertise within those

4 organizations.

5 DR. SIMONYAN: Just one more comment

6 about privacy is that sometimes people mix up

7 privacy and security, and one of the Office of

8 Information management officers told me the big

9 difference. The privacy is who has the right to

10 access the data. Security is who actually does

11 it. So sometimes when -- it's important to

12 distinguish that security sometimes is actually a

13 much bigger question than the privacy because

14 ownership of the data is sometimes clear

15 legislatively, but the security is not.

16 So we have the next speaker, Eugene

17 Yaschenko. He is from National Center for

18 Biotechnology Information. He is the head of the

19 molecular software efforts at NCBI and we know

20 that is one of the biggest, if not the biggest,

21 one of the biggest NGS data repositories and tool

22 providers. So we are happy to hear what he has to

136

1 say about NGS standardization.

2 MR. YASCHENKO: It's still morning, so

3 good morning. I'll try to be quick and let you

4 have lunch. My name is Eugene Yaschenko and I'm

5 chief of the molecular software section at NCBI.

6 And what we are doing is we are building large

7 genomic, genetic, and biomedical databases for the

8 world.

9 So the outline: I will introduce for

10 people who still don't know what NCBI is and

11 produced in NCBI and to build a section about

12 describing the subsection of NCBI which we call

13 the primary archive. Then I'll talk about

14 sequencing archive.

15 So about NCBI: NCBI was created by

16 Congress in 1988 to develop (audio skip). We are

17 funded by act of Congress, and we are in a unique

18 position. We have the funds, obligations to

19 develop databases, obligations to accept the data

20 and serve those databases to public, but we have

21 no leverage on what kind of data we get in what

22 formats and what standard. So the only leverage

137

1 we have is many publishers who require people to

2 submit quantified databases. But we can try not

3 to accept data in the wrong format, but have it

4 resolved and be compliant to the Congress when

5 they are in session. We have to store the data,

6 but we don't control what formats are being sent

7 to us.

8 NCBI in general is mostly known for

9 biomedical efforts. You all know the Department

10 of Health, department central and public health

11 sites, which we support. We also have a lot of

12 databases that support bioinformatics resources at

13 NCBI, its molecular type databases part of

14 GenBank, its brought in database, also part of the

15 GenBank. There is very extensive and nicely

16 created database on genes; the whole genomes,

17 which were assembled and submitted to NCBI and

18 very critical for bioinformatics research is BLAST

19 on NCBI hardware.

20 So part of our databases we call primary

21 data archives. And primary data archives when we

22 build them we do it with some principles that

138

1 first of all they submit them to you. We try to

2 reflect as faithfully as we can what is submitted

3 or provided to us. The second, the term of our

4 archival of this data is assumed to be

5 indefinitely. Obviously, nothing is permanent in

6 the world. But when we design databases, we

7 assume that this database will be used

8 indefinitely.

9 Data archive is different from file

10 archive. This is not our principle, so we are not

11 just a file archive. We're not archiving just

12 files. We tried to figure out what data those

13 files contain, possibly build for them specialized

14 databases and archive the data, not files. Also

15 as a result of that, it's not necessarily lost to

16 us. Some controlled loss of information may be

17 done for the sense of space or even simple sense

18 when you have a data type load, they don't want to

19 reserve all bits of information. We just control

20 the flow.

21 Also for internal storage, we tried to

22 make it as uniform as possible. We utilized the

139

1 transformation of internal storage for validation

2 for additional indexing, sorting, and so on and so

3 forth. And I already mentioned, the extra pain we

4 get is from a multitude of formats generated

5 outside and we always welcome any movement toward

6 the common standard.

7 So primary sequencing data archives at

8 NCBI: GenBank was created in the eighties. Then

9 with the event of first automated sequencing

10 technology, we created the trace archive to store

11 single traces. And when the next generation

12 sequencing came, we developed SRA to handle

13 several orders of magnitudes in more data. We

14 call those two archives, trace archive and SRA,

15 raw data archive because we put a minimal amount

16 of curation on this data and this data represent

17 as close to what was produced by the submitter as

18 possible. So it doesn't mean -- this is not a

19 highly curated set. This is the data as the

20 submitter sees it.

21 So on this picture you see the growth of

22 three databases -- GenBank, which I think on the

140

1 screen should be blue; the trace archive, which is

2 red; and yellow, which on this resolution is

3 almost a vertical line, is SRA. And when we

4 change zoom and shift a bit of set, you see all

5 those lines scaled to the trace archive. And when

6 you change to SRA, you see this long yellow line

7 and everything else stays at zero, which is the

8 number of bases in SRA. As you see, by now we

9 approach three databases of storage and one

10 database is a lot. If you take a DNA in

11 its unstretched form and just line it up, then one

12 database will be a single DNA molecule from

13 Washington, D.C., to New York. This is like

14 micro-scale times infinity becomes something we

15 can measure in real distances.

16 So this is our daily input into the

17 amount of bases we inject every day into SRA, and

18 as you see -- so these lines are inputs and the

19 bit smoother line is average. So as you see

20 starting a couple of years ago, we effectively

21 insert into SRA every day the full size of trace

22 archive and much more. So this is how we talk

141

1 about how scales are changing with next generation

2 technologies.

3 So because of this, this is how we model

4 data for our internal standards, the data that we

5 store. So as we realized the volume is big, we

6 started to try to compress data and reduce it to

7 as few bytes as we can. So as you see in the

8 early stages when we went through NGS explosion,

9 we spent up to 8 bytes because we were recording

10 not only the quality, but Illumina was producing

11 four-channel quality. Some of the technologies

12 were producing signals, which resulted in

13 particular base codes.

14 In the second phase when we realized

15 it's not sustainable, both us and vendors start to

16 converse on the need to store only bases and

17 qualities. And this is what I call the FASTQ

18 conversion. As we go further and try to apply

19 better and better technology, we realized that one

20 of the steps in preserving the data is align it to

21 the reference. And alignment to the reference, as

22 mentioned before, provides very good possibility

142

1 of doing compression by reference when you don't

2 need to record the sequence. A lot of the

3 sequences mention the reference pretty well. So

4 this allowed us to converge to more or less a

5 steady pace where we spend approximately

6 two-thirds of a byte per base and this was

7 previously mentioned. Almost 80 percent of this

8 data is a quality score, which we try to attack

9 next as a feasibility of do we really need to

10 store all of that.

11 So the need for standards to be

12 discussed: This is NGS data. This is if you look

13 at the SRA now, you find 8 sequence and

14 technologies. You find 34 sequences in instrument

15 models. You find 20 types of experiments, meaning

16 doubles, genome sequence, RNA-seq, all of them.

17 And you find cells and taxonomy IDs. So,

18 obviously, to develop separate standards for each

19 of those combinations is impossible. We need a

20 common standard for all of them. And standard is

21 relatively simple. We need to record metadata and

22 we need to record equality and possibly their

143

1 alignment to the reference.

2 So there is also need for a standard for

3 long-time archival perspective. This is the goal

4 of NCBI to build archive for indefinite use, and

5 this raises the first question. Long-term

6 usability means software support. So imagine that

7 format is generated today and it's been understood

8 by a particular set of software. What happens ten

9 years from now? Will this software work on the

10 future computers still? So what's convenient now

11 may not necessarily be relevant in the long term.

12 So then uniformity of the data produced

13 by different technology, even within the scope of

14 the same manufacturer, they change their formats

15 that they produce as a result of applying it

16 several times. It creates -- and their latest

17 software doesn't understand, I think doesn't

18 understand, the format they created six years ago

19 when they started to do it.

20 So also uniformity of data produces

21 different types and also long-term usability of

22 various elements. As I said before, initially we

144

1 tried to model more elements than just data

2 equality, but later the practice shows that it

3 doesn't need to be modeled that much. So the

4 benefits of transforming to archival format are in

5 addition to being archived, and this is regional

6 data as well as data being curated while its being

7 transformed. Transform is possible -- at the time

8 of transform the software that understands the

9 format being submitted to archive is up to time

10 because it's current, it's still current. And the

11 transformation to internal standard gives us

12 ability to apply the data compression. And there

13 is problem that we need to keep up with all the

14 formats which are being submitted to us. There is

15 potential loss of data due to box and archive

16 software if they are created; potential loss of

17 data due to archive decisions for data to keep or

18 discard. And data decompression slows down

19 short-term computational needs and data reduction

20 may discard data fields which are used to create

21 additional information. So we had several times

22 finding in fields which are used only to carry

145

1 information from one bioinformatics pipeline to

2 the next bioinformatics pipeline. Otherwise they

3 have no other meaning, just to carry it from let's

4 say bowtie to -- from top hat to cuff link -- and

5 no other meaning can be applied to those fields.

6 So the decision circle we'll go through.

7 We go in circles. We first decide what data

8 series to store. Then we look at the data series

9 for redundancy removal. Then we decide whether we

10 do loss compression. And then we go through

11 practical application and when something happens

12 when the new technologies shows up or the pattern

13 of using the data becomes changed, we go into the

14 circle again.

15 We initially preserve as much available

16 data series as reasonable and possible. Then we

17 do compression routines. Existent compression

18 routines speed up development acceptance of new

19 data. Then we improve compression methods as data

20 becomes better understood. And then we discard

21 data series which did not prove to be useful, and

22 we recover disc space.

146

1 Also we need to manage the risk of data

2 loss, the risk of losing archived data. We do

3 normal stuff -- redundant disc, redundant location

4 backups. The risk of introducing zero software,

5 obviously we handle the signet test and regression

6 test and quality assurance and recover regional in

7 the processing. And the risk of making bad

8 decisions, for example, we decided to discard some

9 data series. Since it is tape storage, which has

10 limited lifetime, our assumption is that a mistake

11 will be discovered within the lifetime of the

12 tape.

13 So as I mentioned, SRA is big data, and

14 what are the conditions for the data? This is my

15 understanding of how big data gets born. First of

16 all we have advanced technology to generate it

17 because you have to be able to generate data to

18 call it big. It's not easy to create so many

19 bytes. You also need sufficient amount of

20 resources to store this data because you cannot

21 call big data the stuff you can never store. At

22 initial stages of a limited design, there was a

147

1 requirement trying to store all the images

2 produced by Illumina. It's not really -- this

3 technology doesn't exist, so it was never even

4 attempted to be done.

5 Another thing for big data is we don't

6 have enough methods to digest it. That's why we

7 keep it big because we cannot extract useful

8 information at this time, and we keep it for the

9 future trying to figure out maybe in the future we

10 will be able to.

11 So this is, for example, properties of

12 NGS data which we analyzed that we do have random

13 noise in data. We also have systematic noise by

14 well-known biases of let's say PCR or some

15 platforms. We have expected signal. First of all

16 we have known carrier, which I call known because

17 a lot of data which when you align individual

18 genome to the human genome, a lot of data will

19 just tell you that yes, you are human. It will

20 not tell anything individual about you. It will

21 just tell you yes, you are human because it

22 matches perfectly with the genome.

148

1 Then you have expected signals. So, for

2 example, your eyes are blue. Definitely you will

3 have some kind of variations that make your eyes

4 blue. And then you have novel signals, signals

5 where you are looking for, but it's hidden within

6 this big data. So it's like looking at an

7 astronomical picture. So you have all of the

8 black background. You have all of the white

9 speckles and noise. Then you have well-known and

10 well-positioned stars and then you find the super

11 nova. So this is exactly what you are looking at

12 in big data. And our analysis of data in this

13 area show that approximately 64 percent of the

14 total reads we find identical to the reference for

15 human; 25 percent of the reads are suspect,

16 suspect in the sense that for some reason, the

17 black box of sequencing decided that the two or

18 three bases have low quotes or even half of the

19 reads have low quotes. So there is something why

20 hardware misfired. So I would say why not call

21 them suspect and don't try to catch signaling

22 errors.

149

1 Then we also now have very well-known

2 variants, which recently have been recognized and

3 catalogued by the 1,000 genome project, variants

4 of human that are common to a large percentage of

5 people. And all this is a small fraction, and

6 this fraction is 1 percent, on a scale of 1

7 percent, which is actually useful data, data which

8 you can feed into clinical decisions or academic

9 research. And the problems are finite. The

10 additional alignment phase to SRA data allow us to

11 go through silos of data for each individual

12 sample. Two matrix of data where you can align

13 those silos with their position on the genome.

14 And this is where the existence of standards is

15 important because once you go from set of

16 individual data files to matrix, you can start

17 doing a lot of -- a lot more flexibility of how to

18 look at the data. For example, you can slice the

19 data. You can look at the region on the genome

20 and see how different individuals of data 1, data

21 2, data 3, data 4 look in this region. And just,

22 for example, the four individuals for high-depth

150

1 sequencing of 1,000 genomes and they are in HLA

2 region. And those four graphs on the bottom you

3 see represent the coverage, and the red lines

4 represent mismatches. How many -- what percentage

5 of mismatches to people, and you see from those

6 pictures that you can find people with homozygous

7 mutation, heterozygous mutation. You can find

8 people who have mutations and no mutations, and

9 you are looking at one single gene. You know I'm

10 going to look for separate genomes and try to cut

11 a piece out of those genomes. You just take a

12 slice in different dimensions and the data is

13 there.

14 So another application that we tried to

15 do for our human part of this array under dbGaP

16 protection is to build an index on features found

17 on the people. For example, when you have very

18 strong mismatch from the reference, you can index

19 it properly. We tried to build an index, which

20 reports a low existence. So the idea is that you

21 have human sample and you narrow down the

22 problematic area to particular mutation, rare

151

1 mutation that you've never seen before. And all

2 you want to do is you want to go to big system

3 containing a lot of people who haven't seen this

4 mutation before. And in this case you don't even

5 use NGS directly because you have phenotype for

6 the person who has mutation, and then you want to

7 find all other people who have this mutation and

8 try to analyze linkage to their phenotype. So you

9 use NGS only as an indexing media, not as direct

10 research.

11 So to end, we're playing with the thing,

12 which is called beacon. So, for example, this is

13 one of the articles about APOC3 gene, which is

14 responsible for your lifecycle of triglycerides in

15 the cells. And this particular mutation makes you

16 impermeable to eating fat food. I'm not a doctor

17 so it's not medical advice, but if you have this

18 mutation, according to the theory, you can eat as

19 much fat as you want and it will not affect your

20 health. But this is a recent study and I want to

21 look at what people have this mutation. So this

22 is the beacon at NCBI, dbGaP, and you just enter

152

1 the coordinates you are looking for and you find

2 there is one sample in each of the two studies and

3 you happen to have access to it. This is general

4 research you set and the person, Steven Cherry,

5 who was doing this study has access to the set.

6 So he happened to find one person in each of those

7 studies and they are also in the following

8 distribution in the dbGaP study. But he has no

9 access to it, but he has the ability to request

10 access. By looking at those few fields, he finds

11 yes, that for those two people according to the

12 SRA records existing under dbGaP protection there

13 is coverage and they have the heterozygous

14 mutation of one of the DNA codes. So this example

15 took -- I gave two examples how existence of

16 standards allow you to create from collection of

17 the data to big data matrix and index it because

18 the data that fills in the fields of this matrix

19 is uniform and conforms to the same standards. So

20 this is the end of my talk.

21 QUESTIONER: (off mic)

22 MR. YASCHENKO: How we handle storage of

153

1 --

2 QUESTIONER: (off mic)

3 MR. YASCHENKO: We maintain the software

4 on NCBI's Website, but we also -- I think we're in

5 the process of moving this into GitHub because it

6 becomes --

7 QUESTIONER: (off mic)

8 MR. YASCHENKO: Yeah, we're going to

9 move to GitHub in the next couple of weeks

10 actually.

11 QUESTIONER: (off mic)

12 MR. YASCHENKO: We are also using -- you

13 mean virtual machines?

14 QUESTIONER: (off mic)

15 MR. YASCHENKO: We are also working, not

16 program, modeling how to create the virtual

17 machine environment for people. We tried to

18 create examples how to use SRA from the cloud and

19 the cloud can be anything. We're modeling with

20 Amazon, but it can be --

21 QUESTIONER: (off mic)

22 MR. YASCHENKO: Exactly. Right. We're

154

1 at the point where we're just providing the

2 recipe, how to install software in the virtual

3 machine. But we also can generate the Amazon

4 images for people to use direct result

5 installation.

6 ALIN: Are there any other questions?

7 QUESTIONER: What kind of redundancy do

8 you have of the basic data? If you're going to

9 condense your files and base them on a sequence

10 alignment of a standard sequence, how can you

11 preserve that standard sequence because if you

12 lose that, everything else goes?

13 MR. YASCHENKO: The standard sequence,

14 first of all, we prefer the standard sequence to

15 be one of the GenBank sequences. And the GenBank

16 sequencing is highly reliable because it's not

17 only stored at NCBI, it's also kept in multiple

18 places. It's stored at INSEC, which is an

19 international collaboration.

20 We do allow for some of the cases what

21 we call local references, where reference is

22 stored with the object. Obviously, the

155

1 compression is less there because you store the

2 reference. But then it's part of the data. It's

3 not remote, it's local. If you lose it, you lose

4 it with the data.

5 ALIN: There's actually a question

6 online. Kay asks "Does anyone see any linkage

7 between" (audio skip).

8 MR. YASCHENKO: I had trouble

9 understanding.

10 ALIN: I'm not sure if I follow. So

11 does anyone have a question here?

12 QUESTIONER: When we are talking about

13 human data, reference-based compression would be a

14 wonderful compression algorithm. But let's say if

15 it's viral data, and we deal with a lot of viral

16 data or bacterial data for high coverage, we get

17 sequences that are absolutely identical even with

18 the error rates of current models. So there are

19 different approaches. Even before you do

20 reference data compression, they can do just

21 purely redundancy compression. It sometimes gives

22 us a factor of eight to ten right there, so you

156

1 don't even need to keep the reference for that.

2 MR. YASCHENKO: Reference-based

3 compression is redundancy compression because in

4 the fact -- even if your reference is genomic, you

5 store your sequence once when you have coverage.

6 Even if you first assemble -- you don't need to

7 have a perfect reference, you just assemble your

8 reference and you compress against it. This is

9 already reference-based compression because you

10 store the sequence but once and all the rest that

11 are generated are highly redundant.

12 QUESTIONER: But sometimes assembling

13 the genome itself is a challenge. So maybe --

14 MR. YASCHENKO: I think we can have

15 another today meeting on assembling the genome.

16 QUESTIONER: Some of the data from RNA

17 viruses, you're dealing with a quasi-species

18 rather than a single sequence. So how do you deal

19 with that when you're talking about compression of

20 data?

21 MR. YASCHENKO: If we are not doing any

22 compression on our own, there's a part in NCBI

157

1 that deals with pathogens and you will hear where

2 we do analyze the data. But from an SRA

3 perspective, we take the data as given by

4 submitter. So far we are not introducing anything

5 on our own. So if submitter and many pipelines

6 involve alignment to the common reference, we take

7 it. If submitter involves producing genome

8 reference and align it and give it to us, we also

9 take it, but we are not imposing our analysis.

10 ALIN: Before we close the session, we

11 are actually going to take a break now and go to

12 lunch. But before we do that I have a couple of

13 general announcements. One, there are posters

14 outside. Feel free to look at those at your

15 leisure. They're there for you guys to look at

16 and the presenters are also somewhere in the room.

17 So at some point if you don't mind standing near

18 your poster that would be great; if people have

19 questions that you could address them.

20 Also it is lunch. You are welcome to

21 leave the campus, but note that if you do leave

22 the campus you have to go through all the security

158

1 procedures that you went through to get in and

2 those take some time. So there's a cafeteria

3 across from us. We recommend that you go there,

4 so just to ease the time-limiting factor.

5 Also there is a tweet going out that has

6 hashtag FDANGS. We are required to say because

7 we're a part of the government we don't associate

8 with that and, honestly, if you could probably not

9 have FDA in that Twitter tag, it would be probably

10 better for us, to not get us in trouble, just a

11 public announcement that we don't endorse that

12 Twitter hashtag.

13 Thank you. We'll reconvene at 1:30 p.m.

14 (Recess)

15 MR. YASCHENKO: Welcome back from lunch.

16 I hope you feel better now. So, we're second

17 section of data, Big Data Administration and

18 Computational Infrastructure. There are already

19 several mentions of Big Data before, but now we

20 have a whole section dedicated to it. So, the

21 first speaker to talk is Dr. Vahan Simonyan, high

22 performance integrated virtual environment blood

159

1 (Inaudible).

2 DR. SIMONYAN: Good afternoon. So, in

3 the first speech I was representing the

4 perspective of standardization, and this speech is

5 actually the work which I do on a daily basis.

6 So, this is representing high performance

7 integrated virtual environments, a platform which

8 is developed specifically for NGS data inside the

9 FDA and outside in public (Inaudible). So again,

10 this is my disclaimer. I'm supposed to say this,

11 and everything I say, it's my own perspective.

12 It's not a regulation or anything. That does not

13 bind the FDA.

14 Okay, so, in an NGS world, again -- I

15 mentioned this before, when maybe we start working

16 in science, we all think we're going to do fun

17 stuff and do big, different beautiful algorithms,

18 new science, new research. As soon as you move to

19 the NGS server, you find out that you have to

20 infrastructure, and huge infrastructure. And data

21 is big. Data standards are -- some are there.

22 Some are not there.

160

1 It's difficult to interpret these in big

2 ways sometimes. And the data complexity is big.

3 And also, the computations that are to be done are

4 really very large. Every time somebody comes to

5 me and they want to analyze everything against

6 everything, well, it's possible, but it takes

7 years and years and years to finish. So,

8 computations should be done -- it's not that they

9 are big, they are also complex, to avoid all of

10 these huge problems.

11 And computational standards -- we

12 already raised the issue, and that's why we are

13 here today. Computational standards are also a

14 very big question. So, what do we need? And

15 being part of the FDA or doing the same thing for

16 the public, we need a good storage capacity. We

17 need a good security capacity.

18 Also, being a regulating agency and

19 accepting review data, which is important to make

20 sure that the data is very secure and they'll just

21 prepare one slide to it, and then availability of

22 the data is important. We have all heard about

161

1 these beautiful storage systems which are very

2 fast internally. They are facing out with one

3 (Inaudible) for cable, and then you have a

4 computer cloud which is facing in with one cable.

5 And so, they are not available. Data is there.

6 It's just not available for you to compute. It's

7 very slow.

8 And interfaces are an issue, because I

9 mean, the results of our computations sometimes

10 are a hundred million or billion raw tables -- how

11 useful it is. I mean, unless you have a nice

12 interface which allows you to look at them, it's

13 not a result. And support, of course. Big Data

14 administration -- as soon as you go to

15 infrastructure, you have to think about the

16 administration and support, because you have to

17 support the hardware, you have to support the

18 stuff.

19 And then, people move and then you have

20 to think about it; how it has to be upgraded. You

21 have to think about it. So, originally, we

22 started this to become scientists, and then, we

162

1 ended up doing all of these other things which are

2 infrastructure related. So, what HIVE does, it

3 tries to address some of these issues -- the few

4 things it does -- it does robust data loading. It

5 has a robust data loading information.

6 Pretty much, we can go anywhere and get

7 anything for you. You just give us a URL or an

8 ID. If it's NCBI or Uniprod or anything, you come

9 to the system. You command us to go and get the

10 data for you. You go home. You drink your

11 coffee. You come back. You have the data.

12 Distributed storage. When we take the

13 data, we spread it across the cluster and store it

14 in pieces. There are two good reasons for that.

15 Number one is that computations are faster when

16 they're also distributed. And number two is that

17 if we compromise a particular note, we don't

18 compromise the whole data. Everything is

19 distributed. And security -- HIV Provides

20 security.

21 And we had to come up with a new model

22 of security we called Hierarchical Security. And

163

1 then computations -- distributed computations.

2 When you are a developer in HIVE, you don't have

3 to think, where is your competition going to run.

4 You don't have to think, is my data in that

5 particular compute note. You don't have to think

6 about how to parallelize it. HIVE does a lot of

7 it for you.

8 And interfaces. HIVE provides web HTML5

9 based -- Java script based web interfaces, so if

10 you are working in some infrastructures of IT

11 infrastructure, when there are limitations on what

12 software we can install, you don't need to think

13 about it. Browsers are everywhere, and most

14 modern browsers do support HTML5.

15 And of course, the most important part,

16 HIVE is an expertise. It doesn't matter what the

17 tools are. Sometimes we don't how to use the

18 tools, and sometimes, the tool does 90 percent of

19 what you want, but there is still that 10 percent.

20 So, unless you have an expertise and an

21 infrastructure, you are doomed, because now your

22 biology is trying to deal with that 10 percent.

164

1 Okay. So from a hardware perspective,

2 topological perspective, HIVE is an encapsulated,

3 behind the firewall, cloud-like infrastructure.

4 There is a HIVE. It stands for High Performance

5 Integrated Virtual Environment. The letter I is

6 the integrated, because everything is together. I

7 draw on that data that you need and the storage

8 you need in a separate fashion, but sometimes they

9 are the same notes. Sometimes they are different

10 notes. And there's an extremely well optimized

11 internal network connectivity in between them.

12 Everything is controlled with cloud

13 servers, and there is a single point of access

14 which is the web servers. No (Inaudible), no

15 additional users. We are trying to minimize

16 potential violations of the system's integrity.

17 And then, we can link to your devices, if you have

18 an illumina device or rosh device or any other

19 device, we're just doing the sequencing. We can

20 integrate, and then if those mounts are available

21 -- technology people know where the mounts are --

22 it's where your (Inaudible) end up being -- we can

165

1 get them for you even without you actually

2 thinking about it.

3 We can go and upload your data from your

4 local storage, or you can tell us to get the data

5 from somewhere else. Everything is not from the

6 web browser. And this is how HIVE looked before

7 and after. One of these before and after pictures

8 (Laughter). To be honest with you, the second

9 perspective -- after perspective is taken -- the

10 guy is still there behind the cave walls.

11 (Laughter) He's just not visible (Laughter). The

12 other side. Yeah.

13 Okay. So, a few first slides and the

14 storage data flow -- I come here. I define my

15 metadata. I define my sequences. I click a

16 button. Data gets uploaded into HIVE, and that's

17 the only time when HIVE depends on your computers

18 being live. When you upload the data from your

19 local computer -- because if your computer goes to

20 sleep you go to another page, you break the

21 connection. There is nothing we can do about it.

22 That's the only point in HIVE -- when

166

1 you issue a command, you have to wait for it to

2 finish. So what do we do? We initiate the

3 processing pipeline. We recognize almost all of

4 the formats which are available in industry,

5 unless something came up from yesterday. And we

6 do validate it, because we know -- in the

7 industry, we know how validation protocols are

8 there.

9 There are about 40 different validations

10 possible. Ten or 20 of them are being done

11 automatically. We pass the data. We compress the

12 data. We encrypt some of the data, because

13 sometimes it's not feasible, and we archive it for

14 distribution. And the next thing, you come. You

15 don't have the data in your machine. You just

16 point it to NCBI or Uniprod or anywhere, and then

17 we go and start getting it.

18 It seems like NCBI provides a lot of

19 nice FTP downloads and things, but information is

20 not just a packaged FTP lying there. So, you have

21 to do this electronic handshaking. It's called

22 EUTO. So, you have to submit an inquiry, get the

167

1 results, go on -- start downloading by chunks. We

2 take all of this complexity into HIVE, and we do

3 it concurrently. Hopefully, not too much

4 concurrently, because NCBI will blacklist us if we

5 go too much.

6 So, and then, once we get the data, what

7 do we do? We monitor -- sorry. Okay, we're

8 splitting the -- okay, we monitor while the

9 downloads are finished. We pass, compress and do

10 the other stuff automatically. And so, how do you

11 do computations? This is a very general view.

12 I come. I select my data. It says the

13 majority inside of the system -- already, HIVE

14 does not work in the data, which is outside. We

15 can get them, but once -- we can compute only when

16 the data is inside. You remember, that's

17 integrated (Inaudible). So, you click on a

18 button. We'll -- based on what kind of

19 computation it is and appreciation, how much time

20 it may take, we may decided -- our software

21 decides how many chunks have to be made out of it.

22 How many pieces, how many computer initiatives

168

1 should be involved in it?

2 Then, we parallelize. It's called

3 parallelization. Then we launch -- wake up

4 computers which go, start -- so it's very

5 fascinating to start working with the data units,

6 get the information while your browser is updating

7 and showing progress -- 1 percent, 2 percent, 3

8 percent, et cetera. Once it is done, we -- the

9 paralleled (Inaudible) is coagulated. The

10 visualization is prepared and sent to you.

11 And this is how visualizations usually

12 look like. Well, they're not crooked in real

13 life. I just tried to create a perspective.

14 Yeah, so there are all kinds of visualizations.

15 This is all external based. Some visualizations

16 are built internally by HIVE. Some visualizations

17 we run -- adopt tools -- many tools. We have --

18 you'll see some of the tools I mentioned.

19 If the tool produces nice visualization

20 or a table, or it produces the data just in the

21 text format that we can visualize. We are trying

22 to bring it to you in this concept. And

169

1 sometimes, it's actually difficult. Do you know

2 how many times people say, hey, I've done my

3 computation. I waited my three hours, and I'm

4 clicking on this human chromosome number one. I

5 want to see all of the mutations.

6 In the real estate of the screen, which

7 is 300 pixels -- and the human first genome, first

8 chromosome has 250 million positions. And we know

9 about 20 different things about every single

10 position. People say, I clicked it. It's

11 computed. It should take a second to show.

12 But in reality, to go through 200

13 gigabytes -- 250 million positions and produce

14 just the output for you, which is visually

15 appealing and you can understand what's happening,

16 even that has to be launched like 50 processes on

17 a bug (sic), each one doing the chunk of the work,

18 and then coagulating back and bringing it to you.

19 So, when we are dealing with Big Data, even the

20 simplest things which seem simple, they're really

21 a big deal.

22 Okay. So what is the ultimate goal and

170

1 where are we? We produce this ultimate --

2 extremely modular. And we spent a significant

3 amount of time developing these small modules, the

4 alignment module, very (Inaudible) modules -- all

5 of the like trivial -- phylogenetic modules,

6 passing modules, security modules.

7 And we make these black boxes. I don't

8 think they're black here. I'm color blind, but I

9 still can see some of them. It produces black

10 boxes which take particular inputs and outputs,

11 and on the right side, you can see just some

12 examples of what they actually are. And the

13 letter V in HIVE stands for virtual. But unlike

14 most virtual infrastructures, we don't virtualize

15 the machines. We virtualize the services. If

16 it's an alignment, it's an alignment. It takes

17 some input. It takes some outputs. It produces

18 some outputs.

19 Why do you ever care where is it done?

20 Is it a Mac computer? Is it a Windows computer?

21 How much memory does it have? We, as the

22 configurators of HIVE, configured the system in

171

1 such a way that if you launch a service, the

2 service is executed for you regardless of where it

3 is done. It's not your worry to do it. So

4 eventually, well, we are working also on a

5 pipeline design component where we'll take all of

6 these modules and link them together. And the

7 hope is that we can produce actual working

8 pipelines which are preconfigured and validated,

9 and we'll provide it to our users.

10 Okay. So, what actual kind of

11 computations, just to mention a few, are there?

12 We have about 40 tools in production, and we have

13 more in development, and we have some in beta and

14 alpha testing. So of course, the retrieval,

15 storage, security parts of it, visualization tools

16 -- we have a number of alignments which are

17 available to us and adaptable to our system. We

18 have assemblers. We have variation code

19 (Inaudible) arsenal, and you can notice that there

20 are tools which are ours; there are tools which

21 are not ours. They're adapted.

22 It takes us, depending on how complex

172

1 your tool is, it takes some time to adapt it.

2 Generally, -- if you have a command

3 line, it just takes like five inputs and produces

4 like two outputs. That's easy to do. But if you

5 want us to make it optimal, we can spend extra

6 time and develop an explicit parallelization

7 routine. But even implicit parallelization, when

8 we just adapt your tool and just launch it one of

9 the computers, you already benefit from HIVE.

10 Why? Because now you can launch 10

11 copies of it, they are going to go to 10 different

12 computers. You are going to do 10 jobs in the

13 time of one. That's implicitly benefiting from

14 it, although we didn't spend on parallelizing it

15 ourselves. But because alignment and variant

16 colors are significant time consumers, we spend

17 the time and we develop not only our own aligners

18 and variant colors, but we also spend time to

19 explicitly parallelize existing ones.

20 So, let's say if you run a computation

21 and multi samples pipelines, which will take some

22 big, huge coverage data; it will take you two

173

1 days. In HIVE, it will take you a few hours, like

2 two or three hours. And if the system is busy, we

3 tend to get busy sometimes on a Thursday,

4 Wednesday, it will take six hours. But it's still

5 much better than this two or three day computation

6 time. And the reason it is important to generate

7 these very optimized routines is because we are in

8 science.

9 I, being a scientist, I never was able

10 to ask the right question from a first time. I

11 don't know what question to ask. I ask a

12 question, I get an answer, and I recognize, hey,

13 that was the wrong question. I have to compute

14 again. I have to compute again. Now, imagine if

15 every time I ask a question, I have to wait days,

16 how efficient I can be. But if I ask a question,

17 I get an answer immediately or within a reasonable

18 time limit, then I have a better potential of

19 doing a nice science. That's why we do spend the

20 time optimizing all of these things.

21 So, let's just move forward. There's a

22 big arsenal of tools in there. The next thing is

174

1 working with scientists -- is the difficulty of

2 their inquiries. Like, when we designed HIVE, we

3 took all of the CBI approaches, designed all of

4 their data models. We said, hey, we have these

5 beautiful databases you can use. You can populate

6 your information in it. The next day, somebody

7 comes, I have a data model which doesn't fit

8 there. Please design it for me.

9 Hey, I am -- we have DB administrators

10 -- very smart guys. They went. They started

11 designing a new data type. In two days, we have

12 five more requests. We recognize it's

13 unattainable, because now you have so many

14 different data types, it's not possible to

15 maintain. So we stopped. We spent one month or

16 two. We designed a new data model which looks

17 more predicate databases or triplet databases, but

18 it's heavily adapted.

19 We borrowed all of the nice ideas and

20 joined them in a nice, hybrid way, and we created

21 our own HIVE honeycomb database model. So we can

22 define database -- it will take 15 minutes to

175

1 design a new database, once you, the researcher

2 knows what you are -- if you know what you want,

3 it takes us 15 minutes to make a new database for

4 you, because there is one database in there which

5 maintains all of the joint databases. It turns

6 out to be, we are saving a lot of economics on it.

7 We are saving money to having no more system

8 administrators, a much simpler life.

9 Okay. Now about the security aspects of

10 it. So, HIVE implements this hierarchical

11 security model. In any big system where it has to

12 be extremely secure, you have a lot of objects

13 which can share and a lot of objects -- sorry, a

14 lot of users which can accept the shares. The

15 problem is that if you have many rules of defining

16 the permission universe, the system slows down.

17 At every point of access to any object, you have

18 to check the permission.

19 We recognize this is a challenge, so we

20 actually did some studies, did some reading and

21 came up with a hierarchical security model where

22 objects can be shared with their hierarchies. I

176

1 can say I'm giving this object to this particular

2 entity and down in hierarchy, or to this

3 particular entity and up in hierarchy. So, by

4 object, I mean files. I mean processes. I mean

5 computation. I mean algorithms.

6 Let's say you have an algorithm which is

7 only yours in HIVE, and you want to share it with

8 me, but with nobody else. You share an algorithm.

9 So, the next time somebody wants to use it, he

10 says, hey, show me your aligners. If he doesn't

11 have access to your aligner, he cannot get it. If

12 he does have an access to your aligner, he can get

13 it. So, even algorithms are objects. That means

14 they are shareable. Files are shareable. Results

15 are shareable.

16 The fact that computation is running is

17 also shareable. I can give a computation to

18 somebody saying, hey, I launched it. Now you take

19 care of that. And then, we share not just within

20 one hierarchy, which might be an organizational

21 hierarchy of your institution, or you can have

22 project hierarchies. Yes? If each hierarchy is a

177

1 tree, HIVE works with forests, not with trees.

2 So, we can have multiple hierarchies,

3 organizational projects or any other kinds of

4 entities.

5 Okay. So now, let's talk about the

6 misconception that a lot of times, people are

7 asking why do you even need this. When we refer

8 to HIVE as a cloud-like infrastructure, why do you

9 need it? There are all these kinds of clouds

10 already. It's important to note that before --

11 clouds are just renting your computers, except you

12 don't pay for the computer, which is going to

13 consume your power. It's consuming somebody

14 else's power. You are still paying for it, but

15 you get the box. You get the box with the command

16 line on it, and you are responsible for putting

17 the software in it, optimizing the software, doing

18 the parallelization and doing all kinds of stuff.

19 But to do all of that, you need to know

20 programming languages. And these are programming

21 languages, (Inaudible) shell, and there are many

22 others. I know like of them. It's a nightmare.

178

1 I barely speak English or Russian or Armenian, but

2 I know 13 programming languages. So, for a

3 biologist to actually these new programming

4 paradigms will take a significant amount of time.

5 We thought it's actually a waste of the

6 tens of years these people became scientists to

7 let them actually move under the tables with a

8 cable in their mouth. That's not a good way to

9 use somebody's time. So, then when -- these are

10 actual comments. When I was describing to one of

11 the biologists, like we have -- at that point, we

12 had 500 CPU cores. He said, oh, that's

13 magnificent. What is the core (Laughter)? You

14 know? And in programming language, that's okay.

15 So again, but pisan, I said, I thought

16 you were theoretical scientist. I didn't now you

17 worked with snakes. So, there's a big difference

18 between programmers and IT developers and actual

19 (Inaudible) biology scientists. And we have to

20 maintain that separation clearly, and HIVE tries

21 to do that. And of course, there are people like

22 me on the bottom, when we see a computer monitor,

179

1 we are really, really excited.

2 Now, about deployments of HIVE. HIVE

3 has multiple deployments. It's a packaged

4 product. We have maxi-HIVE, which is specifically

5 designed for regular through usage (sic), and the

6 big data to accept the submissions. We got it

7 only a week ago, for which we are very glad. And

8 we have a mini-HIVE platform, which is designed

9 for cutting edge research algorithms. This is

10 where we adapt the tools and we let our scientists

11 play with the weird data, well data and do all

12 kinds of studies.

13 And we have a public HIVE. And our

14 collaborator, Dr. Mazmuder from GW is maintaining

15 that side, and they are actually actively opening

16 and promoting the technology which we are

17 developing. They are inviting people to run

18 pilots. It's an open public resource. Everybody

19 can do it, except that we are asking to personally

20 communicate, because you know, a lot of times when

21 we initially opened it for everybody, immediately,

22 people started launching so many computations, the

180

1 poor computers actually went down, because there

2 were too many things to do. But it's an open

3 source. We tried to transfer the technology to

4 everyone who wants to do this technology.

5 And we have public elastic HIVE. That's

6 when, let's say, somebody now wants to do many

7 computations in a cloud environment. That's why

8 colonial won at Ashburn. Dr. Mazumder is

9 maintaining that plastic elastic HIVE. Not

10 plastic, sorry. Public elastic HIVE. I have to

11 know my terms.

12 Okay, we also adapted HIVE to Amazon and

13 Rock Space, and we didn't make big studies on

14 them. We did a feasibility study on them, and

15 they are running, and we see some performance

16 depreciation and we see significant costs

17 associated when we move the data. You know,

18 there's a huge amount of computer power in these

19 clouds, and you pay for them very little. To

20 store the data, you pay for them very little.

21 Unfortunately, the reality is that when

22 you work in a big data universe, data moving costs

181

1 you. And you move these huge amounts of terabytes

2 of the data -- just today, we are loading like six

3 terabytes, and it's a routine day. And I imagine

4 you are moving this every day, and the costs add

5 up. So, I, myself, am a supporter of a hybrid

6 platform where we'll have a local private

7 cloud-like environment like HIVE, and then, we can

8 support efforts for other search patterns. We can

9 extend to Amazon or Hard Rock Space or any other

10 provider.

11 We are also working with IBM software

12 now. We are trying to use their bare bones

13 systems, bare bone clouds, because the performance

14 is better to our opinion. So you see, we are

15 trying to actually adapt to an environment, not to

16 stick to just the cloud or just the private,

17 because economics is, after all, deciding what we

18 can and what we cannot do.

19 And then, let me bring you slides. This

20 is important. We think that we are solving

21 something which is completely new. We are solving

22 the big challenge which was not here before, and I

182

1 want to say that that is not true. There are so

2 many big data machines right in this room. Every

3 human is a big data machine. Do you know that

4 every second in your body, a gazillion amount of

5 zeta bytes is actually generated information. And

6 then, that information is now being transferred to

7 the central processing unit, minus here. It

8 actually is being treated right where it exists.

9 That is why only a minute amount of

10 information is being transferred to the central

11 processing unit. And even then, our brains

12 sometimes consume 40 percent of the energy. They

13 overheat and they make us distribute things.

14 (Laughter) So, the reason is, it's

15 a distributed computing entity. If

16 you think about it, computation and

17 data information processing is not

18 done in a big, big data center or

19 on a big data cloud. It's done

20 everywhere where it's generated.

21 Of course, the most vital information

22 which is critical for the survival of the system

183

1 as a whole is being transferred to our brains.

2 Similarly, let's model this. Evolution came out

3 with this concept, and it took, actually, three

4 and a half billion years. Why do we need to

5 reinvent? Let's come up with this concept. So,

6 we've tried to produce HIVE in a box for them.

7 We take the machine, one, powerful -- it

8 should be powerful, but it's not loud a cloud. Of

9 course, it's not like a whole infrastructure, of

10 course. We can put a certain number of cores, a

11 certain amount of storage into it, large memory

12 machine. We stick it next to the Lumina or

13 (Inaudible) life sciences or any of these data

14 providers, and then as soon as the data is being

15 produced, HIVE in a box will be able to run

16 particular pipelines which are pretty fine for

17 that particular setting. It's alike an appliance.

18 So, that's what we want to develop and

19 make this available for everyone. And the reason

20 we went this way is because we have an array of

21 sort of regulatory affairs, and the FDA has

22 hundreds of office across America, and I think

184

1 across -- some of them are in international

2 countries -- other countries, sorry.

3 International countries, yeah. In other

4 countries. So, the reason we looked at this is to

5 make sure that they can do their analytics without

6 actually moving the data all the way to us,

7 because network, network, network. It's not

8 location. It is a location, location, location.

9 Yeah. So that's another platform which we are

10 developing.

11 So, what is the future? Moving forward,

12 we are trying to put this -- actually, we have put

13 some of the HIVE developments in a curricula at

14 George Washington University. We are trying to

15 take students, start developing on HIVE. It takes

16 some efforts on developing the APIs and things,

17 but the hope is that, just like your iPhone, just

18 like your smart phone, unless you have young

19 people working and developing new stuff, exciting

20 stuff, you are destined to fail.

21 That's why we want to want to actually

22 have this platform available for everyone, so they

185

1 can develop for it. And we are also at the core

2 of developing of the NGS because of the technology

3 expertise which we have. We also, because of

4 HIVE's success in a NGS setting, now, HIVE Is

5 being studied, investigated for its possibility to

6 treat in post market and clinical data. That's

7 the honeycomb databasing model.

8 And we are collaborating with different

9 (Inaudible) centers. With some of them, we are

10 running active collaboration. For some of them,

11 we are actually developing new tools. We are at

12 the stage of actually adopting their pipelines,

13 and public HIVE collaborates outside with all

14 different organizations.

15 So, this is just a brief history of

16 HIVE. We started on this concept deployment -- we

17 had four Macintosh computers, three scientists,

18 two students and one goal, and we hoped to have

19 zero challenges. Unfortunately, that didn't

20 virtualize. It's a four, three, two, one, zero.

21 It didn't happen. Okay, but that was a very small

22 -- like developers and scientists together,

186

1 working and trying to do something, and then we

2 actually got funding in 2012; medical count to

3 measure initiatives.

4 And then, we went to research production

5 in May of 2013, and we -- actually, this is a

6 milestone for us, also, because we designed a very

7 nice shortage aligner, but just like any aligner,

8 there tens of them. Everybody likes their own

9 flavor. This is adapted in HIVE, and it performs

10 very well. And oh, just this week, we got our

11 FISMA categorization and we (Inaudible) to operate

12 in a regulatory environment as a regulatory

13 production system for NGS data. And currently, we

14 have 80 big or small projects running in HIVE, and

15 all kinds of data, starting from huge terabytes

16 and just ending in like one file, and you're

17 trying to find out where does it align.

18 And we just -- last year, I think we had

19 -- I think I should have said 1.5 years -- we have

20 15 publications and some were pending or in

21 submission process, which I believe that is the

22 biggest achievement, because in science, your

187

1 worth is measured by how much you can actually

2 publish and what you can actually change in the

3 world and have impact in the world. And that is

4 the world.

5 So, this is the HIVE development team.

6 I hope you understand that a big project like this

7 needs a lot of people, and I tried to mention all

8 of them, but in case I forget, I promise to bring

9 chocolates. And we had project leaders and HIVE

10 friends in significant countries, but those are

11 mentioned here. Those are people whose advice or

12 ideas are incorporated, or people who are helping

13 with purchasing hardware or moving stuff or

14 connecting stuff. And the high performance

15 computer center has done a wonderful job.

16 We have a 3,000 core computer high

17 performance computer center, and they have done a

18 wonderful job collaborating with us. And again,

19 as usual, our researchers, without whom we

20 couldn't do it. And please ask me questions.

21 (Applause)

22 MS. VOSKANIAN-KORDI: Actually, there

188

1 was a question online before they get the mikes

2 out. Or, I'll give them some time to get the

3 mikes out. So, there is a question saying, any

4 plans to work with FDA to develop a validation

5 methodology for this pipeline?

6 DR. SIMONYAN: Yeah. Like I said, if I

7 understand correctly, it's about -- if it's about

8 HIVE, yes, we are the core of the development of

9 these standards and validation protocols. I

10 should be very clear -- HIVE is not the only

11 platform at NGS -- at FDA. There are many

12 beautifully built bioinformatics, pipelines and

13 platforms, but we are the one who got the

14 regulatory approval for regulatory analysis.

15 But we are inclusive. We are

16 collaborating, and yes, we are involved in this

17 development of the validation pipelines. And

18 whatever it was I was saying, we wrote a big

19 document on NGS bioinformatics validation and its

20 standardization. All of it is already implemented

21 in HIVE, because we didn't want to just sit down

22 and write stuff. We wanted to actually see if it

189

1 works. So yes, we are involved in it.

2 SPEAKER: Is the intent for HIVE to

3 become an open platform for others to develop

4 applications --

5 DR. SIMONYAN: Yes.

6 SPEAKER: -- to use that framework?

7 DR. SIMONYAN: Yes, yes.

8 SPEAKER: And when will it be available

9 as an open platform?

10 DR. SIMONYAN: Yeah. We are pushing

11 hard for it. I mean, the reality is that some of

12 the code base, we intend to issue in January --

13 API. API will come very soon. API. As for a

14 source code for the algorithms, there are multiple

15 layers of source codes. Do you understand?

16 Source codes of algorithms, we also hope to try to

17 actually release at the same time, although I mean

18 source codes -- as soon as you release a source

19 code, you have to provide the compilation and you

20 are committing to some actually, developer

21 resources. And that's the difficult thing to come

22 up.

190

1 So yes, are trying to actually issue the

2 source codes of the algorithmic layer also, given

3 the connected issues. But there is also code

4 base, which the IPO of this is being resolved now.

5 We are trying to make sure that the FDA has

6 ownership to it. But it still is -- things that

7 still have to come.

8 SPEAKER: Thanks, Vahan. Yeah, great

9 talk. Just like any performance system, there's a

10 lot of design decision to make. With your

11 honeycomb system, you know, you are able to deal

12 with sensible data, let's say. But what kind of

13 trade-offs are there in that? Like is it easy to

14 query? Is it easy to report on? Are there other

15 trade-offs that were made?

16 DR. SIMONYAN: Yeah. So, honeycomb data

17 is actually -- underneath the -- honeycomb data is

18 a low level engine. We are going to provide

19 whatever standards we decide. Like we are

20 adopting NCBI standards, in this case. Yes? The

21 honeycomb -- will you inquire (sic) honeycomb to

22 give you the data for this particular metadata

191

1 object? Those standards will be generated by

2 honeycomb.

3 In this particular moment, if you are

4 asking, give me the honeycomb -- give me the data

5 for, let's say, bio project or bio sample, it will

6 produce the same one as NCBI does produce when

7 years running (Inaudible) to retrieve the same

8 information. Underneath, we keep it slightly

9 differently, because I mean, we expect to get --

10 within half a year, we expect to get 300 million

11 metadata records.

12 The reality is that we are trying to

13 step away from XML internally, because it's too

14 heavy. So underneath, we keep it a different way.

15 We optimize it differently. But export import

16 should be conformant to whatever this community

17 decides. They'll make sure of that.

18 SPEAKER: Yeah, there's hybrid solutions

19 --

20 DR. SIMONYAN: Yeah.

21 SPEAKER: -- like x amount of

22 (Inaudible) relational with (Inaudible) space.

192

1 DR. SIMONYAN: He likes 14 14 XML

2 (Inaudible).

3 SPEAKER: Thanks.

4 MS. VOSKANIAN-KORDI: There is another

5 online question. What element of the pipeline

6 will be validated, and when?

7 DR. SIMONYAN: Which particular pipeline

8 and when --

9 MS. VOSKANIAN-KORDI: The HIVE pipeline.

10 DR. SIMONYAN: Well, HIVE is not a

11 pipeline. It has multiple pipelines, and it's a

12 platform. So in HIVE -- actually, a very

13 interesting question came out from this gentleman.

14 He said, can we download just one pipeline out of

15 HIVE? The unfortunate reality is that we are

16 working within this platform, and all of our tools

17 are platform dependent, because after seeing that

18 our data is -- if there (Inaudible) data, we

19 cannot get it out. And the data, the way data is

20 maintained inter inside, there are certain

21 limitations to it. So, this platform has been --

22 two of the pipelines are adapted to the pipeline.

193

1 Just like -- I mean, you wouldn't be surprised if

2 you tried to run your Microsoft Windows Word in a

3 Linux platform and you don't launch it. Yes?

4 Because it is a different platform.

5 So in a way, you can think of HIVE as an

6 operating system destined to work on multiple

7 computers, except that it also is extensively

8 attuned for big data, and a lot of tools exist

9 already in NGS. So, HIVE has many pipelines, and

10 they will be scrutinized exactly as much as any

11 other FDA tools or any other tool from industry,

12 whoever tries to validate it.

13 But again, this is -- it's my

14 perspective. It's not FDA saying, so -- there are

15 many beautifully built tools. This is just one

16 platform. You have to understand that. We are

17 proposing this, and we are exactly on the same

18 footing as everybody sitting in this room is.

19 There are no prejudices here whatsoever.

20 SPEAKER: Good question. Here I am. So

21 good talk, actually.

22 DR. SIMONYAN: Thank you.

194

1 SPEAKER: My question is, so is the HIVE

2 similar to some of these open source frameworks,

3 like MapReduce or Hadoop?

4 DR. SIMONYAN: Yeah. We are a little

5 bit lazier than Mat Reduce. We map, but we do not

6 reduce, because a lot of times -- yes, it's very

7 similar. We are not using it underneath, but the

8 platforms and ideas are very similar, except in

9 MapReduce, there is a stage you may or may not

10 reduce. Because our data is so big, reduction

11 doesn't really matter. And the next thing you are

12 going to ask about, the data, it also knows how to

13 work on this mapped data. So, reduction is done

14 only usually for visualization or conclusion

15 making or download purposes. There is a big

16 similarity in between them, but engine is

17 different.

18 SPEAKER: So how fast is your engine?

19 DR. SIMONYAN: For map reducing? So,

20 examples I can bring -- I mean, I can't say about

21 the fastness of the engine, because it depends on

22 what pipeline are we talking about. Yes?

195

1 SPEAKER: Yes, yes.

2 DR. SIMONYAN: Like Hadoop, similar

3 platforms are Java based, and then we are -- the

4 core of HIVE and all of these companies are CC++

5 based, with a very low level to the machine. So,

6 we have compared it initially when we were

7 initiating the process. We tried to choose

8 different platforms, and came out that CC++

9 performance is significantly better, so we stuck

10 with this paradigm. But they're very similar

11 ideas. Over there?

12 SPEAKER: I really enjoyed your

13 presentation.

14 DR. SIMONYAN: Thank you.

15 SPEAKER: So early on, when we develop

16 our (Inaudible) track, we'll run into the issues.

17 And actually, we had some discussions about

18 whether these are the tools we need to develop for

19 the technology savvy person or just for the

20 reviewers. I think this is the seminal questions

21 that are going to be for the HIVE. And of course,

22 at the end of the day, we have to conduct a lot of

196

1 training courses to get users really on board to

2 use these tools. So, do you have these kind of

3 (Inaudible)?

4 DR. SIMONYAN: Both for reviewers or for

5 technology savvy people. Yes? The question.

6 SPEAKER: Yes.

7 DR. SIMONYAN: So, I mean, based on the

8 interfaces -- let's be completely honest with it.

9 It's a very good question. The computations are

10 sometimes really complex, and even the

11 interpretation is very complex. So, the current

12 set up is such that there is a learning curve.

13 You have to understand, how do you get the data,

14 how do you launch. What is the interpretation of

15 many of the arguments, which is available?

16 But you're absolutely right that we have

17 to move towards like more technicians, few

18 buttons, here is the data, get me the output kind

19 of situation. And we have done that for some of

20 our tools already. We wanted to create this

21 advanced engine which is possible to customize for

22 advanced users. At the same time now, we are

197

1 moving to these overview engines, which is the web

2 page, pretty much. You come. You select two

3 things. Click. You get your third thing.

4 And I think the NCBI model is wonderful,

5 because they've done a lot of work in making it

6 available and understandable from not just a

7 technology expert's perspective, but also from a

8 scientist's or reviewer's perspective. We are

9 getting there, but I wouldn't say that right now

10 it's so easy, because there's a really huge amount

11 of information there. We are trying. You know?

12 It's a work in progress. You know?

13 MS. WILSON: I was just going to add

14 that also, we are in the process of developing

15 training for review staff to just get them

16 introduced that technology, and then, have more in

17 depth training for people who are going to

18 actually have to analyze the data as part of the

19 regulatory submissions.

20 DR. SIMONYAN: Yep. Yep.

21 MS. VOSKANIAN-KORDI: Well, we're going

22 to end the questions there and allow the next

198

1 speaker to take over the podium. I'm going to

2 pull up those slides, if you want to introduce

3 him.

4 MR. YASCHENKO: And the next speaker is

5 Dr. Warren Kibbe. He represents the National

6 Cancer Institute, who as you can imagine, deals

7 with a huge volume of cancer data and the

8 solutions for storing this data. And also, trial

9 computations, and they build and present this.

10 (Break in recording)

11 DR. KIBBE: Great. Well, I want to

12 thank the organizers for this, and this microphone

13 is a little bit low for me, so I guess I'll be

14 leaned over here so you can hear me. So, unlike

15 the previous speaker -- and I really enjoyed

16 hearing what HIVE can do, I'm going to focus more

17 on the problems, frankly, that the NCI is

18 grappling with, with generating a lot of this

19 data, and how do we get it out in folks' hands?

20 How do we really provide the community with access

21 to these data sets and the computational

22 horsepower behind it?

199

1 So, I'm going to take a little detour

2 before I get into some of that. I want to talk

3 just about what the national challenges in cancer

4 data really look like. I'll briefly talk about

5 disruptive technologies, because I think that's

6 what's leading us into these problems of big data,

7 and I think that's a good thing. I'll talk a

8 little bit about two specific initiatives in the

9 NCI, the genomics data commons and the cloud

10 pilots, and then close with just what I think is a

11 really important issue, and that's how do we start

12 to build a national learning system where we

13 really learn from every single cancer patient

14 that's getting clinical genomics done. And I

15 think that's something that everybody who's in the

16 cancer space agrees to, and I think it has a lot

17 of impact, both with the FDA and frankly, all

18 kinds of diseases, not just cancer.

19 So, two of the big problems from an

20 informatics perspective that I see that we're all

21 grappling with is how do we really lower barriers

22 to data access. So, from a security and a privacy

200

1 standpoint, we have an awful lot of barriers that

2 stop us from gaining access, particularly to

3 genomic data and associated clinical data.

4 And then, what we really want to do is

5 get to the point where we have access to those

6 data and we can learn from them, and we can build

7 predictive models that let us really help cancer

8 patients. So again, my perspective is very much

9 on the cancer side. So, how do we do this for

10 cancer patients? But again, I think it has

11 relevance for every single disease area.

12 And from a principle standpoint, I think

13 we really need open science. We need that

14 semantic interoperability, and Vahan really spent

15 the whole first session talking about all of the

16 pieces of that and how important they are. And

17 then, the last piece is really, we need to have

18 sustainable models for this infrastructure. And I

19 think that's something, particularly when we're

20 thinking about big data, that's becoming apparent.

21 We can't replicate this data everywhere in the

22 world, so how do we do this in a sustainable way?

201

1 So, I'm going to turn first to

2 disruptive technologies. So, how do we get in

3 this place? And I think we all know that high

4 throughput biology is both the source of many of

5 -- really important biology that we're doing now,

6 but also, the source of all of this big data that

7 we're grappling with, or one of the sources.

8 And something that I guess I like to

9 think about is, as we start generating Next

10 Generation sequencing and as we start doing this

11 in a really detailed way, we're really forcing us

12 to think as a community about computational

13 biology and systems biology. So, we really

14 reached the end of thinking about this from a

15 purely reductionist standpoint. And I think the

16 answer is yes. Or at least I hope the answer is

17 yes.

18 So the other thing, and this came out of

19 a workshop from the IOM about a year ago that I

20 was involved in, is realizing how ubiquitous

21 computing and access to data has really become

22 throughout the whole world. So, it was shocking

202

1 for me, that as of December of 2013, so it's now

2 almost a year ago, to realize that there were 6.6

3 billion active mobile contracts in the world, and

4 the world population being at the time, 7.1

5 billion. So that's more than 90 percent of the

6 world has access to at least a cell phone

7 contract, and 1.9 billion smart phone contracts.

8 So, that means that access to data has

9 now really become pervasive in a way that wasn't

10 true five years ago. So, how do we really

11 capitalize on that, knowing that almost everyone

12 now is a data provider and that data emersion is a

13 real thing? Are you going to --

14 (Discussion off the record)

15 DR. KIBBE: So, the reason I think

16 that's really important is there are lots of folks

17 who talk about social media now. I think that

18 there seems to be a pretty big age gap in thinking

19 about social media, but it's clear that everyone

20 underneath the age of 30 by social media.

21 And how do we reach them? How do we really change

22 their behavior? Again, from a cancer perspective,

203

1 that's really important, because I think that

2 there are three modifiable risks that everyone has

3 for cancer. It's infectious disease, smoking,

4 poor nutrition and lack of exercise. Those three

5 things contribute to about 50 percent of our

6 cancer burden across the world. So, if we can

7 just start to address the things we know, we'll

8 really relieve a huge burden on the world with

9 respect to cancer. So, I think that's really

10 important, and I see again, one of those

11 disruptive technologies play an enormous role in

12 that.

13 So, I'll get into big data now, since

14 that's what I'm supposed to talk about. So, we

15 have three very large cancer projects that have

16 generated comparatively speaking, huge amounts of

17 data. TCGA, TARGET and ICGC, which is not the

18 ICGC, which is not an NCI initiative, but it's

19 closely aligned. And then coming out of those, we

20 have the Cancer Genomics Data Commons, which I'll

21 talk about, and the NCI cloud pilots. And along

22 the way, we also have a number of clinical trials

204

1 that are starting to use these data to actually

2 assign patients to specific arms.

3 So, I won't belabor this, because you've

4 been hearing about this all morning and in the

5 previous talk, but we're now inundated by data.

6 What's really good is, of course, the

7 computational capacity that we have the ability to

8 store -- it's all rapidly increasing. It's all

9 exponentially increasing.

10 I'm going to switch gears for just a

11 second, and we're going to go back to how we got

12 in this place, because I think it's useful to

13 think about very briefly. So, this takes me back

14 almost to when I was first becoming a scientist,

15 not quite that far back, and it really starts to

16 map out the Human Genome Project, when it started

17 and where we were. So, when the Human Genome

18 Project started, we weren't really doing mass

19 sequencing. We were doing mapping.

20 And again, from looking at that from an

21 NCBI standpoint, the amount of data that was

22 around then seems laughingly trivial today. But

205

1 it wasn't all that laughable at the time. And

2 here's a little timeline of various sequencing

3 groups. So, you can see the Saccharomyces

4 cerevisiae sequencing completed in '96. We've got

5 SGD represented here in the room.

6 Bacterial genome sequencing continues

7 onward, because there are so many different

8 species there that are being sequenced. And you

9 can see we moved from just sequencing incredibly

10 small things to getting closer and closer to doing

11 the full human genome.

12 And likewise, you see this enormous ramp

13 up in the amount of data, and that was certainly

14 reflected in Eugene's talk looking at the current

15 version of what's in NCBI. But this is now

16 looking backward more than 14 years ago. And I

17 think it's, again, as you start seeing -- and the

18 reason I show this slide is that there's a --

19 technology as it starts to reach maturity,

20 generates all of the data at the end of its

21 current life cycle. And so, that's something we

22 see with TCGA.

206

1 And of course, in February of 2001,

2 these two very seminal works coming out of the

3 Human Genome Project were published. And of

4 course from an outcome standpoint, we know that

5 the Human Genome Project cost a bit more than $5

6 billion, but it's generated more than $800 billion

7 in the U.S. economy alone. So from an economics

8 standpoint, it's been enormously successful. From

9 a scientific standpoint, equally so.

10 So, these are some papers that actually

11 have come out more recently from TCGA. So again,

12 The Cancer Genome Atlas. I'm not going to dwell

13 on these, but we're really starting to understand

14 much more consistently, much more precisely, the

15 genetic underpinnings of cancer. And with that,

16 we hope we'll start to be able to understand much

17 more precisely how we can intervene in cancer.

18 And of course, along with that, TCGA is

19 a long running project at this point. It's going

20 into its tenth year shortly, and it's been a great

21 collaboration. It started out looking at just a

22 few tissue types in cancer, and now, it has been

207

1 expanded to more than 20 tissues. And I think

2 again, echoing what a number of folks said this

3 morning, a lot of it is -- it's a test bed, and

4 there's been incredibly important QC components

5 that have come out, out of TCGA.

6 So, I'm going to skip to a few slides

7 coming directly from the TCGA Consortium. And one

8 of the really interesting parts of what's come out

9 of the TCGA is just being able to look across all

10 of these tissues and realize that there are

11 different mutational patterns in different types

12 of cancers. So, what in retrospect is fairly

13 obvious, is that pediatric cancers have a

14 relatively low mutation rate, and things like

15 melanoma have a very high mutation rate. And

16 that's all diagrammed out here very nicely. But

17 there's been some incredibly transformative

18 understanding of human cancer coming from TCGA.

19 And the papers are numerous, and every one of them

20 has been very insightful in helping us understand

21 important parts of cancer.

22 So, I want to delve into this one just a

208

1 little bit. This is likelihood at endometrial

2 cancers. And what was interesting is looking at

3 the histology of these cancers versus the way that

4 they were characterized from genomic standpoint,

5 pointed out that there were a number of cancers

6 that were being misdiagnosed. And so, those are

7 the ones on the right in those panels, where when

8 you look genomically at them, they were clearly

9 misdiagnosed, and there were a number of

10 endometrioid cancers that were put in there.

11 And what let everyone do is start to

12 understand it. In fact, the outcomes for giving

13 treatment looked very different, and the -- you

14 can see the survival curves look quite different

15 for each of these classes, even though previously,

16 they were all treated as one disease. So, now

17 we're starting to understand that the pathways

18 that underlie these diseases are specific, they're

19 important, and they highlight the different kinds

20 of therapy that need to be done.

21 So, with all of that said, one of the

22 problems for TCGA is it's been a 10 year project,

209

1 and it has a heterogeneity of technologies behind

2 it. There's imaging. There's multiple kinds of

3 sequencing. So, there's a push now when we want

4 to create the Cancer Genomics Data Commons and

5 create some harmonization between the way all

6 these data are being handled and the way they can

7 be analyzed.

8 And we also want now to create an

9 infrastructure that allows directly individuals to

10 be able to contribute their own data to this

11 repository, and that would then create this

12 national or hopefully even international cancer

13 knowledge based.

14 So, as I mentioned, there's a lot of

15 different data types in TCGA. This highlights a

16 few of them, and that they're actually held in

17 different places. They're not just BAM files.

18 It's all kinds of things, and imaging, again, is

19 incredibly important to it. And we don't really

20 have a consistent way to gain access to it all.

21 So again, one of the drivers behind the Genomic

22 Data Commons is to put some cohesion around all of

210

1 this.

2 However, the way that the Genomics Data

3 Commons is still being thought of and built is

4 it's a classic data centric repository. So, how

5 do we actually allow people to access it and

6 download from it? And this is becoming a critical

7 point, because we now have -- in TCGA from a

8 sequencing standpoint alone, about two and a half

9 pedabytes of data. And you can see the rapid

10 increase in the amount of data in TCGA. So again,

11 as I mentioned earlier, that means that individual

12 groups really can't download the whole data set

13 and compute on it locally. It just becomes a non-

14 starter.

15 On top of that, just downloading it, if

16 you assume everybody had a 10 gigabit connection

17 -- so, sorry for the -- I think we flipped back

18 and forth between geek speak and normal language

19 or normal scientific language, at least, which

20 already isn't normal language, but it takes about

21 23 days just to download the current TCGA dataset.

22 So, it's clear that just moving that data around

211

1 is no longer an option.

2 And then, of course, when you want to

3 actually compute on it, the amount of resources

4 and the amount of tooling necessary requires

5 really, something like HIVE. And again, we

6 shouldn't be asking everybody to set up their own

7 HIVE instance, although that might make a few

8 people very happy. So, that highlights, then, the

9 relationship between what we want to do between

10 the Genomic Data Commons and the NCI cloud pilots.

11 So, the idea is that cloud pilots now

12 will create an infrastructure that's tightly

13 coupled to Genomics Data Commons, and allow people

14 into that infrastructure, and it's a cloud

15 infrastructure much like what was being described

16 from the HIVE, except we don't actually have any

17 working software yet, and be able to then do the

18 analysis of TCGA data. So, that's the essence of

19 the cancer genomics cloud pilots.

20 So, we'll be announcing very shortly,

21 though there will be a number of folks funded to

22 do this, and the idea is to really explore the

212

1 models for cancer genomics and what those APIs

2 might look like that would be consumed by the

3 community, and of course, explore cloud models for

4 how to make data plus analysis really happen. And

5 I think that was again, the HIVE model is very

6 elegant.

7 But how do we do this in a community

8 focused way? So, how do we allow anyone to have

9 access to it? It's clear that NCI or the federal

10 government can't pay for it all, so how do we

11 implement this in a way that's scalable and

12 becomes cost effective for everyone? And I think

13 I've said almost all of this.

14 So, another part of this is how do we do

15 this in a reproducible way? And that was, again,

16 I think this morning, Vahan was talking about how

17 we can make standards a part of Next Gen

18 sequencing, and reproducibility is a big part of

19 those -- a big reason to have those standards.

20 So, I flew through my slides, and that

21 will hopefully keep us on target here. But I want

22 to leave you with what the future might look like.

213

1 I really think that the last of computing clouds

2 are here to stay, and the NCI wants to understand

3 how we can make use of that kind of

4 infrastructure. I also think there's some real

5 benefit to social networks. How can we actually

6 change the behavior of folks throughout the whole

7 world in a way that reduces the risk for cancer?

8 That's a whole other different kind of big data

9 talk.

10 And of course, there is a precision

11 medicine piece to this. So, it's not just

12 sequencing. It's imaging. It's histology.

13 There's all kinds of data that are being made

14 available. And how do we really combine all of

15 those in a way we can learn from them and do true

16 prediction? And again, the take home is how do we

17 take all of this and build something where we

18 really can build a learning healthcare network

19 where we learn from every single cancer patient?

20 So with that, I'll take questions. (Applause)

21 SPEAKER: That's a great talk, Warren.

22 So, one of the things about dbGaP -- so, do you

214

1 see streamlining of the dbGaP process, which do

2 you think is actually hampering some of the access

3 to the data? And do you see some other mechanism

4 or even some different set of rules, a new set of

5 tools in the new era? Who accesses --

6 (Simultaneous discussion)

7 DR. KIBBE: That's a great question. I

8 don't think dbGaP itself is necessarily the

9 problem. I think what it is, is there's a lack of

10 consistency in the way that consents are done.

11 There's a lack of consistency across projects. So

12 right now, I think everyone who has gained access

13 to dbGaP, the current status is it's project by

14 project. So, you submit and you gain access to a

15 given project. That's clearly not scalable.

16 That's not where we want to go.

17 The good part for TCGA is this one

18 project. So, you gain access to everything inside

19 TCGA, but you've got to do the same thing for

20 TARGET. You know? Again, that's another set of

21 permissions. So, where I see this going in the

22 future is, hopefully, we'll have more harmonized

215

1 consent forms where we can actually lump things

2 into a group consent. I think, in fact, that was

3 a point this morning in someone's slides, talking

4 about these access groups in dbGaP.

5 So, I see that as one potential way

6 around it. The other side is getting participants

7 directly involved and getting them to actually

8 give their data freely in a very different model

9 -- so, one where the government isn't directly

10 involved. And that's a very different model, and

11 I think it would be very interesting to see that

12 pursued.

13 SPEAKER: This is just a comment. Since

14 I work with dbGaP, the concept of using the

15 application for multiple datasets simultaneously

16 is being considered by NIH. This is not NCBI.

17 It's NIH's decision. And one of them already

18 exists -- the general research usage set. You can

19 apply once for the general research and get

20 approved for multiple things at the same time.

21 So, it's being changed.

22 DR. KIBBE: And there's the new genomics

216

1 data sharing policy. Again, it starts driving us

2 toward that consistency of access.

3 SPEAKER: Another question is that you

4 mentioned TCGA and TARGET and other cancer related

5 big projects. And recently, in (Inaudible)

6 Lincoln Stein was talking about ICGC in detail and

7 all of the challenges. What other similar big

8 projects are running? And what is the mode of

9 interaction between those? Sharing the data?

10 Sharing the resources and --

11 DR. KIBBE: Well, so I think back to

12 Raja's point about the difficulty dbGaP. So, when

13 you start going into the international

14 consortiums, so ICGC is the international

15 consortium, it turns out some of the countries

16 that have submitted data to it won't allow their

17 data to be held on U.S. soil.

18 The U.S., in some of the agreements,

19 won't allow it to go outside of the United States.

20 So, that makes it very hard to combine some of

21 this. So, there has been some really interesting,

22 and frankly, I think very novel thinking about how

217

1 we can start to combine ICGC and TCGA and TARGET

2 in a virtual environment that respects these

3 geographical locations, but still allows people to

4 do computation across them. But I think that's --

5 frankly, part of the problem there is that we have

6 the ability to say, no, we don't want to share

7 with them. That probably shouldn't happen in the

8 first place. But that's beyond a discussion just

9 for this room.

10 MS. VOSKANIAN-KORDI: There's actually a

11 couple of questions online.

12 SPEAKER: What is the NCI cancer genomic

13 data FP? Expected (Inaudible)?

14 DR. KIBBE: The Genomics Data Commons

15 was awarded back in July or August. I'm not sure.

16 And I believe there will be, if there isn't

17 already, a public announcement of it.

18 SPEAKER: Okay. And are the NCI

19 informatics initiatives planned to (Inaudible)?

20 DR. KIBBE: Oh, ASCO LinQ. Sorry.

21 SPEAKER: Right.

22 DR. KIBBE: Yes. So, we're certainly

218

1 talking with ASCO, and the -- it's actually ASCO

2 CancerLinQ. That was partly why I was confused.

3 SPEAKER: Okay, sorry.

4 DR. KIBBE: No, no problem. So, those

5 of you who don't know about CancerLinQ, it's a way

6 that different healthcare providers involved in

7 oncology can start to share data about outcomes

8 and about therapies for their cancer patients.

9 Looking at TCGA -- so, TCGA, one of its downsides

10 is we don't have long-term outcomes for many of

11 the TCGA patients.

12 So, we have this very detailed snapshot

13 of their cancer, their histology, their point in

14 time where we took the data, but we don't always

15 have their long-term outcomes, and not even

16 necessarily what therapies they were given. So, I

17 think it's very natural to think about coupling an

18 archive like TCGA with something like CancerLinQ.

19 The problem is, there will be very few patients

20 that actually overlap between what's in CancerLinQ

21 and what's in TCGA.

22 So, long-term, yes, I think that's

219

1 exactly what we want to be able to do.

2 Short-term, it probably won't be of much value

3 initially. Laura?

4 MS. VENTRIA: Laura Ventria, UCSF.

5 Maybe to give a little positive comment to your

6 comments --

7 DR. KIBBE: Oh, absolutely, please.

8 MS. VENTRIA: So, I think the activities

9 ongoing in cancer but also other diseases to share

10 data and to be able to actually get to the

11 learning systems is very much taken up by the

12 Global Alliance for Genomics and Health. And I

13 think also tomorrow, that will be one of the talks

14 of this global alliance, where also, the

15 structured data components, as well as a way how

16 to access those data by API will be presented.

17 DR. KIBBE: Absolutely. So, for those

18 of you who don't know about the Global Alliance

19 for Genomics and Health, it is an international

20 consortium. It's made up of now -- I think it's

21 over 205 different organizations across the world.

22 And go and read their web site. They've laid out

220

1 very beautifully, I think it's eight principles

2 for data sharing and how we go about thinking

3 about data sharing across all kinds of diseases,

4 and explicitly, how we start thinking about

5 crossing interesting boundaries.

6 So, I think there's -- then there's some

7 great work going on. Actually, the beacon that

8 Eugene mentioned is a response to one of the GA4,

9 GC initiatives to -- everyone should create a

10 beacon. And I think right now NCBI is the only

11 one that's created a beacon, but we have to start

12 somewhere.

13 So, I think there is a lot of hope here,

14 and I appreciate, Laura, your calling me out for

15 not being quite hopeful enough. There's a lot of

16 really good things going on for data sharing.

17 MR. YASCHENKO: The next speaker is Dr.

18 Toby Bloom. She's from New York Genomes. So,

19 welcome. She will present the issue more related

20 to, I believe real life hospital clinical

21 decisions and sequencing.

22 (Discussion off the record)

221

1 DR. BLOOM: Can everybody hear me?

2 SPEAKER: Yes.

3 DR. BLOOM: Okay, good. So, I'm going

4 to talk about big data in clinical genomics and

5 clinical research studies. Okay? Why don't I

6 understand what this is doing?

7 (Discussion off the record)

8 (Break in recording)

9 DR. BLOOM: That did it. I just want to

10 tell you a little bit about the New York Genome

11 Center; just a couple of slides, because the New

12 York Genome Center is fairly new.

13 (Discussion off the record)

14 DR. BLOOM: The New York Genome Center

15 is fairly new. It was formed a couple of years

16 ago by a collaboration of 12 large health systems

17 in New York on the theory that it was better to

18 have one -- and cheaper to have one central genome

19 center than try to build 12.

20 We have more members now than we did

21 then, but you can see that almost all of the New

22 York hospitals are in here. Weil, Cornell,

222

1 Columbia, Presbyterian. Cold Spring Harbor is a

2 member. Sloan Kettering is a member. Most

3 recently, IBM became our first corporate member,

4 but we're open --

5 (Discussion off the record)

6 DR. BLOOM: And we have little strange

7 ones, like the American Museum of Natural History,

8 which isn't exactly doing clinical genomics, but

9 they do have a lot of things they want to run

10 genomic analysis on. Our current capacity, I

11 think, qualifies us as big data. We are one of

12 the first four of the organizations that got

13 Xtens. In fact, the last two of our Xtens are

14 only coming in this week. Today?

15 They were supposed to arrive today, and

16 then we will have ten. We have eight of them

17 running right now. We have 16 2500s. We have

18 capacity for somewhere north of 10 terrabases a

19 day. We do have a CLEO lab, although New York

20 CLEO lab standards mean that not everything we

21 want to run in the CLEO lab can we run quite yet.

22 Do you want to know about computer

223

1 infrastructure? We've got nine pedabytes of

2 storage. Not all of it is used, but we expect it

3 to all be used by December. We have only 3,000

4 cores, but we expect to double that. We have all

5 the standard pipelines using mostly standard

6 methods. As I said, for cancer, we run three

7 somatic variant callers, three structural variant

8 callers, two copy number callers, one or two

9 purity playity callers and god knows what else,

10 and then compare the results semi automatically,

11 but a lot manually right now. And I will talk

12 more about databases later.

13 So, big data. It's a term that is

14 overused, and I don't think anybody really has a

15 definition for it. The standard definition a lot

16 of people use, which is attributed to Gartner

17 Group, but I'm not sure if it's really true, is

18 that it's a combination of how much data you have

19 as a volume, how fast it's coming at you, the

20 velocity, and how complex it is, the variety.

21 So, basically, big data is anything

22 that's too big for you to handle easily, and that

224

1 usually involves some combination of volume, speed

2 and complexity of the data. What I'm going to

3 talk about today -- everybody talks about the

4 volume. Lots of people talk about the velocity.

5 I really want to talk about the variety. I really

6 want to talk about how complex data is going to

7 cause us problems, and it's really showing up in

8 clinical genomics first. And that doesn't mean

9 it's going to stay there, but I think we've really

10 hit it.

11 So, you know, everybody understands that

12 we're talking about dealing with real patients,

13 not just samples we get that are anonymous, and

14 that we need the results faster, and that it's

15 going to be used for treatment or maybe not. And

16 everybody is worried about the accuracy of

17 interpretation when it's going to be used for

18 treatment. But I'm really going to focus, for

19 now, on how complex clinical research projects can

20 be. And I'm going to talk about clinical research

21 for now, rather than CLEA clinical one at a time

22 samples.

225

1 Let me start with one example. The New

2 York Genome Center is currently running an auto

3 immune disease study. We're currently doing only

4 rheumatoid arthritis, but we're expecting to

5 extend it to lupus and multiple sclerosis and

6 Crohn's Disease. And because we're a

7 collaborative center and we're formed with a

8 collaborative center, we've got two other

9 hospitals working on this now, but we expect other

10 hospitals to come in for the other diseases.

11 What are we doing in this project? We

12 are taking weekly blood samples from patients. We

13 basically taught the patients to prick their

14 fingers the way diabetics do. They put four drops

15 of blood into a pre bar-coded tube; take a picture

16 of the bar-code with their cell phone and email it

17 in, so that we know when they took the sample, and

18 stick it in their freezer until they go to the

19 doctor.

20 We're doing this because the genome

21 isn't going to change. The RNA changes all the

22 time. And what we're hoping is that if we follow

226

1 people over a long period of time, we will be able

2 to figure out what changes in gene -- what genes

3 are involved when a flare happens. What happens

4 in the prodrome period before the flare? Can we

5 find early predictors?

6 We're interested in a lot of things, but

7 we're interested in finding the mechanism of

8 action. We're also interested in finding ways to

9 help patients manage their chronic diseases

10 better, and we're hoping that if we can get early

11 indicators of when a flare is going to happen,

12 that not only can we get the patients to doctors

13 earlier, but we may be able to connect it to

14 environmental triggers that you can't find now if

15 they're really long before the actual flare

16 happens.

17 So, we are collecting not only weekly

18 RNA data, we're collecting microbiome data. Right

19 now, just at the beginning of the trial and at the

20 first flare, fecal microbiome is known to change,

21 not only in Crohn's, but also in RA. And I've

22 heard recently in multiple sclerosis, as well.

227

1 It's not surprising this is an auto immune

2 disease. It's probably an effect rather than a

3 cause, but it might give us signs of things.

4 And we're going both, because we want --

5 oral is a lot easier to collect than fecal, and if

6 we can see changes in oral, we'll be able to do

7 weekly oral microbiome collections, also, and just

8 drop the fecal. We may add methylone. We aren't

9 right now. Patient are filling out weekly surveys

10 of how they feel. They're going to the doctor

11 monthly to get clinical assessments of

12 inflammation that we can tie, then, to the RNA

13 samples.

14 We need their medication data, because

15 if they're on -- they're all on medications all

16 the time, and during flares, they're on more. But

17 we need to know when they started and then what

18 they're taking, because it changes the RNA data.

19 And the most interesting of these, maybe, is that

20 we actually have put an app on their smart phone,

21 which right now, is mostly collecting mobility

22 data.

228

1 And again, it's there mostly so that if

2 we can correlate changes in how fast people are

3 moving or how far they're going daily, we may be

4 able to connect that to changes in RNA expression

5 levels. If we can use that as a very simple

6 predictor that they're going into flare, we really

7 can easily change that app to say go to your

8 doctor now. That's the hope. We aren't there.

9 We'd really like to know about food

10 intake. There's no reliable way to do it.

11 There's actually somebody -- Cornell Tech, which

12 is the alliance of Cornell and the Technion. That

13 started in New York about a year ago. They're the

14 ones who are doing the smart phone apps. They're

15 doing mobile health. There's actually somebody

16 there who's trying to analyze pictures of plates

17 of food to figure out what's on them, because

18 that's about the only way we think we can get

19 accurate information about food intake. When it's

20 ready, we'll add that, but it's not there yet.

21 I talked about the study goals already,

22 but yeah, it's a combination. And by the way,

229

1 these are contradictory, and they compete with

2 each other. We want to understand the mechanism

3 of action of auto immune disease, and we want to

4 know whether it's the same in all of them or not.

5 And we also want to help patients better manage

6 their disease and maybe find the environmental

7 triggers. The better the patients can manage the

8 disease, the less information we have to figure

9 out the mechanism of action. But we're doing both

10 together, nonetheless. Okay?

11 But look at how many kinds of data I

12 have. Right? And how I have to analyze them.

13 Okay? So I've got time series data for about six

14 different things from different time periods.

15 Okay? And I can't even smooth the peaks, because

16 what I'm looking for is the peaks. Right? So,

17 I've got to get the daily and the weekly and the

18 monthly stuff all aligned and analyze it. And

19 then, after I do that, I have to figure out what

20 correlates with what. Okay?

21 I don't think this is going to be an

22 unusual kind of clinical study in the future.

230

1 Everybody talks about wanting clinical data and

2 better and deeper clinical data to go along with

3 the genomic data. This is sort of, I think an

4 early example of that. We've seen smaller

5 projects like this, but this is a bigger one, and

6 it's not easy to do. So, those of you who know

7 me, know I always talk about the problems and I

8 rarely talk about the solutions.

9 We are working on it. Right now, we're

10 doing what we have to do, to deal with the

11 patients as they come in. We're trying to figure

12 out enough as we do it to build a system that will

13 deal with this. But for starters from an

14 infrastructure perspective, to ask the questions

15 we want, we can't have all this stuff in a

16 gazillion little files. Right? Some files are

17 big, but RNA files -- you know, if I have a cancer

18 genome -- whole genome, it could be 500 gig to a

19 terabyte.

20 RNA files are small and I'm going to

21 have them weekly, and I might have them daily.

22 Right? They're not that small when you get that

231

1 many of them per patient, and you have a lot of

2 patients. It's still going to be big data. But

3 I've got clinical data. I've got patient surveys

4 with 10 questions on them. I have to get all of

5 that -- and I've got to take the EHR data. I've

6 got to get all of that into a database, or at

7 least indexed in a database that lets me ask

8 questions.

9 Okay? And a standard relational

10 database isn't going to do it. And by the way, as

11 we get bigger and want more data, yes, I am

12 really, really hoping that the global alliance

13 will come up with the mechanism by which we can

14 connect everybody else, and yes, I'm interested in

15 doing (Inaudible). But we haven't even gotten

16 there yet, because we're new and I've got way too

17 much to do.

18 But we're in the middle of designing

19 that database. It's not easy, and finding the

20 infrastructure isn't easy, and getting it to be

21 scalable in all of those dimensions is not easy.

22 Some things in HIVE may help. HIVE as it is,

232

1 probably won't. But I'll steal what I can. From

2 a compute perspective, you know, we've got

3 alignment methods out there. Take your pick.

4 We've got variant call methods out

5 there. Take your pick. There are people working

6 on longitudinal data. There's nobody who's really

7 done this yet in any standard way that we can

8 steal. We're building the methods ourselves. And

9 we're going to build them up as we get more and

10 more patients in and figure out what we're doing.

11 We're doing it manually right now.

12 And by the way, even with just the first

13 patients, we can see changes -- we can see real

14 differences in RNA expression as patients go into

15 flare. We don't know what it means yet, and we

16 don't have enough data to know what it means yet.

17 But we're hopeful that that means we're really

18 going to find something. We don't know the

19 compute capacity we're going to need. We have no

20 idea what this compute is going to take, but doing

21 longitudinal data itself is time intensive.

22 So, I'm trying to point out here that as

233

1 we get closer and closer to clinical studies where

2 we're not in sort of the research area, where with

3 TCGA, you have three centers that all analyze the

4 same data through their pipelines and find three

5 different answers, and then take three months to

6 figure out which answers were right. We're not

7 there. We're in clinical right now. Right? And

8 we're trying to figure out -- this isn't real time

9 yet, but we're trying to figure out fast what's

10 going on here.

11 Among other things, we want to be able

12 to change the protocols as we find out what works

13 and what doesn't work. Data interpretation and

14 clinical accuracy. I know everybody at the FDA is

15 really worried about diagnostics and tying

16 diagnostics to drugs. I'm going to go over this

17 really quickly, because it's not the part I'm

18 worried about. Okay? Yes, I believe if you

19 understand what the mutation is and you have a

20 drug that you believe works, you can come up with

21 a diagnostic for it.

22 The harder question is, if you come up

234

1 with a new mutation nobody has seen before, but

2 when you analyze it biologically, it affects the

3 same pathway in the same way as some drug out

4 there for a different mutation does, what do you

5 do? Especially if it's a rare disease and you're

6 not going to get a drug company to test it.

7 Right? Are doctors going to use it off label?

8 Are you going to do more testing? What do you do?

9 And we have one example where -- of

10 exactly this, where I know that at least one

11 doctor used a drug off label for a dying kid, and

12 then I never heard back again, so I assume the kid

13 died anyway. But I don't know. But we know that

14 sometimes it works in a different -- especially if

15 it's a different disease. We know that sometimes

16 the same mutation and the same drug, the drug will

17 work in one disease. It won't work in another

18 disease, when it's the exact same mutation.

19 And so these are problems that I'm not

20 trying to solve, but they seem to tie in to what

21 people are interested in here. I'm worried about

22 the next level of cases, which is my next case of

235

1 big data. Okay? So, I started out worrying about

2 the multi modal longitudinal data. This is a

3 different issue. When we're talking about

4 variants that have a risk of disease, that carry a

5 risk of disease with them, we don't have a good

6 handle in almost any case of what the penetrance

7 is of those variants; what modifiers, what other

8 genes, what other variants may or may not keep the

9 risk variant in check. Okay?

10 And so, we don't know when some

11 preventive treatment might be needed and when it

12 isn't. Okay? David Altshuler just -- sorry.

13 Two different mikes here get me confused. David

14 Altshuler not took long ago published a study on

15 diabetes, a rare variant that protects you against

16 diabetes, no matter what your other risk factors

17 are. It took 250,000 patients to get statistical

18 power to publish that paper. Okay?

19 So, this is a different case of big

20 data, and it's a really important one. And as get

21 more and more into this, I think it's going to be

22 the next place that genomics is going to be

236

1 spending a lot of time. And yes, we've already

2 found some centenarians with two copies of ApoE4

3 who have no signs of Alzheimer's, and at over a

4 hundred, they really are controls. (Laughter)

5 They're not getting Alzheimer's. Right? But we

6 don't know why.

7 Everybody is really aware of the changes

8 in thinking about BRCA1 now and how much of a risk

9 BRCA1 is to various -- to different women. We

10 have to figure this out, especially as we get to

11 whether you use drugs early to prevent something

12 like Alzheimer's. We don't understand it. But

13 we're going to need tens of thousands of genomes.

14 We don't know how to get them. We're not going to

15 have them all. Hopefully, the global alliance is

16 going to help. There are regulations that make

17 this difficult.

18 Here's a situation that at the moment,

19 is driving me crazy. The New York Genome Center

20 is a collaboration of a lot of hospitals. We

21 often ask researchers if we can keep their data

22 the New York Genome Center, so that it can be

237

1 easily available to combine with other data if

2 somebody needs more data for their studies, or at

3 least, to let us index it so they can go ask for

4 it and get permission.

5 And those people often say, you can keep

6 it, but our data access committee has to approve

7 its use, or our IRB has to approve its use. I

8 recently got permission to take some very large

9 datasets on one particular disease, some of which

10 were not sequenced with us. And they were all

11 consented for use in medical research around a

12 single group of diseases, but not all medical

13 research.

14 And I said to the caretakers of this

15 data, so I'm really interested in this penetrance

16 and modifier problem. So if somebody comes to me

17 working on that, the first question they're going

18 to ask is, can you tell me how many people with

19 the disease of interest I'm working on have this

20 variant. And how many other people in the general

21 population have this variant? So, all I'm going

22 to give them back is a count of some numbers of

238

1 people with this variant out of the tens of

2 thousands of genomes I have.

3 Can I include your samples in just

4 counting this variant? If it's not a rare

5 variant, then the number is large. There is no

6 chance of identification here. Okay? The answer

7 is no, because patients consented it only for use

8 for this one disease. And it doesn't matter that

9 this is only a count that is not identifiable.

10 They didn't allow for somebody to look and see if

11 they had this variant.

12 So first of all, that causes major

13 problems for making progress in any study that

14 needs large numbers of controls or just large

15 numbers of patients, period, and especially when

16 you're looking at penetrance and modifiers, and

17 you're looking for modifiers. This is horrendous.

18 I wish somebody would tell me if there was a

19 regulatory way around that, but I don't think

20 there is. Okay?

21 But now, we come back to the database.

22 I need to connect all these kinds of data and

239

1 these huge numbers of genomes. And I mean, I have

2 a variant database. The variant database doesn't

3 do me any good. It doesn't matter if I actually

4 keep the data in a bunch of different databases,

5 as long as I can query across them easily. The

6 structure itself doesn't matter, but look at what

7 it does to my access control and security.

8 I now need, on every query, to look at

9 every cell in the database, not even every row in

10 the database, and decide if I can use it in this

11 query. You're saying no way. But if I look at

12 what my security is, in terms of informed

13 consents, in terms of data access permissions,

14 okay, informed consents have check boxes. Okay?

15 This disease only, all cancers, all medical

16 research, non commercial research only, commercial

17 -- okay. Okay? There can be multiple boxes

18 there. Okay?

19 That changes what samples, even within

20 one project, can and can't be used for what.

21 Okay? And it's not even completely hierarchical,

22 because the commercial, non commercial stuff is

240

1 orthogonal to the other stuff. And then, I have a

2 different problem. I have owners of samples, the

3 biosample repositories, that allow their samples

4 to be used in multiple projects that are unrelated

5 to each other, but they only want the PIs in their

6 project, in that particular project to see the

7 results of that particular project.

8 So, two PIs can have access to the same

9 sample. It can be aliquots of the very same

10 sample taken from the same patient at the same

11 time. And there are two different files or six

12 different files for that patient. And this PI can

13 see two and this PI can see four, and that PI can

14 only see one. And maybe they can ask for

15 permission for more, and maybe they can't. But

16 the more projects we try to use the same samples

17 in, the harder and harder the security and privacy

18 constraints get, and the more and more impossible

19 it becomes to put them into a database.

20 There are databases out there that do

21 sell level access control. Accumular does. It's

22 a name value para database. I never get it to run

241

1 fast enough. Okay? I'm just putting this out

2 there. I can't even remember -- how far behind am

3 I? I'm five minutes behind.

4 There's lots of questions we want to

5 ask. Is there anything else on this I haven't

6 said yet? (Laughter) I don't think so. I'll just

7 keep going. I already said we don't know how to

8 analyze this data. We're working on it. I hope

9 other people are working on it. But forget the

10 multi modal longitudinal stuff. If I need 250,000

11 genomes and I have to analyze it once, that's a

12 problem in itself. Okay?

13 This is just a summary of everything I

14 just said, and since I'm behind, I'm not even

15 going to say it again. But it's important. All

16 right? Especially the penetrance and modifier

17 stuff. But for anything that needs to aggregate

18 data across locations, across diseases, across

19 projects to gain data access, like dbGaP, allows

20 you to other projects, it causes problems.

21 There's actually only one little thing

22 on this page that I wanted to say, which is -- and

242

1 I'll tell you (Laughter) -- my lawyer actually

2 didn't want me to say this, but I'm going to say

3 it anyway. Okay. We all know that eventually,

4 genomic data is going to fall totally under HIPAA,

5 and right now, it's in this gray area. But

6 everybody agrees that it's personally identifiable

7 information.

8 And as far as I'm concerned, when

9 things are personally identifiable information,

10 they should be kept encrypted. And I am perfectly

11 happy to keep it completely encrypted, once the

12 pipelines are over. But it can take weeks to get

13 through the pipelines, and I'm going to read and

14 write that data dozens of times in that time.

15 And I have tried hardware enhanced

16 encryption. It takes three hours to encrypt a big

17 dan (sic) file. I can't do it. So, I'm actually

18 looking for other algorithms that I think will

19 help me maintain the data in storage, not

20 encrypted, but not identifiable. And I'm working

21 on some algorithms, and I'm going to go the

22 statistician route and try to get a statistician

243

1 to tell me it's not identifiable. And I'm hoping.

2 But it is a problem for all of us. We need to

3 understand what it means if we're not going to be

4 at risk of a breach that exposes a whole lot of

5 data that could wind up identifying people.

6 I've probably said all of this already,

7 but yeah, we're going to have larger numbers of

8 genomes together. We're going to have more kinds

9 of data. They're all going to be longitudinal.

10 We have to figure out how to store the databases

11 so we can query them, not just run pipelines over

12 them from front to back on every file and combine

13 lots of files together in each algorithm.

14 And that's it. I have a lot of people

15 to thank for this and a lot of people who are

16 working these projects. And thank you.

17 (Applause). Not too far over.

18 MS. VOSKANIAN-KORDI: Keep that, because

19 you're answering questions.

20 DR. BLOOM: Oh, I'm still answering

21 questions. Okay.

22 MS. VOSKANIAN-KORDI: Please.

244

1 DR. BLOOM: I'll put it back on.

2 MS. VOSKANIAN-KORDI: Please answer

3 questions.

4 DR. BLOOM: Yes.

5 SPEAKER: You were talking about some of

6 the -- in the first part of your lecture when you

7 were talking about the immune -- sorry --

8 DR. BLOOM: The auto immune disease

9 project.

10 SPEAKER: Auto immune diseases.

11 DR. BLOOM: The rheumatoid arthritis.

12 SPEAKER: Yeah. A lot of those -- the

13 primary signal comes much before you see the

14 disease state.

15 DR. BLOOM: Yes.

16 SPEAKER: And somebody actually can be

17 related to date of birth, which means some kind of

18 environmental factor. How do you -- how far can

19 you go backwards when you set up the databases?

20 DR. BLOOM: So, there's two answers to

21 that. In terms of -- I mean, these patients flare

22 only once or twice a year. So, for things that

245

1 are you know, months ahead, we will have even the

2 RNA data. And we can get access to their clinical

3 records, but I don't think -- I think if there

4 were things in the client records, we would have

5 found them already. So, in terms of being able to

6 relate the genomic data to the clinical data, I

7 think it's only going to be from the start of the

8 study.

9 The other side of that is that I'm doing

10 something a little bit backwards, which some

11 people think I'm crazy for. The New York Genome

12 Center is the host for the PCORI clinical data

13 research network for all of New York. I am going

14 to have full longitudinal clinical --

15 de-identified clinical records for just about

16 every patient in New York at the New York Genome

17 Center, which means moving forward, I am hoping,

18 and I do not have full permission for this yet,

19 but I am hoping that all the researchers at those

20 hospitals who send us genomic data will be able to

21 get access to the anonymized ID to link to full

22 clinical records.

246

1 So in that sense, we can do it. But I

2 can't -- you know, I don't have their genome from

3 their date of birth, and when we start sequencing

4 all babies, maybe I will. But I don't know how

5 else to answer that.

6 MR. YASCHENKO: I have one comment and

7 one question. The comment would be that about

8 encryption. And I think that we hit the same

9 issue when we were in post (Inaudible), how do we

10 encrypt your data. And to my opinion, I think

11 there is a need to develop hardware platforms

12 which are maintaining the data encrypted.

13 And I see here, and I made sure that

14 some hardware manufacturers are also here, because

15 this question of encryption and encoding is --

16 DR. BLOOM: It's a -- yes. I'm sorry to

17 interrupt. Go ahead.

18 MR. YASCHENKO: That is what the --

19 DR. BLOOM: I was going to say, you can

20 buy encrypted disks.

21 MR. YASCHENKO: Yep.

22 DR. BLOOM: Okay? They're more

247

1 expensive. We could do it. There's a problem

2 with them.

3 MR. YASCHENKO: Uh-huh.

4 DR. BLOOM: If somebody breaches your

5 system and gets into your server, those encrypted

6 disks are designed so that when your server starts

7 up, the decryption key is loaded, and anybody who

8 is using the disk -- it sees the data unencrypted

9 automatically. So, if your server is breached,

10 encrypted disks don't help.

11 MR. YASCHENKO: But perhaps the file

12 system developers, and those would be the same

13 people -- yes? Because I see people from AMC,

14 from IBM, from others perhaps I didn't recognize

15 -- if they develop file systems, it's a design of

16 the file system -- when does the encryption -- it

17 becomes open for the usage, for the programs?

18 Perhaps, if they collaborated with you and with us

19 --

20 DR. BLOOM: That would be --

21 MR. YASCHENKO: -- that would be a

22 wonderful addition to it.

248

1 DR. BLOOM: That's an excellent -- yes.

2 I think it's going to take collaboration with

3 hardware manufacturers to do something.

4 MR. YASCHENKO: Yep, yep. Okay.

5 DR. BLOOM: Or, finding a different

6 software solution that's not encrypted, which is

7 what I'm currently trying to do.

8 MR. YASCHENKO: And the question be

9 towards -- not towards you, but towards maybe some

10 folks from FDA. But we're living in this country

11 for a century, when younger people don't have any

12 concerns about identities. They put their all

13 pictures (sic), videos, everything out. So, if

14 let's say, somebody wants to publish a genome for

15 usage of any purpose, are there any regulations

16 which would be controlling that?

17 DR. BLOOM: Yes.

18 MR. YASCHENKO: There are. You know the

19 answer. That's good.

20 DR. BLOOM: Well, it depends in part, on

21 what state you're in.

22 MR. YASCHENKO: Uh-huh.

249

1 DR. BLOOM: Okay, it's country. It's

2 not just state. New York has particularly strict

3 regulations about it. We are trying to crowd

4 source. Warren?

5 DR. KIBBE: Toby, that was great. You

6 brought some really wonderful points. I just

7 wanted to bring up a conversation I heard that

8 came out of the global alliance, and that was,

9 there was a discussion of some treaties that were

10 signed back after World War II that explicitly

11 called out the right of patients to participate in

12 research and to benefit from research.

13 And I think they were looking at some of

14 those treaties, which all countries in the world

15 have signed, as a way to say there is a way to

16 force data sharing, looking at it from a right of

17 --

18 DR. BLOOM: Awesome.

19 DR. KIBBE: -- the individual to

20 participate. So I think that again, that becomes

21 -- that makes these beacon services possible.

22 DR. BLOOM: That's really -- that's a

250

1 really interesting thing. I love it.

2 MS. VOSKANIAN-KORDI: Anybody else?

3 (No response heard)

4 DR. BLOOM: Okay. Thank you so much

5 (Applause). Now I can take this off.

6 (Recess)

7 MS. VOSKANIAN-KORDI: All right. We're

8 going to go ahead and get started, but before we

9 start, the next session is Database Development.

10 There was a wallet that was left on a table

11 outside by Daniel Guittierez. So, if that's you

12 and you don't have your wallet, please come see us

13 outside. All right. I'm going to turn the podium

14 over to Dr. Mazumder with the Database Development

15 session.

16 DR. MAZUMDER: So, thank you very much

17 for inviting me -- oh, just one second. Thank you

18 for inviting me to this session, and it's a really

19 great workshop, and lots of different comments and

20 opinions. And this session is a little bit

21 different. I mean, we'll not really directly talk

22 about NGS initially, at least, as much.

251

1 And I just want to set the tone for this

2 session. So, before NGS was there, there are

3 many, many resources, model organism, databases,

4 RefSeq, Uniprod, SGD, gene ontology. Many of

5 these reference datasets have been -- you know,

6 they have been built over years, over decades.

7 And a sequence by itself is completely

8 meaningless. It doesn't have any meaning. You

9 have to add annotation to it.

10 And most of the time, annotation

11 initially is added manually through biocuration.

12 And then, once you have some gold standard

13 annotation like RefSeq, Risprod or the mortal

14 organism database type of annotation, you can

15 create an automated process to take this

16 annotation and add this to other data sources.

17 And then, when you have NGS data, you

18 have mapping algorithms or other algorithms which

19 will take this NGS data, map it to some reference

20 against this reference -- it has to be mentioned

21 and annotated and everything else by some database

22 curators or database providers, and then this can

252

1 be then used for biological knowledge generation.

2 So, if you look at that -- you know,

3 this is just the National Human Genome Research

4 Institute. They have many model organism

5 databases that are supported. This is their URL.

6 You know, I have some of the names here. Not all

7 of the names are here. Then, there are funding

8 mechanisms which support HIV databases, influenza

9 virus resource, virus sequence databases and so

10 on. So, these databases are quite important.

11 So, in terms of -- I lead the public

12 HIVE, and many times, the question comes, you

13 know, okay, so you have a mapping algorithm. It's

14 really great, so I want to use it, and so on. So,

15 HIVE is not just developing new tools. So, we

16 develop pipelines, work flows, and when we see a

17 need for development of new tools or work flows,

18 we do it. And the need could be many reasons why

19 you want to do it. And Dr. Simonyan mentioned a

20 few of them.

21 We work with XO varnisi DNAC data,

22 textural and image data, ontology standards,

253

1 (Inaudible) and DO cancer slims. So, disease

2 ontology. So, there was a mention of ontology

3 earlier on. Having an ontology -- you know, Dr.

4 Warren Kibbe of CBIIT -- you know, he wrote this

5 paper on disease ontology. One of the things that

6 we were working on this publication where we are

7 doing (Inaudible) cancer analysis from ICGC and

8 TCGA -- and I can tell you that even within ICGC,

9 the same cancer type is mentioned -- has a

10 different name.

11 Now, if you look at it for a few seconds

12 -- if a human looks at it, they will know it's

13 exactly the same cancer type from two different

14 countries. Where from our computational

15 viewpoint, if you try to figure out what is what,

16 it becomes a nightmare. So, we started a small

17 project with a small group of people from Lynn,

18 from University of Maryland, Cosmic and the Early

19 Detection Research Network Group over -- funded by

20 NCI, like 250 scientists.

21 So, we are trying to even take care of

22 this ontology, so that when I compare ICGC,

254

1 IntOGen, Cosmic, I can at least map it to a

2 particular cancer type, which is a hierarchy, and

3 then I can propagate some information across it.

4 We also try to collect data to map data. There

5 are projects from Uniprod, for example, ID

6 mapping, which allows us to do things like that.

7 But the key thing is that references and

8 standards, they have to be done in collaboration.

9 It cannot be done alone.

10 So, this is a perfect place, I think,

11 where we can start also talking about some of the

12 references and standards that many of you are

13 involved in, and bring it to the forefront, so

14 that we can use it within the HIVE group. So, I

15 tried to put the session emphasis, but you know,

16 I'll just go quickly through this, and there are

17 other things that our speakers will talk about;

18 focus on the need for development of curated

19 databases, focus on validation and integration

20 protocols, steps needed to be produce viable and

21 reliable resources that can facilitate

22 collaboration and research.

255

1 So, this session, I am pleased -- it's

2 my pleasure and honor to have three great

3 speakers, Dr. Kim Pruitt, Dr. Mike Cherry and Dr.

4 Rodney Brister. So, Dr. Pruitt is our first

5 speaker. She is a senior staff scientist,

6 Eukaryotic Group at NCBI.

7 She's the RefSeq unit chief, and she has

8 a PhD from Cornell in genetics and development.

9 She leads a great project called the CCDS, and

10 she's the NCIB lead for CCDS projects which tries

11 to standardize the protein coding regions of human

12 and mouse, which is an extremely useful project

13 when you're trying to map and understand your

14 results with a particular reference dataset.

15 She's also the founding member for the

16 International Society of Biocuration. How many of

17 you have heard or that name, International Society

18 of Biocuration? I talked a lot about biocuration,

19 but I think you should take a look at it. So, the

20 next meeting -- our next meeting is in China,

21 Beijing, April 23rd, I think so. And there is

22 also -- the database is the official journal of

256

1 ISB, and you can submit your paper. It deals with

2 database development and biocuration. Please

3 remember that. And if any of you can make it to

4 Beijing, hopefully, we will see you there.

5 Without further ado, Kim?

6 (Break in recording - long pause)

7 DR. PRUITT: I can't manage the mike

8 (Laughter). Thank you for that really nice

9 introduction. So, switching gears, I'm going to

10 talk about RefSeq, which is an NCBI product.

11 Let's see. Next slide.

12 This is a project at NCBI to provide

13 reference sequence standards at the level of the

14 central dogma. So, we're providing reference

15 sequence standards for genomes, for transcripts,

16 for proteins at a huge level ranging from archea

17 to eukaryotes to viruses.

18 There are numerous advantages in the

19 RefSeq set. I'm going to tell you a little bit in

20 a couple of slides -- very 30,000 foot view of how

21 we build the RefSeq set. But there are numerous

22 advantages in using the RefSeq data as compared to

257

1 the primary data that's submitted to GenBank.

2 (Discussion off the record)

3 DR. PRUITT: So, the advantages include

4 consistency, because we control this data product.

5 We're offering a greater consistency in the

6 formatting of these sequence records. There's

7 greater transparency in the source of the data

8 that goes into building the RefSeq dataset. And

9 there's more annotation.

10 We have several annotation pipelines.

11 We have annotation pipelines for prokaryotes and

12 eukaryotes. We also generate annotation for

13 viruses in collaboration with experts in these

14 organisms, but also, through our robust annotation

15 pipelines. And so for many genomes that are

16 submitted to GenBank in an unannotated form,

17 you'll be able to find annotated in the RefSeq

18 dataset.

19 Our data sources are primarily GenBank,

20 and so we rely on a continued submission of

21 primary data to the archival databases. We do

22 engage in collaborations with model organism

258

1 databases or individual researchers who are an

2 expert in our particular protein family. As I

3 said, we have annotation pipelines.

4 And just a quick note on my terminology

5 in this talk, I will call RefSeqs that are a

6 direct product over annotation pipeline -- I call

7 those model RefSeqs, and RefSeqs that are in the

8 pool that is subject to curation, I call that the

9 known RefSeq dataset. So, remember model and

10 known.

11 In terms of products and access, this is

12 a series of sequence records. They're available

13 in our web interfaces, in the nuclear titer and

14 protein databases. They're available in Blast

15 databases. They're available for FTP, and they're

16 available through our programming utilities --

17 utilities (sic).

18 Okay, so I'd like to just sort of take a

19 step back. My talk is going to really focus on

20 our support for the human genome. I could talk

21 for two hours, probably, or more, on the breadth

22 of RefSeq. But for the human genome, NCBI has

259

1 several curated -- we're providing curation

2 support for several very important databases and

3 resources. So, we're involved with the community

4 that is maintaining the assembled human genome.

5 This is the Genome Reference Consortium.

6 And the curators in my group actually

7 contribute tickets to the GRC that are

8 highlighting areas of the genome sequence that we

9 would like them to investigate. These are areas

10 where we're questioning if they're representing a

11 mutation versus the non-mutated version of a gene,

12 or if they have faithfully represented the

13 completeness of a gene. Some of these are

14 questions of redundancy in the genome assembly.

15 So, we're an active contributor to

16 tickets to the GRC, and then as those regions of

17 the genome get fixed, we are often involved in

18 reviewing the corrections to the genomic sequence

19 to see if then the genome annotation will be

20 indeed, improved in that region of the genome.

21 NCBI is also involved in the Locus Reference

22 Genomic project, which is -- and the prelude to

260

1 that is our RefSeq gene product set.

2 These are genomic records that we're

3 putting out so that clinical testing communities

4 have a stable reference for reporting their

5 mutations in, so they have a reference that has

6 been curated and vetted. This is in coordination

7 with Locus' specific databases and the clinical

8 community, and these are sort of gene region

9 snapshots of the genomic sequence, and they're

10 annotated with the transcripts that are most used

11 for reporting relevance in the clinical lab

12 setting.

13 NCBI also has the ClinVar database,

14 which was alluded to earlier. And this is an

15 archival database of clinical variants that have

16 an asserted relevance; that they are asserted to

17 have a pathogenic relevance. And so, we are

18 archiving this information. Again, we're working

19 with clinical community to gather this

20 information. We have curators who support the

21 submission process so that data standards are met

22 and the information can be clearly distributed

261

1 back out to the community.

2 And as Raja mentioned, our involved with

3 the CCDS collaboration -- and this is an

4 international collaboration with our partners at

5 the Welcome Trust Sanger Institute, the Ensemble

6 Resource, the Huge Gene Nomenclature Group, the

7 Mouse Genome Informatics Group and USCS to

8 stabilize the protein coding annotation on the

9 human and mouse genome for those proteins that are

10 the most supported.

11 So, there continue to be differences in

12 ensemble and USCS and RefSeq for more predicted or

13 model kind of coded regional representations. But

14 the most supported layer is subject to this

15 collaboration, where we are working closely

16 together and trying to effect any updates in a

17 synchronized manner.

18 And then of course, there's RefSeq

19 curation, where we are curating genes, transcripts

20 and proteins. So, here's a quick example of a

21 region where the original genome sequence in

22 version GRCh37. This is a region of human

262

1 chromosome 8, the SCXa and b genes are highlighted

2 in red boxes on the top level, and those genes

3 were found to be really redundant. RefSeq

4 questioned whether this was a valid gene

5 duplication or a redundancy in the genomic

6 sequence that was represented on chromosome a.

7 So, the GRC undertook some experimental

8 validation and retiled through this region and

9 actually deduced that -- determined that there was

10 a single gene in this location and not two. And

11 so in the GRCh38 genome, which is the current

12 public version, you'll find a single gene in this

13 location.

14 And ClinVar -- I'm really not going to

15 talk about ClinVar, but I do want to highlight it

16 as an important resource that NCBI is engaged

17 with, where we're working closely with a range --

18 a large variety of groups that are involved with

19 monitoring and reporting clinically relevant

20 variations and aggregating that, so that that

21 information can be archived and then distributed

22 back out.

263

1 And the little panel on the right is

2 just the top of a ClinVar page where things are

3 given stars; where you can see how many groups

4 have reported this, if it's been curated by an

5 expert panel, and so on. So, back to RefSeq.

6 Basically, RefSeq can be thought of as a genome

7 annotation database, and the foundation of our

8 genomic sequence is GenBank. And so, we are not

9 reassembling genomes. We are not changing the

10 genomic sequence in RefSeq versus what is in

11 GenBank.

12 We are simply making a copy of the

13 submitted assembly and then subjected that to

14 annotation and curation. So, we have several

15 annotation pipelines. We have a very robust

16 eukaryotic annotation pipeline. The product of

17 that is, of course, the annotated reference

18 genome, transcript proteins, FTP Blast databases

19 and annotation reports.

20 In our curation mode, we're looking at

21 data at the level of genes, transcripts and

22 proteins. So, we are representing novel splice

264

1 variants. We're doing extensive sequence

2 alignment analysis. We're diving into the

3 literature in order to represent things that are

4 the most common protein isoform in transcript

5 variant. That is represented in the literature.

6 That is thought to be the functional unit, and we

7 want to make sure that we are representing that in

8 the RefSeq database. And we're also adding

9 publications. We're adding names. We're adding

10 content to NCBI's gene database, and so we're

11 adding functional information in the process of

12 our curating our sequence records.

13 Some of the functional information that

14 we're adding -- here is an example. This impacts

15 both the structural annotation of the genome and

16 the functional annotation. So in this case, this

17 is the mouse Bag 1 gene, and actually, the same

18 situation occurs for the human ortho log. This is

19 -- in inverse orientation, this is a gene that's

20 on the opposite strand. So the five prime end of

21 the gene is to your right.

22 So, the canonical protein was annotated

265

1 at the AUG site that I have highlighted in red on

2 the bottom part of your panel, which is a zoomed

3 in view of the top panel where I have boxed the

4 first x on. So, we have two RefSeqs represented,

5 two RefSeq transcripts and proteins represents for

6 the mouse and human bag 1 gene. And one of them

7 starts at a canonical AUG start codon, and the

8 other one starts at a CUG start codon, which has

9 been experimentally described in both mouse and

10 human.

11 And so, this is an example of a type of

12 annotation that automatic annotation pipelines

13 would fail to provide, because they're looking for

14 the canonical AUG start codons. They're not

15 trying to annotate proteins from every CUG start

16 codon that might occur in the genome. So, that

17 would introduce a lot of false noise. And so,

18 here is a value added layer that curation has

19 introduced.

20 This is a highly simplified view of our

21 annotation pipeline on the right part of my slide.

22 It is significantly more complex than this. I

266

1 want to make the point that we are aligning a huge

2 amount of evidence data in generating our

3 annotation. This is really an evidence based

4 product, our genome annotation.

5 So, we are using the known RefSeq

6 component as an input. So, what we curate, like

7 the example I've just shown you, then because a

8 re-agent to developing the genome annotation

9 product. We also are aligning CDNAs, ESTs, TSAs

10 and RNA seq data. And this is something that we

11 introduced in 2013. And the addition of RNA seq

12 data has allowed us to greatly expand on the

13 number of exons that we are -- exons and introns

14 that we're representing in our final genome

15 annotation product set.

16 These are in the model RefSeq component,

17 because these are predicted outputs of our

18 interpretation of all of these alignments. And

19 just to give you sort of a snapshot of some of the

20 numbers, on the left part of the slide, we have

21 some 48,000 known RefSeq transcripts that we are

22 curating. The vast majority of those have

267

1 undergone some level of curation.

2 And we have -- towards the bottom, we

3 have some 52,000 model transcripts that have been

4 added by the integration of RNA -- primarily added

5 by the integration of RNA seq data. For human,

6 we're using the human body map 2 dataset as our

7 alignment pool.

8 So, this is a significant number of

9 annotated transcripts and proteins available for

10 comparison in Next Gen alignment analysis. So,

11 some of the quick highlights of our annotation

12 pipeline speed -- we have a very fast pipeline.

13 It can turn around a whole eukaryotic genome in 2

14 to 10 days, and that is from when we first do our

15 data snapshot of RNA seq data, transcripts,

16 proteins, everything. We do a data snapshot at

17 the beginning. It's a turnkey pipeline. We can

18 have that product loaded, average 2 to 10 days

19 after we have started that, it is available

20 publicly in NCBI's nucleotide protein database,

21 and shortly thereafter on our FTP site.

22 So, it's a very robust pipeline. It has

268

1 very good speed. It is a quality pipeline. We

2 have put a huge amount of engineering work behind

3 it. We do regression testing. You know, any kind

4 of software change, we check. Does it have any

5 deleterious effects? Do we get the product that

6 we're expecting to get? So, we do all of the

7 things that we should be doing for annotating the

8 human genome and other eukaryotic genomes.

9 We can annotate multiple assemblies

10 simultaneously. For human, we annotate the CRCh38

11 assembly. There's a CHRM1 assembly and the HuRef

12 assembly. We can annotate multiple organisms

13 simultaneously. So, this is a very powerful

14 pipeline that we have put together. And of

15 course, one of the main advantages, from my sort

16 of slightly biased point of view is that it

17 integrates the curated RefSeq content. And so we

18 have a means, then, to directly affect in a

19 positive way the output of the annotation pipeline

20 over time, because we do re-annotate organisms

21 over time. Now, as we're re-annotating those

22 organisms, we can be affecting positive change,

269

1 correcting errors that have been identified by us

2 or reported by the community in the genome

3 annotation.

4 So, so far, we've annotated 153

5 organisms using our eukaryotic genome annotation

6 pipeline. Eighty-eight of those organisms we've

7 integrated RNA seq data with. That's quite a lot.

8 A hundred and sixteen of those organisms have some

9 curated RefSeq records available for them. For

10 some of these organisms, it might be really small.

11 It might be 10 RefSeqs, but these are 10 RefSeqs

12 that we have corrected in the genome annotation

13 product for the organism.

14 Human and mouse we update at least

15 yearly. We're trying to achieve a yearly update

16 goal for many of our eukaryotes, and certainly a

17 new assembly triggers a re- annotation. And if we

18 become aware of a new dataset that would probably

19 have a positive outcome on our annotation, then

20 we'll rerun. Or, if we make a significant change

21 to our methods that would have a positive outcome

22 on our annotation, then we'll rerun.

270

1 One of the things that we've really put

2 a lot of effort into is transparency and support

3 evidence, both at the level of the gene and at the

4 level of individual transcripts. So, on the left

5 side, you're looking at a little display from an

6 entrée gene record for the crystalline A gene.

7 This is the human crystalline A gene in the top,

8 and on the top of that left panel, you can see the

9 gene structure that we annotated on the genome.

10 The top transcript protein pair, the

11 blue red lines there is a curated RefSeq. And

12 then, the second set of transcript proteins is a

13 model RefSeq that was predicted based on RNA seq

14 data. As you look further down, you can see a

15 line that shows you that one of these proteins is

16 in the CCDS project, and so it is tracked as a

17 stable protein coding annotation on the genome.

18 Now below that, you can see the clinical

19 variants that are tracked in the ClinVar database.

20 And then below that, you see three tracks that are

21 an aggregate of our RNA seq data. So, the top

22 track is showing you an exon coverage graph. The

271

1 track below that is sort of a mirror image of

2 that. This is showing you the coverage for the

3 intron spanning reeds, and so it's kind of a

4 mirror image of where the exon coverage is.

5 And then below that, the track with the

6 black lines, that is our intron futures that we

7 culled from the RNA seq data. So, the predicted

8 model, the XM that's annotated up there is

9 supported. We see an intron that corresponds to

10 the novel intron that's introduced in that model

11 that came from RNA seq data.

12 So, here's a window. Here's a quick

13 snapshot. Here's the evidence behind this

14 annotation. Yes, I can see it's supported by

15 transcript data. Of course, the problem with RNA

16 seq data is you don't know if you've got ongoing

17 range exon coverage from that data -- what we know

18 is that the intron exon pairs are supported. In

19 the right side, I'm showing you on a record by

20 record basis that we are also providing

21 information per record about the evidence

22 supporting that record.

272

1 So on the known RefSeqs that we clearly

2 indicate the data sources behind the construction

3 of this record. So we will tell you that we made

4 this RefSeq record from accession number A, B and

5 C, and maybe we've used three accession numbers in

6 order to avoid what we think are rare or erroneous

7 mismatches in any of those accessions, in order to

8 provide a record that is the best match to the

9 genome that we don't think is representing

10 anything that is deleterious or a mutation. We

11 also provide what's highlighted here --

12 (Break in recording)

13 DR. PRUITT: Aha. So, up here at the

14 top, there's section on the record that is

15 reporting two levels of evidence; one that we have

16 long range combination -- we have long range

17 support for the transcript exon combination. And

18 we're arbitrarily giving you two accessions that

19 are supporting the long range exon combination.

20 So, you know, here's three exon RefSeq model, and

21 so exons one, two and three in combination are

22 found in these two transcript records.

273

1 We're also telling you where we have RNA

2 seq support for the exon pairs. And so, we have

3 RNA support for exon one to exon two, exon two to

4 exon three. I mean, we're giving you, again, an

5 arbitrary number of samples that support that.

6 We're just saying two because it can be very long

7 list, and it would make the record view kind of

8 unwieldy. So, this is what you'll see on a known

9 RefSeq record.

10 On a model RefSeq record, the

11 information is provided a little bit differently.

12 We're telling you that this is a product of the

13 NCBI annotation pipeline. It's sockular version

14 5.2. There's a link here to an annotation report

15 page where you can get detailed information --

16 I'll show you a quick glimpse of that on my next

17 slide, about the inputs and outputs of the

18 annotation pipeline.

19 And down below in the record, there are

20 comments that tell you supporting evidence for

21 this model includes similarity to MRNAs, 199 ESTs

22 and one hundred percent coverage of the annotated

274

1 genomic future by RNA seq alignments, including

2 five samples that support all of the etrons. So,

3 there's a wealth of information here about

4 evidence.

5 This extends to evidence at the level of

6 the whole annotation run, so we're providing you

7 know, annotation report pages for all of our

8 eukaryotic genomes. You can find the links to

9 these reports on the sequence records, and it

10 brings you to a report page that tells you what

11 software version was used, exactly what version of

12 the assembly was annotated; what are the results

13 of the annotation; how many features of what type

14 are produced from this annotation run, and huge

15 amounts of information about the alignments.

16 How many of the RefSeqs did align? How

17 many of the transcripts that we started -- how

18 many transcripts did we start with? How many of

19 those did align? At what level of threshold

20 quality? And for RNA seq data, we'll tell you

21 exactly what RNA seq data samples that we used.

22 And again, huge details about the alignment

275

1 statistics. And this type of information is

2 provided per assembly that was annotated. So for

3 human, you can interrogate this kind of

4 information for GRCh38, or if you're interested in

5 comparing to HuRef, you can switch to them.

6 So, a quick warning. A lot of people

7 think that they can get RefSeq by going to USCS.

8 And what you get from USCS is their alignment of

9 the known RefSeq dataset. It is not the same

10 thing as NCBI's genome annotation product of

11 RefSeq. They do not align model records, and so

12 often, you will get an under representation of

13 what NCBI's view of the number of possible

14 transcripts and exons are, and they -- actually,

15 because they don't align any of the model RefSeqs,

16 sometimes we may have a gene cull that's not

17 reflected in the USCS display.

18 So, user beware. If you're downloading

19 RefSeq genes from USCS, you are downloading their

20 alignment of known RefSeq records. And so you

21 might have ambiguous placement of paralogs. You

22 might have slightly different exon culls for some

276

1 exons because of different alignment methods that

2 were used. You'll be missing splice variants.

3 You may be missing genes. So, and I'm really

4 almost done.

5 So, yes, you can get RefSeq data at

6 NCBI. We have genomes FTP site that we just

7 recently regenerated that is now a comprehensive

8 report of all RefSeq genomes that are available in

9 NCBI's assembly database. And so, the scope of

10 our new genomes FTP site is archaea, bacteria and

11 eukaryotes. You can get all of the GenBank

12 genomes there, and you can get all of the RefSeq

13 genomes there by simply toggling your path here to

14 RefSeq or GenBank.

15 We have metadata files that help you

16 traverse the FTP directory to find exactly what

17 you're looking for. I encourage you to look at

18 the read me files that are in this directory, the

19 recent NCBI news announcement, and we also have

20 put out an FAQ about this directory. One of the

21 things that we have done with this new genomes FTP

22 site is to modify our fast A titles.

277

1 We did this so that the RefSeq that you

2 download from NCBI can be used more readily in

3 some of the big RNA seq aligner programs. We

4 provide GFF format on our FTP site, and we provide

5 a fast A, and of course, we provide the NCBI

6 standard sort of text view document style, as

7 well.

8 So, NCBI traditionally has used a

9 complex fast A format that had an awful lot of

10 information in there that people then had to parse

11 to get which bit that they wanted to track on. In

12 our new style, we're providing simply the

13 accession dot version, and then of course, the

14 record description. This simple accession dot

15 version is the seq ID that's used in the GFF file,

16 in the fast A files. And so it makes it much more

17 interoperable with some of the RNA seq alignment

18 data.

19 So, we do listen. We do really want to

20 make this a product set that people can use in

21 their pipelines. And if people have suggestions

22 and requests, we do want to hear about those. So,

278

1 RefSeq -- why is this a useful tool for comparison

2 to NGS data? I mean, as Raja really said, because

3 it's annotated. It gives you the context of your

4 alignment. It's an annotation dataset that NCBI

5 is committed to supporting.

6 We integrate the curated information so

7 we have a means to provide corrected information,

8 as we become aware that these corrections are

9 needed. So, we are really very interested in

10 community feedback to help us maintain. This is a

11 community resource and we want community

12 involvement to help us maintain this. We only

13 have so many staff at NCBI. And we're handling a

14 large number of genomes, and so you know, we're

15 really welcoming community input.

16 It's evidence based. There's

17 transparency in our evidence, in our pipeline, in

18 the re-agents that we're using. There's

19 transparency in the support evidence in terms of

20 the specific annotated transcripts. So, we're

21 really trying to build a layer of transparency to

22 this product so people can find that they can use

279

1 it with confidence. Of course, the integration of

2 curated information and the connection between

3 RefSeq and NCBI's gene database gives you a

4 powerful layer of connecting the sequence to the

5 knowledge, because in NCBI's gene database, we're

6 integrating a huge amount of information.

7 We've got GO, you know, gene ontology

8 data integrated. We're integrating publication

9 information. So, there's a lot of information in

10 gene that relates to the functional aspect of this

11 gene and the sequence is then directly compared --

12 is directly connected to that. So you know,

13 basically, NCBI is committed to supporting the

14 human genome at several levels; at the level of

15 maintaining the sequence -- the human genome

16 sequence, the assembly, at the level of archiving

17 the pathogenic and clinically relevant variation

18 data and at the level of the annotation.

19 One of the things that we're looking

20 forward to doing in the next couple of years is to

21 start adding functional annotation about

22 regulatory regions. So, promoters, silencers,

280

1 enhancers -- to start adding annotation of those

2 regulatory regions that have been studied and

3 experimentally confirmed. Of course, there's a

4 large number -- there's a huge amount of data

5 that's coming out of the high throughput of the

6 genomics projects, but there's a wealth of

7 literature, and we would like to connect that to

8 the genome sequence, also. So, these are future

9 plans.

10 And of course, everything that produce

11 at NCBI for the RefSeq project is really available

12 through our web site, FTPs and programming APIs.

13 And because it's at NCBI, we connect RefSeq to a

14 wealth of other resources, and so it really

15 facilitates navigation to more information, once

16 you're in NCBI's domain.

17 And that's it. And I would like to

18 thank, you know, the people that do all the work.

19 You know, the GRC, ClinVar, the eukaryotic genome

20 annotation pipeline. These people are also

21 producing the FTP site and the assembly resource,

22 and of course, my group who eukaryotic RefSeq

281

1 transcription protein set. Thank you. (Applause)

2 MS. VOSKANIAN-KORDI: If there are any

3 questions at this point, please --

4 SPEAKER: Is it possible to have two

5 RefSeq genomic records for the same organism in

6 the same leaf level taxonomy node?

7 DR. PRUITT: So, are you talking about

8 prokaryotic (Inaudible)?

9 SPEAKER: Yes. That's what I'm asking.

10 Prokaryotic.

11 DR. PRUITT: Yeah. So RefSeq undertook

12 providing genome annotation for all submitted

13 prokaryotic genomes, and they may be individual

14 isolates of the same strain. So, there is now a

15 lot of redundancy in the genome representation in

16 RefSeq for prokaryotes. We are also, though,

17 selecting representative genomes that would be --

18 So, if you don't want to deal with the

19 however many thousands of salmonella enterica

20 genomes that we have in RefSeq, we're selecting

21 those that we, in our opinion, are the reference

22 standards or good representatives of different

282

1 strains. And we have metadata files that clearly

2 identify the reference and representative set.

3 And also, in our new genomes FTP site, we have a

4 directory level way to navigate to just that

5 subset of the data.

6 (Discussion off the record)

7 SPEAKER: Hi. I very much appreciate

8 the new FTP site. It's much more convenient to

9 navigate than the old one.

10 DR. PRUITT: Oh, that's great to hear.

11 SPEAKER: But there's one thing missing.

12 The viruses are still apart. Is there a plan to

13 integrate viruses into that same system?

14 DR. PRUITT: So, the viruses -- very

15 good observation. So, the viruses have not

16 traditionally been included on the assembly

17 resource, and we are -- there are some

18 conversations on going to try to come up with a

19 good model for inclusion of the viral genomes in

20 that resource.

21 There is still a viral genomes FTP site

22 where data can be downloaded. The viral genome

283

1 project, which Rodney is going to tell you more

2 about, is providing both reference sequence data

3 and also, some value added data. So, that's on

4 top of that. And so, in terms of the new genomes

5 FTP site, moving forward, our goal is to try to

6 include the reference sequence data in the new

7 structure. But this add on data will still need

8 to be represented.

9 There will probably still be a duality

10 in terms of viral genome representation, because

11 we want to provide this additional dataset, which

12 really doesn't fit currently in the model of the

13 assembly database.

14 SPEAKER: Well, it's sort of like what

15 you have now for the non-viral, where you have the

16 representative.

17 DR. PRUITT: Mm-hmm.

18 SPEAKER: And then, it's almost like the

19 genome neighbors, that you're borrowing --

20 DR. PRUITT: Right.

21 SPEAKER: -- from the viral (Inaudible).

22 DR. PRUITT: Right, right. And so, the

284

1 genome neighbors don't really fit in the --

2 SPEAKER: Well, it's --

3 DR. PRUITT: Right now, they don't.

4 SPEAKER: -- you mentioned like

5 salmonella. You have one salmonella

6 representative, and then the other ones are sort

7 of like genome neighbors.

8 DR. PRUITT: They're sort of like genome

9 neighbors, but we've actually instantiated as

10 RefSeq genomes.

11 SPEAKER: Oh, okay. Right.

12 DR. PRUITT: And in the viral world, the

13 genome neighbors are not all instantiated as a

14 RefSeq genome. They're still GenBank.

15 MS. VOSKANIAN-KORDI: We're actually

16 going to go ahead and close Dr. Pruitt's section

17 right now. If we have time at the end, we'll open

18 it up to further discussion, as well.

19 DR. MAZUMDER: Our next speaker is Dr.

20 Mike Cherry. He started the model organism

21 database SGD, Saccharomyces Genome database. He

22 also started gene ontology. And then, he took

285

1 over the ENCODE DCC. He grew up in -- there's

2 some little interesting things that I just

3 learned. He grew up in Indiana. Parents,

4 molecular biologists, biochemists and farmers.

5 And friends with Warren Gish, Blast.

6 You know? The author of Blast. And Mike Corell,

7 where he first started programming in C, or at

8 least did a lot of that at that time. So, Mike

9 Corell is the Unix 4.2 BSD, and he repurposed the

10 C code, and he's going to talk more about what he

11 is working on now, and also, talk about lessons

12 learned from the more (Inaudible) databases.

13 DR. CHERRY: Okay. So, I have Warren,

14 the same problem getting close to the mike. But

15 thank you guys for that nice introduction. So,

16 I'm not going to tell you about any programming

17 that I've done in the last 10 years. My guys

18 won't let me do it anymore. They do keep one

19 script that I wrote on the server, just so that I

20 can say I'm involved in the project. So, it's

21 very nice of them (Laughter). So, I'm Mike

22 Cherry. I'm at the Department of Genetics at

286

1 Stanford, and as Raja said, I've started several

2 genome databases, but this Saccharomyces

3 cerevisiae database is the longest running. I'll

4 talk to you a little bit about that and what we do

5 with curation. I won't go into too much detail

6 about our web pages and such, because I'll get

7 probably too carried away, but I will go further

8 to a little bit about our work with the gene

9 ontology, as well as get into a little bit about

10 the data coordination center we're trading for

11 ENCDODE. And that's where the metadata comes in.

12 (Discussion off the record)

13 DR. CHERRY: Great. So, in model

14 organism databases, I sort of think of these guys,

15 the Saccharomyces cerevisiae, the brewing baker's

16 budding yeast, disopholomatic ester, the fruit

17 fly, although it doesn't eat fruit. It likes

18 yeast. And senerbatis elegance who doesn't like

19 yeast but eats bacteria.

20 So, these are really great models

21 because of their genetics and genomics, and the

22 long history that their community has brought.

287

1 They're really all started because of the genetics

2 Saccharomyces a little bit because of the

3 . Nobody cares if you grind up pounds

4 and pounds of yeast. It smells good, too.

5 But they're currently used so much

6 because of the genetics and the genomics. They

7 have a rich future -- a rich future. They do have

8 a rich future, but they have a rich community

9 that's involved with them and that maintains the

10 sort of active work. People are studying yeast

11 not just to study RNA plemaries per se, but

12 they're looking at RNA plemaries within the cell

13 itself. So, are they really trying to take apart

14 the system, even though they may be working on one

15 protein?

16 There's also these very powerful

17 resources, molecular resources that are available

18 in these communities. Yeast, for example, there

19 is a knock out of every single gene and the genome

20 that's available for non essential genes. And the

21 essential genes have had a promoter added

22 upstream, so you could actually dial down the

288

1 protein, as well. And these knock out collections

2 are easy to change and manipulate, because yeast

3 really loves to do a homologous recombination.

4 So, as a result of that, you can

5 actually take the knock out of a gene and yeast

6 and put in the human gene, human CDNA, and there's

7 hundreds of cases where the human gene complements

8 the knock out in yeast. And so then, you can

9 study the human gene in yeast where you can grind

10 up tons and have a lot of fun. You know, and this

11 really shows you the power of the evolution to be

12 able to go from a simple organism and study and

13 gene, and a more complex organism.

14 So, I'll talk a little bit about the

15 Saccharomyces genome database, because they're a

16 nice URL. So, what do we do? I have this

17 problem. I keep wanting to walk away. We curate

18 information. Okay? So, I've got about six

19 curators, really experienced PhD curators who love

20 to read papers. Okay? Many of you in the room

21 are probably like that, as well, and I've got a

22 job for you. These six people read the paper.

289

1 They start with the results, maybe the methods.

2 We try not to read the introduction and

3 the discussion first, because we don't want to be

4 confused by what the author says they did. We

5 want to know what they actually did. And so

6 they're reading the paper, abstracting out of that

7 the details; oftentimes hidden in figure legends,

8 tables and such. And then, they integrate it into

9 our database using control vocabulary syntologies.

10 Of course in some cases, we have to use the free

11 text, but as much as we can, we really try to put

12 it in this very structured way.

13 Generally, you know, as you imagine when

14 you're reading papers, the data is very

15 unstructured -- even if it's high throughput, it's

16 very unstructured. You know, people don't use

17 standards in the community. They just sort of

18 make it up themselves. And so, a lot of times, we

19 spend a considerable amount of effort wrangling

20 the data. So that is to take the data that they

21 have given -- I mean, you know, it's a nightmare

22 sometimes.

290

1 They provide you a table as a PDF file

2 or as an image. You know, here's the TIFF image

3 of my table. So, we have to do a lot of work to

4 convert that. Grabbing the metadata is very

5 difficult. Typically, it's very sparse in the

6 paper, and we try to work with the authors to get

7 more of that. The hard part though, sometimes, is

8 that we'll do all this work to correct -- get

9 their data wrangled nicely, and then they say, oh,

10 you can submit that to GO. You know? I haven't

11 done that yet, because I didn't take the time.

12 And you know, and so we're sort of left with the

13 case of who's really going to do it.

14 More and more, we're making sure that we

15 don't manipulate their data until they submit it

16 to GO, just as a way of having a little stick.

17 We're experimenting with something we're sort of

18 calling the wall of fame, or you know, give people

19 badges. So, somebody that actually submits their

20 data, does a really great job, you know, we want

21 to sort of put their name in lights somewhere.

22 You know, it may be a graduate student, but that

291

1 graduate student is going to say, hey, look at me.

2 You know, I'm on the home page of (Inaudible).

3 And hopefully, that will encourage others to do

4 that.

5 You know, and really, though,

6 unfortunately, it means -- the problem is, the PIs

7 have to start enforcing this, and they're really

8 the problem in many cases. Okay. So, you sort of

9 get the gist of this already, but the role of a

10 genomic resource, genome database is really to

11 take the data that's been published, okay, as a

12 result of experiments and computation. We grab

13 all that data out, put it in our database in a

14 nice, integrated way, and then we can have the

15 nice Muppet scientist read that information, do

16 discovery on that, propose new experiments,

17 integrate their own data into that, and hopefully

18 published that, and then the cycle continues.

19 So, this is sort of our whole purpose.

20 My lab really doesn't do research per se. Our job

21 is to take data and make it available to people;

22 to really promote research, our publications are

292

1 about building databases and such. Okay. The

2 types of information that we have interconnected

3 within our database include the molecular

4 function, sort of what the protein does, the

5 biological process, sort of why it does it -- it's

6 part of secretion. The sailor component, where it

7 happens with the cell; the major complexes that

8 are there, sequence homology, mutant phenotypes we

9 spend a lot of effort on.

10 Yeast has lots of mutant phenotypes

11 reported from many strains. I forgot to mention

12 earlier that the yeast genome is in RefSeq, and

13 it's been there for quite a long time. We've

14 actually sort of hand curated, I would guess,

15 almost every nucleotide over the 20 years, because

16 people have banged on the genome so hard that

17 they'll actually tell us when there is an error,

18 because it's been re-sequenced so many times in

19 the community. So, that's been very useful.

20 We also have a lot of other strain

21 genome so we can do comparative analysis to fix

22 the main reference.

293

1 Genetic and protein interactions -- lots

2 and lots of information there, a genetic

3 interaction being where you have a single mutation

4 that doesn't have a phenotype on its own. You

5 combine these two mutations together in the same

6 cell, and the cell is dead or it grows very

7 poorly. So, it's a genetic interaction. Protein

8 interactions of course, are too touching. All

9 kinds of expression data, exploring more about the

10 protein domains and the sequence level of our 3D

11 structure. And we're getting much more into

12 pathways, protein complexes and the regulation of

13 all of this stuff together.

14 One thing I wanted to touch on briefly,

15 just because it was mentioned before -- one

16 problem we have, and Kim and them have done a

17 great job of putting together the transcripts with

18 -- even though RNA seq and the expression ratings

19 before that were created and used, and so there's

20 a lot of data there, and it's really easy for

21 people to go forward, they'll oftentimes, after

22 they create RNA seq -- of course, the next thing

294

1 they want to do is do it in humans.

2 And they sometimes forget to go back and

3 fill out everything it used. So, we don't have

4 the transcripts as tightly defined as we would

5 like, but we have a wealth of information. And

6 this is one case where we have to start piecing

7 things together. So, we have all of the

8 transcript levels. We have things like ribosome

9 profiling, where you look at the RNAs that the

10 ribosomes are bound to and try to pull out not

11 just the RNAs that exist at some point within the

12 cell's life cycle, but that the ribosome is

13 actually bound to. And this is the (Inaudible)

14 thing that then we have to do to make things a

15 little bit better.

16 Okay. So, an example of one of these

17 networks -- and this is really actually a very

18 simple one, genetic interactions. So, here is a

19 sort of famous network diagram within yeast where

20 they've actually taken the genetic interactions,

21 so the bright spots are proteins. And what you

22 probably can't see is there are lines connecting

295

1 all of these proteins together, and those are

2 describing the interactions themselves.

3 But they've annotated the proteins with

4 the biological processes, so that the pathway is

5 the larger sort of reason for that function within

6 the cell. And they've added some sort of

7 attractor to the proteins of a similar function

8 and clustered them on the cell like this. And

9 that's basically showing you that in the genetic

10 interactions, proteins with a similar process are

11 found close together.

12 But you can imagine not only taking the

13 synthetic lethal information -- you actually

14 combine any number of networks. And so this gets

15 into the statistics that I can't actually

16 understand, but it's taking many different types

17 of information, looking at the genotypes, the

18 metabolomics, expression arrays, going after

19 protein coding interactions -- you know, any

20 number of networks together and combining them to

21 create basing networks.

22 But of course, what you don't see here

296

1 is that high quality annotations are underneath

2 these. So for every gene, the connectors here is

3 a fact that the -- for each gene, we have high

4 quality connections into these networks themselves

5 away the annotations about the functions in very

6 specific ways -- the actual title to this and

7 whatnot that's going on.

8 So, this is, I think one of the newer

9 ways that the yeast research is going, really

10 looking at the systems biology and systems

11 genomics, I think is what somebody called it

12 there. Gene ontology consortium was something we

13 created in 1998. The PIs currently are Judy Blake

14 at Jackson Lab in Maine, Paul Strindberg at

15 CalTech, Paul Thomas at UCS, Suzanna Lewis at

16 Berkeley and myself at Stanford, so it's sort of a

17 California thing, these days.

18 But it's really an international

19 consortium. You may not realize that the gene

20 ontology consortium, the folks that make gene

21 ontology, that do a lot of the annotations and

22 distribute the annotations that we create is only

297

1 about 40 people. Okay? And that's -- actually,

2 not all of them are even funded by the

3 consortium's grant.

4 Just real briefly, the gene ontology, if

5 you haven't seen it before, it's really a

6 hierarchical description of function and process

7 within the cell. And we do it in this way so that

8 you can actually annotate the different levels of

9 the ontology. So, I don't know if I should mess

10 with the pointer, but the ball indicates -- the

11 size of the ball indicates the number of proteins

12 that have been annotated to that term or to one of

13 its children below.

14 So, at the very bottom, you see the

15 balls are small. Then we have like cell cycle

16 checkpoint right there. Regulation of cell cycle

17 progress is right here. So, a lot of proteins are

18 annotated low, but some are annotated here,

19 because the evidence only allows it to be

20 annotated there. We don't see evidence within the

21 paper that allows it to be annotated lower.

22 Sometimes, it's like a mutant phenotype. The cell

298

1 doesn't live. We think it's the cell cycle

2 checkpoint. That would be annotated high.

3 So, the gene ontology allows the

4 proteins to be annotated. And we do all of this

5 annotation in an evidence based way using

6 ontologies. It's called the Evidence Code

7 ontology, ECO. But what the curator does and why

8 it takes some training is the curator doesn't

9 annotate to just this phrase, nucleotide

10 metabolism. What they're actually doing is

11 they're annotating to a description of that term.

12 And that's particular important because

13 apictosis -- different people in the room may have

14 three or four different views of how you would

15 actually define that. But for the ontology, we've

16 defined it in a very specific way, and that's what

17 people are annotating to.

18 Just a little big of statistics about

19 all of this work that's happened in 15 years, 18

20 years. So, the ontologies themselves are quite

21 large. There's more things about process, because

22 there's aspects to the cell that we can sort of

299

1 see and we know what's happening. We don't always

2 understand the functions that the proteins are

3 doing. We know the protein's there and it's

4 important, but we don't know exactly what's doing.

5 And then, cellular components have to do with the

6 regions in the cell, the complex in the cell. So

7 lots there.

8 I mean, it's really astounding that the

9 total number of annotations here is 370 million,

10 but you have to realize that that's not people

11 work, the majority of that. That's actually

12 computers crunching on these things. The Uniprod

13 Group at UBI in Hingston Cambishure, his -- they

14 work a pathway -- a pipeline on all of the

15 proteins within Uniprod. And this is using

16 protein domains, sequence similarity, very sort of

17 sophisticated rules that they've built over 10

18 years to actually annotate these proteins from all

19 organisms that they can find.

20 This manually curated set, which is

21 still quite big, but it's a number of years that

22 it's taken to be created, for 65 different genomes

300

1 -- and so for example, for yeast, we have 80,000

2 annotations that are available that are hand

3 curated by reading papers, describing the

4 evidence. The ontology access also allows us to

5 connect the term, the annotation with small

6 that are involved in the measurement of

7 that function.

8 This graph is just to show you how great

9 yeast is. So here, the red bars are showing you

10 the percentage of the gene products within the

11 cell that have at least two of the three domains

12 of the ontology annotated with an experimental

13 term. Okay? And so it's about 50 percent of

14 yeast. The others -- so human is about 10

15 percent, fly is about 10 percent. The line is

16 actually showing you the number of gene products

17 that have been annotated in this way. I know it's

18 difficult to read and all that. But this is not

19 annotating the number of genes, it's annotating

20 the isoforms in that gene. So it would be

21 annotating all of the records within Uniprod.

22 Okay.

301

1 And then hopefully, I'll have time to go

2 through a little bit of what we're doing with

3 ENCODE. So the ENCODE project has been around for

4 about nine years. My group joined ENCODE two

5 years ago as the data coordination center. So,

6 our job is to take data from all of the groups.

7 They don't get credit for submitting data until

8 they give it to us, and they have to give it to us

9 in a way that we say it applies -- that it's met

10 the standards both in the file formats and quality

11 as well as the metadata, as well.

12 Okay. So, just real briefly, on the

13 ENCODE, there's a variety of methods that are used

14 that are actually mapping features of the genome

15 across the whole genome. So, there's a number of

16 -- they're all seq methods these days. So, RNA

17 seq is measuring where transcripts occur. There

18 is a lot of chip sync methods going on to help you

19 identify where proteins bind, where modifications

20 happen to the chromosomes or to the histones,

21 methylation, settlation and such, accessibility of

22 the DNA accessibly through the chromotin where the

302

1 DNA can be cut.

2 And so, all of this information is put

3 together, and the whole purpose is to create a

4 really gold -- a highly structured, high quality

5 gold standard set of information from these

6 datasets with the hope that they'll be around for

7 -- and useful for about four or five years, if not

8 longer. And so this really requires that the

9 standards are quite high, both on the quality of

10 the experiment that's done and on the metadata

11 that's available so you can actually find the

12 information that's available later.

13 A real simple view of how we do this; we

14 require experimental metadata. We have to have

15 the primary data submitted in proper forms. You

16 have to have the right number of replicates for

17 the particular experiment, and of course, the

18 biosample is very critical. We run that through

19 analysis pipelines, and this is -- so the analysis

20 pipelines, as have been talked about many times

21 today, not all of the software is really great.

22 How do you use one package or another package?

303

1 What's needed here for ENCODE, though,

2 is that we want to have a unified pipeline for all

3 of the data of a particular data type. And that's

4 really particularly tough, because it's biological

5 labs that are doing the assays themselves, and of

6 course, they have their own post docs that say the

7 right way to analyze the data.

8 So, it's roughly taken two years to get

9 the various data providers to reach some form of

10 consensus that they agree is appropriate. Okay?

11 They don't want to be good enough. They want to

12 be right. But of course, they would never

13 actually finish, because they know they're not

14 right yet. And they'll keep saying, well in

15 another year, I'll tell you. In a year, I'll tell

16 you. And I've got to finish this paper.

17 But we have actually gotten them to

18 write down protocols, and we're implementing those

19 protocols in the cloud. So, it's sort of like

20 they're the R&D part and we're the production

21 part. We have to take their code and make it work

22 in a way that we can actually run it thousands of

304

1 times, hopefully automatically without a lot of

2 hand holding. And we're doing that in the cloud

3 using a DNA nexus environment. So, we have access

4 right in there.

5 We're writing -- we're building the

6 pipelines to DNA access that allows us to share

7 the pipelines very easily. You can get command

8 nine access in there. But also, if you have a

9 small number of experiments you want to analyze,

10 you can use their web site.

11 So, this has been a particularly

12 difficult one, it's getting these pipelines

13 together. But once we have all that -- once we

14 know we have the right number of replicates and

15 such, we fire it all through. Of course, you need

16 reference annotations to get there. And the

17 analysis pipeline results are made available,

18 because those files go back into the data

19 warehouse, which is at Amazon, and we have to have

20 the metadata for every step along the way here.

21 So, the principles of our metadata for

22 ENCODE is, of course, reproducibility. We want

305

1 you to be able to track the analysis that has

2 happened. We want to be able to communicate the

3 key assumptions and purposes of this particular

4 step within the pipelines, and of course, easily

5 accessible -- provide easy accessibility to the

6 quality and metrics that are there.

7 Transparency. We want you to be able to

8 redo the pipeline in the future. We want you to

9 be able to run the pipeline on your data and to

10 associate it cleanly with the standard data that

11 have been created. Of course, where the files

12 have been, which software they have been run on,

13 which files were used to create another analysis

14 file -- all of this is really critical to allow

15 this to really be a resource. A lot of the lab

16 groups are not used to this sort of level of

17 reproducibility requirements and sharing, and it

18 is a little bit of a cultural education that we're

19 confronted with.

20 Just as an example of the metadata

21 recapture for the software steps -- so we have

22 metadata about the file. We have metadata about

306

1 the software. And we freely, openly share all of

2 the software via Get Hub, and then we have the

3 analysis steps as well that integrate some of the

4 software as part of the greater analysis steps.

5 So, we want to track all of this

6 information so that you can actually query on it.

7 You can find out more information about how a

8 particular pipeline has changed over time. So, in

9 the last two slides -- so I haven't shown you any

10 web sites, and that's what half my group does is

11 make web sites. So, I had to show you one.

12 This is our latest web site. It's the

13 ENCODE portal, and it's been a lot of fun to

14 create. We've used a lot of different

15 technologies than our SGD web site, which is

16 running on eight years old. But in this case, we

17 used the ontologies -- because the ontologies are

18 used in the metadata, we can create this faceted

19 searching. So, in this case, I've clicked on DNA

20 seq, and it and it's told us that there's a

21 certain number of experiments that fit into

22 various categories.

307

1 I know you can't read them there, but

2 the gray bar tells you the relative number of

3 experiments that are being observed or found.

4 Just blowing that up, you can see -- so of those

5 DNA seq experiments that were available, there

6 were a certain number here that are from primary

7 cells, from tissues, and then we actually --

8 because we use ontologies for -- Uberon ontology

9 for the anatomy and we use a cell line ontology in

10 connecting that in with Uberon, so we know where

11 the tissues -- sorry, where the cells are from.

12 You can actually drill down and say I only wanted

13 to see experiments for DNAs on the eye.

14 Okay. So, I don't know if I was too

15 fast or too slow. I want to acknowledge the

16 funding that we've had for this. It's all funded

17 by the Human Genome and it's been a really good,

18 fruitful collaboration with them and with the

19 other model organism databases over the years.

20 Thank you.

21 (Applause)

22 MS. VOSKANIAN-KORDI: Are there any

308

1 questions from the audience?

2 SPEAKER: -- project, how much of the

3 data is shared with the other (Inaudible)?

4 DR. CHERRY: So, we're now at -- ENCODE

5 has used this for about nine years. We're

6 currently at ENCODE three, the third phase of it,

7 and we're two years into that third phase. All of

8 the data from ENCODE II was made available via the

9 EBI, and it's still there from sort of an FTP

10 site. ENCODE II in the U.S. is available from

11 Santa Cruz, and then it's also available now from

12 our new site at Amazon, and you can get at it via

13 the metadata searches and things. The NCBI -- I

14 don't know that we've put the data on NCBI, other

15 than in geo and SRA.

16 MS. VOSKANIAN-KORDI: Other questions?

17 Yep. There's one over here.

18 SPEAKER: Hey, Mike. That's a great

19 talk. You know, I think it would be interesting

20 to compare and contrast how hard it's been to get

21 the community around the mods to agree to

22 something like gene ontology, where we've had a

309

1 wonderful collaboration and ability to agree on a

2 set of standards, and how hard that's been in many

3 other spaces.

4 So from your view, what really made gene

5 ontology click and the agreements happen, and how

6 could we replicate that more, for instance, into

7 the genomic space into -- well, the clinical

8 genomic space?

9 DR. CHERRY: (Laughter) That's a good

10 question. So over the years, I've been

11 interviewed many times by ontologists that try to

12 say, how did you get this to work? And our goal

13 was not to at least start building an ontology

14 because it was ontologically correct. We needed

15 something that worked. And so, we built it sort

16 of in a minimalistic way that helped us get data

17 out.

18 And so, I think that was a critical

19 component, is that people could see there's data

20 there. It's annotated in this way. Oh, maybe I

21 can understand how it's used, but there's still

22 been -- you know, over the years, a lot of little

310

1 spats that have gone on about how to do it and how

2 not to do it. But it's really been great over the

3 past, I don't know, eight years or so when

4 everybody -- basically, any genome paper that

5 comes out or any RNA seq paper comes out, they

6 always list you know, some sort of enrichment

7 analysis using gene ontology.

8 And so, I think it's one of these things

9 that it's just ubiquitous, and you know, that was

10 the whole intention. Right? Nobody really knows

11 where gene ontology comes from. It's just there.

12 And it's created by these curators who are just

13 doing it to -- not to credit for doing it, but to

14 actually help research go forward. And I should

15 say that Warren was involved in the dictyostelium

16 database, so he's a mod guy, too. (Applause)

17 DR. MAZUMDER: Hello. Our next --

18 thanks a lot, Mike. Actually, you know, I think

19 90 percent of the curators or curation that you

20 see today, gene name, protein name, all of this

21 has been done by people who can fit in this room,

22 or even smaller than this room. I could make an

311

1 accession like that. So it's quite amazing,

2 actually.

3 Anyway, our next speaker is Dr. Rodney

4 Brister. He's a group leader of Viral Genome

5 Group at NCBI. He's also the chair of the Virus

6 Genome Annotation Working Group, and he also is

7 involved with the international group, Virus

8 Genome Data Standards. And I know that there are

9 several people in this auditorium who are working

10 on viruses and are interested in this talk. Thank

11 you, Rodney.

12 DR. BRISTER: Sorry for being the last

13 talk (Laughter). We're obviously running a little

14 late. So, Kim went over RefSeq in sort of a

15 general way, and I'm going to focus more on the

16 viral aspects of RefSeq and the work that my group

17 does, the Viral Genome Group at NCBI.

18 So, the world is literally coded in

19 viruses. There are an estimated at 10 to the 30

20 of viruses on the planet right here today, and

21 understandably, we're interested in them because

22 they kill a lot of people. Annually, several

312

1 million deaths are attributed to viruses, and this

2 has brought a lot of attention in the sequencing

3 and genomics world onto viruses, and a lot of

4 money has been put forth to sequence the number of

5 human and plant pathogens.

6 And so, people engaged in sequencing

7 have created somewhat of a sequence explosion,

8 which has resulted in about two million viral

9 sequences being deposited in INS D.C. databases or

10 commonly, GenBank. This sounds like a really

11 great thing. The problem is, sequences are just

12 A', G's, C's and T's until someone comes along and

13 transforms those into sequence data.

14 And in the context of NGS sequencing,

15 transformation of the data means assembling the

16 raw sequences into consensus sequences and

17 annotating those consensus sequences into

18 biologically relevant data, like genes and

19 proteins. Of course, that's where the humans come

20 into play as well as the computers. And the

21 Reference Genome Group -- the Viral Reference

22 Genome Group is involved in trying to make this

313

1 whole process easier.

2 As part of the RefSeq project, we create

3 a non redundant, well annotated set of reference

4 genomes, which we distribute publicly for the

5 world to use. And before going further, I just

6 want to say we're a very small group and rely to a

7 great extent on collaborations with various

8 communities, public databases and individuals

9 scientists, some of whom are here today.

10 Our goal is really to provide the

11 reference infrastructure for the identification,

12 assembly and annotation of viral sequence data.

13 So, our data model is pretty simple, or at least

14 it used to be. And that is, we would create one

15 reference genome for each viral species. Taxonomy

16 is centric to this model. We rely on the taxonomy

17 from the International Committee for the Taxonomy

18 of Viruses. They set standards for family, genus

19 and species level taxonomy, which we bring into

20 NCBI and sort of use as our template for what we

21 create in NCBI.

22 Over the past decade or so, ICTV has

314

1 approved in a steady fashion, quite a number of

2 viruses -- viral species, excuse me. And we

3 validate the taxonomy for each viral reference

4 sequence that we create based on the criteria to

5 the various study sections within ICTV set forth.

6 Now, one of the problems of our data model is --

7 see here in orange, the rate at which novel

8 viruses are discovered is nearly exponential and

9 doesn't really match well with the linear sort of

10 taxonomy efforts at the ICTV.

11 So, we have to react to these novel

12 viruses, because we want to represent them in our

13 sequence space. So we create RefSeqs from these

14 novel viruses, which means we spend a lot of time

15 doing taxonomy on these novel viruses. And we've

16 put them in special bins that are called

17 unclassified bins that you can see in the NCBI

18 taxonomy pages, and we try to classify things to

19 the family or genus level. Sometimes, we're

20 actually able to classify them a little bit below

21 that.

22 Only a few viral -- another problem for

315

1 our data model is that unlike other model

2 organisms, which you've heard about in earlier

3 talks, only a few genomes have been experimentally

4 defined. And so, we have a data model that has to

5 incorporate very well annotated genomes; genomes

6 that have well defined experimental data with

7 genomes for the records that have been annotated

8 mostly by the computational transfer of annotation

9 from a well annotated genome to a less well -- or

10 a well defined genome to a less defined genome.

11 And finally, genomes that have been

12 annotated simply by ab-initio techniques for which

13 very little or no experimental data is known. And

14 so, we have developed multiple classes for RefSeqs

15 to help you discriminate between these types of

16 genomes. So, we have a reviewed class that

17 generally represents a very well annotated genome

18 that comes from mostly experimental data, and we

19 have our provisional class that generally

20 represents something that has mostly been

21 annotated through ab-initio techniques.

22 Now, once we create a RefSeq genome,

316

1 every other genome is then included as a neighbor

2 to that RefSeq genome. Every other genome for

3 that species, excuse me, is included as a genome

4 neighbor to that RefSeq genome. And over the past

5 decade or so, there's been a huge increase in

6 focused sequencing efforts around human, and to a

7 lesser extent, agricultural pathogens. So, you

8 can see in some cases, and this table does not

9 include influenza -- in some cases, you literally

10 have thousands of genomes for a particular

11 species.

12 Now, taxonomy and genome links were

13 validated for all of the genome neighbors. So, we

14 add value to things that are not RefSeqs, and we

15 try again, to assign them with the right taxonomic

16 identifier and try to validate whether or not this

17 is a full link genome as expected by other genomes

18 in that particular taxonomic grouping or not.

19 Now, over the past decade, the

20 sequencing for viruses has just skyrocketed, and

21 the number of validated RefSeq genome records is

22 nearly 80,000 at this point. In blue there, you

317

1 can see the number of RefSeqs, which is basically

2 linear. So, a lot of the new sequencing we're

3 seeing, despite seeing a lot of novel viruses

4 being sequenced, or re-sequencing of new isolates

5 of already defined species.

6 Now, this has increased the extant

7 sequence space for viruses a great deal, and this

8 extant sequence space is growing with both novel

9 viruses, but to a greater extent, with variants of

10 already discovered viruses. And some of this

11 space is captured fairly well by our current

12 RefSeqs, but other parts of the spaces, not at

13 all. And this represents new genotypes, new

14 strains and other variants.

15 So in an attempt to better capture the

16 sequence space so that when you come into our

17 RefSeq database you can find a hit to your

18 sequence, we're expanding our RefSeqs to break the

19 model of one RefSeq per species to include

20 multiple RefSeqs per certain species, and we're

21 relying more on the sequence analysis to define

22 the sequence space, define holes in it, in terms

318

1 of our RefSeq representation. We're also relying

2 on the communities at large to tell us about their

3 important genotypes, their important strains, so

4 that we can capture all of this diversity within

5 this RefSeq model.

6 And again, the idea is that you can come

7 into the RefSeq database and identify a sequence

8 right away. So, it may not be enough to know that

9 you have an enterovirus, but you need to know that

10 there's a particular type of enterovirus, and that

11 allows you to know something about that virus

12 clinically right away.

13 Now, most people think of viruses as

14 things that pop into cells and blow them up with a

15 bunch of (Inaudible). But for a lot of viruses,

16 they get into the cell and they hang out and they

17 take up residency of the cell, sometimes as an

18 episone, sometimes as an integrate. And this

19 sequence space has really been missed by the

20 RefSeq model of years past.

21 And so, we're trying to reclaim this

22 sequence space by using extant viruses to identify

319

1 viruses that are integrated as part of host

2 genomes; those genomes having been submitted to

3 the INS D.C. database, and CrateRef seeks to have

4 some context them, i.e., this RefSeq was made from

5 a crow virus, to allow people to find these guys

6 in terms of when they do a search against RefSeq,

7 but also, to give people, users some context

8 awareness about these sequences.

9 Another project we're doing is trying to

10 give every RefSeq species a host type. And that's

11 another manually curated operation where we're

12 going through both old and newly submitted viruses

13 and identifying the host to which -- a group of

14 hosts to which they infect and assigning that to

15 the actual taxonomy associated with that RefSeq.

16 So, that actually helps the whole database,

17 because then all other viruses of that taxonomy

18 node also get this property. So, you can come

19 into our infrastructure and identify viruses by

20 this host type.

21 The distribution of host type is

22 actually kind of interesting. Humans only make up

320

1 a small number of the hosts in our databases.

2 Actually, our Blast analytics tells us that

3 recently, most of the hits and blasts to viruses

4 come against bacteria (Inaudible). And we're

5 seeing an uptake in other agriculturally important

6 viruses that infect plants and other vertebrates,

7 as well.

8 Now, our goal is to let you know what

9 you have. So, we want to create a reference

10 infrastructure that allows you to identify what

11 you have. And our products consist of our RefSeq

12 sequences, the taxonomy data associated with those

13 RefSeq sequences, the host type metadata

14 associated with those RefSeq sequences, and then

15 the genome neighbors associated with those RefSeq

16 sequences.

17 And so the idea is, you can come and get

18 a hit to our RefSeq, know what taxon is belongs

19 to, know what kind of host it infects. And then,

20 if you want to learn more about the variation

21 associated with that species, then be able to

22 delve down into all of the neighbor sequences and

321

1 see the variants on the genome level associated

2 with that particular RefSeq. And of course, our

3 model is very much dependent on feedback from the

4 community. So, we take what you guys bring us and

5 we integrate it back into the database and

6 hopefully, together, we all improve the data.

7 So, the next question is, where is all

8 of this going? And one example of this is

9 something that we call virus variation, and it

10 tries to deal with this problem that we have

11 millions of GenBank records that really don't have

12 trustworthy annotation. They may not be

13 taxonomically filed away correctly. They may not

14 have genes approachings (sic) annotated at all.

15 They may be annotated incorrectly. So, the Virus

16 Variation Project attempts to bring into this

17 space specialized databases, user interfaces,

18 reference driven annotation tools, metadata

19 parsing, mapping to standardized vocabularies to

20 make all this data much more accessible, so to

21 give it some place in the world as not quite a

22 reference, but something that's been analyzed

322

1 computationally.

2 And the goal is to take things from

3 GenBank, Biosample, SRA, and pull them to our

4 pipeline and create a standardized sequence

5 annotation for all of these records in a context

6 that allows you to analyze variants and metadata

7 associated with those sequences, and bringing

8 about some more user friendly interfaces while

9 doing it, because I realize that some of the

10 entrée tools can be difficulty.

11 And so there are a lot of people that

12 contribute to this. Again, we have a very small

13 group, so we rely on the kindness of strangers in

14 some cases, but also, the kindness of the people

15 within NCBI to help us out on various projects.

16 And here they all are. Thank you.

17 (Applause) On time, too

18 (Laughter).

19 MS. VOSKANIAN-KORDI: Sorry. Before we

20 ask questions, if anyone has their posters

21 outside, they're trying to remove the bulletin

22 boards, so please go ahead and remove your posters

323

1 before five, which is in five minutes. So, thank

2 you.

3 SPEAKER: A lot of the RNA viruses that

4 were originally sequenced no longer exist anymore

5 or can't be found in nature. So, how do you take

6 into account -- would you go through the neighbors

7 as -- annotation of the neighbors?

8 DR. BRISTER: Well, that's a good

9 question. And Aretha Kahn and some other people

10 who are part of the Advanced Virus Detection Group

11 -- endogenous viruses, extinct viruses things,

12 "missing from the databases" -- that's really the

13 kind of stuff that we have to rely on the

14 community at large to tell us about.

15 We don't know it's a problem until

16 someone tell us about it. And then, once we

17 recognize there's an issue there, then we go and

18 we try to solve it. I know ICTV struggled with

19 this issue that they had many viruses that were

20 characterized long ago based on phenotype. And

21 they could filter material. They knew it was a

22 virus. They may have even had an EM of it -- no

324

1 sequence.

2 And I think at this point, they're sort

3 of giving up on those guys. We're a sequenced

4 database, so we need representation within a

5 sequence. Sometimes those exist someplace.

6 Sometimes they don't. If we have a sequence, we

7 can start from there, and we can start making some

8 representation of that.

9 DR. MAZUMDER: Can I add to that list

10 which you just named? The recombinants and some

11 subtype mosaics -- how do they relate to your

12 neighbor concepts?

13 DR. BRISTER: Okay, so this has come up

14 with HIV, and it's kind of an interesting topic.

15 We'll add to this constructs and non natural

16 constructs. Right? This came up with some of the

17 constructs that were made to study HIV in the

18 laboratory where simian and human viruses were

19 sort of fused together.

20 From our standpoint, there's really two

21 levels to this. One is this representation of the

22 sequence base, and the second is the context. So,

325

1 in a perfect world, which we're not there yet, we

2 would like to have a context of where a database

3 -- we would like to be able to store things as

4 laboratory you know, strains, or store things as

5 these are not natural variants, so when you get

6 this hit, be aware of this.

7 Right now, we're doing it with host type

8 in terms of environmental samples. We want to

9 represent the sequence space associated with

10 environmental samples. There's an easy way to

11 mark these as environmental samples. We're doing

12 it that way through the host type property. We've

13 discussed doing the same thing for laboratory

14 strains.

15 In virus variation, we've solved the

16 problem by doing that. It's not clear how we're

17 going to do that within the referenced context.

18 Some of the strains are neighbors now. The real

19 question is, should we make them into RefSeqs.

20 And we haven't yet, but we are having an ongoing

21 discussion about that.

22 DR. MAZUMDER: Very important, because I

326

1 know you were (Inaudible) samples. We had folks

2 from HIV come in (Inaudible) --

3 (Simultaneous discussion)

4 SPEAKER: I just wanted to continue just

5 with that. There are a lot of natural

6 recombinations, particularly like in the

7 enterovirus families.

8 DR. MAZUMDER: Right.

9 SPEAKER: And so those occur all the

10 time. As a matter of fact, in some cases, it's

11 only the cap-set portion that defines the virus as

12 being a particular genotype.

13 DR. BRISTER: Yeah. And so, the same

14 thing happens in the bacteriophage. So, we had a

15 project that's focused on bacteriophages, you have

16 sets that sort of move around between viruses.

17 And so, we're a protein database and a nucleotide

18 database. So, in order to capture the protein

19 space associated with a group of bacteriophages,

20 we've actually had to make several different

21 RefSeqs that together, represent the protein space

22 of that group.

327

1 And I suspect we're going to go back and

2 do this for many other viruses. I mean, as Kim

3 kind of alluded to, we're breaking a lot of rules.

4 And so (Laughter), you know, the thing is when

5 you're a small community and you require a lot of

6 interactions with other scientists, you start

7 hearing them. You go, this is a stupid rule. We

8 need to get rid of this. And so, we're taking

9 test cases to do that, and then we're going back

10 to some of the other stuff and going, okay, we

11 should do that here, too.

12 And it's really about bringing people

13 into the discussion, contacting me to search my

14 name on Google. I'm in this -- I usually use J.

15 Rodney. If you do J. Rodney Brister, you'll find

16 me. Give me an email. We'll solve the problem.

17 I can't guarantee you we'll solve it tomorrow.

18 We'll start the discussion.

19 The way we like to approach these things

20 is to create a group, maybe five or six people who

21 can have a discussion about this as a community,

22 bring the ICTV in it. When necessary, bring

328

1 sequencing centers into it, when necessary, and

2 really sort of work this out together. And we've

3 been successful in some cases, and we're looking

4 to scale that up.

5 SPEAKER: Yes. I'm wondering how the

6 database accounts for important viral genotypes

7 that are resistant to anti-virals. Would they be

8 considered a neighbor to a RefSeq, or are they

9 accounted for in the metadata? How do you address

10 those types of things?

11 DR. BRISTER: Okay. So this is another

12 great thing that we've kind of solved in the virus

13 variation construct where we can work things with

14 clinically relevant metadata. That really doesn't

15 fit into the RefSeq model. The RefSeq model is

16 not designed for viruses in many ways, shapes and

17 forms.

18 How we deal with that down the road, I

19 don't know. One way is to bring in new index

20 terms and entrée -- haven't really had that

21 discussion. In my mind, I'd rather get away from

22 the constraints that a model somebody built for

329

1 something else places upon us and move into a

2 space where we take advantage of a model that we

3 built to solve this particular problem.

4 So, I think with clinically relevant

5 mutations, we like to get those viruses themselves

6 into virus variation, and we've then -- we've

7 expanded virus variation just over the last couple

8 of months, actually adding MERS, an Ebola virus to

9 it. We have plans for Norovirus, Rotavirus.

10 Again, we're open to suggestions.

11 So you know, bring these ideas to me.

12 We have a small group, but if you know, we get a

13 bunch of people asking us for these resources, you

14 guys start calling the Congressmen and start

15 calling my director and saying, we need more viral

16 resources, then maybe we can get it done.

17 DR. MAZUMDER: If there are no more

18 questions, thanks a lot. It was a great, great

19 talk. (Applause) Thank you very much, Rodney.

20 MS. VOSKANIAN-KORDI: At this point,

21 we're going to go ahead and close the first day of

22 our conference. We'll start again tomorrow at

330

1 8:30. Thank you all for joining, and again, if

2 you have a poster up, please go ahead and remove

3 it. Thank you so much. Bye.

4

5 (Whereupon, the PROCEEDINGS were

6 adjourned.)

7 * * * * *

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

331

1 CERTIFICATE OF NOTARY PUBLIC

2 STATE OF MARYLAND

3 I, Mark Mahoney, notary public in and for

4 the State of Maryland, do hereby certify that the

5 forgoing PROCEEDING was duly recorded and

6 thereafter reduced to print under my direction;

7 that the witnesses were sworn to tell the truth

8 under penalty of perjury; that said transcript is a

9 true record of the testimony given by witnesses;

10 that I am neither counsel for, related to, nor

11 employed by any of the parties to the action in

12 which this proceeding was called; and, furthermore,

13 that I am not a relative or employee of any

14 attorney or counsel employed by the parties hereto,

15 nor financially or otherwise interested in the

16 outcome of this action.

17

18 (Signature and Seal on File)

19 ------

20 Notary Public, in and for the State of Maryland

21 My Commission Expires: November 1, 2014

22