Your sequence submission pack

The purpose of this document is to provide detailed information on the key EGA submission stages for your sequence data.

If your submission also consists of array based data that is covered under the same study (publication), we request that you generate your study accession first, using the instructions provided below. The study accession obtained should then be used for your Array based submission.

Before progressing please ensure that you read and follow our guidelines on prepar ing your sequence data files .

Additional submission help and support can be obtained by emailing EGA-Helpdesk

Key stages of sequence submissions in detail

Encrypt - Calculate - Upload - Send Key - Document

Encrypt

Encrypt all your documents and files using GnuPG Contact EGA Helpdesk to obtain the GnuPG public key over email

Before uploading your data files to your submission account, all data files must be encrypted using GnuPG.

Quick quide to using GnuPG for encryption i) Follow the installation instructions found here . ii) If creating your own key, use the : gpg –output -c

Follow the onscreen prompts and choose the default options, which will create an encrypted copy for each file.

If using the EGA public key, import the key by using the command: gpg -- import EGA_Public_Key iii) Now encrypt your files using the command: gpg -e [filename1] [filename2] [etc]

If using your own key, enter your UID generated when you created the key in step 2. For EGA public key, enter your UID as ' EGA_Public_Key '.

You should now have an encrypted copy for each file, with the suffix *.gpg*.

Further information on using GnuPG can be found on their documentation pages here .

You can also use the EGA uploader tool to encrypt and generate md5sum values for your files locally (without upload). See the submission tools for more information.

Calculate

Calculate md5 checksums for files prior and post encry ption (i.e. each file should have two md5 values)

The md5sum program is installed by default on most , and Unix like systems. The windows md5sum program is available here .

To generate md5sum values for any number of files use the command: md5sum > myvalues.md5

This will create md5sum values for the files listed and save these values into a file called 'myvalues.md5'

Please upload your md5sum values to your data upload ac count.

Further information on md5sum can be found here .

You can also use the EGA uploader tool to encrypt and generate md5sum values for your files locally (without upload). See the submission tools for more information.

Upload

Upload all your data files into your data upload account.

Methods available for uploading data are detailed below.

Using Aspera: Downloading the Aspera ascp command line program Aspera is a commercial file transfer protocol that provides faster transfer speeds than ftp over long distances.

For short distance file transfers we continue to recommend the use of ftp.

The Aspera ascp command line client (Aspera connect) can be downloaded here . Please select the correct .

The ascp command line client is distributed as part of the aspera connect high-performance transfer browser plug-in.

Using Aspera: Using the Aspera ascp command line program

Please note: The ascp command line should be run from within the Aspera directory containing ascp.exe.

Your command should look similar to this: ascp -QT -l300M -L- @fasp.ega.ebi.ac.uk:/.

'-l300M' option sets the upload speed limit to 30MB/s. You may wish to lower this value to increase the reliability of the transfer.

'-L-' option is for printing logs out while transferring,

can be a file mask (e.g. '/homes/submitter/*.srf) or a list of files.

is your password protected Aspera login.

Add '-k2' switch for transfer restarts

Using default ftp command line client in Window

1- Start the command line interpreter: press Win-R, type cmd, hit enter 2- Enter 'ftp ftp-private.ebi.ac.uk' 3- Enter your login 4- Enter your password 5- To see a list of available ftp commands type 'help'. 6- Type '' command to check the content of your submission account. 7- Type 'prompt' to switch off confirmation for each file uploaded. 8- Use 'mput' command to upload files: 'mput *.srf' 9- Use 'bye' command to exit the ftp client. 10-Use 'exit' command to exit the command line interpreter.

Using default ftp command line client in Linux/Unix

1- Open a termina l and type 'ftp ftp -private.ebi.ac.uk' 2- Enter your login 3- Enter your password 4- To see a list of available ftp commands type 'help'. 5- Type 'ls' command to check the content of your drop box. 6- Type 'prompt' to switch off confirmation for each file uploaded. 7- Use 'mput' command to upload files: 'mput *.srf' 8- Use 'bye' command to exit the ftp client.

Send Key

Pass your encryption key to the EGA by post or phone (not required if GnuPG public key used)

Please do not pass your encryption key over email. You may use post al/courier services, deliver in person or pass the key over the phone

Our contact details:

Mr Jeff Almeida-King EGA User Support Officer EMBL-EBI Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD,UK Tel: +44 (0) 1223 494559

Document

Provide details of your Study, Samples, E xperiments, Runs/Analysis, Policy and Dataset/s

We require the following documentation for your submission: 1 – Policy documentation to enable submission to EGA 2 – Metadata associated with your study

1- Policy documentation

Please be advised that the EGA can only archive and distribute your data submitted upon receipt and validation of all policy documentation. These documents provide details of your Data Access Committee (DAC), which will be responsible for granting access to the data, and provide authorisation for your data upload.

Further information on DAC’s can be found here.

Please see the links below for examples of the required policy documentation that should accompany your submission and may be emailed directly to EGA Helpdesk.

Data Access Agreement Data access application form Policy statements

1. Metadata associated with your study

**Metadata submitted as xmls or through the Webin tool will be made publicly available to view on the EGA website and other EBI resource/partner websites**

Your metadata, which will include details of your samples, experiments, runs/analysis, Data Access Committee (DAC), policy and dataset/s can be provided by two alternative means:

i) Online using the EGA Webin tool ii) Creating and submitting XMLs

i) Using the EGA Webin tool

This online tool enables you to create new and edit existing submissions.

Go to the EGA Webin page and log in using your submission account name and password.

For the submission of sequencing reads that have been uploaded to you submission upload account: • Go to the ‘New Submission ’ tab • Choose ‘I wish to do a complete submission’ and follow the online prompts, which will guide you through adding information for your study, samples, experiments and runs. • Once completed please register your data access committee (DAC), Data access policy and dataset to conclude your metadata submission.

To generate a study accession number (EGASXXXXXXXXXXX), for use in your publication, before your reads have been uploaded: • Go to the ‘New Submission’ tab • Choose ‘I wish to register study’ and follow the online prompts • Your samples, Data Access Committee (DAC) and Data access policy may also be registered before your reads have been uploaded.

To use the study accession number in a publication, we suggest the following format:

"Genotype data has been deposited at the European Genome-phenome Archive (EGA, http://www.ebi.ac.uk/ega/), which is hosted by the EBI, under accession number EGASXXXXXXXXX."

Further information regarding the use of Webin can be found here .

What happens after the key submission stages have been completed? Upon the completion of a dataset , your website is prepared, which will point to your study, dataset and Data Access Committee.

Once your draft website is completed, a member of the EGA will be in before your website goes live to ensure:

• Your study is represented accurately • Access to EGA user management tools is provided to the Data Access Committee named contacts • Further information regarding the role of the Data Access Committee can be found here

Finally, your data is archived within our databases and prepared for encrypted distribution upon the request of permitted EGA account holders.

We strongly advise you NOT to delete your data until we confirm that your data has been successfully archived.

ii) Creating and submitting XMLs

All m etadata required by the EGA may be collected using our EGA XML's.

Submitters are required to prepare, validate and submit the XMLs.

Working with XML

We recommend manipulating EGA metadata using an XML editor, preferably one with the ability to validate against XML schemas. A good article on choosing an XML editor can be found here . Alternatively, XML can be edited in standard text editors and then checked using an XML validator, e.g. xmllint , a free unix-based XML validator.

General concepts: Aliases and center names

Every EGA object must be uniquely identified within the submission account using the alias attribute. The aliases can be used in submissions to make references between EGA objects. Please find more information about the use of aliases and center names below: alias attribute : every object should have a name that is unique within your submission account. Once submitted successfully, every alias will be assigned an accession. refname attribute : when an object references another by its alias, the alias goes into the refname attribute. For example, if a sample has the alias "sample1", and an experime nt uses this sample, then the EXPERIMENT/SAMPLE/refname should be "sample1". center_name attribute : The center_name attribute is required within the submission XML and will be propagated to all other XMLs if not individually provided. This element is the controlled vocabulary acronym or abbreviation that is provided to the account holder when the account is first generated for an institute . If the submitter is brokering a submission for another institute, the submitter should use their special broker accoun t name in broker_name while the data centre acronym remains in center_name. run_center attribute : Many submitting centres contract out the sequencing to another centre. In these cases, the sequencing centre should be acknowledged in the run_center attribut e. Again, this is controlled vocabulary and the acronym should be sought from EGA before submitting.

Validating and submitting your EGA XML's Please submit your EGA XML's to your XML upload account. Please note, that your log-in details to this account should have been provided at the beginning of the submission process.

Test XML upload account (recommended for first time users): https://www-test.ebi.ac.uk/ena/submit/drop-box/submit/

Production XML upload account: https://www.ebi.ac.uk/ena/submit/drop-box/submit/

Submitters are advised to use the Test XML upload account when submitting XML’s to the EGA for the first time. The test service is identical to the production service except that all submissions will be discarded on the following day.

We recommend that you validate all XMLs using the ‘ VALIDATE ’ action in your submission XML before submitting using the ‘ ADD ’ action.

Validation:

Submission:

<\ACTION>

EGA XML objects 9 EGA XML objects are required to be completed, which must be validated and submitted to your XML upload account in two separate stages, shown below:

Please note: Each stage requires a submission.xml, which defines the submission transaction.

Analysis data submissions A typical EGA analysis (as opposed to ra w) data submission consists of 7 EGA XMLs: Submission, S tudy, Sample, Analysis, DAC Policy and Dataset XML . Currently, we accept two different types of analysis data submissions:

• BAM files (for read alignments) • VCF files (for sequence variations)

In both cases EGA samples must be created to refer to the samples used within the BAM and VCF files.

Study, Samples, DAC , Policy and Dataset may be submitted using Webin, but your Analysis can only be submitted as an XML.

Your unique study accession number Once you have completed submission of stage 1 XMLs, you will receive a unique study accession number (EGASXXXXXXXXXXX) for use in future publications.

To use the accession number in a publication, we suggest the following format:

"Genotype data has been deposited at the European Genome-phenome Archive (EGA, http://www.ebi.ac.uk/ega/), which is hosted by the EBI, under accession number EGASXXXXXXXXX." Preparing your XMLs

For further information, templates and examples please see the online EGA submission manual.

What happens after the key submission stages have been completed? Upon the completion of a dataset , your website is prepared, which will point to your study, dataset and Data Access Committee. Once your draft website is completed, a member of the EGA will be in touch before your website goes live to ensure: • Your study is represented accurately • Access to EGA user management tools is provided to the Data Access Committee named contacts to enable EGA accounts to be created and managed to access the data submitted. • Further information regarding the role of the Data Access Committee can be found here .

Finally, your data is archived within our databases and prepared for encrypted distribution upon the request of permitted EGA account holders.

We strongly advise you NOT to delete your data until we confirm that your data has been successfully archived.

Additional submission help and support can be obtained by emailing EGA-Helpdesk