January

Monday, January 1

I rose early this morning and worked on cleaning up my late 2018 work journal, working for an hour on this.

Wednesday, January 3

We stopped by work this afternoon so that I could pick up my most recent SF50 and other documentation requisite for my unemployment application. I had received a box containing 199 specimens in 88 vials.

Monday, January 28 To do:

Time sheet AKES meeting arrangements. Associate travel card with Concur account. Refuge Notebook catch-up Biology News entries, new literature to add to Bio publications bibliography page. Biota of Canada post to akentsoc.org Finish late 2018 work journal. Look over bristletails received.

I worked on adding images and scripts to my 2018 work journal, trying to finalize it. John asked me to post the recent Refuge Notebook articles from December, so I formatted and posted these. I also made posts to Biology News.

Tuesday, January 29 To do:

AKES meeting arrangements. STDP Final Report to Liz Figure out AKES meeting topic and get going on presentation. Restart AWCC slurm job that was canceled on 26.Dec.2018. Take care of vehicle. New literature to add to Bio publications bibliography page and to our literature database. Biota of Canada post to akentsoc.org Finish late 2018 work journal. Arrange for return shipping of Betula specimens. Look over bristletails received.

Debbie helped me make travel arrangements for Fairbanks, so this is a go. Now I need to figure out my talk since I was not able to work on the alpine defoliation project in January as I had intended.

Liz Graham wrote, asking that I fill in the STDP report review. I started looking through that dataset for notable records. Pissodes fiskei is not in our Alaska checklist. I checked that sequence again. It is closest to a sequence of Pissodes costatus, which we have in Alaska. There were no other new records. I need to get this into a publication of some sort. I also need to start an Alaska library on github.

I worked on re-submitting AWCC/Caribou Hills data for the ITSx step (see script).

Examining specimen with barcode label UAM100185885 (KNWR:Ento:11301), from Bay View Cemetery. This is female Petrobiinae, with sole shaped lateral ocelli. There appear to be 2+2 eversible vesicles on segment VI, but it is a little hard to tell because only the outer pair are ever eversed. This looks like Pedetontus s. str.

KNWR:Ento:11301, label. KNWR:Ento:11301, face.

Wednesday, January 30 To do:

Figure out AKES meeting topic and get going on presentation. New literature to add to Bio publications bibliography page and to our literature database. Biota of Canada post to akentsoc.org Finish late 2018 work journal. Arrange for return shipping of Betula specimens. Look over bristletails received.

I sorted the remaining 67 or so vials of bristletail specimens given to me by Rod Crawford, not opening the vials, but just roughly sorting them based on what they appeared to be. There appear to be less than 88 vials listed on the loan, but this is ok. I was hoping to see some Petridiobius specimens, but there were none. There was more Pedetontus s. str. than anything, followed by Pedetontus cf. submutans and Machilinus. Most of these specimens had been sifted from moss or litter. Sorted vials of bristletails from Rod Crawford.

I finished adding material to my late 2018 work journal including scans, references, etc.

Needing to figure out what to present on next week at the AKES meeting, I resumed the AKES Newsletter article on soil fungi affected by earthworms at Stormy Lake with the intention of adding affects to soil fungi to the earthworm presentation I gave in the fall. See the R script. Biplot of a PCA of Stormy Lake soil fungi occurrence data. Note that all nightcrawler- infested sites (1-3) are to the left of the nightcrawler-free sites (4-6). Frequencies of occurrence of soil fungal OTUs.

Thursday, January 31

To do:

Pay Arctos invoices. Work up Stormy Lake soil fungi data for AKES Newsletter and presentation. New literature to add to Bio publications bibliography page and to our literature database. Biota of Canada post to akentsoc.org Arrange for return shipping of Betula specimens.

I paid the Arctos invoices. I helped Todd with tagging eight sea otters pelts.

Before the winter is over I want to ski down to the trail intersection near Slikok Lake where I collected specimen MOBIL6660-18 last year. This was a Diptera larva collected from a spring in winter. There is no close match for the COI sequence obtained from this specimen on BOLD; the closest match is 89.92% similarity and is identified as Diptera. Other matches over 89% are Chironomidae. I would like to collect more specimens to try to put a genus name on these using morphology.

Chironomid larva specimen MOBIL6660-18.

I requested that records from dataset DS-BOWSER be included the next time that the BOLD BIN algorithm (Ratnasingham et al., 2013) is run. This has not been done for some time. For example, the chironomid specimen referred to above collected in March 2018 has not yet been assigned to a BIN despite having a clean, full-length sequence.

I did look at results of the PIPITS analysis of 2017 Stormy Lake fungal data (see script).

I also skied down the spring where I had collected that mystery chironomid before and collected more sediment, both from the spring and from the stream where it crosses the snowmachine trail.

Sediment sample taken from spring. Sediment sample taken from stream below spring where it crosses the trail. February

Friday, February 1 To do:

Post this week's Refuge Notebook article. Look at samples collected yesterday. Work up Stormy Lake soil fungi data for AKES Newsletter and presentation. New literature to add to Bio publications bibliography page and to our literature database. Biota of Canada post to akentsoc.org Arrange for return shipping of Betula specimens.

I looked through the sediment sample from the stream collected yesterday. There were many ostracods, a few tiny worms, two caddisfly larvae, and three Diptera larvae, each different from the other.

I posted this week's Refuge Notebook article, starting a new volume for the year.

I updated the KNWR Biology's publications bibliography and posted new literature announcements to http://www.akentsoc.org/.

I resumed work on the Stormy Lake fungal data (see script).

I am excited to have just learned about FUNGuild (Nguyen et al., 2016). After some formatting fixes (removing spaces, etc.) I was able to submit my OTU table to FUNGuild at http://www.stbates.org/guilds/app.php.

Guilds v1.0 Beta report: - 386 assignments were made on OTUs within the input file! - Total calculating time = 8.8 seconds!

I looked through these results in R (see script). Relative abundances of guilds of fungi. Sites 1-3 were in the Lumbricus infestation; sites 4-6 were outside in otherwise similar woods. Comparison of relative abundances of guilds of fungi based on reads summed over infested and not infested sites.

There were clearly proportionately more mycorrhizal fungi in Lumbricus-free plots than plots in the infestation.

Monday, February 4 To do:

Get going on AKES presentation. Arrange for return shipping of Betula specimens.

I started work on revising my worm presentation. I am trying to determine which Eisenia species are present.

Specimen/lot UAM:Ento:378050 is now identified as "Eisenia andrei and Amynthas," identified by Adrian Wackett by morphology. The specimen I collected from my compost pile recently (MOBIL8994-18, now also KNWR:Inv:35) was identified inconclusively as just Eisenia by BOLD's ID Engine. I read in Römbke et al. (2016) that these two species are separable DNA barcodes. I submitted the sequence from this specimen to NCBI BLAST and looked at only results that were in the Appendix of Römbke et al. (2016). This placed my specimen in the Eisenia andrei clade.

Eisenia andrei specimen KNWR:Inv:35, live, 25.Nov.2018.

I also updated the identification of specimen KNWR:Ento:6756 to E. andrei because the worms in my compost pile had come from the population at my parents' house. I later made this new identification unaccepted because vermicomposting cultures may contain both species (Domínguez, 2018). I need to look for E. fetida in these populations. I worked on revising my worm presentation on worms I gave in November, updating it with new information and modifying for the upcoming AKES meeting.

I skied east across Headquarters Lake, the small lake to the east, and onto the PSDRA trail system. These trails had not been groomed. Upon a little searching on the internet, it looks like this organization is no longer active. Its website is gone and I saw no activities cited past 2014.

Wednesday, February 6

To do:

Finish AKES presentation. Finish travel-related arrangements. Get specimens together to take to Fairbanks. Fix akentsoc.org links and post meeting agenda. Arrange for return shipping of Betula specimens.

My SLURM job on Yeti was canceled, again after six days. I do not know why. I will have to look into this later. I need to get my presentation done today.

In looking through literature for my talk, I learned that there is another invasive species very similar to Lumbricus terrestris that I need to watch out for, Lumbricus friendi (see Csuzdi and Szlávecz, 2003).

I attended the all employee meeting with the regional director and deputy directory in the middle of the day. I worked on my AKES presentation on worms.

Looking up more about Lumbricus friendi. This species is not included in the key of Gates and Reynolds (2017). There are currently only two COI sequences from this species in BOLD, both of them from Europe.

I examined specimens KNWR:Ento:8612 and KNWR:Ento:7096. Both appear to have canoe-shaped tubercula pubertatis, so they appear to be Lumbricus terrestris.

Thursday, February 7 To do:

Finish AKES presentation. Finish travel-related arrangements. AKES accounts/passwords

I want to find a good resource for discriminating between Lumbricus terrestris and Lumbricus friendi. I found a comparison from The Earthworm Society of Britain's ESB Earthworm Identikit at https://www.earthwormsoc.org.uk/fullscreen/earthwormkey. Comparison of Lumbricus terrestris and Lumbricus friendi from The Earthworm Society of Britain's ESB Earthworm Identikit.

I requested Sherlock (2012) through ARLIS ILL.

I spent most of the morning finishing my presentation for Saturday.

Friday, February 8

Much of my day was spent in travel to Fairbanks. After Derek brought me to the museum I walked around the museum vicinity looking for willow rosette galls, but I found none. I noted that the birches (Betula neoalaskana) still retained much if not most of their seeds and these had been falling very recently on top of the snow.

I attended the Alaska Entomological Society evening social at Derek's house. There I was happy to meet Jessica Rykken and Chris Fettig. Saturday, February 9

This was a busy day at the annual meeting of the Alaska Entomological Society. I presented on earthworms.

Jessica Rykken presents at the 2019 annual meeting of the Alaska Entomological Society.

Monday, February 11

To do:

Uniform order Travel voucher. AKES presentation Biology News post. Post last week's Refuge Notebook article. akentsoc.org updates Solicit AKES Newsletter articles. Get permission to post AKES meeting presentations. E-mail regarding student presentation award. Resume AWCC analysis. Examine Lumbricus specimens. Finish article on Stormy Lake fungi. Process STDP data using improved pipeline. Format skunk moth article for AKES.

I made some small updates to the akentsoc.org website and made a post about my presentation to the Biology News page of the Refuge's website.

I edited, formatted, and posted last week's Refuge Notebook article.

I resumed the AWCC soil fungi analysis. See the command line stuff and SLURM script.

Examining KNWR:Inv:20, Lumbricus terrestris from Seward. Clitellum on 32-37. This is Lumbricus terrestris.

KNWR:Inv:33, from Rainbow Lake; KNWR:Inv:36 (just entered), from Homer; KNWR:Ento:7060, from Canoe Lake; KNWR:Ento:7206, from Fish Lake; KNWR:Ento:7096, from Merganser Lake; and KNWR:Ento:8954, from Cooper Landing are also Lumbricus terrestris. Now I have examined at least representative individuals from all localities from which I currently have specimens. All are Lumbricus terrestris; none are Lumbricus friendi.

Tuesday, February 12 To do:

Travel voucher Time Request permission from presenters to post AKES meeting presentations. Continue AWCC analysis. Identify "snow worms" from Homer. Finish article on Stormy Lake fungi. Process STDP arthropod data using improved pipeline. Format skunk moth article for AKES Newsletter. Format tick announcement for AKES Newsletter. Get worm protocol to Jess.

I submitted 100K more reads to be processed by pipits_funits on Yeti. I also did some getting ready to process more, but I am waiting to receive results from yesterday's job before proceeding. See the command line script, a SLURM script, and an example of a series of SLURM scripts submitted.

Examining worms brought to me from Homer. These look like Dendrobaena octaedra. Clitellum on 29-33. These are D. octaedra.

I circumnavigated the upland island east of Headquarters Lake, skiing around it in the wetlands. It was splendid. At I spring I collected a sample of bottom sand and muck. In the lab I found some small crustaceans in this which look like Harpacticoida. Harpacticoida specimen in filamentous algae from bottom of spring southeast of Headquarters Lake.

Wednesday, February 13

To do:

Continue AWCC analysis. Finish article on Stormy Lake fungi. Process STDP arthropod data using improved pipeline. Format skunk moth article for AKES Newsletter. Format tick announcement for AKES Newsletter. Get worm protocol to Jess.

I entered data for that copepod from yesterday (KNWR:Inv:37) and a Lumbricus terrestris specimen which was not in Arctos for some reason (KNWR:Ento:7207).

I resumed work on my Stormy Lake soil fungi article, making a map. See the R script.

Map of soil sampling locations for the AKES Newsletter article.

My AWCC soil fungi ITSx SLURM jobs finished just before lunch. I ran the rest of the PIPITS steps. See command line input and the SLURM script. The results did not look good, with way too many Cercozoa reads.

Thursday, February 14 To do: New Biology News entries. AKES student presentation award. Format this week's Refuge Notebook. Continue AWCC analysis. Finish article on Stormy Lake fungi. Process STDP arthropod data using improved pipeline. Format skunk moth article for AKES Newsletter. Format tick announcement for AKES Newsletter. Format meeting article for AKES Newsletter. Get worm protocol to Jess. Respond to e-mail about LaTeX labels.

I posted Biology News entries.

Now I really want to know what happened with that AWCC soil fungi analysis. I was thinking about it last night. I did some looking at output on Yeti (see notes/script.) I might need to try split_libraries.py of QIIME (see Gweon et al., 2015). Looking at the documentation, I think I need to use split_libraries_fastq.py (http://qiime.org/scripts/split_libraries_fastq.html).

From http://qiime.org/tutorials/processing_illumina_data.html:

QIIME can be used to process single-end or paired-end read data from the Illumina platform. The primary script for merging paired-end read data in QIIME is join_paired_ends.py. See the script documentation for more details. This is typically applied as a pre-processing step before running split_libraries_fastq.py.

I worked with QIIME for a while, but it seems the kind of input files I have from MrDNA are difficult to work with.

I downloaded quality-filtered, demultiplexed reads I had made earlier on Galaxy (#15) and used an R script to reformat them to what pipits_funits expects.

I ran another SLURM script. We will see what happens.

Friday, February 15

To do:

Format this week's Refuge Notebook. Continue AWCC analysis. Biology News entries for Dawn's new remote sense article and Voices of the Kenai. Finish article on Stormy Lake fungi. Process STDP arthropod data using improved pipeline. Format skunk moth article for AKES Newsletter. Format tick announcement for AKES Newsletter. Format meeting article for AKES Newsletter. Get worm protocol to Jess. Respond to e-mail about LaTeX insect labels.

I canceled the SLURM job begun at the end of the day yesterday. It was still running, but it would take days. I used an R script to format one fasta file per sample for pipits_funits. I then wrote a series of SLURM scripts to process these data (see example). I submitted the 12 jobs in parallel (see below). sbatch 2019-02-15-0806_ITSx_AWCC1.slurm sbatch 2019-02-15-0806_ITSx_AWCC2.slurm sbatch 2019-02-15-0806_ITSx_AWCC3.slurm sbatch 2019-02-15-0806_ITSx_AWCC4.slurm sbatch 2019-02-15-0806_ITSx_AWCC5.slurm sbatch 2019-02-15-0806_ITSx_AWCC6.slurm sbatch 2019-02-15-0806_ITSx_AWCC7.slurm sbatch 2019-02-15-0806_ITSx_AWCC8.slurm sbatch 2019-02-15-0806_ITSx_CaribouHills1.slurm sbatch 2019-02-15-0806_ITSx_CaribouHills2.slurm sbatch 2019-02-15-0806_ITSx_CaribouHills3.slurm sbatch 2019-02-15-0806_ITSx_CaribouHills4.slurm

I posted this week's Refuge Notebook article.

I formatted a draft announcements article for the AKES Newsletter.

I posted three Biology News entries and added Dawn's new article to our literature database and to our on-line Publications Bibliography page.

I started working on the skunk moth article for AKES Newsletter, but the author notified me that he is making some changes, so I will wait on this.

I formatted the meeting article for AKES Newsletter and posted presentations to akentsoc.org so that they can be linked to in this article.

I started on work using the vegan package to look at the Stormy Lake fungi data. See the R script.

PCA plot of Stormy Lake soil fungi data with Lumbricus terrestris (Lt) as an environmental variable.

Monday, February 18

I checked on Yeti on my AWCC analysis. The ITSx step had failed because of an error in the re-inflation step. I looked at the files and could not see what was wrong. I may really need to somehow construct input files of the format PIPITS expects: separate R1 and R2 fasta files for each sample.

Might try Bayexer (https://github.com/HaisiYi/Bayexer). I tried to use this, but I ran into problems. See script.

Tuesday, February 19 To do:

Write Refuge Notebook article on earthworms. Continue AWCC analysis. Finish article on Stormy Lake fungi. Process STDP arthropod data using improved pipeline. Get worm protocol to Jess.

I came in late today due to family appointments in the morning.

Going back to Galaxy, tried splitting original R1 FASTQ file using Barcode splitter tool. That worked, creating datasets 188 (data) and 189 (summary, below).

# Barcode Count AWCC1 42351 AWCC2 45738 AWCC3 46639 AWCC4 52856 AWCC5 33779 AWCC6 25379 AWCC7 55761 AWCC8 64013 CaribouHills1B 37214 CaribouHills2B 50063 CaribouHills3B 62257 CaribouHills4B 60989 unmatched 777148 total 1354187

Need to trim barcodes off of these. Should I trim the primer region also? I did some looking and reading and found that yes, I should trim this off, so I will do so. This should be trimming 8 bp for the barcode and 18 bp for the primer, so 26 bp. I did so using Trimmomatic on Galaxy (collection 203).

Todd invited me to go with him to investigate a congregation of eagles in Slikok Creek near the Sterling Highway and Arc Loop. I could not turn this down. Colin came also.

We looked in Slikok Creek off of Arc Loop Road and off of the Sterling Highway where eagles were congregating. At Arc Loop we saw a dipper working in the culvert (ebird checklist: S52946889). We did see part of an old moose carcass, but most of the eagle activity was not centered around this. We had wondered if there was a run of fish or something moving in the stream, but we saw none. A few scoops with a net along vegetation yielded a sculpin and a nine-spined stickleback (iNaturalist observation: 20498842), but that was all. It appeared that this was just a big bird bath for eagles. There were many eagles drying their wings and we did see one bird just standing in the water. Bald eagles at Slikok Creek near the Sterling Highway (iNaturalist observation: 20498942).

John had asked me lat in the day on Friday to write a Refuge Notebook article on worms, due tomorrow, so I need to get this done. I did get started on it.

Wednesday, February 20 To do:

Set up phone voice mail. Finish Refuge Notebook article on earthworms. Continue AWCC analysis.

Regarding the AWCC fungi, do I need to trim anything off of the tails? I checked e-mail correspondence with Dr. Dowd at MrDNA lab. We had planned on using the primers below. illITS3kyo2 GATGAAGAACGYAGYRAA illITS4kyo3 CTBTTVCCKCTTCACTCG illITS3kyo2 is the forward primer. The first two reads in the reverse FASTA file end in CGAGTGAAGCGGCAACAG. This looks like the reverse complement of illITS4kyo3, but I do not know how the degenerate (?) nucleotides K, V, and Y work. Anyway, I should trim this tail off. I used FASTQ Trimmer on Galaxy to trim the last 18 bp from the original R2 file (dataset 218). 1354187 fastq reads were processed.

I continued writing the Refuge Notebook article finishing a draft in the afternoon and getting it to John.

I generated R2 fastq files using an R script. I uploaded these to Yeti. My first SLURM script failed. vsearch gave the error

Fatal error: Invalid line 3 in FASTQ file: '+' line must be empty or identical to header

I thought that "+" was a normal line 3 for FASTQ format.

Thursday, February 21 To do:

Set up phone voice mail. Continue AWCC analysis. Format tomorrow's Refuge Notebook article.

Troubleshooting yesterday's AWCC work. I compared the R1 and R2 FASTQ files. I found two problems with the R2 files generated by R. First, the R2 file had the Windows carriage returns (CRLF) instead of UNIX (LF). Second, all of the quality scores were changed to ";" so that all quality scores were lost.

I installed the ShortRead package and used an R script to make the R2 FASTQ files. I transferred this to Yeti and then converted the newline characters to UNIX. dos2unix AWCC1_R2.fastq ...

I re-ran that SLURM script from yesterday. For some reason the pispino_createreadpairslist worked but the pispino_createreadpairslist step did not. It worked over the command line, though, yielding 339K reads.

I tried running the next steps of this analysis a couple of ways.

I entered data for Nicoletiidae specimens KNWR:Ento:11302-KNWR:Ento:11304 and looked up a little literature on this group. I examined specimen KNWR:Ento:11302 (2 males and one female). They look like illustrations of Grassiella as illustrated by Escherich (1905) and not Allograssiella as described by Mendes and Schmid (2010).

Friday, February 22

To do:

Fill out University of Alaska Press author questionnaire for Drivers of Landscape Change in the Northwest Boreal Region book. Format and post today's Refuge Notebook article. Set up phone voice mail. Continue AWCC analysis. Format skunk moth article. Finish Stormy Lake fungi article.

That last SLURM script from yesterday was successful.

I tried to rerun the second script, but it failed very quickly.

I found the problem. I had cut up that original prepped.fasta file wrongly, making the second file start with a read and not the label. I fixed this problem and re-ran this. See the R script, shell script for splitting the original fasta, example SLURM script, and shell script to run all of the SLURM scripts.

I formatted and posted this week's Refuge Notebook article.

I edited and formatted the Polix article for the AKES Newsletter.

All of those SLURM jobs except the first one appeared to complete successfully. I think with that first one I had neglected to remove the out_funits_001 directory or something like that. I restarted this.

Monday, February 25

To do:

Continue AWCC analysis. Quick fix in skunk moth article. Finish Stormy Lake fungi article. Post AKES meeting presentations.

That last SLURM pipits_funits jobbed appears to have run correctly.

I worked on coming up with a Refuge boundary map for the purposes of checklisting, removing all of the conveyed lands.

Simple Kenai National Wildlife Refuge boundaries map extracted for the purpose of checklisting.

I converted this to WKT format using https://mygeodata.cloud with the intent of suppling this over the URL to GBIF for Refuge-specific searches, but this did not work. I think that the URL was much too long. For the purpose of checklisting I will need to pull data off of GBIF just using the extent, then clip out the records from the Refuge.

Wow, the script I had started on Thursday running the whole thing through pipits_funits worked! See output. I started another SLURM script to run the pipits_process step.

I left work to take care of . At home I found that that last pipits_process step had worked (see output). I then ran the vsearch step required by LULU via a SLURM script.

In the evening I ran LULU, used FUNGuild, and looked at the results. See the R script and other stuff.

A comparison of soil fungal communities categorized by guilds at the Alaska Wildlife Conservation Center inside the bison pens, outside the pens, and in the Caribou Hills.

Hours today: 10:15-12:30, 16:30-17:45, 20:00-22:00, Σ = 5:00 hrs.

Tuesday, February 26

To do:

Continue AWCC analysis. Quick fix in skunk moth article. Revise Refuge Notebook article. Finish Stormy Lake fungi article. Post AKES meeting presentations. Backups.

John asked for a comparison of diversity among the AWCC/Caribou Hills soil fungi. I did so (see R script). Numbers of soil fungi OTUs detected inside the AWCC pens, outside the AWCC pens, and in the Caribou Hills.

I made the small change requested for the AKES Newsletter skunk moth article.

I worked on the introduction of my Lumbricus soil fungi article for the AKES Newsletter.

I received an e-mail from Kyungsoo giving densities of up to over 30 g/m2 of ash-free dry biomass of earthworms at Stormy Lake for the site closest to the boat launch. I wanted to convert these biomass numbers to something convenient for the typical newspaper reader. Below is my back-of-the- envelope conversion. Ash-free dry biomass of earthworms is roughly 15-22% of their live biomass (see https://www.researchgate.net/post/Earthworm_biomass_relation_between_fresh_mass_and_dry_mass with 15-18% apparently accepted. I went with 17%. Using this to convert middle of the range of values from closest to the boat launch of AFD 25 g/m2:

## conversion factor from ash free dry biomass to fresh biomass is roughly cf <- 1/0.17 * ## ash free dry to fresh weight. 0.00220462 * ## lbs./g 4046.86 ## m2/acre cf [1] 52.48111

## For Stormy Lake site 1: 25*cf [1] 1312.028 ## lbs. live earthworms/acre

Hours: 09:30-12:30, 16:30-18:00, 20:15-22:15, 23:00-00:30, Σ = 8 hrs.

I revised and submitted to John my Refuge Notebook article.

I worked on my Stormy Lake soil fungi article, mostly on the introduction.

Wednesday, February 27

I made some quick calculations based on data from Saltmarsh et al. (2016).

## Occupancy estimate at remote sites is roughly 0.73.

## Average biomass was 0.36 g/m2 cf <- 1/0.17 * ## ash free dry to fresh weight. 0.00220462 * ## lbs./g 4046.86 ## m2/acre

## For the Refuge, where any earthworms occur: 0.36*cf [1] 18.8932

## And then averaged over the Refuge, accounting for occupancy. 0.36*cf*.73 [1] 13.79204

## For remote sites 0.193 *cf [1] 10.12885

## Highest density (L. terrestris at boat launch) 5.651*cf

## Highest density, at site F07 (0.055928818 + 0 + 0.804069599 + 5.651124123)*cf [1] 341.7109 (0.055928818 + 0 + 0.804069599 + 5.651124123) [1] 6.511123

To do:

Time sheet. Get new lifescanner kits uploaded into Arctos and register kits. Continue AWCC analysis. Finish Refuge Notebook article. Finish Stormy Lake fungi article. Post AKES meeting presentations.

I uploaded life scanner vial barcodes to Arctos so that they are ready to use (R script). I tried registering lifescanner kit A3OP00 through the http://lifescanner.net/ interface. This looks much better than the iphone app in that coordinates, etc. can be manually entered. I will not register all of the kits to my account now so that these can be used by others.

I looked through the vials of Pedetontus s. str. from Rod Crawford, looking for newer vials for sequencing. Seven vials of these specimens had been collected in 2018. The abovementioned specimens are now KNWR:Ento:11305 and KNWR:Ento:11306. Pedetontus specimen KNWR:Ento:11306, habitus.

At home this afternoon I dug in my horse manure pile in search of Eisenia fetida, but all that I found looked like Eisenia andrei, lacking E. fetida's more conspicuous pale bands as figure by Domínguez (2018). Eisenia andrei in large horse manure pile, Old Kasilof Road.

I looked at some literature on herbivory by Lumbricus terrestris and resumed work on my AKES Newsletter article on soil fungi at Stormy Lake.

I continued with exploratory work on the Stormy Lake fungal data using the vegan package (see R script). Plot of CCA of Stormy Lake fungal data with Lumbricus presence as a factor.

Hours: 07:00-07:45, 09:15-13:15, 14:15-14:45, 20:45-22:00, 22:45-00:45, Σ = 8.5 hrs.

Friday, February 28 To do:

Respond to Todd about blackfish permit. Format Refuge Notebook article. Finish Stormy Lake fungi article. Post AKES meeting presentations.

I filled out a project description for applying for an ADF&G permit to collect blackfish this summer.

I worked on community analysis of Stormy Lake fungal data. I did some reading up on analysis types in McCune and Grace (2002). p. 102: It looks like either direct gradient analysis, where we know what the explanatory variables of interest are, or indirect gradient analysis, where we don't know ahead of time, would be appropriate. In my case I do know what my variable of interest is. CCA and canonical correlation are indirect gradient analysis methods. p. 109: Vector fitting would be ok for what I am trying to do. p. 115: PCA would not be appropriate for my data, which are not linear or normal. At least I do not want to worry about these assumptions. p. 125: NMS would be appropriate, apparently the best. p. 154: Should not use CA. p. 160: Should not use DCA. p. 164: CCA would be ok with cautions.

I still am unsure whether vector fitting or constrained methods would be best. In the end I did both. I found that I had insufficient data (not enough sites?) to use NMS, but CCA seemed to work. See R script and a second R script, in the end making plots that were colorful if nothing else. Biplot of OTUs (circles) and sampling sites (labeled boxes) from a correspondence analysis where presence of Lumbricus terrestris was included as an environmental variable. Colors of OTU circles correspond to category colors from the pie charts I made on February 1. Red and blue lines indicate groupings of sites by earthworm presence. Biplot of OTUs and sampling sites from a constrained correspondence analysis where presence of Lumbricus terrestris was included as a constraint. See explanation in the caption of the figure above.

I also made a new set of pie charts for the article. See R script.

I worked on the Stormy Lake fungi article, incorporating these figures.

Hours: 06:30-07:30, 09:45-12:45, 16:00-18:00, 20:00- 22:00, Σ = 8 hr. March

Friday, March 1 To do:

Format Refuge Notebook article. Finish Stormy Lake fungi article. Post AKES meeting presentations.

I formatted and posted today's Refuge Notebook article.

I worked on the Stormy Lake earthworm soil fungi article.

I just learned that there is a newer UNITE release (8.0). I should have used that. Oh, well, I am not starting over on the Stormy Lake fungal analysis at this point.

I worked on graphs (R script) and on examining some of the mycorrhizal OTUs (R script and notes).

Hours: 06:45-07:30, 10:00-12:00, 20:30-22:30

Sunday, March 3

I worked on trying to finish up the Stormy Lake earthworm soil fungi article.

Monday, March 4

To do: Finish Stormy Lake fungi article. See about uploading data to GenBank SRA, Zenodo, and/or PlutoF. Post AKES meeting presentations.

I made BioProject, BioSample, and read submissions to NCBI GenBank SRA for the Stormy Lake soil fungi data. → BioProject PRJNA525443

I started learning how to work with projects and samples in PlutoF, getting a project set up (https://plutof.ut.ee/#/study/view/74051), but I did not get to the point of uploading sequences.

I am trying to figure out the best way to get HTS occurrence data to GBIF. It appears that NCBI GenBank reads/occurrences do not get harvested by or linked to GBIF in a regular way. I know that PlutF/UNITE reads can be harvested by GBIF. I did some tinkering and testing and got one test sequence uploaded via the importer. It will take a little bit of work to get the names formatted correctly.

Tuesday, March 5

Those GenBank SRA submissions were published and are now available at https://www.ncbi.nlm.nih.gov/sra/PRJNA525443.

I worked on reformatting OTU sequence data to upload to PlutoF. This ended up being a more difficult task than I had expected (see R script). I worked on the Stormy Lake fungi article, making a small change to the map (R script). I sent off a draft requesting for reviews.

Wednesday, March 6 To do:

Check PlutoF uploads. Post AKES meeting presentations. Start blackfish diet article. See about ordering fungal sequencing kits. Take A100 and A312/325R classes.

Those PlutoF imports are still hung up, waiting to be processed.

I posted presentations from the AKES annual meeting.

I entered data for Dendrobaena octadra specimen KNWR:Inv:38 and added this to the collection.

I started work on an Alaska blackfish diet article for the AKES Newsletter.

After some tweaking I got the first 60 records imported into PlutoF. See R script.

Thursday, March 7

I continued importing Stormy Lake sequence data to PlutoF, getting this done! (See R script) Colin and I drove out to Kenai to look for Alaska blackfish in the vicinity of the pond of of Candlelight Dr. where I had seen them earlier and in the stream behind Walmart. We did find open water in a few places, mainly at seeps, but we saw no blackfish.

I worked on entering data from blackfish gut contents specimens into Arctos in preparation for a short article on blackfish diet.

Friday, March 8

I started out responding to an e-mail inquiry about earthworms, finding out about the new Yukon record of Arctiostrotus fontinalis, and looking up and requesting pertinent literature.

To do:

Post today's Refuge Notebook article. Finish blackfish diet article. See about ordering fungal sequencing kits. Take A100 and A312/325R classes.

I formatted and posted this week's Refuge Notebook article.

I commented on a record of cutthroat trout from Woodpecker Lake on the Refuge (UAM:Fish:328). This seems to me to be a questionable record.

Another interesting record was for Coregonus laurettae, Bering cisco, at Gene Lake (UAM:Fish:2444) We do not have this species on our checklist.

Sunday, March 10

I worked on formatting a table of blackfish prey items.

Monday, March 11 To do:

Finish blackfish diet article. See about ordering fungal sequencing kits. Take A100 and A312/325R classes.

I worked on my blackfish diet article.

Tuesday, March 12

To do:

Credit card stuff. Finish blackfish diet article. See about ordering fungal sequencing kits. Take A100 and A312/325R classes. Start on STDP article.

I finished a draft of the blackfish diet article.

I started an article on the 2017 STDP metagenomic work in the journal Research Ideas and Outcomes. I uploaded all 64 raw FASTQ files to Yeti. I intend to begin an analysis using QIIME 2™. I wrote a manifest file using an R script. Wednesday, March 13

To do:

Time. Safety committee meeting Credit card stuff. Revise blackfish diet article. See about ordering fungal sequencing kits. Take A100 and A312/325R classes. STDP analysis. Review carabid table from Bergdahl.

I attended the safety meeting at 09:00. Action points:

MOCC refresher - I should take this this year. New PPE policy - Record keeping required PPE training. Refuge is responsible for providing PPE. JHAs - Need to check bio JHAs. These should be done before April 5. Definitely need JHA for spraying. Aviation. Will need to have all helmets inspected. Will likely need to destroy some helmets. Need to update lab safety plan.

I revised my blackfish article. I wrote a short note on two new mayfly records from Alaska.

Thursday, March 14 To do:

Credit card stuff. See about ordering fungal sequencing kits. Take A100 and A312/325R classes. STDP analysis. Review carabid table.

I imported the STDP data into QIIME.

I did some backing up of data because our server is scheduled to go down soon. Back-up list:

Slikok project Elodea work LTEMP AWCC/Caribou Hills stuff Melvin thesis datasets

I started an anlysis in QIIME. See notes and I/O.

QIIME summary of read counts per sample for STDP dataset.

Read lengths per sample:

Sample name Sequence count EAFB07JUN17-E 175187 EAFB30JUN17-EA 27823 EAFB30JUN17-IT 27604 JNUF06JUL17-EA 26951 JNUF26MAY17-EA 26040 JBER06JUN17-E 24615 JBER20JUN17-R 24546 JBER11JUL17-IT 22823 EAFB22MAY17-E 22578 JBER06JUN17-IT 21891 JNUF02JUN17-EA 21553 JBER20JUN17-EA 21357 JBER06JUN17-EA 21281 JNUF20JUN17-R 20948 EAFB30JUN17-E 20928 JNUF06JUL17-E 20686 JNUF11AUG17-E 19748 JNUF26MAY17-E 19287 JBER20JUN17-E 18981 EAFB22MAY17-EA 18886 JNUF20JUN17-E 18876 JNUF02JUN17-E 18375 EAFB07JUN17-EA 18233 JBER24MAY17-E 17063 JBER24MAY17-IT 16134 EAFB07JUN17-IT 15180 JBER24MAY17-EA 14808 JBER10MAY17-R2 13334 JBER10MAY17-R1 13144 JNUF20JUL17-EA 9834

Why does one sample, EAFB07JUN17-E, have over 170,000 reads while most are closer to 20,000?

Forward read quality.

Reverse read quality.

I think that now I need to build a guild library, but I am not sure of the very best way to make this in a format that QIIME will like.

Friday, March 15 To do:

See about ordering fungal sequencing kits. Take A100 and A312/325R classes. STDP analysis. Review carabid table. Assemble AKES Newsletter draft.

I looked at methods of reference library creation of Nilsson et al. (2018), Richardson et al. (2018), and Pruesse et al. (2007). The work of Richardson et al. (2018) is most similar to what I need to do, but I think it would not be easy for me to set up Metaxa2. I need to figure out how to make a library that QIIME 2 can use for now, not necessarily a library as well-curated as these major databases.

I started downloading records from BOLD, which was taking a long time.

While downloading I worked on assembling all of the submitted articles for the AKES Newsletter into an issue.

I got started on building a library. See notes and, scripts, I/O.

Monday, March 18 To do:

Send out AKES Newsletter draft for review. STDP analysis. Review carabid table. See about ordering fungal sequencing kits. Take A100 and A312/325R classes.

I did some editing of the AKES Newsletter draft and sent it out to the editorial committee, authors, and others for review.

I saw that BOLD's BIN algorithm was run this weekend.

I resumed work on construction of an Alaska DNA barcode library for HTS. See notes and I/O.

I posted two Biology News entries requested by John.

Wednesday, March 20

To do:

Revise Lumbricus-fungi article based on comments received. Scan some of Dominique's artwork. STDP analysis. Review carabid table. See about ordering fungal sequencing kits. Take A100 and A312/325R classes. Format this week's Refuge Notebook article.

I found some problems with the library work I had done the other day. I resumed, dealing with these problems. See notes and I/O.

I scanned a couple of Dominique's illustration to fulfill a request for someone writing an article about Dominique's work.

Watercolor illustration of the life history of a Mesopolobus by Dominique Collet.

I formatted this week's Refuge Notebook article. Thursday, March 21

To do:

STDP analysis. Finish up AKES Newsletter. Review carabid table. See about ordering fungal sequencing kits. Take A100 and A312/325R classes.

I worked on the STDP analysis, finishing the library work and for the first time generating an OTU table in QIIME2. See scripts, I/O.

Friday, March 22 To do:

STDP analysis. Finish up AKES Newsletter. Post Refuge Notebook article.

I posted today's Refuge Notebook article.

I worked more on the library, re-running the dereplication and clustering steps. I started selecting best library records, also. See I/O, R script, and SLURM script. References

Csuzdi, Csaba and Katalin Szlávecz. 2003. “Lumbricus Friendi Cognetti, 1904 a New Exotic Earthworm in North America.” Northeastern Naturalist 10 (1): 77–83. https://bioone.org/journals/Northeastern-Naturalist/volume- 10/issue-1/1092- 6194(2003)010[0077:LFCANE]2.0.CO;2/span-classgenus- speciesLUMBRICUS-FRIENDI-spana-classinternal-link- hrefi1092-6194-10/10.1656/1092- 6194(2003)010[0077:LFCANE]2.0.CO;2.full Domínguez, J. 2018. “Earthworms and Vermicomposting.” Ch. 5. In Ray, S. (Ed.). Earthworms, Rijeka: IntechOpen. https://doi.org/10.5772/intechopen.76088. Escherich, Karl. 1905. “Das System der Lepismatiden.” Zoologica 43 (18). https://doi.org/10.5962/bhl.title.7909. Gates, Gordon Enoch, and John Warren Reynolds. 2017. “Preliminary key to North American Megadriles (Annelida, Oligochaeta), based on external characters, insofar as possible.” Megadrilogica 22 (10). Gweon, Hyun S., Anna Oliver, Joanne Taylor, Tim Booth, Melanie Gibbs, Daniel S. Read, Robert I. Griffiths, and Karsten Schonrogge. 2015. “PIPITS: An automated pipeline for analyses of fungal internal transcribed spacer sequences from the illumina sequencing platform.” Methods in Ecology and Evolution 6 (8): 973–80. https://doi.org/10.1111/2041- 210X.12399. McCune, Bruce, and Grace, James B. 2002. Analysis of Ecological Communities. Gleneden Beach, Oregon: MjM Software Design. Mendes, Luis F., and Volker S. Schmid. 2010. “Description of Allograssiella floridana gen. nov., spec. nov. from the southern United States living with Pseudomyrmex ants.” Spixiana 33: 49–54. Nilsson, R. H.; Glöckner, F. O.; Saar, I.; Tedersoo, L.; Kõljalg, U.; Abarenkov, K.; Larsson, K.-H.; Taylor, A. F.; Bengtsson-Palme, J.; Schigel, D.; Jeppesen, T. S.; Kennedy, P. & Picard, K. 2018. The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications. Nucleic Acids Research 47:D259-D264. https://doi.org/10.1093/nar/gky1022 Nguyen, Nhu H., Zewei Song, Scott T. Bates, Sara Branco, Leho Tedersoo, Jon Menke, Jonathan S. Schilling, and Peter G. Kennedy. 2016. “FUNGuild: An open annotation tool for parsing fungal community datasets by ecological guild.” Fungal Ecology 20 (April): 241–48. https://doi.org/10.1016/j.funeco.2015.06.006. Pruesse, E.; Quast, C.; Knittel, K.; Fuchs, B. M.; Ludwig, W.; Peplies, J. & Glöckner, F. O. 2007. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research 35:7188-7196. https://doi.org/10.1093/nar/gkm864 Ratnasingham, Sujeevan, and Paul D. N. Hebert. 2013. “A DNA-based registry for all species: The Barcode Index Number (BIN) system.” PLOS ONE 8 (7): e66213. https://doi.org/10.1371/journal.pone.0066213. Richardson, R. T.; Bengtsson-Palme, J.; Gardiner, M. M. & Johnson, R. M. 2018. A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ 6:e5126. https://doi.org/10.7717/peerj.5126 Römbke, Jörg, Manuel Aira, Thierry Backeljau, Karin Breugelmans, Jorge Domínguez, Elisabeth Funke, Nadin Graf, et al. 2016. “DNA barcoding of earthworms (Eisenia fetida/andrei Complex) from 28 ecotoxicological test laboratories.” ISEE-10: The 10th International Symposium on Earthworm Ecology, 22-27 June 2014, Athens, Georgia, USA 104 (August): 3–11. https://doi.org/10.1016/j.apsoil.2015.02.010. Saltmarsh, Deanna Marie, Matthew L. Bowser, John M. Morton, Shirley Lang, Daniel Shain, and Roman Dial. 2016. “Distribution and abundance of exotic earthworms within a boreal forest system in Southcentral Alaska.” NeoBiota 28 (August): 67–86. https://doi.org/10.3897/neobiota.28.5503. Sherlock, E., and Field Studies Council (Great Britain). 2012. Key to the Earthworms of the UK and Ireland. Occasional Publication / Field Studies Council. FSC. Appendices

2019-01-29-1116_work_on_yeti.txt

## Switching between yeti and R figuring out how to continue the AWCC analysis in the background on yeti ## while I take care of other things. cd /home/mattbowser/2018_AWCC_soil_fungi/out_seqprep head prepped.fasta wc -l prepped.fasta 1019826 prepped.fasta ## That is a little over 1e6 lines.

## That is 1019826/2 = 509,913 sequences.

## For my last analysis on ? there were 59,732 input sequences. From these there were 31,214 dereplicated sequences, ## 31214/59732 = 0.5225675 or 52%.

## This took ITSx 2018-12-17 15:09:06 Extracting ITS2 from sequences [ITSx] 2018-12-18 00:02:54 ... done start <- as.POSIXlt("2018-12-17 15:09:06") stop <- as.POSIXlt("2018-12-18 00:02:54") stop - start Time difference of 8.896667 hours ## So the rate was 31214/8.896667 = 3508.505 sequences per hour processed.

## Assuming the same rates, the AWCC data should have about 509913 * 0.5225675 = 266464 dereplicated sequences, ## which should take about 266464/3508.505 = 75.94802 hours or 75.94802/24 = 3.164501 days. ## My previous job was canceled after six days (6- 10:56:32), so it was taking longer than expected.

## Wanting to split this up.

## from https://stackoverflow.com/questions/6424856/r- function-for-returning-all-factors

FUN <- function(x) { x <- as.integer(x) div <- seq_len(abs(x)) factors <- div[x %% div == 0L] factors <- list(neg = -factors, pos = factors) return(factors) } ns <- 1019826 FUN(ns) ns/9 [1] 113314

## Ok, will try splitting this into 9 files. split -l 113314 prepped.fasta

## This yielded output files with the following names. xaa xab xac xad xae xaf xag xah xai fn <- c("xaa", "xab", "xac", "xad", "xae", "xaf", "xag", "xah", "xai") cmds <- paste("mv ", fn, " ", fn, ".fasta", sep="") print(cmds)

## After editing... mv xaa xaa.fasta mv xab xab.fasta mv xac xac.fasta mv xad xad.fasta mv xae xae.fasta mv xaf xaf.fasta mv xag xag.fasta mv xah xah.fasta mv xai xai.fasta cmd <- paste( "srun --mpi=pmi2 pipits_funits -i out_seqprep/", fn, ".fasta -o out_funits_", fn, " -x ITS2", sep="" ) print(cmd) ## #edited output: srun --mpi=pmi2 pipits_funits -i out_seqprep/xaa.fasta -o out_funits_xaa -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xab.fasta -o out_funits_xab -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xac.fasta -o out_funits_xac -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xad.fasta -o out_funits_xad -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xae.fasta -o out_funits_xae -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xaf.fasta -o out_funits_xaf -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xag.fasta -o out_funits_xag -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xah.fasta -o out_funits_xah -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xai.fasta -o out_funits_xai -x ITS2 cd /home/mattbowser/2018_AWCC_soil_fungi vi 2019-01-29-1202_AWCCfungi.slurm

#!/bin/bash #SBATCH --job-name=AWCCITSx #SBATCH -n 9 # number of nodes #SBATCH -n 9 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=4-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-01-29-1202_AWCCITSx-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi srun --mpi=pmi2 pipits_funits -i out_seqprep/xaa.fasta -o out_funits_xaa -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xab.fasta -o out_funits_xab -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xac.fasta -o out_funits_xac -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xad.fasta -o out_funits_xad -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xae.fasta -o out_funits_xae -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xaf.fasta -o out_funits_xaf -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xag.fasta -o out_funits_xag -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xah.fasta -o out_funits_xah -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep/xai.fasta -o out_funits_xai -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules

## Running it. sbatch 2019-01-29-1202_AWCCfungi.slurm

## That failed right away. Judging from the out file, I think it again tried to run things in parallel. I might need to break this into separate slurm jobs. sf <- paste("#!/bin/bash #SBATCH --job-name=AWCCITSx #SBATCH -n 1 # number of nodes #SBATCH -n 3 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=4-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-01-29-1229_AWCCITSx-", fn, ".out module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi srun --mpi=pmi2 pipits_funits -i out_seqprep/", fn, ".fasta -o out_funits_", fn, " -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules ", sep="" ) wd <- "I:/BIOLOGY/Data/ProjectData/Grasslands/2018_grassland_work/work_space/2019- 01-29_ITSx" setwd(wd) sn <- paste(fn, ".slurm", sep="") for (thisf in 1:9) { write(sf[thisf], fil=sn[thisf]) } sb <- paste("sbatch", sn)

## Edited: sbatch xaa.slurm sbatch xab.slurm sbatch xac.slurm sbatch xad.slurm sbatch xae.slurm sbatch xaf.slurm sbatch xag.slurm sbatch xah.slurm sbatch xai.slurm

## Wait. I want to change the job names. sf <- paste("#!/bin/bash #SBATCH --job-name=", fn, " #SBATCH -n 1 # number of nodes #SBATCH -n 3 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=4-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-01-29-1229_AWCCITSx-", fn, ".out module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi srun --mpi=pmi2 pipits_funits -i out_seqprep/", fn, ".fasta -o out_funits_", fn, " -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules ", sep="" ) for (thisf in 1:9) { write(sf[thisf], file=sn[thisf]) }

## Ok, trying one now. sbatch xaa.slurm ## Got error. File has DOS line breaks instead of unix line breaks. for (thisf in 1:9) { of <- file(sn[thisf], "wb") write(sf[thisf], file=sn[thisf]) close(of) }

I had to manually change the end of line character in Notepad++ using Edit -> EOL conversion. sbatch xaa.slurm

## Got lots of errors. Maybe this cannot be split up like this.

## Going to just re-run that original job for now. sbatch 2018-12-20-1038_AWCCfungi.slurm

2019-01-30-1509_looking_at_results.R

## Looking at Stormy fungi data again, this time making sure to eliminate all non-fungi. wd <- "I:/BIOLOGY/Data/ProjectData/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) load("2018-12-21-1558_workspace.RData")

## OUTs per sample for infested mean(dbs[1:3]) [1] 103 ## And nightcrawler-free sites. mean(dbs[4:6]) [1] 104.6667

## I want to modify the biplot some for the article. library(ggbiplot) pdf(file="2019-01-30-1536_stormy_biplot.pdf", width=4.5, height=4.5, pointsize=9 ) g <- ggbiplot(pca1, obs.scale = 1, var.scale = 1, labels=rownames(d2), groups = c(rep("nightcrawlers", 3), rep("no nightcrawlers", 3)), ellipse = TRUE, labels.size=5, varname.size=1.5 ) g <- g + scale_color_discrete(name = '') g <- g + theme(legend.direction = 'horizontal', legend.position = 'top') print(g) dev.off() ## Remaking histogram for the article. pdf(file="2019-01-30-1662_hist_freq.pdf", width=4, height=4 ) par(mar=c(4,4,1,1)) hist(ct4$freq, breaks=0:6/6, xlab="Number of detections", ylab="Number of OTUs", main="", xaxt="n", col="gray" ) axis(side=1, at=1:6/6 - 0.5/6, labels=1:6) dev.off()

## Resorting to look at most frequently observed OTUs. ct4 <- ct4[order(-ct4$freq, -ct4$nreads),] summary(pca1) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 Standard deviation 10.8783 9.1793 8.8745 8.0632 7.4586 2.038e-14 Proportion of Variance 0.2944 0.2096 0.1959 0.1617 0.1384 0.000e+00 Cumulative Proportion 0.2944 0.5040 0.6999 0.8616 1.0000 1.000e+00 pca1$x PC1 PC2 PC3 PC4 PC5 PC6 site1 -4.914343 0.5844918 -3.839917 15.5903530 -1.3504194 2.517436e-14 site2 -5.110602 -4.0457191 -10.797290 -7.3450695 -8.9553186 6.359063e-15 site3 -9.776014 -7.2800890 14.443269 -2.1070481 -0.8510169 -3.169009e-14 site4 -3.060605 -0.7910831 -5.807692 -3.7639991 13.8212994 3.054160e-14 site5 20.789200 -6.1599321 1.973286 0.6347356 -0.6891978 -1.492242e-15 site6 2.072364 17.6923314 4.028345 -3.0089719 -1.9753467 -2.819267e-14

2019-01-31-1116_looking_at_results.R

## Today I am looking at differences between infested and non-infested sites. wd <- "I:/BIOLOGY/Data/ProjectData/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) load("2018-12-21-1558_workspace.RData")

## Worm presence: wrm <- c(1,1,1,0,0,0) wrm <- as.matrix(wrm)

## Log + 1 transformed data: obs <- log(ct4[,1:6] + 1) obs <- t(obs)

## Looking at correlations. oc <- cor(wrm, obs) ct4$wrmcor <- as.vector(oc) ct4 <- ct4[order(ct4$wrmcor, -ct4$nreads),]

## 10 records most positively correlated with nightcrawlers: ct4[402:393,] X5348.2017MLB100.MSITS3 X5348.2017MLB101.MSITS3 X5348.2017MLB102.MSITS3 45 26 29 189 56 10 50 111 34 16 31 307 21 161 194 182 38 139 60 91 30 36 10 353 20 26 28 513 5 458 166 434 209 16 0 18 79 67 0 54 X5348.2017MLB103.MSITS3 X5348.2017MLB105.MSITS3 X5348.2017MLB107.MSITS3 45 0 0 0 56 0 0 0 34 0 0 0 21 20 0 0 38 13 0 0 30 0 0 0 20 0 11 0 5 137 0 0 209 0 0 0 79 0 0 0 nreads otu_id king phyl clas ord 45 244 OTU661 Fungi Venturiales 56 171 OTU14 Fungi Mortierellomycota Mortierellomycetes 34 354 OTU611 Fungi Ascomycota Helotiales 21 557 OTU613 Fungi Ascomycota Leotiomycetes Helotiales 38 303 OTU674 Fungi Ascomycota Leotiomycetes Helotiales 30 399 OTU330 Fungi Ascomycota Hypocreales 20 578 OTU542 Fungi Ascomycota Leotiomycetes Helotiales 5 1195 OTU471 Fungi Ascomycota Sordariomycetes Sordariales 209 34 OTU459 Fungi Ascomycota Dothideomycetes Pleosporales 79 121 OTU365 Fungi Ascomycota Sordariomycetes Hypocreales fam gen spec sim 45 Venturiaceae Venturia 0.86 56 unidentified unidentified Mortierellales_sp_SH213394.07FU 1.00 34 1.00 21 0.96 38 unidentified unidentified 0.91 30 Nectriaceae 0.86 20 unidentified unidentified Helotiales_sp_SH013008.07FU 0.95 5 Chaetomiaceae Humicola Humicola_sp_SH195345.07FU 0.98 209 Pleomassariaceae Tumularia Tumularia_sp_SH198695.07FU 1.00 79 Nectriaceae 0.97 freq wrmcor 45 0.5000000 0.9529024 56 0.5000000 0.9378498 34 0.5000000 0.9157785 21 0.6666667 0.8991145 38 0.6666667 0.8943249 30 0.5000000 0.8894675 20 0.6666667 0.8059055 5 0.6666667 0.7767712 209 0.3333333 0.7069103 79 0.3333333 0.7067543 ## 10 records most negatively correlated with nightcrawler presence: ct4[1:10,] ] X5348.2017MLB100.MSITS3 X5348.2017MLB101.MSITS3 X5348.2017MLB102.MSITS3 130 0 0 0 7 0 0 0 106 0 13 0 46 14 19 0 194 0 0 0 35 0 0 0 82 0 0 0 233 0 0 0 189 0 0 0 201 0 0 0 X5348.2017MLB103.MSITS3 X5348.2017MLB105.MSITS3 X5348.2017MLB107.MSITS3 130 11 37 21 7 10 779 328 106 36 24 20 46 20 87 103 194 0 24 17 35 0 116 215 82 0 42 75 233 0 17 11 189 28 0 15 201 13 25 0 nreads otu_id king phyl clas ord 130 69 OTU552 Fungi Ascomycota Dothideomycetes Pleosporales 7 1117 OTU715 Fungi Ascomycota Dothideomycetes Venturiales 106 93 OTU349 Fungi Ascomycota Eurotiales 46 243 OTU525 Fungi Ascomycota Dothideomycetes Capnodiales 194 41 OTU194 Fungi Umbelopsidomycetes Umbelopsidales 35 331 OTU810 Fungi Ascomycota Leotiomycetes Helotiales 82 117 OTU10 Fungi Mortierellomycota Mortierellomycetes Mortierellales 233 28 OTU721 Fungi Ascomycota Leotiomycetes Helotiales 189 43 OTU418 Fungi Ascomycota Sordariomycetes 201 38 OTU527 Fungi Ascomycota Leotiomycetes Helotiales fam gen spec sim 130 Melanommataceae 0.85 7 Venturiaceae 0.99 106 1.00 46 0.96 194 Umbelopsidaceae Umbelopsis 1.00 35 Myxotrichaceae Oidiodendron Oidiodendron_pilicola_SH216991.07FU 0.91 82 unidentified unidentified Mortierellales_sp_SH026734.07FU 1.00 233 Sclerotiniaceae Mycopappus Mycopappus_alni_SH177350.07FU 0.98 189 0.89 201 0.92 freq wrmcor 130 0.5000000 -0.9772985 7 0.5000000 -0.8852464 106 0.6666667 -0.8028416 46 0.8333333 -0.7058453 194 0.3333333 -0.7055784 35 0.3333333 -0.7051749 82 0.3333333 -0.7044942 233 0.3333333 -0.7041084 189 0.3333333 -0.7021832 201 0.3333333 -0.7013344

## It is interesting that members of Venturiaceae are both most negatively and most positively correlated with nightcrawler presence. These are plant pathogens. ## Mortierellales - positively associated with worm presence. These are mosty saprobes (see https://doi.org/10.3767/003158513X666268).

## Humicola spp. appear to be mainly decomposers (see https://www.sciencedirect.com/science/article/pii/S0166061618300319).

## Whoa, I found out how to look up UNITE SH (species hypothesis) entities much like BOLD BINs. That Humicola identified as Humicola_sp_SH195345.07FU can be looked up at https://unite.ut.ee/bl_forw_sh.php? sh_name=SH195345.07FU or http://dx.doi.org/10.15156/BIO/SH195345.07FU. There is a newer version of this SH: https://unite.ut.ee/bl_forw_sh.php? sh_name=SH1615609.08FU. This is identified as genus Chaetomium. Some of these are endophytes. Some are soil-dwelling, I think decomposers.

## Looking up that Mortierellales_sp_SH213394.07FU thing, https://unite.ut.ee/bl_forw_sh.php? sh_name=SH213394.07FU -> now SH1507815.08FU. Not much info on that one.

## Mycopappus alni is a leaf disease of alders, birches, and crabapples.

## Save image... save.image("2019-01-31-1626_workspace.RData")

2019-02-01-1024_looking_at_results.R

## Today I am looking at differences between infested and non-infested sites. wd <- "I:/BIOLOGY/Data/ProjectData/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) load("2019-01-31-1626_workspace.RData")

## Updating IDs... ct4$accid <- as.character(ct4$spec) ct4$gen <- as.character(ct4$gen) ct4$accid[is.na(ct4$accid)] <- ct4$gen[is.na(ct4$accid)] ct4$fam <- as.character(ct4$fam) ct4$accid[is.na(ct4$accid)] <- ct4$fam[is.na(ct4$accid)] ct4$ord <- as.character(ct4$ord) ct4$accid[is.na(ct4$accid)] <- ct4$ord[is.na(ct4$accid)] ct4$clas <- as.character(ct4$clas) ct4$accid[is.na(ct4$accid)] <- ct4$clas[is.na(ct4$accid)] ct4$phyl <- as.character(ct4$phyl) ct4$accid[is.na(ct4$accid)] <- ct4$phyl[is.na(ct4$accid)] ct4$king <- as.character(ct4$king) ct4$accid[is.na(ct4$accid)] <- ct4$king[is.na(ct4$accid)]

## How many unique identifications? length(levels(as.factor(ct4$accid))) [1] 245 accids <- levels(as.factor(ct4$accid))

## Doing some saving. write.csv(accids, "2019-02-01- 1041_accepted_ids.csv") write.csv(ct4, "2019-02-01-1042_obs_df.csv") save.image("2019-02-01-1043_workspace.RData")

2019-02-01-1128_guilds.R wd <- "I:/BIOLOGY/Data/ProjectData/2017_earthworm_soil_fungi_NGS/work_space/2019- 02-01_FUNGuild" setwd(wd) d1 <- read.delim("2019-02-01- 1054_otu_table.guilds.txt") dim(d1) [1] 874 17 ## Wow, why are there so many more rows now?

## Oh, yeah. I had filtered a bunch of these out. Loading more recent data... setwd("I:/BIOLOGY/Data/ProjectData/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS") load("2019-02-01-1043_workspace.RData") setwd(wd) require(sqldf) names(d1) <- gsub("\\.", "_", names(d1)) ct6 <- sqldf(' select ct4.*, d1.Taxon, d1.Taxon_Level, d1.Trophic_Mode, d1.Guild, d1.Confidence_Ranking, d1.Growth_Morphology, d1.Trait, d1.Notes, d1.Citation_Source from ct4 left outer join d1 on ct4.otu_id = d1.OTUID ') dim(ct6) [1] 402 28

## Saving this. wd <- "I:/BIOLOGY/Data/ProjectData/2017_earthworm_soil_fungi_NGS/work_space/2019- 02-01_FUNGuild" setwd(wd) write.csv(ct6, "2019-02-01- 1147_guilds_assigned.csv", row.names=FALSE) levels(ct6$Guild) levels(ct6$Guild) [1] "-" [2] "Animal Pathogen" [3] "Animal Pathogen-Clavicipitaceous Endophyte- Fungal Parasite" [4] "Animal Pathogen-Dung Saprotroph-Endophyte- Epiphyte-Plant Saprotroph-Wood Saprotroph" [5] "Animal Pathogen-Dung Saprotroph-Endophyte- Lichen Parasite-Plant Pathogen-Undefined Saprotroph" [6] "Animal Pathogen-Endophyte-Fungal Parasite- Plant Pathogen-Wood Saprotroph" [7] "Animal Pathogen-Fungal Parasite-Undefined Saprotroph" [8] "Animal Pathogen-Plant Pathogen-Undefined Saprotroph" [9] "Animal Pathogen-Soil Saprotroph" [10] "Animal Pathogen-Undefined Saprotroph" [11] "Bryophyte Parasite-Ectomycorrhizal-Ericoid Mycorrhizal-Undefined Saprotroph" [12] "Bryophyte Parasite-Litter Saprotroph-Wood Saprotroph" [13] "Dung Saprotroph-Ectomycorrhizal" [14] "Dung Saprotroph-Ectomycorrhizal-Soil Saprotroph-Wood Saprotroph" [15] "Dung Saprotroph-Endophyte-Litter Saprotroph- Undefined Saprotroph" [16] "Dung Saprotroph-Endophyte-Undefined Saprotroph" [17] "Dung Saprotroph-Plant Saprotroph" [18] "Dung Saprotroph-Plant Saprotroph-Wood Saprotroph" [19] "Dung Saprotroph-Soil Saprotroph-Undefined Saprotroph" [20] "Dung Saprotroph-Soil Saprotroph-Wood Saprotrop" [21] "Ectomycorrhizal" [22] "Ectomycorrhizal-Endophyte-Ericoid Mycorrhizal- Litter Saprotroph-Orchid Mycorrhizal" [23] "Ectomycorrhizal-Fungal Parasite" [24] "Ectomycorrhizal-Fungal Parasite-Plant Pathogen-Wood Saprotroph" [25] "Ectomycorrhizal-Fungal Parasite-Soil Saprotroph-Undefined Saprotroph" [26] "Ectomycorrhizal-Lichenized-Wood Saprotroph" [27] "Ectomycorrhizal-Orchid Mycorrhizal-Root Associated Biotroph" [28] "Ectomycorrhizal-Undefined Saprotroph" [29] "Endophyte" [30] "Endophyte-Fungal Parasite-Plant Pathogen" [31] "Endophyte-Lichen Parasite-Plant Pathogen- Undefined Saprotroph" [32] "Endophyte-Litter Saprotroph-Soil Saprotroph- Undefined Saprotroph" [33] "Endophyte-Litter Saprotroph-Wood Saprotroph" [34] "Endophyte-Plant Pathogen" [35] "Endophyte-Plant Pathogen-Undefined Saprotroph" [36] "Endophyte-Plant Pathogen-Wood Saprotroph" [37] "Endophyte-Undefined Saprotroph" [38] "Endophyte-Undefined Saprotroph-Wood Saprotroph" [39] "Ericoid Mycorrhizal" [40] "Fungal Parasite" [41] "Fungal Parasite-Lichen Parasite" [42] "Fungal Parasite-Plant Pathogen-Plant Saprotroph" [43] "Fungal Parasite-Undefined Saprotroph" [44] "Leaf Saprotroph-Plant Pathogen-Undefined Saprotroph-Wood Saprotroph" [45] "Lichenized-Undefined Saprotroph" [46] "Litter Saprotroph" [47] "Litter Saprotroph-Plant Pathogen" [48] "NULL" [49] "Orchid Mycorrhizal" [50] "Plant Pathogen" [51] "Plant Pathogen-Plant Saprotroph" [52] "Plant Pathogen-Undefined Saprotroph" [53] "Plant Pathogen-Wood Saprotroph" [54] "Plant pathogenic (?) on polen" [55] "Plant Saprotroph" [56] "Plant Saprotroph-Wood Saprotroph" [57] "Soil Saprotroph" [58] "Undefined Saprotroph" [59] "Undefined Saprotroph-Wood Saprotroph" [60] "Wood Saprotroph" ## That is a lot of different guild assignments.

##"Orchid Mycorrhizal" This is just a cool one. What was this? ct6[ct6$Guild == "Orchid Mycorrhizal",] X5348.2017MLB100.MSITS3 X5348.2017MLB101.MSITS3 X5348.2017MLB102.MSITS3 175 0 0 0 X5348.2017MLB103.MSITS3 X5348.2017MLB105.MSITS3 X5348.2017MLB107.MSITS3 nreads 175 0 15 0 15 otu_id king phyl clas ord fam gen 175 OTU151 Fungi Sebacinales Serendipitaceae Serendipita spec sim freq wrmcor accid Taxon Taxon_Level Trophic_Mode 175 0.99 0.1666667 -0.4472136 Serendipita Serendipita 13 Symbiotroph Guild Confidence_Ranking Growth_Morphology Trait Notes 175 Orchid Mycorrhizal Highly Probable NULL NULL NULL

Citation_Source 175 Tedersoo L, et al. 2010. Mycorrhiza 20:217-263 (pro parte); Weiss et al. DOI: 10.1111/nph.13977 ## Ok, I need to simplify this. Using Excel. gd <- levels(ct6$Guild) write.csv(gd, "2019-02-01-1157_guilds.csv") gd2 <- read.csv("2019-02-01- 1158_guilds_simplified.csv") levels(gd2$guild_s) [1] "endophyte or parasite" [2] "endophyte, mycorrhizal, parasite, or saprotroph" [3] "endophyte, parasite, or saprotroph" [4] "mycorrhizal" [5] "mycorrhizal or saprotroph" [6] "saprotroph" [7] "unknown"

## Joinin this to the data. ct7 <- sqldf(' select ct6.*, gd2.guild_s from ct6 left outer join gd2 on ct6.Guild = gd2.guild_o ') dim(ct7) [1] 402 29

## Now for an overal pie chart. ag1 <- aggregate(ct7$nreads, by=list(ct7$guild_s), sum) Group.1 x 1 endophyte or parasite 765 2 endophyte, mycorrhizal, parasite, or saprotroph 841 3 endophyte, parasite, or saprotroph 9309 4 mycorrhizal 6537 5 mycorrhizal or saprotroph 1860 6 saprotroph 9592 7 unknown 19411 ag1 <- ag1[order(-ag1$x),] Group.1 x 7 unknown 19411 6 saprotroph 9592 3 endophyte, parasite, or saprotroph 9309 4 mycorrhizal 6537 5 mycorrhizal or saprotroph 1860 2 endophyte, mycorrhizal, parasite, or saprotroph 841 1 endophyte or parasite 765 pie(ag1$x[2:7], labels = ag1$Group.1[2:7], main="Fungal guild abundances summed over all samples")

## Ok, now trying to compare the two. ct7$nreads_worms <- apply(ct7[,1:3], 1, sum) ct7$nreads_noworms <- apply(ct7[,4:6], 1, sum) agw <- aggregate(ct7$nreads_worms, by=list(ct7$guild_s), sum) agnw <- aggregate(ct7$nreads_noworms, by=list(ct7$guild_s), sum) #agw <- agw[order(-agw$x),] #agnw <- agnw[order(-agnw$x),] png(filename="2019-02-01-1233_guilds_worms.png", width=900, height=500 ) pie(agw$x[1:6], labels = agw$Group.1[1:6], main="Fungal guild abundance with Lumbricus terrestris") dev.off() png(filename="2019-02-01-1237_guilds_noworms.png", width=900, height=500 ) pie(agnw$x[1:6], labels = agnw$Group.1[1:6], main="Fungal guild abundance without Lumbricus terrestris") dev.off()

## Same graphs, no including unknowns. png(filename="2019-02-01-1240_guilds_worms.png", width=900, height=500 ) pie(agw$x, labels = agw$Group.1, main="Fungal guild abundance with Lumbricus terrestris") dev.off() png(filename="2019-02-01-1240_guilds_noworms.png", width=900, height=500 ) pie(agnw$x, labels = agnw$Group.1, main="Fungal guild abundance without Lumbricus terrestris") dev.off()

## Now percentages. agw$percent <- round(agw$x/sum(agw$x)*100) agnw$percent <- round(agnw$x/sum(agnw$x)*100) agw

Group.1 x percent 1 endophyte or parasite 467 2 2 endophyte, mycorrhizal, parasite, or saprotroph 487 2 3 endophyte, parasite, or saprotroph 4254 18 4 mycorrhizal 1443 6 5 mycorrhizal or saprotroph 1021 4 6 saprotroph 5455 23 7 unknown 10589 45 agnw Group.1 x percent 1 endophyte or parasite 298 1 2 endophyte, mycorrhizal, parasite, or saprotroph 354 1 3 endophyte, parasite, or saprotroph 5055 21 4 mycorrhizal 5094 21 5 mycorrhizal or saprotroph 839 3 6 saprotroph 4137 17 7 unknown 8822 36 ## So earthworm infestation takes mycorrhizal fungi from 21% to 6%, cutting mycorrhizal fungi abundance by 2/3. save.image("2019-02-01-1248_workspace.RData")

## Carrying on after a break... ## Let's compare individual sites. aga <- aggregate(ct7[,1], by=list(ct7$guild_s), sum) names(aga)[2] <- "s1" for (thissite in 2:6) { ag <- aggregate(ct7[,thissite], by=list(ct7$guild_s), sum) aga <- cbind(aga, ag$x) } names(aga)[3:7] <- paste("s", 2:6, sep="")

## Now plots... cls <- colors()[(1:7)*10] ## sort of random colors. par(mfrow=c(2,3)) for (thissite in 1:6) { pie(aga[,thissite + 1], labels = aga[,1], main=paste("Fungal guild relative abundances, site", thissite), col=cls) } png(filename="2019-02-01-1438_site_guilds.png", width=800, height=1200, pointsize = 24 ) cls <- rev(rainbow(7)) cls <- cls[c(3, 4, 5, 6, 7, 1, 2)] #cls <- colors()[(1:7)*10] ## sort of random colors. par(mfrow=c(4,2)) par(mar=c(0,0,1,0)) plot.new() legend( "center", legend = aga[,1], fill = cls ) for (thissite in 1:3) { pie(aga[,thissite + 1], labels = "", main=paste("Site", thissite), col=cls) } for (thissite in 4:6) { pie(aga[,thissite + 1], labels = "", main=paste("Site", thissite), col=cls) } dev.off() png(filename="2019-02-01- 1454_guilds_comparison.png", width=920, height=700, pointsize = 23 ) cls <- rev(rainbow(7)) cls <- cls[c(3, 4, 5, 6, 7, 1, 2)] par(mfrow=c(2,2)) par(mar=c(0,0,1,0)) pie(agnw$x, labels = "", col=cls, main="Lumbricus absent") pie(agw$x, labels = "", col=cls, main="Lumbricus present") plot.new() legend( "center", legend = aga[,1], fill = cls ) dev.off()

## Now to make a graph for the article. pdf(file="2019-02-01-1504_guilds_comparison.pdf", width=920/170, height=700/170, pointsize = 10 ) cls <- rev(rainbow(7)) cls <- cls[c(3, 4, 5, 6, 7, 1, 2)] par(mfrow=c(2,2)) par(mar=c(0,0,1,0)) pie(agnw$x, labels = "", col=cls, main="Lumbricus absent") pie(agw$x, labels = "", col=cls, main="Lumbricus present") plot.new() legend( "top", legend = aga[,1], fill = cls ) dev.off()

## Now for a table. st <- ag names(st)[1] <- "guild" st$worm_free_mean <- round(apply(aga[,5:7], 1, mean)) st <- st[,c(1,3)] st$worm_free_sd <- round(apply(aga[,5:7], 1, sd)) st$worm_mean <- round(apply(aga[,2:4], 1, mean)) st$worm_sd <- round(apply(aga[,2:4], 1, sd))

## saving this. write.csv(st, "2019-02-01-1522_guild_table.csv", row.names=FALSE)

2019-02-11-1324_yeti_script.txt

## Work on Yeti, 11.Feb.2019 ## I am going to see if I can run just the first 5 sequences through the cd /home/mattbowser/2018_AWCC_soil_fungi mkdir out_seqprep_001

## Trying just 5 sequences (10 lines) and putting this in its own directory. sed -e '10q' out_seqprep/prepped.fasta > out_seqprep_001/prepped.fasta

## made a new SLURM file, included below. sbatch 2019-02-11-1303_AWCCfungi_1-5.slurm

## That took 34 seconds but did not produce the expected output file. cat *3939971.out pipits_funits 2.2, the PIPITS Project https://github.com/hsgweon/pipits ------

2019-02-11 15:08:49 pipits_funits started 2019-02-11 15:08:49 Checking input FASTA for illegal characters 2019-02-11 15:08:49 ... done 2019-02-11 15:08:49 Counting input sequences 2019-02-11 15:08:50 ... number of input sequences: 5 2019-02-11 15:08:50 Dereplicating sequences for efficiency 2019-02-11 15:08:51 ... done 2019-02-11 15:08:51 Counting dereplicated sequences 2019-02-11 15:08:51 ... number of dereplicated sequences: 5 2019-02-11 15:08:51 Extracting ITS2 from sequences [ITSx] 2019-02-11 15:09:02 ... done 2019-02-11 15:09:02 Counting ITS sequences (dereplicated) 2019-02-11 15:09:02 ... number of ITS sequences (dereplicated): 5 2019-02-11 15:09:02 Removing short sequences below < 100bp 2019-02-11 15:09:02 ... done 2019-02-11 15:09:02 Counting length-filtered sequences (dereplicated) 2019-02-11 15:09:02 ERROR: You have 0 sequences! Something isn't right. srun: error: n3-94: task 0: Exited with exit code 1

## Overwriting that previous file with a longer file. sed -e '100q' out_seqprep/prepped.fasta > out_seqprep_001/prepped.fasta wc -l out_seqprep_001/prepped.fasta ## Ok, that was 100 lines long as it should be.

## Trying again... sbatch 2019-02-11-1303_AWCCfungi_1-5.slurm ## At the previous pace that should take round(34*100/60) [1] 57 ## minutes, or one hour! ## We will see how long that takes. ## It took 00:01:21, much faster, thankfully. cat *3939978.out pipits_funits 2.2, the PIPITS Project https://github.com/hsgweon/pipits ------

2019-02-11 15:24:07 pipits_funits started 2019-02-11 15:24:07 Checking input FASTA for illegal characters 2019-02-11 15:24:07 ... done 2019-02-11 15:24:07 Counting input sequences 2019-02-11 15:24:07 ... number of input sequences: 50 2019-02-11 15:24:07 Dereplicating sequences for efficiency 2019-02-11 15:24:07 ... done 2019-02-11 15:24:07 Counting dereplicated sequences 2019-02-11 15:24:07 ... number of dereplicated sequences: 50 2019-02-11 15:24:07 Extracting ITS2 from sequences [ITSx] 2019-02-11 15:25:24 ... done 2019-02-11 15:25:24 Counting ITS sequences (dereplicated) 2019-02-11 15:25:24 ... number of ITS sequences (dereplicated): 31 2019-02-11 15:25:24 Removing short sequences below < 100bp 2019-02-11 15:25:24 ... done 2019-02-11 15:25:24 Counting length-filtered sequences (dereplicated) 2019-02-11 15:25:24 ... number of length-filtered sequences (dereplicated): 2 2019-02-11 15:25:24 Re-inflating sequences 2019-02-11 15:25:25 ... done 2019-02-11 15:25:25 Counting sequences after re- inflation 2019-02-11 15:25:25 ... number of sequences with ITS subregion: 2 2019-02-11 15:25:25 Cleaning temporary directory 2019-02-11 15:25:25 Done - pipits_funits ended successfully. (Your ITS sequences are "out_funits_001/ITS.fasta") 2019-02-11 15:25:25 Next step: pipits_process [ Example: pipits_process -i out_funits_001/ITS.fasta -o pipits_process ] [mattbowser@yeti-login20 2018_AWCC_soil_fungi]

## That worked! ## The rate was 100 sequences per 81 seconds or 100/81 [1] 1.234568 ## reads per second.

## Ok, trying a larger file. How big is the original file? wc -l out_seqprep/prepped.fasta 1019826 out_seqprep/prepped.fasta

## Let's say we run 100K reads at a time. ## That might take 81/100 * 1e5 [1] 81000 ## seconds or 81/100 * 1e5 * 1/60^2 [1] 22.5 ## hours.

## Overwriting that previous file with a much longer file. sed -e '100000q' out_seqprep/prepped.fasta > out_seqprep_001/prepped.fasta wc -l out_seqprep_001/prepped.fasta ## That looked good. wc -l out_seqprep/prepped.fasta ## That looked good.

## Now going for it. sbatch 2019-02-11-1303_AWCCfungi_1-5.slurm

## Uh-oh. I might not have given that enough time. scancel 3939980

## Revised the script to give two days. ## Now going for it again. sbatch 2019-02-11-1303_AWCCfungi_1-5.slurm 2019-02-11-1303_AWCCfungi_1-5.slurm

#!/bin/bash #SBATCH --job-name=AWCCfungi_1-5 #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=02-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-11-1303_AWCCfungi_1-5-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi srun --mpi=pmi2 pipits_funits -i out_seqprep_001/prepped.fasta -o out_funits_001 -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-12-0925_yeti_script.txt

## Work on Yeti, 11.Feb.2019 ## I am going to see if I can run just the first 5 sequences through the cd /home/mattbowser/2018_AWCC_soil_fungi mkdir out_seqprep_002 sed -e '1,100000d;200000q' out_seqprep/prepped.fasta > out_seqprep_002/prepped.fasta wc -l out_seqprep_002/prepped.fasta sbatch 2019-02-12-0940_AWCCfungi_2.slurm ## That started. I hope it works. Now getting ready for the rest. mkdir out_seqprep_003 mkdir out_seqprep_004 mkdir out_seqprep_005 mkdir out_seqprep_006 mkdir out_seqprep_007 mkdir out_seqprep_008 mkdir out_seqprep_009 mkdir out_seqprep_010 sed -e '1,200000d;300000q' out_seqprep/prepped.fasta > out_seqprep_003/prepped.fasta sed -e '1,300000d;400000q' out_seqprep/prepped.fasta > out_seqprep_004/prepped.fasta sed -e '1,400000d;500000q' out_seqprep/prepped.fasta > out_seqprep_005/prepped.fasta sed -e '1,500000d;600000q' out_seqprep/prepped.fasta > out_seqprep_006/prepped.fasta sed -e '1,600000d;700000q' out_seqprep/prepped.fasta > out_seqprep_007/prepped.fasta sed -e '1,700000d;800000q' out_seqprep/prepped.fasta > out_seqprep_008/prepped.fasta sed -e '1,800000d;900000q' out_seqprep/prepped.fasta > out_seqprep_009/prepped.fasta sed -e '1,900000d;1019826q' out_seqprep/prepped.fasta > out_seqprep_010/prepped.fasta wc -l out_seqprep_003/prepped.fasta wc -l out_seqprep_004/prepped.fasta wc -l out_seqprep_005/prepped.fasta wc -l out_seqprep_006/prepped.fasta wc -l out_seqprep_007/prepped.fasta wc -l out_seqprep_008/prepped.fasta wc -l out_seqprep_009/prepped.fasta wc -l out_seqprep_010/prepped.fasta

## Yeah, yesterday's SLURM job completed! ## Let's have a look. cat *983.out ## ok 50k input filtered to 7203 reads after re- inflation.

## Trying to run parallel analyses... sbatch 2019-02-12-1026_AWCCfungi_3-10.slurm ## That failed.

## Cleaned up a little. ## Now running as separate batch files. sbatch 2019-02-12-1055_AWCCfungi_03.slurm sbatch 2019-02-12-1055_AWCCfungi_04.slurm sbatch 2019-02-12-1055_AWCCfungi_05.slurm sbatch 2019-02-12-1055_AWCCfungi_06.slurm sbatch 2019-02-12-1055_AWCCfungi_07.slurm sbatch 2019-02-12-1055_AWCCfungi_08.slurm sbatch 2019-02-12-1055_AWCCfungi_09.slurm sbatch 2019-02-12-1055_AWCCfungi_10.slurm

2019-02-12-0940_AWCCfungi_2.slurm

#!/bin/bash #SBATCH --job-name=AWCCfungi_1-5 #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=02-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-12-0940_AWCCfungi_2-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi srun --mpi=pmi2 pipits_funits -i out_seqprep_002/prepped.fasta -o out_funits_002 -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-12-1055_AWCCfungi_10.slurm

This is an example of one of the eight files submitted nearly simultaneously.

#!/bin/bash #SBATCH --job-name=AWCCfungi_10 #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=02-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-12-1052_AWCCfungi_10-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi #srun --mpi=pmi2 pipits_funits -i out_seqprep_003/prepped.fasta -o out_funits_003 -x ITS2 #srun --mpi=pmi2 pipits_funits -i out_seqprep_004/prepped.fasta -o out_funits_004 -x ITS2 #srun --mpi=pmi2 pipits_funits -i out_seqprep_005/prepped.fasta -o out_funits_005 -x ITS2 #srun --mpi=pmi2 pipits_funits -i out_seqprep_006/prepped.fasta -o out_funits_006 -x ITS2 #srun --mpi=pmi2 pipits_funits -i out_seqprep_007/prepped.fasta -o out_funits_007 -x ITS2 #srun --mpi=pmi2 pipits_funits -i out_seqprep_008/prepped.fasta -o out_funits_008 -x ITS2 #srun --mpi=pmi2 pipits_funits -i out_seqprep_009/prepped.fasta -o out_funits_009 -x ITS2 srun --mpi=pmi2 pipits_funits -i out_seqprep_010/prepped.fasta -o out_funits_010 -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules 2019-02-13-1006_soil_fungi_map.R

## Making a map for the Lumbricus soil fungi article. wd <- "I:/BIOLOGY/Data/ProjectData/2017_earthworm_soil_fungi_NGS/work_space/2019- 02-13_map" knwr = "J:/Goedata/Boundaries/KNWR/knwr.shp" lakes = "J:/Goedata/Hydro/KP_lakes/lakes.shp" streams = "J:/Goedata/Hydro/KP_streams/streams.shp" roads = "J:/Goedata/Transportation/Roads/roads.shp" require(maptools) require(rgdal) require(raster) require(GISTools) albers <- "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs" wgs84 <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"

## Load shape files. #knwr <- readShapeSpatial(knwr) lakes <- readShapeSpatial(lakes) streams <- readShapeSpatial(streams) roads <- readOGR(dsn=roads, layer="roads")

#proj4string(knwr) <- CRS(albers) proj4string(lakes) <- CRS(albers) setwd(wd)

## load points. pts1 <- read.csv("2019-02-13-1009_coordinates.csv") coordinates(pts1) <- c("lon", "lat") proj4string(pts1) <- CRS(wgs84) sitesa <- spTransform(pts1, CRS(albers)) water <- "#93CCEA" ## #76D7EA = Crayola Sky Blue "#BEBEBE" #93CCEA is Crayola cornflower land <- "#E8E8E8"

pdf(file="2019-02-13-1015_soil_fungi_map.pdf", width=6, height=6 ) par(mar=c(0.1, 0.1, 0.1, 0.1)) par(bg=land) plot(sitesa, pch="", bg=land ) plot(streams, add=TRUE, col="#1d6b95", lwd=1 ) plot(lakes, add=TRUE, col=water, border="#1d6b95", lwd=1 ) plot(roads, col="#888888", lwd=2, add=TRUE ) points( sitesa, lwd=2, cex=1.2 ) legend("topright", bg="white", legend=ldf$lab, fill=ldf$fill, border=ldf$border, lwd=ldf$lwd, pch=ldf$pch, pt.cex=ldf$ptcex, pt.lwd=ldf$ptlwd, col=ldf$col ) map.scale(x=160844.7+20, y=1204136+130, len=200, ndivs=2, units="m", subdiv=100) north.arrow(xb=160844.7+0, yb=1204136+60, len=8, lab="N") dev.off() lab <- c("soil samples", "water", "roads", "Lumbricus") ldf <- as.data.frame(lab) ldf$fill <- c(NA, "#93CCEA", NA, NA) ldf$border <- c(NA, "#1d6b95", NA, NA) ldf$lwd <- c(NA, 1, 2, NA) ldf$pch <- c(1, NA, NA, 21) ldf$ptcex <- c(1.2, NA, NA, 1.2) ldf$ptlwd <- c(2, NA, NA, 2) ldf$col <- c("black", "#1d6b95", "#888888", "red") ldf <- ldf[c(2,3,4,1),] pdf(file="2019-02-13-1041_soil_fungi_map.pdf", width=6, height=6 ) par(mar=c(0.1, 0.1, 0.1, 0.1)) par(bg=land) plot(sitesa, pch="", bg=land ) plot(streams, add=TRUE, col="#1d6b95", lwd=1 )

plot(roads, col="#888888", lwd=2, add=TRUE ) points( sitesa, lwd=2, cex=1.2 ) points( 160973.5, 1204136, pch=21, cex=40, col="red", lwd=2 ) plot(lakes, add=TRUE, col=water, border="#1d6b95", lwd=1 ) text(161110, 1204160, "Stormy Lake", srt=46 ) legend("topright", bg="white", legend=ldf$lab, fill=ldf$fill, border=ldf$border, lwd=ldf$lwd, pch=ldf$pch, pt.cex=ldf$ptcex, pt.lwd=ldf$ptlwd, col=ldf$col ) map.scale(x=160844.7+20, y=1204136+130, len=200, ndivs=2, units="m", subdiv=100) north.arrow(xb=160844.7+0, yb=1204136+60, len=8, lab="N") dev.off()

2019-02-13-1244_yeti_stuff.txt cd /home/mattbowser/2018_AWCC_soil_fungi

## Trying to combine all of those output files. cat \ out_funits_001/ITS.fasta \ out_funits_002/ITS.fasta \ out_funits_003/ITS.fasta \ out_funits_003/ITS.fasta \ out_funits_003/ITS.fasta \ out_funits_004/ITS.fasta \ out_funits_005/ITS.fasta \ out_funits_006/ITS.fasta \ out_funits_007/ITS.fasta \ out_funits_008/ITS.fasta \ out_funits_009/ITS.fasta \ out_funits_010/ITS.fasta \ > out_funits/ITS.fasta

## Trying to finish... sbatch 2019-02-13-1226_AWCCfungi_finish.slurm

## How many lines are in that ITS.fasta file? wc -l out_funits/ITS.fasta ## Got 76624.

2019-02-13-1226_AWCCfungi_finish.slurm

#!/bin/bash #SBATCH --job-name=AWCCfungifinish #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=02-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-12-1052_AWCCfungifinish%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi srun --mpi=pmi2 pipits_process -i out_funits/ITS.fasta -o out_process srun --mpi=pmi2 vsearch --usearch_global out_process/repseqs.fasta \ --db out_process/repseqs.fasta --self --id .84 -- iddef 1 \ --userout match_list.txt -userfields query+target+id \ --maxaccepts 0 --query_cov .9 --maxhits 10 source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-14-0939_yeti_stuff.txt cd /home/mattbowser/2018_AWCC_soil_fungi

## Looking at output from yesterday. cat *638.out pipits_process 2.2, the PIPITS Project https://github.com/hsgweon/pipits ------

2019-02-13 14:40:43 pipits_process started 2019-02-13 14:40:43 Generating a sample list from the input sequences 2019-02-13 14:40:43 Dereplicating and removing unique sequences prior to picking OTUs 2019-02-13 14:40:44 Picking OTUs [VSEARCH] 2019-02-13 14:40:48 Removing chimeras [VSEARCH] 2019-02-13 14:40:56 Renaming OTUs 2019-02-13 14:40:56 Mapping reads onto centroids [VSEARCH] 2019-02-13 14:42:10 Making OTU table 2019-02-13 14:42:11 Converting classic tabular OTU into a BIOM format [BIOM] 2019-02-13 14:42:38 Assigning with UNITE [RDP Classifier] 2019-02-13 14:46:34 Reformatting RDP_Classifier output 2019-02-13 14:46:34 Adding assignment to OTU table [BIOM] 2019-02-13 14:46:36 Converting OTU table with taxa assignment into a BIOM format [BIOM] 2019-02-13 14:46:38 Phylotyping OTU table 2019-02-13 14:46:40 Cleaning temporary directory 2019-02-13 14:46:40 Number of reads used to generate OTU table: 29017 2019-02-13 14:46:40 Number of OTUs: 899 2019-02-13 14:46:40 Number of phylotypes: 99 2019-02-13 14:46:40 Number of samples: 12 2019-02-13 14:46:40 Done - Resulting files are in "out_process" directory 2019-02-13 14:46:40 pipits_process ended successfully. vsearch v2.10.2_linux_x86_64, 125.9GB RAM, 20 cores https://github.com/torognes/vsearch

Reading file out_process/repseqs.fasta 100% 366088 nt in 899 seqs, min 103, max 472, avg 407 Masking 100% Counting k-mers 100% Creating k-mer index 100% Searching 100% Matching query sequences: 711 of 899 (79.09%)

## Ok, just looking at how much was filtered.. wc -l out_seqprep/prepped.fasta 1019826 out_seqprep/prepped.fasta ## So that would have been 1019826/2 [1] 509913 # reads. wc -l out_funits/ITS.fasta 76624 out_funits/ITS.fasta ## Down to 76624/2 [1] 38312 # reads after ITSx step. That is a whole lot fewer!

## Looking at output files. ## Here is the first of 10 (lines 1-100K): cat *983.out pipits_funits 2.2, the PIPITS Project https://github.com/hsgweon/pipits ------

2019-02-11 15:59:56 pipits_funits started 2019-02-11 15:59:56 Checking input FASTA for illegal characters 2019-02-11 15:59:56 ... done 2019-02-11 15:59:56 Counting input sequences 2019-02-11 15:59:56 ... number of input sequences: 50000 2019-02-11 15:59:56 Dereplicating sequences for efficiency 2019-02-11 15:59:57 ... done 2019-02-11 15:59:57 Counting dereplicated sequences 2019-02-11 15:59:57 ... number of dereplicated sequences: 48132 2019-02-11 15:59:57 Extracting ITS2 from sequences [ITSx] 2019-02-12 12:19:16 ... done 2019-02-12 12:19:16 Counting ITS sequences (dereplicated) 2019-02-12 12:19:16 ... number of ITS sequences (dereplicated): 25277 2019-02-12 12:19:16 Removing short sequences below < 100bp 2019-02-12 12:19:16 ... done 2019-02-12 12:19:16 Counting length-filtered sequences (dereplicated) 2019-02-12 12:19:16 ... number of length-filtered sequences (dereplicated): 7166 2019-02-12 12:19:16 Re-inflating sequences 2019-02-12 12:19:17 ... done 2019-02-12 12:19:17 Counting sequences after re- inflation 2019-02-12 12:19:17 ... number of sequences with ITS subregion: 7203 2019-02-12 12:19:17 Cleaning temporary directory 2019-02-12 12:19:17 Done - pipits_funits ended successfully. (Your ITS sequences are "out_funits_001/ITS.fasta") 2019-02-12 12:19:17 Next step: pipits_process [ Example: pipits_process -i out_funits_001/ITS.fasta -o pipits_process ]

## So here it looks like the ITSx step removed about half of the sequences (48K to 25K) ## But most of these were < 100bp. Removing these left only 7K. ## So overall the pipits_funits step is yielding 7203/50000 [1] 0.14406 # 14% of number of input reads.

## Looking at last set (lines 900001-1019826). cat *736.out pipits_funits 2.2, the PIPITS Project https://github.com/hsgweon/pipits ------2019-02-12 13:02:16 pipits_funits started 2019-02-12 13:02:16 Checking input FASTA for illegal characters 2019-02-12 13:02:18 ... done 2019-02-12 13:02:18 Counting input sequences 2019-02-12 13:02:18 ... number of input sequences: 59913 2019-02-12 13:02:18 Dereplicating sequences for efficiency 2019-02-12 13:02:21 ... done 2019-02-12 13:02:21 Counting dereplicated sequences 2019-02-12 13:02:21 ... number of dereplicated sequences: 55663 2019-02-12 13:02:21 Extracting ITS2 from sequences [ITSx] 2019-02-13 13:39:14 ... done 2019-02-13 13:39:14 Counting ITS sequences (dereplicated) 2019-02-13 13:39:14 ... number of ITS sequences (dereplicated): 51522 2019-02-13 13:39:14 Removing short sequences below < 100bp 2019-02-13 13:39:14 ... done 2019-02-13 13:39:14 Counting length-filtered sequences (dereplicated) 2019-02-13 13:39:14 ... number of length-filtered sequences (dereplicated): 1284 2019-02-13 13:39:14 Re-inflating sequences 2019-02-13 13:39:15 ... done 2019-02-13 13:39:15 Counting sequences after re- inflation 2019-02-13 13:39:15 ... number of sequences with ITS subregion: 1295 2019-02-13 13:39:15 Cleaning temporary directory 2019-02-13 13:39:15 Done - pipits_funits ended successfully. (Your ITS sequences are "out_funits_010/ITS.fasta") 2019-02-13 13:39:15 Next step: pipits_process [ Example: pipits_process -i out_funits_010/ITS.fasta -o pipits_process ] [mattbowser@yeti-login20 2018_AWCC_soil_fungi]

## Wow, there was an even higher precentage dropped there. Most of these reads were < 100 bp.

## I just reviewed my Stormy Lake analysis for comparison. Here there were 85,606 raw reads, 60,178 joined reads, 59,732 filtered reads input to pipits_funits. Here there were 59,732 input reads, 31,214 dereplicated sequences, 28,218 sequences after ITSx and length filtering, and 54,294 reads after re-inflation. Got 874 OTUs from 6 samples. This is much different that the AWCC analysis.

## Looking at notes from December 20 for the AWCC analysis. ## In the pipisno_seqprep step there were initially 577,039 reads.

## Perhaps I did something wrong on December 18 in regards to splitting the original file, etc.

## Going to try installing QIIME2 to demultiplex. ## Using instructions at the URI below. https://docs.qiime2.org/2019.1/install/native/#install- qiime-2-within-a-conda-environment module load python/miniconda3-gcc6.1.0 conda update conda ## This started, but permission was denied. ## Just checking: module avail ## Nothing newer. ## Proceeding. wget https://data.qiime2.org/distro/core/qiime2- 2019.1-py36-linux-conda.yml conda env create -n qiime2-2019.1 --file qiime2- 2019.1-py36-linux-conda.yml

## (installed) ## Some output:

# To activate this environment, use: # > source activate qiime2-2019.1 # # To deactivate an active environment, use: # > source deactivate

## Transferred raw illumina FASTQ files. Need to uncompress these. gunzip original_data/SAMP1- 12_S4_L001_R1_001.fastq.gz gunzip original_data/SAMP1- 12_S4_L001_R2_001.fastq.gz

## Looking at these. wc -l original_data/SAMP1-12_S4_L001_R1_001.fastq 5416748 original_data/SAMP1-12_S4_L001_R1_001.fastq wc -l original_data/SAMP1-12_S4_L001_R2_001.fastq 5416748 original_data/SAMP1-12_S4_L001_R2_001.fastq

## So those are the same length, good (5.4 M lines, 1.4 M reads.)

## Trying this as a SLURM script. sbatch 2019-02-14-1248_join.slurm qiime join_paired_ends.py \ -f original_data/SAMP1-12_S4_L001_R1_001.fastq \ -r original_data/SAMP1-12_S4_L001_R2_001.fastq \ -o joined

-bash: join_paired_ends.py: command not found

## It looks like these scripts may not be available in qiime2. module purge source deactivate module load qiime/1.9.2 moudle purge.

## Trying running join_paired_ends.py under qiime 1.9.2 sbatch 2019-02-14-1317_join.slurm

## That did not work.

## Looks like I need to use qiime2. module load python/miniconda3-gcc6.1.0 # load required modules source activate qiime2-2019.1 # load QIIME environment cd /home/mattbowser/2018_AWCC_soil_fungi ## I copied .fastq.gz files. qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path original_data \ --input-format CasavaOneEightSingleLanePerSampleDirFmt \ --output-path demux-paired-end.qza sbatch 2019-02-14-1411_import.slurm

## That seemed to work. sbatch 2019-02-14-1416_join.slurm

## Got an error: Plugin error from cutadapt:

Parameter 'seqs' received an argument of type SampleData[PairedEndSequencesWithQuality]. An argument of subtype MultiplexedPairedEndBarcodeInSequence is required.

## Oh, I might just need to change that parameter. sbatch 2019-02-14-1429_import.slurm ## No. ## I think I am going the wrong direction here. I might need to provide my own prepped.fasta file.

2019-02-14-1530_seqprep.R

## Downloaded collection 15 from Galaxy. wd <- "I:/BIOLOGY/Data/ProjectData/Grasslands/2018_grassland_work/work_space/2019- 01-29_ITSx/2019-02-14_demux" setwd(wd) require(Biostrings)

## Get the file list. fl <- dir(wd)

## Get the sample names. sn <- gsub("\\.fasta", "", fl)

this_file <- 1 fas <- readDNAStringSet(fl[this_file]) lbs <- paste(sn[this_file], 1:length(fas), sep="_") names(fas) <- lbs fas1 <- fas print(length(fas1)) for (this_file in 2:length(fl)) { fas <- readDNAStringSet(fl[this_file]) lbs <- paste(sn[this_file], 1:length(fas), sep="_") names(fas) <- lbs fas1 <- c(fas1, fas) print(length(fas1)) } writeXStringSet(fas1, "prepped.fasta")

2019-02-14-1603_funits.slurm

#!/bin/bash #SBATCH --job-name=funits #SBATCH -n 1 # number of nodes #SBATCH -n 3 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=10-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-14-1603_funits-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi srun --mpi=pmi2 pipits_funits -i out_seqprep/prepped.fasta -o out_funits -x ITS2 srun --mpi=pmi2 pipits_process -i out_funits/ITS.fasta -o out_process srun --mpi=pmi2 vsearch --usearch_global out_process/repseqs.fasta \ --db out_process/repseqs.fasta --self --id .84 -- iddef 1 \ --userout match_list.txt -userfields query+target+id \ --maxaccepts 0 --query_cov .9 --maxhits 10 source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-15-0753_seqprep.R

## This time I am getting to run each of the demultiplexed datasets through ITSx individually. wd <- "I:/BIOLOGY/Data/ProjectData/Grasslands/2018_grassland_work/work_space/2019- 01-29_ITSx/2019-02-14_demux" od <- "I:/BIOLOGY/Data/ProjectData/Grasslands/2018_grassland_work/work_space/2019- 01-29_ITSx/2019-02-15_demux" setwd(wd) require(Biostrings)

## Get the file list. fl <- dir(wd) fl <- fl[1:12] ## Get the sample names. sn <- gsub("\\.fasta", "", fl) sn <- sn[1:12] sample_names <- sn nreads <- rep(0, 12) rdf <- as.data.frame(cbind(sample_names, nreads)) for (this_file in 1:length(fl))#length(fl) { fas <- readDNAStringSet(fl[this_file]) lbs <- paste(sn[this_file], 1:length(fas), sep="_") names(fas) <- lbs print(length(fas)) rdf$nreads[this_file] <- length(fas) fd <- paste(od, fl[this_file], sep="/") writeXStringSet(fas, fd) }

## numbers of reads: [1] 42351 [1] 45738 [1] 46639 [1] 52856 [1] 33779 [1] 25379 [1] 55761 [1] 64013 [1] 37214 [1] 50063 [1] 62257 [1] 60989

2019-02-15-0806_ITSx_AWCC1.slurm #!/bin/bash #SBATCH --job-name=ITSxAWCC1 #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=02-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-15-0806_ITSxAWCC1-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi srun --mpi=pmi2 pipits_funits -i 2019-02- 15_demux/AWCC1.fasta -o out_funits_AWCC1 -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-15-1308_difference_testing.R

## Today I want to test for differences between sites. wd <- "I:/BIOLOGY/Data/ProjectData/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) load("2019-02-01-1043_workspace.RData") require(vegan) require(MASS)

## Following example of http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf vare.dis <- vegdist(d2) vare.mds0 <- isoMDS(vare.dis) initial value 11.130355 iter 5 value 3.853666 iter 10 value 1.504435 iter 15 value 0.461817 iter 20 value 0.224225 iter 25 value 0.078866 iter 30 value 0.043991 iter 35 value 0.019019 iter 40 value 0.011252 final value 0.006030 converged stressplot(vare.mds0, vare.dis) ordiplot(vare.mds0, type = "t") vare.mds <- metaMDS(d2, trace = FALSE)

Warning message: In metaMDS(d2, trace = FALSE) : stress is (nearly) zero: you may have insufficient data

vare.mds

Call: metaMDS(comm = d2, trace = FALSE) global Multidimensional Scaling using monoMDS

Data: d2 Distance: bray

Dimensions: 2 Stress: 7.879375e-05 Stress type 1, weak ties Two convergent solutions found after 20 tries Scaling: centring, PC rotation, halfchange scaling Species: expanded scores based on ‘d2’ plot(vare.mds, type = "t") ## That is not looking good. dis <- vegdist(decostand(d2, "norm"), "euclid") dis <- vegdist(decostand(d2, "hell"), "euclidean") d <- vegdist(d2, "bray", binary = TRUE) vare.pca <- rda(d2) vare.pca Call: rda(X = d2)

Inertia Rank Total 1034 Unconstrained 1034 5 Inertia is variance

Eigenvalues for unconstrained axes: PC1 PC2 PC3 PC4 PC5 315.33 217.00 197.77 163.17 140.44 plot(vare.pca) biplot(vare.pca, scaling = -1) vare.pca <- rda(d2, scale = TRUE) vare.pca Call: rda(X = d2, scale = TRUE)

Inertia Rank Total 402 Unconstrained 402 5 Inertia is correlations

Eigenvalues for unconstrained axes: PC1 PC2 PC3 PC4 PC5 118.34 84.26 78.76 65.01 55.63 plot(vare.pca, scaling = -1) vare.ca <- cca(d2) vare.ca Call: cca(X = d2)

Inertia Rank Total 2.734 Unconstrained 2.734 5 Inertia is scaled Chi-square

Eigenvalues for unconstrained axes: CA1 CA2 CA3 CA4 CA5 0.6669 0.5766 0.5192 0.4991 0.4726 plot(vare.ca) plot(vare.ca, scaling = 1)

Lt <- c(1,1,1,0,0,0) Lt <- as.data.frame(Lt) ef <- envfit(vare.pca, Lt, permu = 999) ef

***VECTORS

PC1 PC2 r2 Pr(>r) Lt -0.84117 -0.54077 0.6243 0.1 Permutation: free Number of permutations: 719 plot(vare.pca, display = "sites") plot(ef)

## Need to be done for now. It is time to go on a hot date with my wife! save.image("2019-02-15-1434_workspace.RData")

2019-02-18-0741_demux.txt cd /home/mattbowser/2018_AWCC_soil_fungi module load python/miniconda3-gcc6.1.0 source activate pipits_env pipits_rereplicate -i out_funits_CaribouHills4/intermediate/derep.ITS2.sizefiltered.fasta -o out_funits_CaribouHills4/ITS.fasta --uc out_funits_CaribouHills4/intermediate/derep.uc

## Trying Bayexer. ./Bayexer \ -i original_data/SAMP1-12_S4_L001_R1_001.fastq, original_data/SAMP1-12_S4_L001_R2_001.fastq \ -o 2019-02-18_demux \ -x mappingBayexer.txt

## That did not work. I think the index read files are needed, which I don't have.

2019-02-20-1433_making_R2_fastq_files.R

## Trying to construct R2 files. dd <- "C:/Users/mattbowser/Documents/2019-02- 18_AWCC_fungi_homework/2019-02-20_data" wd <- "C:/Users/mattbowser/Documents/2019-02- 18_AWCC_fungi_homework/" setwd(dd) require(Biostrings)

## Get file list. fl <- dir(dd) fl <- fl[grepl("R1", fl)]

## First load the big R2 file. r2 <- readDNAStringSet("Galaxy218.fastq", format="fastq") summary(r2) Length Class Mode 1354187 DNAStringSet S4

## Now trying to split the r2 by matching record labels. this_file <- 1 # for testing. for (this_file in 1:length(fl))#length(fl) { r1 <- readDNAStringSet(fl[this_file], format="fastq") sl <- which(gsub("2:N:", "1:N:", names(r2)) %in% names(r1)) r2t <- r2[sl] writeXStringSet(r2t, gsub("R1", "R2", fl[this_file]), format="fastq" ) }

2019-02-20-1510_pispino.slurm

#!/bin/bash #SBATCH --job-name=AWCCpispino #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=1-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-20-1510_pispino-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi_try_02 srun --mpi=pmi2 pispino_createreadpairslist -i rawdata -o readpairslist.txt srun --mpi=pmi2 pispino_seqprep -i rawdata -o out_seqprep -l readpairslist.txt source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-21-0816_making_R2_fastq_files.R

## Trying to construct R2 files. dd <- "C:/Users/mattbowser/Documents/2019-02- 18_AWCC_fungi_homework/2019-02-20_data" wd <- "C:/Users/mattbowser/Documents/2019-02- 18_AWCC_fungi_homework/" setwd(dd)

#options(encoding = "native.enc") options(encoding = "en_US.UTF-8") require(ShortRead)

## Get file list. fl <- dir(dd) fl <- fl[grepl("R1", fl)]

## First load the big R2 file. r2 <- readFastq(dirPath=dd, pattern="Galaxy218.fastq") summary(r2) Length Class Mode 1354187 ShortReadQ S4

## Now trying to split the r2 by matching record labels. this_file <- 1 # for testing. for (this_file in 1:length(fl))#length(fl) { r1 <- readFastq(dirPath=dd, pattern=fl[this_file]) sl <- which(gsub("2:N:", "1:N:", id(r2)) %in% id(r1)) r2t <- r2[sl] writeFastq(r2t, gsub("R1", "R2", fl[this_file]), compress=FALSE ) }

2019-02-21-0958_Yeti_IO.txt

(pipits_env) [mattbowser@yeti-login20 2018_AWCC_soil_fungi_try_02] pispino_seqprep -i rawdata -o out_seqprep -l readpairslist.txt 2019-02-21 11:49:53 pispino_seqprep started 2019-02-21 11:49:53 Checking listfile 2019-02-21 11:49:53 ... done 2019-02-21 11:49:53 Counting sequences in rawdata 2019-02-21 11:49:59 ... number of reads: 577039 2019-02-21 11:49:59 Reindexing forward reads 2019-02-21 11:50:05 ... done 2019-02-21 11:50:05 Reindexing reverse reads 2019-02-21 11:50:13 ... done 2019-02-21 11:50:13 Joining paired-end reads [VSEARCH] 2019-02-21 11:50:37 ... number of joined reads: 359739 2019-02-21 11:50:37 Quality filtering [FASTX] 2019-02-21 11:53:10 ... number of quality filtered reads: 338832 2019-02-21 11:53:10 Converting FASTQ to FASTA [FASTX] (also removing reads with "N" nucleotide if specified with "--FASTX-n") 2019-02-21 11:54:27 ... number of prepped sequences: 338832 2019-02-21 11:54:27 Merging into a single file 2019-02-21 11:54:28 ... done 2019-02-21 11:54:28 Cleaning temporary directory 2019-02-21 11:54:29 ... done 2019-02-21 11:54:29 Done - pispino_seqprep completed (Resulting file: out_seqprep/prepped.fasta) 2019-02-21-1208_pipits_funits_001.slurm

#!/bin/bash #SBATCH --job-name=fu001 #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=1-00:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-21-1208_pipits_funits_001.slurm- %j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi_try_02 srun pipits_funits -i out_seqprep_001/prepped.fasta -o out_funits_001 -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-22-0849_splitting.R

## I am going to try to cut this up. n <- 677664 n/16 [1] 42354 ## Going to try making 16 seperate directories.

## Making shell script in R. wd <- "/home/mattbowser/2018_AWCC_soil_fungi_try_02" setwd(wd) fn <- "2019-02-22-0846_split_script.sh" write("#!/bin/sh cd /home/mattbowser/2018_AWCC_soil_fungi_try_02 ", fn)

## Make directories. nn <- 1001:1016 nn <- as.character(nn) nn <- substr(nn, 2, 4) dn <- paste("out_seqprep_", nn, sep="") cmd <- paste("mkdir ", dn, sep="") write(cmd, fn, append=TRUE)

## Make files. sto <- n/16 * (1:16) cmd[1] <- "sed -e '42354q' out_seqprep/prepped.fasta > out_seqprep_001/prepped.fasta" cmd[2:16] <- paste("sed -e '1,", sto[1:15], "d;", sto[2:16], "q' out_seqprep/prepped.fasta > ", dn[2:16], "/prepped.fasta", sep="") write(cmd, fn, append=TRUE)

## I copied that to Yeti and ran it.

## Now making SLURM files. fn <- paste ("2019-02-22-0850_pipits_funits_", nn, ".slurm", sep="") hd <- paste("#!/bin/bash #SBATCH --job-name=fu", nn, " #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=1-00:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o ", fn, "-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi_try_02", sep="") cmd <- paste("srun pipits_funits -i ", dn, "/prepped.fasta -o out_funits_", nn, " -x ITS2", sep="") tl <- "source deactivate # deactivate PIPITS module purge # unload those modules" for(this_file in 1:16) { write(hd[this_file], fn[this_file]) write(cmd[this_file], fn[this_file], append=TRUE) write(tl, fn[this_file], append=TRUE) }

## Now make a script to submit all of those jobs. sfn <- fn fn <- "2019-02-22-0851_split_script.sh" write("#!/bin/sh ", fn) cmd <- paste("sbatch ", sfn, sep="") write(cmd, fn, append=TRUE)

2019-02-22-0846_split_script.sh

#!/bin/sh cd /home/mattbowser/2018_AWCC_soil_fungi_try_02 mkdir out_seqprep_001 mkdir out_seqprep_002 mkdir out_seqprep_003 mkdir out_seqprep_004 mkdir out_seqprep_005 mkdir out_seqprep_006 mkdir out_seqprep_007 mkdir out_seqprep_008 mkdir out_seqprep_009 mkdir out_seqprep_010 mkdir out_seqprep_011 mkdir out_seqprep_012 mkdir out_seqprep_013 mkdir out_seqprep_014 mkdir out_seqprep_015 mkdir out_seqprep_016 sed -e '42354q' out_seqprep/prepped.fasta > out_seqprep_001/prepped.fasta sed -e '1,42354d;84708q' out_seqprep/prepped.fasta > out_seqprep_002/prepped.fasta sed -e '1,84708d;127062q' out_seqprep/prepped.fasta > out_seqprep_003/prepped.fasta sed -e '1,127062d;169416q' out_seqprep/prepped.fasta > out_seqprep_004/prepped.fasta sed -e '1,169416d;211770q' out_seqprep/prepped.fasta > out_seqprep_005/prepped.fasta sed -e '1,211770d;254124q' out_seqprep/prepped.fasta > out_seqprep_006/prepped.fasta sed -e '1,254124d;296478q' out_seqprep/prepped.fasta > out_seqprep_007/prepped.fasta sed -e '1,296478d;338832q' out_seqprep/prepped.fasta > out_seqprep_008/prepped.fasta sed -e '1,338832d;381186q' out_seqprep/prepped.fasta > out_seqprep_009/prepped.fasta sed -e '1,381186d;423540q' out_seqprep/prepped.fasta > out_seqprep_010/prepped.fasta sed -e '1,423540d;465894q' out_seqprep/prepped.fasta > out_seqprep_011/prepped.fasta sed -e '1,465894d;508248q' out_seqprep/prepped.fasta > out_seqprep_012/prepped.fasta sed -e '1,508248d;550602q' out_seqprep/prepped.fasta > out_seqprep_013/prepped.fasta sed -e '1,550602d;592956q' out_seqprep/prepped.fasta > out_seqprep_014/prepped.fasta sed -e '1,592956d;635310q' out_seqprep/prepped.fasta > out_seqprep_015/prepped.fasta sed -e '1,635310d;677664q' out_seqprep/prepped.fasta > out_seqprep_016/prepped.fasta

2019-02-22-0850_pipits_funits_001.slurm

#!/bin/bash #SBATCH --job-name=fu001 #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=1-00:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-22-0850_pipits_funits_001.slurm- %j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi_try_02 srun pipits_funits -i out_seqprep_001/prepped.fasta -o out_funits_001 -x ITS2 source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-22-0851_split_script.sh

#!/bin/sh sbatch 2019-02-22-0850_pipits_funits_001.slurm sbatch 2019-02-22-0850_pipits_funits_002.slurm sbatch 2019-02-22-0850_pipits_funits_003.slurm sbatch 2019-02-22-0850_pipits_funits_004.slurm sbatch 2019-02-22-0850_pipits_funits_005.slurm sbatch 2019-02-22-0850_pipits_funits_006.slurm sbatch 2019-02-22-0850_pipits_funits_007.slurm sbatch 2019-02-22-0850_pipits_funits_008.slurm sbatch 2019-02-22-0850_pipits_funits_009.slurm sbatch 2019-02-22-0850_pipits_funits_010.slurm sbatch 2019-02-22-0850_pipits_funits_011.slurm sbatch 2019-02-22-0850_pipits_funits_012.slurm sbatch 2019-02-22-0850_pipits_funits_013.slurm sbatch 2019-02-22-0850_pipits_funits_014.slurm sbatch 2019-02-22-0850_pipits_funits_015.slurm sbatch 2019-02-22-0850_pipits_funits_016.slurm

2019-02-20-1531_funits-3956207.out pipits_funits 2.2, the PIPITS Project https://github.com/hsgweon/pipits ------

2019-02-21 12:02:44 pipits_funits started 2019-02-21 12:02:44 Checking input FASTA for illegal characters 2019-02-21 12:03:04 ... done 2019-02-21 12:03:04 Counting input sequences 2019-02-21 12:03:04 ... number of input sequences: 338832 2019-02-21 12:03:04 Dereplicating sequences for efficiency 2019-02-21 12:03:28 ... done 2019-02-21 12:03:28 Counting dereplicated sequences 2019-02-21 12:03:28 ... number of dereplicated sequences: 215865 2019-02-21 12:03:28 Extracting ITS2 from sequences [ITSx] 2019-02-25 12:59:13 ... done 2019-02-25 12:59:13 Counting ITS sequences (dereplicated) 2019-02-25 12:59:13 ... number of ITS sequences (dereplicated): 153120 2019-02-25 12:59:13 Removing short sequences below < 100bp 2019-02-25 12:59:14 ... done 2019-02-25 12:59:14 Counting length-filtered sequences (dereplicated) 2019-02-25 12:59:14 ... number of length-filtered sequences (dereplicated): 153074 2019-02-25 12:59:14 Re-inflating sequences 2019-02-25 12:59:21 ... done 2019-02-25 12:59:21 Counting sequences after re- inflation 2019-02-25 12:59:21 ... number of sequences with ITS subregion: 256247 2019-02-25 12:59:21 Cleaning temporary directory 2019-02-25 12:59:21 Done - pipits_funits ended successfully. (Your ITS sequences are "out_funits/ITS.fasta") 2019-02-25 12:59:21 Next step: pipits_process [ Example: pipits_process -i out_funits/ITS.fasta -o pipits_process ]

2019-02-25-1233_pipits_process.slurm

#!/bin/bash #SBATCH --job-name=processall #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=4-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-25-1233_pipits_process-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi_try_02 srun --mpi=pmi2 pipits_process -i out_funits/ITS.fasta -o pipits_process source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-22-0850_pipits_funits_011.slurm- 3956517.out 2019-02-25 14:39:13 pipits_process started 2019-02-25 14:39:13 Generating a sample list from the input sequences 2019-02-25 14:39:14 Dereplicating and removing unique sequences prior to picking OTUs 2019-02-25 14:39:14 Picking OTUs [VSEARCH] 2019-02-25 14:39:18 Removing chimeras [VSEARCH] 2019-02-25 14:39:27 Renaming OTUs 2019-02-25 14:39:29 Mapping reads onto centroids [VSEARCH] 2019-02-25 14:40:54 Making OTU table 2019-02-25 14:40:56 Converting classic tabular OTU into a BIOM format [BIOM] 2019-02-25 14:41:51 Assigning taxonomy with UNITE [RDP Classifier] 2019-02-25 14:44:35 Reformatting RDP_Classifier output 2019-02-25 14:44:35 Adding assignment to OTU table [BIOM] 2019-02-25 14:44:36 Converting OTU table with taxa assignment into a BIOM format [BIOM] 2019-02-25 14:44:39 Phylotyping OTU table 2019-02-25 14:44:40 Cleaning temporary directory 2019-02-25 14:44:41 Number of reads used to generate OTU table: 248760 2019-02-25 14:44:41 Number of OTUs: 1311 2019-02-25 14:44:41 Number of phylotypes: 551 2019-02-25 14:44:41 Number of samples: 12 2019-02-25 14:44:41 Done - Resulting files are in "pipits_process" directory 2019-02-25 14:44:41 pipits_process ended successfully. 2019-02-25-1636_LULU_prep.slurm

#!/bin/bash #SBATCH --job-name=luluprep #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=4-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-02-25-1636_LULU_prep-%j.out # name of output file (the %j inserts the jobid) module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment cd /home/mattbowser/2018_AWCC_soil_fungi_try_02 srun --mpi=pmi2 vsearch --usearch_global pipits_process/repseqs.fasta \ --db pipits_process/repseqs.fasta --self --id .84 - -iddef 1 \ --userout match_list.txt -userfields query+target+id \ --maxaccepts 0 --query_cov .9 --maxhits 10 source deactivate # deactivate PIPITS module purge # unload those modules

2019-02-25-1725_lulu.R

## Now in R. require(lulu) wd <- "C:/Users/mattbowser/Documents/2019-02- 18_AWCC_fungi_homework" setwd(wd)

## I deleted the first line and uncommented the header row in the out_table.txt file. otutab <- read.csv("pipits_process/otu_table_edited.txt",sep='\t',header=TRUE,as.is=TRUE) row.names(otutab) <- otutab[,1] otutab <- otutab[,2:(dim(otutab)[2]-1)] matchlist <- read.table("match_list.txt", header=FALSE,as.is=TRUE, stringsAsFactors=FALSE) cr <- lulu(otutab, matchlist, minimum_ratio_type = "min", minimum_match = 97 )

## That worked! ## Doing some saving. write.csv(cr$curated_table, "2019-02-25- 1740_curated_otu_table.csv") cr$curated_table == cr$original_table ## These are not all true... ## Let's look at the first lines... cr$original_table[1,] cr$curated_table[1,] ## The curated table appears to have been re-sorted. dim(cr$original_table) [1] 1311 12 dim(cr$curated_table) [1] 1311 12 sum(cr$original_table) [1] 248760 sum(cr$curated_table) [1] 248760

## Did LULU do anything? cr$original_table[500,] AWCC1 AWCC2 AWCC3 AWCC4 AWCC5 AWCC6 AWCC7 AWCC8 CaribouHills1 CaribouHills2 CaribouHills3 CaribouHills4 OTU900 0 0 0 0 0 0 0 1 1 0 0 40 cr$curated_table[which(row.names(cr$curated_table) == "OTU900"),] AWCC1 AWCC2 AWCC3 AWCC4 AWCC5 AWCC6 AWCC7 AWCC8 CaribouHills1 CaribouHills2 CaribouHills3 CaribouHills4 OTU900 0 0 0 0 0 0 0 1 1 0 0 40 ## So LULU did nothing at all there except to rearrange the table, not even doing relative abundance filtering?

## Trying to explicitly set relative occurrence. cr2 <- lulu(otutab, matchlist, minimum_ratio_type = "min", minimum_match = 97, minimum_relative_cooccurence = 0.95 ) ## That did the same thing. ## Just trying bumping down the mininum_match value to see what happens. cr3 <- lulu(otutab, matchlist, minimum_ratio_type = "min", minimum_match = 96, minimum_relative_cooccurence = 0.95 )

## This time 34 OTUs were discarded. summary(cr3) Length Class Mode curated_table 12 data.frame list curated_count 1 -none- numeric curated_otus 1277 -none- character discarded_count 1 -none- numeric discarded_otus 34 -none- character runtime 1 difftime numeric minimum_match 1 -none- numeric minimum_relative_cooccurence 1 -none- numeric otu_map 5 data.frame list original_table 12 data.frame list

## Based on the reading I did previously, I am just going to go with the values used by Anslan et al. (2018). ct <- cr$curated_table dim(ct) [1] 1311 12

## Now filtering using N9 rule rml <- function(x) { if (x < 10) {0} else {x} } ct2 <- ct ct2[,] <- apply(ct2, c(1,2), rml) sl <- !(rowSums(ct2) == 0) ct2 <- ct2[sl,] dim(ct2) [1] 708 12 csct <- colSums(ct2) csct AWCC1 AWCC2 AWCC3 AWCC4 AWCC5 7639 12007 18445 21889 10224 AWCC6 AWCC7 AWCC8 CaribouHills1 CaribouHills2 13007 19994 30648 19602 20293 CaribouHills3 CaribouHills4 32964 33000 write.csv(ct2, "2019-02-25- 2007_curated_otu_table.csv")

## Now on yeti... module load python/miniconda3-gcc6.1.0 # load required modules source activate pipits_env # load PIPITS environment pipits_funguild.py -i pipits_process/otu_table.txt - o pipits_process/otu_table_funguild.txt source deactivate # deactivate PIPITS module purge # unload those modules

## Submitted this to FUNGuild at http://www.stbates.org/guilds/app.php.

## I also had to remove the space from "OTU ID;" changing it to "OTUID" worked:

Guilds v1.0 Beta report: - 450 assignments were made on OTUs within the input file! - Total calculating time = 11.57 seconds!

## Back in R. d1 <- read.delim("pipits_process/otu_table_funguild.guilds.txt") dim(d1) [1] 1311 23 require(sqldf) ct4 <- ct2 ct4$otu_id <- row.names(ct4) ct4 <- ct4[,c(13, 1:12)] names(d1) <- gsub("\\.", "_", names(d1)) ct6 <- sqldf(' select ct4.*, d1.Taxon, d1.Taxon_Level, d1.Trophic_Mode, d1.Guild, d1.Confidence_Ranking, d1.Growth_Morphology, d1.Trait, d1.Notes, d1.Citation_Source from ct4 left outer join d1 on ct4.otu_id = d1.OTUID ') dim(ct6) [1] 708 22 write.csv(ct6, "2019-02-25- 2037_guilds_assigned.csv", row.names=FALSE) levels(ct6$Guild) [1] "-" [2] "Algal Parasite--Leaf Saprotroph-Wood Saprotroph" [3] "Animal Endosymbiont-Animal Pathogen-Endophyte- Plant Pathogen-Undefined Saprotroph" [4] "Animal Endosymbiont-Animal Pathogen-Undefined Saprotroph" [5] "Animal Parasite-Fungal Parasite" [6] "Animal Pathogen" [7] "Animal Pathogen-Clavicipitaceous Endophyte- Fungal Parasite" [8] "Animal Pathogen-Dung Saprotroph-Endophyte- Epiphyte-Plant Saprotroph-Wood Saprotroph" [9] "Animal Pathogen-Dung Saprotroph-Endophyte- Lichen Parasite-Plant Pathogen-Undefined Saprotroph" [10] "Animal Pathogen-Endophyte-Epiphyte-Plant Pathogen-Undefined Saprotroph" [11] "Animal Pathogen-Endophyte-Fungal Parasite- Plant Pathogen-Wood Saprotroph" [12] "Animal Pathogen-Endophyte-Plant Pathogen-Wood Saprotroph" [13] "Animal Pathogen-Fungal Parasite-Undefined Saprotroph" [14] "Animal Pathogen-Plant Pathogen-Soil Saprotroph-Undefined Saprotroph" [15] "Animal Pathogen-Plant Pathogen-Undefined Saprotroph" [16] "Animal Pathogen-Soil Saprotroph" [17] "Animal Pathogen-Undefined Saprotroph" [18] "Arbuscular Mycorrhizal" [19] "Bryophyte Parasite-Ectomycorrhizal-Ericoid Mycorrhizal-Undefined Saprotroph" [20] "Bryophyte Parasite-Leaf Saprotroph-Soil Saprotroph-Undefined Saprotroph-Wood Saprotroph" [21] "Bryophyte Parasite-Litter Saprotroph-Wood Saprotroph" [22] "Dung Saprotroph" [23] "Dung Saprotroph-Ectomycorrhizal-Litter Saprotroph-Undefined Saprotroph" [24] "Dung Saprotroph-Ectomycorrhizal-Soil Saprotroph-Wood Saprotroph" [25] "Dung Saprotroph-Endophyte-Litter Saprotroph- Undefined Saprotroph" [26] "Dung Saprotroph-Endophyte-Undefined Saprotroph" [27] "Dung Saprotroph-Plant Saprotroph" [28] "Dung Saprotroph-Plant Saprotroph-Wood Saprotroph" [29] "Dung Saprotroph-Undefined Saprotroph" [30] "Dung Saprotroph-Wood Saprotroph" [31] "Ectomycorrhizal" [32] "Ectomycorrhizal-Endophyte-Ericoid Mycorrhizal- Litter Saprotroph-Orchid Mycorrhizal" [33] "Ectomycorrhizal-Fungal Parasite-Plant Pathogen-Wood Saprotroph" [34] "Ectomycorrhizal-Fungal Parasite-Soil Saprotroph-Undefined Saprotroph" [35] "Ectomycorrhizal-Orchid Mycorrhizal-Root Associated Biotroph" [36] "Ectomycorrhizal-Undefined Saprotroph" [37] "Ectomycorrhizal-Undefined Saprotroph-Wood Saprotroph" [38] "Ectomycorrhizal-Wood Saprotroph" [39] "Endomycorrhizal-Plant Pathogen-Undefined Saprotroph" [40] "Endophyte" [41] "Endophyte-Fungal Parasite-Plant Pathogen" [42] "Endophyte-Lichen Parasite-Plant Pathogen- Undefined Saprotroph" [43] "Endophyte-Lichen Parasite-Undefined Saprotroph" [44] "Endophyte-Litter Saprotroph-Soil Saprotroph- Undefined Saprotroph" [45] "Endophyte-Litter Saprotroph-Wood Saprotroph" [46] "Endophyte-Plant Pathogen" [47] "Endophyte-Plant Pathogen-Undefined Saprotroph" [48] "Endophyte-Plant Pathogen-Wood Saprotroph" [49] "Endophyte-Undefined Saprotroph-Wood Saprotroph" [50] "Ericoid Mycorrhizal" [51] "Fungal Parasite" [52] "Fungal Parasite-Lichen Parasite" [53] "Fungal Parasite-Plant Pathogen-Plant Saprotroph" [54] "Fungal Parasite-Undefined Saprotroph" [55] "Leaf Saprotroph-Plant Pathogen-Undefined Saprotroph-Wood Saprotroph" [56] "Lichenized-Undefined Saprotroph" [57] "Lichenized-Wood Saprotroph" [58] "Litter Saprotroph" [59] "Litter Saprotroph-Plant Pathogen" [60] "NULL" [61] "Orchid Mycorrhizal" [62] "Plant Pathogen" [63] "Plant Pathogen-Plant Saprotroph" [64] "Plant Pathogen-Undefined Saprotroph" [65] "Plant Pathogen-Wood Saprotroph" [66] "Plant Saprotroph-Wood Saprotroph" [67] "Soil Saprotroph" [68] "Undefined Saprotroph" [69] "Undefined Saprotroph-Undefined Biotroph" [70] "Undefined Saprotroph-Wood Saprotroph" [71] "Wood Saprotroph"

## Ok, I need to simplify this. Using Excel. gd <- levels(ct6$Guild) ## Going to start with my previous work. gd2 <- read.csv("2019-02-01- 1158_guilds_simplified.csv") gd <- as.data.frame(gd) length(gd) [1] 71 dim(gd2) [1] 60 2 gd3 <- sqldf(' select gd.gd as guild_o, gd2.guild_s from gd left outer join gd2 on gd.gd = gd2.guild_o ') write.csv(gd3, "2019-02-25- 2049_guilds_simplified.csv")

## Edited in Excel... gd2 <- read.csv("2019-02-25- 2056_guilds_simplified.csv") levels(gd2$guild_s) [1] "endophyte or parasite" [2] "endophyte, mycorrhizal, parasite, or saprotroph" [3] "endophyte, parasite, or saprotroph" [4] "mycorrhizal" [5] "mycorrhizal or saprotroph" [6] "saprotroph" [7] "unknown"

## Joinin this to the data. ct7 <- sqldf(' select ct6.*, gd2.guild_s from ct6 left outer join gd2 on ct6.Guild = gd2.guild_o ') dim(ct7) [1] 708 23 ct7 <- sqldf(' select ct7.*, d1.taxonomy from ct7 left outer join d1 on ct7.otu_id = d1.OTUID ') dim(ct7) [1] 708 24

## Saving here. write.csv(ct7, "2019-02-25- 2107_otu_table_tax_guilds.csv", row.names=FALSE)

## Removing non-fungi. sum(grepl("k__Fungi", ct7$taxonomy)) [1] 633 ct7 <- ct7[grepl("k__Fungi", ct7$taxonomy),]

## Now for an overal pie chart. ct7$nreads <- apply(ct7[,2:13], 1, sum) ag1 <- aggregate(ct7$nreads, by=list(ct7$guild_s), sum) ag1 <- ag1[order(-ag1$x),] ag1 Group.1 x 7 unknown 105533 6 saprotroph 87387 3 endophyte, parasite, or saprotroph 27578 1 endophyte or parasite 4291 4 mycorrhizal 3419 5 mycorrhizal or saprotroph 2050 2 endophyte, mycorrhizal, parasite, or saprotroph 1189 pie(ag1$x, labels = ag1$Group.1, main="Fungal guild abundances summed over all samples")

## Ok, now trying to compare the groups. ct7$nreads_inside <- apply(ct7[,2:5], 1, sum) ct7$nreads_outside <- apply(ct7[,6:9], 1, sum) ct7$nreads_hills <- apply(ct7[,10:13], 1, sum) agin <- aggregate(ct7$nreads_inside, by=list(ct7$guild_s), sum) agout <- aggregate(ct7$nreads_outside, by=list(ct7$guild_s), sum) aghill <- aggregate(ct7$nreads_hills, by=list(ct7$guild_s), sum)

## Now percentages. agin$percent <- round(agin$x/sum(agin$x)*100) agout$percent <- round(agout$x/sum(agout$x)*100) aghill$percent <- round(aghill$x/sum(aghill$x)*100) agin Group.1 x percent 1 endophyte or parasite 1214 2 2 endophyte, mycorrhizal, parasite, or saprotroph 166 0 3 endophyte, parasite, or saprotroph 5751 11 4 mycorrhizal 356 1 5 mycorrhizal or saprotroph 59 0 6 saprotroph 10411 19 7 unknown 36647 67 agout Group.1 x percent 1 endophyte or parasite 1231 2 2 endophyte, mycorrhizal, parasite, or saprotroph 247 0 3 endophyte, parasite, or saprotroph 12254 17 4 mycorrhizal 2969 4 5 mycorrhizal or saprotroph 1735 2 6 saprotroph 25805 36 7 unknown 27401 38 aghill Group.1 x percent 1 endophyte or parasite 1846 2 2 endophyte, mycorrhizal, parasite, or saprotroph 776 1 3 endophyte, parasite, or saprotroph 9573 9 4 mycorrhizal 94 0 5 mycorrhizal or saprotroph 256 0 6 saprotroph 51171 49 7 unknown 41485 39

## Now to make a graph. Sorting. ag1 <- aggregate(ct7$nreads, by=list(ct7$guild_s), sum) agin <- agin[order(-ag1$x),] agout <- agout[order(-ag1$x),] aghill <- aghill[order(-ag1$x),]

pdf(file="2019-02-25-2131_guilds_comparison.pdf", width=920/170 * 3/2, height=700/170, pointsize = 10 ) cls <- rev(rainbow(7)) cls <- cls[c(3, 4, 5, 6, 7, 1, 2)] par(mfrow=c(2,3)) par(mar=c(0,0,2,0)) pie(rev(agin$x), labels = "", col=cls, main="AWCC inside") pie(rev(agout$x), labels = "", col=cls, main="AWCC outside") pie(rev(aghill$x), labels = "", col=cls, main="Caribou Hills") plot.new() legend( "top", legend = rev(agin[,1]), fill = cls ) dev.off()

## Saving this for now. save.image("2019-02-25-2154_workspace.RData")

2019-02-26-0945_diveristy.R

## Making a graph of diversity by category. wd <- "C:/Users/mattbowser/Documents/2019-02- 18_AWCC_fungi_homework" setwd(wd) load("2019-02-25-2154_workspace.RData") to10 <- function(x) { if (x > 0) {1} else {0} } ctp <- ct7[,1:13] ctp[,2:13] <- apply(ctp[,2:13], c(1,2), to10) div <- rep(0, 3) div[1] <- sum(apply(ctp[2:5], 1, max)) div[2] <- sum(apply(ctp[6:9], 1, max)) div[3] <- sum(apply(ctp[10:13], 1, max)) div [1] 290 245 285

## Plot pdf(file="2019-02-26-1003_plot_OTU_sums.pdf", width=5, height=5) bp1 <- barplot(div, names.arg = c("AWCC Inside", "AWCC Outside", "Caribou Hills"), ylab="Total Number of OTUs", ylim=c(0, max(div)*1.1 ), main="Soil Fungi" ) text(bp1, div+max(div)*0.05, div) dev.off() save.image("2019-02-26-1013_workspace.RData")

2019-02-27-0951_lifescanner_uploads.R wd <- "I:/BIOLOGY/Data/ProjectData/DNA_barcoding/LifeScanner/2019- 02-27_container_uploads" setwd(wd) require(xlsx) d1 <- read.xlsx("2019-02-26- 0951_lifescanner_vials.xlsx", sheetIndex=1) dim(d1) [1] 800 2 nk <- length(levels(d1$kit_code)) nk [1] 200 ## number of kits. nv <- length(levels(d1$vial_barcode)) nv [1] 800 d1$vial_barcode <- toupper(d1$vial_barcode) uc <- as.data.frame(d1$vial_barcode) names(uc) <- "BARCODE" uc$BARCODE <- as.character(uc$BARCODE)

## Now for the human-readable labels. uc$LABEL <- paste(substr(uc$BARCODE,1,5), substr(uc$BARCODE,12,14), sep="") uc$CONTAINER_TYPE <- "vial" uc$INSTITUTION_ACRONYM <- "KNWR" uc$DESCRIPTION <- "lifescanner vial" write.csv(uc, "2019-02-27- 1008_create_lifescanner_vial_series.csv", row.names=FALSE) ## Removed last carriage return in text editor before uploading. ## That failed to load. write.csv(uc, "2019-02-27- 1017_create_lifescanner_vial_series.csv", row.names=FALSE, quote=FALSE) ## Found one problem record: BOLD-3N2P2,BOLD-,vial,KNWR,lifescanner vial. ## Where was that? d1[which(d1$vial_barcode == "BOLD-3N2P2"),] kit_code vial_barcode 236 A3OP00 BOLD-3N2P2 ## Ok, so need to find kit A3OP00. ## Took me a while to find that particular box. d1[(d1$kit_code == "A3OP00"),] kit_code vial_barcode 233 A3OP00 BOLD-3NL1VYMZ2 234 A3OP00 BOLD-3NLFVXGO8 235 A3OP00 BOLD-3NM7KFP00 236 A3OP00 BOLD-3N2P2

## One of the vials is labeled "BOLD-2P2" ## Scan: BOLD-3NKZO82P2

## Ok, edited the original file. d1 <- read.xlsx("2019-02-26- 0951_lifescanner_vials.xlsx", sheetIndex=1) d1$vial_barcode <- toupper(d1$vial_barcode) uc <- as.data.frame(d1$vial_barcode) names(uc) <- "BARCODE" uc$BARCODE <- as.character(uc$BARCODE) uc$LABEL <- paste(substr(uc$BARCODE,1,5), substr(uc$BARCODE,12,14), sep="") uc$CONTAINER_TYPE <- "vial" uc$INSTITUTION_ACRONYM <- "KNWR" uc$DESCRIPTION <- "lifescanner vial" write.csv(uc, "2019-02-27- 1036_create_lifescanner_vial_series.csv", row.names=FALSE) ## I uploaded that. It worked!

2019-02-27-2253_difference_testing.R

## Today I want to test for differences between sites. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) load("2019-02-15-1434_workspace.RData") require(vegan) require(MASS)

## Continuing to follow the example of http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf

## Trying Lumbricus as a factor. It should be numerically the same as the last thing I tried. Lt <- as.factor(c(rep("Lumbricus present", 3), rep("Lumbricus absent", 3))) Lt <- as.data.frame(Lt) d2.ca <- cca(d2) ef <- envfit(d2.ca, Lt, permu = 999) ef

***FACTORS:

Centroids: CA1 CA2 LtLumbricus absent 0.7950 0.0858 LtLumbricus present -0.7942 -0.0858

Goodness of fit: r2 Pr(>r) Lt 0.3194 0.3 Permutation: free Number of permutations: 719 plot(d2.ca, display = "sites") plot(ef) plot(d2.ca, display = "sites", type = "p") with(Lt, ordiellipse(d2.ca, Lt, kind = "se", conf = 0.95)) with(Lt, ordispider(d2.ca, Lt, col = "blue", label= TRUE)) with(Lt, ordihull(d2.ca, Lt, col="blue", lty=2)) plot(d2.ca, display = "sites") with(Lt, ordiellipse(d2.ca, Lt, kind = "se", conf = 0.95)) with(Lt, ordispider(d2.ca, Lt, col = "blue", label= TRUE)) with(Lt, ordihull(d2.ca, Lt, col="blue", lty=2)) plot(d2.ca, display = "sites") #with(Lt, ordiellipse(d2.ca, Lt, kind = "se", conf = 0.95, col=c("blue", "red"))) with(Lt, ordispider(d2.ca, Lt, col = c("blue", "red"), label= TRUE)) with(Lt, ordihull(d2.ca, Lt, col=c("blue", "red"), lty=2))

## constrained methods. d2.cca <- cca(d2 ~ Lt, Lt) Call: cca(formula = d2 ~ Lt, data = Lt)

Inertia Proportion Rank Total 2.7343 1.0000 Constrained 0.6073 0.2221 1 Unconstrained 2.1270 0.7779 4 Inertia is scaled Chi-square

Eigenvalues for constrained axes: CCA1 0.6073

Eigenvalues for unconstrained axes: CA1 CA2 CA3 CA4 0.5797 0.5620 0.5104 0.4749 plot(d2.cca) anova(d2.cca) 'nperm' >= set of all permutations: complete enumeration. Set of permutations < 'minperm'. Generating entire set. Permutation test for cca under reduced model Permutation: free Number of permutations: 719

Model: cca(formula = d2 ~ Lt, data = Lt) Df ChiSquare F Pr(>F) Model 1 0.60728 1.142 0.001389 ** Residual 4 2.12700 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 anova(d2.cca, by = "term", step=200) 'nperm' >= set of all permutations: complete enumeration. Set of permutations < 'minperm'. Generating entire set. Permutation test for cca under reduced model Terms added sequentially (first to last) Permutation: free Number of permutations: 719

Model: cca(formula = d2 ~ Lt, data = Lt) Df ChiSquare F Pr(>F) Lt 1 0.60728 1.142 0.001389 ** Residual 4 2.12700 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

## Looking at classification dis <- vegdist(d2) clus <- hclust(dis, "single") plot(clus)

## That will have to do for now. save.image("2019-02-28-0037_workspace.RData")

2019-02-28-1016_difference_testing.R

## Today I want to test for differences between sites. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) require(vegan) require(MASS) load("2019-02-28-0037_workspace.RData")

## I am not sure whether the best way to test for the effect of Lumbricus terrestris is to use.

## I read up on this.

## Vector fitting: ef <- envfit(vare.pca, Lt, permu = 999) ef

***FACTORS: Centroids: PC1 PC2 LtLumbricus absent 1.8168 1.1680 LtLumbricus present -1.8168 -1.1680

Goodness of fit: r2 Pr(>r) Lt 0.3122 0.1 Permutation: free Number of permutations: 719 plot(vare.pca, display = "sites") plot(ef, labels=levels(Lt$Lt)) ef <- envfit(vare.mds, Lt, permu = 999) ef

***FACTORS:

Centroids: NMDS1 NMDS2 LtLumbricus absent 0.1818 0.0001 LtLumbricus present -0.1818 -0.0001

Goodness of fit: r2 Pr(>r) Lt 0.5 0.3 Permutation: free Number of permutations: 719 plot(vare.mds, display = "sites") plot(ef, labels=levels(Lt$Lt)) ## That was 1D. It did not look right.

## Should I be doing vector fitting or constrained ordination to determine whether or not Lumbricus presence had an effect? vare.dis <- vegdist(d2) vare.mds0 <- isoMDS(vare.dis) vare.mds <- metaMDS(d2, try=100, trymax=200, trace=1) ... Warning message: In metaMDS(d2, try = 100, trymax = 200, trace = 1) : stress is (nearly) zero: you may have insufficient data vare.mds

Call: metaMDS(comm = d2, try = 100, trymax = 200, trace = 1) global Multidimensional Scaling using monoMDS

Data: d2 Distance: bray

Dimensions: 2 Stress: 8.076919e-05 Stress type 1, weak ties Two convergent solutions found after 100 tries Scaling: centring, PC rotation, halfchange scaling Species: expanded scores based on ‘d2’

## I think I just have insufficient data for this method.

## Ok, so CCA seems to be the best available that will work for my data. ## Last night I did both direct gradient analysis (object d2.ca) and constrained analysis. The gradient analysis found Lumbricus presence to not be signficant; the constrained analysis found Lumbricus to be a significant factor explaining fungal communities. The gradient plots sure looked better, though.

## ok, so these are my best I think. d2.ca <- cca(d2) d2.ca Call: cca(X = d2)

Inertia Rank Total 2.734 Unconstrained 2.734 5 Inertia is scaled Chi-square

Eigenvalues for unconstrained axes: CA1 CA2 CA3 CA4 CA5 0.6669 0.5766 0.5192 0.4991 0.4726 ef <- envfit(d2.ca, Lt, permu = 999) ef

***FACTORS:

Centroids: CA1 CA2 LtLumbricus absent 0.7950 0.0858 LtLumbricus present -0.7942 -0.0858

Goodness of fit: r2 Pr(>r) Lt 0.3194 0.3 Permutation: free Number of permutations: 719 pdf(file="2019-02-28-1157_gradient_plot.pdf", width=4, height=4 ) par(mar=c(5,5,1,1)) plot(d2.ca, display = "sites") with(Lt, ordiellipse(d2.ca, Lt, kind = "se", conf = 0.95, col=c("blue", "red"))) with(Lt, ordispider(d2.ca, Lt, col = c("blue", "red"), label= TRUE)) with(Lt, ordihull(d2.ca, Lt, col=c("blue", "red"), lty=2)) dev.off()

## Constrained method. d2.cca <- cca(d2 ~ Lt, Lt) d2.cca Call: cca(formula = d2 ~ Lt, data = Lt)

Inertia Proportion Rank Total 2.7343 1.0000 Constrained 0.6073 0.2221 1 Unconstrained 2.1270 0.7779 4 Inertia is scaled Chi-square

Eigenvalues for constrained axes: CCA1 0.6073

Eigenvalues for unconstrained axes: CA1 CA2 CA3 CA4 0.5797 0.5620 0.5104 0.4749 anova(d2.cca, by = "term", step=200) 'nperm' >= set of all permutations: complete enumeration. Set of permutations < 'minperm'. Generating entire set. Permutation test for cca under reduced model Terms added sequentially (first to last) Permutation: free Number of permutations: 719

Model: cca(formula = d2 ~ Lt, data = Lt) Df ChiSquare F Pr(>F) Lt 1 0.60728 1.142 0.001389 ** Residual 4 2.12700 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 pdf(file="2019-02-28-1201_constrained_plot.pdf", width=4, height=4 ) par(mar=c(5,5,1,1)) plot(d2.cca, display = "sites") with(Lt, ordiellipse(d2.cca, Lt, kind = "se", conf = 0.95, col=c("blue", "red"))) with(Lt, ordispider(d2.cca, Lt, col = c("blue", "red"), label= TRUE)) with(Lt, ordihull(d2.cca, Lt, col=c("blue", "red"), lty=2)) dev.off()

## Can I ask witch species are most associated with presence or absence of Lumbricus? str(d2.cca) d2.cca$CCA$v plot(d2.cca, type="t")

## d2.cca$CCA$v[which(rownames(d2.cca$CCA$v)=="OTU379")] d2.cca$CCA$v[which(rownames(d2.cca$CCA$v)=="OTU675")]

## This seems to get at it. save.image("2019-02-28-1242_workspace.RData")

2019-02-28-1619_difference_testing.R

## Today I want to test for differences between sites. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) require(vegan) require(MASS) load("2019-02-01-1248_workspace.RData") load("2019-02-28-1242_workspace.RData")

## Going to try to make a prettier graph. ## Colors I used for different guilds on February 1 cls <- rev(rainbow(7)) cls <- cls[c(3, 4, 5, 6, 7, 1, 2)]

## Making color dataframe. gds <- levels(ct7$guild_s) cd <- as.data.frame(gds) cd$col <- cls require(sqldf) clrt <- ct7[,c("otu_id", "guild_s", "nreads")] clr2 <- sqldf(' select clrt.*, cd.col from clrt left outer join cd on clrt.guild_s = cd.gds ') otus <- colnames(d2) ndf <- as.data.frame(otus)

## Just need to re-order this. clr3 <- sqldf(' select ndf.otus, clr2.guild_s, clr2.col, clr2.nreads from ndf left outer join clr2 on ndf.otus = clr2.otu_id ') clr3$r <- (clr3$nreads*0.05/pi)^(1/2) pdf(file="2019-02-28-1748_gradient_plot.pdf", width=6, height=6 ) par(mar=c(5,5,1,1)) plot(d2.ca, display = "sites"#, #xlim=c(1, 1.5), #ylim=c(0.2,0.4) ) points(d2.ca, display = "species", pch=21, bg=clr3$col, #cex=clr3$lgr/3 cex=clr3$r ) #text(d2.ca, display="species", cex=clr3$nreads^0.5/20) with(Lt, ordiellipse(d2.ca, Lt, kind = "se", conf = 0.95, col=c("blue", "red"))) with(Lt, ordihull(d2.ca, Lt, col=c("blue", "red"), lty=2)) ordilabel(d2.ca, dis="sites") with(Lt, ordispider(d2.ca, Lt, col = c("blue", "red"), label= TRUE)) dev.off()

ct7[which(ct7$otu_id=="OTU209"),] ct7[which(ct7$otu_id=="OTU797"),] ## Those looked right. ct7[which(ct7$otu_id=="OTU261"),] pdf(file="2019-02-28-1759_constrained_plot.pdf", width=6, height=6 ) par(mar=c(5,5,1,1)) plot(d2.cca, display = "sites"#, #xlim=c(1, 1.5), #ylim=c(0.2,0.4) ) points(d2.cca, display = "species", pch=21, bg=clr3$col, #cex=clr3$lgr/3 cex=clr3$r ) #text(d2.cca, display="species", cex=clr3$nreads^0.5/20) with(Lt, ordiellipse(d2.cca, Lt, kind = "se", conf = 0.95, col=c("blue", "red"))) with(Lt, ordihull(d2.cca, Lt, col=c("blue", "red"), lty=2)) ordilabel(d2.cca, dis="sites") with(Lt, ordispider(d2.cca, Lt, col = c("blue", "red"), label= TRUE)) dev.off() save.image("2019-02-28-2015_workspace.RData")

2019-02-28-2045_guild_pie_chart.R

## Remaking one of those pie chart graphs for the article. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2019- 02-01_FUNGuild" setwd(wd) load("2019-02-01-1248_workspace.RData") aga <- aggregate(ct7[,1], by=list(ct7$guild_s), sum) names(aga)[2] <- "s1" for (thissite in 2:6) { ag <- aggregate(ct7[,thissite], by=list(ct7$guild_s), sum) aga <- cbind(aga, ag$x) } names(aga)[3:7] <- paste("s", 2:6, sep="") aga Group.1 s1 s2 s3 s4 s5 s6 1 endophyte or parasite 38 181 248 22 145 131 2 endophyte, mycorrhizal, parasite, or saprotroph 18 0 469 0 291 63 3 endophyte, parasite, or saprotroph 1669 1424 1161 1805 2300 950 4 mycorrhizal 362 232 849 2859 968 1267 5 mycorrhizal or saprotroph 102 37 882 0 24 815 6 saprotroph 2326 1388 1741 590 987 2560 7 unknown 3680 2650 4259 2028 5049 1745 pdf(file="2019-02-28-2052_site_guilds.pdf", width=8.5, height=6, pointsize = 11 ) cls <- rev(rainbow(7)) cls <- cls[c(3, 4, 5, 6, 7, 1, 2)] par(mfrow=c(3,3)) par(mar=c(0,4,1,0)) for (thissite in 1:3) { if (thissite == 1){yl <- "Lumbricus present"} else {yl <- ""} pie(aga[,thissite + 1], labels = "", main=paste("Site", thissite), ylab=yl, col=cls) } for (thissite in 4:6) { if (thissite == 4){yl <- "Lumbricus absent"} else {yl <- ""} pie(aga[,thissite + 1], labels = "", main=paste("Site", thissite), ylab=yl, col=cls) } plot.new() plot.new() legend( "top", legend = aga[,1], fill = cls ) dev.off() save.image("2019-02-28-2112_workspace.RData")

2019-03-01-1007_improving_graphs.R

## Improving graphs for the article. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) require(vegan) require(MASS) load("2019-02-28-2015_workspace.RData") ps <- 4 pdf(file="2019-03-01-1008_gradient_plot.pdf", width=ps, height=ps ) par(mar=c(5,5,1,1)) plot(d2.ca, display = "sites"#, #xlim=c(1, 1.5), #ylim=c(0.2,0.4) ) points(d2.ca, display = "species", pch=21, bg=clr3$col, #cex=clr3$lgr/3 cex=clr3$r ) #text(d2.ca, display="species", cex=clr3$nreads^0.5/20) with(Lt, ordiellipse(d2.ca, Lt, kind = "se", conf = 0.95, col=c("blue", "red"))) with(Lt, ordihull(d2.ca, Lt, col=c("blue", "red"), lty=2)) with(Lt, ordispider(d2.ca, Lt, col = c("blue", "red"), label= TRUE)) ordilabel(d2.ca, dis="sites") dev.off() pdf(file="2019-03-01-1013_constrained_plot.pdf", width=ps, height=ps ) par(mar=c(5,5,1,1)) plot(d2.cca, display = "sites"#, #xlim=c(1, 1.5), #ylim=c(0.2,0.4) ) points(d2.cca, display = "species", pch=21, bg=clr3$col, #cex=clr3$lgr/3 cex=clr3$r ) #text(d2.cca, display="species", cex=clr3$nreads^0.5/20) with(Lt, ordiellipse(d2.cca, Lt, kind = "se", conf = 0.95, col=c("blue", "red"))) with(Lt, ordihull(d2.cca, Lt, col=c("blue", "red"), lty=2)) with(Lt, ordispider(d2.cca, Lt, col = c("blue", "red"), label= TRUE)) ordilabel(d2.cca, dis="sites") dev.off()

## Now for some stats for the article. summary(d2.ca)

Call: cca(X = d2)

Partitioning of scaled Chi-square: Inertia Proportion Total 2.734 1 Unconstrained 2.734 1

Eigenvalues, and their contribution to the scaled Chi-square

Importance of components: CA1 CA2 CA3 CA4 CA5 Eigenvalue 0.6669 0.5766 0.5192 0.4991 0.4726 Proportion Explained 0.2439 0.2109 0.1899 0.1825 0.1728 Cumulative Proportion 0.2439 0.4548 0.6446 0.8272 1.0000

Scaling 2 for species and site scores * Species are scaled proportional to eigenvalues * Sites are unscaled: weighted dispersion equal on all dimensions

## I want to know how mycorrhizal species come out in the CCA. spcca1 <- summary(d2.cca)$species[,1] spcca1 <- as.data.frame(spcca1) spcca1$otu_id <- rownames(spcca1) require(sqldf) ct8 <- sqldf(' select ct7.*, spcca1.spcca1 from ct7 left outer join spcca1 on ct7.otu_id = spcca1.otu_id ') sl <- ct8$guild_s == "mycorrhizal" sum(sl) [1] 40 mc <- ct8[sl,] plot(1:40, mc$spcca1) mc <- mc[order(mc$spcca1, mc$nreads),] plot(mc$spcca1, 1:40) mc$r <- (mc$nreads*0.05/pi)^(1/2) plot(mc$spcca1, 1:40, pch=21, cex=mc$r, bg="#FFDB00FF", xlab="CCA1", ylab="Mycorrhizal OTUs" ) ps <- 4 pdf(file="2019-03-01-1116_mycorrhizal_otus.pdf", width=ps, height=ps ) par(mar=c(5,5,1,1)) plot(rev(mc$spcca1), rev(1:40), pch=21, cex=rev(mc$r), bg="#FFDB00FF", xlab="CCA1", ylab="Mycorrhizal OTUs" ) dev.off()

2019-03-01-2111_scrutinizing_some_OTUs.R

## Taking a closer look at some OTUs. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS" setwd(wd) require(vegan) require(MASS) load("2019-02-28-2015_workspace.RData")

## I want to know how mycorrhizal species come out in the CCA. spcca1 <- summary(d2.cca)$species[,1] spcca1 <- as.data.frame(spcca1) spcca1$otu_id <- rownames(spcca1) require(sqldf) ct8 <- sqldf(' select ct7.*, spcca1.spcca1 from ct7 left outer join spcca1 on ct7.otu_id = spcca1.otu_id ') sl <- ct8$guild_s == "mycorrhizal" sum(sl) [1] 40 mc <- ct8[sl,] mc$r <- (mc$nreads*0.05/pi)^(1/2) write.csv(mc, "2019-03-01-2114_myc_OTUs.csv", row.names=FALSE)

## Top 4, all > 300 reads, all more abundant in Lumbricus-free sites. OTU261 OTU252 OTU233 OTU810

>OTU261 AAAAGTTCTCAACCACACCGGTTTCTCTGGTGTGGCTTGGATTTGGGGGTTTGCAGGCTTTCAAGTCGGCTCTCCTGAAA

AAGATTAGCGGTATCTGAGCAGAAACCAAGCCTCGGGCGTGATAATTATCTATGCCTTGGTGTATAATCTGTGTGGGCTG

CTTATAACTGGAGAGTGTTGACAG

This is an unidentified Inocybe.

Submitted to UNITE analysis at https://unite.ut.ee/analysis.php

Best hit is Inocybe borealis. This is a mesic boreal forest species commonly associated with Betula and Picea (Kokkonen and Vauras, 2012).

>OTU252 ATATATATATATCAACCTTCTCTTTTTGAGTGGTTTGGATGTGGGGGTTTGCTGGCCTTGTAAAGGGTTCAGCTCCCCTG

AAATGCATTAGCAGAACAACCCTGTTCATTGGTGTGATAACTATCTACGCTATTGAATGTAAGGGGCAGTTCAGCTTTCT

AACAGTCCTCGGACAATTCATCATTA

Matches for Cortinarius casimiri (most common) Cortinarius decipiens Cortinarius helvolus Cortinarius saturninus

Cortinarius casimiri (Velen.) Huijsman, 1955

>OTU233 TACTCTCACACTCTCTAATTAGTTAGAGAGCAGTGGATTTGGATGCTGCCTGGTACTTACTGTCAGGCTCATCTTAAATG

AAGTAGTGCGACTCTTAGTTAAACATAGTACGGCGTGATAAGTAACCCTCGCTGTTTTCTGTCTGATTAAGAGCTCTGTG

CTTCAAACCGTCTCAGGACAATATTTGATA

Helvellosebacina helvelloides (Schwein.) Oberw., Garnica & K. Riess, 2014 What is that?

>OTU810 CAACCCTCAAGCACTGCTTGGTGTTGGGCCCTGCCCGTCGCGGCCGGCCCTAAAGACAGTGGCGGCGCCGTCTGGCTCTA

AGCGTAGTACAACTCTCGCTCTGGAGCCCTGCGGTAGCTTGCCAGAACCCCTAATCTTAT

Oidiodendron pilicola Kobayasi, 1969

2019-03-05-0807_PlutoF_upload_prep.R

## Today I am formatting data to try to upload sequence/observation records to PlutoF.

## Loading older data. load("C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS/2019-02-28-2015_workspace.RData") wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2019- 03-04_PlutoF_submissions" setwd(wd)

## Going to need the OTU representative sequences... require(Biostrings) seq <- readDNAStringSet("C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS/out_process/repseqs.fasta")

## Getting started on an appropriate data frame. ct8 <- ct7

## Loading example. ex1 <- read.csv("2019-03-04- 1535_sequence_upload_test.csv")

## I think the hardest part here will be coming up with taxonomy labels that will work. require(reshape) accid <- levels(as.factor(ct8$accid)) txdf <- colsplit(accid, split="_", names=1:6) txdf$accid <- accid txdf$Determination.Taxon.name <- "" txdf$Determination.Taxon.name.add. <- "" txdf$X1 <- as.character(txdf$X1) txdf$X2 <- as.character(txdf$X2) txdf$X3 <- as.character(txdf$X3) txdf$X4 <- as.character(txdf$X4) txdf$X5 <- as.character(txdf$X5) txdf$X6 <- as.character(txdf$X6) for(this_name in 1:length(accid)) { if (txdf$X1[this_name] == "unidentified") { txdf$Determination.Taxon.name[this_name] <- "Fungi" } if (txdf$X1[this_name] == txdf$accid[this_name]) { txdf$Determination.Taxon.name[this_name] <- txdf$accid[this_name] } if (txdf$X2[this_name] == "sp") { txdf$Determination.Taxon.name[this_name] <- txdf$X1[this_name] txdf$Determination.Taxon.name.add.[this_name] <- paste(txdf$X2[this_name], txdf$X3[this_name], sep=" ") } if ( grepl("SH", txdf$X3[this_name]) & ! (txdf$X2[this_name] == "sp") ) { txdf$Determination.Taxon.name[this_name] <- paste(txdf$X1[this_name], txdf$X2[this_name], sep=" ") txdf$Determination.Taxon.name.add.[this_name] <- txdf$X3[this_name] } if ( grepl("GS", txdf$X1[this_name]) ) { txdf$Determination.Taxon.name[this_name] <- "Fungi" } } txdf ## That looked good. names(txdf) <- gsub("\\.", "_", names(txdf)) require(sqldf) ct9 <- sqldf(' select ct8.*, txdf.Determination_Taxon_name, txdf.Determination_Taxon_name_add_ from ct8 left outer join txdf on ct8.accid = txdf.accid ')

## Ok now it is time to make a data table conforming to the requiremtns of the PlutoF uploads. ## First, paring down to all that I need. ct10 <- ct9[,c(1:6,8,32,33)] ct11 <- melt(ct10, id.vars=c("otu_id", "Determination_Taxon_name", "Determination_Taxon_name_add_")) ct11$variable <- gsub("X5348\\.", "", ct11$variable) ct11$variable <- gsub("\\.MSITS3", "", ct11$variable) dim(ct11) [1] 2412 5 ## Have to get rid of those all zero observations. This is presence only. ct11 <- ct11[ct11$value > 0,] dim(ct11) [1] 623 5 ## Whoa, is that differnet than what I reported? Nope, that is the same. Good. names(ct11)[4] <- "Parent_Name" names(ct11)[5] <- "Abundance_Value" ct11$Parent_Type <- "materialsample" ct11$Sequence_ID <- paste(ct11$Parent_Name, ct11$otu_id, sep="-") ct11$Sequenced_regions <- "ITS2" ct11$Sequence <- "" for(this_otu in 1:dim(ct11)[2]) { sl <- which(names(seq) == ct11$otu_id[this_otu]) ct11$Sequence[this_otu] <- as.character(seq[sl]) } ct11$Isolation_source <- "Soil fungal DNA" ct11$Forward.primer.name <- "ITS3F" ct11$Forward.primer.sequence <- "GCATCGATGAAGAACGCAGC" ct11$Reverse.primer.name <- "ITS4R" ct11$Reverse.primer.sequence <- "TCCTCCGCTTATTGATATGC" ct11$Locked..available.for.UNITE <- 1 ct11$Request.UNITE.accession.number <- 1 ct11$Abundance.Method.description <- "read count" ct11$Rights.holder <- "Matthew Bowser" ct11 <- ct11[2:18] names(ct11) <- gsub("_", ".", names(ct11)) ## Saving here. save.image("2019-03-05-1016_workspace.RData")

## Saving first record to try and load it. write.csv(ct11[1,], "2019-03-05- 1017_load_record_1.csv", row.names=FALSE)

## Got error: [10:19:09] ValidationError: Error preparsing source file: Selection is missing a required field Sequence ID ## Bummer. There are spaces in the field names required by PlutoF.

## Trying tidyverse, which can handle spaces is dataframe column names. library(tidyverse) ct12 <- as_tibble(ct11) names(ct12)[6] <- "Sequence ID" names(ct12)[7] <- "Sequenced regions" names(ct12)[9] <- "Isolation source" names(ct12)[10] <- "Forward primer name" names(ct12)[11] <- "Forward primer sequence" names(ct12)[12] <- "Reverse primer name" names(ct12)[13] <- "Reverse primer sequence" names(ct12)[14] <- "Locked, available for UNITE" names(ct12)[15] <- "Request UNITE accession number" names(ct12)[16] <- "Abundance.Method description" names(ct12)[17] <- "Rights holder"

## Saving first record to try and load it. write.csv(ct12[1,], "2019-03-05- 1040_load_record_1.csv", row.names=FALSE)

## Got error: [10:40:48] ValidationError: Error preparsing source file: Selection is missing a required field Sekvents

## Trying again with less quotes. write.csv(ct12[1,], "2019-03-05- 1045_load_record_1.csv", row.names=FALSE, quote=FALSE)

## Got error: [10:46:07] ValidationError: Error preparsing source file: Selection is missing a required field Country

## Trying adding that field. ct12$Sampling.event.Sampling.area.Country <- "" names(ct12)[18] <- "Sampling event.Sampling area.Country" write.csv(ct12[1,], "2019-03-05- 1049_load_record_1.csv", row.names=FALSE, quote=FALSE)

## Got error. [10:50:59] ValidationError: Error preparsing source file: Selection is missing a required field Taxon name ## Oh yes, there was a problem with my field names. names(ct12)[1] <- "Determination.Taxon name" names(ct12)[2] <- "Determination.Taxon name add." write.csv(ct12[1,], "2019-03-05- 1054_load_record_1.csv", row.names=FALSE, quote=FALSE)

## Got error: [10:54:25] ValidationError: Error preparsing source file: 'Locked' is not in list ## That use of a comma in the column name is a pain. ## I might just have to mannually edit the header to make it conform.

## I did so. ## Got error: [10:58:53] ValidationError: Error preparsing source file: Selection is missing a required field Sekvents

## Maybe I will have to... ## My wife called. We have a newly sick kid and other kids need to be taken to town for engagements, so I must go home now.

## Back at the office. ## I think I will have to map the values I have onto that example template file that worked. ## I still don't know about those two UNITE boolean fields, whether they should be 1/0, yes/no, true/false, etc. dim(ex1) [1] 1 75 ct13 <- as_tibble(ex1) ct13[2:nrow(ct12),] <- NA

## Now filling. ct13$Parent.Type <- "materialsample" ct13$Parent.Name <- ct12$Parent.Name ct13$Sequence.ID <- ct12$`Sequence ID` ct13$Sequenced.regions <- "ITS2" ct13$Sequence <- ct12$Sequence ct13$Forward.primer.name <- "ITS3F" ct13$Forward.primer.sequence <- "GCATCGATGAAGAACGCAGC" ct13$Reverse.primer.name <- "ITS4R" ct13$Reverse.primer.sequence <- "TCCTCCGCTTATTGATATGC" ct13$Locked..available.for.UNITE <- ct12$`Locked, available for UNITE` ct13$Request.UNITE.accession.number <- ct12$`Request UNITE accession number` ct13$Determination.Taxon.name <- ct12$`Determination.Taxon name` ct13$Abundance.Value <- ct12$Abundance.Value ct13$Abundance.Method.description <- ct12$`Abundance.Method description` ct13$Rights.holder <- ct12$`Rights holder`

## Saving here. save.image("2019-03-05-1701_workspace.RData") write.csv(ct13[1,], "2019-03-05- 1702_load_record_1.csv", row.names=FALSE, quote=FALSE, na="")

## Manually replaced header. ## Got error: ValidationError: Error preparsing source file: Selection is missing a required field Sekvents

## I deleted those boolean 1 values and it seemed to work. ## It is taking quite a while, though. ## Import queue at the URI below. ## https://plutof.ut.ee/#/import ## That appears to be parked. I do not know what is going on there, perhaps a large number of records being imported by someone else like what goes on when large numbers of records are imported through Arctos' bulkloader.

2019-03-05-1803_soil_fungi_map.R ## Making a map for the Lumbricus soil fungi article. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2019- 02-13_map" knwr = "J:/Goedata/Boundaries/KNWR/knwr.shp" lakes = "J:/Goedata/Hydro/KP_lakes/lakes.shp" streams = "J:/Goedata/Hydro/KP_streams/streams.shp" roads = "J:/Goedata/Transportation/Roads/roads.shp" require(maptools) require(rgdal) require(raster) require(GISTools) albers <- "+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs" wgs84 <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"

## Load shape files. #knwr <- readShapeSpatial(knwr) lakes <- readShapeSpatial(lakes) streams <- readShapeSpatial(streams) roads <- readOGR(dsn=roads, layer="roads")

#proj4string(knwr) <- CRS(albers) proj4string(lakes) <- CRS(albers) setwd(wd)

## load points. pts1 <- read.csv("2019-02-13-1009_coordinates.csv") coordinates(pts1) <- c("lon", "lat") proj4string(pts1) <- CRS(wgs84) sitesa <- spTransform(pts1, CRS(albers)) water <- "#93CCEA" ## #76D7EA = Crayola Sky Blue "#BEBEBE" #93CCEA is Crayola cornflower land <- "#E8E8E8" lab <- c("soil samples", "water", "roads", "Lumbricus") ldf <- as.data.frame(lab) ldf$fill <- c(NA, "#93CCEA", NA, NA) ldf$border <- c(NA, "#1d6b95", NA, NA) ldf$lwd <- c(NA, 1, 2, NA) ldf$pch <- c(1, NA, NA, 21) ldf$ptcex <- c(1.2, NA, NA, 1.2) ldf$ptlwd <- c(2, NA, NA, 2) ldf$col <- c("black", "#1d6b95", "#888888", "red") ldf <- ldf[c(2,3,4,1),] pdf(file="2019-03-05-1804_soil_fungi_map.pdf", width=6, height=6 ) par(mar=c(0.1, 0.1, 0.1, 0.1)) par(bg=land) plot(sitesa, pch="", bg=land ) plot(streams, add=TRUE, col="#1d6b95", lwd=1 ) plot(roads, col="#888888", lwd=2, add=TRUE ) points( sitesa, lwd=2, cex=1.2 ) points( 160973.5, 1204136, pch=21, cex=40, col="red", lwd=2 ) plot(lakes, add=TRUE, col=water, border="#1d6b95", lwd=1 ) text(161110, 1204160, "Stormy Lake", srt=46 ) legend("topright", bg="white", legend=ldf$lab, fill=ldf$fill, border=ldf$border, lwd=ldf$lwd, pch=ldf$pch, pt.cex=ldf$ptcex, pt.lwd=ldf$ptlwd, col=ldf$col ) text( sitesa, lab=paste("site", 1:6, sep=""), pos=3 ) map.scale(x=160844.7+20, y=1204136+130, len=200, ndivs=2, units="m", subdiv=100) north.arrow(xb=160844.7+0, yb=1204136+60, len=8, lab="N") dev.off()

2019-03-06-1627_PlutoF_upload_prep.R

## Got one record to load. Trying to load more. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2019- 03-04_PlutoF_submissions" setwd(wd) load("2019-03-05-1701_workspace.RData") library(tidyverse)

## Fiddling with a couple of Boolean variables. ct13$Locked..available.for.UNITE <- "Yes" ct13$Request.UNITE.accession.number <- "Yes"

## Now saving a file. write.csv(ct13[2,], "2019-03-06- 1522_load_record_2.csv", row.names=FALSE, quote=FALSE, na="")

## Got that to load after working with importer. Boolean values need to be TRUE or FALSE. Also, the taxon "Fungi" had to be selected manually. ## Also, to get this record to be associated with a project I needed to select the project association when importing; I could not figure out how to do this afterward for record 1. ct13$Locked..available.for.UNITE <- TRUE ct13$Request.UNITE.accession.number <- TRUE write.csv(ct13[1,], "2019-03-06- 1534_load_record_1.csv", row.names=FALSE, quote=FALSE, na="")

## That worked splendidly after replacing the header file. write.csv(ct13[3:10,], "2019-03-06- 1538_load_records_3-10.csv", row.names=FALSE, quote=FALSE, na="")

## Whoa, there was a problem there. Why do only the first 9 records in the dataframe have sequences?

## Found problem in my code from yesterday. Rerunning code from yesterday.

## Going to need the OTU representative sequences... require(Biostrings) seq <- readDNAStringSet("C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2018- 12-17_PIPITS/out_process/repseqs.fasta")

## Getting started on an appropriate data frame. ct8 <- ct7

## Loading example. ex1 <- read.csv("2019-03-04- 1535_sequence_upload_test.csv") require(reshape) accid <- levels(as.factor(ct8$accid)) txdf <- colsplit(accid, split="_", names=1:6) txdf$accid <- accid txdf$Determination.Taxon.name <- "" txdf$Determination.Taxon.name.add. <- "" txdf$X1 <- as.character(txdf$X1) txdf$X2 <- as.character(txdf$X2) txdf$X3 <- as.character(txdf$X3) txdf$X4 <- as.character(txdf$X4) txdf$X5 <- as.character(txdf$X5) txdf$X6 <- as.character(txdf$X6) for(this_name in 1:length(accid)) { if (txdf$X1[this_name] == "unidentified") { txdf$Determination.Taxon.name[this_name] <- "Fungi" } if (txdf$X1[this_name] == txdf$accid[this_name]) { txdf$Determination.Taxon.name[this_name] <- txdf$accid[this_name] } if (txdf$X2[this_name] == "sp") { txdf$Determination.Taxon.name[this_name] <- txdf$X1[this_name] txdf$Determination.Taxon.name.add.[this_name] <- paste(txdf$X2[this_name], txdf$X3[this_name], sep=" ") } if ( grepl("SH", txdf$X3[this_name]) & ! (txdf$X2[this_name] == "sp") ) { txdf$Determination.Taxon.name[this_name] <- paste(txdf$X1[this_name], txdf$X2[this_name], sep=" ") txdf$Determination.Taxon.name.add.[this_name] <- txdf$X3[this_name] } if ( grepl("GS", txdf$X1[this_name]) ) { txdf$Determination.Taxon.name[this_name] <- "Fungi" } } #txdf ## That looked good. names(txdf) <- gsub("\\.", "_", names(txdf)) require(sqldf) ct9 <- sqldf(' select ct8.*, txdf.Determination_Taxon_name, txdf.Determination_Taxon_name_add_ from ct8 left outer join txdf on ct8.accid = txdf.accid ')

## Ok now it is time to make a data table conforming to the requiremtns of the PlutoF uploads. ## First, paring down to all that I need. ct10 <- ct9[,c(1:6,8,32,33)] ct11 <- melt(ct10, id.vars=c("otu_id", "Determination_Taxon_name", "Determination_Taxon_name_add_")) ct11$variable <- gsub("X5348\\.", "", ct11$variable) ct11$variable <- gsub("\\.MSITS3", "", ct11$variable) dim(ct11) [1] 2412 5 ## Have to get rid of those all zero observations. This is presence only. ct11 <- ct11[ct11$value > 0,] dim(ct11) [1] 623 5 names(ct11)[4] <- "Parent_Name" names(ct11)[5] <- "Abundance_Value" ct11$Parent_Type <- "materialsample" ct11$Sequence_ID <- paste(ct11$Parent_Name, ct11$otu_id, sep="-") ct11$Sequenced_regions <- "ITS2" ct11$Sequence <- "" for(this_otu in 1:nrow(ct11)) { sl <- which(names(seq) == ct11$otu_id[this_otu]) ct11$Sequence[this_otu] <- as.character(seq[sl]) } ct11$Isolation_source <- "Soil fungal DNA" ct11$Forward.primer.name <- "ITS3F" ct11$Forward.primer.sequence <- "GCATCGATGAAGAACGCAGC" ct11$Reverse.primer.name <- "ITS4R" ct11$Reverse.primer.sequence <- "TCCTCCGCTTATTGATATGC" ct11$Locked..available.for.UNITE <- 1 ct11$Request.UNITE.accession.number <- 1 ct11$Abundance.Method.description <- "read count" ct11$Rights.holder <- "Matthew Bowser" ct12 <- ct11[2:18] names(ct12) <- gsub("_", ".", names(ct11)) names(ct12)[6] <- "Sequence ID" names(ct12)[7] <- "Sequenced regions" names(ct12)[9] <- "Isolation source" names(ct12)[10] <- "Forward primer name" names(ct12)[11] <- "Forward primer sequence" names(ct12)[12] <- "Reverse primer name" names(ct12)[13] <- "Reverse primer sequence" names(ct12)[14] <- "Locked, available for UNITE" names(ct12)[15] <- "Request UNITE accession number" names(ct12)[16] <- "Abundance.Method description" names(ct12)[17] <- "Rights holder"

## Trying adding that field. ct12$Sampling.event.Sampling.area.Country <- "" names(ct12)[18] <- "Sampling event.Sampling area.Country" ## Oh yes, there was a problem with my field names. names(ct12)[1] <- "Determination.Taxon name" names(ct12)[2] <- "Determination.Taxon name add." ct13 <- as_tibble(ex1) ct13[2:nrow(ct12),] <- NA

## Now filling. ct13$Parent.Type <- "materialsample" ct13$Parent.Name <- ct12$Parent_Name ct13$Sequence.ID <- ct12$`Sequence ID` ct13$Sequenced.regions <- "ITS2" ct13$Sequence <- ct12$Sequence ct13$Forward.primer.name <- "ITS3F" ct13$Forward.primer.sequence <- "GCATCGATGAAGAACGCAGC" ct13$Reverse.primer.name <- "ITS4R" ct13$Reverse.primer.sequence <- "TCCTCCGCTTATTGATATGC" ct13$Locked..available.for.UNITE <- TRUE ct13$Request.UNITE.accession.number <- TRUE ct13$Determination.Taxon.name <- ct12$`Determination.Taxon name` ct13$Abundance.Value <- ct12$Abundance_Value ct13$Abundance.Method.description <- ct12$`Abundance.Method description` ct13$Rights.holder <- ct12$`Rights holder`

## Saving here. save.image("2019-03-06-1547_workspace.RData")

## Ok, now we can proceed. write.csv(ct13[10,], "2019-03-06- 1608_load_record_10.csv", row.names=FALSE, quote=FALSE, na="") ## That worked after several tries!

## Next 10. write.csv(ct13[11:20,], "2019-03-06- 1612_load_records_11-20.csv", row.names=FALSE, quote=FALSE, na="")

## Next 20. write.csv(ct13[21:40,], "2019-03-06- 1616_load_records_21-40.csv", row.names=FALSE, quote=FALSE, na="")

## Next 20. write.csv(ct13[41:60,], "2019-03-06- 1620_load_records_41-60.csv", row.names=FALSE, quote=FALSE, na="")

2019-03-07-0741_PlutoF_uploads.R

## Continuing PlutoF uploads. wd <- "C:/Users/mattbowser/Documents/2017_earthworm_soil_fungi_NGS/work_space/2019- 03-04_PlutoF_submissions" setwd(wd) load("2019-03-06-1547_workspace.RData") library(tidyverse)

## Next 40. write.csv(ct13[61:100,], "2019-03-07- 0743_load_records_61-100.csv", row.names=FALSE, quote=FALSE, na="")

## Next 100. write.csv(ct13[101:200,], "2019-03-07- 0750_load_records_101-200.csv", row.names=FALSE, quote=FALSE, na="")

## Next 100. write.csv(ct13[201:300,], "2019-03-07- 0756_load_records_201-300.csv", row.names=FALSE, quote=FALSE, na="")

## Next 100. write.csv(ct13[301:400,], "2019-03-07- 0803_load_records_301-400.csv", row.names=FALSE, quote=FALSE, na="") ## Wierd. One of those (2017MLB105-OTU422) had no identification at all, yet it loaded ok.

## I drove a coworker to the car dealership.

## Next 100. write.csv(ct13[401:500,], "2019-03-07- 0842_load_records_401-500.csv", row.names=FALSE, quote=FALSE, na="")

## The rest. write.csv(ct13[501:623,], "2019-03-07- 0853_load_records_501-623.csv", row.names=FALSE, quote=FALSE, na="")

2019-03-12-1547_importing.R

## Trying to make a manifest file as shown at ## https://docs.qiime2.org/2018.11/tutorials/importing/#sequence- data-with-sequence-quality-information-i-e-fastq wd <- "C:/Users/mattbowser/Documents/2017_STDP_NGS/work_space/2019- 03-12_QIIME2_import" dd <- "C:/Users/mattbowser/Documents/2017_STDP_NGS/data/RTL_data/Graham_5376Raw01192018" setwd(wd) fn <- dir(dd)

## I need to remove the .fastq.gz files. fn <- fn[!grepl("\\.gz", fn)] length(fn) [1] 60 sample_id <- substr(fn, 6, 19) sample_id <-gsub("E-", "E", sample_id) sample_id <-gsub("R-", "R", sample_id) length(unique(sample_id)) [1] 30 ## Those looked good. yd <- "/home/mattbowser/2017_STDP/original_data" absolute_filepath <- paste(yd, fn, sep="/") mf <- as.data.frame(cbind(sample_id, absolute_filepath)) mf$direction <- NA mf$direction[grepl("R1\\.fastq", fn)] <- "forward" mf$direction[grepl("R2\\.fastq", fn)] <- "reverse" write.csv(mf, "2019-03-12-1614_manifest.csv", row.names=FALSE)

2019-03-14-0828_importing.txt

Again working in getting STDP data into QIIME. Looking at guidance at the URI below. https://docs.qiime2.org/2018.11/tutorials/importing/#sequence- data-with-sequence-quality-information-i-e-fastq

Need to figure out PHRED offset Looking at head of first file, 5376-EAFB07JUN17-EA- mlCOIintF-HCO2198R_R1.fastq.

Looking at https://en.wikipedia.org/wiki/FASTQ_format#Quality Illumina 1.8+ Phred+33 would have quality scores in the set below.

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ

Illumina 1.5+ Phred+64 would have quality scores in the set below.

BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghi

Quality scores from the first read from my file:

#8BCCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCFFGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFCEEGGGGGGGGFGGGGGCFGGGDGGGFGGGGGFFFGGGGGGEFGGGGGGGGFGGGGGGFDGGGGGGFGGGGFGGGGGGGGGGGGGGGGG9:FFFDCGFFEGAFGGGGCFGGFDD:FBFFDGF=FF;=EE9>9DFFGGGGGGGGFF:EGGGGGFCCGGGGGG>G;E5C5;F6CE? DF:F6?EGGGFGA9C,;AEEA6FGGGF7C7EGGFFEFFFFF@FEC)4;

That is Illumina 1.8+ Phred+33.

So my data is PairedEndFastqManifestPhred33.

I edited my manifest file to make it conform with the QIIME 2 manifest example. First three lines: sample-id,absolute-filepath,direction EAFB07JUN17- E,/home/mattbowser/2017_STDP/original_data/5376- EAFB07JUN17-E-mlCOIintF-HCO2198R_R1.fastq,forward EAFB07JUN17- E,/home/mattbowser/2017_STDP/original_data/5376- EAFB07JUN17-E-mlCOIintF-HCO2198R_R2.fastq,reverse

On Yeti: module load python/miniconda3-gcc6.1.0 source activate qiime2-2019.1 qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path 2019-03-14-0833_manifest \ --output-path paired-end-demux.qza \ --input-format PairedEndFastqManifestPhred33

Imported 2019-03-14-0833_manifest as PairedEndFastqManifestPhred33 to paired-end- demux.qza

That worked!

Looking at example at the URI below. https://github.com/BikLab/BITMaB2- Tutorials/blob/master/QIIME2-metabarcoding-tutorial- already-demultiplexed-fastqs.md qiime demux summarize \ --i-data paired-end-demux.qza \ --o-visualization demux.qzv

Saved Visualization to: demux.qzv

Submitted to https://view.qiime2.org/

Demultiplexed sequence counts summary Minimum: 9834 Median: 20807.0 Mean: 25289.8 Maximum: 175187 Total: 758694

I examined some of the reads. Should I trim off the 26 base pairs from the beginning of the forward reads that correspond the 26 bp forward primer? The reverse primer is also 26 bp long. Judging from what I did in the past I should trim these to yield the expected 313 bp region.

Tool documentation: https://docs.qiime2.org/2019.1/plugins/available/dada2/denoise- paired/?highlight=dada2 qiime dada2 denoise-paired \ --i-demultiplexed-seqs paired-end-demux.qza \ --p-trim-left-f 26 \ --p-trim-left-r 26 \ --p-trunc-len-f 200 \ --p-trunc-len-r 200 \ --p-n-threads 2 \ --o-representative-sequences rep-seqs.qza \ --o-table table.qza \ --o-denoising-stats denoise-stats.qza

That is taking a really long time. Maybe I should have submitted that via a SLURM script.

Finished!

Saved FeatureTable[Frequency] to: table.qza Saved FeatureData[Sequence] to: rep-seqs.qza Saved SampleData[DADA2Stats] to: denoise-stats.qza qiime feature-table tabulate-seqs \ --i-data rep-seqs.qza \ --o-visualization rep-seqs.qzv

Looked at this. Got 404 sequences. Also using Microbiome 16S Analysis: A Quick-Start Guide by Amanda Birmingham (URI below) http://compbio.ucsd.edu/wp- content/uploads/2016/10/20170712_microbiome_16s_tutorial_non- interactive.pdf qiime feature-table summarize \ --i-table table.qza \ --o-visualization table.qzv

## This did not generate an OTU table like I am used to seeing.

2019-03-15-0908_library_construction.txt

## Looking at QIIME2 example at the URI below for importing referecee datasets. https://docs.qiime2.org/2019.1/tutorials/feature- classifier/

Downloading took a long time.

Moved files to Yeti.

## Combining into one big file. cat *.fas > 2019-03-15-1152_lib.fas

Now in R.

## in R: wd <- "C:/Users/mattbowser/Documents/2017_STDP_NGS/work_space/2019- 03-15_library_construction" dd <- "C:/Users/mattbowser/Documents/2017_STDP_NGS/work_space/2019- 03-15_library_construction/BOLD_downloads" setwd(dd) require(Biostrings) fas1 <- readDNAStringSet("2019-03-15-1152_lib.fas") label <- names(fas1) nm <- as.data.frame(label) require(reshape) nm2 <- colsplit(nm$label, split="\\|", names=c("processid", "identification", "marker", "id2")) head(nm2) ## That looked good. names(fas1) <- nm2$processid writeXStringSet(fas1, "2019-03-15-1214_lib.fas") t1 <- fas1[1:2] s1 <- as.character(t1) #t1[1] <- "ATAT"

## Doing a substitution. fas2 <- fas1 fas2[1:length(fas2)] <- gsub("-", "N", as.character(fas1))

## Saving that. writeXStringSet(fas1, "2019-03-15-1242_lib.fas") ## Leaving R. q("no") ## Now I want the list of identical sequences. module load genetics/vsearch-2.4.3

## Looked at http://manpages.org/vsearch vsearch --derep_fulllength 2019-03-15-1242_lib.fas - -uc 2019-03-15-1400_derep.txt

Fatal error: illegal character '-' on line 19 in fasta file sed -n '19p' 2019-03-15-1242_lib.fas ## Well that is bad. R's writeXStringSet command makes fasta files with multiple lines per sequence. wc -l 2019-03-15-1152_lib.fas 782334 2019-03-15-1152_lib.fas module purge

## in R: require(Biostrings) fas1 <- readDNAStringSet("2019-03-15-1152_lib.fas") label <- names(fas1) nm <- as.data.frame(label) require(reshape) nm2 <- colsplit(nm$label, split="\\|", names=c("processid", "identification", "marker", "id2")) head(nm2) ## That looked good. dim(nm2) [1] 391167 4 length(seq) seq <- gsub("-", "N", as.character(fas1)) [1] 391167 length(fas1) [1] 391167 fas2 <- rep(as.character(nm2$processid), 2) length(fas2) [1] 782334 fas2[(1:length(fas1))*2 - 1] <- paste(">", as.character(nm2$processid), sep="") fas2[(1:length(fas1))*2] <- seq write(fas2, "2019-03-15-1426_lib.fas")

## Leaving R. q("no")

## Now I want the list of identical sequences. module load genetics/vsearch-2.4.3 vsearch --derep_fulllength 2019-03-15-1426_lib.fas - -uc 2019-03-15-1439_derep.txt vsearch v2.4.3_linux_x86_64, 125.9GB RAM, 20 cores https://github.com/torognes/vsearch

Reading file 2019-03-15-1426_lib.fas 100% 251623261 nt in 391167 seqs, min 63, max 1951, avg 643 Dereplicating 100% Sorting 100% 269985 unique sequences, avg cluster 1.4, median 1, max 309 Writing uc file, first part 100% Writing uc file, second part 100% sed -n '101,200p' 2019-03-15-1439_derep.txt tail 2019-03-15-1439_derep.txt

R de1 <- read.delim("2019-03-15-1439_derep.txt", header=FALSE) dim(de1) [1] 661152 10

## Where does that number come from? 391167 + 269985 ## input sequences + unique sequences. [1] 661152

## Looking at that transition. de1[391120:391210, ]

## looked at uclust file format http://www.drive5.com/uclust/uclust_userguide_1_1_579.html#_Toc257997686 column names: Type Cluster Size %Id Strand Qlo Tlo Alignment Query Target de1[de1$V9 == "LBCH2059-10",] V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 S 0 658 * * * * * LBCH2059-10 * 391168 C 0 309 * * * * * LBCH2059-10 * ## Why are there two lines for this sequence, both with diferent lengths? ## On is new seed, the other is new cluster.

## Looking at second sequence. de1[de1$V9 == "LBCH1210-10",] V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 2 H 0 658 100.0 + 0 0 * LBCH1210-10 LBCH2059- 10 ## That shows up only once. ## Ok, I think I can use just the first 391167 lines. de2 <- de1[1:391167, 9:10] names(de2) <- c("Query", "Target") sl <- grepl("\\*", de2$Target) de2$Query <- as.character(de2$Query) de2$Target <- as.character(de2$Target) de2$Target[sl] <- de2$Query[sl] ak <- read.delim("Alaska.txt") bc <- read.delim("British_Columbia.txt") Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string bc <- read.delim("British_Columbia.txt", quote="") ## That worked! Done for now, though.

2019-03-18-0933_library_work.txt

## Resuming library creation work. ## On Yeti. cd /home/mattbowser/AK_arhropod_COI_library

R de1 <- read.delim("2019-03-15-1439_derep.txt", header=FALSE) dim(de1) [1] 661152 10 de2 <- de1[1:391167, 9:10] names(de2) <- c("Query", "Target") sl <- grepl("\\*", de2$Target) de2$Query <- as.character(de2$Query) de2$Target <- as.character(de2$Target) de2$Target[sl] <- de2$Query[sl] dim(de2) [1] 391167 2 ak <- read.delim("Alaska.txt", quote="", stringsAsFactors = FALSE) bc <- read.delim("British_Columbia.txt", quote="", stringsAsFactors = FALSE) yk <- read.delim("Yukon_Territory.txt", quote="", stringsAsFactors = FALSE) ru <- read.delim("Russia.txt", quote="", stringsAsFactors = FALSE) d1 <- rbind(ak, bc, yk, ru) dim(d1) [1] 415779 68 tg <- levels(as.factor(de2$Target)) length(tg) [1] 269727

## Now for each target I want to select one of the records with the most complete data.

## First make a dataframe to put results in. Target <- tg rdf <- as.data.frame(tg) rdf$rep <- NA ## for selected representative sequence records. this_tg <- 1 ## for testing. for (this_tg in 1:length(tg)) # length(tg) {

## First make dataframe of records with matching sequences. ttg <- tg[this_tg] qr <- de2$Query[which(de2$Target == ttg)] sl <- which(d1$processid %in% qr) qdf <- d1[sl,] qdf$lt <- paste(qdf$class_name, qdf$order_name, qdf$family_name, qdf$species_name)

## If there are any species identifications, choose these. if (sum(is.na(qdf$species_name)) > 0) { sl <- !(is.na(qdf$species_name)) qdf <- qdf[sl,] }

## If there are no species IDs just pick the longest taxonomy. if (sum(!is.na(qdf$species_name)) == 0) { mxlt <- max(nchar(qdf$lt)) sl <- nchar(qdf$lt) == mxlt qdf <- qdf[sl,] }

## If some have BINs and some don't, choose the ones with BINS. if (sum(is.na(qdf$bin_uri)) > 0) { sl <- !(is.na(qdf$bin_uri)) qdf <- qdf[sl,] }

## Now if some records are from Alaska, choose these. if (sum(qdf$province_state %in% "Alaska") > 0) { sl <- qdf$province_state %in% "Alaska" qdf <- qdf[sl,] }

## Now just make a random choice. sl <- sample(qdf$processid, size=1)

## Now populate that dataframe. rdf$rep[this_tg] <- sl

}

## That is taking forever. I think that this dataset is far too big and that I need to cluster at the 99% level or something. For comparison, the largest version of the UNITE database has 9,409 RefS's and 112,778 RepS's.

## Again consulting http://manpages.org/vsearch

## Which iddef should I use and should I use consensus sequences or centroids? ## Looks like I should use centroids and not consensus. Consensus ## Consulted https://www.drive5.com/usearch/UsearchUserGuide4.1.pdf

## Going with --iddef 0, the default.

## Stopped that. ## Saving image. save.image("2019-03-18-1358_workspace.RData")

## Killed R. module load genetics/vsearch-2.4.3 vsearch --cluster_fast 2019-03-15-1426_lib.fas \ --centroids 2019-03-18-1357_clusters.fas \ --uc 2019-03-18-1357_derep.txt \ --id 0.99\ --iddef 0 vsearch v2.4.3_linux_x86_64, 125.9GB RAM, 20 cores https://github.com/torognes/vsearch

Reading file 2019-03-15-1426_lib.fas 100% 251623261 nt in 391167 seqs, min 63, max 1951, avg 643 Masking 100% Sorting by length 100% Counting unique k-mers 100% Clustering 100% Sorting clusters 100% Writing clusters 100% Clusters: 140257 Size min 1, max 629, avg 2.8 Singletons: 91226, 23.3% of seqs, 65.0% of clusters wc -l 2019-03-18-1357_clusters.fas 1324334 2019-03-18-1357_clusters.fas head 2019-03-18-1357_clusters.fas sed -n '1,100p' 2019-03-18-1357_clusters.fas

## Interesting. This is a multiple line per sequence fasta file. sed -n '1,100p' 2019-03-18-1357_derep.txt wc -l 2019-03-18-1357_derep.txt 531424 2019-03-18-1357_derep.txt

So went from 391K clusters to 140K clusters. module purge module load R/3.5.1-gcc7.1.0

R de1 <- read.delim("2019-03-18-1357_derep.txt", header=FALSE) dim(de1) [1] 531424 10 de2 <- de1[, 9:10] names(de2) <- c("Query", "Target") sl <- grepl("\\*", de2$Target) de2$Query <- as.character(de2$Query) de2$Target <- as.character(de2$Target) de2$Target[sl] <- de2$Query[sl] de2 <- unique(de2) dim(de2) [1] 390923 2 ak <- read.delim("Alaska.txt", quote="", stringsAsFactors = FALSE) bc <- read.delim("British_Columbia.txt", quote="", stringsAsFactors = FALSE) yk <- read.delim("Yukon_Territory.txt", quote="", stringsAsFactors = FALSE) ru <- read.delim("Russia.txt", quote="", stringsAsFactors = FALSE) d1 <- rbind(ak, bc, yk, ru) dim(d1) [1] 415779 68 tg <- levels(as.factor(de2$Target)) length(tg) [1] 140093 #Whoa, I expected more.

## Now for each target I want to select one of the records with the most complete data.

## First make a dataframe to put results in. Target <- tg rdf <- as.data.frame(tg) rdf$rep <- NA ## for selected representative sequence records. this_tg <- 1 ## for testing. for (this_tg in 1:length(tg)) # length(tg) {

## First make dataframe of records with matching sequences. ttg <- tg[this_tg] qr <- de2$Query[which(de2$Target == ttg)] sl <- which(d1$processid %in% qr) qdf <- d1[sl,] qdf$lt <- paste(qdf$class_name, qdf$order_name, qdf$family_name, qdf$species_name)

## If there are any species identifications, choose these. if (sum(is.na(qdf$species_name)) > 0) { sl <- !(is.na(qdf$species_name)) qdf <- qdf[sl,] }

## If there are no species IDs just pick the longest taxonomy. if (sum(!is.na(qdf$species_name)) == 0) { mxlt <- max(nchar(qdf$lt)) sl <- nchar(qdf$lt) == mxlt qdf <- qdf[sl,] }

## If some have BINs and some don't, choose the ones with BINS. if (sum(is.na(qdf$bin_uri)) > 0) { sl <- !(is.na(qdf$bin_uri)) qdf <- qdf[sl,] }

## Now if some records are from Alaska, choose these. if (sum(qdf$province_state %in% "Alaska") > 0) { sl <- qdf$province_state %in% "Alaska" qdf <- qdf[sl,] }

## Now just make a random choice. sl <- sample(qdf$processid, size=1)

## Now populate that dataframe. rdf$rep[this_tg] <- sl

} write.csv(rdf, "2019-03-18- 1431_selected_records.csv", row.names=FALSE)

2019-03-20-0830_library_work.txt

## Picking up where I left off on Monday on Yeti in R. save.image("2019-03-20-0810_workspace.RData") dim(rdf) [1] 140093 2 require(Biostrings) fas1 <- readDNAStringSet("2019-03-15-1426_lib.fas") labs <- labels(fas1)[labels(fas1) %in% rdf$rep] length(labs) [1] 140333 ## so there must be duplicate matches for this number to be greater? length(unique(labs)) [1] 140075 ## hmm. This makes me a little uncomfortable. fsd <- as.data.frame(cbind(labels(fas1), as.character(fas1)), col.names=c("lab", "seq")) dim(fsd) [1] 391167 2

## Those names didn't take. names(fsd) <- c("lab", "seq") fsd2 <- fsd[fsd$lab %in% rdf$rep,] dim(fsd2) [1] 140333 2 fsd2 <- unique(fsd2) dim(fsd2) [1] 140333 2 ## Are there multiple sequences for some processids? summary(fsd2[duplicated(fsd2$lab),]) summary(fsd2[duplicated(fsd2$lab),]) lab GBMTG1989-16: 10 ZSMDB107-16 : 10 TANYT001-14 : 8 TANYT002-14 : 8 TANYT230-15 : 5 ZSMDB105-16 : 5 (Other) :212 fsd2[fsd2$lab=="GBMTG1989-16",] ## Wow, I checked out this one and there really are a number of sequences of different genes on BOLD for this single specimen. http://boldsystems.org/index.php/Public_RecordView? processid=GBMTG1989-16

## I will now need to start over to filter out those non COI-5P sequences. fas1 <- readDNAStringSet("2019-03-15-1152_lib.fas") label <- names(fas1) nm <- as.data.frame(label) require(reshape) nm2 <- colsplit(nm$label, split="\\|", names=c("processid", "identification", "marker", "id2")) head(nm2) levels(nm2$marker) [1] "12S" "16S" "18S" "18S-V4" "28S" "28S-D2" [7] "28S-D2-D3" "AATS" "ARK" "CAD" "CAD4" "COI-3P" [13] "COI-5P" "COII" "COXIII" "CYTB" "EF1-alpha" "ENO" [19] "H3" "H4" "ITS2" "ND1" "ND2" "ND3" [25] "ND4" "ND4L" "ND5-0" "ND6" "PGD" "TPI" [31] "Wnt1"

## Selecting only COI-5P. fas2 <- fas1[nm2$marker == "COI-5P"] nm2 <- nm2[nm2$marker == "COI-5P",] length(fas2) [1] 390414 names(fas2) <- nm2$processid

## Doing a substitution. fas2[1:length(fas2)] <- gsub("-", "N", as.character(fas2))

## Saving that. writeXStringSet(fas2, "2019-03-20-1009_lib.fas")

## Leaving R. q() sed -n '1,100p' 2019-03-20-1009_lib.fas ## That looked good except that it was multiline. We will see if that works. module purge module load genetics/vsearch-2.4.3 vsearch --cluster_fast 2019-03-20-1009_lib.fas \ --centroids 2019-03-20-1014_clusters.fas \ --uc 2019-03-20-1014_uc_out.txt \ --id 0.99\ --iddef 0 vsearch v2.4.3_linux_x86_64, 125.9GB RAM, 20 cores https://github.com/torognes/vsearch

Reading file 2019-03-20-1009_lib.fas 100% 251143700 nt in 390414 seqs, min 63, max 1951, avg 643 Masking 100% Sorting by length 100% Counting unique k-mers 100% Clustering 100% Sorting clusters 100% Writing clusters 100% Clusters: 140091 Size min 1, max 599, avg 2.8 Singletons: 91146, 23.3% of seqs, 65.1% of clusters module purge module load R/3.5.1-gcc7.1.0

R

## Forgot to check whether or not there are still duplicates. require(Biostrings) fas1 <- readDNAStringSet("2019-03-20- 1014_clusters.fas") length(unique(names(fas1))) [1] 140091 ## Yeah!!! de1 <- read.delim("2019-03-20-1014_uc_out.txt", header=FALSE) dim(de1) [1] 530505 10 de2 <- de1[, 9:10] names(de2) <- c("Query", "Target") sl <- grepl("\\*", de2$Target) de2$Query <- as.character(de2$Query) de2$Target <- as.character(de2$Target) de2$Target[sl] <- de2$Query[sl] de2 <- unique(de2) dim(de2) [1] 390414 2 ak <- read.delim("Alaska.txt", quote="", stringsAsFactors = FALSE) bc <- read.delim("British_Columbia.txt", quote="", stringsAsFactors = FALSE) yk <- read.delim("Yukon_Territory.txt", quote="", stringsAsFactors = FALSE) ru <- read.delim("Russia.txt", quote="", stringsAsFactors = FALSE) d1 <- rbind(ak, bc, yk, ru) dim(d1) [1] 415779 68 tg <- levels(as.factor(de2$Target)) length(tg) [1] 140091 ## Looks good.

## Now for each target I want to select one of the records with the most complete data.

## First make a dataframe to put results in. Target <- tg rdf <- as.data.frame(tg) rdf$rep <- NA ## for selected representative sequence records. this_tg <- 1 ## for testing. for (this_tg in 1:length(tg)) # length(tg) {

## First make dataframe of records with matching sequences. ttg <- tg[this_tg] qr <- de2$Query[which(de2$Target == ttg)] sl <- which(d1$processid %in% qr) qdf <- d1[sl,] qdf$lt <- paste(qdf$class_name, qdf$order_name, qdf$family_name, qdf$species_name)

## If there are any species identifications, choose these. if (sum(is.na(qdf$species_name)) > 0) { sl <- !(is.na(qdf$species_name)) qdf <- qdf[sl,] }

## If there are no species IDs just pick the longest taxonomy. if (sum(!is.na(qdf$species_name)) == 0) { mxlt <- max(nchar(qdf$lt)) sl <- nchar(qdf$lt) == mxlt qdf <- qdf[sl,] }

## If some have BINs and some don't, choose the ones with BINS. if (sum(is.na(qdf$bin_uri)) > 0) { sl <- !(is.na(qdf$bin_uri)) qdf <- qdf[sl,] }

## Now if some records are from Alaska, choose these. if (sum(qdf$province_state %in% "Alaska") > 0) { sl <- qdf$province_state %in% "Alaska" qdf <- qdf[sl,] }

## Now just make a random choice. sl <- sample(qdf$processid, size=1)

## Now populate that dataframe. rdf$rep[this_tg] <- sl

} write.csv(rdf, "2019-03-20- 1031_selected_records.csv", row.names=FALSE) save.image("2019-03-20-1031_workspace.RData") dim(rdf) [1] 140091 2 head(rdf) tg rep 1 ABKWR002-07 DKNWR093-11 2 ABKWR003-07 DRYAS21339-15 3 ABKWR005-07 UAMU758-14 4 ABKWR006-07 UAMIC1050-13 5 ABKWR007-07 UAMU343-14 6 ABKWR008-07 ABKWR008-07 ## Something is wrong here. ABKWR002-07 and DKNWR093-11 are in completely different groups.

## Where was the problem? head(de2) Query Target 1 DRYAS16715-15 DRYAS16715-15 2 GMRSA3978-14 GMRSA3978-14 3 CERPA342-08 CERPA342-08 4 BEECE684-10 BEECE684-10 5 BENTH337-08 BENTH337-08 6 BENTH361-08 BENTH361-08 tail(de2) Query Target 390409 RDNMC355-05 RDNMC355-05 390410 UAMIC016-12 UAMIC016-12 390411 BBLPD189-10 BBLPD189-10 390412 BUICD124-15 CERPA385-08 390413 SPIAL243-10 GMODL1184-15 390414 LBCG504-08 LBCG504-08 ## Something is wrong there. SPIAL243-10 GMODL1184- 15 are different things. The problem may have been with the FASTA file that I made. Perhaps the labels and sequences are mixed up. as.character(fas1[names(fas1)=="GMODL1184-15"]) GMODL1184-15 "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTTTTAGAATATTAATTCGAACAGAACTAGGAATACCTGGATCATTAATTAATGATAATAGTCAAATTTATAACGTAATTGTAACTTCACATGCATTCTTAATAATTTTTTTCATAGTAATACCTGTTATAATTGGAGGATTTGGTAATTGATTAGTACCATTAATATTAGGAGCCCCTGATATAGCTTTTCCACGATTAAATAATATAAGATTTTGATTTCTACCCCCATCAATTACACTTTTATTATCAAGAAGTTTAGTAAATGCAGGGTCAGGAACAGGATGAACAGTTTATCCACCTTTATCAGGAAGAGTTTCTCATACAGGAGCATCTGTTGATTTAACTATTTTTTCTTTACATCTAGCAGGAATTTCATCAATTTTAGGAGCTATTAATTTCATCTCAACAATAATTAATATACGTGTAAAAGGAATAACATTTGAACGAATACCCCTATTTATTTGAGCAGTATCTCTAACAGCTTTATTATTACTTTTATCATTACCTGTATTAGCTGGTGCAATTACAATATTATTAACAGATCGAAATTTAAATACATCATTTTTTGATCCATCAGGTGGAGGAGATCCAATTCTTTATCAACATTTATTT"

That sequence is from Monsoma pulveratum. The label is Charipinae. Yes, these are mixed up in the FASTA file.

## Starting over. fas1 <- readDNAStringSet("2019-03-15-1152_lib.fas") label <- names(fas1) nm <- as.data.frame(label) require(reshape) nm2 <- colsplit(nm$label, split="\\|", names=c("processid", "identification", "marker", "id2")) ## Selecting only COI-5P. fas2 <- fas1[nm2$marker == "COI-5P"] nm2 <- nm2[nm2$marker == "COI-5P",] length(fas2) names(fas2) <- nm2$processid[nm2$marker == "COI-5P"] ## This was the problematic line. Fixed now, I think.

## Doing a substitution. fas2[1:length(fas2)] <- gsub("-", "N", as.character(fas2))

## Saving that. writeXStringSet(fas2, "2019-03-20-1316_lib.fas")

## Some random quality checks, checking against BOLD. as.character(fas2[sample(1:length(fas2), size=3)]) ## Those all checked out. q() module purge module load genetics/vsearch-2.4.3 vsearch --cluster_fast 2019-03-20-1316_lib.fas \ --centroids 2019-03-20-1334_clusters.fas \ --uc 2019-03-20-1334_uc_out.txt \ --id 0.99\ --iddef 0 vsearch v2.4.3_linux_x86_64, 125.9GB RAM, 20 cores https://github.com/torognes/vsearch Reading file 2019-03-20-1316_lib.fas 100% 250958050 nt in 390414 seqs, min 63, max 1551, avg 643 Masking 100% Sorting by length 100% Counting unique k-mers 100% Clustering 100% Sorting clusters 100% Writing clusters 100% Clusters: 139973 Size min 1, max 629, avg 2.8 Singletons: 91032, 23.3% of seqs, 65.0% of clusters module purge module load R/3.5.1-gcc7.1.0

R de1 <- read.delim("2019-03-20-1334_uc_out.txt", header=FALSE) dim(de1) de2 <- de1[, 9:10] names(de2) <- c("Query", "Target") sl <- grepl("\\*", de2$Target) de2$Query <- as.character(de2$Query) de2$Target <- as.character(de2$Target) de2$Target[sl] <- de2$Query[sl] de2 <- unique(de2) dim(de2) ak <- read.delim("Alaska.txt", quote="", stringsAsFactors = FALSE) bc <- read.delim("British_Columbia.txt", quote="", stringsAsFactors = FALSE) yk <- read.delim("Yukon_Territory.txt", quote="", stringsAsFactors = FALSE) ru <- read.delim("Russia.txt", quote="", stringsAsFactors = FALSE) d1 <- rbind(ak, bc, yk, ru) dim(d1) tg <- levels(as.factor(de2$Target)) length(tg)

## Now for each target I want to select one of the records with the most complete data.

## First make a dataframe to put results in. Target <- tg rdf <- as.data.frame(tg) rdf$rep <- NA ## for selected representative sequence records. this_tg <- 1 ## for testing. for (this_tg in 1:length(tg)) # length(tg) {

## First make dataframe of records with matching sequences. ttg <- tg[this_tg] qr <- de2$Query[which(de2$Target == ttg)] sl <- which(d1$processid %in% qr) qdf <- d1[sl,] qdf$lt <- paste(qdf$class_name, qdf$order_name, qdf$family_name, qdf$species_name)

## If there are any species identifications, choose these. if (sum(is.na(qdf$species_name)) > 0) { sl <- !(is.na(qdf$species_name)) qdf <- qdf[sl,] }

## If there are no species IDs just pick the longest taxonomy. if (sum(!is.na(qdf$species_name)) == 0) { mxlt <- max(nchar(qdf$lt)) sl <- nchar(qdf$lt) == mxlt qdf <- qdf[sl,] }

## If some have BINs and some don't, choose the ones with BINS. if (sum(is.na(qdf$bin_uri)) > 0) { sl <- !(is.na(qdf$bin_uri)) qdf <- qdf[sl,] }

## Now if some records are from Alaska, choose these. if (sum(qdf$province_state %in% "Alaska") > 0) { sl <- qdf$province_state %in% "Alaska" qdf <- qdf[sl,] }

## Now just make a random choice. sl <- sample(qdf$processid, size=1)

## Now populate that dataframe. rdf$rep[this_tg] <- sl

} ## This loop takes forever. How could I make it faster? write.csv(rdf, "2019-03-20- 1337_selected_records.csv", row.names=FALSE) save.image("2019-03-20-1337_workspace.RData")

2019-03-21-0830_library_work.txt

## Picking up where I left off on Monday on Yeti in R. module load R/3.5.1-gcc7.1.0

R load("2019-03-20-1337_workspace.RData")

## Today I need to make the taxonomy file as in the example at https://docs.qiime2.org/2019.1/tutorials/feature- classifier/

## I just realized that the way I did it the sequence I included may not actually be the same as the one for which I have identifiers, etc. I should have made my selections in a derep step, then ... ## Wait. Let me check.

## Leaving R. q()

## Checking some sequences. sed -n '1,100p' 2019-03-20-1334_clusters.fas >BEECE668-10 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnaaTTGGTTCATCCATAAGATTATTAATTCGTATAGAATNAAGTC ATCCTGGTATATGAATNAATAATGATCAAATTTATAATTCATTAGTTACAAGTCATGCAtttttaataattttttttATA

GTTATACCATTTATAATTGGAGGATTTGGAAATTANTTAATTCCATTAATATTAGGATCACCTGATATAGCTTTTCCACG

AATAAATAATATTAGATTCTGATTACTTCCTCCATCTCTTTTTATATTTCTTTTAAGAAATTTATTTACTCCAAATGCAG

GAACAGGATGAACTGTTTATCCTCCTTTATCATCTTATATATTTCATTCATCACCTTCAATTGATATTGCAATCTTTTCT

TTACATATAACTGGAATTTCTTCAATTATTGGATCTTTAAATTTTATTGTAACTATTATAATAATAAAAAATTTTTCATT

AAATTATGATCAAATTAACTTATTTTCATGATCAGTTTGTATTACAGTAATATTATTAATTTTATCTTTACCAGTCCTAG

CAGGAGCAATTACTATATTATTATTTGATCGAAATTTTAATACATCTTTTTTTGATCCAATAGGAGGAGGTGATCCAATC

CTTTATCAACATTTATTTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn >BEECE670-10 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnaTAAGATTATTAATTCGAATAGAACTTAGTC

ATCCTGGGATATGAATTAATAATGATCAAATTTATAATTCATTAGTTACAAGTCATGCAtttttaataattttttttATA

GTTATACCTTTTATAATTGGAGGATTTGGAAATTATTTAATTCCTTTAATATTAGGATCACCCGATATAGCTTTCCCTCG

AATAAATAATATTAGATTTTGATTATTACCACCATCTCTTTTATTATTACTTTTAAGAACATTATTTTCTCCAAATGTAG

GAACAGGTTGAACTGTATATCCTCCTTTATCATCTTATATATTTCATTCATCTCCATCTGTTGATATTGCAATTTTCTCT

TTACATATAACTGGAATTTCTTCAATTATTGGATCATTAAATTTTATTGTAACNATCATACTAATAAAAAATTTTTCATT

AAATTATGATCAAATTAATTTATTTTCTTGATCAGTATGTATTACAGTATTATTATTAATTTTATCATTACCAGTTTTAG

CAGGAGCAATTACAATACTTCTTTTTGACCGAAATTTTAATACATCATTTTTTGACCCAATAGGAGGTGGAGATCCTATT

CTTTATCAACANTTATTTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ... ## Those looked good. sed -n '10001,10100p' 2019-03-20-1334_clusters.fas ... >NGNAS1179-14 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTTATTAATTCGAGCTGAATTAG

GAACTCCAGGATCTTTAATTGGAGATNNNGATCAAATTTATAATACTATTGTCACTGCTCATGCATTTATTATAAttttt tttATAGTAATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTCCCTTTAATANNNTTGGGAGCCCCTGATATAGC

TTTCCCTCGAATAAATAATATAAGATTTTGACTTCTTCCACCTTCTTTAACTTTATTAATTTCAAGAAGTATTGTAGAAA

ATGGAGCTGGAACTGGATGAACTGTGTACCCCCCACTTTCATCTAATATTGCCCATGGTGGAAGATCTGTTGATTTANNN

GCTATTTTTTCCCTTCATTTAGCAGGAATTTCTTCAATTTTAGGAGCTATTAATTTTATTACTACTATTATTAATATAAA

ATTAAATGGAATAATATTTGATCAAATACCTTTATTTGTGTGAGCTGTCGGAATTACAGCTTTATTACTTTTACTATCCC

TTCCTGTTTTAGCAGGGNNNGCTATTACTATGCTTTTAACAGATnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

>NGNAS1180-14 nnnnnnnnnnTTTATTTTTGGAATTTGAGCAGGAATAGTTGGTACATCTCTANNNAGTCTTTTAATTCGAGCTGAATTAG

GTAATCCTGGATCTTTAATTGGAGATNNNGATCAAATTTATAATACTATCGTCACAGCTCATGCCTTTATTATAAttttt tttATAGTTATACCTATTATAATTGGAGGATTTGGAAATTGACTAGTCCCCTTAATANNNTTAGGAGCCCCAGATATAGC

TTTCCCCCGAATAAATAACATAAGATTTTGACTTCTCCCCCCCTCACTTACTCTTTTAATTTCAAGAAGAATTGTAGAAA

ATGGAGCAGGGACAGGATGAACAGTATACCCCCCACTTTCATCTAATATTGCTCATGGAGGAAGATCTGTAGATTTANNN

GCTATTTTTTCTCTCCATTTAGCTGGTATTTCTTCAATTTTAGGAGCAATTAATTTTATTACTACTATTATTAATATAAA

AATTAATGGATTATCTTTTGATCAAATACCATTATTTGTATGAGCAGTAGGAATTACTGCATTATTATTATTACTTTCTC

TCCCAGTTCTAGCTGGannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

... ## Those were good. So it will just be the taxonomy that I have gathered from the members of the clusters. ## Or should I just use the taxonomy of the sequences chosen by vsearch clustering? ## Some of these will be really coarse. I will use those best taxonomies chosen for the clusters, then manually edit if necessary. module load R/3.5.1-gcc7.1.0

R load("2019-03-20-1337_workspace.RData") require(sqldf) d2 <- sqldf(' select, rdf.tg, d1.* from rdf left outer join d1 on rdf.rep = d1.processid ') ## That crashed R. There seems to be a problem with the sqldf package. ## There must be a way to do a similar thing just in R.

## Starting over, this time not using sqldf. load("2019-03-20-1337_workspace.RData") dim(rdf) [1] 139973 2 ## That is good, matches the number of clusters in the clustered fasta file. sl <- which(d1$processid %in% rdf$rep) length(sl) [1] 139973 rdf2 <- rdf[order(rdf$rep),] d2 <- d1[sl,] d2 <- d2[order(d2$processid),]

## Checking... sum(rdf2$rep == d2$processid) [1] 139973 ## Looks good. levels(as.factor(d2$species_name))

## Trying to improve format of species names so that they are not a problem for QIIME2.

## Looking at the format of the QIIME files at https://unite.ut.ee/repository.php as an example. ## Species names have underscores instead of spaces. d2$spf <- gsub(" ", "_", d2$species_name) d2$spf <- gsub("\\.", "", d2$spf)

## any # signs? sum(grepl("#", d2$spf)) [1] 0 ## Nope.

## Any ?'s? sum(grepl("\\?", d2$spf)) [1] 3 ## Yes. ## d2$spf <- gsub("\\?", "", d2$spf) ## Maybe these will be ok. I will leave them for now.

## Adding BIN_URIs. d2$spf <- paste(d2$spf, gsub("BOLD:", "_BOLD-", d2$bin_uri), sep="") d2$tax <- paste( "k__Animalia; ", "p__", d2$phylum_name, "; ", "c__", d2$class_name, "; ", "o__", d2$order_name, "; ", "f__", d2$family_name, "; ", "g__", d2$genus_name, "; ", "s__", d2$spf, sep="" ) ## Making taxonomy file. tf <- as.data.frame(cbind(as.character(rdf2$tg), d2$tax)) write.table(tf, file = "2019-03-21-1023_tax.txt", sep = "\t", row.names = FALSE, col.names = FALSE, na = "", quote = FALSE)

## Saving here. save.image("2019-03-21-1026_workspace.RData") q()

## Going to try to import this library now. module purge module load python/miniconda3-gcc6.1.0 source activate qiime2-2019.1 qiime tools import \ --type 'FeatureData[Sequence]' \ --input-path 2019-03-20-1334_clusters.fas \ --output-path 2019-03-20-1334_clusters.qza

An unexpected error has occurred:

Invalid characters in sequence: ['a', 'n', 't']. Valid characters: ['M', 'K', 'W', 'D', 'N', 'G', 'A', 'H', 'V', 'T', 'C', '.', 'Y', '-', 'S', 'B', 'R'] Note: Use `lowercase` if your sequence contains lowercase characters not in the sequence's alphabet.

See above for debug info.

## Trying to fix this. ## The following from https://gist.github.com/l- modolo/3384b250006b59e54157 awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' 2019-03-20-1334_clusters.fas > 2019-03-21- 1023_clusters.fas

## Now trying import again. qiime tools import \ --type 'FeatureData[Sequence]' \ --input-path 2019-03-21-1023_clusters.fas \ --output-path 2019-03-21-1023_clusters.qza Imported 2019-03-21-1023_clusters.fas as DNASequencesDirectoryFormat to 2019-03-21- 1023_clusters.qza qiime tools import \ --type 'FeatureData[Taxonomy]' \ --input-format HeaderlessTSVTaxonomyFormat \ --input-path 2019-03-21-1023_tax.txt \ --output-path 2019-03-21-1023_tax.qza Imported 2019-03-21-1023_tax.txt as HeaderlessTSVTaxonomyFormat to 2019-03-21- 1023_tax.qza

Extracting reference reads. Taking the primers from the Graham_5376M.txt file from RTL. qiime feature-classifier extract-reads \ --i-sequences 2019-03-21-1023_clusters.qza \ --p-f-primer GGWACWGGWTGAACWGTWTAYCCYCC \ --p-r-primer TAAACTTCAGGGTGACCAAAAAATCA \ --p-trunc-len 400 \ --p-min-length 100 \ --p-max-length 600 \ --o-reads 2019-03-21-1059_ref_seqs.qza

## That is taking a long time. Aborted after taking lunch and waiting more.

## Looking at https://docs.qiime2.org/2018.11/plugins/available/feature- classifier/classify-consensus-vsearch/ ## and https://github.com/BikLab/BITMaB2- Tutorials/blob/master/QIIME2-metabarcoding-tutorial- already-demultiplexed-fastqs.md#step-3---assigning- taxonomy cd /home/mattbowser/2017_STDP qiime feature-classifier classify-consensus-vsearch \ --i-query rep-seqs.qza \ --i-reference-taxonomy /home/mattbowser/AK_arhropod_COI_library/2019-03-21- 1023_tax.qza \ --i-reference-reads /home/mattbowser/AK_arhropod_COI_library/2019-03-21- 1023_clusters.qza \ --o-classification taxonomy.qza \ --p-perc-identity 0.90 \ --p-maxaccepts 1

## This is also taking forever. I might need to submit these kinds of things via SLURM.

## That finished up while I was away from the computer for a bit. Saved FeatureData[Taxonomy] to: taxonomy.qza mv table.qza unfiltered-table.qza cp unfiltered-table.qza table.qza qiime metadata tabulate \ --m-input-file taxonomy.qza \ --o-visualization taxonomy.qzv Saved Visualization to: taxonomy.qzv ## That looked ok. Confidence was always 1.0 though, which is suspicious.

## How do I get a classic OTU table? https://forum.qiime2.org/t/basic-questions- regarding-classic-otu-table-in-qiime2/4189/5 ## Current documentation: https://docs.qiime2.org/2019.1/tutorials/exporting/ qiime tools export \ --input-path table.qza \ --output-path exported-feature-table Exported table.qza as BIOMV210DirFmt to directory exported-feature-table

## That made a feature table in biom format, which I do not know how to read. mkdir extracted-feature-table qiime tools extract \ --input-path table.qza \ --output-path extracted-feature-table ## That still made data in the biom format.

#####Change the first line of biom-taxonomy.tsv (i.e. the header) to this:

#OTUID #taxonomy #confidence biom add-metadata \ -i exported-feature-table/feature-table.biom \ -o table-with-taxonomy.biom \ --observation-metadata-fp exported-feature- table/taxonomy.tsv \ --sc-separated taxonomy

#convert to classic table biom convert \ -i table-with-taxonomy.biom \ -o OTU_Table.txt \ --to-tsv \ --header-key taxonomy ## That worked! Taxonomy was missing, but I have that in another file already.

2019-03-22_identifications_etc.txt

Yesterday the OTU table I generated was missing taxonomy. I tried changing the header line of the exported-feature-table/taxonomy.tsv file to OTUID taxonomy confidence

## No firing up QIIME2 cd /home/mattbowser/2017_STDP module load python/miniconda3-gcc6.1.0 source activate qiime2-2019.1 biom add-metadata \ -i exported-feature-table/feature-table.biom \ -o table-with-taxonomy.biom \ --observation-metadata-fp exported-feature- table/taxonomy.tsv \ --sc-separated taxonomy ## That gave a bunch of errors. ## Changing the header back to #OTUID #taxonomy #confidence biom add-metadata \ -i exported-feature-table/feature-table.biom \ -o table-with-taxonomy.biom \ --observation-metadata-fp exported-feature- table/taxonomy.tsv \ --sc-separated taxonomy ## That worked.

#convert to classic table biom convert \ -i table-with-taxonomy.biom \ -o 2019-03-22-0838_OTU_Table.txt \ --to-tsv \ --header-key taxonomy

## That still lacks the taxonomy, but I have that from another file.

## Now all I am missing is a FASTA file of all OTUs. How do I get that?

## Those are in the rep-seqs.qzv file.

## Downloaded sequences.fasta, renamed it 2019-03- 22-0850_sequences.fasta

## So I have all of the parts. ## I am not happy with the identifications for two reasons, though. ## First, I am concerned about the always 1.0 confidence in the identifications produced by vsearch. This just doesn't seem right. ## Second, I feel like I really should make my library a little differently: 1. derep 2. pick best taxonomy, etc. from verbatim replicate sequences 3. cluster at 99% Step 2 there will take a long time.

## Remaking the library, this time making sure that metadata selected stays with the representative sequences. source deactivate module purge module load genetics/vsearch-2.4.3 cd /home/mattbowser/AK_arhropod_COI_library vsearch --derep_fulllength 2019-03-15-1426_lib.fas \ --uc 2019-03-22-1439_derep.txt \ --output 2019-03-22-1439_derep.fas vsearch v2.4.3_linux_x86_64, 125.9GB RAM, 20 cores https://github.com/torognes/vsearch

Reading file 2019-03-15-1426_lib.fas 100% 251623261 nt in 391167 seqs, min 63, max 1951, avg 643 Dereplicating 100% Sorting 100% 269985 unique sequences, avg cluster 1.4, median 1, max 309 Writing output file 100% Writing uc file, first part 100% Writing uc file, second part 100% ## Ok, now I am going to to try to write the R selections as a script. module purge

## Trying out that script. sbatch 2019-03-22-0935_choose_dereps.slurm

## I should be able to go ahead and cluster that other file. module load genetics/vsearch-2.4.3 vsearch --cluster_fast 2019-03-22-1439_derep.fas \ --centroids 2019-03-22-0954_clusters.fas \ --uc 2019-03-22-0954_uc_out.txt \ --id 0.99\ --iddef 0 vsearch v2.4.3_linux_x86_64, 125.9GB RAM, 20 cores https://github.com/torognes/vsearch

Reading file 2019-03-22-1439_derep.fas 100% 173514025 nt in 269985 seqs, min 63, max 1951, avg 643 Masking 100% Sorting by length 100% Counting unique k-mers 100% Clustering 100% Sorting clusters 100% Writing clusters 100% Clusters: 140528 Size min 1, max 188, avg 1.9 Singletons: 96298, 35.7% of seqs, 68.5% of clusters

2019-03-22-0923_choose_dereps.R ## R script to select representatives with verbatim matching sequences, choosing those having the best identifications and associated data. wd <- "/home/mattbowser/AK_arhropod_COI_library" setwd(wd) de1 <- read.delim("2019-03-20-1014_uc_out.txt", header=FALSE) de2 <- de1[, 9:10] names(de2) <- c("Query", "Target") sl <- grepl("\\*", de2$Target) de2$Query <- as.character(de2$Query) de2$Target <- as.character(de2$Target) de2$Target[sl] <- de2$Query[sl] de2 <- unique(de2) ak <- read.delim("Alaska.txt", quote="", stringsAsFactors = FALSE) bc <- read.delim("British_Columbia.txt", quote="", stringsAsFactors = FALSE) yk <- read.delim("Yukon_Territory.txt", quote="", stringsAsFactors = FALSE) ru <- read.delim("Russia.txt", quote="", stringsAsFactors = FALSE) d1 <- rbind(ak, bc, yk, ru) tg <- levels(as.factor(de2$Target))

## Now for each target I want to select one of the records with the most complete data. ## First make a dataframe to put results in. Target <- tg rdf <- as.data.frame(tg) rdf$rep <- NA ## for selected representative sequence records. for (this_tg in 1:length(tg)) {

## First make dataframe of records with matching sequences. ttg <- tg[this_tg] qr <- de2$Query[which(de2$Target == ttg)] sl <- which(d1$processid %in% qr) qdf <- d1[sl,] qdf$lt <- paste(qdf$class_name, qdf$order_name, qdf$family_name, qdf$species_name)

## If there are any species identifications, choose these. if (sum(is.na(qdf$species_name)) > 0) { sl <- !(is.na(qdf$species_name)) qdf <- qdf[sl,] }

## If there are no species IDs just pick the longest taxonomy. if (sum(!is.na(qdf$species_name)) == 0) { mxlt <- max(nchar(qdf$lt)) sl <- nchar(qdf$lt) == mxlt qdf <- qdf[sl,] }

## If some have BINs and some don't, choose the ones with BINS. if (sum(is.na(qdf$bin_uri)) > 0) { sl <- !(is.na(qdf$bin_uri)) qdf <- qdf[sl,] }

## Now if some records are from Alaska, choose these. if (sum(qdf$province_state %in% "Alaska") > 0) { sl <- qdf$province_state %in% "Alaska" qdf <- qdf[sl,] }

## Now just make a random choice. sl <- sample(qdf$processid, size=1)

## Now populate that dataframe. rdf$rep[this_tg] <- sl

} ## This loop takes forever. How could I make it faster? write.csv(rdf, "selected_records.csv", row.names=FALSE)

2019-03-22-0935_choose_dereps.slurm

#!/bin/bash #SBATCH --job-name=choosedereps #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=1-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-03-22-0935_choose_dereps-%j.out # name of output file (the %j inserts the jobid) module load R/3.5.1-gcc7.1.0 cd /home/mattbowser/AK_arhropod_COI_library

R CMD BATCH 2019-03-22-0923_choose_dereps.R 2019-03- 22-0923_choose_dereps.txt module purge

2019-03-22-1203_library_work.txt

## That SLURM job finished while I took a short walk. module purge module load R/3.5.1-gcc7.1.0 cd /home/mattbowser/AK_arhropod_COI_library

R

## Interesting. My workspace from the SLURM script was restored. dim(rdf) [1] 140091 2 ## I checked over this and it looked to be exactly like the selected_records.csv file. sl <- which(d1$processid %in% rdf$rep) length(sl) [1] 140091 rdf2 <- rdf[order(rdf$rep),] d2 <- d1[sl,] d2 <- d2[order(d2$processid),]

## Checking... sum(rdf2$rep == d2$processid) [1] 140091 ##That checks out. require(Biostrings) fas1 <- readDNAStringSet("2019-03-22- 0954_clusters.fas") length(unique(names(fas1))) [1] 140371 ## Why does this differ from the 140091 for the

## Oh, I see. I chose the wrong file in that R script run by SLURM earlier today. q() module purge

## I made new R script and SLURM files. sbatch 2019-03-22-1219_choose_dereps.slurm

2019-03-22-1219_choose_dereps.R

## R script to select representatives with verbatim matching sequences, choosing those having the best identifications and associated data. wd <- "/home/mattbowser/AK_arhropod_COI_library" setwd(wd) de1 <- read.delim("2019-03-22-1439_derep.txt", header=FALSE) de2 <- de1[, 9:10] names(de2) <- c("Query", "Target") sl <- grepl("\\*", de2$Target) de2$Query <- as.character(de2$Query) de2$Target <- as.character(de2$Target) de2$Target[sl] <- de2$Query[sl] de2 <- unique(de2) ak <- read.delim("Alaska.txt", quote="", stringsAsFactors = FALSE) bc <- read.delim("British_Columbia.txt", quote="", stringsAsFactors = FALSE) yk <- read.delim("Yukon_Territory.txt", quote="", stringsAsFactors = FALSE) ru <- read.delim("Russia.txt", quote="", stringsAsFactors = FALSE) d1 <- rbind(ak, bc, yk, ru) tg <- levels(as.factor(de2$Target))

## Now for each target I want to select one of the records with the most complete data. ## First make a dataframe to put results in. Target <- tg rdf <- as.data.frame(tg) rdf$rep <- NA ## for selected representative sequence records. for (this_tg in 1:length(tg)) {

## First make dataframe of records with matching sequences. ttg <- tg[this_tg] qr <- de2$Query[which(de2$Target == ttg)] sl <- which(d1$processid %in% qr) qdf <- d1[sl,] qdf$lt <- paste(qdf$class_name, qdf$order_name, qdf$family_name, qdf$species_name)

## If there are any species identifications, choose these. if (sum(is.na(qdf$species_name)) > 0) { sl <- !(is.na(qdf$species_name)) qdf <- qdf[sl,] }

## If there are no species IDs just pick the longest taxonomy. if (sum(!is.na(qdf$species_name)) == 0) { mxlt <- max(nchar(qdf$lt)) sl <- nchar(qdf$lt) == mxlt qdf <- qdf[sl,] }

## If some have BINs and some don't, choose the ones with BINS. if (sum(is.na(qdf$bin_uri)) > 0) { sl <- !(is.na(qdf$bin_uri)) qdf <- qdf[sl,] }

## Now if some records are from Alaska, choose these. if (sum(qdf$province_state %in% "Alaska") > 0) { sl <- qdf$province_state %in% "Alaska" qdf <- qdf[sl,] }

## Now just make a random choice. sl <- sample(qdf$processid, size=1)

## Now populate that dataframe. rdf$rep[this_tg] <- sl

} ## This loop takes forever. How could I make it faster? write.csv(rdf, "selected_records.csv", row.names=FALSE)

2019-03-22-1219_choose_dereps.slurm

#!/bin/bash #SBATCH --job-name=choosedereps #SBATCH -n 1 # number of nodes #SBATCH -n 1 # number of tasks #SBATCH -p long # parition #SBATCH --account=bio # account code #SBATCH --time=1-01:00:00 # requested job time D- HH:MM:SS #SBATCH --mail-type=ALL # choose when you want to be emailed #SBATCH [email protected] # add your email address #SBATCH -o 2019-03-22-0935_choose_dereps-%j.out # name of output file (the %j inserts the jobid) module load R/3.5.1-gcc7.1.0 cd /home/mattbowser/AK_arhropod_COI_library

R CMD BATCH 2019-03-22-1219_choose_dereps.R 2019-03- 22-1219_choose_dereps.txt module purge