Load Libraries

(Note, sections should ‘’not’’ be wrapped in chunk-tags)

Then in a new code chunk, enter the following line of code to load the needed library:

library("tidyverse")
library("ggseqlogo")

Load Data

Here, we will load the data we are to work with. Enter the following code, but change the x in the file to a number in the range 1-20 depending on which file you fancy:

TCR_data <- read_tsv(file = "data/HC4.tsv")
TCR_data

## # A tibble: 20,493 x 115
##    sample_name species locus product_subtype kit_pool sku   test_name
##    <chr>       <chr>   <chr> <chr>           <lgl>    <lgl> <lgl>    
##  1 HC4         Human   TCRB  Survey          NA       NA    NA       
##  2 HC4         Human   TCRB  Survey          NA       NA    NA       
##  3 HC4         Human   TCRB  Survey          NA       NA    NA       
##  4 HC4         Human   TCRB  Survey          NA       NA    NA       
##  5 HC4         Human   TCRB  Survey          NA       NA    NA       
##  6 HC4         Human   TCRB  Survey          NA       NA    NA       
##  7 HC4         Human   TCRB  Survey          NA       NA    NA       
##  8 HC4         Human   TCRB  Survey          NA       NA    NA       
##  9 HC4         Human   TCRB  Survey          NA       NA    NA       
## 10 HC4         Human   TCRB  Survey          NA       NA    NA       
## # … with 20,483 more rows, and 108 more variables: sample_catalog_tags <chr>,
## #   sample_rich_tags <lgl>, sample_rich_tags_json <chr>, hla_class_i <lgl>,
## #   hla_class_ii <lgl>, kit_control <lgl>, total_templates <dbl>,
## #   productive_templates <dbl>, outofframe_templates <dbl>,
## #   stop_templates <dbl>, dj_templates <dbl>, total_rearrangements <dbl>,
## #   productive_rearrangements <dbl>, outofframe_rearrangements <dbl>,
## #   stop_rearrangements <dbl>, dj_rearrangements <dbl>, total_reads <dbl>,
## #   total_productive_reads <dbl>, total_outofframe_reads <dbl>,
## #   total_stop_reads <dbl>, total_dj_reads <dbl>, productive_clonality <dbl>,
## #   productive_entropy <dbl>, sample_clonality <dbl>, sample_entropy <dbl>,
## #   sample_amount_ng <dbl>, sample_cells_mass_estimate <dbl>,
## #   fraction_productive_of_cells_mass_estimate <dbl>, sample_cells <lgl>,
## #   fraction_productive_of_cells <lgl>, max_productive_frequency <dbl>,
## #   max_frequency <dbl>, counting_method <chr>, primer_set <chr>,
## #   sequence_result_status <chr>, release_date <dbl>, upload_date <dbl>,
## #   sample_tags <chr>, fraction_productive <dbl>, order_name <lgl>,
## #   kit_id <lgl>, total_t_cells <lgl>, total_templates_agg <lgl>,
## #   rearrangement <chr>, amino_acid <chr>, frame_type <chr>,
## #   rearrangement_type <chr>, templates <dbl>, seq_reads <dbl>,
## #   frequency <dbl>, productive_frequency <dbl>, cdr3_length <dbl>,
## #   v_family <chr>, v_gene <chr>, v_allele <chr>, d_family <chr>, d_gene <chr>,
## #   d_allele <chr>, j_family <chr>, j_gene <chr>, j_allele <chr>,
## #   v_deletions <dbl>, d5_deletions <dbl>, d3_deletions <dbl>,
## #   j_deletions <dbl>, n2_insertions <dbl>, n1_insertions <dbl>, v_index <dbl>,
## #   n1_index <dbl>, n2_index <dbl>, d_index <dbl>, j_index <dbl>,
## #   v_family_ties <lgl>, v_gene_ties <lgl>, v_allele_ties <chr>,
## #   d_family_ties <chr>, d_gene_ties <chr>, d_allele_ties <chr>,
## #   j_family_ties <lgl>, j_gene_ties <chr>, j_allele_ties <lgl>,
## #   sequence_tags <lgl>, v_shm_count <lgl>, v_shm_indexes <lgl>,
## #   antibody <chr>, bio_identity <chr>, v_resolved <chr>, d_resolved <chr>,
## #   j_resolved <chr>, extended_rearrangement <lgl>, cdr1_rearrangement <lgl>,
## #   cdr1_amino_acid <lgl>, cdr1_start_index <lgl>,
## #   cdr1_rearrangement_length <lgl>, cdr2_rearrangement <lgl>,
## #   cdr2_amino_acid <lgl>, cdr2_start_index <lgl>,
## #   cdr2_rearrangement_length <lgl>, cdr3_rearrangement <lgl>,
## #   cdr3_amino_acid <lgl>, …

Inspect the Data

We can get a glimpse of the data like so

TCR_data %>% glimpse

## Observations: 20,493
## Variables: 115
## $ sample_name                                <chr> "HC4", "HC4", "HC4", "HC4"…
## $ species                                    <chr> "Human", "Human", "Human",…
## $ locus                                      <chr> "TCRB", "TCRB", "TCRB", "T…
## $ product_subtype                            <chr> "Survey", "Survey", "Surve…
## $ kit_pool                                   <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ sku                                        <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ test_name                                  <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ sample_catalog_tags                        <chr> "Cluster of Differentiatio…
## $ sample_rich_tags                           <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ sample_rich_tags_json                      <chr> "[]", "[]", "[]", "[]", "[…
## $ hla_class_i                                <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ hla_class_ii                               <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ kit_control                                <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ total_templates                            <dbl> 69944, 69944, 69944, 69944…
## $ productive_templates                       <dbl> 58236, 58236, 58236, 58236…
## $ outofframe_templates                       <dbl> 11021, 11021, 11021, 11021…
## $ stop_templates                             <dbl> 685, 685, 685, 685, 685, 6…
## $ dj_templates                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_rearrangements                       <dbl> 20493, 20493, 20493, 20493…
## $ productive_rearrangements                  <dbl> 16518, 16518, 16518, 16518…
## $ outofframe_rearrangements                  <dbl> 3663, 3663, 3663, 3663, 36…
## $ stop_rearrangements                        <dbl> 312, 312, 312, 312, 312, 3…
## $ dj_rearrangements                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_reads                                <dbl> 813208, 813208, 813208, 81…
## $ total_productive_reads                     <dbl> 677093, 677093, 677093, 67…
## $ total_outofframe_reads                     <dbl> 128145, 128145, 128145, 12…
## $ total_stop_reads                           <dbl> 7970, 7970, 7970, 7970, 79…
## $ total_dj_reads                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ productive_clonality                       <dbl> 0.255897, 0.255897, 0.2558…
## $ productive_entropy                         <dbl> 10.42619, 10.42619, 10.426…
## $ sample_clonality                           <dbl> 0.2460609, 0.2460609, 0.24…
## $ sample_entropy                             <dbl> 10.79855, 10.79855, 10.798…
## $ sample_amount_ng                           <dbl> 400, 400, 400, 400, 400, 4…
## $ sample_cells_mass_estimate                 <dbl> 61538, 61538, 61538, 61538…
## $ fraction_productive_of_cells_mass_estimate <dbl> 0.9463421, 0.9463421, 0.94…
## $ sample_cells                               <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ fraction_productive_of_cells               <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ max_productive_frequency                   <dbl> 0.05377252, 0.05377252, 0.…
## $ max_frequency                              <dbl> 0.04477206, 0.04477206, 0.…
## $ counting_method                            <chr> "v2", "v2", "v2", "v2", "v…
## $ primer_set                                 <chr> "Human-TCRB-PD1x", "Human-…
## $ sequence_result_status                     <chr> "Published", "Published", …
## $ release_date                               <dbl> 1.446043e+12, 1.446043e+12…
## $ upload_date                                <dbl> 1.445966e+12, 1.445966e+12…
## $ sample_tags                                <chr> "Cluster of Differentiatio…
## $ fraction_productive                        <dbl> 0.8326089, 0.8326089, 0.83…
## $ order_name                                 <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ kit_id                                     <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ total_t_cells                              <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ total_templates_agg                        <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ rearrangement                              <chr> "CCCCTCACTCTGGAGTCAGCTACCC…
## $ amino_acid                                 <chr> "CASSDGVGQFF", NA, "CASRES…
## $ frame_type                                 <chr> "In", "Out", "In", "Out", …
## $ rearrangement_type                         <chr> "VDJ", "VDJ", "VDJ", "VDJ"…
## $ templates                                  <dbl> 1, 1, 1, 1, 1, 1, 1, 27, 1…
## $ seq_reads                                  <dbl> 5, 2, 5, 4, 7, 14, 12, 327…
## $ frequency                                  <dbl> 6.148488e-06, 2.459395e-06…
## $ productive_frequency                       <dbl> 7.384510e-06, NA, 7.384510…
## $ cdr3_length                                <dbl> 33, 29, 45, 46, 45, 36, 45…
## $ v_family                                   <chr> "TCRBV10", "TCRBV10", "TCR…
## $ v_gene                                     <chr> "TCRBV10-02", "TCRBV10-02"…
## $ v_allele                                   <chr> "01", "01", "01", "01", "0…
## $ d_family                                   <chr> NA, "TCRBD02", "TCRBD02", …
## $ d_gene                                     <chr> "unresolved", "TCRBD02-01"…
## $ d_allele                                   <chr> NA, NA, "01", NA, NA, NA, …
## $ j_family                                   <chr> "TCRBJ02", "TCRBJ02", "TCR…
## $ j_gene                                     <chr> "TCRBJ02-01", "TCRBJ02-03"…
## $ j_allele                                   <chr> "01", "01", "01", "01", "0…
## $ v_deletions                                <dbl> 3, 9, 8, 2, 0, 2, 7, 7, 5,…
## $ d5_deletions                               <dbl> 0, 2, 6, 1, 6, 10, 8, 4, 0…
## $ d3_deletions                               <dbl> 10, 10, 0, 8, 1, 0, 1, 5, …
## $ j_deletions                                <dbl> 12, 7, 7, 4, 0, 6, 1, 1, 0…
## $ n2_insertions                              <dbl> 1, 0, 6, 10, 0, 6, 4, 10, …
## $ n1_insertions                              <dbl> 6, 3, 6, 0, 1, 0, 0, 3, 1,…
## $ v_index                                    <dbl> 48, 52, 36, 35, 36, 45, 36…
## $ n1_index                                   <dbl> 62, -1, 45, 50, -1, 60, 46…
## $ n2_index                                   <dbl> 65, 64, 61, -1, 58, -1, -1…
## $ d_index                                    <dbl> 63, 60, 51, 60, 53, 66, 50…
## $ j_index                                    <dbl> 71, 67, 67, 63, 59, 68, 57…
## $ v_family_ties                              <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ v_gene_ties                                <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ v_allele_ties                              <chr> NA, NA, NA, NA, NA, NA, NA…
## $ d_family_ties                              <chr> "TCRBD01,TCRBD02", NA, NA,…
## $ d_gene_ties                                <chr> "TCRBD01-01,TCRBD02-01", N…
## $ d_allele_ties                              <chr> NA, "01,02", NA, NA, NA, N…
## $ j_family_ties                              <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ j_gene_ties                                <chr> NA, NA, NA, NA, NA, NA, NA…
## $ j_allele_ties                              <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ sequence_tags                              <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ v_shm_count                                <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ v_shm_indexes                              <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ antibody                                   <chr> "Vb 12", "Vb 12", "Vb 12",…
## $ bio_identity                               <chr> "CASSDGVGQFF+TCRBV10-02+TC…
## $ v_resolved                                 <chr> "TCRBV10-02*01", "TCRBV10-…
## $ d_resolved                                 <chr> NA, "TCRBD02-01", "TCRBD02…
## $ j_resolved                                 <chr> "TCRBJ02-01*01", "TCRBJ02-…
## $ extended_rearrangement                     <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr1_rearrangement                         <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr1_amino_acid                            <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr1_start_index                           <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr1_rearrangement_length                  <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr2_rearrangement                         <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr2_amino_acid                            <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr2_start_index                           <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr2_rearrangement_length                  <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr3_rearrangement                         <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr3_amino_acid                            <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr3_start_index                           <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr3_rearrangement_length                  <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_v_family                            <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_v_gene                              <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_v_allele                            <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_j_family                            <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_j_gene                              <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_j_allele                            <lgl> NA, NA, NA, NA, NA, NA, NA…

Note that the %>% simply takes the dataset on the left and puts it into the function on the right

Q1: How many observations are in the dataset?

TCR_data %>% nrow

## [1] 20493

Q2: How many variables are in the dataset?

TCR_data %>% ncol

## [1] 115

Q4: Which variable in the dataset holds the amino acid sequence of the TCRb-CDR3?

amino_acid

Wrangle the Data

We can reduce the dataset to only include the TCRb-CDR3 amino acid sequence data, we are interested in working with. The following will take the entire TCR_data dataset, selects the amino_acid variable, drops all missing observations as denoted by NA and finally puts the result into a new dataset variable TCRb_CDR3:

 TCRb_CDR3 <- TCR_data %>%
   select(amino_acid) %>%
   drop_na(amino_acid)
 TCRb_CDR3

## # A tibble: 16,830 x 1
##    amino_acid     
##    <chr>          
##  1 CASSDGVGQFF    
##  2 CASRESGGALYTQYF
##  3 CASSESGVSGNTIYF
##  4 CASSEPPAEQYF   
##  5 CASTLGGSYNSPLHF
##  6 CASTRTASLQETQYF
##  7 CASRDSVRFYEQYF 
##  8 CASSEPPAEQYF   
##  9 CASALGVAGYNEQFF
## 10 CASSEQGVNEQFF  
## # … with 16,820 more rows

Q5: How many data points were missing the amino acid sequence of the the TCRb-CDR3?

In HC10:

nrow(TCR_data) - nrow(TCRb_CDR3)

## [1] 3663

The first C and the last F (can also be a W) is technically not annotated as being part of the loop. The following code creates a variable amino_acid_no_FC, which takes the amino_acid variable only including anything between the 2nd initial character and the 2nd to last character:

TCRb_CDR3 <- TCRb_CDR3 %>%
  mutate(amino_acid_no_CF = str_sub(amino_acid, 2, -2))
TCRb_CDR3

## # A tibble: 16,830 x 2
##    amino_acid      amino_acid_no_CF
##    <chr>           <chr>           
##  1 CASSDGVGQFF     ASSDGVGQF       
##  2 CASRESGGALYTQYF ASRESGGALYTQY   
##  3 CASSESGVSGNTIYF ASSESGVSGNTIY   
##  4 CASSEPPAEQYF    ASSEPPAEQY      
##  5 CASTLGGSYNSPLHF ASTLGGSYNSPLH   
##  6 CASTRTASLQETQYF ASTRTASLQETQY   
##  7 CASRDSVRFYEQYF  ASRDSVRFYEQY    
##  8 CASSEPPAEQYF    ASSEPPAEQY      
##  9 CASALGVAGYNEQFF ASALGVAGYNEQF   
## 10 CASSEQGVNEQFF   ASSEQGVNEQF     
## # … with 16,820 more rows

We want to also look at the lengths of the CDR3-loops, i.e. how many amino acid residues are in each loop:

TCRb_CDR3 <- TCRb_CDR3 %>%
  mutate(k = str_length(amino_acid_no_CF)) %>% 
  filter(k > 0)
TCRb_CDR3

## # A tibble: 16,829 x 3
##    amino_acid      amino_acid_no_CF     k
##    <chr>           <chr>            <int>
##  1 CASSDGVGQFF     ASSDGVGQF            9
##  2 CASRESGGALYTQYF ASRESGGALYTQY       13
##  3 CASSESGVSGNTIYF ASSESGVSGNTIY       13
##  4 CASSEPPAEQYF    ASSEPPAEQY          10
##  5 CASTLGGSYNSPLHF ASTLGGSYNSPLH       13
##  6 CASTRTASLQETQYF ASTRTASLQETQY       13
##  7 CASRDSVRFYEQYF  ASRDSVRFYEQY        12
##  8 CASSEPPAEQYF    ASSEPPAEQY          10
##  9 CASALGVAGYNEQFF ASALGVAGYNEQF       13
## 10 CASSEQGVNEQFF   ASSEQGVNEQF         11
## # … with 16,819 more rows

Q6: What is it, that the above code does?

Calculates the length of the CDR3 without the first C and the last F and removes any empty sequences after trimming

Lastly, we want to see if there are multiple observations of the same TCRb-CDR3 amino acid sequence. The following code counts the occurrences of the variable amino_acid_no_CF, arranges the sequences descending with respect to counts:

TCRb_CDR3 %>%
  count(amino_acid_no_CF) %>%
  arrange(-n)

## # A tibble: 15,411 x 2
##    amino_acid_no_CF     n
##    <chr>            <int>
##  1 ASSPGQGEGYEQY       37
##  2 ASSLGQAYEQY         23
##  3 ASSSTAAGNQPQH       23
##  4 ASSLLGGAQALS        19
##  5 ASSFLGRWRTEAF       16
##  6 ASSRRQGIGTSYEQY     16
##  7 ASSLGRGVDEQY        15
##  8 ASSKGLAGTYEQY       13
##  9 ASSLSGRRYGYT        13
## 10 ASSLGGDNQPQH        12
## # … with 15,401 more rows

Q7: What is the sequence and counts of the 3 TCRb-CDR3 amino acid sequences, which occurs most times?

## # A tibble: 3 x 2
##   amino_acid_no_CF     n
##   <chr>            <int>
## 1 ASSPGQGEGYEQY       37
## 2 ASSLGQAYEQY         23
## 3 ASSSTAAGNQPQH       23

Now, let us just make a quick plot of all the sequences occurring at least 5 times (note the added filter(n >= 5))

TCRb_CDR3 %>%
  count(amino_acid_no_CF) %>%
  arrange(-n) %>%
  filter(n >= 5) %>%
  ggplot(aes(x = reorder(amino_acid_no_CF, -n), y = n, fill = n)) +
  geom_col() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  labs(x = "")

Before we proceed to the analysis, let us create a dataset of unique (i.e. distinct) observations

TCRb_CDR3_unique <- TCRb_CDR3 %>%
  select(amino_acid_no_CF, k) %>% 
  distinct
TCRb_CDR3_unique

## # A tibble: 15,411 x 2
##    amino_acid_no_CF     k
##    <chr>            <int>
##  1 ASSDGVGQF            9
##  2 ASRESGGALYTQY       13
##  3 ASSESGVSGNTIY       13
##  4 ASSEPPAEQY          10
##  5 ASTLGGSYNSPLH       13
##  6 ASTRTASLQETQY       13
##  7 ASRDSVRFYEQY        12
##  8 ASALGVAGYNEQF       13
##  9 ASSEQGVNEQF         11
## 10 ASSRGASGYNEQF       13
## # … with 15,401 more rows

Q8: How many unique sequences are in the dataset?

TCRb_CDR3_unique %>% nrow

## [1] 15411

Analyse the Data

Let us start with plotting the distribution of the lengths

Q9: What is the most observed length?

## [1] 13

Let us compare the amino acid frequencies of in the TCRb-CDR3 with those of naturally occurring. I.e. if you took all the proteins in the world, how often would you see e.g. ‘A’. These background frequencies can be found as follows:

Go to the [http://isoelectricpointdb.org/ Proteome-pI - Proteome Isoelectric Point Database]
In the top click [http://isoelectricpointdb.org/statistics.html Statistics]
Scroll to the bottom and find the ’‘’Amino acid frequency across kingdoms’’’ section
The appropriate ’‘’Kingdom’’’ in our case would be ’‘’Eukaryota’’’, we can record the amino acid frequencies, remembering to change them to one-letter encoding

In your rmarkdown document, manually create a dataset of these background amino acid frequencies like so:

natural_aa_frequencies <- tibble(
  amino_acid = c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L",
                 "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"),
  Eukaryota = c(7.63, 1.76, 5.40, 6.42, 3.87, 6.33, 2.44, 5.10, 5.64, 9.29,
                2.25, 4.28, 5.41, 4.21, 5.71, 8.34, 5.56, 6.20, 1.24, 2.87)
  ) %>%
  mutate(f_Eukaryota = Eukaryota / sum(Eukaryota)) %>%
  select(amino_acid, f_Eukaryota)
natural_aa_frequencies

## # A tibble: 20 x 2
##    amino_acid f_Eukaryota
##    <chr>            <dbl>
##  1 A               0.0763
##  2 C               0.0176
##  3 D               0.0540
##  4 E               0.0642
##  5 F               0.0387
##  6 G               0.0633
##  7 H               0.0244
##  8 I               0.0510
##  9 K               0.0564
## 10 L               0.0929
## 11 M               0.0225
## 12 N               0.0428
## 13 P               0.0541
## 14 Q               0.0421
## 15 R               0.0571
## 16 S               0.0834
## 17 T               0.0556
## 18 V               0.0620
## 19 W               0.0124
## 20 Y               0.0287

Now, we need a variable with the one-letter symbols for all the 20 proteogenic amino acids

aa <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L",
        "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")

and their corresponding chemical profile

chemistry <- c("A" = "Hydrophobic", "C" = "Polar", "D" = "Acidic", "E" = "Acidic",
               "F" = "Hydrophobic", "G" = "Polar", "H" = "Basic", "I" = "Hydrophobic",
               "K" = "Basic", "L" = "Hydrophobic", "M" = "Hydrophobic", "N" = "Neutral",
               "P" = "Hydrophobic", "Q" = "Neutral", "R" = "Basic", "S" = "Polar",
               "T" = "Polar", "V" = "Hydrophobic", "W" = "Hydrophobic", "Y" = "Polar")

We can get a vector of all the TCRb-CDR3 amino acid residues like so

 all_TCRb_CDR3_aa <- TCRb_CDR3_unique %>%
   select(amino_acid_no_CF) %>%
   drop_na(amino_acid_no_CF) %>%
   distinct %>%
   pull(amino_acid_no_CF) %>%
   str_split("") %>%
   unlist

We can get a quick count using the table function:

all_TCRb_CDR3_aa %>% table

## .
##     *     A     C     D     E     F     G     H     I     K     L     M     N 
##   316 22918    98  6351 12939  7406 22601  2466  2517  1915  9129   617  6271 
##     P     Q     R     S     T     V     W     Y 
##  7085 14724  9259 34727 14015  4257  1581 14752

We can see that there are non-standard amino acid residue characters, which we have to remember to filter away. First we convert the vector to a tibble (this is just a slightly different version of a data.frame).

all_TCRb_CDR3_aa <- tibble(amino_acid = all_TCRb_CDR3_aa)
all_TCRb_CDR3_aa

## # A tibble: 195,944 x 1
##    amino_acid
##    <chr>     
##  1 A         
##  2 S         
##  3 S         
##  4 D         
##  5 G         
##  6 V         
##  7 G         
##  8 Q         
##  9 F         
## 10 A         
## # … with 195,934 more rows

Now, we can create the dataset of amino acid frequencies in the TCR-data like so:

TCR_aa_frequencies <- all_TCRb_CDR3_aa %>%
  count(amino_acid) %>%
  filter(amino_acid %in% aa) %>% 
  mutate(f_TCR = n/sum(n)) %>%
  select(amino_acid, f_TCR)
TCR_aa_frequencies

## # A tibble: 20 x 2
##    amino_acid    f_TCR
##    <chr>         <dbl>
##  1 A          0.117   
##  2 C          0.000501
##  3 D          0.0325  
##  4 E          0.0661  
##  5 F          0.0379  
##  6 G          0.116   
##  7 H          0.0126  
##  8 I          0.0129  
##  9 K          0.00979 
## 10 L          0.0467  
## 11 M          0.00315 
## 12 N          0.0321  
## 13 P          0.0362  
## 14 Q          0.0753  
## 15 R          0.0473  
## 16 S          0.178   
## 17 T          0.0716  
## 18 V          0.0218  
## 19 W          0.00808 
## 20 Y          0.0754

Now, we can join the two created datasets to one and add the chemical profile:

aa_frequencies <- TCR_aa_frequencies %>%
  full_join(natural_aa_frequencies, by = "amino_acid") %>% 
  full_join(tibble(amino_acid = names(chemistry), chemical_profile = chemistry), by = "amino_acid")
aa_frequencies

## # A tibble: 20 x 4
##    amino_acid    f_TCR f_Eukaryota chemical_profile
##    <chr>         <dbl>       <dbl> <chr>           
##  1 A          0.117         0.0763 Hydrophobic     
##  2 C          0.000501      0.0176 Polar           
##  3 D          0.0325        0.0540 Acidic          
##  4 E          0.0661        0.0642 Acidic          
##  5 F          0.0379        0.0387 Hydrophobic     
##  6 G          0.116         0.0633 Polar           
##  7 H          0.0126        0.0244 Basic           
##  8 I          0.0129        0.0510 Hydrophobic     
##  9 K          0.00979       0.0564 Basic           
## 10 L          0.0467        0.0929 Hydrophobic     
## 11 M          0.00315       0.0225 Hydrophobic     
## 12 N          0.0321        0.0428 Neutral         
## 13 P          0.0362        0.0541 Hydrophobic     
## 14 Q          0.0753        0.0421 Neutral         
## 15 R          0.0473        0.0571 Basic           
## 16 S          0.178         0.0834 Polar           
## 17 T          0.0716        0.0556 Polar           
## 18 V          0.0218        0.0620 Hydrophobic     
## 19 W          0.00808       0.0124 Hydrophobic     
## 20 Y          0.0754        0.0287 Polar

And plot the correspondence:

Now, let us try to gain insights into where the differences are largest. We can do this by updating the dataset with a new variable f_diff, which is simply the difference between f_TCR and f_Eukaryota

aa_frequencies <- aa_frequencies %>%
  mutate(f_diff = f_TCR - f_Eukaryota)
aa_frequencies

## # A tibble: 20 x 5
##    amino_acid    f_TCR f_Eukaryota chemical_profile    f_diff
##    <chr>         <dbl>       <dbl> <chr>                <dbl>
##  1 A          0.117         0.0763 Hydrophobic       0.0408  
##  2 C          0.000501      0.0176 Polar            -0.0171  
##  3 D          0.0325        0.0540 Acidic           -0.0216  
##  4 E          0.0661        0.0642 Acidic            0.00191 
##  5 F          0.0379        0.0387 Hydrophobic      -0.000862
##  6 G          0.116         0.0633 Polar             0.0522  
##  7 H          0.0126        0.0244 Basic            -0.0118  
##  8 I          0.0129        0.0510 Hydrophobic      -0.0382  
##  9 K          0.00979       0.0564 Basic            -0.0466  
## 10 L          0.0467        0.0929 Hydrophobic      -0.0463  
## 11 M          0.00315       0.0225 Hydrophobic      -0.0194  
## 12 N          0.0321        0.0428 Neutral          -0.0108  
## 13 P          0.0362        0.0541 Hydrophobic      -0.0179  
## 14 Q          0.0753        0.0421 Neutral           0.0331  
## 15 R          0.0473        0.0571 Basic            -0.00980 
## 16 S          0.178         0.0834 Polar             0.0941  
## 17 T          0.0716        0.0556 Polar             0.0160  
## 18 V          0.0218        0.0620 Hydrophobic      -0.0403  
## 19 W          0.00808       0.0124 Hydrophobic      -0.00432 
## 20 Y          0.0754        0.0287 Polar             0.0467

We can plot these differences like so:

Q10: Which 3 amino acid have the largest positive difference?

SGY

Q11: Which 3 amino acid have the largest negative difference?

VLK

Q12: Using your biological knowledge, how do you interpret the correspondence between the TCRb_CDR3 and Natural amino acid frequencies?

Let us take a closer look at which amino acid residues are found at which positions in the TCRb-CDR3. The following code takes the TCRb_CDR3_unique dataset, filters to amino_acid_no_CF sequences of length X (See Q10) and then it takes all those sequences and puts them into the ggseqlogo function, which produces a https://prosite.expasy.org/sequence_logo.html:

my_k <- 13 # Put in the value from Q10
TCRb_CDR3_unique %>%
  filter(k == my_k) %>%
  pull(amino_acid_no_CF) %>%
  ggseqlogo

Try replacing the value of my_k with a couple of different values and inspect the resulting logos.

Q13: What does the sequence logos tell you?

Q14: Using what you learned in Q14, re-evalutate the correspondance plot from Q10-12

Optional - Advanced

Q15: Redo the entire analysis with a DIFFERENT HCx.tsv file and re-answer questions 7, 8, 9, 10 and 11

Go back in your rmarkdown document and find the command

 TCRb_CDR3 <- TCRb_CDR3 %>%
   mutate(amino_acid_no_CF = str_sub(amino_acid, 2, -2))

Change the command to

 TCRb_CDR3 <- TCRb_CDR3 %>%
   mutate(amino_acid_no_CF = str_sub(amino_acid, 5, -5))

Then re-run the entire analysis and re-inspect the correspondence plot

Q16: What happened and why?

Q17: Look at the bar-plot of the differences, what is your interpretation?

Exercise Summary

The following is a clarification of the learning objectives of this exercise. After having completed the above exercise, you should be able to:

Retrieve data from the [https://clients.adaptivebiotech.com/immuneaccess immuneACCESS database]
Setup a new [https://www.r-project.org/ R] project and workspace on [https://rstudio.cloud RStudio Cloud]
Upload data
Install packages
Create a simple [https://rmarkdown.rstudio.com/ rmarkdown document]
Use simple [https://www.r-project.org/ R] / [https://www.tidyverse.org/ Tidyverse] commands to load and wrangle data
Create simple data visualisations using [https://ggplot2.tidyverse.org/ ggplot], extract and interpret insights
Perform and interpret a simple amino acid frequency analysis to evaluate the variability of the TCRb-CDR3 loop
Create a [https://prosite.expasy.org/sequence_logo.html Sequence Logo] using [https://www.ncbi.nlm.nih.gov/pubmed/29036507 ggseqlogo], extract and interpret insights

22145 Immunological Bioinformatics

Repertoire Analysis

Leon Eyrich Jessen

Tuesday January 7th 2020

Load Libraries

Load Data

Inspect the Data

Q1: How many observations are in the dataset?

Q2: How many variables are in the dataset?

Q4: Which variable in the dataset holds the amino acid sequence of the TCRb-CDR3?

Wrangle the Data

Q5: How many data points were missing the amino acid sequence of the the TCRb-CDR3?

Q6: What is it, that the above code does?

Q7: What is the sequence and counts of the 3 TCRb-CDR3 amino acid sequences, which occurs most times?

Q8: How many unique sequences are in the dataset?

Analyse the Data

Q9: What is the most observed length?

Q10: Which 3 amino acid have the largest positive difference?

Q11: Which 3 amino acid have the largest negative difference?

Q12: Using your biological knowledge, how do you interpret the correspondence between the TCRb_CDR3 and Natural amino acid frequencies?

Q13: What does the sequence logos tell you?

Q14: Using what you learned in Q14, re-evalutate the correspondance plot from Q10-12

Optional - Advanced

Q15: Redo the entire analysis with a DIFFERENT HCx.tsv file and re-answer questions 7, 8, 9, 10 and 11

Q16: What happened and why?

Q17: Look at the bar-plot of the differences, what is your interpretation?

Exercise Summary

22145 Immunological Bioinformatics

Repertoire Analysis

Leon Eyrich Jessen

Tuesday January 7th 2020

Load Libraries

Load Data

Inspect the Data

Q1: How many observations are in the dataset?

Q2: How many variables are in the dataset?

Q3: How are observations and variables related to rows and columns?

Q4: Which variable in the dataset holds the amino acid sequence of the TCRb-CDR3?

Wrangle the Data

Q5: How many data points were missing the amino acid sequence of the the TCRb-CDR3?

Q6: What is it, that the above code does?

Q7: What is the sequence and counts of the 3 TCRb-CDR3 amino acid sequences, which occurs most times?

Q8: How many unique sequences are in the dataset?

Analyse the Data

Q9: What is the most observed length?

Q10: Which 3 amino acid have the largest positive difference?

Q11: Which 3 amino acid have the largest negative difference?

Q12: Using your biological knowledge, how do you interpret the correspondence between the TCRb_CDR3 and Natural amino acid frequencies?

Q13: What does the sequence logos tell you?

Q14: Using what you learned in Q14, re-evalutate the correspondance plot from Q10-12

Optional - Advanced

Q15: Redo the entire analysis with a DIFFERENT HCx.tsv file and re-answer questions 7, 8, 9, 10 and 11

Q16: What happened and why?

Q17: Look at the bar-plot of the differences, what is your interpretation?

Exercise Summary