(Note, sections should ‘’not’’ be wrapped in chunk-tags)
Then in a new code chunk, enter the following line of code to load the needed library:
library("tidyverse")
library("ggseqlogo")
Here, we will load the data we are to work with. Enter the following code, but change the x
in the file to a number in the range 1-20 depending on which file you fancy:
TCR_data <- read_tsv(file = "data/HC4.tsv")
TCR_data
## # A tibble: 20,493 x 115
## sample_name species locus product_subtype kit_pool sku test_name
## <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl>
## 1 HC4 Human TCRB Survey NA NA NA
## 2 HC4 Human TCRB Survey NA NA NA
## 3 HC4 Human TCRB Survey NA NA NA
## 4 HC4 Human TCRB Survey NA NA NA
## 5 HC4 Human TCRB Survey NA NA NA
## 6 HC4 Human TCRB Survey NA NA NA
## 7 HC4 Human TCRB Survey NA NA NA
## 8 HC4 Human TCRB Survey NA NA NA
## 9 HC4 Human TCRB Survey NA NA NA
## 10 HC4 Human TCRB Survey NA NA NA
## # … with 20,483 more rows, and 108 more variables: sample_catalog_tags <chr>,
## # sample_rich_tags <lgl>, sample_rich_tags_json <chr>, hla_class_i <lgl>,
## # hla_class_ii <lgl>, kit_control <lgl>, total_templates <dbl>,
## # productive_templates <dbl>, outofframe_templates <dbl>,
## # stop_templates <dbl>, dj_templates <dbl>, total_rearrangements <dbl>,
## # productive_rearrangements <dbl>, outofframe_rearrangements <dbl>,
## # stop_rearrangements <dbl>, dj_rearrangements <dbl>, total_reads <dbl>,
## # total_productive_reads <dbl>, total_outofframe_reads <dbl>,
## # total_stop_reads <dbl>, total_dj_reads <dbl>, productive_clonality <dbl>,
## # productive_entropy <dbl>, sample_clonality <dbl>, sample_entropy <dbl>,
## # sample_amount_ng <dbl>, sample_cells_mass_estimate <dbl>,
## # fraction_productive_of_cells_mass_estimate <dbl>, sample_cells <lgl>,
## # fraction_productive_of_cells <lgl>, max_productive_frequency <dbl>,
## # max_frequency <dbl>, counting_method <chr>, primer_set <chr>,
## # sequence_result_status <chr>, release_date <dbl>, upload_date <dbl>,
## # sample_tags <chr>, fraction_productive <dbl>, order_name <lgl>,
## # kit_id <lgl>, total_t_cells <lgl>, total_templates_agg <lgl>,
## # rearrangement <chr>, amino_acid <chr>, frame_type <chr>,
## # rearrangement_type <chr>, templates <dbl>, seq_reads <dbl>,
## # frequency <dbl>, productive_frequency <dbl>, cdr3_length <dbl>,
## # v_family <chr>, v_gene <chr>, v_allele <chr>, d_family <chr>, d_gene <chr>,
## # d_allele <chr>, j_family <chr>, j_gene <chr>, j_allele <chr>,
## # v_deletions <dbl>, d5_deletions <dbl>, d3_deletions <dbl>,
## # j_deletions <dbl>, n2_insertions <dbl>, n1_insertions <dbl>, v_index <dbl>,
## # n1_index <dbl>, n2_index <dbl>, d_index <dbl>, j_index <dbl>,
## # v_family_ties <lgl>, v_gene_ties <lgl>, v_allele_ties <chr>,
## # d_family_ties <chr>, d_gene_ties <chr>, d_allele_ties <chr>,
## # j_family_ties <lgl>, j_gene_ties <chr>, j_allele_ties <lgl>,
## # sequence_tags <lgl>, v_shm_count <lgl>, v_shm_indexes <lgl>,
## # antibody <chr>, bio_identity <chr>, v_resolved <chr>, d_resolved <chr>,
## # j_resolved <chr>, extended_rearrangement <lgl>, cdr1_rearrangement <lgl>,
## # cdr1_amino_acid <lgl>, cdr1_start_index <lgl>,
## # cdr1_rearrangement_length <lgl>, cdr2_rearrangement <lgl>,
## # cdr2_amino_acid <lgl>, cdr2_start_index <lgl>,
## # cdr2_rearrangement_length <lgl>, cdr3_rearrangement <lgl>,
## # cdr3_amino_acid <lgl>, …
We can get a glimpse of the data like so
TCR_data %>% glimpse
## Observations: 20,493
## Variables: 115
## $ sample_name <chr> "HC4", "HC4", "HC4", "HC4"…
## $ species <chr> "Human", "Human", "Human",…
## $ locus <chr> "TCRB", "TCRB", "TCRB", "T…
## $ product_subtype <chr> "Survey", "Survey", "Surve…
## $ kit_pool <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ sku <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ test_name <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ sample_catalog_tags <chr> "Cluster of Differentiatio…
## $ sample_rich_tags <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ sample_rich_tags_json <chr> "[]", "[]", "[]", "[]", "[…
## $ hla_class_i <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ hla_class_ii <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ kit_control <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ total_templates <dbl> 69944, 69944, 69944, 69944…
## $ productive_templates <dbl> 58236, 58236, 58236, 58236…
## $ outofframe_templates <dbl> 11021, 11021, 11021, 11021…
## $ stop_templates <dbl> 685, 685, 685, 685, 685, 6…
## $ dj_templates <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_rearrangements <dbl> 20493, 20493, 20493, 20493…
## $ productive_rearrangements <dbl> 16518, 16518, 16518, 16518…
## $ outofframe_rearrangements <dbl> 3663, 3663, 3663, 3663, 36…
## $ stop_rearrangements <dbl> 312, 312, 312, 312, 312, 3…
## $ dj_rearrangements <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_reads <dbl> 813208, 813208, 813208, 81…
## $ total_productive_reads <dbl> 677093, 677093, 677093, 67…
## $ total_outofframe_reads <dbl> 128145, 128145, 128145, 12…
## $ total_stop_reads <dbl> 7970, 7970, 7970, 7970, 79…
## $ total_dj_reads <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ productive_clonality <dbl> 0.255897, 0.255897, 0.2558…
## $ productive_entropy <dbl> 10.42619, 10.42619, 10.426…
## $ sample_clonality <dbl> 0.2460609, 0.2460609, 0.24…
## $ sample_entropy <dbl> 10.79855, 10.79855, 10.798…
## $ sample_amount_ng <dbl> 400, 400, 400, 400, 400, 4…
## $ sample_cells_mass_estimate <dbl> 61538, 61538, 61538, 61538…
## $ fraction_productive_of_cells_mass_estimate <dbl> 0.9463421, 0.9463421, 0.94…
## $ sample_cells <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ fraction_productive_of_cells <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ max_productive_frequency <dbl> 0.05377252, 0.05377252, 0.…
## $ max_frequency <dbl> 0.04477206, 0.04477206, 0.…
## $ counting_method <chr> "v2", "v2", "v2", "v2", "v…
## $ primer_set <chr> "Human-TCRB-PD1x", "Human-…
## $ sequence_result_status <chr> "Published", "Published", …
## $ release_date <dbl> 1.446043e+12, 1.446043e+12…
## $ upload_date <dbl> 1.445966e+12, 1.445966e+12…
## $ sample_tags <chr> "Cluster of Differentiatio…
## $ fraction_productive <dbl> 0.8326089, 0.8326089, 0.83…
## $ order_name <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ kit_id <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ total_t_cells <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ total_templates_agg <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ rearrangement <chr> "CCCCTCACTCTGGAGTCAGCTACCC…
## $ amino_acid <chr> "CASSDGVGQFF", NA, "CASRES…
## $ frame_type <chr> "In", "Out", "In", "Out", …
## $ rearrangement_type <chr> "VDJ", "VDJ", "VDJ", "VDJ"…
## $ templates <dbl> 1, 1, 1, 1, 1, 1, 1, 27, 1…
## $ seq_reads <dbl> 5, 2, 5, 4, 7, 14, 12, 327…
## $ frequency <dbl> 6.148488e-06, 2.459395e-06…
## $ productive_frequency <dbl> 7.384510e-06, NA, 7.384510…
## $ cdr3_length <dbl> 33, 29, 45, 46, 45, 36, 45…
## $ v_family <chr> "TCRBV10", "TCRBV10", "TCR…
## $ v_gene <chr> "TCRBV10-02", "TCRBV10-02"…
## $ v_allele <chr> "01", "01", "01", "01", "0…
## $ d_family <chr> NA, "TCRBD02", "TCRBD02", …
## $ d_gene <chr> "unresolved", "TCRBD02-01"…
## $ d_allele <chr> NA, NA, "01", NA, NA, NA, …
## $ j_family <chr> "TCRBJ02", "TCRBJ02", "TCR…
## $ j_gene <chr> "TCRBJ02-01", "TCRBJ02-03"…
## $ j_allele <chr> "01", "01", "01", "01", "0…
## $ v_deletions <dbl> 3, 9, 8, 2, 0, 2, 7, 7, 5,…
## $ d5_deletions <dbl> 0, 2, 6, 1, 6, 10, 8, 4, 0…
## $ d3_deletions <dbl> 10, 10, 0, 8, 1, 0, 1, 5, …
## $ j_deletions <dbl> 12, 7, 7, 4, 0, 6, 1, 1, 0…
## $ n2_insertions <dbl> 1, 0, 6, 10, 0, 6, 4, 10, …
## $ n1_insertions <dbl> 6, 3, 6, 0, 1, 0, 0, 3, 1,…
## $ v_index <dbl> 48, 52, 36, 35, 36, 45, 36…
## $ n1_index <dbl> 62, -1, 45, 50, -1, 60, 46…
## $ n2_index <dbl> 65, 64, 61, -1, 58, -1, -1…
## $ d_index <dbl> 63, 60, 51, 60, 53, 66, 50…
## $ j_index <dbl> 71, 67, 67, 63, 59, 68, 57…
## $ v_family_ties <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ v_gene_ties <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ v_allele_ties <chr> NA, NA, NA, NA, NA, NA, NA…
## $ d_family_ties <chr> "TCRBD01,TCRBD02", NA, NA,…
## $ d_gene_ties <chr> "TCRBD01-01,TCRBD02-01", N…
## $ d_allele_ties <chr> NA, "01,02", NA, NA, NA, N…
## $ j_family_ties <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ j_gene_ties <chr> NA, NA, NA, NA, NA, NA, NA…
## $ j_allele_ties <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ sequence_tags <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ v_shm_count <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ v_shm_indexes <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ antibody <chr> "Vb 12", "Vb 12", "Vb 12",…
## $ bio_identity <chr> "CASSDGVGQFF+TCRBV10-02+TC…
## $ v_resolved <chr> "TCRBV10-02*01", "TCRBV10-…
## $ d_resolved <chr> NA, "TCRBD02-01", "TCRBD02…
## $ j_resolved <chr> "TCRBJ02-01*01", "TCRBJ02-…
## $ extended_rearrangement <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr1_rearrangement <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr1_amino_acid <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr1_start_index <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr1_rearrangement_length <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr2_rearrangement <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr2_amino_acid <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr2_start_index <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr2_rearrangement_length <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr3_rearrangement <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr3_amino_acid <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr3_start_index <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ cdr3_rearrangement_length <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_v_family <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_v_gene <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_v_allele <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_j_family <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_j_gene <lgl> NA, NA, NA, NA, NA, NA, NA…
## $ chosen_j_allele <lgl> NA, NA, NA, NA, NA, NA, NA…
Note that the %>%
simply takes the dataset on the left and puts it into the function on the right
TCR_data %>% nrow
## [1] 20493
TCR_data %>% ncol
## [1] 115
amino_acid
We can reduce the dataset to only include the TCRb-CDR3 amino acid sequence data, we are interested in working with. The following will take the entire TCR_data
dataset, selects the amino_acid
variable, drops all missing observations as denoted by NA
and finally puts the result into a new dataset variable TCRb_CDR3
:
TCRb_CDR3 <- TCR_data %>%
select(amino_acid) %>%
drop_na(amino_acid)
TCRb_CDR3
## # A tibble: 16,830 x 1
## amino_acid
## <chr>
## 1 CASSDGVGQFF
## 2 CASRESGGALYTQYF
## 3 CASSESGVSGNTIYF
## 4 CASSEPPAEQYF
## 5 CASTLGGSYNSPLHF
## 6 CASTRTASLQETQYF
## 7 CASRDSVRFYEQYF
## 8 CASSEPPAEQYF
## 9 CASALGVAGYNEQFF
## 10 CASSEQGVNEQFF
## # … with 16,820 more rows
In HC10:
nrow(TCR_data) - nrow(TCRb_CDR3)
## [1] 3663
The first C
and the last F
(can also be a W
) is technically not annotated as being part of the loop. The following code creates a variable amino_acid_no_FC
, which takes the amino_acid
variable only including anything between the 2nd initial character and the 2nd to last character:
TCRb_CDR3 <- TCRb_CDR3 %>%
mutate(amino_acid_no_CF = str_sub(amino_acid, 2, -2))
TCRb_CDR3
## # A tibble: 16,830 x 2
## amino_acid amino_acid_no_CF
## <chr> <chr>
## 1 CASSDGVGQFF ASSDGVGQF
## 2 CASRESGGALYTQYF ASRESGGALYTQY
## 3 CASSESGVSGNTIYF ASSESGVSGNTIY
## 4 CASSEPPAEQYF ASSEPPAEQY
## 5 CASTLGGSYNSPLHF ASTLGGSYNSPLH
## 6 CASTRTASLQETQYF ASTRTASLQETQY
## 7 CASRDSVRFYEQYF ASRDSVRFYEQY
## 8 CASSEPPAEQYF ASSEPPAEQY
## 9 CASALGVAGYNEQFF ASALGVAGYNEQF
## 10 CASSEQGVNEQFF ASSEQGVNEQF
## # … with 16,820 more rows
We want to also look at the lengths of the CDR3-loops, i.e. how many amino acid residues are in each loop:
TCRb_CDR3 <- TCRb_CDR3 %>%
mutate(k = str_length(amino_acid_no_CF)) %>%
filter(k > 0)
TCRb_CDR3
## # A tibble: 16,829 x 3
## amino_acid amino_acid_no_CF k
## <chr> <chr> <int>
## 1 CASSDGVGQFF ASSDGVGQF 9
## 2 CASRESGGALYTQYF ASRESGGALYTQY 13
## 3 CASSESGVSGNTIYF ASSESGVSGNTIY 13
## 4 CASSEPPAEQYF ASSEPPAEQY 10
## 5 CASTLGGSYNSPLHF ASTLGGSYNSPLH 13
## 6 CASTRTASLQETQYF ASTRTASLQETQY 13
## 7 CASRDSVRFYEQYF ASRDSVRFYEQY 12
## 8 CASSEPPAEQYF ASSEPPAEQY 10
## 9 CASALGVAGYNEQFF ASALGVAGYNEQF 13
## 10 CASSEQGVNEQFF ASSEQGVNEQF 11
## # … with 16,819 more rows
Calculates the length of the CDR3 without the first C
and the last F
and removes any empty sequences after trimming
Lastly, we want to see if there are multiple observations of the same TCRb-CDR3 amino acid sequence. The following code counts the occurrences of the variable amino_acid_no_CF
, arranges the sequences descending with respect to counts:
TCRb_CDR3 %>%
count(amino_acid_no_CF) %>%
arrange(-n)
## # A tibble: 15,411 x 2
## amino_acid_no_CF n
## <chr> <int>
## 1 ASSPGQGEGYEQY 37
## 2 ASSLGQAYEQY 23
## 3 ASSSTAAGNQPQH 23
## 4 ASSLLGGAQALS 19
## 5 ASSFLGRWRTEAF 16
## 6 ASSRRQGIGTSYEQY 16
## 7 ASSLGRGVDEQY 15
## 8 ASSKGLAGTYEQY 13
## 9 ASSLSGRRYGYT 13
## 10 ASSLGGDNQPQH 12
## # … with 15,401 more rows
## # A tibble: 3 x 2
## amino_acid_no_CF n
## <chr> <int>
## 1 ASSPGQGEGYEQY 37
## 2 ASSLGQAYEQY 23
## 3 ASSSTAAGNQPQH 23
Now, let us just make a quick plot of all the sequences occurring at least 5 times (note the added filter(n >= 5)
)
TCRb_CDR3 %>%
count(amino_acid_no_CF) %>%
arrange(-n) %>%
filter(n >= 5) %>%
ggplot(aes(x = reorder(amino_acid_no_CF, -n), y = n, fill = n)) +
geom_col() +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
labs(x = "")
Before we proceed to the analysis, let us create a dataset of unique (i.e. distinct) observations
TCRb_CDR3_unique <- TCRb_CDR3 %>%
select(amino_acid_no_CF, k) %>%
distinct
TCRb_CDR3_unique
## # A tibble: 15,411 x 2
## amino_acid_no_CF k
## <chr> <int>
## 1 ASSDGVGQF 9
## 2 ASRESGGALYTQY 13
## 3 ASSESGVSGNTIY 13
## 4 ASSEPPAEQY 10
## 5 ASTLGGSYNSPLH 13
## 6 ASTRTASLQETQY 13
## 7 ASRDSVRFYEQY 12
## 8 ASALGVAGYNEQF 13
## 9 ASSEQGVNEQF 11
## 10 ASSRGASGYNEQF 13
## # … with 15,401 more rows
TCRb_CDR3_unique %>% nrow
## [1] 15411
Let us start with plotting the distribution of the lengths
## [1] 13
Let us compare the amino acid frequencies of in the TCRb-CDR3 with those of naturally occurring. I.e. if you took all the proteins in the world, how often would you see e.g. ‘A’. These background frequencies can be found as follows:
In your rmarkdown document, manually create a dataset of these background amino acid frequencies like so:
natural_aa_frequencies <- tibble(
amino_acid = c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L",
"M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"),
Eukaryota = c(7.63, 1.76, 5.40, 6.42, 3.87, 6.33, 2.44, 5.10, 5.64, 9.29,
2.25, 4.28, 5.41, 4.21, 5.71, 8.34, 5.56, 6.20, 1.24, 2.87)
) %>%
mutate(f_Eukaryota = Eukaryota / sum(Eukaryota)) %>%
select(amino_acid, f_Eukaryota)
natural_aa_frequencies
## # A tibble: 20 x 2
## amino_acid f_Eukaryota
## <chr> <dbl>
## 1 A 0.0763
## 2 C 0.0176
## 3 D 0.0540
## 4 E 0.0642
## 5 F 0.0387
## 6 G 0.0633
## 7 H 0.0244
## 8 I 0.0510
## 9 K 0.0564
## 10 L 0.0929
## 11 M 0.0225
## 12 N 0.0428
## 13 P 0.0541
## 14 Q 0.0421
## 15 R 0.0571
## 16 S 0.0834
## 17 T 0.0556
## 18 V 0.0620
## 19 W 0.0124
## 20 Y 0.0287
Now, we need a variable with the one-letter symbols for all the 20 proteogenic amino acids
aa <- c("A", "C", "D", "E", "F", "G", "H", "I", "K", "L",
"M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
and their corresponding chemical profile
chemistry <- c("A" = "Hydrophobic", "C" = "Polar", "D" = "Acidic", "E" = "Acidic",
"F" = "Hydrophobic", "G" = "Polar", "H" = "Basic", "I" = "Hydrophobic",
"K" = "Basic", "L" = "Hydrophobic", "M" = "Hydrophobic", "N" = "Neutral",
"P" = "Hydrophobic", "Q" = "Neutral", "R" = "Basic", "S" = "Polar",
"T" = "Polar", "V" = "Hydrophobic", "W" = "Hydrophobic", "Y" = "Polar")
We can get a vector of all the TCRb-CDR3 amino acid residues like so
all_TCRb_CDR3_aa <- TCRb_CDR3_unique %>%
select(amino_acid_no_CF) %>%
drop_na(amino_acid_no_CF) %>%
distinct %>%
pull(amino_acid_no_CF) %>%
str_split("") %>%
unlist
We can get a quick count using the table
function:
all_TCRb_CDR3_aa %>% table
## .
## * A C D E F G H I K L M N
## 316 22918 98 6351 12939 7406 22601 2466 2517 1915 9129 617 6271
## P Q R S T V W Y
## 7085 14724 9259 34727 14015 4257 1581 14752
We can see that there are non-standard amino acid residue characters, which we have to remember to filter away. First we convert the vector to a tibble
(this is just a slightly different version of a data.frame
).
all_TCRb_CDR3_aa <- tibble(amino_acid = all_TCRb_CDR3_aa)
all_TCRb_CDR3_aa
## # A tibble: 195,944 x 1
## amino_acid
## <chr>
## 1 A
## 2 S
## 3 S
## 4 D
## 5 G
## 6 V
## 7 G
## 8 Q
## 9 F
## 10 A
## # … with 195,934 more rows
Now, we can create the dataset of amino acid frequencies in the TCR-data like so:
TCR_aa_frequencies <- all_TCRb_CDR3_aa %>%
count(amino_acid) %>%
filter(amino_acid %in% aa) %>%
mutate(f_TCR = n/sum(n)) %>%
select(amino_acid, f_TCR)
TCR_aa_frequencies
## # A tibble: 20 x 2
## amino_acid f_TCR
## <chr> <dbl>
## 1 A 0.117
## 2 C 0.000501
## 3 D 0.0325
## 4 E 0.0661
## 5 F 0.0379
## 6 G 0.116
## 7 H 0.0126
## 8 I 0.0129
## 9 K 0.00979
## 10 L 0.0467
## 11 M 0.00315
## 12 N 0.0321
## 13 P 0.0362
## 14 Q 0.0753
## 15 R 0.0473
## 16 S 0.178
## 17 T 0.0716
## 18 V 0.0218
## 19 W 0.00808
## 20 Y 0.0754
Now, we can join the two created datasets to one and add the chemical profile:
aa_frequencies <- TCR_aa_frequencies %>%
full_join(natural_aa_frequencies, by = "amino_acid") %>%
full_join(tibble(amino_acid = names(chemistry), chemical_profile = chemistry), by = "amino_acid")
aa_frequencies
## # A tibble: 20 x 4
## amino_acid f_TCR f_Eukaryota chemical_profile
## <chr> <dbl> <dbl> <chr>
## 1 A 0.117 0.0763 Hydrophobic
## 2 C 0.000501 0.0176 Polar
## 3 D 0.0325 0.0540 Acidic
## 4 E 0.0661 0.0642 Acidic
## 5 F 0.0379 0.0387 Hydrophobic
## 6 G 0.116 0.0633 Polar
## 7 H 0.0126 0.0244 Basic
## 8 I 0.0129 0.0510 Hydrophobic
## 9 K 0.00979 0.0564 Basic
## 10 L 0.0467 0.0929 Hydrophobic
## 11 M 0.00315 0.0225 Hydrophobic
## 12 N 0.0321 0.0428 Neutral
## 13 P 0.0362 0.0541 Hydrophobic
## 14 Q 0.0753 0.0421 Neutral
## 15 R 0.0473 0.0571 Basic
## 16 S 0.178 0.0834 Polar
## 17 T 0.0716 0.0556 Polar
## 18 V 0.0218 0.0620 Hydrophobic
## 19 W 0.00808 0.0124 Hydrophobic
## 20 Y 0.0754 0.0287 Polar
And plot the correspondence:
Now, let us try to gain insights into where the differences are largest. We can do this by updating the dataset with a new variable f_diff
, which is simply the difference between f_TCR
and f_Eukaryota
aa_frequencies <- aa_frequencies %>%
mutate(f_diff = f_TCR - f_Eukaryota)
aa_frequencies
## # A tibble: 20 x 5
## amino_acid f_TCR f_Eukaryota chemical_profile f_diff
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 A 0.117 0.0763 Hydrophobic 0.0408
## 2 C 0.000501 0.0176 Polar -0.0171
## 3 D 0.0325 0.0540 Acidic -0.0216
## 4 E 0.0661 0.0642 Acidic 0.00191
## 5 F 0.0379 0.0387 Hydrophobic -0.000862
## 6 G 0.116 0.0633 Polar 0.0522
## 7 H 0.0126 0.0244 Basic -0.0118
## 8 I 0.0129 0.0510 Hydrophobic -0.0382
## 9 K 0.00979 0.0564 Basic -0.0466
## 10 L 0.0467 0.0929 Hydrophobic -0.0463
## 11 M 0.00315 0.0225 Hydrophobic -0.0194
## 12 N 0.0321 0.0428 Neutral -0.0108
## 13 P 0.0362 0.0541 Hydrophobic -0.0179
## 14 Q 0.0753 0.0421 Neutral 0.0331
## 15 R 0.0473 0.0571 Basic -0.00980
## 16 S 0.178 0.0834 Polar 0.0941
## 17 T 0.0716 0.0556 Polar 0.0160
## 18 V 0.0218 0.0620 Hydrophobic -0.0403
## 19 W 0.00808 0.0124 Hydrophobic -0.00432
## 20 Y 0.0754 0.0287 Polar 0.0467
We can plot these differences like so:
SGY
VLK
Let us take a closer look at which amino acid residues are found at which positions in the TCRb-CDR3. The following code takes the TCRb_CDR3_unique
dataset, filters to amino_acid_no_CF
sequences of length X (See Q10) and then it takes all those sequences and puts them into the ggseqlogo
function, which produces a https://prosite.expasy.org/sequence_logo.html:
my_k <- 13 # Put in the value from Q10
TCRb_CDR3_unique %>%
filter(k == my_k) %>%
pull(amino_acid_no_CF) %>%
ggseqlogo
Try replacing the value of my_k
with a couple of different values and inspect the resulting logos.
Go back in your rmarkdown document and find the command
TCRb_CDR3 <- TCRb_CDR3 %>%
mutate(amino_acid_no_CF = str_sub(amino_acid, 2, -2))
Change the command to
TCRb_CDR3 <- TCRb_CDR3 %>%
mutate(amino_acid_no_CF = str_sub(amino_acid, 5, -5))
Then re-run the entire analysis and re-inspect the correspondence plot
The following is a clarification of the learning objectives of this exercise. After having completed the above exercise, you should be able to:
[https://www.r-project.org/ R]
project and workspace on [https://rstudio.cloud RStudio Cloud][https://www.r-project.org/ R]
/ [https://www.tidyverse.org/ Tidyverse]
commands to load and wrangle data[https://ggplot2.tidyverse.org/ ggplot]
, extract and interpret insights[https://www.ncbi.nlm.nih.gov/pubmed/29036507 ggseqlogo]
, extract and interpret insights