WikiSysop: Created page with "'''It is the intention with the following exercises, that you discuss and solve the following tasks and questions in your groups, so talk to each other for mutual benefit''' = Introduction = The purpose of this exercise is to get a feeling with T-cell CDR3 sequence data and to try to make predictions on which sequences will bind a specific peptide. We will here use to approaches for predictions: a naive 'database look-up' approach and a neural network approach. With th..."

2024-03-20T14:46:49Z

Created page with "'''It is the intention with the following exercises, that you discuss and solve the following tasks and questions in your groups, so talk to each other for mutual benefit''' = Introduction = The purpose of this exercise is to get a feeling with T-cell CDR3 sequence data and to try to make predictions on which sequences will bind a specific peptide. We will here use to approaches for predictions: a naive 'database look-up' approach and a neural network approach. With th..."

New page

'''It is the intention with the following exercises, that you discuss and solve the following tasks and questions in your groups, so talk to each other for mutual benefit'''

= Introduction =

The purpose of this exercise is to get a feeling with T-cell CDR3 sequence data and to try to make predictions on which sequences will bind a specific peptide. We will here use to approaches for predictions: a naive 'database look-up' approach and a neural network approach. With the naive approach we have a set of CDR3 sequences where we know the corresponding epitopes. We can compare our query CDR3 sequences with the validated set and compute a similarity score. We could then make the assumption that a high similarity score can be used to infer the epitope for the query CDR3 sequence. However, if this approach solved the problem of pairing CDR3 sequences and epitopes, we wouldn't still be doing this research. The pairing is simply more complicated than that. Therefore, we instead utilize neural networks which lets us model non-linear relations. In the exercise we present to you our in-house model, netTCR. Other models exist out there, but we can only speak for our own.

In this exercise, you will be working with curated T-cell epitopes from the IEDB and single cell immune profiling data sets from 10x Genomics. The peptides <code>GILGFVFTL</code>, <code>GLCTLVAML</code> and <code>NLVPMVATV</code> in conjugation with <code>HLA-A*02:01</code> are heavily studied and many <code>TCR-CDR3b</code> sequences annotated as binders exist in the IEDB. Here, we will create a database of annotated binders using the IEDB and then use single cell data to compare the similarity between <code>TCR-CDR3b</code> sequences annotated as binders and non-binders.

= Theoretical questions =

Discuss the following ''10 quick ones'' with your group to make sure, that you are on track:

* '''Q1:''' Where in the cell are MHC Class I molecules found?
* '''Q2:''' Do all human cells express MHC Class I molecules?
* '''Q3:''' How many HLA class I molecules do each person have?
* '''Q4:''' Which T-cells interact with MHC Class I?
* '''Q5:''' Where do the peptides that bind MHC Class I come from?
* '''Q6:''' Does MHC class I distinguish between self and non-self?
* '''Q7:''' Explain the difference between a peptide and an epitope in context with MHC Class I presentation
* '''Q8:''' How do virus peptides end up on being presented by MHC Class I?
* '''Q9:''' In cancer, there is no influx of foreign proteins, explain how cancer epitopes arise
* '''Q10:''' Briefly explain why understanding all three components of the TCRpMHC system is so extremely important in context with vaccinology
* '''Q11:''' What are the source proteins for the peptides <code>GILGFVFTL</code>, <code>GLCTLVAML</code> and <code>NLVPMVATV</code>?

= Setup your Google Colab Notebook =

Access the notebooks using a Google Drive account. Notebooks can be found [https://drive.google.com/drive/folders/18360Nkt6oKFVDfBVvO1tp33z8VnWTCSG?usp=sharing here]. You need the notebooks to complete the exercise. The notebooks contain code blocks that you will not be held accountable for. To code is not a prerequisite for this course, but you are encouraged to try to understand what is going on. All you need to do is upload data, run the notebooks, and download data.

= Retrieve Data =

== IEDB Data ==

# Go to the [https://www.iedb.org/ Immune Epitope Database and Analysis Resource], which your worked with on day 2 of the course
# In the '''Epitope''' box, select ''Linear Epitope'' enter the sequence of one of the aforementioned peptides
# In the '''Assay''' box, tick ''Positive Assays Only'' and ''T Cell Assays''
# In the '''MHC Restriction''' box, click ''Find''
## Click the blue plus sign next to ''MHC molecule''
## Click the blue plus sign next to ''Class I''
## Click the blue plus sign next to ''Human''
## Click the blue plus sign next to ''HLA-A''
## Click ''HLA-A*02:01''
## Click ''Apply''
# Click ''Search''
# On the left of the results page find the '''Receptor''' box and tick ''Has receptor sequence'', set '''type''' to ''TCRab'' and '''Chain''' to ''beta''
# Scroll down and click ''Search''
# Now, click the '''Receptors''' tab
# Verify, that you are on the '''T Cell Receptors''' tab
# Click ''Export Results'' and select ''Export to CSV file''

== 10x Genomics Single Cell T-Cell Data ==

''You may need to register to download the data in this section, if prompted, simply do so.''

# Go to the [https://support.10xgenomics.com/single-cell-vdj/datasets 10x Genomics Single Cell Immune Profiling Datasets]
# Find the '''Application Note - A New way of Exploring Immunity''' section
# Under '''Cell Ranger 3.0.2''' click '''CD8+ T cells of Healthy Donor X''' link (Choose an X in the range 1-4)
# This will take you to the page for the CD8+ T cells of Healthy Donor X
# Scroll down to the '''Output Files''' section
# Click the '''binarized matrix CSV''' link to download the '''vdj v1 hs aggregated donorX binarized matrix''' data set

= Prepare and analyse the data =

From the previous exercise steps you should now have the following two files:

tcell_receptor_table_export_1578653238.csv
vdj_v1_hs_aggregated_donor1_binarized_matrix.csv

Verify, that this is the case, note that <code>1578653238</code> is a download id and therefore will not be the same. Also, depending on which donor you choose <code>donor1</code> will be different.

Go to [https://colab.research.google.com/drive/1aNgFI8jmf_SYl9PRfYVl19gFoNnGBDcV?usp=sharing this colab notebook]

This notebook will create eight files:

# Two database files (based on the IEDB data) containing CDR3 alpha and beta sequences, respectively.
# Three "binder" query files (based on the 10x annotated binders) containing CDR3 alpha, beta, and alpha+beta sequences, respectively.
# Three "non-binder" query files (based on the 10x annotated non-binders) containing CDR3 alpha, beta, and alpha+beta sequences, respectively.

== Create Sequence Logos ==

This is also part of the colab notebook.

* '''Q15:''' Look at and interpret the logos
* '''Q16:''' It seems that for both logos, the first and last three positions are quite conserved. Yet, we know that one logo represents a set of non-binders and the other a set of binders, so why can we not see the specificity for the target peptide?

'''Go to this [https://docs.google.com/spreadsheets/d/1lqsEby0AuK8UmpgGBDUECM3kZvgi5CibHkJ0LSa78YE/edit?usp=sharing Google Sheet] and enter the stated values, which you can find in your above analysis. The first line is filled in for reference.''' (Note, the database/query files we create are based on a sampling, so do not despair if you do not get the same means and standard deviations)

= Naive database look-up approach =

== Calculate Similarity Scores using the MAIT-match server ==

The MAIT-match server calculates a similarity metric between two amino acid sequences based on the method proposed by [https://arxiv.org/abs/1205.6031 Shen ''et al.'']. If two sequences are highly similar, then the score will be close to 1 and if they are highly dissimilar, then the score will be close to 0.

From the above exercise step, you need 6 of the 8 files and you will have to make several runs/queries. For each query you will need:

# Database: an IEDB file (either alpha or beta)
# A query: a 10x genomics data set (either alpha or beta) (either positive or negative binders)

Make sure you query both 10x alpha files against the single IEDB alpha file and query both 10x beta files against the single IEDB beta file. You should end up with 4 files that I recommend you give these names:

# mait.cdr3a.pos.tsv
# mait.cdr3a.neg.tsv
# mait.cdr3b.pos.tsv
# mait.cdr3b.neg.tsv

The task ahead is to use the [http://www.cbs.dtu.dk/services/MAIT_Match/ MAIT Match Server] to calculate similarity scores between the known IEDB binders and the annotated 10x binders/non-binders, to see if 10x annotated binders are more similar to the IEDB binders, compared to 10x non-binders.

# Go to the [http://www.cbs.dtu.dk/services/MAIT_Match/ MAIT Match Server]
# Click the ''Choose file'' button under the first submission window and select the query file (10x data)
# Click the ''Choose file'' button under the second submission window and select the database file (IEDB data)
# Click ''Submit''

Once the server is done:

# Open a text editor (Mac: TextEdit, windows: Notepad)
# Copy / paste the resulting output
# Delete everything ''before'' <code>Res Sequence MAIT_hit Score</code>, so that the first line in the file is <code>Res Sequence MAIT_hit Score</code>
# Delete everything ''after'' the last line of scores, e.g. <code>Best ASSKARSLGNRGNEQF ASSKGGTRGNEQF 0.8460</code>
# Name file according to schema above
# Repeat the above steps

== Evaluate Similarity Scores ==

From the previous exercise steps you should now have the following four files:

# mait.cdr3a.pos.tsv
# mait.cdr3a.neg.tsv
# mait.cdr3b.pos.tsv
# mait.cdr3b.neg.tsv

Go to [https://colab.research.google.com/drive/1TRkvF831K12r2p30ag4QkyW7vK1SAkr9?usp=sharing this colab notebook]

= The netTCR server =

The [https://services.healthtech.dtu.dk/service.php?NetTCR-2.0 netTCR Server] NetTCR server predicts binding probability between a T-cell receptor CDR3 protein sequence and a MHC-I peptide binding to HLA-A*02:01. In this part we want to compare the ability of the netTCR server to assign scores versus that of the MAIT-match server.

# Go to the [https://services.healthtech.dtu.dk/service.php?NetTCR-2.0 netTCR Server]
# Click the '''Upload''' file button and select a query file you created previously
# In the drop-down menu '''Select peptides''', click the arrow down and select the target peptide your query file is based on
# Click the green '''Submit''' button
# Repeat the above such that you have predictions for alpha, beta and alpha+beta for both the positive and negative set.

When the predictions are completed:
# Copy the results to a text editor and save files with meaningful names
# Upoad the results to this [https://colab.research.google.com/drive/1uyx3MSDTunvkrl8me8a41e-0Sj4lzzOv?usp=sharing notebook]

Meaningful filenames:
# nettcr.cdr3a.pos.tsv
# nettcr.cdr3a.neg.tsv
# nettcr.cdr3b.pos.tsv
# nettcr.cdr3b.neg.tsv
# nettcr.cdr3ab.pos.tsv
# nettcr.cdr3ab.neg.tsv

= Advanced =

If you reach this point and there is still time left for exercises, then

* Redo the above analysis for another target peptide and re-enter the values you find in the google sheet
* From your sequence logos, you can see the clear conservation of the first and last 3 positions, adjust your above analysis to account for this

= Exercise Summary =

A student who has met the objectives of this exercise will be able to:

# Extract peptide and allele specific T-cell epitopes from the IEDB
# Extract single cell immune profiling data from 10x genomics
# Upload data to RStudio cloud
# Create an analysis using rmarkdown and pre-specified code
# Calculate sequence similarity metrics using the [http://www.cbs.dtu.dk/services/MAIT_Match/ MAIT Match Server]
# Use similarity metrics to create a TCR-CDR3b database
# Use similarity metrics to query a TCR-CDR3b database
# Interpret score distributions for TCR-CDR3b sequences annotated as binders and non-binders

Single cell technologies for T-cell epitopes - Revision history