Single cell technologies for T-cell epitopes

From 22145
Jump to navigation Jump to search

It is the intention with the following exercises, that you discuss and solve the following tasks and questions in your groups, so talk to each other for mutual benefit

Introduction

The purpose of this exercise is to get a feeling with T-cell CDR3 sequence data and to try to make predictions on which sequences will bind a specific peptide. We will here use to approaches for predictions: a naive 'database look-up' approach and a neural network approach. With the naive approach we have a set of CDR3 sequences where we know the corresponding epitopes. We can compare our query CDR3 sequences with the validated set and compute a similarity score. We could then make the assumption that a high similarity score can be used to infer the epitope for the query CDR3 sequence. However, if this approach solved the problem of pairing CDR3 sequences and epitopes, we wouldn't still be doing this research. The pairing is simply more complicated than that. Therefore, we instead utilize neural networks which lets us model non-linear relations. In the exercise we present to you our in-house model, netTCR. Other models exist out there, but we can only speak for our own.

In this exercise, you will be working with curated T-cell epitopes from the IEDB and single cell immune profiling data sets from 10x Genomics. The peptides GILGFVFTL, GLCTLVAML and NLVPMVATV in conjugation with HLA-A*02:01 are heavily studied and many TCR-CDR3b sequences annotated as binders exist in the IEDB. Here, we will create a database of annotated binders using the IEDB and then use single cell data to compare the similarity between TCR-CDR3b sequences annotated as binders and non-binders.

Theoretical questions

Discuss the following 10 quick ones with your group to make sure, that you are on track:

  • Q1: Where in the cell are MHC Class I molecules found?
  • Q2: Do all human cells express MHC Class I molecules?
  • Q3: How many HLA class I molecules do each person have?
  • Q4: Which T-cells interact with MHC Class I?
  • Q5: Where do the peptides that bind MHC Class I come from?
  • Q6: Does MHC class I distinguish between self and non-self?
  • Q7: Explain the difference between a peptide and an epitope in context with MHC Class I presentation
  • Q8: How do virus peptides end up on being presented by MHC Class I?
  • Q9: In cancer, there is no influx of foreign proteins, explain how cancer epitopes arise
  • Q10: Briefly explain why understanding all three components of the TCRpMHC system is so extremely important in context with vaccinology
  • Q11: What are the source proteins for the peptides GILGFVFTL, GLCTLVAML and NLVPMVATV?

Setup your Google Colab Notebook

Access the notebooks using a Google Drive account. Notebooks can be found here. You need the notebooks to complete the exercise. The notebooks contain code blocks that you will not be held accountable for. To code is not a prerequisite for this course, but you are encouraged to try to understand what is going on. All you need to do is upload data, run the notebooks, and download data.

Retrieve Data

IEDB Data

  1. Go to the Immune Epitope Database and Analysis Resource, which your worked with on day 2 of the course
  2. In the Epitope box, select Linear Epitope enter the sequence of one of the aforementioned peptides
  3. In the Assay box, tick Positive Assays Only and T Cell Assays
  4. In the MHC Restriction box, click Find
    1. Click the blue plus sign next to MHC molecule
    2. Click the blue plus sign next to Class I
    3. Click the blue plus sign next to Human
    4. Click the blue plus sign next to HLA-A
    5. Click HLA-A*02:01
    6. Click Apply
  5. Click Search
  6. On the left of the results page find the Receptor box and tick Has receptor sequence, set type to TCRab and Chain to beta
  7. Scroll down and click Search
  8. Now, click the Receptors tab
  9. Verify, that you are on the T Cell Receptors tab
  10. Click Export Results and select Export to CSV file

10x Genomics Single Cell T-Cell Data

You may need to register to download the data in this section, if prompted, simply do so.

  1. Go to the 10x Genomics Single Cell Immune Profiling Datasets
  2. Find the Application Note - A New way of Exploring Immunity section
  3. Under Cell Ranger 3.0.2 click CD8+ T cells of Healthy Donor X link (Choose an X in the range 1-4)
  4. This will take you to the page for the CD8+ T cells of Healthy Donor X
  5. Scroll down to the Output Files section
  6. Click the binarized matrix CSV link to download the vdj v1 hs aggregated donorX binarized matrix data set

Prepare and analyse the data

From the previous exercise steps you should now have the following two files:

tcell_receptor_table_export_1578653238.csv
vdj_v1_hs_aggregated_donor1_binarized_matrix.csv

Verify, that this is the case, note that 1578653238 is a download id and therefore will not be the same. Also, depending on which donor you choose donor1 will be different.

Go to this colab notebook

This notebook will create eight files:

  1. Two database files (based on the IEDB data) containing CDR3 alpha and beta sequences, respectively.
  2. Three "binder" query files (based on the 10x annotated binders) containing CDR3 alpha, beta, and alpha+beta sequences, respectively.
  3. Three "non-binder" query files (based on the 10x annotated non-binders) containing CDR3 alpha, beta, and alpha+beta sequences, respectively.

Create Sequence Logos

This is also part of the colab notebook.

  • Q15: Look at and interpret the logos
  • Q16: It seems that for both logos, the first and last three positions are quite conserved. Yet, we know that one logo represents a set of non-binders and the other a set of binders, so why can we not see the specificity for the target peptide?

Go to this Google Sheet and enter the stated values, which you can find in your above analysis. The first line is filled in for reference. (Note, the database/query files we create are based on a sampling, so do not despair if you do not get the same means and standard deviations)

Naive database look-up approach

Calculate Similarity Scores using the MAIT-match server

The MAIT-match server calculates a similarity metric between two amino acid sequences based on the method proposed by Shen et al.. If two sequences are highly similar, then the score will be close to 1 and if they are highly dissimilar, then the score will be close to 0.

From the above exercise step, you need 6 of the 8 files and you will have to make several runs/queries. For each query you will need:

  1. Database: an IEDB file (either alpha or beta)
  2. A query: a 10x genomics data set (either alpha or beta) (either positive or negative binders)

Make sure you query both 10x alpha files against the single IEDB alpha file and query both 10x beta files against the single IEDB beta file. You should end up with 4 files that I recommend you give these names:

  1. mait.cdr3a.pos.tsv
  2. mait.cdr3a.neg.tsv
  3. mait.cdr3b.pos.tsv
  4. mait.cdr3b.neg.tsv

The task ahead is to use the MAIT Match Server to calculate similarity scores between the known IEDB binders and the annotated 10x binders/non-binders, to see if 10x annotated binders are more similar to the IEDB binders, compared to 10x non-binders.

  1. Go to the MAIT Match Server
  2. Click the Choose file button under the first submission window and select the query file (10x data)
  3. Click the Choose file button under the second submission window and select the database file (IEDB data)
  4. Click Submit

Once the server is done:

  1. Open a text editor (Mac: TextEdit, windows: Notepad)
  2. Copy / paste the resulting output
  3. Delete everything before Res Sequence MAIT_hit Score, so that the first line in the file is Res Sequence MAIT_hit Score
  4. Delete everything after the last line of scores, e.g. Best ASSKARSLGNRGNEQF ASSKGGTRGNEQF 0.8460
  5. Name file according to schema above
  6. Repeat the above steps


Evaluate Similarity Scores

From the previous exercise steps you should now have the following four files:

  1. mait.cdr3a.pos.tsv
  2. mait.cdr3a.neg.tsv
  3. mait.cdr3b.pos.tsv
  4. mait.cdr3b.neg.tsv

Go to this colab notebook

The netTCR server

The netTCR Server NetTCR server predicts binding probability between a T-cell receptor CDR3 protein sequence and a MHC-I peptide binding to HLA-A*02:01. In this part we want to compare the ability of the netTCR server to assign scores versus that of the MAIT-match server.

  1. Go to the netTCR Server
  2. Click the Upload file button and select a query file you created previously
  3. In the drop-down menu Select peptides, click the arrow down and select the target peptide your query file is based on
  4. Click the green Submit button
  5. Repeat the above such that you have predictions for alpha, beta and alpha+beta for both the positive and negative set.

When the predictions are completed:

  1. Copy the results to a text editor and save files with meaningful names
  2. Upoad the results to this notebook

Meaningful filenames:

  1. nettcr.cdr3a.pos.tsv
  2. nettcr.cdr3a.neg.tsv
  3. nettcr.cdr3b.pos.tsv
  4. nettcr.cdr3b.neg.tsv
  5. nettcr.cdr3ab.pos.tsv
  6. nettcr.cdr3ab.neg.tsv

Advanced

If you reach this point and there is still time left for exercises, then

  • Redo the above analysis for another target peptide and re-enter the values you find in the google sheet
  • From your sequence logos, you can see the clear conservation of the first and last 3 positions, adjust your above analysis to account for this

Exercise Summary

A student who has met the objectives of this exercise will be able to:

  1. Extract peptide and allele specific T-cell epitopes from the IEDB
  2. Extract single cell immune profiling data from 10x genomics
  3. Upload data to RStudio cloud
  4. Create an analysis using rmarkdown and pre-specified code
  5. Calculate sequence similarity metrics using the MAIT Match Server
  6. Use similarity metrics to create a TCR-CDR3b database
  7. Use similarity metrics to query a TCR-CDR3b database
  8. Interpret score distributions for TCR-CDR3b sequences annotated as binders and non-binders