ExGenbank-new-answers: Difference between revisions

Latest revision as of 12:41, 10 September 2025

Note: numbers in Part 2 and Part 3 are updated on September 6, 2023.

Part 1

QUESTION 1.1: a) Inspecting the FEATURE table of the entry reveals that two CDS regions are defined; therefore there are two genes in this entry. As stated on the GenBank hand-out "CDS" is the most stable definition of a protein coding gene used in the GenBank format - sometimes "gene" will also be present, but CDS is more commonly used.; b) Columba livia (Rock pigeon / domestic pigeon); c) The HEADER contain general information about the entry: Organism, publication references, keywords, accession-ID etc. The FEATURE table contains information that refers to coordinates in the DNA sequence - for example definition of CDS regions.

QUESTION 1.2: a) Since the FEATURE table has been thrown away, we no longer have the coordinates for the genes. As such they are "in there" somewhere, but we cannot find them without using external information.; b) The entire "ORIGIN" block (all the DNA sequence) has been converted to FASTA format. The FEATURE table is discarded. From the HEADER block the definition (title) and accession number is preserved, the rest is discarded.

QUESTION 1.3: The downloaded file has Unix line endings. Remember from Plain text files and Geany that line endings are indicated in the status line at the bottom of the Geany window.

QUESTION 1.4: a) The "join" statements defines how to extract the coding sequence from the entire length of DNA in the entry: "join(1104..1192,1306..1510,1614..1742)" is basically a recipe stating to paste together the three intervals - and we'll get the protein coding part of the gene: the coding exons glued together. The CDS will always start with a START codon (e.g. ATG) and end with a STOP codon (e.g. TAA).; b) The gene contains three coding exons. Note: from a CDS definition we don't get any information about UnTranslated Regions (UTR's) that are often found before and after the coding region in the mRNA).

QUESTION 1.5: The first number is the Gene Identifier (taken from the VERSION line in the header). The subsequent numbers are the positions (coordinates) in the original gene entry (taken from the join line).

Part 2

QUESTION 2.1.1: a) 252,430 hits; b) No. There is e.g. the first hit, M57671.1, "Octodon degus insulin mRNA, complete cds" which is from a Degu, a rat-like carnivore from Chile. In fact, you can see in the right side of the results page that only 11,521 hits are from human. There is no reason to expect only human results from GenBank, since it is not a human-centric database.; c) No. There are many hits to complete or partial chromosome sequences which contain a lot of other genes. An example is JWIN03000075.1, "Camelus dromedarius breed African isolate Drom800 Contig74, whole genome shotgun sequence".

QUESTION 2.1.2: a) In the Search details box, you find "insulin[All Fields]".

QUESTION 2.1.3

a) 18,838 hits.

b) Yes, it is among the hits on the first page of results.

Title: Homo sapiens insulin (INS) gene, complete cds

Accession: AH002844

c) ("Homo sapiens"[Organism] OR human[All Fields]) AND insulin[All Fields]

QUESTION 2.2

a) 5609 hits.

b) Yes (except for 10 hits that are synthetic constructs, but based on human sequence). See the "Top Organisms" box on the right.

c) No.

There are many examples of insulin-degrading enzyme, insulin-like growth factor, insulin receptor and insulin-induced genes.
Many entries are mRNA and therefore not gene entries.

QUESTION 2.3: a) 9 hits.; b) 15 hits.; c) Accession codes: AH002844 J00265 J00268, Locus name: AH002844, Definition (title): "Human insulin gene, complete cds".

QUESTION 2.4

The important thing here is not the precise search string, but that you understand the principle of using "kill-words". One possible answer could be:

insulin[title] complete[title] NOT mRNA[title] NOT receptor[title] NOT receptor-like[title] NOT "insulin like"[title] NOT "insulin degrading"[title] NOT "growth factor"[title] NOT "family member"[title] NOT "insulin induced"[title] NOT "insulin dependent"[title] NOT "insulin promoter"[title]

which gives 21 hits, representing 13 organisms and some synthetic constructs.

Note: the use of double quotes ("") to add two-word "kill phrases".

Note: don't kill "insulin precursor"! Insulin is always synthesized as a precursor, preproinsulin, that contains both a signal peptide, a propeptide, and the two mature chains. More about insulin in the exercises next week.

Part 3

QUESTION 3.1: It's a good idea to separate the two logical parts of the search string:; One for narrowing down the species:

(rat[ORGANISM] OR mouse[ORGANISM])

And one for actually searching for insulin:

insulin[KEYWORD]

They can then be AND'ed together:

(rat[ORGANISM] OR mouse[ORGANISM]) AND insulin[KEYWORD]

This gives 10 hits.

By manual inspection of the results, I then pick the following entries:

J00748 - Rat insulin II gene (ins-2) with two introns
J00747 - Rat insulin-I (ins-1) gene
X04724 - Mouse preproinsulin gene II
X04725 - Mouse preproinsulin gene I

Note: rodents have two copies of the insulin gene in their genomes.

Note: using "Protein Name" as field yields no results - you cannot assume that entries are always annotated with Protein Name.

QUESTION 3.2: It will never be possible to do this query perfectly - a good attempt could be:

"alcohol dehydrogenase"[title] complete[title] NOT mRNA[title] NOT synthetic[title]

which gives 2179 hits.

Note: as many as 360 of these hits are from one organism, Populus nigra (Poplar tree).

QUESTION 3.3

"Capra hircus"[ORGANISM] AND "alpha globin"[title]

This gives 6 hits. There are 2 alpha globin genes, HBAI and HBAII, and they are both present in two entries. Correct answers could be:

EU938074 Capra hircus I alpha globin (HBAI) gene, complete cds
EU938078 Capra hircus II alpha globin (HBAII) gene, complete cds

QUESTION 3.4: From Tree of Life we find that ruminants (Danish: "Drøvtyggere") is contained in the taxon: "Ruminantia". Since we can search any level of taxonomy in the ORGANISM field we can use this:

Ruminantia[ORGANISM] AND "alpha globin"[title]

This yields 16 hits (which will need a bit of clean-up).

QUESTION 3.5: Like in 3.2, it will never be possible to do this query perfectly - a good attempt could be:

actin[title] AND actin[protein name] NOT mRNA[title] NOT partial[title]

which yields 585 hits.

Note that this will miss entries that are not annotated with "Protein name". Alternatively, you could search with the "Title" field, but that requires a lot of "kill words":

actin[title] complete[title] NOT mRNA[title] NOT pseudogene[title] NOT regulator[title] 
 NOT binding[title] NOT associated[title] NOT related[title]

yields 1106 hits and still requires some cleanup.

QUESTION 3.6

human[organism] "insulin receptor"[title] NOT mRNA[title] NOT substrate[title] NOT partial[title]

gives 74 hits, with #1 or #2 being the right one:

NG_008852.2 Homo sapiens insulin receptor (INSR), RefSeqGene on chromosome 19
AH002851.2 Homo sapiens insulin receptor (INSR) gene, complete cds

@@ Line 13: / Line 13: @@
 ;QUESTION 1.3:
-:The downloaded file has Unix line endings. Remember from the [[ExJEdit|JEdit exercise]] that line endings are indicated by the letters "U", "W" or "M" in the lower right hand corner of the jEdit window. <!-- ''This means that you would have had trouble opening it in Notepad on Windows.'' -->
+:The downloaded file has Unix line endings. Remember from [[Plain text files and Geany]] that line endings are indicated in the status line at the bottom of the Geany window.<!-- by the letters "U", "W" or "M" in the lower right hand corner of the jEdit window. --> <!-- ''This means that you would have had trouble opening it in Notepad on Windows.'' -->
 ;QUESTION 1.4:
@@ Line 25: / Line 25: @@
 ;QUESTION 2.1.1:
-:a) 226,089 hits <!-- (you can see the "Sequence type" overview in the margin on the left). In total there are xxx,xxx hits (that includes xx,xxx [https://en.wikipedia.org/wiki/Expressed_sequence_tag EST] hits and xx [https://en.wikipedia.org/wiki/Genome_survey_sequence GSS] hits). -->
+:a) 252,430 hits <!-- (you can see the "Sequence type" overview in the margin on the left). In total there are xxx,xxx hits (that includes xx,xxx [https://en.wikipedia.org/wiki/Expressed_sequence_tag EST] hits and xx [https://en.wikipedia.org/wiki/Genome_survey_sequence GSS] hits). -->
-:b) No. There is ''e.g.'' the first hit, '''M57671.1''', "Octodon degus insulin mRNA, complete cds" which is from a [http://en.wikipedia.org/wiki/Degu Degu], a rat-like carnivore from Chile. In fact, you can see in the right side of the results page that only 11,216 hits are from human. There is no reason to expect only human results from GenBank, since it is not a human-centric database.
+:b) No. There is ''e.g.'' the first hit, '''M57671.1''', "Octodon degus insulin mRNA, complete cds" which is from a [http://en.wikipedia.org/wiki/Degu Degu], a rat-like carnivore from Chile. In fact, you can see in the right side of the results page that only 11,521 hits are from human. There is no reason to expect only human results from GenBank, since it is not a human-centric database.
 :c) No. There are many hits to complete or partial chromosome sequences which contain a lot of other genes. An example is '''JWIN03000075.1''', "Camelus dromedarius breed African isolate Drom800 Contig74, whole genome shotgun sequence".
@@ Line 33: / Line 33: @@
 ;QUESTION 2.1.3
-:a) 18,111 hits.
+:a) 18,838 hits.
 :b) Yes, it is among the hits on the first page of results.
 ::Title: '''Homo sapiens insulin (INS) gene, complete cds'''
@@ Line 40: / Line 40: @@
 ;QUESTION 2.2:
-:a) 5548 hits.
+:a) 5609 hits.
 :b) Yes (except for 10 hits that are synthetic constructs, but based on human sequence). See the "Top Organisms" box on the right.
 :c) No.
@@ Line 54: / Line 54: @@
 :The important thing here is not the precise search string, but that you understand the principle of using "kill-words". One possible answer could be:
 ::insulin[title] complete[title] NOT mRNA[title] NOT receptor[title] NOT receptor-like[title] NOT "insulin like"[title] NOT "insulin degrading"[title] NOT "growth factor"[title] NOT "family member"[title] NOT "insulin induced"[title] NOT "insulin dependent"[title] NOT "insulin promoter"[title]
-:which gives 19 hits, representing 13 organisms and some synthetic constructs.
+:which gives 21 hits, representing 13 organisms and some synthetic constructs.
 :''Note'': the use of double quotes ("") to add two-word "kill phrases".
 :''Note'': '''don't kill "insulin precursor"!''' Insulin is always synthesized as a precursor, preproinsulin, that contains both a signal peptide, a propeptide, and the two mature chains. More about insulin in the exercises next week.
@@ Line 81: / Line 81: @@
 :It will never be possible to do this query perfectly - a good attempt could be:
   "alcohol dehydrogenase"[title] complete[title] NOT mRNA[title] NOT synthetic[title]
-:which gives 2170 hits.
+:which gives 2179 hits.
 :''Note'': as many as 360 of these hits are from one organism, Populus nigra (Poplar tree).
@@ Line 101: / Line 101: @@
 :Like in 3.2, it will never be possible to do this query perfectly - a good attempt could be:
   actin[title] AND actin[protein name] NOT mRNA[title] NOT partial[title]
-:which yields 414 hits.
+:which yields 585 hits.
 :Note that this will miss entries that are not annotated with "Protein name". Alternatively, you could search with the "Title" field, but that requires a lot of "kill words":
   actin[title] complete[title] NOT mRNA[title] NOT pseudogene[title] NOT regulator[title]
    NOT binding[title] NOT associated[title] NOT related[title]
-:yields 934 hits and still requires some cleanup.
+:yields 1106 hits and still requires some cleanup.

ExGenbank-new-answers: Difference between revisions

Latest revision as of 12:41, 10 September 2025

Part 1

Part 2

Part 3

Navigation menu

Search