First look exercise answers: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with " <H2> Solutions </H2> Illumina data: 1. <pre> cd </pre> 2. <pre> mkdir first_look/ </pre> 3. <pre> cp /data/shared/exercises/first_look/reads.fastq.gz . </pre> 4. <pre> zless -S reads.fastq.gz </pre> 5. <pre> zcat /data/shared/exercises/first_look/reads.fastq.gz |wc -l </pre> 1000 lines so 1000/4 250 sequences. 1. <pre> tar xvfz /data/shared/exercises/first_look/pairedReads.tar.gz </pre> 2. <pre> head ERR243038_1.fastq ERR243038_2.fastq </pre>...")
 
No edit summary
 
Line 1: Line 1:
<H2>Solutions</H2>


<H2> Solutions </H2>
== First look at data ==
 
 
Illumina data:
1.


1. Navigate to home directory.
<pre>
<pre>
cd  
cd
</pre>
</pre>


 
2. Create directory <tt>first_look</tt>.
2.
 
<pre>
<pre>
mkdir first_look/
mkdir first_look
cd first_look
</pre>
</pre>


3.
3. Copy FASTQ file.
 
<pre>
<pre>
cp /data/shared/exercises/first_look/reads.fastq.gz .
cp /home/projects/22126_NGS/exercises/first_look/reads.fastq.gz .
</pre>
</pre>


4.  
4. Inspect reads.
 
<pre>
<pre>
zless -S reads.fastq.gz
zless -S reads.fastq.gz
</pre>
</pre>


 
5. Count number of reads (lines / 4).
5.
 
<pre>
<pre>
zcat /data/shared/exercises/first_look/reads.fastq.gz |wc -l  
zcat reads.fastq.gz | wc -l
</pre>
</pre>


1000 lines
If result = 1000 lines → 
1000 / 4 = <b>250 reads</b>.


so 1000/4 250 sequences.
---


1.
== Illumina data ==


1. Extract paired-end data.
<pre>
<pre>
tar xvfz /data/shared/exercises/first_look/pairedReads.tar.gz
tar xvfz /home/projects/22126_NGS/exercises/first_look/pairedReads.tar.gz
</pre>
</pre>


2.
This creates:
 
* ERR243038_1.fastq
<pre>
* ERR243038_2.fastq
head ERR243038_1.fastq ERR243038_2.fastq
</pre>


2. Inspect the first read header in each file.
<pre>
<pre>
grep @ERR243038  ERR243038_1.fastq |head  
head ERR243038_1.fastq
head ERR243038_2.fastq
</pre>
</pre>


Extract first 10 header lines using grep:
<pre>
<pre>
grep -m 10  @ERR243038 ERR243038_1.fastq  
grep '^@ERR243038' ERR243038_1.fastq | head
grep '^@ERR243038' ERR243038_2.fastq | head
</pre>
</pre>


the output is:
Example output:
 
<pre>
<pre>
@ERR243038.1 HS4_09359:1:1101:1072:21612#33/1
@ERR243038.1 HS4_09359:1:1101:1072:21612#33/1
@ERR243038.2 HS4_09359:1:1101:1076:69021#33/1
@ERR243038.2 HS4_09359:1:1101:1076:69021#33/1
@ERR243038.3 HS4_09359:1:1101:1081:60568#33/1
@ERR243038.3 HS4_09359:1:1101:1081:60568#33/1
@ERR243038.4 HS4_09359:1:1101:1086:81871#33/1
...
@ERR243038.5 HS4_09359:1:1101:1086:82800#33/1
@ERR243038.6 HS4_09359:1:1101:1090:45168#33/1
@ERR243038.7 HS4_09359:1:1101:1091:34108#33/1
@ERR243038.8 HS4_09359:1:1101:1096:7235#33/1
@ERR243038.9 HS4_09359:1:1101:1099:66333#33/1
@ERR243038.10 HS4_09359:1:1101:1101:32746#33/1
</pre>
</pre>


3. Remove trailing /1 and /2 using sed.
<pre>
grep '^@ERR243038' ERR243038_1.fastq | sed 's:/1$::' > human_1.headers
grep '^@ERR243038' ERR243038_2.fastq | sed 's:/2$::' > human_2.headers
</pre>


 
(Alternate version using generic regex:
<pre>
<pre>
grep @ERR243038 ERR243038_1.fastq |sed "s/\/1//g" |head
grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' > human_1.headers
grep '^@ERR243038' ERR243038_2.fastq | sed 's/.$//' > human_2.headers
</pre>
</pre>
)


3.  
4. Compare the results.


View first 10 lines of each:
<pre>
<pre>
grep  @ERR243038  ERR243038_1.fastq  |sed "s/\/1//g" > human_1.headers
head human_1.headers
grep  @ERR243038  ERR243038_2.fastq  |sed "s/\/2//g" > human_2.headers
head human_2.headers
</pre>
</pre>


4.
Side-by-side:
<pre>
paste human_1.headers human_2.headers | head
</pre>


Ensure they match:
<pre>
<pre>
head human_1.headers
diff human_1.headers human_2.headers
head human_2.headers
paste human_1.headers human_2.headers
diff human_1.headers human_2.headers  
</pre>
</pre>
If <code>diff</code> prints nothing, the paired-end files are in perfect sync.

Latest revision as of 12:48, 20 November 2025

Solutions

First look at data

1. Navigate to home directory.

cd

2. Create directory first_look.

mkdir first_look
cd first_look

3. Copy FASTQ file.

cp /home/projects/22126_NGS/exercises/first_look/reads.fastq.gz .

4. Inspect reads.

zless -S reads.fastq.gz

5. Count number of reads (lines / 4).

zcat reads.fastq.gz | wc -l

If result = 1000 lines → 1000 / 4 = 250 reads.

---

Illumina data

1. Extract paired-end data.

tar xvfz /home/projects/22126_NGS/exercises/first_look/pairedReads.tar.gz

This creates:

  • ERR243038_1.fastq
  • ERR243038_2.fastq

2. Inspect the first read header in each file.

head ERR243038_1.fastq
head ERR243038_2.fastq

Extract first 10 header lines using grep:

grep '^@ERR243038' ERR243038_1.fastq | head
grep '^@ERR243038' ERR243038_2.fastq | head

Example output:

@ERR243038.1 HS4_09359:1:1101:1072:21612#33/1
@ERR243038.2 HS4_09359:1:1101:1076:69021#33/1
@ERR243038.3 HS4_09359:1:1101:1081:60568#33/1
...

3. Remove trailing /1 and /2 using sed.

grep '^@ERR243038' ERR243038_1.fastq | sed 's:/1$::' > human_1.headers
grep '^@ERR243038' ERR243038_2.fastq | sed 's:/2$::' > human_2.headers

(Alternate version using generic regex:

grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' > human_1.headers
grep '^@ERR243038' ERR243038_2.fastq | sed 's/.$//' > human_2.headers

)

4. Compare the results.

View first 10 lines of each:

head human_1.headers
head human_2.headers

Side-by-side:

paste human_1.headers human_2.headers | head

Ensure they match:

diff human_1.headers human_2.headers

If diff prints nothing, the paired-end files are in perfect sync.