22113 - User contributions [en]

QT clustering

2025-05-15T12:34:15Z

WikiSysop: /* References */

__NOTOC__
===Description===
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, QT clustering partitions the data set such that each example (data point) is assigned to exactly one cluster. QT clustering is superior to K-means clustering in that the number of clusters is not given beforehand and it yields the same result in repeated runs. It requires more CPU time, though.

QT (Quality Threshold) has its name from the user-determined threshold (distance) of the maximal diameter of the clusters that the method computes.

While not strictly required, you could make your QT algorithm into a class. That would make it very easy to include and use in the future.

===Input and output===
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers. The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01 8.76 3.29 1.05
ex02 12.3 2.33 3.53
ex03 -0.54 -3.56 1.45
.
.

The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster and is proceeded by the the members of that cluster:

Cluster-1
ex10 1.04 2.98 1.34
ex12 1.23 2.34 1.69
.
.
Cluster-2
ex04 -0.34 3.51 9.02
ex07 -8.56 5.12 12.5
.
.

===Examples of program execution===
cluster.py vectors.txt 500
The 500 is the maximum diameter for a cluster in the data set. An interesting twist could be to automatically decide the cluster diameter like this: X % of the distance between the two data point furthest away from each other. Called like this
cluster.py vectors.txt 30%

===Details===
The method works for any type of data set where it is possible to calculate a ''distance'' between any two points. In this project we are just considering euclidean distances, as they are simple.
[https://en.wikipedia.org/wiki/Pythagorean_theorem Pythagoras's theorem].

The algorithm works like this.
# For each point in the data set, calculate the ''candidate cluster'' with that point as the starting point. With '''n''' points in the data set, there are '''n''' candidate clusters, obviously.
# Choose the candidate cluster that contains most points as the primary cluster and remove those points from the data set. If two or more candidate clusters have equally most points, pick the cluster with the smallest diameter. If they are still equal, pick the first you found.
# Repeat step 1 and 2 until there are no points in the data set or a set limit has been reached; like all remaining candidate clusters has less than, say, 10 points and are therefore not true clusters, but noise.
# Print the resulting clusters.

A ''candidate cluster'' for a point is calculated using "complete linkage" like this:
# Consider the starting point as the beginning of the candidate cluster for that point. This is trivially seen as a subset of your data set.
# Add one point from your data set at a time in such a way that you extend the candidate cluster diameter the least. Again, if two points would extend the diameter the least, pick the first one you find.
# Continue adding points - that is repeat step 2 - to your candidate cluster until the diameter exceeds the Quality Threshold (hence the name QT clustering). The point that makes the diameter exceed the QT is not part of the candidate cluster.
Important definition: The diameter of a data set (or candidate cluster) is the distance of the two points furthest from each other. 
Note: All points in the data set can participate in multiple candidate clusters. Any point is not permanently assigned to a candidate cluster, before you actually pick the largest one and remove the "winners" points from the data set. 
Note: Building a candidate cluster according to above method is '''NOT''' the same method as just adding the nearest point to the starting point - or any point in the growing candidate cluster.

A fairly large part of this project is optimizing the algorithm just described. This is done by gaining insight in the algorithm - not calculating what does not need to be calculated, not calculating the same thing again and again.

Various data sets: [https://teaching.healthtech.dtu.dk/material/22113/point100.lst 100 data points], [https://teaching.healthtech.dtu.dk/material/22113/point1000.lst 1000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point3000.lst 3000 data points],
[https://teaching.healthtech.dtu.dk/material/22113/point4169.lst 4169 data points], [https://teaching.healthtech.dtu.dk/material/22113/point5000.lst 5000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point6000.lst 6000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point10000.lst 10000 data points].

Checking the correctness of your program. 
[https://teaching.healthtech.dtu.dk/material/22113/QT-small.lst Result] of clustering the small list (100 points) with QT being 30% of the diameter. 
[https://teaching.healthtech.dtu.dk/material/22113/result1000.lst Result] of clustering 1000 points with QT being 30% of the diameter.

The algorithm is deterministic - meaning that an implementation will yield the same result on the same data set every time. However, in the description there are two places, where "you pick the first one" you find. This is implementation dependent and therefore two different implementations of QT can give rise to different results. The data sets given here are constructed in such a way, that this will NOT happen for them, i.e. no matter how you implement your method you should get the same result.

===References===
# [http://genome.cshlp.org/content/9/11/1106.full.pdf+html Exploring expression data: identification and analysis of coexpressed genes. LJ Heyer, S Kruglyak, S Yooseph - Genome research, 1999 - genome.cshlp.org]
# [https://www.chem.agilent.com/cag/bsp/products/gsgx/Downloads/pdf/qt_clustering.pdf QT clustering in industry - Agilent]
Peter's speed reference using 30% as the threshold, using pure python:
Points Time (seconds) Improved
100 0 0
1000 2 1
3000 32 16
4169 65 33
5000 115 53
6000 149 74
10000 505 223
It is not required to achieve these numbers, but it is important to have a reference - and maybe a goal. The improved times have been reached by using efficient data types, not by any change in the method or computer.

QT clustering

2025-05-15T12:33:37Z

WikiSysop: /* References */

QT clustering

2025-04-22T13:12:58Z

WikiSysop: /* References */

__NOTOC__
===Description===
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, QT clustering partitions the data set such that each example (data point) is assigned to exactly one cluster. QT clustering is superior to K-means clustering in that the number of clusters is not given beforehand and it yields the same result in repeated runs. It requires more CPU time, though.

QT (Quality Threshold) has its name from the user-determined threshold (distance) of the maximal diameter of the clusters that the method computes.

While not strictly required, you could make your QT algorithm into a class. That would make it very easy to include and use in the future.

===Input and output===
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers. The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01 8.76 3.29 1.05
ex02 12.3 2.33 3.53
ex03 -0.54 -3.56 1.45
.
.

The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster and is proceeded by the the members of that cluster:

Cluster-1
ex10 1.04 2.98 1.34
ex12 1.23 2.34 1.69
.
.
Cluster-2
ex04 -0.34 3.51 9.02
ex07 -8.56 5.12 12.5
.
.

===Examples of program execution===
cluster.py vectors.txt 500
The 500 is the maximum diameter for a cluster in the data set. An interesting twist could be to automatically decide the cluster diameter like this: X % of the distance between the two data point furthest away from each other. Called like this
cluster.py vectors.txt 30%

===Details===
The method works for any type of data set where it is possible to calculate a ''distance'' between any two points. In this project we are just considering euclidean distances, as they are simple.
[https://en.wikipedia.org/wiki/Pythagorean_theorem Pythagoras's theorem].

The algorithm works like this.
# For each point in the data set, calculate the ''candidate cluster'' with that point as the starting point. With '''n''' points in the data set, there are '''n''' candidate clusters, obviously.
# Choose the candidate cluster that contains most points as the primary cluster and remove those points from the data set. If two or more candidate clusters have equally most points, pick the cluster with the smallest diameter. If they are still equal, pick the first you found.
# Repeat step 1 and 2 until there are no points in the data set or a set limit has been reached; like all remaining candidate clusters has less than, say, 10 points and are therefore not true clusters, but noise.
# Print the resulting clusters.

A ''candidate cluster'' for a point is calculated using "complete linkage" like this:
# Consider the starting point as the beginning of the candidate cluster for that point. This is trivially seen as a subset of your data set.
# Add one point from your data set at a time in such a way that you extend the candidate cluster diameter the least. Again, if two points would extend the diameter the least, pick the first one you find.
# Continue adding points - that is repeat step 2 - to your candidate cluster until the diameter exceeds the Quality Threshold (hence the name QT clustering). The point that makes the diameter exceed the QT is not part of the candidate cluster.
Important definition: The diameter of a data set (or candidate cluster) is the distance of the two points furthest from each other. 
Note: All points in the data set can participate in multiple candidate clusters. Any point is not permanently assigned to a candidate cluster, before you actually pick the largest one and remove the "winners" points from the data set. 
Note: Building a candidate cluster according to above method is '''NOT''' the same method as just adding the nearest point to the starting point - or any point in the growing candidate cluster.

A fairly large part of this project is optimizing the algorithm just described. This is done by gaining insight in the algorithm - not calculating what does not need to be calculated, not calculating the same thing again and again.

Various data sets: [https://teaching.healthtech.dtu.dk/material/22113/point100.lst 100 data points], [https://teaching.healthtech.dtu.dk/material/22113/point1000.lst 1000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point3000.lst 3000 data points],
[https://teaching.healthtech.dtu.dk/material/22113/point4169.lst 4169 data points], [https://teaching.healthtech.dtu.dk/material/22113/point5000.lst 5000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point6000.lst 6000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point10000.lst 10000 data points].

Checking the correctness of your program. 
[https://teaching.healthtech.dtu.dk/material/22113/QT-small.lst Result] of clustering the small list (100 points) with QT being 30% of the diameter. 
[https://teaching.healthtech.dtu.dk/material/22113/result1000.lst Result] of clustering 1000 points with QT being 30% of the diameter.

The algorithm is deterministic - meaning that an implementation will yield the same result on the same data set every time. However, in the description there are two places, where "you pick the first one" you find. This is implementation dependent and therefore two different implementations of QT can give rise to different results. The data sets given here are constructed in such a way, that this will NOT happen for them, i.e. no matter how you implement your method you should get the same result.

===References===
# [http://genome.cshlp.org/content/9/11/1106.full.pdf+html Exploring expression data: identification and analysis of coexpressed genes. LJ Heyer, S Kruglyak, S Yooseph - Genome research, 1999 - genome.cshlp.org]
# [https://www.chem.agilent.com/cag/bsp/products/gsgx/Downloads/pdf/qt_clustering.pdf QT clustering in industry - Agilent]
Peter's speed reference using 30% as the threshold:
Points Time (seconds) Improved
100 0 0
1000 2 1
3000 32 20
4169 65 44
5000 115 70
6000 149 106
10000 505 342
It is not required to achieve these numbers, but it is important to have a reference - and maybe a goal.

QT clustering

2025-04-22T12:23:38Z

WikiSysop: /* Description */

__NOTOC__
===Description===
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, QT clustering partitions the data set such that each example (data point) is assigned to exactly one cluster. QT clustering is superior to K-means clustering in that the number of clusters is not given beforehand and it yields the same result in repeated runs. It requires more CPU time, though.

QT (Quality Threshold) has its name from the user-determined threshold (distance) of the maximal diameter of the clusters that the method computes.

While not strictly required, you could make your QT algorithm into a class. That would make it very easy to include and use in the future.

===Input and output===
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers. The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01 8.76 3.29 1.05
ex02 12.3 2.33 3.53
ex03 -0.54 -3.56 1.45
.
.

The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster and is proceeded by the the members of that cluster:

Cluster-1
ex10 1.04 2.98 1.34
ex12 1.23 2.34 1.69
.
.
Cluster-2
ex04 -0.34 3.51 9.02
ex07 -8.56 5.12 12.5
.
.

===Examples of program execution===
cluster.py vectors.txt 500
The 500 is the maximum diameter for a cluster in the data set. An interesting twist could be to automatically decide the cluster diameter like this: X % of the distance between the two data point furthest away from each other. Called like this
cluster.py vectors.txt 30%

===Details===
The method works for any type of data set where it is possible to calculate a ''distance'' between any two points. In this project we are just considering euclidean distances, as they are simple.
[https://en.wikipedia.org/wiki/Pythagorean_theorem Pythagoras's theorem].

The algorithm works like this.
# For each point in the data set, calculate the ''candidate cluster'' with that point as the starting point. With '''n''' points in the data set, there are '''n''' candidate clusters, obviously.
# Choose the candidate cluster that contains most points as the primary cluster and remove those points from the data set. If two or more candidate clusters have equally most points, pick the cluster with the smallest diameter. If they are still equal, pick the first you found.
# Repeat step 1 and 2 until there are no points in the data set or a set limit has been reached; like all remaining candidate clusters has less than, say, 10 points and are therefore not true clusters, but noise.
# Print the resulting clusters.

A ''candidate cluster'' for a point is calculated using "complete linkage" like this:
# Consider the starting point as the beginning of the candidate cluster for that point. This is trivially seen as a subset of your data set.
# Add one point from your data set at a time in such a way that you extend the candidate cluster diameter the least. Again, if two points would extend the diameter the least, pick the first one you find.
# Continue adding points - that is repeat step 2 - to your candidate cluster until the diameter exceeds the Quality Threshold (hence the name QT clustering). The point that makes the diameter exceed the QT is not part of the candidate cluster.
Important definition: The diameter of a data set (or candidate cluster) is the distance of the two points furthest from each other. 
Note: All points in the data set can participate in multiple candidate clusters. Any point is not permanently assigned to a candidate cluster, before you actually pick the largest one and remove the "winners" points from the data set. 
Note: Building a candidate cluster according to above method is '''NOT''' the same method as just adding the nearest point to the starting point - or any point in the growing candidate cluster.

A fairly large part of this project is optimizing the algorithm just described. This is done by gaining insight in the algorithm - not calculating what does not need to be calculated, not calculating the same thing again and again.

Various data sets: [https://teaching.healthtech.dtu.dk/material/22113/point100.lst 100 data points], [https://teaching.healthtech.dtu.dk/material/22113/point1000.lst 1000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point3000.lst 3000 data points],
[https://teaching.healthtech.dtu.dk/material/22113/point4169.lst 4169 data points], [https://teaching.healthtech.dtu.dk/material/22113/point5000.lst 5000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point6000.lst 6000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point10000.lst 10000 data points].

Checking the correctness of your program. 
[https://teaching.healthtech.dtu.dk/material/22113/QT-small.lst Result] of clustering the small list (100 points) with QT being 30% of the diameter. 
[https://teaching.healthtech.dtu.dk/material/22113/result1000.lst Result] of clustering 1000 points with QT being 30% of the diameter.

The algorithm is deterministic - meaning that an implementation will yield the same result on the same data set every time. However, in the description there are two places, where "you pick the first one" you find. This is implementation dependent and therefore two different implementations of QT can give rise to different results. The data sets given here are constructed in such a way, that this will NOT happen for them, i.e. no matter how you implement your method you should get the same result.

===References===
# [http://genome.cshlp.org/content/9/11/1106.full.pdf+html Exploring expression data: identification and analysis of coexpressed genes. LJ Heyer, S Kruglyak, S Yooseph - Genome research, 1999 - genome.cshlp.org]
# [https://www.chem.agilent.com/cag/bsp/products/gsgx/Downloads/pdf/qt_clustering.pdf QT clustering in industry - Agilent]
Peter's speed reference using 30% as the threshold:
Points Time (seconds) Improved
100 0 0
1000 2 2
3000 32 20
4169 65 44
5000 115 70
6000 149 106
10000 505 342
It is not required to achieve these numbers, but it is important to have a reference - and maybe a goal.

QT clustering

2025-04-22T12:21:31Z

WikiSysop: /* Details */

__NOTOC__
===Description===
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, QT clustering partitions the data set such that each example (data point) is assigned to exactly one cluster. QT clustering is superior to K-means clustering in that the number of clusters is not given beforehand and it yields the same result in repeated runs. It requires more CPU time, though.

QT (Quality Threshold) has its name from the user-determined threshold (distance) of the maximal diameter of the clusters that the method computes.

===Input and output===
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers. The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01 8.76 3.29 1.05
ex02 12.3 2.33 3.53
ex03 -0.54 -3.56 1.45
.
.

The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster and is proceeded by the the members of that cluster:

Cluster-1
ex10 1.04 2.98 1.34
ex12 1.23 2.34 1.69
.
.
Cluster-2
ex04 -0.34 3.51 9.02
ex07 -8.56 5.12 12.5
.
.

===Examples of program execution===
cluster.py vectors.txt 500
The 500 is the maximum diameter for a cluster in the data set. An interesting twist could be to automatically decide the cluster diameter like this: X % of the distance between the two data point furthest away from each other. Called like this
cluster.py vectors.txt 30%

===Details===
The method works for any type of data set where it is possible to calculate a ''distance'' between any two points. In this project we are just considering euclidean distances, as they are simple.
[https://en.wikipedia.org/wiki/Pythagorean_theorem Pythagoras's theorem].

The algorithm works like this.
# For each point in the data set, calculate the ''candidate cluster'' with that point as the starting point. With '''n''' points in the data set, there are '''n''' candidate clusters, obviously.
# Choose the candidate cluster that contains most points as the primary cluster and remove those points from the data set. If two or more candidate clusters have equally most points, pick the cluster with the smallest diameter. If they are still equal, pick the first you found.
# Repeat step 1 and 2 until there are no points in the data set or a set limit has been reached; like all remaining candidate clusters has less than, say, 10 points and are therefore not true clusters, but noise.
# Print the resulting clusters.

A ''candidate cluster'' for a point is calculated using "complete linkage" like this:
# Consider the starting point as the beginning of the candidate cluster for that point. This is trivially seen as a subset of your data set.
# Add one point from your data set at a time in such a way that you extend the candidate cluster diameter the least. Again, if two points would extend the diameter the least, pick the first one you find.
# Continue adding points - that is repeat step 2 - to your candidate cluster until the diameter exceeds the Quality Threshold (hence the name QT clustering). The point that makes the diameter exceed the QT is not part of the candidate cluster.
Important definition: The diameter of a data set (or candidate cluster) is the distance of the two points furthest from each other. 
Note: All points in the data set can participate in multiple candidate clusters. Any point is not permanently assigned to a candidate cluster, before you actually pick the largest one and remove the "winners" points from the data set. 
Note: Building a candidate cluster according to above method is '''NOT''' the same method as just adding the nearest point to the starting point - or any point in the growing candidate cluster.

A fairly large part of this project is optimizing the algorithm just described. This is done by gaining insight in the algorithm - not calculating what does not need to be calculated, not calculating the same thing again and again.

Various data sets: [https://teaching.healthtech.dtu.dk/material/22113/point100.lst 100 data points], [https://teaching.healthtech.dtu.dk/material/22113/point1000.lst 1000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point3000.lst 3000 data points],
[https://teaching.healthtech.dtu.dk/material/22113/point4169.lst 4169 data points], [https://teaching.healthtech.dtu.dk/material/22113/point5000.lst 5000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point6000.lst 6000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point10000.lst 10000 data points].

Checking the correctness of your program. 
[https://teaching.healthtech.dtu.dk/material/22113/QT-small.lst Result] of clustering the small list (100 points) with QT being 30% of the diameter. 
[https://teaching.healthtech.dtu.dk/material/22113/result1000.lst Result] of clustering 1000 points with QT being 30% of the diameter.

The algorithm is deterministic - meaning that an implementation will yield the same result on the same data set every time. However, in the description there are two places, where "you pick the first one" you find. This is implementation dependent and therefore two different implementations of QT can give rise to different results. The data sets given here are constructed in such a way, that this will NOT happen for them, i.e. no matter how you implement your method you should get the same result.

===References===
# [http://genome.cshlp.org/content/9/11/1106.full.pdf+html Exploring expression data: identification and analysis of coexpressed genes. LJ Heyer, S Kruglyak, S Yooseph - Genome research, 1999 - genome.cshlp.org]
# [https://www.chem.agilent.com/cag/bsp/products/gsgx/Downloads/pdf/qt_clustering.pdf QT clustering in industry - Agilent]
Peter's speed reference using 30% as the threshold:
Points Time (seconds) Improved
100 0 0
1000 2 2
3000 32 20
4169 65 44
5000 115 70
6000 149 106
10000 505 342
It is not required to achieve these numbers, but it is important to have a reference - and maybe a goal.

QT clustering

2025-04-22T12:17:15Z

WikiSysop: /* References */

__NOTOC__
===Description===
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, QT clustering partitions the data set such that each example (data point) is assigned to exactly one cluster. QT clustering is superior to K-means clustering in that the number of clusters is not given beforehand and it yields the same result in repeated runs. It requires more CPU time, though.

QT (Quality Threshold) has its name from the user-determined threshold (distance) of the maximal diameter of the clusters that the method computes.

===Input and output===
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers. The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01 8.76 3.29 1.05
ex02 12.3 2.33 3.53
ex03 -0.54 -3.56 1.45
.
.

The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster and is proceeded by the the members of that cluster:

Cluster-1
ex10 1.04 2.98 1.34
ex12 1.23 2.34 1.69
.
.
Cluster-2
ex04 -0.34 3.51 9.02
ex07 -8.56 5.12 12.5
.
.

===Examples of program execution===
cluster.py vectors.txt 500
The 500 is the maximum diameter for a cluster in the data set. An interesting twist could be to automatically decide the cluster diameter like this: X % of the distance between the two data point furthest away from each other. Called like this
cluster.py vectors.txt 30%

===Details===
The method works for any type of data set where it is possible to calculate a ''distance'' between any two points. In this project we are just considering euclidean distances, as they are simple.
[https://en.wikipedia.org/wiki/Pythagorean_theorem Pythagoras's theorem].

The algorithm works like this.
# For each point in the data set, calculate the ''candidate cluster'' with that point as the starting point. With '''n''' points in the data set, there are '''n''' candidate clusters, obviously.
# Choose the candidate cluster that contains most points as the primary cluster and remove those points from the data set. If two or more candidate clusters have equally most points, pick the cluster with the smallest diameter. If they are still equal, pick the first you found.
# Repeat step 1 and 2 until there are no points in the data set or a set limit has been reached; like all remaining candidate clusters has less than, say, 10 points and are therefore not true clusters, but noise.
# Print the resulting clusters.

A ''candidate cluster'' for a point is calculated using "complete linkage" like this:
# Consider the starting point as the beginning of the candidate cluster for that point. This is trivially seen as a subset of your data set.
# Add one point from your data set at a time in such a way that you extend the candidate cluster diameter the least. Again, if two points would extend the diameter the least, pick the first one you find.
# Continue adding points - that is repeat step 2 - to your candidate cluster until the diameter exceeds the Quality Threshold (hence the name QT clustering). The point that makes the diameter exceed the QT is not part of the candidate cluster.
Important definition: The diameter of a data set (or candidate cluster) is the distance of the two points furthest from each other. 
Note: All points in the data set can participate in multiple candidate clusters. Any point is not permanently assigned to a candidate cluster, before you actually pick the largest one and remove the "winners" points from the data set. 
Note: Building a candidate cluster according to above method is '''NOT''' the same method as just adding the nearest point to the starting point - or any point in the growing candidate cluster.

A fairly large part of this project is optimizing the algorithm just described. This is done by gaining insight in the algorithm - not calculating what does not need to be calculated, not calculating the same thing again and again.

Various data sets: [https://teaching.healthtech.dtu.dk/material/22113/point100.lst 100 data points], [https://teaching.healthtech.dtu.dk/material/22113/point1000.lst 1000 data points],
[https://teaching.healthtech.dtu.dk/material/22113/point4169.lst 4169 data points], [https://teaching.healthtech.dtu.dk/material/22113/point5000.lst 5000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point6000.lst 6000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point10000.lst 10000 data points].

Checking the correctness of your program. 
[https://teaching.healthtech.dtu.dk/material/22113/QT-small.lst Result] of clustering the small list (100 points) with QT being 30% of the diameter. 
[https://teaching.healthtech.dtu.dk/material/22113/result1000.lst Result] of clustering 1000 points with QT being 30% of the diameter.

The algorithm is deterministic - meaning that an implementation will yield the same result on the same data set every time. However, in the description there are two places, where "you pick the first one" you find. This is implementation dependent and therefore two different implementations of QT can give rise to different results. The data sets given here are constructed in such a way, that this will NOT happen for them, i.e. no matter how you implement your method you should get the same result.

===References===
# [http://genome.cshlp.org/content/9/11/1106.full.pdf+html Exploring expression data: identification and analysis of coexpressed genes. LJ Heyer, S Kruglyak, S Yooseph - Genome research, 1999 - genome.cshlp.org]
# [https://www.chem.agilent.com/cag/bsp/products/gsgx/Downloads/pdf/qt_clustering.pdf QT clustering in industry - Agilent]
Peter's speed reference using 30% as the threshold:
Points Time (seconds) Improved
100 0 0
1000 2 2
3000 32 20
4169 65 44
5000 115 70
6000 149 106
10000 505 342
It is not required to achieve these numbers, but it is important to have a reference - and maybe a goal.

Unit test - start of reverse polish notation class

2025-03-24T13:52:16Z

WikiSysop: Created page with "These are the two files, I made in class for unit test demo purposes. My original class in the file: ReversePolishCalc.py <pre> class ReversePolishCalc: def __init__(self): self.stack = list() def _checkstack(self, count): if len(self.stack) < count: raise IndexError("Stack does not contain enough elements to perform operaation") def push(self, vector): if isinstance(vector, (int, float, str)): vector = [..."

These are the two files, I made in class for unit test demo purposes.

My original class in the file: ReversePolishCalc.py
<pre>
class ReversePolishCalc:
def __init__(self):
self.stack = list()

def _checkstack(self, count):
if len(self.stack) < count:
raise IndexError("Stack does not contain enough elements to perform operaation")

def push(self, vector):
if isinstance(vector, (int, float, str)):
vector = [vector]
if not isinstance(vector, (list, tuple)):
raise ValueError("Input can not be understood as numbers")
for number in vector:
if isinstance(number, (int, float)):
self.stack.append(number)
elif isinstance(number, str):
try:
self.stack.append(int(number))
except ValueError:
try:
self.stack.append(float(number))
except ValueError:
raise ValueError("Input can not be understood as numbers")
else:
raise ValueError("Input can not be understood as numbers")

def pop(self):
self._checkstack(1)
return self.stack.pop()

def add(self):
self._checkstack(2)
self.stack[-2] += self.stack[-1]
del self.stack[-1]

def subtract(self):
self._checkstack(2)
self.stack[-2] -= self.stack[-1]
del self.stack[-1]

def multiply(self):
self._checkstack(2)
self.stack[-2] *= self.stack[-1]
del self.stack[-1]

def divide(self):
self._checkstack(2)
if self.stack[-1] == 0:
raise ZeroDivisionError
self.stack[-2] /= self.stack[-1]
del self.stack[-1]

def factorial(self):
self._checkstack(1)
no = int(self.stack[-1])
if no != self.stack[-1]:
raise ValueError("Factorial with floats is invalid")
if no < 0:
raise ValueError("Factorial can not be calcuated with negatives")
res = 1
for i in range(2, no+1):
res *= i
self.stack[-1] = res
</pre>

My test file of the class: test_ReversePolishCalc.py
<pre>
import pytest
from ReversePolishCalc import ReversePolishCalc as rpc

def test_push1():
# Arrange
calc = rpc()
# Act
calc.push(1)
# Assert
assert calc.stack[-1] == 1, "Simple push of integer 1"

def test_push2():
# Arrange
calc = rpc()
# Act
calc.push(1)
# Assert
assert calc.stack[0] == 1, "Simple push of integer 2"

def test_push3():
# Arrange
calc = rpc()
# Act
calc.push(1.2)
# Assert
assert calc.stack[0] == 1.2, "Push of float"

def test_push4():
# Arrange
calc = rpc()
# Act
calc.push("1")
# Assert
assert calc.stack[0] == 1, "Push of 1 as string"

def test_push5():
# Arrange
calc = rpc()
# Act
calc.push([1, 1.5, "2.5"])
# Assert
assert calc.stack == [1, 1.5, 2.5], "Advanced list push of ints, floats and strings"

@pytest.mark.parametrize("x,y", [(1,1), (1.2, 1.2), ("1.2", 1.2), ("1", 1)])
def test_push(x,y):
# Arrange
calc = rpc()
# Act
calc.push(x)
# Assert
assert calc.stack[0] == y, "Push of 1 as string"

def test_pop1():
# Arrange
calc = rpc()
calc.push(1)
# Act
num = calc.pop()
# Assert
assert num == 1

def test_failpop1():
# Arrange
calc = rpc()
# Act
with pytest.raises(IndexError):
num = calc.pop()
# Assert

</pre>

Unit test

2025-03-24T13:48:12Z

WikiSysop: /* Required course material for the lesson */

__NOTOC__
{| width=500 style="font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;"
|Previous: [[Classes]]
|Next: [[Scientific Libraries, Pandas, Numpy]]
|}
== Required course material for the lesson ==
Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_08-Testing.ppt Testing] 
Online: [https://docs.pytest.org/ pytest documentation] 
Resource: [[Example code - Unit test]] 
Resource: [[Unit test - start of reverse polish notation class]] 
Blog: [https://www.joelonsoftware.com/2000/04/30/top-five-wrong-reasons-you-dont-have-testers/ On testing], by the founder of StackExchange. 

== Subjects covered ==
Overview of test methods 
Unit test using pytest framework.

== Exercises to be handed in ==
You should make a special folder for the exercises. I will refer to my special folder as ''unittest'' in these exercises. You will also see some ''__pycache__'' folders appear in places. This is Pythons cache for "compiled" programs. It is safe to ignore and also to delete, because it may become outdated.
# Use your factorial function from exercise 2 in [[Making Functions]]. If you did not do so already, change it to use exceptions instead of '''sys.exit()''', when an error occurs. Now make simple unit tests for the following test cases: 12, 2, 1, 0, -1, 3.0, 3.4, "3", "3.1.", "ABC". The factorial function and all test functions must be in one single file (''factorial_test.py'' in ''unittest''), which you can run ''pytest'' on.
# Now remove the factorial function from ''factorial_test.py'' and put it in its own file ''factorial.py''. Import it from the ''factorial_test.py'' like '''from factorial import factorial'''. The first factorial is the name of the .py file, the second factorial is the name of your factorial function. Just run ''pytest'' (no file name) in the folder to check it works. It is more normal to have test and function separated.
# Above we removed test code from function code by creating two files. Next, put the files in their own folder in ''unittest''. I would put my ''factorial.py'' in ''unittest/src'' and ''factorial_test.py'' in ''unittest/test''. This way there is a very clear separation between function and test. The problem is making sure the test code loads the function code. Do it wrong a couple of times - it is very instructive.
# Follow the file structure of having a ''code'' (or ''src'') folder for programs, a ''test'' folder for tests, and now a ''testdata'' folder for files containing test data. Now make unit tests and appropriate test data files for your '''fasta''' class from last week. In this exercise you just need to make unit test for the method '''load'''. You need to hand in both tests and test data. Maybe you should zip it all. Learn to zip :-)
# Add unit tests for the method '''save''' in your '''fasta''' class.
# Add unit tests for the method '''delete''' in your '''fasta''' class.
I would not be surprised if you find errors in your '''fasta''' class based on these tests. I found flaws in my code.

== Exercises for extra practice ==
# Add unit tests for all methods in your '''fasta''' class. That will be a bit of work.

Classes

2025-03-17T17:24:50Z