Data analysis
Description
This project is about analyzing specific data and answer various questions about it. The data file is a flat file database constructed in year 2000 with various information about people, and can be seen here as
people.db.
The program must read this file ONCE - line by line - not storing the actual lines for future reference, but entering the data in an appropriate data structure of your own devising. The questions are sometimes asking if some distribution is "normal". "Normal" here does not mean
fit the bell curve (standard normal distribution). It means reasonable or natural, what you would expect.
The program must now answer the following questions:
- Is the age and gender distribution normal/sensible in the database? A yes/no answer is not good enough.
- At what age does the men become fathers first time (max age, min age, average age)?
- Is the distribution of first-time fatherhood age normal/sensible? A yes/no answer is not good enough.
- At what age does the women become mothers first time (max age, min age, average age)?
- Is the distribution of first-time motherhood age normal/sensible? A yes/no answer is not good enough.
- How many men and women do not have children (in percent)?
- What is the average age difference between the parents (with a child in common obviously)?
- How many people has at least one grandparent that is still alive? A person is living if he/she is in the database. State the number both in percent and as a real number.
- How many has at least one cousin in the data set? What is the average number of cousins based on those who have cousins?
Note: This number is historically difficult to compute right, but here are some thoughts to help you out in verifying your count.
You have to construct a method for finding cousin pairs. Any cousin pair you identify, can be written as a tuple (cpr1, cpr2) in a list.
a) There should be no duplicate tuples in the list.
b) There should be no tuple with the same cpr on position 1 and 2.
c) Because of symmetry, it is expected that for any (cpr1, cpr2) tuple there is a (cpr2, cpr1) tuple - which also implies that the set of cpr1's is equal to the set of cpr2's.
d) The length of the list of cousin tuples is the number of cousin pairs, and the size of the set of cpr's is the number of people who have cousins. - Is the firstborn likely to be male or female?
- How many men/women (percentage) have children with more than one woman/man?
- Do tall people marry (or at least get children together)? To answer that, calculate the percentages of tall/tall, tall/normal, tall/short, normal/normal, normal/short, and short/short couples. Decide your own limits for tall, normal and short, and if they are the same for men and women.
- Do tall parents get tall children?
- Do fat people marry (or at least get children together)? To answer that, calculate the percentages of fat/fat, fat/normal, fat/slim, normal/normal, normal/slim, and slim/slim couples. Decide your own limits for fat, normal and slim. Calculate the BMI, and let that be the fatness indicator.
- Using the knowledge of blood group type inheritance, are there any children in the database where you can safely say that at least one of the parents are not the real parent. If such children exists, make a list of them. In the report you must discuss how you determine that the parent(s) of the child are not the "true" parents.
- Make a list of fathers who can donate blood to their sons. The list must identify the father and the son(s) and their blood type. You must write the length of the list in the report, together with the number of fathers and the number of sons.
- Make a list of persons who can donate blood to at least one of their grandparents. The list must identify must the person, the grandparent(s) and their blood type. You must write the length of the list in the report, together with the number of grandchildren and the number of grandparents.
All questions has to answered in one run of the program, but not necessarily in that order. You are welcome to answer other interesting questions, that can be posed from the data. Many questions are about distributions and if the distributions are "normal". The program can calculate the distributions, but the analysis of the result (evaluating normalcy) is to be in the report. I will come with an example: "Is the distribution of first-time fatherhood age normal? A yes/no answer is not good enough." You must at least calculate and print something like:
Age Percentage 16-20: 5% 21-25: 10% 26-30: 40% 31-35: 30% 36-40: 15% 41-45: 0%
From that you simply evaluate if it is normal and put it in the report. I think above numbers are rather normal, but below are very strange. If you want to, you can support your opinions with references to Statistics Denmark.
Age Percentage 16-20: 0% 21-25: 0% 26-30: 10% 31-35: 20% 36-40: 50% 41-45: 20%
The problem you should solve in the project is not "how to make good statistics", but "how to collect the data from the database". If you feel you can do better statistics than above, you are welcome.
Tip: For the sanity of the questions you should assume/pretend you are doing this analysis primo 2000.
Tip: The CPR consists of a date part (first 6 numbers as DDMMYY) which is the birthday, and a 4 digit number. There are rules about how CPR should be constructed, and they are not followed since it is illegal to publish CPR numbers. What you need to know is that the date part is a date in 1900-1999, and the last digit is significant; odd - male, even - female.
Tip: The data are somewhat randomly constructed, so you can find 'facts' that seem very unlikely, like 6 year old kids with a height of 2 m. Just accept it.
Tip: In the database the children of a person is clear. This means you can follow a thread down the generations. As can be seen from some of the questions, it can nice to find the parents directly from a child, i.e follow a thread up the generations. Can you find a way to do this easily?