The primary objective of the project is to encourage you to explore and think about potential applications of the techniques you will learn in this class. This group project offers you an opportunity to apply your data mining knowledge to real-life data and to mine managerially-relevant insights. Please note that this is a semester long
A final report/presentation (225 points) – a Microsoft PowerPoint document containing slides of the following types:
- Executive summary
- Project motivation/background
- Data description
- Data preparation activities
- Models used – at least three distinct techniques (with screenshots of related SAS EM output)
- Managerial/business implications
- References (if needed)
Here are a few additional things to consider.
- There is no Word document. The report is in the form of a PowerPoint Presentation.
- Not just screenshots. Because the report is a PowerPoint Presentation you may be tempted to just insert a bunch of screenshots. I do want screenshots of models, results, etc. but they should be annotated so I have some sense of why they are there and what I am supposed to take from the slide.
- Keep it tight. Fully explaining your final project in the form of a PowerPoint (without the benefit of being able to verbally explain during a presentation) should be hard. You need to balance telling a good (and complete) story against generating excessive numbers of slides. If you were presenting I would be looking for about 15 minutes of content. Given that timeline, I would expect approximately 20 slides… this is a guideline and not an absolute, but please try and tell a complete and concise story.
- Tell a coherent story. The story is important… I want to know what you did and why. I want to know what models you ran and which was the best performer. I also want to know what that model says and why that is important to the organization.
1. The aim of our project is to analyze the Breast Cancer Wisconsin (Original) dataset to classify the data by using various classification models and compare the misclassification rate between these models.
a. We are planning to use classification models like Decision tree, Bagging, Random Forest, Naïve Bayes classifier, Support Vector machine and compare the results for better accuracy.
b. We have chosen Breast Cancer Wisconsin (Original) dataset obtained from UCI Machine learning repository for analysis.
c. This is a secondhand dataset available at
2. Number of attributes in the Breast Cancer Wisconsin (Original) dataset are: 32 (ID, diagnosis, 30 real-valued input features). Attribute information:
a. ID number,
b. Diagnosis (M = malignant, B = benign) – Predicting Variable
c. Ten real-valued features are computed for each cell nucleus – radius (mean of distances from center to points on the perimeter), texture (standard deviation of gray-scale values), perimeter, area, smoothness (local variation in radius lengths), compactness (perimeter^2 / area – 1.0), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry, fractal dimension (“coastline approximation”)
d. Class distribution: 357 benign, 212 malignant
3. Here, Diagnosis is the field that we are predicting which takes two values B = benign and M = malignant.
a. Benign – Benign tumors may grow larger but do not spread to other parts of the body. Also called nonmalignant.
b. Malignant – Malignant tumors are cancerous
4. Breast cancer is the most prevalent diagnosed cancer for women in the U.S. following skin cancer. In both men and women, breast cancer occurs, but in women it is much more common. There has been a rise in breast cancer survival rates and a steady decline in the mortality connected with this disease, primarily because of variables such as prior detection, a more individualized treatment approach and a greater sense of the disease.