Assignment 1: Using the WEKA Work-bench
A. Become familiar with the use of the WEKA workbench to invoke several different machine learning schemes. Employ latest steady version. Work with both the visual interface (Explorer) and command line program (CLI). Observe Weka webpage for Weka documentation.
B. Utilize the following learning schemes, with all the default configurations to analyze the next thunderstorm data (in weather. arff). For test out options, initially choose " Use schooling set", in that case choose " Percentage Split" using standard 66% percentage split. Record model percent error rate. ZeroR (majority class)
Naive Bayes Simple
J4. almost eight
C. Which of such classifiers are you more likely to trust when determining whether to learn? Why? M. What can you declare about reliability when using teaching set info and when using a separate percentage to train?
Assignment a couple of: Preparing the data and mining it
A. Take the document genes-leukemia. csv (here is the description in the data) and convert this to Weka file genes-a. arff. You may convert the file both using a text editor like emacs (brute force way) or look for a Weka command word that changes. csv document to. arff (a intelligent way). N. Target discipline is SCHOOL. Use J48 on genes-leukemia with " Use schooling set" alternative. C. Employ genes-leukemia. arff to create two subsets:
genes-leukemia-train. arff, with the 1st 38 samples (s1... s38) of the data genes-leukemia-test. arff, with the staying 34 selections (s39... s72). D. Educate J48 on genes-leukemia-train. arff and specify " Use training set" as test option. What decision tree do you obtain? What is its veracity?
Electronic. Now identify genes-leukemia-test. arff as test set.
What decision tree do you really get and exactly how does its veracity compare to one in the previous question? F. At this point remove the field " Source" from the classifier (unclick checkmark next to Source, and click on Apply Filter in the top menu) and duplicate steps M and At the.
So what do you watch? Does the precision on test out set improve and if therefore , why do you consider it does? G. Extra credit: which repertorier gives the maximum accuracy on the test established? Assignment three or more: Data Cleaning and Preparing for Modeling
The prior assignment was with the picked subset of top 50 genes for your Leukemia dataset. In this project you will be working on the project of real data miner, and you will be dealing with an actual hereditary dataset, starting from the beginning. You will see that the process of data mining regularly has many small steps that need to be completed correctly to get great outcomes. However tedious these steps might appear, the target is a deserving one -- help make an early diagnosis for leukemia -- a common type of cancer. Making a correct analysis is literally a life and death decision, and so we should be careful that people do the research correctly. 3A. Get info
Take ALL_AML_original_data. zip data file from Data directory and extract coming from it Teach file: data_set_ALL_AML_train. txt
Test data file: data_set_ALL_AML_independent. txt
Sample and category data: table_ALL_AML_samples. txt
This info comes from pioneering work by simply Todd Golub et 's at MIT Whitehead Company (now MIT Broad Institute). 1 . Rename the teach file to ALL_AML_grow. teach. orig. txt and evaluation file to ALL_AML_grow. check. orig. txt. Convention: we all use the same file underlying for files of similar type and use different extensions for different versions of these files. Below " orig" stands for first input data files and " grow" stands for genes in rows. All of us will use expansion. tmp pertaining to temporary files that are commonly used for just one step in the task and can be deleted later. Notice: The landmark analysis of MIT biologists is described in their daily news Molecular Classification of Malignancy: Class Breakthrough and Course Prediction simply by Gene Phrase Monitoring (pdf). Both teach and evaluation datasets will be tab-delimited data files with 7130 records. The " train" file should have 78 areas and " test" 75 fields. The first two fields are Gene Information (a long description just like GB DEF =...