Programming Languages: Data Mining

QUESTION ONE [25 MARKS]

  1. Study the “dataframe” below and answer the questions that follow.

                Column1              Column2              Column3              Column4              Column5              Column6

Name1 Alpha                   12                           24                           54                           0                              Alpha

Name2 Beta                      16                           32                           51                           1                              Beta

Name3 Alpha                   52                           104                         32                           0                              Gamma

Name4 Beta                      36                           72                           84                           1                              Delta

Name5 Beta                      45                           90                           32                           0                              Phi

Name6 Alpha                   12                           24                           12                           0                              Zeta

Name7 Beta                      32                           64                           64                           1                              Sigma

Name8 Alpha                   42                           84                           54                           0                              Mu

Name9 Alpha                   56                           112                         31                           1                              Eta

dataframe                                                                                          

iv.  State and explain Techniques and tools (R or Python packages) that are used to preprocess data so that it can be ready for data mining                                                                      [5 Marks]

(b) Suppose that your local bank has a data mining system. The bank has been studying your debit card usage patterns. Noticing that you make many transactions at home renovation stores, the bank decides to contact you, offering information regarding their special loans for home improvements.

(c) Data quality can be assessed in terms of several issues, including accuracy, completeness, and consistency. For each of the above three issues;

  1. Briefly discuss how data quality assessment can depend on the intended use of the data, giving examples.                                                                                                                                [2 Marks]
  2. Propose TWO other dimensions of data quality                                                      [2 Marks]

(d) In real-world data, tuples with missing values for some attributes are a common occurrence.    Describe any TWO methods for handling this problem.                                                                      [2 Marks]

(e) Briefly describe any TWO issues to consider during data integration. Give example for each case.                                                                                                                                                                             [2 Marks]

(f) What are the differences between the three main types of data warehouse usage, namely;

  1. Information processing                                                                                                         [1 Mark]
  2. Analytical processing                                                                                                                          [1 Mark]
  3. Data mining                                                                                                                                            [1 Mark]

(g) Briefly discuss the motivation behind OLAP                                                                          [2 Marks]

QUESTION TWO [25 MARKS]

(a) Describe the difference between clustering and classification. Give one example to illustrate each category                                                                                                                                                   [4 Marks]

(b) Briefly explain three metrics (functions) of measuring similarity of data items during clustering

  Use one example to illustrate each function                                                                             [3 Marks].

(c) Supposed you have been hired as Data Scientist by Kenya Government analyze COVID-19 data and submit results. Illustrate specific data mining you can apply and for what purpose [4 Marks]

(d)  Suppose KCA University is planning to implementing a data storage and mining infrastructure that has an operational data store, data warehouse, data mart and datamining tools. As a data scientist, do the following

    i) Draw well labelled architecture that illustrates how the four components are interconnected

                                                                                                                                                                                      [4 Marks]

    ii) Explain how each of the component can be used by KCA University                  [4 Marks]

    iii) There three types of schemas that can be used design and develop the data warehouse

     [3 Marks]

(e) Describe the meaning and importance of ETL process in business enterprises. Draw a well labelled diagram to illustrate your answer                                                                                          [3 Marks]

Suppose that your local bank has a data mining system. The bank has been studying your debit card usage patterns. Noticing that you make many transactions at home renovation stores, the bank decides to contact you, offering information regarding their special loans for home improvements.
Briefly explain how this may conflict with your right to privacy. [2 Marks]Describe a privacy-preserving data mining method that may allow the bank to perform customer pattern analysis without infringing on its customers right to privacy. [2 Marks]
(c) Data quality can be assessed in terms of several issues, including accuracy, completeness, and consistency. For each of the above three issues;Briefly discuss how data quality assessment can depend on the intended use of the data, giving examples. [2 Marks]Propose TWO other dimensions of data quality

The accountant at Typing Haven wants a program that will help her prepare a customer’s bill. She will enter the number of typed envelopes and the number of typed pages, as well as the charge per typed envelope and the charge per typed page. The program should calculate and display the amount due for the envelopes, the amount due for the pages, and the total amount due. Complete an IPO chart for this problem. Desk-check the algorithm using 50, 100, $.10, and $.25 as the number of typed envelopes, the number of typed pages, the charge per typed envelope, and the charge per typed page. Then desk-check it using your own set of data.

Using the proper mathematical notation, write the following sets (a) Months of the year (b) Even integers 2. Obtain the result of the following summation by using Python 10 ?(k2 k) k=0 3. By using python obtain how many different pairs could we have if the class has 34 students? 4. Write Python code to produce a list with the square of the first 100 integers 5. Write Python code to create a variable one and assign it any integer. Using that variable, do the following: (a) Take the square of one and assign it to one_sq

A professor has constructed a 3-by-5 two-dimensional array of grades. This array contains the test grades of students in the professor’s advanced compiler design class. Write, compile, and run a C++ program that reads 15 array values and then determines the total number of grades in these ranges: less than 60, greater than or equal to 60 and less than 70, greater than or equal to 70 and less than 80, greater than or equal to 80 and less than 90, and greater than or equal to 90.

ASSESSMENT DESCRIPTION: Your task is to design, develop and test a small application which will allow a mobile phone user to compare the cost of their phone usage on particular day under plans from three different phone providers and find the most expensive and cheapest from them. Task 1- Design This stage requires you to prepare documentation that describes the function of the program and how it is to be tested. There is no coding or code testing involved in this stage. Requirements: 1) Read through Task 2: Program Development to obtain details of the requirements of this program. 2) Write pseudocode that describes how the program will operate. a. All program requirements must be included, even if you do not end up including all these requirements in your program code. b. The pseudocode must be structured logically so that the program would function correctly. 3) Prepare and document test cases that can be used to check that the program works correctly once it has been coded. You do NOT need to actually run the test cases in this stage; this will occur in Task 3: Testing. a. Test cases should be documented using a template which is week 6 lecture and tutorial. You may include extra information if you wish. At this stage, the Actual Result column will be left blank. Two test cases per group member are required to gain full marks in this task.

Task 2: Program Development Using the Design Documentation to assist you, develop a Java program that allows the user to enter details of their phone usage and then compare the bill which would result from this usage under different billing plans. All requirements require that you follow coding conventions, such as proper layout of code, using naming conventions and writing meaningful comments throughout your program. Requirement 1: Display a welcome message when the program starts • The welcome message should have a row of “*” at the top and the bottom, just long enough to extend over the text. Hint: Use a loop for this. • The first line of the message should read “WELCOME TO PHONE BILL COMPARISON SYSTEM” • The second line of the message should be blank. • The third line should read “Developed by” followed by your names and a comma, then “student ID”, then your student ids of all group members. • The fourth line should display “OODP101 Object Oriented Design and Programming” • The fifth line should display the current date and time of system. You are expected to do a research to complete this task. • The sixth line should be blank, and the seventh line should be another row of “*”

Requirement 2 Provide a menu from which the user can select to Enter Usage Details, Display Cost Under Provider 1, Display Cost Under Provider 2, Display Cost Under Provider 3, Clear Usage, or Exit System. This menu should be repeated each time after the user has chosen and completed an option until the user chooses to Exit. The user selects an option by entering the number next to it. If an invalid number is selected, the user is advised to make another selection

Requirement 3 When the user selects the Enter Usage Details option, provide another menu from which the user can select Phone Call, SMS, Data Usage, or Return to Main Menu. The user selects an option by entering the number next to it. If an invalid number is selected, the user is told to make another selection.

Requirement 3.1 If the user selects Phone Call, they are prompted to enter the length of the call in seconds. If user selects this option more than once then it means that there are more than one calls that user had made on particular day so your program should be able to consider all calls in billing system. The value entered must be positive – if not, the user should be prompted to re-enter a new value. After entering a valid call length, number of calls should be displayed and the user is returned to the Enter Usage Details Menu so that they may choose to enter additional usage details.

Requirement 3.2 If the user selects SMS, the program should simply increment the count of the number of SMS messages and number of messages. No further information is required so the program should simply display the total number of SMS messages recorded so far, and then return to the Enter Usage Details Menu.

Requirement 3.3 If the user selects Data Usage, they should be prompted to enter the amount of data used in MB. The value entered must be positive – if not, the user should be prompted to reenter a new value. After entering a valid value, the user is returned to the Enter Usage Details Menu so that they may choose to enter additional usage details.