Data Mining

Teachers
Weekly hours
Competences
Objectives
Contents
Activities
Teaching methodology
Evaluation methodology
Bibliography
Web links
Previous capacities

Credits

Types

Elective

Requirements

Prerequisite: PE
Prerequisite: PRO2

Department

CS;EIO

Web

https://www-eio.upc.edu/~karina/datamining/

Mail

Data Mining is concerned with the conversion of data into knowledge for decision making and as such constitutes the central phase of the process of extracting knowledge from databases (KDD, Knowledge Discovery in Databases). Data Mining is a meeting point of different disciplines: statistics, machine learning, database techniques and decision making systems. The discipline allows to address many current problems related to information processing.

This course introduces the most well-established techniques for solving three fundamental problems: analysis of binary data ("Transactions"), analysis of scientific data (e.g. in genomics) and analysis of enterprise data. An added goal is the use of R, a powerful free programming environment.

On the interest of this course for a computer science degree student:
Data Mining is a discipline devoted to process big data from complex information systems of big organizations, to extract relevant, new, understandable, useful knowledge for decision making, in all kind of contexts, from e-comerce, to social nets, including environmental systems monitoring, customer fidelization cards, consume in general, public health, banca, finances or industrial production.

Data Mining is an umbrella where it is required to combine techniques and methodologies from several computer science areas (data warehouses desing, machine learning, statistical modelling, multivariate data analysis, data visualization, intensive computing, software engineering) to provide answer to the area complexity.

Currently, it is clear that the value of organizations is directly related to the information that can be extracted from the available data, and there still is a lack of the professional profile suitable to do that. Data Mining is the science that transforms data in value for the organizations and acquiring skills on this matter is an excellent complement for the computer science professional, whatever specialization he/she follows.

For those in the information systems specialization, this course provides skill to complete the data processing: too often an excellent information system design is not sufficiently used due to a lack of a good exploitation service with the suitable mining. Also, knowing what can be extracted from data is an important referent to take into account in the design of the data structure itself. Software engineering students will lear useful criteria to identify and standardize data mining services to include in the big computer applications to support the organization, by deciding and planning data consumption services to be provided.

Students from information technologies area can be interested in the relationship between the real-time monitoring of fix or mobile systems and the data mining to reduce signals to relevant features, to detect events to communicate or tu extract relevant information in an incremental process (data stream mining). Knowledge extraction from distributed data or from the cloud is an area with extreme projection in the near future.

For students of computing, this matter provides very interesting challenges related with development of new knowledge extraction algorithms more eficients and/or scalable to deal with big datasets or with other less classical structures, like graphs (social nets mining) or documents (web mining).

Teachers

Person in charge

Caroline König ( )
Karina Gibert Oliveras ( )

Others

Dante Conti ( )
Manuel Gijon Agudo ( )
Mario Martín Muñoz ( )
Sergi Ramirez Mitjans ( )
Sonia Garcia Esteban ( )
Xavier Angerri Torredeflot ( )

Weekly hours

Theory

Problems

Laboratory

Guided learning

0.4

Autonomous learning

5.6

Competences

Technical Competences of each Specialization

Information systems specialization

CSI2 - To integrate solutions of Information and Communication Technologies, and business processes to satisfy the information needs of the organizations, allowing them to achieve their objectives effectively.
- CSI2.2 - To conceive, deploy, organize and manage computer systems and services, in business or institutional contexts, to improve the business processes; to take responsibility and lead the start-up and the continuous improvement; to evaluate its economic and social impact.
- CSI2.3 - To demonstrate knowledge and application capacity of extraction and knowledge management systems .
- CSI2.6 - To demonstrate knowledge and capacity to apply decision support and business intelligence systems.

Transversal Competences

Reasoning

G9 [Avaluable] - Capacity of critical, logical and mathematical reasoning. Capacity to solve problems in her study area. Abstraction capacity: capacity to create and use models that reflect real situations. Capacity to design and perform simple experiments and analyse and interpret its results. Analysis, synthesis and evaluation capacity.
- G9.3 - Critical capacity, evaluation capacity.

Third language

G3 [Avaluable] - To know the English language in a correct oral and written level, and accordingly to the needs of the graduates in Informatics Engineering. Capacity to work in a multidisciplinary group and in a multi-language environment and to communicate, orally and in a written way, knowledge, procedures, results and ideas related to the technical informatics engineer profession.
- G3.2 - To study using resources written in English. To write a report or a technical document in English. To participate in a technical meeting in English.

Objectives

Knowing the types of the main problems of Data Mining
Related competences: CSI2.3, CSI2.6, CSI2.2,
Data quality assesment and preprocessing
Related competences: CSI2.3, CSI2.6, CSI2.2,
Problem solving: identify the statistical and/or machine learning techniques more appropriate to solve the problem
Related competences: G9.3, CSI2.3, CSI2.6, CSI2.2,
Implement simple learning algorithms
Related competences: G9.3, CSI2.3, CSI2.6, CSI2.2,
Validation of results
Related competences: G9.3, CSI2.3, CSI2.6, CSI2.2,
Presentation of results in a professional environment for decision making
Related competences: G9.3, CSI2.3, CSI2.6, G3.2, CSI2.2,

Introduction to Data Mining.
Statistical modeling and types of problems: analysis of binary data ("transactions"), analysis of scientific data and analysis of data from enterprises
Visualization and dimensionality reduction
Feature selection and extraction. Visualization of multivariate data.
Clustering
Direct partitioning methods, hierarchical methods and expectation maximization
Predictive Methods
Regressió lineal múltiple i generalitzada. Regressió Logística. Xarxes Neuronals
Decision Trees
Classification and regression trees (CART).
Validation protocols and data resampling
Holdout, cross-validation and the bootstrap
Generation of association rules
A-priori and Eclat algorithms.
Discriminant Analysis
Bayesian decision theory. LDA and QDA Discriminant Analysis and Naïve Bayes
Non parametric discrimination
Nearest neighbours
Regression Shrinkage and Variable Selection
Regularized linear regression. LASSO and the Elastic Net methods.
Formal concept analysis
Formal method for pattern finding
Preprocessing
a
Bagging i ensemble methods
Bagging i ensemble methods

Activities

Activity Evaluation act

Development Unit 1

Objectives: 1
Contents:

1 . Introduction to Data Mining.

Theory

Problems

Laboratory

Guided learning

Autonomous learning

A review of R language

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of item 2

Objectives: 2
Contents:

2 . Visualization and dimensionality reduction

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of item 3

Objectives: 2
Contents:

3 . Clustering

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of Item 4

Objectives: 2
Contents:

4 . Predictive Methods

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of item 5

Objectives: 2

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of Item 6

Objectives: 2
Contents:

5 . Decision Trees

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of Item 7

Objectives: 2
Contents:

8 . Discriminant Analysis

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of Item 8

Objectives: 2

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of Item 9

Objectives: 2
Contents:

6 . Validation protocols and data resampling

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Development of Item 10

Objectives: 5
Contents:

9 . Non parametric discrimination

Theory

Problems

Laboratory

Guided learning

Autonomous learning

Practice 1

Objectives: 2 5 4 3
Week: 13

Theory

Problems

Laboratory

Guided learning

Autonomous learning

20h

Practice 2

Objectives: 5 4 3 6
Week: 15

Theory

Problems

Laboratory

Guided learning

Autonomous learning

20h

Teaching methodology

The learning methodology will consist in the analysis of case studies concerning complex data sets from real problems. From these problems the body of necessary scientific knowledge will be introduced. The theoretical and practical lessons are interleaved such that programming and/or integration of data mining functions enhance the assimilation of the various concepts explained. The open programming environment R will be used in the laboratory.

The laboratory classes will be devoted to solving problems related to the knowledge provided in the theory classes and to the resolution by the students of a similar problem. This problem may include the resolution of very brief conceptual questions and will be delivered for its evaluation. Finally, the students must complete two full practical works, a statistical modeling problem and a modelling problem of the "scientific", "transaction" or "marketing" kind (only one of them must be chosen by the student). This last practical work will be presented orally to the whole class.

Evaluation methodology

The evaluation of the course will be based on the grade obtained in the exercises developed during the lab sessions. On the other hand there will be two practical works. For each practical work, the student will deliver the corresponding written report. Finally, at the end of the course, the students must present orally the second practical work.

The student will be required to show the necessary reasoning as well as English skills. These skills will be are evaluated using the corresponding rubrics.

The overall laboratory grade is the average of the grades obtained for the exercises developed out of the laboratory sessions.

The final mark will be obtained as follows:

Lab = overall laboratory grade
PR1 = grade for the first practical work
PR2 = grade for the second practical work

Final grade = 0.2*Labo + 0.4*Pr1 + 0.4*Pr2

In both practical works (counting 40% each), 35% corresponds to the technical correction and 5% corresponds to the 'reasoning' generic competence, so that this competence gets an overall weight of 10% of the final grade.

Bibliography

Basic:

Construction and assessment of classification rules - Hand, D.J, Wiley, 1997. ISBN: 978-0-471-96583-1
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991001900839706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
The elements of statistical learning: data mining, inference, and prediction - Hastie, T.; Tibshirani, R.; Friedman, J, Springer, 2009. ISBN: 9780387848570
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003549679706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Introducción a la minería de datos - Hernández Orallo, J.; Ramírez Quintana, M.J.; Ferri Ramírez, C, Pearson, 2004. ISBN: 9788420540917
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002742379706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Data analysis and graphics using R: an example-based approach - Maindonald, J.H.; Braun, J, Cambridge University, 2010. ISBN: 9780521762939
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991003210549706711&context=L&vid=34CSUC_UPC:VU1&lang=ca
Pattern classification - Duda, R.O.; Hart, P.E.; Stork, D.G, John Wiley & Sons, 2001. ISBN: 0-471-05669-3
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991002131619706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Complementary:

Aprender de los datos: el análisis de componentes principales: una aproximación desde el Data Mining - Aluja Banet, T.; Morineau, A, EUB , 1999. ISBN: 9788483120224
https://discovery.upc.edu/discovery/fulldisplay?docid=alma991001877509706711&context=L&vid=34CSUC_UPC:VU1&lang=ca

Web links

Es la pàgina per baixar-se i informar-se sobre el sistema R. http://www.cran.es.r-project.org
Es la pàgina per baixar-se i informar-se sobre el sistema Weka. http://www.cs.waikako.ac.nz
Informació general sobre software, cursos, actualitat de la mineria de dades als Estats Units. http://www.kdnuggets.com/

Previous capacities

Foundations of probability and statistics. Basic Programming in R

Data Mining

Teachers

Person in charge

Others

Weekly hours

Competences

Technical Competences of each Specialization

Information systems specialization

Transversal Competences

Reasoning

Third language

Objectives

Contents

Activities

Development Unit 1

A review of R language

Development of item 2

Development of item 3

Development of Item 4

Development of item 5

Development of Item 6

Development of Item 7

Development of Item 8

Development of Item 9

Development of Item 10

Practice 1

Practice 2

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Complementary:

Web links

Previous capacities

Where we are

Contact with us

Data Mining

You are here

Teachers

Person in charge

Others

Weekly hours

Competences

Technical Competences of each Specialization

Information systems specialization

Transversal Competences

Reasoning

Third language

Objectives

Contents

Activities

Development Unit 1

A review of R language

Development of item 2

Development of item 3

Development of Item 4

Development of item 5

Development of Item 6

Development of Item 7

Development of Item 8

Development of Item 9

Development of Item 10

Practice 1

Practice 2

Teaching methodology

Evaluation methodology

Bibliography

Basic:

Complementary:

Web links

Previous capacities

Where we are

Contact with us