Data Mining Techniques

You are here

Credits
3
Types
Elective
Requirements
This subject has not requirements, but it has got previous capacities
Department
EIO
Mail
The main goal of this course is to provide a global and practical view of the central step of the Knowledge Discovey from DataBases process, called Data Mining, a discipline devoted to extract relevant information from different kind of data (surveys, monitoring, data-warehouses...) to support decision-making from phenomena or organizations with high degrees of complexity. The course is focused to provide the proper elements to design efficient and correct Data Mining processes, according to the real problem targeted at every application. Besides reviewing the main Data Mining methods, training on several important practical aspects will be provided, like effects on wrong pre-processing, wrong selection of data mining method, wrong interpretation of results or assumption of false hypothesis for the analyzed process. This issues will help to guarantee the validity and utility of final results. Real cases from several fields, like health, environment or economy will be discussed to show the versatility of the discipline to provide suppport to a wide spectrum of very difficult real problems.

Teachers

Person in charge

  • Karina Gibert Oliveras ( )

Others

  • Xavier Angerri Torredeflot ( )

Weekly hours

Theory
1.5
Problems
0
Laboratory
1.5
Guided learning
0
Autonomous learning
0

Competences

Technical Competences of each Specialization

Direcció i gestió

  • CDG1 - Capability to integrate technologies, applications, services and systems of Informatics Engineering, in general and in broader and multicisciplinary contexts.

Especifics

  • CTE9 - Capability to apply mathematical, statistical and artificial intelligence methods to model, design and develop applications, services, intelligent systems and knowledge-based systems.

Generic Technical Competences

Generic

  • CG8 - Capability to apply the acquired knowledge and to solve problems in new or unfamiliar environments inside broad and multidisciplinary contexts, being able to integrate this knowledge.

Transversal Competences

Sustainability and social commitment

  • CTR2 - Capability to know and understand the complexity of the typical economic and social phenomena of the welfare society. Capacity for being able to analyze and assess the social and environmental impact.

Teamwork

  • CTR3 - Capacity of being able to work as a team member, either as a regular member or performing directive activities, in order to help the development of projects in a pragmatic manner and with sense of responsibility; capability to take into account the available resources.

Information literacy

  • CTR4 - Capability to manage the acquisition, structuring, analysis and visualization of data and information in the area of informatics engineering, and critically assess the results of this effort.

Appropiate attitude towards work

  • CTR5 - Capability to be motivated by professional achievement and to face new challenges, to have a broad vision of the possibilities of a career in the field of informatics engineering. Capability to be motivated by quality and continuous improvement, and to act strictly on professional development. Capability to adapt to technological or organizational changes. Capacity for working in absence of information and/or with time and/or resources constraints.

Reasoning

  • CTR6 - Capacity for critical, logical and mathematical reasoning. Capability to solve problems in their area of study. Capacity for abstraction: the capability to create and use models that reflect real situations. Capability to design and implement simple experiments, and analyze and interpret their results. Capacity for analysis, synthesis and evaluation.

Basic

  • CB6 - Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.
  • CB7 - Ability to integrate knowledges and handle the complexity of making judgments based on information which, being incomplete or limited, includes considerations on social and ethical responsibilities linked to the application of their knowledge and judgments.
  • CB8 - Capability to communicate their conclusions, and the knowledge and rationale underpinning these, to both skilled and unskilled public in a clear and unambiguous way.
  • CB9 - Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.

Objectives

  1. Saber realitzar l'anàlisi descriptivá bàsica automàtica d'una base de dades complexa
    Related competences: CTE9, CG8,
  2. Saber traslladar un problema real donat a un problema de mineria de dades
    Related competences: CB6, CTR2, CTR6, CG8,
  3. Saber triar la tècnica de mineria de dades adequada per un problema real donat
    Related competences: CB6, CTR6, CG8,
  4. Saber dissenyar un projecte integrat de knowledge discovery, amb totes les seves fases, des de la formulació d'objectius fins la producció explícita del coneixement, integrant les tècniques apropiades en cada punt del procés sota un enfoc multidisciplinar
    Related competences: CB7, CB8, CB6, CTR2, CTR4, CTR6, CDG1, CTE9, CG8,
  5. Saber triar i utilitzar les eines adequades per implementar i desplegar un projecte de Knowledge Discovery, utilitzant la combinació més eficaç d'entorns de programació de lliure distribució o paquets professionals especialitzats
    Related competences: CTR4, CDG1, CTE9, CG8,
  6. Saber interpretar correctament els resultats d'un projecte de Knowledge Discovery, fer una validació crítica dels resultats i reportar-los amb claredat i poder comunicar-los per escrit (tant de forma detallada com sintètica) o oralment a destinataris tècnics o no especialitzats
    Related competences: CB7, CB8, CTR2, CTR4, CTR6,
  7. Poder recòrrer a bibliografia complementària per trobar solució a problemes nous, incorporant coneixements més avençats al disseny dels projectes de Knowledge Discovery. Poder incorporar a un projecte un software nou o una nova tècnica.
    Related competences: CB9, CTR5, CTR6, CDG1, CG8,
  8. Saber realitzar una planificació a mig termini (uns tres mesos) per al desenvolupament d'un projecte de Knowledge Discovery de certa envergadura
    Related competences: CTR5, CDG1, CTR3,
  9. Saber integrar-se en un equip de treball (potser multidisciplinar) per al desenvolupament d'un projecte de Knowledge Discovery
    Related competences: CB8, CTR3, CTR4, CTR5, CDG1,
  10. Saber dissenyar un preprocessament adequat de les dades a analitzar, d'acord amb els objectius de l'estudi i l'estat original de les pròpies dades
    Related competences: CB6, CTR2, CTR4, CG8,

Contents

  1. Introduction. Data Mining origins, steps, Statistics and Artificial Intelligence
    Data Mining is placed in the historical context.
    The overall process of Knowledge Discovery from Databases is presented, together with its steps and including Data Mining itself.
    The disciplinary pillars of Data Mining are introduced: Statistics and Artificial Intelligence, Information Systems and Data Visualization
  2. Scope and tools
    Different natures of real problems and their different levels of complexity are discussed according to the classification proposed by Simpson. . Ill-structured domains are introduced, as well as a priori and implicit knowledge management, causes and consequences.
    Some software tools for developing data mining tasks are introduced.
  3. Method Selection. Typology of problems (DMMCM)
    The course follows a problem-oriented KDD approach, where the nature of the problem mainly determines the analysis process. Factors determining a correct choice of data mining method in real cases are presented. The DMMCM typology of methods is presented as a conceptual basis for selection.
  4. Data, Metadata
    Main data structures analyzed by Data Mining techniques.
    Importance of metadata, formats and contents
  5. Preprocessing
    Brief introduction of relevant aspects in data preparation step: Missing data, outliers detection and treatment, derived variables, transformed variables, filtering, sampling, feature weighting, dimensionality reduction. Good practice guidelines will be provided
  6. Data Mining Descriptive methods
    Statistical clustering: partitional methods, hierarchical methods, density-based, model-based, scalability; Conceptual Clustering (IA); Hybrid AI&Stats methods: clustering based on rules. Case OMS: mental health systems
  7. Associative Data Mining methods
    Association rules induction. Factorial methods. Bayesian Networks.
  8. Predictive Data Mining methods
    Regressión, statistical modelling in general. Temporal methods, Artificial Neural Networks, Swarm Intelligence.
  9. Data Mining Discriminant methods
    Decision trees, rule induction, support vector machines, Random Forest. discriminant analysis, hybrid methods. Case elderly people functioning and profiles assessment grid
  10. Space-temporality
    Introduction of some tools to manage data including simultanoeulsy spatial information changing over time. Case Quality of Life Guttmann
  11. Post-processing and validation
    Post-processing tools and validation tools for both models and results adapted to different Data Mining methods. Case wastewater treatment
  12. Conclusion
    All the elements seen during the course will be placed over the general scheme of the Knowledge Discovery process presented in section 1, as a global synthesis of the course

Activities

Activity Evaluation act


Paper reading

A paper from an impact journal about a real data mining application will be selected. The paper can be proposed by both the student or the lecturer. The student must read and understand the process of Knowledge Discovery used in the applicationwith all its components. A form with this information must be filled-in.
Objectives: 6 7
Contents:
Theory
0h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
4h

Definició del projecte en equip

Per grups, els estudiants triaran un tema i unes dades sobre les que resoldre un problema de Mineria de Dades
Objectives: 2 8 9
Theory
0h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
0h

Presentació Control de projectes en equip

Cada grup presentarà en públic el plantejament del seu projecte. Descripció del projecte, objectius, estructura, contingut i origen de les dades, disseny del procés de Data Mining a aplicar, pla de treball
  • Laboratory: Two lab sessions dedicated to group presentations and discussion
Objectives: 1 2 3 8 9 10
Theory
0h
Problems
0h
Laboratory
4h
Guided learning
0h
Autonomous learning
7h

Presentació final del projecte en equip

Cada grup entregarà l'informe de la pràctica i presentarà als seus companys els resultats de l'aplicació de mineria de dades desenvolupada. Hi haurà debat i discussió amb el professor sobre les decisions preses al llarg del projecte
Objectives: 1 2 3 4 5 6 8 9
Week: 18
Theory
0h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
14h

Conclusió Final del curs

Integra tots els elements que s'han vist i treballat durant el curs, així com la posta en comú dels projectes desenvolupats per grups i articles llegits durant el curs
Objectives: 3 6
Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Introduction



Theory
2h
Problems
0h
Laboratory
0h
Guided learning
0h
Autonomous learning
0h

Scope, tools, Data, Metadata, Preprocessing



Theory
6h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
0h

DMMCM map, Data Mining methods



Theory
12h
Problems
0h
Laboratory
8h
Guided learning
0h
Autonomous learning
0h

Spatio-temporality



Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
0h

Post-processing



Theory
2h
Problems
0h
Laboratory
2h
Guided learning
0h
Autonomous learning
0h

Teaching methodology

The course uses a mixt methodology of case-based learning and project-based learning

The first week the fundamentals of the matter will be given and the activities to be developed by the student to guarantee the learning process will be assigned. Basically two activities: a paper reading activity regarding a Data Mining application and the developement of a Data Mining project in a working team.

In the following weeks, the structure will be as described bellow:
Every week two hours will be devoted to a case presentation, including the whole steps of development (preprocessing, analisys, postprocessing and validation). In part of the third hour the students will give synthetic presentations of complementary cases to be documented individually. The remaining part of third hour and forth hour, lab activities will be followed related with the project to be developed by every working team.

Together with the acquisition of tecnical skills directly related with Data Mining, an important goal of the course is to provide to the student transversal skills considered relevants for the professional developement, like team-working capacity, long-term planning skills, oral, visual and written communication skills, synthesis skills, justifying decisions made during the project, incidence management skills, knowledge integration for building solutions to high complex problems. The activities scheduled during the course have been especially designed to this purpose.

Last week of the course, every project will be presented and followed by a discussion, usefull as oral examination. The lecturer will use last hour of the course to highlight commonalities and particularities of the presented projects related with the basic schemes of a Data Mining project. Common discussionwill follow on what students understood about usefulness of Data Mining in Computer Engineering, this completing the general message of the course

Evaluation methodology

Two scores corresponding to two activities developed during the course:
20% for Paper activity: It will evaluate the capacities of comprehension (0.5), synthesis (0.5), oral and visual communication (0.5), as well as argumentative capacity (0.5), which will be demonstrated through discussion

80% for a project developed by teams. There will be a single evaluation of the Data Mining project quality, considering the methodologic rigour (0.5), the correctness of the Knowledge Discovery process designed (0.5), the selected preprocessing methods (0,25), the selected data mining methods (0,25), the selected tools (0,5), correct application and results interpretation (1), the integration of several techniques in the project (0,5), the quality of the written report (1), and final public presentation (1). For the final scoring, it will be important the level of planning and coordination of the team, how the incidencies during the course have been solved (1). Additionally, invididual evaluation of the communication skills of every single student (0,5) will be taken into account, as well as its integration level to the working team (1).

Web links

Previous capacities

És convenient, però no imprescindible, tenir coneixements previs d'estadística en general i més particularment d'anàlisi multivariant de dades, i d'aprenentatge automàtic