Instructor: Weixiong
Zhang
Prereq: CS241 and SSM
326A (or Math 320), or their equivalent, or permission of the
instructor
Text book:
Reference
books:
Location: Whitaker Hall 216
Time: Monday and Wednesday, 1:00 pm - 2:30 pm
Office hours: Monday and Wednesday, 2:30 -
3:30 pm, Jolley Hall 506, or by appointment.
Brief description: Many scientific computing problems are, by nature, statistical. Such problems appear in many domains, such as text analysis, data mining on the web, computational biology and various medical applications. Another source of the statistical nature of such problems is the lack of sufficient information of the problem domains as well as the specific problems at hand. What is available for a typical application is usually a set of data from observation or experiments. The main objective of this course is to gain experience of dealing with statistical data analysis problems by studying various statistical methods that can be used to make sense out of data, by reading and reviewing literature as well as by working on a specific statistical problem in a selected application domain.
Syllabus
Reading materials on biology
Policies on homework, project, and
grading
Collaboration Policy
Homework assignments
Topics: What data mining is about; Characteristics and relationship with statistics and machine learning; Tasks, objectives, components (Models, patterns, and evaluation, and algorithms) and overall procedures; The curse of dimensionality; Uncertainty in data. Basic background of biology.
Dates: 1/21, 1/26, 1/28
Reading: chp 1.4, 1.5, 1.6, 2.6, 2.7, 4.2, 4.7.
Topics: Basics on subjective probabilities, statistics, maximum likelihood, Bayesian estimator, independence, Bayesian networks, simulation method and Monto Carlo Markov chains, statistical modeling; Bayesian classifier; Learning Bayesian networks.
Dates: 2/2, 2/4, 2/9, 2/11, 2/16
Reading:
N. Friedman, et al., Using Bayesian networks to analyze expression
data, J. Computational Biology, 7(3/4), 2000, pp. 601-620, which can be
downloaded
from a machine on campus.
Topics: Systematic search: Dynamic programming, state-space search (best-first search and depth-first branch-and-bound); local search strategies; parameter optimization methods; optimization with missing data (EM) and Monte Carlo Markov Chain sampling.
Applications:
Global and local alignment, motif finding (Gibbs sampler)
Reading:
Topics: Classification
modeling; Decision-tree
classification and entropy minimization; Cross validation;
Kernel methods and support vector machines
Applications: Regulatory module identification; Identification of reguatory networks; Gene splicing prediction
Dates: 3/15, 3/17, 3/22, 3/24, 3/29
Reading:
Topics: Linear regression; Artificial neural networks.
Applications: Identification of reguatory networks
Dates: 3/31, 4/5, 4/7
Reading:
Applications: Gene expression analysis
Dates: 4/14, 4/19, 4/21, 4/12
Reading:
Topics: Regular and context free grammar and hidden Markov models (HMMs)
Applications: HMM models and
gene
prediction; Learning a grammar from sequences
Dates: if we have
time
Reading:
Topics: Student presentations
Dates: 4/26, 4/28, 5/3
If you need basic knowledge on cell biology, microarray technology
and microarray data analysis, the following links will be useful.
H. Lodish, et al., Molecular Cell Biology (online version)
Specific reading materials can be found at the reading lists in the
Syllabus.
Reading some papers on a particular datamining topic that we do not have time to cover or we do not go into detail in the class, and then giving an in-depth presentation to describe and discuss the problem, dataming objectives, data used, datamining procedure and techniques adopted, your criticism on the existing work, and your thoughts and suggestions on possible future research.
Designing an algorithm/method for a particular datamining problem or compare some existing methods for the problem.
Please keep any discussions you have with other students to a small group of no more than 3 students and be sure that each of you are equally involved. If you just listen in and are then able to understand and write up the solution you have missed at least half of the benefit of the homework. It is really important to work through the process of recognizing when you are heading the wrong way and learning how to work through the problem solving process.
Violations of any of the above rules will be dealt with harshly! The homework problems and projects are designed to help you learn the material being taught. Being told the solution and understanding it is VERY different from working through the process of actually finding a solution. If you do not take an active role in the process of solving the homework problems and project, then you won't get much out of it, hence you won't learn the material.
Goto TopCreated by Weixiong Zhang, Jan. 2004. Last modified by Weixiong Zhang, Feb. 2004.