%PDF-1.3
%
21 0 obj <>
endobj
xref
21 8
0000000016 00000 n
0000000653 00000 n
0000000752 00000 n
0000000922 00000 n
0000001030 00000 n
0000001133 00000 n
0000005651 00000 n
0000000456 00000 n
trailer
<<78D254A212FCD844BEE10AE293937310>]>>
startxref
0
%%EOF
28 0 obj<>stream
xb```"m)dÀ<,@qA -{X@@0
A*f@9I3T8̔5XXa>f"I?
endstream
endobj
22 0 obj<>
endobj
23 0 obj<>/ProcSet 27 0 R>>/Type/Page>>
endobj
24 0 obj<>
endobj
25 0 obj<>
endobj
26 0 obj<>stream
BT
113.16 724.2 TD
0 0 0 rg
/F0 18 Tf
-0.0049 Tc 0.0477 Tw (A Perspective on Statistical Tools for Data Mining) Tj
144.36 -20.64 TD -0.005 Tc 0 Tw (Applications) Tj
-178.32 -24.84 TD /F1 12 Tf
0 Tc ( ) Tj
178.44 -21 TD /F0 14.04 Tf
-0.0267 Tc -0.0033 Tw (David M. Rocke) Tj
-118.92 -16.08 TD -0.0082 Tc -0.0618 Tw (Center for Image Processing and Integrated Computing) Tj
75.84 -16.08 TD -0.0115 Tc -0.0185 Tw (University of California, Davis) Tj
11.76 -51.84 TD -0.0084 Tc -0.0216 Tw (Statistics and Data Mining) Tj
-147.12 -30.96 TD /F1 12 Tf
-0.0026 Tc 1.7769 Tw (Statistics is about the analysis of data. Some statistical ideas are designed for problems in) Tj
0 -13.8 TD -0.0014 Tc 0.9014 Tw (which well-formulated prior hypotheses are evaluated by the collection and analysis of data,) Tj
T* -0.0055 Tc 1.303 Tw (but other currents of thought in the field are aimed at more exploratory ends. In this sense,) Tj
T* -0.003 Tc 1.931 Tw (data mining \(defined as the exploratory analysis of large data sets\) should be a branch of) Tj
T* -0.0035 Tc 0.4492 Tw (statistics. Yet the field of data mining has evolved almost independently of the community of) Tj
T* 0.0018 Tc -0.0018 Tw (statistical researchers. Why is this?) Tj
0 -23.76 TD -0.0016 Tc 0.893 Tw (One reason for the separate development of data mining is that the methods were developed) Tj
0 -13.8 TD -0.0041 Tc 2.3118 Tw (by those who needed to solve the problems, and these rarely included researchers whose) Tj
T* -0.0057 Tc 3.2893 Tw (primary areas of interest were statistical theory and methodology. Several authors have) Tj
T* -0.0029 Tc 0.4189 Tw (recently pointed out ways in which statistical ideas are relevant to data mining \(see Elder and) Tj
T* -0.0092 Tc 1.0892 Tw (Pregibon 1996; Friedman 1997; ) Tj
160.08 0 TD -0.001 Tc 1.093 Tw (Glymour et al. 1996, 1997; as well as many contributors to) Tj
-160.08 -13.8 TD -0.0038 Tc 0.5897 Tw (the NRC CATS report 1996\). If the only reason for the lack of use of standard statistical and) Tj
0 -13.8 TD -0.0031 Tc 0.0531 Tw (data analytic methods were unfamiliarity, this approach would be sufficient. But perhaps there) Tj
T* -0.0056 Tc 0.0056 Tw (are other problems.) Tj
0 -23.88 TD -0.0042 Tc 1.5042 Tw (Statistical methods have traditionally been developed to make the greatest use of relatively) Tj
0 -13.8 TD -0.0054 Tc 3.1438 Tw (sparse data. The concept of statistical efficiency, for example, is crucial when data are) Tj
T* -0.0042 Tc 0.9813 Tw (expensive, and less important when additional data can be obtained for the price of fetching) Tj
T* -0.0016 Tc 0.5587 Tw (them from the disk. On the other hand, computational efficiency has always played a smaller) Tj
T* -0.0014 Tc 0.0414 Tw (role, especially since the advent of electronic computers. When analyzing potentially vast data) Tj
T* -0.0031 Tc 2.9131 Tw (sets, the importance of various considerations is changed or reversed compared to what) Tj
T* -0.004 Tc 2.268 Tw (statisticians are traditionally used to. It is perhaps this that has made the use of standard) Tj
0 -13.68 TD -0.0014 Tc 0.0014 Tw (statistical methods less common in the field of data mining than it might otherwise have been.) Tj
114.84 -33 TD /F0 14.04 Tf
-0.0023 Tc -0.0757 Tw (What is Difficult about Data Mining?) Tj
-114.84 -30.96 TD /F1 12 Tf
-0.0032 Tc 1.2565 Tw (The problem is not just that there is a large amount of data, or that the goal is exploratory.) Tj
0 -13.8 TD -0.0018 Tc 0.4018 Tw (Statisticians \(among other disciplines\) have developed many tools for the exploration of data.) Tj
T* -0.0084 Tc 0.6769 Tw (For many exploratory statistical problems, the answers become clearer as the size of the data) Tj
T* -0 Tc 0.052 Tw (set become larger. Finding means and medians, as well as regression coefficients, fall into this) Tj
T* -0.001 Tc 0.4483 Tw (easy category. Suppose, for example, that one wished to predict account ) Tj
354.12 0 TD -0.0073 Tc 0.4873 Tw (delinquincies from a) Tj
-354.12 -13.8 TD -0.0021 Tc 2.6335 Tw (fixed set of 25 predictors that exist in a database, and suppose that previous experience) Tj
0 -13.8 TD -0.0033 Tc 2.2833 Tw (showed that a linear logistic regression specification worked well. In this case, the only) Tj
T* -0.005 Tc 1.4193 Tw (uncertainty lies in the values of the regression coefficients. If a particular coefficient were) Tj
ET
endstream
endobj
27 0 obj[/PDF/Text]
endobj
1 0 obj<>/ProcSet 27 0 R>>/Type/Page>>
endobj
2 0 obj<>stream
BT
79.2 735.12 TD
0 0 0 rg
/F1 12 Tf
-0.0062 Tc 1.0329 Tw (estimated to be 2.0 with a standard error of 1.3 from a sample of 100 cases, there would be) Tj
0 -13.8 TD -0.003 Tc 0.643 Tw (considerable uncertainty as to its value. If the sample size was increased to 1,000,000 cases,) Tj
T* -0.0016 Tc 2.9093 Tw (the standard error of the coefficient would be only .0013, or essentially no uncertainty.) Tj
T* -0.0014 Tc 1.626 Tw (Furthermore, the calculations for this larger regression would not be very expensive in the) Tj
0 -13.68 TD -0.0009 Tc 0.0009 Tw (context of normal data mining applications.) Tj
0 -23.88 TD -0.0063 Tc 1.7735 Tw (But many problems in data mining have what might be called ) Tj
318.12 0 TD /F2 12 Tf
-0 Tc 1.8 Tw (subtle structure) Tj
76.8 0 TD /F1 12 Tf
-0.0024 Tc 1.8624 Tw (, defined as) Tj
-394.92 -13.8 TD -0.0013 Tc 0.8044 Tw (those structures that are difficult to find even in large samples. Multivariate outlier detection) Tj
0 -13.8 TD -0.0035 Tc 0.5007 Tw (and cluster finding fall into this category. This means that considerable searching is required) Tj
T* -0 Tc 0.4206 Tw (to determine how the data break out into clusters, and more data do not necessarily make this) Tj
T* -0.0043 Tc 0.6118 Tw (easier. If the cluster structure is hidden from easy detection because of the orientation of the) Tj
T* -0.0052 Tc 0.0186 Tw (clusters, no amount of data will make this immediately apparent.) Tj
0 -23.76 TD -0.0031 Tc 2.6603 Tw (Subtle structure is often definable as the global optimum of a function with many local) Tj
0 -13.8 TD -0.0013 Tc 3.4253 Tw (optima. The best global optimum may be hard to find with small data sets, but good) Tj
T* -0.0022 Tc 1.3222 Tw (computational procedures with small data sets may be impractical with large ones. Cluster) Tj
T* -0.0032 Tc 0.7232 Tw (finding can be cast in this mold if one defines clusters, for example, by means of the normal) Tj
T* -0.0015 Tc 1.9129 Tw (likelihood or a similar criterion function. \(One example would be to compute the pooled) Tj
T* -0.0019 Tc 0.3699 Tw (covariance matrix of all clusters together, in which each point is centered at its cluster mean\).) Tj
T* -0.0024 Tc 0.9704 Tw (Such criterion functions may have many local optima, and finding the correct one is often a) Tj
T* -0.0035 Tc 0.0035 Tw (difficult problem.) Tj
0 -24.12 TD /F0 12 Tf
0.0018 Tc -0.0018 Tw (Computational Complexity of Statistical Methods for Data Mining Applications) Tj
0 -23.52 TD /F1 12 Tf
-0.0014 Tc 0.1728 Tw (We consider this issue in the context of two fundamentally important methods in data mining:) Tj
0 -13.8 TD -0.0054 Tc 2.9854 Tw (data cleaning and data segmentation. Data cleaning here will mean the identification of) Tj
T* -0.0037 Tc 0.2917 Tw (anomalous data \(outliers for short\) that may need to be removed or separately addressed in an) Tj
T* -0.0021 Tc 3.2821 Tw (analysis. Variable-by-variable data cleaning is straightforward, but often anomalies only) Tj
T* -0.005 Tc 0.595 Tw (appear when many attributes of a data point are simultaneously considered. Multivariate data) Tj
T* -0.0019 Tc 2.4099 Tw (cleaning is more difficult, but is an essential step in a complete analysis. We will avoid) Tj
T* -0.0044 Tc 0.3644 Tw (technical details of these methods \(see Rocke and Woodruff 1996\), but the essential nature of) Tj
T* -0.0032 Tc 1.7882 Tw (the methods is to identify the main \223shape\224 of the data and identify as outliers those data) Tj
T* -0.0023 Tc 0.0023 Tw (points that lie too far from the main body of data.) Tj
0 -23.76 TD -0.0023 Tc 0.6023 Tw (Data segmentation is here taken to mean the division of the data into ) Tj
339.96 0 TD 0.0023 Tc 0.5977 Tw (nonoverlapping subsets) Tj
-339.96 -13.8 TD -0.0028 Tc 0.9928 Tw (that are relatively similar within a subset. Some points may be outliers, and so belong to no) Tj
0 -13.8 TD -0.0029 Tc 0.8829 Tw (group. We do not assume that any reliable prior knowledge exists as to which points belong) Tj
T* -0.0035 Tc 0.9978 Tw (together, or even how many groups there may be. Statistical methods for this problem often) Tj
T* -0.0057 Tc 2.2514 Tw (go by the name \223cluster analysis.\224 In both cases of data cleaning and data segmentation,) Tj
T* -0.0024 Tc 1.4924 Tw (consideration will be restricted to measurement data, rather than categorical data. This will) Tj
T* -0.0024 Tc 3.8608 Tw (simplify the discussion so that many separate sub-cases do not need to be separately) Tj
T* -0.0144 Tc 0 Tw (described.) Tj
0 -23.76 TD -0.0021 Tc 0.7821 Tw (It seems like an obvious point that the computational effort cannot rise too fast with the size) Tj
0 -13.8 TD -0.012 Tc 1.572 Tw (of the ) Tj
33.72 0 TD -0 Tc 1.5604 Tw (dataset, otherwise processing large ) Tj
176.52 0 TD -0.01 Tc 1.6043 Tw (datasets would be impossible. If there are ) Tj
212.76 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
0.012 Tc 1.548 Tw ( data) Tj
-429 -13.8 TD -0.004 Tc 0.124 Tw (points, each with ) Tj
84.96 0 TD /F2 12 Tf
0 Tc 0 Tw (p) Tj
6 0 TD /F1 12 Tf
-0.0013 Tc 0.1613 Tw ( associated measurements, then many statistical methods naively used have) Tj
-90.96 -13.8 TD -0.0042 Tc 0.1242 Tw (a computational complexity of ) Tj
150.36 0 TD /F2 12 Tf
-0.015 Tc 0 Tw (O\(np) Tj
24.6 5.52 TD /F2 8.04 Tf
0.06 Tc (3) Tj
4.08 -5.52 TD /F2 12 Tf
-0.156 Tc (\)) Tj
3.84 0 TD /F1 12 Tf
0.0008 Tc 0.2258 Tw (, while more complex methods may in principle be high) Tj
-182.88 -13.8 TD -0 Tc 0.8403 Tw (order polynomial, or even exponential in ) Tj
204 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
-0.018 Tc 0.858 Tw ( or ) Tj
17.64 0 TD /F2 12 Tf
0 Tc 0 Tw (p) Tj
6 0 TD /F1 12 Tf
-0.0018 Tc 0.9618 Tw (. This is obviously not satisfactory for larger) Tj
-233.64 -13.8 TD -0.008 Tc 0.008 Tw (data sets.) Tj
0 -23.88 TD -0.0042 Tc 0.7877 Tw (In the quest for low computational effort, one limit is that there must of necessity be a piece) Tj
ET
endstream
endobj
3 0 obj<>/ProcSet 27 0 R>>/Type/Page>>
endobj
4 0 obj<>stream
BT
79.2 735.12 TD
0 0 0 rg
/F1 12 Tf
-0.0017 Tc 0.5417 Tw (that is linear in ) Tj
76.8 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
0 Tc 0.5993 Tw (. This is because we cannot identify outliers unless each point is examined.) Tj
-82.8 -13.8 TD -0.0019 Tc 3.009 Tw (It will be important that the linear part has a low constant; that is, the effort will rise) Tj
0 -13.8 TD -0.0017 Tc 0.0017 Tw (proportional to ) Tj
74.64 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
-0.0037 Tc 0.1077 Tw (, but the constant of proportionality must be small. This will be the case if we) Tj
-80.64 -13.8 TD -0.0033 Tc 0.449 Tw (simply plug each point into a quick outlier identification routine. In no case, however, should) Tj
0 -13.68 TD -0.0038 Tc 0.0172 Tw (we tolerate calculations that rise more than linearly with ) Tj
272.88 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
(.) Tj
-278.88 -23.88 TD -0.002 Tc 0.73 Tw (In data segmentation, it might seem difficult to avoid a piece that is proportional to ) Tj
412.08 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 5.52 TD /F2 8.04 Tf
0.06 Tc (2) Tj
4.08 -5.52 TD /F1 12 Tf
-0.01 Tc 0.85 Tw (, since) Tj
-422.16 -13.8 TD -0.0044 Tc 0.9244 Tw (pairwise distances are often needed for classification. We avoid this difficulty by employing) Tj
0 -13.8 TD -0.0142 Tc 0.7792 Tw (sampling \(in the first of two ways\). If ) Tj
188.64 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
-0.0014 Tc 0.7314 Tw ( is very large, the basic structure of a data set can be) Tj
-194.64 -13.8 TD -0.0014 Tc 0.508 Tw (estimated using a much smaller number of points than ) Tj
267.48 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
-0.005 Tc 0.6222 Tw (, and that basic structure can be used) Tj
-273.48 -13.8 TD -0.0033 Tc 0.2347 Tw (to classify the remaining points. For example, if a complex algorithm for data segmentation is) Tj
0 -13.8 TD /F2 12 Tf
-0.02 Tc 0 Tw (O\(n) Tj
18.6 5.52 TD /F2 8.04 Tf
0.06 Tc (3) Tj
4.08 -5.52 TD /F2 12 Tf
-0.156 Tc (\)) Tj
3.84 0 TD /F1 12 Tf
-0 Tc 0.249 Tw (, but we instead perform the calculation on a subset of size proportional to ) Tj
362.4 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 5.52 TD /F2 8.04 Tf
0.055 Tc (1/4) Tj
10.44 -5.52 TD /F1 12 Tf
-0.0373 Tc 0.3373 Tw ( \(for large) Tj
-405.36 -13.8 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
-0.0046 Tc 0.0046 Tw (\), the net complexity is a ) Tj
121.56 0 TD -0.008 Tc 0.008 Tw (sublinear ) Tj
46.92 0 TD /F2 12 Tf
0.02 Tc 0 Tw (O\(n) Tj
18.72 5.52 TD /F2 8.04 Tf
0.055 Tc (3/4) Tj
10.44 -5.52 TD /F2 12 Tf
-0.156 Tc (\)) Tj
3.84 0 TD /F1 12 Tf
0 Tc (.) Tj
-207.48 -23.76 TD -0.0017 Tc 0.9899 Tw (An additional advantage of estimating the structure of the data set on a subset of the data is) Tj
0 -13.8 TD -0.0012 Tc 2.4577 Tw (that it can be done more than once and the results compared. This allows for a kind of) Tj
T* -0.0032 Tc 3.9272 Tw (independent verification of results and avoids making conclusions based on accidental) Tj
T* -0.0227 Tc 0.9427 Tw (appearances. If each ) Tj
102.96 0 TD 0.0045 Tc 0.9555 Tw (subcalculation is ) Tj
85.32 0 TD -0.005 Tc 0.995 Tw (sublinear, then repeating it any fixed number of times) Tj
-188.28 -13.8 TD -0.0019 Tc 0.0019 Tw (that does not rise with ) Tj
108.96 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
0.004 Tc -0.004 Tw ( is also ) Tj
36.36 0 TD -0.0072 Tc 0 Tw (sublinear.) Tj
-151.32 -23.88 TD -0.0041 Tc 1.8964 Tw (Sampling has an additional role to play in estimating complex structures within a sample.) Tj
0 -13.8 TD -0.0041 Tc 0.7786 Tw (Some estimation methods have a computational effort that rises exponentially with ) Tj
409.2 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
0.01 Tc 0.83 Tw ( if done) Tj
-415.2 -13.8 TD -0.0021 Tc 1.4271 Tw (naively. An example of this is a method of data cleaning that depends on finding the most) Tj
0 -13.8 TD -0.004 Tc 1.8126 Tw (compact half of the data and then evaluating each point against this half. Specifically, the) Tj
T* -0.0043 Tc 1.3613 Tw (Minimum Covariance Determinant \(MCD\) method finds that half of the data for which the) Tj
T* -0.0038 Tc 1.8678 Tw (determinant of the covariance matrix of the data is smallest. Since the space that must be) Tj
T* -0.0056 Tc 0.3102 Tw (searched to find even a gross approximation of the MCD rises exponentially with ) Tj
397.2 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
-0 Tc 0.36 Tw (, there is a) Tj
-403.2 -13.8 TD -0.0021 Tc 1.0821 Tw (danger that the computational effort to find a solution within given quality bounds will also) Tj
0 -13.8 TD -0.003 Tc 0.0201 Tw (rise rapidly \(Woodruff and Rocke 1993\). A solution \(Woodruff and Rocke 1994\) is to conduct) Tj
T* -0.0059 Tc 1.3047 Tw (most computations only on the cells of a partition of the data, and then use the full sample) Tj
0 -13.68 TD -0.0071 Tc 0.0071 Tw (only for the final stages.) Tj
0 -24.24 TD /F0 12 Tf
-0.001 Tc 0.025 Tw (Properties of Algorithms for Data Mining) Tj
0 -23.4 TD /F1 12 Tf
-0.0014 Tc 0.7314 Tw (Computer-science theorists usually strive to find methods that are deterministic, get the right) Tj
0 -13.8 TD -0.0063 Tc 0.4949 Tw (answer every time, and run in worst-case polynomial time. The problems that we face cannot) Tj
T* -0.0031 Tc 0.5731 Tw (apparently be accomplished without giving up something from this list\227most likely, we will) Tj
T* -0.005 Tc 1.1278 Tw (need to give up both determinism and sure correctness. We clearly cannot use methods that) Tj
T* -0.0028 Tc 2.4528 Tw (fail to be polynomial; in fact a low-constant linear portion plus a ) Tj
341.88 0 TD 0.0018 Tc 2.5182 Tw (sublinear remainder is) Tj
-341.88 -13.8 TD -0.0045 Tc 2.4745 Tw (required. Probabilistic algorithms provide a way to obtain good, if not provably optimal,) Tj
0 -13.8 TD -0.0011 Tc 0.0011 Tw (answers for even very large problems.) Tj
0 -24.12 TD /F0 12 Tf
-0.0047 Tc 0.0047 Tw (Global Optimization) Tj
0 -23.52 TD /F1 12 Tf
-0.0064 Tc 1.1121 Tw (Many problems have both a discrete and a continuous formulation. For example, if the data) Tj
0 -13.8 TD -0.0038 Tc 0.9798 Tw (are thought to fall into two clusters, we can define this by the \(discrete\) cluster membership) Tj
T* -0.003 Tc 0.795 Tw (\(an integer vector of length ) Tj
137.52 0 TD /F2 12 Tf
0 Tc 0 Tw (n) Tj
6 0 TD /F1 12 Tf
-0.0043 Tc 0.8552 Tw (\) or by the \(continuous\) location and shape of the clusters \(two) Tj
-143.52 -13.8 TD -0.0019 Tc 3.0139 Tw (continuous vectors and two continuous matrices\). Often continuous formulations lead to) Tj
0 -13.8 TD -0.0018 Tc 2.4098 Tw (better answers if the \223correct\224 local optimum is found, but this may be difficult to find.) Tj
T* -0.0056 Tc 0.1164 Tw (Discrete methods can yield search techniques with no obvious or easy continuous analog, thus) Tj
T* -0.0039 Tc 0.5167 Tw (improving the search for the local optimum. These include swap neighborhoods, constructive) Tj
ET
endstream
endobj
5 0 obj<>/ProcSet 27 0 R>>/Type/Page>>
endobj
6 0 obj<>stream
BT
79.2 735.12 TD
0 0 0 rg
/F1 12 Tf
-0.0008 Tc 2.0008 Tw (neighborhoods with highly random starting point \(small number of points\), tabu lists, etc.) Tj
0 -13.8 TD -0.0021 Tc 1.6596 Tw (Often it is very effective to use a discrete heuristic search method to provide one or more) Tj
T* -0.0026 Tc 2.8935 Tw (starting points for locating optima of a continuous global optimization problem. Among) Tj
T* 0 Tc 0.8723 Tw (heuristic optimization methods, genetic algorithms \(GA\) seem to work well for optimization) Tj
T* -0.0054 Tc 1.9254 Tw (only when a descent step is added \(so that the GA serves mainly to diversify the starting) Tj
T* -0.0042 Tc 2.7457 Tw (points\). Simulated annealing can perform well if tuned properly, but we have had more) Tj
T* -0.0019 Tc 0.471 Tw (success with other methods. Steepest descent with random restarts \(perhaps with constructive) Tj
0 -13.68 TD -0.0046 Tc 0.0217 Tw (start\) is hard to beat for many problems.) Tj
93.72 -33.12 TD /F0 14.04 Tf
-0.0104 Tc -0.0196 Tw (The Role of Algorithms in Statistical Science) Tj
-93.72 -30.84 TD /F1 12 Tf
-0.0043 Tc 0.1093 Tw (At one time, it would have been feasible to argue that algorithms were not central to statistical) Tj
0 -13.8 TD -0.0038 Tc 0.9553 Tw (science because all a better algorithm did was somewhat speed up arriving at essentially the) Tj
T* -0.008 Tc 1.7651 Tw (same estimate. Things have changed. More and more, we are using estimators that have a) Tj
T* -0.0038 Tc 1.062 Tw (substantial stochastic component, in which the \223theoretical estimator\224 is not achievable, and) Tj
T* -0.004 Tc 3.3363 Tw (all we get are various approximations of varying degrees of exactness. To avoid extra) Tj
T* -0.0021 Tc 2.9421 Tw (complexity, we will describe this phenomenon in terms of an example\227) Tj
378.24 0 TD /F2 12 Tf
0 Tc 0 Tw (S) Tj
6 0 TD /F1 12 Tf
0.0028 Tc 2.9972 Tw (-estimators of) Tj
-384.24 -13.8 TD -0.0035 Tc 0.8035 Tw (multivariate location and scale\227but it clearly applies equally to Markov chain Monte Carlo,) Tj
0 -13.8 TD -0.0024 Tc 0.0024 Tw (optimal experimental design, and other important areas of statistical science.) Tj
0 -23.76 TD -0.012 Tc 0.132 Tw (An ) Tj
17.76 0 TD /F2 12 Tf
0 Tc 0 Tw (S) Tj
6 0 TD /F1 12 Tf
-0.0022 Tc 0.1993 Tw (-estimator of multivariate location and scale consists of that choice of a location vector ) Tj
422.52 0 TD /F3 12 Tf
-0.012 Tc 0 Tw (T) Tj
-446.28 -13.8 TD /F1 12 Tf
0.011 Tc 2.389 Tw (and PDS matrix ) Tj
87 0 TD /F3 12 Tf
0.036 Tc 0 Tw (C) Tj
8.04 0 TD /F1 12 Tf
-0.0248 Tc 2.4648 Tw ( which minimizes |) Tj
97.68 0 TD /F3 12 Tf
0.036 Tc 0 Tw (C) Tj
8.04 0 TD /F1 12 Tf
-0.0049 Tc 2.4392 Tw (| subject to a data-based constraint \(omitted here\)) Tj
-200.76 -13.8 TD -0.0019 Tc 0.4819 Tw (which is designed to standardize the estimator. The ) Tj
252.72 0 TD /F2 12 Tf
0 Tc 0 Tw (S) Tj
6 0 TD /F1 12 Tf
0.0029 Tc 0.5771 Tw (-estimator can always be shown to exist) Tj
-258.72 -13.8 TD -0.0022 Tc 1.6565 Tw (in some theoretical sense, but there exists no known algorithm that can always produce it,) Tj
0 -13.8 TD -0.0032 Tc 0.1232 Tw (even in an arbitrarily long computation. The theoretical ) Tj
269.76 0 TD /F2 12 Tf
0 Tc 0 Tw (S) Tj
6 0 TD /F1 12 Tf
0.0023 Tc 0.2034 Tw (-estimator must be a solution to a set) Tj
-275.76 -13.8 TD -0.013 Tc 2.113 Tw (of associated ) Tj
69.36 0 TD /F2 12 Tf
-0.036 Tc 0 Tw (M) Tj
9.96 0 TD /F1 12 Tf
-0.0015 Tc 2.1215 Tw (-estimating equations; however, there is no known method of determining) Tj
-79.32 -13.8 TD -0.0018 Tc 0.3618 Tw (how many such local solutions there are. Given a list of such local solutions, the best of them) Tj
0 -13.8 TD -0.0091 Tc 1.3491 Tw (is a candidate for the theoretical ) Tj
164.4 0 TD /F2 12 Tf
0 Tc 0 Tw (S) Tj
6 0 TD /F1 12 Tf
0.0008 Tc 1.4392 Tw (-estimator, but there is no known method of determining) Tj
-170.4 -13.8 TD -0.0038 Tc 1.6138 Tw (whether there is another solution with smaller determinant. Furthermore, there can be very) Tj
0 -13.8 TD -0.0011 Tc 0.9911 Tw (large differences between the best solution \(the theoretical ) Tj
291.12 0 TD /F2 12 Tf
0 Tc 0 Tw (S) Tj
6 0 TD /F1 12 Tf
-0.0021 Tc 1.0821 Tw (-estimator\), and the second-best) Tj
-297.12 -13.8 TD -0.0037 Tc 0.0037 Tw (solution \(Rocke and Woodruff 1993, 1997\).) Tj
0 -23.88 TD -0.0032 Tc 1.6632 Tw (Under these circumstances, which occur in more and more modern statistical methods, the) Tj
0 -13.8 TD -0.0011 Tc 1.8251 Tw (algorithm used to find local solutions is critical in the quality of the approximation to the) Tj
T* -0.012 Tc 0.252 Tw (theoretical ) Tj
53.76 0 TD /F2 12 Tf
0 Tc 0 Tw (S) Tj
6 0 TD /F1 12 Tf
-0.0037 Tc 0.2991 Tw (-estimator. Use of a less effective algorithm can result, not in a small degradation) Tj
-59.76 -13.8 TD -0.0035 Tc 1.4778 Tw (of performance, but in a very large one \(Woodruff and Rocke 1994; Rocke and Woodruff,) Tj
0 -13.8 TD -0.004 Tc 0.404 Tw (1996\). In a very real sense, \223the algorithm is the estimator,\224 and different algorithms result in) Tj
0 -13.68 TD -0.0018 Tc 0.0018 Tw (essentially different estimators, which can have very different finite-sample properties.) Tj
0 -23.88 TD -0.0033 Tc 1.2688 Tw (Furthermore, if the only available algorithm for an estimator has exponential computational) Tj
0 -13.8 TD -0.0015 Tc 0.0846 Tw (complexity, estimation in large or even medium sized samples is impossible. For example, the) Tj
T* -0.0012 Tc 1.0092 Tw (exact computation of a robust multivariate estimator, the Minimum Covariance Determinant) Tj
T* -0.0013 Tc 0.387 Tw (Estimator \(MCD\) is only possible in samples of size 30 or less. No conceivable improvement) Tj
T* -0.0011 Tc 0.6311 Tw (in computational speed could raise this level higher than 100. Consider the case of a sample) Tj
T* -0.004 Tc 1.3349 Tw (of size 200 in which we are interested in examining all ) Tj
280.08 0 TD -0.0051 Tc 1.3451 Tw (subsamples of size 100. There are) Tj
-280.08 -13.8 TD -0.0034 Tc 1.9234 Tw (about 10) Tj
43.56 5.52 TD /F1 8.04 Tf
0.06 Tc 0 Tw (59) Tj
8.16 -5.52 TD /F1 12 Tf
-0.009 Tc 1.929 Tw ( such ) Tj
31.8 0 TD -0.0044 Tc 1.9377 Tw (subsamples. Suppose that one had a million-processor parallel computer,) Tj
-83.52 -13.8 TD -0.002 Tc 1.4513 Tw (each processor of which could examine a billion configurations per second \(this is perhaps) Tj
0 -13.8 TD 0 Tc 2.76 Tw (1000 ) Tj
29.76 0 TD -0.005 Tc 2.8574 Tw (Gflops, considerably faster than any existing processor\). It would still take 10) Tj
415.68 5.52 TD /F1 8.04 Tf
0.06 Tc 0 Tw (33) Tj
-445.44 -19.32 TD /F2 12 Tf
0.003 Tc (millenia) Tj
39.36 0 TD /F1 12 Tf
-0.0033 Tc 0.0033 Tw ( for the process to finish!) Tj
ET
endstream
endobj
7 0 obj<>/ProcSet 27 0 R>>/Type/Page>>
endobj
8 0 obj<>stream
BT
79.2 735.12 TD
0 0 0 rg
/F1 12 Tf
-0.0045 Tc 2.8661 Tw (For massive data sets, even polynomial-time algorithms may be far too slow. An ) Tj
427.2 0 TD /F2 12 Tf
-0.06 Tc 0 Tw (O\(n) Tj
18.48 5.52 TD /F2 8.04 Tf
0.06 Tc (3) Tj
4.08 -5.52 TD /F2 12 Tf
-0.156 Tc (\)) Tj
-449.76 -13.8 TD /F1 12 Tf
-0.0012 Tc 2.0492 Tw (algorithm may be feasible in samples of size 1000, but not in samples of size 1,000,000.) Tj
0 -13.68 TD -0.0035 Tc 0.0135 Tw (Linear, or even sub-linear, algorithms are required to deal with massive data sets.) Tj
193.2 -33.12 TD /F0 14.04 Tf
-0.0017 Tc 0 Tw (Conclusion) Tj
-193.2 -30.84 TD /F1 12 Tf
-0.0064 Tc 3.0156 Tw (A statistical perspective on data mining can yield important benefits if the data mining) Tj
0 -13.8 TD -0.0024 Tc 2.8086 Tw (perspective on statistical methods is kept in mind during their development. In the data) Tj
T* -0.0031 Tc 4.5271 Tw (mining perspective, statistical efficiency is not a particularly important goal, whereas) Tj
T* -0.0024 Tc 1.3224 Tw (computational efficiency is critical. Statistical methods must be used that do not depend on) Tj
T* -0.0022 Tc 1.3858 Tw (prior knowledge of the exact structure of the data; the questions to be asked as well as the) Tj
T* -0.0027 Tc 0.0113 Tw (answers to be derived must be free to depend on the outcome of the analysis.) Tj
0 -23.88 TD -0.0034 Tc 0.0034 Tw (Some specific recommendations include:) Tj
0 -24.6 TD /F4 12 Tf
0 Tc 0 Tw (\267) Tj
5.52 0 TD /F5 12 Tf
9.144 Tw ( ) Tj
12.48 0 TD /F1 12 Tf
-0.0014 Tc 0.0014 Tw (Perspectives from statistics, computer science, and machine learning must contribute to an) Tj
0 -13.8 TD -0.004 Tc 0.004 Tw (understanding of algorithms for data mining.) Tj
-18 -24.72 TD /F4 12 Tf
0 Tc 0 Tw (\267) Tj
5.52 0 TD /F5 12 Tf
9.144 Tw ( ) Tj
12.48 0 TD /F1 12 Tf
-0.0008 Tc -0.0092 Tw (Global optimization is often necessary to solve data mining problems, which often have) Tj
0 -13.68 TD -0.006 Tc 0.006 Tw (subtle structure.) Tj
-18 -24.72 TD /F4 12 Tf
0 Tc 0 Tw (\267) Tj
5.52 0 TD /F5 12 Tf
9.144 Tw ( ) Tj
12.48 0 TD /F1 12 Tf
-0.0008 Tc 0.0008 Tw (Attention to the performance of heuristic search algorithms is critical.) Tj
-18 -24.72 TD /F4 12 Tf
0 Tc 0 Tw (\267) Tj
5.52 0 TD /F5 12 Tf
9.144 Tw ( ) Tj
12.48 0 TD /F1 12 Tf
-0.0027 Tc 0.0027 Tw (Sampling is needed to keep complexity linear or ) Tj
235.2 0 TD -0 Tc 0 Tw (sublinear. Partitioning to find a starting) Tj
-235.2 -13.68 TD -0.0062 Tc 0.0162 Tw (point by combinatorial search for a smooth algorithm is often a viable strategy.) Tj
-18 -24.72 TD /F4 12 Tf
0 Tc 0 Tw (\267) Tj
5.52 0 TD /F5 12 Tf
9.144 Tw ( ) Tj
12.48 0 TD /F1 12 Tf
0.0055 Tc -0.0055 Tw (Statistical ) Tj
50.4 0 TD -0.0016 Tc 0.0016 Tw (asymptotics must now focus on computational complexity as well as statistical) Tj
-50.4 -13.8 TD -0.013 Tc 0 Tw (convergence.) Tj
-18 -24.72 TD /F4 12 Tf
0 Tc (\267) Tj
5.52 0 TD /F5 12 Tf
9.144 Tw ( ) Tj
12.48 0 TD /F1 12 Tf
-0.0028 Tc 0.0128 Tw (Although data mining has developed mostly independently of statistics as a discipline, a) Tj
0 -13.8 TD -0.0033 Tc 0.0108 Tw (fusion of the ideas from these two fields will lead to better methods for the analysis of) Tj
0 -13.68 TD -0.006 Tc 0.006 Tw (massive data sets.) Tj
176.04 -33 TD /F0 14.04 Tf
-0.0083 Tc 0 Tw (References) Tj
-194.04 -30.96 TD /F1 12 Tf
0.0008 Tc -0.0008 Tw (Banfield, J. D. and ) Tj
93 0 TD -0.0024 Tc 0.0024 Tw (Raftery, A. E. \(1993\), \223Model-Based Gaussian and Non-Gaussian) Tj
-71.4 -13.8 TD -0.011 Tc 0.011 Tw (Clustering,\224 ) Tj
61.2 0 TD /F2 12 Tf
0.0084 Tc 0 Tw (Biometrics) Tj
52.08 0 TD /F1 12 Tf
0 Tc (, ) Tj
6 0 TD /F0 12 Tf
(49) Tj
12 0 TD /F1 12 Tf
(, 803\226821.) Tj
-152.88 -23.76 TD -0.0129 Tc 2.7729 Tw (Cabena, Peter, ) Tj
78 0 TD -0.0015 Tc 2.7615 Tw (Hadjinian, Pablo, ) Tj
92.16 0 TD -0.0009 Tc 2.7609 Tw (Stadler, Rolf, ) Tj
72.84 0 TD -0.009 Tc 2.769 Tw (Verhees, ) Tj
48 0 TD 0.0072 Tc 2.7528 Tw (Jaap, ) Tj
30.12 0 TD -0.0057 Tc 2.8257 Tw (Zanasi, Alessandro \(1998\)) Tj
-299.52 -13.8 TD /F2 12 Tf
0 Tc 1.68 Tw (Discovering Data Mining: From Concept to Implementation) Tj
300.72 0 TD /F1 12 Tf
-0.0011 Tc 1.8311 Tw (, Upper Saddle River, NJ:) Tj
-300.72 -13.8 TD -0.0083 Tc 0.0083 Tw (Prentice Hall.) Tj
-21.6 -23.88 TD -0.0112 Tc 4.4812 Tw (Elder IV, John and ) Tj
111.36 0 TD -0.0033 Tc 4.5461 Tw (Pregibon, Daryl \(1996\) \223A Statistical Perspective on Knowledge) Tj
-89.76 -13.8 TD -0.002 Tc 0.242 Tw (Discovery in Databases,\224 in ) Tj
137.88 0 TD /F2 12 Tf
-0.0009 Tc 0.3209 Tw (Advances in Knowledge Discovery and Data Mining) Tj
254.52 0 TD /F1 12 Tf
0 Tc 0.36 Tw (, ) Tj
6.36 0 TD -0.0168 Tc 0 Tw (Usama) Tj
-398.76 -13.8 TD -0.0137 Tc 1.0337 Tw (Fayyad, Gregory ) Tj
86.16 0 TD -0 Tc 0.9607 Tw (Piatetsky-Shapiro, ) Tj
92.28 0 TD -0.0035 Tc 1.0435 Tw (Padhraic Smyth, and ) Tj
105.72 0 TD 0.009 Tc 1.071 Tw (Ramasamy ) Tj
57.48 0 TD -0.0023 Tc 1.0822 Tw (Uthurusamy \(eds.\)) Tj
-341.64 -13.68 TD -0.004 Tc 0.004 Tw (Cambridge, MA: MIT Press) Tj
-21.6 -23.88 TD -0.024 Tc 6.504 Tw (Fayyad, ) Tj
47.64 0 TD -0.009 Tc 6.369 Tw (Usama M., ) Tj
68.64 0 TD 0.0019 Tc 6.4781 Tw (Piatetsky-Shapiro, Gregory, Smyth, ) Tj
194.16 0 TD -0.013 Tc 6.553 Tw (Padhraic, and ) Tj
81.24 0 TD -0.0371 Tc 0 Tw (Uthurusamy,) Tj
-370.08 -13.8 TD -0.0174 Tc 6.2174 Tw (Ramasamy \(eds.\) \(1996\) ) Tj
139.56 0 TD /F2 12 Tf
-0.0009 Tc 6.1409 Tw (Advances in Knowledge Discovery and Data Mining) Tj
289.44 0 TD /F1 12 Tf
0 Tc 0 Tw (,) Tj
-429 -13.8 TD -0.0038 Tc 0.0038 Tw (Cambridge, MA: MIT Press.) Tj
ET
endstream
endobj
9 0 obj<>
endobj
10 0 obj<>/ProcSet 27 0 R>>/Type/Page>>
endobj
11 0 obj<>stream
BT
79.2 735.12 TD
0 0 0 rg
/F1 12 Tf
-0.0062 Tc 0.0902 Tw (Friedman, Jerome H. \(1997\) \223On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality,\224) Tj
21.6 -13.68 TD /F2 12 Tf
-0.005 Tc 0.005 Tw (Data Mining and Knowledge Discovery) Tj
191.16 0 TD /F1 12 Tf
0 Tc 0 Tw (, ) Tj
6 0 TD /F0 12 Tf
(1) Tj
6 0 TD /F1 12 Tf
(, 55\22678.) Tj
-224.76 -23.88 TD -0.0111 Tc 1.4511 Tw (Glymour, Clark, ) Tj
84.72 0 TD 0.0009 Tc 1.4391 Tw (Madigan, David, ) Tj
86.88 0 TD 0.003 Tc 1.437 Tw (Pregibon, Daryl, and Smyth, ) Tj
146.16 0 TD -0.0018 Tc 1.6218 Tw (Padhraic \(1996\) \223Statistical) Tj
-296.16 -13.8 TD -0.0083 Tc 0.0383 Tw (Inference and Data Mining\224 ) Tj
137.88 0 TD /F2 12 Tf
0 Tc -0 Tw (Communications of the ACM) Tj
139.68 0 TD /F1 12 Tf
0 Tc 0 Tw (, ) Tj
6 0 TD /F0 12 Tf
(39) Tj
12 0 TD /F1 12 Tf
(, 35\22641.) Tj
-317.16 -23.76 TD -0.0111 Tc 1.4511 Tw (Glymour, Clark, ) Tj
84.72 0 TD 0.0009 Tc 1.4391 Tw (Madigan, David, ) Tj
86.88 0 TD 0.003 Tc 1.437 Tw (Pregibon, Daryl, and Smyth, ) Tj
146.16 0 TD -0.0018 Tc 1.6218 Tw (Padhraic \(1997\) \223Statistical) Tj
-296.16 -13.8 TD -0.0097 Tc 0.0497 Tw (Themes and Lessons for Data Mining,\224 ) Tj
192.24 0 TD /F2 12 Tf
-0.0012 Tc 0.0012 Tw (Data Mining and Knowledge Discovery) Tj
191.28 0 TD /F1 12 Tf
0 Tc 0 Tw (, ) Tj
6 0 TD /F0 12 Tf
(1) Tj
6 0 TD /F1 12 Tf
(, 1\22628.) Tj
-417.12 -23.88 TD -0.0015 Tc 2.3775 Tw (National Research Council, Board on Mathematical Sciences, Committee on Applied and) Tj
21.6 -13.8 TD -0.0009 Tc 6.8009 Tw (Theoretical Statistics \(1996\) ) Tj
159.36 0 TD /F2 12 Tf
-0.0041 Tc 6.8641 Tw (Massive Data Sets: Proceedings of a Workshop) Tj
269.64 0 TD /F1 12 Tf
0 Tc 0 Tw (,) Tj
-429 -13.68 TD -0.0024 Tc 0.0024 Tw (Washington, DC: National Academy Press.) Tj
-21.6 -23.88 TD -0.0062 Tc 1.8404 Tw (Rocke, D. M. \(1996\) \223Robustness Properties of ) Tj
243.24 0 TD /F2 12 Tf
0 Tc 0 Tw (S) Tj
6 0 TD /F1 12 Tf
-0.0023 Tc 1.9523 Tw (-Estimators of Multivariate Location and) Tj
-227.64 -13.8 TD -0.0027 Tc 0.0027 Tw (Shape in High Dimension,\224 ) Tj
135.6 0 TD /F2 12 Tf
0.008 Tc -0.008 Tw (Annals of Statistics) Tj
92.16 0 TD /F1 12 Tf
0 Tc 0 Tw (, ) Tj
6 0 TD /F0 12 Tf
(24) Tj
12 0 TD /F1 12 Tf
(, 1327\2261345.) Tj
-267.36 -23.76 TD -0.0051 Tc 0.4944 Tw (Rocke, D. M. and Woodruff, D. L. \(1993\) \223Computation of Robust Estimates of Multivariate) Tj
21.6 -13.8 TD -0.0047 Tc 0.0047 Tw (Location and Shape,\224 ) Tj
106.56 0 TD /F2 12 Tf
0.0084 Tc -0.0084 Tw (Statistica ) Tj
47.76 0 TD -0.0044 Tc 0 Tw (Neerlandica) Tj
59.28 0 TD /F1 12 Tf
0 Tc (, ) Tj
6 0 TD /F0 12 Tf
(47) Tj
12 0 TD /F1 12 Tf
(, 27\22642.) Tj
-253.2 -23.88 TD -0.0067 Tc 1.2529 Tw (Rocke, D. M. and Woodruff, D. L. \(1996\) \223Identification of Outliers in Multivariate Data,\224) Tj
21.6 -13.68 TD /F2 12 Tf
0.0009 Tc -0.0009 Tw (Journal of the American Statistical Association) Tj
227.04 0 TD /F1 12 Tf
0 Tc 0 Tw (, ) Tj
6 0 TD /F0 12 Tf
(91) Tj
12 0 TD /F1 12 Tf
(, 1047\2261061.) Tj
-266.64 -23.88 TD -0.0081 Tc 1.0789 Tw (Rocke, D. M. and Woodruff, D. L. \(1997\) \223Robust Estimation of Multivariate Location and) Tj
21.6 -13.8 TD -0.0137 Tc 0.0137 Tw (Shape,\224 ) Tj
40.56 0 TD /F2 12 Tf
-0.0009 Tc 0.0009 Tw (Journal of Statistical Planning and Inference) Tj
216.96 0 TD /F1 12 Tf
0 Tc 0 Tw (, ) Tj
6 0 TD /F0 12 Tf
(57) Tj
12 0 TD /F1 12 Tf
(, 245\226255.) Tj
-297.12 -23.76 TD -0.0047 Tc 1.6201 Tw (Woodruff, D. L., and Rocke D. M. \(1993\) \223Heuristic Search Algorithms for the Minimum) Tj
21.6 -13.8 TD 0.0014 Tc -0.0014 Tw (Volume Ellipsoid,\224 ) Tj
96.36 0 TD /F2 12 Tf
0.0019 Tc -0.0019 Tw (Journal of Computational and Graphical Statistics) Tj
244.44 0 TD /F1 12 Tf
0 Tc 0 Tw (, ) Tj
6 0 TD /F0 12 Tf
(2) Tj
6 0 TD /F1 12 Tf
(, 69\22695.) Tj
-374.4 -23.88 TD -0.0033 Tc 1.4933 Tw (Woodruff, D. L. and Rocke, D. M. \(1994\) \223Computable Robust Estimation of Multivariate) Tj
21.6 -13.8 TD -0.0009 Tc 2.1209 Tw (Location and Shape in High Dimension using Compound Estimators,\224 ) Tj
360.36 0 TD /F2 12 Tf
-0.001 Tc 2.161 Tw (Journal of the) Tj
-360.36 -13.68 TD 0.0016 Tc -0.0016 Tw (American Statistical Association) Tj
156.72 0 TD /F1 12 Tf
0 Tc 0 Tw (, ) Tj
6 0 TD /F0 12 Tf
(89) Tj
12 0 TD /F1 12 Tf
(, 888\226896.) Tj
ET
endstream
endobj
12 0 obj<>
endobj
13 0 obj<>
endobj
14 0 obj<>
endobj
15 0 obj<>
endobj
16 0 obj<>
endobj
17 0 obj[/CalRGB<>]
endobj
18 0 obj[/CalGray<>]
endobj
19 0 obj<>stream
Thursday, September 03, 1998 8:54:55 AM
Acrobat PDFWriter 3.02 for Windows
application/pdf
Rhonda Boughtin
Dmstat3.PDF
Microsoft Word
uuid:da259813-0076-45e7-b6ca-7e34f237c2e1
uuid:3680c473-19af-496d-84be-60aa8935178c
endstream
endobj
20 0 obj<>
endobj
xref
0 21
0000000000 65535 f
0000005678 00000 n
0000005856 00000 n
0000011985 00000 n
0000012163 00000 n
0000019325 00000 n
0000019513 00000 n
0000026445 00000 n
0000026643 00000 n
0000032464 00000 n
0000032698 00000 n
0000032878 00000 n
0000037362 00000 n
0000037472 00000 n
0000037586 00000 n
0000038616 00000 n
0000038711 00000 n
0000038794 00000 n
0000038940 00000 n
0000039008 00000 n
0000040692 00000 n
trailer
<>
startxref
116
%%EOF