data mining

(noun)

a technique for searching large-scale databases for patterns; used mainly to find previously unknown correlations between variables that may be commercially useful

Related Terms

  • exploratory data analysis
  • skewed

Examples of data mining in the following topics:

  • Data Snooping: Testing Hypotheses Once You've Seen the Data

    • Testing hypothesis once you've seen the data may result in inaccurate conclusions.
    • The error is particularly prevalent in data mining and machine learning.
    • Sometimes, people deliberately test hypotheses once they've seen the data.
    • Data snooping (also called data fishing or data dredging) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data.
    • Although data-snooping bias can occur in any field that uses data mining, it is of particular concern in finance and medical research, which both heavily use data mining.
  • Exploratory Data Analysis (EDA)

    • Exploratory data analysis is an approach to analyzing data sets in order to summarize their main characteristics, often with visual methods.
    • Exploratory data analysis (EDA) is an approach to analyzing data sets in order to summarize their main characteristics, often with visual methods.
    • Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data and possibly formulate hypotheses that could lead to new data collection and experiments.
    • Tukey promoted the use of the five number summary of numerical data:
    • Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.
  • Applications of Statistics

    • In calculating the arithmetic mean of a sample, for example, the algorithm works by summing all the data values observed in the sample and then dividing this sum by the number of data items.
    • Statistical methods can summarize or describe a collection of data.
    • These inferences may take the form of: answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation), describing associations within the data (correlation) and modeling relationships within the data (for example, using regression analysis).
    • It can include extrapolation and interpolation of time series or spatial data and can also include data mining.
    • This Boxplot represents Michelson and Morley's data on the speed of light.
  • Fundamentals of Statistics

    • Data collected about this kind of "population" constitutes what is called a time series.
    • Data collected about this kind of "population" constitutes what is called a time series.
    • Numerical descriptors include mean and standard deviation for continuous data types (like heights or weights), while frequency and percentages are more useful in terms of describing categorical data (like race).
    • These inferences may take the form of: answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation), describing associations within the data (correlation ) and modeling relationships within the data (for example, using regression analysis).
    • It can include extrapolation and interpolation of time series or spatial data, and can also include data mining.
  • Stepwise Regression

    • Hence it is prone to overfitting the data.
    • This method is particularly valuable when data is collected in different settings.
    • Stepwise regression procedures are used in data mining, but are controversial.
    • The tests themselves are biased, since they are based on the same data.
    • Models that are created may be too-small than the real models in the data.
  • Distorting the Truth with Descriptive Statistics

    • Reporting bias involves a skew in the availability of data, such that observations of a certain kind may be more likely to be reported and consequently used in research.
    • Descriptive statistics is a powerful form of research because it collects and summarizes vast amounts of data and information in a manageable and organized manner.
    • correlate (associate) data or create any type of statistical relationship modeling relationship among variables;
    • In other words, every time you try to describe a large set of observations with a single descriptive statistics indicator, you run the risk of distorting the original data or losing important detail.
  • Examining numerical data exercises

    • Data were collected on life spans (in years) and gestation lengths (in days) for 62 mammals.
    • Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average.
    • Exercise 1.6 introduces a data set on the smoking habits of UK residents.
    • Create a box plot for the data given in Exercise 1.30.
    • (d) The time series plot shown below is another way to look at these data.
  • Confounding

    • Beyond these factors, researchers may not consider or have access to data on other causal factors.
    • Smoking and confounding are reviewed in occupational risk assessments such as the safety of coal mining.
  • Observations, variables, and data matrices

    • These observations will be referred to as the email50 data set, and they are a random sample from a larger data set that we will see in Section 1.7
    • The data in Table 1.3 represent a data matrix, which is a common way to organize data.
    • Data matrices are a convenient way to record and store data.
    • How might these data be organized in a data matrix?
    • These data were collected from the US Census website.
  • Optional Collaborative Classrom Exercise

    • The science of statistics deals with the collection, analysis, interpretation, and presentation of data.We see and use data in our everyday lives.
    • Your instructor will record the data.
    • For example, consider the following data:
    • Where do your data appear to cluster?
    • Effective interpretation of data (inference) is based on good procedures for producing data and thoughtful examination of the data.
Subjects
  • Accounting
  • Algebra
  • Art History
  • Biology
  • Business
  • Calculus
  • Chemistry
  • Communications
  • Economics
  • Finance
  • Management
  • Marketing
  • Microbiology
  • Physics
  • Physiology
  • Political Science
  • Psychology
  • Sociology
  • Statistics
  • U.S. History
  • World History
  • Writing

Except where noted, content and user contributions on this site are licensed under CC BY-SA 4.0 with attribution required.