data snooping

(noun)

the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data

Related Terms

  • Type I error

Examples of data snooping in the following topics:

  • Data Snooping: Testing Hypotheses Once You've Seen the Data

    • Testing hypothesis once you've seen the data may result in inaccurate conclusions.
    • The error is particularly prevalent in data mining and machine learning.
    • Data snooping (also called data fishing or data dredging) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data.
    • Data-snooping bias is a form of statistical bias that arises from this misuse of statistics.
    • Although data-snooping bias can occur in any field that uses data mining, it is of particular concern in finance and medical research, which both heavily use data mining.
  • Is batting performance related to player position in MLB?

    • We will use a data set called bat10, which includes batting records of 327 Major League Baseball (MLB) players from the 2010 season.
    • The primary issue here is that we are inspecting the data before picking the groups that will be compared.
    • It is inappropriate to examine all data by eye (informal testing) and only afterwards decide which parts to formally test.
    • This is called data snooping or data fishing.
  • Distorting the Truth with Descriptive Statistics

    • Reporting bias involves a skew in the availability of data, such that observations of a certain kind may be more likely to be reported and consequently used in research.
    • Descriptive statistics is a powerful form of research because it collects and summarizes vast amounts of data and information in a manageable and organized manner.
    • correlate (associate) data or create any type of statistical relationship modeling relationship among variables;
    • In other words, every time you try to describe a large set of observations with a single descriptive statistics indicator, you run the risk of distorting the original data or losing important detail.
  • Exercises

    • Alan, while snooping around his grandmother's basement stumbled upon a shiny object protruding from under a stack of boxes .
  • Observations, variables, and data matrices

    • These observations will be referred to as the email50 data set, and they are a random sample from a larger data set that we will see in Section 1.7
    • The data in Table 1.3 represent a data matrix, which is a common way to organize data.
    • Data matrices are a convenient way to record and store data.
    • How might these data be organized in a data matrix?
    • These data were collected from the US Census website.
  • Optional Collaborative Classrom Exercise

    • The science of statistics deals with the collection, analysis, interpretation, and presentation of data.We see and use data in our everyday lives.
    • Your instructor will record the data.
    • For example, consider the following data:
    • Where do your data appear to cluster?
    • Effective interpretation of data (inference) is based on good procedures for producing data and thoughtful examination of the data.
  • Types of Data

    • Qualitative data: race, religion, gender, etc.
    • Primary data is original data that has been collected specially for the purpose in mind.
    • This type of data is collected first hand.
    • Secondary data is data that has been collected for another purpose.
    • Differentiate between primary and secondary data and qualitative and quantitative data.
  • Data

    • Data may come from a population or from a sample.
    • Quantitative data are always numbers.
    • All data that are the result of counting are called quantitative discrete data.
    • All data that are the result of measuring are quantitative continuous data assuming that we can measure accurately.
    • The data are the colors of backpacks.
  • Student Learning Outcomes

    • Recognize, describe, and calculate the measures of location of data: quartiles and percentiles.
    • Recognize, describe, and calculate the measures of the center of data: mean, median, and mode.
    • Recognize, describe, and calculate the measures of the spread of data: variance, standard deviation, and range.
  • When to Use These Tests

    • "Ranking" refers to the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted.
    • In statistics, "ranking" refers to the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted.
    • If, for example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively.
    • The upper plot uses raw data.
    • Indicate why and how data transformation is performed and how this relates to ranked data.
Subjects
  • Accounting
  • Algebra
  • Art History
  • Biology
  • Business
  • Calculus
  • Chemistry
  • Communications
  • Economics
  • Finance
  • Management
  • Marketing
  • Microbiology
  • Physics
  • Physiology
  • Political Science
  • Psychology
  • Sociology
  • Statistics
  • U.S. History
  • World History
  • Writing

Except where noted, content and user contributions on this site are licensed under CC BY-SA 4.0 with attribution required.