primary data
(noun)
data that has been compiled for a specific purpose, and has not been collated or merged with others
Examples of primary data in the following topics:
-
Types of Data
- Data can be classified as either primary or secondary.
- Primary data is original data that has been collected specially for the purpose in mind.
- An example of primary data is conducting your own questionnaire.
- Those who gather primary data get to write the questions.
- Differentiate between primary and secondary data and qualitative and quantitative data.
-
Introducing observational studies and experiments
- There are two primary types of data collection: observational studies and experiments.
- Researchers perform an observational study when they collect data in a way thatdoes not directly interfere with how the data arise.
- In each of these situations, researchers merely observe the data that arise.
-
Introduction
- These observations - collected from the likes of field notes, surveys, and experiments - form the backbone of a statistical investigation and are called data.
- Statistics is the study of how best to collect, analyze, and draw conclusions from data.
- That is, statistics has three primary components: How best can we collect data?
- However, many of these investigations can be addressed with a small number of data collection techniques, analytic tools, and fundamental concepts in statistical inference.
-
Observational studies
- Generally, data in observational studies are collected only by monitoring what occurs, while experiments require the primary explanatory variable in a study be assigned for each subject by the researchers.
- In the same way, the county data set is an observational study with confounding variables, and its data cannot easily be used to make causal conclusions.
- This prospective study recruits registered nurses and then collects data from them using questionnaires.
- Some data sets, such as county, may contain both prospectively- and retrospectively-collected variables.
- Generally, data in observational studies are collected only by monitoring what occurs, while experiments require the primary explanatory variable in a study be assigned for each subject by the researchers.
-
Is batting performance related to player position in MLB?
- We will use a data set called bat10, which includes batting records of 327 Major League Baseball (MLB) players from the 2010 season.
- The primary issue here is that we are inspecting the data before picking the groups that will be compared.
- It is inappropriate to examine all data by eye (informal testing) and only afterwards decide which parts to formally test.
- This is called data snooping or data fishing.
-
Rank Randomization: Two Conditions (Mann-Whitney U, Wilcoxon Rank Sum)
- The primary advantage of rank randomization tests is that there are tables that can be used to determine significance.
- Fictitious data converted to ranks.
- Rearrangement of data converted to ranks.
- Rearrangement of data converted to ranks.
- Rearrangement of data converted to ranks.
-
Data Snooping: Testing Hypotheses Once You've Seen the Data
- Testing hypothesis once you've seen the data may result in inaccurate conclusions.
- The error is particularly prevalent in data mining and machine learning.
- Sometimes, people deliberately test hypotheses once they've seen the data.
- Data snooping (also called data fishing or data dredging) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data.
- Although data-snooping bias can occur in any field that uses data mining, it is of particular concern in finance and medical research, which both heavily use data mining.
-
The Null and the Alternative
- In the significance testing approach of Ronald Fisher, a null hypothesis is potentially rejected or disproved on the basis of data that is significantly under its assumption, but never accepted or proved.
- In the hypothesis testing approach of Jerzy Neyman and Egon Pearson, a null hypothesis is contrasted with an alternative hypothesis, and these are decided between on the basis of data, with certain error rates.
- The evidence is in the form of sample data.
- $H_0$: No more than 30% of the registered voters in Santa Clara County voted in the primary election.
- $H_a$: More than 30% of the registered voters in Santa Clara County voted in the primary election.
-
Tree diagrams
- Tree diagrams are a tool to organize outcomes and probabilities around the structure of the data.
- The smallpox data fit this description.
- The first branch for inoculation is said to be the primary branch while the other branches are secondary.
- This tree diagram splits the smallpox data by inoculation into the yes and no groups with respective marginal probabilities 0.0392 and 0.9608.
- When constructing a tree diagram, variables provided with marginal probabilities are often used to create the tree's primary branches; in this case, the marginal probabilities are provided for midterm grades.
-
Types of outliers in linear regression
- (4) There is a primary cloud and then a small secondary cloud of four outliers.
- In (5), data with no clear trend were assigned a line with a large trend simply due to one outlier (!).
- If there are outliers in the data, they should not be removed or ignored without a good reason.
- Whatever final model is fit to the data would not be very helpful if it ignores the most exceptional cases.
- All data sets have at least one outlier.