Getting Started: A Few Definitions. Statistics: The science of collecting, describing, interpreting and analyzing data What's Data? Experiment: A procedure that gathers "information" about a collection of items. Population: A collection of items Sample: A subset of a population Variable: A characteristic of each item in a population (A QUESTION) e.g. People -> age, hair color (if you're not bald), gender, ethnicity, salary Why is it called a variable? (As opposed to a constant) Most common way to represent a variable? X Data: Is a collection of responses from a population or sample to a variable or set of variables Examples of Populations v. Samples "Students at DeVry" v. "Students in this class" "People hospitalized at RWJUH" v. "All people currently hospitalized in the U.S" "People with lung cancer" v. "All people who have ever had or will ever have lung cancer" "All U.S. voters" v. "All voters who participated in a specific opinion poll" Calculations based upon a population -> Parameters (Population Mean, Std. Dev.) " based upon a sample -> statistics (sample mean, sample std. dev.) Population -> parameter Sample -> statistic A branch of STATISTICS is called inferential statistics and deals primarily with the following question: I have information from a sample. What does that mean for my population? The second major branch of statistics is called "descriptive statistics". That focuses on taking a set of data and analyzing and displaying it in a meaningful way. What can I skip (and not lose points on the quiz)? section 1.5, 1.7, 1.9 Today: What kinds of variables are there? The type of variable affects the potential forms of analysis Example: Ethnicity vs. Age Ethnicity: I can only count and compute percentages. Ages: I can compute an average and a standard deviation. Two major types of variables: 1) Categorical (Qualitative) Ordinal - has an order Nominal - no inherent order (alphabetical does not count) 2) Numerical (Quantitative) Discrete: values that occur from a predefined list (most common - integers) Continuous: values that can occur within a range of numbers and can be measured as finely as possible (lengths, times, volumes) What I care about less: Levels of Measurement a) Nominal b) Ordinal c) Interval d) Ratio Some examples: Can you classify me? (Cat: Ord v. Nom. OR Num: Disc v. Cont.) 1) Blood type 2) Current Temperature! 3) Number of Vehicles in Parking Lot 4) Hotel Rating 5) Meat Quality Grade 6) Amount of Drug in a Pill 7) Homicides in New York City in One Year 8) Speed of a Car on the Jersey Turnpike 9) A voter's political affiliation 10) Movie Rating Q: Who cares about statistics? A: People are constantly bombarded by data of all sorts and we need some tools to understand it. A2: It describes a number of situations that people find important. **Law Enforcement: Crime, Terrrorism -> Insurance **Business: Sales, Marketing **Medical: Is a drug effective? Should I take it if it has serious side effects: Vioxx I have a disease. Will it kill me? **Political **Sports **Gambling Sample - smaller than a population. Generally: I want to know something about a population. But, rather than collect information from the ENTIRE population (census) I often look to a smaller sample instead (1000-10,000 items) Why sample? Focusing on the population is: 1) Too costly 2) Too time-consuming 3) less accurate 4) impossible: does not exist (medical patients) too destructive (tire, lightbulb lifespans, blood sample) Q: How do you sample? A: Goal - Representative (Unbiased) Sample - Random Book: Probability Sample - each item has a predefined probability of inclusion in my sample. Simple Random Sample: Each item has the same chance of inclusion in my sample Each sample of a given size is equally likely Systematic Random Sample: To generate a random sample use: 1) A sampling frame: numbers associated with each item 2) random numbers: computer? Simple random sample of size n. Pick n random number from 1 to the end of the sample. My example: Pick 2 numbers from 1 to 6. 2 ways: with replacement, **without replacement** Systematic sample of size n. If N is the size of my population, I take k = N/n (maybe I'll round up). Pick ONLY ONE random number from 1 to k. Then, add k to this value until you have a total of n numbers Simple Random Sampling: 1) Construct a Sampling Frame: A numbered list of 2) To choose a sample of n items, choose n random numbers - Excel's Sampling Tool (repetitions are possible), Simple random numbers Week 2: ---------- Chapter 2: Methods of Describing/Displaying Data Display - Tables (non-graphic), Charts (Graphs) Theme to consider: When does the method "remove" information? Methods Can Vary Depending Upon the Data: SECTION 2.3 First kind of data is a series of values (numbers) shown by date. examples: exercise 2.8 p.55, Presidential Deaths Time-Series or a Time-Order Plot For numerical, sequentially-ordered data SECTION 2.4 Some Simple Ways to Describe Numerical Data Given Raw Numerical Data: 1) Sort the numbers 2) Two Simple Plots: a) Dotplot - values listed along x-axis, a dot per value, when values repeat, dots stack vertically. b) Stem-and-Leaf: Divide each value into parts--a "stem" and a "leaf". You record EVERY leaf and only repeat a stem once. You also should include stems that have no leaves. e.g. Presidential Age: 53, 56, 69, etc. 53: Stem is 5 and Leaf is 3. 56: Stem is 5 Leaf is 6; 69: Stem is 6 and Leaf is 9. Note #1: No loss of information! Note #2: Stems are not always 10's Possibly if a data value is 326. Maybe Stem:32 and Leaf 6 OR Stem:3 and leaf 26 (possibly represented as just a 2 or a 3). Also, if there are many leaves for a single stem it can be broken down into "high" and "low" values. 3) More Complex Methods: Frequency Classes a) Our Stem-and-Leaf Counts by Tens: 40's., 50's, 60's, etc. b) Frequency - "Count" c) Relative Frequency - "Count/Total" d) Example: Baseball - Hits (Frequency), Batting Average (Relative Freq.) = Hits/(At Bats) e) Cumulative Relative Frequency Note #3: Information is lost when grouping into classes. Best guess in each class: the midpoint ages 40-49: midpoint (40+49)/2=44.5 (or 45) Q: Why use relative frequencies? (Think Baseball! or Crime!) A: Comparison of two items with different totals (different classes or players) To Do: More Frequency Classes ============== Lecture #3: ============== Generally: Fix a number of classes that you plan to use (5-20) Determine a set of classes that are: non-overlapping but all-inclusive. Class Width W= (Max-Min)/(# of classes) -> Round Up! 5 Classes: (3.34-2.78)/5 = 0.112 -> 0.12 Now start with minimum and keep adding the width enough times to generate the appropriate amount of classes Graphs: Histograms, Polygons, Ogives, Pareto Diagrams* Bivariate Data Graphical Displays: 1) Histogram - Bar Chart! x=values, y=frequencies. No gaps between bars 2) Polygon - line graph - x=midpoint, y=frequency Why polygon v. histogram? Can show multiple polygons in the same window 3) Ogive - cumulative percentage polygon x=RIGHT ENDPOINT, y=relative frequency Which is better? A "higher" or a "lower" ogive? 4) Pareto Chart - a) Used For Qualitative Nominal Data b) Construct a table with categories and matching frequencies c) List them in order of Descending Frequency d) Add on relative frequencies -> cumulative relative frequencies e) Chart: Frequency Histogram with a Ogive ============== Lecture #4: Descriptive Measures of Numerical Data ============== Given a set of data we want ways to compress all the information into easily digestible parts that represent the whole in some meaningful way: We use ideas like: 1) Central Tendency: "middle of the data", "typical value" 2) Variation: "spread of the data", "how far is unusual?" 3) Position: "ranking of the data" At the same time, we'll examine the affect of outliers or "extreme values" on these measures. Start with measures of central tendency. There are many ways to define the middle of a data set. We'll discuss 5 of them. The "5 M's" Listed in order simplicity 1) Mode: Most common data value (there is an Excel MODE formula Yay!) *There may be more than one mode or no mode at all. 2) Midrange: (Maximum+Minimum)/2 (**No Excel Formula Exists :-<** ) 3) Mean: - sum all numbers and divide by the total count (AVERAGE in Excel) 4) Median: "the middle number" (MEDIAN in Excel) **Sort all data (from lowest to highest) **N=# of data values, Take (N+1)/2 **Find that number in the list and call it the median. **IF (N+1)/2 is not an integer, average the two numbers nearest it. 5) Midhinge: (Q1+Q3)/2 (No Excel formula: Use QUARTILE instead) Quartiles divide a data set into quarters. Min<-> 25%<->Q1<->Median<->Q3<->Max SKEWED: a) Symmetric: If Mean = Median b) Right-Skewed: Mean > Median c) Left-Skewed: Mean < Median ##Example (using Excel on all the diesel gas prices by state) Occasionally there are "Outliers" -> Unusually large values (definition to follow!) These occur naturally or through (recording) errors Let's take a look at how an outlier affects the measures of central tendency ## Example, replace 2.933 with 4.933 Trimmed Mean - Takes an average of all numbers except for a certain percentage of the unusually high and low values. Example 10% trimmed mean (TRIMMEAN) Measures of Variation: 1) Range (R) = Maximum - Minimum (No formula in Excel) 2) Interquartile Range (IQR) = Q3-Q1 (No formula in Excel) 3) Variance (S^2), **Standard Deviation(S)**, Coefficient of Variation See book for definitions Excel Formulas (for samples): VAR, STDEV S = sqrt(S^2) Connection between the measures: R roughly 2S for small data, 4S for medium data, 6S for large data sets IQR is roughly 1.33*S for large enough data sets Finish up 3.3 next week Measures of Position: Percentiles & Quartiles Q1 = 1/4 of all values are below and 3/4 of all values are above P33 = 33% of all values are below and 67% of all values are above. Five Number Summary: Minimum, Q1, Median, Q3, Maximum Box-and-Whisker Plot Example: DJIA: Five Number Summary & 90th percentile and 15th percentile. Empirical Rule (68%, 95%, 99.7%) Chebyshev's Theorem (---,75%,89%,...) Second Example: State of World's Children Chapter 4: Probability Statistics analyzes past data Probability predicts the future using intuition or past information or pure guesswork Probability - the subject probability - a likelihood that something will happen Experiment: a situation that produces results Outcome: an individual result Sample Space S: the set of all outcomes Event: any subset of the sample space A, B, C probability: a value between 0 and 1 that represents the likelihood that a specific event will take place. event A has a probability P(A). 0
SUMPRODUCT(A1:A5,B1:B5)
Variance -> SUMPRODUCT(A1:A5,A1:A5,B1:B5)-C2^2
Standard Deviation -> SQRT(C3)
Examples: Problems 4.10 and 4.12 on p. 149
New This Week
Specialized Discrete Distributions
*Certain Common Situations
Binomial, Geometric, Negative Binomial, Poisson, (Hypergeometric)
Note: Parameters - specific features that distinguish within a class of
distributions
Binomial Distribution
**Whenever you ask a "Yes/No" question of a FIXED number of Items
Important:
1) Only two possible answer: "Yes" - Success
2) Fixed number (N) of Trials
3) Independent Trials - Equal Success Probability - p
probability of failure -> q = 1-p
4) Binomial Distribution X - counts the number of successes
X = 0,1,2,....N
Example: Examine 12 light bulbs to see if they work, each has a probability
p=0.73 -> q=1-.73 = 0.27
X = the actual number of lightbulbs that work out of the 12.
I can calculate the probability of each possible outcome X=0,1,...,11,12
Formula: Page 150 P(X=k) = C(n,k) * p^k * q^(n-k)
Excel: =BINOMDIST(k,n,p,FALSE)
Note: "TRUE" is for cumulative probabilities like P(X<=k)
Secondary Formulas: Mean=N*p, Standard Deviation = SQRT(N*p*q)
Geometric (Negative Binomial)
Want to Succeed! (independent trials, fixed probability p)
No fixed number of trials
Example: Sending out resumes until you have a job, Hitting on people in a bar
Looking for two working batteries in a large pile
Negative Binomial: Counts the number of attempts until R successes
X=# of attempts, X=R,R+1,......infinite
Parameter: p=success probability, r=number of successes needed to stop.
Geometric (r=1)
P(X=k) = C(k-1,r-1)*p^r*q^(k-r) **(P. 162)**
Excel: =negbinomdist(k-r,r,p)
Example: I need two lightbulbs for my lamp, p=0.73
X=number of bulbs I need to test before my lamp is working
P(X=k)=NegBinomDist(k-r,r,p)
Mean: r/p
Standard Deviation: SQRT (r*q/p^2) = SQRT(r*q)/p
Second Example: p.157 -> 4.16
Poisson Distribution: Count the number of occurrences in a fixed amount of time
Examples:
1) people entering a store in an hour,
2) phone calls to a call center in a 10-minute period
3) hits to a busy website in a minute
4) print jobs sent to a printer in a half-hour
Assumptions: Each occurrence is independent
X= count of events in a given amount of time
Parameter: mean number of events in the same amount of time (L=lambda)
P(X=k)= L^k/(k!)*exp(-L)
In Excel: POISSON(k,L,FALSE)
X=0,1,2,.......
Example: 2.8 people pass a toll booth each minute on average
Mean = L, Std. Dev. = sqrt(L)
Example p.172, problem 4.38
***Week 8***
Continuous Probability Distributions
Review:
Discrete probability distribution: list of outcomes -- integers?
e.g. 0,1,2,3,....
*Each outcome had a probability P(x) often described by a table or a formula
*P(x) called a "probability density function"
*There's a related cumulative probability F(x) = P(X<=x)
the official term is a "probability distribution function"
When talking about continuous probabilities the outcomes can lie within a range
of values 0