Basics of Data Science
- What is Data Science?
- Answer: Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines statistics, computer science, domain knowledge, and machine learning to analyze and interpret complex data.
- Differentiate between Data Science, Machine Learning, and Artificial Intelligence.
- Answer:
- Data Science: Broad field that focuses on extracting insights from data.
- Machine Learning: Subfield of Data Science focused on developing algorithms that learn from data.
- Artificial Intelligence: Broader concept of machines being able to carry out tasks in a way that humans would consider "smart."
- What is the CRISP-DM methodology?
- Answer: CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It includes six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.
Data Preprocessing and Feature Engineering
- What is data preprocessing, and why is it important?
- Answer: Data preprocessing involves cleaning, transforming, and organizing raw data into a usable format. It is crucial for improving the quality of data and ensuring accurate, reliable, and efficient model training.
- Name some common data preprocessing techniques.
- Answer: Techniques include handling missing values, removing duplicates, normalizing/standardizing data, encoding categorical variables, and feature scaling.
- What is feature engineering, and why is it important?
- Answer: Feature engineering involves creating new features from existing data to improve model performance. It helps in making the data more relevant for the machine learning algorithms.
Statistics and Probability
- Explain the difference between descriptive and inferential statistics.
- Answer:
- Descriptive Statistics: Summarizes and describes the features of a dataset.
- Inferential Statistics: Makes inferences about populations based on sample data.
- What is the Central Limit Theorem (CLT)?
- Answer: The Central Limit Theorem states that the distribution of the sample mean of a large number of independent, identically distributed variables approaches a normal distribution, regardless of the original distribution.
- Define p-value.
- Answer: The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the observed value under the null hypothesis. It helps in determining the significance of the results.
Exploratory Data Analysis (EDA)
- What is Exploratory Data Analysis (EDA)?
- Answer: EDA involves analyzing and summarizing the main characteristics of a dataset, often using visual methods. It helps in discovering patterns, spotting anomalies, testing hypotheses, and checking assumptions.
- Name some common techniques used in EDA.
- Answer: Techniques include summary statistics, data visualization (e.g., histograms, scatter plots, box plots), correlation analysis, and data transformation.
- How do you handle outliers in a dataset?
- Answer: Techniques include:
- Removing outliers if they are errors.
- Transforming data (e.g., using logarithms).
- Using robust statistical methods.
- Imputing outliers with mean/median or other statistical methods.
Machine Learning Algorithms
- What are the types of machine learning algorithms?
- Answer:
- Supervised Learning: Trained on labeled data (e.g., regression, classification).
- Unsupervised Learning: Trained on unlabeled data (e.g., clustering, association).
- Reinforcement Learning: Learns through trial and error interactions with an environment.
- Explain the difference between a parametric and a non-parametric model.
- Answer:
- Parametric Model: Assumes a specific form for the underlying distribution (e.g., Linear Regression).
- Non-Parametric Model: Does not assume any specific form and can adapt to data (e.g., K-Nearest Neighbors).
- What is overfitting, and how can it be prevented?
- Answer: Overfitting occurs when a model learns the noise in the training data, leading to poor generalization on new data. It can be prevented using cross-validation, pruning, regularization, and simplifying the model.