- What is data science primarily concerned with?
a) Collecting data
b) Analyzing data
c) Both a and b
d) None of the above - What is the first step in the data science process?
a) Data cleaning
b) Data visualization
c) Data analysis
d) Data collection - Which programming language is commonly used for data analysis and visualization in data science?
a) Java
b) Python
c) C++
d) Ruby - What is the term for finding patterns and insights in data?
a) Data collection
b) Data cleaning
c) Data analysis
d) Data visualization - Which of the following is NOT a data type commonly used in data science?
a) Numeric
b) Boolean
c) Text
d) Sound - What is the purpose of exploratory data analysis (EDA)?
a) To create predictive models
b) To summarize data
c) To understand data and discover patterns
d) To visualize data - Which statistical measure describes the central tendency of a dataset?
a) Standard deviation
b) Median
c) Range
d) Variance - What is the main goal of data preprocessing?
a) To remove all data
b) To prepare data for analysis
c) To add noise to data
d) To create more complex data - What is the term for data that is missing in a dataset?
a) Outliers
b) Noise
c) Null or missing values
d) Data artifacts - What is the process of converting categorical data into numerical values called?
a) Categorical encoding
b) Numerical transformation
c) Data normalization
d) Data scaling - Which of the following is NOT a supervised learning algorithm?
a) Linear regression
b) K-means clustering
c) Decision tree
d) Support vector machine - What is the primary goal of unsupervised learning?
a) Classification
b) Regression
c) Clustering
d) Feature engineering - Which technique is used for reducing the dimensionality of data while preserving as much information as possible?
a) Principal Component Analysis (PCA)
b) Linear regression
c) K-means clustering
d) Decision trees - Which data visualization type is best suited for showing the distribution of a single variable?
a) Scatter plot
b) Histogram
c) Box plot
d) Bar chart - In a confusion matrix for a binary classification problem, what does “true positive” represent?
a) Correctly predicted positive instances
b) Incorrectly predicted positive instances
c) Correctly predicted negative instances
d) Incorrectly predicted negative instances - What is overfitting in machine learning?
a) When a model performs well on the training data but poorly on new, unseen data
b) When a model performs equally well on training and testing data
c) When a model has too few parameters
d) When a model is undertrained - What is the purpose of regularization techniques in machine learning?
a) To make the model fit the training data perfectly
b) To reduce the complexity of a model and prevent overfitting
c) To increase the variance of a model
d) To decrease the bias of a model - What is the ROC curve used to evaluate in machine learning?
a) Model accuracy
b) Model bias
c) Model variance
d) Model performance at different thresholds - Which of the following is an example of a natural language processing (NLP) task?
a) Image classification
b) Speech recognition
c) Sentiment analysis
d) Regression analysis - What is the purpose of a decision tree in machine learning?
a) To perform clustering
b) To make predictions or classifications
c) To reduce the dimensionality of data
d) To visualize data - Which library is commonly used for deep learning in Python?
a) Scikit-learn
b) Matplotlib
c) TensorFlow
d) NumPy - What is the term for a subset of data that is used for model evaluation but not for training?
a) Validation set
b) Test set
c) Training set
d) Feature set - Which of the following is NOT a step in the CRISP-DM data mining process?
a) Data collection
b) Model deployment
c) Data visualization
d) Data preprocessing - What is the objective of a k-fold cross-validation technique in machine learning?
a) To train multiple models with different parameters
b) To divide data into k equal-sized subsets for training and testing
c) To increase model complexity
d) To reduce model interpretability - What is the main advantage of using ensemble methods in machine learning?
a) They are faster to train
b) They are simpler to implement
c) They often improve model performance
d) They require less data - Which of the following is a commonly used algorithm for recommendation systems?
a) K-means clustering
b) Decision tree
c) Naive Bayes
d) Collaborative filtering - What is a data warehouse used for in data science?
a) Storing and managing large volumes of data
b) Performing real-time data analysis
c) Collecting data from external sources
d) Visualizing data - What is the goal of feature engineering in machine learning?
a) To create new data
b) To select the most relevant features for a model
c) To increase the dimensionality of data
d) To reduce the amount of data - What is the purpose of the term “bias” in machine learning?
a) To introduce randomness into the model
b) To reduce model accuracy
c) To make the model more flexible
d) To control systematic errors in predictions - In a time series analysis, what is a lag?
a) A gap between data points
b) A time delay between two variables
c) A seasonality factor
d) A statistical error term
Answers:
- c) Both a and b
- d) Data collection
- b) Python
- c) Data analysis
- d) Sound
- c) To understand data and discover patterns
- b) Median
- b) To prepare data for analysis
- c) Null or missing values
- a) Categorical encoding
- b) K-means clustering
- c) Clustering
- a) Principal Component Analysis (PCA)
- b) Histogram
- a) Correctly predicted positive instances
- a) When a model performs well on the training data but poorly on new, unseen data
- b) To reduce the complexity of a model and prevent overfitting
- d) Model performance at different thresholds
- c) Sentiment analysis
- b) To make predictions or classifications
- c) TensorFlow
- a) Validation set
- c) Data visualization
- b) To divide data into k equal-sized subsets for training and testing
- c) They often improve model performance
- d) Collaborative filtering
- a) Storing and managing large volumes of data
- b) To select the most relevant features for a model
- d) To control systematic errors in predictions
- b) A time delay between two variables