
1. What is data analysis?
Data analysis is the process of inspecting, cleansing, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making.
2. What are the key responsibilities of a data analyst?
- Collecting and interpreting data
- Identifying patterns and trends
- Cleaning and validating data
- Creating reports and dashboards
- Working with cross-functional teams to support business decisions
3. What is the difference between data analysis and data analytics?
- Data analysis focuses on interpreting historical data to identify trends and insights.
- Data analytics involves using those insights to predict future outcomes and optimize processes.
4. What are the different types of data?
- Structured data: Organized data (e.g., SQL databases)
- Unstructured data: Raw data (e.g., text, images)
- Semi-structured data: Partially organized (e.g., JSON, XML)
5. Explain the data analysis process.
- Define objectives
- Collect data
- Data cleaning
- Exploratory data analysis (EDA)
- Data modeling
- Interpret results
- Report findings
6. What is data cleaning, and why is it important?
Data cleaning is the process of removing inaccurate, incomplete, or duplicate data. It ensures data integrity and improves the accuracy of analysis.
7. What tools are commonly used for data analysis?
- SQL
- Excel
- Python (Pandas, NumPy)
- R
- Tableau/Power BI
- SAS
8. What is the difference between primary and secondary data?
- Primary data: Collected firsthand through surveys, experiments, etc.
- Secondary data: Previously collected data (e.g., from reports or public databases).
9. How do you handle missing data?
- Deletion: Remove rows/columns with missing values
- Imputation: Replace missing values using mean, median, mode, or predictive models
- Using placeholders when necessary
10. What is normalization and denormalization?
- Normalization: Organizing data to reduce redundancy (e.g., dividing tables).
- Denormalization: Combining tables for faster read performance.
11. Explain the difference between inner join and left join in SQL.
- Inner Join: Returns records that have matching values in both tables.
- Left Join: Returns all records from the left table and matching records from the right table.
12. What is the difference between OLAP and OLTP?
- OLAP (Online Analytical Processing): Used for complex queries and reporting (data analysis).
- OLTP (Online Transaction Processing): Used for day-to-day transaction processing.
13. What is a primary key and a foreign key?
- Primary key: Uniquely identifies each record in a table.
- Foreign key: Establishes a link between two tables using a primary key from another table.
14. What is data visualization?
The graphical representation of data using charts, graphs, and dashboards to identify trends and patterns.
15. What visualization tools have you used?
Examples include:
- Tableau
- Power BI
- Matplotlib/Seaborn (Python)
- Excel
16. How do you identify outliers in a dataset?
- Using box plots, Z-scores, or IQR (Interquartile Range)
- Visual analysis through scatter plots
17. What is a pivot table?
A pivot table summarizes large datasets by aggregating data using rows, columns, and calculated values in Excel or similar tools.
18. What is a correlated subquery in SQL?
A subquery that depends on the outer query for its values and is executed repeatedly for each row of the outer query.
19. What is the difference between UNION and UNION ALL?
- UNION: Combines results from two queries and removes duplicates.
- UNION ALL: Combines results without removing duplicates.
20. What is data mining?
The process of discovering patterns, correlations, and anomalies in large datasets using statistical and machine learning techniques.
21. Explain ETL and its stages.
ETL (Extract, Transform, Load) involves:
- Extracting data from different sources
- Transforming data into a suitable format
- Loading data into a data warehouse or database
22. What is a hypothesis test?
A statistical method to test assumptions about a population using sample data (e.g., t-test, chi-square test).
23. How do you deal with multicollinearity?
- Removing highly correlated variables
- Using dimensionality reduction techniques (e.g., PCA)
- Regularization methods (e.g., Ridge, Lasso)
24. What is a time series analysis?
A technique used to analyze data points collected over time to identify trends, seasonality, and forecast future values.
25. What are key performance indicators (KPIs)?
Quantifiable metrics used to evaluate the success of an organization or specific activity (e.g., revenue growth, churn rate).
26. Explain the difference between supervised and unsupervised learning.
- Supervised learning: Uses labeled data to train models (e.g., regression, classification).
- Unsupervised learning: Identifies patterns in unlabeled data (e.g., clustering, dimensionality reduction).
27. What is a clustered index vs. a non-clustered index?
- Clustered index: Sorts and stores data rows based on key values (only one per table).
- Non-clustered index: Creates a separate structure for faster lookups (multiple per table).
28. What is the difference between variance and standard deviation?
- Variance: Measures the spread of data points from the mean.
- Standard deviation: The square root of variance, indicating data dispersion.
29. How would you optimize a slow SQL query?
- Using appropriate indexing
- Avoiding unnecessary columns (SELECT *)
- Optimizing joins
- Analyzing query execution plans
30. What is A/B testing?
A controlled experiment comparing two versions (A and B) to determine which performs better based on specific metrics.
31. What is the difference between a data warehouse and a data lake?
- Data warehouse: Stores structured data optimized for reporting and analytics.
- Data lake: Stores raw, unstructured, and structured data for various uses.
32. Explain the difference between cross join and self join.
- Cross join: Produces a Cartesian product of two tables.
- Self join: Joins a table with itself.
33. What is a CTE (Common Table Expression)?
A temporary result set defined within an SQL query to improve readability and reuse.
34. How do you calculate the correlation coefficient?
Using statistical formulas or functions like CORR()
in SQL or corr()
in Python (Pandas).
35. What is a dimension table and a fact table?
- Dimension table: Contains descriptive attributes (e.g., product details).
- Fact table: Contains measurable metrics (e.g., sales data).
36. What is a surrogate key?
A system-generated key used to uniquely identify records, often replacing natural primary keys in data warehouses.
37. What is the difference between DELETE and TRUNCATE?
- DELETE: Removes specified rows and can be rolled back.
- TRUNCATE: Removes all rows without logging individual row deletions (faster, cannot be rolled back easily).
38. What is data integrity?
The accuracy, consistency, and reliability of data throughout its lifecycle.
39. Explain the concept of normalization forms.
Normalization reduces data redundancy through multiple forms (1NF, 2NF, 3NF, BCNF, etc.).
40. What is a rolling average?
An average calculated over a defined time window that “rolls” forward, useful for smoothing time-series data.
41. What is a histogram?
A graphical representation of data distribution using bars to show frequency of data ranges.
42. What are some common data quality issues?
- Missing values
- Duplicates
- Inconsistent data formats
- Outliers
43. Explain the difference between JOIN and UNION.
- JOIN: Combines columns from multiple tables based on related keys.
- UNION: Combines rows from multiple queries.
44. What is the difference between WHERE and HAVING clauses?
- WHERE: Filters rows before aggregation.
- HAVING: Filters groups after aggregation.
45. How do you handle duplicate data?
- Using
DISTINCT
in SQL queries - Removing duplicates via data cleaning (e.g.,
drop_duplicates()
in Python)
46. What is logistic regression?
A statistical method used for binary classification problems to predict categorical outcomes.
47. Explain the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column).
- COUNT(*): Counts all rows
- COUNT(column): Counts non-null values in a column
- COUNT(DISTINCT column): Counts unique non-null values
48. What is dimensionality reduction?
Reducing the number of features in a dataset using techniques like PCA or feature selection.
49. What is the difference between clustered and non-clustered columns in data visualization?
This refers to the arrangement of bars or data points (clustered groups vs. stacked for comparison).
50. How do you ensure data accuracy in your reports?
- Thorough data cleaning
- Validation checks
- Cross-referencing data with trusted sources
- Peer review of analysis
Bonus Tips for Interviews:
- Be prepared to solve SQL queries and basic statistical problems.
- Practice explaining technical concepts in simple terms.
- Demonstrate problem-solving approaches with real-world examples.