In this updated version of project, I embarked on an exciting journey of exploring and analyzing trending repositories on GitHub. I aimed to extract valuable insights from the repositories' data, rank them based on various criteria, and create a meaningful schema for further analysis.
Here's a high-level overview of the workflow I followed in this project:
-
Data Extraction: I fetched trending repositories' data from GitHub using web scraping techniques and the GitHub API.
-
Data Transformation: The extracted data was cleaned, transformed, and organized into a structured format suitable for analysis.
-
Schema Design: I designed a star schema that included dimension tables for users, repositories, time, and ranks, along with a fact table for repository information.
-
Database Creation: I set up a PostgreSQL database to store the structured data using the designed star schema.
-
Triggers and Functions: I implemented triggers and functions in PostgreSQL to handle the dynamic updates and insertions in the user dimension.
-
Ranking Algorithm: I developed a Python function to calculate rank values for repositories based on customizable weights.
-
Data Analysis: With the data in place, I conducted insightful analyses, such as identifying top repositories and understanding user behaviors.
Documentation and Sharing: I documented the entire project to capture challenges, solutions, workflow, and outcomes. This documentation serves as a valuable reference for both personal reflection and sharing with others.
Throughout the project, I encountered several challenges that pushed me to think creatively and problem-solve effectively:
Solution: I used Python with libraries like BeautifulSoup and the GitHub API to gather and process repository and user information. Schema Design and Database Management:
Challenge: Designing a star schema to organize data effectively and managing primary and foreign keys.
Solution: I carefully designed the schema with proper relationships between dimensions and the fact table, ensuring data integrity. Trigger and Function Implementation:
Challenge: Implementing triggers and functions to update user information while avoiding recursive loops.
Solution: I crafted triggers and functions in PostgreSQL to ensure smooth updates in the user dimension while maintaining control over the insertion process. Ranking Algorithm Development:
Solution: I created a Python function to calculate rank values using customizable weights and implemented the function in the PostgreSQL database.
- Python pandas,psycopg2,PyGithub.
- PostgreSQL.