I would love to connect! Feel free to send me a message at my Gmail account.
I am currently a master student in mathematics at CUHK, and a full-time researcher at CUHK in building big data solutions with Machine Learning, Large Language Models and Graph Neural Networks, for mass historical archive (image+text) research.
- Graph Neural Network
- Knowledge modeling, representation theory of knowledge, and Knowledge base schema design
- Knowledge mining and curation (For personal or multi-user use)
-
Scene_graph_benchmark_nvidia: This repository provides a November-2024-reworked implementation of the Scene Graph Benchmark Docker container definition, aimed at tasks like scene graph generation, object detection, and relationship detection. The current implementation is based on NVIDIA's PyTorch container with GPU acceleration, ensuring compatibility with CUDA and cuDNN.
-
Envsync:
envsync
is creating a apt-get experience for python packages, to automatically update the content of requirements.txt and synchronize the virtual environments of any Python projects with Git hooks. -
Gites: The objective of the Gites is to replicate the user experience of employing Google Drive or OneDrive functionalities within the context of Git commands. The actions such as batch push, pull, clone are provided.
- Prog-for-humanists-web: This collection was created by me and serves as a valuable resource for senior year students at CUHK who are studying humanity major can grow their programming skills in the context of data engineering, database management, machine learning, natural language processing and project deployment as part of a 3-credit course at the university. The course aims to help students to have solid foundation to develop modern and impactful humanity projects in python.
Display:
While my closed source projects are not listed here, here's a glimpse into the projects I'm currently working on, categorized based on their development stage:
Bots
- Keyword-listening-discord-bot: This project represents my personal endeavor to create a Discord bot for my dedicated server. The bot is designed to monitor all messages within a specific Discord server, requiring the specification of a token and guild ID. Whenever it detects predefined commands or keywords in messages, it responds accordingly. In essence, my Discord bot serves as a helpful tool, providing instant access to information from a manual, aiding the coder members of the server with informative responses to their commands.
- youtube-chatroom-response: An asynchronous youtube chatroom response bot written by python, that allow users to customize the patterns matched in the chatroom, and then automatically response accordingly.
Finance
-
yahoo-finance-scraper: This python package use the API from yfin package, and collect data from yahoo finance website. Checkout the snapshots in readme.md in this repository for more details.
-
My tradingview profile: Check out my contributions to the library of traders community, which garnered over 2000 stars on TradingView.
Software engineering
- Software Engineering Toolbox: Part 1 is completed. Curate several folders of tools so that user can dignose a target directory easily during software development.
Project management
- csv2gantt: Aims to provide a tool that converts csv format data file into gantt chart.
Large Language Models
- poe-langchain: Adapting LLM that we can access on
poe
, into langchain'sllm
eco-system.
Software engineering
- dir2tree.py: Recursively prints text that contain a tree structure representing the folders and files in a given directory. It provides insights to LLM or human developer into the structure of a package.
Format change
- gpt2md.py: A python script that convert the mathematics LaTeX format that you can copy from ChatGPT, into the LaTeX format that markdown (such as Obsidian) can display.
- mermaid2md.py: This script automatically converts the output of LLM to syntax-error free markdown format.
File migration
- Migrate_to_public_space: This Python script facilitates the management of two spaces: a creation space (a large and private area for work) and a publication space (a smaller area for selected items ready for publishing). Users can specify a list of items to migrate from the creation space to the publication space, streamlining the publishing process.
YouTube content download
- youtube_subtitle: A python script that output youtube video substitle. Only video_id is needed to run the script.
- downloadYT_whole.py: A python script that download a video without chopping. Only videoID is needed to be provided to run the script.
- downloadYT_chopping.py: A python script that download a video with chopping. Only videoID is needed to be provided to run the script.
Video editing
- compress_concate.py: A python script that compress and concatenate a series of videos into a single video.
Scraping
-
Home PC activity parser: Automatic home PC user activity parser. That automatically parse user activity into text file, for real time streaming or database for further analytics.
-
JobsDB scraper: A python script that scrape the job information from JOBSDB, a popular recruiter website in Hong Kong. It also tries to count mentioned skillsets and then make simple statistics for the data collected.
I have discovered a true passion for crafting articles that break down intricate concepts into easily digestible pieces. As a result, I am currently working on developing the following sets of materials for the public. Please remember to cite the source when using these materials.
-
GPU-Environment-Windows-Linux-Docker: When I first ventured into setting up a GPU-accelerated environment with CUDA and PyTorch, the process proved to be daunting and time-consuming. This was due to a couple of factors:
- Choice Overload: With multiple setup methodologies available, it was unclear which combination of steps would yield a functional environment without trial and error.
- Component Complexity: The setup involves integrating components from various providers, each independently maintained and with often opaque error messaging, which complicates troubleshooting.
To help streamline this setup process, I've created a series of Jupyter notebooks that address different setup scenarios:
- CUDA-GPU Environment: Detailed guides for various CUDA-GPU environment configurations, tailored to circumvent common pitfalls.
- Docker Approach: Clear, step-by-step instructions for defining a Docker image, designed to be flexible and accommodate updates to underlying components.
- Demonstration Scripts: Practical examples demonstrating how to utilize Large Language Models (LLMs) in the configured environment, providing a quick start to harnessing their capabilities.
These resources aim to minimize setup headaches and get you started with a robust data science platform, whether you're working in Windows or Linux, and whether you prefer a native installation or a Dockerized solution.
-
Multilingual-Semantic-Search-Course: Contextual embeddings generated by LLMs are the core technology used in semantic search. This short-course repository try to walk-through the environment setup, concepts involved, and also demonstrating some key steps of using those functions.
-
Big-Data-Integration-and-Processing-Course: This repository serves as a comprehensive guide to mastering big data techniques, featuring:
- Detailed software setup instruction
- Integration Tools: Introduction to MongoDB, a NoSQL database perfect for document integration, alongside strategies for effective data amalgamation.
- Processing Frameworks: Tutorials and examples on using Apache Spark, the leading platform for large-scale data processing
- Final project demonstration: Ddemonstrate how to integrate, update, query and delete data from a >1TB textual corpus.
-
BigdataMath: During my learning journey, I've found that I possess the ability to create effective notes that can simplify complex concepts, making them more accessible and comprehensive. Consequently, I'm planning to write a series of articles covering a variety of topics in the field of mathematics related to big data, which is an area I have a strong interest in.
The content of these repositories is available for educational and informational purposes. While I encourage you to explore, learn from, and engage with the material, please respect the following terms:
- Access: The repositories are publicly accessible and open to everyone.
- Use: You are free to clone the repositories, run the Jupyter notebooks, and build upon the material for personal and educational use.
- Distribution: Redistribution of the original or modified content is allowed, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests I endorse you or your use.
- Commercial Use: Commercial use of the content may be restricted depending on the license chosen. Please review the specific license of each repository to understand what is and isn't allowed.
- Contribution: If you wish to contribute to the repositories or have any suggestions, feel free to open an issue or a pull request in the respective repository.