This repository contains the reproduction for the paper Password Guessing Using Random Forest. The author proposes a set of new methods to translate PII(Personal Identifiable Information) data into structures that perform quite well in classical machine learning models. I have implemented the main concept of the paper and programmed an easy-to-use tool for training models, generating patterns, conducting guesses and evaluating accuracy. This repo contributes:
- A GUI program exclusively for the PII-based targeted password guessing scenario
- A pre-trained model
If you are looking for more knowledge about the underlying logic and training process of this project, this article provides more details about the algorithm and the corresponding transcript is available here.
- Overview
- Features
- Prerequisites
- Usage
- Advanced Configuration
- Build from source
- License
- Contact
- Acknowledgements
- Appendix
- PII-based targeted password guessing
- A pre-trained model(get here) ready to use which is trained on a dataset with 11w data
- Generate password patterns based on PII dataset
- Conduct password guesses for given personal information
- Support for training specified model for self-defined datasets
- Support for evaluating the accuracy of generated guesses
Clone this repo to your local, install dependencies:
pip install -r requirements.txt
This project use Mysql to store analysis data, you can launch a prepared database in docker which is recommended:
docker-compose up -d
And connect to mysql://root:[email protected]:3307/rfguess
.
Or if you expect to use a custom database, you should import user.sql
into your database manually, which will create all the data tables.
Launch the user interface:
python main.py
Run the executable file and you will see the panel as below:
There are three main modules in the user interface: Guess-Generator, Pattern-Generator and Model-Trainner.
First you should get a trained model(whether you train by yourself in model-trainner or use the pre-trained model from rfguess.clf). Then set a limit on the number of patterns to be generated and start generating.
This module requires a pattern file(see Appendix for more detail) and PII data of the target user. You can load the pattern file generated by Pattern-Generator or use the default pattern file.
-
Fill in PII data Input the personal data of the target user or load data from json file(format)
The model training process of Machine-Learning is pretty more laborious than that of Deep-Learning. The algorithms in this program need to use mysql database to store intermediate data structures while processing the original dataset. Fortunately, you just need to have a normal running mysql server and just provide a database url to connect to. All the data structures are configured automatically.
- Connect to your database and import database structure
Import sql file(get here). Note that, this script would drop and recreate data tables that are in business(you can checkout and modify the table names in Parse/Config.py
).
If you start mysql server by docker-compose(which is the recommended way), the user.sql
has already imported at the start time, so you can just skip this step.
- Load your PII dataset(.txt)
The PII dataset should in csv format and comply with the principles below:
- the first line presents field names
- field name should fall into ['account', 'name', 'phone', 'idcard', 'email', 'password'], case-insensitive
- you can include any combination of the allowed fields but name and password are mandatory
- each line contains one PII data
- each line should have several fields and separated by comma
- blank characters will be ignored
A legal dataset is presented like:
name, email, password, phone
张三, [email protected] , zhangsan, 111122222
John, [email protected], 3333, 44444
Jason Harris, [email protected], 5555, 5555
You can specify the character set of the target dataset by Charset edit box.
Push Load PII Data button and wait. Your dataset will be consumed and stored in database after some procession.
- Analyze and process dataset
This step will analyze the PII dataset to some intermediate data.
- Train model
You will train a classifier model and dump into a .clf file.
- Evaluate accuracy
To evaluate the accuracy of a model, this step uses 50% of your dataset as train-set and other 50% as test-set, generates a password dictionary for each PII data and checks whether the correct password falls into the dictionary.
- Restore the status of last run
Use "Update Status" button to load the progress of the last run and check the status of each phase.
See more detailed configuration at Config.py.
Algorithm configuration
Markov n-gram model is used in the main algorithm, you can set n by pii_order parameter:
pii_order = 6
You can control the limit of guesses by the two following thresholds, which are calculated according to the possibility of the growing pattern. A pattern is adopted only if its possibility is greater than the threshold. So the larger is the threshold, the lesser is the number of guesses, vice verse. It is notable that you should not set the threshold excessively small(lesser than 1e-11) to avoid overwhelming by useless patterns.
general_generator_threshold = 1.2e-8
Database configuration
You can config the table names of database as you like:
class TableNames:
PII = "PII"
pwrepresentation = "pwrepresentation"
representation_frequency = "representation_frequency"
pwrepresentation_frequency = "pwrepresentation_frequency"
pwrepresentation_unique = "pwrepresentation_unique"
pwrepresentation_general = f"{pwrepresentation}_general"
representation_frequency_base_general = f"representation_frequency_base_general"
representation_frequency_general = f"{representation_frequency}_general"
pwrepresentation_frequency_general = f"{pwrepresentation_frequency}_general"
pwrepresentation_unique_general = f"{pwrepresentation_unique}_general"
Classifier configuration
Tune the parameters of random forest by the following config:
class RFParams:
n_estimators = 30
criterion = 'gini'
min_samples_leaf = 10
max_features = 0.8
This project is written by Python3.11. You can install dependencies by using pip:
pip install -r requirements.txt
And run the following command to launch the main window:
python main.py
This code is released under an MIT License. You are free to use, modify, distribute, or sell it under those terms.
Project Link: https://github.com/PadishahIII/RFGuess
Tag | Description |
---|---|
N1 | FullName |
N2 | Abbreviate of name |
N3 | Family name |
N4 | Given name |
N5 | First character of given name append family name |
N6 | First character of family name append given name |
N7 | Family name capitalized |
N8 | First character of family name |
N9 | Abbr of given name |
B1 | Birthday in YYYYMMDD |
B2 | MMDDYYYY |
B3 | DDMMYYYY |
B4 | MMDD |
B5 | YYYY |
B6 | YYYYMM |
B7 | MMYYYY |
B8 | YYMMDD |
B9 | MMDDYY |
B10 | DDMMYY |
A1 | Account |
A2 | Letter segment of account |
A3 | Digit segment of account |
E1 | Email prefix |
E2 | Letter segment of email |
E3 | Digit segment of email |
E4 | Email site like qq, 163 |
P1 | Phone number |
P2 | First three digits of phone number |
P3 | Last four digits of phone number |
I1 | Id card number |
I2 | First three digits of idCard |
I3 | First six digits of idCard |