Skip to content

Awais11227/Data_-Analysis

Repository files navigation

Data Wrangling

📖 Project Overview

This project involves working with a clinical trial dataset containing information on 500 patients, of which 350 participated in a trial comparing two insulin treatments: Novodra (injectable) and Auralin (oral).
The dataset includes patient details, treatment records, HbA1c measurements, and adverse reactions.

The main goal of data wrangling here is to:

  • Clean and organize raw data
  • Handle missing or inconsistent values
  • Prepare data for analysis (statistical testing, visualization, reporting)

📂 Dataset Description

Patients Table (patients)

Contains demographic and baseline details:

  • Identifiers (patient_id, name, contact, address)
  • Demographics (sex, birthdate, age)
  • Measurements (weight, height, BMI)

Treatments Table (treatments, treatment_cut)

Tracks treatment progress and effectiveness:

  • Insulin doses (Auralin, Novodra)
  • HbA1c levels (start, end, change)

Adverse Reactions Table (adverse_reactions)

Logs reported side effects for both treatment groups.


🛠️ Data Wrangling Steps

The wrangling process will include:

  1. Loading Data – Import CSV/Excel files into Pandas
  2. Exploring Structure – Use .info(), .head(), .describe()
  3. Cleaning
    • Remove duplicates
    • Standardize column names
    • Handle missing values (imputation or removal)
  4. Transformations
    • Convert datatypes (e.g., birthdate → datetime, zip_code → string)
    • Calculate derived columns (e.g., age from birthdate, BMI categories)
  5. Merging Tables – Combine patients, treatments, and adverse reactions for complete analysis
  6. Validation – Ensure correct ranges (BMI, age ≥ 18, HbA1c values)

Data Wrangling - Clinical Trial Dataset

📖 Project Overview

This project works with a clinical trial dataset of 500 patients, where 350 participated in a study comparing two insulin treatments: Novodra (injectable) and Auralin (oral).
The dataset includes patient demographics, treatment details, HbA1c levels, and reported adverse reactions.

The goal is to clean, transform, and prepare the data for analysis.


📂 Dataset Structure

🧑 Patients Table (patients)

  • patient_id → Unique patient ID
  • assigned_sex → Sex at birth (Male/Female)
  • given_name, surname → Patient names
  • address, city, state, zip_code, country → Contact details (all US)
  • contact → Phone & email
  • birthdate → Patient’s date of birth (Age ≥ 18 included)
  • weight, height, bmi → Body stats (Inclusion BMI: 16–38)

💉 Treatments Table (treatments, treatment_cut)

  • given_name, surname → Patient identifiers
  • auralin → Baseline and final insulin doses (units “u”)
  • novodra → Same as above, for Novodra group
  • hba1c_start, hba1c_end → HbA1c levels at start and end (%)
  • hba1c_change → Change in HbA1c (start − end)

⚠️ Adverse Reactions Table (adverse_reactions)

  • given_name, surname → Patient identifiers
  • adverse_reaction → Reported side effect

🛠️ Data Wrangling Steps

  1. Load Data → Import CSV/Excel files into Pandas
  2. Explore → Use .info(), .head(), .describe()
  3. Clean → Remove duplicates, standardize names, handle missing values
  4. Transform → Convert datatypes, derive new columns (e.g., Age, BMI category)
  5. Merge → Combine patients, treatments, and adverse reactions
  6. Validate → Ensure correct ranges (Age ≥ 18, BMI 16–38, valid HbA1c values)

🔍 Example Pandas Functions

df.info()
df.describe()
df.isna().sum()
df.drop_duplicates()
df.fillna()
df.merge()
df.groupby()

Releases

No releases published

Packages

 
 
 

Contributors