Skip to content

Data Science Competition: “Screening and Diagnosis of Esophageal Cancer” by Mauna Kea. Top 12% Finalist.

Notifications You must be signed in to change notification settings

Jonas1312/ChallengeMKea

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Challenge Mauna Kea

Challenge "Screening and Diagnosis of esophageal cancer" by Mauna Kea

Challenge goals

The goal of this challenge is to build an image classifier to assist physicians in the screening and diagnosis of esophageal cancer. Such a tool would have a massive impact on patient management and patient lives.

Illustration

Illustration

Data description

There are 11161 images acquired from 61 patients to be classified as:

  • Class 0: Squamous Epithelium
  • Class 1: Intestinal Metaplasia
  • Class 2: Gastric Metaplasia
  • Class 3: Dysplasia/Cancer

The split between the training and the two test sets is 80%-10%-10%.

The training set is made of 9446 images, acquired from 44 patients:

  • Class 0: 1469 images
  • Class 1: 3177 images
  • Class 2: 1206 images
  • Class 3: 3594 images

The two test sets, Test_1 and Test_2, are as follow:

  • 893 images acquired from 10 patients
  • 822 images acquired from 7 patients

The total is 1715 images acquired from 17 patients.

We also know that if an image acquired from a patient is in the training set, there is no image acquired from the same patient in any test sets.

Metric

Non weighted Multiclass accuracy = (Number of images correctly classified) / (Number of images in the test set)

Goal: 99%

Baseline: 75% (simple CNN)

Score updates

Date Model LB score Rank Solution weight_name
06/07 First commit x x x
11/07 EfficientNet0 0.894 21/45 AdamW
MultiStepLR
simple data aug
efficientnet_acc=99.13_loss=0.00521_AdamW_ep=17_sz=224_wd=1e-05.pth
13/07 Ensemble (5):
DenseNet
SE-ResNext
InceptionResNetV2
EfficientNet0
0.918 15/45 SGD
Cosine annealing
warm restarts
5best.csv
15/07 Ensemble (3) 0.944 8/45 Pseudo-labeling
+ MixUp
ch_3best.csv
16/07 Ensemble (5) 0.949 7/45 Add 2 models 2ch_5best.csv
01/01/2020 Competition deadline x x x x

Final solution

  • Data preprocessing:

    • remove duplicates with dhash (9446 images -> 6973 images)
    • split 6973 left images in train/test/valid (80%/10%/10%)
    • No split by patient (I assumed it was not necessary since I had removed nearly identical images, maybe it's a bad idea...)
  • Class imbalance:

    • over-sampling during training
    • Weighted CE loss for testing
  • Training:

    • simple data aug (flips, random rotations)
    • resized to (224, 224), (299, 299) for InceptionResNetV2
    • Pretrained weights from imagenet
    • SGD
    • LR scheduler : cosine annealing + warm restarts (60 epochs with 3 periods of 20 epochs)
    • no regularization needed
  • Pseudo-labeling:

    • on test set (1715 images)
    • predict labels for each image
    • if all (or more than X%) images that belong to the same patient have the same prectited class => add patient to train set
  • Ensemble:

    • with probabilites from softmax

What didn't work

  • Focal loss
  • Adam/AdamW (converges fast but not as good as SGD)
  • Adabound/Amsbound

Things I would have liked to try if GPU weren't so expensive 😕

Leaderboard

Public Leaderboard Final (private) Leaderboard

About

Data Science Competition: “Screening and Diagnosis of Esophageal Cancer” by Mauna Kea. Top 12% Finalist.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published