Smart contract security has become one of the most critical problems in modern computing. Billions of dollars now flow through decentralized applications, yet the average Solidity contract is still handwritten, under-reviewed, and often deployed without meaningful automated analysis.
Static analyzers exist, Slither, Mythril, Scribble, but they are massive, complex systems built by security professionals. What if you want something simpler? Something you can understand, modify, and extend yourself? Something you can integrate into custom pipelines or research experiments?
In this article, we build exactly that: a Python-based feature extraction system that reads Solidity code and transforms it into structured security signals.
This is not about building a full auditor. Instead, it’s about understanding how Python can learn to “read” Solidity, identify risky patterns, and produce features that can power heuristic scoring or machine learning.
Before you can detect vulnerabilities, classify risky code, or train an AI model to audit contracts, you need one thing:
Signal.
Classic vulnerabilities such as reentrancy, oracle manipulation, delegatecall misuse, and access-control bugs leave detectable fingerprints in code. These fingerprints become features.
Examples:
| Vulnerability | Possible Feature |
|---|---|
| Reentrancy | Presence of call{value:} or external calls before state updates |
| Access control bug | Setter functions lacking require(msg.sender == owner) |
| Oracle manipulation | Public state mutation without checks |
| Delegatecall injection | Literal use of delegatecall or proxy patterns |
| Expanded attack surface | High count of public / payable functions |
| Gas griefing | Loops over dynamic arrays / mappings |
Most real auditors use mental models:
“This contract uses delegatecall. That’s dangerous unless this is a proxy.” “This function writes to storage but is publicly accessible.”
We can encode these intuitions into Python.
The simplest possible interpreter for Solidity is just:
from pathlib import Path
def read_source(path):
return Path(path).read_text(encoding="utf-8")But this raw text means nothing yet. We need to transform it into features.
Regex is surprisingly effective for identifying dangerous low-level constructs. Each of these is a security smell:
delegatecallcall.valuetx.originselfdestruct
- number of
payablefunctions - number of
publicfunctions - number of lines (complexity proxy)
Let’s build a feature extractor:
import re
import hashlib
from pathlib import Path
RISKY_KEYWORDS = [
"delegatecall",
"call.value",
"tx.origin",
"selfdestruct",
"block.timestamp",
]
def extract_features_from_text(source: str):
lines = source.splitlines()
n_lines = len(lines)
n_payable = len(re.findall(r"\bpayable\b", source))
n_public = len(re.findall(r"\bpublic\b", source))
features = {
"n_lines": n_lines,
"n_payable": n_payable,
"n_public": n_public,
}
for kw in RISKY_KEYWORDS:
features[f"has_{kw.replace('.', '_')}"] = 1 if kw in source else 0
return featuresThis already detects:
- large contracts
- payable-heavy contracts
delegatecall→ proxy or exploittx.origin→ broken access control- value transfer patterns
You’re now performing the same early-stage static analysis as many formal tools.
Instead of immediately applying machine learning, we start with a heuristic scoring engine that mirrors how human auditors think.
Example scoring logic:
def compute_risk(features):
score = 0
if features["has_delegatecall"]:
score += 50
if features["has_tx_origin"]:
score += 40
if features["has_call_value"]:
score += 30
if features["n_payable"] > 3:
score += 25
elif features["n_payable"] > 0:
score += 5
if features["n_lines"] > 300:
score += 15
elif features["n_lines"] > 100:
score += 5
score = min(100, score)
if score <= 20:
level = "Low"
elif score <= 60:
level = "Medium"
else:
level = "High"
return score, levelThis allows Python to:
- identify highly dangerous contracts
- classify contracts into risk buckets
- detect unsafe code without running it
Each contract is hashed:
def hash_source(source):
return hashlib.sha256(source.encode()).hexdigest()This gives you a unique fingerprint for each Solidity file. It allows:
- caching analyses
- tracking versions
- linking risk results to a specific source
- storing assessments in a database or blockchain
A final CLI glues everything together:
python src/cli.py --file data/examples/high_risk_delegatecall.solProduces:
{
"source_hash": "…",
"features": { … },
"risk_score": 90,
"risk_level": "High"
}This is a complete static-analysis pipeline.
Feed Python various Solidity snippets and watch the signals react.
target.delegatecall(data);Python flags:
has_delegatecall = 1
Risk score spikes.
require(tx.origin == owner);Python flags:
has_tx_origin = 1
Immediate medium/high risk.
(bool ok, ) = msg.sender.call{value: amount}("");Regex doesn’t catch this yet, so we extend the pattern:
if "call{value:" in source.replace(" ", ""):
features["has_reentrancy_pattern"] = 1Python now detects reentrancy fingerprints.
Once you extract features, the next step is obvious:
-
Build a dataset:
- Label each
.solfile as low/medium/high risk - Extract features programmatically
- Label each
-
Train:
RandomForestClassifier().fit(X, y)
-
Predict risk automatically.
This turns Python into an AI-powered lightweight auditor.
Regex = fast and simple AST = accurate and powerful
Future upgrades:
-
Use Slither programmatically
-
Use solidity-parser-antlr for Python
-
Extract:
- function graph
- call graph
- state mutation patterns
- protected/unprotected setters
- role-based access control detection
This is how professional auditing tools work internally.
Teaching Python to “read” Solidity is easier than you think, but more powerful than it appears. With just:
- raw text
- some regex
- simple heuristics
- proper feature engineering
you can build a functioning static analyzer capable of flagging dangerous patterns before deployment.
This project is the perfect foundation for:
- blockchain ML research
- educational security tooling
- automated CI security pipelines
- smart contract QA systems
- future open-source security tools
Python doesn’t just read Solidity, it learns to understand it.