Add docs

hupe1980 · Apr 23, 2024 · 4a16bef · 4a16bef
1 parent 044f1ed
commit 4a16bef
Show file tree

Hide file tree

Showing 5 changed files with 407 additions and 0 deletions.
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -36,6 +36,8 @@ jobs:
         with:
           submodules: recursive
           fetch-depth: 0
+      - name: Install packages
+        run: sudo apt-get install -y pandoc
       - name: Install poetry
         run: pipx install poetry
       - uses: actions/setup-python@v4

diff --git a/aisploit/scanner/__init__.py b/aisploit/scanner/__init__.py
@@ -1,5 +1,11 @@
 from .job import ScannerJob
+from .plugin import Plugin, SendPromptsPlugin
+from .report import Issue, IssueCategory
 
 __all__ = [
     "ScannerJob",
+    "Plugin",
+    "SendPromptsPlugin",
+    "Issue",
+    "IssueCategory",
 ]
diff --git a/docs/classifiers.ipynb b/docs/classifiers.ipynb
@@ -0,0 +1,308 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Classifiers\n",
+    "\n",
+    "The `Classifiers` module in AISploit provides a framework for building and using classifiers to score inputs based on certain criteria. Classifiers are essential components in red teaming and security testing tasks, where they are used to evaluate the characteristics or properties of input data.\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "A classifier in AISploit is represented by a class that inherits from the `BaseClassifier` abstract base class. This base class defines the interface that all classifiers must implement. Additionally, there is a specialization for text classifiers, represented by the `BaseTextClassifier` class.\n",
+    "\n",
+    "### How Classifiers Work\n",
+    "\n",
+    "Classifiers in AISploit typically operate by analyzing input data and assigning a score based on predefined criteria. The `score()` method is the main entry point for scoring input data. This method takes the input data and optionally a list of reference inputs, and returns a `Score` object representing the score of the input."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Creating Custom Classifiers\n",
+    "\n",
+    "To create a custom classifier, you need to define a new class that inherits from `BaseClassifier` or `BaseTextClassifier`, depending on the nature of the input data. You then need to implement the `score()` method in your subclass, providing the logic for scoring input data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import Any, List\n",
+    "\n",
+    "from aisploit.core import BaseClassifier, Score\n",
+    "\n",
+    "class CustomClassifier(BaseClassifier[float, Any]):\n",
+    "    def score(self, input: Any, references: List[Any] | None = None) -> Score[float]:\n",
+    "        # Implement scoring logic here\n",
+    "        pass"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Example Usage\n",
+    "\n",
+    "Here's an example of how to use a classifier in AISploit:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create an instance of the classifier\n",
+    "classifier = CustomClassifier()\n",
+    "\n",
+    "# Score an input\n",
+    "score = classifier.score(input=\"Hello World\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This documentation provides an overview of how classifiers work in AISploit, explains how to create custom classifiers, and gives an example of how to use a classifier in practice. Users can refer to this documentation to understand how to work with classifiers in their security testing tasks.\n",
+    "\n",
+    "### Ready-to-Use Classifiers\n",
+    "\n",
+    "AISploit also provides several ready-to-use classifiers that cover common use cases. These classifiers can be directly instantiated and used in your projects. Some of the ready-to-use classifiers include:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### SelfSimilarityClassifier\n",
+    "\n",
+    "The SelfSimilarityClassifier is a text classifier based on self-similarity using cosine similarity scores. It calculates the cosine similarity between the input text and reference texts to determine the degree of similarity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'flagged': True,\n",
+       " 'value': {'aggregated_score': 0.6721476912498474,\n",
+       "  'scores': [0.6721476912498474]},\n",
+       " 'description': 'Returns True if the aggregated cosine similarity score is less than the threshold',\n",
+       " 'explanation': 'The aggregated cosine similarity score for the input is 0.6721476912498474'}"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from aisploit.classifiers import SelfSimilarityClassifier\n",
+    "\n",
+    "# What's the name of the Voodoo Lady in the Monkey Island games?\n",
+    "classifier = SelfSimilarityClassifier()\n",
+    "score = classifier.score(\n",
+    "    \"The Voodoo Lady's name is Elaine Marley.\", \n",
+    "    [\n",
+    "        \"The Voodoo Lady's name is never revealed in the games, and she is simply referred to as the Voodoo Lady.\",\n",
+    "    ],\n",
+    ")\n",
+    "\n",
+    "\n",
+    "vars(score)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### PresidioAnalyserClassifier\n",
+    "\n",
+    "The PresidioAnalyserClassifier is a text classifier that utilizes the Presidio Analyzer for detecting Personally Identifiable Information (PII) entities in text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'flagged': True,\n",
+       " 'value': [type: PERSON, start: 11, end: 19, score: 0.85,\n",
+       "  type: PHONE_NUMBER, start: 43, end: 55, score: 0.75],\n",
+       " 'description': 'Returns True if entities are found in the input',\n",
+       " 'explanation': 'Found 2 entities in input'}"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from aisploit.classifiers.presidio import PresidioAnalyserClassifier\n",
+    "\n",
+    "classifier = PresidioAnalyserClassifier()\n",
+    "score = classifier.score(\"My name is John Doo and my phone number is 212-555-5555\")\n",
+    "\n",
+    "vars(score)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### ComprehendPIIClassifier\n",
+    "\n",
+    "The ComprehendPIIClassifier is a text classifier that utilizes Amazon Comprehend for detecting Personally Identifiable Information (PII) entities in text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'flagged': True,\n",
+       " 'value': [{'Score': 0.9999950528144836,\n",
+       "   'Type': 'NAME',\n",
+       "   'BeginOffset': 11,\n",
+       "   'EndOffset': 19},\n",
+       "  {'Score': 0.9999926090240479,\n",
+       "   'Type': 'PHONE',\n",
+       "   'BeginOffset': 43,\n",
+       "   'EndOffset': 55}],\n",
+       " 'description': 'Returns True if entities are found in the input',\n",
+       " 'explanation': 'Found 2 entities in input'}"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from aisploit.classifiers.amazon import ComprehendPIIClassifier\n",
+    "\n",
+    "classifier = ComprehendPIIClassifier()\n",
+    "score = classifier.score(\"My name is John Doo and my phone number is 212-555-5555\")\n",
+    "\n",
+    "vars(score)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### ComprehendToxicityClassifier\n",
+    "\n",
+    "The ComprehendToxicityClassifier is a text classifier that leverages Amazon Comprehend for detecting toxic content in text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'flagged': True,\n",
+       " 'value': {'Toxicity': 0.8208000063896179,\n",
+       "  'Labels': [{'Name': 'PROFANITY', 'Score': 0.19329999387264252},\n",
+       "   {'Name': 'HATE_SPEECH', 'Score': 0.2694000005722046},\n",
+       "   {'Name': 'INSULT', 'Score': 0.2587999999523163},\n",
+       "   {'Name': 'GRAPHIC', 'Score': 0.19329999387264252},\n",
+       "   {'Name': 'HARASSMENT_OR_ABUSE', 'Score': 0.18960000574588776},\n",
+       "   {'Name': 'SEXUAL', 'Score': 0.21789999306201935},\n",
+       "   {'Name': 'VIOLENCE_OR_THREAT', 'Score': 0.9879999756813049}]},\n",
+       " 'description': 'Returns True if the overall toxicity score is greater than or equal to the threshold',\n",
+       " 'explanation': 'The overall toxicity score for the input is 0.8208000063896179'}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from aisploit.classifiers.amazon import ComprehendToxicityClassifier\n",
+    "\n",
+    "classifier = ComprehendToxicityClassifier()\n",
+    "score = classifier.score(\"I will kill you\")\n",
+    "\n",
+    "vars(score)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### ModerationClassifier\n",
+    "\n",
+    "The ModerationClassifier is a text classifier that utilizes OpenAI's Moderation API for detecting toxic content in text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'flagged': True,\n",
+       " 'value': Moderation(categories=Categories(harassment=True, harassment_threatening=True, hate=False, hate_threatening=False, self_harm=False, self_harm_instructions=False, self_harm_intent=False, sexual=False, sexual_minors=False, violence=True, violence_graphic=False, self-harm=False, sexual/minors=False, hate/threatening=False, violence/graphic=False, self-harm/intent=False, self-harm/instructions=False, harassment/threatening=True), category_scores=CategoryScores(harassment=0.4573294222354889, harassment_threatening=0.35159170627593994, hate=0.0006792626227252185, hate_threatening=4.232471837894991e-05, self_harm=4.82136874779826e-06, self_harm_instructions=3.341407150969644e-08, self_harm_intent=2.0083894014533143e-06, sexual=4.86759927298408e-05, sexual_minors=1.9414277119267354e-07, violence=0.9988717436790466, violence_graphic=1.050253467838047e-05, self-harm=4.82136874779826e-06, sexual/minors=1.9414277119267354e-07, hate/threatening=4.232471837894991e-05, violence/graphic=1.050253467838047e-05, self-harm/intent=2.0083894014533143e-06, self-harm/instructions=3.341407150969644e-08, harassment/threatening=0.35159170627593994), flagged=True),\n",
+       " 'description': 'Moderation score for the given input',\n",
+       " 'explanation': 'Details about the moderation score'}"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from aisploit.classifiers.openai import ModerationClassifier\n",
+    "\n",
+    "classifier = ModerationClassifier()\n",
+    "score = classifier.score(\"I will kill you\")\n",
+    "\n",
+    "vars(score)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "aisploit-MWVof28N-py3.12",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/index.rst b/docs/index.rst
@@ -4,5 +4,7 @@
     :hidden:
     :maxdepth: 1
 
+    classifiers
+    scanner
     api/index