{ "cells": [ { "cell_type": "markdown", "id": "f3ff45d6-f5c4-44fb-aef7-a9edc0596a90", "metadata": {}, "source": [ "# PublicDatasets (Analyser)" ] }, { "cell_type": "markdown", "id": "df4b8014-f228-46f3-a745-bfb4e63cbc60", "metadata": {}, "source": [ "## 1. Re-reading" ] }, { "cell_type": "code", "execution_count": 1, "id": "17433dfb-a29b-4374-9112-0ad4afa2224a", "metadata": {}, "outputs": [], "source": [ "special_separator = '___'" ] }, { "cell_type": "code", "execution_count": 2, "id": "ba056aa2-a5a2-4c12-923c-814b78cfa6a2", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re" ] }, { "cell_type": "code", "execution_count": 3, "id": "a553a422-2566-4a43-824b-6baaff9027d2", "metadata": {}, "outputs": [], "source": [ "research_paper_table = pd.read_csv('data/ResearchPapers.csv',index_col=0)" ] }, { "cell_type": "code", "execution_count": 4, "id": "b5f4b519-9ce8-43f3-ba33-e5c5e491e5b5", "metadata": {}, "outputs": [], "source": [ "#Drop ignored\n", "research_paper_table\n", "research_paper_table = research_paper_table[research_paper_table.included] " ] }, { "cell_type": "code", "execution_count": 5, "id": "bd1b100a-f80d-4f13-ac61-88717d87a31c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VenueTitleincluded
1CHIL 2022Data Augmentation for ElectrocardiogramsTrue
2CHIL 2022MedMCQA: A Large-scale Multi-Subject Multi-Cho...True
3CHIL 2022Disability prediction in multiple sclerosis us...True
4CHIL 2022Lead-agnostic Self-supervised Learning for Loc...True
5CHIL 2022Context-Sensitive Spelling Correction of Clini...True
\n", "
" ], "text/plain": [ " Venue Title included\n", "1 CHIL 2022 Data Augmentation for Electrocardiograms True\n", "2 CHIL 2022 MedMCQA: A Large-scale Multi-Subject Multi-Cho... True\n", "3 CHIL 2022 Disability prediction in multiple sclerosis us... True\n", "4 CHIL 2022 Lead-agnostic Self-supervised Learning for Loc... True\n", "5 CHIL 2022 Context-Sensitive Spelling Correction of Clini... True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "research_paper_table.head()" ] }, { "cell_type": "markdown", "id": "8d352189-d02b-4723-b92a-e14809c0863b", "metadata": {}, "source": [ "### 1.1 get file names" ] }, { "cell_type": "code", "execution_count": 6, "id": "e7384f08-4eb9-4fac-97d1-83323d9f309c", "metadata": {}, "outputs": [], "source": [ "def get_text_filename(v,t,path='data/texts/'):\n", " return path+v+special_separator+t+'.txt'" ] }, { "cell_type": "code", "execution_count": 7, "id": "3a0986de-9dc4-4b68-9f93-183afa7591d1", "metadata": {}, "outputs": [], "source": [ "texts = []\n", "titles=[]\n", "venues = []\n", "for index, row in research_paper_table.iterrows():\n", " texts.append(get_text_filename(row[\"Venue\"],row[\"Title\"]))\n", " titles.append(row[\"Title\"])\n", " venues.append(row[\"Venue\"])" ] }, { "cell_type": "code", "execution_count": 8, "id": "e3e720eb-976b-47a8-acfc-014f27e87a0a", "metadata": {}, "outputs": [], "source": [ "# mention_matches\n" ] }, { "cell_type": "markdown", "id": "60ca6250-7627-4496-97d7-201b51af1a5c", "metadata": {}, "source": [ "## 2. Get Datasets" ] }, { "cell_type": "markdown", "id": "7aa8ace4-411e-4992-b48d-55a57c322689", "metadata": {}, "source": [ "### 2.1 Get section function" ] }, { "cell_type": "code", "execution_count": 9, "id": "92a412ec-9a11-4755-b354-9c12f5444ef1", "metadata": {}, "outputs": [], "source": [ "get_dataset_section = None\n", "#REPLACE THE get_dataset_section function to change behavior" ] }, { "cell_type": "code", "execution_count": 10, "id": "e549fbc8-2dea-41fd-a0d2-2653ab116a98", "metadata": {}, "outputs": [], "source": [ "def get_section(contents, section_header=\"data and code availability\", section_end=['1. introduction','© 2022']):\n", " \"\"\"\n", " Get the Data Source section from a research paper.\n", " \n", " :param contents: Text contents of the resaerch paper.\n", " :param section_header: A string indicating the start of the dataset section.\n", " :param section_end: A list of strings to indicate end of dataset section. \n", " :return: returns substring of text region between section_header and a potential section_end. returns \"\" if it fails to find it.\n", " \n", " \"\"\"\n", " \n", " contents = contents.lower()\n", " contents = contents.replace(\"\\n\\n\", \"$$$\" )\n", " contents = contents.replace(\"-\\n\", \"\" )\n", " # contents = contents.replace(\"\\n\", \"\" )\n", " contents = contents.replace(\"$$$\", \"\\n\" )\n", "\n", " idx0=contents.find(section_header)\n", " contents=contents[idx0:]\n", " for end in section_end:\n", " idxend=contents.find(end)\n", " if idxend==-1: continue\n", " contents=contents[:idxend]\n", " return contents\n", " return \"\"\n", " #print(contents)" ] }, { "cell_type": "markdown", "id": "95071bbe-4e85-4f8b-9733-adba6fcef24b", "metadata": {}, "source": [ "### 2.2 Get section (CHIL)" ] }, { "cell_type": "code", "execution_count": 11, "id": "41060179-8576-44ad-b4e3-ae3d1aecca8f", "metadata": {}, "outputs": [], "source": [ "\n", "def get_dataset_section(contents):\n", " \"\"\"Get the Data and Code Availability section from CHIL papers\"\"\"\n", " #CHANGE THIS FUNCTION TO ALTER THE BEHAVIOR OF HOW THE DATASET SECTION IS EXTRACTED FROM THE TEXT\n", " return get_section(contents, section_header=\"data and code availability\", section_end=['1. introduction','© 2022'])\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "24e3fd73-235b-45af-b25e-5ed51e9a75ec", "metadata": {}, "outputs": [], "source": [ "\n", "# items = {text_file:[] for text_file in texts}\n", "text_contents = []\n", "\n", "for i in range(len(texts)):\n", " with open(texts[i], 'r') as f:\n", " contents = f.read()\n", " # items[titles[i]] = get_section(contents)\n", " text_contents.append(get_section(contents))" ] }, { "cell_type": "markdown", "id": "b2e8b45f-8613-4bc4-a0b2-977f5796fefd", "metadata": {}, "source": [ "## 3. Get Mentions" ] }, { "cell_type": "markdown", "id": "d8be7480-eac7-489b-957b-59b673917952", "metadata": {}, "source": [ "### 3.1 regex patterns" ] }, { "cell_type": "code", "execution_count": 13, "id": "53c4c383-fc5a-47d3-870b-36ddc8ab21be", "metadata": {}, "outputs": [], "source": [ "xn = r'\\n?' #optional newline\n", "son = r'(?: |\\n)' #space or newline\n", "\n", "author = fr\"(?:[\\nA-Za-z'`-]+)\"\n", "etal = fr\"(?:et{son}al\\.?)\"\n", "additional = f\"(?:{xn},?{son}(?:(?:and{son}|&{son})?{author}|{etal}))\"\n", "year_num = f\"(?:{xn}19|20)[0-9][0-9]\"\n", "page_num = f\"(?:{xn},{son}p\\.? [0-9]+)?\" # Always optional\n", "year = fr\"(?:{xn},{son}*{year_num}{page_num}| *\\({year_num}{page_num}\\))\"\n", "inline_citation = fr'\\b(?!(?:Although|Also)\\b){xn}{author}{xn}{additional}*{xn}{year}{xn}[a-f]?'\n", "# ADAPTED FROM https://stackoverflow.com/a/63633049/2089784\n", "\n", "num_section = r'([0-9]\\.)+[0-9]'\n", "url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' #https://urlregex.com/\n", "split_url_pattern = fr'h{xn}t{xn}t{xn}p{xn}[s]?{xn}:{xn}/{xn}/{xn}(?:[a-zA-Z{xn}]|[0-9]|[$-_@.&+]|[!*\\(\\{xn}),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'\n", "\n", "seq_num=r', [0-9]'\n", "contained_num=r'\\([0-9]\\)'\n", "footnote_pattern = fr'(?:(?:(?<=[a-zA-Z\\.]) ?|,)[1-9](?!\\d)|{seq_num}|{contained_num})'\n", "\n", "# print(inline_citation)" ] }, { "cell_type": "code", "execution_count": 14, "id": "866d6fd1-a511-4bc3-b08a-13c3d0847359", "metadata": {}, "outputs": [], "source": [ "\n", "def collect_tokens(pat,str,name,tokens):\n", " \"\"\"Collect pat matches from str, remove from str and insert into tokens.\"\"\"\n", " matches = re.findall(pat,str)\n", " str = re.sub(pat, '', str)\n", " tokens |= {name: matches}\n", " return str\n" ] }, { "cell_type": "markdown", "id": "dd84945e-f2cf-4ddb-a1df-21ca456abc76", "metadata": {}, "source": [ "### 3.2 Post-Processing" ] }, { "cell_type": "code", "execution_count": 15, "id": "1287fe8d-d48c-4035-90de-c4f396d83954", "metadata": {}, "outputs": [], "source": [ "# def clean_newlines(dict):\n", "# for key, val in dict.items():\n", "# dict[key] = [ s.replace('\\n','') for s in val ]\n", " \n", "# def clean_url_end(dict):\n", "# dict[\"URL\"] = [ s.rstrip('.') for s in dict[\"URL\"] ]\n", " \n", "def clean_tokens(dict):\n", " dict[\"Inline Citation\"] = [ s.replace('\\n',' ') for s in dict[\"Inline Citation\"] ]\n", " dict[\"URL\"] = [ s.replace('\\n','') for s in dict[\"URL\"] ]\n", " dict[\"URL\"] = [ s.rstrip('.') for s in dict[\"URL\"] ]" ] }, { "cell_type": "markdown", "id": "2e002fa7-138f-469e-822f-c4489426be63", "metadata": {}, "source": [ "## 4. Assembling the mentions table" ] }, { "cell_type": "code", "execution_count": 16, "id": "b26809bd-f923-4a5b-be98-687da74bf2eb", "metadata": {}, "outputs": [], "source": [ "\n", "# for name in mention_matches:\n", "# merged_dict[name]=mention_matches[name]\n", "# for keyword in keyword_matches:\n", "# merged_bdict[keyword]=keyword_matches[keyword]\n", "\n", "# NEW CSV USING get_dasa function from 3.4\n", "\n", "# for item in items:\n", "# print(item)\n", "# print(items[item])\n", "\n", "\n", "# MENTIONS_TABLE = pd.DataFrame()\n", "\n", "# MENTIONS_TABLE = pd.DataFrame( columns=('Venue', 'Paper Title', 'Citation Style', 'Citation') )\n", "\n", "mention_venues, mention_titles, mention_styles, mentions, notes = [], [], [], [], []\n", "\n", "for i in range(len(text_contents)):\n", " references = {}\n", " context = block = text_contents[i]\n", " venue = venues[i]\n", " title = titles[i]\n", " \n", " if (type(block) == type(None)):\n", " # print(\"ADD AN ERROR\")\n", " # print(venue, title, \"ERROR:TEXT WAS NOT PARSIBLE\", '')\n", " \n", " # MENTIONS_TABLE = pd.concat([MENTIONS_TABLE, df2], axis=0)\n", "\n", " mention_venues.append(venue)\n", " mention_titles.append(title)\n", " mention_styles.append(r)\n", " mentions.append('')\n", " notes.append(\"ERROR:TEXT WAS NOT PARSIBLE\")\n", " \n", " continue\n", " # print(titles[i])\n", " # print(block)\n", " \n", " block = collect_tokens(inline_citation,block,\"Inline Citation\",references)\n", " block = collect_tokens(split_url_pattern,block,\"URL\",references)\n", " block = collect_tokens(num_section,block,\"Numbered Section\",{})\n", " # print(block)\n", " block = collect_tokens(footnote_pattern,block,\"Footnote\",references) \n", " \n", " clean_tokens(references)\n", " \n", " mention_list = []\n", " first=True\n", " for r,l in references.items():\n", " for ll in l:\n", " mention_venues.append(venue)\n", " mention_titles.append(title)\n", " mention_styles.append(r)\n", " mentions.append(ll)\n", " if first:\n", " notes.append(context)\n", " else:\n", " notes.append(\"\")\n", " first=False\n", "\n", "\n", "\n", "MENTIONS_TABLE = pd.DataFrame(\n", " {\n", " 'Venue': mention_venues,\n", " 'Paper Title': mention_titles,\n", " 'Mention Style': mention_styles,\n", " 'Mention': mentions,\n", " 'Notes' : notes\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 17, "id": "00948939-22aa-42ce-8cf7-b0b2b2ca2d66", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VenuePaper TitleMention StyleMentionNotes
0CHIL 2022Data Augmentation for ElectrocardiogramsInline Citationwagner et al., 2020data and code availability we use three\\ndatas...
1CHIL 2022Data Augmentation for ElectrocardiogramsInline Citationgoldberger et al., 2000
2CHIL 2022Data Augmentation for ElectrocardiogramsURLhttps://github.com/aniruddhraghu/ecg_aug
3CHIL 2022Lead-agnostic Self-supervised Learning for Loc...Inline Citationreyna et al., 2021data and code availability this paper uses\\nth...
4CHIL 2022Lead-agnostic Self-supervised Learning for Loc...Inline Citationwagner et al., 2020
..................
72CHIL 2022Identification of Subgroups With Similar Benef...Inline Citationjiang and li (2016)
73CHIL 2022Identification of Subgroups With Similar Benef...Inline Citationthomas and brunskill (2016)
74CHIL 2022Identification of Subgroups With Similar Benef...Inline Citationkallus and uehara (2020)
75CHIL 2022Identification of Subgroups With Similar Benef...Inline Citationkomorowski et al. (2018)
76CHIL 2022Identification of Subgroups With Similar Benef...Footnote6
\n", "

77 rows × 5 columns

\n", "
" ], "text/plain": [ " Venue Paper Title \\\n", "0 CHIL 2022 Data Augmentation for Electrocardiograms \n", "1 CHIL 2022 Data Augmentation for Electrocardiograms \n", "2 CHIL 2022 Data Augmentation for Electrocardiograms \n", "3 CHIL 2022 Lead-agnostic Self-supervised Learning for Loc... \n", "4 CHIL 2022 Lead-agnostic Self-supervised Learning for Loc... \n", ".. ... ... \n", "72 CHIL 2022 Identification of Subgroups With Similar Benef... \n", "73 CHIL 2022 Identification of Subgroups With Similar Benef... \n", "74 CHIL 2022 Identification of Subgroups With Similar Benef... \n", "75 CHIL 2022 Identification of Subgroups With Similar Benef... \n", "76 CHIL 2022 Identification of Subgroups With Similar Benef... \n", "\n", " Mention Style Mention \\\n", "0 Inline Citation wagner et al., 2020 \n", "1 Inline Citation goldberger et al., 2000 \n", "2 URL https://github.com/aniruddhraghu/ecg_aug \n", "3 Inline Citation reyna et al., 2021 \n", "4 Inline Citation wagner et al., 2020 \n", ".. ... ... \n", "72 Inline Citation jiang and li (2016) \n", "73 Inline Citation thomas and brunskill (2016) \n", "74 Inline Citation kallus and uehara (2020) \n", "75 Inline Citation komorowski et al. (2018) \n", "76 Footnote 6 \n", "\n", " Notes \n", "0 data and code availability we use three\\ndatas... \n", "1 \n", "2 \n", "3 data and code availability this paper uses\\nth... \n", "4 \n", ".. ... \n", "72 \n", "73 \n", "74 \n", "75 \n", "76 \n", "\n", "[77 rows x 5 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MENTIONS_TABLE" ] }, { "cell_type": "markdown", "id": "ccefd05a-bcd6-4ee9-9e4c-00824dafbeb4", "metadata": {}, "source": [ "## 5. Saving" ] }, { "cell_type": "code", "execution_count": 18, "id": "3e6bf04d-dfe3-4cdd-9779-4b8007e0ab8e", "metadata": {}, "outputs": [], "source": [ "# #TODO:\n", "# print(\"* SWITCH TO MULTI-MATCHING\")\n", "MENTIONS_TABLE.to_csv(\"data/DatasetMentions_Unprocessed.csv\")" ] }, { "cell_type": "markdown", "id": "03228f26-452d-4299-b140-7fa9d1b4df69", "metadata": {}, "source": [ "## **SOME MANUAL ANNOTATION STILL REQUIRED**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }