Welcome to secutils
secutils is a utility package to facilitate large bulk downloads of SEC documents. It works with any SEC document type and will retrieve the entire historical database if required. Multi-threaded file downloads are enabled in the command line utility.
Key functionality includes:
- Multi-threaded downloading
- Caching of index files
- Automatic directory structure buildout (i.e. downloading multiple file types w/dir structure: ftype --> year --> quarter --> files)
- Resume downloading
- Built in logging and download success tracker
Overview of README:
secutils picks up where a number of other repos left off. There are a couple SEC downloading python packages out there, however they are designed from retrieval of few documents. I needed a way to consistently download the latest updates from the SEC and secure a local copy of the entire history of the SEC for certain file types. This translates into TB's of documents, where networking, directory structure, logging, etc. issues arise.
There is a nice package available to download and construct index files, however the user is still left to download the actual files and must be comfortable with bash scripting.
With secutils the program handles files you have already retrieved, get's the missing files you don't have in your local archive, and continues.
For examples of other repos that exist:
Furthermore, the hope of this package is to create parsers for repsective form types. A user could import the 10-K parser and call the Management Discussion and Analysis method to retrieve respective MD&A's from selected files.
There are also plans to integrate directly with popular cloud providers given the scale of these filings. Processing 10-K/Q's alone requires TB's of storage.
There are two primary methods of installing sec-utils. The first is via the python packaging index (pypi). The second is straight from source.
To install from pypi:
pip install secutils
And to install from source:
git clone https://github.com/datawrestler/sec-utils && cd sec-utils
conda create --name sec_env python=3.7 pip
conda activate sec_env
pip install -r requirements.txt
pip install -e .
conda activate sec_env
python download_sec.py --output_dir=/mnt/sda/sec --form_types=S-1 --num_workers=-1 --start_year=2014 --end_year=2019 --quarters 1 2 3 4
Even more cleanly, you can coordinate long running jobs and keep track of your parameters by modifying this example script
Make sure to make it executable on your system:
chmod +x run.sh
./run.sh
You can also generate a config file and use the config to control parameters of longer runs:
from secutils.utils import generate_config
path_for_config = ''
generate_config(path_for_config)
then when calling the longer download run:
python -m secutils.download_sec --config_path='path_for_config'
A useful trick when working with remote servers is to direct output from a session to a file. Using screen also maintains a session even if you disconnect from ssh:
screen -dm -L python -m secutils.download_sec --config_path='path_for_config'
Additionally, users can leverage the API directly for more hands on work. An overview resides in an example jupyter notebook with additional details below:
from secutils.edgar import FormIDX
form = FormIDX(year=2017, quarter=1, seen_files=None, cache_dir=None, form_types=['10-K])
files = form.index_to_files()
form.master_index.head()
# CIK Company Name Form Type Date Filed Filename fname
# 1000015 META GROUP INC 10-K 1998-03-31 edgar/data/1000015/0001000015-98-000009.txt 0001000015-98-000009.txt
# 1000112 CHEVY CHASE MASTER CREDIT CARD TRUST II 10-K 1998-03-27 edgar/data/1000112/0000920628-98-000038.txt 0000920628-98-000038.txt
# 1000179 PARAMOUNT FINANCIAL CORP 10-K 1998-03-30 edgar/data/1000179/0000950120-98-000108.txt 0000950120-98-000108.txt
# lets take a peek at attributes available to individual files:
ex = files[0]
msg = f"""
Company Name: {ex.company_name}
CIK Number: {ex.cik_number}
Date Filed: {ex.date_filed}
Form Type: {ex.form_type}
File Name: {ex.file_name}
Download URL: {ex.file_download_url}
"""
print(msg)
# Company Name: OPTICAL CABLE CORP
# CIK Number: 1000230
# Date Filed: 2017-12-20 00:00:00
# Form Type: 10-K
# File Name: 0001437749-17-020936.txt
# Download URL: https://www.sec.gov/Archives/edgar/data/1000230/0001437749-17-020936.txt
# get example file and download:
# to download our example file:
output_dir = '.'
ex.download_file(output_dir) # 200 is a successful download
# verify download
import os
list(filter(lambda x: x.endswith('txt'), os.listdir(output_dir)))
# ['0001437749-17-020936.txt']
Getting hands on is great, however using the CLI does provide several advantages:
- Automatic directory structure creation
- Built in logging and caching
- Ability to resume training via download scanning
- Multi-threaded file downloading
The vision for this project extends far beyond it's current state of downloading index and SEC files from the Edgar database. Currently, parsing SEC files is tremendously difficult. There are numerous reasons for these difficulties including:
- No systematic tagging structure for SEC filings
- File submissions changed over the years
- Many different file types, header types, and content from one filing type to another
Given the above, parsing even a 10-K takes tremendous effort. The goal of this project is to bring together like minded individuals and take a stab at a systematic parsing effort with a consistent API. The future state of the project would allow users to download SEC filings and use convenient methods to retrieve particular sections of the filings. For instance, a user could do something like the following:
from secutils.file_types import file_10k
file_path = '/path/to/10-K'
f = file_10k.from_path(file_path)
# and retrieve the management discussion and analysis section directly:
f.management_discussion()
# Here at XYZ company, we believe the following year will bring about great properity due to our R&D efforts in packages like secutils...
This would open up a world of opportunity for collaboration, text analytics research, and general business information gathering.