Markdown files manager

To efficiently read, process, and write the YAML front matter of hundreds of Markdown files spread across multiple subdirectories, you can leverage both yq (a powerful command-line YAML processor) and pandas (a versatile Python data analysis library). Below is a comprehensive guide to help you accomplish this task:

Overview

Extract YAML Front Matter: Use yq to parse and extract YAML front matter from each Markdown file.
Aggregate Data: Collect the extracted data into a structured format (e.g., JSON or CSV).
Process with Pandas: Load the aggregated data into a Pandas DataFrame for analysis or manipulation.
Write Results: Output the processed data as needed, such as updating YAML front matter or generating reports.

Prerequisites

Install yq:
- Using Homebrew (macOS/Linux):
```
brew install yq
```
- Using Snap (Linux):
```
sudo snap install yq
```
- From Source or Other Methods: Refer to the yq GitHub repository for detailed installation instructions.
Install Python and Pandas:
- Ensure you have Python installed (preferably Python 3.6+).
- Install pandas and PyYAML:
```
pip install pandas PyYAML
```
Directory Structure:
- Assume your Markdown files are organized in various subdirectories under a root directory, e.g., ./content.

Step-by-Step Guide

1. Extract YAML Front Matter Using `yq`

First, you’ll need to extract the YAML front matter from each Markdown file. YAML front matter is typically enclosed between --- lines at the beginning of a Markdown file.

Here’s how you can extract it using yq:

# Example command to extract YAML front matter from a single file
yq eval 'select(di == 0)' file.md

However, to process multiple files across subdirectories, it’s more efficient to automate this with a Python script.

2. Python Script to Aggregate YAML Front Matter

Below is a Python script that:

Recursively searches for all Markdown files in the specified directory.
Uses yq to extract YAML front matter from each file.
Aggregates the data into a Pandas DataFrame.
Processes the data as needed.
Writes the results back or outputs them as required.

import os
import subprocess
import pandas as pd
import yaml
from pathlib import Path

def find_markdown_files(root_dir):
    """Recursively find all Markdown files in the given directory."""
    return list(Path(root_dir).rglob('*.md'))

def extract_yaml_front_matter(file_path):
    """
    Extract YAML front matter from a Markdown file using yq.
    Assumes that YAML front matter is enclosed between '---' lines.
    """
    try:
        # Use yq to extract the YAML front matter
        # The 'split' function splits the file by '---' and selects the first YAML block
        command = f"yq eval 'select(di == 0)' {file_path}"
        result = subprocess.run(command, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        yaml_content = result.stdout
        # Parse YAML content into a Python dictionary
        return yaml.safe_load(yaml_content)
    except subprocess.CalledProcessError as e:
        print(f"Error processing {file_path}: {e.stderr}")
        return None
    except yaml.YAMLError as e:
        print(f"YAML parsing error in {file_path}: {e}")
        return None

def aggregate_data(markdown_files):
    """Aggregate YAML front matter from all Markdown files into a list of dictionaries."""
    data = []
    for file in markdown_files:
        yaml_data = extract_yaml_front_matter(file)
        if yaml_data:
            yaml_data['file_path'] = str(file)  # Keep track of the source file
            data.append(yaml_data)
    return data

def main():
    # Define the root directory containing Markdown files
    root_dir = './content'  # Change this to your directory

    # Find all Markdown files
    markdown_files = find_markdown_files(root_dir)
    print(f"Found {len(markdown_files)} Markdown files.")

    # Aggregate YAML front matter
    data = aggregate_data(markdown_files)
    print(f"Aggregated YAML data from {len(data)} files.")

    # Create a Pandas DataFrame
    df = pd.DataFrame(data)
    print("Initial DataFrame:")
    print(df.head())

    # Example Processing: Let's say you want to analyze tags
    if 'tags' in df.columns:
        # Expand tags into separate rows
        df_expanded = df.explode('tags')
        tag_counts = df_expanded['tags'].value_counts()
        print("Tag Counts:")
        print(tag_counts)

        # Save tag counts to a CSV
        tag_counts.to_csv('tag_counts.csv', header=['count'])
        print("Tag counts saved to tag_counts.csv")

    # Example Processing: Update a field and write back to YAML
    # Suppose you want to add a new field 'processed: true' to each YAML front matter
    df['processed'] = True

    # Iterate over the DataFrame and update YAML front matter in files
    for index, row in df.iterrows():
        file_path = row['file_path']
        # Read the original file
        with open(file_path, 'r', encoding='utf-8') as f:
            lines = f.readlines()

        # Find the YAML front matter
        if lines[0].strip() == '---':
            end_index = lines[1:].index('---\n') + 1
            yaml_content = ''.join(lines[1:end_index])
            # Update YAML data
            yaml_data = yaml.safe_load(yaml_content)
            yaml_data.update({'processed': row['processed']})
            new_yaml = yaml.dump(yaml_data, sort_keys=False)

            # Reconstruct the file with updated YAML
            new_content = ['---\n'] + [line for line in new_yaml.splitlines(keepends=True)] + ['---\n'] + lines[end_index+1:]

            # Write back to the file
            with open(file_path, 'w', encoding='utf-8') as f:
                f.writelines(new_content)
            print(f"Updated {file_path} with processed=True")
        else:
            print(f"No YAML front matter found in {file_path}")

    # Save the aggregated data to a CSV for further analysis
    df.to_csv('aggregated_yaml_data.csv', index=False)
    print("Aggregated YAML data saved to aggregated_yaml_data.csv")

if __name__ == "__main__":
    main()

3. Explanation of the Script

Finding Markdown Files:
- The find_markdown_files function uses pathlib to recursively locate all .md files within the specified root directory.
Extracting YAML Front Matter:
- The extract_yaml_front_matter function invokes yq via subprocess to extract the YAML front matter from each Markdown file.
- It handles errors gracefully, printing any issues encountered during extraction or parsing.
Aggregating Data:
- The aggregate_data function compiles all extracted YAML data into a list of dictionaries, adding the file path for reference.
Creating a Pandas DataFrame:
- The aggregated data is loaded into a Pandas DataFrame for easy manipulation and analysis.
Processing Data with Pandas:
- Example 1: If there’s a tags field, the script counts the occurrence of each tag and saves the results to tag_counts.csv.
- Example 2: The script adds a new field processed: true to each YAML front matter and writes the updated YAML back to the original Markdown files.
Writing Results:
- Aggregated data is saved to aggregated_yaml_data.csv for any additional processing or reporting needs.
- Updated YAML front matter is written back to the original Markdown files, ensuring that changes are reflected directly in the source files.

4. Running the Script

Save the Script:
- Save the above Python script to a file, e.g., process_markdown_yaml.py.
Execute the Script:
```
python process_markdown_yaml.py
```
Check Outputs:
- aggregated_yaml_data.csv: Contains all extracted YAML data.
- tag_counts.csv: (If applicable) Contains counts of each tag.
- Updated Markdown files: Each file now includes the processed: true field in its YAML front matter.

Additional Tips

Custom Processing:
- Modify the Pandas processing section to suit your specific needs, such as filtering, aggregating, or transforming data.
Backup Files:
- Before writing changes back to your Markdown files, consider creating backups to prevent accidental data loss.
Performance Optimization:
- For a large number of files, consider optimizing the script by parallelizing the YAML extraction process using libraries like concurrent.futures.
Error Handling:
- Enhance error handling to manage files without YAML front matter or with malformed YAML gracefully.
Alternative Approaches:
- While this guide uses yq for YAML extraction, Python’s frontmatter library is another option for handling Markdown front matter directly within Python.

Conclusion

By combining the strengths of yq for YAML processing and pandas for data manipulation, you can create a robust workflow to manage and analyze YAML front matter across a large collection of Markdown files. The provided Python script serves as a foundation that you can customize further based on your specific requirements.

To read and process the YAML front matter of hundreds of Markdown files located in various subdirectories using yq and pandas, you can follow a systematic approach that combines shell commands and Python scripting. Here’s a step-by-step guide:

Extract the YAML Front Matter with yq
The yq command-line tool can extract and process the YAML front matter directly from your Markdown files. It has a --front-matter flag specifically designed for this purpose.
Collect Data into Pandas DataFrame
Use Python to traverse your directory structure, invoke yq to extract the front matter from each file, and collect this data into a pandas DataFrame for processing.
Process the Data with Pandas
Perform any data manipulation or analysis using pandas’ powerful DataFrame operations.
Write Results Back to Files with yq
After processing, use yq again to write the updated front matter back into your Markdown files.

Below is a detailed explanation and example code for each step.

Step 1: Extract the YAML Front Matter with `yq`

The yq tool can extract the front matter from Markdown files using the --front-matter=extract option. It can output the front matter in JSON format, which is convenient for loading into pandas.

Example Command:

yq eval --front-matter=extract -j '.' file.md

--front-matter=extract: Tells yq to extract the front matter.
-j: Outputs the result in JSON format.
'.': The expression to evaluate (in this case, the entire front matter).

Step 2: Collect Data into Pandas DataFrame

Create a Python script that:

Walks through all subdirectories to find Markdown files.
Uses subprocess to call yq and extract the front matter.
Parses the JSON output and collects it into a list.
Converts the list into a pandas DataFrame.

Python Script:

import os
import subprocess
import pandas as pd
import json

# Initialize a list to store the front matter data
data = []

# Traverse the directory structure
for root, dirs, files in os.walk('.'):
    for file in files:
        if file.endswith('.md'):
            file_path = os.path.join(root, file)
            try:
                # Use yq to extract front matter in JSON format
                result = subprocess.run(
                    ['yq', 'eval', '--front-matter=extract', '-j', '.', file_path],
                    capture_output=True,
                    text=True,
                    check=True
                )
                front_matter_json = result.stdout
                # Load the JSON data
                front_matter = json.loads(front_matter_json)
                # Add the file path to the data
                front_matter['file'] = file_path
                data.append(front_matter)
            except subprocess.CalledProcessError as e:
                print(f"Error processing file {file_path}: {e}")
            except json.JSONDecodeError as e:
                print(f"Error parsing front matter in file {file_path}: {e}")

# Create a DataFrame from the collected data
df = pd.DataFrame(data)

# Display the first few rows
print(df.head())

Explanation:

os.walk(‘.’): Recursively traverses the current directory.
subprocess.run(): Calls yq to extract the front matter.
json.loads(): Parses the JSON output from yq.
data.append(): Collects each front matter dictionary.
pd.DataFrame(): Converts the list of dictionaries into a DataFrame.

Step 3: Process the Data with Pandas

Now that you have the data in a DataFrame, you can perform any processing you need. For example, you can filter, modify, or analyze the data.

Example Processing:

# Example: Add a new column based on existing data
df['title_length'] = df['title'].apply(len)

# Example: Filter rows
df_filtered = df[df['tags'].apply(lambda x: 'important' in x)]

# Display the processed DataFrame
print(df_filtered.head())

Step 4: Write Results Back to Files with `yq`

After processing, you may want to write the results back into the YAML front matter of the Markdown files. You can use yq with the --front-matter=process option to update the front matter in place.

Python Script for Writing Back:

for index, row in df.iterrows():
    file_path = row['file']
    # Prepare the yq command to update multiple fields
    yq_expression = ''
    if 'title' in row:
        yq_expression += f'.title = "{row["title"]}" | '
    if 'tags' in row:
        tags_yaml = ', '.join(f'"{tag}"' for tag in row['tags'])
        yq_expression += f'.tags = [{tags_yaml}] | '
    # Remove the trailing ' | '
    yq_expression = yq_expression.rstrip(' | ')
    try:
        # Use yq to update the front matter in place
        subprocess.run(
            ['yq', 'eval', '--front-matter=process', yq_expression, '-i', file_path],
            check=True
        )
    except subprocess.CalledProcessError as e:
        print(f"Error updating file {file_path}: {e}")

Explanation:

yq_expression: Constructs the expression to update fields.
–front-matter=process: Tells yq to process and update the front matter.
-i: Updates the file in place.

Note: Be cautious when constructing yq expressions, especially with strings that may contain special characters. You may need to sanitize or escape values appropriately.

Handling Special Cases and Errors

Empty or Missing Front Matter: Handle files that may not have front matter by adding error checking in your script.
Data Types: Ensure that the data types in your DataFrame match what you expect in the YAML (e.g., lists, strings, numbers).
Special Characters: Be careful with special characters in strings that may need to be escaped in YAML.

Putting It All Together

Here’s a consolidated version of the script:

import os
import subprocess
import pandas as pd
import json

# Step 1 and 2: Extract front matter and collect data
data = []

for root, dirs, files in os.walk('.'):
    for file in files:
        if file.endswith('.md'):
            file_path = os.path.join(root, file)
            try:
                result = subprocess.run(
                    ['yq', 'eval', '--front-matter=extract', '-j', '.', file_path],
                    capture_output=True,
                    text=True,
                    check=True
                )
                front_matter_json = result.stdout
                front_matter = json.loads(front_matter_json)
                front_matter['file'] = file_path
                data.append(front_matter)
            except Exception as e:
                print(f"Error processing {file_path}: {e}")

df = pd.DataFrame(data)

# Step 3: Process data with pandas
# Example processing
df['title'] = df['title'].str.upper()

# Step 4: Write results back to files
for index, row in df.iterrows():
    file_path = row['file']
    yq_expression = ''
    if 'title' in row:
        yq_expression += f'.title = "{row["title"]}" | '
    # Add more fields as needed
    yq_expression = yq_expression.rstrip(' | ')
    try:
        subprocess.run(
            ['yq', 'eval', '--front-matter=process', yq_expression, '-i', file_path],
            check=True
        )
    except Exception as e:
        print(f"Error updating {file_path}: {e}")

Alternative: Using Python Libraries Only

If you prefer to avoid subprocess calls and yq, you can use Python libraries like PyYAML and python-frontmatter to read and write the YAML front matter directly.

Example with python-frontmatter:

import os
import frontmatter
import pandas as pd

data = []

for root, dirs, files in os.walk('.'):
    for file in files:
        if file.endswith('.md'):
            file_path = os.path.join(root, file)
            post = frontmatter.load(file_path)
            metadata = post.metadata
            metadata['file'] = file_path
            data.append(metadata)

df = pd.DataFrame(data)

# Process data
df['title'] = df['title'].str.upper()

# Write back to files
for index, row in df.iterrows():
    file_path = row['file']
    post = frontmatter.load(file_path)
    post.metadata['title'] = row['title']
    with open(file_path, 'w') as f:
        frontmatter.dump(post, f)

Note: While this method avoids external dependencies, you mentioned specifically wanting to use yq and pandas.

Final Tips:

Backup Your Files: Before running scripts that modify files in place, make sure to back up your data.
Test on a Subset: Test your script on a small subset of files to ensure it works as expected.
Error Handling: Enhance the script with robust error handling to manage files without front matter or with unexpected formats.

By following these steps, you can effectively use yq and pandas to read, process, and write the YAML front matter of multiple Markdown files spread across various subdirectories.