Skip to content

[FEAT]: Cleaning up Content in collectors #2702

Open
@morbificagent

Description

@morbificagent

What would you like to see?

Hi together,
I have tested AnythingLLM now a few days and had Problems finding context in my files...
Was playing around with the settings but couldnt get it working like i wanted to. It delivered some infomation but was missing many parts.

Looked in the citations showed that the chunks of my office files looked like this:

Information....
10 empty lines
some information...
8 empty lines
Footer

All in all many empty lines eventualy because of style-elements in the document and redundant information because of footer on every page.

So i tried to "compress" the information a little bit by making changes to the document collectors/converters by adding:

function deduplicateContent(content) {
const seen = new Set();
return content
.split("\n")
.filter((line) => {
if (line.trim() === "") return false;
if (seen.has(line)) return false;
seen.add(line);
return true;
})
.join("\n");
}

And
const content = deduplicateContent(pageContent.join("\n"));
a little bit deeper...

Here an example file:
asDocx.txt

The result is that all redundant lines are removed and the empty lines too (which are redundant too for sure ;-) )

Dont know if its the best method doing this but its working and helps me a lot so AnythingLLM can send better context to the LLM...

Eventualy something like this could be implemented from someone who is able to make it better ;-)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions