[FEAT]: Cleaning up Content in collectors

### What would you like to see?

Hi together,
I have tested AnythingLLM now a few days and had Problems finding context in my files...
Was playing around with the settings but couldnt get it working like i wanted to. It delivered some infomation but was missing many parts.

Looked in the citations showed that the chunks of my office files looked like this:


Information....
10 empty lines
some information...
8 empty lines
Footer

All in all many empty lines eventualy because of style-elements in the document and redundant information because of footer on every page.

So i tried to "compress" the information a little bit by making changes to the document collectors/converters by adding:


function deduplicateContent(content) {
  const seen = new Set();
  return content
    .split("\n")
    .filter((line) => {
      if (line.trim() === "") return false;
      if (seen.has(line)) return false;
      seen.add(line);
      return true;
    })
    .join("\n");
}

And
  const content = deduplicateContent(pageContent.join("\n"));
a little bit deeper...

Here an example file:
[asDocx.txt](https://github.com/user-attachments/files/17870044/asDocx.txt)

The result is that all redundant lines are removed and the empty lines too (which are redundant too for sure ;-) )

Dont know if its the best method doing this but its working and helps me a lot so AnythingLLM can send better context to the LLM...

Eventualy something like this could be implemented from someone who is able to make it better ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FEAT]: Cleaning up Content in collectors #2702

What would you like to see?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[FEAT]: Cleaning up Content in collectors #2702

Description

What would you like to see?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions