Description
What would you like to see?
Hi together,
I have tested AnythingLLM now a few days and had Problems finding context in my files...
Was playing around with the settings but couldnt get it working like i wanted to. It delivered some infomation but was missing many parts.
Looked in the citations showed that the chunks of my office files looked like this:
Information....
10 empty lines
some information...
8 empty lines
Footer
All in all many empty lines eventualy because of style-elements in the document and redundant information because of footer on every page.
So i tried to "compress" the information a little bit by making changes to the document collectors/converters by adding:
function deduplicateContent(content) {
const seen = new Set();
return content
.split("\n")
.filter((line) => {
if (line.trim() === "") return false;
if (seen.has(line)) return false;
seen.add(line);
return true;
})
.join("\n");
}
And
const content = deduplicateContent(pageContent.join("\n"));
a little bit deeper...
Here an example file:
asDocx.txt
The result is that all redundant lines are removed and the empty lines too (which are redundant too for sure ;-) )
Dont know if its the best method doing this but its working and helps me a lot so AnythingLLM can send better context to the LLM...
Eventualy something like this could be implemented from someone who is able to make it better ;-)