@pwm @nuze @Death @w0rm @p This is basically what I'm doing. They're not factoids, they're assertions and they're graded by sourcing (unsourced, anonymously sourced, named source, linked source), and I create a list of up to 20 words that represent the entities in that assertion, as well as a list of up to 20 words that represent the entities of the article.
I find syndicated articles by cosine similarity match. If the article is more than 75% matching by cosine similarity, then the outlet took the syndicated article, added editor's notes or additional commentary, and pushed it out as their own as it commonly done.
I use simhash to then say, "If this isn't syndicated, is it about the same thing?" This is fuzzier. 55-60% simhash matching means that they're using a lot of the same subjects, verbs, and phrases.
Once we're past this point I'm now comparing the entity-lists of the two articles and their factoids, I'm comparing the similarity/difference of factoids in each article, and I'm literally asking the AI, "are we looking at news or are we looking at an advertisement, astroturfing, sponsored content, or fluff?" when I get to two matching articles.
There was some work put into an agent who can clean things up after the fact but that hasn't proved helpful. I will likely try and work out a better architecture and plan for a version two of this that simplifies some of it, expands on other parts of it, and yes leans into the graph -- right now the graph can be selected from the sqlite db but it's rudimentary.