Egregoros

Signal feed

Timeline

Post

Remote status

Context

6
@pwm @p @w0rm @Death so I uh, I made another thing. this is kinda what I meant to build like a year ago but now it actually works.

it still needs work, and if I set it up as a fedi bot I would want a new account because it will be spammy for the first 15min. but basically its job is to correlate unique articles on the same news stories, and extract verifiable and confirmed facts while dismissing ideology, advocacy journalism, etc.
@vii @Death @pwm @w0rm

> its job is to correlate unique articles on the same news stories, and extract verifiable and confirmed facts while dismissing ideology, advocacy journalism, etc.

I'll reenable registrations if you wanna do it on FSE.

What would be *very* interesting is, while you're doing the fact extraction pass, you could do sentiment by topic/source. If you wanna do two bots, that would be cool.
@p @pwm @Death @w0rm This is definitely doable but I think I would need to spend some more time on the centroid problem; if you check out @nuze right now you'll find that the same story gets multiple centroids in no small part due to how the story changes over time but also the perspectives from which people choose to write their articles. "Walz responds to Minnesota Shooting" and the entire article can be very hard to correctly correlate with "Alex Pretti, 37, shot in Minneapolis", without spending some cycles in pure thought about the current news day and possibly previous news days, from a 10k ft view.
@vii @nuze @Death @w0rm @p Apologies if this is what you're already doing but I think the way you address this is to get the device to not treat an individual article as the atomic unit, but rather the article is a collection of factoids.

If you go this route your task is to distill articles to a smaller atomic unit, call em "factoids," and treat each article like a graph, that connects factoid vertices. Putting together multiple sources becomes the union of these graphs who share vertices, and you can treat developments to stories (really graphs) as having two kinds, updates to attributes of vertices, or the linking in of new vertices in connection. I guess it would make your real task the factoid distillation and less so the synthesis directly from multiple sources. It would make ingestion harder but synthesis much easier.

I dunno if this is practical, but it would be the approach I would begin to think along given the task of deduplication of information.
@pwm @nuze @Death @w0rm @p This is basically what I'm doing. They're not factoids, they're assertions and they're graded by sourcing (unsourced, anonymously sourced, named source, linked source), and I create a list of up to 20 words that represent the entities in that assertion, as well as a list of up to 20 words that represent the entities of the article.

I find syndicated articles by cosine similarity match. If the article is more than 75% matching by cosine similarity, then the outlet took the syndicated article, added editor's notes or additional commentary, and pushed it out as their own as it commonly done.

I use simhash to then say, "If this isn't syndicated, is it about the same thing?" This is fuzzier. 55-60% simhash matching means that they're using a lot of the same subjects, verbs, and phrases.

Once we're past this point I'm now comparing the entity-lists of the two articles and their factoids, I'm comparing the similarity/difference of factoids in each article, and I'm literally asking the AI, "are we looking at news or are we looking at an advertisement, astroturfing, sponsored content, or fluff?" when I get to two matching articles.

There was some work put into an agent who can clean things up after the fact but that hasn't proved helpful. I will likely try and work out a better architecture and plan for a version two of this that simplifies some of it, expands on other parts of it, and yes leans into the graph -- right now the graph can be selected from the sqlite db but it's rudimentary.
@vii @nuze @Death @w0rm @p > Once we're past this point I'm now comparing the entity-lists of the two articles and their factoids
Rather than comparing entity lists here in some hellish n^2 fashion, have you thought about trying to get a model to place all the factoids in some sort of vector space where proximity implies they are the same but e.g. phrased differently? thinking about the relative similarity of dead, died, killed, murdered e.g. This seems like a task analogous to image classification only with language, and might yield interesting results, allowing you to spatially query for similarity, rather than directly compare when reversing the process for synthesis purposes at the the end. Though you're left with the unenviable task of training a model if no such model (to vectorize?) a piece of information exists. I obviously don't know the lingo very well, but conceptually this sort of makes sense to me.

Replies

0
No replies yet.