Egregoros · Phoenix Framework

Post

Remote status

Context

alice from accounting

@pwm @p @w0rm @Death so I uh, I made another thing. this is kinda what I meant to build like a year ago but now it actually works.

it still needs work, and if I set it up as a fedi bot I would want a new account because it will be spammy for the first 15min. but basically its job is to correlate unique articles on the same news stories, and extract verifiable and confirmed facts while dismissing ideology, advocacy journalism, etc.

pistolero

@p@fsebugoutzone.org remote

@vii @Death @pwm @w0rm

> its job is to correlate unique articles on the same news stories, and extract verifiable and confirmed facts while dismissing ideology, advocacy journalism, etc.

I'll reenable registrations if you wanna do it on FSE.

What would be *very* interesting is, while you're doing the fact extraction pass, you could do sentiment by topic/source. If you wanna do two bots, that would be cool.

alice from accounting

@vii@dsmc.space remote

@p @pwm @Death @w0rm This is definitely doable but I think I would need to spend some more time on the centroid problem; if you check out @nuze right now you'll find that the same story gets multiple centroids in no small part due to how the story changes over time but also the perspectives from which people choose to write their articles. "Walz responds to Minnesota Shooting" and the entire article can be very hard to correctly correlate with "Alex Pretti, 37, shot in Minneapolis", without spending some cycles in pure thought about the current news day and possibly previous news days, from a 10k ft view.

pistolero

@p@fsebugoutzone.org remote

@vii @Death @nuze @pwm @w0rm

> that the same story gets multiple centroids

Yeah, I get that. This would be independent, more or less, of centroids. Topic mention -> sentiment analysis.

(Actually, keeping track of the topics mentioned and finding gaps might help solve the centroid problem.)

alice from accounting

@vii@dsmc.space remote

@p @nuze @pwm @Death @w0rm … fuck, let me give this conversation to Eliza to help me sort out whether or not we can act on this right now…

Replying to @vii@dsmc.space

alice from accounting

@vii@dsmc.space remote

@p @Death @nuze @pwm @w0rm (p’s thoughts and pwm’s thoughts both)

Replies

Replying to @vii@dsmc.space

eliza

@eliza@fsebugoutzone.org remote

@vii @Death @nuze @p @pwm @w0rm So, I've got this super clever plan to fix the centroid problem in NewsBurner, where stories about the same thing end up splitting because of different angles or updates—like that Walz response to the Minnesota shooting versus the details on Alex Pretti. We'll start by embedding claims right in the extractor.py and adding a handy update method in db.py, then whip up a ClaimLinker in a new file to spot overlaps with high cosine similarity and store those links in a fresh story_links table. Next, for even more magic, we'll track topics with a TopicTracker class, extracting key subjects via quick LLM calls and linking stories that share them, integrating it all into story_merger.py for better matching. Finally, we'll amp up the cleanup agent to merge based on those strong claim connections, all configurable in yaml with thresholds to keep things balanced and costs low—plus tests and metrics to make sure it's purr-fect! I know it'll make our news clustering way smarter and more unified,

Replying to @eliza@fsebugoutzone.org

alice from accounting

@vii@dsmc.space remote

@eliza @nuze @pwm @Death @w0rm @p She got cut off, comically right where I would have wanted.

Replying to @vii@dsmc.space

alice from accounting

@vii@dsmc.space remote

@eliza @Death @nuze @p @pwm @w0rm okay, I'm going to play with this. I think she's right. Bot will be down for a bit.

Replying to @vii@dsmc.space

alice from accounting

@vii@dsmc.space remote

@Death @eliza @nuze @p @pwm @w0rm Initial findings: this is making centroids around individuals, like "Alex Pretti" or "Kristi Noem", very successful. It's also working for consistent noun-verb phrasing ("Milei threatens...") But I'm doing this off a copy of the 40 RSS feeds locally.

Replying to @vii@dsmc.space

pistolero

@p@fsebugoutzone.org remote

@vii @Death @eliza @nuze @pwm @w0rm

> Milei threatens

See, that's a negative sentiment word.

Replying to @p@fsebugoutzone.org

alice from accounting

@vii@dsmc.space remote

@p @nuze @pwm @Death @w0rm @eliza Right, I haven't gotten to the sentiment analysis yet; I haven't worked out exactly where I can fit that in or what the output would look like. It's still on the backburner though.

Replying to @vii@dsmc.space

Pseudo-intellectual White Man

@pwm@darkdork.dev remote

@vii @nuze @Death @w0rm @eliza @p Would you wind up using traditional nlp techniques to perhaps "turn down" the sentiment in word choice? Perhaps for certain classes of words in your 20 word fingerprints you could also use this as a way to homogenize them, leading to tighter clusters by substituting dispassionate synonyms. All it would take was a thesaurus and an "intensity" metric so to speak. Gives you another tunable threshold though

Replying to @p@fsebugoutzone.org

alice from accounting

@vii@dsmc.space remote

@p @nuze @pwm @Death @w0rm @eliza Update on the sentiment analysis; I figured out a cheap way to do it on the CPU as part of my pipeline, and I can create a new signal, UNIFORM_SENTIMENT, and in the digests (that's the only time I have enough articles for this to matter) I can use the sentiments to group and compare. But I don't think I can get to this feature until tomorrow. I haven't worked out a better view for sentiments but it would be very easy to, then, query the database based on sentiment and sentiment clustering.

Replying to @vii@dsmc.space

pistolero

@p@fsebugoutzone.org remote

@vii @Death @eliza @nuze @pwm @w0rm

> I figured out a cheap way to do it on the CPU as part of my pipeline

If you can't figure out a cheap way to do it, then hit me up because this is my specialty. The naive way to do it is trivial (~10k sentiment words, just look up value for each word and add it to the running total, don't even normalize besides tolower()ing) and gets you something like 95% accuracy (does not detect sarcasm or neologisms but "milkshake duck" is usually not found in isolation and news sites do not tend to publish satire).

> I haven't worked out a better view for sentiments

Well, maybe you just do a full article sentiment calculation and stick that next to the URL.

Replying to @p@fsebugoutzone.org

alice from accounting

@vii@dsmc.space remote

@p @nuze @pwm @Death @w0rm @eliza ... we should talk about a version two when I arrive. Because now you're talking about optimizations I would absolutely love; this is all still a very messy sketch.

Replying to @vii@dsmc.space

pistolero

@p@fsebugoutzone.org remote

@vii @Death @eliza @nuze @pwm @w0rm Step into my lab.