Category: search

  • Mid-caffeination Mastodon Thoughts

    Derek Powazek posted this on Mastodon yesterday:

    An actual use for machine learning that I’d want: a bot that records all the posts that cause me to block someone, saves them into a db, and then automatically hides posts that match above a certain threshold.

    Derek on Mastodon

    I love a good brain exercise, so I’ve been thinking about it, and I don’t actually think this is that hard, and is very possible using tools you already need to run Mastodon in production.

    I might play with actually implementing this during my week off around cooking and family time, but if someone else wanted to do it, this idea is 100% free.

    To enable search in Mastodon, you have to install and use ElasticSearch. It has machine learning goodies in it already like nearest neighbor and vector search.

    Basically, we should be able to build a very personal spam/block bot for Mastodon given some training data (posts that pushed you to block someone) and some fiddling about (which is the hard/fun part).

    Right now, there are no dates on blocks in Mastodon (I haven’t checked the schema yet to see if they’re there but not returned), and you can’t see which post “triggered” the block. I think that could be added fairly easily – or at least something like “Add this to Blockbot” to use it to train the bot.

    Mastodon doesn’t really have a plugin architecture yet, so I’m not sure if this should be a standalone app that sits alongside your running Mastodon instance or a feature – I’ll probably try it as a feature to get familiar with Mastodon.

    Basically, we take “blockworthy” posts, index them, and then use that to compare posts to the blocklist to get a semantic distance. Once we have the distance we can start manually testing for accuracy and tweak settings until we get something close to a “block score”. Users could then say, “yep, don’t show me anything with a block score greater than 1.5” and ta-da, a little robot janitor is just cleaning up your feed for you. That’s probably computationally intensive to do on every post, but I think you could apply it to people you don’t follow who reply to you to weed out the worst Reply Guys and riff raff.

    You could also have community-wide block bots that are trained on a communal collection of blockworthy posts. It could help get around rigid blocklists by allowing targetted removal of replies from timelines instead of blocking whole instances.

    It could also be used for finding good stuff too… Imagine something that found you people who post things like you do and brought them to you. It could be used as an “attract” bot as well.

    I think ideally, it could be used like left and right handed whuffie. When you come in contact with a profile, how alike and how different are your posts from theirs’? Do we agree on anything? Are our disagreements strong enough, and on topics that are sensitive enough, that I probably don’t want to engage with them? Then it’s more informative than just a robot going out and sweeping up my replies.

    Yeah, this is hand wavey, but a lot of this stuff is just built in to ElasticSearch already, so it’s not like we have to invent anything (yay, because that’s hard). We just have to assemble it and feed it enough data.

    It should be fun, and I think it could be helpful, especially for folks who get inundated with awful replies.

    And if you beat me to implementing it, that’s great! Then it’ll be out there in the world and we can all play with it!

  • Notes from Supernova: Personal Infosphere

    This panel’s all about how we can keep up of all the information that comes in every day. We’re\
    h4. Dalton from imeem

    • We’re reaching some limit as to the amount of information we can handle.
    • imeem creates both an IM client and web client
    • Instant messaging is useful as a communications tool, but about presence. Presence is actually the most important part of IM clients.
    • They’ve got real-time notification of new blog posts, profile updates, etc.
    • They have groups to “aggregate people around particular topics”
    • Trying to manage all forms of digital information, can pull in data from other services
    • They have a unified tag space across media types (eeenteresting). I wonder how that plays out with users. People tag different content differently, do users of imeem use consistent tags across media?\
      h4. Yael from eSnips
    • They have mainstream users, not teens.
    • Social, but focused on content, not people
    • It’s for sharing interest and passions but lets you go one step further\
      h4. Ben from Plaxo
    • 5 year-old company
    • Synchronized address book
    • People have on average:
      • 3-4 phone numbers
      • 2-3 e-mail addresses
      • 2-3 physical addresses
    • And this information is always changing
    • 33% of mobile phone numbers and 24% of e-mail addresses change annually\
      h4. Tariq Krim from Netvibes
    • Create a single place for your entire digital life. Another personal portal.
    • They have an open API for module developers.
    • They have a public wiki for users to request features and report bugs
    • They have a really cool live translating tool
    • So they want to use “open standards”, but didn’t really say which ones\
      h4. Hans from Plum
    • Connect with each others “heads”, not with dates.
    • Collect data of all types and drop it into buckets
    • “Communities of Knowledge”
    • Tiny little application that runs and allows you to add anything you read into a collection.
    • Wow… this is really cool. Collect anything from your desktop and throw it up into a collection. Neato.
    • Everything is indexed and searchable.
    • Works great on the Mac too. Yay!!
    • Also allows you to connect to people with similar collections to yours.
    • They dig microformats as well.
    • They use Amazon’s S3 for the data.\
      h4. Discussion
    • Collaborative Filtering
      • imeem uses collaborative filtering to decide how popular or “good” something is. Compared to PageRank
      • Plum called on “big” companies like Yahoo and AOL to come up with a good scheme for licensing documents or declaring document license. Time to go read up on rel license, isn’t it?
    • Lot of talk of ownership while avoiding completely the topic of lock-in and open API’s. Oh well, we’ll talk about it in the next panel.
  • My Standards Story

    Molly’s post about search engines and standards has inspired me to tell my standards evolution story, because it’s really all about AOL Search.\
    I worked on AOL Search for five years, from 1999 – 2004. In that time, it went from being “AOL Netfind”, powered by Excite and in a horrible frameset where we had very little control over anything to something built completely in-house powered by the ODP, to what you see today (powered by a bunch of in-house technology and incorporating results from all over the web, most noticably from Google).\
    I was the only person to touch the frontend code for that five years. I wrote the first in-house version of AOL Search in AOLPress, a WYSIWYG HTML Editor that started life as NaviPress. It was a glorious example of old school HTML. It was all uppercase tags, unquoted attributes, tables all over the place and non-breaking spaces. But, it was one of the first successful web products at AOL, and was a whole lot better than NetFind was.\
    I started noticing the web standards movement in, I think, 2000. Back then, I couldn’t do anything about it because we had to support Netscape 4.7 and all the other old school browsers. But, in 2000, I removed all the font tags from AOL Search and we started using CSS for text (which was all it was good for back then). Life continued… I started blogging in July of 2000, and in November of 2001, my blog went all CSS-y (Zeldman even wrote about it).\
    In 2002, it was time to break out of tables, and we did. I dropped all the tables, and put in a browser sniff to give Netscape 4 users a stinky old tabled header and footer. I didn’t have a DOCTYPE (because I was young and stupid), but we were table-less. This is when our business decided that speed was all-important. They really wanted us to get our load time under six seconds. I don’t remember why six seconds was the magic number, but it was. We still had several large banner ads on the site, and six seconds seemed like an impossible dream.\
    I went crazy in 2003. I was on a mission. I was the six second man. I was going to get us there, because dammit, I love a challenge. We started 2003 at about 14 seconds (measured by an internal tool, over a modem, using the AOL client, not perfect, but it was consistent). I dove into the standards, and pored over weblogs, forums and A List Apart, looking for anything that would help me get there. This is when I discovered semantic markup. I started trimming. In the spring, we hit 10 seconds. Then, I put in a standards-mode DOCTYPE, and, without changing anything else, we went from 10 to 8. We were close, and I smelled blood. So, I devised a test where I made a version of the product that didn’t have any ads on it at all. We tested it and it loaded in less than four seconds, according to the tool. Now I had my villain and the lobbying started. I won’t go into everything I had to go through to get those big ads taken off, and I never actually got them all removed, but I got the “bad” ones taken off, and we replaced them with sponsored links, first from Overture and then from Google.\
    In late 2003, we hit 6 seconds. I did a little dance and took a vacation. When I got back, I decided that I could hit four without too much more effort. I started trimming. Everything on the results page was meaningful. All the results were list items. We had headers for result sections. There was a place for everything, and everything in its place. I got better with CSS, wrote better selectors, and shed more meaningless markup and lots of bad CSS. We hit 4 seconds (and broke it for a little while… 3.78 seconds). In the spring of 2004, by our measurements, we were one second faster than Yahoo (and much faster than everyone else) and one second slower than Google. AOL Search, at that moment, was everything I wanted it to be. It was fast. It was standards-compliant (except for those stupid ampersands). It was accessible (at one point, we did all these tests with an internal tool, and we were more accessible than any other major search engine, and scored better than Accessify).\
    That’s when I moved on to other stuff. I couldn’t work on it after that. I’d done everything I could with it. In my mind, it was perfect. Changing anything would have broken my heart. I wrote another search app, built on the philosophy of the CSS Zen Garden that allowed products within AOL to create search products without writing any code – just configure, create your own CSS file, and go.\
    Now, I’m doing training and mentoring stuff. I tell other folks how to do this stuff. What did I learn from my five years on AOL Search Everything. I learned everything. It was challenging, stressful and really scary at times, but I learned more than I would have if I had jumped around from project to project, never seeing the consequences of my choices. With AOL Search, I had a this huge high-profile product where every little change made a difference in the company’s bottom line. I had a platform for experimentation, and since I was the only one who touched that one part of it, I had almost complete control over how it was built. I got the blame when I messed up (and I did), and the credit when it worked (and it did, mostly). I got to see how making it faster made people use it more. When we got to six seconds, usage went through the roof.\
    Why tell the story? I don’t know. It’s important to me. It’s part of what’s driving me to push standards to the rest of AOL. It’s important because it makes for better products. It’s important because our users benefit. They don’t wait around for pages to download and can actually get on with using the product. We get consistency, and maintainability. It’s important to me because the process is repeatable. It’s possible to go from old school, inaccessible and slow to standards-based, accessible and fast. Today, it shouldn’t take you five years to do it. I was hamstrung by the browser environment at the time. You have no excuse. It’s never been a better time to work on the web, and it’s only going to get better. Get to it.

  • Who’s Number One?

    Me! Well, I’m number one for everything dumb. What about me is number one for that? It’s this super dumbness. Dumb, huh?

  • Wow, at the bottom of

    Wow, at the bottom of yesterday’s post, I mentioned Tardy from Greg the Bunny and today I’ve gotten over a dozen visits from people who found me by way of Google searches for Tardy from Greg the Bunny. Google is amazing.

  • Insider Info

    Yesterday, people on AOL searched for “yo mama jokes” as many times as they searched for “notre dame” and “cysts“.

  • Into the fray

    I have a big meeting in fifteen minutes where I have to defend a whole platform against a roomful of people. I doubt the folks who write AOLserver will even show up, which will leave me all alone to defend it. I have papers and figures and drawings, and it probably won’t matter. They’ll make the switch and then they’ll realize that they can’t do everything they used to, and feel bad for not listening to me. I would much rather they listen to me now than realize their mistake later when it’s too late to turn back.

    If I have time today, I’m going to write an essay about national cultures, but don’t hold your breath for it. I’m still sick, and work is really busy. I went to bed at 9:30 last night and barely woke up when the alarm went off.

  • Freedom!!

    Nine days. I have nine days off in a row. I don’t believe it. It’s too good to be true. I have a terrible feeling that I won’t get to take all nine off because something will break at work, someone will need me to rescue them from either their own incompetance or implement somethin that will make somebody a whole lot of money.

    Every time I’ve tried to take a vacation this year, I’ve had to move something around. I have to go back to work, do something and mess up our plans. Please, not this week.

    I like what I do. I’m good at it. Whenever I look at where I am, I think back to my interview for this job over two years ago. The guy interviewing me asked what I wanted to be doing in six months. I said I wanted to be the go-to guy. I want to be the guy people come to when something needs doing. I’m that guy. Then he asked what I wanted to be doing in 5 years. I said I wanted to be running a big site like Amazon (yeah, shoot for the moon). Well, it’s almost three years later and I’m the only production guy on one of the most-used search engines on the web. There’s a whole team for the backend. A whole team that keeps it up and running. I am the only guy who works on the frontend and middleware pieces. And now, I’ve got a dozen other search projects that I’m the only frontend guy on. I guess I’m running a collection of sites that gets (I think) more hits than Amazon on any given day. How crazy is that? How messed up is the world that I’m the only guy for these projects?

    You know, this stuff is bad for my ego. It’s made me arrogant. I’m trying not to be, I swear I am. I know I’m in the position I’m in because my group has made some really bad decisions over the years, letting the wrong people get away, while replacing them with people with little-to-no talent or imagination. That means that they make up for the lack of talent in most by overworking those that have some. That’s also the way they lose good people. It’s a vicious cycle, and now the economics of everything mean we’re not hiring. So, it will be this way for the foreseeable future. What a downer…