mTrawl – A free gift for the web community

For a number of years the landscape free website checkers hasn’t really moved forward that much. Since the introduction of xenu’s link sleuth it seems like everyone said game over and stopped trying with free link checkers and other quality tools.

We don’t want that to be the case. Our tools, frankly, aren’t for everyone (yes, I really did just type that). If you’re a professional web developer working on many sites or the owner of quite a large site then DeepTrawl and CloudTrawl make complete sense. You need pro-tools, because you’re time is very valuable. But, if you’re an ametuer web developer, or a pro in the making then you may simply not have the cash to splash out on something commercial. You may want something free. That doesn’t mean the tool you use should be slow, unpolished, or lacking in essential features.

This is why we’ve created mTrawl. It’s based on DeepTrawl, just with less features. It’s aimed squarely at those who need something to check their website works right, but can’t justify spending money on a product or service.

mTrawl

So what does mTrawl do? Check it out:

It’s a link checker

Install it on your PC or Mac & click start. mTrawl rips through every page page of your site (yes, every page, there are no limits) & finds the broken links. It shows every broken link on every page in a really easy to read report. It even gives you the line number where the broken links were found.

It’s a validator

This is where mTrawl really excels. If you just want a free link checker there are many options. None were aware of also validate your html. While mTrawl is checking each page for broken links it’s also validating every page, just like the W3C validator does. Why would you want this? Well, validation is really, really important. It checks your code is correct, i.e. complaint with a standard (either html 5, xhtml or 4.01). If you have errors in your code that could damage your seo or mean your site doesn’t render correctly in different browsers.

It’s really very, very polished

It’s may have way less features than it’s sister, DeepTrawl but it’s just as polished. Just because it’s a free product doesn’t mean it should look bad, crash, only work with some older operating systems, or generally just annoy you with little niggles. All we know about producing a successful commercial product also went into mTrawl, so you’ll feel good using it. Of course we hope that one day you’ll love it enough to move over to one of our pro products. Until then, happy trawling!

You can download mTrawl, here.

How to spell check web pages effectively

Making sure to spell check web pages is really important. This isn’t a new issue. The problem of spelling errors is as old as the written word, but the modern web does present some unique challenges. When you review a page on the screen, you need to make sure you’re checking all the text. This is sometimes tricky. Take a look at this example:

spelling

Can you see the spelling error? Try looking again. Still no? Here it is…

spellMenu

The reason you couldn’t see the spelling error in the (made up) example is that it was hidden until the mouse roles over the menu, and this is a common issue when spell checking sites.

If you spell check web pages manually in your browser you need to be extremely careful to check the “hidden” areas of the page. Often these are simply menus, where there isn’t too much text to get wrong, but in some sites these hidden areas can be a lot bigger….

When you spell check web pages, do you check the tabs too?

In the above example there’s a large amount of text hidden behind those tabs. In fact, taken together they contain more text than the rest of the page. So the first lesson is: beware of tabs, menus and other hidden areas, they often contain text that needs to be spell checked.

The following sections contain hints on checking other parts of pages, which in some case are completely hidden when you view them in the browser, but they’re some of the most important parts of the content.

The title

The <title> tags in the <head> section of a web page contain the text that will appear in a couple of places:

1) The browser tab used to open the page:

title

2) Search engine results. Search engines use the title tag in their listings. This makes the title an incredibly important page element to check!

search

The description

This is a slightly more obscure part of a page. It never appears in the browser when the page is viewed. The description is found in the top of your html, like this:

<html>

<head>

<title>Your title</title>

<meta name=”description” content=”Text describing the content of the page”>

</head>

The description is often used by the search engines to describe the page, enticing the viewer to click on it. Here’s an example of how it appears:


description
Since the title and description may make someone decide whether or not to view your page, getting the spelling right here is very important.

Image alt’s

Image alts are put into the html of your page with images. They provide a description of what’s in the image. They look something like this:
<img src=”imageFile.png” alt=”A description of what’s in the image” />
These are fairly important because image alt’s are shown by some web browsers when a visitors mouse is positioned over the image. They’re also used in indexing images to be shown in systems like Google’s image search. Finally image alts can sometimes be read out as an alternative to the image by screen reader software.

How to handle these

If you’re spell checking manually by opening pages in your browser it’s important to check tabs and menus etc. to find hidden content. It’s also important to do right click > View page source and make sure all the hidden areas (title, description, image alt’s) are spelled correctly.

There is also a better way. Our tool, DeepTrawl, will automatically check all the content of every page and all of the above. With a single click you can find the spelling errors in an entire site.

spell check web pages

DeepTrawl v4 is here

v4

It’s been a little while since DeepTrawl has seen a major upgrade so this one is truly huge. Here’s an overview of the most important new features.
Html 5 validation
This is something we’re very proud to announce. Html 5 validation (along with xhtml & html 4.01) is now baked in. In fact it’s exactly the same html 5 validation you’ll get from the w3c’s own validator. We’ve the same validator code the w3c uses for html 5. Of course as with every DeepTrawl check, this works on your entire site with one click.
Css Validation
This is a brand new feature in v4. When a site is trawled all of the internal, external and inline styles are read, just like the html. Css validation shows you all the errors along side all the site’s other issues. But we’ve gone a lot further than just validation. Css is now a first class citizen – user added checks can now analyze css and it’s import, font & image links are checked.
wedwedwed

It’s been a little while since DeepTrawl has seen a major upgrade so this one is truly huge. Here’s an overview of the most important new features.

Html 5 validation

This is something we’re very proud to announce. Html 5 validation is now baked in. In fact it’s exactly the same html 5 validation you get from w3c. Of course as with every DeepTrawl check, this works on your entire site with one click. More.

Css Validation

This is a brand new feature in v4. When a site is trawled all of the internal, external and inline styles are read, just like the html. Css validation shows you all the errors along side all the site’s other issues. But we’ve gone a lot further than just validation. Css is now a first class citizen – user added checks can now analyze css and it’s import, font & image links are checked. More.

(Much) better html exports

We’ve upgraded html exports hugely. They’re now beautifully rendered in html 5 & css. We’ve also added branding options – you can now add your own logo, colors and texts making html exports perfect for sharing with clients.

Dependency checks

DeepTrawl has always checked for broken links. In fact for a long time it was known mainly as a link checker. Starting in v4 we’ve added the dependency check, this does link checking for things like JavaScript, font & iframe imports. Ever seen a page with a broken css import? We bet you have and would wager the page was ugly as hell – now you never have to torture your visitors like this. More.

Better UI

The interface in DeepTrawl is now a lot sexier. We’ve gone for a cross platform look & feel that works really well on all modern platforms like Windows 7, 8 & OSX Mavericks. Besides looking better the interface now also works better. We’ve added features like Chrome style reorderable tabs and a new Monitor which pops out from the bottom of the screen, instead of being in a separate window.

Improved filters

DeepTrawl has has filter tabs for a while now – they allow you to see pages with specific errors in a new tab. We’ve enhanced with two new features:

1) It’s now possible to hide all other errors in the filter tab.

2) Filter tabs can now be used to filter by url. This allows you to show only errors in specific parts of your site or even zero in on a single page in the error results.

Check analytics code

Did you remember to put your analytics tracking code in *every* … *single* … *page* in your site? 100% sure? Now DeepTrawl can tell you which pages you’ve missed.

Try it

There are many more new features and enhancements (actually there are more than twice as many new features and enhancements compared to v4). We suggest you try it out.

Why Google should use plus for rankings – and almost certainly will

Recently there was a bit of a stir caused by Moz.com discovering that there’s a correlation between sharing on Google + and the ranking of pages. Their data shows that pages with more shares tend to get higher rankings.

They made the case that Google + is actually something of an seo machine – that each post is like a mini seo’d article. This seemed very odd to us. Google own all that data – they don’t actually need to crawl it all with their spider to index it. It’s already in a database sitting in one of their data centers. Now – maybe they do just crawl it because everything isn’t always as joined up as we might imagine but that seems unlikely. As the article points out, shares on + are indexed way faster than most pages indicating some kind of special sauce.

But, here’s the really interesting part: The article explicitly stated that giving a +1 to a page wasn’t directly influencing it’s ranking. In a response to the article on Hacker News (later appended to the article), Google’s own Matt Cutts even took the time to explicitly rule this out:

It is not the +1’s themselves that are causing the high rankings of posts but the fact that most +1’s on a site result in a shared post on Google+, which creates a followed link back to the post. It’s instant organic link building.

This got us thinking… why on earth wouldn’t Google use +1’s to directly influence page rankings? The obvious answer is that it would be open to manipulation by spammers. That’s true, but by all accounts Google’s previous most powerful weapon (PageRank) has been demoted in relevance by them because it’s being spammed way too much. Generating spammy links is just too easy these days.

The advantage of +1’s is that they’re tied to your Google account which gives the search giant a lot of really useful details to try and hone in on and ignore the spammers. For example they could:

- Not count any +1’s generated in the x days after an account was created

- Discount any by accounts which have +1′d a number of sites deemed to be spammy

- Weight against +1’s from non-verified accounts.

That’s just a quick though experiment. Given time one could generate hundreds spam-y metrics using the data Google holds about all of us. Hell – I might seriously demote the +1’s of anyone who’d Googled phrases like “black hat seo” in the past – ooh – creepy!

But slightly sinister jokes aside – it’s clear that this data could help Google create a better search. Even if they aren’t using +1’s now it’s a good bet they will in the future. So, if there weren’t enough benefits already installing those +1 buttons is probably a good idea to do it now – you might just see a direct rankings boost from it in the future.

The world’s biggest companies have boatloads of broken links

Since we recently released CloudTrawl we decided to undertake some research to prove just how valuable it is. The uptime of major websites and the damage to reputation and profits downtime causes has been written about extensively, so we decided to go a different way. Every web user has seen a broken link; they often make our blood boil & frequently people will leave a site on seeing one assuming the content they’re seeking simply doesn’t exist. 404 has become the new blue screen of death. Broken links are a real risk to reputation & profit but we’ve never seen a comprehensive study on just how common they are in major sites.

We decided to undertake that study and to perform it on the group of sites whose owners aren’t lacking in resources: the Fortune 500.

The Results

Here’s a big figure to open with:

Fortune500_1

You read that right, 92% of the sites in our sample included at least one broken link & most had several. 68% had more than 10 broken links, 49% had more than 50 and a surprising 43% of Fortune 500 sites have more than 100 broken links.

We also broke down the amount of pages which had broken links against the total amount of pages in each site. A stunning 13% of all pages in Fortune 500 sites have at least one broken link (many pages have several).

Fortune500_2

What isn’t shown in the figures is the importance of some of these links. We saw examples of broken links to annual reports, quarterly statements, social presences (e.g. broken Facebook links) & external + internal news articles. Perhaps most worrying were the unreachable legal notices & terms & conditions documents. Along with making users leave the sites (& possibly making lawyers pass out!) these things are bad for search engine optimization. Google won’t be able to find these pages & sites will be penalized.

Our Method

To get a fair cross section of the Fortune 500 we chose 100 companies at random across the set. We entered their names into Google and picked the first US / international result owned by that company. This resulted in a mix of sites. Some were corporate (company news, quarterly statements etc.) and some were online presences for customers (stores & marketing). We rejected any sites which CloudTrawl didn’t finish crawling in 5 hours or which contained more than 5,000 pages (these can sometimes spawn loops in page generation and unfairly bias results, search engines also stop crawling sites if they think this is happening).

To eliminate false positives we quality checked results both randomly and where sites contained a high percentage of broken links. To make sure the headline figures weren’t biased we only check links (not images) and only checked for 404 & 410 http error codes, ignoring server timeouts etc. as these can sometimes be temporary.

Conclusion

Although there are some big headline figures above, the one that troubles us most is the 13%. Essentially we’re saying that more than 1/10 Fortune 500 web pages has a severe bug that’s waiting to pop up and grab unsuspecting users.

Next time you see a 404 error you’ll at least have the consolation that they’re proven to be really common. Of course we do give webmasters the tools to fix these issues – and I think we’ve presented a decisive demonstration of why they’re needed.

Note; feel free to use the infographics in this post; we hereby release them for use on other sites.

OMG; We’ve Launched!

It’s a proud day over at CloudTrawl.com; we just launched the full live service!

We’d loved it if you sign up for the free trial and we’re all ears for new feature requests & suggestions.

So, what made it into the first version? CloudTrawl is designed to watch out for stuff that goes wrong on it’s own, even if you don’t change your site. So for the first version we have:

- Link Checking (we check every page of your site, daily or weekly)

- Uptime Minotoring (we check your site is online every 30 seconds / 24×7)

We also have features like complete history charting, the ability to share site reports and settings with colleagues & customers, very cool looking real time views for uptime checks, the ability to “Start Now” for link checks, image validation and a lot more.

Even this tidy set of features is really just the tip of the iceberg of what’s planned for CloudTrawl. The ultimate goal: monitor absolutely everything that could go wrong with your site on it’s own; over time we’ll be adding more checks and we’d love for you to tell us what extra features and checks you think CloudTrawl should have.

Happy Trawling!

Last bug is fixed!

This is a real development milestone. All of the code for CloudTrawl v1 has been written for a while and we’ve been focused entirely on testing. Our testing has included a lot of steps:

1. Automated testing; we now have a massive suite of automated tests which can be run at the click of a button

2. Functional testing; making sure every feature works as described and they all hang together well

3. Cross browser testing; making sure the interface works across browsers and operating systems

4. Scale; running up hundreds or thousands of uptime checks and hundreds of link & image checks simultaneously to make sure the system performs well with lots of people using it (if I can think of a way to make this not boring it deserves a blog post all of it’s own).

5. Third party testing; we got the guys over at TestLab² to do a barrage of tests to make sure we hadn’t missed anything.

And then this evening it finally happened… the last known bug was fixed. So “OMG”, it’s so nearly time to open the champagne hit the release button. Watch this space!

Will crawling affect your Google Analytics?

This is a question we’ve been asked a few times. Many users want to know if the hits generated when their pages are checked for broken links will show up in Analytics. The answer for both CloudTrawl and DeepTrawl is no, it won’t be a problem. For reasons I’ve written about previously, both products don’t execute the Javascript on your pages. Since Google Analytics relies on Javascript to count page views it has no way of knowing that we’ve visited a page, so this won’t show up.

As a side note, in the next version of DeepTrawl we’re planning to implement a way to make sure all of your pages contain analytics tracking code, until then it’s relatively easy to do this check yourself using DeepTrawls ability to add your own new checks.

404’s Are so important…

… there’s even a TED video devoted to them!

<object width=”526″ height=”374″>
<param name=”movie” value=”http://video.ted.com/assets/player/swf/EmbedPlayer.swf”></param>
<param name=”allowFullScreen” value=”true” />
<param name=”allowScriptAccess” value=”always”/>
<param name=”wmode” value=”transparent”></param>
<param name=”bgColor” value=”#ffffff”></param>
<param name=”flashvars” value=”vu=http://video.ted.com/talk/stream/2012U/Blank/RennyGleeson_2012U-320k.mp4&su=http://images.ted.com/images/ted/tedindex/embed-posters/RennyGleeson_2012-embed.jpg&vw=512&vh=288&ap=0&ti=1444&lang=&introDuration=15330&adDuration=4000&postAdDuration=830&adKeys=talk=renny_gleeson_404_the_story_of_a_page_not_found;year=2012;theme=art_unusual;event=TED2012;tag=marketing;tag=technology;tag=web;&preAdTag=tconf.ted/embed;tile=1;sz=512×288;” />
<embed src=”http://video.ted.com/assets/player/swf/EmbedPlayer.swf” pluginspace=”http://www.macromedia.com/go/getflashplayer” type=”application/x-shockwave-flash” wmode=”transparent” bgColor=”#ffffff” width=”526″ height=”374″ allowFullScreen=”true” allowScriptAccess=”always” flashvars=”vu=http://video.ted.com/talk/stream/2012U/Blank/RennyGleeson_2012U-320k.mp4&su=http://images.ted.com/images/ted/tedindex/embed-posters/RennyGleeson_2012-embed.jpg&vw=512&vh=288&ap=0&ti=1444&lang=&introDuration=15330&adDuration=4000&postAdDuration=830&adKeys=talk=renny_gleeson_404_the_story_of_a_page_not_found;year=2012;theme=art_unusual;event=TED2012;tag=marketing;tag=technology;tag=web;&preAdTag=tconf.ted/embed;tile=1;sz=512×288;”></embed>
</object>

Why we’re building CloudTrawl using Amazon Web Services (and why you should consider them too)

For those not in the know AWS is a Cloud hosting provider; they allow their customers to use servers on a pay as you go basis, starting up and shutting them down quickly and paying by the hour.

Some of their customers are traditional web sites, some are web applications. In both cases the beauty is that extra web servers can be added almost instantly to cope when peak load comes along, i.e. when lots and lots of people are using the site.

So what’s so special about CloudTrawl that we need this? Are we expecting 100 users to log on one hour and the 10,000,000 the next? Well no, probably not.

The answer lies in the type of things CloudTrawl does:

1) Uptime Checking

This is nice and consistent. At launch we’ll have three servers doing this job, based in the US, Ireland and Japan. That number will grow but not overnight, as we get more customers we can add more.

2) Link Checking

This is the big reason we need a true Cloud service to run on, but it’s not obvious at first site. Using other online link checking services we’ve seen you set up your account and your site is scanned perhaps once a day, once a week or once a month. That’s nice and consistent right? Surely we can balance all of that out and just add servers as we need them? Nope, afraid not. We have an awesome button that rides right over that idea:

StartNow2

That little Start Now button means our service needs to be truly flexible. One minute we could be checking 10 sites for broken links, the next minute it could be 1,000.

So we needed to make sure we’d always have enough servers to do all that work and that’s why we’re running on AWS. We can automatically start up as many servers as we need to do the work and our customers don’t have to wait around.

If they’re worried their site might have broken links they can always hit Start Now and see CloudTrawl checking their site in real time and even fix the errors as they come in.

Pretty cool hugh?

So what’s the lesson for the web community? Well, the requirement to scale your site can come when you least expect it. Once your site is gaining some popularity it may be time to start seriously wondering: will one server always be enough? What if I suddenly get linked to from the BBC, CNN or Slashdot?

Luckily scaling isn’t necessarily that hard. For example If you have a site running static HTML Amazons EC2 is pretty easy to set up for scaling. If you’re into Wordpress services like WP Engine are designed to scale automatically for you. It’s not that old-fashioned single server hosting is dead, but if you think there’s a chance you might see a big spike in traffic some day, now is a great time to start looking into options.