statmt.blogspot.com

Main

id

295078170

name

statmt.blogspot.com · homepage snapshot

processing priority

4

site type

3 (personal blog or private political site, e.g. Blogspot, Substack, also small blogs on own domains)

review version

11

html import

20 (imported)

Events

first seen date

2024-08-14 03:21:54

expired found date

-

created at

2024-08-14 03:21:54

updated at

2026-02-24 18:04:11

Domain name statistics

length

19

crc

48448

tld

2211

nm parts

0

nm random digits

0

nm rare letters

0

Connections

is subdomain of id

69893241 (blogspot.com)

previous id

0

replaced with id

0

related id

-

dns primary id

0

dns alternative id

0

lifecycle status

0 (unclassified, or currently active)

Subdomains and pages

deleted subdomains

0

page imported products

0

page imported random

0

page imported parking

0

Error counters

count skipped due to recent timeouts on the same server IP

0

count content received but rejected due to 11-799

0

count dns errors

0

count cert errors

0

count timeouts

0

count http 429

0

count http 404

0

count http 403

0

count http 5xx

0

next operation date

-

Server

server bits

—

server ip

-

Mainpage statistics

mp import status

20

mp rejected date

-

mp saved date

-

mp size orig

69393

mp size raw text

14793

mp inner links count

11

mp inner links status

20 (imported)

Open Graph

title

The StatMT Blog

description

A blog for and about statistical machine translation.

image

site name

author

updated

2026-02-06 09:36:15

raw text

The StatMT Blog The StatMT Blog A blog for and about statistical machine translation. Saturday, September 13, 2014 Easy parallel corpora from Wikipedia We're off to a busy start of the semester, and between co-teaching a new (for me) class , proposals, project work, and students returning from internships, I haven't had much capacity for extracurricular writing. But, I wanted to post a link to some scripts I just pushed to Github that will build a parallel corpus based by extracting the titles from the interlingual links on Wikipedia. I've found Wikipedia title pairs to be a surprisingly useful resource on a number of occasions (great coverage of interesting languages and scripts, good license for data use/distribution), and I imagine others will as well. Posted by Chris at 9:50 PM 1 comment: Monday, July 14, 2014 Understanding mkcls Clustering words based on contextual similarity is an effective way to learn lexical representations that improve performan...

Text analysis

redirect type

0 (-)

block type

0 (no issues)

detected language

1 (English)

category id

Other [en] (231)

index version

2025123101

spam phrases

0

Text statistics

text nonlatin

0

text cyrillic

0

text characters

11509

text words

2348

text unique words

886

text lines

260

text sentences

95

text paragraphs

31

text words per sentence

24

text matched phrases

1

text matched dictionaries

2

Link statistics

links self subdomains

0

links other subdomains

12 - anthology.aclweb.org, dl.acm.org, link.springer.com, iro.umontreal.ca, homepages.inf.ed.ac.uk

links other domains

9 - mitpressjournals.org, acl2014.org, brenocon.com, hunch.net, machinedlearnings.com

links spam adult

0

links spam random

0

links spam expired

0

links ext activities

16

links ext ecommerce

0

links ext finance

0

links ext crypto

0

links ext booking

0

links ext news

2

links ext leaks

0

links ext ugc

37 - blogger.com, en.wikipedia.org, facebook.com, nlpers.blogspot.com, gtnlp.wordpress.com

links ext klim

0

links ext generic

1

dol status

0

dol updated

2026-02-06 09:36:15

RSS

rss path

https://statmt.blogspot.com/feeds/posts/default

rss status

32 (unknown)

rss found date

2024-08-19 22:02:28

rss size orig

32798

rss items

6

rss spam phrases

0

rss detected language

1 (English)

inbefore feed id

-

inbefore status

0 (new)

Sitemap

sitemap path

https://statmt.blogspot.com/sitemap.xml

sitemap status

40 (completed successful import of reports.txt file to table in_pages)

sitemap review version

2

sitemap urls count

6

sitemap urls adult

0

sitemap filtered products

0

sitemap filtered videos

0

sitemap found date

2024-08-15 08:47:48

sitemap process date

2025-08-04 12:52:31

sitemap first import date

-

sitemap last import date

2026-02-24 18:04:11