Main

type

0 (not classified)

status

21 (imported old-v2, waiting for another import)

review version

0

cleanup version

0

pending deletion

0 (-)

created at

2025-10-14 14:00:18

updated at

2026-01-07 11:47:28

Address

url

https://www.cs.jhu.edu/~jason/hopskip/

url length

38

url crc

33713

url crc32

3456598961

location type

1 (url matches target location, page_location is empty)

canonical status

2 (missing canonical tag in html)

canonical page id

-

Source

domain id

491243395

domain tld

2295

domain parts

0

originating warc id

-

originating url

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151281008.23/warc/CC-MAIN-20250812234112-20250813024112-00514.warc.gz

source type

11 (CommonCrawl)

Server response

server ip

128.220.13.64

Publication date

2025-07-19 02:04:26

Fetch attempts

0

Original html size

4007

Normalized and saved size

4007

Content

title

Hopskip - The Johns Hopkins finite-state learning toolkit

excerpt

content

This is the future home page of Hopskip (www.hopskip.org) - an open-source tool for specifying, training, and sharing probabilistic models of string and sequence data. Applications include text analysis and manipulation, speech recognition, machine translation, information extraction, music, genomics, etc. You specify an appropriate probabilistic model using an extended regular expression language. Internally this is compiled into a parameterized finite-state machine. You can then train the free parameters from data. Training can be supervised, unsupervised, or something in between. It is easy to specify complex models that are sensitive to linguistically meaningful features, that incorporate dictionaries or morphological analyzers, etc. You can try your models right away, without writing additional code. The Hopskip code will handle them in a highly optimized way. We are planning a communal library of useful finite-state machines, such as taggers, parsers, lemmatizers, we...

author

updated

1768359992

Text analysis

block type

0

extracted fields

104

extracted bits

title
full content
content was extracted heuristically

detected location

0

detected language

1 (English)

category id

Zastosowania AI (149)

index version

1

paywall score

0

spam phrases

0

Text statistics

text nonlatin

0

text cyrillic

0

text characters

2051

text words

370

text unique words

234

text lines

1

text sentences

22

text paragraphs

1

text words per sentence

16

text matched phrases

1

text matched dictionaries

2