id
type
0 (not classified)
status
21 (imported old-v2, waiting for another import)
review version
0
cleanup version
0
pending deletion
0 (-)
created at
2025-10-14 14:00:18
updated at
2026-01-07 11:47:28
url
https://www.cs.jhu.edu/~jason/hopskip/
url length
38
url crc
33713
url crc32
3456598961
location type
1 (url matches target location, page_location is empty)
canonical status
2 (missing canonical tag in html)
canonical page id
-
domain id
domain tld
2295
domain parts
0
originating warc id
-
originating url
https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-33/segments/1754151281008.23/warc/CC-MAIN-20250812234112-20250813024112-00514.warc.gz
source type
11 (CommonCrawl)
server ip
Publication date
2025-07-19 02:04:26
Fetch attempts
0
Original html size
4007
Normalized and saved size
4007
title
Hopskip - The Johns Hopkins finite-state learning toolkit
excerpt
content
This is the future home page of Hopskip (www.hopskip.org) - an open-source tool for specifying, training, and sharing probabilistic models of string and sequence data. Applications include text analysis and manipulation, speech recognition, machine translation, information extraction, music, genomics, etc. You specify an appropriate probabilistic model using an extended regular expression language. Internally this is compiled into a parameterized finite-state machine. You can then train the free parameters from data. Training can be supervised, unsupervised, or something in between. It is easy to specify complex models that are sensitive to linguistically meaningful features, that incorporate dictionaries or morphological analyzers, etc. You can try your models right away, without writing additional code. The Hopskip code will handle them in a highly optimized way. We are planning a communal library of useful finite-state machines, such as taggers, parsers, lemmatizers, we...
author
updated
1768359992
block type
0
extracted fields
104
extracted bits
title
full content
content was extracted heuristically
detected location
0
detected language
1 (English)
category id
index version
1
paywall score
0
spam phrases
0
text nonlatin
0
text cyrillic
0
text characters
2051
text words
370
text unique words
234
text lines
1
text sentences
22
text paragraphs
1
text words per sentence
16
text matched phrases
1
text matched dictionaries
2
links self subdomains
0
links other subdomains
0
links other domains
3
links spam adult
0
links spam random
0
links spam expired
0
links ext activities
3
links ext ecommerce
0
links ext finance
0
links ext crypto
0
links ext booking
0
links ext news
0
links ext leaks
0
links ext ugc
0
links ext klim
0
links ext generic
0
image author
featured image