Introduction
LibreQoS is an open source project for monitoring and providing quality-of-experience for Internet Service Providers (ISPs) and large networks. It runs as a “middle-box”, monitoring traffic that passes through it. It recently gained the ability to track individual data flows - connections between two endpoints. It’s also my favorite Open Source project, to which I contribute regularly.
Public Internet IP addresses belong to an ASN - an Autonomous System Number. An ASN is a unique number that identifies a network on the Internet. Tracking flows allows you to see which ASNs your users are connecting to, how much data is flowing, and monitor the quality of the connection.
Just providing ASN names isn’t very useful. Most people who see that Joe is connecting to “SSI-AS” won’t realize that this means they are watching Netflix! To make this data useful, we need to categorize the ASNs. There are a lot of ASN IP blocks - over 40,000! - so we need to automate this process.
Let’s start by fetching the data we need to analyze.
Obtaining the ASN Data
ipinfo.io provides downloadable CSV files containing information about IP addresses and ASNs.
(An outdated copy is included in data/asn.csv.) The data looks like this:
start_ip,end_ip,asn,name,domain
1.0.0.0,1.0.0.255,AS13335,"Cloudflare, Inc.",cloudflare.com
1.0.4.0,1.0.7.255,AS38803,Wirefreebroadband Pty Ltd,gtelecom.com.au
1.0.16.0,1.0.16.255,AS2519,ARTERIA Networks Corporation,arteria-net.com
There are 420,772 lines in the file - we’re not going to list them all here!
Loading the Data
The code for this is in here/.
Start a new Rust project:
cargo new categorize
cd categorize
Rust makes reading CSV files easy. We’ll use a couple of crates to help us out: Serde (the Serialization-Deserialization library) and CSV. We’ll also include anyhow for easy error handling. You can add them as follows:
cargo add serde -F derive
cargo add csv
cargo add anyhow
Now we create a structure to define the data. We won’t be using most of the fields, so we’ll add #[allow(dead_code)]
to suppress warnings about unused fields. We also add #[derive(Deserialize)]
to automatically generate the code to read the data from the CSV file.
#[derive(Deserialize)]
#[allow(dead_code)] // Ignore unused fields. They have to be here to match the CSV file.
struct AsnRow {
start_ip: String,
end_ip: String,
asn: String,
name: String,
domain: String,
}
Next, let’s read the data from the CSV file. We only care about the domain
field, so we’ll
write some code to load the data and return just that field:
pub fn load_asn_domains() -> Result<Vec<String>> {
let data = include_str!("../../data/asn.csv");
let mut reader = csv::Reader::from_reader(data.as_bytes());
let rows: Vec<_> = reader
.deserialize::<AsnRow>() // Deserialize - returns a result
.into_iter() // Consume the iterator
.flatten()// Keep only Ok records
.map(|r| r.domain.to_lowercase().trim().to_string()) // Extract just the domain
.filter(|d| !d.is_empty()) // Remove empty domains
.collect(); // Move the results into a vector
println!("Loaded {} domains", rows.len());
Ok(rows)
}
Calling this function returns 412,795 domains. A scan through the data shows a lot of duplicates! We don’t want to categorize the same domain multiple times, so we need to de-duplicate the data.
Fortunately, a crate named Itertools
makes this very easy.
If you’re tempted to just add it all to a HashSet
and let that do the job—it’ll work, but
it will be substantially slower. Generating a hash for every string is expensive. It’s much
faster to sort the strings, iterate forward and only retain unique items!
Add a dependency on Itertools
:
We can then add two lines to our function, and we’re done!
pub fn load_asn_domains() -> Result<Vec<String>> {
let data = include_str!("../../data/asn.csv");
let mut reader = csv::Reader::from_reader(data.as_bytes());
let rows: Vec<_> = reader
.deserialize::<AsnRow>() // Deserialize - returns a result
.into_iter() // Consume the iterator
.flatten()// Keep only Ok records
.map(|r| r.domain.to_lowercase().trim().to_string()) // Extract just the domain
.filter(|d| !d.is_empty()) // Remove empty domains
.sorted() // Sort the results
.dedup() // Remove duplicates
.collect(); // Move the results into a vector
println!("Loaded {} domains", rows.len());
Ok(rows)
}
That runs very quickly - and we’re down to 63,519 domains. That’s still a lot—but at least we aren’t doing the same work over and over again.
Setting Up a Local LLM
You probably don’t want to pay for 63,519 API calls (assuming everything is one shot, works perfectly first time, and you never need to run a second test!). So let’s set up a local LLM. I used Ollama
on my Linux box: it neatly wraps the complexities of llama-cpp
, supports my AMD GPU out of the box, and is easy to install. Your setup will vary by platform. Visit
https://ollama.com/ and follow the instructions there. I’m using the
llama3.1
model. I installed it with ollama pull llama3.1
.
Once you have Ollama
installed, you can test it with ollama run llama3.1
and had a little
chat:
>>> Is Rust a great language?
Rust is a highly-regarded programming language that has gained popularity in recent years, and opinions about it vary depending on
one's background, experience, and goals. Here are some aspects where Rust excels:
**Great features:**
1. **Memory Safety**: Rust's ownership model ensures memory safety without relying on garbage collection. This makes it an attractive
choice for systems programming and applications that require high performance.
2. **Concurrency**: Rust provides built-in support for concurrency through its `async/await` syntax, making it easy to write
asynchronous code.
3. **Performance**: Rust is designed with performance in mind. It can compete with C++ in terms of execution speed and memory usage.
4. **Type System**: Rust's type system is both expressive and flexible. It allows for static checking of types, ensuring that your
code is correct at compile-time rather than runtime.
(and on, and on, and on - this is a chatty LLM!)
So now that we have a working local LLM, let’s talk to it via the API from Rust.
Talking to the LLM
In our Rust code, we’re going to add three dependencies: Tokio
(an async runtime) and
Reqwest
(an HTTP client) and serde_json
(to make JSON easy to work with).
cargo add tokio -F full
cargo add reqwest
cargo add serde_json -F json
Talking to Ollama uses a relatively simple Rest API. If you’ve
taken our Rust Foundations or Rust as a Service classes, you’ll know this one!
Let’s start by setting a constant to the local LLM’s API endpoint URL:
const LLM_API: &str = "http://localhost:11434/api/generate";
Now, we’ll define a structure to receive data from the API:
#[derive(Deserialize)]
struct Response {
response: String,
}
And finally, we can write a function that talks to the LLM:
async fn llm_completion(prompt: &str) -> Result<String> {
// Use serde_json to quickly make a JSON object
let request = json!({
"model": "llama3.1",
"prompt": prompt,
});
// Start the Reqwest client
let client = reqwest::Client::new();
// Create a POST request, add the request JSON, and send it
let mut res = client.post(LLM_API)
.json(&request)
.send()
.await?;
// Empty string to assemble the response
let mut response = String::new();
// While res.chunk() returns Some(data), the stream
// holds data we want. So we can grab each chunk,
// and add it to the response string.
while let Some(chunk) = res.chunk().await? {
let chunk: Response = serde_json::from_slice(&chunk)?;
response.push_str(&chunk.response);
}
// Return the response
Ok(response)
}
The only tricky part here is that we are streaming the response, rather than
processing it all at once. LLMs return one response at a time. Fortunately,
streaming is built into Reqwest.
Let’s give this a try:
#[tokio::main]
async fn main() -> Result<()> {
let response = llm_completion("Good Morning!").await?;
println!("{}", response);
Ok(())
}
On my machine, this returns:
Good morning! Hope you're having a great start to the day! How can I help or chat with you today?
Now that we have a working LLM, we can start categorizing the data.
Categorizing the Data: First Try - Oneshot!
“Oneshot” is a term used in the LLM world to describe a single request with no
helper data, no introspection, chain-of-thought or anything else. In a perfect
world, this would be all you need. (Note: this isn’t a perfect world).
It will take a while to categorize 63,519 domains, so we’ll start with a small
sample. Let’s add the rand
crate (cargo add rand
) to help us pick
random test data.
We’ve already loaded the domain list, and we have a function to talk to the LLM—so
let’s put this together:
#[tokio::main]
async fn main() -> Result<()> {
let domains = load_asn_domains()?;
// Pick a small random sample
let mut rng = rand::thread_rng();
let sample = domains.choose_multiple(&mut rng, 2);
// Let's do some categorizing
for domain in sample {
println!("Domain: {}", domain);
let prompt = format!("Please categorize this domain with a single keyword. \
Do not elaborate, do not explain or otherwise \
enhance the answer. The domain is: {domain}");
let response = llm_completion(&prompt).await?;
println!("Response: {}", response);
}
Ok(())
}
Things to note:
- We’re using a very simple prompt. Adding “do not elaborate” and “do not explain” is a
common technique to get a single-word answer. LLMs like to talk.
- We’re not providing any additional information or context—we’re relying on the LLM’s baked-in knowledge.
- I like to say “please” to LLMs, so when the inevitable Super Intelligence Apocalypse happens, hopefully I’ll be spared.
Running this gave me:
Domain: r-tk.net
Response: Radio
Domain: 365it.fr
Response: Software
Is the answer any good? I’d not heard of either of those domains, so—unlike the LLM—I fired up
a browser to try and determine if the LLM was hallucinating (I was definitely expecting
some hallucination!).
Unsurprisingly, the LLM was wrong. r-tk.net
is a Russian company that provides Internet and
streaming television services. 365it.fr
is an IT services company in France.
I ran it a few more times, and the results weren’t great. Sometimes, the LLM was
spot on—and most of the time, it was hallucinating an effectively random answer.
Large Language Models are “next token predictors”. While there is some evidence that emergent
behavior develops in large models, they aren’t a font of all knowledge. If a domain wasn’t
part of their training data, they don’t know what it is—and will “hallucinate” a likely-
sounding answer.
So let’s try and give the LLM some context to work with.
Adding Context
The vast majority of the listed domains have a website associated. Maybe we could
scrape text from the website and use that as context?
Many recent LLMs can use tools. For example, ChatGPT can fire up a web browser
to search for the answer to your question. This article isn’t going to try to
incorporate tool usage—instead, we’ll scrape the website data and provide
context directly.
Let’s make use of the reqwest
crate to fetch the website data, and the scraper
crate to extract some text. LLMs have a pretty short context window, so we
don’t want to overwhelm the poor AI with too much data.
We’ll add a function to fetch the website data:
async fn website_text(domain: &str) -> Result<String> {
let url = format!("http://{}/", domain);
// Build a header with a Firefox user agent
let mut headers = header::HeaderMap::new();
headers.insert(
header::USER_AGENT,
header::HeaderValue::from_static("Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion")
);
// Setup Reqwest with the header
let client = reqwest::Client::builder()
.default_headers(headers)
.timeout(Duration::from_secs(30))
.build()?;
// Fetch the website
let body = client
.get(&url).send().await?
.text().await?;
// Parse the HTML
let doc = scraper::Html::parse_document(&body);
// Search for parts of the site with text in likely places
let mut content = Vec::new();
for items in ["title", "meta", "ul,li", "h1", "p"] {
content.extend(find_content(items, &doc));
}
// We now have a big list of words (hopefully) from the website
let result = content
.into_iter() // Consuming iterator
.sorted() // Sort alphabetically
.dedup_with_count()// De-duplicate, and return a tuple (count, word)
.sorted_by(|a, b| b.0.cmp(&a.0)) // Sort by count, descending
.map(|(count, word)| word)// Take only the word
.take(100)// Take the top 100 words
.join(" "); // Join them into a string
Ok(result)
}
That’s quite a mouthful, but it does a lot:
- The function creates a header mimicking Firefox. Many domains won’t reply without a valid
USER-AGENT
string. - It creates a Reqwest instance featuring the header, and a 30-second timeout window for obtaining data from the remote website.
- It fetches the website, and extracts the result’s body as text.
- It uses
scraper
to parse the HTML. - It searches for text in likely places: the title, meta tags, unordered lists, list items, headers, and paragraphs. The
find_content
function (below) gathers the word list from each section of the website. - It sorts the words.
- It de-duplicates the word list, keeping count of how many times each word appeared.
- It retains the 100 most used words on the website.
The helper function uses the scraper
crate to extract text from the HTML,
and convert it into lowercase words as a vector of strings:
fn find_content(selector: &str, document: &Html) -> Vec<String> {
let selector = scraper::Selector::parse(selector).unwrap();
let mut content = Vec::new();
for element in document.select(&selector) {
// Get all text elements matching the selector
let e: String = element.text().collect::<String>();
// Split at whitespace, and filter out words shorter than 3 characters and
// convert to lowercase.
let e: Vec<String> = e.split_whitespace()
.filter(|s| s.len() > 3)
.map(|s| s.trim().to_lowercase())
.collect();
if !e.is_empty() {
content.extend(e);
}
}
content
}
Why trim
and to_lowercase
? Websites are often full of whitespace, and often have
pretty strange formatting. We only care about the word content - we don’t want the
gaps in-between. Normalizing to lower-case ensures that “Provision” and “provision”
will be counted as the same word.
Here’s the resulting context from provision.ro
(which my system picked at random):
security management data application threat protection risk firewall access endpoint
digital privacy e-mail encryption gateway response secure testing detection hunting
identity infrastructure network assessment training vulnerability advance
analytics/ anti-phishing asset attacks authentication automated automation awareness
bots browser casb centric classification client collaboration container cspm database
ddos deception detection/protection discovery ediscovery governance human incident
intelligence isolation malware mast penetration privilege rights runtime sase
self-protection side siem soar third-party tools ueba visibility wireless media
operations analysis cloud compliance generation masking messaging mobile next social
trust zero about services solutions provision technologies partners contact find home
more cyber experience expertise help information provision’s
Combining these functions together yields a list of at-most 100 words from the website. We
have to keep the list small, to not overwhelm the LLM’s context window—and keep processing
relatively performant.
So now we slightly change our main
function to include the context in the prompt:
#[tokio::main]
async fn main() -> Result<()> {
let domains = load_asn_domains()?;
// Pick a small random sample
let mut rng = rand::thread_rng();
let sample = domains.choose_multiple(&mut rng, 5);
// Let's do some categorizing
for domain in sample {
println!("Domain: {}", domain);
if let Ok(text) = website_text(domain).await {
let prompt = format!("Please categorize this domain with a single keyword. \
Do not elaborate, do not explain or otherwise enhance the answer. \
The domain is: {domain}. Here are some items from the website: {text}");
let response = llm_completion(&prompt).await?;
println!("Response: {}", response);
} else {
println!("unable to scrape: {domain}");
}
}
Ok(())
}
So let’s see how we’re doing with some context included:
Domain: eternet.cc
Response: Internet
Domain: wilken-rz.de
unable to scrape: wilken-rz.de
Domain: orovalleyaz.gov
Response: Government
Domain: embl.de
Response: Biotechnology
Domain: baikonur.net
Response: Internet
Manually visiting these sites:
eternet.cc
is indeed an Internet provider.orovalleyaz.gov
is the official site for the town of Oro Valley, Arizona. So “Government” is right.embl.de
is the European Molecular Biology Laboratory. So “Biotechnology” is right.baikonur.net
is a Russian site that provides Internet services. So “Internet” is right.
One scraping failure, and 4/4 on categorization! I repeated this a few times, and it was
consistently good!
Now let’s turn this into a program that can complete our intended task.
Running the LLM queries is by far the slowest part of this process. We aren’t going to
optimize Ollama
itself. Sadly, buying a better GPU—or using a faster LLM- is the best
way to speed that up.
Running through each site one-at-a-time is going to take a really long time. Let’s speed things up,
and make use of Tokio’s async capabilities (one thread per core, work stealing). We’ll add the
futures
crate to provide join_all
—which I find easier than Tokio’s JoinSet
system.
Appending Results to a File
Let’s add a helper function that appends a line to a file:
async fn append_to_file(filename: &str, line: &str) -> Result<()> {
let mut file = tokio::fs::OpenOptions::new()
.append(true)
.create(true)
.open(filename)
.await?;
tokio::io::AsyncWriteExt::write_all(&mut file, format!("{}\n", line).as_bytes()).await?;
Ok(())
}
This is not thread-safe—but it’s designed to be called from a channel, which
will serialize the calls to it.
Next, we’ll build a function that receives a report of failures, and uses the append_to_file
function to append errors as they occur:
async fn failures() -> Sender<String> {
let (tx, mut rx) = tokio::sync::mpsc::channel::<String>(32);
tokio::spawn(async move {
while let Some(domain) = rx.recv().await {
println!("Failed to scrape: {}", domain);
// Append to "failures.txt"
if let Err(e) = append_to_file("failures.txt", &domain).await {
eprintln!("Failed to write to file: {}", e);
}
}
});
return tx;
}
We’ll do the same for success, but we’ll use a struct to store both the domain and the
category (I think a struct is nicer than a (String, String)
tuple):
struct Domain {
domain: String,
category: String,
}
async fn success() -> Sender<Domain> {
let (tx, mut rx) = tokio::sync::mpsc::channel::<Domain>(32);
tokio::spawn(async move {
while let Some(domain) = rx.recv().await {
println!("Domain: {}, Category: {}", domain.domain, domain.category);
// Append to "categories.csv"
if let Err(e) = append_to_file("categories.csv", &format!("{},{}", domain.domain, domain.category)).await {
eprintln!("Failed to write to file: {}", e);
}
}
});
return tx;
}
Categorization as a Function
Let’s move our prompt generation and calling into a function as well:
async fn categorize_domain(domain: &str, text: &str) -> Result<Domain> {
let prompt = format!("Please categorize this domain with a single keyword in English. \
Do not elaborate, do not explain or otherwise enhance the answer. \
The domain is: {domain}. Here are some items from the website: {text}");
let response = llm_completion(&prompt).await?;
Ok(Domain {
domain: domain.to_string(),
category: response,
})
}
Finally, let’s tie this together to process the entire list.
And Let’s Call It!
We’re probably going to melt some CPU and GPU chips here! Let’s do it!
#[tokio::main]
async fn main() -> Result<()> {
// Load the domains
let mut domains = load_asn_domains()?;
// Shuffle the domains (so in test runs we aren't always hitting the same ones)
domains.shuffle(&mut rand::thread_rng());
// Create the channels for results
let report_success = success().await;
let report_failures = failures().await;
// Create a big set of tasks
let already_done = std::fs::read_to_string("categories.csv").unwrap_or_default();
let mut futures = Vec::new();
for domain in domains.into_iter() {
// Skip domains we've already done - in case we have to run it more than once
if already_done.contains(&domain) {
continue;
}
// Clone the channels - they are designed for this.
let my_success = report_success.clone();
let my_failure = report_failures.clone();
let future = tokio::spawn(async move {
match website_text(&domain).await {
Ok(text) => {
match categorize_domain(&domain, &text).await {
Ok(domain) => { let _ = my_success.send(domain).await; },
Err(_) => { let _ = my_failure.send(domain).await; },
}
}
Err(_) => {
let _ = my_failure.send(domain).await;
}
}
});
futures.push(future);
// Limit the number of concurrent tasks
if futures.len() >= 32 {
let the_future: Vec<_> = futures.drain(..).collect();
let _ = join_all(the_future).await;
}
}
// Call any leftover items
join_all(futures).await;
Ok(())
}
I ran this with take(10)
to test it. There were no failures! The category list looks good, too:
phatriasulung.net.id,Webhosting
cornellcollege.edu,Education
connexcs.com,Communications
provedorsupply.net.br,Internet
thornburg.com,Investment
usp.org,Pharmaceuticals
valassis.com,Marketing
agen-rs.si,Energy
balasai.com,Hosting
lima.co.uk,Technology
We have the basis of a working categorization engine!
Conclusion
LLMs provide a powerful tool for categorizing data, and Rust makes it easy to
work with them. Rust has excellent tools for scraping web data and massaging
the results, allowing you to provide context to your LLM calls. With Tokio, Reqwest
and futures, you can easily parallelize your work to make the most of your
hardware.
There’s quite a few possible improvements to this code:
- Ask the LLM to use one of a provided list of keywords, rather than coming up with them for you.
- You could use more than one LLM and take the majority answer.
- You should try different LLM models. Llama 3 is a fun, open source model—but there’s a lot of models out there!
- You could definitely improve the word selection algorithm. Remove stop words, prioritize the title, etc.
The final program categorizes about 1,300 domains per hour on my MacBook Air M1. It’s not
amazingly fast, but it’s not bad either. It’s definitely faster than doing it by hand!
Spot-checking the results shows that it’s pretty accurate, too. It’s not perfect, but it’s
more accurate than I would be if I read all 63,519 domains myself!