WEBVTT

00:00:00.000 --> 00:00:02.000
Recording on…

00:00:02.000 --> 00:00:17.000
Okay. Welcome, everyone. This is the third Thursday in the month of April 2026. It's April the seventeenth.

00:00:17.000 --> 00:00:18.000
16th, Gary.

00:00:18.000 --> 00:00:34.000
316. That's right. I mean to check these things in advance. And do I know? So anyway, you are at the monthly meeting of the St. Louis Linux user group. We are part of the St. Louis Unix user group. The.

00:00:34.000 --> 00:00:55.000
Linux meetings try to focus more on Linux. We're a free group. There's no membership dues. Anyone's welcome to come. Uh, I think probably everybody in the group, right, or everybody in the… on the meeting right now, I think, is probably familiar with this. So, if I skip anything and you, uh.

00:00:55.000 --> 00:01:12.000
You want to know something you see on a slide. Raise your meat hook or say something in the chat, and I'll try and address it. So we usually try and hold the Linux meeting eight days after the monthly general meeting, and our meetings are all free and open to the public.

00:01:12.000 --> 00:01:16.000
Next. go ahead.

00:01:16.000 --> 00:01:20.000
We go.

00:01:20.000 --> 00:01:22.000
Somebody start to ask something? Oh, okay. Okay.

00:01:22.000 --> 00:01:28.000
No, that was me cussing under my breath.

00:01:28.000 --> 00:01:43.000
As those of you who are already here probably already noticed, the Zoom meeting usually is opened at 6 o'clock. The meeting starts at 6.30, and hopefully after some short announcements and any general Q&A, we'll have the main speaker actually talking by 6:45.

00:01:43.000 --> 00:01:51.000
So our chairman of the lug is James Conroy.

00:01:51.000 --> 00:02:06.000
the MC is me, Gary Meyer. I'm also the president of the organization. The host is Lee Lamert of Omnitech. He's the the principal and the lead scientist there. He's also the treasurer of Slug. So.

00:02:06.000 --> 00:02:23.000
And he also provides us with virtual machines to keep this production going. Speaking of production, Stan Rikert is our Vice President, and he's the man who does an enormous amount of production work to make this all happen. He takes the videos, puts them in the library.

00:02:23.000 --> 00:02:48.000
tries to keep recordings organized. So thank you all for all the help you providing. Speaking of help, anyone who would like to, if you'd like to volunteer to monitor the chat, sometimes I'm going to put something in the chat. We may not notice that it's something that's needed to be addressed immediately, so… If anybody like to volunteer, let me know if you're volunteering, and I wanted to keep an eye on it.

00:02:48.000 --> 00:03:06.000
Topic Wrangler. It would just really help for either this meeting or the general monthly meeting. We could really use somebody to help find speakers and so, yeah, if you just like to try and line up some speakers, we'd love to have you help.

00:03:06.000 --> 00:03:15.000
And of course, we always need speakers. So if any of you have any ideas of what to present topic wise or speaker wise, or like to talk about it yourself.

00:03:15.000 --> 00:03:25.000
by all beach. Please volunteer, talk to me, or send me an email. I can send Stan an email or Lee, and just send it to to.

00:03:25.000 --> 00:03:29.000
editor, it's lube.org, and one of us will get ahold of you.

00:03:29.000 --> 00:03:37.000
announcements. The next Linux meeting, I already said that.

00:03:37.000 --> 00:03:58.000
Next slack meeting is usually the first Tuesday of each month. So Slack is the St. Louis area computer club. It doesn't restrict itself just to Linux or Unix or are you an open system? It's a little more hardware oriented and we'll even talk about… Microsoft. So.

00:03:58.000 --> 00:03:59.000
Bite your tongue.

00:03:59.000 --> 00:04:00.000
I hate my tongue. Yes.

00:04:00.000 --> 00:04:06.000
Microsoft, you're saying that wrong. It's gonna be redubbed as MicroSlop.

00:04:06.000 --> 00:04:08.000
I'll pop up.

00:04:08.000 --> 00:04:09.000
Okay.

00:04:09.000 --> 00:04:15.000
Has earned that title by crashing more of its star products somewhere between here and the moon.

00:04:15.000 --> 00:04:34.000
Yep. Oh, boy. The new Linux meeting is oriented for new people to Linux. They get to to to basically set the agenda, ask the question, and the other people there are.

00:04:34.000 --> 00:04:45.000
really enjoying, even if they're experienced graveyards, it's amazing what you can learn from other people answering what seemed like a simple beginning question. So, you're all encouraged to come.

00:04:45.000 --> 00:04:54.000
announcements. Let's see. Oh, I guess I go on ahead to the next slide.

00:04:54.000 --> 00:04:55.000
Okay.

00:04:55.000 --> 00:05:03.000
Oh, I… I want you to definitely talk about the Missouri radiation exposure Compensation Act. A lot of the people that.

00:05:03.000 --> 00:05:09.000
are in Missouri, don't know about it. A lot of people that used to be here in St. Louis don't know about it.

00:05:09.000 --> 00:05:29.000
But if you go Google search for Missouri radiation exposure Compensation Act, if you had family or currently have any kind of cancer, it might be a covered cancer. You could get $25,000 for a death benefit, a family, or if $50,000 tax-free for your own medical coverage.

00:05:29.000 --> 00:05:30.000
Anyway.

00:05:30.000 --> 00:05:31.000
Anyway…

00:05:31.000 --> 00:05:47.000
Mm-hmm. And for any of you out of town. Yes, there's some radiation that was dumped all the way back, making the nuclear bombs against Japan. And so we're still trying to clean that up in the area. That's what the source of this problem is.

00:05:47.000 --> 00:05:48.000
Yep. Hey, Stan?

00:05:48.000 --> 00:05:51.000
Okay, okay.

00:05:51.000 --> 00:05:53.000
Sir. What?

00:05:53.000 --> 00:05:58.000
What about somebody that died back in the 80s?

00:05:58.000 --> 00:06:15.000
If they died from one of the covered cancers, the family can file for death benefit. I've got an uncle, he died from cancer, and I'm trying to help his daughter file a claim.

00:06:15.000 --> 00:06:16.000
Okay, well.

00:06:16.000 --> 00:06:24.000
My cousin. So, yeah. The thing is, anybody that is in these covered areas since.

00:06:24.000 --> 00:06:25.000
In 1949.

00:06:25.000 --> 00:06:31.000
1949. follows under this. If they meet the other criteria.

00:06:31.000 --> 00:06:41.000
Yeah, well… I think I want to look into it for my mom. I mean, she passed back in 85, but… It was definitely cancer.

00:06:41.000 --> 00:06:56.000
Okay, well, you need… you need to go to the websites. I can send you some links, or I've already talked about it. If you'd hit the link on this web page, it'll get you into it. Or just do a Google search for Missouri Radiation Exposure Compensation Act.

00:06:56.000 --> 00:06:57.000
Yeah, come to think of it… My daughter died of cancer too.

00:06:57.000 --> 00:07:02.000
Boom. Or go to libraries.

00:07:02.000 --> 00:07:14.000
If you hit either the St. Louis County libraries or the St. Louis, or St. Charles County Libraries, they've got links to the RECA thing. Go there.

00:07:14.000 --> 00:07:15.000
All right. I'll do it. Thank you, sir.

00:07:15.000 --> 00:07:19.000
Okay?

00:07:19.000 --> 00:07:28.000
Okay. Um… Let's see by Windows 10. It's of course been replaced as of the end of October, and there's…

00:07:28.000 --> 00:07:35.000
Boy, that's a quiet meeting.

00:07:35.000 --> 00:07:36.000
No?

00:07:36.000 --> 00:07:40.000
Well, so far. So the. There's the book advertised there on the slide right now. You can go and download that.

00:07:40.000 --> 00:07:53.000
Oh, excuse me. And no, that's not a beer, it's a Diet Pepsi.

00:07:53.000 --> 00:07:54.000
Mm-hmm.

00:07:54.000 --> 00:08:09.000
Okay, I'm muted. All right. Anyway, so yes, if you if anybody wants to make any comments about Windows 11, you're more than welcome to, but I think we chatted about it quite a bit. So the obvious solution is to replace that with Unix-based operating system.

00:08:09.000 --> 00:08:15.000
Okay, moving on.

00:08:15.000 --> 00:08:20.000
There's an endof10.org website. They're they're starting to list.

00:08:20.000 --> 00:08:31.000
Places throughout the country and throughout the world that'll help people either commercially for a price or for free, move over to Linux.

00:08:31.000 --> 00:08:32.000
Oh, did you hear that, um… France is officially dumping windows?

00:08:32.000 --> 00:08:38.000
Next slide.

00:08:38.000 --> 00:08:39.000
I'm sorry.

00:08:39.000 --> 00:08:48.000
I have not heard that. Yeah, a lot of a lot of the European countries are just there are 3 or 4 now that have officially dumped it.

00:08:48.000 --> 00:08:51.000
France will certainly be a big one to lose.

00:08:51.000 --> 00:08:53.000
Or for them to lose.

00:08:53.000 --> 00:08:56.000
Oh, I see here, go, die.

00:08:56.000 --> 00:09:16.000
I… Alrighty privacy and security. We record these meetings. However, even when our recorders are off, or we don't have a chance to go in and edit something out. Anybody who is in the meeting, of course, can have their own recorders, so…

00:09:16.000 --> 00:09:22.000
Just be careful what you're saying. Uh, because we're not the only one in control.

00:09:22.000 --> 00:09:37.000
Next. presentation archives. The Slough General Monthly Meeting, and this meeting, the St. Louis Linux Lug meeting. We put the recordings of this meeting into the archive, and it's available for.

00:09:37.000 --> 00:09:40.000
Oh, I've got dead silence. Is anybody out there?

00:09:40.000 --> 00:09:42.000
Yes. I'm the one murderer in Iraq.

00:09:42.000 --> 00:09:49.000
You can't hear us?

00:09:49.000 --> 00:09:50.000
You can't hear us, Steve? Steve.

00:09:50.000 --> 00:09:56.000
Whoa, what's that?

00:09:56.000 --> 00:10:03.000
Okay, somebody sent Steve a text message and tell him that his speaker needs to be turned on.

00:10:03.000 --> 00:10:10.000
We have the presentation archives linked on all of our websites. At the bottom right-hand corner.

00:10:10.000 --> 00:10:15.000
Or you can follow the link that's shown on the slide.

00:10:15.000 --> 00:10:16.000
Next.

00:10:16.000 --> 00:10:32.000
Okay. The calendar. We try and keep certainly our own meetings on our calendar. And we highlight those in red. But we also try and put technical events that may be of interest to other people.

00:10:32.000 --> 00:10:46.000
So the calendar is listed there on that slide. It's basically slug.org slash calendar. And if you have any suggestions, send your suggestions of stuff that should be on the calendar to editor@sloop.org. Thank you.

00:10:46.000 --> 00:10:55.000
mailing list. We maintain a number of mailing lists for our group or anybody who wants to read them, actually. So.

00:10:55.000 --> 00:10:56.000
You can sign up as it's showing up there.

00:10:56.000 --> 00:10:59.000
Hey, Stan, have I lost you for some reason?

00:10:59.000 --> 00:11:09.000
Yes. Go ahead. I cannot deal with Steve while I'm showing the slides, Lee, take care of him.

00:11:09.000 --> 00:11:17.000
already done. Okay, the announce mailing list is very little traffic. Basically, it's just.

00:11:17.000 --> 00:11:33.000
usually announcements of our upcoming meetings and other important timely things to discuss. And because of that, it's kind of moderated traffic. The discussed mailing list is open to any of our members who want to sign up, and.

00:11:33.000 --> 00:11:54.000
Basically, you can make comments, you can ask questions, you can tell people what you're doing, see if they have any suggestions. So, there's a lot more traffic on Disgust than there is unannounced. Steering Committee, if you like to see sausage being made, that's what we do in the steering committee. We basically try and plan these meetings and get things organized.

00:11:54.000 --> 00:12:10.000
Anybody's welcome to join. The sysadmin group, as I said earlier, thanks to Omnitech, we've got some virtual machines that basically provide our technical machinery. And so, basically, the volunteers.

00:12:10.000 --> 00:12:29.000
from this group who offer to help run these virtual machines are the ones who are on that mailing list, so it's not really open to anyone unless you're a volunteer, just so in case we make a mistake and slip a password or something. But we do want more people to volunteer, so if you want to volunteer to help SysAdmin, let me know.

00:12:29.000 --> 00:12:33.000
Next. Sponsored meetings.

00:12:33.000 --> 00:12:51.000
We already mentioned the lug meeting. That's obviously tonight. The new Linux meeting. Again, that's for the new users, although it really, really is useful to both have, I'll call them the gray beards, the experienced users. And a lot of times, as an experienced user, you learn a lot. So please come.

00:12:51.000 --> 00:13:11.000
Slack is the one I said it's more hardware oriented, and so only the St. Louis Linux meeting and the slew general meeting are the only ones that have planned topics. The other 2 meeting are basically just driven by whoever comes and asks questions.

00:13:11.000 --> 00:13:27.000
call for help before we go on. If any of you just want to ask a quickie question. It'll also be time probably after the meeting for to just ask general questions or make comments. But if anybody feels they need to get something mentioned here on the front end.

00:13:27.000 --> 00:13:38.000
by all means, please. say something that you want to ask a question, or uh… uh, make a quick comment.

00:13:38.000 --> 00:13:42.000
going once. Going twice.

00:13:42.000 --> 00:13:44.000
Three times.

00:13:44.000 --> 00:13:59.000
Okay. All right. With that. I think we're ready to turn presentation over to our main speaker, our only speaker tonight, Robert Sizek. Robert, I think, lived in town here.

00:13:59.000 --> 00:14:05.000
for a dozen years. Here in St. Louis, worked for WashU and or Barnes. And it was a great contributor to our group for those years when he was here in town. And even now that he…

00:14:05.000 --> 00:14:16.000
Okay. He's here, he can introduce himself, and you need to ask him whether he wants to hold comments or questions to the end, or what.

00:14:16.000 --> 00:14:37.000
That's right. That's right. So Robert now lives out in Albuquerque, New Mexico, and he still participates with us on a quasi-regular basis, and once again, he's presenting. So with that, I'm going to turn it over to Robert. And Robert, tell us, do you want people to just interrupt you and make sense?

00:14:37.000 --> 00:14:38.000
As you, uh…

00:14:38.000 --> 00:14:44.000
Oh yeah, just interrupting is just fine.

00:14:44.000 --> 00:14:45.000
Can you hear me? Okay.

00:14:45.000 --> 00:14:49.000
That kind of echoed Robert. Say it again. Yes, you're you're over modulating, though. You kind of echoing.

00:14:49.000 --> 00:14:58.000
Uh-oh. That's not good, and I just see my video just froze.

00:14:58.000 --> 00:14:59.000
Are you there?

00:14:59.000 --> 00:15:05.000
Here we're here, and you're obviously a cold person. If you're just frozen there.

00:15:05.000 --> 00:15:11.000
I have no idea why that's happening. I'm looking at my video, and the video's fine, it's with… anyway.

00:15:11.000 --> 00:15:12.000
Anyway. All right, cool. I'm gonna go get a drink. I'll be right back.

00:15:12.000 --> 00:15:15.000
Okay, you're moving again.

00:15:15.000 --> 00:15:17.000
Okay.

00:15:17.000 --> 00:15:25.000
That's I'm out of drinks. Shoot. It's coffee, really.

00:15:25.000 --> 00:15:37.000
Or tea or something. Anyhow, uh, thanks everyone for being here. Let me see if I can pull some stuff up real quick.

00:15:37.000 --> 00:15:50.000
We just do this click. Come on talk to me.

00:15:50.000 --> 00:15:58.000
All right. That'll work. Oh, shoot, I need to log in real quick. Give me one second. I got an Mfa code.

00:15:58.000 --> 00:16:08.000
Come on.

00:16:08.000 --> 00:16:17.000
You showed me work. All right, sweet, sweet all right. And.

00:16:17.000 --> 00:16:33.000
I'm gonna do my entire screen back.

00:16:33.000 --> 00:16:54.000
Just want to show one more… get one more thing set up. Give me one sec.

00:16:54.000 --> 00:16:59.000
And finally, the Internet's all day, and I'm actually going to walk a little bit through some of the.

00:16:59.000 --> 00:17:16.000
interesting things that have happened, and why. which I can't find it. There we go.

00:17:16.000 --> 00:17:31.000
Oh, well. Well, that's going, I'm gonna share something else in the meantime. So… Yeah, so the impetus for this talk was Gary and I were chatting not too long ago about some of the stuff that I've been working on.

00:17:31.000 --> 00:18:01.000
My name is Robert Sitk. I am an instructor here, here being Albuquerque, New Mexico, at the Community College for Central New Mexico. I teach the data science program. I'm actually in my classroom right now. I wish I could show you everything, but… My camera isn't, how would you say the moving parties isn't working very well. One of the things that, uh, that, of course, I deal with in teaching is teaching students how to use databases, right? So it's a data science course. So we go over things like.

00:18:02.000 --> 00:18:19.000
mostly querying databases, right? So we don't talk about… I mean, we talk a little bit about building a database and and what a, you know, what an ER diagram is, and what indexing is, and things like that. But for the most part, most of the of, you know.

00:18:19.000 --> 00:18:27.000
the time is not spent in building databases, but rather querying databases right? So.

00:18:27.000 --> 00:18:44.000
I brought up the DuckDB page here, and it, you know, kind of breaks it out into two things, Oltp and Olap. Oltp is what we normally at least what I normally think of as databases, right? So you know the oracles and the and the Postgres, the MySQLs and the SQLites and so forth.

00:18:44.000 --> 00:18:53.000
You do transaction processing, right? You update the database, you're adding things, you're deleting things, you're modifying things, and so forth.

00:18:53.000 --> 00:19:08.000
The other extreme is the OLAP is the Online Analytical processing, right? So the database is already there, and what you want to do is you want to query the database. You want to ask all kinds of questions, like how many records are there? And what happens if I take data from here and here and join it, and then do stuff with it?

00:19:08.000 --> 00:19:23.000
Right? And it turns out that the workloads that you do are actually quite different. The concerns that you have with an OLTP kind of thing are kind of different than an OLAP kind of database.

00:19:23.000 --> 00:19:30.000
And so so this is some of the stuff that I I teach my students. And.

00:19:30.000 --> 00:19:46.000
As part of the course that I teach, the way this course is structured is, you know, we have 6 modules that we go through and talk about machine learning and databases and natural language processing, and so forth. And at the end of the course.

00:19:46.000 --> 00:19:55.000
all the students are expected to form teams. 2 to 4 individuals, um, and come up with a capstone.

00:19:55.000 --> 00:20:01.000
And about a year and a half ago, one of these teams.

00:20:01.000 --> 00:20:11.000
With using data from the US Census, right? And that's actually another page I want to bring up Us census geo.

00:20:11.000 --> 00:20:16.000
Uh, geographies, obviously, but there we go. That's good enough.

00:20:16.000 --> 00:20:40.000
See this? So then we're working with geographies. And the Us. Census is broken, you know, a lot of things about geographies, right? So we have this kind of national thing, and then, you know, regions and divisions, so forth, and it goes all the way down to these things called census blocks, right? And they collect an enormous amount of data, um.

00:20:40.000 --> 00:21:01.000
about the population in the United States. Most people, when you talk about the census, they think, oh yeah, this happens every 10 years, right? They go out and they… you fill out the form, and you mail back in, or you go online, you fill out the data, and you're done with it. But it turns out that the census actually collects a lot more data than just that.

00:21:01.000 --> 00:21:07.000
In fact, there's something called the American Community Survey, the ACS, which.

00:21:07.000 --> 00:21:22.000
Literally, what it does is it sends out a survey, not to the entire United States, but to a random representative sample. And they've got people out there to determine what exactly that is, but they sent out a sample literally every.

00:21:22.000 --> 00:21:40.000
month, right? So every month, the census is collecting data on the population of the United States, and the one that's collected every month is what is called the long form, right? So it's, you know, the one that's every 10 years is the short form. The one every month is the long form. And what they do is then.

00:21:40.000 --> 00:21:53.000
after 12 months go by, they aggregate all the data, and they put out what is called the one-year American Community Survey information, right? So that information is available on a yearly basis.

00:21:53.000 --> 00:22:04.000
comes out in September or something like that. The other thing they do is they also go and they look back five years, and they put out the five-year estimates, and I think those come out in.

00:22:04.000 --> 00:22:08.000
I don't remember if it's October or December, but again, late in the year.

00:22:08.000 --> 00:22:33.000
But it turns out that what that means is we have an enormous amount of information. In fact, about the US population, and not just individuals, but households and businesses and, um… you know, you know, characteristics about these households and businesses as well. Like, what's their education level? How many people live in the household, and.

00:22:33.000 --> 00:22:48.000
All the way down to, do you have a car, and do you have internet service, and do you have a telephone, and things like that, right? So, lots of really interesting information. And so my students were, you know, using this information and using their capstone, and.

00:22:48.000 --> 00:22:55.000
using census information in addition to other data sources and aggregating the data and doing some analysis on this.

00:22:55.000 --> 00:23:10.000
And what happened is, literally. the the week before… They were going to make their final presentations.

00:23:10.000 --> 00:23:15.000
the census… basically went offline.

00:23:15.000 --> 00:23:35.000
Right? And when I say offline, I mean that in one particular way. If the census provides lots of really good data, and in fact, as far as the data source goes, the US Census is perhaps one of the best data sources that's out there. And there you can get data literally just from their website.

00:23:35.000 --> 00:23:51.000
So you can go to the website and look at data. In fact, we're looking at their website right now that talks about, you know, the geographic entities and how they're divided down. So you can go to the website and find this information. They also have APIs, right? So you can apply for an API key and.

00:23:51.000 --> 00:24:06.000
I think you just fill out your email address, and boom, you get, you know, within an hour, usually. Um, but for sure, within 24 hours, they'll send you an API key, and you can use that API key and go against… they have RESTful APIs, they can go and query the data.

00:24:06.000 --> 00:24:17.000
And then the third way they make data available is via their FTP site, right? So… Good old-fashioned FTP, you go there, you don't even need to log in.

00:24:17.000 --> 00:24:34.000
You just go to the Ftp site. You can download the data as as you need. Well, it turns out that this particular case, my students were using the API, and all of a sudden the APIs went offline, right? They couldn't get get the data. And of course, at the scale that we're working with data.

00:24:34.000 --> 00:24:52.000
It's one of these things where it you you can't do this on the website, right? The website is great for clicking on things and saying, Oh, I want to know this and this and this and this, click, click, click, boom. Oh, this looks nice, and, you know, maybe download a spreadsheet or something like that. Um, but in the volumes of the data that we're working with, and the way we want the data, um.

00:24:52.000 --> 00:25:01.000
That's just not feasible, right? So. That's why we use the Apis. And of course, in this particular case, the Apis went offline.

00:25:01.000 --> 00:25:11.000
Fortunately, from a lot of, you know, Google searching and digging and things like that, we found out that they have a website, or excuse me, an FTP site.

00:25:11.000 --> 00:25:23.000
went through their FTP site, found the data that we needed, downloaded it, and started, you know, working with it that way. Now, that said, um, the data in the FTP side is not exactly, um… the most…

00:25:23.000 --> 00:25:48.000
Accessible data. Accessible isn't you can get it, but interpretable data, it isn't that much. And we'll see some examples of that in a little bit. But, um… We were able to get it, we were able to make sense of it, we were able to, you know, tease it apart and what have you, and then eventually we were able to, uh, how would you say the students were able to do their capstone projects, and it went really, really well.

00:25:48.000 --> 00:25:58.000
So that was the impetus for me thinking, gee whiz. That's not good if all of a sudden this really rich.

00:25:58.000 --> 00:26:09.000
data source just disappears. Especially if we're, you know, I'm working with my students, I want to show them things, and we can't access the data.

00:26:09.000 --> 00:26:31.000
And so my thought was… How hard would it be to… basically make a copy of the Us. Census, right? I mean, how much data does the Us. Census have? And as as Gary mentioned, my my background is when I was at Wash U and and in St. Louis, we were working with in genomics.

00:26:31.000 --> 00:26:36.000
Right? So, uh, you know, big sequencing projects generating lots and lots and lots of data.

00:26:36.000 --> 00:26:48.000
So you know, that's kind of where I'm coming from. It is not uncommon for us to generate, you know, when we were doing sequencing reactions, especially years later, as in, you know, I would say within the past.

00:26:48.000 --> 00:27:03.000
5 to 10 years, it's not uncommon to have these sequencers generating, you know, 10 terabytes of data in a week. So, um, so terabytes and terabytes of data is not uncommon, or not unfamiliar to me.

00:27:03.000 --> 00:27:28.000
So these data sets that the Us. Census had makes me just made me wonder, how hard would it be to get all the data and make a duplicate of it? And of course, after starting the day, I paired back my expectations and said, you know what, the US Census has a lot of data. A lot of it is not immediately relevant for me. A lot of it is, how would you say historical data.

00:27:28.000 --> 00:27:43.000
Uh, which is great, but not what I was immediately interested in, or am immediately interested in. So I'm interested in current data, right? So, you know, the most recent census, the most recent ACS, or the one year or the five year, something like that.

00:27:43.000 --> 00:27:49.000
How much of that data is there, and can I get it? And once I get it, what can I do with it?

00:27:49.000 --> 00:28:02.000
So that was that was one motivation for me for starting this project of just, you know, hey, how can I? Can I get the data from the Us. Census and then make it available?

00:28:02.000 --> 00:28:16.000
Um, related to that, a lot of times I get students who… they get data from lots of other places, and like I said, these students that we're getting data from the U.S. Census were also getting data from other places, and they want to combine the data, right?

00:28:16.000 --> 00:28:23.000
And so I've had students that go out to things like the Irs and the.

00:28:23.000 --> 00:28:41.000
And here in the state, we, you know, you can go to the tax and revenue office and get a bunch of information about tax, uh, gross receipts about businesses or business sectors. Um, you can get information from school districts and, uh, you know, crime reports and, uh, transit information, and roads.

00:28:41.000 --> 00:29:01.000
All kinds of different stuff. And they would get this data, and an overwhelming majority of the time, the data is not in a truly usable format. And what I mean by that is the data may be in, um… I kid you not, Excel files, and we'll see some examples of that, right? So, um…

00:29:01.000 --> 00:29:04.000
Wow.

00:29:04.000 --> 00:29:24.000
So they'll download the Excel files and they'll open up the Excel files and look at it and go, oh, gee whiz, okay, great, you know, I need to pull the data over here and the data over here and reformat this and that, and all that kind of fun stuff. Um, and make it so that it's actually usable. And then once they do, once they've spent all this effort of cleaning the data and reorganizing it and making it, you know.

00:29:24.000 --> 00:29:38.000
equivalent to what would be, you know, you know, some, you know, like third normal form in a database kind of thing. They would then build the database, load the data, and then continue the capstones, and say, hey, look, we built this database.

00:29:38.000 --> 00:29:48.000
The unfortunate thing about databases is that, well, they need a server right? They need a, you know, Postgres, MySQL, Oracle.

00:29:48.000 --> 00:29:56.000
SQL Server, whatever, you need a dedicated server that's there with an internet connection, or some kind of a connection, some kind of network connection.

00:29:56.000 --> 00:30:05.000
That, number one, and then number two is you need some kind of a client that can communicate with you, right? To actually get the data.

00:30:05.000 --> 00:30:15.000
Those databases tend to be expensive. Even if you put it on a cloud server that offers it as a service, it's still up and running 24/7.

00:30:15.000 --> 00:30:25.000
And there's it's got to be maintained right? It's a it it is a database. You got to do stuff with it right? Backups and all kinds of fun stuff.

00:30:25.000 --> 00:30:43.000
And so, I was… I was looking… that how can we solve this problem? With the problem being, how can we take data, for example, from the US Census, reorganize it so that it's useful, you know, basically make a database-like structure.

00:30:43.000 --> 00:30:51.000
And then hosted someplace so people could actually then use it, right? So restructure it and make it usable.

00:30:51.000 --> 00:30:59.000
And one of the cool things that's really happened over the past couple of years is.

00:30:59.000 --> 00:31:09.000
We actually have some really cool technologies that are coming down the pipe. One of them is a data format that's called Parquet.

00:31:09.000 --> 00:31:19.000
Um, and this is, you know, it's been around for a while, and it's, of course, you know, evolving and getting better, and how would you say, getting more and more adopted.

00:31:19.000 --> 00:31:26.000
And what's nice about Parquet is that it's what is known as a columnar storage format or data data format.

00:31:26.000 --> 00:31:34.000
which is nice. But for my personal use, for what I'm mostly interested in is, it's a file.

00:31:34.000 --> 00:31:43.000
That's all it is. It's just a file. Now, it's got some structure to it, no question about it, but it's just a file.

00:31:43.000 --> 00:31:58.000
With that, it has some characteristics that I like that are consistent with the database, right? One of them is that it's tabular data, right? So you have this concept of rows and columns, that number one. Number two is.

00:31:58.000 --> 00:32:16.000
It has data types. Right? So lots of people think, you know, gee whiz, you know, we can take data and put it up on the web, and we can make it available as a CSV file or an Excel file or something else, something along those lines.

00:32:16.000 --> 00:32:38.000
The unfortunate thing is, is CSV files and Excel files in that realm, they don't have data types, right? So you… If you want to get a CSV file and load it up and say, well, I want this to be an integer, and I want this one to be a float, and I want this one to be a date, and this is a string, and so forth.

00:32:38.000 --> 00:32:41.000
There's no way to enforce that in a CSV file.

00:32:41.000 --> 00:33:05.000
And to give you an idea of why this is kind of a problem with CSV files as well as Excel files, is… Uh, it's… is not uncommon in the genomics world to store data as CSV files, and then people will load them up into Excel files. The problem becomes when you have genes that are named things like, uh, CEPT5.

00:33:05.000 --> 00:33:19.000
Right? And, uh… In fact, let me see if I can bring it up real quick. I think it's called step 5. Yeah, there you go. So SEP5 is a protein in humans encoded by the CEP5 gene. Great, that's the name of the gene.

00:33:19.000 --> 00:33:22.000
Anybody want to take a guess what happens when you load?

00:33:22.000 --> 00:33:31.000
information about CEP5, including its name, into Excel.

00:33:31.000 --> 00:33:32.000
Since it's a September 5th. It becomes a deed.

00:33:32.000 --> 00:33:34.000
I imagine it pukes.

00:33:34.000 --> 00:33:39.000
You blow it up, right?

00:33:39.000 --> 00:33:44.000
All right. I I heard it pukes. I hear it blows up. What? Specifically, what happened?

00:33:44.000 --> 00:33:51.000
It's… It sets it to September 5th.

00:33:51.000 --> 00:33:52.000
It's a date. Excel thinks it's a date.

00:33:52.000 --> 00:34:02.000
Exactly. Because of… because it's… Yeah, I think it's a day, right? And the reason why is because, well, of course, CSV doesn't have.

00:34:02.000 --> 00:34:15.000
A date type, right? It just has… strings and things that look like numbers, and, you know… But there's no… there's no schema to a CSV file. In fact.

00:34:15.000 --> 00:34:25.000
Not only is there no schema in a CSV file, unlike most other data formats, there actually isn't a standard for CSV file. Most of the things you can go and look up the RFC for JSON and YAML and.

00:34:25.000 --> 00:34:31.000
No.

00:34:31.000 --> 00:34:54.000
God knows whatever else. And they exist. Csv file does not. It does not have a standard. It's just more of a convention than anything else. And of course, so therefore we run into problems like this, right? So 75 MAR3, I think, is another one, or March 3 or Mark 3. I don't remember. There's something that there's a bunch of genes that have names that are very similar to dates, right?

00:34:54.000 --> 00:35:11.000
And so those are the issues with storing things as CSV files and Excel files. If you want to put them on the web for people to actually download and then use right? It's it's not. It's not how you say expressive enough? It doesn't. It doesn't capture the it doesn't have a schema.

00:35:11.000 --> 00:35:18.000
So… What's really nice is Parquet does. It's a file format, it has a schema.

00:35:18.000 --> 00:35:26.000
Um, and it has a very elaborate schema. In fact, I don't remember how many data types it has, but it has a bunch of them.

00:35:26.000 --> 00:35:33.000
And that actually then sets it apart, or excuse me, that… so that's the data format.

00:35:33.000 --> 00:35:40.000
The second technology, which is really nice, right, so we talked about Parquet, so that's a file that you can read and query data from.

00:35:40.000 --> 00:35:43.000
Why can't you put MDoc? Sorry, I'm yelling at my Postgres install right now.

00:35:43.000 --> 00:35:50.000
right? The other nice…

00:35:50.000 --> 00:35:51.000
The morons couldn't say, oh, by the way, this starts up as user 999.

00:35:51.000 --> 00:35:55.000
Rudy.

00:35:55.000 --> 00:36:00.000
And we're going to get into that in a second as well of.

00:36:00.000 --> 00:36:20.000
I'm leading into that of that. But anyhow. The other technology, which is really cool, is a what is called, or at least the way I like to call it is a querying cert not service a query engine right? So.

00:36:20.000 --> 00:36:47.000
Unlike traditional databases, which have. How would you say all kinds of things that they need to keep track of, uh, right? So things like, um… has to make sure that, uh, you know, for transactions, that you've got, uh, uh, you know, you can do commits and rollbacks, uh, that there's some kind of logging going on, uh, there's some kind of authentication, um, etc, etc, etc, right? That is what, you know, traditional OLTP databases do.

00:36:47.000 --> 00:36:59.000
Postgres, MySQL, Oracle, and so forth. Duckdb doesn't do that. Duckdb is literally just a query engine. You can connect it to a data source and query it, and that's it.

00:36:59.000 --> 00:37:10.000
at least that's the way it used to be. It's it's now gaining some more features, if you will. But think of it for now as just a query engine, right? So it looks at a data source.

00:37:10.000 --> 00:37:18.000
It knows how to interpret that data source, and it can query that data source. And that data source could be anywhere right?

00:37:18.000 --> 00:37:34.000
And so now what we have is we have parquet, which is a nice format, a file format that has data structure, or how you say a nice database like structure to it, and you have DuckDB, which is a nice query engine.

00:37:34.000 --> 00:37:54.000
Well, DuckDB can query a parquet file. And what's really nice about this is that Parquet file can be anywhere. And when I say anywhere, I literally mean anywhere. So let me show you this real quick. This is kind of the the end result of a bunch of toil. But in essence, what I've done is.

00:37:54.000 --> 00:38:03.000
Cure the Parquet files that I have in Amazon. I've taken a bunch of data sets. We'll talk about that in a little bit, taken a bunch of data sets.

00:38:03.000 --> 00:38:17.000
converted them into parquet and put it on Amazon, right? So they are here. If I click on one of these guys, actually, I don't want that one. Let me go back real quick. I want. Oops, I don't want that one either.

00:38:17.000 --> 00:38:21.000
I just want the one with one file and a folder.

00:38:21.000 --> 00:38:28.000
Come on. Uh…

00:38:28.000 --> 00:38:41.000
Oh, well, let me just pick on one of these guys then. So here it is. Here's the URL to this one parquet file. And if I want to, I can literally just point duck DB at this and say, Hey, you see this one parquet file?

00:38:41.000 --> 00:38:59.000
query. Do select star from this Parquet file and tell me what's in it, right? Not only can I query it in the sense of traditional, you know, hey, select star, I could actually also query the metadata about it, and then, you know, it'll tell me, oh yeah, these are all the columns, these are the, uh, the data types of the columns, and so forth.

00:38:59.000 --> 00:39:05.000
Let me show you an example of that. So let me go to here.

00:39:05.000 --> 00:39:09.000
So here's my… One of my Parquet files.

00:39:09.000 --> 00:39:14.000
And let me just clear out all the goodies here.

00:39:14.000 --> 00:39:31.000
Alright, so here's my parquet file. There's we'll talk about with table shells are in a little second. This is kind of a fun little piece here. In essence, what I'm doing is I'm telling it, hey, install the Hpfs so it knows how to interact with files over the Internet.

00:39:31.000 --> 00:39:44.000
And then I'm literally just saying select star from this… table. And of course, that table is just a parquet file that's sitting on the web someplace. And I'm saying limit 10. And if I run this.

00:39:44.000 --> 00:39:51.000
boom. Notice how fast that was. That literally went out over the internet, grabbed the information, and pulled it back down.

00:39:51.000 --> 00:40:04.000
One of the cool things about Parquet and its format, um, and also that it works over the, uh, um… I would say it uses the HTTP protocol for getting things, is that.

00:40:04.000 --> 00:40:15.000
based on its structure, it does and goes and makes a query to the parquet file and say, hey, give me the metadata, or some of the metadata, enough of the metadata that I, you know, that I could work with.

00:40:15.000 --> 00:40:30.000
It gets some of that. And then it looks at it and says, okay, great, I only want the first 10 lines, and I want everything. And so… give me just the first 10 lines, and it knows how to do…

00:40:30.000 --> 00:40:43.000
Oh, shoot. Now I'm drawing a blank arrange request. It can go to a HTTP can go to a file in an HTTP server and say, you know what? I only want data from here to here.

00:40:43.000 --> 00:40:55.000
and that's it. If I've got a terabyte file, and I don't this. But let's say I did. If I had a terabyte file and I just wanted a piece of the data that's in the middle.

00:40:55.000 --> 00:41:12.000
DuckDB knows how to query that parquet file, get just the how would you say the the the the byte offsets of where that range is, and just get that data, and only query that small piece right? So let's say I only want, let's say, a kilobyte of data in the middle of a terabyte file.

00:41:12.000 --> 00:41:22.000
It knows how to query it and pull that data down. So here, when I'm requesting, hey, I want 10 rows. It literally goes to the.

00:41:22.000 --> 00:41:32.000
Parquet file finds out where about were those 10 rows are, and makes that range request to just get those 10 rows, and bring that down.

00:41:32.000 --> 00:41:40.000
It does not download the entire file. just enough to do what it needs to do.

00:41:40.000 --> 00:41:58.000
It also… And I haven't quite figured this out yet with Amazon. They can do it also over https, but it definitely works with S. 3. So S3 bucks work really well. I don't have to specify the names of the files. I can say, well, you know what? Here's a here's a glob.

00:41:58.000 --> 00:42:05.000
Um, of files that I have. So I got lots of Parquet files, and we saw them over here earlier.

00:42:05.000 --> 00:42:14.000
We go back real quick check check check. Great. I've got a bunch. Look how many of these parquet files I have.

00:42:14.000 --> 00:42:23.000
chunky, chunky chunk of chunk of chunk, right? A whole bunch of them. Um, I can basically tell DuckDB, Hey, you know what? I've got a bunch of parquet files.

00:42:23.000 --> 00:42:30.000
Go ahead and query them. I want just the 1st 10 records from those bunch of parquet files.

00:42:30.000 --> 00:42:44.000
And what I've done with these particular files, and I put them in a structure called a high partition, and it knows about what that is. In essence, it's a it's a file system hierarchy where I can break down pieces into into groups.

00:42:44.000 --> 00:42:57.000
And DuckDB knows how to query that. So I say, go ahead, go out there, grab, you know, hey, I've got a bunch of parquet files out there, query them, and give me just the first 10. And it goes out, boom, comes back.

00:42:57.000 --> 00:43:12.000
Right? Um… And this is kind of really nice. Let me just go see if I can. I didn't do this query earlier, but rather I did do this query earlier, but I think I removed it. Let me just do a count of how many there are.

00:43:12.000 --> 00:43:19.000
Right, so I'm not going to limit that. I'm just going to say, hey, count… count how many rows are in all my Parquet files.

00:43:19.000 --> 00:43:31.000
And boom, it went out there and it calculated that there are, what is that, 22.8 million rows. And notice it did that in, like, a fraction of a second.

00:43:31.000 --> 00:43:40.000
Right? I think there… I've got somewhere in the neighborhood of, you know, 1,200 parquet files or something like that. It literally just went out there and said, you know, go to all of them.

00:43:40.000 --> 00:43:47.000
Give me the metadata and count how many rows there are and tell it back to me.

00:43:47.000 --> 00:43:50.000
So these are some of the kind of cool things that we can do. I.

00:43:50.000 --> 00:43:55.000
I can query data that's a file out on the Internet.

00:43:55.000 --> 00:44:01.000
So what are the implications of that? It means that I don't have to set up a database.

00:44:01.000 --> 00:44:15.000
What I need is just a web server. And I can just take files and put it on a web server. Now, I may have to organize them in a certain way. I did here with, you know, use hype partitioning, but I don't have to do that. I mean, that's just a…

00:44:15.000 --> 00:44:21.000
Makes things go a little bit faster. But I don't need a database.

00:44:21.000 --> 00:44:30.000
Right? What it also means is that if I want to make data available to other people, I don't need to.

00:44:30.000 --> 00:44:33.000
in addition to nuts having to set up a database.

00:44:33.000 --> 00:44:38.000
And I just want to put it on a web server. They don't have to download all this data.

00:44:38.000 --> 00:44:52.000
and then do whatever they do locally. An example of that is, like I said earlier, is there are lots of places out there that put Excel files that make it available, CSV files and so forth. That's great.

00:44:52.000 --> 00:45:04.000
But if you actually want to use them, you actually have to download all of them, get the ones, the pieces that you want, and then… Again, probably have to build your own database to query it.

00:45:04.000 --> 00:45:14.000
With this, you don't. The data's already there, you just… you just need to know how data is structured and how to form a SQL query and query the data.

00:45:14.000 --> 00:45:26.000
What that also means is, if you want to make the data available as a quote unquote database itself, or or make the data available to to folks, you don't need to set up.

00:45:26.000 --> 00:45:44.000
An API server, right? And all the things that go up, well, gee whiz, you know, if I want this for a route, I'd have to set this up and then make this query and so forth. In a way, you can just say, well, here's the data. It's in a parquet format, and you can query it.

00:45:44.000 --> 00:45:56.000
however you like. So it makes data of, you know, public data available very quickly and easily to the the general populace.

00:45:56.000 --> 00:46:05.000
So, this was kind of the impetus. This is… this is now what I'm going to be showing my students. Of course, in different contexts. But the idea is we can take.

00:46:05.000 --> 00:46:27.000
Csv files, Excel files, statistics files. Unfortunately, yes, you could do it with shapefiles and Tiger files, which are geospatial stuff. Obviously, I've got some geospatial stuff that I was working on. I didn't get all the way into the spatial… geospatial stuff that I wanted to join in here.

00:46:27.000 --> 00:46:48.000
Uh, that's… that'll be something for the next time. But clearly, we can do that as well. And so, the normal way, right, the quote-unquote normal way, is, you know, download the files, get them locally, and work with them, um, with specialized tools. If you've got… if you've got Excel files, you need to have Excel, or something that can read in Excel file, some spreadsheet or something like that.

00:46:48.000 --> 00:47:09.000
If you've got CSV files, again, you need some kind of a tool that can read that. If you've got geospatial data, well, you now need to have something that knows how to work with GIS stuff. Of course, if you got Python and pandas, it knows how to do all that stuff. But… What's nice in this way is that you can just have DuckDB, and it can do all of those.

00:47:09.000 --> 00:47:27.000
And not only can it just query individual files, right? So I had this individual Parquet file up here. I said, great, uh, you know, give me the first 10 things. Here, I've got a bunch of Parquet files that are all in the same schema, uh, or as you say, I have all the same structure, and I said, great, how many are there?

00:47:27.000 --> 00:47:32.000
What are they? Right, so here they are, at least the first 10 rows of them.

00:47:32.000 --> 00:47:47.000
We'll talk about these fields in a minute, but notice what I got up here. These are actually 2 related tables. These are what are called. This is a table shells table where we got something called the table shell. Talk about what that is. But it gives you some information.

00:47:47.000 --> 00:47:56.000
some meta information about about the ECS. Down here, we also have table shelves.

00:47:56.000 --> 00:48:14.000
Right? And these are basically a one to many relationship. Right? So this this one down here, which is called a a data file or a data table. The dat tables are related to these table shell tables in into a many to one relationship.

00:48:14.000 --> 00:48:22.000
Well, we can join them, right? A traditional inner join or outer or whatever, but.

00:48:22.000 --> 00:48:37.000
Again, what we can do is do use DuckDB for this. So I'm going to do a little bit of sleight of hand in the sense of I'm going to simplify things. What I'm going to do is I'm going to say, well, you know that parquet file that's out there. I'm going to make a view.

00:48:37.000 --> 00:48:50.000
of that. I'm just gonna… I'm gonna call it table shelves. And so instead of having this… and instead of having to reference it as this long URL, I can just reference it as a view called table shelves. So I'm going to create this view.

00:48:50.000 --> 00:49:03.000
I'm going to create another view called data. And this is going to be the one that is… there's multiple files, right? That are in a hive structure, and I'm going to call that view called data.

00:49:03.000 --> 00:49:11.000
And what I could do is I could now join them, right? So what I'm going to do is I'm going to say, grab all the fields from the table shells table.

00:49:11.000 --> 00:49:27.000
Um, grab all the data from the data tables, exclude the table shelves. And the reason why is because that's what we're going to be joining on. And if you don't exclude it, you end up with 2 columns with the same name, or almost the same name.

00:49:27.000 --> 00:49:31.000
And so I can then say, great, from this table.

00:49:31.000 --> 00:49:39.000
called table shells from this other table called data. And you're going to use these aliases called T and D.

00:49:39.000 --> 00:49:50.000
And I can specify the joined relationship, right? So I can say table shell in one table is the same as table shell in the other. And then I can specify a state and a county.

00:49:50.000 --> 00:50:03.000
And if I run this, it literally it's smart enough to go out and say, Okay, great, I just need enough data from this one and enough data from this one, and… put them together, and here they are, right?

00:50:03.000 --> 00:50:22.000
None of this stuff is running on my machine in the sense of I don't have a database on my machine. It's literally just going out to these parquet files and pulling the data together. So I was able to pill down 106,000 rows and 13 columns. It just so happens to be that the state 29 is Missouri.

00:50:22.000 --> 00:50:31.000
And the county of 510 is St. Louis. So this is some information that has to deal with.

00:50:31.000 --> 00:50:42.000
Computer use in the county of St. Louis. So, um…

00:50:42.000 --> 00:50:59.000
We can now literally, just using DuckDB, go out, pull this data, and join it. This is an aws. I want to do the same thing in DigitalOcean. Unfortunately, I couldn't figure out how to upload a folder with thousands of files in the way that I wanted it to. So I was only able to upload this one.

00:50:59.000 --> 00:51:02.000
That's… let me see if I can bring it up.

00:51:02.000 --> 00:51:08.000
That's… this file over here, again, it's it's just a.

00:51:08.000 --> 00:51:24.000
I don't think they have any way to show this. Uh, it's just a URL to a web page, and if we go and query it, boom, it does the same thing, right? So instead of going to AWS, I'm doing this on DigitalOcean, right? Um… And it's acting as a web page.

00:51:24.000 --> 00:51:38.000
So literally, nothing fancy. I have parquet files. I put the parquet files on a web server, and I've got DuckDB and DuckDB is able to query stuff on a web server. No database server involved.

00:51:38.000 --> 00:51:43.000
All right. Any questions on any of this so far?

00:51:43.000 --> 00:51:50.000
So I'd like to go a little bit deeper and show you a little bit.

00:51:50.000 --> 00:51:51.000
Yeah.

00:51:51.000 --> 00:52:02.000
This is Stan. I have a question on what's on the screen now. The 13 lines. Can you… What… is that… is that a particular code? Like, uh… Or is it just specific to that parquet, or… and also, can you put your mark in, say, a line 11?

00:52:02.000 --> 00:52:06.000
Nope.

00:52:06.000 --> 00:52:09.000
to say that I changed this on Tuesday.

00:52:09.000 --> 00:52:16.000
Oh, yeah. So this is standard SQL. So I can say I.

00:52:16.000 --> 00:52:26.000
I changed this on Thursday. Oops, not Thursday, Thursday, right? The double dash is is the comment in SQL.

00:52:26.000 --> 00:52:31.000
The first two lines here basically say, hey, in DuckDB.

00:52:31.000 --> 00:52:49.000
uh, load up these extra modules so you know how to get stuff over the HTTP, um, protocol. And then this is… Pretty much standard SQL, right? Select the columns that you want from a table that you want, and limit 10.

00:52:49.000 --> 00:52:58.000
And if we run this… It runs just fine. So this is… this is a comment. And if you want to change this later on, or with something else, you absolutely can.

00:52:58.000 --> 00:53:00.000
Does that help, Sam? All right.

00:53:00.000 --> 00:53:04.000
Yes, it does, definitely.

00:53:04.000 --> 00:53:08.000
Yeah, any other questions?

00:53:08.000 --> 00:53:14.000
So this is basically complete directly with solutions like.

00:53:14.000 --> 00:53:18.000
Snowflake and Databricks.

00:53:18.000 --> 00:53:23.000
Oh, good question. Is it competing directly with them?

00:53:23.000 --> 00:53:29.000
I don't know. I don't think it is. But I could be wrong.

00:53:29.000 --> 00:53:30.000
Because that's what Snowflake and Databricks, from what I can tell.

00:53:30.000 --> 00:53:34.000
What? Yes.

00:53:34.000 --> 00:53:45.000
Oh. Basically do is they're a web-based.

00:53:45.000 --> 00:53:46.000
Uh-huh.

00:53:46.000 --> 00:53:55.000
Big data. Toolkit where you can go in and ETL data in and out of sources and do your analysis.

00:53:55.000 --> 00:54:08.000
Yeah, so… so what's nice about those two services, solutions, products, I guess, is a good term for it. Is they provide an entire ecosystem, if you will, of for not only just.

00:54:08.000 --> 00:54:23.000
querying data, but also doing analysis, and doing all kinds of other fun stuff, um, with the data, right? All within a… you know, a consistent environment, if you will.

00:54:23.000 --> 00:54:29.000
So… Does… does DuckDB compete with that?

00:54:29.000 --> 00:54:32.000
I I don't know. I don't… I don't think it does.

00:54:32.000 --> 00:54:50.000
But what is cool is that… for some things, um… Traditionally, where you would say, well, gee whiz, yeah, do… go to Databricks or Snowflake, because there are no other tools.

00:54:50.000 --> 00:54:58.000
You… I think for some of the low-hanging fruit stuff, I think this is a good solution, right?

00:54:58.000 --> 00:55:14.000
And the examples that I'm giving here, of course, is, you know, a web server that's out on the internet, right? Digitalocean, AWS could be, you know, somebody else's web server that's out there. There's no reason why it couldn't be internal as well, right? You could have an internal web server, or even an internal file system.

00:55:14.000 --> 00:55:30.000
Um, and instead of just HTTP, you just say DuckDB, hey, go to, you know, the G drive or something like that, or whatever this mount, you know, slash mount, blah blah, blah blah data, and pull data off of that, and DuckDB will do that just as fun as well.

00:55:30.000 --> 00:55:46.000
But yeah, I'm not. I I don't know if it's a competitor, but… I don't know what their… I I don't know enough about databricks and and snowflakes target.

00:55:46.000 --> 00:55:55.000
audience. You know, maybe… 80% of their of their of their clients.

00:55:55.000 --> 00:55:59.000
just do this. They just have some data someplace that.

00:55:59.000 --> 00:56:09.000
You know, they want it on a file server or web server, and they just want to query it. If that's all you're doing, then yeah, DuckDB was with some parquet files will work just fine.

00:56:09.000 --> 00:56:20.000
In fact, what's really cool about Parquet is that if you set things up like this in a hive structure with Parquet files, you can use, um.

00:56:20.000 --> 00:56:37.000
spark or, you know, Databricks or Snowflakes, or any of their tools to access the data the same way, right? So you can… if you got the data in that format, you can come in from DuckDB, or you can come in with any of those other tools and query the exact same data.

00:56:37.000 --> 00:56:44.000
Earlier, Ed asked, what is… if you could talk about the hive structure. What is that?

00:56:44.000 --> 00:56:45.000
Yeah, that's…

00:56:45.000 --> 00:56:55.000
Oh, yeah, sure. Let me, uh… I think I may have an example of that up here. In essence, what you do is you you take some of the columns and.

00:56:55.000 --> 00:57:03.000
Uh, you basically partition them, and of course, I don't have an example of that. You partition them.

00:57:03.000 --> 00:57:08.000
Uh, on your file system.

00:57:08.000 --> 00:57:20.000
Uh, I don't think this is gonna work. Let's… let's see. Oh yeah, here we go. So, uh, here's an example, you know, of a smaller piece that I was working on. So we've got census export, um.

00:57:20.000 --> 00:57:34.000
I only, uh, looked at some of these shells, and within each of these shells, these are just and I'll go into this in a little bit. But basically what I did is I took these shell files which were excel files.

00:57:34.000 --> 00:57:42.000
pull the data out of Excel, parse it a little bit, and then save it as a parquet file, and I use the shell name.

00:57:42.000 --> 00:58:04.000
As the as the partitioning. process. And so what DuckDB does, it'll say, okay, great, what I'm going to do is create a folder that's a table shell equals, which table shell is the column name, equals, and then, in this case, B28003 is the value, and then it puts.

00:58:04.000 --> 00:58:14.000
inside that folder, it puts the actual Parquet file, right? And it does that for every single different value that I have for for a table shell.

00:58:14.000 --> 00:58:29.000
The advantage of this is if I go in and I say, well, geez, you know, I want to select, you know, blah blah blah from… This entire data set, but I'm only interested in table shell, uh, I don't know, B28009G.

00:58:29.000 --> 00:58:37.000
It automatically knows about the structure and says, oh, you know what, I only need to go to this one Parquet file and ignore the rest.

00:58:37.000 --> 00:58:42.000
Right? And so that's that's what hive structuring is.

00:58:42.000 --> 00:58:56.000
In this case, I only went one level deep, you can go multiple levels deep. A good example of that. Actually, let me scroll down. One of my plans. Unfortunately, I didn't get a chance to get to it. If we look down here, we've got.

00:58:56.000 --> 00:59:09.000
FIPS information. So we got state, county, and there's actually one more level as well, which I didn't… which I didn't parse out, which I'll need to do it. But you can, in essence, have something that says, um.

00:59:09.000 --> 00:59:26.000
The State is is the 1st level, and the county is the next level. And I think the next one is the is the census census track, I believe. And so you can say, which you is, I want everything from, you know, St. Louis, and only look at things that are in, you know, St. Louis.

00:59:26.000 --> 00:59:31.000
Right? Or I only want to look at things that are in a particular track number, and it knows it.

00:59:31.000 --> 00:59:39.000
DuckDB will look at the hive structure, index it, and say, okay, great, you only need to look at these folders.

00:59:39.000 --> 00:59:46.000
And how do you define this structure? SQL statements, or what?

00:59:46.000 --> 00:59:54.000
Yeah, exactly. Let me just go back a little bit. Let's see where I actually do that, where I created.

00:59:54.000 --> 01:00:05.000
Yeah, right here. So this particular case, and. If we have time, I'll go through all of this stuff. But yeah, these are basically SQL statements that go through and.

01:00:05.000 --> 01:00:22.000
I would just say, restructure the data in a way that makes sense. And then at the very end, you can say, great, take all that data, save it as a parquet file and partition it in this particular case by table shell and table shell happens to be one of the column names.

01:00:22.000 --> 01:00:37.000
And so with DuckDB automatically does is it says, okay, great, I'm going to get a distinct list, a unique list of all the table shells, create folders for those, and then take each one of those, basically think of it as a group by and say group by this.

01:00:37.000 --> 01:00:46.000
turn that into one Parquet file. Next one, next Parquet file. Next group, next Parquet file, and so on and so forth. You don't need to worry about it.

01:00:46.000 --> 01:00:53.000
It does that for you. You just have to tell it how you want it partitioned.

01:00:53.000 --> 01:00:56.000
Does that answer your question, Stan?

01:00:56.000 --> 01:01:09.000
Well, the question was actually from from Ed, and I listened to what you said, and I have no idea, other than, yes, SQL statements is how you define it.

01:01:09.000 --> 01:01:10.000
So, I'm not real sure what a high structure is.

01:01:10.000 --> 01:01:18.000
Yes.

01:01:18.000 --> 01:01:23.000
Okay.

01:01:23.000 --> 01:01:24.000
Okay, I can understand that. All right, thank you.

01:01:24.000 --> 01:01:36.000
Oh, it's just a… it's just a hierarchy of files and folders. The files ultimately are parquet files, and the folders are… are this. Um… Yeah. In fact, let me… let me just drop into the terminal real quick. Whoa, not like that.

01:01:36.000 --> 01:01:39.000
Hey, let me blow this up a little bit.

01:01:39.000 --> 01:01:48.000
Here's my here's my folder called census export. And these are all the folders.

01:01:48.000 --> 01:01:56.000
They just happen to be in a structure that the naming convention for the folders just happens to be that.

01:01:56.000 --> 01:02:09.000
The first part is the name of the column or the field that you're interested in your database. And the second part, the one to the right of the equal sign is the particular value that you're, uh, that was in that column.

01:02:09.000 --> 01:02:14.000
And so you can. You do a find. Oops.

01:02:14.000 --> 01:02:22.000
There you go. You got got all the files and all the folders with it. So here's a folder. There's a file that's in it. There's a folder. There's a file that's in it.

01:02:22.000 --> 01:02:29.000
It is just a file system hierarchy structure.

01:02:29.000 --> 01:02:32.000
That's understandable. Okay, thank you.

01:02:32.000 --> 01:02:37.000
Uh-huh. All right.

01:02:37.000 --> 01:02:56.000
And I want to go through and show you what, uh, what my students and I. Well, my what my students had to go through and eventually how we how we kind of solved it, and how this project kind of came about. So… I'm not going to go through all of this stuff. I'm going to just try to find the original.

01:02:56.000 --> 01:03:04.000
because I just took some of these. Let me grab this. What I did is I looked at the.

01:03:04.000 --> 01:03:15.000
At the census, 5-year data. joke, joke, joke. Oh, please tell me it's… All right. So this is what was interesting.

01:03:15.000 --> 01:03:25.000
This was happening to me this afternoon. Everything was going fine until about maybe two this afternoon.

01:03:25.000 --> 01:03:32.000
Okay, good. So these… are all the data files.

01:03:32.000 --> 01:03:48.000
that the Us. Census provides for the 5-year data, right? So we got B01001, B01001 a dat, uh, and then B, C, D, and so forth.

01:03:48.000 --> 01:03:49.000
That's it. Is that… is that zip file a roll-up of these data files?

01:03:49.000 --> 01:03:55.000
These individual files. Yeah, go ahead. Question.

01:03:55.000 --> 01:03:56.000
Okay, thank you.

01:03:56.000 --> 01:04:16.000
It is. It is. Yeah, so if you want, you can download this one zip file, which I strongly recommend. And the reason why I say that is because as nice as it is to do web scraping, one of the things the Us. Census does do is it does recognize that when you start making lots of requests.

01:04:16.000 --> 01:04:23.000
Um, it sees you as a bot. thinks you're being abusive and will shut you down.

01:04:23.000 --> 01:04:27.000
Right? We'll prevent you from, uh, from grabbing stuff.

01:04:27.000 --> 01:04:40.000
Uh, through the web interface. And so you can, you have to go, you know, so my recommendation is download the zip file and you get all these data files. Now, the question, of course, is, well, Gee was what is a DAT file? And it turns out that these DAT files.

01:04:40.000 --> 01:04:53.000
Um… Let me make sure I've got all this stuff in here. Looks like I do these data files. If we look at them, I said, I'm just doing a quick curl request to verify that I can access things.

01:04:53.000 --> 01:05:08.000
Let me, uh… actually, let me do… Just real quick, right? Um… So great if I created a list of these, a portion of them, I should say. They're roughly.

01:05:08.000 --> 01:05:20.000
I can't remember. It's 1,200 or 1600 of these data clouds. Um, and so… I just want to make sure, can I get it using curl? And I can. That's good. Let's go ahead and look at the first two lines.

01:05:20.000 --> 01:05:26.000
And what we see is it's not it's a CSV file.

01:05:26.000 --> 01:05:42.000
It's not a Csv file in the sense that it's not comma separated values, but it is a CSV file in the sense of it's just a text file, and instead of commas, it's using vertical bars to delineate the different values, right? So.

01:05:42.000 --> 01:05:55.000
It has a header in the beginning, it's got this geo ID, and then it has all these other things. We'll talk about that in a second. And then we start getting the actual data. So what's cool about that is we can read this.

01:05:55.000 --> 01:06:03.000
And I'm gonna… I'm gonna do a little bit of Python here first. Um, we can read this into a data frame to look at this.

01:06:03.000 --> 01:06:06.000
Right? And this is what our… this is what we… whoops.

01:06:06.000 --> 01:06:09.000
And this is what we get as a data frame.

01:06:09.000 --> 01:06:17.000
Great. Um… Does anybody have an idea what B28003 underscore E001 is?

01:06:17.000 --> 01:06:23.000
Because I certainly don't. Right.

01:06:23.000 --> 01:06:36.000
Um… This represents, or at least the first half of this, everything to the left of the underscore. That represents the… what is called the table shell.

01:06:36.000 --> 01:06:44.000
I have no idea. what that is. It would be nice to have a list.

01:06:44.000 --> 01:06:58.000
right? Another table that says, Oh. B28003 is this. And you know I don't have another one in here, but some other number is is something else, and so on and so forth, right?

01:06:58.000 --> 01:07:04.000
We then have this E001, and then on this one we have N001.

01:07:04.000 --> 01:07:09.000
Right? Um, and then we have E002 and M002.

01:07:09.000 --> 01:07:14.000
Turns out I actually do know what those are. The E represents estimates. Remember?

01:07:14.000 --> 01:07:29.000
This is the, uh, this is the annual, or excuse me, the monthly survey that they send out, and then every year they aggregate it, and then every five years, they aggregate the previous five years. So this is the aggregate of the previous five years. And because it's a survey.

01:07:29.000 --> 01:07:45.000
Um, it's not a… it's not a complete count of everything in the United States. They call it an estimate, right? And so the E, these numbers in this, in every column that has an E in it, those are estimates of whatever those… whatever they're measuring, right? Whatever they're calculating.

01:07:45.000 --> 01:08:02.000
The M is the margin of error. And so in this case, what is that? 129,227,496. It's got a margin of error of 209,365, right? And then this one here is 100.

01:08:02.000 --> 01:08:18.000
12,340, or excuse me, 123 million, blah blah blah. It has an error of this. And so we've got these pairings of estimates, margin estimates, margin bearer, estimates, margin bearer, and so forth.

01:08:18.000 --> 01:08:25.000
that frustrates me when trying to work in data, right? I would prefer to have 2 columns.

01:08:25.000 --> 01:08:45.000
Um, error and excuse me, estimates and margin of error and then have a column that says, okay, well, this is A and B and C and D and so forth, right? So that's, that's one thing that I would love to do with this data. And the other thing, of course, is this geo ID. This geo ID actually represents also.

01:08:45.000 --> 01:09:04.000
Um, geographic regions in the United States. And if we notice, uh, not so much in the top one, but on the bottom one, it's a little bit more obvious, we've got a number to the left of the US and a number to the right of the US. There are two different numbers. It turns out the number to the left of the US, the one that in this case begins with 97, is the U.S. Census.

01:09:04.000 --> 01:09:13.000
Um… I forgot the term already, but it's basically the census breaks things up into into pieces.

01:09:13.000 --> 01:09:28.000
And it's that. On the right-hand side is something called the FIPS code. And of course, I don't remember what FIPS is. It's federal something something something. But basically, everything gets an identifier for different things.

01:09:28.000 --> 01:09:50.000
parse these out, and as in what it means is the 56 means something. The 05 means something, the 830 means something. I don't remember exactly where the boundaries are, and they have different sizes, or sometimes they're not even in here. Sometimes they're shorter or longer, depending on the length of it, it means different things.

01:09:50.000 --> 01:10:08.000
Um… It'd be nice if those meetings were actually parsed out and usable, right? So this is the kind of data that my students were working on, and we ended up going through and parsing these. So the first thing that we did is we tackled the columns, right, the column names.

01:10:08.000 --> 01:10:11.000
And we broke this up into a bunch of pieces.

01:10:11.000 --> 01:10:21.000
And so what that looks like is. Well, I don't want to go through all this stuff.

01:10:21.000 --> 01:10:26.000
The short version is that you can do something like this.

01:10:26.000 --> 01:10:35.000
We just do this real quick. I'm just going to say, uh… actually, I'm going to do the whole thing. This letter, oh.

01:10:35.000 --> 01:10:40.000
Oh.

01:10:40.000 --> 01:10:46.000
Bill doesn't like it.

01:10:46.000 --> 01:10:54.000
Oh, man. Oh, I hate it when it was just working, and now it's not.

01:10:54.000 --> 01:11:13.000
Oh, there we go. So in essence, what this is doing, and this is doing the doing I'm going to say, rearranging things in Python using pandas. It's basically going through all the columns and saying, Hey, break these out into their individual pieces. And yes, I'm using regular expressions to do that.

01:11:13.000 --> 01:11:31.000
Uh, these are the regular expressions. It then goes through and does what is called a unpivot. So it takes it from this Y format and makes it this long format, right? The other thing it does is it takes that geo ID and parses it into its individual components as well, right?

01:11:31.000 --> 01:11:49.000
Um, and then what it does is it does a little pivot, uh, where it does, instead of having it go row by row, estimate, margin error, estimate margin bear, estimate margin of error, it pivots those so you have two columns. So in the end, this is what it looks like, and this was done in pandas.

01:11:49.000 --> 01:12:01.000
Um, so I have the geo ID, and I've parsed out the GIS number, that's the number to the left of the US, and that is composed of smaller pieces. One's called the summary level, another's called the variant.

01:12:01.000 --> 01:12:12.000
Everything to the right of the US is called the FIPS. So in this case, we've got the… This number down here, this 72 and so forth, uh, that's the FIPS number.

01:12:12.000 --> 01:12:17.000
State 72, and I don't know what that is. And every county.

01:12:17.000 --> 01:12:23.000
and then what we have is, remember those table shells in the beginning that were, you know, the table.

01:12:23.000 --> 01:12:29.000
underscore E with the… with the number? Well, it turns out that that number is what is called the line number.

01:12:29.000 --> 01:12:38.000
Exactly why it's called the line number, I'm not sure. I'm guessing that it is literally the line number of the survey when people filled this out.

01:12:38.000 --> 01:12:55.000
But I don't know. That's just my guess. But what I've done is I've pulled those pieces apart. I've also turned that E in the M or the underscore E in the underscore M into 2 columns, the estimate and the Moe right? So now we can go through. And actually, I can see the estimates.

01:12:55.000 --> 01:13:10.000
And the MOEs side by side, and I can start querying the same. So, for example, if I wanted everything from the state of Missouri or New Mexico, or Alaska or whatever, if I knew that number, I could just say, well, great, just select everything from that state and give me those estimates.

01:13:10.000 --> 01:13:17.000
I now have… not quite, but almost a normalized table.

01:13:17.000 --> 01:13:18.000
Yeah.

01:13:18.000 --> 01:13:24.000
Question. Would you have to do a Freedom of Information Act to find out what that line number was on the form?

01:13:24.000 --> 01:13:30.000
No, no. And I'm glad you asked. Because that's where we're going next.

01:13:30.000 --> 01:13:31.000
Okay.

01:13:31.000 --> 01:13:39.000
Right? Yeah. So. But before we get there. I just want to show that this is what it looks like in pandas. Of course.

01:13:39.000 --> 01:13:53.000
Don't get me wrong, pandas is great. You can do a lot of things with pandas, but of course it's specific to Python right? This syntax can't be used in Rust or Ruby or Go or anything else.

01:13:53.000 --> 01:14:13.000
In contrast. This is the exact same thing written in SQL in DuckDB. Again, I've got my regular expressions, and yes, absolutely, DuckDB knows how to deal with regular expressions. It also knows what a CTE is, a common table expression, and so I've done the exact same process that I had in Pandas.

01:14:13.000 --> 01:14:27.000
Um, I did it in SQL using a CTE, right? And in CTE, you basically can say, hey, do this, and then in the next section, use the results of the previous section and do some more things on it, right?

01:14:27.000 --> 01:14:39.000
Just real quick, I'm basically saying, hey, read all the data, and then for all that data, apply those regular expressions to pull things apart, right? Regular expressions are great for that, right?

01:14:39.000 --> 01:14:50.000
Then once you pull things apart, some things will be null. Go ahead and fill those in with some values, and that's what this does, right? So I'm cleaning up the data.

01:14:50.000 --> 01:15:05.000
I then go through and I say, okay, great, you know how you got everything in columns? Well, don't do that. Take all that stuff and make it into rows. Make it go from a wide format to a long format. And, uh, yeah, this is the details on how to do that. And then, once you've done that.

01:15:05.000 --> 01:15:16.000
Go ahead and pull some more stuff apart, right? The column things, break it into three pieces. You've got the actual table name, excuse me, the table sheet name.

01:15:16.000 --> 01:15:36.000
people shall name whether it's an estimate or a, um… margin of error, and a line number. So break it into those three parts, make that a column, and then, in the very end, take the everyday ones and make them pivot them again, right? So I unpivot and then I pivot a little bit. And in the end.

01:15:36.000 --> 01:15:45.000
I get exactly what I want, right? That's… And, oh, yeah, and I'll just add this extra thing down here is, hey, great, now that you've done this stuff.

01:15:45.000 --> 01:15:55.000
For God's sake, don't save it in a CSV file. Save it as a Parquet file, and partition it based on the table shell name, right? And that's how we get this structure here.

01:15:55.000 --> 01:16:11.000
Of course, lastly, what I did just to show that it actually works. You know, here's my query straight off the drive. This is off the local disk. Right? So I've got the geo ID, the Gis, and then the pieces that make up that I've got the FIPS and all the pieces that make up that.

01:16:11.000 --> 01:16:20.000
Um, I have the table shell, B28 and the line number, and then lastly, I actually have the counts of things, right? The estimates as well as the margin of error.

01:16:20.000 --> 01:16:31.000
Right? Once I've got it saved on the file, of course, what I did is I then say, great, put it up in Amazon and see if we can query it up there, which is what I've already shown you.

01:16:31.000 --> 01:16:44.000
So Stan asked. What the heck are these right? What is a B 2 8 0 0 3? And I said, I don't know, and because the answer is, is I don't know.

01:16:44.000 --> 01:16:55.000
But here's the cool thing. That data is also available, oops, that's okay. That data is also available on the Census website. So let's go find that data.

01:16:55.000 --> 01:17:01.000
And we can find that right here. These are the… These are the shells.

01:17:01.000 --> 01:17:05.000
But let's load this up. I hope this one works.

01:17:05.000 --> 01:17:10.000
And the reason why I say I hope it works is because.

01:17:10.000 --> 01:17:20.000
I was trying to read this with pandas as well as DuckDB, and I was getting a 524 error, and it turns out that.

01:17:20.000 --> 01:17:38.000
That's basically apparently the web interface to the Us. Census is handled by Cloudflare, and cloud. Apparently I did this often enough. Cloudflare probably thought I was a bot and said, No, you are your your IP address, or whatever criteria they use said, No, you're you're not getting access to us.

01:17:38.000 --> 01:17:42.000
Right? I can use a web browser just fine.

01:17:42.000 --> 01:17:48.000
The difference is is the web browser is actually running on my laptop right here.

01:17:48.000 --> 01:18:11.000
This script here that's running is actually running in Colab, and of course it's coming from Google. So Google is making these requests, and so… Cloudflare is probably looking at it and said, oh, boy, someone's running some weird stuff on Google again, we're gonna shut them down. And so, um, this was working, uh, until, uh, early this afternoon, and then they said, nope, we're not gonna let you get that data anymore.

01:18:11.000 --> 01:18:20.000
Which is the entire reason why I want the data from the census, is so that I… me and my students can't be shut out from accessing the data.

01:18:20.000 --> 01:18:32.000
Thank you. So these are all the shell file or yes, the the table shell files, and uh… Notice what format they're in.

01:18:32.000 --> 01:18:41.000
They're not CSV. Although the original data files that we wanted are in Csv, even though they have vertical bars instead of a comma.

01:18:41.000 --> 01:18:49.000
Plain text, great. But the thing that describes what those columns are, well, that's not in CSV.

01:18:49.000 --> 01:18:54.000
That is in an Excel file. And not just one excel file.

01:18:54.000 --> 01:19:03.000
There are, and I don't remember like 1,200 or 1600 of these things, right? Individual Excel files.

01:19:03.000 --> 01:19:08.000
And unlike the dat files that we had before.

01:19:08.000 --> 01:19:14.000
There is no zip file. To get all the Excel files.

01:19:14.000 --> 01:19:20.000
So if you want these Excel files. You have two choices. One is you go to the website.

01:19:20.000 --> 01:19:25.000
which is what I was trying to do, get a list of all these things and just start downloading them.

01:19:25.000 --> 01:19:35.000
That's probably how I got on the blacklist at Cloudflare. The other way is to use their Ftp service.

01:19:35.000 --> 01:19:45.000
Which is what I ended up doing. So, um, I just wrote this little script here for LFTP. They said, hey, you know what? Instead of going to the web.

01:19:45.000 --> 01:20:00.000
user Ftp service. Here's the folder where I want you to go. And of course, what's nice about the Lftp program is that you can specify how many basically parallel threads you can have at one time. I specify 10, and I just say, just.

01:20:00.000 --> 01:20:08.000
Just get everything, right? There's not that much data. There's just a lot of files. In the end, I think it's only, like.

01:20:08.000 --> 01:20:13.000
20 megabytes or something like that? I mean, tiny, right? But it was able to get all the files.

01:20:13.000 --> 01:20:24.000
I was able to… open up the Excel files and extract the sheets. Again, I used pandas for this.

01:20:24.000 --> 01:20:42.000
Truth be told, I tried this with DuckDP to open up excel files, and it was not successful. According to the documentation and things that I've read, you should be able to do it. I just wasn't able to do it. So I said, I'm not going to worry about DuckDB for now. I just want to get the data, and I know Pandas will do it just fine.

01:20:42.000 --> 01:20:48.000
So, I ended up opening up every single… Excel file.

01:20:48.000 --> 01:21:07.000
And of course. Excel is not the excel file itself is not just a spreadsheet. You know, back in the day it used to be that a file was just a spreadsheet. Now they're called workbooks, right? And a workbook can have multiple sheets, or some people call them tabs, right? So you have multiple tabs in a single workbook.

01:21:07.000 --> 01:21:12.000
And each of those tabs can have a name.

01:21:12.000 --> 01:21:13.000
Yeah.

01:21:13.000 --> 01:21:21.000
Question. How did you find out about the… how did you find out about the FTP site? So many places seem to have stopped doing FTP.

01:21:21.000 --> 01:21:26.000
That was a Google search. There was also panic.

01:21:26.000 --> 01:21:40.000
Remember, the original impetus for this was I had a student that was pulling this data for part of his capstone with his team. His team was working on other stuff.

01:21:40.000 --> 01:21:57.000
I remember the full details, but I think someone was working on using city data, and some of it was using some other housing data from within the city, and he wanted to get information from the U.S. Census about, you know, census tracts and household income and things like that.

01:21:57.000 --> 01:22:05.000
Um, and the 8… and… This data is available via the API as well, and the API went down.

01:22:05.000 --> 01:22:18.000
And the other… only other option that we knew how to do was basically go and get all this data via the website one by one. And of course, just like what happened to me today. I'm guessing.

01:22:18.000 --> 01:22:25.000
Um, when we try to do it on the website, it also shut us down and said, nope, you're a bad person, you're an automated script.

01:22:25.000 --> 01:22:40.000
You know, we're not gonna let you do this. And so, it was, okay, great. We can't do it manually, we can't do it via the API, or I should say, we can't do it via automated queries on the web. We can't get it that way. We can't get it via the API. What else is there?

01:22:40.000 --> 01:22:45.000
and it literally was a Google search for Ftp.

01:22:45.000 --> 01:22:54.000
Right? Um, and so… Oops, let's let's spell census right.

01:22:54.000 --> 01:23:02.000
And… let's see…

01:23:02.000 --> 01:23:16.000
It is not obvious. Um, because every link that's, um, how would you say, that you can click on takes you to the HTTPS, right? So if we click on here, let's click on summary file, there you go.

01:23:16.000 --> 01:23:22.000
This is the website. We've already been here. Well, it turns out that if you actually.

01:23:22.000 --> 01:23:27.000
slow down, which we had to do, and read?

01:23:27.000 --> 01:23:39.000
Um, it turns out that… I'm trying to… oh, here you go. It's not even a link. That… is an FTP server.

01:23:39.000 --> 01:23:43.000
And if you go there, you can get all the same files, which is.

01:23:43.000 --> 01:23:48.000
what I did up here. Um, where is… yeah, there we go.

01:23:48.000 --> 01:23:54.000
Right there. So it says lsftp. Go to Ftp. 2.census.gov and get it.

01:23:54.000 --> 01:24:00.000
Stan, you bring up a great point. A lot of places don't use Ftp servers anymore.

01:24:00.000 --> 01:24:03.000
Yeah, that's… That's kind of a bummer.

01:24:03.000 --> 01:24:12.000
I noticed… I noticed that Phil posted in the chat a link to the Federal Information Processing Standards.

01:24:12.000 --> 01:24:15.000
Have you looked at that? Was it any help?

01:24:15.000 --> 01:24:16.000
Alrighty.

01:24:16.000 --> 01:24:22.000
Yeah, that's that's the FIPS codes.

01:24:22.000 --> 01:24:23.000
Okay.

01:24:23.000 --> 01:24:32.000
So if you look at if you look in there, you'll find all the numbers that correspond to different things, and yes, eventually I will get to that for the for the geography stuff. Just having it.

01:24:32.000 --> 01:24:33.000
Thank you.

01:24:33.000 --> 01:24:54.000
Yeah. Oh, there's lots of data out there. It's it's just getting it in a reasonable way. A lot of the data that I found is also not organized in the sense of, you know, you've got a FIPS code that looks like, you know, 720053776, or something like that, right?

01:24:54.000 --> 01:25:10.000
You actually can't find that number. Instead, at least I haven't been able to, instead what you find is you find out that, well, 7-2 stands for the state of Texas, or something. I'm just making this up, right? And the next 3 digits stand for the county of, I don't know, um.

01:25:10.000 --> 01:25:25.000
you know, Springfield. And the next thing is the census track number. They describe what each of the pieces are. And so you have to figure out how to parse the parse the FIPS code, and then match it to the appropriate number.

01:25:25.000 --> 01:25:34.000
Where that gets interesting is FIPS codes are not a uniform size, they vary in size, and so therefore, if you have a FIPS code that's 5 characters.

01:25:34.000 --> 01:25:50.000
That means something different. In other words, you've got to use a different data set to look up what the… what the pieces are compared to something that's, let's say, 10 characters. That's a completely different interpretation.

01:25:50.000 --> 01:25:51.000
Hmm. Okay.

01:25:51.000 --> 01:26:05.000
It gets really interesting. It'd be nice if everything was nice logically organized, but it's it's not. I mean, it is to some extent, but… Definitely not a, you know, third normal form or any normal data table kind of thing.

01:26:05.000 --> 01:26:14.000
all right. So the Excel files, I had to pull out all the data from the Excel files, and in the end.

01:26:14.000 --> 01:26:22.000
What I did is I grabbed… the the the the workbooks themselves, the files themselves have.

01:26:22.000 --> 01:26:34.000
sheets. And so what I wanted to know is, okay, great. How many sheets are in each of the workbooks in each of the files? And it turns out that there's exactly one.

01:26:34.000 --> 01:26:50.000
Um, now you would think it would be something like, well, gee whiz, you know, they would all be named consistently. Well, it turns out that they are named consistently to some extent, but the name of the sheet is the name of the table shell.

01:26:50.000 --> 01:27:04.000
Right? So… It wasn't a consistent name like, oh, the the name is constantly data or table shell or some generic name. No, no, no. The name of the sheet was the name of the file.

01:27:04.000 --> 01:27:15.000
which also happened to be the name of the table. So you couldn't just say, oh, you know what? I'm going to open up this file, and I'll grab the, you know, the name of the sheet to get what I want, right?

01:27:15.000 --> 01:27:25.000
So, what I did is I went through every single one, got the name of the sheet, the name of the tab that's in there, and uh… and then in each one of them, when I parsed it, I also looked at, well, what's its description?

01:27:25.000 --> 01:27:31.000
And of course, as you can see here, they're not consistently.

01:27:31.000 --> 01:27:46.000
format. Some of them have 3 columns, right? So table ID, line number, and description. Some of them have more than that, right? Um, so I just did a count. Turns out that, yeah, some of 17, some have 13.

01:27:46.000 --> 01:27:57.000
The numbers are all over the place, right? So, um, the other thing that I noticed is that the sheets themselves actually what I think, uh.

01:27:57.000 --> 01:28:05.000
more correctly, it's actually the… this is called a table shell. Um, they begin with either one or two letters. I only looked at the first letter.

01:28:05.000 --> 01:28:21.000
And an overwhelming number of them are B. Um, and if we look at the ones that have more than 3 columns, uh, there's only these 3, and there's a very few of them. So I just said, yeah, you know, just tackle the bees, and that's what we're going to work with.

01:28:21.000 --> 01:28:37.000
And so I got 1,189 sheets for the bees, pulled them all out, put them together, and this is what we got. And so now we have the table sheet and then the description. Now, one of the things I noticed is that.

01:28:37.000 --> 01:28:49.000
The description is kind of incomplete. There are actually some lines that are… As in, there are rows in the Excel file that are not.

01:28:49.000 --> 01:29:04.000
nicely organized like this, that actually give you more description. Um, so I need to go back and pull that out and do something else with that, um, to provide a little bit more… more metadata. Think of it as almost a one-to-many relationship. There's information about the.

01:29:04.000 --> 01:29:20.000
table shell itself, right? A nice description, a summary of it. And then what you have is you have each of the line items, right? So, in this case, I've got line 10, 20, 30, 40, 50, and so forth, right? And so line 10 is the total. Line 20 is the mail.

01:29:20.000 --> 01:29:26.000
Line 30 is under five years old, and it turns out that.

01:29:26.000 --> 01:29:44.000
In this particular case, these are subcategories of line 20. So these are all males under 5 years old, all males between 5 and 9, and so forth. And then these are all… and yes, there are also females, and then the males and the females are added up into a single total.

01:29:44.000 --> 01:29:55.000
Great. This is that metadata that's just elsewhere in the spreadsheet, and that you just need to know, right? Not not very well described.

01:29:55.000 --> 01:30:01.000
What does make this interesting is that, remember, we have those dat files. We now have the sheet files.

01:30:01.000 --> 01:30:13.000
The sheets are organized by… The name, the line number, and the description, right? So they're this way. The DAT files are the column names.

01:30:13.000 --> 01:30:26.000
I want to know what a call name is. There's no easy way for me to join. That's why, when I have the DAT file, I have to do that pivot for… or excuse me, that unpivot first. Once I do that, I can now join.

01:30:26.000 --> 01:30:37.000
The table sheet name with the actual information from this Excel file, and put them together. So… Logical thing, at least to me, logical thing to do here.

01:30:37.000 --> 01:30:44.000
take all these files, Excel files, and dump them and save them as a Parquet file, and when we do that.

01:30:44.000 --> 01:30:51.000
Let me see if this runs. Oh, this isn't running. This one shut down. But basically, it's the one that's on the website.

01:30:51.000 --> 01:31:07.000
turn that into a separate table. So now we have the data files which have been normalized mostly. We now have these Excel files, the shell files which have been normalized mostly. Now we can do a join, and we can now see.

01:31:07.000 --> 01:31:11.000
What are we looking at in each of those data files?

01:31:11.000 --> 01:31:21.000
So, two different files, or two different data sets turned into Parquet, which we can now query.

01:31:21.000 --> 01:31:28.000
All right. Any questions on what I've done here?

01:31:28.000 --> 01:31:35.000
Any place where anybody would like to go deeper on some of this stuff.

01:31:35.000 --> 01:31:46.000
All right. Given that I've already gone an hour, um, what… the things that I wanted to do, um, that I didn't get a chance to do, were.

01:31:46.000 --> 01:31:54.000
the things that I kind of mentioned to Gary, which was the geospatial stuff. It turns out that this information in the GOID.

01:31:54.000 --> 01:32:10.000
And I don't know which part yet. I don't know if it's the Gis part, which is my guess, but it could also be the FIP side. I don't know. But the Us. Census has a bunch of information, right?

01:32:10.000 --> 01:32:25.000
That's geography based. And not just information that's geography-based of, you know, hey, how many households and how many people are this versus that, and how many people have computers, how many people have cars, how many people have phones, and so forth. Um, but they also have.

01:32:25.000 --> 01:32:31.000
geographic information. That is, they have points and points.

01:32:31.000 --> 01:32:37.000
points on the globe, if you will, right? So, uh, where are certain landmarks?

01:32:37.000 --> 01:32:53.000
The Us. Census has that. Where is? What are a bunch of roads, right? So roads are basically line segments. The U.S. Census has that as well. And so you can download these different geographies. They're called basically shapes.

01:32:53.000 --> 01:33:12.000
The most common shapes are points, lines, and areas, right, or polygons. And so you can download this information. The geographical information, and then merge it with the stuff that I've talked about already, right? From the dat files and the shell files.

01:33:12.000 --> 01:33:17.000
Again, what we could do is we can take these shape files.

01:33:17.000 --> 01:33:25.000
and turn them into parquet. It's a special format called Geo Parquet, but it's nevertheless a parquet format.

01:33:25.000 --> 01:33:42.000
And, of course. DBE can read that. So we have DuckDB that can read Excel files. Duckdb that can read CSV files, and DuckDB that can read these shapefiles and then start operating on them to tweak them into a format that's actually usable.

01:33:42.000 --> 01:33:47.000
and then actually use them to query the data and put it together.

01:33:47.000 --> 01:34:03.000
So that's kind of the project that I'm working on. It's just kind of a side project. Of course, I'm showing this to my class, my students. Hopefully, you know, some of them will be interested in doing portions of this. I don't have this on a GitHub site yet. This is literally all on my Google Drive.

01:34:03.000 --> 01:34:13.000
But eventually, I'm going to put it on GitHub just because I'm gonna say, it makes for working with the data and other people much, much easier.

01:34:13.000 --> 01:34:32.000
And then, ultimately, what I'd like to do is, once I have these small pieces comparatively, these are small data sets. It'd be nice to start to say, well, gee, what other data sets that are interesting that we can then also turn into parquet files that we can then, how would you say, do joins and other interesting queries?

01:34:32.000 --> 01:34:44.000
For example, one of my one of my, uh… uh, teams in a previous cohort, they were using satellite data from NASA to.

01:34:44.000 --> 01:34:52.000
Uh, basically look at wildfires in different parts of the United States, primarily New Mexico, but also some of the surrounding states.

01:34:52.000 --> 01:35:02.000
And what they did is they you would get the image from NASA. So you actually got the actual satellite image. But NASA also provided you with information about, um.

01:35:02.000 --> 01:35:07.000
Um… geolocation.

01:35:07.000 --> 01:35:22.000
right? So you have latitude and longitude and things like that about all the different features that NASA had identified within that satellite image. And so we can then take that information, combine it with stuff from the census information. For example, we want to know, let's say.

01:35:22.000 --> 01:35:43.000
Uh, you can, you know, you can use that information to identify where a fire is, and then you can use that if you know where the fire is, and you take the information about rivers that you can get from the U.S. Census, you can then say, okay, great, given that fire, what's the closest river? What's the closest town? What are the roads that are going to be impacted by that fire? All kinds of really cool stuff.

01:35:43.000 --> 01:35:53.000
Right? Right now, that data is on all different kinds of formats. I'd like to believe that at some point we… if we turn these all into nice Parquet files, we can just use tools like DuckDB.

01:35:53.000 --> 01:36:10.000
like to believe in the future there'll be something even more cooler. But at least for right now, we can use DuckDB to say, okay, great, grab that information about the fire, grab that information about the rivers, grab this other information about people that are affected, join them together, and, uh, hey, where's the closest, I don't know.

01:36:10.000 --> 01:36:16.000
shelter or route out, or something like that, and provide that as information.

01:36:16.000 --> 01:36:33.000
So that's where we're where I am right now. Lots of cool things, and and it's really nice that some open source tools make this possible. Basically, a web server. So Apache and Nginx. On top of that, you can put in parquet files.

01:36:33.000 --> 01:36:37.000
On top of that, you can then use DuckDB to go in and query the data.

01:36:37.000 --> 01:36:47.000
Lots of cool stuff. All right. With that, I'm gonna I'm gonna draw it as a close. Anybody have any questions?

01:36:47.000 --> 01:36:48.000
Yeah.

01:36:48.000 --> 01:37:14.000
Hi everyone. Other than things you've already mentioned, do you… possibly foresee other potential kinds of data mining that… people may not be aware of or like maybe financial institutions looking at their data or something. Is this an advantage over the access that they already currently have?

01:37:14.000 --> 01:37:15.000
Or not.

01:37:15.000 --> 01:37:31.000
So… so one of my students used to, uh, well, not used to, he still does as far as I know, was working at a local credit union, right? And this is what about a year? What? Anybody remember when the.

01:37:31.000 --> 01:37:34.000
Silicon Valley Bank included. But about a year and a half? About a year and a half ago, something like that? Is that about right?

01:37:34.000 --> 01:37:39.000
Yes. Yeah. Yeah.

01:37:39.000 --> 01:37:53.000
Right? And when that happened, I was talking with him about that. It had happened, like, maybe a month or two before he was in my class, and I was talking to him about it, and one of the things that he said is that.

01:37:53.000 --> 01:38:06.000
In retrospect, it's like… Duh. You know, every financial institution has to report these things to the basically to the feds. I don't know who. I'm guessing the.

01:38:06.000 --> 01:38:07.000
Federal Reserve or somebody, right? They have to take that deal.

01:38:07.000 --> 01:38:09.000
FDIC. FDIC, I believe, and… couple other things. Yeah.

01:38:09.000 --> 01:38:24.000
Yeah, go ahead, Stan. Probably. Yeah, and probably not just them, but probably maybe more than one. And all that information is public information, right? And his comment was that.

01:38:24.000 --> 01:38:30.000
If someone would have looked at that data, they would have seen.

01:38:30.000 --> 01:38:41.000
Like, months earlier that this would have happened, right? And… It kind of reminds me of, uh, back in 2008, um.

01:38:41.000 --> 01:38:50.000
uh, you know, the financial crisis that we had, you know, nationwide, you know, it wasn't just Silicon Valley Bank, it was, you know, much bigger than that.

01:38:50.000 --> 01:39:03.000
Um, and if anybody's seen the movie or read the book, The Big Short, at least in the movie, I don't remember the book, but in the movie, in the very beginning, there's a… there's a, you know, um… How would you say the opening…

01:39:03.000 --> 01:39:14.000
dialogue or monologue, rather, introduction is, hey, you know, this is what happened and nobody saw it coming except for a couple of people.

01:39:14.000 --> 01:39:22.000
Those couple people weren't, you know, you know, phenomenal people. They just did something that nobody else did.

01:39:22.000 --> 01:39:29.000
It looked right? They looked at the data. What I'd like to believe is that with tools like this.

01:39:29.000 --> 01:39:44.000
We can access those public… that public information, and instead of being in whatever format it is, right? And when I say whatever format, obviously you can see right here from US Census, we've got a variety of different formats. We've got Excel, we've got DAP files.

01:39:44.000 --> 01:39:50.000
We got, uh, there's also shapefiles, zip files, stat files.

01:39:50.000 --> 01:40:09.000
a bunch of a bunch of different formats. If we could have tools that can read that information, that number one and number 2, reorganize it in a way that it's useful. Right? So think of that Excel file, turn it into parquet, so you can actually query it. I'd like to believe that more people will then go and say.

01:40:09.000 --> 01:40:17.000
Hey, you know what? We can see things before things will happen. I'd like to believe that.

01:40:17.000 --> 01:40:28.000
probably a little bit naive and optimistic, but I'd like to believe it. What I do know is, at least here in Albuquerque, there are a number of nonprofits that have been talking to about data and stuff.

01:40:28.000 --> 01:40:39.000
One of the biggest challenges they have is the, um… It's it. I'm going to say accessibility. But it's it's it's.

01:40:39.000 --> 01:40:42.000
getting the data is not the problem. They can get data.

01:40:42.000 --> 01:40:51.000
Once they get the data is what to do with it, right? It's in these formats that they don't know how to work with.

01:40:51.000 --> 01:40:59.000
And then the second challenge for a lot of nonprofits, and probably most businesses, is having someone who actually then knows how to work with that data.

01:40:59.000 --> 01:41:05.000
This is especially true if it's, um… Uh, how would you say various, uh, domain-specific?

01:41:05.000 --> 01:41:09.000
like this U.S. Census data with the with the table sheets.

01:41:09.000 --> 01:41:27.000
I know nothing about that. That is… that's not my area of expertise. I'm literally trying, you know, piecing this together as I'm going along. If this was in a database with a proper data dictionary that I can go through and query quickly, I can probably make a lot more progress than what I'm doing at the moment.

01:41:27.000 --> 01:41:36.000
Right? And that is probably true that that is true for a lot of the nonprofits that I've worked for, excuse me, worked with, and.

01:41:36.000 --> 01:41:48.000
Hey, how do we get data from the different data providers, right? The city, the county, the state, the feds, anybody? How do we then take that data and use it?

01:41:48.000 --> 01:42:09.000
If they have data in a usable format, and that format is fairly consistent, that people can use tools like DuckDB to query and work with it. I think that would be phenomenal. I think it really would be phenomenal.

01:42:09.000 --> 01:42:12.000
Any other questions?

01:42:12.000 --> 01:42:32.000
just a comment. We've. We've mentioned this group in the discuss before, and I think also on the calendar, and I can't think of their name now, but it there's a nationwide group that basically it's computer geeks who volunteer time to work on a public project, do some.

01:42:32.000 --> 01:42:39.000
data mining or data analysis utilizing public data. And I can't think of the name of the group.

01:42:39.000 --> 01:42:43.000
Anybody remember? Anyway, an example.

01:42:43.000 --> 01:42:44.000
I can't you? Yeah, I can't say I have. I'd love to hear about it. That'd be cool.

01:42:44.000 --> 01:42:48.000
Go ahead.

01:42:48.000 --> 01:43:06.000
Okay, the the there was a St. Louis chapter. I haven't seen anything from them in some time, but one of the things that the St. Louis chapter did a couple of years ago was they, uh… The question was, well.

01:43:06.000 --> 01:43:18.000
Who owns a property in the city of St. Louis that is in need of work, in need of repair and.

01:43:18.000 --> 01:43:19.000
Yeah.

01:43:19.000 --> 01:43:42.000
There wasn't a readily available answer to that, and it was interesting. They had to go through a number of different city and state agencies to somehow find out who the ownership of.

01:43:42.000 --> 01:43:43.000
Yeah.

01:43:43.000 --> 01:43:59.000
abandoned property or property that had problems like falling down trees, overgrown horns, falling apart buildings, and so… It turned out their best indicator was the City Department of Forestry, and you would go, What the hell would that have with.

01:43:59.000 --> 01:44:06.000
And it was because, well, if the trees are falling into the building, or the grass is 10 foot high.

01:44:06.000 --> 01:44:35.000
you know, any other city agency basically figured, well, hell, the Department of Forestry probably has lawnmowers or guys with a saw who know what to do. So they would basically get the Department of Forestry to tend to this property. And the end result of this whole project was that the largest landholder.

01:44:35.000 --> 01:44:36.000
Yeah.

01:44:36.000 --> 01:44:43.000
In the city of St. Louis is the city of St. Louis because of the number of abandoned properties or properties that have just fallen, you know, so far in arrears on their taxes that they've been abandoned. So, but it was interesting that, again, there was no.

01:44:43.000 --> 01:44:47.000
No good central database, not even good definitions of, you know.

01:44:47.000 --> 01:44:53.000
data items that they were looking for, so they… they had to spend a lot of time basically figuring out.

01:44:53.000 --> 01:44:59.000
you know, what the data was, where it was, and how they could make sense of it. So yeah, this is a great, great project you're working on here. Great, uh…

01:44:59.000 --> 01:45:24.000
Yeah. And… and something I didn't show, again, there's so many things I want to do that… I would just say… I can only do so much and, you know, gotta break it off in chunks, but one of the cool things about DuckDB is that you can create something, you can create a special database that is called the data catalog. It's basically a database.

01:45:24.000 --> 01:45:31.000
of databases, right? Um, and so what you can do is you can say, uh, you can create this, uh.

01:45:31.000 --> 01:45:46.000
this database that basically says, I'm gonna have this table called foo, and it is the Parquet file at this URL, or a collection of Parquet files at this URL, and I'm going to have this other table, or view, called bar.

01:45:46.000 --> 01:46:05.000
And it's going to be this collection of Excel files that are over here, right? And I'm going to have this other table called bat, and it's this collection of CSV files that are over here. And what happens is DuckDB knows how to, you know, basically read that file, and since it knows, oh, I know what a parquet file and an Excel file and so forth is.

01:46:05.000 --> 01:46:20.000
It automatically can turn those into regular-looking database tables that can then say, okay, great, I want, you know, this, join this, left join this, where, blah blah blah, group by, yadda yadda yadda, right? And, you know, conduct queries on those things.

01:46:20.000 --> 01:46:43.000
Again, another thing that I'm working on, because that would be great to have here to simplify things, instead of to say, you know, okay, great, I've got, you know, my dad files over here, and I've got my table shell file over here. Yeah, you can join them, but wouldn't it be cool if I just said, hey, you know what, here's a view that already does that and it just abstracts that away from people.

01:46:43.000 --> 01:46:48.000
All they need to do is go, oh, select star from this view, and here's the data.

01:46:48.000 --> 01:47:01.000
Another thing I didn't mention, which is really cool, which I was working with the guys next door. Yeah, literally on the other side of this wall. We've got we do our full stack class.

01:47:01.000 --> 01:47:13.000
Those guys also, you know, they do everything. They do front-end, back-end, and all that fun stuff. Uh, was working with them on, um, uh… a portion of this.

01:47:13.000 --> 01:47:27.000
to use DuckDB within the browser, right? So DuckDB can also run as a WASM web assembly, and so you don't even need to install anything. Literally, all you need to do is just.

01:47:27.000 --> 01:47:44.000
Get the duck deep, you know, have the JavaScript load up the WASM, and then you can… type in or have a web interface to query data. So, it'd be interesting to say, which you is, my, you know, create a web application that is literally just a web page.

01:47:44.000 --> 01:48:02.000
and nothing else, right? The database itself is the files that are out there, right? The Parquet files or Excel files or what have you. And what you have is you just have a web page that automatically says, oh yeah, this data is over here and that's parquet and this data is over here, and this is something else and so on and so forth. And you run everything.

01:48:02.000 --> 01:48:09.000
You run the DuckDB client, if you will, inside the browser. Nothing to install.

01:48:09.000 --> 01:48:15.000
Nothing to down the well. I mean, you gotta download the web page that has the WASM in or the DuckDB client in it.

01:48:15.000 --> 01:48:23.000
But there's no database anywhere, right? There's there's no service running nothing just just a web page.

01:48:23.000 --> 01:48:29.000
So yeah, interesting things.

01:48:29.000 --> 01:48:30.000
Any other questions?

01:48:30.000 --> 01:48:43.000
Ed Purst… Ed posed a question in the chat. He said, do you think that parquet will be a new standard adopted more widely? The auto spell check changed the name to whatever.

01:48:43.000 --> 01:48:44.000
Sorry.

01:48:44.000 --> 01:49:01.000
Yeah. Will it become it? There's part of me that says, boy, I hope so. Um, and there's another part of me that says, uh, uh, thinks that, you know what, it'd be really cool if something even cooler would come along. I don't know what that would be, but…

01:49:01.000 --> 01:49:07.000
We live in interesting times, right? Technology is changing at amazingly fast paces, right?

01:49:07.000 --> 01:49:25.000
So, uh… It'd be cool. Do I think it will? I think it… I think it will when enough people realize the utility of having an actual table structure that has data types, right?

01:49:25.000 --> 01:49:41.000
Um, and that can be… read by multiple different kinds of clients. Um… DuckDB is nice. There's no question about it. I love it. You can do lots of cool things with it. In fact, again, something I didn't show was.

01:49:41.000 --> 01:49:55.000
within Pandas itself, I can use DuckDB to query a Pandas data frame as though it was a database table. So I could literally like write SQL queries to query a data frame.

01:49:55.000 --> 01:49:59.000
And you can… what's cool about that is, uh.

01:49:59.000 --> 01:50:12.000
You could do that in R or Rust, or Ruby, or anything else as well. So that's kind of cool. I think it won't take off until.

01:50:12.000 --> 01:50:24.000
how to say this. Certain large companies, um… are simply overwhelmed to the point that, uh, you know, um.

01:50:24.000 --> 01:50:29.000
there. There's a reason, a financial reason for them to.

01:50:29.000 --> 01:50:39.000
Be able to query those that parquet format. So, for example, as of the last time I checked, which was a couple of days ago.

01:50:39.000 --> 01:50:43.000
Um, I could not read a Parquet file in Excel.

01:50:43.000 --> 01:50:49.000
And when I say Excel, I mean excel and power query right?

01:50:49.000 --> 01:50:54.000
There may be third-party tools that enable it. I'm not aware of it.

01:50:54.000 --> 01:51:08.000
On my company laptop, I'm not allowed to install anything, and so I… even if they existed, it would be useless to me. So until it became something where I could literally just open up a laptop, open up excel and just.

01:51:08.000 --> 01:51:18.000
query a Parquet file. I think for the vast majority of people. This it's.

01:51:18.000 --> 01:51:25.000
They're just gonna be completely unaware of it. For everybody else, I think it'll be phenomenal.

01:51:25.000 --> 01:51:26.000
Okay. Oh, go ahead, Sam.

01:51:26.000 --> 01:51:37.000
A question. My question is, is the artificial idiot sitting next to you, is it capable of using parquet?

01:51:37.000 --> 01:51:49.000
Oh, yeah. I've told… I've told Jim and I and I've got, by the way, another thing I'm working on since.

01:51:49.000 --> 01:52:05.000
Scott's talk last week is I actually was able to get open code with a Llama running within Codespaces on GitHub, as well as in Colab in Google Colab. And so I was actually playing around with.

01:52:05.000 --> 01:52:17.000
you know the that kind of thing. I'm saying, Hey, I've got a parquet file. I've got a CSV file. I've got an excel file. Tell me, you know, go through each of them and tell me what their structure is. And open code using a llama.

01:52:17.000 --> 01:52:23.000
I think I was using the the Quinn 2.5?

01:52:23.000 --> 01:52:28.000
I think that's right. Model, and it was. It had no problem.

01:52:28.000 --> 01:52:39.000
opening it and reading it, and… doing whatever, you know, everything that I asked it to do. So yes, the artificial idiot sitting next to me was was doing all right with it.

01:52:39.000 --> 01:52:44.000
Hmm. Oh.

01:52:44.000 --> 01:52:45.000
This is not Ed.

01:52:45.000 --> 01:52:48.000
This is it. So back on… The, uh, the real Ed, not the chat bot.

01:52:48.000 --> 01:52:57.000
Okay. Okay.

01:52:57.000 --> 01:52:58.000
Uh-huh?

01:52:58.000 --> 01:53:07.000
The question I… about the parquet adoption. is that, um… Uh, so, like, um… Like JSON, JSON came out of nowhere, absolutely out of nowhere.

01:53:07.000 --> 01:53:12.000
Uh-huh.

01:53:12.000 --> 01:53:13.000
Yeah.

01:53:13.000 --> 01:53:22.000
and took over the frickin' world in, like, 5 or 10 years, right? And it just it's just everywhere, right? There's just everything is Jason.

01:53:22.000 --> 01:53:23.000
Yeah. Yeah, I think if he got Xml. Right?

01:53:23.000 --> 01:53:26.000
And, uh… Uh… Yeah.

01:53:26.000 --> 01:53:37.000
Well, yeah, it was easier to work with than an XML and XML is a night.

01:53:37.000 --> 01:53:38.000
Yeah.

01:53:38.000 --> 01:53:50.000
Right, exactly. I think a lot of that had to do with, you know, web and JavaScript and all that stuff. But anyway, um, my point about this, right, is that, uh… Uh, I don't know whether Parquet… I mean, I don't know enough about it to say, well, okay, is parquet…

01:53:50.000 --> 01:54:06.000
a richer data format than JSON. Json's pretty simple, right? It only has a few. It doesn't really have much in the way of data types. It just basically has like, you know, strings, arrays, bools, ints, you know, numbers.

01:54:06.000 --> 01:54:13.000
Yeah. Doesn't have to date. It doesn't have date.

01:54:13.000 --> 01:54:14.000
Right?

01:54:14.000 --> 01:54:27.000
stuff like that. No. What? Oh, what? No. No, which it should, really. I mean, that that's one I would think you would want. But anyway, but parquet probably has a lot more data, but I guess one of the questions would be, okay, well, are you.

01:54:27.000 --> 01:54:44.000
Do you get any advantages that way or do you get, like, is it like a really bulky data format? Like, you know, is it… does it… Is it used more… bandwidth to download it, or are the files much bigger?

01:54:44.000 --> 01:54:53.000
Um, is it not suitable for, you know, uh, web APIs, that kind of question I have is about Parquet.

01:54:53.000 --> 01:54:54.000
In other words, I don't see it as a replacement for JSON.

01:54:54.000 --> 01:54:58.000
Yes.

01:54:58.000 --> 01:55:11.000
Uh, it's definitely not a replacement for Jason. Because one of the nice things about Jason is that it's it is simple, because you only have 6 data types in it.

01:55:11.000 --> 01:55:26.000
At least I think it's 6. It's but it's a small number, right? And the other really good thing about JSON is that it's text, right? It is plain text. Parquet is not. Parquet is not plain text. It is a binary format.

01:55:26.000 --> 01:55:51.000
That said, it has a richer set of data types. And I don't remember how many there are, but it definitely has, you know, var chart int 64, n32, category, date, um… Yeah, a bunch… a bunch of different kinds, including you can. There are, how would you say, community additions or community add-ons. So, for example, the spatial.

01:55:51.000 --> 01:56:21.000
I went for geospatial data, that is an add-on to Parquet that allows it to handle geospatial data. And… It handles it in such a way so that you can do things like with DuckDB or or anything else that understands parquet is you could say, oh, you know, hey, I got this point over here, and I got this point over here. What's the distance between them, right? I've got this point over here. I got this area over here. What's the distance from this point to the closest edge to the center.

01:56:21.000 --> 01:56:29.000
Is it incited? Is it East or West? You know, what's the direction? There's lots of things that you can do with that. Json doesn't have that, right?

01:56:29.000 --> 01:56:37.000
You can clue stuff together, and it's an existing formats, but you know, natively Json doesn't have that.

01:56:37.000 --> 01:56:38.000
Okay. Oh, go ahead.

01:56:38.000 --> 01:56:56.000
So what? Yeah, so Parquet does that, number one. Number two is it's binary, which could be an advantage and disadvantage. The disadvantage, of course, which means that if you download a Parquet file and you want to edit it, you have to use a particular tool to edit that Parquet file. And it's not really designed for editing.

01:56:56.000 --> 01:57:10.000
Um, it's designed for… you read the Parquet file, you modify something, and then you write out the entire Parquet file, right? You don't make changes to the individual… to the bits inside the parquet file. It's an all-or-nothing kind of thing.

01:57:10.000 --> 01:57:11.000
It's more of a…

01:57:11.000 --> 01:57:29.000
That said. There's some workarounds around that, but, um… And I'm… and… you know, this is… as of the last time I was reading about parquet. I mean, it's entirely possible that they've made some, you know, amazing changes, and if you can now read it, right, and modify it, but…

01:57:29.000 --> 01:57:36.000
Last time I checked, that wasn't the case. Another cool thing about Parquet is that it's compressed by default.

01:57:36.000 --> 01:57:57.000
I don't remember what the default compression algorithm is, but it is compressed. And so, what that means is… Because I've got data types that I can… specify, and I can compress the data. I can read in a CSV file that is filled with numbers, especially, you know, big numbers.

01:57:57.000 --> 01:58:06.000
and because or even small numbers, I can then say, well, you know what, I want this to be an int 8, or I want this to be an int32, or I want this to be Boolean.

01:58:06.000 --> 01:58:18.000
And when I do that, I can, I would just say, affect the size of the parquet file. If I then, on top of that, add compression to it, it makes it much, much smaller.

01:58:18.000 --> 01:58:34.000
all the Excel files that I have basically every single Excel file when I save it as a parquet file is a fraction of the size like 10. I think less than 10% of the original size of the excel file, and these Excel files are not that big. There's not much data in them.

01:58:34.000 --> 01:58:38.000
I think there's just a bunch of extra stuff that Excel throws in there.

01:58:38.000 --> 01:58:39.000
But…

01:58:39.000 --> 01:58:45.000
Excel is a zip file that is. A set of XML files.

01:58:45.000 --> 01:58:49.000
So of course it's going to have a ton of garbage around it.

01:58:49.000 --> 01:58:50.000
It's micro swap. Whether they do anything that's efficient?

01:58:50.000 --> 01:58:57.000
Yeah. Well, so…

01:58:57.000 --> 01:59:06.000
And yeah, and and you know. and everybody uses it, right? Because they really don't know any better for the most part, right? And they're very comfortable with.

01:59:06.000 --> 01:59:18.000
What's what's the meme? After the NHS decided to lose all those records for during the pandemic.

01:59:18.000 --> 01:59:23.000
Morons decided to use Excel as a database. It's not one.

01:59:23.000 --> 01:59:31.000
It's a data manipulation slash visualization tool.

01:59:31.000 --> 01:59:32.000
Okay, so… oh, okay, or you're not fun? Sure.

01:59:32.000 --> 01:59:36.000
Yeah. Yeah. So…

01:59:36.000 --> 01:59:43.000
Yeah, so what I'm getting at is, you know, there's lots of advantages for doing stuff with Parquet.

01:59:43.000 --> 02:00:00.000
I don't know who it was that mentioned, you know, things like Databricks and Snowflake and Spark and Apache Arrow. A lot of those could literally read those files and work with them and manipulate them as is, right? So there's.

02:00:00.000 --> 02:00:04.000
There's lots of really cool things you can do with Parquet.

02:00:04.000 --> 02:00:14.000
What it is missing, or at least I haven't seen yet, is a nice graphical interface to work with the stuff that's in Parquet.

02:00:14.000 --> 02:00:26.000
you know, maybe it exists, and I'm just not aware of it. But you can imagine that it could be anything. It could be a web interface that literally loads the data, does stuff, and then and then dumps the data back.

02:00:26.000 --> 02:00:31.000
But, um… I'm not aware of any.

02:00:31.000 --> 02:00:41.000
primarily because… even if it existed, I probably wouldn't use it because that's not how I work with data.

02:00:41.000 --> 02:00:44.000
you know, I've got, I think in the example I showed.

02:00:44.000 --> 02:00:49.000
just the few files that I worked with, you know, we're at 22 million rows.

02:00:49.000 --> 02:01:00.000
I'm not going to be looking… I can't look at 22 million brothers. I have no idea what that what that means, right? So I I do queries and things like that, and and.

02:01:00.000 --> 02:01:05.000
I haven't seen anything that's graphical that is… that is really good.

02:01:05.000 --> 02:01:09.000
way to query things.

02:01:09.000 --> 02:01:10.000
Hey, Rob. Go, go ahead. Go ahead.

02:01:10.000 --> 02:01:21.000
Okay, so my… okay. Well, the the final point, or maybe observation I wanted to make is that.

02:01:21.000 --> 02:01:33.000
See, if you look at the… overall landscape of you know data that's out there right? Isn't most of it.

02:01:33.000 --> 02:01:49.000
sitting in some SQL database, right? And when you go to get the data from a website or an FTP site or something right? They have exported that SQL data right from that relational database.

02:01:49.000 --> 02:01:59.000
Into Excel files or CSV or in some weird cases, Json.

02:01:59.000 --> 02:02:11.000
But they don't really ever like you know what my point being is that it's you're kind of talking about, whoa, this is a really great.

02:02:11.000 --> 02:02:21.000
you know, I assume it's open source, right? Data chain interchange format, right? That is database agnostic.

02:02:21.000 --> 02:02:38.000
And can be, you know, used to, let's say I've got… I love my SQLite. Okay, great. But you love your Postgres. Okay, great. Right? You give me your data in Parquet, and then I import it into SQLite.

02:02:38.000 --> 02:02:45.000
Right? Our devdp, I guess, is a way of doing it. But you know what I'm saying? In other words.

02:02:45.000 --> 02:02:53.000
In other words, maybe it's not a replacement for Jason, you know, for web stuff. But it is.

02:02:53.000 --> 02:02:58.000
Most of the data that's out there is in SQL databases. It's been that way for like 50 years.

02:02:58.000 --> 02:03:04.000
And I'm… So hit.

02:03:04.000 --> 02:03:10.000
I would modify what you would say in in in once like one slight way.

02:03:10.000 --> 02:03:17.000
Clearly, the overwhelming majority of the structured data that's out there is in a SQL database of some form, right? Where are databases, you can use SQL to query it with.

02:03:17.000 --> 02:03:21.000
Hmm.

02:03:21.000 --> 02:03:32.000
I would say… the overwhelming amount of data that's out there is unstructured. And in bizarre formats.

02:03:32.000 --> 02:03:47.000
that are not consistent, right? And that's… that, I think there's opportunities to… be able to query that data and work with that data as well.

02:03:47.000 --> 02:03:55.000
to your point about data being in a database. One of one of my students.

02:03:55.000 --> 02:04:05.000
or I should say one of the one of the teams for my cohorts, one of them was looking at a particular state agency and was getting their data.

02:04:05.000 --> 02:04:25.000
Uh, from their website. And they were getting it as Excel files. And… I would like to believe what you said, Ed, where what they do is they have a database, and they run a query, and it has a template, and it spits out this Excel file, right?

02:04:25.000 --> 02:04:30.000
If, um… If that were true.

02:04:30.000 --> 02:04:34.000
Then what you would have is you would have… if you were to query, let's say, the past.

02:04:34.000 --> 02:04:43.000
20 years of data and say, well, I want an Excel sheet for every month, right? So 20 years times 12, so 240 spreadsheets.

02:04:43.000 --> 02:04:55.000
If you were to query a database that would generate these Excel sheets, let's say, on the fly, you would expect every single Excel sheet to be the same same structure, same schema, same format.

02:04:55.000 --> 02:05:11.000
what what my students discovered is over the 20 years in the 240 Excel spreadsheets. It was rare, if any, 2 spreadsheets were the same structure. They literally varied from month to month to month to month.

02:05:11.000 --> 02:05:33.000
Sometimes it was just subtle changes, you know. an extra row or something, or an extra column or something, you know, a blank rower column or something like that. Sometimes, you know, a name change in a field or something like that. But other times it was just completely new calculations, completely new, all kinds of different stuff.

02:05:33.000 --> 02:05:50.000
What that suggests is that there isn't a database. It suggests that literally what they would do is once a month, someone would sit down and say, Oh, I'm going to take this data, this data, this data, put it all together, look like a spreadsheet. Yeah, it looks good. Great. Save it to the save it to the file system, and it's out on the web, and we're done with it.

02:05:50.000 --> 02:05:54.000
Oh, hey, in Denver.

02:05:54.000 --> 02:05:55.000
Or copy and paste.

02:05:55.000 --> 02:06:14.000
Right? Yeah. Yeah, that's what I… That's what I suspect. Right? Because if you had a database and you had a process, an automated process, I would expect to see much more consistency in whatever that data source of those Excel files. And that is not what we saw.

02:06:14.000 --> 02:06:15.000
Right?

02:06:15.000 --> 02:06:28.000
Oh, I'm just being naive. It's just a database of like, you know, the relational database has been around since like God came up with it in the seventies, and the an IBM, which is the biggest manufacturer of mainframes, adopted it very quickly.

02:06:28.000 --> 02:06:33.000
There?

02:06:33.000 --> 02:06:49.000
And it's, you know, they kind of said, okay, you're going to put data in this format. That's it, right? And I was like, Okay, well, I bet you there's a lot of mainframes out there, and I bet you a lot of that has data in some relational database. Db2, or something like that.

02:06:49.000 --> 02:07:02.000
Uh, so, you know, I would think it's out there, but maybe you're right. Maybe this… Maybe this other stuff is, you know, too unstructured.

02:07:02.000 --> 02:07:04.000
Yeah, I…

02:07:04.000 --> 02:07:05.000
Oh. The real annoying.

02:07:05.000 --> 02:07:11.000
This is Phil. I was going to mention that, uh… Probably… Say what?

02:07:11.000 --> 02:07:12.000
The real anointing is that 90% of these agencies, the reasons why they do that is because cost.

02:07:12.000 --> 02:07:18.000
The, uh…

02:07:18.000 --> 02:07:19.000
Right.

02:07:19.000 --> 02:07:29.000
I was going to mention that kind of the reason I think Robert went that route was they originally were hitting the database with the API.

02:07:29.000 --> 02:07:31.000
And they thought he was a bat. So he had to go to the FTP site and pull down the parquet files.

02:07:31.000 --> 02:07:36.000
Yeah.

02:07:36.000 --> 02:07:37.000
Which is kind of the reason. Okay, okay.

02:07:37.000 --> 02:07:38.000
So…

02:07:38.000 --> 02:07:43.000
There were… Yeah, they weren't parquet files. They were CSV and Excel and God knows whatever else. But yes.

02:07:43.000 --> 02:07:49.000
Right, right. Okay, but you… in effect, you used Parquet to change it into.

02:07:49.000 --> 02:07:50.000
Yeah. Yep.

02:07:50.000 --> 02:07:59.000
to be able to use it. Gotcha. I mean, that's kind of the whole reason, you know, you were using database. But then they said, Oh, he's a bot.

02:07:59.000 --> 02:08:00.000
Could be. With me?

02:08:00.000 --> 02:08:03.000
Okay.

02:08:03.000 --> 02:08:22.000
This is Stan, rolling back. So Parquet deals with… non-databases, and then makes them into a parquet file that they can deal with, right? So it's not a… there's no server required on the other end. It just pulls whatever's on the other end, and.

02:08:22.000 --> 02:08:25.000
manipulation, right?

02:08:25.000 --> 02:08:35.000
not quite. Think of think of a parquet file as simply a CSV file on steroids. It's just a file format. That's it. Nothing else.

02:08:35.000 --> 02:08:41.000
Well, you're not relying on the other end to do anything as a Parquet file.

02:08:41.000 --> 02:08:46.000
Correct?

02:08:46.000 --> 02:08:47.000
So, converting the… converting the data that you pulled down.

02:08:47.000 --> 02:08:52.000
I'm not sure. It's the only thing I'm…

02:08:52.000 --> 02:08:53.000
Uh-huh.

02:08:53.000 --> 02:08:59.000
It's parquet that… manipulates it and puts it into a parquet file.

02:08:59.000 --> 02:09:00.000
Right?

02:09:00.000 --> 02:09:08.000
No, it's I use. I use the query engine DuckDB to pull the data, transform it, and save it as a parquet file.

02:09:08.000 --> 02:09:10.000
I'm sorry, I skipped that step in my brain.

02:09:10.000 --> 02:09:11.000
Yeah, I was just going to point that out.

02:09:11.000 --> 02:09:14.000
Yeah, wait. Okay.

02:09:14.000 --> 02:09:16.000
It was a separate… a separate tool.

02:09:16.000 --> 02:09:24.000
So… so there's nothing required on the other end, it's the DuckDB that sucks it down and makes the Parquet file.

02:09:24.000 --> 02:09:25.000
Correct.

02:09:25.000 --> 02:09:26.000
Okay. Hey, hey, Robert, I got I got a different question for you.

02:09:26.000 --> 02:09:31.000
Okay.

02:09:31.000 --> 02:09:32.000
I am getting a headache trying to wrap my mind around.

02:09:32.000 --> 02:09:36.000
Yeah.

02:09:36.000 --> 02:09:43.000
A query counting 222 million rows of data in less than a second.

02:09:43.000 --> 02:09:47.000
Is that some kind of function provided by S3?

02:09:47.000 --> 02:09:56.000
Are… is that… you know, how would we implement that in the non… You know, Amazon world.

02:09:56.000 --> 02:10:03.000
Hello?

02:10:03.000 --> 02:10:04.000
There we go down here.

02:10:04.000 --> 02:10:23.000
Um, it's… it's not just Amazon, it's just that it's on a web server, and it's, how would you say, um… taking advantage of the parquet format, right? So… Again, you know, like I mentioned, Stan, think of… think of Parquet as a CSV file, but on steroids, and in a very big way. So, not only do you have data types, it's structured in such a way that the metadata, there's a lot of metadata that is being.

02:10:23.000 --> 02:10:42.000
stored somewhere in the file as well, right? So, um… you know, the data types, the descriptions, the number of rows, the number of columns, um… And extra things where you have things… I think the term is chunks, right? And so…

02:10:42.000 --> 02:10:48.000
The example that I did was just, hey, I just want to count everything. If I've got a thousand files.

02:10:48.000 --> 02:10:59.000
Each with, you know, a million rows in them, all it needs to do is go to every single file, each thousand of those, and says, hey, what's, you know, look at the metadata and tell me how many you got.

02:10:59.000 --> 02:11:08.000
And it does that really fast, right? Because there's, I mean, a thousand files is nothing. It just goes boom, boom, boom, boom, right? Adds them up, and then goes, here you go.

02:11:08.000 --> 02:11:25.000
More likely, DuckDB is also pretty smart and says, oh, we've got a thousand files? I don't need to go and query it sequentially. I don't need to go file A, then file B, or file 1, file 2, file 3, file 4. I can probably do that in parallel.

02:11:25.000 --> 02:11:34.000
And then aggregate all the data, get it all back at once, and add it up really quickly. Duckdb does a lot of that kind of stuff behind the scenes.

02:11:34.000 --> 02:11:46.000
Huh? Okay. So, so, so you're saying that us normal people could implement that on a gigabit network, you know, with a decent server in the back room?

02:11:46.000 --> 02:11:47.000
Okay.

02:11:47.000 --> 02:12:03.000
Absolutely. And it's not really an implementation, right? It's literally you take the only implementation would be is that you, you know, you, uh, um, partition things so that you can… take advantage of the parallelism, right? If you have everything in one file.

02:12:03.000 --> 02:12:19.000
That's good if you want to do, you know, metadata. Hey, how many things are in that file? Boom, that's really quick. But if you then want to query the files based on different columns, that takes a little bit more effort on a single file, if it's big, right? Let's say it's got a billion records in it.

02:12:19.000 --> 02:12:23.000
If instead you broke it up into 10 or 100 pieces.

02:12:23.000 --> 02:12:24.000
That would probably be a lot faster.

02:12:24.000 --> 02:12:27.000
Okay.

02:12:27.000 --> 02:12:29.000
Because you could do a lot of that in parallel.

02:12:29.000 --> 02:12:43.000
And does it have, like, an idea of a of index, you know, like, you know, you've you've indexed the data on this primary key or something, and it says, Okay.

02:12:43.000 --> 02:12:44.000
I…

02:12:44.000 --> 02:12:50.000
I can go directly to this row in, you know, physically in the file.

02:12:50.000 --> 02:12:56.000
In other words, a random access jump.

02:12:56.000 --> 02:13:09.000
The the short answer is, I don't know. My understanding is from what I've read is, yeah, the data is organized into chunks. So it's not indexed in the sense of like a B tree or a hash or something like that. But it does know, hey, I've got, you know.

02:13:09.000 --> 02:13:12.000
Mm-hmm.

02:13:12.000 --> 02:13:26.000
Within this chunk, I've got this kind of data. Within the second chunk, I've got this kind of data. Within this third chunk, I've got this other kind of data. And so… If you query it, and it realizes, well, gee whiz, what I'm… what you're looking for is only in the third chunk.

02:13:26.000 --> 02:13:38.000
It knows where that third chunk is, and it can go directly to that third chunk to query the data. In other words, it doesn't look anywhere else. It doesn't look, you know, it doesn't look at the first or second chunk, or the fourth or fifth or sixth.

02:13:38.000 --> 02:13:43.000
If everything it needs is in that third chunk, it just goes there, queries that data, and gives it to you.

02:13:43.000 --> 02:13:52.000
So, it's… it's not like a traditional index, but there's there's definitely some… Smart searching?

02:13:52.000 --> 02:13:54.000
I guess is a way to say it?

02:13:54.000 --> 02:14:00.000
Is that a feature of Parquet, or of the way DuckDB handles it?

02:14:00.000 --> 02:14:10.000
It's a… it's a feature of well, it's a feature of Parquet, because of its organization, and of course DuckDB takes advantage of that.

02:14:10.000 --> 02:14:12.000
Uh, okay.

02:14:12.000 --> 02:14:20.000
Hmm. Rudy posted something in the chat which is too extensive for me to read.

02:14:20.000 --> 02:14:22.000
That's for after we're done with this, but since we've been talking about DMS.

02:14:22.000 --> 02:14:25.000
Oh, okay. Okay.

02:14:25.000 --> 02:14:30.000
fussing with something, and I could use somebody's help with.

02:14:30.000 --> 02:14:37.000
Later. You're saying. Okay.

02:14:37.000 --> 02:14:49.000
Oh yeah, Code for America. Yeah, I'm just looking at the chat stuff. Holy smokes, lots of stuff.

02:14:49.000 --> 02:15:06.000
Yeah, pandas to SQL, Postgres. It's free, you have SQLite. What's cool about… and again, I didn't show this with DuckDB, but DuckDB knows how to connect to a lot of different databases. So let's say you do have data in a Postgres database, you can connect DB to it and do your queries there, or.

02:15:06.000 --> 02:15:15.000
BigQuery or SQL Server, or Oracle, or what have you. It's like this generic SQL client.

02:15:15.000 --> 02:15:16.000
And you joins across the two different sources.

02:15:16.000 --> 02:15:22.000
The other thing, which… Yes, and you could do joins across different sources, right? So you've got data in order… Uh-huh.

02:15:22.000 --> 02:15:33.000
So if I have a SQLite file on one and my Postgres on another, and I need to run a join across both a table inside of the SQLite and the.

02:15:33.000 --> 02:15:39.000
or the Excel file and, uh… See, that's where the power is right there. Because I can't even do that with the Django RM.

02:15:39.000 --> 02:15:42.000
Yeah.

02:15:42.000 --> 02:15:46.000
I have to go get the data from one database.

02:15:46.000 --> 02:15:54.000
And then join it, and then in memory. do the join manually inside of Python, which is… it's slower.

02:15:54.000 --> 02:16:09.000
And what's what's also cool about DuckDB, unlike pandas, right? Pandas, everything, everything's got to be in RAM. Right? Which is good up until you run out of RAM. Duckdb is smart enough to do things in chunks.

02:16:09.000 --> 02:16:25.000
So it can say, great, I need some stuff. Great. Whatever I don't need, I'll save it to disk, then query gets more, save it to disk, query, save it to disk, and only get the stuff that it needs, and then, you know, eventually give you a result. So it's not… it's not bound by RAM like pandas is.

02:16:25.000 --> 02:16:35.000
and of course it's SQL, right? So you don't have to think about, you know, you don't have to have a different syntax which pandas is, which polars is, which many, many other sources are.

02:16:35.000 --> 02:16:41.000
So yeah.

02:16:41.000 --> 02:16:50.000
So… If I wanted to standardize and clean, like, the traditional ETL process.

02:16:50.000 --> 02:16:59.000
Whether it would be best to do if I was using DuckDB is I would take DuckDB, I'd tell it to connect to that file.

02:16:59.000 --> 02:17:07.000
That's lineage. Once it's and then tell it to dump out the data.

02:17:07.000 --> 02:17:15.000
And then do my load procedure into whatever system I was.

02:17:15.000 --> 02:17:19.000
You could do that. Absolutely. Absolutely. Right?

02:17:19.000 --> 02:17:26.000
Oh, by the way, just real quick, Robert. You didn't touch on this.

02:17:26.000 --> 02:17:44.000
But… Is there a way to, you know, you were doing queries and things, but is there a way to, like, okay, like, I know in SQLite, for example, I can say, okay, uh, run this query, and I'll put it as a CSV or something, right? In other words.

02:17:44.000 --> 02:17:50.000
Like, can I get the data in some other data format from the query itself?

02:17:50.000 --> 02:17:51.000
Yeah.

02:17:51.000 --> 02:17:58.000
right? Or is there a way to export the data from DuckDB in some format?

02:17:58.000 --> 02:18:09.000
Absolutely, and yes, you're right. I did cover it, but I guess it was a bit subtle. All the data that I'm pulling. I am saving it as a parquet file.

02:18:09.000 --> 02:18:10.000
I mean, other than parquet is what I meant to say. Yeah.

02:18:10.000 --> 02:18:25.000
There's… Absolutely. No, no, no, no. Yo, I can go in and say, you know what, save this as a SQLite table, go save this as an Excel file, save this as a CSV file, and I want tabs as zillometers instead. Um…

02:18:25.000 --> 02:18:27.000
Absolutely, you can do that. Absolutely. Yeah.

02:18:27.000 --> 02:18:38.000
Okay. Okay. So you could theoretically then automate the procedure, right? You could say, here's a shell script, right? And I'm going to run the shell script and that shell script is going to go off to the web or the.

02:18:38.000 --> 02:18:44.000
Yep.

02:18:44.000 --> 02:18:54.000
Ftp site, or whatever, and then query the data, and then give it back to me and put it in this format on this in this directory.

02:18:54.000 --> 02:18:55.000
Correct. Correct.

02:18:55.000 --> 02:19:04.000
Huh? Okay. That's… that's pretty cool. You don't have to use the command line interface to do it.

02:19:04.000 --> 02:19:09.000
I assume you can like write a duck DB script right? That has all your SQL statements in it.

02:19:09.000 --> 02:19:11.000
Yeah. Yeah, that's that's what I was doing. All that stuff is a SQL script.

02:19:11.000 --> 02:19:16.000
No.

02:19:16.000 --> 02:19:17.000
Okay.

02:19:17.000 --> 02:19:26.000
Absolutely. You can pipe it into DuckDB, or even say DuckDB, you know, and I forgot what the option is, but hey, here's my script, run it and, you know, do things. Absolutely.

02:19:26.000 --> 02:19:27.000
Okay. I would have expected that, but I I didn't know. Okay.

02:19:27.000 --> 02:19:33.000
Absolutely.

02:19:33.000 --> 02:19:34.000
This is…

02:19:34.000 --> 02:19:46.000
Yeah. Yeah, and in fact, one of the things that I did talking about different formats is I used DuckDB to, how would you say there's a teaching database for SQLite called Chinook.

02:19:46.000 --> 02:20:00.000
I basically took the Chinook database in SQLite, read it with DuckDB, and turned it into a bunch, you know, every table as a Parquet file, right? And now you can use DuckDB to query that Parquet file.

02:20:00.000 --> 02:20:11.000
Is that the the like a common practice would be like, if you've got a huge data set what doesn't have to be huge right? But just a data set that's made up of multiple database tables right that your output.

02:20:11.000 --> 02:20:16.000
Uh-huh.

02:20:16.000 --> 02:20:23.000
But if you're going to use Parquet would be one file per table.

02:20:23.000 --> 02:20:25.000
Yes.

02:20:25.000 --> 02:20:41.000
Okay, and so when you… when you… when you come back with DuckDB, you say, okay, here's a directory with all these files in it, right? And you know, treat them as individual database tables and select, you know, A from.

02:20:41.000 --> 02:20:44.000
you know, this table, comma, this other table, which is really just a parquet file.

02:20:44.000 --> 02:21:04.000
Yeah. or or you could do what I mentioned earlier, is what you do is you, you know, you've got a folder with all these parquet files. Inside that, you put a… there is a DuckDB database format as well. There's a DuckDB data file. And in that you can create.

02:21:04.000 --> 02:21:22.000
All your tables as… or views that reference to files, right? So, you know, table…

02:21:22.000 --> 02:21:23.000
Exactly. Yep.

02:21:23.000 --> 02:21:32.000
Right. So you don't have to type type the the path name of the weird file name thing or whatever. You can just say, there's what's called, you know, whatever. This is St. Louis. This is Kansas City. This is blah blah blah. Okay.

02:21:32.000 --> 02:21:33.000
Okay.

02:21:33.000 --> 02:21:43.000
This is Stan. Am I incorrect in making an observation that the Internet is much, much faster than I think it is. If I use the right tool?

02:21:43.000 --> 02:21:44.000
Good question.

02:21:44.000 --> 02:21:49.000
Well, if you're going to go across the ISDN link, of course it's going to be slow.

02:21:49.000 --> 02:21:52.000
Yeah.

02:21:52.000 --> 02:21:53.000
Yeah, I… I've been using the Starlink for a few months, and.

02:21:53.000 --> 02:21:59.000
There is vision, yeah.

02:21:59.000 --> 02:22:15.000
I was just amazed at how fast it is. I don't necessarily think it's the fastest thing in the world, but… You know, and I don't do multiplayer 3D games, but… It does everything else. It downloads Gigabyte, uh…

02:22:15.000 --> 02:22:25.000
Operating system files pretty well, no problem. And it's just a little phase array antenna laying in my backyard.

02:22:25.000 --> 02:22:29.000
I hope some deer don't step on it.

02:22:29.000 --> 02:22:33.000
Well, that's your fault. Focus and ENT and put it up fine.

02:22:33.000 --> 02:22:34.000
Oh, we're.

02:22:34.000 --> 02:22:35.000
There's…

02:22:35.000 --> 02:22:40.000
That's what happened to my weather station. Deer crashed into it one night, destroyed it.

02:22:40.000 --> 02:22:45.000
See, why don't you put a pole up on your roof at some point? I'll help you.

02:22:45.000 --> 02:22:46.000
or or uh…

02:22:46.000 --> 02:22:49.000
Doesn't even have to be on the root? Just… Upper Hired.

02:22:49.000 --> 02:22:57.000
Well, at least then no critter's going to run into it.

02:22:57.000 --> 02:22:58.000
Hey, I just got to say that this dude.

02:22:58.000 --> 02:23:03.000
Yeah, that's true. I'll tell you, Lee, I don't… I pretty much don't do ladders anymore.

02:23:03.000 --> 02:23:15.000
Oh, you don't do a ladder. Get some tube clamps, you take the ENT, you put the anten on top of it, and then you take the T posting, drive it in the ground.

02:23:15.000 --> 02:23:18.000
Then you take the two clamps and clamp the ante to the T-post.

02:23:18.000 --> 02:23:21.000
Oh.

02:23:21.000 --> 02:23:31.000
Actually, I could use the post I got out in the backyard for my weather station. It's… I don't know, it's about five and a half feet, that's fine.

02:23:31.000 --> 02:23:32.000
Yeah.

02:23:32.000 --> 02:23:40.000
I have to get a longer cable.

02:23:40.000 --> 02:23:46.000
I'm sorry if I derailed the conversation.

02:23:46.000 --> 02:23:50.000
No worries. Any other questions before we call it a day?

02:23:50.000 --> 02:23:53.000
One final question. And I read the Parquet file.

02:23:53.000 --> 02:23:57.000
far away.

02:23:57.000 --> 02:24:04.000
With my Rust programming language. You sure about that? Okay.

02:24:04.000 --> 02:24:05.000
Dr. Ppe. I'm sorry, Ross.

02:24:05.000 --> 02:24:10.000
Absolutely. Yeah, let me… let me… And here's the the reason why.

02:24:10.000 --> 02:24:29.000
Here's the reason why. In addition to native C and C++ APIs, DuckDB supports a range of programming languages, including Java, Python, Rust, Node.js, R, Julia, Swift, Go assembly, Go, and probably a bunch of others given, you know.

02:24:29.000 --> 02:24:40.000
Given its rapid, how would you say, adoption at yeah adoption right? And so.

02:24:40.000 --> 02:24:55.000
Since these programming languages know how to work with DuckDB for our, you know, support DuckDB, and DuckDB knows how to read Parquet files. Any of these languages know how to read Parquet files as well.

02:24:55.000 --> 02:24:56.000
You're…

02:24:56.000 --> 02:25:07.000
Okay, so you would… you would have to use TuckDV. There's not just like a… like a library crate for Rust that just natively reads parquet. I was going to look it up, but…

02:25:07.000 --> 02:25:17.000
Yeah. Yeah, the Rust API is distributed as a Rust crate that exposes an elegant wrapper over the native C API.

02:25:17.000 --> 02:25:18.000
Uh, okay.

02:25:18.000 --> 02:25:34.000
And I don't know what it's… I don't know what it's like in Rust, but I do know in Python, if you want to load up a DuckDB, you know, you just say import DuckDB, and it's like, you know, 2 seconds, boom. You've got that module.

02:25:34.000 --> 02:25:47.000
And then you can start querying anything you want parquet files using the DuckDB module, and it just it's blazingly fast.

02:25:47.000 --> 02:25:48.000
I guess the best thing about this is a… oh.

02:25:48.000 --> 02:25:52.000
Hmm.

02:25:52.000 --> 02:26:00.000
The DuckDB would replace SQLAlchemy. Python stack itself.

02:26:00.000 --> 02:26:07.000
SQL outcome is the… I'm usually the go-to view.

02:26:07.000 --> 02:26:08.000
Yeah.

02:26:08.000 --> 02:26:18.000
Run queries of this. When you're not using something like a… the Django object relation model.

02:26:18.000 --> 02:26:26.000
Probably. Right?

02:26:26.000 --> 02:26:27.000
Was it November that you first presented about DuckDB?

02:26:27.000 --> 02:26:31.000
Good.

02:26:31.000 --> 02:26:32.000
I think so. Somewhere around that. Cool.

02:26:32.000 --> 02:26:33.000
Yes, it was. It was. It was November, the November log meeting.

02:26:33.000 --> 02:26:36.000
Okay. Okay.

02:26:36.000 --> 02:26:39.000
Okay. Yeah.

02:26:39.000 --> 02:26:43.000
St. Louis Linux user group meeting? Thank you.

02:26:43.000 --> 02:26:58.000
Yes. And I think I put that URL in a discuss message the other day, I think so. But I I went out, looked at it. And yes, the the recording is out there in the archive.

02:26:58.000 --> 02:26:59.000
And just…

02:26:59.000 --> 02:27:03.000
I never greet him.

02:27:03.000 --> 02:27:04.000
Steve, you really should.

02:27:04.000 --> 02:27:13.000
No, I never read Gary's postings because they're so sloppy.

02:27:13.000 --> 02:27:14.000
Oh, right.

02:27:14.000 --> 02:27:22.000
He uses his Gmail account, and they convert all the URLs to these long, stupid strings that are ugly.

02:27:22.000 --> 02:27:29.000
Oh. That sounds like we all need to get together to go for a beer and a piece of sundine.

02:27:29.000 --> 02:27:31.000
Ah, yes.

02:27:31.000 --> 02:27:32.000
Just let me know.

02:27:32.000 --> 02:27:35.000
Yeah. Okay.

02:27:35.000 --> 02:27:40.000
Nick can only use…

02:27:40.000 --> 02:27:51.000
One final note, then, and again, something that… there's so many things I didn't show you. Duckdb can do a lot of things. And one of the things that I'm really actually, uh.

02:27:51.000 --> 02:28:14.000
I'm actually really liking is something called Prequel, P-R-Q-L, which stands for Pipeline Relational Query Language. Technically, it's… it's… what is it called? It translates its language or its syntax into SQL to then execute the code or execute the the query. So it's…

02:28:14.000 --> 02:28:33.000
Technically, I don't think you could really classify it as a query language, but it is. And what's… What I like about it is it handles some of the grievances that I have with SQL, although CTEs have done a pretty good job of handling that. But a good example of that, just a basic query.

02:28:33.000 --> 02:28:46.000
Think about the format of a basic query, right? You say select.

02:28:46.000 --> 02:28:47.000
Sorry, guys.

02:28:47.000 --> 02:29:04.000
Oh. I agree. Crequel is one of the most exquisite, amazing things that has… anyway. So with Prequel, you know, if you look at what SQL is, is, you know, you say select and you specify the columns first, and then you say from table.

02:29:04.000 --> 02:29:09.000
And then lastly, you say where, which, in essence, filters the rows.

02:29:09.000 --> 02:29:32.000
If I were to redesign SQL from scratch, one of the first things I would do is I would flip that. I would say, here's your table, tell me where you're starting from, from this table. Okay, great, now I've got some context to understand what we're doing. Then I would say, where… to filter the rows that I want, and then I would do the select the columns that I want.

02:29:32.000 --> 02:29:45.000
And what's cool is with Prequel, you can do that, right? It's literally a pipeline at a flow. Hey, do this to the data, then do this to the data, then do this to the data, then do this to the data, and so on and so forth, right?

02:29:45.000 --> 02:30:01.000
One of the cool things it gets rid of is this whole silly thing with having, right? Do this, group by blah blah, halving, got it, da-da-da. Now, that goes away, right? Because what you have is that every time you execute a command.

02:30:01.000 --> 02:30:18.000
You, in essence, think of that as a new data set. So if you are doing aggregation, that aggregation is now what you're working on. So instead of saying, having something greater than 2, for example, you simply say, well, where, whatever that column name that you called it, is greater than 2, right? So it's this nice.

02:30:18.000 --> 02:30:34.000
My opinion, nice flow of data. It's a nice pipeline to think about data. But… That's just my opinion. And that will be a topic for a future time.

02:30:34.000 --> 02:30:40.000
All right. It's 8 o'clock, folks, or 8 o'clock here. It's 9 o'clock where you guys are.

02:30:40.000 --> 02:30:47.000
I've had a long day. So, um, any other questions before, uh, before we ski-daddle?

02:30:47.000 --> 02:30:50.000
Where are you teaching at?

02:30:50.000 --> 02:31:02.000
Community College of Central New Mexico. It's it's here in Albuquerque, New Mexico. I can post. Tell you what, let me find a link, and I will post the link.

02:31:02.000 --> 02:31:03.000
to the chat.

02:31:03.000 --> 02:31:10.000
And I'm going to be real specific. I'm gonna post a link to my course, because… Well, by golly, my course is the best course.

02:31:10.000 --> 02:31:11.000
You betta.

02:31:11.000 --> 02:31:16.000
not not because of me, because of the students. The students are just freaking awesome.

02:31:16.000 --> 02:31:24.000
And so, there you go. Actually, you know, if you don't mind, if I can make a quick plug.

02:31:24.000 --> 02:31:43.000
Right now, I have I'm teaching a part-time course, which is on Fridays, which goes for 6 months. That's only on Fridays. The full-time course is starting up in June. Actually, it's starting up in the middle of May is when we start sending out what is called the pre-work, but it goes full… it goes full steam the first week of June.

02:31:43.000 --> 02:32:12.000
And it goes for 8 weeks, excuse me, 12 weeks, Monday through Thursday, um, and number one is if you… if anybody's interested, uh, just… Post the SLU, or send me an email, I'd be happy to answer. We will take students from anywhere, and also there's funding opportunities, right? So if you see the sticker, or the price, don't get sticker shock, most of the students that we have qualified for financial aid. What I mean by financial aid is.

02:32:12.000 --> 02:32:30.000
Literally, it's… it's… The course is paid for, it's not a loan. And yeah, we cover, you know, supervised, unsupervised learning, classification, regression, natural language processing, deep learning, um, and we're also now adding in AI components, so things like.

02:32:30.000 --> 02:32:46.000
Claude code and open code, and Ollama, and, uh, you know, prompt engineering and all that kind of fun stuff as well. So if you're interested, you know, just just shoot me a message, or if you click on the link and fill out the form.

02:32:46.000 --> 02:32:52.000
Our admin staff will, uh, will be happy to field your questions.

02:32:52.000 --> 02:32:55.000
Cool.

02:32:55.000 --> 02:33:04.000
Excellent. Thank you very, very much, Robert. This was a really excellent presentation. Loved getting into all the.

02:33:04.000 --> 02:33:09.000
The nuts and bolts of data mining. It's kind of cool.

02:33:09.000 --> 02:33:17.000
Uh, any other final comments from anyone or? But we wrap it up for the evening.

02:33:17.000 --> 02:33:18.000
I'm ready to go home. Cool, thank you.

02:33:18.000 --> 02:33:20.000
It was a great talk.

02:33:20.000 --> 02:33:27.000
Albuquerque as you went by the old Miss Altair building yet?

02:33:27.000 --> 02:33:35.000
I have not been out to Albuquerque in… Excuse me. Say that again. We're in Albuquerque?

02:33:35.000 --> 02:33:39.000
Well, wherever the MITS, Altair. you know, they made the myths.

02:33:39.000 --> 02:33:44.000
Oh, yeah! Yeah, yeah, I have no idea where they were.

02:33:44.000 --> 02:33:46.000
Okay.

02:33:46.000 --> 02:33:59.000
It was a strip mall, from what I understand it. You can't really find where it was anymore because they just redid the whole thing.

02:33:59.000 --> 02:34:04.000
I'm sure the street exists.

02:34:04.000 --> 02:34:05.000
I move we end the meeting.

02:34:05.000 --> 02:34:06.000
Hey, don't know.

02:34:06.000 --> 02:34:07.000
Oh.

02:34:07.000 --> 02:34:17.000
Oh, and also… Going to the, uh, government database thing, or whatever, the FTP site, just by coincidence this morning, I went to the Fccp site.

02:34:17.000 --> 02:34:24.000
and I pulled down some data files, and they also use vertical pipe symbols for the.

02:34:24.000 --> 02:34:28.000
come on the CSV files.

02:34:28.000 --> 02:34:36.000
Yeah, pipe eliminators. pretty standard because comma separated runs into issues when you have.

02:34:36.000 --> 02:34:40.000
Addresses.

02:34:40.000 --> 02:34:48.000
Yeah, it's used a lot in HL7, which is the health level seven part of the medical industry.

02:34:48.000 --> 02:35:10.000
Uh, so it's basically… they're really strange files because they're… you have to kind of go with the first field to tell what this row is right? And then so you might have like a group of those, and then you might get to another one. And now the 1st field is something totally different. And now you have like an even longer row.

02:35:10.000 --> 02:35:25.000
Or maybe a shorter row. So, you can't just, like, oh, import this into, uh, you know, and convert the the pipes to commas. No, it doesn't work. You literally have to kind of write your own code.

02:35:25.000 --> 02:35:26.000
Oh, look. And that's all called extract, transform, and load.

02:35:26.000 --> 02:35:32.000
to actually be able to parse it.

02:35:32.000 --> 02:35:35.000
That's what that… Processes, Paul.

02:35:35.000 --> 02:35:37.000
Thanks. Thanks, Robert. I really enjoyed it.

02:35:37.000 --> 02:35:41.000
I've had to go with… All that.

02:35:41.000 --> 02:36:08.000
Hey, I just… And one more thing, a side note, talking about MIPS, since Phil mentioned it, I had to look it up. Yeah, it's, uh… It's literally Kenny Corner from a… or at least, you know, where they used to be, is literally kettic corner from this absolutely fantastic Vietnamese restaurant. Um, and so I have probably driven by the old MIPS, um.

02:36:08.000 --> 02:36:15.000
Um, headquarters count a countless number of times, and and not even known it. So, very cool.

02:36:15.000 --> 02:36:19.000
model instrumentation and telemetry systems.

02:36:19.000 --> 02:36:21.000
Yeah. Yeah.

02:36:21.000 --> 02:36:27.000
Rocketry, auto rocketry.

02:36:27.000 --> 02:36:28.000
Yeah. Um… Yeah, Gary, go for it.

02:36:28.000 --> 02:36:48.000
All right, we're gonna… I think we can put. I think we can put a bookend on it for for the meeting. This has really been good. The bookend for the recording. This is 9:07 Pm. Central time. It is the monthly meeting of the St. Louis Linux user group.

02:36:48.000 --> 02:37:03.000
St. Louis Lug, our speaker tonight was Robert Situk. He was doing a presentation on DuckDB and all the great data that you can analyze or manipulate this way.

02:37:03.000 --> 02:37:13.000
We appreciated his talk. The date today is Thursday, April the.

02:37:13.000 --> 02:37:15.000
6 fees.

02:37:15.000 --> 02:37:20.000
17

