Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "text processing"
-
TLDR;
Wrote a slick scheduling and communication system allowing me to assign photography resources based on time and location.
I'll tell you a little secret ... I'm not actually a dev. I'm a photographer, pretending to be a dev.
Or ... perhaps it's the other way around? (I spend most of my time writing code these days, but only for me - I write the software I use to run my business).
I own a photography studio - we specialize in youth volleyball photography (mostly 12-18 year old girls with a bit of high school, college and semi-pro thrown in for good measure - it's a hugely popular sport) and travel all over the US (and sometimes Europe) photographing.
As a point of scale, this year we photographed a tournament in Denver that featured 100 volleyball courts (in one room!), playing at the same time.
I'm based in California and fly a crew of part-time staff around to these events, but my father and I drive our booth equipment wherever it needs to go. We usually setup a 30'x90' booth with local servers, download/processing/cashier computers and 45 laptops for viewing/ordering photographs. Not to mention 16' drape and banners, tons of samples, 55' TVs, etc. It's quite the production.
We photograph by paid signup only - when there are upwards of 800 teams/9,600 athletes per weekend playing, and you only have four trained photographers, you've got to manage your resources!
This of course means you have to have a system for taking sign those sign ups, assigning teams to photographers and doing so in the most efficient manner possible based on who is available when the team is playing. (You can waste an awful lot of time walking from one court to another in a large convention center - especially if you have to navigate through large crowds - not to mention exhausting yourself).
So this year I finally added a feature I've wanted for quite some time - an interactive court map. I can take an image of the court layout from the tournament and create an HTML version in our software. As I mouse over requests in one window, the corresponding court is highlighted on the map in another browser window. Each photographer has a color associated with them. When I assign requests to a photographer, the court is color coded with the color of the photographer. This allows me to group assignments to minimize photographer walk time and keep them in a specific area. It's also very easy to look at the map and see unassigned requests and look to see what photographer is nearby.
This year I also integrated with Twilio and setup a simple set of text shortcuts that photographers can use to let our booth staff know where they are, if they have memory cards that need picking up, if they need water/coffee/snack, etc. They can also move assignments on their schedule or send and SOS for help if it looks like they aren't going to be able to photograph a team.
Kind of a CLI via the phone. :)
The additions have turned out to be really useful and has made scheduling and managing the photographers much easier that it was in the past.18 -
Doing linguistic research where I need to parse 2000 files of a total of 36 GB. Since we are using python the first thing I thought was to implement multi threading. Now I changed the total runtime from three days to like one day and a half. But then when I checked the activity monitor I saw only 20 percent of the CPU usage. After a searching process I started to understand how multi threading and multi processing works. Moral of the story: if you want to ping a website till they block you or do easy tasks that will not use up all power of one core, do multi thrading. If you need to do something complicated that can easily consume all the powers of a single CPU core, split up the work and do multi processing. In my case, when I tried to grab information from a website, I did multi thrading since the work is easy and I really wanted to pin the website 16 times simultaneously but only have 4 cores. But when it come to text processing which a single file will take 80 percent of cpu, split it up and do multi processing.
This is just a post for those who are confused with when to use which.12 -
It works.
How I hate that sentence.
Whenever that sentence pops up, I wanna take a frying pan, make some bacon, eat the bacon and slam the still hot pan with grease through someone's face till the skull breaks.
Why has he so many anger issues, one might ask.
Usually the sentence "It works" means that after looking at "working thing" it works wrong in 95 % of all cases, but hey - for 5 % it at least does *something* right. Not everything, don't get ya hope up.
We had this fun topic happening again today and I'm still too angry to sleep.
Lucene analysis of texts in Elasticsearch.
Stopword list? Multiple word n-grams per line, duplicates, not lower cased, not properly encoded.
Tokenizers? Duh. Why should one put them in proper order.... Or more realistic: There is an order in tokenizers necessary *devs with shocked faces*.
Language specific details... UHM. Wait. Languages are different? There are edge cases in languages? *more shocked faces*.
Even more shocking that if an text processing pipeline is implemented horribly wrong, it delivers wrong results. *mind blown*.
But our unit tests (this goes out to @kiki) were working.
Yeah. You dumb nuggets who even an amoeba would be ashamed of, when you only do positive tests in unit tests with the most obvious working examples, then your unit tests are just useless waste of nibbles.
Some of the devs are really a fucking waste of genetic information, should have probably ended better in a sock.
If this sounds too harsh, they had 2 weeks.
In just 3 hours I found out that they can redo that with supervision.
-.-
I'm getting too old for that shit. Seriously.4 -
Had an interview with a potential customer last week, and he started questioning my technical capability in the middle of the discussion on the basis that I’m taking notes with pen and paper...
Yes, I can type. At 90+ WPM, I can darn near produce a transcript of everything we say. But I won’t remember any of it afterward, because it passes straight from the ears to the hands without any processing.
“You see, that’s what we have something called ’search’ for...”
...Yeah. Except that doesn’t help with picking out the most important points from a wall of text, organizing it in a way that allows visualizing relationships between concepts, and other non-linear things that are hard to do on the fly in a word processor.
“Well, how about we get you a tablet with a pen and you can just write on that, then?”
How about no.
Ended up turning him down because of other concerns that were raised that were, suffice to say, about as ornerous as you might expect from that exchange.7 -
I’m trying to add digit separators to a few amount fields. There’s actually three tickets to do this in various places, and I’m working on the last of them.
I had a nightmare debugging session earlier where literally everything would 404 unless I navigated through the site in a very roundabout way. I never did figure out the cause, but I found a viable workaround. Basically: the house doesn’t exist if you use the front door, but it’s fine if you go through the garden gate, around the back, and crawl in through the side window. After hours of debugging I eventually discovered that if I unlocked the front door with a different key, everything was fine… but nobody else has this problem?
Whatever.
Onto the problem at hand!
I’m trying to add digit separators to some values. I found a way to navigate to the page in question (more difficult than it sounds), and … I don’t know what view is rendering the page. Or what controller. Or how it generates its text.
The URL is encrypted, so I get no clues there. (Which was lead dev’s solution to having scrapeable IDs instead of just, you know, fixing them). The encryption also happens in middleware, so it’s a nightmare to work through. And it’s by the lead dev, so the code is fucking atrocious.
The view… could be one of many, and I don’t even know where they are. Or what layout. Or what partials go into building it.
All of the text on the page are “resources” — think named translations that support plus nested macros. I don’t know their names, and the bits of text I can search for are used fucking everywhere. “Confirmation number” (the most unique of them) turns up 79 matches. “Fee” showed up in 8310 places before my editor gave up looking. Really.
The table displaying the data, which is what I actually care about, isn’t built in JS or markup, but is likely a resource that goes through heavy processing. It gets generated in a controller somewhere (I don’t know the resource name so I can’t find it), and passed through several layers of “dynamic form” abstraction, eventually turned into markup, and rendered as a partial template. At least, that’s how it worked in the previous ticket. I found a resource that looks right, and there’s only the one. I found the nested macros it uses for the amount and total, and added the separators there… only to find that it doesn’t work.
Fucking dead end.
And i have absolutely nothing else to go on.
Page title? “Show”
URL? /~LiolV8N8KrIgaozEgLv93s…
Text? All from macros with unknown names. Can’t really search for it without considerable effort.
Table? Doesn’t work.
Text in the table? doesn’t turn up anything new.
Legal agreement? There are multiple, used in many places, generates them dynamically via (of course) resources, and even looking through the method usages, doesn’t narrow it down very much.
Just.
What the fuck?
Why does this need to be so fucking complicated?
And what genius decided “$100000.00” doesn’t need separators? Right, the lot of them because separators aren’t used ANYWHERE but in code I authored. Like, really? This is fintech. You’d think they would be ubiquitous.
And the sheer amount of abstraction?
Stupid stupid stupid stupid stupid.11 -
Chrome, Firefox, and yes even you Opera, Falkon, Midori and Luakit. We need to talk, and all readers should grab a seat and prepare for some reality checks when their favorite web browsers are in this list.
I've tried literally all of them, in search for a lightweight (read: not ridiculously bloated) web browser. None of them fit the bill.
Yes Midori, you get a couple of bonus points for being the most lightweight. Luakit however.. as much as I like vim in my terminal, I do not want it in a graphical application. Not to mention that just like all the others you just use webkit2gtk, and therefore are just as bloated as all the others. Lightweight my ass! But programmable with Lua, woo! Not like Selenium, Chrome headless, ... does that for any browser. And that's it for the unique features as far as I'm concerned. One is slow, single-threaded and lightweight-ish (Midori) and another has vim keybindings in an application that shouldn't (Luakit).
Pretty much all of them use webkit2gtk as their engine, and pretty much all of them launch a separate process for each tab. People say this is more secure, but I have serious doubts about that. You're still running all these processes as the same user, and they all have full access to the X server they run under (this is also a criticism against user separation on a single X session in general). The only thing it protects against is a website crashing the browser, where only that tab and its process would go down. Which.. you know.. should a webpage even be able to do that?
But what annoys me the most is the sheer amount of memory that all of these take. With all due respect all of you browsers, I am not quite prepared to give 8 fucking gigabytes - half the memory in this whole box! - just for a dozen or so tabs. I shouldn't have to move my web browser to another lesser used 16GB box, just to prevent this one from going into fucking swap from a dozen tabs. And before someone has a go at the add-ons, there's 4 installed and that's it. None of them are even close to this complete and utter memory clusterfuck. It's the process separation. Each process consumes half a GB of memory, and there's around a dozen of them in a usual browsing session. THAT is the real problem. And I want to get rid of it.
Browsers are at their pinnacle of fucked up in my opinion, literally to the point where I'm seriously considering elinks. Being a sysadmin, I already live my daily life in terminals anyway. As such I also do have resources. But because of that I also associate every process with its cost to run it, in terms of resources required. Web browsers are easily at the top of the list.
I want to put 8GB into perspective. You can store nearly 2 entire DVD movies in that memory. However media players used to play them (such as SMPlayer) obviously don't do that. They use 60-80MB on average to play the whole movie. They also require far less processing power than YouTube in a web browser does, even when you download that exact same video with youtube-dl (either streamed within the media player or externally). That is what an application should be.
Let's talk a bit about these "complicated" websites as well. I hate to break it to you framework web devs, but you're a dime a dozen. The competition is high between web devs for that exact reason. And websites are not complicated. The document itself is plain old HTML, yes even if your framework converts to it in the background. That's the skeleton of your document, where I would draw a parallel with documents in office suites that are more or less written in XML. CSS.. oh yes, markup. Embolden that shit, yes please! And JavaScript.. oh yes, that pile of shit that's been designed in half a day, and has a framework called fucking isEven (which does exactly what it says on the tin, modulo 2 be damned). Fancy some macros in your text editor? Yes, same shit, different pile.
Imagine your text editor being as bloated as a web browser. Imagine it being prone to crashing tabs like a web browser. Imagine it being so ridiculously slow to get anything done in your productivity suite. But it's just the usual with web browsers, isn't it? Maybe Gopher wasn't such a bad idea after all... Oh and give me another update where I have to restart the browser when I commit the heinous act of opening another tab, just because you had to update your fucking CA certs again. Yes please!19 -
Me: The dev agency didn’t follow best practices. They only implemented front end validation on the form. The form submits to a public endpoint, so bots don’t have to go through our site to submit the form. That’s why our database is still filled with $1 donation transactions. I honestly recommend telling this to the dev agency and request that you not be charged for the extra work needed to do this right.
Manager: They charge $95/hr and they’re billing for 8 hours already.
[Aside: The agency’s task was to implement a $10 minimum on the form, do some text changes, and deploy.]
Me: I would expect work to be done according to accepted best practices. It’s really a half done job.
Manager: But they were very helpful when we had that payment processing emergency. They stayed late to help us. We shouldn’t push this in case we need their help again. Can you do the backend validation? [We are in US and agency is in Lithuania.]
Me: 🤬😩😑🤐[To myself: This wouldn’t have happened if the fundraising team hadn’t panicked and would only wait until I came back from my one day of PTO.]1 -
Started new course called "Introduction to natural language processing" in uni. I am super bad at doing regular expressions and don't understand anything about them.
Saw the first weeks homework. Have to do i.e. some text cleanup with regex... I was sad. But now after reading the course material and trying some of the exercises I'm super excited since I'm actually doing something "real" with it.
Do you guys just love it when teaching material is well written? I do.3 -
My most humbling experience was finding the source code online to the original Pokemon games. It was right after I had finished my first text based Linux console game and I was looking up other programs source codes just for shits and giggles. Most of them were simple and I learned a few simple tricks but the red and blue Pokemon were the first codes I saw that fascinated me. The addressing, the memory allocation, even the simple audio processing was simply genius. So many unique innovations and techniques. If I achieve 1/5th of the skill I found in those files, I can die a happy programmer!3
-
!rant
For all of youse that ever wanted to try out Common Lisp and do not know where to start (but are interested in getting some knowledge of Common Lisp) I recommend two things:
As an introductory tutorial:
https://lisperati.com/casting.html/
And as your dev environment:
https://portacle.github.io/
Notice that the dev environment in question is Emacs, regardless of how you might feel about it as a text editor, i can recommend just going through the portacle help that gives you some basic starting points regarding editing. Learn about splitting buffers, evaluating the code you are typing in order for it to appear in the Common Lisp REPL (this one comes with an environment known as SLIME which is very popular in the Lisp world) as well as saving and editing your files.
Portacle is self contained inside of one single directory, so if you by any chance already have an Emacs environment then do not worry, Portacle will not touch any of that. I will admit that as far as I am concerned, Emacs will probably be the biggest hurdle for most people not used to it.
Can I use VS Code? Yes, yes you can, but I am not familiar with setting up a VSCode dev environment for Emacs, or any other environment hat comes close to the live environment that emacs provides for this?
Why the fuck should I try Common Lisp or any Lisp for that matter? You do not have to, I happen to like it a lot and have built applications at work with a different dialect of Lisp known as Clojure which runs in the JVM, do I recommend it? Yeah I do, I love functional programming, Clojure is pretty pure on that (not haskell level imo though, but I am not using Haskell for anything other than academic purposes) and with clojure you get the entire repertoire of Java libraries at your disposal. Moving to Clojure was cake coming from Common Lisp.
Why Common Lisp then if you used Clojure in prod? Mostly historical reasons, I want to just let people know that ANSI Common Lisp has a lot of good things going for it, I selected Clojure since I already knew what I needed from the JVM, and parallelism and concurrency are baked into Clojure, which was a priority. While I could have done the same thing in Common Lisp, I wanted to turn in a deliverable as quickly as possible rather than building the entire thing by myself which would have taken longer (had one week)
Am I getting something out of learning Common Lisp? Depends on you, I am not bringing about the whole "it opens your mind" deal with Lisp dialects as most other people do inside of the community, although I did experience new perspectives as to what programming and a programming language could do, and had fun doing it, maybe you will as well.
Does Lisp stands for Lots of Irritating Superfluous Parentheses or Los in stupid parentheses? Yes, also for Lost of Insidious Silly Parentheses and Lisp is Perfect, use paredit (comes with Portacle) also, Lisp stands for Lisp Is Perfect. None of that List Processing bs, any other definition will do.
Are there any other books? Yes, the famous online text Practical Common Lisp can be easily read online for free, I would recommend the Lisperati tutorial first to get a feel for it since PCL demands more tedious study. There is also Common Lisp a gentle introduction. If you want to go the Clojure route try Clojure for the brave and true.
What about Scheme and the Structure and Interpretation of Computer Programs? Too academic for my taste, and if in Common Lisp you have to do a lot of things on your own, Scheme is a whole other beast. Simple and beautiful really, but I go for practical in terms of Lisp, thus I prefer Common Lisp.
how did you start with Lisp?
I was stupid and thought I should start with it after a failed attempt at learning C++, then Java, and then Javascript when I started programming years ago. I was overwhelmed, but I continued. Then I moved to other things. But always kept Common Lisp close to heart. I am also heavy into A.I, Lisp has a history there and it is used in a lot of new and sort of unknown projects dealing with Knowledge Reasoning and representation. It is also Alien tech that contains many things that just seem super interesting to me such as treating code as data and data as code (back-quoting, macros etc)
I need some inspiration man......show me something? Sure, look for a game called Kandria in youtube, the creator, Shimera (Nicolas Hafner) is an absolute genius in the world of Lisp and a true inspiration. He coded the game in Common Lisp, he is also the person behind portacle. If that were not enough, he might very well also be Shirakumo, another prominent member of the Common Lisp Community.
Ok, you got me, what is the first thing in common lisp that I should try after I install the portacle environment? go to the repl and evaluate this:
(+ 0.1 0.2)
Watch in awe at what you get.
In the truest and original sense of the phrase (MIT based) "happy hacking!"9 -
StackOverflow locked my account. I'm hoping someone here might be kind enough to help me with a bash script I'm "bashing" my head with. Actually, it's zsh on MacOS if it makes any difference.
I have an input file. Four lines. No blank lines. Each of the four lines has two strings of text delimited by a tab. Each string on either side of the tab is either one word with no spaces or a bunch of words with spaces. Like this (using <tab> as a placeholder here on Devrant for where the tab actually is)
ABC<tab>DEF
GHI<tab>jkl mno pq
RST<tab>UV
wx<tab>Yz
I need to open and read the file, separate them into key-value pairs, and put them into an array for processing. I have this script to do that:
# Get input arguments
search_string_file="$1"
file_path="$2"
# Read search strings and corresponding names from the file and store in arrays
search_strings=()
search_names=()
# Read search strings and corresponding names from the file and store in arrays
while IFS= read -r line || [[ -n "$line" ]]; do
echo "Line: $line"
search_string=$(echo "$line" | awk -F'\t' '{print $1}')
name=$(echo "$line" | awk -F'\t' '{print $2}')
search_strings+=("$search_string")
search_names+=("$name")
done < "$search_string_file"
# Debug: Print the entire array of search strings
echo "Search strings array:"
for (( i=0; i<${#search_strings[@]}; i++ )); do
echo "[$i] ${search_strings[$i]} -- ${search_names[$i]}"
done
However, in the output, I get the following:
Line: ABC<tab>DEF
Line: GHI<tab>jkl mno pq
Line: RST<tab>UV
Line: wx<tab>Yz
Search strings array:
[0] --
[1] ABC -- DEF
[2] GHI -- jkl mno pq
[3] RST -- UV
That's it. I seem to be off by one because that last line...
Line: wx<tab>Yz
never gets added to the array. What I need it to be is:
[0] ABC -- DEF
[1] GHI -- jkl mno pq
[2] RST -- UV
[3] wx -- Yz
What am I doing wrong here?
Thanks.17 -
Imagine you were developing an on screen keyboard that has a word prediction function and you have access to unlimited resources. Like Apple for instance.
Would you prioritize common English words like at, and, in, or, what, the
Or would you prioritize letter combinations like ave, ayy, inn, our, eraser, three
Would you use your vast resources to build in any context processing at all that suggests the next word based the previous words?
Would you then also delete parts of the text that have already been typed when the user decides against your suggestion?
I know what Apple would do.
This message took 25+ corrections.7 -
Anyone tried converting speech waveforms to some type of image and then using those as training data for a stable diffusion model?
Hypothetically it should generate "ultrarealistic" waveforms for phonemes, for any given style of voice. The training labels are naturally the words or phonemes themselves, in text format (well, embedding vectors fwiw)
After that it's a matter of testing text-to-image, which should generate the relevant phonemes as images of waveforms (or your given visual representation, however you choose to pack it)
I would have tried this myself but I only have 3gb vram.
Even rudimentary voice generation that produces recognizable words from text input, would be interesting to see implemented and maybe a first for SD.
In other news:
Implementing SQL for an identity explorer. Basically the system generates sets of values for given known identities, and stores the formulas as strings, along with the values.
For any given value test set we can then cross reference to look up equivalent identities. And then we can test if these same identities hold for other test sets of actual variable values. If not, the identity string cam be removed, or gophered elsewhere in the database for further exploration and experimentation.
I'm hoping by doing this, I can somewhat automate the process of finding identities, instead of relying on logs and using the OS built-in text search for test value (which I can then look up in the files that show up, and cross reference the logged equations that produced those values), which I use to find new identities.
I was even considering processing the logs of equations and identities as some form of training data perhaps for a ML system that generates plausible new identities but that's a little outside my reach I think.
Finally, now that I know the new modular function converts semiprimes into numbers with larger factor trees, I'm thinking of writing a visual browser that maps the connections from factor tree to factor tree, making them expandable and collapsible, andallowong adjusting the formula and regenerating trees on the fly.7 -
My current task involves processing the commoncrawl web archive, and it's like a box of junk you buy at a flea market. You find so much useless stuff, broken stuff, stuff that makes you question people...
My latest find makes me wonder what lies out there if what I found was in plain sight. I found tens of thousands of websites that look like someone used markov chains to generate pron ads. Those websites exist in 10+ languages, use the same url-scheme, read like a dyslexic camgirl reading alphabet soup and are hosted on the same three ip-adresses. There is no javascript involved and some pages link to a variety of twitter accounts.
I queried a few commoncrawl files and amassed 4GB of this spam. Every time I look at it it gets weirder. There is an italian article about malware in there too.
Here's a text sample:
"Not from her bedroom, she her stream view and meet new experience. In hd india, because swimsuit still laws exist no interaction or frigthened and."1 -
Oh... I dont know what to pick...
So i will pick 3 projects from my 3 stages of my dev "carrier"
1.Right after i discovered programing and learned how ,if, while and similar structures worked. The launguage was object pascal with delphi 2007
That was a "safe" with a stupidly complicated lock (text inputs, sliders, ect) it opened a secret folder in the end.
2. It was a embedded code for a Atmega8 AVR, Atmel studio, pure C but without memory managment (i didnt even know that it even existed)
It was a Pip boy knockoff, 16x2 display and a small keyboard connected to the arduino like board that i made on a proto board.
It wasnt that much of a pipboy, it was more of a showoff of atmega8 internal systems, (ADC, timers, interrupts and such)
3.DataLab, after helping my friend with his master thesis, (we meet on discord long story, i was in high school) i decided that mathlab is shit and i created a visual scripting enviroment, launguage C# .net 4 (in the latest version)
I remade the whole program from scrach 1 time, significantly improving everything (code reuse, better algorithms, data processing, code redability and edge cases) I have learned good practises from everywhere. I learned how to use git.
DataLab project looks just like LabViev (i didnt notice that it even existed...), it is frozen now because of my mental status but im planning on using it on my CV when i will be looking for jobs on holidays. There are many things that i can improve in that program but ... first i have to fix myself. -
How do I go into tokenizing strings of text from a LaTeX file into a Go structure for further processing? I feel like splitting hair trying for loop over an io.Reader and using regexp to find where a sentence ends; there must be a better way...2