Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "html parsing"
-
Oh, man, I just realized I haven't ranted one of my best stories on here!
So, here goes!
A few years back the company I work for was contacted by an older client regarding a new project.
The guy was now pitching to build the website for the Parliament of another country (not gonna name it, NDAs and stuff), and was planning on outsourcing the development, as he had no team and he was only aiming on taking care of the client service/project management side of the project.
Out of principle (and also to preserve our mental integrity), we have purposely avoided working with government bodies of any kind, in any country, but he was a friend of our CEO and pleaded until we singed on board.
Now, the project itself was way bigger than we expected, as the wanted more of an internal CRM, centralized document archive, event management, internal planning, multiple interfaced, role based access restricted monster of an administration interface, complete with regular user website, also packed with all kind of features, dashboards and so on.
Long story short, a lot bigger than what we were expecting based on the initial brief.
The development period was hell. New features were coming in on a weekly basis. Already implemented functionality was constantly being changed or redefined. No requests we ever made about clarifications and/or materials or information were ever answered on time.
They also somehow bullied the guy that brought us the project into also including the data migration from the old website into the new one we were building and we somehow ended up having to extract meaningful, formatted, sanitized content parsing static HTML files and connecting them to download-able files (almost every page in the old website had files available to download) we needed to also include in a sane way.
Now, don't think the files were simple URL paths we can trace to a folder/file path, oh no!!! The links were some form of hash combination that had to be exploded and tested against some king of database relationship tables that only had hashed indexes relating to other tables, that also only had hashed indexes relating to some other tables that kept a database of the website pages HTML file naming. So what we had to do is identify the files based on a combination of hashed indexes and re-hashed HTML file names that in the end would give us a filename for a real file that we had to then search for inside a list of over 20 folders not related to one another.
So we did this. Created a script that processed the hell out of over 10000 HTML files, database entries and files and re-indexed and re-named all this shit into a meaningful database of sane data and well organized files.
So, with this we were nearing the finish line for the project, which by now exceeded the estimated time by over to times.
We test everything, retest it all again for good measure, pack everything up for deployment, simulate on a staging environment, give the final client access to the staging version, get them to accept that all requirements are met, finish writing the documentation for the codebase, write detailed deployment procedure, include some automation and testing tools also for good measure, recommend production setup, hardware specs, software versions, server side optimization like caching, load balancing and all that we could think would ever be useful, all with more documentation and instructions.
As the project was built on PHP/MySQL (as requested), we recommended a Linux environment for production. Oh, I forgot to tell you that over the development period they kept asking us to also include steps for Windows procedures along with our regular documentation. Was a bit strange, but we added it in there just so we can finish and close the damn project.
So, we send them all the above and go get drunk as fuck in celebration of getting rid of them once and for all...
Next day: hung over, I get to the office, open my laptop and see on new email. I only had the one new mail, so I open it to see what it's about.
Lo and behold! The fuckers over in the other country that called themselves "IT guys", and were the ones making all the changes and additions to our requirements, were not capable enough to follow step by step instructions in order to deploy the project on their servers!!!
[Continues in the comments]26 -
I am much too tired to go into details, probably because I left the office at 11:15pm, but I finally finished a feature. It doesn't even sound like a particularly large or complicated feature. It sounds like a simple, 1-2 day feature until you look at it closely.
It took me an entire fucking week. and all the while I was coaching a junior dev who had just picked up Rails and was building something very similar.
It's the model, controller, and UI for creating a parent object along with 0-n child objects, with default children suggestions, a fancy ui including the ability to dynamically add/remove children via buttons. and have the entire happy family save nicely and atomically on the backend. Plus a detailed-but-simple listing for non-technicals including some absolutely nontrivial css acrobatics.
After getting about 90% of everything built and working and beautiful, I learned that Rails does quite a bit of this for you, through `accepts_nested_params_for :collection`. But that requires very specific form input namespacing, and building that out correctly is flipping difficult. It's not like I could find good examples anywhere, either. I looked for hours. I finally found a rails tutorial vide linked from a comment on a SO answer from five years ago, and mashed its oversimplified and dated examples with the newer documentation, and worked around the issues that of course arose from that disasterous paring.
like.
I needed to store a template of the child object markup somewhere, yeah? The video had me trying to store all of the markup in a `data-fields=" "` attrib. wth? I tried storing it as a string and injecting it into javascript, but that didn't work either. parsing errors! yay! good job, you two.
So I ended up storing the markup (rendered from a rails partial) in an html comment of all things, and pulling the markup out of the comment and gsubbing its IDs on document load. This has the annoying effect of preventing me from using html comments in that partial (not that i really use them anyway, but.)
Just.
Every step of the way on building this was another mountain climb.
* singular vs plural naming and routing, and named routes. and dealing with issues arising from existing incorrect pluralization.
* reverse polymorphic relation (child -> x parent)
* The testing suite is incompatible with the new rails6. There is no fix. None. I checked. Nope. Not happening.
* Rails6 randomly and constantly crashes and/or caches random things (including arbitrary code changes) in development mode (and only development mode) when working with multiple databases.
* nested form builders
* styling a fucking checkbox
* Making that checkbox (rather, its label and container div) into a sexy animated slider
* passing data and locals to and between partials
* misleading documentation
* building the partials to be self-contained and reusable
* coercing form builders into namespacing nested html inputs the way Rails expects
* input namespacing redux, now with nested form builders too!
* Figuring out how to generate markup for an empty child when I'm no longer rendering the children myself
* Figuring out where the fuck to put the blank child template markup so it's accessible, has the right namespacing, and is not submitted with everything else
* Figuring out how the fuck to read an html comment with JS
* nested strong params
* nested strong params
* nested fucking strong params
* caching parsed children's data on parent when the whole thing is bloody atomic.
* Converting datetimes from/to milliseconds on save/load
* CSS and bootstrap collisions
* CSS and bootstrap stupidity
* Reinventing the entire multi-child / nested params / atomic creating/updating/deleting feature on my own before discovering Rails can do that for you.
Just.
I am so glad it's working.
I don't even feel relieved. I just feel exhausted.
But it's done.
finally.
and it's done well. It's all self-contained and reusable, it's easy to read, has separate styling and reusable partials, etc. It's a two line copy/paste drop-in for any other model that needs it. Two lines and it just works, and even tells you if you screwed up.
I'm incredibly proud of everything that went into this.
But mostly I'm just incredibly tired.
Time for some well-deserved sleep.7 -
So I just spent the last few hours trying to get an intro of given Wikipedia articles into my Telegram bot. It turns out that Wikipedia does have an API! But unfortunately it's born as a retard.
First I looked at https://www.mediawiki.org/wiki/API and almost thought that that was a Wikipedia article about API's. I almost skipped right over it on the search results (and it turns out that I should've). Upon opening and reading that, I found a shitload of endpoints that frankly I didn't give a shit about. Come on Wikipedia, just give me the fucking data to read out.
Ctrl-F in that page and I find a tiny little link to https://mediawiki.org/wiki/... which is basically what I needed. There's an example that.. gets the data in XML form. Because JSON is clearly too much to ask for. Are you fucking braindead Wikipedia? If my application was able to parse XML/HTML/whatevers, that would be called a browser. With all due respect but I'm not gonna embed a fucking web browser in a bot. I'll leave that to the Electron "devs" that prefer raping my RAM instead.
OK so after that I found on third-party documentation (always a good sign when that's more useful, isn't it) that it does support JSON. Retardpedia just doesn't use it by default. In fact in the example query that was a parameter that wasn't even in there. Not including something crucial like that surely is a good way to let people know the feature is there. Massive kudos to you Wikipedia.. but not really. But a parameter that was in there - for fucking CORS - that was in there by default and broke the whole goddamn thing unless I REMOVED it. Yeah because CORS is so useful in a goddamn fucking API.
So I finally get to a functioning JSON response, now all that's left is parsing it. Again, I only care about the content on the page. So I curl the endpoint and trim off the bits I don't need with jq... I was left with this monstrosity.
curl "https://en.wikipedia.org/w/api.php/...=*" | jq -r '.query.pages[0].revisions[0].slots.main.content'
Just how far can you nest your JSON Wikipedia? Are you trying to find the limits of jq or something here?!
And THEN.. as an icing on the cake, the result doesn't quite look like JSON, nor does it really look like XML, but it has elements of both. I had no idea what to make of this, especially before I had a chance to look at the exact structured output of that command above (if you just pipe into jq without arguments it's much less readable).
Then a friend of mine mentioned Wikitext. Turns out that Wikipedia's API is not only retarded, even the goddamn output is. What the fuck is Wikitext even? It's the Apple of wikis apparently. Only Wikipedia uses it.
And apparently I'm not the only one who found Wikipedia's API.. irritating to say the least. See e.g. https://utcc.utoronto.ca/~cks/...
Needless to say, my bot will not be getting Wikipedia integration at this point. I've seen enough. How about you make your API not retarded first Wikipedia? And hopefully this rant saves someone else the time required to wade through this clusterfuck.12 -
So, I've had a personal project going for a couple of years now. It's one of those "I think this could be the billion-dollar idea" things. But I suffer from the typical "it's not PERFECT, so let's start again!" mentality, and the "hmm, I'm not sure I like that technology choice, so let's start again!" mentality.
Or, at least, I DID until 3-4 months ago.
I made the decision that I was going to charge ahead with it even if I started having second thoughts along the way. But, at the same time, I made the decision that I was going to rely on as little external technology as possible. Simplicity was going to be the key guiding light and if I couldn't truly justify bringing a given technology into the mix, it'd stay out.
That means that when I built the front end, I would go with plain HTML/CSS/JS... you know, just like I did 20+ years ago... and when I built the back end, I'd minimize the libraries I used as much as possible (though I allowed myself a bit more flexibility on the back end because that seems to be where there's less issues generally). Similarly, any choice I made I wanted to have little to no additional tooling required.
So, given this is a webapp with a Node back-end, I had some decisions to make.
On the back end, I decided to go with Express. Previously, I had written all the server code myself from "first principles", so I effectively built my own version of Express in other words. And you know what? It worked fine! It wasn't particularly hard, the code wasn't especially bad, and it worked. So, I considered re-using that code from the previous iteration, but I ultimately decided that Express brings enough value - more specifically all the middleware available for it - to justify going with it. I also stuck with NeDB for my data storage needs since that was aces all along (though I did switch to nedb-promises instead of writing my own async/await wrapper around it as I had previously done).
What I DIDN'T do though is go with TypeScript. In previous versions, I had. And, hey, it worked fine. TS of course brings some value, but having to have a compile step in it goes against my "as little additional tooling as possible" mantra, and the value it brings I find to be dubious when there's just one developer. As it stands, my "tooling" amounts to a few very simple JS scripts run with NPM. It's very simple, and that was my big goal: simplicity.
On the front end, I of course had to choose a framework first. React is fine, Angular is horrid, Vue, Svelte, others are okay. But I didn't want to bother with any of that because I dislike the level of abstraction they bring. But I also didn't want to be building my own widget library. I've done that before and it takes a lot of time and effort to do it well. So, after looking at many different options, I settled on Webix. I'm a fan of that library because it has a JS-centric approach. There's no JSX-like intermediate format, no build step involved, it's just straight, simple JS, and it's powerful and looks pretty good. Perfect for my needs. For one specific capability I did allow myself to bring in AnimeJS and ThreeJS. That's it though, no other dependencies (well, at first, I was using Axios because it was comfortable, but I've since migrated to plain old fetch). And no Webpack, no bundling at all, in fact. I dynamically load resources, which effectively is code-splitting, and I have some NPM scripts to do minification for a production build, but otherwise the code that runs in the browser is what I actually wrote, unlike using a framework.
So, what's the point of this whole rant?
The point is that I've made more progress in these last few months than I did the previous several years, and the experience has been SO much better!
All the tools and dependencies we tend to use these days, by and large, I think get in the way. Oh, to be sure, they have their own benefits, I'm not denying that... but I'm not at all convinced those benefits outweighs the time lost configuring this tool or that, fixing breakages caused by dependency updates, dealing with obtuse errors spit out by code I didn't write, going from the code in the browser to the actual source code to get anywhere when debugging, parsing crappy documentation, and just generally having the project be so much more complex and difficult to reason about. It's cognitive overload.
I've been doing this professionaly for a LONG time, I've seen so many fads come and go. The one thing I think we've lost along the way is the idea that simplicity leads to the best outcomes, and simplicity doesn't automatically mean you write less code, doesn't mean you cede responsibility for various things to third parties. Those things aren't automatically bad, but they CAN be, and I think more than we realize. We get wrapped up in "what everyone else is doing", we don't stop to question the "best practices", we just blindly follow.
I'm done with that, and my project is better for it! -
[CMS Of Doom™]
Ah, yes, their built-in bullshit newsletter module just sent the n-th user n emails. Wonderful considering n=368.
The culprit? Better don't ask...
OK, anyway: So the mailer is running as a CRONjob, but nah, not as a console script call but by a public HTTP GET URL call, fucking obviously (it's the CMS Of Doom for a reason).
So these fucking imbeciles "implemented" an ob_start() callback where HTML links are - for whatever fucking reason - modified by some regex (obviously everybody knows parsing HTML by Regex is trivial). In this case the link was somehow modified to recall the mailer Cronjob...
This must have upset the pngoing mailing process thus spamming mails. Whyyyy
And I've thought I've seen it all after 6 months in this legacy hell...
This is why you don't run a company consisting of only beginners in PHP (in cluding their "CEO")! -
Which misanthropic, terrible, perverse excuse for a dogfucker decided that damned non breaking spaces (SPACES!) return false on isWhitespace? It's in the name, space, it's white, it's a fucking white space, a whitespace if you will so who do I have to kill for wasting two damned hours of my life trying to parse away those bastards?3
-
Currently making a perfect sudoku webapp / plugin using native JS and html templates where I'm very enthousiast about.
It allows to select multiple cells and then put in a number and all selected have that number. It keeps state of every change, you can do unlimited redo's. Right click or double click someehere removes selection. Not built yet, but it will have a box where you can paste sudoku's you've found on the internet. I just parse 81 times [1-9] with regex. So all formats are supported including noisy ones as long the noise is not numbers. Making your own puzzle is very easy. Art is to make hard ones. I'm generating extra hard puzzles using C threading. For reference: there are 6,670,903,752,021,072,936,960 sudoku puzzles possible and from that I try to resolve the hard ones using simple human logging with brute forcing as fallback until it can use logic again. 30 million attempts to solve per secon. I should at some more logic. I don't do xwing or ywing, bs imho. You have to be a superhuman to spot xwing / ywing possibilities. I think i can imagine a better logic myself. We'll see.
And yes, that's a real screenshot. Puzzle is validated and it found issues. Marked with red font. Green is current selection by user11 -
Teach data structures by showing how they're used in real life situations. Don't make us do some nonsense puzzle shit. For example, a friend of mine is learning stacks/queues right now and his assignment is to build a simple HTML parsing algorithm to determine whether an HTML file is valid. This shows the student a practical use of the data structure and reinforces that this shit actually does get used in real life.
-
I started building an application for FIDAL (Italian Federation of Athletics) because why not: I was bored and wanted to learn Flutter.
There is no API, but I didn't even expect it. Parsing the HTML is easy enough.
BUT OH MY GOD THE ENTIRE WEBSITE IS SHIT. Take this page: http://www.fidal.it/graduatorie.php, it uses some useless jQuery plugin and uses a buttload of JavaScript that isn't even needed. BUT WAIT. Try entering an invalid "club code" (http://fidal.it/graduatorie.php/...), a FUCKING white page with 200, are you kidding?
I'd also like to mention that all pages that require form input won't load correctly if you don't include "submit=Invia" in the URL.
I am not giving up.3 -
Parsing my college's terrible classless html float left div mosaic timetables so I could hide lectures I wasn't taking.
-
I wish there’s a better engine for parsing HTML for relevant content from a URL than mercury for free.
I was making a class project in react and that is not getting it done to my thing. 😑😑 -
I've to say that javascript is no language I like because I'm more fastinated of building a nice and scalable backend then building a gui (in the browser).
The funny thing is that right now javascript helps me at my current project because many websites implement wide-opened apis for their js frontend and it just works like a charm to use them instead of parsing the whole html and do some XPath stuff to fetch information. -
#Suphle Rant 7: transphporm failure
In this issue, I'll be sharing observations about 3 topics.
First and most significant is that the brilliant SSR templating library I've eyed for so many years, even integrated as Suphle's presentation layer adapter, is virtually not functional. It only works for the trivial use case of outputting the value of a property in the dataset. For instance, when validation fails, preventing execution from reaching the controller, parsing fails without signifying what ordinance was being violated. I trim the stylesheet and it only works when outputting one of the values added by the validation handler. Meaning the missing keys it can't find from controller result is the culprit.
Even when I trimmed everything else for it to pass, the closing `</li>` tag seems to have been abducted.
I mail project owner explaining what I need his library for, no response. Chat one of the maintainers on Twitter, nothing. Since they have no forum, I find their Gitter chatroom, tag them and post my questions. Nothing. The only semblance of a documentation they have is the Github wiki. So, support is practically dead. Project last commit: 2020. It's disappointing that this is how my journey with them ends. There isn't even an alternative that shares the same philosophy. It's so sad to see how everybody is comfortable with PHP templating syntax and back end logic entagled within their markup.
Among all other templating libraries, Blade (which influenced my strong distaste for interspersing markup and PHP), seems to be the most popular. First admission: We're headed back to the Blade trenches, sadly.
2nd Topic: While writing tests yesterday, I had this weird feeling about something being off. I guess that's what code smell is. I was uncomfortable with the excessive amount of mocking wrappers I had to layer upon SUT before I can observe whether the HTML adapter receives expected markup file, when I can simply put a `var_dump` there. There's a black-box test for verifying the output but since the Transphporm headaches were causing it to fail, I tried going white-box. The mocking fixture was such a monstrosity, I imagined Sebastian Bergmann's ghost looking down in abhorrence over how much this Degenerate is perverting and butchering his creation.
I ultimately deleted the test travesty but it gave rise to the question of how properly designed system really is. Or, are certain things beyond testing white box? Are there still gaps in the testing knowledge of a supposed testing connoisseur? 2nd admission.
Lastly, randomly wanted to tweet an idea at Tomas Votruba. Visited his profile, only to see this https://twitter.com/PovilasKorop/.... Apparently, Laravel have implemented yet another feature previously only existing in Suphle (or at the libraries Arkitekt and Deptrac). I laughed mirthlessly as I watch them gain feature-parity under my nose, when Suphle is yet to be launched. I refuse to believe they're actually stalking Suphle3 -
Once upon a time I was working with an engineer who loved sed and awk a bit too much. We had data stored in SharePoint that was retrievable via an RSS feed. Said engineer insisted on using curl to grab the feed and sed/awk to parse the HTML ...
I on the other hand suggested using libcurl (primarily for NTLM auth support) and parsing the RSS feed using libxml.
Which engineer do you think management decided supporting?
Hint: Reusability and maintainability were big requirements in this project.1